From qa-llm-evaluation
Authors and runs OpenAI Evals - Python framework + registry for evaluating LLMs and LLM-backed systems with `oaieval <model> <eval-name>` CLI; supports template-based evals (Match / Includes / FuzzyMatch / ModelBasedClassify) defined in `evals/registry/evals/*.yaml` against JSONL data files in `evals/registry/data/`, plus custom Python eval classes implementing the Eval interface. Use when the user works with the openai/evals repo, needs the OpenAI-curated eval registry, or contributes new evals via PR to the registry.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-llm-evaluation:openai-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
[oa-gh]: https://github.com/openai/evals
Per oa-gh, a registry of YAML eval-specs lives under
evals/registry/evals/, each pointing to a JSONL data file under
evals/registry/data/ (Git-LFS managed). The oaieval CLI runs
an eval against any completion-function-protocol model - either
one of OpenAI's curated evals or a custom one registered by the
team.
For new projects without a registry-contribution motive, evaluate
promptfoo-evaluation or
deepeval-evaluation first -
both have lower friction for non-OpenAI workflows.
For running existing evals (per oa-gh):
pip install evals
For contributing new evals (clone first, then editable install):
git clone https://github.com/openai/evals.git
cd evals
pip install -e .
The editable install is required to register new evals and access the full registry source.
Per github.com/openai/evals/blob/main/docs/run-evals.md:
oaieval gpt-3.5-turbo test-match
Pattern: oaieval <model> <eval-name>. Per oa-run:
"Any implementation of the CompletionFn protocol can be run against oaieval."
Eval names are "specified in the YAML files under evals/registry/evals" (oa-run); implementations live in
evals/elsuite.
Per oa-run:
"logging locally or to Snowflake will write to tmp/evallogs"
Override with --record_path /custom/path/. Logs are JSONL events
"which can be inspected using a text editor or analyzed
programmatically" (oa-run).
Common flags (oa-run):
--no-local-run - Snowflake DB logging--record_path <dir> - output directoryoaieval --help - full CLI optionsEval templates avoid Python authoring for common evaluation
patterns. The four built-in templates per oa-gh (referenced
in eval-templates.md):
ideal (single string or list)ideal
textA registered eval YAML lives at evals/registry/evals/<name>.yaml
and references a JSONL file at evals/registry/data/<name>/samples.jsonl.
Each JSONL row contains the input prompt + the ideal field used by
the template.
For grading logic beyond templates, subclass the Eval interface.
The full pattern lives in docs/custom-eval.md and docs/build-eval.md
in the oa-gh repository - author per the doc when authoring,
then register in the YAML registry.
OpenAI Evals does not ship a first-party CI action. Pattern:
oaieval gpt-4 my-eval --record_path ./evallogs
# parse JSONL evallog for pass-rate; fail CI if below threshold
jq -s '[.[] | select(.spec) | .]' ./evallogs/<run>.jsonl # extract spec + outcomes
For PR-comment integration, parse the events JSONL into a summary and post via gh CLI (no built-in action).
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Pick Match template for open-ended generation | Exact-match always fails on creative outputs | Use ModelBasedClassify (Step 4) |
Skip --record_path in CI | Logs land in /tmp and disappear between steps | Always pass --record_path |
| Custom Python eval without registry YAML | oaieval can't find it | Register the YAML alongside the Python class (Step 5) |
Run on gpt-3.5-turbo only | Model-version drift; results not reproducible | Pin specific snapshot (e.g., gpt-4-0613) |
promptfoo-evaluation.sampling, match, metrics).oaieval CLI referencedocs/build-eval.md, docs/custom-eval.md, docs/eval-templates.md
in oa-gh - authoring details (load from repo when
building a new eval)promptfoo-evaluation,
deepeval-evaluation -
lower-friction alternatives for new projectsprompt-eval-reviewer -
adversarial reviewernpx claudepluginhub testland/qa --plugin qa-llm-evaluationProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.