From builder-ai
Use before merging, deploying, or demo'ing any LLM feature. Requires documented eval results — pass rate, failure analysis, baseline comparison. Blocks "it looked good when I tested it" completions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/builder-ai:eval-before-shipThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
AN LLM FEATURE IS NOT READY UNTIL NUMBERS EXIST.
"It looked good when I tested it" is not an eval.
"I ran a few examples and it worked" is not an eval.
A named suite, a defined metric, a pass rate, a failure analysis,
and a baseline comparison IS an eval. All five. Not four.
Trigger before any of these:
An eval must have all five components:
| Component | What It Means | What Does NOT Count |
|---|---|---|
| Named test suite | File with labelled examples in evals/ | "I tested it manually" |
| Defined metric | Accuracy %, faithfulness score, task pass rate | "It seemed accurate" |
| Pass threshold | Explicit minimum (e.g., ≥ 85%) | No threshold = no standard |
| Failure analysis | ≥ 5 failures examined and categorised | "There were a few errors" |
| Baseline comparison | This version vs. previous version or control | First release exempt; all subsequent require it |
Answer these before touching the prompt:
If you cannot answer these before building, the task is not well enough specified to build.
evals/
<feature-name>/
test-set.jsonl ← labelled examples, one JSON object per line
harness.py ← eval runner
results-<date>.md ← documented results
test-set.jsonl format:
{"id": "001", "input": "...", "expected": "...", "tags": ["edge-case"]}
Label ground truth before running any model. Seeing outputs first contaminates labels.
Minimum 50 examples; 200+ for features used at volume. Include: edge cases, short inputs, long inputs, ambiguous inputs, adversarial rephrasing.
python evals/<feature-name>/harness.py --model <model-id> --seed 42
Record: pass rate (X/N = Y%), latency p50/p95, failure category breakdown.
Examine every failing case. Assign each to one category:
The most common failure category is your next fix.
Suite: evals/email-classifier/test-set.jsonl (200 examples)
Model: claude-sonnet-4-6, temperature: 0.0, seed: 42
Pass rate: 178/200 = 89% (threshold: ≥ 85% ✓)
Failure breakdown:
- Format violation: 12 (long emails > 2000 tokens)
- Hallucination: 10 (invented labels not in schema)
Baseline: v1 (82%) → v2 (89%), delta: +7pp ✓
Results: evals/email-classifier/results-2026-06-07.md
These thoughts mean you have NOT completed an eval — stop and build one:
When eval-before-ship is satisfied, state it like this:
Eval complete.
Suite: evals/<feature>/test-set.jsonl — N examples
Model: <model-id>, temperature: X, seed: Y
Pass rate: A/N = B% (threshold: ≥ C% ✓)
Top failure mode: <category> (N cases — <root cause>)
Failure analysis: evals/<feature>/failure-analysis-<date>.md ✓
Baseline: <previous> = X% → this version = Y%, delta: +Zpp ✓
[First release: baseline field may be marked "N/A — initial version"]
Results stored: evals/<feature>/results-<date>.md ✓
The pass rate, failure analysis, and baseline comparison are not optional. On the first release, baseline may be marked "N/A — initial version"; for all subsequent releases it is required.
LLM outputs are stochastic. A prompt that works on the examples you tried does not predict performance on the next 1000 inputs. Eval suites catch regressions before users do — and they make it possible to improve without guessing.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub rbraga01/a-team --plugin builder-ai