From grimoire
Guides building evals before prompts for LLM features, agents, or prompts. Helps measure improvement objectively and avoid speculative iteration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:apply-evaluation-driven-developmentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build evals before writing extensive prompts. Measure first, then iterate.
Build evals before writing extensive prompts. Measure first, then iterate.
Adopted by: Anthropic (documented practice for all production AI features), OpenAI (eval-first methodology across all model iterations — public Evals framework), Google DeepMind (HELM and BIG-Bench are eval-first by design), and Meta AI, Microsoft Research, and DeepMind across production AI systems. Impact: Teams iterating without evals consistently over-engineer prompts for scenarios they haven't tested, adding complexity that doesn't improve measurable performance. Anthropic's documented practice shows that baseline evals before any prompt engineering reveal that minimal prompts often perform within 10–15% of heavily engineered versions — eliminating weeks of speculative work. OpenAI's regression evals on GPT-4 Turbo caught capability regressions not visible to internal red-teaming (OpenAI Evals framework documentation). Why best: The alternative — writing detailed prompts based on intuition, then testing — produces unmeasurable progress. You cannot know if a change helped without a baseline. Evaluation-driven development applies the same discipline as test-driven development: define success criteria before building, measure against them, iterate to passing.
Sources: Anthropic, "Building effective agents" (2024); OpenAI Evals framework docs; Liang et al., "HELM: Holistic Evaluation of Language Models", Stanford CRFM (2022); Anthropic, Claude agent skills best practices (2024)
Before adding detailed instructions, few-shot examples, or chain-of-thought scaffolding, define at least 10–20 test cases that specify success:
input: "Summarize this support ticket in one sentence."
expected: concise factual summary, no filler phrases, identifies core issue
You cannot measure improvement without a starting point. Evals written after the fact are biased toward the prompt already built.
Use the simplest possible instruction that describes the task:
# Wrong — speculation about what the model needs before any measurement
You are a helpful assistant. When summarizing tickets, always identify the root cause,
use professional tone, avoid jargon, check for urgency markers, keep under 20 words...
# Right — minimal starting prompt
Summarize the support ticket in one sentence.
Run the minimal prompt against your evals. Record the baseline pass rate and failure categories. Don't add complexity before you know where the model actually fails.
Run your eval suite against the minimal prompt. Record:
Baseline numbers are the only honest signal you have. If you skip this step, every subsequent change is speculation.
Each iteration follows one rule: one change, then measure.
while pass_rate < target:
top_failure = eval_results.top_failure_category()
candidate_prompt = minimal_fix_for(top_failure)
new_results = run_evals(candidate_prompt)
if new_results.pass_rate > current_pass_rate:
accept(candidate_prompt)
current_pass_rate = new_results.pass_rate
Changing multiple prompt elements in one iteration makes it impossible to isolate what helped. One change per iteration.
A prompt that "looks thorough" is not a success criterion. Stop adding complexity when:
Prompt complexity added beyond what evals justify is technical debt.
Once your feature passes the target bar, every future change (prompt edit, model upgrade, context change) must pass the existing eval suite before shipping.
See write-eval-suite for how to structure the eval harness and integrate with CI/CD.
Writing extensive prompts first, then building evals. The baseline is now unavailable. You've invested effort in an unvalidated direction and can't measure what you've gained.
Too few eval cases. 10 examples produce noisy, unreliable signal. Target 50+ before drawing conclusions about a failure category.
Changing multiple prompt elements in one iteration. You can't isolate which change helped. One change per iteration.
Removing eval cases that "seem outdated." Shrinking the eval set masks regressions. Only add cases; never remove them from an active suite.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireDesigns test cases, adversarial inputs, and iterates on prompts based on eval results. Useful for prompt-engineering tasks like drafting, testing, and refining prompts and skills.
Designs, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.