From grimoire
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:write-eval-suiteThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build a structured evaluation suite that measures LLM or AI system performance with reproducible, comparable metrics.
Build a structured evaluation suite that measures LLM or AI system performance with reproducible, comparable metrics.
Adopted by: OpenAI (public Evals framework), Stanford (HELM — Holistic Evaluation of Language Models), EleutherAI (LM Evaluation Harness) Impact: HELM evaluates 30+ models across 42 scenarios and 7 metric categories; OpenAI uses community evals to discover model regressions before release — systematic evals caught GPT-4 Turbo regressions not visible to internal red-teaming.
Why best: Evals are to AI systems what unit tests are to software: they make quality measurable, regressions detectable, and improvements verifiable. Without them, "the model got better" is a belief, not a fact. A good eval suite is the single most durable investment in a production AI system.
lm-evaluation-harness, or a custom runner. Each eval: input → model call → output → scoring function → metric aggregation.output.strip() == expected. Model-graded: structured prompt asking judge model to rate 1-5 with reasoning. Code eval: execute output, check return value or stdout.Eval structure (OpenAI Evals format):
{"input": [{"role": "user", "content": "Summarize: [article]"}], "ideal": "The article discusses..."}
{"input": [{"role": "user", "content": "Extract the date from: [text]"}], "ideal": "2026-03-15"}
Scoring pipeline:
for example in eval_dataset:
output = model.complete(example["input"])
score = judge_model.grade(output, example["ideal"])
metrics.record(score)
print(f"Mean score: {metrics.mean():.3f} ± {metrics.ci():.3f}")
npx claudepluginhub jeffreytse/grimoire --plugin grimoireUse this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
Guides building evals before prompts for LLM features, agents, or prompts. Helps measure improvement objectively and avoid speculative iteration.