llm-evals-toolkit
Use this skill to design rigorous, reproducible evaluations for LLM and RAG
systems: an eval plan, a labeled dataset, metrics, LLM-as-judge rubrics, and the
experiment hygiene needed to trust the numbers and gate regressions.
When to use
- The user wants to measure or compare LLM/RAG quality (prompts, models, retrievers, agents).
- The user needs a metric set, a judge prompt, a golden/test set, or a benchmark harness.
- The user is iterating on a prompt/pipeline and wants regression protection.
Non-goals
- Do not invent results or claim a system is "better" without data.
- Do not over-engineer: prefer the smallest eval that answers the question.
- Do not build heavy infra when a CSV/JSONL dataset + a script suffices.
Workflow (apply in order)
1) Frame the eval
- State the task precisely and the decision the eval informs (ship / pick model / tune prompt).
- Define success criteria up front (target metric + threshold) before looking at outputs.
- List failure modes you actually care about (hallucination, refusal, format breaks, latency, cost).
2) Build the dataset
- Curate a small golden set of representative inputs with expected outputs or rubrics.
- Cover the distribution: typical cases, hard/edge cases, and adversarial cases.
- Version the dataset (JSONL with stable ids). Keep a held-out slice to detect overfitting to the prompt.
- Avoid leakage: never reuse eval items as few-shot examples in the system under test.
3) Choose metrics (match the metric to the output)
- Deterministic where possible: exact/fuzzy match, regex/JSON-schema validity, contains-keyword, BLEU/ROUGE for constrained text.
- Model-graded (LLM-as-judge) for open-ended quality: correctness, relevance, coherence, tone, safety.
- RAG-specific: retrieval recall@k / MRR, context precision, faithfulness/groundedness (answer supported by retrieved context), answer relevance.
- Operational: latency (p50/p95), token cost per item, refusal/error rate.
4) Design judge prompts (LLM-as-judge)
- Give the judge a clear rubric and a small ordinal scale (e.g., 1-5) or a binary pass/fail with explicit criteria.
- Require a short rationale before the score to improve reliability; output a parseable verdict (JSON).
- Prefer pairwise comparison (A vs B) for "which is better" questions; randomize order to control position bias.
- Calibrate: spot-check judge verdicts against human labels on a sample; report judge–human agreement.
- Use a strong judge model and keep the judge prompt versioned alongside the dataset.
5) Run and report
- Run each system under test on the full dataset; record per-item scores plus metadata (model, prompt version, params).
- Report aggregate metrics with sample size and a simple uncertainty estimate (e.g., bootstrap CI or std error); don't over-read tiny deltas.
- Show worst failures with inputs/outputs so the user can act, not just a single number.
6) Experiment hygiene
- Pin and log: model id, temperature/params, prompt version, dataset version, retriever/index version, judge version.
- Make runs reproducible (seed where supported; store raw outputs).
- Gate regressions in CI: fail when a key metric drops below threshold on the golden set.
Deliverables checklist
- Eval plan: task, decision, success criteria, failure modes.
- Versioned dataset (JSONL with ids) including hard/adversarial cases.
- Metric set (deterministic + model-graded + RAG/operational as relevant).
- Judge prompt(s) with rubric, parseable output, and a calibration note.
- A run report: per-item scores, aggregates with uncertainty, and top failures.
- Optional: a regression gate (threshold) wired into CI.