llm-evals-toolkit | Fullstack ML/AI Agent Skills

Stats

Actions

Tags

llm-evals-toolkit | Fullstack ML/AI Agent Skills

llm-evals-toolkit

Use this skill to design rigorous, reproducible evaluations for LLM and RAG systems: an eval plan, a labeled dataset, metrics, LLM-as-judge rubrics, and the experiment hygiene needed to trust the numbers and gate regressions.

When to use

The user wants to measure or compare LLM/RAG quality (prompts, models, retrievers, agents).
The user needs a metric set, a judge prompt, a golden/test set, or a benchmark harness.
The user is iterating on a prompt/pipeline and wants regression protection.

Non-goals

Do not invent results or claim a system is "better" without data.
Do not over-engineer: prefer the smallest eval that answers the question.
Do not build heavy infra when a CSV/JSONL dataset + a script suffices.

Workflow (apply in order)

1) Frame the eval

State the task precisely and the decision the eval informs (ship / pick model / tune prompt).
Define success criteria up front (target metric + threshold) before looking at outputs.
List failure modes you actually care about (hallucination, refusal, format breaks, latency, cost).

2) Build the dataset

Curate a small golden set of representative inputs with expected outputs or rubrics.
Cover the distribution: typical cases, hard/edge cases, and adversarial cases.
Version the dataset (JSONL with stable ids). Keep a held-out slice to detect overfitting to the prompt.
Avoid leakage: never reuse eval items as few-shot examples in the system under test.

3) Choose metrics (match the metric to the output)

Deterministic where possible: exact/fuzzy match, regex/JSON-schema validity, contains-keyword, BLEU/ROUGE for constrained text.
Model-graded (LLM-as-judge) for open-ended quality: correctness, relevance, coherence, tone, safety.
RAG-specific: retrieval recall@k / MRR, context precision, faithfulness/groundedness (answer supported by retrieved context), answer relevance.
Operational: latency (p50/p95), token cost per item, refusal/error rate.

4) Design judge prompts (LLM-as-judge)

Give the judge a clear rubric and a small ordinal scale (e.g., 1-5) or a binary pass/fail with explicit criteria.
Require a short rationale before the score to improve reliability; output a parseable verdict (JSON).
Prefer pairwise comparison (A vs B) for "which is better" questions; randomize order to control position bias.
Calibrate: spot-check judge verdicts against human labels on a sample; report judge–human agreement.
Use a strong judge model and keep the judge prompt versioned alongside the dataset.

5) Run and report

Run each system under test on the full dataset; record per-item scores plus metadata (model, prompt version, params).
Report aggregate metrics with sample size and a simple uncertainty estimate (e.g., bootstrap CI or std error); don't over-read tiny deltas.
Show worst failures with inputs/outputs so the user can act, not just a single number.

6) Experiment hygiene

Pin and log: model id, temperature/params, prompt version, dataset version, retriever/index version, judge version.
Make runs reproducible (seed where supported; store raw outputs).
Gate regressions in CI: fail when a key metric drops below threshold on the golden set.

Deliverables checklist

Eval plan: task, decision, success criteria, failure modes.
Versioned dataset (JSONL with ids) including hard/adversarial cases.
Metric set (deterministic + model-graded + RAG/operational as relevant).
Judge prompt(s) with rubric, parseable output, and a calibration note.
A run report: per-item scores, aggregates with uncertainty, and top failures.
Optional: a regression gate (threshold) wired into CI.