From skillry-ai-and-agent-systems
Use when you need to review LLM eval plans, rubrics, datasets, expected outputs, and regression checks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-ai-and-agent-systems:44-llm-evaluation-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review an LLM evaluation framework for dataset quality, rubric precision, LLM-as-judge configuration, regression coverage, hallucination detection methods, and metric validity. The goal is to catch evaluation design flaws that produce misleading scores — a poorly designed eval is worse than no eval because it creates false confidence in a system that may be failing in ways the eval does not mea...
Review an LLM evaluation framework for dataset quality, rubric precision, LLM-as-judge configuration, regression coverage, hallucination detection methods, and metric validity. The goal is to catch evaluation design flaws that produce misleading scores — a poorly designed eval is worse than no eval because it creates false confidence in a system that may be failing in ways the eval does not measure.
prompt-systems-review).rag-vector-search-review).Audit the eval dataset. Record: total number of examples, creation method (human-annotated / LLM-generated / sampled from production logs / manually constructed), date of creation and last update, and distribution across task categories. A dataset smaller than 100 examples for a production system has insufficient statistical power to detect regressions reliably. A dataset not updated after a significant product requirement change is measuring the wrong behavior.
Check golden set validity. For a random sample of 10-20 golden-set examples: is the expected output unambiguously correct under current product requirements? Does the input represent current production traffic patterns? Are the annotator guidelines documented and accessible? Golden examples that were correct under old requirements but are now wrong will make the current system look like it is regressing when it is actually improving.
Review rubric precision. For each rubric criterion, apply the two-annotator test: could two independent annotators reach the same score for the same output without discussion? If the criterion contains subjective adjectives — "helpful," "clear," "appropriate," "good" — it fails the test. Replace each failing criterion with an observable, scale-anchored definition:
Measure inter-annotator agreement. For human-judged criteria, compute Cohen's Kappa (for categorical ratings) or Krippendorff's Alpha (for ordinal scales) on a minimum 50-example overlap set where two annotators scored the same examples independently. Kappa below 0.6 means the rubric criterion is too ambiguous to produce reliable scores — rewrite the criterion before using it. For LLM-as-judge, measure agreement between two different judge models on the same 50-example set. High cross-judge disagreement indicates a rubric problem.
Verify hallucination detection coverage. Check that the eval explicitly tests for:
Review regression test structure. Confirm that there is a fixed, immutable regression set that does not change between evaluations. This set must include: known failure cases from past production incidents (one example per incident), edge cases that caused regressions in previous model versions, and a representative sample of high-traffic query patterns. Regression tests must have binary pass/fail verdicts — not rubric scores that can shift when the judge model is updated.
Check metric selection. Verify that selected metrics match the task type:
Evaluating with the same model you optimized against. The system was fine-tuned on GPT-4o outputs. The eval uses GPT-4o as the judge. The judge has a strong prior toward GPT-4o style and scores the system higher than human annotators would. The reported quality is inflated. Always use a different judge model family (not just version) than the model used in the system.
Rubric drift. The rubric was written 12 months ago when the product had different requirements. Six months ago the requirements changed but the rubric was not updated. The eval now measures old behavior. Teams optimize for the rubric rather than the actual product goal. Version the rubric alongside the product requirements specification; any PR that changes product requirements must also update the eval rubric.
Cherry-picked regression set. The regression set contains only examples the current model handles well because "the ones it failed on are fixed now." When a similar failure mode recurs, the regression set does not catch it. Never remove failure cases from the regression set — permanently retain one example per production incident, labeled with the incident date and a description.
Position bias in pairwise LLM evaluation. The judge consistently rates the response listed first as better, regardless of quality. On a 100-example pairwise eval, this can shift win rates by 10-15 percentage points — more than the actual quality difference between most model versions. Test for position bias before using pairwise evaluation. If bias exists, either use point-score rubrics instead, or randomize order and average scores across both orderings.
Statistical significance ignored. System A scores 0.762, System B scores 0.748. The difference is 1.4 percentage points on a 100-example eval set. The team ships System A because it scored higher. McNemar's test on these results returns p = 0.43 — the difference is not statistically significant. The systems are indistinguishable on this eval set. Either accept the uncertainty or enlarge the eval set until the difference becomes significant.
BLEU for open-ended generation. A customer service assistant is evaluated with BLEU against reference responses. BLEU penalizes any paraphrase of the reference, even if semantically equivalent. A correct, clear response that uses different words than the reference gets a low BLEU score. The team optimizes for lexical overlap with the reference, degrading response naturalness. BLEU is a translation metric; do not use it for open-ended generation tasks.
No eval-to-deploy gate. Evaluation is performed before each release, the scores are discussed in a meeting, and then deployment proceeds regardless. This is not a gate — it is a ritual. Define a minimum threshold for each key metric. Define the process for blocking deployment when the threshold is not met. Document the exception process for overriding the gate with named approver accountability.
Produce an eval review report with:
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub fluxonlab/skillry --plugin skillry-ai-and-agent-systems