Skill

model-evaluator

Systematic LLM and ML model evaluation — benchmarks, metrics, regression detection, and model comparison. Use when assessing or comparing AI model quality.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-ml-eng-pro:model-evaluator

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Provides a systematic framework for evaluating LLM and ML model performance. Supports standard benchmarks (MMLU, GSM8K, HumanEval, etc.), custom evaluation tasks, multi-dimensional metrics (accuracy, latency, cost, safety, fairness), regression detection across model versions, and head-to-head model comparison with statistical significance testing.

SKILL.md

65 lines · ~1k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitMay 25, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Model Evaluator

What It Does

Iron Laws (NEVER violate)

Multi-metric only — Never evaluate on a single metric. Minimum: quality + latency + cost + safety. Single-metric optimization is gaming.
Statistical significance required — "Model A beats Model B by 1%" without confidence intervals is noise. Always compute significance.
Blind test set — Evaluation test set must be unseen during development. No peeking, no tweaking after seeing results.
Task-representative eval — Evaluation tasks must match production use case. MMLU score is irrelevant for a code generation model.

Red Flags (STOP immediately)

Benchmark contamination — Training data overlaps with benchmark test set → results are invalid
Metric collapse — All models score within 1% on a metric → metric is not discriminative; find better eval
Regression cascade — New model beats old on 3 metrics but catastrophically fails on 1 critical metric → not deployable
Overfitting to eval — Model improves on benchmarks but degrades in production → eval doesn't match reality

Common Rationalizations (self-deception)

"This benchmark score is good enough" → Benchmarks measure benchmark performance, not your use case. Always run custom eval.
"95% accuracy is great" → If the 5% errors are catastrophic failures, accuracy is the wrong metric.
"We'll evaluate after deployment" → Post-deployment evaluation is user-facing experimentation. Evaluate before.

When To Use

Comparing multiple models for a production use case
Detecting regressions after model update or fine-tuning
Setting up continuous evaluation pipelines
Running standard benchmarks for model capability assessment
A/B testing models in production with proper metrics

Human Partner Signals (escalate to human)

Safety regression — New model produces more harmful outputs → must not deploy
Cost explosion — Better quality comes at 5x the cost → business decision on cost-quality tradeoff
Fairness failure — Model performance varies significantly across demographic groups → ethics review
Benchmark gaming suspicion — Suspiciously high benchmark scores → investigate contamination

Pipeline

Define: identify evaluation dimensions — quality, latency, cost, safety, fairness, robustness
Select: choose benchmarks and custom eval tasks matching production use case
Baseline: run evaluation on current production model to establish baseline
Compare: run identical evaluation on candidate models
Analyze: compute statistical significance, identify regression, highlight tradeoffs
Report: generate evaluation report with radar chart, leaderboard, and deployment recommendation
Monitor: set up continuous evaluation to detect drift and regression over time

Verification Checklist

Evaluation covers quality + latency + cost + safety as minimum dimensions
Statistical significance computed for all model comparisons
Test set verified as unseen during model development
Custom eval tasks match production use case (not just benchmarks)
Regression detection configured with alerting thresholds
Evaluation results reproducible with versioned test sets and configs

Related Skills

prompt-engineer — Model evaluation measures prompt quality improvements
dataset-curator — Evaluation datasets require the same curation rigor as training data
evaluating-llms-harness — lm-eval-harness for standard benchmark execution
weights-and-biases — Experiment tracking for evaluation results

model-evaluator

Invocation

Context Preview

SKILL.md

model-evaluator

Invocation

Context Preview

SKILL.md

Model Evaluator

What It Does

Iron Laws (NEVER violate)

Red Flags (STOP immediately)

Common Rationalizations (self-deception)

When To Use

Human Partner Signals (escalate to human)

Pipeline

Verification Checklist

Related Skills

Similar Skills

Model Evaluator

What It Does

Iron Laws (NEVER violate)

Red Flags (STOP immediately)

Common Rationalizations (self-deception)

When To Use

Human Partner Signals (escalate to human)

Pipeline

Verification Checklist

Related Skills

Similar Skills