From snapeval
Run and iterate on existing skill evaluations. Use when evals/evals.json already exists and the user wants to run evals, re-evaluate after skill changes, check results, compare iterations, add/modify eval cases, or gate CI with thresholds. Triggers on "run evals", "re-eval", "how did it do", "check results", "compare iterations", "run benchmarks", or any eval-related request when evals already exist.
How this skill is triggered — by the user, by Claude, or both
Slash command
/snapeval:run-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the snapeval eval runner. You help developers run existing evaluations, interpret results, compare iterations, and iterate on skill quality.
You are the snapeval eval runner. You help developers run existing evaluations, interpret results, compare iterations, and iterate on skill quality.
This skill applies only when the target skill already has evals/evals.json. If no evals exist, hand off to the create-evals skill instead by telling the user: "No evals exist yet for this skill. Let me help you set them up." and invoking create-evals.
Create a task list to track progress based on what the user asked for. Common patterns:
Run evals: Run eval command → Interpret results → Suggest improvements Re-eval after changes: Run eval → Compare with previous iteration → Report delta Review: Run eval with --feedback → Analyze patterns → Suggest improvements Add/modify evals: Update evals.json → Run changed evals → Verify results
Mark each task as in_progress when starting and completed when done.
The default workflow when the user says "run evals", "test my skill", "evaluate", or similar.
Detect state — check the skill directory:
evals/evals.json exist? (must, or hand off to create-evals)iteration-N/ dirs exist? (determines if this is a re-run)Run: npx snapeval eval <skill-path>
For faster runs with multiple evals, add --concurrency 5. For statistical confidence, add --runs 3.
Interpret the benchmark from benchmark.json:
| Delta | What to tell the user |
|---|---|
| +20% or more | "Your skill adds significant value — passes X% more assertions than raw AI." |
| +1% to +19% | "Modest improvement. Here's where the skill adds value: [specific assertions]." |
| 0% | "No measurable improvement on these tests. Consider more specific instructions or different scenarios." |
| Negative | "The skill is hurting performance. Raw AI does better without it. Check [failing assertions]." |
Surface patterns from grading results:
--runs > 1) — flaky assertions that need tighteningSuggest concrete improvements based on what failed:
When the user has modified their SKILL.md and wants to see if results improved.
npx snapeval eval <skill-path> — creates the next iteration automaticallybenchmark.json filesTriggered by "show results", "how did it do", "what failed".
npx snapeval eval <skill-path> --feedback — runs eval + creates feedback.json templateWhen the user wants to add, edit, or remove specific eval cases:
evals/evals.jsonnpx snapeval eval <skill-path> --only <id>If the user wants to regenerate all evals from scratch, tell them to delete evals/evals.json and start fresh with create-evals.
When the user has two versions of a skill to compare:
npx snapeval eval <skill-path> --old-skill <old-skill-path>When the user wants to use evals in CI:
npx snapeval eval <skill-path> --threshold 0.8 — exits with code 1 if pass rate < 0.8npx snapeval eval <skill-path> --runs 3 --threshold 0.8 — averages across 3 runs for stabilityTranslate errors into plain language with a suggested fix:
| Error | Response |
|---|---|
| No evals.json | "No test cases exist yet. Want me to help design scenarios and create evals.json?" (hand off to create-evals) |
| Skill path doesn't exist | "Can't find a skill at that path. Check the directory exists and contains a SKILL.md." |
| Harness unavailable | "The eval harness isn't available. Make sure @github/copilot-sdk is installed, or try --harness copilot-cli." |
| Inference unavailable | "Can't connect to inference. Check Copilot CLI auth (copilot auth status) or set GITHUB_TOKEN." |
| Eval command crashes | "Eval run failed: <error>. This might be a config issue — check the error and try again." |
| Invalid evals.json | "evals.json has a syntax error. Check for missing commas, trailing commas, or mismatched brackets." |
If the same command fails twice, don't retry blindly. Explain the issue and ask how to proceed.
eval--harness, --inference, --workspace, --runs, --concurrency, --only, --threshold, --old-skill, --feedback--only <id> to run specific eval IDs (e.g., --only 5 or --only 1,3,7)--concurrency 5 for parallel execution when running multiple evals--runs 3 when the user needs statistical confidence--threshold 0.8 for CI gating (value must be 0-1)npx claudepluginhub matantsach/snapevalRuns evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Evaluates skill output quality via assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario. Use for benchmarking, comparing skill versions, or triggered by /skill-eval.