snapeval

Harness-agnostic eval runner for agentskills.io skills.

snapeval runs every eval case with and without your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.

snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
  #1 formal greeting for Eleanor
    Skill: 100% | Baseline: 33% | 5.2s
  #2 casual greeting for Marcus
    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
  #3 pirate greeting for Zoe
    Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
  Skill pass rate:    100.0%
  Baseline pass rate: 55.6%
  Improvement:        +44.4%

How it works

You write a SKILL.md and an evals.json with test cases and assertions
snapeval runs each eval twice — once with your skill loaded, once without (baseline)
Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
A benchmark shows where your skill adds value vs. where the raw AI already handles it

Quick start

As a Copilot plugin

copilot plugin install matantsach/snapeval

Then in Copilot CLI, just say evaluate my skill — the snapeval skill handles the rest.

Standalone CLI

git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval <skill-dir>

Eval format

my-skill/
├── SKILL.md
└── evals/
    ├── evals.json
    └── scripts/         ← optional deterministic checks
        └── validate.sh

evals.json:

{
  "skill_name": "greeter",
  "evals": [
    {
      "id": 1,
      "label": "formal greeting for Eleanor",
      "prompt": "Can you give me a formal greeting for Eleanor?",
      "expected_output": "Returns the formal greeting addressed to Eleanor.",
      "assertions": [
        "Output contains the name Eleanor",
        "Output uses a formal tone",
        "script:validate.sh"
      ]
    }
  ]
}

Field	Required	Description
`id`	yes	Unique numeric identifier
`prompt`	yes	The user prompt sent to the harness
`expected_output`	yes	Human description of the expected behavior
`label`	no	Human-readable name shown in terminal output
`slug`	no	Filesystem-safe name for the eval directory
`assertions`	no	List of assertions to grade (LLM semantic or `script:` prefixed)
`files`	no	Input files to attach to the prompt

Assertions

Semantic — graded by an LLM. Write specific, verifiable statements:

"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"

Script — prefix with script:. Scripts live in evals/scripts/, receive the output directory as $1, and pass on exit code 0:

"script:validate-json-structure.sh"

CLI reference

`eval`

Run evals, grade assertions, compute benchmark.

npx snapeval eval [skill-dir] [options]

Flag	Description	Default
`--harness <name>`	Harness adapter	`copilot-sdk`
`--inference <name>`	Inference adapter for grading	`auto`
`--workspace <path>`	Output directory	`../{skill_name}-workspace`
`--runs <n>`	Harness invocations per eval for statistical averaging	`1`
`--concurrency <n>`	Parallel eval cases (1-10)	`1`
`--only <ids>`	Run specific eval IDs (e.g. `--only 1,3,5`)	all
`--threshold <rate>`	Minimum pass rate 0-1 for exit code 0	none
`--old-skill <path>`	Compare against old skill version	none
`--feedback`	Write feedback.json template for human review	off

Exit codes

Code	Meaning
0	Success
1	Threshold not met (eval ran but pass rate below `--threshold`)
2	Config/input error (bad JSON, missing fields, invalid flags)
3	File not found (missing skill dir, evals.json, or script)
4	Runtime error (harness failure, grading failure, timeout)

Output artifacts

Each run creates an iteration directory:

snapeval

Popularity

What's Inside

README

snapeval

How it works

Quick start

As a Copilot plugin

Standalone CLI

Eval format

Assertions

CLI reference

`eval`

Exit codes

Output artifacts

Confidence

Similar Plugins

skill-optimizer

agent-eval-harness

semia

evaluate-plugin

singularity-claude

skillkit

More by matantsach

heartbeat

talent-scout

Popularity

Health & Quality

More by matantsach

heartbeat

talent-scout

Similar Plugins

skill-optimizer

agent-eval-harness

semia

evaluate-plugin

singularity-claude

skillkit