Skill

eval

Eval-driven development skill for AI workflows. Tracks pass@k metrics, capability and regression evals. Includes blind evaluation protocol for high-stakes scenarios.

testing

ai-ml

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/kernel:eval

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBashWriteEditGrepGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Supporting Files

reference/eval-research.md

SKILL.md

87 lines · ~1.2k tokens

Stats

LanguageJavaScript

Stars11

MaintenanceExcellent

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

AgentDB read-start has run. Check for existing eval definitions in _meta/research/. Understand what behavior you're evaluating before writing evals. Skill-specific: skills/eval/reference/eval-research.md

<core_principles>

DEFINE BEFORE CODE: Evals written first force clear thinking about success criteria.
CODE GRADERS > MODEL GRADERS: Deterministic checks beat probabilistic judgments.
STRUCTURAL SEPARATION FOR HIGH-STAKES: When stakes are real (security, payments, eval-of-evals, agent quality scoring), use the blind-evaluator agent — never self-score. Self-scoring inflates results ~36% structurally; procedural separation ("I won't peek") does not fix it.
TRACK PASS@K: pass@1 (first attempt), pass@3 (within 3 attempts). Target pass@3 > 90%.
REGRESSION BEFORE SHIP: Every change must pass existing evals before merge.
FAST EVALS GET RUN: Slow evals get skipped. Keep evaluation fast. </core_principles>

1. DEFINE: Write eval criteria before implementation. (gate: criteria exist in writing before any code) 2. IMPLEMENT: Code to pass defined evals. 3. EVALUATE: Run evals, record pass@k. (gate: pass@3 > 90% for capability; pass^3 = 100% for regression) 4. REPORT: Document results in eval report format. See reference for template.

<blind_evaluation_protocol> Use when implementing agent would otherwise score its own output (high-stakes: security, payments, agent quality):

Spawn agents/blind-evaluator.md as a fresh agent.
Pass ONLY: problem statement, rubric (3-7 criteria with PASS conditions + weights), artifact path.
Do NOT pass: implementer's checkpoint, summary, commit message, prompt, or expected solution.
(gate: blind evaluator runs contamination check — if forbidden inputs detected, returns INVALID; clean inputs and retry)
(gate: confidence < 0.7 from blind evaluator → escalate to human grader)

Two-phase eval protocol:

Run 1: implementing agent solves cold, no eval feedback. Blind evaluator scores. This is the externally-reportable number.
Run 2: implementing agent gets Run 1 score + rubric breakdown, then optimizes. For iteration only. </blind_evaluation_protocol>

pass@k: "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts (typical target: > 90%)

pass^k: "All k trials succeed"

pass^3: 3 consecutive successes
Use for critical paths (auth, payments)

See reference for calculation formula and worked examples.

<grader_selection>

Code-based (preferred): grep, test suite, build, type-check — deterministic, fast.
Model-based: for open-ended outputs that can't be checked deterministically. Run multiple times, take majority.
Human: required for security-sensitive changes, UX evaluation, legal/compliance.

See reference for full grader templates and examples. </grader_selection>

<anti_patterns> Writing evals after implementation tests existing bugs, not requirements. Model-based grading is slow and probabilistic. Prefer code graders. Every change must pass regression evals. No exceptions. Evals that take > 30s get skipped. Keep them fast. Track pass@k over time. Declining reliability is a signal. For any user-facing or high-stakes eval, the implementing agent scoring its own work inflates results ~36%. Spawn blind-evaluator instead. Evaluating against a codebase that already contains the canonical solution = answer key in the eval set. Use pre-merge snapshots or a separate fixture. Greenfield tickets in the golden eval set collapse to self=10, blind=3. Greenfields are not evaluable as solved tasks — exclude them from the dataset. Optimizing how much context the evaluator gets before establishing a baseline score = can't distinguish signal from noise. Run minimal-context baseline first, then test additions one at a time. </anti_patterns>

<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'

Record eval type, pass rates, and any failures for future reference. </on_complete>

eval

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

eval

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Similar Skills

Similar Skills