From kernel
Eval-driven development skill for AI workflows. Tracks pass@k metrics, capability and regression evals. Includes blind evaluation protocol for high-stakes scenarios.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kernel:evalThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<skill id="eval">
<core_principles>
<blind_evaluation_protocol> Use when implementing agent would otherwise score its own output (high-stakes: security, payments, agent quality):
agents/blind-evaluator.md as a fresh agent.Two-phase eval protocol:
pass^k: "All k trials succeed"
See reference for calculation formula and worked examples.
<grader_selection>
See reference for full grader templates and examples. </grader_selection>
<anti_patterns> Writing evals after implementation tests existing bugs, not requirements. Model-based grading is slow and probabilistic. Prefer code graders. Every change must pass regression evals. No exceptions. Evals that take > 30s get skipped. Keep them fast. Track pass@k over time. Declining reliability is a signal. For any user-facing or high-stakes eval, the implementing agent scoring its own work inflates results ~36%. Spawn blind-evaluator instead. Evaluating against a codebase that already contains the canonical solution = answer key in the eval set. Use pre-merge snapshots or a separate fixture. Greenfield tickets in the golden eval set collapse to self=10, blind=3. Greenfields are not evaluable as solved tasks — exclude them from the dataset. Optimizing how much context the evaluator gets before establishing a baseline score = can't distinguish signal from noise. Run minimal-context baseline first, then test additions one at a time. </anti_patterns>
<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'
Record eval type, pass rates, and any failures for future reference. </on_complete>
npx claudepluginhub ariaxhan/kernel-claude --plugin kernelImplements eval-driven development (EDD) framework for Claude Code sessions with capability/regression evals, pass@k metrics, and code/model/human graders for agent reliability.
Formal evaluation framework for Claude Code implementing eval-driven development (EDD) with capability evals, regression tests, pass@k metrics, and code/model/human graders.
Implements eval-driven development (EDD) for Claude Code sessions: capability/regression evals, code/model/human graders, pass@k metrics. Use to define success before coding.