Skill

build-agent-evals

Build automated evaluations for an AI agent from scratch: collecting tasks from real failures, choosing code/model/human graders, picking pass@k vs pass^k, building an isolated harness, and keeping the suite honest over time. Use this whenever someone wants to measure, benchmark, or regression-test an agent, write an eval harness for an LLM agent, decide how to grade non-deterministic output, set up an LLM-as-judge, or asks any version of "how do I know if my agent is actually getting better." Trigger even when they say "tests for my agent," "eval set," or "agent benchmark" rather than the word "evals." Not for container or resource limits making scores flaky across runs; that's calibrate-eval-infrastructure.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agent-stdlib:build-agent-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Source: [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents). A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script.

Supporting Files

scripts/passk.py

SKILL.md

69 lines · ~1.2k tokens

Stats

LanguagePython

Stars1

MaintenanceExcellent

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Build agent evals

Source: Demystifying evals for AI agents. A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script.

An eval tells you whether a change to an agent made it better or worse. Without one you are guessing from vibes, and vibes miss regressions that only show up on the tenth run. Treat the eval suite the way you treat a unit-test suite: it has an owner, it grows when bugs slip through, and it fails loudly.

Start from real failures

Collect 20 to 50 tasks before writing any grader. The best sources are bugs your agent already produced, support tickets, and manual test cases you keep rerunning by hand. Write each task so two experts reading it reach the same verdict on pass or fail. If you cannot decide whether an output passed, the task is underspecified and will poison every measurement built on it.

Include a reference solution for each task to prove it is solvable, and build both positive cases (the agent should do X) and negative cases (the agent should refuse, or should not touch Y). A suite made only of positive cases optimizes toward an agent that does too much.

Choose the grader to match the task

Grade what the agent produced, not the path it took. An agent that reaches the right end state by an unusual route still passed.

Code-based grader. String match, schema validation, a state check against a database or filesystem. Use this wherever the correct answer is checkable by a program. It is fast, free, and never flaky in the way a model judge is.
Model-based grader (LLM-as-judge). A rubric scored by a separate model call. Use it for output that needs judgment: tone, summary quality, whether an explanation is correct. Give the judge a rubric with explicit criteria rather than asking "is this good," and have it cite evidence for its score so you can audit it.
Human grader. Subject-matter spot checks and A/B preference. Use it sparingly to calibrate the other two, not as the everyday loop.

For tasks with several components, award partial credit per component instead of one all-or-nothing verdict. Partial credit shows you which part regressed.

Pick a metric that respects non-determinism

An agent run twice gives two answers. One pass tells you little. Run each task k times and report the metric that matches what you care about:

pass@k measures capability: at least one of k runs succeeded. Use it when you want to know whether the agent can do the task.
pass^k measures reliability: all k runs succeeded. Use it when a single failure in production is expensive, which for most shipped agents it is.

The bundled script computes both from a list of per-run outcomes:

python scripts/passk.py --results results.json --k 5

See scripts/passk.py for the input format. Report both numbers early; the gap between them is your reliability problem stated as a single figure.

Keep the harness clean

Start every trial from a fresh, isolated environment. A shared scratch directory or a database left dirty by the previous run produces correlated failures that look like an agent regression and are really a harness bug. Reset state per trial.

Maintain the suite

Read full transcripts on a schedule, not just the pass/fail column. A grader that passes a wrong answer is worse than no grader.
Watch for saturation. When the suite hits near 100% pass, it has stopped discriminating; add harder tasks pulled from recent failures.
Give the suite an owner. Unowned eval suites rot exactly like unowned tests.

Common mistakes

Grading the trajectory instead of the outcome, which punishes correct-but-unexpected solutions.
One-sided suites that reward an agent for doing too much.
A single run per task, reported as if it were the truth.
An LLM judge with a vague prompt, whose scores nobody audits.

build-agent-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

build-agent-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Build agent evals

Start from real failures

Choose the grader to match the task

Pick a metric that respects non-determinism

Keep the harness clean

Maintain the suite

Common mistakes

Similar Skills

Build agent evals

Start from real failures

Choose the grader to match the task

Pick a metric that respects non-determinism

Keep the harness clean

Maintain the suite

Common mistakes

Similar Skills