From agent-stdlib
Build automated evaluations for an AI agent from scratch: collecting tasks from real failures, choosing code/model/human graders, picking pass@k vs pass^k, building an isolated harness, and keeping the suite honest over time. Use this whenever someone wants to measure, benchmark, or regression-test an agent, write an eval harness for an LLM agent, decide how to grade non-deterministic output, set up an LLM-as-judge, or asks any version of "how do I know if my agent is actually getting better." Trigger even when they say "tests for my agent," "eval set," or "agent benchmark" rather than the word "evals." Not for container or resource limits making scores flaky across runs; that's calibrate-eval-infrastructure.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-stdlib:build-agent-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Source: [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents). A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script.
Source: Demystifying evals for AI agents. A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script.
An eval tells you whether a change to an agent made it better or worse. Without one you are guessing from vibes, and vibes miss regressions that only show up on the tenth run. Treat the eval suite the way you treat a unit-test suite: it has an owner, it grows when bugs slip through, and it fails loudly.
Collect 20 to 50 tasks before writing any grader. The best sources are bugs your agent already produced, support tickets, and manual test cases you keep rerunning by hand. Write each task so two experts reading it reach the same verdict on pass or fail. If you cannot decide whether an output passed, the task is underspecified and will poison every measurement built on it.
Include a reference solution for each task to prove it is solvable, and build both positive cases (the agent should do X) and negative cases (the agent should refuse, or should not touch Y). A suite made only of positive cases optimizes toward an agent that does too much.
Grade what the agent produced, not the path it took. An agent that reaches the right end state by an unusual route still passed.
For tasks with several components, award partial credit per component instead of one all-or-nothing verdict. Partial credit shows you which part regressed.
An agent run twice gives two answers. One pass tells you little. Run each task k times and report the metric that matches what you care about:
The bundled script computes both from a list of per-run outcomes:
python scripts/passk.py --results results.json --k 5
See scripts/passk.py for the input format. Report both numbers early; the gap between them is your reliability problem stated as a single figure.
Start every trial from a fresh, isolated environment. A shared scratch directory or a database left dirty by the previous run produces correlated failures that look like an agent regression and are really a harness bug. Reset state per trial.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub pebeto/agent-stdlib --plugin agent-stdlib