From agentic-eval
Use when designing or implementing an evaluation loop for AI agent outputs — reflection loops, evaluator-optimizer pipelines, LLM-as-judge scoring, or rubric-based iteration. Not when running an existing test suite or reviewing a completed artifact without iterating.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-eval:agentic-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when you are designing or implementing an evaluation loop that lets an agent assess and improve its own outputs through iteration — not when you are running a pre-existing test suite or doing a one-off review with no refinement cycle.
Use this skill when you are designing or implementing an evaluation loop that lets an agent assess and improve its own outputs through iteration — not when you are running a pre-existing test suite or doing a one-off review with no refinement cycle.
The core pattern is: Generate → Evaluate → Critique → Refine → Output, looping until a convergence condition is met or a max-iteration budget is exhausted.
verification-before-completion.systematic-debugging.test-driven-development.| Situation | Use this skill? | Route instead |
|---|---|---|
| Designing a reflection loop with a score threshold and max iterations | Yes | — |
| Implementing LLM-as-judge comparison of two candidate outputs | Yes | — |
Running npm test to confirm a fix works | No | verification-before-completion |
| Tracing why a specific assertion fails | No | systematic-debugging |
| Writing Jest or pytest test coverage for a module | No | test-driven-development |
| Reviewing a PR diff once, no iteration | No | review-comment-resolution |
Required before starting
Helpful if present
trigger-queries.json already exists; if so, load it to understand scope.The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) and full Python examples are in references/patterns.md.
The implementation checklist — criteria, threshold, loop wiring, convergence, logging — is in assets/eval-checklist.md.
For a new implementation, start with the checklist to confirm your setup is complete, then use the patterns reference to choose and adapt an evaluation strategy.
max_iterations bound (3–5 is a safe default) before wiring up a refinement loop. Unbounded loops stall agents.After implementing an evaluation loop, confirm:
max_iterations is set and respected by the loopreferences/patterns.md — The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) with annotated Python examples and a best-practices table.assets/eval-checklist.md — Implementation checklist: setup, loop wiring, convergence, logging, and safety items to confirm before shipping.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub matt-riley/lucky-hat --plugin agentic-eval