From prompt-evaluation-claude-code
Eval-driven prompt refinement that runs entirely inside Claude Code via the Agent/Task tool — no Python, no SDK, no API key. Each candidate run and each judge call executes in an isolated subagent with a fresh context window, so samples are independently graded and the main session stays focused on synthesis and iteration. Trivially parallel: spawn N candidate + M judge subagents in one assistant message. Invoke when the user wants to evaluate, A/B test, regress-test, or iterate on a prompt directly inside Claude Code, especially when they reference subagents, the Task/Agent tool, or "test this prompt without writing code". Phrases like "use Claude Code to evaluate this prompt", "spawn subagents to test", "parallel-test these variants", "A/B these prompts in Claude Code", "grade this rubric with subagents", or "iterate this prompt with fresh contexts" qualify. Pairs with the broader `prompt-evaluation` skill for shared dataset-design and binary-judge methodology.
How this skill is triggered — by the user, by Claude, or both
Slash command
/prompt-evaluation-claude-code:prompt-evaluation-claude-codeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill is a **router and workflow**. It teaches how to run a
assets/candidate_subagent.template.mdassets/eval_set.template.jsonlassets/judge_binary_subagent.template.mdassets/judge_pairwise_subagent.template.mdevals/evals.jsonreferences/candidate_subagents.mdreferences/eval_set.mdreferences/iteration_loop.mdreferences/judge_subagents.mdreferences/parallel_execution.mdreferences/pitfalls.mdreferences/shared_methodology.mdThis skill is a router and workflow. It teaches how to run a
prompt-evaluation loop using Claude Code's own subagent capability
— no external API calls, no Python, no promptfoo. Read the
references on demand.
Claude Code's Agent/Task tool spawns a subagent with a fresh
context window. The subagent sees only the prompt you pass it; it
inherits nothing from the main conversation. When the subagent
finishes, it returns a single message to the caller and its working
context is discarded.
For prompt evaluation, this gives you three properties for free that are otherwise hard to engineer:
Agent calls in one assistant message
run concurrently. 20 evals finish in roughly the wall time of 1.This is what the existing prompt-evaluation skill achieves with
async messages.create() calls. This skill achieves it with Agent
calls instead — same shape, no SDK.
| Situation | Use this skill | Use prompt-evaluation |
|---|---|---|
| Quick exploration inside a Claude Code session | ✓ | |
| The user can't or won't write Python / install promptfoo | ✓ | |
| The prompt under test is destined for a Claude Code agent | ✓ | |
| You want a CI-runnable, reproducible eval in the repo | ✓ | |
You need exact messages.create() semantics (system vs user roles, exact model ID, exact output_config) | ✓ | |
| You want the eval to survive outside Claude Code (CI, scheduled jobs, shared with non-Claude-Code teammates) | ✓ | |
| You need RAG-specific metrics (faithfulness, context-precision) at scale | ✓ | |
| You want a one-off, exploratory pass and then a portable codified version | start here, port later |
The two skills share methodology — binary judges, position-swap
for pairwise, calibration against human labels, dataset stratified
by feature×scenario×persona, "look at data first". This skill
inherits all of that; see references/shared_methodology.md for
the cross-skill pointers.
For each invocation you produce four kinds of artifact in the user's working directory:
<workspace>/
├── eval-set-v1.jsonl ← the test inputs (versioned)
├── eval-set-v2.jsonl ← bumped when cases change
├── prompt-candidates/
│ ├── candidate-v1.md ← the prompt under test
│ └── candidate-v2.md ← the proposed revision
├── iteration-1/
│ ├── eval-1/
│ │ ├── candidate-v1.txt ← raw output from subagent
│ │ ├── candidate-v2.txt
│ │ ├── judge-v1.json ← {verdict, reasoning}
│ │ └── judge-v2.json
│ ├── ...
│ └── results.md ← stamps eval-set version,
│ pass rate, failure modes,
│ cumulative combined record
└── iteration-2/... ← after one refinement loop
Two orthogonal version axes — prompt (candidate-vX.md) and
eval set (eval-set-vY.jsonl) — are tracked separately so
that runs are unambiguous: every iteration is "candidate-vX ran
against eval-set-vY". See references/eval_set.md → Versioning
for the bump-don't-mutate rule and the cumulative-record pattern
in references/iteration_loop.md.
You orchestrate the loop. Subagents do the per-sample work. Files are the interface.
Capture the prompt under test and its job. Exact prompt string, what an input looks like, what a correct output looks like. If vague, ask for one example input + the user's ideal output — that becomes the seed golden pair.
Build (or accept) the eval set. Hand-curate 10–30 entries
from real failures if possible, generate synthetically only as
a complement. Store as JSONL. See references/eval_set.md.
Pick a grading approach.
references/judge_subagents.md.references/judge_subagents.md.Spawn the run. One assistant message, N+M parallel Agent
calls (N candidate subagents, M judge subagents — or N
candidates first, then M judges in a second turn if the judges
need outputs to be on disk first). See
references/candidate_subagents.md and
references/parallel_execution.md.
Aggregate, surface failures, propose a revision. Read
judge-*.json from each eval-*/. Compute pass rate. List
failed inputs with one-line reasoning. Cluster the failures
into 2–4 themes. Propose a single targeted edit to the prompt.
Write iteration-N/results.md stamped with the eval-set
version used and (after iteration 2) a cumulative combined
record showing which cases the current shipping candidate has
passed across iterations. Ask the user to greenlight
candidate-v(N+1).md, then re-run on the same eval-set
version.
That is the whole loop. Subsequent sections drill into the mechanics.
A candidate subagent simulates one execution of the prompt under test on one eval input. The pattern:
Agent({
description: "Run candidate v1 on eval-3",
subagent_type: "general-purpose",
prompt: """
You are a subject under test. The instructions below are the
only instructions you should follow. Do not consult any tools
other than Write. Do not search the web. Do not call other
agents. Do not 'help' beyond what the instructions say.
<prompt_under_test>
{paste candidate prompt verbatim}
</prompt_under_test>
The user message you are responding to is:
<user_message>
{paste eval input}
</user_message>
Write your final answer — and nothing else — to:
{absolute path}/iteration-1/eval-3/candidate-v1.txt
using the Write tool. Then return the single word: DONE
"""
})
Three load-bearing pieces:
<prompt_under_test> and <user_message> tags.
This is the closest analog to a system+user API call you can
get without leaving Claude Code.The full template is at assets/candidate_subagent.template.md.
Critical pitfalls live at references/pitfalls.md.
A judge subagent grades one candidate output against one criterion. The pattern is the same shape as a candidate run — file-based output, no tools except Write, fresh context.
Defaults (cross-referenced to prompt-evaluation):
correct / incorrect). Anthropic, Hamel,
Yan, Arize, Databricks all converge on this. Likert scales aren't
actionable.<thinking> then <correctness>.references/judge_subagents.md.The judge subagent writes a JSON file:
{
"verdict": "correct",
"reasoning": "The output identifies both Service Outage and Feature Request, matching the golden set."
}
Template at assets/judge_binary_subagent.template.md; pairwise
variant at assets/judge_pairwise_subagent.template.md.
If you have N eval inputs, spawn N candidate subagents in the same assistant message. They run concurrently. Then, in the next turn, spawn N judge subagents (also in one message). Wall time is roughly 2× the slowest sample, not N× the average.
Two practical caveats:
references/parallel_execution.md for the full table.See references/parallel_execution.md for the batching pattern.
The synthesis pass is where this skill earns its keep. You have the failures, the rubric, the pass rates, and (after the second iteration) two passes' worth of comparison. The main session should:
candidate-*.txt outputs in full. Don't skim.prompt-candidates/candidate-v2.md.When the user reports "v2 doesn't look better, just different",
you almost certainly traded one failure mode for another. Look at
the per-mode breakdown, not the headline number. See
references/iteration_loop.md.
The subagent-as-eval-runner pattern has failure modes that
messages.create() does not. The most common:
prompt-evaluation skill.verdict as an
enum.Full list with mitigations at references/pitfalls.md.
promptfoo
so it runs in CI without a human at the keyboard. Use this
skill for the exploratory loop and then port.| File | When to read |
|---|---|
references/eval_set.md | Designing the eval set; what counts as a good sample |
references/candidate_subagents.md | Full anatomy of a candidate-subagent prompt + variants |
references/judge_subagents.md | Binary / numeric / pairwise judges + calibration |
references/parallel_execution.md | Batching, rate limits, partial-failure handling |
references/iteration_loop.md | Cluster failures → propose edit → compare runs |
references/pitfalls.md | Subagent-specific failure modes + mitigations |
references/shared_methodology.md | Cross-pointers into the prompt-evaluation skill |
| Asset | Purpose |
|---|---|
assets/candidate_subagent.template.md | Drop-in prompt for a candidate run |
assets/judge_binary_subagent.template.md | Drop-in prompt for a binary judge |
assets/judge_pairwise_subagent.template.md | Drop-in prompt for a pairwise judge |
assets/eval_set.template.jsonl | Shape of the eval-set file |
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub 46ki75/claude-plugins --plugin prompt-evaluation-claude-code