From ai-dev-toolkit
Test AI system prompts that involve tool calls by packaging them as a skill with mock callable tools, running evals inline (no extra API cost) or via model-specific subagents, and grading assertions against tool-call logs. Use this skill whenever you want to evaluate whether a system prompt causes a model to call the right tools with the right parameters — including comparing behavior across models like Haiku vs Sonnet.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-dev-toolkit:prompt-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A methodology for testing system prompts that involve tool-calling behavior. The core idea: package the prompt as a skill with mock bash tools, run test cases, and grade whether the right tools were called with the right parameters.
A methodology for testing system prompts that involve tool-calling behavior. The core idea: package the prompt as a skill with mock bash tools, run test cases, and grade whether the right tools were called with the right parameters.
Why this approach:
Create a skill directory for the prompt you're evaluating:
<project>/docs/skills/<skill-name>/
├── SKILL.md ← the prompt, formatted as a skill
├── scripts/
│ └── call_tool.sh ← mock tool caller (copy from template)
└── evals/
└── evals.json ← test cases + assertions
The skill file has two jobs: (1) give the model the system prompt, and (2) tell it how to call tools using the mock script instead of real APIs.
Include in the SKILL.md:
call_tool.sh{OUTPUTS_DIR}/response.txtTool-calling instruction to include in SKILL.md:
bash {SKILL_DIR}/scripts/call_tool.sh <tool_name> '<compact_json>' {OUTPUTS_DIR}
Compact JSON means no line breaks — the grader parses one JSON object per line.
Copy docs/skills/prompt-eval/scripts/call_tool_template.sh to your skill's scripts/call_tool.sh and add case blocks for each tool in your prompt. Make it executable: chmod +x scripts/call_tool.sh.
The script does two things:
{"tool":"...","params":{...},"timestamp":"..."} to tool_calls.jsonlKeep fake responses realistic enough that the model's follow-up reasoning stays on track (e.g. if create_task returns an id, the model can use it in a subsequent schedule_reminder call).
Good test cases cover the behavioral rules you actually care about. Each case should exercise one distinct pattern.
Typical patterns to cover:
{
"skill_name": "your-skill-name",
"evals": [
{
"id": 1,
"name": "short-kebab-case-name",
"prompt": "The user message to process. Include outputs_dir path here.",
"expected_output": "Human-readable description of what success looks like.",
"expectations": [
"create_task was called",
"create_task assignee_id is mem-001",
"create_task ai_context.reasoning present",
"schedule_reminder was called",
"schedule_reminder member_id is mem-001"
]
}
]
}
Assertions are checked by grade_evals.py using keyword pattern matching. Write them in this form:
| Pattern | Example |
|---|---|
| Tool was called | "create_task was called" |
| Tool NOT called | "schedule_reminder NOT called" |
| Tool called N times | "create_task called 3 times" |
| Field equals value | "create_task assignee_id is mem-001" |
| Nested field present | "create_task ai_context.reasoning present" |
| Field absent/null | "create_task assignee_id absent" |
| Time constraint | "schedule_clarify scheduled_at is within 2h of 2026-03-28T09:00:00Z" |
Keep assertions specific enough to fail when the model gets it wrong, but not so brittle they fail on acceptable variation (e.g. don't assert exact title text).
Running inline means you execute the evals yourself within the current session. This uses no additional API tokens beyond the current conversation.
Create output directories for each eval:
mkdir -p docs/skills/<skill-name>-workspace/iteration-1/eval-<id>-<name>/outputs
For each eval:
evals.jsoncall_tool.shoutputs/response.txtCall tools like this (use compact single-line JSON for params):
bash docs/skills/<skill-name>/scripts/call_tool.sh create_task \
'{"family_id":"fam-001","title":"Buy groceries","assignee_id":"mem-002","created_by_id":"mem-002","due_date":"2026-03-29","priority":"medium","ai_context":{"reasoning":"Bob needs groceries tomorrow."}}' \
docs/skills/<skill-name>-workspace/iteration-1/eval-1-simple-task/outputs
To test a different model (e.g. Haiku), spawn a subagent with the target model. The subagent reads the skill independently and executes the evals — this gives a genuine test of that model's behavior, not yours.
Use iteration-1/<model-name>/eval-<id>-<name>/outputs/ for subagent runs so the grader can find them:
mkdir -p docs/skills/<skill-name>-workspace/iteration-1/haiku/eval-<id>-<name>/outputs
You are being tested with a skill. Read the skill file, then run each eval exactly as instructed.
## Skill to follow
Read: <absolute-path-to>/SKILL.md
## Evals
### Eval 1 — <name>
Outputs dir: <absolute-path>/iteration-1/haiku/eval-1-<name>/outputs
<user message>
### Eval 2 — <name>
...
Run all evals. For each: call tools via Bash, save response to response.txt.
Use compact single-line JSON for all tool params (no line breaks inside the JSON string).
Spawn with model: "haiku" (or "sonnet", "opus") in the Agent tool.
Run the grading script, pointing it at the workspace and evals.json:
python3 docs/skills/prompt-eval/scripts/grade_evals.py \
docs/skills/<skill-name>-workspace/iteration-1 \
docs/skills/<skill-name>/evals/evals.json \
sonnet haiku
The script:
tool_calls.jsonl for each eval × model combinationgrading_summary.json to the workspaceDirectory lookup: The grader looks for tool calls at:
<workspace>/eval-<id>-<name>/<model>/outputs/tool_calls.jsonl (inline runs)<workspace>/<model>/eval-<id>-<name>/outputs/tool_calls.jsonl (subagent runs)For inline runs (where you are the model), use your own name as the model directory, e.g. sonnet.
When assertions fail, diagnose whether the issue is:
If a follow-up tool call (like schedule_reminder after create_task) gets dropped:
After editing SKILL.md, re-run the failing evals (inline or subagent) and re-grade. Iterate until the assertions pass across your target models.
# Run grader
python3 docs/skills/prompt-eval/scripts/grade_evals.py \
<workspace>/iteration-N \
<skill>/evals/evals.json \
sonnet haiku
# Create output dirs (inline run)
mkdir -p <workspace>/iteration-1/eval-<id>-<name>/outputs
# Create output dirs (subagent run)
mkdir -p <workspace>/iteration-1/haiku/eval-<id>-<name>/outputs
# Call a mock tool inline
bash <skill>/scripts/call_tool.sh <tool_name> '<compact_json>' <outputs_dir>
npx claudepluginhub elvinouyang/claude-skill-collection --plugin ai-dev-toolkitTests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.