From compounding-skills
Audits .claude/ skills via structured evals: discovers files in .claude/, generates realistic test prompts, measures Claude output improvements, iterates using Anthropic playbook.
How this skill is triggered — by the user, by Claude, or both
Slash command
/compounding-skills:compounding-skills-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are auditing skills generated by `compounding-skills:setup` to verify they actually improve Claude's output quality. This follows the same eval playbook used by Anthropic's official skill-creator plugin.
You are auditing skills generated by compounding-skills:setup to verify they actually improve Claude's output quality. This follows the same eval playbook used by Anthropic's official skill-creator plugin.
Process knowledge: Load the skill-auditor skill for evaluation infrastructure — scripts, agents, viewer, and schemas.
<audit_target>$ARGUMENTS</audit_target>
Find all generated skills in the project's .claude/ directory:
# Find all skill files (SKILL.md and standalone .md skills)
find .claude/skills -name "SKILL.md" -o -name "*.md" 2>/dev/null
If the audit target above specifies a particular skill, focus on that one. Otherwise, present everything discovered and ask what to audit:
Use AskUserQuestion:
question: "What would you like to audit?"
header: "Available skills"
options:
- (list each discovered skill with its name and description)
- label: "All skills"
description: "Audit every generated skill (takes longer)"
For each selected item, read its content fully:
For each item being audited, create 2-3 realistic test prompts that exercise what it's supposed to improve. These should be the kind of tasks a real developer would actually ask Claude to do in this project.
Test prompts should:
For an expert-rails-developer skill, good test prompts might look like:
Test prompts should simulate realistic invocations of the skill:
plan skill: a simple bug fix, a multi-step feature, a refactor)For a {prefix}:plan skill, good test prompts might look like:
For a {prefix}:review skill:
For a {prefix}:brainstorm skill:
Present the test cases to the user: "Here are the test cases I'd like to run. Do these look right, or do you want to add/change any?"
Wait for confirmation before proceeding.
After the user confirms test cases, ask how they'd like to run the evals:
Use AskUserQuestion:
question: "How would you like to run the evals?"
header: "Eval run mode"
options:
- label: "Sequential (recommended)"
description: "Run one eval pair at a time, grade inline. Lighter on resources, more reliable. Uses 2 sub-agents at a time."
- label: "Parallel"
description: "Launch all eval pairs at once, grade via sub-agents. Faster but heavier — uses up to (N × 2) + (N × 2) sub-agents. May hit auth limits on large runs."
Save the chosen mode — it determines behavior in Phase 3 and Phase 4.
After the user confirms, determine the path to the skill-auditor's scripts. The skill-auditor is bundled with this plugin — find it relative to this skill file:
# The skill-auditor lives alongside this skill in the plugin
AUDITOR_PATH="$(dirname "$(dirname "$0")")/skills/skill-auditor"
Save test cases to a workspace directory as a sibling to the skill being audited:
{skill-name}-workspace/
└── evals/
└── evals.json
Use the schema from skill-auditor/references/schemas.md. Don't write assertions yet — just prompts and expected outputs.
{
"skill_name": "expert-rails-developer",
"evals": [
{
"id": 1,
"prompt": "The realistic test prompt",
"expected_output": "Description of what a good result looks like",
"files": []
}
]
}
Run eval pairs one subagent at a time to avoid rate limits. For each test case, run the with-skill version first, wait for it to complete, then run the baseline. Do not spawn multiple subagents in parallel.
Use these prompts for spawning eval sub-agents. The run mode determines when they're launched, not what they do.
Step 1 — With-skill run:
Spawn a single Agent subagent and wait for it to complete:
Execute this task:
- First read the skill at: {path-to-skill-being-audited}/SKILL.md and all its reference files
- Follow the skill's conventions and patterns while completing the task
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/with_skill/outputs/
- When done, save a transcript of your work as transcript.md in the outputs directory
Step 2 — Without-skill run (baseline):
After the with-skill run completes, spawn the baseline:
Execute this task:
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/without_skill/outputs/
- Do NOT read any skill files — complete this task using only your general knowledge
- When done, save a transcript of your work as transcript.md in the outputs directory
Then move to the next test case. Complete all test cases for one eval before starting the next.
Step 1 — With-skill run:
Spawn a single Agent subagent and wait for it to complete:
Execute this task by following the skill workflow:
- First read the skill at: {path-to-skill-being-audited}
- Also read any skills referenced by the workflow skill (e.g., expert-{stack}-developer)
- Follow the skill's phases and workflow exactly as written
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/with_skill/outputs/
- When done, save a transcript of your work as transcript.md in the outputs directory
Step 2 — Without-skill run (baseline):
After the with-skill run completes, spawn the baseline:
Execute this task:
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/without_skill/outputs/
- Do NOT read any skill files — complete this task using only your general knowledge
- When done, save a transcript of your work as transcript.md in the outputs directory
Process one eval at a time. For each eval:
grading.json following the grader protocol in skill-auditor/agents/grader.md. For assertions that can be checked programmatically (file exists, contains pattern), run a script instead.This keeps max concurrency at 2 sub-agents and eliminates grader sub-agents entirely.
Launch everything at once for maximum speed:
Once all runs complete:
Grade runs one at a time to avoid rate limits. For each run (with_skill and without_skill), spawn a single grader subagent, wait for it to complete, then grade the next run. Read skill-auditor/agents/grader.md for the grading protocol.
The grader prompt:
You are a grader. Read the grading instructions at: {auditor-path}/agents/grader.md
Evaluate these expectations against the execution outputs:
- Expectations: {assertions list}
- Transcript path: {run-dir}/outputs/transcript.md
- Outputs directory: {run-dir}/outputs/
Save grading results to: {run-dir}/grading.json
Use the exact schema from {auditor-path}/references/schemas.md — the viewer depends on the field names "text", "passed", and "evidence" in the expectations array.
For assertions that can be checked programmatically, write and run a script instead of relying on the grader.
Write an eval_metadata.json for each eval directory:
{
"eval_id": 1,
"eval_name": "descriptive-name-here",
"prompt": "The test prompt",
"assertions": []
}
When each subagent completes, the task notification includes total_tokens and duration_ms. Save this data immediately to timing.json in the run directory — this is the only chance to capture it.
Assertions should be:
Update eval_metadata.json and evals/evals.json with the assertions.
Once all runs are complete and graded (inline for sequential mode, via sub-agents for parallel mode):
Run the aggregation script:
python -m scripts.aggregate_benchmark {workspace}/iteration-{N} --skill-name {name}
Run this from the skill-auditor directory so the module resolves correctly. This produces benchmark.json and benchmark.md.
Read the benchmark data and surface patterns the aggregate stats might hide. See skill-auditor/agents/analyzer.md (the "Analyzing Benchmark Results" section) for what to look for:
Generate the interactive review viewer:
# Find the auditor path
AUDITOR_PATH="path/to/skill-auditor"
nohup python "$AUDITOR_PATH/eval-viewer/generate_review.py" \
{workspace}/iteration-{N} \
--skill-name "{skill-name}" \
--benchmark {workspace}/iteration-{N}/benchmark.json \
> /dev/null 2>&1 &
VIEWER_PID=$!
For iteration 2+, also pass --previous-workspace {workspace}/iteration-{N-1}.
If the environment has no display, use --static {output_path} to write a standalone HTML file instead.
Tell the user: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done reviewing, come back here and let me know."
When the user says they're done reviewing:
feedback.json from the workspacekill $VIEWER_PID 2>/dev/nullBased on the feedback and benchmark data, improve the skill:
Apply improvements to the skill, then:
iteration-{N+1}/--previous-workspace pointing at the previous iterationKeep iterating until:
After the skill content is solid, offer to optimize the description for better triggering accuracy.
Create 20 eval queries — a mix of should-trigger (8-10) and should-not-trigger (8-10). Save as JSON.
Queries must be realistic — specific, detailed, with file paths, personal context, casual phrasing. Not abstract requests like "Format this data" but concrete ones like "ok so I need to add a new service object that handles the stripe webhook for subscription cancellations, following our existing patterns in app/services/".
For should-not-trigger queries, focus on near-misses — queries that share keywords with the skill but actually need something different.
Read the template from skill-auditor/assets/eval_review.html, replace placeholders:
__EVAL_DATA_PLACEHOLDER__ with the JSON array__SKILL_NAME_PLACEHOLDER__ with the skill name__SKILL_DESCRIPTION_PLACEHOLDER__ with the current descriptionWrite to a temp file and open it. The user edits queries and exports the final set.
python -m scripts.run_loop \
--eval-set {path-to-trigger-eval.json} \
--skill-path {path-to-skill} \
--model {model-id} \
--max-iterations 5 \
--verbose
Run from the skill-auditor directory. This handles train/test splitting, evaluation, and iterative improvement automatically.
Take best_description from the output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.
After auditing completes, present a summary:
{command_prefix}:compound periodically to keep skills freshnpx claudepluginhub profburial/compounding-skills --plugin compounding-skillsCreate, edit, optimize, and benchmark skills for Claude Code. Guides users through drafting, evaluating with test prompts, analyzing quantitative results, and iterating on skill descriptions.
Creates, edits, and optimizes skills with iterative evaluation and benchmarking. Guides users through drafting, testing, and refining skills.
Creates new Claude Code skills from scratch, modifies and improves existing ones, runs evaluations and benchmarks performance with variance analysis, optimizes trigger descriptions.