Skill

compounding-skills:audit

Audits .claude/ skills via structured evals: discovers files in .claude/, generates realistic test prompts, measures Claude output improvements, iterates using Anthropic playbook.

Bash

Markdown

testing

code-quality

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/compounding-skills:compounding-skills-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are auditing skills generated by `compounding-skills:setup` to verify they actually improve Claude's output quality. This follows the same eval playbook used by Anthropic's official skill-creator plugin.

SKILL.md

382 lines · ~3.9k tokens

Stats

LanguagePython

Parent stars7

Parent forks1

MaintenanceExcellent

Last CommitMar 27, 2026

Actions

View Source View Plugin View on GitHub View README

compounding-skills:audit — Skill Quality Audit

You are auditing skills generated by compounding-skills:setup to verify they actually improve Claude's output quality. This follows the same eval playbook used by Anthropic's official skill-creator plugin.

Process knowledge: Load the skill-auditor skill for evaluation infrastructure — scripts, agents, viewer, and schemas.

Feature Description

<audit_target>$ARGUMENTS</audit_target>

Phase 1 — Discover Skills

Find all generated skills in the project's .claude/ directory:

# Find all skill files (SKILL.md and standalone .md skills)
find .claude/skills -name "SKILL.md" -o -name "*.md" 2>/dev/null

If the audit target above specifies a particular skill, focus on that one. Otherwise, present everything discovered and ask what to audit:

Use AskUserQuestion:

question: "What would you like to audit?"
header: "Available skills"
options:
  - (list each discovered skill with its name and description)
  - label: "All skills"
    description: "Audit every generated skill (takes longer)"

For each selected item, read its content fully:

Expert skills: Read SKILL.md and all reference files to understand what it teaches Claude
Workflow skills: Read the full skill file to understand the workflow it defines

Phase 2 — Generate Test Cases

For each item being audited, create 2-3 realistic test prompts that exercise what it's supposed to improve. These should be the kind of tasks a real developer would actually ask Claude to do in this project.

For Skills

Test prompts should:

Be specific and detailed — include file paths, feature names, and project context
Exercise the skill's core value proposition (conventions, patterns, architecture decisions)
Cover different aspects of what the skill teaches (not just one narrow scenario)
Be complex enough that the skill should make a measurable difference

For an expert-rails-developer skill, good test prompts might look like:

"Add a new endpoint to the V2 API that lets users update their notification preferences. It should follow our existing patterns for services and policies."
"I'm seeing a bug where the subscription renewal job silently fails when the payment gateway returns a timeout. Debug this and add proper error handling."
"Refactor the ReportsController — it's gotten too fat. Extract the report generation logic following our conventions."

For Workflow Skills

Test prompts should simulate realistic invocations of the skill:

Match how a real user would invoke the skill (with realistic arguments)
Exercise the skill's core phases and decision points
Cover different scenarios the skill should handle (e.g., for a plan skill: a simple bug fix, a multi-step feature, a refactor)

For a {prefix}:plan skill, good test prompts might look like:

"Plan adding a new notification preferences API endpoint with email digest support"
"Plan fixing the race condition in our subscription renewal background job"
"Plan refactoring the reports module to use the service object pattern"

For a {prefix}:review skill:

"Review the current branch against the plan" (with a pre-staged branch and plan file)

For a {prefix}:brainstorm skill:

"Brainstorm how to add real-time updates to our dashboard"

Present the test cases to the user: "Here are the test cases I'd like to run. Do these look right, or do you want to add/change any?"

Wait for confirmation before proceeding.

Choose Run Mode

After the user confirms test cases, ask how they'd like to run the evals:

Use AskUserQuestion:

question: "How would you like to run the evals?"
header: "Eval run mode"
options:
  - label: "Sequential (recommended)"
    description: "Run one eval pair at a time, grade inline. Lighter on resources, more reliable. Uses 2 sub-agents at a time."
  - label: "Parallel"
    description: "Launch all eval pairs at once, grade via sub-agents. Faster but heavier — uses up to (N × 2) + (N × 2) sub-agents. May hit auth limits on large runs."

Save the chosen mode — it determines behavior in Phase 3 and Phase 4.

Save Test Cases

After the user confirms, determine the path to the skill-auditor's scripts. The skill-auditor is bundled with this plugin — find it relative to this skill file:

# The skill-auditor lives alongside this skill in the plugin
AUDITOR_PATH="$(dirname "$(dirname "$0")")/skills/skill-auditor"

Save test cases to a workspace directory as a sibling to the skill being audited:

{skill-name}-workspace/
└── evals/
    └── evals.json

Use the schema from skill-auditor/references/schemas.md. Don't write assertions yet — just prompts and expected outputs.

{
  "skill_name": "expert-rails-developer",
  "evals": [
    {
      "id": 1,
      "prompt": "The realistic test prompt",
      "expected_output": "Description of what a good result looks like",
      "files": []
    }
  ]
}

Phase 3 — Run Evals

Run eval pairs one subagent at a time to avoid rate limits. For each test case, run the with-skill version first, wait for it to complete, then run the baseline. Do not spawn multiple subagents in parallel.

Eval Agent Prompts

Use these prompts for spawning eval sub-agents. The run mode determines when they're launched, not what they do.

For Skills

Step 1 — With-skill run:

Spawn a single Agent subagent and wait for it to complete:

Execute this task:
- First read the skill at: {path-to-skill-being-audited}/SKILL.md and all its reference files
- Follow the skill's conventions and patterns while completing the task
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/with_skill/outputs/
- When done, save a transcript of your work as transcript.md in the outputs directory

Step 2 — Without-skill run (baseline):

After the with-skill run completes, spawn the baseline:

Execute this task:
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/without_skill/outputs/
- Do NOT read any skill files — complete this task using only your general knowledge
- When done, save a transcript of your work as transcript.md in the outputs directory

Then move to the next test case. Complete all test cases for one eval before starting the next.

For Workflow Skills

Step 1 — With-skill run:

Spawn a single Agent subagent and wait for it to complete:

Execute this task by following the skill workflow:
- First read the skill at: {path-to-skill-being-audited}
- Also read any skills referenced by the workflow skill (e.g., expert-{stack}-developer)
- Follow the skill's phases and workflow exactly as written
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/with_skill/outputs/
- When done, save a transcript of your work as transcript.md in the outputs directory

Step 2 — Without-skill run (baseline):

After the with-skill run completes, spawn the baseline:

Execute this task:
- Task: {eval prompt}
- Input files: {eval files if any, or "none"}
- Save all outputs to: {workspace}/iteration-{N}/eval-{ID}/without_skill/outputs/
- Do NOT read any skill files — complete this task using only your general knowledge
- When done, save a transcript of your work as transcript.md in the outputs directory

Sequential Mode (default)

Process one eval at a time. For each eval:

Spawn the pair — launch the with-skill and without-skill agents together (2 agents at once)
Wait for both to complete — capture timing data from the completion notifications
Draft assertions for this eval while outputs are fresh
Grade inline — read the outputs and transcript yourself, then write grading.json following the grader protocol in skill-auditor/agents/grader.md. For assertions that can be checked programmatically (file exists, contains pattern), run a script instead.
Move to the next eval

This keeps max concurrency at 2 sub-agents and eliminates grader sub-agents entirely.

Parallel Mode

Launch everything at once for maximum speed:

Spawn all pairs in one turn — launch with-skill and without-skill agents for every eval simultaneously
Draft assertions for all evals while waiting
Capture timing as each agent completes
Grade via sub-agents — for each run, spawn a grader sub-agent:

Once all runs complete:

4.1 Grade Each Run

Grade runs one at a time to avoid rate limits. For each run (with_skill and without_skill), spawn a single grader subagent, wait for it to complete, then grade the next run. Read skill-auditor/agents/grader.md for the grading protocol.

The grader prompt:

You are a grader. Read the grading instructions at: {auditor-path}/agents/grader.md

Evaluate these expectations against the execution outputs:
- Expectations: {assertions list}
- Transcript path: {run-dir}/outputs/transcript.md
- Outputs directory: {run-dir}/outputs/

Save grading results to: {run-dir}/grading.json

Use the exact schema from {auditor-path}/references/schemas.md — the viewer depends on the field names "text", "passed", and "evidence" in the expectations array.

For assertions that can be checked programmatically, write and run a script instead of relying on the grader.

Shared Steps (both modes)

Write an eval_metadata.json for each eval directory:

{
  "eval_id": 1,
  "eval_name": "descriptive-name-here",
  "prompt": "The test prompt",
  "assertions": []
}

When each subagent completes, the task notification includes total_tokens and duration_ms. Save this data immediately to timing.json in the run directory — this is the only chance to capture it.

Assertions should be:

Objectively verifiable (not subjective quality judgments)
Discriminating (they should pass with the skill and fail without it)
Named descriptively so they're clear in the benchmark viewer

Update eval_metadata.json and evals/evals.json with the assertions.

Phase 4 — Grade, Aggregate, and Review

Once all runs are complete and graded (inline for sequential mode, via sub-agents for parallel mode):

4.1 Aggregate Benchmark

Run the aggregation script:

python -m scripts.aggregate_benchmark {workspace}/iteration-{N} --skill-name {name}

Run this from the skill-auditor directory so the module resolves correctly. This produces benchmark.json and benchmark.md.

4.2 Analyst Pass

Read the benchmark data and surface patterns the aggregate stats might hide. See skill-auditor/agents/analyzer.md (the "Analyzing Benchmark Results" section) for what to look for:

Assertions that always pass regardless of skill (non-discriminating)
High-variance evals (possibly flaky)
Time/token tradeoffs

4.3 Launch the Viewer

Generate the interactive review viewer:

# Find the auditor path
AUDITOR_PATH="path/to/skill-auditor"

nohup python "$AUDITOR_PATH/eval-viewer/generate_review.py" \
  {workspace}/iteration-{N} \
  --skill-name "{skill-name}" \
  --benchmark {workspace}/iteration-{N}/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!

For iteration 2+, also pass --previous-workspace {workspace}/iteration-{N-1}.

If the environment has no display, use --static {output_path} to write a standalone HTML file instead.

Tell the user: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done reviewing, come back here and let me know."

Phase 5 — Read Feedback and Iterate

When the user says they're done reviewing:

Read feedback.json from the workspace
Empty feedback means the output looked fine — focus on test cases with specific complaints
Kill the viewer server: kill $VIEWER_PID 2>/dev/null

Improve the Skill

Based on the feedback and benchmark data, improve the skill:

Generalize — don't overfit to the test cases. Changes should help across many prompts, not just the 2-3 tested ones
Keep it lean — remove instructions that aren't contributing to better outcomes. Read the transcripts to spot unproductive work caused by the skill
Explain the why — tell the model why a convention matters rather than adding rigid MUST/NEVER rules. Claude is smart — understanding beats compliance
Extract repeated work — if all test runs independently wrote similar helper logic, the skill should provide that directly

Apply improvements to the skill, then:

Rerun all test cases into iteration-{N+1}/
Launch the viewer with --previous-workspace pointing at the previous iteration
Wait for user review
Read feedback, improve again, repeat

Keep iterating until:

The user says they're happy
Feedback is all empty (everything looks good)
Improvements aren't making meaningful progress

Phase 6 — Description Optimization (Optional)

After the skill content is solid, offer to optimize the description for better triggering accuracy.

6.1 Generate Trigger Eval Queries

Create 20 eval queries — a mix of should-trigger (8-10) and should-not-trigger (8-10). Save as JSON.

Queries must be realistic — specific, detailed, with file paths, personal context, casual phrasing. Not abstract requests like "Format this data" but concrete ones like "ok so I need to add a new service object that handles the stripe webhook for subscription cancellations, following our existing patterns in app/services/".

For should-not-trigger queries, focus on near-misses — queries that share keywords with the skill but actually need something different.

6.2 Review with User

Read the template from skill-auditor/assets/eval_review.html, replace placeholders:

__EVAL_DATA_PLACEHOLDER__ with the JSON array
__SKILL_NAME_PLACEHOLDER__ with the skill name
__SKILL_DESCRIPTION_PLACEHOLDER__ with the current description

Write to a temp file and open it. The user edits queries and exports the final set.

6.3 Run the Optimization Loop

python -m scripts.run_loop \
  --eval-set {path-to-trigger-eval.json} \
  --skill-path {path-to-skill} \
  --model {model-id} \
  --max-iterations 5 \
  --verbose

Run from the skill-auditor directory. This handles train/test splitting, evaluation, and iterative improvement automatically.

6.4 Apply Result

Take best_description from the output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

Summary

After auditing completes, present a summary:

Which skills were audited
Pass rate with skill vs. without (the delta is its value)
Key improvements made
Any skills that didn't outperform baseline (consider simplifying or removing)
Any workflow skills where the workflow didn't produce better results than ad-hoc execution (consider streamlining phases or removing unnecessary steps)
Suggest running {command_prefix}:compound periodically to keep skills fresh

compounding-skills:audit

Popularity

Invocation

Context Preview

SKILL.md

compounding-skills:audit

Popularity

Invocation

Context Preview

SKILL.md

compounding-skills:audit — Skill Quality Audit

Feature Description

Phase 1 — Discover Skills

Phase 2 — Generate Test Cases

For Skills

For Workflow Skills

Choose Run Mode

Save Test Cases

Phase 3 — Run Evals

Eval Agent Prompts

For Skills

For Workflow Skills

Sequential Mode (default)

Parallel Mode

4.1 Grade Each Run

Shared Steps (both modes)

Phase 4 — Grade, Aggregate, and Review

4.1 Aggregate Benchmark

4.2 Analyst Pass

4.3 Launch the Viewer

Phase 5 — Read Feedback and Iterate

Improve the Skill

Phase 6 — Description Optimization (Optional)

6.1 Generate Trigger Eval Queries

6.2 Review with User

6.3 Run the Optimization Loop

6.4 Apply Result

Summary

Similar Skills

compounding-skills:audit — Skill Quality Audit

Feature Description

Phase 1 — Discover Skills

Phase 2 — Generate Test Cases

For Skills

For Workflow Skills

Choose Run Mode

Save Test Cases

Phase 3 — Run Evals

Eval Agent Prompts

For Skills

For Workflow Skills

Sequential Mode (default)

Parallel Mode

4.1 Grade Each Run

Shared Steps (both modes)

Phase 4 — Grade, Aggregate, and Review

4.1 Aggregate Benchmark

4.2 Analyst Pass

4.3 Launch the Viewer

Phase 5 — Read Feedback and Iterate

Improve the Skill

Phase 6 — Description Optimization (Optional)

6.1 Generate Trigger Eval Queries

6.2 Review with User

6.3 Run the Optimization Loop

6.4 Apply Result

Summary

Similar Skills