From compounding-skills
Audits Claude skills: evaluates quality with benchmark scripts, test cases, dual-run comparisons, metrics; iterates improvements using agents and Python tooling.
How this skill is triggered — by the user, by Claude, or both
Slash command
/compounding-skills:skill-auditorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluation and improvement infrastructure for skills generated by `compounding-skills:setup`. Loaded by `compounding-skills:audit`.
agents/analyzer.mdagents/comparator.mdagents/grader.mdassets/eval_review.htmleval-viewer/generate_review.pyeval-viewer/viewer.htmlreferences/schemas.mdscripts/__init__.pyscripts/aggregate_benchmark.pyscripts/generate_report.pyscripts/improve_description.pyscripts/package_skill.pyscripts/quick_validate.pyscripts/run_eval.pyscripts/run_loop.pyscripts/utils.pyEvaluation and improvement infrastructure for skills generated by compounding-skills:setup. Loaded by compounding-skills:audit.
This skill teaches Claude how to:
All tooling lives in this skill's bundled directories. The scripts, viewer, and agents are identical to the ones used by Anthropic's official skill-creator plugin — same playbook, same rigor.
| Script | Purpose |
|---|---|
scripts/run_eval.py | Tests whether a skill description triggers Claude to invoke it |
scripts/aggregate_benchmark.py | Aggregates grading results into statistical summaries |
scripts/run_loop.py | Orchestrates the full eval + description improvement loop |
scripts/improve_description.py | Uses Claude with extended thinking to optimize skill descriptions |
scripts/generate_report.py | Generates HTML reports from optimization loop output |
scripts/quick_validate.py | Validates SKILL.md frontmatter (name, description, format) |
scripts/package_skill.py | Creates distributable .skill files (ZIP archives) |
scripts/utils.py | Shared utilities (YAML frontmatter parsing) |
| File | Purpose |
|---|---|
eval-viewer/generate_review.py | Generates interactive HTML viewer for reviewing eval outputs and benchmarks |
eval-viewer/viewer.html | Rich HTML template with Outputs tab, Benchmark tab, and feedback collection |
| Agent | Purpose |
|---|---|
agents/grader.md | Evaluates assertions against execution outputs — produces structured grading.json |
agents/comparator.md | Blind A/B comparison between two outputs without knowing which skill produced them |
agents/analyzer.md | Analyzes benchmark patterns and explains why one version outperformed another |
| File | Purpose |
|---|---|
references/schemas.md | JSON schemas for evals.json, grading.json, benchmark.json, timing.json, and all other data structures |
| File | Purpose |
|---|---|
assets/eval_review.html | Interactive HTML template for reviewing and editing trigger eval sets |
The audit follows the same methodology as Anthropic's skill-creator:
Every test case runs twice — once with the skill and once without it (baseline). This isolates the actual contribution. If Claude produces equally good output without it, the skill isn't pulling its weight.
The grader agent evaluates each run against predefined assertions. It goes beyond pass/fail — it extracts implicit claims, verifies them against outputs, and critiques whether the assertions themselves are discriminating enough to be useful.
Results are aggregated with mean, stddev, min, max across configurations. The analyzer surfaces patterns the numbers alone won't show: assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
The eval viewer presents outputs side-by-side with benchmark data. The human reviews qualitatively (does this output actually look good?) while the benchmarks provide quantitative rigor. Feedback drives the next iteration.
Results are organized by iteration and eval:
{name}-workspace/
├── iteration-1/
│ ├── eval-descriptive-name/
│ │ ├── eval_metadata.json
│ │ ├── with_skill/
│ │ │ ├── outputs/
│ │ │ ├── timing.json
│ │ │ └── grading.json
│ │ └── without_skill/
│ │ ├── outputs/
│ │ ├── timing.json
│ │ └── grading.json
│ ├── benchmark.json
│ └── benchmark.md
├── iteration-2/
│ └── ...
└── feedback.json
Generated skills contain real code examples and project-specific conventions. Good test prompts should exercise whether those conventions actually guide Claude's behavior — not just whether Claude can complete the task at all.
Workflow skills define multi-phase workflows. Good test prompts should exercise whether the workflow produces better results than ad-hoc execution — better plans, more thorough reviews, more structured output.
A skill that doesn't outperform the baseline is overhead. Claude is already capable — the skill needs to demonstrably improve output quality, consistency, or adherence to project conventions to justify its context window cost.
Test cases are examples, not the universe of usage. When improving based on test results, make changes that will generalize to the many prompts users will actually throw at it. Explain the why behind instructions rather than adding rigid rules that only fix the specific test case.
If the benchmark shows a section isn't contributing to better outcomes, consider removing it. Every line costs context window space — unused instructions actively hurt by crowding out useful ones.
Expert skills are judged on whether code output follows conventions. Workflow skills are judged on whether the workflow adds value:
plan skill produce more thorough, actionable plans than ad-hoc planning?review skill catch issues that a general review misses?brainstorm skill lead to better-explored design decisions?work skill produce more consistent, tested implementations?compound skill extract genuinely useful patterns?Workflow skill improvements often involve streamlining phases, improving question quality, or removing steps that don't contribute to better outcomes.
npx claudepluginhub profburial/compounding-skills --plugin compounding-skillsAudits Claude skills for structure, token efficiency, tool integration, compliance, and performance. Use for reviews, production prep, or quality optimization.
Audits Claude skills against Anthropic prompting best practices including positive framing, motivation, and XML structure. Use after creation/modification, before release, or for inconsistent results.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.