Skill

skill-auditor

Audits Claude skills: evaluates quality with benchmark scripts, test cases, dual-run comparisons, metrics; iterates improvements using agents and Python tooling.

Python

Markdown

testing

code-quality

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/compounding-skills:skill-auditor

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Evaluation and improvement infrastructure for skills generated by `compounding-skills:setup`. Loaded by `compounding-skills:audit`.

Supporting Files

agents/analyzer.mdagents/comparator.mdagents/grader.mdassets/eval_review.htmleval-viewer/generate_review.pyeval-viewer/viewer.htmlreferences/schemas.mdscripts/__init__.pyscripts/aggregate_benchmark.pyscripts/generate_report.pyscripts/improve_description.pyscripts/package_skill.pyscripts/quick_validate.pyscripts/run_eval.pyscripts/run_loop.pyscripts/utils.py

SKILL.md

137 lines · ~1.6k tokens

Stats

LanguagePython

Parent stars7

Parent forks1

MaintenanceExcellent

Last CommitMar 27, 2026

Actions

View Source View Plugin View on GitHub View README

Skill Auditor

Evaluation and improvement infrastructure for skills generated by compounding-skills:setup. Loaded by compounding-skills:audit.

Purpose

This skill teaches Claude how to:

Audit generated skills — evaluate whether they actually improve Claude's output quality
Run structured evals — test with realistic prompts, grade outputs, and aggregate benchmarks
Iterate on improvements — use human feedback and quantitative data to refine until they're genuinely useful

Infrastructure

All tooling lives in this skill's bundled directories. The scripts, viewer, and agents are identical to the ones used by Anthropic's official skill-creator plugin — same playbook, same rigor.

Scripts

Script	Purpose
`scripts/run_eval.py`	Tests whether a skill description triggers Claude to invoke it
`scripts/aggregate_benchmark.py`	Aggregates grading results into statistical summaries
`scripts/run_loop.py`	Orchestrates the full eval + description improvement loop
`scripts/improve_description.py`	Uses Claude with extended thinking to optimize skill descriptions
`scripts/generate_report.py`	Generates HTML reports from optimization loop output
`scripts/quick_validate.py`	Validates SKILL.md frontmatter (name, description, format)
`scripts/package_skill.py`	Creates distributable .skill files (ZIP archives)
`scripts/utils.py`	Shared utilities (YAML frontmatter parsing)

Eval Viewer

File	Purpose
`eval-viewer/generate_review.py`	Generates interactive HTML viewer for reviewing eval outputs and benchmarks
`eval-viewer/viewer.html`	Rich HTML template with Outputs tab, Benchmark tab, and feedback collection

Agents

Agent	Purpose
`agents/grader.md`	Evaluates assertions against execution outputs — produces structured grading.json
`agents/comparator.md`	Blind A/B comparison between two outputs without knowing which skill produced them
`agents/analyzer.md`	Analyzes benchmark patterns and explains why one version outperformed another

References

File	Purpose
`references/schemas.md`	JSON schemas for evals.json, grading.json, benchmark.json, timing.json, and all other data structures

Assets

File	Purpose
`assets/eval_review.html`	Interactive HTML template for reviewing and editing trigger eval sets

Evaluation Methodology

The audit follows the same methodology as Anthropic's skill-creator:

Dual-Run Comparison

Every test case runs twice — once with the skill and once without it (baseline). This isolates the actual contribution. If Claude produces equally good output without it, the skill isn't pulling its weight.

Structured Grading

The grader agent evaluates each run against predefined assertions. It goes beyond pass/fail — it extracts implicit claims, verifies them against outputs, and critiques whether the assertions themselves are discriminating enough to be useful.

Benchmark Aggregation

Results are aggregated with mean, stddev, min, max across configurations. The analyzer surfaces patterns the numbers alone won't show: assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.

Human-in-the-Loop Review

The eval viewer presents outputs side-by-side with benchmark data. The human reviews qualitatively (does this output actually look good?) while the benchmarks provide quantitative rigor. Feedback drives the next iteration.

Workspace Layout

Results are organized by iteration and eval:

{name}-workspace/
├── iteration-1/
│   ├── eval-descriptive-name/
│   │   ├── eval_metadata.json
│   │   ├── with_skill/
│   │   │   ├── outputs/
│   │   │   ├── timing.json
│   │   │   └── grading.json
│   │   └── without_skill/
│   │       ├── outputs/
│   │       ├── timing.json
│   │       └── grading.json
│   ├── benchmark.json
│   └── benchmark.md
├── iteration-2/
│   └── ...
└── feedback.json

Key Principles for Auditing

1. Test What Matters

Generated skills contain real code examples and project-specific conventions. Good test prompts should exercise whether those conventions actually guide Claude's behavior — not just whether Claude can complete the task at all.

Workflow skills define multi-phase workflows. Good test prompts should exercise whether the workflow produces better results than ad-hoc execution — better plans, more thorough reviews, more structured output.

2. Baseline Is Essential

A skill that doesn't outperform the baseline is overhead. Claude is already capable — the skill needs to demonstrably improve output quality, consistency, or adherence to project conventions to justify its context window cost.

3. Generalize, Don't Overfit

Test cases are examples, not the universe of usage. When improving based on test results, make changes that will generalize to the many prompts users will actually throw at it. Explain the why behind instructions rather than adding rigid rules that only fix the specific test case.

4. Lean Beats Comprehensive

If the benchmark shows a section isn't contributing to better outcomes, consider removing it. Every line costs context window space — unused instructions actively hurt by crowding out useful ones.

5. Workflow Skills Have Different Success Criteria

Expert skills are judged on whether code output follows conventions. Workflow skills are judged on whether the workflow adds value:

Does a plan skill produce more thorough, actionable plans than ad-hoc planning?
Does a review skill catch issues that a general review misses?
Does a brainstorm skill lead to better-explored design decisions?
Does a work skill produce more consistent, tested implementations?
Does a compound skill extract genuinely useful patterns?

Workflow skill improvements often involve streamlining phases, improving question quality, or removing steps that don't contribute to better outcomes.

skill-auditor

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

skill-auditor

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Skill Auditor

Purpose

Infrastructure

Scripts

Eval Viewer

Agents

References

Assets

Evaluation Methodology

Dual-Run Comparison

Structured Grading

Benchmark Aggregation

Human-in-the-Loop Review

Workspace Layout

Key Principles for Auditing

1. Test What Matters

2. Baseline Is Essential

3. Generalize, Don't Overfit

4. Lean Beats Comprehensive

5. Workflow Skills Have Different Success Criteria

Similar Skills

Skill Auditor

Purpose

Infrastructure

Scripts

Eval Viewer

Agents

References

Assets

Evaluation Methodology

Dual-Run Comparison

Structured Grading

Benchmark Aggregation

Human-in-the-Loop Review

Workspace Layout

Key Principles for Auditing

1. Test What Matters

2. Baseline Is Essential

3. Generalize, Don't Overfit

4. Lean Beats Comprehensive

5. Workflow Skills Have Different Success Criteria

Similar Skills