claude-stuff

A framework for organizing reusable Claude Code skills and evaluating their effectiveness.

What it does

Skills are markdown files (slash commands, instruction blocks, prompt snippets) that shape Claude Code's behavior. This framework answers the question: does a skill actually improve Claude's output?

For each test case, the framework:

Runs claude -p without the skill (baseline)
Runs claude -p with the skill injected
Compares the two outputs using three methods:
- Rubric scoring — a judge Claude call rates outputs 0-2 per criterion
- Automated checks — linters and custom scripts validate the output
- Before/after diffs — unified diff with an AI-generated summary

Results are stored as TSV files for easy inspection and analysis.

Setup

Requires Python 3.12+ and Pipenv. No runtime dependencies beyond stdlib.

pipenv install --dev
pipenv shell

Usage

# List available test cases
python -m src.cli list

# Run all evaluations
python -m src.cli run

# Run specific cases by glob pattern
python -m src.cli run --cases "python_style_*"

# Override model or budget
python -m src.cli run --model opus --budget 2.0

# View results
python -m src.cli report --run-id <RUN_ID>

Project structure

.claude-plugin/marketplace.json   Plugin marketplace catalog
plugins/                          Skill plugins (loaded via --plugin-dir)
  dev-workflow/
  python-style/
  skill-orchestration/
evals/
  cases/                          Test case definitions (TOML)
  checks/                         Custom check scripts (exit 0=pass, 1=fail)
results/                          TSV output from evaluation runs
src/                              Framework source code
tests/                            Unit tests

Defining test cases

Test cases are TOML files in evals/cases/. Example:

[case]
id = "python_style_001"
name = "CSV line parser with style"
plugins = ["python-style"]            # Plugins to load via --plugin-dir
# expected_skills = ["python-style"]  # Optional: skills the agent should invoke
                                       #   (defaults to `plugins`)

[prompt]
text = """
Write a Python function called `parse_csv_line` that takes a single line
of CSV text and returns a list of fields. Handle quoted fields.
"""

[rubric]
criteria = [
    "Function has a docstring",
    "Has type annotations",
    "Handles quoted fields with commas inside",
]

[checks]
scripts = ["evals/checks/has_docstrings.py"]
linters = ["ruff check"]

[options]
model = "sonnet"
max_budget_usd = 0.5

Custom check scripts

Check scripts in evals/checks/ receive the full Claude output on stdin. Exit 0 for pass, non-zero for fail. Write diagnostics to stderr.

import sys

def main() -> int:
    output = sys.stdin.read()
    if "def " not in output:
        print("No function definition found", file=sys.stderr)
        return 1
    return 0

if __name__ == "__main__":
    sys.exit(main())

Output

Each run produces TSV files in results/:

{run_id}_scores.tsv — rubric scores per criterion per variant
{run_id}_checks.tsv — pass/fail per check per variant
{run_id}_diffs.tsv — raw diffs and AI summaries
{run_id}_summary.tsv — aggregated per-case comparison

Findings

`python-style` improves Python output

Case: python_style_001 — ask Claude to write a CSV line parser, no style hints in the prompt.

Check / Score	Baseline	With `python-style`
Rubric score (0-10)	7	9 (+2)
`ruff check`	✅	✅
`has_docstrings`	❌	✅

The skill consistently adds Google-style docstrings and type annotations that the baseline omits.

`python-style` causes streaming when the prompt doesn't ask for it

Case: streaming_csv_to_json_001 — ask for a "function that converts a CSV file to JSON", with no mention of streaming or memory efficiency.

Check / Score	Baseline	With `python-style`
Rubric score (0-10)	6	10 (+4)
`uses_streaming`	❌	✅
`has_docstrings`	✅	✅

Without the skill, Claude loads all rows into a list and dumps once. With the skill, Claude streams rows incrementally and writes JSON on the fly — even though the prompt never mentions memory or streaming.

`dev-workflow` causes real TDD execution, verified from the message stream

Case: dev_workflow_tdd_001 — "Create a Stack data structure. Write the implementation and tests as separate files, then run the tests."

The tdd_order check inspects the stream-json message history and classifies each tool call into a TDD event sequence:

Variant	Tool sequence
Baseline	`[write_impl, write_test, run_test_pass]` ❌ impl-first
With `dev-workflow`	`[write_test, run_test_fail, write_impl, run_test_pass]` ✅ full TDD cycle

dev-workflow

Popularity

What's Inside

README

claude-stuff

What it does

Setup

Usage

Project structure

Defining test cases

Custom check scripts

Output

Findings

`python-style` improves Python output

`python-style` causes streaming when the prompt doesn't ask for it

`dev-workflow` causes real TDD execution, verified from the message stream

Confidence

Similar Plugins

claudekit

evaluate-plugin

crucible

superpowers

development-productivity

develop

More by BillSchumacher

python-style

Popularity

Health & Quality

More by BillSchumacher

python-style

Similar Plugins

claudekit

evaluate-plugin

crucible

superpowers

development-productivity

develop

dev-workflow

Popularity

What's Inside

README

claude-stuff

What it does

Setup

Usage

Project structure

Defining test cases

Custom check scripts

Output

Findings

python-style improves Python output

python-style causes streaming when the prompt doesn't ask for it

dev-workflow causes real TDD execution, verified from the message stream

Confidence

Similar Plugins

claudekit

evaluate-plugin

crucible

superpowers

development-productivity

develop

More by BillSchumacher

python-style

Popularity

Health & Quality

More by BillSchumacher

python-style

Similar Plugins

claudekit

evaluate-plugin

crucible

superpowers

development-productivity

develop

`python-style` improves Python output

`python-style` causes streaming when the prompt doesn't ask for it

`dev-workflow` causes real TDD execution, verified from the message stream