claude-stuff
A framework for organizing reusable Claude Code skills and evaluating their effectiveness.
What it does
Skills are markdown files (slash commands, instruction blocks, prompt snippets) that shape Claude Code's behavior. This framework answers the question: does a skill actually improve Claude's output?
For each test case, the framework:
- Runs
claude -p without the skill (baseline)
- Runs
claude -p with the skill injected
- Compares the two outputs using three methods:
- Rubric scoring — a judge Claude call rates outputs 0-2 per criterion
- Automated checks — linters and custom scripts validate the output
- Before/after diffs — unified diff with an AI-generated summary
Results are stored as TSV files for easy inspection and analysis.
Setup
Requires Python 3.12+ and Pipenv. No runtime dependencies beyond stdlib.
pipenv install --dev
pipenv shell
Usage
# List available test cases
python -m src.cli list
# Run all evaluations
python -m src.cli run
# Run specific cases by glob pattern
python -m src.cli run --cases "python_style_*"
# Override model or budget
python -m src.cli run --model opus --budget 2.0
# View results
python -m src.cli report --run-id <RUN_ID>
Project structure
.claude-plugin/marketplace.json Plugin marketplace catalog
plugins/ Skill plugins (loaded via --plugin-dir)
dev-workflow/
python-style/
skill-orchestration/
evals/
cases/ Test case definitions (TOML)
checks/ Custom check scripts (exit 0=pass, 1=fail)
results/ TSV output from evaluation runs
src/ Framework source code
tests/ Unit tests
Defining test cases
Test cases are TOML files in evals/cases/. Example:
[case]
id = "python_style_001"
name = "CSV line parser with style"
plugins = ["python-style"] # Plugins to load via --plugin-dir
# expected_skills = ["python-style"] # Optional: skills the agent should invoke
# (defaults to `plugins`)
[prompt]
text = """
Write a Python function called `parse_csv_line` that takes a single line
of CSV text and returns a list of fields. Handle quoted fields.
"""
[rubric]
criteria = [
"Function has a docstring",
"Has type annotations",
"Handles quoted fields with commas inside",
]
[checks]
scripts = ["evals/checks/has_docstrings.py"]
linters = ["ruff check"]
[options]
model = "sonnet"
max_budget_usd = 0.5
Custom check scripts
Check scripts in evals/checks/ receive the full Claude output on stdin. Exit 0 for pass, non-zero for fail. Write diagnostics to stderr.
import sys
def main() -> int:
output = sys.stdin.read()
if "def " not in output:
print("No function definition found", file=sys.stderr)
return 1
return 0
if __name__ == "__main__":
sys.exit(main())
Output
Each run produces TSV files in results/:
{run_id}_scores.tsv — rubric scores per criterion per variant
{run_id}_checks.tsv — pass/fail per check per variant
{run_id}_diffs.tsv — raw diffs and AI summaries
{run_id}_summary.tsv — aggregated per-case comparison
Findings
python-style improves Python output
Case: python_style_001 — ask Claude to write a CSV line parser, no style hints in the prompt.
| Check / Score | Baseline | With python-style |
|---|
| Rubric score (0-10) | 7 | 9 (+2) |
ruff check | ✅ | ✅ |
has_docstrings | ❌ | ✅ |
The skill consistently adds Google-style docstrings and type annotations that the baseline omits.
python-style causes streaming when the prompt doesn't ask for it
Case: streaming_csv_to_json_001 — ask for a "function that converts a CSV file to JSON", with no mention of streaming or memory efficiency.
| Check / Score | Baseline | With python-style |
|---|
| Rubric score (0-10) | 6 | 10 (+4) |
uses_streaming | ❌ | ✅ |
has_docstrings | ✅ | ✅ |
Without the skill, Claude loads all rows into a list and dumps once. With the skill, Claude streams rows incrementally and writes JSON on the fly — even though the prompt never mentions memory or streaming.
dev-workflow causes real TDD execution, verified from the message stream
Case: dev_workflow_tdd_001 — "Create a Stack data structure. Write the implementation and tests as separate files, then run the tests."
The tdd_order check inspects the stream-json message history and classifies each tool call into a TDD event sequence:
| Variant | Tool sequence |
|---|
| Baseline | [write_impl, write_test, run_test_pass] ❌ impl-first |
With dev-workflow | [write_test, run_test_fail, write_impl, run_test_pass] ✅ full TDD cycle |