From autobuild
Write a BENCHMARK.md with measurable evaluation criteria through iterative dialogue. Pushes for programmatic metrics over subjective checklists. Invoked after program-writer, before workflow execution.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autobuild:benchmark-writerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark = **scalar evaluation function**. Takes codebase state, outputs ONE number. Tells orchestrator how far from done.
Benchmark = scalar evaluation function. Takes codebase state, outputs ONE number. Tells orchestrator how far from done.
NOT a plan. NOT exit conditions. NOT a to-do list. MEASUREMENT INSTRUMENT. Same benchmark runs every iteration, produces comparable score. Trajectory (down for MINIMIZE, up for MAXIMIZE) reveals progress.
Belongs in benchmark: score formula, programmatic checks, data science metrics (MSE, F1, correlation), binary checklist items, fuzzy scales (0-10 with rubrics), iteration log.
Does NOT belong (MOST COMMON MISTAKE): exit conditions, completion conditions, convergence criteria. ALWAYS PROGRAM.md, NEVER BENCHMARK.md. Writing "stop", "exit", "completion", "converge" in benchmark - STOP, move to program.
PROGRAM.md exists, user-approved.
Read PROGRAM.md. ASK user - all in ONE message:
What can we measure programmatically? Propose concrete metrics per work item:
wc -l), function counts (grep -c "def ")pytest --co -q | tail -1), test pass rateruff check --statistics)radon cc -s -a)test -f path)Target per metric? Current → target.
What can't be measured programmatically? Becomes fuzzy scale (0-10) with rubric. Last resort. Every fuzzy scale justifies why.
Execution recipe per check? Exact command, script, procedure. Repeatable:
make test, pytest --co -q | tail -1, ruff check --statistics | tail -1npx playwright test --reporter=json | jq '.stats.unexpected'curl -s endpoint | jq '.status' or pytest fixturepython run_simulation.py --config test.yaml | grep 'metric:'claude -p 'prompt', check output contains X")Every check needs Execution line. No ambiguous "verify X works".
Ask: "Which can we compute? What am I missing?"
Includes:
Ask: "Measuring the right things? Targets realistic?"
Common refinements:
Each round: update, show changes, ask if ready.
Explicit approval only. Same phrases as program-writer.
make test fails, wc -l, grep -c. Best. Reproducible, no LLM judgmenttest -f path, grep -q pattern file. Binary, fast## Score
**Direction**: MINIMIZE (target: 0)
score = (failed_tests * 10) + unchecked_items + lint_violations + sum(fuzzy_residuals)
**Weights reflect severity**: test failures 10x worse than missing checklist item.
Rules:
Every fuzzy scale MUST have:
### Scale: Design Consistency (0-10)
Current grade: [0] /10. Residual: [10]
Rubric:
- 10 = every module follows identical patterns, no mixed conventions
- 8 = consistent with 1-2 minor deviations
- 5 = some patterns shared, some divergent
- 2 = no consistent patterns
Why not programmatic: consistency cross-cutting, no single grep pattern captures it.
# Benchmark: <title matching PROGRAM.md>
## Score
**Direction**: MINIMIZE (target: 0)
score =
## Evaluation
**Programmatic checks** (run these commands):
1. `make test` - count failures (weight: 10x)
Execution: `.venv/bin/pytest tests/ -q --tb=no | tail -1`
2. `make lint` - must be clean
Execution: `uvx ruff check --statistics | tail -1`
3. <custom metric>
Execution: <exact command that outputs the number>
**Scenario tests** (if applicable):
- <test name>
Execution: <exact command or script>
Expected: <what passing looks like>
Metric: <what number to extract>
**Generative checks**:
4. For each [ ] item, verify against code. Mark [x] with evidence
5. Grade fuzzy scales using rubrics
6. EDIT this file, UPDATE Iteration Log
---
## Section 1: <work item group>
- [ ] <binary check - passes or fails, use [ ] for unchecked, [x] for passing>
Execution: <how to check - command, grep, or read instruction>
Evidence: <what was found when checked>
## Fuzzy Scales (if any)
### Scale: <name> (0-10)
Current grade: [0] /10. Residual: [10]
Rubric: <what each score means>
Why not programmatic: <justification>
---
## Iteration Log
| Iter | Score | Tests | Notes |
|------|-------|-------|-------|
| base | TBD | N | before any work |
<!-- NO "Completion Conditions" or "Exit Conditions" section here.
Those belong in PROGRAM.md. The benchmark ONLY scores. -->
npx claudepluginhub stellarshenson/claude-code-plugins --plugin autobuildGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.