From autoimprove
Use when testing debate agent bug-finding accuracy against curated code challenges — F1 scoring, 'test debate agents on challenges', 'benchmark agents'.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoimprove:challenge [--suite puzzles|all] [--language python|typescript|go|rust|all] [--difficulty easy|medium|hard|all] [--tags <tag>] [--id <challenge-id>] [--dry-run][--suite puzzles|all] [--language python|typescript|go|rust|all] [--difficulty easy|medium|hard|all] [--tags <tag>] [--id <challenge-id>] [--dry-run]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<SKILL-GUARD>
Run debate agents against curated challenges and score them with precision-weighted F1.
Initialize progress tracking:
TodoWrite([
{id: "1", content: "🔍 Parse arguments", status: "pending"},
{id: "2", content: "📋 Load & filter manifest", status: "pending"},
{id: "3", content: "🔄 Run challenges", status: "pending"},
{id: "4", content: "📊 Aggregate & report", status: "pending"},
{id: "5", content: "🏷️ Log results", status: "pending"}
])
Mark step in progress: TodoWrite([{id: "1", status: "in_progress"}])
--tags off-by-one,loop); a challenge matches if ANY of its tags match--id python/off-by-one); overrides all other filtersMark complete: TodoWrite([{id: "1", status: "completed"}])
Mark step in progress: TodoWrite([{id: "2", status: "in_progress"}])
Read challenges/manifest.json from the project root.
Apply filters in this order:
--id is set: match exactly that one challenge. If not found, print Challenge '<id>' not found in manifest. and stop.language (if not "all")difficulty (if not "all")tags (if provided): keep challenges whose tags array contains at least one of the requested tagsIf the filtered set is empty, print:
No challenges match the given filters (language=<x>, difficulty=<y>, tags=<z>).
Run /challenge to see all available challenges.
and stop.
If --dry-run: print the filtered challenge list and stop:
Dry run — would run N challenge(s):
python/off-by-one [easy, python] tags: boundary, loop, off-by-one
go/goroutine-leak [medium, go] tags: goroutine, channel, resource-leak
Otherwise, report how many challenges will be run:
Running {N} challenges ({languages})...
Mark complete: TodoWrite([{id: "2", status: "completed"}])
Mark step in progress: TodoWrite([{id: "3", status: "in_progress"}])
For each challenge in the filtered set, run sequentially (not in parallel) to avoid context flooding from concurrent debate pipelines. If the set is large (>5 challenges), note estimated wall time: roughly 2–4 min per challenge.
Read the challenge file (e.g., challenges/python/off-by-one/challenge.py).
The file extension tells you the language:
.py → Python.ts → TypeScript.go → Go.rs → RustSpawn the adversarial-review pipeline in single-pass mode (1 round, not iterative) against the challenge file. This means:
[{"file": "...", "line": N, "type": "...", "description": "..."}]accepted or rejected; outputs structured JSON rulings with shape [{"finding_id": N, "verdict": "accepted|rejected", "rationale": "..."}]Pass the full challenge file content as the code under review. Do NOT pass the answer key — agents must find bugs without hints.
The key outputs needed for scoring are:
ENTHUSIAST_FINDINGS — the raw findings array from the EnthusiastJUDGE_RULINGS — the verdict array from the JudgePrepare a combined JSON file with the Judge's rulings AND the Enthusiast's findings (the scoring script needs both to match file/line/type):
# Write components via printf to avoid shell injection on embedded quotes (F5)
# Use mktemp to prevent parallel-run collisions (F6)
debate_tmpfile=$(mktemp /tmp/debate-output-XXXXXX.json)
rulings_tmpfile=$(mktemp /tmp/debate-rulings-XXXXXX.json)
findings_tmpfile=$(mktemp /tmp/debate-findings-XXXXXX.json)
printf '%s' "${JUDGE_RULINGS}" > "$rulings_tmpfile"
printf '%s' "${ENTHUSIAST_FINDINGS}" > "$findings_tmpfile"
jq -n \
--slurpfile rulings "$rulings_tmpfile" \
--slurpfile findings "$findings_tmpfile" \
'{rulings: $rulings[0], findings: $findings[0]}' > "$debate_tmpfile"
rm "$rulings_tmpfile" "$findings_tmpfile"
# Use absolute path relative to project root (F7)
SCORE_SCRIPT="$(git rev-parse --show-toplevel)/scripts/score-challenge.sh"
ANSWER_KEY="$(git rev-parse --show-toplevel)/challenges/{id}/answer-key.json"
"$SCORE_SCRIPT" "$ANSWER_KEY" "$debate_tmpfile"
rm "$debate_tmpfile"
Parse the F1 score from the output JSON.
Print per-challenge result:
{id}: F1={f1} (P={precision} R={recall}) TP={tp} FP={fp} FN={fn} {PASS|FAIL}
After each challenge completes, update progress:
TodoWrite([{id: "3", content: "🔄 Run challenges — {N}/{total} done"}])
After all challenges complete: TodoWrite([{id: "3", status: "completed", content: "🔄 Run challenges — {total}/{total} done"}])
Mark step in progress: TodoWrite([{id: "4", status: "in_progress"}])
After all challenges complete:
## Challenge Results
| Challenge | Language | Difficulty | F1 | Precision | Recall | Verdict |
|---|---|---|---|---|---|---|
| python/off-by-one | python | easy | 1.00 | 1.00 | 1.00 | PASS |
| python/null-handling | python | easy | 0.80 | 0.67 | 1.00 | PASS |
| ... | ... | ... | ... | ... | ... | ... |
**Overall: {passed}/{total} passed (avg F1: {avg_f1})**
Include a breakdown by language and by difficulty if more than one group is present:
By language: python 2/2 (avg F1: 0.90) go 1/2 (avg F1: 0.60)
By difficulty: easy 3/3 (avg F1: 0.93) medium 0/1 (avg F1: 0.40)
Mark complete: TodoWrite([{id: "4", status: "completed", content: "📊 Aggregate & report — {passed}/{total} passed, avg F1: {avg_f1}"}])
Mark step in progress: TodoWrite([{id: "5", status: "in_progress"}])
Append a summary line to experiments.tsv (if it exists) with type: challenge:
{id} {timestamp} challenge - {total_challenges} {avg_f1} - - {pass_count}/{total} {tokens_used} {wall_time} Challenge benchmark: {passed}/{total} passed
This enables longitudinal tracking of agent accuracy over time.
Mark complete: TodoWrite([{id: "5", status: "completed"}])
TodoWrite([
{id: "1", status: "completed"},
{id: "2", status: "completed"},
{id: "3", status: "completed"},
{id: "4", status: "completed"},
{id: "5", status: "completed"}
])
references/runbook.md — Usage examples (7), common failure patterns, edge cases, integration points, recommended workflow for prompt iteration, when-not-to-use. Load when debugging scores or onboarding.npx claudepluginhub tokyo-megacorp/autoimproveCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.