From claude-commands
Runs the autor research and SWE-bench benchmark loop: executes run_autor_experiment.py for technique comparison, evaluates against SWE-bench, and manages bandit state for technique selection.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-commands:autor-bench-eloopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Loop interval**: 30m | **Max duration**: 12h (24 iterations)
Loop interval: 30m | Max duration: 12h (24 iterations)
Drive the autor research + benchmarking pipeline: run run_autor_experiment.py for technique comparison, evaluate against SWE-bench, and build the bandit state for technique selection.
run_autor_experiment.py exists and passes python3 -m py_compile~/.swes/ai_orch accessible at ~/worktrees/pr6270-swebench/orchestration/tmux list-sessions for autor-* sessions)technique_bandit/bandit_state.json readable# Check active autor runs
tmux list-sessions 2>/dev/null | grep -E "autor-|swebench-" || echo "no active sessions"
# Check latest score files
ls -t research-wiki/scores/SR-*.json 2>/dev/null | head -5
# Check SWE-bench eval status
ls -t ~/.swes/eval_results/ 2>/dev/null | head -3 || echo "no eval results yet"
# Check bandit state
python3 -c "import json; d=json.load(open('technique_bandit/bandit_state.json')); [print(f'{k}: n={v[\"n\"]}, mean={v[\"mean\"]:.2f}') for k,v in d['techniques'].items()]"
Primary metric: rubric mean across techniques in bandit_state.json.
Secondary metric: SWE-bench resolution rate (instances solved / total evaluated).
# Compute current technique means
python3 -c "
import json
d = json.load(open('technique_bandit/bandit_state.json'))
for k, v in sorted(d['techniques'].items(), key=lambda x: -x[1].get('mean',0)):
print(f'{k}: n={v[\"n\"]}, mean={v[\"mean\"]:.2f}')
"
# Check SWE-bench results if available
if [ -f ~/.swes/eval_results/latest.json ]; then
python3 -c "import json; d=json.load(open('$HOME/.swes/eval_results/latest.json')); print(f'SWE-bench: {d[\"resolved\"]}/{d[\"total\"]} = {d[\"resolved\"]/max(d[\"total\"],1)*100:.1f}%')"
fi
Decision tree:
run_autor_experiment.py --technique <technique> --prs 6265,6261,6245,6269 --n 1swebench-tester subagentai_orch --agent-cli claude vs run_autor_experiment.py comparisonBased on diagnose, pick ONE of:
python scripts/run_autor_experiment.py --technique SR-multi-exemplar --prs 6265,6261,6245,6269 --n 1python scripts/run_autor_experiment.py --technique SR-prtype --prs 6265,6261,6245,6269 --n 1swebench-tester subagentaiorch-cli-tester subagentAppend to wiki/syntheses/et_logs/eloop_cycles.md:
## YYYY-MM-DD HH:MM cycle
### Quality metric: technique means from bandit_state.json
### SWE-bench: X resolved / Y evaluated
### Live vs computed gap: ±X points
### New runs dispatched: [list]
### Findings: [observations]
touch /tmp/autor_bench_eloop_last_run
6a. Run autor experiment:
cd $HOME/llm-wiki-autor-phase3
python scripts/run_autor_experiment.py --technique <chosen> --prs 6265,6261,6245,6269 --n 1 --outdir research-wiki/scores
6b. Run SWE-bench comparison if dispatched:
Use swebench-tester subagent — runs predictions through SWE-bench harness.
6c. Run ai_orch CLI comparison if dispatched: Compare raw CLI output (claude/codex/gemini via ai_orch) vs autor harness on same SWE-bench instances.
## Autor-Bench Loop Cycle — HH:MM
- Best technique: X @ Y.mean (n=Z)
- Live vs computed: ±X pts
- SWE-bench: X% resolution (N/M)
- Next run: technique=Z
| Metric | Source | Healthy threshold |
|---|---|---|
| SR-multi-exemplar mean | bandit_state.json | >86 (Phase 7 live target) |
| Live vs computed gap | run output vs phase7_results.md | <3 pts |
| SWE-bench resolution | ~/.swes/eval_results/latest.json | >15% (beats SWE-agent) |
# Start the loop (via /loop skill)
/loop 30m /autor-bench-eloop
# Or manually for one cycle
/autor-bench-eloop
scripts/run_autor_experiment.py — deterministic autor harnesstechnique_bandit/bandit_state.json — bandit state + technique scoresresearch-wiki/scores/SR-*.json — score artifactswiki/syntheses/phase7_results.md — Phase 7 synthesis~/.swes/ — SWE-bench evaluation suite~/worktrees/pr6270-swebench/orchestration/ — ai_orch CLI orchestrationnpx claudepluginhub jleechanorg/claude-commands --plugin claude-commandsOrchestrates an automated loop that generates, scores, and closes PRs for three research techniques (SelfRefine, ET, PRM) until each reaches n=15 samples in a Thompson bandit.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.