From autoresearch
Use when running autonomous iterative experiments to optimize a measurable metric on an existing codebase — not for writing tests or diagnosing a specific failure.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when the goal is to improve a specific, measurable outcome through repeated automated experiments — write a change, measure it, keep it or revert it, repeat. You define the metric; the agent runs the loop autonomously until interrupted or the budget is exhausted.
Use this skill when the goal is to improve a specific, measurable outcome through repeated automated experiments — write a change, measure it, keep it or revert it, repeat. You define the metric; the agent runs the loop autonomously until interrupted or the budget is exhausted.
This is distinct from test-driven development (writing tests for behavior) and systematic debugging (diagnosing a known failure). Use this skill when optimization through iteration is the work.
test-driven-development).systematic-debugging).| Situation | Use this skill? | Route instead |
|---|---|---|
| Writing failing tests before implementing a feature | No | test-driven-development |
| Diagnosing why a test or build is failing | No | systematic-debugging |
| Iteratively optimizing a benchmark or metric | Yes | — |
| Improving code quality with no numeric signal | No | systematic-debugging or verification-before-completion |
| Running a single targeted experiment to validate a hypothesis | No | Implement directly; this skill is for multi-experiment autonomous loops |
Required before editing
npm run benchmark, go test -bench=. ./..., hyperfine './build.sh').lower_is_better or higher_is_better.Helpful if present
unlimited (default: unlimited, stop only on interrupt).Only investigate if encountered
git checkout -b autoresearch/<tag> (use today's date as the tag, e.g., autoresearch/2025-07-09).Summarize all parameters in a table before proceeding:
| Parameter | Value |
|---|---|
| Goal | … |
| Metric command | … |
| Metric extraction | … |
| Direction | lower is better / higher is better |
| In-scope files | … |
| Out-of-scope files | … |
| Constraints | … |
| Max experiments | … |
| Simplicity policy | … |
Do not start the loop until confirmed.
git checkout -b autoresearch/<tag>.results.tsv in the repo root with the header row (see references/experiment-guide.md for the TSV format). Add results.tsv and run.log to .git/info/exclude so they stay untracked without modifying any tracked file.0 with status baseline.Run until MAX_EXPERIMENTS is reached or the user interrupts. Do not stop to ask permission between iterations.
For each experiment:
references/experiment-guide.md for strategy ordering (low-hanging fruit first, diversify after plateaus, combine winners, etc.).git add <changed files> && git commit -m "experiment: <short description>".run.log (<command> > run.log 2>&1).run.log. If extraction fails, read the last 50 lines for the error.git revert HEAD --no-edit. Log status discard.git revert HEAD --no-edit (or revert both commits if two were made) and log status crash. Do not use --amend or reset --hard.results.tsv: experiment_number commit_hash metric_value status description.When the loop ends:
results.tsv as a formatted table.git log --oneline <start_commit>..HEAD.autoresearch/<tag> branch containing only kept (improving) experiment commits.results.tsv (untracked) — full experiment journal.run.log (untracked) — output from the most recent metric run.git reset --hard as a revert strategy. Always revert with a new commit (git revert HEAD --no-edit) to preserve history and avoid data loss.git commit --amend in the loop. Amended history makes the results log unreliable.Mechanical:
node skills/skill-authoring/scripts/validate-skill-library.mjs skills/autoresearch/SKILL.md
Smoke tests:
parseUser function before I implement it" (→ test-driven-development)systematic-debugging)Before each reporting step, apply verification-before-completion: confirm the results.tsv row count matches the experiment count and the final metric was measured from the last run.
src/compute.ts to reduce the npm run benchmark p95 latency. Keep going until I stop you."go test -bench=. ./... ns/op for the BenchmarkSort function. File a result log and tell me what you tried."time cargo build --release, lower is better. Only touch Cargo.toml and src/."references/experiment-guide.md — experiment strategy order, TSV format spec, git safety patterns, and constraint handlingassets/setup-template.md — fillable setup table to confirm inputs before the loop startsCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub matt-riley/lucky-hat --plugin autoresearch