From autoresearch
Set up and run an autonomous experiment loop for any optimization target. Use when asked to "run autoresearch", "optimize X in a loop", "set up autoresearch", or "start experiments".
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:createThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.
Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.
All scripts are in the plugin. Reference them as:
${CLAUDE_PLUGIN_ROOT}/scripts/prepare-experiment.sh${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh${CLAUDE_PLUGIN_ROOT}/scripts/status.sh${CLAUDE_PLUGIN_ROOT}/scripts/advisor.py (UCB scoring, failure analysis, cost tracking)AskUserQuestion to confirm: Goal, Metric (+ direction), Files in scope, Constraints, Guard (optional safety-net command that must always pass — e.g. pnpm test, tsc --noEmit). Do NOT infer constraints silently — ask the user what's off-limits. If they say "none", write "none". The environment (language version, runtime, OS) is always part of the optimization surface unless the user explicitly restricts it.git checkout -b autoresearch/<goal>-<date>autoresearch.md (session doc, template below) and autoresearch.sh (benchmark script). Commit both.autoresearch.checks.sh — the guard script. This must exit 0 when everything is OK; non-zero blocks the keep. Commit it.cp "${CLAUDE_PLUGIN_ROOT}/scripts/pre-commit-hook.sh" .git/hooks/pre-commit
autoresearch.jsonl:
echo '{"type":"config","name":"<name>","metricName":"<metric>","metricUnit":"<unit>","bestDirection":"<lower|higher>"}' > autoresearch.jsonl
mkdir -p ~/.claude/states/autoresearch
cat > ~/.claude/states/autoresearch/$(openssl rand -hex 4).md << 'STATEOF'
---
session_id: "$SESSION_ID"
cwd: "$(pwd)"
last_resume: 0
---
Resume the autoresearch experiment loop. Read autoresearch.md and git log for context. Check autoresearch.ideas.md if it exists. Be careful not to overfit to the benchmarks and do not cheat.
STATEOF
IMPORTANT: The session_id value must match the current Claude Code session ID.${CLAUDE_PLUGIN_ROOT}/skills/create/REFERENCE.md for the template table).
d. Cross-signal correlation: join two profiler signals to find root causes at intersections (CPU x allocations, CPU x exceptions, alloc x thread-state, I/O x endpoints).
e. Characterize the search space: document every tunable dimension in the "Search Space" section of autoresearch.md — type (continuous, discrete, categorical), range/values, dependencies.
f. Consult past experience: mdvault search "autoresearch technique <bottleneck-type>" --top-k 10 if available.
g. Algorithmic complexity audit: For each hot function, state current Big-O and theoretical best Big-O. If current is worse by ≥1 level (e.g., O(N²) vs O(N log N)), the FIRST experiment MUST target the worst Big-O gap.
h. Headroom table: For each bottleneck, estimate: (1) % of total runtime, (2) theoretical max speedup if fully optimized, (3) headroom = (1) × (2). Write this table in autoresearch.md. Attack the highest-headroom row first.${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh "./autoresearch.sh"${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh keep <metric_value> "baseline"# Autoresearch: <goal>
## Directive
<!-- Declarative section: WHAT to optimize, not session state -->
- **Goal**: <one sentence>
- **Success metric**: <name> (<unit>, <lower|higher> is better)
- **Hard constraints**: <ONLY what the user explicitly forbids>
- **Generality test**: Every change must pass: "If this benchmark disappeared, would this still be a worthwhile improvement?"
## Objective
<What we're optimizing and the workload.>
## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...
## How to Run
`./autoresearch.sh` outputs `METRIC name=number` lines.
Log results with `${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh`.
## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>
## Off Limits
<What must NOT be touched.>
## Guard
<Guard command if any, e.g. `pnpm test`. "none" if no guard.>
## Constraints
<ONLY constraints the user explicitly stated.>
## Search Space
| Dimension | Type | Range/Values | Dependencies |
|-----------|------|--------------|--------------|
**Active dimensions**: <N> | **Explored**: <list> | **Unexplored**: <list>
## Headroom Table
| Bottleneck | % Runtime | Max Speedup | Headroom | Current Big-O | Best Big-O |
|------------|-----------|-------------|----------|---------------|------------|
## Profiling Notes
<Where time/resources are actually spent. Update periodically.>
## Problem Profile
**Bottleneck classification**: <e.g. CPU-compute (72%), Memory-bandwidth>
**Current focus**: <bottleneck> -> trying <technique family> first.
**Pivot trigger**: If 3 discards in current focus, re-profile.
## What's Been Tried
<Update as experiments accumulate. Note key wins, dead ends.>
Bash script (set -euo pipefail): pre-checks fast (<1s), runs the benchmark, outputs METRIC name=number lines. The benchmark scripts are IN SCOPE — modify them freely. Extend them when a new optimization dimension requires it (JVM flags file, compiler flags, env vars). Do NOT skip a dimension because "the script doesn't support it" — make the script support it.
The guard is a safety net that must always pass. If the benchmark improves but the guard fails, the change is reverted (guard_failed). Common guards: pnpm test, tsc --noEmit, pnpm build. Create only when the user specifies a guard or constraints require it. Never modify the guard script during the loop.
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
keep. Worse/equal -> discard.runs 5. If stddev > delta, discard.log-experiment.sh <status> <metric> "<description>" '{}' "root cause" (use '{}' as metrics_json placeholder when passing failure_reason as 5th arg). WHY it failed, not just that it failed. Every 5 experiments, run advisor.py failures autoresearch.jsonl to spot repeated failure patterns.elapsed_s (time since previous experiment). Every 10 experiments, run advisor.py cost autoresearch.jsonl to check: are discards burning more time than keeps? If avg discard time > 2x avg keep time, simplify your approach.guard_failed, revert. Don't touch the guard script — adapt the implementation.autoresearch.md exists, read it + git log, continue looping.Each iteration:
autoresearch.ideas.md -> top kept experiments (the context hook injects top 5 keeps as "population" — mutate from the best, not just the latest). Always attack the highest-headroom bottleneck first.git revert on failure (preserving the attempt in history as memory):
${CLAUDE_PLUGIN_ROOT}/scripts/prepare-experiment.sh "<description>" [files...]
If AUTORESEARCH_PREPARED=false, skip to next iteration.${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh "./autoresearch.sh" [timeout] [checks_timeout] [runs] [warmup] [early_stop_pct]
runs (default 1): N times, report median. warmup (default 0): untimed. early_stop_pct (default 0): abort remaining runs when first run is >N% worse than best. In Refinement phase, use runs 5 warmup 1 early_stop_pct 20.${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh <status> <metric> "<description>" [metrics_json] [failure_reason]
NEVER manually append to JSONL or revert code. The script handles rollback.
Description format: [dimension] what you tried | techniques: tech1, tech2
dimension: the optimization layer (algorithm, data-layout, io, runtime-flags, parallelism, build-pipeline, os-kernel, hardware, elimination, problem-reformulation, pipeline-order, cross-language, library-swap, compression, concurrency, measurement)techniques: cumulative list of named techniques in current working code[data-layout] convert HashMap to flat arrays | techniques: flat-array, arena-allocator, SoAmdvault remember "TECHNIQUE: <name> | BOTTLENECK: <type> | GAIN: <X>% | CONTEXT: <workload> | WORKS-WHEN: <conditions>" --namespace autoresearch/techniques
autoresearch.ideas.md using checkbox format (- [ ] / - [x])When stuck (5+ consecutive discards), escalate in order:
git log --oneline -50) — look for patterns in what worked vs failedAlso read ${CLAUDE_PLUGIN_ROOT}/skills/create/REFERENCE.md for exploration strategies, dimension checklists, decision tree templates, and optimization patterns.
NEVER STOP. Keep going until interrupted.
npx claudepluginhub sderosiaux/claude-plugins --plugin autoresearchSets up autonomous experiment loops for code optimization targets. Gathers goal/metric/files, creates git branch/benchmark script/logging, runs baseline via subagent. For 'run autoresearch' or iterative experiments.
Runs autonomous experiment loops to iteratively optimize measurable metrics like code performance, ML loss, build size via git branches, code changes, verify commands, and guards.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.