Skill

Autoresearch: Setup

Set up and run an autonomous experiment loop for any optimization target. Use when asked to "run autoresearch", "optimize X in a loop", "set up autoresearch", or "start experiments".

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autoresearch:create

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

Supporting Files

REFERENCE.md

SKILL.md

186 lines · ~3.3k tokens

Stats

LanguageShell

Stars1

Forks1

MaintenanceExcellent

Last CommitApr 9, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Autoresearch: Setup

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

Scripts

All scripts are in the plugin. Reference them as:

${CLAUDE_PLUGIN_ROOT}/scripts/prepare-experiment.sh
${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh
${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh
${CLAUDE_PLUGIN_ROOT}/scripts/status.sh
${CLAUDE_PLUGIN_ROOT}/scripts/advisor.py (UCB scoring, failure analysis, cost tracking)

Setup Steps

Use AskUserQuestion to confirm: Goal, Metric (+ direction), Files in scope, Constraints, Guard (optional safety-net command that must always pass — e.g. pnpm test, tsc --noEmit). Do NOT infer constraints silently — ask the user what's off-limits. If they say "none", write "none". The environment (language version, runtime, OS) is always part of the optimization surface unless the user explicitly restricts it.
Create branch: git checkout -b autoresearch/<goal>-<date>
Read the source files in scope. Understand the workload deeply before writing anything.
Write autoresearch.md (session doc, template below) and autoresearch.sh (benchmark script). Commit both.
If a guard was specified (or constraints require correctness checks), write autoresearch.checks.sh — the guard script. This must exit 0 when everything is OK; non-zero blocks the keep. Commit it.

Install the pre-commit hook:

cp "${CLAUDE_PLUGIN_ROOT}/scripts/pre-commit-hook.sh" .git/hooks/pre-commit

Write the config line to autoresearch.jsonl:

echo '{"type":"config","name":"<name>","metricName":"<metric>","metricUnit":"<unit>","bestDirection":"<lower|higher>"}' > autoresearch.jsonl

Create the auto-resume state file:

mkdir -p ~/.claude/states/autoresearch
cat > ~/.claude/states/autoresearch/$(openssl rand -hex 4).md << 'STATEOF'
---
session_id: "$SESSION_ID"
cwd: "$(pwd)"
last_resume: 0
---
Resume the autoresearch experiment loop. Read autoresearch.md and git log for context. Check autoresearch.ideas.md if it exists. Be careful not to overfit to the benchmarks and do not cheat.
STATEOF

IMPORTANT: The session_id value must match the current Claude Code session ID.

Target environment audit before any optimization. a. Read the target spec FIRST. If the benchmark runs on a specific server/container/VM, find its exact spec (CPU model, cache sizes, OS version, kernel features like hugepages/THP, available libraries, compiler flags). This is the #1 constraint — code that wins locally but ignores the target environment will not transfer. Document specs in autoresearch.md under "Target Environment". b. Match the test environment. If running locally or on a VM, match the target as closely as possible (same CPU family, same OS, same allocator, same kernel config). Bare-metal for PMU access if profiling. If local/target ratio is constant across experiments, the bottleneck is SYSTEMIC (TLB, allocator, NUMA, frequency), not algorithmic — focus experiments on system-level changes. c. Identify free-lunch zones. If the benchmark harness separates setup from measurement (constructor outside timer, warmup phase, etc.), document exactly what work is "free" and exploit it aggressively (pre-allocation, page pre-faulting, cache warming, data structure precomputation).
Profile & Classify the workload. a. Profile (perf, profiler, flamegraph, GC logs, iostat, strace — whatever fits). b. Classify bottleneck: CPU-compute, CPU-branch, Memory-bandwidth, Memory-allocation, Memory-TLB, I/O-read, I/O-write, Concurrency, Startup, External. c. Write the Problem Profile in autoresearch.md. Build a Decision Tree (see ${CLAUDE_PLUGIN_ROOT}/skills/create/REFERENCE.md for the template table). d. Cross-signal correlation: join two profiler signals to find root causes at intersections (CPU x allocations, CPU x exceptions, alloc x thread-state, I/O x endpoints). e. Characterize the search space: document every tunable dimension in the "Search Space" section of autoresearch.md — type (continuous, discrete, categorical), range/values, dependencies. f. Consult past experience: mdvault search "autoresearch technique <bottleneck-type>" --top-k 10 if available. g. Algorithmic complexity audit: For each hot function, state current Big-O and theoretical best Big-O. If current is worse by ≥1 level (e.g., O(N²) vs O(N log N)), the FIRST experiment MUST target the worst Big-O gap. h. Headroom table: For each bottleneck, estimate: (1) % of total runtime, (2) theoretical max speedup if fully optimized, (3) headroom = (1) × (2). Write this table in autoresearch.md. Attack the highest-headroom row first.
Run baseline: ${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh "./autoresearch.sh"
Log baseline: ${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh keep <metric_value> "baseline"
Start the main loop immediately. Follow the Loop Rules below.

autoresearch.md template

# Autoresearch: <goal>

## Directive
<!-- Declarative section: WHAT to optimize, not session state -->
- **Goal**: <one sentence>
- **Success metric**: <name> (<unit>, <lower|higher> is better)
- **Hard constraints**: <ONLY what the user explicitly forbids>
- **Generality test**: Every change must pass: "If this benchmark disappeared, would this still be a worthwhile improvement?"

## Objective
<What we're optimizing and the workload.>

## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...

## How to Run
`./autoresearch.sh` outputs `METRIC name=number` lines.
Log results with `${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh`.

## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>

## Off Limits
<What must NOT be touched.>

## Guard
<Guard command if any, e.g. `pnpm test`. "none" if no guard.>

## Constraints
<ONLY constraints the user explicitly stated.>

## Search Space
| Dimension | Type | Range/Values | Dependencies |
|-----------|------|--------------|--------------|

**Active dimensions**: <N> | **Explored**: <list> | **Unexplored**: <list>

## Headroom Table
| Bottleneck | % Runtime | Max Speedup | Headroom | Current Big-O | Best Big-O |
|------------|-----------|-------------|----------|---------------|------------|

## Profiling Notes
<Where time/resources are actually spent. Update periodically.>

## Problem Profile
**Bottleneck classification**: <e.g. CPU-compute (72%), Memory-bandwidth>
**Current focus**: <bottleneck> -> trying <technique family> first.
**Pivot trigger**: If 3 discards in current focus, re-profile.

## What's Been Tried
<Update as experiments accumulate. Note key wins, dead ends.>

autoresearch.sh

Bash script (set -euo pipefail): pre-checks fast (<1s), runs the benchmark, outputs METRIC name=number lines. The benchmark scripts are IN SCOPE — modify them freely. Extend them when a new optimization dimension requires it (JVM flags file, compiler flags, env vars). Do NOT skip a dimension because "the script doesn't support it" — make the script support it.

autoresearch.checks.sh — the Guard (optional)

The guard is a safety net that must always pass. If the benchmark improves but the guard fails, the change is reverted (guard_failed). Common guards: pnpm test, tsc --noEmit, pnpm build. Create only when the user specifies a guard or constraints require it. Never modify the guard script during the loop.

Loop Rules

LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.

Primary metric is king. Improved -> keep. Worse/equal -> discard.
Validate small gains. Delta < 5% -> re-run with runs 5. If stddev > delta, discard.
Simpler is better. Removing code for equal perf = keep.
Simplicity gate. If diff adds >50 net lines for <5% gain, flag as suspicious complexity. Prefer the simpler approach.
Generality test. Before keeping: "If this benchmark disappeared, would this change still be worthwhile?" If no, discard — it's overfitting.
Diagnose failures. On discard/crash/guard_failed, pass a root-cause reason to log-experiment.sh: log-experiment.sh <status> <metric> "<description>" '{}' "root cause" (use '{}' as metrics_json placeholder when passing failure_reason as 5th arg). WHY it failed, not just that it failed. Every 5 experiments, run advisor.py failures autoresearch.jsonl to spot repeated failure patterns.
Cost awareness. Each experiment logs elapsed_s (time since previous experiment). Every 10 experiments, run advisor.py cost autoresearch.jsonl to check: are discards burning more time than keeps? If avg discard time > 2x avg keep time, simplify your approach.
Don't thrash. Repeatedly reverting? Try something structurally different.
Guard failed? Metric improved but guard failed → guard_failed, revert. Don't touch the guard script — adapt the implementation.
Crashes: fix if trivial, otherwise log and move on.
Don't self-impose constraints. The environment is part of the optimization surface unless the user said otherwise.
Resuming: if autoresearch.md exists, read it + git log, continue looping.

Each iteration:

Choose what to try next. Consult: Headroom table -> Decision Tree -> "What's Been Tried" -> autoresearch.ideas.md -> top kept experiments (the context hook injects top 5 keeps as "population" — mutate from the best, not just the latest). Always attack the highest-headroom bottleneck first.
Edit code
Prepare the experiment (stages, validates atomicity, commits). This enables git revert on failure (preserving the attempt in history as memory):
```
${CLAUDE_PLUGIN_ROOT}/scripts/prepare-experiment.sh "<description>" [files...]
```
If AUTORESEARCH_PREPARED=false, skip to next iteration.
${CLAUDE_PLUGIN_ROOT}/scripts/run-experiment.sh "./autoresearch.sh" [timeout] [checks_timeout] [runs] [warmup] [early_stop_pct] runs (default 1): N times, report median. warmup (default 0): untimed. early_stop_pct (default 0): abort remaining runs when first run is >N% worse than best. In Refinement phase, use runs 5 warmup 1 early_stop_pct 20.
Parse AUTORESEARCH_* output lines
MANDATORY: ${CLAUDE_PLUGIN_ROOT}/scripts/log-experiment.sh <status> <metric> "<description>" [metrics_json] [failure_reason] NEVER manually append to JSONL or revert code. The script handles rollback. Description format: [dimension] what you tried | techniques: tech1, tech2
- dimension: the optimization layer (algorithm, data-layout, io, runtime-flags, parallelism, build-pipeline, os-kernel, hardware, elimination, problem-reformulation, pipeline-order, cross-language, library-swap, compression, concurrency, measurement)
- techniques: cumulative list of named techniques in current working code
- Example: [data-layout] convert HashMap to flat arrays | techniques: flat-array, arena-allocator, SoA

On keep with >5% improvement, store in mdvault if available:

mdvault remember "TECHNIQUE: <name> | BOTTLENECK: <type> | GAIN: <X>% | CONTEXT: <workload> | WORKS-WHEN: <conditions>" --namespace autoresearch/techniques

Update "What's Been Tried" in autoresearch.md periodically
Write deferred ideas to autoresearch.ideas.md using checkbox format (- [ ] / - [x])
Every 3 discards in a row: re-profile, update Problem Profile. At 5 consecutive discards: trigger the escalation protocol (see below).
Cross-metric correlation every ~10 experiments if secondary metrics exist
Repeat

When stuck (5+ consecutive discards), escalate in order:

Re-read the TARGET ENVIRONMENT spec — you may have missed a system feature (hugepages, allocator, CPU feature, kernel config)
Check the local/target ratio — if it's CONSTANT across experiments, stop optimizing code and start optimizing the system layer (allocation strategy, page size, TLB, NUMA)
Re-read ALL in-scope files from scratch — you may have missed something
Re-read the original goal/direction — recalibrate
Review full experiment history (git log --oneline -50) — look for patterns in what worked vs failed
Combine near-misses — two changes that individually didn't help might work together
Try the OPPOSITE of what hasn't been working — if optimizing in one direction, reverse
Radical architecture change — rethink the approach entirely, different algorithm/structure

Also read ${CLAUDE_PLUGIN_ROOT}/skills/create/REFERENCE.md for exploration strategies, dimension checklists, decision tree templates, and optimization patterns.

NEVER STOP. Keep going until interrupted.

Autoresearch: Setup

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Autoresearch: Setup

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Autoresearch: Setup

Scripts

Setup Steps

autoresearch.md template

autoresearch.sh

autoresearch.checks.sh — the Guard (optional)

Loop Rules

Similar Skills

Autoresearch: Setup

Scripts

Setup Steps

autoresearch.md template

autoresearch.sh

autoresearch.checks.sh — the Guard (optional)

Loop Rules

Similar Skills