Skill

autoresearch

Set up and run an autonomous experiment loop for any optimization target. Use when asked to "run autoresearch", "optimize X in a loop", "set up autoresearch", "start experiments", or "benchmark and optimize".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autoresearch:autoresearch <goal or "resume" or "off" or "clear" or "status">

User invocable

Model invocable

Inline context

Default effort

Argument hint<goal or "resume" or "off" or "clear" or "status">

Tool Access

This skill is limited to the following tools:

BashReadWriteEditGlobGrep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

SKILL.md

308 lines · ~3k tokens

Stats

LanguageShell

Stars0

MaintenanceGood

Last CommitMar 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Autoresearch

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

Commands

/autoresearch <goal> — set up a new session and start looping
/autoresearch resume — resume from existing autoresearch.md
/autoresearch status — show experiment results summary
/autoresearch off — deactivate the loop (stop hook stops blocking)
/autoresearch clear — delete autoresearch.jsonl and reset all state

Safety Principle

Autoresearch MUST run in a git worktree. This is a strict requirement, not a suggestion. The worktree isolates all experiments from the user's main checkout — no files outside the worktree are ever at risk.

Before doing anything else, verify you are in a worktree:

# .git is a FILE in worktrees, a DIRECTORY in main checkouts
if [ -d .git ]; then
  echo "ERROR: autoresearch must run in a git worktree."
  echo "Start with: claude -w autoresearch-<name>"
fi

If not in a worktree, tell the user to restart with claude -w <name> and stop.

Setup

When starting a new session:

Verify worktree — run the check above. Do not proceed if not in a worktree.
Gather info — ask or infer from context:
- Goal: what are we optimizing? (e.g., "FIFO lot matching speed")
- Command: the benchmark command (e.g., bun run bench:fifo)
- Metric: name, unit, and direction (e.g., recalc_us, microseconds, lower is better)
- Files in scope: which files may be modified
- Quality gate (optional): correctness checks command (e.g., bun run test)
- Guards (optional): metrics that must not regress (e.g., memory_mb < 512)
- Cost ceiling (optional): max USD to spend before auto-stopping (e.g., 5.00)
- Constraints: hard rules (no new deps, specific files off-limits, etc.)
Read source files deeply before writing anything. Understand the workload.
Write session files and commit them:

`autoresearch.md`

The heart of the session. A fresh agent with zero context should be able to read this file and run the loop effectively. Invest time making it excellent.

# Autoresearch: <goal>

## Objective
Specific description of what we're optimizing and why.

## Metrics
- **Primary**: <name> (<unit>, <lower|higher> is better)
- **Secondary** (optional): additional metrics to track

## How to Run
./autoresearch.sh

## Files in Scope
List every file the agent may modify, with brief notes on what each does.

## Off Limits
What must NOT be touched and why.

## Constraints
Hard rules: tests must pass, no new dependencies, etc.

## Guards (must not regress)
Metrics that must stay within bounds regardless of the primary optimization.
Format: `metric_name operator threshold` (e.g., `memory_mb < 512`, `test_count >= 47`)
The benchmark script outputs these as additional METRIC lines.
If any guard is violated, treat the experiment as checks_failed.

## Baseline
- Primary metric: <value>
- Date: <date>
- Commit: <hash>

## What's Been Tried
Updated as experiments accumulate. Format:
- Run N: <description> → <kept|discarded|crashed> (<metric value>, <delta%>)

`autoresearch.sh`

Bash benchmark script. Must:

Use set -euo pipefail
Run fast (every second is multiplied by hundreds of runs)
Output METRIC <name>=<number> lines on stdout
Exit 0 on success, non-zero on failure

Example:

#!/bin/bash
set -euo pipefail
# Pre-check: fast syntax/compile verification
bun check

# Run benchmark
RESULT=$(bun run bench:fifo 2>&1)
echo "$RESULT"

# Extract and output metric
TIME=$(echo "$RESULT" | grep -oP 'recalc_us: \K[0-9.]+')
echo "METRIC recalc_us=$TIME"

`autoresearch.checks.sh` (optional)

Only create when quality gates are needed. Runs after every passing benchmark.

Uses set -euo pipefail
Runs correctness checks (tests, types, lint)
Exit 0 = checks pass, non-zero = checks fail
Its execution time does NOT affect the primary metric
Keep output minimal — only errors, suppress verbose success

Example:

#!/bin/bash
set -euo pipefail
bun run test --run 2>&1 | tail -20

Make scripts executable: chmod +x autoresearch.sh autoresearch.checks.sh
Activate the loop: write the state file that tells the stop hook to keep looping:

cat > .claude/autoresearch-loop.local.md << 'EOF'
---
stop_count: 0
max_iterations: 50
max_consecutive_discards: 8
max_cost_usd: 0
active: true
---
Read autoresearch.md for full context. Continue the experiment loop.
EOF

Set max_cost_usd to the user's cost ceiling if they specified one (0 = unlimited). Set max_consecutive_discards to 8 by default (loop stops after 8 consecutive non-keep results).

Initialize JSONL using the log helper:

${CLAUDE_PLUGIN_ROOT}/hooks/scripts/log-experiment.sh . init "<goal>" "<metric_name>" "<metric_unit>" "<lower|higher>"

Run baseline: execute ./autoresearch.sh, parse the metric, log the baseline:

${CLAUDE_PLUGIN_ROOT}/hooks/scripts/log-experiment.sh . result 1 <baseline_metric> keep "baseline"

Start looping immediately

The Experiment Loop

Each iteration:

Review context: check the system message for the current search strategy (explore/exploit/combine/ablation). Read autoresearch.md (especially "What's Been Tried"), check git log for recent experiments, check autoresearch.ideas.md if it exists
Form a hypothesis: decide what to try next. Prefer ideas that are structurally different from recent failures.
Edit files: make the code change
Run benchmark: ./autoresearch.sh
Parse metric: extract the METRIC name=value line from output
Run quality gate (if autoresearch.checks.sh exists): ./autoresearch.checks.sh
Decide and act:

If improved (metric is better) AND checks pass:

git add -A
git commit -m "<description>

Autoresearch: {\"status\":\"keep\",\"metric\":<value>,\"delta\":\"<delta%>\"}"

Log the result using the helper (ensures consistent JSONL format):

${CLAUDE_PLUGIN_ROOT}/hooks/scripts/log-experiment.sh . result <run_number> <metric_value> keep "<description>"

If worse/equal OR checks fail:

Revert code changes (autoresearch files are automatically preserved):

${CLAUDE_PLUGIN_ROOT}/hooks/scripts/revert-experiment.sh .

Log the result:

${CLAUDE_PLUGIN_ROOT}/hooks/scripts/log-experiment.sh . result <run_number> <metric_value> discard|crash|checks_failed "<description>"

Update autoresearch.md: append to "What's Been Tried"
Repeat — go to step 1

Loop Rules

LOOP FOREVER. Never ask "should I continue?" Never stop to summarize. Never wait for permission.

Primary metric is king. Improved → keep. Worse or equal → discard.
Simpler is better. Removing code for equal performance = keep.
Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
Crashes: fix if trivial (typo, missing import), otherwise log as crash and move on.
Think longer when stuck. Re-read source files. Study profiling data. Try a completely different approach.
If out of ideas, think harder. Read academic papers in your training data. Try counterintuitive approaches. Combine two previous ideas.
Check guards after every experiment. If autoresearch.md has a ## Guards section, parse all guard metrics from the benchmark output and verify each one. If any guard is violated, treat the experiment as checks_failed regardless of primary metric improvement.
Signal completion when you've genuinely exhausted all ideas and further experiments would be unproductive. Emit <promise>AUTORESEARCH_COMPLETE</promise> in your message. The stop hook will detect this and cleanly terminate the loop.

NEVER STOP. The user may be away for hours.

Search Strategy

The stop hook computes an adaptive search strategy based on experiment history and includes it in the system message each turn. Adapt your approach:

explore (default): try novel, diverse ideas. Prefer untried approaches. Be creative.
exploit: the keep rate is high — refine the current best. Make small, incremental tweaks. Don't try anything radical.
combine: multiple near-misses exist (experiments that almost beat the best). Merge the top 2-3 best ideas into a single experiment.
ablation: many consecutive failures with no near-misses. Remove components one at a time from the current best to find what's actually driving the gain. Simplify.

Check the system message at the start of each iteration for the current strategy and reason.

Profiling (optional)

For performance optimization targets, create autoresearch.profile.sh during setup. Run it once after the baseline to capture profiling data, then add a ## Profiling Hotspots section to autoresearch.md.

Example autoresearch.profile.sh for Python:

#!/bin/bash
set -euo pipefail
python3 -c "
import cProfile, pstats, io
# Import and run your target
from sort import sort_numbers
import random
random.seed(42)
data = random.sample(range(100000), 5000)
pr = cProfile.Profile()
pr.enable()
for _ in range(3): sort_numbers(data)
pr.disable()
s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats('cumulative').print_stats(15)
print(s.getvalue())
"

Add the output to autoresearch.md as a hotspot table. Re-profile after significant kept experiments to check if the bottleneck has moved.

Guardrails

Do not overfit to the benchmark. The optimization must improve real-world performance, not just game the measurement script.
Do not cheat on the benchmark. Never modify autoresearch.sh, autoresearch.checks.sh, or the test suite to make metrics look better.
Do not add benchmark-specific code paths. No if running_benchmark: ... shortcuts. The optimized code must be production-quality.

Ideas Backlog

When you discover promising but complex optimizations you won't pursue right now, append them as bullets to autoresearch.ideas.md. On resume, check and prune stale entries, experiment with the rest.

Resuming

When /autoresearch resume is called or after context compaction:

Read autoresearch.md — this is the complete session state
Read autoresearch.jsonl — parse to find run count, best metric, recent results
Check git log --oneline -10 — see recent commits
Check autoresearch.ideas.md if it exists — promising paths to explore
Continue looping from where you left off

User Messages During Experiments

If the user sends a message while you're mid-experiment, finish the current run + log cycle first, then incorporate their feedback.

Status Display

When /autoresearch status is called, use the /autoresearch-status command to display results.

Automatic Termination

The loop stops automatically when any of these occur:

Max iterations reached — stop hook deletes the state file
Convergence detected — too many consecutive discards (default 8)
Cost budget exceeded — estimated session cost exceeds max_cost_usd
Completion signal — you emit <promise>AUTORESEARCH_COMPLETE</promise> when all ideas are exhausted

Deactivating

When /autoresearch off is called:

Delete .claude/autoresearch-loop.local.md (stops the stop hook)
Print a summary of results
Do NOT delete autoresearch.jsonl or autoresearch.md

When /autoresearch clear is called:

Delete .claude/autoresearch-loop.local.md
Delete autoresearch.jsonl
Print confirmation
Do NOT delete autoresearch.md (it's still useful as documentation)

autoresearch

Invocation

Tool Access

Context Preview

SKILL.md

autoresearch

Invocation

Tool Access

Context Preview

SKILL.md

Autoresearch

Commands

Safety Principle

Setup

autoresearch.md

autoresearch.sh

autoresearch.checks.sh (optional)

The Experiment Loop

If improved (metric is better) AND checks pass:

If worse/equal OR checks fail:

Loop Rules

Search Strategy

Profiling (optional)

Guardrails

Ideas Backlog

Resuming

User Messages During Experiments

Status Display

Automatic Termination

Deactivating

Similar Skills

Autoresearch

Commands

Safety Principle

Setup

autoresearch.md

autoresearch.sh

autoresearch.checks.sh (optional)

The Experiment Loop

If improved (metric is better) AND checks pass:

If worse/equal OR checks fail:

Loop Rules

Search Strategy

Profiling (optional)

Guardrails

Ideas Backlog

Resuming

User Messages During Experiments

Status Display

Automatic Termination

Deactivating

Similar Skills

`autoresearch.md`

`autoresearch.sh`

`autoresearch.checks.sh` (optional)

`autoresearch.md`

`autoresearch.sh`

`autoresearch.checks.sh` (optional)