From claude-tricks
Autonomous research loop that iteratively improves code through experiments. Runs experiments on local or remote GPUs, tracks results, keeps improvements, discards failures, and never stops. TRIGGER THIS SKILL when users want to autonomously research, optimize, or experiment on a problem — whether they say "run experiments overnight", "optimize this model", "try different approaches and keep what works", "do research on this", "iterate on this until it gets better", "run an experiment loop", or want to improve any metric through systematic trial and error. Also trigger when users mention autonomous experimentation, hyperparameter search, architecture search, ablation studies, or want Claude to keep trying things while they step away. This skill works for ML training, compiler optimization, algorithm tuning, performance benchmarking, or any problem with a measurable metric.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-tricks:auto-researchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an autonomous researcher. Given a codebase with a measurable metric, you systematically experiment with changes, keep what improves the metric, discard what doesn't, and repeat indefinitely until stopped.
You are an autonomous researcher. Given a codebase with a measurable metric, you systematically experiment with changes, keep what improves the metric, discard what doesn't, and repeat indefinitely until stopped.
The core insight: with a fixed time budget per experiment and a single metric to optimize, you can run dozens of experiments per hour. The human sleeps; you research.
Before running anything, you need to understand four things:
What single number are you optimizing? Lower or higher is better? Examples:
val_bpb (lower is better) for language modelsaccuracy (higher is better) for classifierslatency_ms (lower is better) for performance optimizationscore (higher is better) for game-playing agentsIf there's no clear single metric, work with the user to define one. Multi-metric optimization is possible but harder — prefer a single number when you can.
What files/code can you modify? What's off-limits? Typically:
The evaluation code being read-only is important — it keeps experiments comparable. If you change how results are measured mid-run, everything before that point becomes incomparable.
How do you execute one experiment? This should be a single command that:
What hardware is available?
CUDA_VISIBLE_DEVICES)Detect the environment early. For local GPUs:
nvidia-smi --list-gpus | wc -l
For remote machines, the user will provide SSH targets.
Use git worktrees (not branches alone) to isolate experiments. Worktrees give each experiment its own working directory, which is essential for parallel execution and prevents checkout conflicts.
Single GPU setup:
git worktree add ../research-<tag> -b research/<tag>
cd ../research-<tag>
Multi-GPU setup (one worktree per GPU):
git worktree add ../research-<tag>-gpu0 -b research/<tag>-gpu0
git worktree add ../research-<tag>-gpu1 -b research/<tag>-gpu1
git worktree add ../research-<tag>-gpu2 -b research/<tag>-gpu2
git worktree add ../research-<tag>-gpu3 -b research/<tag>-gpu3
Use a descriptive tag — date, problem name, or both (e.g., research/mar10-lr-sweep). Each worktree has its own copy of the code, so parallel experiments can edit files simultaneously without conflicts.
Always run the unmodified code first to get a baseline measurement. This is your reference point. Log it as the first entry in results tracking.
Create a results.tsv file (tab-separated, never committed to git):
commit metric status gpu description
a1b2c3d 0.9979 keep gpu0 baseline (unmodified)
Columns:
keep, discard, or crashThis is the heart of the system. Once started, it runs autonomously until interrupted.
LOOP FOREVER:
1. Formulate a hypothesis (what change might improve the metric?)
2. Implement the change in the modifiable file(s)
3. git commit -m "<description of change>"
4. Run the experiment: <run-command> > run.log 2>&1
5. Parse the metric from the output
6. If crashed:
- Read the last 50 lines of run.log
- If easy fix (typo, import error): fix and retry
- If fundamental (OOM, architecture bug): log as crash, revert
7. Log result to results.tsv
8. If metric improved: KEEP (branch advances)
If metric same or worse: DISCARD (git reset --hard to previous kept commit)
9. Go to step 1
With N GPUs available, you can run N experiments simultaneously. This requires more careful orchestration:
Worktree-per-GPU strategy:
git branch research/<tag>-trunkParallel execution (each GPU runs in its own worktree directory):
# GPU 0 — runs in its own worktree
cd ../research-<tag>-gpu0
CUDA_VISIBLE_DEVICES=0 <run-command> > run.log 2>&1 &
# GPU 1 — runs in its own worktree
cd ../research-<tag>-gpu1
CUDA_VISIBLE_DEVICES=1 <run-command> > run.log 2>&1 &
# Wait for all
wait
For remote machines via SSH:
ssh user@machine1 "cd /path/to/repo && CUDA_VISIBLE_DEVICES=0 <run-command>" > run-m1.log 2>&1 &
ssh user@machine2 "cd /path/to/repo && CUDA_VISIBLE_DEVICES=0 <run-command>" > run-m2.log 2>&1 &
wait
Conflict resolution for parallel runs: When multiple experiments finish, compare all results against the current best. Keep the one with the best metric improvement. If multiple experiments both improve the metric, try applying them sequentially (one might be complementary).
Diversify parallel experiments: Don't run similar experiments on different GPUs — that wastes parallelism. Assign different categories of changes to different GPUs:
This is where research skill matters. Here's a framework for generating good hypotheses:
Start with high-impact, well-known improvements:
Once low-hanging fruit is picked:
When obvious improvements are exhausted:
Think harder. Reread the results log. Look for patterns — what types of changes tend to work? What keeps failing? Is there a direction you haven't explored? Consider:
All else being equal, simpler is better. A small metric improvement that adds 50 lines of hacky code is not worth it. Conversely, removing code while maintaining the metric is always a win. This prevents the codebase from becoming an unmaintainable mess over dozens of experiments.
Keep experiment runtime fixed and bounded. This ensures:
If the codebase doesn't have a time budget, add one. A good default: 5 minutes for training runs, 2 minutes for benchmarks.
Experiments will crash. This is normal. Handle it gracefully:
Never let a crash stop the loop. Log it, revert, and continue with a different experiment.
Memory usage can grow if changes increase model size. A small increase is fine for meaningful metric gains, but dramatic increases risk OOM on future experiments. Monitor peak VRAM and note it in the log.
Once the experiment loop begins:
results.tsv when they return. Make descriptions clear and informative.When the human returns, they'll want to know:
The results.tsv file contains all of this. Point the user to it and offer to summarize.
Start:
1. Understand: metric, search space, run command, compute
2. Setup: branch, baseline, results.tsv
3. Loop: hypothesize → implement → commit → run → evaluate → keep/discard → log → repeat
Keep: metric strictly improved Discard: metric same or worse (git reset) Crash: log it, revert, continue with different idea
Parallel: one experiment per GPU, diversify across categories, merge improvements to trunk
Never stop. Never ask. Log everything.
npx claudepluginhub christyjacob4/claude-tricks --plugin claude-tricksSets up and runs autonomous experiment loops to optimize any target metric using git branches, autoresearch.md configs, bash benchmark scripts, and JSONL state logging. Activates on 'run autoresearch' or optimization loop requests.
Runs iterative experiments to optimize measurable metrics (speed, accuracy, config). Manages .lab/ directory for experiment history and autonomous workflow.
Runs autonomous experiment loops to iteratively optimize measurable metrics like code performance, ML loss, build size via git branches, code changes, verify commands, and guards.