Skill

auto-research

Autonomous research loop that iteratively improves code through experiments. Runs experiments on local or remote GPUs, tracks results, keeps improvements, discards failures, and never stops. TRIGGER THIS SKILL when users want to autonomously research, optimize, or experiment on a problem — whether they say "run experiments overnight", "optimize this model", "try different approaches and keep what works", "do research on this", "iterate on this until it gets better", "run an experiment loop", or want to improve any metric through systematic trial and error. Also trigger when users mention autonomous experimentation, hyperparameter search, architecture search, ablation studies, or want Claude to keep trying things while they step away. This skill works for ML training, compiler optimization, algorithm tuning, performance benchmarking, or any problem with a measurable metric.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/claude-tricks:auto-research

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an autonomous researcher. Given a codebase with a measurable metric, you systematically experiment with changes, keep what improves the metric, discard what doesn't, and repeat indefinitely until stopped.

SKILL.md

266 lines · ~2.9k tokens

Stats

LanguagePython

Stars2

MaintenanceExcellent

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Auto-Research

The core insight: with a fixed time budget per experiment and a single metric to optimize, you can run dozens of experiments per hour. The human sleeps; you research.

Phase 1: Understand the Problem

Before running anything, you need to understand four things:

1. The metric

What single number are you optimizing? Lower or higher is better? Examples:

val_bpb (lower is better) for language models
accuracy (higher is better) for classifiers
latency_ms (lower is better) for performance optimization
score (higher is better) for game-playing agents

If there's no clear single metric, work with the user to define one. Multi-metric optimization is possible but harder — prefer a single number when you can.

2. The search space

What files/code can you modify? What's off-limits? Typically:

Modifiable: The training script, model code, hyperparameters, architecture
Read-only: Evaluation harness, data loading, metric computation, infrastructure

The evaluation code being read-only is important — it keeps experiments comparable. If you change how results are measured mid-run, everything before that point becomes incomparable.

3. The run command

How do you execute one experiment? This should be a single command that:

Runs the experiment (training, benchmark, etc.)
Outputs the metric in a parseable format
Exits with code 0 on success, non-zero on failure

4. The compute environment

What hardware is available?

Single local GPU: Run experiments sequentially
Multiple local GPUs: Run experiments in parallel, one per GPU (use CUDA_VISIBLE_DEVICES)
Remote machines via SSH: Run experiments in parallel across machines

Detect the environment early. For local GPUs:

nvidia-smi --list-gpus | wc -l

For remote machines, the user will provide SSH targets.

Phase 2: Setup

Git worktrees

Use git worktrees (not branches alone) to isolate experiments. Worktrees give each experiment its own working directory, which is essential for parallel execution and prevents checkout conflicts.

Single GPU setup:

git worktree add ../research-<tag> -b research/<tag>
cd ../research-<tag>

Multi-GPU setup (one worktree per GPU):

git worktree add ../research-<tag>-gpu0 -b research/<tag>-gpu0
git worktree add ../research-<tag>-gpu1 -b research/<tag>-gpu1
git worktree add ../research-<tag>-gpu2 -b research/<tag>-gpu2
git worktree add ../research-<tag>-gpu3 -b research/<tag>-gpu3

Use a descriptive tag — date, problem name, or both (e.g., research/mar10-lr-sweep). Each worktree has its own copy of the code, so parallel experiments can edit files simultaneously without conflicts.

Establish baseline

Always run the unmodified code first to get a baseline measurement. This is your reference point. Log it as the first entry in results tracking.

Initialize tracking

Create a results.tsv file (tab-separated, never committed to git):

commit	metric	status	gpu	description
a1b2c3d	0.9979	keep	gpu0	baseline (unmodified)

Columns:

commit: short git hash
metric: the measured value (0.0 for crashes)
status: keep, discard, or crash
gpu: which GPU/machine ran it (useful for parallel runs)
description: short text describing what was tried

Phase 3: The Experiment Loop

This is the heart of the system. Once started, it runs autonomously until interrupted.

Sequential mode (single GPU)

LOOP FOREVER:
  1. Formulate a hypothesis (what change might improve the metric?)
  2. Implement the change in the modifiable file(s)
  3. git commit -m "<description of change>"
  4. Run the experiment: <run-command> > run.log 2>&1
  5. Parse the metric from the output
  6. If crashed:
     - Read the last 50 lines of run.log
     - If easy fix (typo, import error): fix and retry
     - If fundamental (OOM, architecture bug): log as crash, revert
  7. Log result to results.tsv
  8. If metric improved: KEEP (branch advances)
     If metric same or worse: DISCARD (git reset --hard to previous kept commit)
  9. Go to step 1

Parallel mode (multiple GPUs)

With N GPUs available, you can run N experiments simultaneously. This requires more careful orchestration:

Worktree-per-GPU strategy:

Create a trunk branch: git branch research/<tag>-trunk
Create a worktree per GPU (see Phase 2 setup above) — each gets its own directory and branch
Each GPU runs its own experiment loop independently in its own worktree
When a GPU finds an improvement, merge it back to trunk
Other GPUs rebase onto the updated trunk before starting their next experiment

Parallel execution (each GPU runs in its own worktree directory):

# GPU 0 — runs in its own worktree
cd ../research-<tag>-gpu0
CUDA_VISIBLE_DEVICES=0 <run-command> > run.log 2>&1 &

# GPU 1 — runs in its own worktree
cd ../research-<tag>-gpu1
CUDA_VISIBLE_DEVICES=1 <run-command> > run.log 2>&1 &

# Wait for all
wait

For remote machines via SSH:

ssh user@machine1 "cd /path/to/repo && CUDA_VISIBLE_DEVICES=0 <run-command>" > run-m1.log 2>&1 &
ssh user@machine2 "cd /path/to/repo && CUDA_VISIBLE_DEVICES=0 <run-command>" > run-m2.log 2>&1 &
wait

Conflict resolution for parallel runs: When multiple experiments finish, compare all results against the current best. Keep the one with the best metric improvement. If multiple experiments both improve the metric, try applying them sequentially (one might be complementary).

Diversify parallel experiments: Don't run similar experiments on different GPUs — that wastes parallelism. Assign different categories of changes to different GPUs:

GPU 0: architecture changes
GPU 1: hyperparameter tuning
GPU 2: optimizer/schedule changes
GPU 3: regularization experiments

Phase 4: Deciding What to Try

This is where research skill matters. Here's a framework for generating good hypotheses:

Early experiments (broad exploration)

Start with high-impact, well-known improvements:

Learning rate adjustments (scale up/down by 2-3x)
Model size changes (deeper, wider, narrower)
Batch size tuning
Optimizer switches or hyperparameter tweaks
Known architectural improvements from recent papers

Mid-run experiments (targeted refinement)

Once low-hanging fruit is picked:

Fine-grained hyperparameter sweeps around the current best
Ablation studies (remove components to check if they help)
Schedule tuning (warmup, cooldown, decay rates)
Regularization adjustments

Late-run experiments (creative exploration)

When obvious improvements are exhausted:

Combine previously discarded ideas (they might work together)
Try ideas from recent papers or adjacent domains
Simplification — remove complexity and see if the metric holds
Re-derive the hyperparameters from scratch given the new architecture

If you run out of ideas

Think harder. Reread the results log. Look for patterns — what types of changes tend to work? What keeps failing? Is there a direction you haven't explored? Consider:

The simplicity criterion: can you get the same metric with less code?
Revisiting an old discarded idea with a twist
Combining two small improvements that individually didn't help

Phase 5: Experiment Discipline

The simplicity criterion

All else being equal, simpler is better. A small metric improvement that adds 50 lines of hacky code is not worth it. Conversely, removing code while maintaining the metric is always a win. This prevents the codebase from becoming an unmaintainable mess over dozens of experiments.

Time budget per experiment

Keep experiment runtime fixed and bounded. This ensures:

All experiments are directly comparable
You can estimate throughput (experiments per hour)
No single experiment hogs compute indefinitely

If the codebase doesn't have a time budget, add one. A good default: 5 minutes for training runs, 2 minutes for benchmarks.

Crash handling

Experiments will crash. This is normal. Handle it gracefully:

OOM: Reduce model size or batch size
NaN/Inf loss: Reduce learning rate or check for numerical instability
Import errors: Fix the typo and retry
Timeout (>2x expected runtime): Kill it, log as crash, move on

Never let a crash stop the loop. Log it, revert, and continue with a different experiment.

VRAM as a soft constraint

Memory usage can grow if changes increase model size. A small increase is fine for meaningful metric gains, but dramatic increases risk OOM on future experiments. Monitor peak VRAM and note it in the log.

Phase 6: Autonomy

Once the experiment loop begins:

Never stop. Do not pause to ask the human if you should continue. They may be asleep, away, or busy. You are autonomous.
Never ask permission. Each experiment is a small, reversible change. If it works, keep it. If it doesn't, revert. There's no risk that requires human approval.
Log everything. The human will review results.tsv when they return. Make descriptions clear and informative.
Handle errors silently. Crashes, OOM, divergence — log them and move on. Don't stop to report them.
Think harder when stuck. If the last 5 experiments all failed, step back. Reread the code. Reread the results. Try a fundamentally different approach.

Results Analysis

When the human returns, they'll want to know:

How many experiments ran
How many were kept vs. discarded vs. crashed
What's the best metric achieved vs. baseline
Which specific changes had the biggest impact
The trajectory: is improvement slowing down?

The results.tsv file contains all of this. Point the user to it and offer to summarize.

Quick Reference

Start:

1. Understand: metric, search space, run command, compute
2. Setup: branch, baseline, results.tsv
3. Loop: hypothesize → implement → commit → run → evaluate → keep/discard → log → repeat

Keep: metric strictly improved Discard: metric same or worse (git reset) Crash: log it, revert, continue with different idea

Parallel: one experiment per GPU, diversify across categories, merge improvements to trunk

Never stop. Never ask. Log everything.

auto-research

Popularity

Invocation

Context Preview

SKILL.md

auto-research

Popularity

Invocation

Context Preview

SKILL.md

Auto-Research

Phase 1: Understand the Problem

1. The metric

2. The search space

3. The run command

4. The compute environment

Phase 2: Setup

Git worktrees

Establish baseline

Initialize tracking

Phase 3: The Experiment Loop

Sequential mode (single GPU)

Parallel mode (multiple GPUs)

Phase 4: Deciding What to Try

Early experiments (broad exploration)

Mid-run experiments (targeted refinement)

Late-run experiments (creative exploration)

If you run out of ideas

Phase 5: Experiment Discipline

The simplicity criterion

Time budget per experiment

Crash handling

VRAM as a soft constraint

Phase 6: Autonomy

Results Analysis

Quick Reference

Similar Skills

Auto-Research

Phase 1: Understand the Problem

1. The metric

2. The search space

3. The run command

4. The compute environment

Phase 2: Setup

Git worktrees

Establish baseline

Initialize tracking

Phase 3: The Experiment Loop

Sequential mode (single GPU)

Parallel mode (multiple GPUs)

Phase 4: Deciding What to Try

Early experiments (broad exploration)

Mid-run experiments (targeted refinement)

Late-run experiments (creative exploration)

If you run out of ideas

Phase 5: Experiment Discipline

The simplicity criterion

Time budget per experiment

Crash handling

VRAM as a soft constraint

Phase 6: Autonomy

Results Analysis

Quick Reference

Similar Skills