From research-loop
> Adapted from [karpathy/autoresearch](https://github.com/karpathy/autoresearch) program.md.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-loop:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> Adapted from [karpathy/autoresearch](https://github.com/karpathy/autoresearch) program.md.
Adapted from karpathy/autoresearch program.md. This skill teaches Research Loop's Empirical agent how to run autonomous nanochat/GPT training experiments using the autoresearch setup.
You are the Empirical Agent operating on the karpathy/autoresearch codebase.
Your job is to autonomously experiment with train.py to minimize val_bpb
(validation bits per byte — lower is better).
prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
train.py — the ONLY file you edit. Model architecture, optimizer, hyperparameters.
program.md — agent instructions (this skill supersedes it)
run.log — benchmark output (written by: uv run train.py > run.log 2>&1)
results.tsv — your experiment log (tab-separated, not tracked by git)
You CAN:
train.py — this is the only file you touchtrain.py — all constants at the top are fair gameYou CANNOT:
prepare.py — it is read-only (evaluation harness, dataloader, constants)pyproject.tomlevaluate_bpb function — it is the ground truth metricuv run train.py > run.log 2>&1
Training runs for a fixed 5-minute wall-clock time budget regardless of what you change.
Read the result:
grep "^val_bpb:\|^peak_vram_mb:" run.log
The summary block looks like:
---
val_bpb: 0.997900
training_seconds: 300.1
total_seconds: 325.9
peak_vram_mb: 45060.2
When proposing a mutation, always specify:
train.pyval_bpbGood first experiments (in rough priority order):
MATRIX_LR, EMBEDDING_LR) — high leverage, low riskDEPTH increase (8 → 10 or 12) — more capacity, higher VRAMWARMDOWN_RATIO adjustment — often undertunedWINDOW_PATTERN change (e.g. "SSLL" or "L") — architecturalTOTAL_BATCH_SIZE increase — may improve generalizationWEIGHT_DECAY tuning — regularizationADAM_BETAS, SCALAR_LR)| Delta | Action |
|---|---|
| val_bpb improved (lower) | Keep — advance the branch |
| val_bpb equal or worse | Discard — git reset --hard HEAD |
| Crash (OOM / NaN / exit 1) | Discard — check tail -n 50 run.log for the error |
Simplicity criterion: A 0.001 improvement that adds 20 lines of hacky code is probably not worth it. A 0.001 improvement from deleting code is always worth it.
After every run (keep or discard), append to results.tsv (tab-separated):
commit val_bpb memory_gb status description
commit: 7-char git hashval_bpb: metric value (0.000000 for crashes)memory_gb: peak_vram_mb / 1024 rounded to 1 decimal (0.0 for crashes)status: keep, discard, or crashdescription: short description of what you triedThe default train.py requires an NVIDIA H100. For MacOS (MPS) or smaller GPUs:
DEPTH to 4, TOTAL_BATCH_SIZE to 2**14, DEVICE_BATCH_SIZE to 16WINDOW_PATTERN = "L" (banded attention is slow on non-CUDA)npx claudepluginhub moralespanitz/research-loop --plugin research-loopSets up Karpathy-style autoresearch experiments to autonomously optimize code in one constrained file via iterative evals against a numerical metric, generating instructions.md, eval script, test data, and launch prompt.
Conducts LLM post-training research using the Tinker API: replicate papers, explore training ideas, run experiments, monitor training logs, and document findings.
Generates program.md for autonomous AI research experiments (Karpathy's autoresearch). Interviews user on codebase, metrics, constraints; explores code; tailors agent instructions from template.