karpathy-autoresearch

A Claude Code plugin that runs Andrej Karpathy's autoresearch pattern on any repo — autonomously edit one file, run it, keep changes that improve a scalar metric, reset the ones that don't. Git is the memory. A TSV is the audit trail.

Run the loop on your repo to make it better. Define a metric. Walk away. Review diffs when you come back.

What it does

You give it a program.md spec — editable file, frozen files, one shell command, one scalar metric, a budget. The skill then hill-climbs:

for N trials:
  pick next hypothesis from program.md
  edit the editable file
  commit WIP
  run the command (hard wall-clock kill)
  parse metric + guards from run.log
  if metric improved AND guards pass -> keep commit
  else                                -> git reset --hard
  append row to .autoresearch/results.tsv

One git branch per run (autoresearch/<tag>). One TSV row per trial. No LLM self-grading — the scalar metric decides.

What makes this different

Generic "AI coding agents" plan, implement, and declare victory. This runs code, reads a number, and either keeps the change or throws it out. Closer to stochastic optimization than to code-gen chat.

	Generic agent	karpathy-autoresearch
Judge	LLM self-assessment	Ground-truth scalar metric
Memory	Chat context	Git branch + results.tsv
Action surface	Whole codebase	Exactly one file
Stop condition	"I think we're done"	Budget / no-improvement streak
Bias	Add features	Prefer code deletion

Install

claude plugin marketplace add tbelbek/karpathy-autoresearch
claude plugin install karpathy-autoresearch@karpathy-autoresearch-marketplace

Use

cd into any repo you want to improve.
Run /karpathy-autoresearch. If no program.md exists, the skill offers to scaffold one from the template and stops so you can fill it in.
Fill in the program: editable file, frozen files, run command, metric, budget, hypotheses.
Run /karpathy-autoresearch again. It echoes the program back and asks for authorization.
Walk away. Come back to .autoresearch/results.tsv and a git branch of experiments.

Example `program.md` sections

Editable surface: train.py — only file the loop may modify
Frozen surface: evaluate.py, pyproject.toml
Command: uv run train.py > run.log 2>&1
Metric: val_bpb, direction min, parsed via regex on run.log
Budget: 5 min per trial, 50 trials total, stop after 10 no-improvement in a row
Hypotheses: cosine LR, larger batch, RMSNorm, remove biases, fused optimizer

Full template: skills/karpathy-autoresearch/references/program.template.md.

Safety

Refuses to start without a program.md — no silent defaults
Requires explicit user authorization of the total budget
Frozen files are never touched
Hard wall-clock kill per trial; hard total budget
git reset --hard on every rejected trial — no creeping state
One variable per trial — enforced by convention, logged in TSV

What this skill is NOT

Not a web research tool
Not a benchmark harness (you write the metric extraction)
Not a hyperparameter grid search — it is directed and hypothesis-led
Not unbounded — it stops cleanly when budget or no-improvement streak hits

Credits

Pattern by Andrej Karpathy — https://github.com/karpathy/autoresearch. This plugin is an independent port to Claude Code; no code from Karpathy's repo is included.

License

MIT

karpathy-autoresearch

Popularity

What's Inside

README