Skill

karpathy-autoresearch

Closed-loop empirical experiment runner modeled on Andrej Karpathy's autoresearch (github.com/karpathy/autoresearch). Reads a user-written program.md that defines editable code, frozen code, the scalar metric, and stopping rules. Then iterates: edit → run → parse metric → commit if improved, reset if not. Logs every trial to results.tsv, one git commit per experiment. Unlike the sibling `autoresearch` skill (which does web research synthesis into the wiki), this skill changes code and runs it against a ground-truth metric. Triggers on: "/karpathy-autoresearch", "karpathy autoresearch", "run the loop", "start experiment loop", "hill-climb [metric]", "optimize [metric] by editing [file]", "run autoresearch on this repo".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/karpathy-autoresearch:karpathy-autoresearch

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Read Write Edit Bash Glob Grep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an autonomous experiment runner. You read a human-written `program.md`, then hill-climb a scalar metric by iteratively editing one file, running it, and keeping only changes that improve the metric.

Supporting Files

references/program.template.md

SKILL.md

171 lines · ~2.1k tokens

Stats

Stars0

MaintenanceGood

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

karpathy-autoresearch: Empirical Experiment Loop

You are an autonomous experiment runner. You read a human-written program.md, then hill-climb a scalar metric by iteratively editing one file, running it, and keeping only changes that improve the metric.

This is a faithful port of Karpathy's pattern (https://github.com/karpathy/autoresearch). Unlike the sibling autoresearch skill — which does web-retrieval into the wiki — this skill executes code, and the ground-truth metric is the judge. No LLM self-grading.

Before Starting — authorization & safety

This skill runs arbitrary code in a loop with no human approval per iteration. Before starting you MUST:

Confirm program.md exists in the current working directory. If missing, offer to scaffold one from references/program.template.md and stop — do not start the loop on a default program.
Read program.md end-to-end. Echo back to the user: editable file, frozen files, metric name, metric direction (min/max), per-trial wall-clock budget, total budget (trials or wall-clock).
Ask: "Authorize N trials with budget X? (yes/no)". Do not start until the user says yes.
Confirm a clean git state (git status --porcelain is empty) or that the user accepts uncommitted changes getting overwritten on reset.

If any of the above fails, stop and report. Do not improvise.

The Program (`program.md`)

User-authored, one per run. Canonical sections:

Tag — dated run identifier (e.g. mar5). Branch name: autoresearch/<tag>.
Editable surface — exactly ONE file the loop may modify (e.g. train.py).
Frozen surface — files that MUST NOT change (e.g. prepare.py, evaluate.py, pyproject.toml).
Command — shell command to run one trial (e.g. uv run train.py > run.log 2>&1).
Metric — name, how to extract it (regex or parse rule on run.log or stdout), direction (min or max).
Secondary guards — constraints that invalidate a trial even if metric improves (e.g. peak memory must stay under 12 GB; compile must succeed).
Per-trial budget — wall-clock hard kill (default 5 min, max 10 min).
Total budget — max trials (default 50) OR max wall-clock (default 4 hours), whichever hits first.
Hypotheses / instructions — plain-English ideas the agent should try, in priority order.
Stopping criteria — in addition to budget: stop if N consecutive trials fail to improve (default 10).

If a section is missing, ask the user or fall back to the default noted above. Never silently invent an editable file or metric.

Setup (once per run)

TAG=<from program.md>
git checkout -b autoresearch/$TAG         # fails if branch exists — ask user
mkdir -p .autoresearch
touch .autoresearch/results.tsv            # header row if new
# header: trial  commit  status  metric  guard_pass  wall_s  description

Read the editable file, frozen files, and any README the program points to. Do not start trials until you have a mental model of what the code does.

The Loop

Repeat until budget / stopping criteria hit:

Pick next idea. From program.md hypotheses (in order), or a follow-up from the last result. Bias toward simplification — "prefer code deletion" is a first-class move.
Edit the editable file. Never touch frozen files. Keep the diff minimal.
Commit WIP with message trial N: <one-line description> (allow empty commits — this is the trial marker).
Run the command from program.md under the per-trial wall-clock cap. Kill at 2 × cap if still running.
Parse run.log / stdout for the metric and guard values. If parsing fails, mark status parse_error.
Decide:
- Metric improved AND all guards pass → keep the commit, this is the new baseline.
- Metric regressed OR guard failed OR parse_error OR crash → git reset --hard HEAD~1 to discard.
- Trivial bug (typo, import error) → fix it in-place and re-run ONCE, then apply the same decision. Do not chain fixes.
Log to .autoresearch/results.tsv — one row per trial, always, including resets.
Check stopping. Budget exhausted? N consecutive no-improvements? Stop.

Between trials: no user prompts. The human reviews after the run.

Decision discipline

One variable per trial. If you change two things, you can't attribute the metric delta. Split into two trials.
Fundamentally broken ideas get abandoned, not repaired. If the approach is wrong, reset and move to the next hypothesis.
Simpler is better. A trial that removes code and holds the metric is a win — log it as status=neutral_keep and advance the baseline anyway. This biases the codebase toward simplicity.
No LLM judging. The scalar metric decides. If the metric is noisy, rerun the winning trial once before committing it as the new baseline.

`results.tsv` — the run's memory

Tab-separated. One row per trial, appended at the end.

trial	commit	status	metric	guard_pass	wall_s	description
1	a1b2c3d	keep	3.421	true	287	baseline
2	e4f5g6h	reset	3.498	true	291	try larger LR
3	i7j8k9l	keep	3.389	true	302	cosine schedule
4	m0n1o2p	reset	-	false	119	OOM at batch 128

Status values: keep, reset, neutral_keep (metric unchanged, code simplified), parse_error, timeout, crash, guard_fail.

This file plus git log autoresearch/<tag> IS the experiment record. No separate markdown journal.

Wiki integration (optional)

If the current directory is a claude-obsidian vault (has a wiki/ folder), after the run ends:

Create wiki/experiments/<tag>.md — synthesis page. Frontmatter: type: experiment, tag, date, metric_start, metric_end, trials_total, trials_kept, status: complete.
Body sections: Program Summary, Key Wins (top 5 kept trials with deltas), Dead Ends (top 5 reset trials and why), Open Questions.
Append to wiki/log.md at the top: ## [YYYY-MM-DD] karpathy-autoresearch | <tag> | <metric_start> → <metric_end> over N trials.
Update wiki/hot.md with the latest baseline metric.

If no wiki/ folder exists, skip wiki integration silently. This skill works standalone.

Report to User (final)

karpathy-autoresearch run: <tag>
Branch: autoresearch/<tag>
Trials: N  |  Kept: K  |  Reset: R  |  Wall time: H:MM
Metric (<name>, <direction>): <start> → <end>  (Δ <delta>)

Top 3 wins:
  <commit> <desc> <delta>
  ...

Top 3 dead ends:
  <commit> <desc> <reason>
  ...

Results: .autoresearch/results.tsv
Synthesis (if vault): wiki/experiments/<tag>.md

Next ideas worth trying (from Open Questions): ...

What this skill is NOT

Not a web research tool — that's autoresearch.
Not a benchmark harness — you still write your own metric extraction.
Not a hyperparameter sweep — it is directed, hypothesis-led, one-variable-at-a-time. Random/grid search is out of scope.
Not an autonomous system — it is autonomous WITHIN the explicit budget and program the user authorized, and it stops cleanly when either runs out.

Constraints (non-negotiable)

Never edit frozen files.
Never skip git reset --hard on a rejected trial.
Never silently change the metric definition mid-run.
Never exceed 2 × per-trial wall-clock cap on a single trial — kill and log timeout.
Never exceed the total budget. When the next trial would exceed it, stop and write the report.

karpathy-autoresearch

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

karpathy-autoresearch

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

karpathy-autoresearch: Empirical Experiment Loop

Before Starting — authorization & safety

The Program (`program.md`)

Setup (once per run)

The Loop

Decision discipline

`results.tsv` — the run's memory

Wiki integration (optional)

Report to User (final)

What this skill is NOT

Constraints (non-negotiable)

Similar Skills

karpathy-autoresearch: Empirical Experiment Loop

Before Starting — authorization & safety

The Program (`program.md`)

Setup (once per run)

The Loop

Decision discipline

`results.tsv` — the run's memory

Wiki integration (optional)

Report to User (final)

What this skill is NOT

Constraints (non-negotiable)

Similar Skills

karpathy-autoresearch

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

karpathy-autoresearch

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

karpathy-autoresearch: Empirical Experiment Loop

Before Starting — authorization & safety

The Program (program.md)

Setup (once per run)

The Loop

Decision discipline

results.tsv — the run's memory

Wiki integration (optional)

Report to User (final)

What this skill is NOT

Constraints (non-negotiable)

Similar Skills

karpathy-autoresearch: Empirical Experiment Loop

Before Starting — authorization & safety

The Program (program.md)

Setup (once per run)

The Loop

Decision discipline

results.tsv — the run's memory

Wiki integration (optional)

Report to User (final)

What this skill is NOT

Constraints (non-negotiable)

Similar Skills

The Program (`program.md`)

`results.tsv` — the run's memory

The Program (`program.md`)

`results.tsv` — the run's memory