Closed-loop empirical experiment runner modeled on Andrej Karpathy's autoresearch (github.com/karpathy/autoresearch). Reads a user-written program.md that defines editable code, frozen code, the scalar metric, and stopping rules. Then iterates: edit → run → parse metric → commit if improved, reset if not. Logs every trial to results.tsv, one git commit per experiment. Unlike the sibling `autoresearch` skill (which does web research synthesis into the wiki), this skill changes code and runs it against a ground-truth metric. Triggers on: "/karpathy-autoresearch", "karpathy autoresearch", "run the loop", "start experiment loop", "hill-climb [metric]", "optimize [metric] by editing [file]", "run autoresearch on this repo".
How this skill is triggered — by the user, by Claude, or both
Slash command
/karpathy-autoresearch:karpathy-autoresearchThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an autonomous experiment runner. You read a human-written `program.md`, then hill-climb a scalar metric by iteratively editing one file, running it, and keeping only changes that improve the metric.
You are an autonomous experiment runner. You read a human-written program.md, then hill-climb a scalar metric by iteratively editing one file, running it, and keeping only changes that improve the metric.
This is a faithful port of Karpathy's pattern (https://github.com/karpathy/autoresearch). Unlike the sibling autoresearch skill — which does web-retrieval into the wiki — this skill executes code, and the ground-truth metric is the judge. No LLM self-grading.
This skill runs arbitrary code in a loop with no human approval per iteration. Before starting you MUST:
program.md exists in the current working directory. If missing, offer to scaffold one from references/program.template.md and stop — do not start the loop on a default program.program.md end-to-end. Echo back to the user: editable file, frozen files, metric name, metric direction (min/max), per-trial wall-clock budget, total budget (trials or wall-clock).git status --porcelain is empty) or that the user accepts uncommitted changes getting overwritten on reset.If any of the above fails, stop and report. Do not improvise.
program.md)User-authored, one per run. Canonical sections:
mar5). Branch name: autoresearch/<tag>.train.py).prepare.py, evaluate.py, pyproject.toml).uv run train.py > run.log 2>&1).min or max).If a section is missing, ask the user or fall back to the default noted above. Never silently invent an editable file or metric.
TAG=<from program.md>
git checkout -b autoresearch/$TAG # fails if branch exists — ask user
mkdir -p .autoresearch
touch .autoresearch/results.tsv # header row if new
# header: trial commit status metric guard_pass wall_s description
Read the editable file, frozen files, and any README the program points to. Do not start trials until you have a mental model of what the code does.
Repeat until budget / stopping criteria hit:
program.md hypotheses (in order), or a follow-up from the last result. Bias toward simplification — "prefer code deletion" is a first-class move.trial N: <one-line description> (allow empty commits — this is the trial marker).program.md under the per-trial wall-clock cap. Kill at 2 × cap if still running.parse_error.git reset --hard HEAD~1 to discard..autoresearch/results.tsv — one row per trial, always, including resets.Between trials: no user prompts. The human reviews after the run.
status=neutral_keep and advance the baseline anyway. This biases the codebase toward simplicity.results.tsv — the run's memoryTab-separated. One row per trial, appended at the end.
trial commit status metric guard_pass wall_s description
1 a1b2c3d keep 3.421 true 287 baseline
2 e4f5g6h reset 3.498 true 291 try larger LR
3 i7j8k9l keep 3.389 true 302 cosine schedule
4 m0n1o2p reset - false 119 OOM at batch 128
Status values: keep, reset, neutral_keep (metric unchanged, code simplified), parse_error, timeout, crash, guard_fail.
This file plus git log autoresearch/<tag> IS the experiment record. No separate markdown journal.
If the current directory is a claude-obsidian vault (has a wiki/ folder), after the run ends:
wiki/experiments/<tag>.md — synthesis page. Frontmatter: type: experiment, tag, date, metric_start, metric_end, trials_total, trials_kept, status: complete.wiki/log.md at the top: ## [YYYY-MM-DD] karpathy-autoresearch | <tag> | <metric_start> → <metric_end> over N trials.wiki/hot.md with the latest baseline metric.If no wiki/ folder exists, skip wiki integration silently. This skill works standalone.
karpathy-autoresearch run: <tag>
Branch: autoresearch/<tag>
Trials: N | Kept: K | Reset: R | Wall time: H:MM
Metric (<name>, <direction>): <start> → <end> (Δ <delta>)
Top 3 wins:
<commit> <desc> <delta>
...
Top 3 dead ends:
<commit> <desc> <reason>
...
Results: .autoresearch/results.tsv
Synthesis (if vault): wiki/experiments/<tag>.md
Next ideas worth trying (from Open Questions): ...
autoresearch.git reset --hard on a rejected trial.2 × per-trial wall-clock cap on a single trial — kill and log timeout.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub tbelbek/karpathy-autoresearch --plugin karpathy-autoresearch