From kivi-claude-skills
Autonomous ML experiment loop inspired by Karpathy's autoresearch. Modifies training code, runs experiments with fixed time budgets, evaluates metrics, keeps improvements and discards failures via git. Use when the user wants to autonomously optimize ML training, run experiment loops, auto-tune hyperparameters, or set up autonomous research agents. Triggers include "autoresearch", "experiment loop", "autonomous training", "auto-optimize", "hyperparameter search".
How this skill is triggered — by the user, by Claude, or both
Slash command
/kivi-claude-skills:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). An AI agent autonomously modifies your training code, runs experiments on a fixed time budget, keeps improvements, discards failures, and iterates — all tracked via git commits and a results log.
Inspired by Karpathy's autoresearch. An AI agent autonomously modifies your training code, runs experiments on a fixed time budget, keeps improvements, discards failures, and iterates — all tracked via git commits and a results log.
/autoresearch setup <script-path> — Initialize a research session/autoresearch run — Start the autonomous experiment loop/autoresearch status — Show experiment progress dashboard/autoresearch analyze — Post-session analysis with recommendations/autoresearch setup <script-path>Interactive setup phase. Read the code, agree on parameters, create a branch, initialize tracking.
Read the training script at <script-path> and all related files in the same directory. Understand the framework (PyTorch, JAX, TensorFlow, etc.), model architecture, training loop, and evaluation logic.
Ask the user to confirm or provide these parameters:
| Parameter | Description | Default |
|---|---|---|
run_command | How to execute training (e.g., python train.py, uv run train.py) | python <script-path> |
metric_name | The evaluation metric to optimize (e.g., val_loss, val_bpb, accuracy) | (must specify) |
metric_direction | lower (loss-like) or higher (accuracy-like) | lower |
metric_grep | Grep pattern to extract metric from stdout (e.g., ^val_loss:) | ^<metric_name>: |
time_budget_minutes | Minutes per experiment | 5 |
editable_files | Files the agent may modify | [<script-path>] |
readonly_files | Files to read for context but never modify | [] |
max_experiments | Safety cap, 0 = unlimited | 50 |
Propose a run tag based on today's date (e.g., mar23). User may override.
Create branch autoresearch/<tag> from current HEAD:
git checkout -b autoresearch/<tag>
results.tsv with header row:commit <metric_name> memory_gb duration_s status description
.autoresearch.json:{
"tag": "<tag>",
"branch": "autoresearch/<tag>",
"script_path": "<script-path>",
"run_command": "<command>",
"metric_name": "<name>",
"metric_direction": "lower",
"metric_grep": "^<name>:",
"time_budget_minutes": 5,
"timeout_minutes": 10,
"editable_files": ["<script-path>"],
"readonly_files": [],
"max_experiments": 50,
"created_at": "<ISO timestamp>"
}
Add to .gitignore (if not already present): results.tsv, .autoresearch.json, run.log
Confirm setup and show a summary. Await user's go-ahead before running.
/autoresearch runThe core autonomous experiment loop. Once started, runs indefinitely until max_experiments or user interrupt.
.autoresearch.json for configurationautoresearch/<tag>. If not, git checkout autoresearch/<tag>results.tsv — if no data rows exist, the first experiment is the baseline (run the script as-is, record with status baseline)Repeat indefinitely:
Based on previous results in results.tsv, current code state, and ML knowledge, decide what to try next. Use the Experiment Strategy Guide for ideas. Write a one-line description of the experiment before modifying code.
Edit the editable file(s) with the experimental change. One idea per experiment — never combine unrelated changes.
git add <editable-files>
git commit -m "exp: <short description>"
Execute the training command with output redirected and a timeout:
timeout <timeout_minutes>m <run_command> > run.log 2>&1
Use the Bash tool with timeout set to timeout_minutes * 60 * 1000 ms.
Extract the metric:
grep '<metric_grep>' run.log
If grep returns nothing, the run crashed. Go to Step 6a.
6a. Crash:
tail -n 50 run.logcrash and move onresults.tsv: metric=0.000000, memory_gb=0.0, duration_s=0, status=crash6b. Metric improved (lower for lower, higher for higher):
keep — the branch advances, this commit staysdiscard6c. Metric equal or worse:
discard — revert the commit:git reset --hard HEAD~1
Append a row to results.tsv:
<commit-7chars> <metric_value> <memory_gb> <duration_s> <status> <description>
Extract memory from run.log if available (grep for peak memory, VRAM, etc.). If not available, use 0.0.
Do NOT ask the user if you should continue. Do NOT pause.
After each experiment, print a one-line summary:
#<N> <commit> <metric_value> <status> <description>
/autoresearch statusShow an ASCII dashboard of current progress.
.autoresearch.json and results.tsv================================================================
AUTORESEARCH · <tag> · <script_path>
================================================================
EXPERIMENTS [################....] 16/50
METRIC <metric_name> (<direction>)
BEST <best_value> (exp #<N>, commit <hash>)
BASELINE <baseline_value>
IMPROVEMENT <delta> (<percentage>%)
RECENT EXPERIMENTS
#16 a1b2c3d 0.993 keep increase LR to 0.04
#15 b2c3d4e 1.005 discard switch to GeLU activation
#14 c3d4e5f 0.000 crash double model width (OOM)
#13 d4e5f6g 0.995 keep reduce depth to 6 layers
STATUS BREAKDOWN
keep: 8 (50%)
discard: 5 (31%)
crash: 3 (19%)
================================================================
/autoresearch analyzePost-session analysis. Run after stopping the loop.
.autoresearch.json and results.tsvgit log --oneline autoresearch/<tag> to see kept commitsautoresearch-<tag>-analysis.md with:Summary:
What Worked:
keep experiments with descriptions, grouped by category (hyperparams, architecture, optimizer, etc.)What Didn't Work:
discard experiments, identify patterns (e.g., "all activation function changes were discarded")Crashes:
Metric Progression:
Recommendations:
Ordered by risk level. Start low, escalate gradually.
keepkeep, try variations on the same theme (e.g., if LR 0.02 helped, try 0.03)discard in one category, switch categoriesautoresearch/<tag> created from current HEADkeep → commit stays, branch advancesdiscard → git reset --hard HEAD~1 revertscrash (with fix) → revert failing commit, apply fix, new commit, re-runresults.tsv, .autoresearch.json, run.log are untracked (in .gitignore)editable_filesresults.tsv, .autoresearch.json, or run.log to gitrun.log<command> > run.log 2>&1tail -n 50 run.logdiscardnpx claudepluginhub phoxiao/kivi-claude-skills --plugin kivi-claude-skillsSets up Karpathy-style autoresearch experiments to autonomously optimize code in one constrained file via iterative evals against a numerical metric, generating instructions.md, eval script, test data, and launch prompt.
Runs autonomous experiment loops to iteratively optimize measurable metrics like code performance, ML loss, build size via git branches, code changes, verify commands, and guards.
Runs iterative experiments to optimize measurable metrics (speed, accuracy, config). Manages .lab/ directory for experiment history and autonomous workflow.