From autoresearch
Autonomously optimize any Claude Code skill by running it repeatedly, scoring outputs on a 0-100.00 scale, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. Outputs: an improved SKILL.md, a results log, screenshots of every run, and a changelog of every mutation tried.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.
Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.
This skill adapts Andrej Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, we optimize skill prompts. Like Karpathy's version, we run real experiments and measure real results — no simulations.
Take any existing skill, define what "good output" looks like as scored eval criteria (0-100.00 scale), then run an autonomous loop that:
Iron rule: real execution, not simulation. Karpathy's autoresearch runs real train.py and measures real val_bpb. We run real skills and score real outputs. Never simulate what a skill "would produce" — actually run it and see.
Output: An improved SKILL.md + results.tsv log + changelog.md + screenshots of every run + a dashboard linking to all evidence.
STOP. Do not run any experiments until all fields below are confirmed with the user. Ask for any missing fields before proceeding.
.html, render in browser, take screenshot.md or .txtBefore changing anything, read and understand the target skill completely.
references/ that the skill links toDo NOT skip this. You need to understand what the skill does before you can improve it.
The skill's parent directory MUST be a git repository. This is how we ensure full rollback safety.
Check for git: Run git rev-parse --git-dir in the skill's parent directory. If it fails, initialize: git init && git add -A && git commit -m "snapshot before autoresearch".
Verify clean state: Run git status. If there are uncommitted changes, commit them first: git add -A && git commit -m "pre-autoresearch snapshot". Never start on a dirty tree.
Create a dedicated branch: Generate a tag from today's date and skill name:
git checkout -b autoresearch/<skill-name>-<YYYY-MM-DD>
The branch name must not already exist. If it does, append -2, -3, etc.
Confirm: The original skill is now safely preserved on the previous branch (usually main or master). You can always get back to it with git checkout main.
Why this matters: Every experiment becomes a git commit. If autoresearch makes things worse, you git checkout main and the original is untouched. If you run autoresearch multiple times, each run is its own branch — no backup file overwriting.
Convert the user's eval criteria into a structured test. Each eval is scored on a 0-100.00 scale — this gives fine-grained signal while keeping scoring consistent.
Format each eval as:
EVAL [number]: [Short name] (weight: [1-3])
Question: [What specific quality does this measure?]
Scoring guide:
95-100: Exceptional — exceeds expectations, no issues
80-94: Good — meets expectations with minor issues
60-79: Acceptable — works but has notable gaps
40-59: Below average — significant issues
0-39: Poor — fails to meet the criteria
Rules for good evals:
See references/eval-guide.md for detailed examples of good vs bad evals.
Final score calculation:
final_score = weighted_average(all_eval_scores_across_all_runs)
The final score is a single number from 0.00 to 100.00. Example: 4 evals × 2 runs = 8 individual scores → weighted average → one final score like 78.25.
Create the evidence directory structure before running any experiments:
autoresearch-[skill-name]/
├── runs/ # all raw outputs go here
│ ├── exp0-run1-input1.html # naming: exp{N}-run{R}-{input-slug}.{ext}
│ ├── exp0-run1-input1.png # screenshot of the above
│ └── ...
├── dashboard.html # static HTML dashboard (data embedded, no fetch)
├── results.json # structured experiment data
├── results.tsv # tab-separated score log
└── changelog.md # detailed mutation log
Add autoresearch-*/ to .gitignore in the repo root. These files survive git resets.
This plugin includes two hooks that enforce behavior automatically:
PostToolUse hook on Write — When you write ANY .html file to a runs/ directory, the hook automatically:
.png alongside the .htmlStop hook — When the agent tries to stop, the hook checks:
.autoresearch-active flag present? If yes, is results.json status set to "complete"?To activate hooks: Create the flag file at the start of the run:
touch .autoresearch-active
To deactivate hooks: Remove the flag file when the run is complete:
rm .autoresearch-active
The hooks are dormant unless this flag file exists in the project directory.
For visual outputs (HTML/CSS/JS, React components, pages):
runs/exp{N}-run{R}-{input-slug}.html using the Write toolFor text/code outputs:
runs/exp{N}-run{R}-{input-slug}.md (or .py, .ts, etc.)runs/exp{N}-run{R}-{input-slug}.out**Every run must produce a saved artifact.
Run the skill AS-IS before changing anything. This is experiment #0.
First, activate the hooks:
touch .autoresearch-active
runs/exp0-run1-{input-slug}.{ext}dashboard.html with the baseline data embedded directly in the HTML (do NOT use fetch — browsers block it from file:// protocol)results.tsv format (tab-separated):
experiment commit score status description
0 a1b2c3d 72.50 baseline original skill — no changes
Score is always 0.00-100.00 (weighted average of all eval scores across all runs for that experiment).
IMPORTANT: After establishing baseline, confirm the score with the user before proceeding. If baseline is already 95+, the skill may not need optimization — ask the user if they want to continue.
This is the core autoresearch loop. Once started, run autonomously until stopped.
LOOP:
Analyze failures. Look at which evals are failing most. Read the actual outputs (and screenshots) that failed. Identify the pattern — is it a formatting issue? A missing instruction? An ambiguous directive?
Form a hypothesis. Pick ONE thing to change. Don't change 5 things at once — you won't know what helped.
Good mutations:
Bad mutations:
Make the change and commit. Edit SKILL.md with ONE targeted mutation, then:
git add SKILL.md
git commit -m "experiment [N]: [short description of change]"
Every experiment is a commit. This is the core safety mechanism.
Run the experiment. Actually execute the skill [N] times with the same test inputs. Save every output. Take screenshots for visual outputs. Name files: runs/exp{N}-run{R}-{input-slug}.{ext}
Score it. Score each eval for each run on a 0.00-100.00 scale from the REAL outputs and screenshots — not from what you think the skill would produce. Calculate the weighted average as the experiment's final score (a single 0.00-100.00 number).
Decide: keep or discard.
git reset --hard HEAD~1 to revert. The change didn't move the needle.git reset --hard HEAD~1 to revert.Simplicity criterion (from Karpathy): All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome — that's a simplification win.
Log the result in results.tsv and update results.json. These files are gitignored and persist across resets.
Append to changelog.md (also gitignored — see step 6 format below).
Rebuild dashboard.html with updated data embedded inline. Include links to all screenshot files for each experiment.
Repeat. Go back to step 1 of the loop.
NEVER STOP. Once the loop starts, do not pause to ask the user if you should continue. They may be away from the computer. Run autonomously until:
If you run out of ideas: Re-read the failing outputs and screenshots. Try combining two previous near-miss mutations. Try a completely different approach to the same problem. Try removing things instead of adding them. Simplification that maintains the score is a win.
After each experiment (whether kept or discarded), append to changelog.md:
## Experiment [N] — [keep/discard] — `[commit hash or "reverted"]`
**Score:** [XX.XX] / 100.00 (delta: [+/-X.XX] from previous)
**Change:** [One sentence describing what was changed]
**Reasoning:** [Why this change was expected to help]
**Result:** [What actually happened — which evals improved/declined and by how much]
**Per-eval breakdown:** [eval1: XX.XX, eval2: XX.XX, ...]
**Failing outputs:** [Brief description of what still scores low, if anything]
**Evidence:** [List screenshot filenames showing examples]
This changelog is the most valuable artifact. It's a research log that any future agent (or smarter future model) can pick up and continue from.
When the user returns or the loop stops:
Rebuild the final dashboard with ALL data embedded inline in the HTML. The dashboard must include:
runs/CRITICAL: Embed all data directly in the HTML as JavaScript variables. Do NOT use fetch() — it fails from file:// protocol due to browser CORS restrictions.
Present a summary:
git checkout main && git merge autoresearch/<branch-name>git checkout main (branch stays for reference)git diff main...autoresearch/<branch-name>git log --oneline autoresearch/<branch-name>Deactivate hooks: Remove the flag file and update results.json:
rm .autoresearch-active
Update results.json status to "complete" BEFORE removing the flag — the Stop hook checks this.
Open the dashboard: On Windows start dashboard.html, on macOS open dashboard.html.
The skill produces artifacts in two locations:
Git branch (autoresearch/<skill-name>-<date>):
git log of every experiment tried (discards are reset, but logged in changelog)git diff main shows total improvementWorking directory (autoresearch-[skill-name]/, gitignored):
autoresearch-[skill-name]/
├── runs/ # every output + screenshot from every run
│ ├── exp0-run1-pricing.html
│ ├── exp0-run1-pricing.png
│ ├── exp0-run2-portfolio.html
│ ├── exp0-run2-portfolio.png
│ ├── exp1-run1-pricing.html
│ ├── exp1-run1-pricing.png
│ └── ...
├── dashboard.html # self-contained dashboard with embedded data + screenshot links
├── results.json # structured experiment data with per-eval breakdown
├── results.tsv # tab-separated score log
└── changelog.md # detailed mutation log with commit hashes and evidence links
results.tsv example:
experiment commit score status description
0 a1b2c3d 72.50 baseline original skill — no changes
1 b2c3d4e 88.75 kept added specific color palette guidance
2 reverted 88.75 discard tried enforcing font sizes — no improvement
3 c3d4e5f 94.38 kept added pre-ship checklist for eval criteria
What feeds into autoresearch:
What autoresearch feeds into:
A good autoresearch run:
If the skill "passes" all evals but the actual output quality hasn't improved — the evals are bad, not the skill. Go back to step 2 and write better evals.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub lendtrain/autoresearch-for-skills