Skill

autoresearch

Autonomously optimize any Claude Code skill by running it repeatedly, scoring outputs on a 0-100.00 scale, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. Outputs: an improved SKILL.md, a results log, screenshots of every run, and a changelog of every mutation tried.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autoresearch:autoresearch

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.

Supporting Files

references/eval-guide.md

SKILL.md

388 lines · ~4.9k tokens

Stats

LanguageShell

Stars0

MaintenanceGood

Last CommitMar 25, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Autoresearch for Skills

This skill adapts Andrej Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, we optimize skill prompts. Like Karpathy's version, we run real experiments and measure real results — no simulations.

the core job

Take any existing skill, define what "good output" looks like as scored eval criteria (0-100.00 scale), then run an autonomous loop that:

Actually executes the skill with real test inputs — producing real outputs
Captures evidence — saves outputs to files and takes screenshots where applicable
Scores every output against the eval criteria
Mutates the skill prompt to fix failures
Keeps mutations that improve the score, discards the rest
Repeats until the score ceiling is hit or the user stops it

Iron rule: real execution, not simulation. Karpathy's autoresearch runs real train.py and measures real val_bpb. We run real skills and score real outputs. Never simulate what a skill "would produce" — actually run it and see.

Output: An improved SKILL.md + results.tsv log + changelog.md + screenshots of every run + a dashboard linking to all evidence.

before starting: gather context

STOP. Do not run any experiments until all fields below are confirmed with the user. Ask for any missing fields before proceeding.

Target skill — Which skill do you want to optimize? (need the exact path to SKILL.md)
Test inputs — What 3-5 different prompts/scenarios should we test the skill with? (variety matters — pick inputs that cover different use cases so we don't overfit to one scenario)
Eval criteria — What 3-6 binary yes/no checks define a good output? (these are your "test questions" — see references/eval-guide.md for how to write good evals)
Runs per experiment — How many times should we run the skill per mutation? Default: 3. (more runs = more reliable scores, but slower and more expensive. 3 is the sweet spot for balancing cost with signal.)
Budget cap — Optional. Max number of experiment cycles before stopping. Default: no cap (runs until you stop it).
Output type — What does the skill produce? This determines how we capture evidence:
- Visual (HTML/CSS, components, pages, diagrams): save as .html, render in browser, take screenshot
- Code (functions, scripts, configs): save as source file, run if possible, capture output
- Text (content, docs, copy): save as .md or .txt
- Mixed: save all artifacts and screenshot any visual portion

step 1: read the skill and set up git safety

Before changing anything, read and understand the target skill completely.

Read the full SKILL.md file
Read any files in references/ that the skill links to
Identify the skill's core job, process steps, and output format
Note any existing quality checks or anti-patterns already in the skill

Do NOT skip this. You need to understand what the skill does before you can improve it.

git branching (Karpathy-style safety net)

The skill's parent directory MUST be a git repository. This is how we ensure full rollback safety.

Check for git: Run git rev-parse --git-dir in the skill's parent directory. If it fails, initialize: git init && git add -A && git commit -m "snapshot before autoresearch".
Verify clean state: Run git status. If there are uncommitted changes, commit them first: git add -A && git commit -m "pre-autoresearch snapshot". Never start on a dirty tree.
Create a dedicated branch: Generate a tag from today's date and skill name:
```
git checkout -b autoresearch/<skill-name>-<YYYY-MM-DD>
```
The branch name must not already exist. If it does, append -2, -3, etc.
Confirm: The original skill is now safely preserved on the previous branch (usually main or master). You can always get back to it with git checkout main.

Why this matters: Every experiment becomes a git commit. If autoresearch makes things worse, you git checkout main and the original is untouched. If you run autoresearch multiple times, each run is its own branch — no backup file overwriting.

step 2: build the eval suite

Convert the user's eval criteria into a structured test. Each eval is scored on a 0-100.00 scale — this gives fine-grained signal while keeping scoring consistent.

Format each eval as:

EVAL [number]: [Short name] (weight: [1-3])
Question: [What specific quality does this measure?]
Scoring guide:
  95-100: Exceptional — exceeds expectations, no issues
  80-94:  Good — meets expectations with minor issues
  60-79:  Acceptable — works but has notable gaps
  40-59:  Below average — significant issues
  0-39:   Poor — fails to meet the criteria

Rules for good evals:

Score each eval independently on a 0.00 to 100.00 scale. Use decimals for precision (e.g., 87.50, not just 88).
Specific enough to be consistent. "Is the text readable?" is too vague. "All content is legible with proper contrast, no truncated text, and correct spelling" is scorable.
Not so narrow that the skill games the eval. "Contains fewer than 200 words" will make the skill optimize for brevity at the expense of everything else.
3-6 evals is the sweet spot. More than that and the skill starts parroting eval criteria back instead of actually improving.
Each eval can optionally have a weight (1-3) to emphasize critical criteria. Default weight is 1.

See references/eval-guide.md for detailed examples of good vs bad evals.

Final score calculation:

final_score = weighted_average(all_eval_scores_across_all_runs)

The final score is a single number from 0.00 to 100.00. Example: 4 evals × 2 runs = 8 individual scores → weighted average → one final score like 78.25.

step 3: set up the working directory and activate hooks

Create the evidence directory structure before running any experiments:

autoresearch-[skill-name]/
├── runs/                    # all raw outputs go here
│   ├── exp0-run1-input1.html   # naming: exp{N}-run{R}-{input-slug}.{ext}
│   ├── exp0-run1-input1.png    # screenshot of the above
│   └── ...
├── dashboard.html           # static HTML dashboard (data embedded, no fetch)
├── results.json             # structured experiment data
├── results.tsv              # tab-separated score log
└── changelog.md             # detailed mutation log

Add autoresearch-*/ to .gitignore in the repo root. These files survive git resets.

Automated hooks (plugin-enforced)

This plugin includes two hooks that enforce behavior automatically:

PostToolUse hook on Write — When you write ANY .html file to a runs/ directory, the hook automatically:
- Opens the file in the headless browser (browse daemon)
- Takes a screenshot and saves it as .png alongside the .html
- You do NOT need to manually screenshot — the hook handles it
Stop hook — When the agent tries to stop, the hook checks:
- Is .autoresearch-active flag present? If yes, is results.json status set to "complete"?
- If the run is incomplete, the hook BLOCKS the agent from stopping

To activate hooks: Create the flag file at the start of the run:

touch .autoresearch-active

To deactivate hooks: Remove the flag file when the run is complete:

rm .autoresearch-active

The hooks are dormant unless this flag file exists in the project directory.

Saving outputs

For visual outputs (HTML/CSS/JS, React components, pages):

Save the generated code to runs/exp{N}-run{R}-{input-slug}.html using the Write tool
The PostToolUse hook automatically screenshots it — no manual step needed
The screenshot is the ground truth for visual evals — score from it, not from the code

For text/code outputs:

Save to runs/exp{N}-run{R}-{input-slug}.md (or .py, .ts, etc.)
If the code is runnable, execute it and capture stdout to runs/exp{N}-run{R}-{input-slug}.out

**Every run must produce a saved artifact.

step 4: establish baseline

Run the skill AS-IS before changing anything. This is experiment #0.

First, activate the hooks:

touch .autoresearch-active

For each test input, actually invoke the skill (via the Skill tool or by following its instructions) to produce real output
Save every output to runs/exp0-run1-{input-slug}.{ext}
For visual outputs, render in browser and take screenshots
Score every output against every eval — score from the ACTUAL output, not what you think the skill would do
Record the baseline score in results.tsv and results.json
Generate dashboard.html with the baseline data embedded directly in the HTML (do NOT use fetch — browsers block it from file:// protocol)

results.tsv format (tab-separated):

experiment	commit	score	status	description
0	a1b2c3d	72.50	baseline	original skill — no changes

Score is always 0.00-100.00 (weighted average of all eval scores across all runs for that experiment).

IMPORTANT: After establishing baseline, confirm the score with the user before proceeding. If baseline is already 95+, the skill may not need optimization — ask the user if they want to continue.

step 5: run the experiment loop

This is the core autoresearch loop. Once started, run autonomously until stopped.

LOOP:

Analyze failures. Look at which evals are failing most. Read the actual outputs (and screenshots) that failed. Identify the pattern — is it a formatting issue? A missing instruction? An ambiguous directive?
Form a hypothesis. Pick ONE thing to change. Don't change 5 things at once — you won't know what helped.

Good mutations:
- Add a specific instruction that addresses the most common failure
- Reword an ambiguous instruction to be more explicit
- Add an anti-pattern ("Do NOT do X") for a recurring mistake
- Move a buried instruction higher in the skill (priority = position)
- Add or improve an example that shows the correct behavior
- Remove an instruction that's causing the skill to over-optimize for one thing at the expense of others
Bad mutations:
- Rewriting the entire skill from scratch
- Adding 10 new rules at once
- Making the skill longer without a specific reason
- Adding vague instructions like "make it better" or "be more creative"
Make the change and commit. Edit SKILL.md with ONE targeted mutation, then:
```
git add SKILL.md
git commit -m "experiment [N]: [short description of change]"
```
Every experiment is a commit. This is the core safety mechanism.
Run the experiment. Actually execute the skill [N] times with the same test inputs. Save every output. Take screenshots for visual outputs. Name files: runs/exp{N}-run{R}-{input-slug}.{ext}
Score it. Score each eval for each run on a 0.00-100.00 scale from the REAL outputs and screenshots — not from what you think the skill would produce. Calculate the weighted average as the experiment's final score (a single 0.00-100.00 number).
Decide: keep or discard.
- Score improved by 1.00+ points → KEEP. Log it. Record the commit hash. This is the new baseline.
- Score changed by less than 1.00 point → DISCARD. git reset --hard HEAD~1 to revert. The change didn't move the needle.
- Score got worse → DISCARD. git reset --hard HEAD~1 to revert.
Simplicity criterion (from Karpathy): All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing something and getting equal or better results is a great outcome — that's a simplification win.
Log the result in results.tsv and update results.json. These files are gitignored and persist across resets.
Append to changelog.md (also gitignored — see step 6 format below).
Rebuild dashboard.html with updated data embedded inline. Include links to all screenshot files for each experiment.
Repeat. Go back to step 1 of the loop.

NEVER STOP. Once the loop starts, do not pause to ask the user if you should continue. They may be away from the computer. Run autonomously until:

The user manually stops you
You hit the budget cap (if one was set)
You hit 98.00+ score for 3 consecutive experiments (diminishing returns)

If you run out of ideas: Re-read the failing outputs and screenshots. Try combining two previous near-miss mutations. Try a completely different approach to the same problem. Try removing things instead of adding them. Simplification that maintains the score is a win.

step 6: write the changelog

After each experiment (whether kept or discarded), append to changelog.md:

## Experiment [N] — [keep/discard] — `[commit hash or "reverted"]`

**Score:** [XX.XX] / 100.00 (delta: [+/-X.XX] from previous)
**Change:** [One sentence describing what was changed]
**Reasoning:** [Why this change was expected to help]
**Result:** [What actually happened — which evals improved/declined and by how much]
**Per-eval breakdown:** [eval1: XX.XX, eval2: XX.XX, ...]
**Failing outputs:** [Brief description of what still scores low, if anything]
**Evidence:** [List screenshot filenames showing examples]

This changelog is the most valuable artifact. It's a research log that any future agent (or smarter future model) can pick up and continue from.

step 7: deliver results

When the user returns or the loop stops:

Rebuild the final dashboard with ALL data embedded inline in the HTML. The dashboard must include:
- Score metrics (baseline, final, improvement, kept, reverted counts)
- Score progression chart and pass rate chart (use Chart.js from CDN)
- Per-eval breakdown showing which evals improved most
- Full experiment log table with: #, description, score, rate, status, commit, per-eval pass/fail
- Screenshot gallery: for each experiment, thumbnail links to the actual screenshot files in runs/
- Links to results.json, changelog.md, and the improved SKILL.md
- Git commands for applying/discarding/diffing
CRITICAL: Embed all data directly in the HTML as JavaScript variables. Do NOT use fetch() — it fails from file:// protocol due to browser CORS restrictions.
Present a summary:
- Score summary: Baseline score → Final score (e.g., 72.50 → 94.38, delta +21.88)
- Total experiments run: How many mutations were tried
- Keep rate: How many mutations were kept vs discarded
- Top 3 changes that helped most (from the changelog)
- Remaining failure patterns (what the skill still gets wrong, if anything)
- How to apply:
  - To keep the improvement: git checkout main && git merge autoresearch/<branch-name>
  - To discard everything: git checkout main (branch stays for reference)
  - To see what changed: git diff main...autoresearch/<branch-name>
  - To see every experiment: git log --oneline autoresearch/<branch-name>
Deactivate hooks: Remove the flag file and update results.json:
```
rm .autoresearch-active
```
Update results.json status to "complete" BEFORE removing the flag — the Stop hook checks this.
Open the dashboard: On Windows start dashboard.html, on macOS open dashboard.html.

output format

The skill produces artifacts in two locations:

Git branch (autoresearch/<skill-name>-<date>):

Every kept mutation as a commit on the branch
Full git log of every experiment tried (discards are reset, but logged in changelog)
git diff main shows total improvement

Working directory (autoresearch-[skill-name]/, gitignored):

autoresearch-[skill-name]/
├── runs/                    # every output + screenshot from every run
│   ├── exp0-run1-pricing.html
│   ├── exp0-run1-pricing.png
│   ├── exp0-run2-portfolio.html
│   ├── exp0-run2-portfolio.png
│   ├── exp1-run1-pricing.html
│   ├── exp1-run1-pricing.png
│   └── ...
├── dashboard.html           # self-contained dashboard with embedded data + screenshot links
├── results.json             # structured experiment data with per-eval breakdown
├── results.tsv              # tab-separated score log
└── changelog.md             # detailed mutation log with commit hashes and evidence links

results.tsv example:

experiment	commit	score	status	description
0	a1b2c3d	72.50	baseline	original skill — no changes
1	b2c3d4e	88.75	kept	added specific color palette guidance
2	reverted	88.75	discard	tried enforcing font sizes — no improvement
3	c3d4e5f	94.38	kept	added pre-ship checklist for eval criteria

how this connects to other skills

What feeds into autoresearch:

Any existing skill that needs optimization
User-defined eval criteria (or help them define evals using the eval guide)

What autoresearch feeds into:

The improved skill lives on a branch until the user merges it
The changelog can be passed to future models for continued optimization
The eval suite can be reused whenever the skill is updated
The screenshot evidence trail can be reviewed by the user to validate improvements

the test

A good autoresearch run:

Ran real experiments — actually executed the skill and produced real outputs, not simulated scores
Captured evidence — saved every output to a file and took screenshots for visual outputs
Used git branching — original skill preserved on main, experiments on a dedicated branch
Started with a baseline — never changed anything before measuring the starting point
Used 0-100.00 scored evals — precise scores with decimal granularity, not vague vibes
Changed one thing at a time — so you know exactly what helped
Committed every experiment — full git history of what was tried
Kept results outside git — results.tsv, changelog.md, and runs/ survive resets
Improved the score — measurable improvement from baseline to final
Didn't overfit — the skill got better at the actual job, not just at passing the specific test inputs
Ran autonomously — didn't stop to ask permission between experiments
Applied simplicity criterion — equal results with less complexity = keep the simpler version
Built a dashboard with evidence — screenshots linked for every experiment so the user can visually verify

If the skill "passes" all evals but the actual output quality hasn't improved — the evals are bad, not the skill. Go back to step 2 and write better evals.

autoresearch

Invocation

Context Preview

Supporting Files

SKILL.md

autoresearch

Invocation

Context Preview

Supporting Files

SKILL.md

Autoresearch for Skills

the core job

before starting: gather context

step 1: read the skill and set up git safety

git branching (Karpathy-style safety net)

step 2: build the eval suite

step 3: set up the working directory and activate hooks

Automated hooks (plugin-enforced)

Saving outputs

step 4: establish baseline

step 5: run the experiment loop

step 6: write the changelog

step 7: deliver results

output format

how this connects to other skills

the test

Similar Skills

Autoresearch for Skills

the core job

before starting: gather context

step 1: read the skill and set up git safety

git branching (Karpathy-style safety net)

step 2: build the eval suite

step 3: set up the working directory and activate hooks

Automated hooks (plugin-enforced)

Saving outputs

step 4: establish baseline

step 5: run the experiment loop

step 6: write the changelog

step 7: deliver results

output format

how this connects to other skills

the test

Similar Skills