autoresearch-for-skills

A Claude Code plugin that autonomously optimizes skill prompts using Andrej Karpathy's autoresearch methodology. Instead of optimizing ML training code, it optimizes Claude Code skill prompts.

Point it at any skill, define what "good" looks like, and let it run. It executes the skill repeatedly with real inputs, scores every output on a 0-100 scale, mutates the prompt, keeps what improves the score, and discards the rest.

What it does

Runs real experiments — actually executes the skill and produces real outputs (not simulated)
Scores on a 0-100.00 scale — weighted eval criteria with decimal precision
Mutates one thing at a time — so you know exactly what helped
Uses git branching — every experiment is a commit, original skill preserved on main
Captures evidence — saves every output, takes screenshots for visual skills
Builds a dashboard — HTML dashboard with score charts, eval radar, screenshot gallery

Install

claude plugins marketplace add /path/to/autoresearch-for-skills
claude plugins install autoresearch@autoresearch-for-skills

Or add the GitHub repo as a marketplace:

# Add the marketplace
claude plugins marketplace add --source github --repo lendtrain/autoresearch-for-skills

# Install the plugin
claude plugins install autoresearch@autoresearch-for-skills

Usage

/autoresearch skill=path/to/SKILL.md iterations=10

The skill will ask you for:

Target skill — path to the SKILL.md to optimize
Test inputs — 3-5 prompts/scenarios to test with
Eval criteria — 3-6 scored criteria defining "good output"
Runs per experiment — how many times to run per mutation (default: 3)
Budget cap — max experiment cycles (optional)
Output type — visual, code, text, or mixed

Then it runs autonomously until stopped.

How scoring works

Each eval criterion is scored 0.00-100.00 per run. The experiment score is the weighted average across all evals and runs. A mutation is kept if it improves the score by 1.00+ points. Same score with less complexity = simplification win (kept).

What you get

On a git branch (autoresearch/<skill-name>-<date>):

Every kept mutation as a commit
git diff main shows total improvement

In a working directory (autoresearch-<skill-name>/):

autoresearch-<skill-name>/
├── runs/                    # every output + screenshot from every run
├── dashboard.html           # self-contained HTML dashboard with charts
├── results.json             # structured experiment data
├── results.tsv              # tab-separated score log
└── changelog.md             # detailed mutation log

Plugin components

Skill: `/autoresearch`

The core autoresearch loop — gather context, build evals, establish baseline, run experiments, build dashboard.

Hooks

PostToolUse (Write) — auto-screenshots HTML files written to runs/ directories during active runs
Stop — prevents the agent from stopping before the dashboard is built and results are marked complete

References

eval-guide.md — how to write eval criteria that actually work

Example results

From a real run optimizing a frontend-design skill:

Experiment	Score	Status	Description
0	62.50%	baseline	Original skill
1	100%	kept	Added explicit banned font list + approved alternatives
2	100%	kept	Removed redundant paragraph (simplification win)

From optimizing a design-review skill against a live Next.js app:

Experiment	Score	Status	Description
0	84.50	baseline	Original skill — reported issues but didn't fix them
1	91.56	kept	Added mandatory fix gate — agent now produces actual code fixes
2	90.88	kept	Condensed fix gate from 26 to 3 lines (simplification win)

Based on

Karpathy's autoresearch — autonomous experimentation loops for ML training
Adapted for Claude Code skill prompt optimization

License

MIT

autoresearch

Popularity

What's Inside

README

autoresearch-for-skills

What it does

Install

Usage

How scoring works

What you get

Plugin components

Skill: `/autoresearch`

Hooks

References

Example results

Based on

License

Confidence

Similar Plugins

singularity-claude

skill-creator

autoresearch-agent

skill-optimizer

skills-toolkit

skillkit

More by lendtrain

mortgage

Popularity

Health & Quality

More by lendtrain

mortgage

Similar Plugins

singularity-claude

skill-creator

autoresearch-agent

skill-optimizer

skills-toolkit

skillkit

autoresearch

Popularity

What's Inside

README

autoresearch-for-skills

What it does

Install

Usage

How scoring works

What you get

Plugin components

Skill: /autoresearch

Hooks

References

Example results

Based on

License

Confidence

Similar Plugins

singularity-claude

skill-creator

autoresearch-agent

skill-optimizer

skills-toolkit

skillkit

More by lendtrain

mortgage

Popularity

Health & Quality

More by lendtrain

mortgage

Similar Plugins

singularity-claude

skill-creator

autoresearch-agent

skill-optimizer

skills-toolkit

skillkit

Skill: `/autoresearch`