From super-dev
Auto-improve any agent prompt using Karpathy's autoresearch method. Runs iterative test-measure-improve loops on agent prompts to systematically increase quality. Triggers on: "autoresearch", "auto-improve", "optimize agent", "tune prompt", "improve skill quality".
How this skill is triggered — by the user, by Claude, or both
Slash command
/super-dev:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Based on Andrej Karpathy's autoresearch method. Instead of manually improving agent prompts, let an AI agent do it in an iterative loop: try a small change, score the result, keep improvements, revert regressions.
Based on Andrej Karpathy's autoresearch method. Instead of manually improving agent prompts, let an AI agent do it in an iterative loop: try a small change, score the result, keep improvements, revert regressions.
┌─────────────────────────────────────────────────────┐
│ AUTORESEARCH LOOP │
│ │
│ 1. Run agent on test input │
│ 2. Score output against checklist (yes/no) │
│ 3. Record baseline score │
│ 4. Analyze weakest checklist items │
│ 5. Make ONE small change to the agent prompt │
│ 6. Re-run agent on same test input │
│ 7. Score again │
│ 8. If score improved → KEEP change │
│ If score dropped → REVERT change │
│ 9. Repeat from step 4 until target score reached │
│ │
│ Stop condition: 95%+ score 3 times in a row │
│ OR max rounds reached (default: 10) │
└─────────────────────────────────────────────────────┘
/super-dev:autoresearch
Arguments:
--agent <agent-name> Agent to optimize (e.g., "code-reviewer", "qa-agent")
--test-input <prompt> Test input to run the agent on
--rounds <N> Max improvement rounds (default: 10)
User specifies which agent prompt to improve. Read the agent's markdown file from agents/<name>.md.
Ask the user (or auto-generate from the agent's existing quality gates) a set of 3-6 yes/no scoring questions. Each question checks one specific aspect of the agent's output.
Good checklist questions (yes/no only):
Bad checklist questions (avoid):
Run the agent on the test input 3 times. Score each run against the checklist. The average score is the baseline.
Baseline: 5/8 checks passing = 62.5%
For each round:
When done (target reached or max rounds), produce:
# Autoresearch Results: [agent-name]
## Summary
- **Baseline score:** 62.5% (5/8)
- **Final score:** 93.75% (7.5/8)
- **Rounds:** 6
- **Changes kept:** 4
- **Changes reverted:** 2
## Changelog
### Round 1: Added gotchas section ✓ KEPT
- **Score:** 62.5% → 75%
- **Change:** Added "Gotchas" section listing 6 common failures
- **Why:** Checklist item "identifies production-risk bugs" was failing
- **Effect:** Bug identification improved in 2/3 test runs
### Round 2: Added explicit timezone rule ✓ KEPT
- **Score:** 75% → 81.25%
- **Change:** Added rule "Always flag timezone-naive datetime operations"
- **Why:** Checklist item "catches time-related bugs" was failing
### Round 3: Reduced prompt verbosity ✗ REVERTED
- **Score:** 81.25% → 75%
- **Change:** Removed 3 paragraphs of explanation, kept only rules
- **Why:** Hypothesized less text = more focused output
- **Effect:** Quality dropped, model needed the context
[...]
## Files
- **Original:** agents/[name].md.backup
- **Improved:** agents/[name].md
- **Results log:** ${CLAUDE_PLUGIN_DATA}/autoresearch/[name]-results.json
Store results in ${CLAUDE_PLUGIN_DATA}/autoresearch/:
${CLAUDE_PLUGIN_DATA}/autoresearch/
├── code-reviewer-results.json
├── code-reviewer-changelog.md
├── qa-agent-results.json
└── qa-agent-changelog.md
Good changes (one at a time):
Bad changes (avoid):
The autoresearch skill is a meta-tool: it improves the tools that build your software. Run it periodically on agents that produce inconsistent results.
Recommended schedule:
npx claudepluginhub jenningsloy318/claude-skill-artifacts --plugin super-devAnalyzes agent performance, classifies failure modes, and applies prompt improvements with validation and staged rollouts.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Audits Claude Code agents for violations, gaps, and improvements across 7 dimensions like description quality and frontmatter, outputting structured repair plans.