By abdielou
Transform vague optimization problems into fully scaffolded autonomous experiment loops with eval suites, scoring functions, and meta-agent directives.
Transform a vague optimization problem into a fully scaffolded autonomous experiment loop. Use when user says 'autoeval', 'optimization loop', 'autonomous experiment', 'hill-climbing loop', or invokes /autoeval.
Phase 1 of autoeval -- interactive discovery to clarify the optimization problem, classify the loop type, and identify exit ramps. Use when autoeval orchestrator routes to Phase 1, or when user invokes directly.
Phase 2 of autoeval -- interactive metric exploration and scoring function design with stress-testing. Use when autoeval orchestrator routes to Phase 2, or when user invokes directly.
Phase 3 of autoeval -- build eval cases, scoring functions, and coverage strategy. The most critical phase of the optimization loop. Use when autoeval orchestrator routes to Phase 3, or when user invokes directly.
Phase 4 of autoeval -- generate the seed implementation with marked edit surface that the meta-agent will iterate on. Use when autoeval orchestrator routes to Phase 4, or when user invokes directly.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
ALPHA — DO NOT USE. Currently testing loop types with real problems. Example executions will be added for each of the 12 types. Moves out of alpha once all are validated.
A Claude Code skill that transforms an optimization problem into a fully scaffolded autonomous experiment loop. Describe what you want to optimize, and autoeval builds the eval suite, seed implementation, monitoring dashboard, and loop runner — everything you need to kick off an overnight experiment.
/plugin marketplace add abdielou/autoeval
/plugin install autoeval@abdielou-autoeval
/autoeval optimize my bin packing heuristic for minimum waste
/autoeval improve my classifier's accuracy on the test suite
/autoeval --auto tune my prompt to maximize extraction accuracy
autoeval walks you through defining the problem, designing the scoring function, and building the eval suite. Then it scaffolds the loop and tells you how to run it.
The --auto flag makes the later phases run autonomously after the interactive metric design is locked in.
| File | What it does |
|---|---|
program.md | Meta-agent directive — goal, edit surface, iteration protocol, constraints |
evals/ | Test cases, scoring functions, eval runner |
run-loop.py | Launches claude sessions with auto-restart and timeout |
monitor.py | Live dashboard at localhost:8080 |
steering.md | Guide the agent mid-run without stopping the loop |
| Seed harness | Minimal baseline implementation with marked edit surface |
Terminal 1 — start the loop:
cd <output-dir>
python run-loop.py
Sessions auto-restart every N iterations with fresh context. Ctrl+C once kills the current session; Ctrl+C again within 3 seconds stops the loop entirely.
Terminal 2 — monitor progress:
cd <output-dir>
python monitor.py
Live chart showing score over iterations, component scores (toggleable), failed experiments as red X markers, hypothesis on hover, and summary stats. Auto-refreshes every 30 seconds.
The loop uses two models to balance cost and quality:
Opus-level creativity for the "think hard" steps at ~10% of the cost.
Guide the loop agent mid-run without stopping it. Append entries to steering.md tagged with the commit hash they apply after:
## after 7f5ad6f
Stop trying to optimize the greedy heuristic. The biggest gain is switching
to a scoring-based approach. Look at the best-fit-decreasing literature.
## after a1b2c3d
The classifier is plateauing at 96.2%. The bottleneck is feature engineering,
not model choice. Try polynomial features on columns 3-7.
The agent reads this before each iteration and follows it. Old entries are automatically irrelevant once HEAD moves past them.
For permanent changes, edit program.md — takes effect at the next session restart.
Most optimization loops silently discard failed attempts. autoeval preserves them.
When an experiment fails, before reverting the code, the agent saves the full diff, hypothesis, score breakdown, and failure analysis to failed_experiments/. The monitoring dashboard shows a Failed Experiments table with the last 20 failures. Before each new hypothesis, the deep reasoning model sees past failures and avoids repeating them.
Failed iterations become institutional knowledge the loop builds up over time.
autoeval classifies your problem against 12 optimization loop types:
| Training | Agent Harness | Generative Output | Algorithm Performance |
| Retrieval & Ranking | Pipeline Optimization | Simulation Calibration | Strategy & Decision |
| Adversarial | Data Curation | Control Systems | Interface Optimization |
Each type defines what the agent edits and how it's scored. See loop-types.md for details.
The eval suite goes through integrity validation before the loop starts:
autoresearch | autoagent | AIDE | OpenEvolve | AI-Scientist
MIT
npx claudepluginhub abdielou/autoeval --plugin autoevalGuided discovery sessions with structured facilitation, first-principles decomposition, and persistent notes.
Native Meta-Harness — run a harness-optimization loop natively: propose scaffolding variants around a fixed model, score each with a $0 deterministic test, keep the Pareto-best, loop.
Autonomous experiment loops on any codebase — one file, one metric, one loop. Based on Karpathy's autoresearch pattern.
Autonomous experimentation skill — your AI coding agent designs experiments, tests hypotheses, discards failures, keeps wins. Runs overnight while you sleep.
Evolutionary code discovery using Claude Code models
Autonomous experiment loop that optimizes any file by a measurable metric. 5 slash commands, 8 evaluators, configurable loop intervals (10min to monthly).
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns