Autonomous codebase improvement loop — modify code, evaluate against benchmarks, keep or discard via git worktrees
15 agent definitions for the autoimprove improvement loop. Each agent is a Markdown file with YAML frontmatter (`name`, `description`, optional `model`).
Maps safe zones and risky heuristics in spec/design prose before seeing Enthusiast findings. Parallel phase-1 dispatch — produces adversarial context for the Judge-spec. Spawned by the review orchestrator — not invoked directly by users.
Maps safe zones and risky heuristics before seeing Enthusiast findings. Parallel phase-1 dispatch — produces adversarial context for the Judge. Spawned by the review orchestrator — not invoked directly by users.
Runs the full E→A→J debate pipeline on a single code challenge and scores it with F1. Dispatched by the challenge skill — not invoked directly by users.
Surfaces strategic insights from a completed idea-matrix convergence report — dimension patterns, hidden assumptions, risk clusters, cells to re-examine. Does not re-score.
Run an adversarial Enthusiast→Adversary→Judge debate review on code. Automatically converges — no manual round control needed. Use when the user says 'adversarial review', 'debate review', 'run a review round', 'do a review round', 'review code with debate agents', 'i want an adversarial review', or '/autoimprove review'. Do NOT trigger on generic 'review' requests or PR reviews. Takes a file, diff, or PR as target.
Main entry point for the autonomous improvement loop. Use when the harness calls `Skill(autoimprove)`, when the user runs `/autoimprove`, or when the user asks to start the full research → experiment → judge → converge flow. This is an alias for the `run` skill. It exists so callers can invoke the top-level `autoimprove` skill name directly without failing with "Unknown skill: autoimprove".
Run cross-model calibration for autoimprove skills — compare Opus (gold standard) vs Haiku (cheap) on the same input to identify reasoning gaps. Use when the user says '/calibrate', 'calibrate skill', 'model calibration', or 'calibration gap'. Phase 1: hardcoded for adversarial-review only.
Use when testing debate agent bug-finding accuracy against curated code challenges — F1 scoring, 'test debate agents on challenges', 'benchmark agents'.
Manually sweep stale autoimprove worktrees and branches via `skills/_shared/cleanup-worktrees.sh`. Safe to run at any time — protects live worktrees, tagged keepers, and in-flight experiments. Triggers: '/autoimprove cleanup', 'clean up stale worktrees', 'sweep orphan branches', 'autoimprove hygiene'. <example> user: "/autoimprove cleanup --dry-run" assistant: I'll use the cleanup skill to preview what the sweep would remove. <commentary>Dry-run preview — cleanup skill.</commentary> </example> <example> user: "clean up the orphan worktree-agent branches" assistant: I'll use the cleanup skill to sweep them. <commentary>Explicit cleanup request — cleanup skill.</commentary> </example> Do NOT use for in-loop per-experiment cleanup → that lives in step 3j of the run skill. This skill is the manual/safety-net sweep only.
Modifies files
Hook triggers on file write and edit operations
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Autonomous codebase improvement loop for Claude Code and Codex. Inspired by karpathy/autoresearch.
You program the improvement strategy. The system modifies code, evaluates against your benchmarks, and keeps or discards changes via git worktree isolation. You wake up to a log of experiments and a better codebase.
autoimprove.yaml evaluate.sh experimenter agent
(you write this) (deterministic scoring) (blind to scoring)
│ │ │
▼ ▼ ▼
┌─────────────┐ spawn ┌──────────────┐ evaluate ┌──────────┐
│ orchestrator │────────▶ │ worktree │──────────▶ │ verdict │
│ (loop) │◀─────────│ experiment │ │ keep or │
│ │ commit │ │ │ discard │
└─────────────┘ └──────────────┘ └──────────┘
The orchestrator picks improvement themes (failing tests, TODOs, coverage gaps), spawns an experimenter agent into an isolated git worktree, then evaluates the result with a deterministic script. The experimenter never sees your metrics or scores — it makes changes it genuinely believes are improvements.
Scoring uses set logic, not weighted averages. A change is kept only if no metric regresses and at least one improves. A single regression vetoes the entire experiment.
Claude Code:
# 0. Install the plugin (one-time)
claude plugin marketplace add https://github.com/ipedro/autoimprove
claude plugin install autoimprove
# 1. Inside your project, run:
/autoimprove init
Codex:
$autoimprove:init
/autoimprove init in Claude Code and $autoimprove:init in Codex are interactive — they detect your project, run your tests, and scaffold everything:
autoimprove initialized for my-project (Node.js)
Gates
[PASS] tests — npm test (42 tests, 0 failures)
Metrics (baseline)
test_count: 42
todo_count: 7
Files written:
autoimprove.yaml ← your improvement strategy
benchmark/metrics.sh ← measures test_count + todo_count
Next step: /autoimprove run --experiments 3
You don't write a benchmark script — init generates one from your project. Then:
Claude Code:
# 2. Run the improvement loop (3 trial experiments first)
/autoimprove run --experiments 3
# 3. See what happened
/autoimprove report
Codex:
$autoimprove:autoimprove --experiments 3
$autoimprove:report
| autoresearch | autoimprove |
|---|---|
train.py (agent edits this) | Your source code |
prepare.py (immutable eval) | evaluate.sh |
program.md (human strategy) | autoimprove.yaml |
val_bpb (fitness number) | Per-metric set logic |
git reset --hard | git worktree remove |
The key insight from autoresearch: the human doesn't edit the code — they edit the improvement strategy. You tune autoimprove.yaml, not your source files.
autoimprove is conservative by default:
evaluate.sh (bash + jq), no LLM in the scoring loopautoimprove.yaml lives in your project root:
gates:
- name: tests
command: npm test
- name: typecheck
command: npx tsc --noEmit
benchmarks:
- name: project-metrics
type: script
command: bash benchmark/metrics.sh
metrics:
- name: test_count
extract: "json:.test_count"
direction: higher_is_better
tolerance: 0.02 # max acceptable regression
significance: 0.01 # min meaningful improvement
themes:
auto:
strategy: weighted_random
priorities:
failing_tests: 5
todo_comments: 3
coverage_gaps: 2
See docs/configuration.md for the full schema.
jq (brew install jq / apt install jq)autoimprove ships both a Claude Code plugin manifest and a Codex plugin manifest.
Claude Code
claude plugin marketplace add https://github.com/ipedro/autoimprove
claude plugin install autoimprove
npx claudepluginhub tokyo-megacorp/autoimproveSnapshot Figma canvases, diff design changes, compare to implementation, and notify Slack.
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Multi-model consensus engine integrating OpenAI Codex CLI, Gemini CLI, and Claude CLI for collaborative code review and problem-solving.
Curate auto-memory, promote learnings to CLAUDE.md and rules, extract proven patterns into reusable skills.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns
Memory compression system for Claude Code - persist context across sessions