Search everything...

Stats

Actions

Available In

autoimprove

Name: autoimprove
Author: tokyo-megacorp

By tokyo-megacorp

Autonomous codebase improvement loop — modify code, evaluate against benchmarks, keep or discard via git worktrees

npx claudepluginhub tokyo-megacorp/autoimprove

Popularity

Stars

Above avg

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Agents16

agents/ — Agent Definitions

/AGENTS

15 agent definitions for the autoimprove improvement loop. Each agent is a Markdown file with YAML frontmatter (`name`, `description`, optional `model`).

adversary-spec

/adversary-spec

Maps safe zones and risky heuristics in spec/design prose before seeing Enthusiast findings. Parallel phase-1 dispatch — produces adversarial context for the Judge-spec. Spawned by the review orchestrator — not invoked directly by users.

adversary

/adversary

Maps safe zones and risky heuristics before seeing Enthusiast findings. Parallel phase-1 dispatch — produces adversarial context for the Judge. Spawned by the review orchestrator — not invoked directly by users.

challenge-runner

/challenge-runner

Runs the full E→A→J debate pipeline on a single code challenge and scores it with F1. Dispatched by the challenge skill — not invoked directly by users.

convergence-analyst

/convergence-analyst

Surfaces strategic insights from a completed idea-matrix convergence report — dimension patterns, hidden assumptions, risk clusters, cells to re-examine. Does not re-score.

Skills26

adversarial-review

/adversarial-review

Run an adversarial Enthusiast→Adversary→Judge debate review on code. Automatically converges — no manual round control needed. Use when the user says 'adversarial review', 'debate review', 'run a review round', 'do a review round', 'review code with debate agents', 'i want an adversarial review', or '/autoimprove review'. Do NOT trigger on generic 'review' requests or PR reviews. Takes a file, diff, or PR as target.

autoimprove

/autoimprove

Main entry point for the autonomous improvement loop. Use when the harness calls `Skill(autoimprove)`, when the user runs `/autoimprove`, or when the user asks to start the full research → experiment → judge → converge flow. This is an alias for the `run` skill. It exists so callers can invoke the top-level `autoimprove` skill name directly without failing with "Unknown skill: autoimprove".

calibrate

/calibrate

Run cross-model calibration for autoimprove skills — compare Opus (gold standard) vs Haiku (cheap) on the same input to identify reasoning gaps. Use when the user says '/calibrate', 'calibrate skill', 'model calibration', or 'calibration gap'. Phase 1: hardcoded for adversarial-review only.

challenge

/challenge

Use when testing debate agent bug-finding accuracy against curated code challenges — F1 scoring, 'test debate agents on challenges', 'benchmark agents'.

cleanup

/cleanup

Manually sweep stale autoimprove worktrees and branches via `skills/_shared/cleanup-worktrees.sh`. Safe to run at any time — protects live worktrees, tagged keepers, and in-flight experiments. Triggers: '/autoimprove cleanup', 'clean up stale worktrees', 'sweep orphan branches', 'autoimprove hygiene'. <example> user: "/autoimprove cleanup --dry-run" assistant: I'll use the cleanup skill to preview what the sweep would remove. <commentary>Dry-run preview — cleanup skill.</commentary> </example> <example> user: "clean up the orphan worktree-agent branches" assistant: I'll use the cleanup skill to sweep them. <commentary>Explicit cleanup request — cleanup skill.</commentary> </example> Do NOT use for in-loop per-experiment cleanup → that lives in step 3j of the run skill. This skill is the manual/safety-net sweep only.

Hooks1

Event Hooks

File writes

1 hook across 1 event

The plugin manifest points to a different repository than the source indexed by ClaudePluginHub.

Stats

Version0.15.1

ReleasedJun 5, 2026

LanguageShell

Stars2

MaintenanceExcellent

LicenseMIT

Last CommitJun 13, 2026

AddedMay 7, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Safety Signals

Caution

Modifies files

Hook triggers on file write and edit operations

Uses power tools

Uses Bash, Write, or Edit tools

README

autoimprove

Autonomous codebase improvement loop for Claude Code and Codex. Inspired by karpathy/autoresearch.

You program the improvement strategy. The system modifies code, evaluates against your benchmarks, and keeps or discards changes via git worktree isolation. You wake up to a log of experiments and a better codebase.

How it works

autoimprove.yaml          evaluate.sh              experimenter agent
(you write this)          (deterministic scoring)   (blind to scoring)
       │                         │                         │
       ▼                         ▼                         ▼
┌─────────────┐  spawn   ┌──────────────┐  evaluate  ┌──────────┐
│ orchestrator │────────▶ │  worktree    │──────────▶ │ verdict  │
│   (loop)    │◀─────────│  experiment  │            │ keep or  │
│             │  commit   │              │            │ discard  │
└─────────────┘          └──────────────┘            └──────────┘

The orchestrator picks improvement themes (failing tests, TODOs, coverage gaps), spawns an experimenter agent into an isolated git worktree, then evaluates the result with a deterministic script. The experimenter never sees your metrics or scores — it makes changes it genuinely believes are improvements.

Scoring uses set logic, not weighted averages. A change is kept only if no metric regresses and at least one improves. A single regression vetoes the entire experiment.

Quick start

Claude Code:

# 0. Install the plugin (one-time)
claude plugin marketplace add https://github.com/ipedro/autoimprove
claude plugin install autoimprove

# 1. Inside your project, run:
/autoimprove init

Codex:

$autoimprove:init

/autoimprove init in Claude Code and $autoimprove:init in Codex are interactive — they detect your project, run your tests, and scaffold everything:

autoimprove initialized for my-project (Node.js)

Gates
  [PASS] tests — npm test (42 tests, 0 failures)

Metrics (baseline)
  test_count: 42
  todo_count: 7

Files written:
  autoimprove.yaml        ← your improvement strategy
  benchmark/metrics.sh   ← measures test_count + todo_count

Next step: /autoimprove run --experiments 3

You don't write a benchmark script — init generates one from your project. Then:

Claude Code:

# 2. Run the improvement loop (3 trial experiments first)
/autoimprove run --experiments 3

# 3. See what happened
/autoimprove report

Codex:

$autoimprove:autoimprove --experiments 3
$autoimprove:report

The autoresearch mapping

autoresearch	autoimprove
`train.py` (agent edits this)	Your source code
`prepare.py` (immutable eval)	`evaluate.sh`
`program.md` (human strategy)	`autoimprove.yaml`
`val_bpb` (fitness number)	Per-metric set logic
`git reset --hard`	`git worktree remove`

The key insight from autoresearch: the human doesn't edit the code — they edit the improvement strategy. You tune autoimprove.yaml, not your source files.

Safety

autoimprove is conservative by default:

Hard gates first — tests and typecheck must pass or the change is immediately discarded
No metric can regress — a single regression vetoes, regardless of other improvements
Epoch drift halt — session stops if cumulative drift exceeds 5% from session start
Trust starts small — tier 0 limits experiments to 3 files, 150 lines. Scope expands only after consecutive successful keeps
Fast-forward only — rebase conflicts = discard. Clean linear history guaranteed
Experimenter is blind — can't game metrics it can't see
Evaluation is deterministic — evaluate.sh (bash + jq), no LLM in the scoring loop

Configuration

autoimprove.yaml lives in your project root:

gates:
  - name: tests
    command: npm test
  - name: typecheck
    command: npx tsc --noEmit

benchmarks:
  - name: project-metrics
    type: script
    command: bash benchmark/metrics.sh
    metrics:
      - name: test_count
        extract: "json:.test_count"
        direction: higher_is_better
        tolerance: 0.02       # max acceptable regression
        significance: 0.01    # min meaningful improvement

themes:
  auto:
    strategy: weighted_random
    priorities:
      failing_tests: 5
      todo_comments: 3
      coverage_gaps: 2

See docs/configuration.md for the full schema.

Prerequisites

Claude Code or Codex
jq (brew install jq / apt install jq)
bash 4+
A project with a test suite

Installation

autoimprove ships both a Claude Code plugin manifest and a Codex plugin manifest.

Claude Code

Add as a marketplace:

claude plugin marketplace add https://github.com/ipedro/autoimprove

Install the plugin:

claude plugin install autoimprove

View full README on GitHub

autoimprove

Popularity

What's Inside

Confidence

README

autoimprove

How it works

Quick start

The autoresearch mapping

Safety

Configuration

Prerequisites

Installation

Similar Plugins

caveman

llm-council-plugin

self-improving-agent

ui-design

claude-mem

More by tokyo-megacorp

figma-differ

autoimprove

How it works

Quick start

The autoresearch mapping

Safety

Configuration

Prerequisites

Installation

Popularity

Health & Quality

More by tokyo-megacorp

figma-differ

Similar Plugins

caveman

llm-council-plugin

self-improving-agent

ui-design

claude-mem