auto-optimize
AutoResearch for Performance Engineering
Measure first. Reason deep. Reflect. Repeat.
Andrej Karpathy introduced the idea of autoresearch — closing the loop between hypothesis, experiment, and reflection so that an AI agent can drive an entire research cycle autonomously. auto-optimize applies that idea to performance engineering.
You define a numeric goal and a success threshold. The plugin builds regression and benchmark infrastructure, locks a baseline, then runs an autonomous loop: profile → reason → plan → apply → test → measure → reflect → repeat. Every iteration is a git commit. Every decision is reasoned, recorded, and fed back into the next cycle.
Installation
claude plugin marketplace add bluuewhale/auto-optimize
claude plugin install auto-optimize@auto-optimize
Quick Start
/auto-optimize The API is slow. I want to make it faster.
auto-optimize will ask a few clarifying questions — metric, scope, success target, and test commands — then take it from there. If benchmarks or regression tests are missing, it writes them before starting the loop.
Real-World Result: 27% Faster Hash Table
Full write-up: HashSmith Part 3 — I Automated My Way to a 27% Faster Hash Table
HashSmith is an open-source high-performance hash table for the JVM — a SwissTable-style map built around SWAR probing and 8-byte control word groups. After two rounds of manual optimization, the author handed the profiler to auto-optimize.
One prompt. ~3 hours. No manual intervention.
/auto-optimize I want to optimize the get/put performance of the SwissMap implementation.
| |
|---|
| Experiments run | 5 |
| Optimizations landed | 3 |
| Dropped | 2 |
| Improvement vs baseline | 13–32% across all 8 benchmark scenarios |
What the agent found
The agent ran 5 experiments autonomously. Three compounding wins, in order:
-
Tombstone guard — the probe loop was carrying tombstone-handling logic on a path where tombstones essentially never exist in production. Splitting into two specialized loop bodies eliminated the dead weight. Put path: -19% to -45%.
-
ILP hoisting on the read path — emptyMask was being computed after the key-equality loop, creating a serial dependency. Moving it adjacent to eqMask let the CPU's out-of-order engine pipeline both SWAR operations in the same clock cycle. Get path: -11% to -36%.
-
A third, smaller improvement compounded on top of both.
None of these required a single line of code written by the author. The structured reasoning pipeline (Step-Back → CoT → Self-Consistency → Pre-mortem) found the tombstone fast path by asking what is this loop doing that it doesn't need to do? — a question that wasn't visible in the disassembly alone.
The Problem
Most optimization attempts fail silently:
- You change code, feel like it's faster, ship it — but never measured before or after
- You write a quick benchmark once, optimize for it, then lose the script
- You try five approaches, forget what you tried, and repeat the same dead-ends
auto-optimize enforces the discipline you know you should have but don't.
How It Works
| Phase | What Happens | Output |
|---|
| 0. Gather | Collects goal, scope, metric direction, and numeric success criteria | experiment-plan.md |
| 1. Infra | Builds Regression Test and Benchmark Test scripts if missing (parallel sub-agents) | tests/ + bench/ |
| 1.5 Baseline | Locks noise-floor-validated baseline measurement and environment snapshot | baseline/ |
| 2. Loop | Profile → Disassemble → Reason (Opus) → Apply → Test → Benchmark → Reflect | iterations/ + leaderboard.md |
| 3. Report | Summarizes all iterations, best config, and recommended next steps | final-report.md |
Every iteration is a git commit — including reverts. The full experiment history is always recoverable.
The Intelligence Layer
Most AI coding tools apply changes and hope for the best. auto-optimize's inner loop is built differently — each iteration runs a structured reasoning pipeline powered by Claude Opus before a single line of code is touched.
Multi-technique Reasoning per Iteration
Every iteration delegates planning to a dedicated Opus sub-agent that applies four reasoning techniques in sequence: