From agent-doe-engine
Use when the user wants to tune an AI agent or optimize one or more measurable numbers at once - "optimize this", "make X faster without blowing up Y", "reduce latency and cost", "find the best trade-off between A and B", "tune these parameters", "which model/prompt/setup", "speed up my app". Runs a Design of Experiments matrix (up to 11 factors in one pass), measures every objective on every run, and selects the best trade-off by weighted scalarization, Derringer-Suich desirability, or Pareto frontier. Falls back to a single-variable autoresearch loop.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-doe-engine:agent-doe-engineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- SPDX-FileCopyrightText: 2025-2026 Tyrone Ross, Jr <[email protected]> -->
Optimize numbers you can measure - fast. The core idea is Design of Experiments: test many input variables at once in a handful of runs instead of changing one thing at a time. Then, when several outcomes compete, pick the setting that best trades them off.
The efficiency is the whole point. One-factor-at-a-time needs a run per variable and still misses interactions. DOE resolves several variables together: 2–3 factors in ≤8 runs, 4–7 in 8 runs, 8–11 in a 12-run screening pass. Build time, latency, token cost, bundle size, coverage, accuracy - anything a one-line command turns into a number.
The metric is the only judge. No "this looks better."
${CLAUDE_PLUGIN_ROOT} below is the plugin root; from a clone it's the repo root. Runtime state lives in the consumer project under .agent-doe-engine/optimize/.
| Shape | Trigger | Path |
|---|---|---|
| Multi-factor (the core - fewer runs) | one number, several knobs to test together | DOE matrix, single objective |
| Multi-objective (the differentiator) | ≥2 competing numbers ("faster AND cheaper", "latency vs accuracy") | DOE matrix + an objectives list + a selection method |
| Single-factor | one number, one thing to try | autoresearch greedy loop |
Multi-factor and multi-objective compose: a single DOE run can test many variables and score several objectives at once - that is the fastest path to a good trade-off.
Skip when the user already named factors and they're known-adjustable in this repo. Otherwise run this phase first - wrong factors burn the whole budget on noise.
Every agent-doe-engine run mutates factor values across many DOE runs. Doing that in the user's primary checkout interleaves optimization writes with real work-in-progress and risks leaving the tree dirty if a run is killed. The helper handles create / reuse / cleanup:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/worktree.py \
--workdir "$TARGET_REPO" --target "<target name>" --json init
Use the printed path as the worktree from this point on (cd into it before any further agent-doe-engine command). The branch is agent-doe-engine/<slug>; the worktree is <repo-name>-agent-doe-engine-<slug> alongside the repo. Re-running init is idempotent. At the end of Phase 3 Review run worktree.py ... cleanup [--delete-branch] to remove it.
This is the default path, not an afterthought - there is no "just run in main" shortcut. The helper is stdlib-only and never reaches outside git worktree operations.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/suggest_factors.py \
--workdir "$PWD" --top 12 --json --research-levels > /tmp/mg-candidates.json
--research-levels flags high-confidence numeric knobs whose names match tuning keywords (batch, timeout, lr, ...) with needs_research: true and a research_topic string. The script never calls research itself - it just marks which candidates would benefit if the host has a research capability available (see §0.4).
The host coding agent's LLM reads the candidate list and selects which to take forward. The script is deterministic; the choice is reasoning work. Use the AskUserQuestion path to confirm - candidates pre-checked, per existing convention. Surface for each: name, current_value, suggested_levels, confidence, file:line, and one-line why. Limit the user-facing list to the ~6 highest-signal entries; let the user add/remove.
This is the canonical confirmation point - never auto-run downstream phases on heuristic candidates alone.
For each accepted candidate, prove the optimizer can actually move it before spending DOE runs:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/validate_factors.py \
--workdir "$PWD" --candidates /tmp/mg-candidates.json --json --reject-non-adjustable \
> .agent-doe-engine/optimize/validated_factors.json
The validator performs a snapshot → mutate → re-read → revert → verify cycle on each candidate's primary definition site. Output classification:
| adjustability | reason | Action |
|---|---|---|
adjustable | ok | enters the DOE |
not_adjustable | dead_constant | reject - zero references; the optimizer would change a value nothing reads |
not_adjustable | duplicate_definition | reject - two sites with conflicting values, which one wins is ambiguous |
not_adjustable | mutation_failed | reject - write didn't land (read-only FS, locked file, race with build cache) |
not_adjustable | revert_failed | hard-surface - working tree is dirty; stop the run and ask the user before continuing |
not_adjustable | no_definition_site | reject - name vanished since the scan |
Only adjustable candidates enter factors.json. For each rejection, show the user the reason and evidence so they can fix the underlying issue (extract a duplicate to a single config, add a real reference, etc.) or override. --reject-non-adjustable exits 1 - surface that to the user with the rejection summary; the user decides whether to drop, fix, or override.
For any validated candidate that was flagged needs_research: true in §0.1, the host LLM may consult its research capability (web search, Exa, Context7, internal docs - whatever the host has available) to propose best-practice levels. The reasoning is the host's; the plugin only carries the structured input/output.
suggested_levels ([0.5x, 1x, 2x]) with researched levels (e.g. for BATCH_SIZE = 32 on a Transformer training loop, research may suggest [8, 16, 32, 64] based on published GPU memory tradeoffs).{"name": "prompt_variant", "levels": ["chain-of-thought", "few-shot", "zero-shot"]}). The DOE machinery treats them as categorical levels - useful for prompt A/B/C, tokenizer choice, model variant, scheduler family, etc.This step is off by default. Enable only when the user explicitly asks ("research good levels for these") OR when the candidate set is small enough (≤3 factors) that the research overhead is worth it. The host invokes its own research tool - there are no vendor API calls inside this plugin. Always cite the source the research returned in the factors.json why field so a future run can audit it.
If the host has no research capability, skip this step silently - the heuristic levels are a working default.
Write the validated (and optionally researched) candidates to .agent-doe-engine/optimize/factors.json in the shape [{name, low, high}] (numeric two-level) or [{name, levels:[...]}] (numeric multi-level OR categorical). From here, the rest of the SETUP phase (objectives, design) proceeds as Phase 1 below.
Wrong metric = Goodhart's Law. Wrong factors = wasted runs. This is the highest-leverage phase.
Each objective is a number plus how to read it:
{
"objectives": [
{"name": "latency_ms", "direction": "lower", "weight": 0.5, "metric_cmd": "python3 bench.py --stat p95",
"validity": "validated"},
{"name": "cost_usd", "direction": "lower", "weight": 0.3, "metric_cmd": "python3 cost.py",
"validity": "validated"},
{"name": "coverage_pct", "direction": "higher", "weight": 0.2, "metric_cmd": "pytest --cov | tail -1 | grep -o '[0-9]*%'",
"validity": "unvalidated"}
],
"selection": "scalarize"
}
Write it to .agent-doe-engine/optimize/objectives.json. One objective is the single-metric case - everything below still works.
validity field (optional; default unvalidated when absent):
| Value | Meaning |
|---|---|
validated | The metric has been shown to correlate with the real user outcome (e.g. correlated against ground-truth human ratings or an A/B result). A DOE winner on this metric is safe to apply. |
unvalidated | The metric is a proxy that has not yet been correlated against the real user outcome. Optimizing it moves the number; whether it moves the underlying goal is unknown. |
needs_human_ratings | Known proxy; ground-truth human ratings exist or could be collected and should be used to validate before the next DOE cycle. |
Set validity: "validated" only when you have evidence (e.g. a correlation study, an A/B test, or a published benchmark showing the metric tracks user outcome). Leave it absent or "unvalidated" during early exploration. The overfitting reviewer treats any DOE or loop winner selected on an unvalidated or needs_human_ratings metric as a strong_checkpoint finding (Goodhart risk - see Phase 3).
Choosing selection:
| Method | Picks | Use when |
|---|---|---|
scalarize (default) | max weighted sum of normalized objectives | you can express priorities as weights |
desirability | max Derringer-Suich D (geometric mean of per-objective desirabilities) | every objective must clear a bar - a zero on one tanks the run |
pareto | the non-dominated trade-off set (single winner = max-desirability point on the front) | you want to see all trade-offs before committing |
If the user named factors ("optimize workers, batch_size, timeout"), validate the shape [{name, low, high}] or [{name, levels:[...]}] and run them through Phase 0.3 (adjustability validation) before skipping ahead.
Otherwise the factor inventory comes from Phase 0 (PLAN) above: scan → host-LLM picks → adjustability validation → optional research → .agent-doe-engine/optimize/factors.json. Phase 0 is the canonical path; this section is the contract for what factors.json must contain. Do not auto-run optimization on heuristic candidates that have not passed validate_factors.py.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py detect <k>
Routing: k=1 → autoresearch (§Single-factor); 2–3 → 2^k full factorial (≤8 runs); 4–7 → fractional factorial 2^(k-p) Res III/IV (8 runs); 8–11 → Plackett-Burman 12-run screening.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py generate \
--factors "$(cat .agent-doe-engine/optimize/factors.json)" \
--design auto --seed "$RANDOM" \
> .agent-doe-engine/optimize/doe.json
For each row in .agent-doe-engine/optimize/doe.json (in randomized run_order):
runs[i]._factors to code / config / env.metric_cmd (use metric_runner.py for sampled/aggregated measurement of noisy metrics):
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/metric_runner.py --cmd "<metric_cmd>" --samples 5 --warmups 1 --aggregate p95
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/metric_runner.py --guard "<guard_cmd>"..agent-doe-engine/optimize/results.jsonl: {"run_id": i, "values": {"latency_ms": .., "cost_usd": ..}, "guard_ok": true}.Then fit effects and select:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py analyze \
--design .agent-doe-engine/optimize/doe.json \
--results .agent-doe-engine/optimize/results.jsonl \
--objectives .agent-doe-engine/optimize/objectives.json \
> .agent-doe-engine/optimize/effects.json
Output: ranked main effects + interactions per objective, the selection result (best run, scores, always the pareto_front), and best_factors (concrete winning values). Apply the winning combination as one commit. If selection: pareto, present the front and let the user pick the trade-off; default to the max-desirability point.
When there is one factor (or one thing to try), skip DOE.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --init --workdir "$PWD" \
--target "<name>" --scope "<glob>" \
--objectives "$(cat .agent-doe-engine/optimize/objectives.json | python3 -c 'import sys,json;print(json.dumps(json.load(sys.stdin)["objectives"]))')" \
--selection scalarize \
--metric-cmd "true" --guard-cmd "<cmd>" --budget 20 --direction lower
Measure the baseline once, then record it:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --set-baseline --workdir "$PWD" \
--baseline-values '{"latency_ms": 100, "cost_usd": 5}'
Then dispatch the optimize-runner agent. Each iteration: hypothesize one atomic change → apply → measure every objective → loop.py --score --values '{...}' to get the scalar aggregate (improvement ratio vs baseline; >1 = better) → keep if aggregate improves and the guard passes, else git revert. Convergence: 5 consecutive discards, regressing trend, or budget exhausted.
Single-objective mode is the original behavior - omit --objectives and use --metric-cmd directly.
overfitting-reviewer (read-only): check for removed safety, fragile shortcuts, metric-gaming, scope violations across the kept changes.python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --archive --workdir "$PWD".python3 ${CLAUDE_PLUGIN_ROOT}/scripts/worktree.py \
--workdir "$TARGET_REPO" --target "<target name>" --json cleanup [--delete-branch]
Default keeps the branch (so the user can inspect / merge later); add --delete-branch only when the user explicitly discards the run.| Component | Tier | Why |
|---|---|---|
| Setup (objectives, factors, selection) | Thinking | Wrong metric = Goodhart |
| Hypothesis generation | Code | High volume |
| Metric / guard / analyze | deterministic scripts | no LLM |
| Keep/revert | deterministic | numeric comparison |
| Overfitting review | Code (read-only) | pattern matching |
.agent-doe-engine/optimize/
├── objectives.json # objectives + selection method
├── factors.json # factor definitions
├── doe.json # generated design matrix
├── results.jsonl # measured responses per run
├── effects.json # per-objective effects + selection result
├── experiment.json # autoresearch config (single/few-factor mode)
├── results.tsv # autoresearch iteration log
└── experiments/ # archived runs
See profiles.md for ready-made single-objective presets (simplify, build time, bundle size, latency). Compose them into a multi-objective objectives.json when you want to optimize several at once.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub tyroneross/agent-doe-engine --plugin agent-doe-engine