Skill

agent-doe-engine

Use when the user wants to tune an AI agent or optimize one or more measurable numbers at once - "optimize this", "make X faster without blowing up Y", "reduce latency and cost", "find the best trade-off between A and B", "tune these parameters", "which model/prompt/setup", "speed up my app". Runs a Design of Experiments matrix (up to 11 factors in one pass), measures every objective on every run, and selects the best trade-off by weighted scalarization, Derringer-Suich desirability, or Pareto frontier. Falls back to a single-variable autoresearch loop.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agent-doe-engine:agent-doe-engine

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Supporting Files

profiles.md

SKILL.md

249 lines · ~4k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

agent-doe-engine

Optimize numbers you can measure - fast. The core idea is Design of Experiments: test many input variables at once in a handful of runs instead of changing one thing at a time. Then, when several outcomes compete, pick the setting that best trades them off.

The efficiency is the whole point. One-factor-at-a-time needs a run per variable and still misses interactions. DOE resolves several variables together: 2–3 factors in ≤8 runs, 4–7 in 8 runs, 8–11 in a 12-run screening pass. Build time, latency, token cost, bundle size, coverage, accuracy - anything a one-line command turns into a number.

The metric is the only judge. No "this looks better."

${CLAUDE_PLUGIN_ROOT} below is the plugin root; from a clone it's the repo root. Runtime state lives in the consumer project under .agent-doe-engine/optimize/.

Three shapes of request

Shape	Trigger	Path
Multi-factor (the core - fewer runs)	one number, several knobs to test together	DOE matrix, single objective
Multi-objective (the differentiator)	≥2 competing numbers ("faster AND cheaper", "latency vs accuracy")	DOE matrix + an `objectives` list + a `selection` method
Single-factor	one number, one thing to try	autoresearch greedy loop

Multi-factor and multi-objective compose: a single DOE run can test many variables and score several objectives at once - that is the fastest path to a good trade-off.

Phase 0: PLAN - pick the right variables before spending runs

Skip when the user already named factors and they're known-adjustable in this repo. Otherwise run this phase first - wrong factors burn the whole budget on noise.

0.0 - Open a dedicated worktree (mandatory; do NOT run in main)

Every agent-doe-engine run mutates factor values across many DOE runs. Doing that in the user's primary checkout interleaves optimization writes with real work-in-progress and risks leaving the tree dirty if a run is killed. The helper handles create / reuse / cleanup:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/worktree.py \
  --workdir "$TARGET_REPO" --target "<target name>" --json init

Use the printed path as the worktree from this point on (cd into it before any further agent-doe-engine command). The branch is agent-doe-engine/<slug>; the worktree is <repo-name>-agent-doe-engine-<slug> alongside the repo. Re-running init is idempotent. At the end of Phase 3 Review run worktree.py ... cleanup [--delete-branch] to remove it.

This is the default path, not an afterthought - there is no "just run in main" shortcut. The helper is stdlib-only and never reaches outside git worktree operations.

0.1 - Scan for factor candidates

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/suggest_factors.py \
  --workdir "$PWD" --top 12 --json --research-levels > /tmp/mg-candidates.json

--research-levels flags high-confidence numeric knobs whose names match tuning keywords (batch, timeout, lr, ...) with needs_research: true and a research_topic string. The script never calls research itself - it just marks which candidates would benefit if the host has a research capability available (see §0.4).

0.2 - Host LLM ranks and picks the candidates to test

The host coding agent's LLM reads the candidate list and selects which to take forward. The script is deterministic; the choice is reasoning work. Use the AskUserQuestion path to confirm - candidates pre-checked, per existing convention. Surface for each: name, current_value, suggested_levels, confidence, file:line, and one-line why. Limit the user-facing list to the ~6 highest-signal entries; let the user add/remove.

This is the canonical confirmation point - never auto-run downstream phases on heuristic candidates alone.

0.3 - Validate adjustability (REQUIRED before DOE)

For each accepted candidate, prove the optimizer can actually move it before spending DOE runs:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/validate_factors.py \
  --workdir "$PWD" --candidates /tmp/mg-candidates.json --json --reject-non-adjustable \
  > .agent-doe-engine/optimize/validated_factors.json

The validator performs a snapshot → mutate → re-read → revert → verify cycle on each candidate's primary definition site. Output classification:

adjustability	reason	Action
`adjustable`	`ok`	enters the DOE
`not_adjustable`	`dead_constant`	reject - zero references; the optimizer would change a value nothing reads
`not_adjustable`	`duplicate_definition`	reject - two sites with conflicting values, which one wins is ambiguous
`not_adjustable`	`mutation_failed`	reject - write didn't land (read-only FS, locked file, race with build cache)
`not_adjustable`	`revert_failed`	hard-surface - working tree is dirty; stop the run and ask the user before continuing
`not_adjustable`	`no_definition_site`	reject - name vanished since the scan

Only adjustable candidates enter factors.json. For each rejection, show the user the reason and evidence so they can fix the underlying issue (extract a duplicate to a single config, add a real reference, etc.) or override. --reject-non-adjustable exits 1 - surface that to the user with the rejection summary; the user decides whether to drop, fix, or override.

0.4 - Research seam (optional, host-driven, off by default)

For any validated candidate that was flagged needs_research: true in §0.1, the host LLM may consult its research capability (web search, Exa, Context7, internal docs - whatever the host has available) to propose best-practice levels. The reasoning is the host's; the plugin only carries the structured input/output.

Numeric factor: replace the heuristic suggested_levels ([0.5x, 1x, 2x]) with researched levels (e.g. for BATCH_SIZE = 32 on a Transformer training loop, research may suggest [8, 16, 32, 64] based on published GPU memory tradeoffs).
Categorical factor: replace the levels with named variants ({"name": "prompt_variant", "levels": ["chain-of-thought", "few-shot", "zero-shot"]}). The DOE machinery treats them as categorical levels - useful for prompt A/B/C, tokenizer choice, model variant, scheduler family, etc.

This step is off by default. Enable only when the user explicitly asks ("research good levels for these") OR when the candidate set is small enough (≤3 factors) that the research overhead is worth it. The host invokes its own research tool - there are no vendor API calls inside this plugin. Always cite the source the research returned in the factors.json why field so a future run can audit it.

If the host has no research capability, skip this step silently - the heuristic levels are a working default.

0.5 - Compose the factor file

Write the validated (and optionally researched) candidates to .agent-doe-engine/optimize/factors.json in the shape [{name, low, high}] (numeric two-level) or [{name, levels:[...]}] (numeric multi-level OR categorical). From here, the rest of the SETUP phase (objectives, design) proceeds as Phase 1 below.

Phase 1: SETUP - get the objectives and factors right

Wrong metric = Goodhart's Law. Wrong factors = wasted runs. This is the highest-leverage phase.

1.1 - Name the objectives

Each objective is a number plus how to read it:

{
  "objectives": [
    {"name": "latency_ms",   "direction": "lower",  "weight": 0.5, "metric_cmd": "python3 bench.py --stat p95",
     "validity": "validated"},
    {"name": "cost_usd",     "direction": "lower",  "weight": 0.3, "metric_cmd": "python3 cost.py",
     "validity": "validated"},
    {"name": "coverage_pct", "direction": "higher", "weight": 0.2, "metric_cmd": "pytest --cov | tail -1 | grep -o '[0-9]*%'",
     "validity": "unvalidated"}
  ],
  "selection": "scalarize"
}

Write it to .agent-doe-engine/optimize/objectives.json. One objective is the single-metric case - everything below still works.

validity field (optional; default unvalidated when absent):

Value	Meaning
`validated`	The metric has been shown to correlate with the real user outcome (e.g. correlated against ground-truth human ratings or an A/B result). A DOE winner on this metric is safe to apply.
`unvalidated`	The metric is a proxy that has not yet been correlated against the real user outcome. Optimizing it moves the number; whether it moves the underlying goal is unknown.
`needs_human_ratings`	Known proxy; ground-truth human ratings exist or could be collected and should be used to validate before the next DOE cycle.

Set validity: "validated" only when you have evidence (e.g. a correlation study, an A/B test, or a published benchmark showing the metric tracks user outcome). Leave it absent or "unvalidated" during early exploration. The overfitting reviewer treats any DOE or loop winner selected on an unvalidated or needs_human_ratings metric as a strong_checkpoint finding (Goodhart risk - see Phase 3).

Choosing selection:

Method	Picks	Use when
`scalarize` (default)	max weighted sum of normalized objectives	you can express priorities as weights
`desirability`	max Derringer-Suich D (geometric mean of per-objective desirabilities)	every objective must clear a bar - a zero on one tanks the run
`pareto`	the non-dominated trade-off set (single winner = max-desirability point on the front)	you want to see all trade-offs before committing

1.2 - Identify factors

If the user named factors ("optimize workers, batch_size, timeout"), validate the shape [{name, low, high}] or [{name, levels:[...]}] and run them through Phase 0.3 (adjustability validation) before skipping ahead.

Otherwise the factor inventory comes from Phase 0 (PLAN) above: scan → host-LLM picks → adjustability validation → optional research → .agent-doe-engine/optimize/factors.json. Phase 0 is the canonical path; this section is the contract for what factors.json must contain. Do not auto-run optimization on heuristic candidates that have not passed validate_factors.py.

1.3 - Pick the design (≥2 factors)

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py detect <k>

Routing: k=1 → autoresearch (§Single-factor); 2–3 → 2^k full factorial (≤8 runs); 4–7 → fractional factorial 2^(k-p) Res III/IV (8 runs); 8–11 → Plackett-Burman 12-run screening.

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py generate \
  --factors "$(cat .agent-doe-engine/optimize/factors.json)" \
  --design auto --seed "$RANDOM" \
  > .agent-doe-engine/optimize/doe.json

Phase 2: RUN THE MATRIX

For each row in .agent-doe-engine/optimize/doe.json (in randomized run_order):

Apply the factor values from runs[i]._factors to code / config / env.
Measure every objective - run each objective's metric_cmd (use metric_runner.py for sampled/aggregated measurement of noisy metrics):
```
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/metric_runner.py --cmd "<metric_cmd>" --samples 5 --warmups 1 --aggregate p95
```
Run the guard (must exit 0): python3 ${CLAUDE_PLUGIN_ROOT}/scripts/metric_runner.py --guard "<guard_cmd>".
Append to .agent-doe-engine/optimize/results.jsonl: {"run_id": i, "values": {"latency_ms": .., "cost_usd": ..}, "guard_ok": true}.
Revert the factor changes - each DOE run starts from the same baseline; the design does not accumulate.

Then fit effects and select:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/doe.py analyze \
  --design .agent-doe-engine/optimize/doe.json \
  --results .agent-doe-engine/optimize/results.jsonl \
  --objectives .agent-doe-engine/optimize/objectives.json \
  > .agent-doe-engine/optimize/effects.json

Output: ranked main effects + interactions per objective, the selection result (best run, scores, always the pareto_front), and best_factors (concrete winning values). Apply the winning combination as one commit. If selection: pareto, present the front and let the user pick the trade-off; default to the max-desirability point.

Single-factor - autoresearch loop

When there is one factor (or one thing to try), skip DOE.

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --init --workdir "$PWD" \
  --target "<name>" --scope "<glob>" \
  --objectives "$(cat .agent-doe-engine/optimize/objectives.json | python3 -c 'import sys,json;print(json.dumps(json.load(sys.stdin)["objectives"]))')" \
  --selection scalarize \
  --metric-cmd "true" --guard-cmd "<cmd>" --budget 20 --direction lower

Measure the baseline once, then record it:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --set-baseline --workdir "$PWD" \
  --baseline-values '{"latency_ms": 100, "cost_usd": 5}'

Then dispatch the optimize-runner agent. Each iteration: hypothesize one atomic change → apply → measure every objective → loop.py --score --values '{...}' to get the scalar aggregate (improvement ratio vs baseline; >1 = better) → keep if aggregate improves and the guard passes, else git revert. Convergence: 5 consecutive discards, regressing trend, or budget exhausted.

Single-objective mode is the original behavior - omit --objectives and use --metric-cmd directly.

Phase 3: REVIEW

Dispatch overfitting-reviewer (read-only): check for removed safety, fragile shortcuts, metric-gaming, scope violations across the kept changes.
Summarize: runs, kept/reverted, per-objective improvement, the chosen trade-off.
Archive: python3 ${CLAUDE_PLUGIN_ROOT}/scripts/loop.py --archive --workdir "$PWD".
Worktree cleanup (Phase 0.0 counterpart). When the user has reviewed the kept changes and is ready to merge/cherry-pick or discard, remove the agent-doe-engine worktree:
```
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/worktree.py \
  --workdir "$TARGET_REPO" --target "<target name>" --json cleanup [--delete-branch]
```
Default keeps the branch (so the user can inspect / merge later); add --delete-branch only when the user explicitly discards the run.

Model tiering (when running under a multi-model host)

Component	Tier	Why
Setup (objectives, factors, selection)	Thinking	Wrong metric = Goodhart
Hypothesis generation	Code	High volume
Metric / guard / analyze	deterministic scripts	no LLM
Keep/revert	deterministic	numeric comparison
Overfitting review	Code (read-only)	pattern matching

State files

.agent-doe-engine/optimize/
├── objectives.json   # objectives + selection method
├── factors.json      # factor definitions
├── doe.json          # generated design matrix
├── results.jsonl     # measured responses per run
├── effects.json      # per-objective effects + selection result
├── experiment.json   # autoresearch config (single/few-factor mode)
├── results.tsv       # autoresearch iteration log
└── experiments/      # archived runs

Profiles

See profiles.md for ready-made single-objective presets (simplify, build time, bundle size, latency). Compose them into a multi-objective objectives.json when you want to optimize several at once.

agent-doe-engine

Invocation

Context Preview

Supporting Files

SKILL.md

agent-doe-engine

Invocation

Context Preview

Supporting Files

SKILL.md

agent-doe-engine

Three shapes of request

Phase 0: PLAN - pick the right variables before spending runs

0.0 - Open a dedicated worktree (mandatory; do NOT run in main)

0.1 - Scan for factor candidates

0.2 - Host LLM ranks and picks the candidates to test

0.3 - Validate adjustability (REQUIRED before DOE)

0.4 - Research seam (optional, host-driven, off by default)

0.5 - Compose the factor file

Phase 1: SETUP - get the objectives and factors right

1.1 - Name the objectives

1.2 - Identify factors

1.3 - Pick the design (≥2 factors)

Phase 2: RUN THE MATRIX

Single-factor - autoresearch loop

Phase 3: REVIEW

Model tiering (when running under a multi-model host)

State files

Profiles

Similar Skills

agent-doe-engine

Three shapes of request

Phase 0: PLAN - pick the right variables before spending runs

0.0 - Open a dedicated worktree (mandatory; do NOT run in main)

0.1 - Scan for factor candidates

0.2 - Host LLM ranks and picks the candidates to test

0.3 - Validate adjustability (REQUIRED before DOE)

0.4 - Research seam (optional, host-driven, off by default)

0.5 - Compose the factor file

Phase 1: SETUP - get the objectives and factors right

1.1 - Name the objectives

1.2 - Identify factors

1.3 - Pick the design (≥2 factors)

Phase 2: RUN THE MATRIX

Single-factor - autoresearch loop

Phase 3: REVIEW

Model tiering (when running under a multi-model host)

State files

Profiles

Similar Skills