From betterbest
Use when you want to make an existing thing as good as it can be through iterative experiments — a prompt, config, function, agent system, essay, recipe, policy doc, design — anything you can inspect, change, and measure. Triggers when someone says "make this better", "improve X", "tune this", "iterate on this until it's good", or hands you something that almost works. A general-purpose adaptation of autoresearch for arbitrary artefacts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/betterbest:betterbestThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**You've got good. Go for best.** Polish and improve just about anything, from code to prose.
You've got good. Go for best. Polish and improve just about anything, from code to prose.
betterbest is a general-purpose adaptation of autoresearch: you improve a thing by running iterative experiments on it. One experiment = change one thing, run a probe, read the result, keep it if better or revert if not. Repeat until the thing is as good as it can be.
It works on anything you can inspect, change, and measure: a prompt, a config, a function, an agent system, an essay, a recipe, a melody, a policy doc, a UI flow. The thing only has to be good enough to start — betterbest refines what exists, it does not generate from nothing.
| Piece | What it is | Examples |
|---|---|---|
| Artefact | the thing you're improving (inspectable, mutable, versionable) | prompt.txt, a function, speech.md, a recipe |
| Experiment | one change → run probe → read → keep/revert | the loop body |
| Probe | whatever exercises the artefact and yields something to read | test questions at a prompt; tests+profile for code; a critic-persona panel for prose (see Working on prose) |
| Rubric | the weighted dimensions you read each result against, each tagged hard gate or soft penalty; each dimension's scorer can be a formula OR a subagent evaluation — same interface, just a weighted term | +5 cites source / −50 hallucination / −20 cost; or +30 "a fresh critic judges clarity" |
| Journal | the running record of every experiment and verdict | scripts/journal.py |
digraph betterbest {
"Setup (once): artefact, probe, rubric, knobs" [shape=box];
"Read state + history (best-so-far, what's been tried)" [shape=box];
"Propose ONE mutation (a hypothesis)" [shape=box];
"Apply it to the artefact" [shape=box];
"Run the probe" [shape=box];
"CANARY: did the probe run THIS variant?" [shape=diamond];
"Score by READING outputs vs rubric" [shape=box];
"Better than best?" [shape=diamond];
"Keep + journal" [shape=box];
"Revert + journal why" [shape=box];
"Cadence check: critique judge? audit rubric? stop?" [shape=diamond];
"Done" [shape=doublecircle];
"Setup (once): artefact, probe, rubric, knobs" -> "Read state + history (best-so-far, what's been tried)";
"Read state + history (best-so-far, what's been tried)" -> "Propose ONE mutation (a hypothesis)";
"Propose ONE mutation (a hypothesis)" -> "Apply it to the artefact";
"Apply it to the artefact" -> "Run the probe";
"Run the probe" -> "CANARY: did the probe run THIS variant?";
"CANARY: did the probe run THIS variant?" -> "Read state + history (best-so-far, what's been tried)" [label="mismatch: fix, don't score"];
"CANARY: did the probe run THIS variant?" -> "Score by READING outputs vs rubric" [label="ok"];
"Score by READING outputs vs rubric" -> "Better than best?";
"Better than best?" -> "Keep + journal" [label="yes"];
"Better than best?" -> "Revert + journal why" [label="no"];
"Keep + journal" -> "Cadence check: critique judge? audit rubric? stop?";
"Revert + journal why" -> "Cadence check: critique judge? audit rubric? stop?";
"Cadence check: critique judge? audit rubric? stop?" -> "Read state + history (best-so-far, what's been tried)" [label="continue"];
"Cadence check: critique judge? audit rubric? stop?" -> "Done" [label="stop condition met"];
}
Before iterating, settle these. Use AskUserQuestion for the multiple-choice knobs; ask
free-form for the artefact/probe/rubric. Don't start the loop until they're set.
No user to interview (autonomous run)? Don't block — choose the defaults (subagent-auditor review mode; plateau K=5-8 and budget N per the sizing guidance), record each choice in the journal as an explicit assumption to revisit, and start the loop. The rubric audit is where wrong guesses surface and get corrected.
A capable agent already proposes one change at a time, applies the loop to non-code things, audits a gamed rubric, and stops on a plateau — when a single clean task is in front of it. The value of betterbest is keeping that alive across a long, real run, plus these things that are easy to drop:
The rubric verdict comes from you reading the output and deciding. Mechanical checks
(grep, a count, a format match) are hints to look at, never the score. Regex misses
paraphrase, tone, partial leaks, and every subjective dimension; it burns the rubric's dynamic
range on false positives. scripts/judge_helper.py renders a worksheet you fill in by reading;
it deliberately does no scoring. Even for a "fully specified" literal (an ID, a banned word),
the grep is a pre-filter — you still read the flagged item to render the verdict.
The silent failure: you believe you tested variant v14, but a stale path / cached config /
un-applied edit means the probe ran v12. The score looks fine; it's measuring the wrong thing,
and every later experiment builds on a phantom. Before trusting any score, verify the
running config matches the declared variant — compare a declared property (a planted marker, the
model id, the file hash) against what actually appears in the probe output. (Prose has no hash —
see Working on prose for its canary form.) scripts/canary.py does this and exits loud on
mismatch. On mismatch: do NOT log; fix the apply/probe path; re-run. This is cheap; run it
every time, not just when suspicious.
Over a long run you anchor on your own prior verdicts and scores creep upward — the same output you'd have scored 6 at iteration 3 you wave through as an 8 at iteration 25. Every 5–7 iterations, spawn a fresh subagent with no loop context. Give it the rubric and a sample of recent outputs cold, and ask: were these scored consistently and correctly? Journal its findings. When it disagrees, either justify your score or revise it — both are useful. A single-shot run never reveals this; a real run always does.
Early experiments expose flaws in the rubric itself: a reward being gamed, a dimension that never fires, a penalty too weak to matter, two dimensions double-counting. On the rubric-review cadence from setup, step out of optimization mode and ask: is the rubric still measuring what we care about? When you revise it, journal the revision and re-score recent iterations against the new rubric (you trade away comparability with older scores — that's the cost of not optimizing the wrong thing for 50 iterations).
The audit can change the judges, not just the weights. A dimension's scorer can be a formula or a subagent, so the audit revises both: re-weight, add, or drop dimensions and swap a dimension's implementation — replace a formula with a fresh critic, or split a vague subagent-judge into two sharper ones. Adding or removing a judge from a panel is adding or removing a weighted dimension; no separate machinery. The same anti-gaming guard applies: the audit is anchored to the original goal (a fresh goal-anchored agent or the human), so judges change only to measure the goal better, never to pick a more lenient referee — swapping in a friendlier judge is just rubric-gaming one level up.
Re-anchor best after a revision — don't keep climbing the old hill. The old rubric is what
crowned the current best, and it's the thing you just decided you couldn't trust. So re-scoring
isn't bookkeeping; it can change which variant you continue from. Re-score recent iterations —
including ones you REVERTED — under the new rubric, because a rubric change is exactly what
resurrects a variant the old metric undervalued. Pick the new best from those re-scored results
and continue from it. Bound "recent" to keep it cheap, and treat anything older than the last
rubric change (or older than a canary fix) as suspect rather than re-scoring the whole history.
If an audit confirms the rubric is sound and you've plateaued, that's a real stop, not a reason to grind.
The core decision of every experiment. Apply these in order — any one can force a revert:
Every code-specific ritual has a prose form. Collected here so it's one place, not scattered:
response_format: json_object 100%. Look for kernel mutations early; don't get stuck only tweaking text.| Script | Does | Never does |
|---|---|---|
scripts/journal.py | init/log/status: best-so-far, plateau length, cost | decide keep/revert |
scripts/probe_runner.py | run your probe, capture out/err/time per iteration | score the output |
scripts/judge_helper.py | render a READ-and-fill worksheet (rubric + items + hints) | assign any score |
scripts/canary.py | verify the probe ran your declared variant; loud on mismatch | judge quality |
Run any with --help. The judging is always you reading your rubric against your outputs.
betterbest is itself an artefact. To improve it: artefact = this SKILL.md; probe = a subagent
tries to run a betterbest loop on a fresh toy problem and reports friction; rubric = did they
understand each step first-read, configure the loop correctly, run the canary and judge-critique
unprompted, and know when to stop. Then iterate — one mutation at a time.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub drewmccormack/betterbest --plugin betterbest