From slow-powers
Use when testing whether a new skill improves agent behavior, or when validating a change to an existing skill's language.
How this skill is triggered — by the user, by Claude, or both
Slash command
/slow-powers:evaluating-skillsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill). This skill owns the *craft* of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The *mechanics* of actually running an eval — building the workspace, staging skills, dispatching s...
evals/baseline/BASELINE.mdevals/baseline/NOTES.mdevals/baseline/benchmark.jsonevals/baseline/grading/deterministic-edit-skip__new_skill.jsonevals/baseline/grading/deterministic-edit-skip__old_skill.jsonevals/baseline/grading/did-my-revision-help__new_skill.jsonevals/baseline/grading/did-my-revision-help__old_skill.jsonevals/baseline/grading/is-new-skill-ready-to-ship__new_skill.jsonevals/baseline/grading/is-new-skill-ready-to-ship__old_skill.jsonevals/evals.jsonevals/fixtures/iron-law/candidate-skill.mdpressure-scenarios.mdSkill development has two phases: drafting (slow-powers:writing-skills) and evaluation (this skill). This skill owns the craft of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The mechanics of actually running an eval — building the workspace, staging skills, dispatching subagents, grading, aggregating — are owned by a dedicated tool, eval-magic, which ships as a dependency-less prebuilt binary you invoke as eval-magic. See Running the eval for the hand-off.
An eval is a structured measurement of whether a skill actually shifts behavior. Each test case is a realistic prompt; each run dispatches a fresh general-purpose subagent twice — once with the skill loaded, once without (or once with the prior version, once with the revised version) — and grades the outputs against assertions. Pass-rate deltas tell you whether the skill is worth shipping or the change is worth landing.
Evals are harness-agnostic: run records use a portable JSON schema so an eval authored on one harness (Claude Code, Codex, OpenCode) can be executed and graded on any other. The runner documents the schema and the per-harness specifics; this skill stays at the level of what makes a good eval.
with_skill vs without_skill. Use when validating that a brand-new skill beats baseline behavior with no skill loaded.old_skill vs new_skill. Use when testing a language change to an existing skill — snapshot the old SKILL.md, then run both variants against the same prompts. A negative or zero delta is a signal to revert: the new language did not improve behavior.The runner implements both; pick the mode that matches the change you're measuring.
Before you build an eval, decide whether this change needs one. An eval measures whether words on a page shift contingent behavior — what the agent does when the outcome is genuinely in doubt: under pressure, with ambiguity, or against a competing goal. That is where measurement earns its cost.
A deterministic change doesn't move that needle. Removing a one-line "announce out loud that you're using this skill" instruction, fixing a typo, or shipping a manually-invoked, testing-only procedure changes what the agent is told, not whether it complies under pressure. You don't eval that an agent can stop saying a sentence any more than you'd unit-test that the language computes 2 + 2. Following an unambiguous instruction is the runtime contract; evals test the contingent logic built on top of it.
The question for every skill change: does it alter contingent behavior, or is it deterministic instruction-following?
Either way, announce the decision and why — "deterministic instruction removal, no eval" or "this changes pressured compliance, I'll run an eval." A visible decision is one the user can override; a silent one is a rationalization waiting to happen. The door stays open: if the user wants an eval anyway, run a worthwhile one — design real cases, don't phone it in to confirm a foregone conclusion.
Skill type is a fast read, not a verdict. Reference and manually-invoked procedural changes often land deterministic; discipline, technique, and pattern changes often carry contingency (pressure-scenarios.md draws the same line under "When to use" / "Don't use them for"). Use type to orient your first guess — never as the answer. The decision is per change, not per type: a deterministic typo fix in a discipline-enforcing skill still skips, and restructuring a reference doc because the agent kept missing a section is contingent and earns an eval.
No skill shipped without passing evals. No behavior-shaping change landed without a positive revision delta.
Once you've judged a change behavior-shaping, the law is absolute — these are not exemptions:
If you can't measure a behavioral change, you don't know if it's an improvement. Tuning behavior-shaping language without a benchmark drifts the skill — sometimes silently — toward worse behavior under pressure. (The narrow, declared exception for deterministic changes is above; "deterministic" is a judgment you announce and defend, not a backdoor around this law.)
Excuses for skipping an eval on a change you've already judged behavior-shaping. None of them hold.
| Excuse | Reality |
|---|---|
| "Change is obviously an improvement" | Then proving it is fast. Run the eval. |
| "Existing skill works fine" | Until pressure-test scenario X. Run the eval. |
| "No time to author test cases" | The cost of shipping a regression is higher than 30 minutes of eval design. |
| "It's just rewording" | Wording IS the skill. Reword = changed skill. Run the eval. |
| "Eval results are noisy" | Then add runs, not skip the eval. |
| "Pass rate was already 100%" | Then the assertion is too easy. Replace it. |
| "I'll just call it deterministic" | Deterministic means the agent's compliance isn't in doubt — not that you'd rather not measure. If the wording could change a pressured choice, it's behavioral. Run the eval. |
An eval run is not free. Each test case dispatches a fresh subagent per condition — an N-case suite is 2N full agent sessions, plus a judge dispatch for every llm_judge assertion. That is real wall-clock time and real tokens, and a subagent under test can write outside its sandbox and pollute the real workspace. Never kick off a run silently.
Before building the workspace and dispatching anything, STOP and present the user a run summary, then wait for explicit confirmation:
new-skill (with vs without) or revision (old vs new), plus the baseline label for revision modeevals.json)llm_judge assertions. The runner never dispatches these itself, so it can't observe them — state them explicitly so the user can correct a wrong choice before tokens are spent.2N agent dispatches plus judge dispatches; call out that this is time- and token-intensive--guard is the default; proceed unguarded only on an explicit opt-out, and warn that stray writes will then only be detected after the fact, never blocked)Do not dispatch until the user confirms this summary. An earlier "run the eval" is not confirmation — the summary may reveal a wrong mode, the wrong model, or a missing guard the user never intended. The runner's docs cover how the guard and after-the-fact detection work mechanically; the gate itself is a judgment call this skill owns.
All of these mean: STOP. Present the pre-flight summary and wait for confirmation.
A test case has these parts:
true): set false for a negative eval where correct behavior is the skill not firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate, so a correct non-invocation isn't mistaken for the skill failing to fire.Cases live in <skill>/evals/evals.json. For the file shape, see the author-template example in the eval-magic README and validate against the bundled schema with eval-magic validate; for worked, maintained examples, read the live suites in this repo — e.g. skills/verifying-development-work/evals/evals.json and skills/hardening-plans/evals/evals.json.
Tips for writing good prompts:
bun test, quote the output").pressure-scenarios.md for the pressure-scenario taxonomy (time pressure, sunk cost, authority, exhaustion, etc.).Don't write assertions yet. You don't know what "good" looks like until you see what the first run produces.
What "stresses the skill" depends on what kind of skill it is. The four types from slow-powers:writing-skills each need a different style of prompt:
pressure-scenarios.md for the taxonomy. The wild failure for these skills is almost always mid-session — the agent is already committed to a skill-free approach when the trigger arrives — so a cold prompt under-measures them; pair each cold case with a seeded one (see Seeding conversation context below). Success = the rule holds under maximum pressure.A cold prompt measures trigger-recognition in isolation. The harder, more realistic failure is trigger-recognition under a competing attractor — an agent already mid-session, committed to a skill-free approach, where loading the skill reads as redundant. Approximate that by seeding: embed prior User: / Assistant: turns directly in the prompt string (it is wrapped verbatim as the user request, so a multi-line transcript needs no schema change). Seed an Assistant: turn that has already produced work in a native, skill-free style, then a final User: turn carrying the real request. A seed can reproduce prior commitment / in-flight momentum, redundancy framing, sunk cost, and — usefully — a prior plan that name-drops a skill (e.g. a parenthetical "TDD — tests first") without actually following it, so you can test whether the agent makes the discipline load-bearing or treats the label as compliance. For a worked example, see the seeded cases in hardening-plans/evals/.
When to seed. Seed when the skill's real-world failure happens mid-session under a competing attractor — prior commitment to a skill-free approach, redundancy framing ("I'm already doing this"), sunk cost, exhaustion, or an in-flight workflow/mode that makes loading the skill feel like ceremony. A cold prompt is enough when all you need to know is whether the description triggers from a clean start. Discipline-enforcing skills almost always warrant at least one seeded case kept alongside a cold contrast case (the cold one isolates the description; the seeded one stresses the trigger under momentum). Technique, pattern, and reference skills usually don't need seeding unless their failure, too, is specifically a mid-session one.
Reusable seed scaffold — adapt the turns to your skill's attractor:
[The following is the conversation so far in this session. You are the
assistant; continue from the final user turn.]
User: <the original request that kicked off the work>
Assistant: <work already produced in a native, skill-free style — the
approach the skill is supposed to correct, optionally name-dropping the
skill's discipline as a label without following it>
User: <the turn that should trigger the skill — phrased so loading it now
reads as redundant or as duplicated effort>
Keep the seeded turns short and concrete; the point is to establish momentum, not to write a full session.
The ceiling — state it plainly. A seed is text the subagent reads, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only describe one. So when the wild failure you're chasing was caused by such a mode (the documented case: an agent in plan mode that invoked zero skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded pass is therefore necessary but not sufficient — it under-estimates real-session difficulty — and a seed that fails to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them.
Narrowing the gap — --plan-mode. For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: its --plan-mode flag injects the harness's verbatim plan-mode procedure into every dispatch as an operating-context layer the subagent is told it is operating under, rather than a paraphrase the agent merely reads in the seed prose. This narrows the gap (verbatim procedure > paraphrase) but does not close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm. See eval-magic run --help for the flag and the per-harness profiles it depends on.
After iteration 1, you've seen what the outputs look like. Now write assertions: verifiable statements about correctness. Add them to evals.json and re-grade existing outputs without re-dispatching. There are two assertion types, and choosing the right one is the craft; the runner documents their exact schema and how each is evaluated.
transcript_check — mechanical. Matches patterns against a run's tool invocations. Fast, deterministic, cheap. Use for "did the agent run X" or "did file Y get written." Depends on the harness exposing transcripts (full support on Claude Code; on transcript-less harnesses these grade as unverifiable — lean on llm_judge there).llm_judge — judged. Soft criteria a model evaluates. Use for "did the response quote actual evidence" or "did the agent refuse to claim success without proof." Portable across all harnesses.For maximally portable evals, lean on llm_judge for the substantive checks and use transcript_check for cheap mechanical signals where the adapter is available.
Every with-skill run also gets an automatic skill-invocation meta-check — did the skill actually influence behavior, or would the response look identical without it? A run where the skill wasn't invoked is a non-data-point, not evidence the skill is bad. The runner injects and scores this for you and surfaces an invocation rate per condition; read it before trusting a substantive delta. (Mechanics in the runner's docs.)
Once a run is graded and aggregated, the headline is the delta: what the skill costs (time, tokens) and what it buys (pass-rate improvement). A skill that adds 13 seconds and 1700 tokens but improves pass rate by 50 points is probably worth it; one that doubles tokens for a 2-point gain is probably not. For Mode B, a positive delta.pass_rate means the revision is an improvement.
Analyzing patterns:
Human review catches what assertions don't — outputs that are technically correct but miss the point. Keep per-eval reviewer notes; an empty note means the output looked fine. Focus the next iteration on the cases you had specific complaints about.
Guidance for revision:
When to stop: pass rates are satisfactory and reviewer feedback is consistently empty; iteration deltas have plateaued; or you've found a more fundamental issue (wrong scope, unrepresentative prompts).
The mechanics of executing a run live in eval-magic — the eval-magic binary. eval-magic's README is the complete operating guide, and every flag is documented in the tool's own help.
| Need | Where |
|---|---|
| Quickstart, install, the two modes end-to-end | the eval-magic README |
Every subcommand and flag; the --skill-dir model; workspace layout | eval-magic --help and eval-magic <subcommand> --help |
| Full run mechanics: dispatch loop, transcript access, grading, aggregating, baselines | the eval-magic README |
| Claude Code & Codex harness specifics — isolating from installed plugins, the guard, judging | the README's Harnesses section |
| What a harness needs to reach Claude-Code-tier support | docs/harness-parity.md |
slow-powers:writing-skills — drafting a skill (Phase 1)pressure-scenarios.md — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skillseval-magic tool) — runs the evals this skill teaches you to authornpx claudepluginhub slowdini/slow-powers --plugin slow-powersGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.