From ai-native-toolkit
Compares two versions of an LLM-directed document across a transfer set for behavioral equivalence. A/B test prompts, validate compressed skills, or gate transforms on no-regression.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-native-toolkit:ab-equivalenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A thin capability that compares two versions of an LLM-directed document - an `original` (the teacher) and a `candidate` (the student) - across a transfer set, and returns a per-case verdict on whether the candidate still induces the behaviour the original induced.
A thin capability that compares two versions of an LLM-directed document - an original (the teacher) and a candidate (the student) - across a transfer set, and returns a per-case verdict on whether the candidate still induces the behaviour the original induced.
It is transform-agnostic: it judges behavioural equivalence between two versions and neither knows nor cares which transform produced the candidate. It therefore serves every optimizer transform that claims to preserve behaviour - compression today, directive-clarity next - not just compression. It is a library capability other skills compose: semantic-compress invokes it to gate a distillation, and skill-forge exposes it alongside its own quality gate. It does not judge absolute quality ("is this skill good?") - that is a different question answered by different judges. A/B equivalence judges sameness between two versions ("does the candidate still do what the original did?").
This skill owns the runner (references/runner-prompt.md, the pure-wrapper template, paths relative to this skill directory). The runner is the shared execution primitive: it applies one version of a document to one case input and returns a transcript and self-report. Skills that need behavioural comparison compose this capability rather than re-implementing the runner.
| Input | Required | Notes |
|---|---|---|
original | yes | Path to the teacher document - the version whose behaviour is the equivalence target. |
candidate | yes | Path to the student document - the transformed version under test. |
transfer_set | yes | Array of cases spanning the test taxonomy (happy / edge / adversarial / composition). The transfer set is the operational definition of the behaviour being preserved, so its breadth bounds the safety of the conclusion. |
The caller (e.g. semantic-compress) owns deriving and confirming the transfer set; this capability consumes it. A thin transfer set yields a weak equivalence claim - the caller is responsible for flagging coverage, and the output records it.
For each case in the transfer set:
references/runner-prompt.md, the pure-wrapper prompt) once with original as the skill draft, on the case input, producing the teacher transcript.candidate as the skill draft, on the identical case input, producing the candidate transcript.references/equivalence-judge-prompt.md - a focused compare-two-transcripts judge), which emits the per-case verdict and efficiency signal.The runner and runner-prompt are the only execution primitive; the equivalence judge is the one comparison component, distinct from any absolute-quality lens. The judge compares observed behaviour to observed behaviour, never the candidate against what the original document says it should do. The full contract and schema are in references/ab-equivalence.md; the judge prompt and decision rule are in references/equivalence-judge-prompt.md.
The original never changes across a multi-round transform loop, so its transcript per case is captured once and reused across every round. Only the candidate is re-run each round. This is a hard rule, not an optimization: re-running the teacher each round wastes runner budget and risks introducing teacher-side noise that the judge would mistake for a candidate change. The caller passes the cached teacher transcripts back in on rounds >= 2; this capability re-runs only the candidate. A budget ceiling on candidate re-runs belongs to the caller's loop, not here.
Per case, the judge returns exactly one verdict:
| Verdict | Meaning | What it must cite |
|---|---|---|
equivalent | The candidate induced every behaviour and discipline the original induced. Incidental wording differences with no behavioural consequence are still equivalent. | Nothing required beyond the verdict. |
candidate-regressed | A behaviour or discipline the original induced is absent in the candidate. This is the failing verdict. | The specific behaviour lost - the discipline, step, or output the original produced and the candidate did not. |
candidate-diverged | The candidate behaves differently but no behaviour the original induced was lost - a different-but-not-worse change (including incidental improvements). | The difference - what the candidate did differently. Not necessarily worse; documented for the caller's judgement. |
The regressed-vs-diverged decision is the load-bearing distinction, stated authoritatively in references/equivalence-judge-prompt.md:
Decision order: check for any loss first. If any loss exists, the verdict is candidate-regressed - even alongside an unrelated gain; a regression is never excused by an improvement elsewhere. Only with no loss do you choose between diverged and equivalent. When uncertain whether a delta is a loss or merely a difference, the judge defaults to candidate-regressed - a false regression costs one add-back round; a false equivalent ships a behaviour-losing transform undetected.
Independent of the verdict, the judge records an efficiency signal per case - how directly the runner acted on each version versus how much it had to unpack or reinterpret the instruction before acting:
| Field | Type | Meaning |
|---|---|---|
original_directness | integer 1-5 | How directly the runner acted on the original: 5 = acted immediately, no reinterpretation; 1 = had to unpack, infer, or work around the instruction heavily before acting. |
candidate_directness | integer 1-5 | The same measure for the candidate. |
interpretation_notes | string | What the runner had to unpack or reinterpret on each version - the qualitative evidence behind the two scores. |
The signal is read from the runner self-report (references/runner-prompt.md): steps followed / skipped, ambiguities hit and how resolved, improvisation beyond the skill, and any point it wanted to deviate but followed literally all reveal how much interpretive work each version forced. Directness is scored from interpretive work shown, not from document length - a shorter document that forced more reinterpretation is less direct, not more.
Why it exists: compression's gate is strict no-regression (sameness alone). But the optimizer family includes transforms that claim behaviour-preserving-but-lighter - directive-clarity rewrites instructions the model must unpack into directives that name the action. Such a transform can only be validated if the harness measures the lightness, not just the sameness: its gate is no-regression and a measured efficiency gain (candidate_directness > original_directness with no candidate-regressed). Recording the signal here, on every A/B run, is what lets those transforms prove a measured gain instead of asserting one. This capability records the signal; it never gates on it - whether a gain is required is the calling transform's gate.
{
"cases": [
{
"case_id": "string",
"verdict": "equivalent|candidate-regressed|candidate-diverged",
"behaviour_delta": "string",
"efficiency_signal": {
"original_directness": 1,
"candidate_directness": 1,
"interpretation_notes": "string"
}
}
],
"summary": {
"pass": true,
"regressions": 0,
"divergences": 0,
"equivalents": 0
}
}
case_id - the transfer-set case identifier.verdict - one of the three categories above.behaviour_delta - for candidate-regressed, the specific behaviour lost; for candidate-diverged, the difference observed; empty (or "") for equivalent.efficiency_signal - the per-case directness scores and notes described above.summary.regressions / divergences / equivalents - counts of each verdict across cases.summary.pass is true if and only if zero cases are candidate-regressed. Divergences do not fail the run - they are surfaced for the caller's judgement. This encodes the strict no-regression gate: the candidate is accepted only when it loses nothing.The capability runs the same mechanism in every mode; modes differ only in how the runner pair per case and the equivalence judge are spawned. The caller's harness selects the mode; A/B equivalence runs inside whatever mode it is handed.
Solo mode (chat / standalone ZIP, no subagents) is the default: a single agent works each case sequentially - it applies the original via the runner wrapper, then the candidate on the identical input, then judges the two transcripts with the equivalence-judge prompt, recording the verdict and efficiency signal before moving to the next case. The cached teacher transcript is the only state carried between rounds.
Phased sub-agent mode (Agent Teams flag off): the lead spawns a fresh runner subagent per version per case (the runner pair) and a fresh judge subagent per case. With no persistent agents, the cached teacher transcripts are injected into each round so only the candidate is re-run.
Team mode (Agent Teams flag on, CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1): the equivalence judge can run as a persistent background teammate (Agent with run_in_background: true, joined to the session's single implicit team), communicating with the lead via SendMessage; ephemeral runners are spawned per round and shut down after. As with the forge loop, shut each teammate down with a SendMessage shutdown_request at the end of the run; nothing persists to block a future run. Interim behaviour (CC 2.1.178+) - do not block on a structured approval. Since the team→subagent merge, a background subagent cannot send a structured shutdown_response; it rejects the request with "Structured team-protocol messages ... cannot be sent by a background subagent. Send a plain text message instead." Send shutdown_request once, treat the teammate's plain-text acknowledgement as the completion signal, and ignore any subsequent idle_notifications (TaskStop / TaskList cannot reach background teammates either; they reap when the session exits). Restore a real lead-side approval-wait here when anthropics/claude-code#68721 and anthropics/claude-code#60199 land.
No persistent team is required by the capability itself - the judge holds no across-round memory of its own; the cached teacher transcript is the only state carried between rounds, and the caller owns it.
npx claudepluginhub bjcoombs/ai-native-toolkit --plugin ai-native-toolkitGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.