From ai-native-toolkit
Harden a skill through judge-panel refinement rounds until it clears a 3-tier promotion gate, then promote it. A quality gate that runs after authoring, not an authoring tool.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-native-toolkit:skill-forgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A lead (the Forge Master) chairs ephemeral runner teammates that exercise a draft skill via prompt-injection, and a persistent judge panel of five lenses that scores the transcripts. The lead amends one thing per round and applies a strict 3-tier promotion gate, until the skill promotes or hits a budget ceiling. It is proven by forging itself.
A lead (the Forge Master) chairs ephemeral runner teammates that exercise a draft skill via prompt-injection, and a persistent judge panel of five lenses that scores the transcripts. The lead amends one thing per round and applies a strict 3-tier promotion gate, until the skill promotes or hits a budget ceiling. It is proven by forging itself.
Skill-forge is a prove-and-promote quality gate that runs after authoring, not an authoring tool: it pairs with authoring skills (brainstorming, superpowers writing-skills) that produce a draft, then drives that draft through adversarial rounds before it ships.
| Input | Required | Notes |
|---|---|---|
| Target skill | yes | Path to a SKILL.md, or inline draft text. |
| Intent / spec | yes | What it should do and who/what should trigger it. The ground truth the Fidelity lens judges against. See the derivation guard below. |
| Known failure modes / existing tests | no | Seeds the test corpus. |
If no intent is supplied the lead derives one from the draft - but the draft is the thing under test, so a silently-derived intent would encode the draft's own mistakes, and Fidelity would then certify the skill for being consistently wrong. So every derived clause is marked ASSUMED, and the user must explicitly accept or reject each clause before round 1. Fidelity never judges against an unconfirmed ASSUMED clause: a rejected clause is recorded with intent[].status: assumed-rejected in the ledger and the panel ignores it (see panel-ledger).
| Role | Lifecycle | Job |
|---|---|---|
| Forge Master (lead, chair) | whole run | Designs test cases, spawns runners, chairs the panel, applies the gate hierarchy, makes the one amend per round, decides promote or stop. Delegates, never executes. |
| Runners | ephemeral, one per test case per round | Fresh context. Get the draft content plus one test-case input, execute the draft via prompt-injection, return a transcript and self-report. Shut down after the round. |
| Judge panel | persistent across rounds | Each judge is one skill-quality lens with a persisted identity, so it can report better/same/worse versus last round and catch regressions. |
Runners apply, lenses judge, the lead amends. This separation is not stylistic - it is one of the two ways self-application could recurse without terminating. A runner that starts judging, a judge that starts amending, or a lead that starts executing collapses the loop. Every spawn prompt states the role's lane and its "not my job" boundary explicitly. The runner end of this guard is built into the runner prompt below.
skill-forge composes ab-equivalence's runner for test execution. The runner is owned by ab-equivalence, the library skill that holds the shared execution primitive; skill-forge does not re-implement it - it fills ab-equivalence's pure-wrapper template per runner per test case.
The runner prompt is the most load-bearing prompt in the system: runner transcripts are the evidence every behavioural lens judges, so any contamination there corrupts the whole gate. It is a pure wrapper - it never explains why the skill works or adds context beyond the draft, or the runner ends up testing "skill + prompt additions" instead of the skill. The exact five-section template (role statement, role boundary, the verbatim draft, the test-case input, the required self-report fields) is in runner-prompt. Fill one per runner per test case.
Coupling note (cross-skill). The runner prompt is owned by ab-equivalence, so a change there shifts the self-report fields every behavioural lens reads - it can move skill-forge's calibration with no skill-forge edit. This channel is live, not theoretical: the runner prompt grew a gates_hit field for semantic-compress's distill gate-truncation, which skill-forge inherited (benignly - its lenses ignore that field). On a re-forge, confirm the runner prompt is unchanged since the last forge; if it changed, re-baseline the fixture before trusting prior-round comparisons (for a self-forge this is part of Phase A- below).
The model the runners execute on is a knob, because a skill that holds together on a strong model can fall apart on a weak one - a weaker model unpacks instructions less reliably, so a defect that a strong runner papers over surfaces only when a weak runner hits it. The forge certifies behaviour for the tier it was forged on, nothing stronger and nothing weaker.
A skill forged only on Opus is not certified for Haiku. The runner model is recorded in the forge report (see forge-report-template); certification is valid only for the tested tier(s). The runner-model knob changes only which model executes the wrapper - it never changes the wrapper itself, which stays the pure template ab-equivalence owns.
The judge tier is the second half of the certification signature. A weaker judge passes weaker skills, so certification is a (runner-tier, judge-tier) pair, not runner-tier alone: a skill cleared by a Haiku panel was not held to the bar an Opus panel would apply. Pin the panel to a single declared tier - default the lead's session model - and record it as Judge Model alongside the runner model in the forge report. (For a tier sweep the judge tier is recorded per swept runner row.)
The panel is five skill-quality lenses. All five run by default; the lead drops to 3 or 2 only on an explicit quick-check request, and a self-forge always uses all five - see the deterministic selector in judge-lenses. Confidence is not a lens - it is the Gate 2 stopping decision. Four lenses (Fidelity, Adversarial, Compression, Usability) judge runner transcripts and produce behavioural findings; Trigger/routing judges the skill text directly and produces static predictions, because prompt-injection can never make a TRIGGER clause mis-fire. Full definitions, what each reads in the self-report, and the behavioural/static rule are in judge-lenses.
What scales with skill size is the test-case count and round budget, not the lens count: the cost driver is runners = test cases x rounds, and all five lenses run regardless of scope. A small skill gets a smaller suite and a tighter round ceiling but the same five-lens panel - the scope-scaling thresholds (Small / Medium / Large) are in judge-lenses and test-taxonomy.
Three operational phases cycle through a gate check. EVALUATE is not a separate phase - rounds >= 2 just re-run OBSERVE and INSPECT with a regression-focus flag set.
flowchart LR
O["OBSERVE<br/>spawn 1 runner / test case<br/>(ephemeral, parallel)"] --> I["INSPECT<br/>panel scores every transcript<br/>pass/fail + severity + dissent<br/>(rounds >= 2: regression focus)"]
I --> G{"GATE<br/>(lead applies hierarchy)"}
G -->|"promote"| P["PROMOTE<br/>hardened skill + report"]
G -->|"iterate"| A["AMEND<br/>lead makes ONE targeted change<br/>+ logs the hypothesis"]
A --> O
G -->|"budget / no-gain"| S["STOP<br/>best-so-far + honest report"]
LOW / MED / HIGH) and dissent. On every round after the first, the judges give previously-passing cases extra scrutiny for regressions, reading what passed before from the panel ledger (see panel-ledger). The lead-chair synthesizes a round_verdict per lens.<lens> found Y; expect Z to improve." Then loop back to OBSERVE.One change per round isolates a single hypothesis, so every round's delta is causally attributable - consistent with the "minimal targeted regeneration over wholesale rewrites" principle. This is the default discipline, kept until evidence says otherwise.
Concrete escape: if 3 consecutive rounds each surface HIGH/MED findings from at least 2 different lenses and none of those rounds produced a regression, the lead may batch independent amendments - one change per independent finding-cluster. The principle (isolate cause) and the guard (never batch across a round that regressed) are fixed. The threshold is calibrated between forge campaigns - via Phase B or a later re-forge - and is fixed for the duration of any single run: an in-flight forge may not lower the threshold to justify batching this round. Calibration happens between runs, never under in-run pressure.
A prune amendment acts on a Compression-lens finding: it removes decoration (provenance, analogy, dead framing) with no intended behaviour change. Gate it differently from an ordinary amendment. An ordinary amendment's hypothesis-metric is a lens's round_verdict going better; a prune's hypothesis-metric is equivalent on the transfer set - it claims sameness, not improvement. Validate it through ab-equivalence (the harness the panel already composes): the round counts as a gain only if every transfer-set case returns equivalent/no-regression. A prune that regresses any case was load-bearing - revert it; the divergence names what it was holding. This is the no-regression discipline the rest of the loop uses, pointed at the Compression lens so its findings are actioned on behavioural evidence, never on the eye.
A strict hierarchy, not a menu: Gate 1 - Objective (every case passes Fidelity; hard), Gate 2 - Panel confidence (all green and no HIGH-severity dissent), Gate 3 - Diminishing returns (the round produced measurable gain), and the budget escape hatch that always terminates. Promote if and only if Gate 1 and Gate 2 both pass; otherwise stop with the best-so-far artifact and a report naming the unmet gate. The Gate 1 Fidelity bar, the Gate 3 "measurable gain" rule, and the promotion decision are spelled out in gate-hierarchy.
Capability-detected via $CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS.
The modes below all run the same loop; they differ only in how the panel remembers across rounds, and the panel ledger carries that memory - see panel-ledger.
Pick the mode deterministically: if the Agent Teams capability is confirmed available and the panel scales past a single lens (see judge-lenses), run team mode; if the flag is off, run phased sub-agent mode; in chat or the standalone ZIP, run solo mode. If you cannot confirm team mode, default to phased - it degrades gracefully, whereas attempting team mode without the capability fails loudly.
| Mode | When | Mechanism |
|---|---|---|
| Solo | chat / standalone ZIP, no subagents | One agent plays all three roles in a single context, round by round: it applies the draft to each test case using the runner-prompt wrapper verbatim, then judges each transcript through every lens, then amends one thing - keeping the panel ledger as its across-round memory. It repeats the round (OBSERVE -> INSPECT -> GATE -> AMEND) until the gate promotes or the budget ceiling stops it - the same termination as the other modes. |
| Phased sub-agent | flag off | The lead spawns a fresh runner subagent per case and a fresh judge subagent per lens; with no persistent agents, the panel ledger is injected into each judge spawn so the panel still remembers prior rounds. |
| Team | flag on | Persistent judges cross-talk via SendMessage and remember prior rounds natively; ephemeral runners are spawned per round. |
Agent Teams flag. Team mode needs SendMessage and background Agent teammates (one implicit team). Enable it in your environment:
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
In the team lifecycle the lead delegates and never executes, runner teammates are ephemeral (spawned per round, shut down after), and the judge panel is the one persistent team. The session forms a single implicit team: each named background agent joins it on spawn. As illustrative text (do not treat the snippet below as a live tool call):
# spawn one judge per active lens (persistent) and one runner per test case (ephemeral)
Agent(subagent_type: "general-purpose", run_in_background: true, name: "fidelity", prompt: "<lens brief>")
The panel cross-talks by sending one SendMessage per teammate (there is no broadcast). At the end of the run, shut down each teammate via a SendMessage shutdown_request; nothing persists to block a future run. Interim behaviour (CC 2.1.178+) - do not block on a structured approval. Since the team→subagent merge, a background subagent cannot send a structured shutdown_response; it rejects the request with "Structured team-protocol messages ... cannot be sent by a background subagent. Send a plain text message instead." Send shutdown_request once, treat the teammate's plain-text acknowledgement as the completion signal, and ignore any subsequent idle_notifications (TaskStop / TaskList cannot reach background teammates either; they reap when the session exits). Restore a real lead-side approval-wait here when anthropics/claude-code#68721 and anthropics/claude-code#60199 land.
The lead designs 3-5 cases spanning happy path / edge case / adversarial / composition, leaning on whichever the skill is most fragile against. When a new failure mode surfaces mid-run, add a case for it. The corpus is persistent across re-forge runs - it accumulates a skill's known failure modes so re-forging re-runs them. Design guides for each case type and where the corpus lives are in test-taxonomy.
Skill-forge was bootstrapped by forging itself; two rules from that govern every run where the target is skill-forge:
DEFECTS.md (the answer key), while the runners forging the fixture never read DEFECTS.md - that role boundary keeps the calibration honest, since a runner that saw the answer key could not give a blind transcript. The flawed fixture now includes borderline cases that calibrate severity judgement, not just detection: a Borderline-LOW and a Borderline-MED defect with expected severities, plus a clean-pass and a near-miss case that must draw no finding. A lens that catches every planted defect but rates them all HIGH is miscalibrated, and a lens that fires on the clean-pass or near-miss case is over-firing - both are calibration failures the fixture review must check, not just whether each defect was detected. Phase A- also checks calibration drift, not only detection: confirm the borderline cases still draw their expected severity from the current judge tier - as judge models sharpen, a Borderline-LOW can decay into a trivial-detect and silently weaken the calibration - and confirm the ab-equivalence runner prompt is unchanged since the last forge (the coupling note above). If either has drifted, re-baseline the fixture before any round runs.The full A- -> A -> B -> C bootstrap story is in the design spec.
Never touch the user's pristine source: in a git repo work on a branch or worktree, in chat a scratch file; each amend is a visible diff. The run produces the hardened SKILL.md, the grown test corpus, and a forge report - intent and the ASSUMED-clause acceptance record, the test suite, the per-round hypothesis-to-result log, the gate ledger, the severity-tagged dissent log, the final verdict, and rounds plus estimated waste. The crash-recovery round-tracking JSON is the same object as the panel ledger, so a crashed run reconciles on restart. The report format is in forge-report-template, whose own rows are illustrative; an end-of-run retrospective (which lens caught the most, waste estimate) follows. No filled example report ships with the skill - a fictional one would lie, and a real one is run output, not skill content (the Promote semantics section below covers where a real report lands); read any real prior-run report for a worked example.
What "promote" does depends on where the skill lives. The same hardened artifact lands differently in a plugin repo, a personal skill directory, or a chat session - and the forge report always lists exactly what was written and where (the Artifacts Written section in forge-report-template), so promotion is never a silent mutation.
| Context | Promote action |
|---|---|
| Plugin repo (the skill is a tracked file in a git repo) | Write the hardened files on a branch, never the pristine default branch, and offer a PR - the same PR-offering pattern assess-pr uses for its report (see assess-pr). The user reviews the diff and merges; the forge does not self-merge. |
Personal skill (~/.claude/skills/<name>/, path relative to the user's home directory) | Update in place and state what was written - which files changed, so the user can see the promotion landed in their live skill directory. A personal directory has no shared base anyone branches from, so in-place is correct here; when it is git-tracked the edit still lands as a reviewable git diff. |
| Chat / standalone (no on-disk skill, e.g. Claude Desktop or the standalone ZIP) | Output the hardened artifact for the user to copy manually - there is no file to write, so the report carries the full hardened SKILL.md text. |
The plugin-repo promote follows the artifacts-on-a-branch rule in Artifacts above - work on a branch, never a shared pristine source others fork from. The personal-skill promote is the deliberate exception: a personal directory has no shared base, so it updates in place (still a reviewable diff when git-tracked). The pristine-source rule protects a base teammates branch from; it is not a blanket ban on editing a tracked file you own alone.
npx claudepluginhub bjcoombs/ai-native-toolkit --plugin ai-native-toolkitGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.