From Leerie
Autonomous self-healing loop that replays failing leerie worker prompts, iteratively proposes patches via a subagent, and measures success against replay baselines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/leerie:llm-self-heal <run-id-or-ndjson-path> [--call-type <name>] [--max-iterations <N>] [--n-replays <N>] [--success-threshold <0..1>]<run-id-or-ndjson-path> [--call-type <name>] [--max-iterations <N>] [--n-replays <N>] [--success-threshold <0..1>]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<objective>
claude -p, score each, establish noise floor.patch-generator subagent to propose a
minimal patch to the system prompt, apply the patch, replay the
patched arm, score, check convergence.<heal-dir>/<call_type>/healing-<call_type>.md
with the verdict (SUCCESS / PLATEAUED / BUDGET_EXHAUSTED / TIMEOUT /
REGRESSED), the best patch found, and the full iteration history.Output: per call_type with failures, a heal report under the run's
<heal_subdir>/ directory (default heal-out/; configurable via
--heal-dir / LEERIE_HEAL_DIR / leerie.toml heal_dir).
Production prompts in prompts/ are NOT modified by this skill.
Patches are proposed evidence — applying them is a separate manual step.
<execution_context>
Arguments parsed from $ARGUMENTS:
<run-id> or path to a calls.ndjson file or its
parent directory. If a run-id is given, the skill resolves
.leerie/runs/<run-id>/calls.ndjson and the corresponding heal
output dir .leerie/runs/<run-id>/heal-out/.--call-type <name> (optional): heal only this call_type; default
heals all call_types that have failing verdicts in the verdict files
found under judge-out/.--verdict-dir <dir> (optional): where judge-llm-batch wrote its
verdict JSON files. Defaults to <run-dir>/judge-out/.--heal-dir <dir> (optional): where to write heal-loop state and
reports. Defaults to <run-dir>/heal-out/ or the value resolved from
LEERIE_HEAL_DIR / leerie.toml heal_dir.--max-iterations <N> (default 10, HEAL_MAX_ROUNDS_DEFAULT):
hard cap on loop iterations per call_type.--n-replays <N> (default 5, HEAL_N_REPLAYS_DEFAULT): replays per
arm (baseline or each patched iteration) per failing sample.--success-threshold <0..1> (default 0.9,
HEAL_SUCCESS_THRESHOLD_DEFAULT): pass-rate target for SUCCESS exit.--plateau-window <N> (default 3, HEAL_PLATEAU_WINDOW_DEFAULT):
consecutive iterations of small delta → PLATEAUED exit.--plateau-delta <0..1> (default 0.03, HEAL_PLATEAU_DELTA_DEFAULT):
"small delta" threshold in pass-rate units.--model <alias> (default sonnet, MODEL_DEFAULT_PER_WORKER["heal"]):
model alias passed to claude -p for replay arms. Override via
LEERIE_MODEL_HEAL or --heal-model on the main orchestrator, or
pass --model directly to this skill invocation.All default values match IMPLEMENTATION.md §2 "Heal-loop convergence parameters". </execution_context>
Leerie records every `claude -p` worker invocation to `.leerie/runs//calls.ndjson` (one line per call, appended immediately after each call returns). The judge-llm-batch skill produces verdict JSON files in `judge-out/`. This skill consumes those verdicts to close the loop: it replays failing samples against patched prompts to find a patch that raises the pass rate above the success threshold.Each call_type has exactly one system-prompt source (IMPLEMENTATION.md §10):
| call_type | Prompt source |
|---|---|
| classifier | prompts/classifier.md |
| planner | prompts/planner.md |
| reconciler | prompts/reconciler.md |
| implementer | prompts/implementer.md |
| integrator | prompts/integrator.md |
| conformer | prompts/conformer.md |
The heal loop reads the current file from prompts/ as the base
prompt text for any call_type it heals.
Under <heal-dir>/<call_type>/:
state.json — loop state (history, best-so-far, baseline)
iter-<N>/
patch-request.json — inputs for the patch-generator subagent
patch-response.json — subagent's structured output
applied-patch.txt — the patched system prompt text
arm-results.json — n-replay results per failing sample
scores.json — per-sample per-replay pass/fail verdicts
This matches IMPLEMENTATION.md §8 "Coordination directory layout".
Each replay runs claude -p with:
--append-system-prompt <patched-prompt-text> (the prompt under test)--json-schema <schema> (the same schema used for this call_type in
the live orchestrator, per SCHEMAS in orchestrator/leerie.py)--model <model> (the heal model alias)user_content fieldA replay passes when structured_output is present and schema-valid
(parsed_ok=true equivalent). Schema validity is the primary pass
criterion — the same bar the live orchestrator uses.
orchestrator/leerie.py and prompts/. If
not, abort: llm-self-heal must run from the leerie repo root..leerie/runs/<run-id>/calls.ndjson exists. If a directory or
file path, resolve accordingly.judge-out/ (or --verdict-dir) contains at least one
*-verdicts.json file with pass=false entries. If none, emit:
No failures found — nothing to heal. and stop.--call-type. If absent, collect all call_types that have
failing verdict entries.failures × n_replays × 2 × max_iterations LLM
calls. Warn the user if this is large (>50 calls) before proceeding.For each call_type in scope:
For each failing sample (those with pass=false in the verdict file):
system_prompt, user_content, and call_type from the
matching calls.ndjson line (join on call_id).n_replays unpatched claude -p calls using the captured
system_prompt verbatim. Record pass/fail for each.<heal-dir>/<call_type>/state.json.Read base prompt: load from prompts/<call_type>.md.
Build patch request: create
<heal-dir>/<call_type>/iter-N/patch-request.json with:
call_typecurrent_prompt (the current best prompt text — base on iter 0,
the last applied patch on subsequent iterations)failing_samples (array of {call_id, user_content, captured_response, judge_rationale} for samples that still fail)prior_attempts (array of {iter, anchor, replacement, strategy, pass_rate} from state.json history)Invoke patch-generator subagent:
Agent({
subagent_type: "patch-generator",
description: "Propose patch for <call_type> iter-<N>",
prompt: "<contents of patch-request.json formatted as delimited sections>"
})
Write the subagent's structured output to
<heal-dir>/<call_type>/iter-N/patch-response.json.
The subagent emits {anchor, replacement, strategy, pivot_reason}.
If the output is not valid JSON, retry once with a reminder to emit
only the JSON envelope.
Apply patch: substitute anchor → replacement in the current
prompt text. Write the resulting prompt to
<heal-dir>/<call_type>/iter-N/applied-patch.txt.
If anchor is not found verbatim in the current prompt, skip this
iteration and log a warning — do not apply a patch that cannot be
cleanly located.
Replay patched arm: for each failing sample, run n_replays
claude -p calls against the patched prompt. Record pass/fail.
Write to <heal-dir>/<call_type>/iter-N/arm-results.json and
scores.json.
Score and check convergence: compute pass rate across all non-noise-floor samples for this iteration.
pass_rate ≥ success_threshold → SUCCESS, stop.plateau_window all have |Δpass_rate| < plateau_delta → PLATEAUED, stop.iter == max_iterations → BUDGET_EXHAUSTED, stop.plateau_delta for
plateau_window consecutive iters → REGRESSED, stop.state.json with this iteration's result and best-so-far.Write <heal-dir>/<call_type>/healing-<call_type>.md with:
# Heal report: <call_type>
**Verdict:** SUCCESS | PLATEAUED | BUDGET_EXHAUSTED | TIMEOUT | REGRESSED
**Best pass rate:** <P>% (iter <N>)
**Baseline pass rate:** <B>%
**Iterations run:** <N>
## Best patch
**Anchor:**
```
Replacement:
<replacement text>
Strategy:
| iter | pass_rate | delta | verdict |
|---|---|---|---|
| 0 (baseline) | % | — | — |
| 1 | % | <Δ> | CONTINUE |
| ... |
<noise-floor note if any samples didn't reproduce>
## Step 3: Emit summary
For each call_type processed:
[<call_type>] iterations= verdict= best_pass_rate=
% → /<call_type>/healing-<call_type>.md
Then stop. Do not apply any patch to `prompts/`.
</workflow>
<safety_constraints>
- This skill reads NDJSON captures, verdict JSON, and prompt files
- This skill writes only into `<heal-dir>/<call_type>/`
- This skill does NOT modify `prompts/*.md` or `orchestrator/leerie.py`
- Patches are proposed with measured evidence — applying them is a
separate manual step the user performs after reviewing the report
- Replays make real `claude -p` calls using the user's subscription
- The skill warns the user before starting if the estimated call count
is large (>50 calls)
</safety_constraints>
<example_invocations>
Heal all call_types with failures in a run:
/leerie:llm-self-heal fix-login-timeout-bug-b81e90
Heal only the implementer call_type:
/leerie:llm-self-heal fix-login-timeout-bug-b81e90 --call-type implementer
Heal with tighter budget:
/leerie:llm-self-heal fix-login-timeout-bug-b81e90
--call-type planner
--max-iterations 5
--success-threshold 0.85
</example_invocations>
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub enricai/leerie --plugin leerie