Skill

llm-self-heal

Autonomous self-healing loop that replays failing leerie worker prompts, iteratively proposes patches via a subagent, and measures success against replay baselines.

testing

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/leerie:llm-self-heal <run-id-or-ndjson-path> [--call-type <name>] [--max-iterations <N>] [--n-replays <N>] [--success-threshold <0..1>]

User invocable

Model invocable

Inline context

Default effort

Argument hint<run-id-or-ndjson-path> [--call-type <name>] [--max-iterations <N>] [--n-replays <N>] [--success-threshold <0..1>]

Tool Access

This skill is limited to the following tools:

ReadWriteEditBashGlobGrepAgent

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

SKILL.md

296 lines · ~2.8k tokens

Stats

LanguagePython

Stars4

Forks1

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

call_type → prompt-file mapping

Each call_type has exactly one system-prompt source (IMPLEMENTATION.md §10):

call_type	Prompt source
classifier	`prompts/classifier.md`
planner	`prompts/planner.md`
reconciler	`prompts/reconciler.md`
implementer	`prompts/implementer.md`
integrator	`prompts/integrator.md`
conformer	`prompts/conformer.md`

The heal loop reads the current file from prompts/ as the base prompt text for any call_type it heals.

Heal-loop on-disk state layout

Under <heal-dir>/<call_type>/:

state.json           — loop state (history, best-so-far, baseline)
iter-<N>/
  patch-request.json   — inputs for the patch-generator subagent
  patch-response.json  — subagent's structured output
  applied-patch.txt    — the patched system prompt text
  arm-results.json     — n-replay results per failing sample
  scores.json          — per-sample per-replay pass/fail verdicts

This matches IMPLEMENTATION.md §8 "Coordination directory layout".

Replay mechanics

Each replay runs claude -p with:

--append-system-prompt <patched-prompt-text> (the prompt under test)
--json-schema <schema> (the same schema used for this call_type in the live orchestrator, per SCHEMAS in orchestrator/leerie.py)
--model <model> (the heal model alias)
User content from the captured user_content field

A replay passes when structured_output is present and schema-valid (parsed_ok=true equivalent). Schema validity is the primary pass criterion — the same bar the live orchestrator uses.

Step 1: Pre-flight

Confirm CWD contains orchestrator/leerie.py and prompts/. If not, abort: llm-self-heal must run from the leerie repo root.
Resolve the input path. If a run-id string, check .leerie/runs/<run-id>/calls.ndjson exists. If a directory or file path, resolve accordingly.
Confirm judge-out/ (or --verdict-dir) contains at least one *-verdicts.json file with pass=false entries. If none, emit: No failures found — nothing to heal. and stop.
Resolve --call-type. If absent, collect all call_types that have failing verdict entries.
Estimate total cost: failures × n_replays × 2 × max_iterations LLM calls. Warn the user if this is large (>50 calls) before proceeding.

Step 2: Per-call_type loop

For each call_type in scope:

2a. Baseline

For each failing sample (those with pass=false in the verdict file):

Extract system_prompt, user_content, and call_type from the matching calls.ndjson line (join on call_id).
Run n_replays unpatched claude -p calls using the captured system_prompt verbatim. Record pass/fail for each.
Persist per-sample baseline pass rates to <heal-dir>/<call_type>/state.json.
Samples that pass at majority vote in baseline are "noise-floor" cases — the captured failure didn't reproduce, so patch measurement is unreliable for them. Note them in the report.

2b. Iteration loop (until convergence or cap)

Read base prompt: load from prompts/<call_type>.md.
Build patch request: create <heal-dir>/<call_type>/iter-N/patch-request.json with:
- call_type
- current_prompt (the current best prompt text — base on iter 0, the last applied patch on subsequent iterations)
- failing_samples (array of {call_id, user_content, captured_response, judge_rationale} for samples that still fail)
- prior_attempts (array of {iter, anchor, replacement, strategy, pass_rate} from state.json history)
Invoke patch-generator subagent:
```
Agent({
  subagent_type: "patch-generator",
  description: "Propose patch for <call_type> iter-<N>",
  prompt: "<contents of patch-request.json formatted as delimited sections>"
})
```
Write the subagent's structured output to <heal-dir>/<call_type>/iter-N/patch-response.json. The subagent emits {anchor, replacement, strategy, pivot_reason}. If the output is not valid JSON, retry once with a reminder to emit only the JSON envelope.
Apply patch: substitute anchor → replacement in the current prompt text. Write the resulting prompt to <heal-dir>/<call_type>/iter-N/applied-patch.txt. If anchor is not found verbatim in the current prompt, skip this iteration and log a warning — do not apply a patch that cannot be cleanly located.
Replay patched arm: for each failing sample, run n_replays claude -p calls against the patched prompt. Record pass/fail. Write to <heal-dir>/<call_type>/iter-N/arm-results.json and scores.json.
Score and check convergence: compute pass rate across all non-noise-floor samples for this iteration.
- pass_rate ≥ success_threshold → SUCCESS, stop.
- Iterations within plateau_window all have |Δpass_rate| < plateau_delta → PLATEAUED, stop.
- iter == max_iterations → BUDGET_EXHAUSTED, stop.
- Pass rate dropped vs. prior best by > plateau_delta for plateau_window consecutive iters → REGRESSED, stop.
- Otherwise → CONTINUE. Update state.json with this iteration's result and best-so-far.

2c. Write report

Write <heal-dir>/<call_type>/healing-<call_type>.md with:

# Heal report: <call_type>

**Verdict:** SUCCESS | PLATEAUED | BUDGET_EXHAUSTED | TIMEOUT | REGRESSED
**Best pass rate:** <P>% (iter <N>)
**Baseline pass rate:** <B>%
**Iterations run:** <N>

## Best patch

**Anchor:**

```

Replacement:

<replacement text>

Strategy:

Iteration history

iter	pass_rate	delta	verdict
0 (baseline)	%	—	—
1	%	<Δ>	CONTINUE
...

Notes

<noise-floor note if any samples didn't reproduce>


## Step 3: Emit summary

For each call_type processed:

[<call_type>] iterations= verdict= best_pass_rate=

% → /<call_type>/healing-<call_type>.md


Then stop. Do not apply any patch to `prompts/`.

</workflow>

<safety_constraints>
- This skill reads NDJSON captures, verdict JSON, and prompt files
- This skill writes only into `<heal-dir>/<call_type>/`
- This skill does NOT modify `prompts/*.md` or `orchestrator/leerie.py`
- Patches are proposed with measured evidence — applying them is a
  separate manual step the user performs after reviewing the report
- Replays make real `claude -p` calls using the user's subscription
- The skill warns the user before starting if the estimated call count
  is large (>50 calls)
</safety_constraints>

<example_invocations>

Heal all call_types with failures in a run:

/leerie:llm-self-heal fix-login-timeout-bug-b81e90


Heal only the implementer call_type:

/leerie:llm-self-heal fix-login-timeout-bug-b81e90 --call-type implementer


Heal with tighter budget:

/leerie:llm-self-heal fix-login-timeout-bug-b81e90
--call-type planner
--max-iterations 5
--success-threshold 0.85


</example_invocations>

llm-self-heal

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

llm-self-heal

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

call_type → prompt-file mapping

Heal-loop on-disk state layout

Replay mechanics

Step 1: Pre-flight

Step 2: Per-call_type loop

2a. Baseline

2b. Iteration loop (until convergence or cap)

2c. Write report

Iteration history

Notes

Similar Skills

call_type → prompt-file mapping

Heal-loop on-disk state layout

Replay mechanics

Step 1: Pre-flight

Step 2: Per-call_type loop

2a. Baseline

2b. Iteration loop (until convergence or cap)

2c. Write report

Iteration history

Notes

Similar Skills