improving-skills | Skill Kit

Stats

Actions

Tags

improving-skills | Skill Kit

Improving Skills

Adapts the autoresearch loop (modify → verify → keep-or-revert) for one purpose only: improving SKILL.md files. The target is always a single SKILL.md; no prompts, no agents, no generic code optimization.

When to use

The user wants to autonomously improve a specific SKILL.md — phrasings like "tighten this skill", "shrink the skill body without losing capability", "make this skill trigger more reliably", "run autoresearch on the X skill".

Always invoked manually via /skill-kit:improving-skills. The skill has disable-model-invocation: true so it never auto-fires.

Steps

Set up the run. Take the target SKILL.md path from the user. Copy the entire skill directory to .skill-kit/runs/<run-id>/skill/ (run-id is a short timestamp like 2026-05-13-1542). Initialize a scratch git branch for the run.
Read configuration. Copy templates from this skill's templates/ into .skill-kit/runs/<run-id>/ if not already present:
- program.md — goals and constraints. If <skill-dir>/goal.md exists (drafted via /skill-kit:goal-improve-skill), seed program.md from it instead of starting from the blank template.
- eval.json — scoring config (paths, weights, iteration cap)
- test-prompts.md — positive + negative trigger fixtures (drives the trigger-accuracy dimension) If <skill-dir>/references/learnings.md exists and carries a "Retro log" of observations from real runs, read it and fold its Status: open entries into program.md's "Notes for the agent" as a candidate-edit backlog — these are the human-flagged improvements the loop should try first. Treat them as candidates, not mandates: each must still pass the harness and raise the composite to be kept, and any unhelpful entry is just reverted. Verify the target ships a tests.md sidecar with ≥3 scenarios — the quality dimension scores candidates against those scenarios. If test-prompts.md or tests.md is missing/empty, stop and ask the user to populate it before starting the loop.
Score the baseline. Run score-skill .skill-kit/runs/<run-id>/skill/SKILL.md and record the composite as iteration 0 in .skill-kit/runs/<run-id>/results.tsv.
Run the loop. See loop.md for the iteration mechanics and stopping conditions.
Value-add check. When the loop stops, run the value-add test (${CLAUDE_PLUGIN_ROOT}/reference/value-add-test.md) once on the final kept candidate — the composite proves the skill is well-formed, not that it beats just asking the model. Skip only if the target is not a generative/judgment skill (deterministic / safety / format-compliance); then record Value-add verdict: N/A (non-generative). Record the verdict (PASS / CONCERN / FAIL, lift, seeds) in .skill-kit/runs/<run-id>/results.tsv and the report. A FAIL means the composite gains never reached the user (usually scaffolding burying substance) — recommend the substance-first remedy and do not call the skill done. This runs once at loop end, never per-iteration (it spawns ~4N sub-agents). See scoring.md → "What the composite does NOT measure".
Report. Summarize: starting score, best score, number of iterations, what changed in the winning candidate, and the value-add verdict. Point the user at .skill-kit/runs/<run-id>/skill/SKILL.md and the results.tsv audit log.

Constraints

One focused change per iteration. Resist the urge to refactor multiple things at once.
Never modify frontmatter field names (name, description) — only their values. Renaming description to desc would break discovery.
Never drop below the harness floor: every kept iteration must pass check-skill cleanly (zero FAILs).
Preserve the skill's intended behavior. If test-prompts.md says a prompt should fire, the candidate must keep it firing.

Output format

After the loop stops, emit:

## Improving-skills run <run-id>

- Target: <path>
- Iterations: <n>
- Baseline score: <score>
- Best score:     <score> (iteration <k>)
- Token reduction: <baseline_tokens> → <best_tokens> (<percent>%)
- Trigger (simulated / empirical): <sim> / <emp>
- Value-add verdict: <PASS|CONCERN|FAIL, lift, seeds | N/A (non-generative)>

## What changed

<2-4 bullets describing the surviving edits>

## Audit log

.skill-kit/runs/<run-id>/results.tsv
.skill-kit/runs/<run-id>/skill/SKILL.md

Details

Loop mechanics, retry behavior, stopping conditions: loop.md
Scoring composite and how to interpret results.tsv: scoring.md
Empirical trigger measurement (real firing vs simulated prediction): ${CLAUDE_PLUGIN_ROOT}/reference/trigger-accuracy.md