From Skill Kit
Iteratively improves an agent skill's SKILL.md by mutating it, scoring against a fixed composite (trigger accuracy, instruction quality, token efficiency, best-practices compliance), and keeping only changes that raise the score. Use when the user asks to tighten, shrink, or improve a SKILL.md while preserving its intended behavior.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-kit:improving-skillsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Adapts the autoresearch loop (modify → verify → keep-or-revert) for one
Adapts the autoresearch loop (modify → verify → keep-or-revert) for one purpose only: improving SKILL.md files. The target is always a single SKILL.md; no prompts, no agents, no generic code optimization.
The user wants to autonomously improve a specific SKILL.md — phrasings like "tighten this skill", "shrink the skill body without losing capability", "make this skill trigger more reliably", "run autoresearch on the X skill".
Always invoked manually via /skill-kit:improving-skills. The skill has
disable-model-invocation: true so it never auto-fires.
.skill-kit/runs/<run-id>/skill/ (run-id is a short
timestamp like 2026-05-13-1542). Initialize a scratch git branch for
the run.templates/
into .skill-kit/runs/<run-id>/ if not already present:
program.md — goals and constraints. If <skill-dir>/goal.md exists
(drafted via /skill-kit:goal-improve-skill), seed program.md from it
instead of starting from the blank template.eval.json — scoring config (paths, weights, iteration cap)test-prompts.md — positive + negative trigger fixtures (drives the
trigger-accuracy dimension)
If <skill-dir>/references/learnings.md exists and carries a "Retro log"
of observations from real runs, read it and fold its Status: open entries
into program.md's "Notes for the agent" as a candidate-edit backlog —
these are the human-flagged improvements the loop should try first. Treat
them as candidates, not mandates: each must still pass the harness and
raise the composite to be kept, and any unhelpful entry is just reverted.
Verify the target ships a tests.md sidecar with ≥3 scenarios — the
quality dimension scores candidates against those scenarios. If
test-prompts.md or tests.md is missing/empty, stop and ask the
user to populate it before starting the loop.score-skill .skill-kit/runs/<run-id>/skill/SKILL.md
and record the composite as iteration 0 in .skill-kit/runs/<run-id>/results.tsv.${CLAUDE_PLUGIN_ROOT}/reference/value-add-test.md) once on the
final kept candidate — the composite proves the skill is well-formed, not
that it beats just asking the model. Skip only if the target is not a
generative/judgment skill (deterministic / safety / format-compliance); then
record Value-add verdict: N/A (non-generative). Record the verdict
(PASS / CONCERN / FAIL, lift, seeds) in .skill-kit/runs/<run-id>/results.tsv and the
report. A FAIL means the composite gains never reached the user (usually
scaffolding burying substance) — recommend the substance-first remedy and do
not call the skill done. This runs once at loop end, never per-iteration
(it spawns ~4N sub-agents). See scoring.md → "What the composite
does NOT measure"..skill-kit/runs/<run-id>/skill/SKILL.md and the results.tsv audit log.name, description) — only their
values. Renaming description to desc would break discovery.check-skill cleanly (zero FAILs).test-prompts.md says a
prompt should fire, the candidate must keep it firing.After the loop stops, emit:
## Improving-skills run <run-id>
- Target: <path>
- Iterations: <n>
- Baseline score: <score>
- Best score: <score> (iteration <k>)
- Token reduction: <baseline_tokens> → <best_tokens> (<percent>%)
- Trigger (simulated / empirical): <sim> / <emp>
- Value-add verdict: <PASS|CONCERN|FAIL, lift, seeds | N/A (non-generative)>
## What changed
<2-4 bullets describing the surviving edits>
## Audit log
.skill-kit/runs/<run-id>/results.tsv
.skill-kit/runs/<run-id>/skill/SKILL.md
${CLAUDE_PLUGIN_ROOT}/reference/trigger-accuracy.mdGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub mjenkinsx9/mjenkins-toolbox --plugin skill-kit