From skill-evolver
Autonomous skill evolution engine. Use when the user wants to improve, optimize, or evolve a Claude Code skill through iterative A/B testing. Trigger on phrases like "进化这个skill", "优化skill", "自动改进", "跑进化", "evolve skill", "improve skill automatically". Also use when user wants to run an outer optimization loop over skill prompts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-evolver:skill-evolverThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Autonomous outer-loop optimization for Claude Code skills.
Autonomous outer-loop optimization for Claude Code skills.
Framework tools are at ${CLAUDE_PLUGIN_ROOT}/evolver/ — use this prefix for all framework commands.
User project paths (targets/, experiments/, evals) are relative to the user's project root.
Three principles:
BASELINE → LOOP { ANALYZE → PROPOSE → APPLY → VERIFY → KEEP/REVERT → LOG }
Core cycle: read evidence, propose one atomic change, verify mechanically, keep or revert. Full details in Step 2 below. Steps: ANALYZE → READ → PROPOSE → APPLY → RUN all evals → GRADE → KEEP or REVERT → LOG
Read targets/<target>.yaml:
project_root: /path/to/project
skills:
- name: <id> # skill identifier
path: skills/<id> # relative to project_root
evals_dir: evals/<id> # must contain grade.py + evals.json
settings:
max_iterations: 20 # omit for unbounded
min_delta: 0.01 # minimum score delta to keep
auto_revert: true # revert on no improvement
guard: null # optional regression test command
The evals directory must contain:
grade.py — module with grade(project_root: Path, label: str) -> GradingResultevals.json — eval definitions (see schema below)iteration-N/ — optional fixture directoriesFor L3 assertion design guidance, see references/l3-assertion-patterns.md in this skill.
Good L3 assertions discriminate — with-skill passes, without-skill fails. If both pass at 100%,
the assertion is dead weight.
{
"skills": {
"<id>": {
"skill_path": "skills/<id>",
"evals": [{
"id": "eval-id", "name": "描述", "iteration": 1,
"prompt": "给 agent 的执行指令...",
"project_dirs": {
"with_skill": "/tmp/test-with",
"without_skill": "/tmp/test-without"
}
}]
}
}
}
prompt: injected into subagent prompt templateproject_dirs.*: isolated working directories for each trackiteration: fixture version, increment when prompts/grading changeBefore any changes, run all evals for the target skill and record baseline scores.
For each eval in evals.json:
mkdir -p {project_dir}/outputsiteration-{N}/), copy them to the project dirSpawn agents per the Subagent Protocol (see below). Run all evals in parallel.
After all agents complete, run grading for each track:
python3 ${CLAUDE_PLUGIN_ROOT}/evolver/framework/runner.py grade {evals_dir} {project_dir} {label}
Or directly: python3 {evals_dir}/grade.py if it supports CLI invocation.
For each eval, compute per-level pass rates from grading.json:
L1 = passed_L1 / total_L1
L2 = passed_L2 / total_L2
L3 = passed_L3 / total_L3
eval_score = L1 * 0.2 + L2 * 0.3 + L3 * 0.5
Multi-eval aggregation: average the per-eval aggregate scores:
overall_score = mean(eval_scores)
delta = overall_with_score - overall_without_score
If delta < 0.05, the eval assertions lack discriminative power — bare Claude can almost match the skill. Consider adding more L3 assertions before evolving.
L3 (content quality) gets higher weight to combat Goodhart — we don't want the optimizer to sacrifice quality for structural compliance.
Before entering the evolution loop, audit every assertion for discriminatory power. An assertion that passes 100% in BOTH tracks (with AND without skill) is dead weight — it costs grading time but provides zero signal.
Audit procedure:
For each assertion in the baseline grading results:
Output an audit summary:
Assertion Audit (<N> total):
Strong: <K> — clearly differentiate skill value
Weak: <W> — marginal, monitor
Dead: <D> — pass 100% both tracks, need replacement
Reversed: <R> — skill makes it worse (investigate)
Action on dead assertions:
references/l3-assertion-patterns.md for better design patterns, then propose replacement assertions BEFORE starting evolution.Dead L3 example: "摘要 >3 句" — both with and without pass 100%. Replace with: "摘要提到至少一个具体的人名、数字或日期" — bare Claude often writes vague summaries without concrete anchoring.
After baseline is established, present a summary to the user before starting the loop:
Baseline established for <skill-name>:
Evals: <N> test cases
Aggregate score: <score>
L1: <score> L2: <score> L3: <score>
Weakest area: <dimension with lowest score>
Settings:
Max iterations: <N or "unlimited">
Min delta: <value>
Guard: <command or "none">
Proceed with evolution loop? (Ctrl+C to abort)
Do NOT start the loop until the user acknowledges. This prevents wasted iterations on misconfigured evals.
Before proposing any change, read:
experiments/<skill>/experiments.tsv — full history of what's been triedCategorize failures:
Form a hypothesis: "Changing X in SKILL.md will improve Y metric because Z."
Rules for proposals:
Common proposal mistakes to avoid:
Edit the skill file. ONE section, ONE concept. Commit with message:
experiment: <skill-name> — <brief hypothesis>
Re-run all evals using the same process as Step 1 (prepare dirs → spawn agents → grade with runner.py → compute score).
Compute scores from grading.json (same runner.py score as Step 1.4):
python3 ${CLAUDE_PLUGIN_ROOT}/evolver/framework/runner.py score experiments/{skill}/iter-{N}/modified/{track}/grading.json
(where {track} is with_skill and without_skill respectively)
Compare new aggregate score to previous best:
new_score > prev_best + min_delta → tentative KEEP, proceed to blind quality checknew_score <= prev_best + min_delta → REVERT (git checkout -- <skill-path>)Blind quality gate: see "## Blind Quality Check" for full protocol. Key rule:
Initialize (first iteration only):
python3 ${CLAUDE_PLUGIN_ROOT}/evolver/scripts/aggregate.py init experiments {skill_name}
Log each iteration using aggregate.py:
python3 ${CLAUDE_PLUGIN_ROOT}/evolver/scripts/aggregate.py log experiments/{skill} \
{iteration} {commit} {l1} {l2} {l3} {aggregate} {delta} {status} "{hypothesis}"
This appends to experiments/<skill>/experiments.tsv:
iteration commit l1 l2 l3 aggregate delta status hypothesis
0 a1b2c3d 1.0 0.85 0.60 0.78 0.0 baseline initial
1 b2c3d4e 1.0 0.87 0.62 0.80 +0.02 keep added memory point examples
2 - 1.0 0.83 0.58 0.77 -0.01 discard mandatory checklists
Print summary anytime:
python3 ${CLAUDE_PLUGIN_ROOT}/evolver/scripts/aggregate.py summary experiments/{skill}
Also save full grading.json for each iteration:
experiments/<skill>/iter-<N>/
├── baseline/
│ ├── with_skill/grading.json
│ └── without_skill/grading.json
├── modified/
│ ├── with_skill/grading.json
│ └── without_skill/grading.json
└── summary.json # {iteration, hypothesis, score_delta, kept}
Every 10 iterations, print a progress summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Progress after <N> iterations for <skill-name>
Baseline: <baseline_score>
Current: <current_score> (Δ <delta>)
Best: <best_score> (iter <N>)
Kept: <K> changes
Discarded: <D> changes
Top improvement: <brief description of best change>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bounded loops (max_iterations set) also print a final summary with baseline → current best trajectory.
Stop when:
max_iterations reachedPrint final summary:
Guard is an optional safety net. While Verify checks "did the metric improve?", Guard checks "did anything else break?"
Configure in targets/<target>.yaml:
settings:
guard: "python3 evals/another-skill/grade.py /tmp/another-skill-test --quick"
Guard rules:
Example: when evolving skill A, guard with skill B's evals to ensure changes to A don't break B's output format expectations.
Mechanical assertions (L1/L2/L3) can be gamed — the optimizer learns to pass assertions without actually improving output quality. Blind comparison is an independent quality signal: a fresh agent compares with-skill and without-skill outputs without knowing which is which.
Blind check runs in one of two modes, determined by the baseline assertion audit (Step 1.5):
| Mode | Condition | Behavior |
|---|---|---|
| Periodic (default) | Assertions have healthy discrimination (delta > 0.05, <30% dead) | Blind check every 5 iterations as sanity check |
| Primary Gate | Assertions lack discrimination (delta < 0.05, or >30% dead L3 assertions) | Blind check runs on EVERY iteration as the main keep/revert gate |
In Primary Gate mode, the mechanical score still runs, but the blind comparison's winner determines keep/revert. The mechanical score is recorded for tracking but doesn't decide. This prevents the optimizer from gaming weak assertions — even if it learns to pass every dead L3 check, the blind judge catches quality regressions.
Mode transition: Start in Periodic. If the baseline audit triggers Primary Gate, stay in it until assertions are improved (dead L3% < 30% AND delta > 0.05 after re-running baseline with new assertions). The evolver should proactively suggest assertion improvements when stuck in Primary Gate mode.
| Trigger | Rationale |
|---|---|
| Every 5 iterations | Periodic sanity check |
| L3 score jumps >0.15 in one iteration | Suspiciously fast improvement |
| Hypothesis is purely structural | Adding keywords, reordering sections — easy to game |
| Manual request | Proposer wants a second opinion |
Skip blind check in Periodic mode if all L3 assertions pass at 100% AND delta > 0.05 (skill is genuinely converged). Never skip in Primary Gate mode — the high dead-assertion rate means L3 passing doesn't guarantee quality.
Spawn a subagent with this prompt:
You are a blind quality judge. Output A and B come from unknown sources.
Task: {eval.prompt}
Output A: {path_a}
Output B: {path_b}
1. Read both outputs.
2. Score each on Content (correctness/completeness/accuracy, 1-5) and Structure (organization/formatting/usability, 1-5).
3. Pick winner: A, B, or TIE.
Save to {output_path}/comparison.json:
{"winner": "A"|"B"|"TIE", "reasoning": "...", "rubric": {"A": {...}, "B": {...}}}
CRITICAL: Do NOT try to infer which output came from the skill.
Randomization: Randomly assign which output is A and which is B. Record the mapping so you can unblind after.
After the comparator finishes, read comparison.json:
| Blind Winner | Mechanical Delta | Verdict | Action |
|---|---|---|---|
| with-skill | delta > 0 | Strong confirmation | KEEP |
| with-skill | delta ≤ 0 | Assertions too strict or miss what matters | Consider revising assertions; tentative KEEP |
| without-skill | delta > 0 | Mechanical improvement was spurious | REVERT |
| without-skill | delta ≤ 0 | Confirmed regression | Keep the revert |
| TIE | any | Mechanical score decides | Fall back to normal keep/revert |
Spawn an analyzer to understand WHY a change was reverted:
Read {comparison.json}. Winner: {winner_path}, Loser: {loser_path}.
1. What specific instruction difference caused the outcome? Quote both.
2. What concrete change would flip the result?
Save to {output_path}/analysis.json:
{"winner_strengths": [...], "loser_weaknesses": [...], "improvement_suggestions": [{"priority": "high"|"medium"|"low", "suggestion": "..."}]}
The analysis feeds into the next Propose step — proposer reads analysis.json alongside grading failures.
| Failure | Response |
|---|---|
| Eval agent timeout / crash | Mark that eval as inconclusive, continue with remaining evals. If >50% of evals fail, abort iteration and revert. |
| Grading script error | Attempt fix (max 3 tries: syntax fix, import fix, assertion fix). If still broken, skip this eval for this iteration, log warning. |
| Git conflict on revert | Abort iteration, reset to HEAD~1, log the anomalous state for manual review. |
| Infinite loop / hang | Each iteration has a 10-minute timeout. If exceeded, kill subagents, revert, log timeout. |
| Resource exhaustion (disk/token) | Revert, log warning, suggest reducing eval scope or increasing limits. |
| External dependency missing | Skip evaluation that depends on it, log, continue with other evals. |
| All evals fail | Abort iteration, revert, log critical error. Do NOT count this as a kept change. |
Every crash is logged to experiments/<skill>/crashes.tsv:
timestamp iteration type detail action
2026-05-19T14:30:00 3 timeout eval agent >600s reverted
2026-05-19T14:45:00 5 grad_error grade.py syntax fixed (attempt 1)
Each eval spawns two agents in parallel (Agent tool, run_in_background: true).
For each eval definition from evals.json. {skill_path} = project_root + path.
with-skill agent:
Execute this task using the {skill_name} skill.
First, read {skill_path}/SKILL.md and {skill_path}/references/. Follow the skill's instructions exactly.
Then: {eval.prompt}
Work in the directory: {eval.project_dirs.with_skill}. Save summary to {eval.project_dirs.with_skill}/outputs/summary.md
Note: Read the skill file directly — the Skill tool only works for installed skills, not development skills.
without-skill agent:
Execute this task WITHOUT any skill. Do NOT read any SKILL.md. Work from your own knowledge.
Task: {eval.prompt}
Work in {eval.project_dirs.without_skill}. Save summary to {eval.project_dirs.without_skill}/outputs/summary.md
mkdir -p {project_dir}/outputs, copy fixtures if presentpython3 ${CLAUDE_PLUGIN_ROOT}/evolver/framework/runner.py grade {evals_dir} {project_dir} {label}; save results to experiments/{skill}/iter-{N}/{config}/{track}/Parallelism: all evals run in parallel, N_evals × 2 concurrent agents. Batch in groups of 2-3 if resource-constrained.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub tulancn/skill-evolver --plugin skill-evolver