From stellarlinkco-skills
Evolves any measurable artifact (prompt, skill, code) through autonomous mutation-evaluate-gate loops. Supports GT case suites and scalar metric loops with automatic keep/revert.
How this skill is triggered — by the user, by Claude, or both
Slash command
/stellarlinkco-skills:self-evolutionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the evolution controller. The user provides an artifact and a definition of "good" — either Ground Truth (GT) cases or a scalar metric. You drive the loop: mutate, evaluate, gate, keep or revert, repeat.
You are the evolution controller. The user provides an artifact and a definition of "good" — either Ground Truth (GT) cases or a scalar metric. You drive the loop: mutate, evaluate, gate, keep or revert, repeat.
Core principle: Any artifact that can be evaluated can be trained. You need three things:
Choose the lightest loop that still has a real oracle:
Do not force GT case generation when the user already has a fixed executable metric. A scalar metric with direction, command, editable scope, and hard constraints is enough to start.
Before starting, verify:
references/ground-truth.md), ORminimize or maximize), and parse rulereferences/artifact-guide.md)If neither GT nor scalar metric exists, help the user create one. Analyze the artifact, propose 10-15 test cases with assertions or one benchmark command with metric extraction, and ask for review.
For skills specifically: If skill-creator is installed, use quick_validate.py for L1 validation and skill-creator's grader for L2 evaluation. See references/artifact-guide.md for skill-specific execution methods.
Determine the artifact type and run mode. This drives execution method and mutation layer definitions.
| Type | Artifact Is | Execution Method | Default Mode | Example |
|---|---|---|---|---|
prompt | A text prompt/instruction | Send to LLM with test input, capture output | GT Suite | System prompt, few-shot template |
skill | SKILL.md + references/scripts | Run claude with skill loaded | GT Suite | Claude Code skill |
code | Source code files | Run/test via shell command | GT Suite or Scoreboard | Python function, JS module |
experiment | Bounded code/config optimized by one benchmark | Run fixed command, parse scalar metric | Scoreboard | autoresearch train.py |
idea | A document/proposal | LLM evaluates against criteria | GT Suite | Business plan, design doc |
config | Configuration file | Apply config, run system, check behavior | GT Suite or Scoreboard | YAML config, .env settings |
custom | User-defined | User provides execution command | GT Suite, Scoreboard, or Hybrid | Anything else |
If the artifact or oracle is ambiguous, ask the user. If only the run mode is ambiguous, prefer GT Suite for semantic quality and Scoreboard for numeric optimization.
Record the contract before the first mutation:
The oracle is the source of truth. Do not edit the oracle to make a mutation pass.
Create workspace as a sibling to the artifact:
<artifact-name>-evolution/
├── evolve_plan.md # Strategy document
├── results.tsv # Per-iteration summary (append-only)
├── experiments.jsonl # Structured experiment log (append-only)
├── gt/ # GT Suite / Hybrid only
│ ├── dev.json
│ ├── holdout.json
│ └── regression.json
├── traces/ # Per-iteration execution traces
└── iterations/ # Per-iteration snapshots and result JSON
For Scoreboard Mode, results.tsv must at minimum include:
iteration commit metric_name metric_value cost status description
Use tab-separated rows. Keep run logs and raw outputs in traces/iteration-{N}/.
git init && git add -A && git commit -m "pre-evolution baseline"For Scoreboard Mode, create or reuse a dedicated experiment branch/tag when the repository workflow allows it. This keeps discarded experiments cheap to revert and makes the result ledger auditable.
For GT Suite / Hybrid:
references/ground-truth.md for the full GT format and assertion types.For Scoreboard:
minimize or maximize) and any minimum meaningful delta.Run the current artifact state through the oracle. This is iteration 0.
traces/iteration-0/, results to iterations/iteration-0/l2_results.json.traces/iteration-0/, and append the baseline row to results.tsv.Analyze the artifact, oracle, scope, and baseline results. Write evolve_plan.md:
Ask only if contract choices remain ambiguous. If the user already supplied a fixed metric command and editable scope, record the plan and start the loop.
Use the loop shape that matches the selected operating mode.
GT Suite / Hybrid checklist:
Iteration {N}:
- [ ] Phase 1: Review — read memory + traces
- [ ] Phase 2: Ideate — propose ONE atomic change with trace evidence
- [ ] Phase 3: Modify — execute the change
- [ ] Phase 4: Commit — git commit (before verify)
- [ ] Phase 5: Verify — L1 → L2 (→ L3 if triggered)
- [ ] Phase 6: Gate — deterministic AND gate → KEEP or DISCARD
- [ ] Phase 7: Log — results.tsv + experiments.jsonl + traces
- [ ] Phase 8: Loop — continue / promote / stop
For Scoreboard Mode, use the lean autoresearch loop in the dedicated section below. It keeps the same invariants: one mutation, one run, one ledger entry, deterministic keep/discard.
Read the evolution memory:
results.tsvexperiments.jsonlExtract 5 signals:
Propose ONE atomic change. Follow the priority ladder:
Hard rule: evidence before action. Cite specific trace evidence:
"Case {id} failed because {observed behavior}. Trace at {path} shows {evidence}. If I change {X}, the output should change from {current} to {expected}."
No trace citation, no mutation allowed. On the first iteration, cite baseline L2 results as evidence.
Atomicity test: If describing the change requires the word "and", split into two iterations.
Depth over assertion-gaming. When the artifact is a document or idea, passing GT assertions is necessary but not sufficient. Each mutation should make the artifact genuinely stronger, not just add the minimum content to flip an assertion from FAIL to PASS. If a case expects "market data with dollar figures," add real analysis — not a single throwaway number. The GT is a lower bound on quality, not the target.
When all cases already pass: If dev pass_rate is 1.0, shift priority to Simplify (Priority 5). Look for instructions, code, or content that can be removed without dropping pass_rate. Shorter, cleaner artifacts are better than padded ones. Do not waste iterations on cosmetic changes that add no behavioral value.
Read references/mutation.md for layer rules and per-artifact mutation guidance.
Execute the proposed change:
git diff --stat — if >5 files changed, probably not atomicCommit BEFORE verifying. This preserves the audit trail even for failed mutations.
git add <artifact-files> && git commit -m "evolve: iteration-{N} — {one-line description}"
Only stage artifact files. Never git add -A — workspace artifacts must not be committed.
Run the 3-layer evaluation pipeline. Read references/evaluation.md for details.
L1 Quick Gate (every iteration):
L2 Dev Eval (every iteration):
traces/iteration-{N}/L3 Strict Eval (conditional): Triggered when: every N iterations / dev pass_rate exceeds threshold / before layer promotion. Runs holdout set + regression set. Detects overfitting.
Apply the 5-dimension AND gate. ALL must pass — any NO triggers git revert HEAD.
| Dim | Question | Check |
|---|---|---|
| Structure | Did L1 pass? | L1 result |
| Progress | pass_rate >= previous best? | Compare results.tsv |
| Regression | No previously-passing case now fails? | Diff per-case results |
| Cost | Token/time within 2x baseline? | Timing data |
| Safety | No new safety violations? | L1 safety scan |
Read references/gate.md for edge cases and noise handling.
Append to results.tsv:
iteration layer mutation_type description pass_rate delta decision tokens duration_s timestamp
Append to experiments.jsonl:
{
"iteration": 5,
"artifact_type": "prompt",
"layer": 2,
"mutation": "Added disambiguation hint for ambiguous queries",
"target_cases": ["case-12"],
"pass_rate_before": 0.82,
"pass_rate_after": 0.86,
"regressions": [],
"decision": "KEEP",
"gate_details": {"structure": true, "progress": true, "regression": true, "cost": true, "safety": true}
}
Save per-case traces to traces/iteration-{N}/case-{id}.md.
Three outcomes:
On stop, generate a final evolution report:
Use this mode when the oracle is one scalar metric from a fixed command. The model is not optimizing a checklist; it is doing experimental research under a bounded benchmark.
Before iteration 1, write this contract in evolve_plan.md:
editable_scope:
forbidden_scope:
run_command:
metric_name:
metric_direction: minimize|maximize
metric_parse_rule:
timeout_seconds:
hard_constraints:
soft_costs:
keep_rule:
discard_rule:
The reference pattern is references/autoresearch/program.md: a tiny repo, one editable file, one fixed evaluation function, one primary metric, and a ledger of experiments.
Repeat until the user stops you, the target is reached, or the configured iteration limit is exhausted:
results.tsv, experiments.jsonl, the current diff, and the latest trace. Identify the current best metric and recent failed ideas.crash/discarddiscardkeepkeepresults.tsv and experiments.jsonl with commit, metric, cost, status, and short description.If several consecutive experiments fail:
Do not pause between successful experiments just to ask whether to continue. The user can redirect, stop, or adjust the contract at any time.
The loop runs autonomously, but the user can intervene at any time:
Read these when entering the relevant phase:
references/ground-truth.md — GT case format, 8 assertion types, split strategy, generationreferences/evaluation.md — 3-layer evaluation pipeline, scoreboard execution, noise mitigationreferences/gate.md — 5-dimension AND gate, scalar metric edge cases, loggingreferences/mutation.md — Layered mutation strategy, priority ladder, evidence protocolreferences/artifact-guide.md — Per-artifact-type guidance, including metric/autoresearch optimizationreferences/autoresearch/program.md — Minimal reference pattern for fixed-metric autonomous research loopsRun these directly — do not load source into context:
scripts/evaluate_assertions.py --output-file <path> --expectations '<json>' — Evaluate programmatic assertionsscripts/results_tracker.py <workspace> log --iteration N --layer L ... — Append GT Suite results to results.tsv + experiments.jsonlscripts/results_tracker.py <workspace> init --mode scoreboard — Initialize a Scoreboard workspacescripts/results_tracker.py <workspace> log --mode scoreboard --iteration N --layer L --mutation-type <type> --description <text> --metric-name <name> --metric-value <value> --metric-direction minimize|maximize --delta <delta> --decision BASELINE|KEEP|DISCARD — Append scalar metric resultsscripts/structural_check.py <artifact-path> --type <artifact-type> — L1 structural validatorRun the experiment-mode validator when evolving fixed-metric benchmark artifacts:
scripts/structural_check.py <artifact-path> --type experiment --language <language> --plan-path <evolve_plan.md>
npx claudepluginhub stellarlinkco/skillsAutonomously optimizes skill prompts using a mutate/score/keep evolutionary loop with git-based revert. Useful for improving SKILL.md performance over time.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Directs multi-cycle improvement campaigns by forming hypotheses, scouting before attacking, and extracting transferable patterns. Use for sustained autonomous quality advancement across sessions.