From claude-evolve
Socratic interview to clarify evolution tasks before execution. Asks targeted questions across four dimensions, scores ambiguity, produces initial.py, evaluate.py, config.json when ambiguity drops below 20%.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-evolve:evolve-interview <describe what you want to optimize><describe what you want to optimize>The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Ouroboros-inspired Socratic interview that refuses to start an expensive evolution run until the task specification is mathematically clear. Asks targeted questions across four dimensions, scores ambiguity after every answer, and only crystallizes into concrete artifacts (`initial.py`, `evaluate.py`, `config.json`) once ambiguity drops below 20%.
Ouroboros-inspired Socratic interview that refuses to start an expensive evolution run until the task specification is mathematically clear. Asks targeted questions across four dimensions, scores ambiguity after every answer, and only crystallizes into concrete artifacts (initial.py, evaluate.py, config.json) once ambiguity drops below 20%.
When this skill is invoked, immediately execute the workflow below. Do not only restate or summarize these instructions back to the user.
Evolutionary code discovery is expensive. Each generation burns real LLM calls and wall-clock time. If the fitness function is ambiguous, the wrong code region is marked mutable, or the evaluator is unreliable, the entire run produces garbage. This skill prevents that by forcing specification clarity before execution.
Analogous to oh-my-claudecode's deep-interview, but specialized for the four dimensions that actually matter for evolution:
initial.py / evaluate.py yetinitial.py with EVOLVE-BLOCK markers AND a working evaluate.py → run /evolve directly/evolve-status/evolve-installExplore subagent BEFORE asking the user about themAmbiguity score computed as:
ambiguity = 1 - (goal * 0.30 + program * 0.25 + evaluation * 0.30 + constraints * 0.15)
Each sub-score is in [0.0, 1.0]:
| Dimension | Weight | What "1.0" looks like |
|---|---|---|
| Goal Clarity | 0.30 | A single scalar fitness function stated in one sentence. No qualifiers. |
| Program Clarity | 0.25 | Exact file path + exact function/region to mark mutable. EVOLVE-BLOCK boundaries known. |
| Evaluation Clarity | 0.30 | Concrete command that runs real code and returns a number. Deterministic or well-averaged. |
| Constraint Clarity | 0.15 | Hard bounds on what must not change, time budget, correctness checks. |
Threshold: 0.20 ambiguity (i.e. 0.80 total clarity) before crystallization.
Explore agent to map the relevant area and store codebase_context. Examples of what to extract: function signatures, test files, existing benchmarks, language/runtime, external deps.state/interview-state.json:{
"active": true,
"phase": "interview",
"interview_id": "<uuid-or-timestamp>",
"type": "greenfield|brownfield",
"initial_idea": "<user input>",
"rounds": [],
"current_ambiguity": 1.0,
"threshold": 0.20,
"codebase_context": null,
"challenge_modes_used": [],
"ontology_snapshots": []
}
Starting evolution task interview. I'll ask targeted questions to understand what you want to evolve before committing to an expensive run. After each answer I'll show your clarity score. We'll proceed to spec crystallization once ambiguity drops below 20%.
Your idea: "<initial_idea>" Project type: <greenfield|brownfield> Current ambiguity: 100%
Repeat until ambiguity ≤ 0.20 OR user early-exits:
Select the dimension with the lowest clarity score. If tied, use weight order: Goal > Evaluation > Program > Constraints.
Question styles by dimension:
| Dimension | Question template | Example |
|---|---|---|
| Goal | "What single number defines success?" | "You said 'faster' — are we minimizing wall-clock seconds, CPU time, or something else?" |
| Program | "Which exact region is mutable?" | "I see three functions in solver.py. Which one is the algorithm you want to replace vs. fixed interface?" |
| Evaluation | "How do we measure that without asking an LLM?" | "Do you have a test suite we can run, or do we need a dedicated evaluator that runs the code on sample inputs?" |
| Constraints | "What must remain unchanged or bounded?" | "Is there a memory limit, runtime cap per evaluation, or an API surface that must stay stable?" |
If the scope feels conceptually fuzzy (user keeps redefining the target), switch to an ontology question before returning to normal dimensions:
"Across the last few rounds you've described this as X, Y, and Z. Which of those IS the thing we're evolving, and which are inputs or metrics?"
Use AskUserQuestion to present the question with clickable options when possible. Header format:
Round {n} | Targeting: {weakest_dim} | Why now: {one-sentence rationale} | Ambiguity: {pct}%
{question}
Options should be concrete candidates (e.g. three distinct fitness metric choices) plus "Other" for free text.
After the user answers, compute scores for all four dimensions.
Scoring rubric (apply to every dimension):
Be conservative. If you can write multiple different valid specifications from the user's answers, the score is not above 0.8.
Compute:
clarity = goal*0.30 + program*0.25 + evaluation*0.30 + constraints*0.15
ambiguity = 1 - clarity
Each round, extract the key entities mentioned (nouns with meaningful structure): Program, Evaluator, Input, Output, Metric, Constraint, Test, etc.
Track stability across rounds:
stable_entities — present in both current and previous roundschanged_entities — renamed (same type, >50% field overlap)new_entities — first seen this roundremoved_entities — in previous but not currentstability_ratio = (stable + changed) / total. Round 1 is N/A (no previous to compare).
Append the snapshot to state.ontology_snapshots[].
After scoring, show the user:
Round {n} complete.
| Dimension | Score | Weight | Weighted | Gap |
|-----------|-------|--------|----------|-----|
| Goal | {g} | 0.30 | {g*0.30} | {gap} |
| Program | {p} | 0.25 | {p*0.25} | {gap} |
| Evaluation | {e} | 0.30 | {e*0.30} | {gap} |
| Constraints | {c} | 0.15 | {c*0.15} | {gap} |
| **Total clarity** | | | **{total}** | |
| **Ambiguity** | | | **{1-total}** | |
**Ontology:** {n} entities | Stability: {ratio}% | New: {n} | Changed: {n} | Stable: {n}
**Next target:** {weakest} — {rationale}
Append the round to state.rounds[] and rewrite state/interview-state.json.
At specific rounds, inject a perspective shift into the question generation prompt. Each mode is used once and then normal questioning resumes.
CONTRARIAN MODE: Your next question should challenge the user's framing. Ask "what if the opposite were true?" or "what if the constraint you're treating as hard were actually negotiable?" Target the assumption most likely to be wrong.
Example: "You've said we must preserve the existing API surface. What if we didn't — would that unlock a dramatically better algorithm?"
SIMPLIFIER MODE: Your next question should probe whether the spec can be reduced. Ask "what's the smallest version that would still be valuable?" or "which dimension could we drop and still ship?" Look for aspirational complexity.
Example: "You mentioned wanting both speed and accuracy. If we optimized purely for speed with a minimum accuracy floor, would that still be useful?"
ONTOLOGIST MODE: Ambiguity is still high. We may be solving the wrong problem. Look at the ontology: {entities}. Ask "which entity is the core thing we're evolving, and which are context or environment?" Force the user to pick a noun.
Example: "You've mentioned the parser, the AST, and the optimizer. Which one is the evolution target? The other two are constraints or fixtures."
Track used modes in state.challenge_modes_used[] to prevent repetition.
When ambiguity ≤ 0.20 (or early exit), generate the concrete artifacts:
initial.pyBased on the interview answers:
# EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END markersrun_experiment(**kwargs) function as the stable interfaceIf the user provided an existing file, read it and insert EVOLVE-BLOCK markers around the region they identified. Do not rewrite code outside the markers.
evaluate.pyBased on the evaluation answers, produce a standalone evaluator that:
--program_path <path> as an argumentrun_experiment() (or the specified entry point)combined_score and correctImportant: The evaluator must NEVER use claude or any LLM to judge fitness. It must be pure code execution. If the user tried to specify an LLM-as-judge, reject it and ask for an objective metric.
Example template:
#!/usr/bin/env python3
"""Evaluator for <task description>."""
import argparse
import importlib.util
import json
import sys
import time
def load_program(path):
spec = importlib.util.spec_from_file_location("candidate", path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod
def main():
p = argparse.ArgumentParser()
p.add_argument("--program_path", required=True)
args = p.parse_args()
try:
mod = load_program(args.program_path)
# <user-specific: call run_experiment, validate, compute metric>
result = mod.run_experiment()
score = _compute_score(result)
correct = _validate(result)
except Exception as e:
print(json.dumps({"combined_score": 0.0, "correct": False, "error": str(e)}))
sys.exit(0)
print(json.dumps({
"combined_score": float(score),
"correct": bool(correct),
# <additional metrics>
}))
if __name__ == "__main__":
main()
config.jsonGenerate an EvolveConfig-compatible JSON. Set reasonable defaults based on the interview:
task_description: built from the Goal and Constraints answersinit_program_path: "initial.py"eval_program_path: "evaluate.py"num_generations: default 50; user can overrideensemble.arms: default ["sonnet/medium", "sonnet/low", "haiku/high", "haiku/medium", "haiku/low"] (skip sonnet/high and above unless the user explicitly wants extended thinking — they're slow and often exceed token limits)patches.types: ["diff", "full", "cross", "fix"] with default probsislands.num_islands: 2 (default) or 3 for larger searchnovelty.similarity_threshold: 0.95llm_timeout: 300 (5 min) — generous for medium effort, not so generous that failures waste hourseval_timeout: derived from the user's stated time budget per evaluation (default 120)Display:
✅ Spec crystallized (ambiguity: {pct}%)
Files written:
- initial.py ({lines} lines, EVOLVE-BLOCK marks {startline}-{endline})
- evaluate.py ({lines} lines)
- config.json ({gens} generations, {arms} bandit arms, {islands} islands)
Baseline fitness: run `python3 evaluate.py --program_path initial.py` to verify
Next step: run `/evolve` to start the evolutionary loop.
Then run the baseline evaluation yourself (via Bash) and report the baseline score so the user can see what they're starting from.
After the spec is ready, use AskUserQuestion to present:
Question: "Spec ready. How would you like to proceed?"
Options:
/evolve with the generated config (recommended)On option 1, invoke the evolve_start MCP tool (or equivalent) with the generated config path.
AskUserQuestion for every interview question — one at a time, never batchAgent(subagent_type="Explore") (haiku, short timeout) for brownfield codebase exploration BEFORE asking the userWrite to create initial.py, evaluate.py, config.json in Phase 4Bash to run the baseline evaluation in Phase 4dSkill("claude-evolve:evolve") or the evolve_start MCP tool to hand off in Phase 5Write to state/interview-state.json after every round. On resume, re-read the file and continue from the last completed round. If the user edits artifacts manually between sessions, detect that and offer to re-score with the new artifacts in mind.
initial.py written with EVOLVE-BLOCK markers around the mutable regionevaluate.py written and returns JSON with combined_score and correctconfig.json written and loads correctly via EvolveConfig.from_json()Round 3 | Targeting: Evaluation | Why now: we know the fitness metric
(sum of radii) and the mutable region (the packing algorithm), but
haven't specified HOW we measure it — ambiguity of the evaluator is
our biggest remaining gap. | Ambiguity: 38%
Do you have a test suite that validates circle packings, or should
I generate a dedicated evaluator that runs the algorithm once and
computes the sum of radii with overlap/boundary checks?
[explore agent finds: solver.py has parse(), plan(), and execute()]
Round 2 | Targeting: Program | Why now: you said "evolve the planner"
but solver.py has three functions — we need to know exactly which to
wrap in EVOLVE-BLOCK markers. | Ambiguity: 60%
I see three functions in solver.py: parse() (130 lines),
plan() (340 lines), and execute() (45 lines). Is `plan()` the
algorithm we're evolving, with parse() and execute() as fixed
interface?
"What's the fitness metric, and what's the mutable region, and how
long per evaluation, and how many generations should we run?"
Four questions at once → shallow answers, inaccurate scoring.
User: "Have Claude rate the code quality on a scale of 1 to 10."
Interviewer: (accepts this)
Wrong. Push back and explain that claude-evolve requires real code execution for fitness. Ask for an objective metric: does the code pass tests, produce correct output, run faster, use less memory, etc.
npx claudepluginhub samuelzxu/claude-evolve --plugin claude-evolveStarts or monitors an evolutionary development loop that iteratively refines ontology and acceptance criteria across generations until convergence. Use for multi-step project definition and iterative refinement.
Evolves any measurable artifact (prompt, skill, code) through autonomous mutation-evaluate-gate loops. Supports GT case suites and scalar metric loops with automatic keep/revert.
Starts an autonomous evolutionary code optimization run using Claude Code models as mutation operators, iteratively improving code via selection, crossover, and evaluation.