From mh
Evolve repo-local Claude Code harness assets through a 5-phase pipeline — harvest context, propose candidate, evaluate with evidence, audit regressions, report results.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mh:harness-evolveThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
ultrathink.
ultrathink.
Objective: $ARGUMENTS
You are the pipeline orchestrator. Execute these 5 phases IN ORDER, using subagents and MCP tools. Each phase produces disk artifacts consumed by the next.
Reserve a candidate run ID and record the objective:
RUN_ID=$(mh-next-run)
RUN_DIR=$(mh-next-run --run-id $RUN_ID --path)
echo "Run: $RUN_ID at $RUN_DIR"
Gather project context relevant to the objective. Run:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/context_harvester.py --project . --objective "$ARGUMENTS" --budget 8000
Save the output to $RUN_DIR/context-snapshot.md for crash recovery.
Also read the current frontier and regressions:
frontier_read MCP tool (or run mh-frontier --markdown)harness://regressions MCP resource (or run mh-regressions --markdown)Discover installed plugins and their callable capabilities:
plugin_scan MCP tool (with include_capabilities=true)Read execution traces from prior sessions and runs:
trace_search MCP tool with the objective as query to find relevant session tracescandidate_diff to get their full patches, hypotheses, and safety notesThese traces are critical — the Meta-Harness paper proved raw traces improve proposal quality by +43% vs summaries.
Dispatch the harness-proposer subagent with this context:
You are proposing a harness improvement for: "$ARGUMENTS"
Run ID: $RUN_ID [Include the harvested context, frontier, regressions, AND plugin capabilities from Phase 1]
Available external capabilities: [paste plugin_scan output] You may reference other plugins' skills or MCP tools in your proposal. For example, you can propose a rule that triggers
/superpowers:test-driven-development, or a skill that calls Context7 for doc verification, or a hook that uses Playwright for visual checks.Prior candidates: [Include candidate_diff output for top-3 frontier runs — their patches, hypotheses, and what worked/failed] Execution traces: [Include trace_search results — tool calls, errors, patterns from recent sessions]
Study what was tried before. The paper proves that reading prior candidate source code and traces is the #1 factor for better proposals.
Create these files in $RUN_DIR:
- hypothesis.md (Claim / Evidence / Predicted impact / Risk)
- safety-note.md
- candidate.patch (unified diff of your changes)
Edit ONLY harness surfaces: CLAUDE.md, .claude/skills/, .claude/agents/, .claude/rules/, prompts/, .meta-harness/**, helper scripts. Do NOT edit application code. Run
mh-validatebefore finishing.
Wait for the proposer to complete. Verify that $RUN_DIR/hypothesis.md and $RUN_DIR/candidate.patch exist.
Before proceeding, verify these files exist in $RUN_DIR:
If ANY artifact is missing or empty, STOP and report the failure. Do NOT proceed to Phase 3.
Run deterministic evaluation:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/eval_runner.py --eval-dir ${CLAUDE_PLUGIN_ROOT}/eval-tasks --cwd . --json
Then dispatch the harness-evaluator subagent to assess LLM-judge criteria:
Evaluate candidate $RUN_ID. Read ONLY the files in $RUN_DIR (hypothesis.md, candidate.patch, safety-note.md). Do NOT read the proposer's reasoning or conversation — only its output artifacts.
Deterministic results: [paste eval_runner output]
For each LLM-judge criteria in eval-tasks/, assess whether the candidate meets it. Write metrics.json to $RUN_DIR and record to frontier using
mh-record-metrics.
Wait for the evaluator. Verify that $RUN_DIR/metrics.json exists.
Before proceeding, verify:
mh-frontier --markdown to confirm)If metrics were not recorded, STOP. Do NOT proceed to Phase 4.
Dispatch the regression-auditor subagent:
Audit candidate $RUN_ID for regressions. Read the frontier (mh-frontier --markdown), the candidate's metrics.json, and the patch. Compare against prior frontier leaders. Write analysis.md to $RUN_DIR with: likely cause, confidence, evidence, recommendation.
Wait for the auditor. Read $RUN_DIR/analysis.md.
Before reporting, verify:
If missing, note "audit skipped" in the report but proceed to Phase 5.
Present results using the Meta-Harness output style:
⚗ EVOLUTION REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Run: $RUN_ID | Hypothesis: [from hypothesis.md]
| Metric | Baseline | Candidate | Delta |
|---------------|----------|-----------|------------|
| Score | [val] | [val] | [+/-]% ▲/▼ |
| Latency (ms) | [val] | [val] | [+/-]% ▲/▼ |
| Tokens | [val] | [val] | [+/-]% ▲/▼ |
Confidence: N=[sample] | Method: deterministic + LLM-judge
Risk: [from safety-note.md]
Verdict: [PROMOTE / REJECT / ITERATE]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
npx claudepluginhub yannabadie/meta-harness-ygnProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.