From harness
Use for controlled experiments — testing one prompt across different models or phrasings, comparing multiple versions of an algorithm or code implementation, isolating a feature from a large project for standalone debugging, or iterating v1 to v2 to v3 to compare improvements. Trigger phrases include "test X across different Y", "compare v1 v2", "set up an experiment", "isolate this feature to debug", "v2 failed, try v3", "/harness". Do NOT trigger when the user just wants to run code once (no comparison intent), or only wants to review existing results.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness:harnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Lab lead (main session) → research groups (one group per variant; within a group: setup → script → run).
Lab lead (main session) → research groups (one group per variant; within a group: setup → script → run).
<status/conclusion>. <key evidence>. <next step / decision needed>.───────────────────────────────────────────────── rule, not crammed together.[0] Clarify Main session talks with the user; no lab created. Sign-off gates [1].
[1] File lab Main session writes INTENT/hypothesis/STATUS + mkdir vN + writes vN/README
[2] Dispatch Mode A: three concurrent waves; Mode B: serial + ask the user between gens
[3] Aggregate Main session reads vN/verdict, writes REPORT, reports back in one paragraph
| Mode | When to choose | Main-session dispatch |
|---|---|---|
| A parallel | Hypothesis clear, directions known | Sign off v1..vN once, dispatch in waves |
| B iterate | Don't know what to tune next | Design v1 → run → read verdict → talk → design v2 |
Trigger defaults (still confirm): "compare v1 v2 v3" → A; "v_N failed, try v_N+1" / "let's try and see" → B.
SLUG=<short-name>
TS=$(date '+%Y%m%d-%H%M%S')
LAB="/tmp/harness-lab/${SLUG}-${TS}"
mkdir -p "$LAB"
Then:
$LAB/INTENT.md (from ${CLAUDE_PLUGIN_ROOT}/skills/harness/templates/INTENT.md)$LAB/hypothesis.md (from templates/hypothesis.md, with decision criteria)$LAB/STATUS.md (from templates/STATUS.md, empty)mkdir -p $LAB/v{1..N} + Write each vN/README.md (from templates/vN-README.md)mkdir $LAB/v1 + Write v1/README.mdWave 1 (dispatch N agents in one message):
Agent(subagent_type="harness:setup",
prompt="LAB=$LAB N=1",
run_in_background=true)
Agent(subagent_type="harness:setup",
prompt="LAB=$LAB N=2",
run_in_background=true)
...
Wait for all N DONE notifications → wave 2 same shape with harness:script → wait → wave 3 with harness:run → wait.
Agent(harness:setup, "LAB=$LAB N=1", bg=true)
→ wait
Agent(harness:script, "LAB=$LAB N=1", bg=true)
→ wait
Agent(harness:run, "LAB=$LAB N=1", bg=true)
→ wait → Read v1/verdict.md → talk with the user:
"v1 verdict: <PASS/FAIL + key numbers>. Next gen change X→Y?"
User signs off → mkdir v2 + Write v2/README + repeat the three steps. Until the user says "enough".
ls $LAB/v*/verdict.md
Dispatch one harness:report (haiku) to write the REPORT:
Agent(subagent_type="harness:report",
prompt="LAB=$LAB",
run_in_background=true)
Wait for the notification → main session Reads $LAB/REPORT.html → distills it into the final report (do not rewrite REPORT, it is the archive; Edit the conclusion/recommendation section if needed, but default to trusting the haiku draft). Report format:
Lab
$LABdone, N variants.───────────────────────────────────────────────── Conclusion ───────────────────────────────────────────────── <winner / failure / inconclusive + 1-2 sentence core finding, numbers first>.
───────────────────────────────────────────────── Recommendation ─────────────────────────────────────────────────
- <concrete action 1>
- <concrete action 2>
- <concrete action 3>
REPORT:
$LAB/REPORT.html
Agent STATUS last line:
| Status code | Main-session handling |
|---|---|
DONE — ... | next |
DONE_WITH_CONCERNS — <why> | correctness issue → fix; observation → go |
NEEDS_CONTEXT — <missing> | supply context, re-dispatch the same agent |
BLOCKED — <why> | see below |
BLOCKED routing:
On verdict FAIL: main session re-dispatches script for a second run (prompt adds "last v_N failed at X, fix run.sh or setup"). Still FAIL after 3 rounds → escalate to the user.
vN/runs/* or vN/artifact/* (verdict.md lives in the vN/ root and MUST be read)<plugin root>/
agents/
setup.md (haiku)
script.md (sonnet)
run.md (haiku)
report.md (haiku)
skills/harness/
SKILL.md
templates/
INTENT.md hypothesis.md verdict.md vN-README.md REPORT.html STATUS.md
/tmp/harness-lab/<slug>-<ts>/
├── INTENT.md main session
├── hypothesis.md main session
├── STATUS.md appended by every role
├── REPORT.html main session (via harness:report)
└── v1/, v2/, .../
├── README.md main session
├── artifact/ harness:setup + harness:script
│ ├── (scaffolding)
│ └── run.sh
├── runs/transcript.log harness:run
└── verdict.md harness:run
date '+%Y%m%d-%H%M%S' localdate -u +'%Y-%m-%dT%H:%M:%SZ' UTC ISO8601 (agents enforce this)"is the experiment done": tail -5 $LAB/STATUS.md, answer in one sentence, do not Read REPORT / vN/.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub hali0515/harness-lab --plugin harness