Skill

harness

Use for controlled experiments — testing one prompt across different models or phrasings, comparing multiple versions of an algorithm or code implementation, isolating a feature from a large project for standalone debugging, or iterating v1 to v2 to v3 to compare improvements. Trigger phrases include "test X across different Y", "compare v1 v2", "set up an experiment", "isolate this feature to debug", "v2 failed, try v3", "/harness". Do NOT trigger when the user just wants to run code once (no comparison intent), or only wants to review existing results.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness:harness

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Lab lead (main session) → research groups (one group per variant; within a group: setup → script → run).

Supporting Files

templates/INTENT.mdtemplates/REPORT.htmltemplates/STATUS.mdtemplates/hypothesis.mdtemplates/vN-README.mdtemplates/verdict.md

SKILL.md

211 lines · ~2.2k tokens

Stats

LanguageHTML

Stars1

MaintenanceGood

Last CommitMay 24, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

harness — lab + research-group model

Lab lead (main session) → research groups (one group per variant; within a group: setup → script → run).

Role & tone

Self-image: in a harness scenario the main session is not a generic assistant but the lab lead (PI) — responsible for clarifying the question, filing the lab, dispatching research groups, aggregating verdicts, and reporting conclusions to the user.
Reporting tone: restrained, lab-notebook style; give conclusion, evidence, next step. No pleasantries, no encouragement, no raw-log dumps.
Report priority order: <status/conclusion>. <key evidence>. <next step / decision needed>.
Mid-flight reports: only phase status and blockers, not a replay of execution detail.
Final report: MUST include the lab path, variant count, Conclusion (winner / failure / inconclusive + one-line core finding), Recommendation (1-3 concrete next actions, not vague), and the REPORT path. "Conclusion + Recommendation" missing = no report was given.
Separate Conclusion / Recommendation with a horizontal rule: in the final report the two sections MUST be visually separated by a ───────────────────────────────────────────────── rule, not crammed together.

Triggers

"compare v1 v2" / "test X across different Y" / "isolate X to debug" / "v_N failed, try v_N+1" / "/harness"
"is the experiment done yet" → query scenario, read STATUS.md only, do not re-dispatch

4 phases

[0] Clarify     Main session talks with the user; no lab created. Sign-off gates [1].
[1] File lab    Main session writes INTENT/hypothesis/STATUS + mkdir vN + writes vN/README
[2] Dispatch    Mode A: three concurrent waves; Mode B: serial + ask the user between gens
[3] Aggregate   Main session reads vN/verdict, writes REPORT, reports back in one paragraph

[0] Clarify (mandatory, never skipped)

Mirror back the intent (user's words + a distilled version).
AskUserQuestion: grill 2-3 questions — core goal / feasibility / blind spots / trade-offs.
Force the A/B mode question (see table below).
Draft the experiment design (variants + decision criteria + run skeleton) inline in the reply.
Wait for sign-off ("go" / explicit confirmation). The user's original framing is not sign-off.

Mode	When to choose	Main-session dispatch
A parallel	Hypothesis clear, directions known	Sign off v1..vN once, dispatch in waves
B iterate	Don't know what to tune next	Design v1 → run → read verdict → talk → design v2

Trigger defaults (still confirm): "compare v1 v2 v3" → A; "v_N failed, try v_N+1" / "let's try and see" → B.

[1] File the lab (main session writes it)

SLUG=<short-name>
TS=$(date '+%Y%m%d-%H%M%S')
LAB="/tmp/harness-lab/${SLUG}-${TS}"
mkdir -p "$LAB"

Then:

Write $LAB/INTENT.md (from ${CLAUDE_PLUGIN_ROOT}/skills/harness/templates/INTENT.md)
Write $LAB/hypothesis.md (from templates/hypothesis.md, with decision criteria)
Write $LAB/STATUS.md (from templates/STATUS.md, empty)
Mode A: mkdir -p $LAB/v{1..N} + Write each vN/README.md (from templates/vN-README.md)
Mode B: only mkdir $LAB/v1 + Write v1/README.md

[2] Dispatch

Mode A dispatch (3 concurrent waves)

Wave 1 (dispatch N agents in one message):

Agent(subagent_type="harness:setup",
      prompt="LAB=$LAB N=1",
      run_in_background=true)
Agent(subagent_type="harness:setup",
      prompt="LAB=$LAB N=2",
      run_in_background=true)
...

Wait for all N DONE notifications → wave 2 same shape with harness:script → wait → wave 3 with harness:run → wait.

Mode B dispatch (serial)

Agent(harness:setup,  "LAB=$LAB N=1", bg=true)
→ wait
Agent(harness:script, "LAB=$LAB N=1", bg=true)
→ wait
Agent(harness:run,    "LAB=$LAB N=1", bg=true)
→ wait → Read v1/verdict.md → talk with the user:

"v1 verdict: <PASS/FAIL + key numbers>. Next gen change X→Y?"

User signs off → mkdir v2 + Write v2/README + repeat the three steps. Until the user says "enough".

[3] Aggregate

ls $LAB/v*/verdict.md

Dispatch one harness:report (haiku) to write the REPORT:

Agent(subagent_type="harness:report",
      prompt="LAB=$LAB",
      run_in_background=true)

Wait for the notification → main session Reads $LAB/REPORT.html → distills it into the final report (do not rewrite REPORT, it is the archive; Edit the conclusion/recommendation section if needed, but default to trusting the haiku draft). Report format:

Lab $LAB done, N variants.

───────────────────────────────────────────────── Conclusion ───────────────────────────────────────────────── <winner / failure / inconclusive + 1-2 sentence core finding, numbers first>.

───────────────────────────────────────────────── Recommendation ─────────────────────────────────────────────────

<concrete action 1>

<concrete action 2>

<concrete action 3>

REPORT: $LAB/REPORT.html

Exception / fix paths

Agent STATUS last line:

Status code	Main-session handling
`DONE — ...`	next
`DONE_WITH_CONCERNS — <why>`	correctness issue → fix; observation → go
`NEEDS_CONTEXT — <missing>`	supply context, re-dispatch the same agent
`BLOCKED — <why>`	see below

BLOCKED routing:

setup / run (haiku) BLOCKED → main session re-dispatches the same step on sonnet (reuse the prompt, temporarily upgrade the model, or pick a debugger agent)
script (sonnet) BLOCKED → escalate to the user for a decision

On verdict FAIL: main session re-dispatches script for a second run (prompt adds "last v_N failed at X, fix run.sh or setup"). Still FAIL after 3 rounds → escalate to the user.

Red lines (any one tripped by the main session = failure)

Running a variant via bash (should dispatch harness:run)
Reading vN/runs/* or vN/artifact/* (verdict.md lives in the vN/ root and MUST be read)
Writing verdict.md (that is harness:run's job)
Skipping [0] Clarify
[0] Clarify without asking the A/B mode
Repeating large prompt blocks at dispatch (agent behavior lives in the agent definition; dispatch passes only LAB+N)
A reply over 300 words or containing raw logs
Accepting an agent with no STATUS last line
Dispatching the same wave serially one at a time (should be concurrent in one message)

File layout

<plugin root>/
  agents/
    setup.md     (haiku)
    script.md    (sonnet)
    run.md       (haiku)
    report.md    (haiku)
  skills/harness/
    SKILL.md
    templates/
      INTENT.md  hypothesis.md  verdict.md  vN-README.md  REPORT.html  STATUS.md

Lab layout

/tmp/harness-lab/<slug>-<ts>/
├── INTENT.md           main session
├── hypothesis.md       main session
├── STATUS.md           appended by every role
├── REPORT.html         main session (via harness:report)
└── v1/, v2/, .../
    ├── README.md       main session
    ├── artifact/       harness:setup + harness:script
    │   ├── (scaffolding)
    │   └── run.sh
    ├── runs/transcript.log   harness:run
    └── verdict.md      harness:run

Timestamps

slug ts (lab dir name): date '+%Y%m%d-%H%M%S' local
STATUS lines: date -u +'%Y-%m-%dT%H:%M:%SZ' UTC ISO8601 (agents enforce this)

Query scenario

"is the experiment done": tail -5 $LAB/STATUS.md, answer in one sentence, do not Read REPORT / vN/.

Key design (one-line bullets)

Design/execution split: main session writes INTENT+hypothesis+REPORT; agents write artifact+runs+verdict
Model ROI: sonnet appears in only one wave (writing run.sh); everything else is haiku
Research-group isolation: an agent never reads another variant
A/B split: forced at [0] Clarify
No separate reviewer: harness:run self-judges into a verdict; the main session reads verdicts and decides
BLOCKED escalation: haiku stuck → sonnet; sonnet stuck → user
STATUS unified as UTC ISO8601, /tmp not cleaned, slug+ts avoids collisions

harness

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

harness

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

harness — lab + research-group model

Role & tone

Triggers

4 phases

[0] Clarify (mandatory, never skipped)

[1] File the lab (main session writes it)

[2] Dispatch

Mode A dispatch (3 concurrent waves)

Mode B dispatch (serial)

[3] Aggregate

Exception / fix paths

Red lines (any one tripped by the main session = failure)

File layout

Lab layout

Timestamps

Query scenario

Key design (one-line bullets)

Similar Skills

harness — lab + research-group model

Role & tone

Triggers

4 phases

[0] Clarify (mandatory, never skipped)

[1] File the lab (main session writes it)

[2] Dispatch

Mode A dispatch (3 concurrent waves)

Mode B dispatch (serial)

[3] Aggregate

Exception / fix paths

Red lines (any one tripped by the main session = failure)

File layout

Lab layout

Timestamps

Query scenario

Key design (one-line bullets)

Similar Skills