Search everything...

Stats

Actions

Available In

harness

Name: harness
Author: hali0515

By hali0515

Lab + research-group model for controlled A/B experiments in Claude Code: test a prompt across models or phrasings, compare algorithm or implementation versions, isolate a feature from a large project for focused debugging, or iterate v1 to v2 to v3 and compare. The main session acts as lab lead (PI); each variant is run by a research group (setup, script, run) with a cost-optimized model mix.

npx claudepluginhub hali0515/harness-lab --plugin harness

Popularity

Stars

Above avg

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Agents4

report

/report

harness plugin internal: given a lab path, read all .md + vN/artifact + vN/runs, and fill the templates/REPORT.html template into lab/REPORT.html (dark OpenAI-style multi-section doc site, left nav + inline markdown rendering + artifact visualization). Does NOT write the main session's final spoken report to the user.

run

/run

harness plugin internal: given a lab path and variant number, run vN/artifact/run.sh, redirect all output to vN/runs/transcript.log, read the transcript and write vN/verdict.md (PASS/FAIL/CONCERNS + key numbers) from the template, then report back to the main session in one paragraph on the last line.

script

/script

harness plugin internal: given a lab path and variant number, read the existing scaffolding + hypothesis, then write vN/artifact/run.sh (or an equivalent executable entry point). This is the only sonnet appearance in the harness pipeline, invested in designing the experiment's execution logic.

setup

/setup

harness plugin internal: given a lab path and variant number, scaffold vN/artifact/ + prepare mock data / inputs / samples. Does NOT write execution logic (that is the script agent's job). Only invoked within the harness pipeline.

Skills1

harness

/harness

Use for controlled experiments — testing one prompt across different models or phrasings, comparing multiple versions of an algorithm or code implementation, isolating a feature from a large project for standalone debugging, or iterating v1 to v2 to v3 to compare improvements. Trigger phrases include "test X across different Y", "compare v1 v2", "set up an experiment", "isolate this feature to debug", "v2 failed, try v3", "/harness". Do NOT trigger when the user just wants to run code once (no comparison intent), or only wants to review existing results.

The plugin manifest points to a different repository than the source indexed by ClaudePluginHub.

Stats

Version0.1.0

LanguageHTML

Stars1

MaintenanceGood

LicenseMIT

Last CommitMay 24, 2026

AddedMay 24, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

harness-lab1

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

README

harness-lab

A Claude Code plugin for running controlled experiments — the way a lab runs them.

When you want to know whether prompt phrasing A beats B, whether the cheaper model is good enough, or whether your v3 rewrite actually improved anything, you need a comparison, not a one-off run. harness gives you a repeatable structure for that: it files the experiment, runs each variant in isolation, judges each one against criteria you set up front, and aggregates the results into a single report.

The model: lab + research groups

The main Claude Code session acts as the lab lead (PI). It never runs experiments itself — it clarifies the question, files the lab, and dispatches a research group per variant. Each group is a short pipeline of subagents:

Lab lead (main session)
│
├── clarify the question  →  file the lab (INTENT / hypothesis / STATUS)
│
├── variant v1 ── group ──►  setup ──► script ──► run ──► verdict
├── variant v2 ── group ──►  setup ──► script ──► run ──► verdict
├── variant vN ── group ──►  setup ──► script ──► run ──► verdict
│
└── aggregate verdicts  →  report  →  REPORT.html  →  conclusion + recommendation

Agent	Model	Does
`harness:setup`	haiku	scaffolds `vN/artifact/` — mock data, inputs, samples
`harness:script`	sonnet	writes the replayable `run.sh` (the experiment's execution logic)
`harness:run`	haiku	runs the script, captures the transcript, writes a PASS/FAIL verdict
`harness:report`	haiku	aggregates everything into a dark, navigable `REPORT.html`

Why it's structured this way

Design / execution split. The main session owns the design (intent, hypothesis, decision criteria, final report); the subagents own execution (scaffolding, scripts, runs, verdicts). Each side stays in its lane.
Model ROI. sonnet — the expensive model — appears in exactly one wave, writing the script. Everything else (scaffolding, running, reporting) runs on haiku. You pay for reasoning where it matters.
Research-group isolation. A variant's agents never read another variant's files. No cross-contamination, and each group fits comfortably in context.
Verdicts before opinions. harness:run judges each variant against the hard criteria in the hypothesis and grounds the verdict in transcript numbers — not vibes. The lab lead reads verdicts and decides.

Install

claude plugin marketplace add LH151732/harness-lab
claude install harness@harness-lab

That's it. The skill auto-triggers on experiment-shaped requests; the four agents come bundled.

Usage

Just describe a comparison. The skill triggers on phrasing like:

"compare v1 v2 v3"
"test this prompt across Sonnet and Haiku"
"isolate this function from the project and benchmark two implementations"
"v2 didn't work, let's try v3"
or invoke it explicitly with /harness

Two modes

The lab lead always asks which mode fits before doing anything:

Mode A — parallel. You know the directions to compare. It signs off v1..vN at once and dispatches them in concurrent waves, then produces one report.
Mode B — iterate. You don't know the next move yet. It runs v1, shows you the verdict, and you decide what v2 should change — repeating until you're satisfied.

What you get

Every experiment lives under /tmp/harness-lab/<slug>-<timestamp>/:

<slug>-<ts>/
├── INTENT.md           what you're trying to find out + mode
├── hypothesis.md       the claim, the variants, the hard PASS/FAIL criteria
├── STATUS.md           append-only execution trace (UTC timestamps)
├── REPORT.html         the aggregated, navigable report
└── v1/, v2/, .../
    ├── README.md       this variant's spec
    ├── artifact/       scaffolding + run.sh
    ├── runs/transcript.log
    └── verdict.md      PASS / FAIL / CONCERNS + key numbers

The final message back to you is deliberately terse: a conclusion (winner / failure / inconclusive, numbers first) and a recommendation (1-3 concrete next steps), plus the path to REPORT.html for the full archive.

Requirements

Claude Code with plugin support.
The experiments run as local shell scripts in /tmp, so whatever your experiment needs (an interpreter, an API key, a CLI) must be available in your environment. harness orchestrates; it doesn't sandbox.

Customizing

The four agents and six templates are plain Markdown / HTML under agents/ and skills/harness/templates/. Fork and edit them — the REPORT.html template, in particular, is built to be retargeted: it ships with generic run.sh / artifacts / transcript slots per variant, and the harness:report agent fills them from whatever your experiment produced.

License

MIT — see LICENSE.

harness

Popularity

What's Inside

Confidence

README

harness-lab

The model: lab + research groups

Why it's structured this way

Install

Usage

Two modes

What you get

Requirements

Customizing

License

Similar Plugins

everything-claude-code

unity-dev-toolkit

creative-writing

dotnet-skills

harness-lab

The model: lab + research groups

Why it's structured this way

Install

Usage

Two modes

What you get

Requirements

Customizing

License

Popularity

Health & Quality

Similar Plugins

everything-claude-code

unity-dev-toolkit

creative-writing

dotnet-skills

claude-seo

r-skills