By hali0515
Lab + research-group model for controlled A/B experiments in Claude Code: test a prompt across models or phrasings, compare algorithm or implementation versions, isolate a feature from a large project for focused debugging, or iterate v1 to v2 to v3 and compare. The main session acts as lab lead (PI); each variant is run by a research group (setup, script, run) with a cost-optimized model mix.
harness plugin internal: given a lab path, read all .md + vN/artifact + vN/runs, and fill the templates/REPORT.html template into lab/REPORT.html (dark OpenAI-style multi-section doc site, left nav + inline markdown rendering + artifact visualization). Does NOT write the main session's final spoken report to the user.
harness plugin internal: given a lab path and variant number, run vN/artifact/run.sh, redirect all output to vN/runs/transcript.log, read the transcript and write vN/verdict.md (PASS/FAIL/CONCERNS + key numbers) from the template, then report back to the main session in one paragraph on the last line.
harness plugin internal: given a lab path and variant number, read the existing scaffolding + hypothesis, then write vN/artifact/run.sh (or an equivalent executable entry point). This is the only sonnet appearance in the harness pipeline, invested in designing the experiment's execution logic.
harness plugin internal: given a lab path and variant number, scaffold vN/artifact/ + prepare mock data / inputs / samples. Does NOT write execution logic (that is the script agent's job). Only invoked within the harness pipeline.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A Claude Code plugin for running controlled experiments — the way a lab runs them.
When you want to know whether prompt phrasing A beats B, whether the cheaper model is good enough, or whether your v3 rewrite actually improved anything, you need a comparison, not a one-off run. harness gives you a repeatable structure for that: it files the experiment, runs each variant in isolation, judges each one against criteria you set up front, and aggregates the results into a single report.
The main Claude Code session acts as the lab lead (PI). It never runs experiments itself — it clarifies the question, files the lab, and dispatches a research group per variant. Each group is a short pipeline of subagents:
Lab lead (main session)
│
├── clarify the question → file the lab (INTENT / hypothesis / STATUS)
│
├── variant v1 ── group ──► setup ──► script ──► run ──► verdict
├── variant v2 ── group ──► setup ──► script ──► run ──► verdict
├── variant vN ── group ──► setup ──► script ──► run ──► verdict
│
└── aggregate verdicts → report → REPORT.html → conclusion + recommendation
| Agent | Model | Does |
|---|---|---|
harness:setup | haiku | scaffolds vN/artifact/ — mock data, inputs, samples |
harness:script | sonnet | writes the replayable run.sh (the experiment's execution logic) |
harness:run | haiku | runs the script, captures the transcript, writes a PASS/FAIL verdict |
harness:report | haiku | aggregates everything into a dark, navigable REPORT.html |
harness:run judges each variant against the hard criteria in the hypothesis and grounds the verdict in transcript numbers — not vibes. The lab lead reads verdicts and decides.claude plugin marketplace add LH151732/harness-lab
claude install harness@harness-lab
That's it. The skill auto-triggers on experiment-shaped requests; the four agents come bundled.
Just describe a comparison. The skill triggers on phrasing like:
/harnessThe lab lead always asks which mode fits before doing anything:
v1..vN at once and dispatches them in concurrent waves, then produces one report.v1, shows you the verdict, and you decide what v2 should change — repeating until you're satisfied.Every experiment lives under /tmp/harness-lab/<slug>-<timestamp>/:
<slug>-<ts>/
├── INTENT.md what you're trying to find out + mode
├── hypothesis.md the claim, the variants, the hard PASS/FAIL criteria
├── STATUS.md append-only execution trace (UTC timestamps)
├── REPORT.html the aggregated, navigable report
└── v1/, v2/, .../
├── README.md this variant's spec
├── artifact/ scaffolding + run.sh
├── runs/transcript.log
└── verdict.md PASS / FAIL / CONCERNS + key numbers
The final message back to you is deliberately terse: a conclusion (winner / failure / inconclusive, numbers first) and a recommendation (1-3 concrete next steps), plus the path to REPORT.html for the full archive.
/tmp, so whatever your experiment needs (an interpreter, an API key, a CLI) must be available in your environment. harness orchestrates; it doesn't sandbox.The four agents and six templates are plain Markdown / HTML under agents/ and skills/harness/templates/. Fork and edit them — the REPORT.html template, in particular, is built to be retargeted: it ships with generic run.sh / artifacts / transcript slots per variant, and the harness:report agent fills them from whatever your experiment produced.
MIT — see LICENSE.
npx claudepluginhub hali0515/harness-lab --plugin harnessComplete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development
Complete creative writing suite with 10 specialized agents covering the full writing process: research gathering, character development, story architecture, world-building, dialogue coaching, editing/review, outlining, content strategy, believability auditing, and prose style/voice analysis. Includes genre-specific guides, templates, and quality checklists.
Comprehensive .NET development skills for modern C#, ASP.NET, MAUI, Blazor, Aspire, EF Core, Native AOT, testing, security, performance optimization, CI/CD, and cloud-native applications
Comprehensive SEO analysis plugin for Claude Code. 25 sub-skills (21 core + 1 orchestrator + 1 framework + 2 extension mirrors) and 18 sub-agents cover technical SEO, content quality, schema, sitemaps, Core Web Vitals, local SEO, backlinks, AI/GEO, ecommerce, hreflang, SXO, clustering, drift monitoring, and Google APIs. Includes optional MCP extensions, SPA-aware rendering, portability, and hardened SSRF/DNS-rebinding safe fetchers.
Modern R development skills for Claude Code - tidyverse patterns, rlang metaprogramming, Bayesian inference, performance optimization, and more