Skill

run-harness

Use to execute the harness loop on a project that has a .harness/ folder. Triggers on "/run-harness", "run the harness", "kick off the harness on [X]", "resume the harness". You — Claude Code — are the orchestrator. Each role (planner, generator, evaluator) is a subagent invocation via the Task tool. State lives on disk under .harness/state/.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness:run-harness

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are the orchestrator. Each role is a fresh subagent invocation with its own context window. State communication is through JSON files on disk. You do not roleplay the planner, generator, or evaluator — you dispatch them and read what they wrote.

SKILL.md

175 lines · ~1.9k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

run-harness

When to invoke

User says "run the harness" or "/run-harness".
User says "resume" — pick up after the last iteration_start event in state/progress.jsonl.
User says "re-run iteration N" — re-execute a single iteration only.

Preconditions

.harness/config.json exists. If not, stop and suggest /instantiate-harness.
.harness/calibration/ is non-empty. If it has no examples (only README.md), warn loudly: evaluator quality without calibration is poor (principle P6). Ask before proceeding.
Schemas are reachable. Plugin root is ~/.claude/plugins/harness/ (or wherever installed); resolve relative paths from there.

The loop you drive

0 — Setup

Read .harness/config.json into memory. Note models.{planner,generator,evaluator}, budget, pivot.consecutive_failure_threshold, pivot.plateau_threshold. Log iteration_start is not for setup — that comes per iteration.

1 — Planner (runs once)

If state/plan.json does not exist OR the user did not pass --resume:

Task {
  subagent_type: general-purpose,
  model: <config.models.planner>,
  description: "Plan harness scope",
  prompt: """
  You are the planner. Read your system prompt at .harness/planner.md and follow it.
  Your output: write state/plan.json matching runtime/schemas/plan.schema.json.
  The user prompt: <config.user_prompt>
  The rubric is at .harness/rubric.json.
  Do not produce more than 8 deliverables. Do not pick stack/tools/structure.
  When done, print "PLAN_WRITTEN" and the path.
  """
}

Validate the written state/plan.json against the schema (call python <plugin-root>/runtime/cli/validate.py plan state/plan.json). Append a plan_written line to state/progress.jsonl.

2 — Iteration loop

For iteration N from 1 to budget.max_iterations:

2a — Log iteration start

Append {"t":..., "event":"iteration_start", "iter": N} to state/progress.jsonl.

2b — Contract negotiation (P4)

Dispatch the generator subagent to propose a contract:

Task {
  description: "Propose contract iter N",
  prompt: """
  You are the generator. Read your system prompt at .harness/generator.md and follow it.
  Read state/plan.json and .harness/rubric.json.
  Propose a contract for iteration N. Write state/contracts/iter-{N:03d}.json
  matching runtime/schemas/contract.schema.json. Do NOT build yet.
  When done, print "CONTRACT_PROPOSED" and the path.
  """
}

Then dispatch the evaluator subagent to critique:

Task {
  description: "Critique contract iter N",
  prompt: """
  You are the evaluator. Read your system prompt at .harness/evaluator.md and follow it.
  Read state/contracts/iter-{N:03d}.json. Either approve it or critique it.
  If approve: set "agreed_by": ["generator","evaluator"] and "agreed_at": <ISO>
  in the same file and write it back. Print "CONTRACT_AGREED".
  If critique: write state/handoffs/handoff-{seq}.json with kind "contract_critique"
  and the specific issues. Print "CONTRACT_CRITIQUED".
  """
}

If critiqued, dispatch generator again to revise. Cap at 3 negotiation rounds. After 3, force-agree with a log note.

Append contract_agreed to progress.jsonl.

2c — Build artifact

Task {
  description: "Build artifact iter N",
  prompt: """
  You are the generator. Read your system prompt at .harness/generator.md and follow it.
  Read state/contracts/iter-{N:03d}.json (the agreed contract).
  Build the artifact. When done, write state/handoffs/handoff-{seq}.json with
  kind "artifact_ready" and a per-criterion self-check (HINT only, not verdict).
  Print "ARTIFACT_READY" and the artifact path(s).
  """
}

Append artifact_ready to progress.jsonl.

2d — Evaluate

Task {
  description: "Grade iter N",
  prompt: """
  You are the evaluator. Read your system prompt at .harness/evaluator.md and follow it.
  Read state/contracts/iter-{N:03d}.json, .harness/rubric.json, and the calibration
  examples at .harness/calibration/.
  Operate the artifact — run every verification action listed in the contract.
  Do not grade from description.
  Write state/verdicts/verdict-{N:03d}.json matching runtime/schemas/verdict.schema.json.
  Print "VERDICT_WRITTEN" and iteration_pass.
  """
}

Validate verdict against schema. Append verdict line to progress.jsonl with iter, pass, and failing criterion IDs.

2e — Decide

Run the pivot check:

python <plugin-root>/runtime/cli/check_pivot.py --project . --iteration N

The script reads recent verdicts from state/verdicts/ and returns one of:

PASS → log halt_success, exit
ITERATE → continue to iteration N+1
PIVOT <criterion> → archive iteration N, log pivot, continue to N+1 with fresh-context instruction
BUDGET → log halt_budget, exit

If PIVOT: dispatch generator at N+1 with explicit "previous attempt archived under state/archive/iter-{N:03d}; take a different approach" in the prompt.

3 — Surface to user

At meaningful events (verdict, pivot, halt), print a one-line summary. Don't surface every handoff — surface the structural events only. At halt, print final artifact path(s) and the line count from progress.jsonl.

Modes

Fresh: /run-harness — planner runs, then iterate.
Resume: /run-harness --resume — read last_event from progress.jsonl, start at next iteration.
Re-iteration: /run-harness --iteration N — re-run only that iteration (useful for debugging a specific failure after editing a prompt).
Dry: /run-harness --dry — python runtime/cli/validate.py config .harness/config.json && python runtime/cli/validate.py rubric .harness/rubric.json — no role dispatch.

What good looks like

Each role gets its own Task call. You never "play" the role yourself.
Every JSON file written by a subagent is validated before the next role reads it.
progress.jsonl reflects every meaningful event in real time.
Pivots fire when the same criterion fails consecutive_failure_threshold times — not because of vibes.

Anti-patterns

Concatenating role prompts into one Task call. Each role must be a separate Task. That's how P1 (separate context windows) and P2 (adversarial pressure) get their teeth.
Letting a subagent skip verification. If the evaluator's verdict doesn't cite the verification action's output, reject and re-dispatch.
Continuing past a schema validation failure. Reject and re-dispatch the writer agent.
Surfacing every internal step to the user. Surface iteration starts, verdicts, pivots, halts. Not contract round 2 of 3.
Forgetting to archive on pivot. The archive is what makes the pivot meaningful — the next generator session must see the previous attempt is gone and a different approach is requested.

Why this works

Each subagent invocation gets its own context window. The evaluator never sees the generator's reasoning, only what's on disk. That's the adversarial pressure principle P2 made concrete. The file system is the message bus — not your conversation context.

run-harness

Invocation

Context Preview

SKILL.md

run-harness

Invocation

Context Preview

SKILL.md

run-harness

When to invoke

Preconditions

The loop you drive

0 — Setup

1 — Planner (runs once)

2 — Iteration loop

2a — Log iteration start

2b — Contract negotiation (P4)

2c — Build artifact

2d — Evaluate

2e — Decide

3 — Surface to user

Modes

What good looks like

Anti-patterns

Why this works

Similar Skills

run-harness

When to invoke

Preconditions

The loop you drive

0 — Setup

1 — Planner (runs once)

2 — Iteration loop

2a — Log iteration start

2b — Contract negotiation (P4)

2c — Build artifact

2d — Evaluate

2e — Decide

3 — Surface to user

Modes

What good looks like

Anti-patterns

Why this works

Similar Skills