From harness
Enforce four-role separation (Planner / Writer / Evaluator / Orchestrator) when drafting + judging completed work. Runs in two modes — interactive `/gen-eval-pair <prompt>` (current session reviews planner draft + implements as writer) and ralph-loop (headless writer dispatches planner; automatic P3 acts as gate instead of human review). Both modes share the same 5-phase pipeline: planner drafts contract → evaluator reviews contract → evaluator proposes rubric additions; orchestrator applies → writer implements → evaluator scores against rubric. The project's `.harness/rubric.md` is the only rule file evaluators read. Use when setting up autonomous coding loops, when the user mentions "evaluator-rubric", "sprint contract", "球員兼裁判", "player-referee", or wants to gate work behind structured review.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness:gen-eval-pairThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Writer/evaluator separation for autonomous coding loops. Closes the player-referee gap: the agent that writes the code MUST NOT be the agent that judges it.
Writer/evaluator separation for autonomous coding loops. Closes the player-referee gap: the agent that writes the code MUST NOT be the agent that judges it.
tokens = split $ARGUMENTS by whitespace
first_token = first non-flag token
if first_token == "lint" → lint sub-command(contract 結構檢查)
path = second non-flag token = contract path
if first_token == "eval" → eval sub-command(既有 contract 重新評分)
path = second non-flag token = contract path
otherwise → default invocation(interactive full pipeline;走 5-phase)
prompt = entire $ARGUMENTS 作為 user task description
呼叫方式:/gen-eval-pair <prompt>
驅動 5-phase pipeline:
.harness/contract/<task-id>/contract.md.harness/rubric.md.harness/rubric.mdContract path 自動產生於 .harness/contract/<task-id>/(task-id 用 sequential 或 timestamp)。
呼叫方式:/gen-eval-pair lint <contract-path>
僅 lint:檢查 contract 結構與完整性,不跑 writer / evaluator scoring。
內部呼叫 ${CLAUDE_PLUGIN_ROOT}/scripts/run.sh --contract=<path> --phase=lint。
呼叫方式:/gen-eval-pair eval <contract-path>
僅 eval:對既有 contract + 已完成 work 跑 evaluator,跳過 lint 與 planner phase。
內部呼叫 ${CLAUDE_PLUGIN_ROOT}/scripts/run.sh --contract=<path> --phase=eval。
命名提醒:以下 Mode A / Mode B 指 pipeline topology(interactive 人在場 vs ralph-loop 自動化),與上方
$ARGUMENTS解析的lint/evalsub-command 是不同維度。Topology Mode A/B 命名在 REFERENCE.md 與下游 persona/template 已穩定引用。
P1 user / driver provides feature description
P2 planner drafts contract (subagent in both modes)
Mode A: .harness/contract/<task-id>/contract.md
Mode B: .ralph/sprints/<US-id>-contract.v<n>.md
P3 evaluator reviews contract → Accept | Revise | Insufficient
└─ Revise/Insufficient: Mode A = re-prompt planner in parent session;
Mode B = write defects to .ralph/prompt.md, retry next iteration
P4 evaluator proposes rubric additions;
orchestrator applies (only if rubric missing fields the contract needs)
P5 writer implements per contract
P6 evaluator scores per rubric → Accept | Revise | Block | Incomplete
├─ verdict has suggestedRubricAdditions[] → orchestrator applies → re-run P6 (max 3 rounds)
└─ no additions → final verdict, end
The hard separation — four roles, none plays another's part:
agents/planner.md), read-only tools. Never writes the file directly — returns markdown via JSON for the orchestrator to persist. Avoids the "writer-is-also-contract-author" bias where contract AC drift toward the implementation already half-formed in the writer's head..harness/rubric.md. NEVER produces a verdict. NEVER drafts the contract directly (the Planner subagent does that at P2).{verdict, defects[], suggestedRubricAdditions[]} — proposes only.run.sh / this skill in interactive topology) is a deterministic script. The sole author of rubric.md — applies evaluator's suggestedRubricAdditions in a separate chore(harness): extend rubric ... commit. Re-invokes evaluator./gen-eval-pair <prompt>Use when a human is in the loop (you're working in a Claude Code session and want gated review on the work you're about to do).
| Aspect | Details |
|---|---|
| Planner | Spawned via Task tool with subagent_type: harness:planner at P2 — fresh context. Returns contract markdown + story_id + open questions; you review in this session before persisting. |
| Writer | The current Claude session (you, reading this skill) — implements at P5 against the persisted contract |
| Evaluator | Spawned via Task tool with subagent_type: harness:evaluator — fresh context every phase. The same persona is dispatched for P3 / P4 / P6, distinguished by Phase=... in the prompt (per-phase output schemas tracked in agents/evaluator.md). |
| Pause points | After Planner returns (you review draft); after P3 if Revise; after P6 if Revise/Block — you ask the user how to proceed |
/gen-eval-pair <prompt> is invokedTreat <prompt> as the feature description for P1.
P2 — Planner drafts contract. Read <project>/.harness/rubric.md if it exists (path passed to planner). Pick <US-id> from prd.json if integrated with ralph; otherwise slug from <prompt>. Dispatch:
Task({
subagent_type: "harness:planner",
description: "Draft sprint contract",
prompt: "Phase=draft-contract. user_prompt=<prompt>. rubric=<path>. suggested_story_id=<US-id>. Return JSON {story_id, version, title, contract_md, rubric_coverage_notes, open_questions}."
})
Review the returned contract_md in this session (you, parent agent). If open_questions is non-empty, surface them to the user before persisting. Make any tightening edits, then write to .harness/contract/<task-id>/contract.md. Versioning: subsequent revisions overwrite the same file; if you need to preserve drafts, suffix as contract.v2.md, contract.v3.md in the same directory.
P3 — evaluator reviews contract. Dispatch:
Task({
subagent_type: "harness:evaluator",
description: "Review sprint contract",
prompt: "Phase=contract-review. Read <contract path>. Read <rubric path if exists>. Verify every AC is falsifiable + measurable + traceable to rubric rules. Return JSON {verdict, defects[]}."
})
P4 — propose rubric additions. Evaluator 先依下表偵測情境再分支執行;orchestrator (assistant qua orchestrator) 為唯一 rubric 寫入者,每筆變更獨立 commit;Writer 永遠不碰 rubric。
情境偵測:以 .harness/rubric.md 是否存在且非空為判準。
| 情境 | 偵測 | Evaluator 行為(Phase=propose-rubric) | Orchestrator 輸出 |
|---|---|---|---|
| Greenfield | .harness/rubric.md 不存在或為空 | 依 templates/rubric.global.example.md 八個 sections 從 contract 推導完整 initial rubric(Design References / Page Type Wrappers / Forbidden Patterns / Composition Assertions / Scoring Dimensions / Process Rules / Verdict Rules / Evolution Log) | 新建 .harness/rubric.md,獨立 commit:chore(harness): seed rubric — from <task-id> |
| Brownfield | .harness/rubric.md 存在且非空 | 比對 contract 需求 vs 現有 rubric 覆蓋度,返回缺漏的 suggestedRubricAdditions[] | 逐筆 append 至既有 .harness/rubric.md,每筆獨立 commit:chore(harness): extend rubric — <reason> |
兩情境共用同一 evaluator dispatch(Phase=propose-rubric),由 evaluator 依輸入 rubric 路徑是否可讀且非空自行判斷分支。Greenfield seeding 與 brownfield extension 的 commit 都必須與 writer 的 story commit 分離。
P5 — implement. Write the code per contract. Run project Quality Requirements (pnpm svelte-check, lint, tests, etc.).
P6 — evaluator scores. Dispatch:
Task({
subagent_type: "harness:evaluator",
description: "Score implementation",
prompt: "Phase=score. Read .harness/rubric.md (your ONLY rule file). Read <contract>. Read <screenshot>. Return JSON {verdict, defects[], suggestedRubricAdditions[]}."
})
suggestedRubricAdditions non-empty → orchestrator applies → re-dispatch (max 3 rounds).bun run .ralph/ralph.ts)Use when fully autonomous. The driver spawns a fresh writer agent each iteration; the writer dispatches the planner + evaluator (subagents OR external CLI).
Path convention: Mode B uses
.ralph/sprints/and.ralph/prompt.mdexclusively. Mode A artefacts live under.harness/.
| Aspect | Details |
|---|---|
| Planner | Dispatched from within writer's process via Task — subagent_type: harness:planner. No human review in this mode — P3 contract-review (run automatically) is the gate. |
| Writer | Headless agent spawned by .ralph/ralph.ts (claude -p / copilot --yolo / gemini -p) — fresh process per iteration; implements at P5 |
| Evaluator | Dispatched from within writer's process — subagent via Task or external CLI (e.g. copilot --model gpt-5.4) per project config |
| Pause points | None — Revise leaves story in_progress for next iteration; Block flips story to blocked |
The writer's prompt (.ralph/prompt.md) directs it through:
prd.json (or wherever ralph stores it).subagent_type: harness:planner). Receive contract draft..ralph/sprints/<story_id>-contract.v<n>.md without human review.<plugin-target>/run.sh --contract=<path> --phase=contract-review → automatic P3 gate.run.sh --phase=all or equivalent).open_questions into .ralph/prompt.md (or a story-specific feedback file), mark the story needs-revision, terminate this iteration. Next ralph iteration sees the feedback and re-dispatches Planner.# Mode B example
<plugin-target>/run.sh --contract=.ralph/sprints/<US-id>-contract.md
run.sh orchestrates P3→P6. Phase flags (--phase=contract-review|propose-rubric|score|all) let the writer call subsets. Default all runs the full pipeline.
In Mode A, the human reviews the Planner's draft to catch: vague AC, missing verification, scope creep, contract-evaluator-injection. These are the exact things P3 contract-review (already implemented) is designed to catch. Mode B just makes the gate explicit and automatic instead of relying on a human pause. See REFERENCE.md "Planner dispatch — Mode A vs Mode B topology" for the full failure-mode mapping.
rubric.mdThe plugin does NOT ship runtime rules — only schema and examples. Each project owns its own rule file at:
<project>/.harness/
└── rubric.md ← single file, mutates in place; git history is the audit trail
Why single-layer by default: at any given moment the evaluator only sees one rubric. Splitting "this rule is cross-story vs story-specific" requires N stories of evidence — premature classification at story 1 produces guess-based labels that bias toward task and never grow global. Start single; opt into the two-layer split when you have evidence it's needed (see Advanced below).
Example shipped by plugin (copy into project + customize):
templates/rubric.global.example.md — sections: Design References / Page Type Wrappers / Forbidden Patterns / Composition Assertions / Scoring Dimensions / Process Rules / Verdict Rules / Evolution Log. Use as the seed for .harness/rubric.md.suggestedRubricAdditions schema (returned by evaluator at P4 / P6):
{
"id": "<short-id>",
"section": "Forbidden Patterns" | "Composition Assertions" | "Page Type Wrappers" | "Required Components" | "Process Rules" | "Scoring Dimensions" | "Verdict Rules",
"row": { "<col>": "<val>" },
"reasonShort": "<≤100 chars>"
}
Orchestrator appends each addition to rubric.md in place. Each write is a separate chore(harness): extend rubric — <reason> commit, never bundled with writer's story commit.
Writer never modifies .harness/rubric.md. Enforced via six layers:
agents/evaluator.md) uses Read/Grep/Glob/WebFetch — no Write/Edit. It proposes additions, doesn't apply them.copilot --model <X> --allow-tool Read. Same restriction.run.sh (or this skill in interactive topology) applies suggestedRubricAdditions, in its own commit.rubric.ts startup checks (a) writer's HEAD commit doesn't touch rubric.md, (b) writer's working tree doesn't dirty rubric.md, (c) contract's Expected files touched doesn't list rubric.md. Any violation → Block.<contract>.run-trace.json sidecar (exit code, timestamps, stdout hash). rubric.ts rejects visual reviews whose latest section doesn't match the trace nonce. Defends against synthetic visual-review.md files. See REFERENCE.md "Tamper hardening — nonce + run-trace".update-rubric.ts runs each addition's verification_command against <fixtures-dir>/known-good (must exit 0) and known-broken (must exit non-0). Rules that don't discriminate are rejected before commit. Modes: --gate=off|lenient|strict (default lenient). See REFERENCE.md "Addition gate".Plus the global model rule:
--model containing claude (writer is assumed Claude). Subagent mode is read-only by tool restriction.⚠️ The 5-phase pipeline above is the target architecture. Current scripts implement P3 (lint-contract.ts, regex-only) and P6 (rubric.ts + visual-review.ts). Pending work:
| Phase | Status |
|---|---|
| P3 contract-review | ✏️ Deterministic lint shipped (scripts/lint-contract.ts); LLM-semantic review prompt available at templates/prompts/contract-review-prompt.md but no dispatcher script — caller-driven |
| P4 propose-rubric | ✏️ Apply step shipped (scripts/update-rubric.ts); evaluator dispatch is caller-driven (drop JSON at <contract>.propose-rubric.json, then run.sh --phase=propose-rubric) |
agents/evaluator.md per-phase schemas | ✅ Complete (P3 / P4 / P6 each with own input + output schema) |
.harness/rubric.md parser | ✅ Shipped (scripts/parse-rubric.ts); rubric.ts reads project rubric's ## Scoring Dimensions and ## Verdict Rules to drive dim list + thresholds. Falls back to built-in 7-dim if neither section present. See REFERENCE.md "Scoring Dimensions + Verdict Rules — parser spec". |
run.sh --phase= separation | ✅ Complete (contract-review / propose-rubric / score / all; lint aliased with deprecation warning) |
| Re-score loop (P6 iteration) | ✅ Shipped in run.sh run_p6 — up to 3 rounds; each iteration runs visual-review + rubric.ts, applies suggestedRubricAdditions[] via update-rubric.ts, exits early on convergence (all-duplicates → exit code 3) or hard cap. See REFERENCE.md "Orchestrator algorithm". |
| Smoke test | ✅ Shipped (scripts/smoke-test.ts); three modes (baseline / project / both). init.ts runs baseline at install; re-run with --mode=project after editing rubric.md or adding fixtures. See REFERENCE.md "Smoke test". |
| P2 Planner role | ✅ Shipped — persona agents/planner.md + prompt template templates/prompts/planner-prompt.md. Mode A: assistant dispatches Planner subagent and reviews draft in parent session before persisting. Mode B (ralph): writer dispatches Planner, persists without human review, automatic P3 contract-review acts as gate; Revise/Insufficient → defects + open_questions written to .ralph/prompt.md for next iteration. See REFERENCE.md "Planner dispatch — Mode A vs Mode B topology". |
| split-rubric CLI | ✅ Shipped (scripts/split-rubric.ts); single → two-layer migration in one commit. Classifies rules by Evolution Log recurrence (0 stories → global; 1 story → task; 2+ → global). --dry-run / --yes / interactive. See REFERENCE.md "Migration: single → two-layer". |
| Data-driven promotion | ✅ Shipped (scripts/rubric-stats.ts); update-rubric.ts auto-records every task-layer apply to <target>/.rubric-stats.json. Run rubric-stats.ts --propose to surface rules in ≥N stories as promotion candidates. See REFERENCE.md "Data-driven promotion". |
uncheckedStates[] in P6 verdict | ✅ Shipped (variant B). Evaluator MUST declare every (state, viewport) it failed to cover. rubric.ts surfaces gaps in score.json. run.sh --strict-unchecked downgrades Accept → Incomplete when non-empty; default lenient mode preserves verdict. See REFERENCE.md "uncheckedStates handling". |
Tested across: per AC | ✅ Shipped (variant C). Planner persona + prompt template + sprint-contract template all require each AC to declare Tested across: viewports=[..], states=[..]. lint-contract.ts enforces; --allow-default-tested-across demotes to warning during migration. |
| Interactive evaluator | ✅ Shipped (variant A). agents/evaluator.md tools extended with Bash / Write / mcp__playwright__* plus strict path + command whitelists. run.sh --evaluator-probe=passive|interactive (default passive) gates the capability; subagent dispatcher renders mode-specific prompt section; external (copilot) dispatcher silently downgrades interactive → passive. See REFERENCE.md "Probe mode". |
Use this skill as documentation of the target flow while building it out story by story.
See REFERENCE.md for rubric.md grammar specifics, parsing rules, orchestrator algorithm, Planner dispatch topology (Mode A vs Mode B), smoke test modes, tamper hardening (nonce + run-trace), addition gate (verification_command + fixtures), troubleshooting (Windows MSYS / bun vs Node / timeout tuning), the rubric extension audit-log format, migration single → two-layer (split-rubric.ts), data-driven promotion (rubric-stats.ts), and the opt-in two-layer rubric (rubric.global.md + tasks/<US-id>-rubric.task.v<n>.md) for projects that have outgrown the single-file default and want cross-story vs per-story rule isolation.
下列為各 phase 的 canonical Task() dispatch 結構。Mode A workflow 段(step 2/3/4/6)內的 Task() 範例為精簡版以維持行內可讀;當你需要決定 description 粒度、prompt 應帶入哪些 context、或在多輪 review 中如何重派時,以本附錄為準。完整 JSON output schema 不在此重複——統一定義於 agents/planner.md 與 agents/evaluator.md。本附錄僅針對 Mode A subagent dispatch;Mode B 走 run.sh --phase= 不適用此節(其 prompt 由 templates/prompts/ 變數注入式 template 處理)。
Description 範例(一行給 Task() description 欄位):
Review sprint contractReview sprint contract (revise round <n>)Prompt 結構:
Phase=contract-review
contract: .harness/contract/<task-id>/contract.md
rubric: .harness/rubric.md # 若不存在,明寫 "not yet created"
previous_verdict: <略,僅 revise 重派時附>
- verdict: Revise
- defects: [<前輪 defects[] 完整 JSON>]
- revised_sections: [<本輪 contract 已修正的 section 名>]
Return JSON per agents/evaluator.md Phase: contract-review schema.
多輪派遣注意事項:P3 重派時 contract 應已寫回同一檔(覆寫或 contract.v2.md、contract.v3.md),prompt 內必須完整附帶前輪 verdict 的 defects[] JSON 以及一句 revised_sections 列表,讓 evaluator 能對焦驗證「前輪 defects 是否真被修掉」,而不是把新 contract 當作首次審查重來。previous_verdict 在首輪 dispatch 時省略整段。
Description 範例:
.harness/rubric.md 不存在或為空):Seed rubric from contract.harness/rubric.md 已存在且非空):Propose rubric additionsPrompt 結構:
Phase=propose-rubric
contract: .harness/contract/<task-id>/contract.md
rubric: .harness/rubric.md # 路徑必傳;evaluator 自行判定 greenfield/brownfield
mode: single # 或 two-layer(見 Advanced)
template_reference: templates/rubric.global.example.md # 無條件帶入;greenfield 用以推導完整 rubric、brownfield 用以對齊 section 命名
Return JSON per agents/evaluator.md Phase: propose-rubric schema (suggestedRubricAdditions[]).
多輪派遣注意事項:P4 通常單輪(evaluator 一次回出 suggestedRubricAdditions[],orchestrator apply 後即進 P5)。P6 觸發的 re-score 回饋會回到 P6 而非 P4——P6 verdict 內若帶 suggestedRubricAdditions[] 由 orchestrator 直接 apply、再 dispatch P6(最多 3 round),不會回頭重派 P4。Greenfield/brownfield 分支邏輯由 evaluator 依 rubric 路徑可讀且非空與否自行判斷(見 Mode A step 4 表格),dispatch 端不負責切換 phase token。
Description 範例:
Score implementationRe-score after rubric extension (round <n>)Prompt 結構:
Phase=score
contract: .harness/contract/<task-id>/contract.md
rubric: .harness/rubric.md
screenshot: .harness/contract/<task-id>/screenshot.png # 若有;fullPage + viewport 都附時逐行列
design_system: [DESIGN.md, MAPPING.md, src/lib/layout.css]
source: [<writer 本輪改動的檔案清單>]
previous_round: <略,僅 re-score 時附>
- applied_additions: [<上輪 suggestedRubricAdditions[] 已 apply 的 id list>]
- prior_verdict: <上輪 verdict>
Return JSON per agents/evaluator.md Phase: score schema (verdict, defects[], suggestedRubricAdditions[]).
多輪派遣注意事項:P6 re-score 由 orchestrator apply suggestedRubricAdditions[] 後觸發(最多 3 round,收斂條件為 additions 全為 duplicates;見 REFERENCE.md「Orchestrator algorithm」)。重派時 prompt 必須帶 previous_round.applied_additions 的 id 列表——讓 evaluator 知道「這次 rubric 已新增 X / Y / Z 規則」,避免它再次提出同樣的 addition 造成 round-trip 浪費。prior_verdict 一併附上方便 evaluator 對齊改進方向,但不要附前輪完整 defects——P6 是針對最新 rubric + 最新 screenshot 的獨立評分,前輪 defects 可能已不再適用。
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub gn00678465/harness-tools --plugin harness