From harness
Designs and runs Anthropic-style long-running application harnesses for autonomous coding. Use when turning a short prompt into a multi-agent workflow, dispatching initializer/planner/generator/evaluator/coordinator roles, tracking completion through a machine-readable feature list, negotiating sprint contracts before coding, making incremental progress on failing features across sessions, or running until the required feature set passes. Also activate for /start, /session, /run, /reset commands, context reset with handoff files, supervised vs continuous execution modes, or questions about Anthropic harness design patterns and context anxiety.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness:harnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Domain-agnostic.** This harness works for any project type — web apps, CLI tools, APIs, libraries, infrastructure, data pipelines, mobile apps, or any software development workflow. The patterns below are not specific to any tech stack or domain.
Domain-agnostic. This harness works for any project type — web apps, CLI tools, APIs, libraries, infrastructure, data pipelines, mobile apps, or any software development workflow. The patterns below are not specific to any tech stack or domain.
Blend the two Anthropic articles rather than following only one of them:
If the harness keeps creating new sprints but never finishes, it is missing the 2025 completion machinery. If the harness finishes in one sprint without a written rationale, it is no longer clearly following the 2026 sprinted harness.
The default harness should have:
The goal is not to maximize the number of sprints. The goal is to drive the required feature set to passing status and stop.
This harness follows a GAN-like (Generative Adversarial Network) pattern from Anthropic's engineering articles:
The adversarial tension between generator and evaluator prevents the common failure mode where a model is too lenient grading its own work. This pattern applies regardless of domain — whether building software, writing tender documents, or designing architecture.
Whenever this skill is activated (by any agent, command, or direct invocation):
.harness/ directory exists. If not -> suggest /harness:start.harness/state.json -> verify mode, variant, current_sprint_phase exist.harness/features.json -> verify at least one feature exists.harness/config.json -> use defaults for missing fieldsrelease.json (project root) if exists -> know current versionBased on domain_profile in state.json or spec.md:
custom -> read spec.md for custom criteriasoftwareDomain profiles are provided by domain skill suites (e.g., harness-sdlc-suite). See the installed suite's index skill for available profiles and routing.
passes field only changed by evaluator evidencecurrent_round only incremented by coordinator or sessionEvery command (/start, /run, /session, /reset, /release) must run these validation steps before proceeding:
.harness/: verify directory exists. If not -> "No harness found. Run /harness:start first." STOP.state.json: verify it contains mode, variant, current_sprint_phase. If missing fields -> warn and use defaults.features.json: verify it is valid JSON with at least one feature. If malformed -> STOP with error.config.json: verify it is valid JSON. If missing -> use defaults silently.The harness uses .harness/config.json as the single configuration source for persistent preferences. State.json holds runtime state (round, phase, errors); config.json holds tunable settings.
The initializer creates a default config.json during /start. Users can edit it manually between sessions. Config.json values take precedence over state.json defaults when both define the same field (e.g., context_reset_threshold).
Key fields: use_codex (auto/on/off), context_reset_threshold, auto_commit, auto_retro, retro_interval, max_retry_on_failure, evaluator_strictness (lenient/standard/strict), commit_prefix_pass, commit_prefix_fail, commit_tag. See references/patterns.md for the full schema and field descriptions.
When the environment supports separate agents or sessions, dispatch explicitly:
initializer agent to create the operational scaffold.planner agent to produce or refine the product spec if the prompt is underspecified.generator agent to propose the next bounded sprint and implement it.evaluator agent to review the contract, then test + review + grade the implementation.coordinator agent in continuous mode to advance rounds automatically until a stop condition is reached.releaser agent after all required features pass or when the user requests /release.Do not collapse these into one agent unless the environment truly cannot separate them. If you are forced to use one agent, state that the run is an approximation and not faithful role separation.
Use these ownership boundaries:
Use role-scoped references so each subagent reads only the context it needs:
The initializer exists to prevent endless re-planning and drifting scope.
Initializer responsibilities:
.harness/features.json..harness/progress.md..harness/init.md or an equivalent setup artifact that explains how to start and verify the app..harness/state.json when the run will continue automatically.Do not skip the feature list. It is the main completion ledger.
Every planned app spec should include an Execution strategy section.
That section should declare:
If the run finishes in one sprint, the planner or coordinator should state why sprint decomposition was not load-bearing for this specific app.
.claude/settings.json has "openai-codex" in extraKnownMarketplaces or "codex@openai-codex": true in enabledPlugins), use the codex CLI for adversarial code review. Falls back to Claude-only review if the CLI is not installed or not authenticated.For projects with a runtime component, the evaluator MUST read the active domain skill for runtime verification procedures. Build-only verification (e.g., npm run build passing) is NOT sufficient — the evaluator must also:
Runtime verification prevents the class of failures where all features pass build verification while the app crashes on startup.
The coordinator exists to keep the run convergent and auditable.
Coordinator responsibilities:
.harness/state.jsonThe coordinator should also write a short rationale whenever it:
The releaser agent manages version bumps, changelog generation, and git tags.
/release to cut a release checkpoint mid-run.release.json (project root), CHANGELOG.md (project root)features.json, state.json, summary.md, progress.mdTracks all releases with version history. Schema documented in references/patterns.md.
Fields: current_version, releases[] (version, date, features_shipped, features_deferred, changelog, sprint_count, previous_version), next_version.
Generated from feature evidence in features.json. Each entry lists shipped features, deferred features, sprint count, and notable changes.
| Condition | Bump | Example |
|---|---|---|
| Only bug fixes or reliability improvements | patch | 0.1.0 -> 0.1.1 |
| At least one new feature shipped | minor | 0.1.0 -> 0.2.0 |
| Breaking changes to existing behavior | major | 0.2.0 -> 1.0.0 |
The releaser creates an annotated git tag for each release: git tag -a vX.Y.Z -m "Release vX.Y.Z".
After creating the release commit and git tag, the releaser syncs the new version into all plugin manifest files:
.claude-plugin/marketplace.json -- update version in each plugin entryplugins/harness/.claude-plugin/plugin.json -- update version.codex-plugin/plugin.json -- update versionThis prevents version drift between release.json and the plugin descriptors.
Features have a maturity field alongside the binary passes flag. Maturity adds granularity for tracking overall project readiness:
| Level | Meaning | Scoring Trigger |
|---|---|---|
draft | Initial implementation, known gaps | Any criterion below 3 |
functional | Core behavior works | All criteria >= 3 (also sets passes = true) |
reviewed | Passed evaluation with acceptable scores | All criteria >= 3, evaluator accepted |
polished | Production-ready quality | All criteria >= 4 |
accepted | Stakeholder sign-off | Set manually by user/stakeholder, not by evaluator |
The evaluator sets maturity automatically based on scores after grading. The accepted level is never set by the evaluator — it requires explicit stakeholder sign-off, which is particularly relevant for domain profiles where stakeholder approval is a core evaluation criterion.
The harness supports multiple domains through a profile system. Each domain declares 4 primary evaluation criteria, artifact taxonomy, verification methods, and stakeholder lens.
Domain profiles are provided by domain skill suites (e.g., harness-sdlc-suite). See the installed suite's index skill for available profiles and routing.
The custom profile allows any project to define its own 4 criteria inline in spec.md without requiring a domain skill suite. This makes the core harness fully self-contained for projects that do not fit a predefined domain.
A project can declare a primary profile + optional secondary profile in state.json (secondary_profile field) for cross-domain work. When a secondary profile is active, the evaluator loads both domain skills and scores both sets of criteria (8 criteria total: 4 primary + 4 secondary). The primary profile's criteria determine the pass/fail threshold; the secondary profile's criteria are scored but treated as advisory unless all 4 are below 3, in which case they trigger a warning. This allows, for example, a software project with secondary: ops to track operational readiness alongside code quality without letting ops criteria block feature acceptance during early development.
Do not rely on prose-only judgments. Every evaluation round should produce numeric criterion scores plus granular contract-check results.
Primary criteria are determined by the domain profile declared in spec.md. The default software profile uses: product_depth, functionality, visual_design, code_quality.
Score each primary criterion on a 0-5 scale:
0: absent, broken, or not meaningfully implemented1: severely incomplete or mostly non-functional2: partially present but below acceptable quality3: acceptable baseline for the sprint goal4: strong implementation with only minor issues5: excellent implementation for the scoped workHard rules:
3 fails the round.You may report an average score as a trend signal, but never use the average to override a failed criterion.
After the domain criteria scoring (0-5 scale) above, every evaluation round applies a binary Authenticity Gate. This gate is a cross-cutting quality check that detects technically-competent-but-generic output -- artifacts that score adequately on domain criteria yet show no evidence of project-specific decision-making.
The gate checks four dimensions. Each dimension is binary pass/fail -- not scored on a 0-5 scale. The gate runs AFTER domain criteria scoring and is independent of those scores.
| Dimension | Definition |
|---|---|
| internal_consistency | All artifacts share consistent conventions -- structure, terminology, and style form a unified whole rather than appearing assembled from different sources. |
| intentionality | Evidence of project-specific decisions tailored to THIS project's context. Artifacts reflect deliberate choices rather than unmodified defaults or generic template output. |
| craft | Technical fundamentals are correct for the artifact type -- consistent structure, clear hierarchy, uniform conventions, and formatting that follows established standards for the deliverable format. |
| fitness_for_purpose | Every deliverable is usable by the target audience without requiring additional explanation. Artifacts serve their stated purpose and can be consumed as-is. |
The authenticity gate operates as a dual-side control:
Both sides reference the same 4 dimensions. The generator's checklist and the evaluator's gate are two views of the same quality standard.
Each sprint contract must define:
The evaluator should score both:
Primary criteria measure overall quality. Checklist items measure whether the sprint actually satisfied the contract.
Every evaluation round should emit all of these:
.harness/sprints/NN-evaluation.md.harness/sprints/NN-evaluation.jsonThe Markdown artifact is for human review. The JSON artifact is for machine-readable continuity across long runs.
The Markdown artifact must include these sections:
The JSON artifact should include at least:
decisiontarget_feature_idsprimary_scorescontract_checksblockersnon_blocking_issuesfeature_evidencetest_resultsreview_findingsDo not mark a feature as passing based only on the Markdown summary if the structured evaluation artifact is missing or inconsistent.
Maintain a machine-readable .harness/features.json with at least:
idtitlerequiredprioritystatuspassesevidencenotesRules:
At the start of every coding session or reset session:
.harness/progress.md..harness/features.json..harness/spec.md.If the session starts by inventing a new roadmap instead of checking the failing feature list, the harness is drifting.
Every sprint must result in a git commit. This prevents work loss and creates a traceable history.
| Outcome | Prefix | Format |
|---|---|---|
| Evaluation PASS | feat | feat(F-XXX): <title> — sprint N [harness] |
| Evaluation FAIL | wip | wip(F-XXX): <title> — sprint N attempt [harness] |
| Implementation pre-eval | wip | wip(F-XXX): implement <title> — sprint N [harness] |
git add -A to stage all changes[harness] tag identifies automated commitsChoose one mode explicitly:
supervised: complete one bounded round, surface the result, and wait for the user before advancingcontinuous: keep advancing rounds automatically until all stop conditions are satisfied or a documented blocker pauses the runUse continuous when the user asks the harness to keep going without manual retriggering.
Use supervised when the user wants tight review between rounds.
Use this by default when the user asks to follow the 2026 app-harness article.
Fallback for context anxiety or environments that cannot sustain continuous sessions. See references/advanced.md for full details.
Remove sprint decomposition when evidence shows it is no longer adding value. See references/advanced.md for full details.
Every role-owned artifact must include a metadata block:
RoleAgentInputsStatusFor review and evaluation artifacts, also include:
Reviewed byDecisionFor feature-list updates, include evaluator-backed evidence entries tied to feature IDs.
For evaluation artifacts, also include:
In Variant A, run this loop:
.harness/features.json and .harness/progress.md..harness/state.json in continuous mode..harness/spec.md if the spec is incomplete..harness/sprints/NN-contract.md..harness/sprints/NN-contract-review.md..harness/sprints/NN-builder-report.md..harness/sprints/NN-evaluation.md and .harness/sprints/NN-evaluation.json..harness/features.json.Do not keep increasing sprint count without reducing the number of failing required features.
Start with:
.harness/features.json.harness/progress.md.harness/init.md + .harness/init.sh + .harness/init.bat.harness/state.json in continuous mode.harness/spec.md.harness/sprints/NN-contract.md.harness/sprints/NN-contract-review.md.harness/sprints/NN-builder-report.md.harness/sprints/NN-evaluation.md.harness/sprints/NN-evaluation.jsonRelease artifacts (project root -- persist across .harness/ resets):
release.json -- created by releaser after all required features passCHANGELOG.md -- generated changelog from feature evidenceOptional supporting artifacts:
.harness/handoff.md for reset-based runs only.harness/summary.md for final wrap-up.harness/evaluator-calibration.md when subjective scoring needs tighter anchors.harness/decomposition.md when sprint planning needs an auditable rationale outside the main spec.harness/cost-log.md for tracking per-sprint cost and durationUse the shared schemas in references/patterns.md.
A run is complete when one of these is true:
.harness/features.json have passes: true.If none of these conditions are checked, the harness has no definition of done.
In continuous mode, pause and record the reason if any of these happen:
Do not keep looping just because budget remains.
When generator and evaluator evidence conflict:
The evaluator uses a structured rubric system with anchored examples to prevent scoring drift.
Persisted calibration file (.harness/evaluator-calibration.md) is required only when expected_sprint_count > 3. For shorter runs (3 or fewer sprints), the evaluator scores with anchors conceptually without persisting them to a file.
When required: after the first evaluation round, create evaluator-calibration.md with concrete score anchors (descriptions of what 2, 3, 4, and 5 look like) for each of the domain profile's primary criteria. Review and update every 3 rounds.
For each criterion every round, the evaluator must:
NN-evaluation.md under a "Score Justification" sectionAfter every retro_interval rounds (from config.json, default 3) or after any FAIL evaluation, the coordinator appends a ## Retrospective -- Rounds X-Y section to .harness/progress.md (not a separate file). The section covers: what worked, what didn't, adjustments for next rounds, patterns detected.
The coordinator reads the latest retrospective section in progress.md before starting each new round and incorporates learnings into generator/evaluator dispatch instructions.
Harness assumptions decay with model improvements. Test removal methodically. See references/advanced.md.
Use compaction when context < 60% and quality is stable. Use /reset when context > 75% or model shows context anxiety. See references/advanced.md.
Remove one component at a time and observe impact. Keep the feature list and evaluator-led QA as last line of defense. See references/advanced.md.
If an agent spawn fails (timeout, API error, crash):
state.json errors array.stop_reason and STOP. Never silently continue.Track rounds_since_reset in state.json. After context_reset_threshold rounds (default: 3), the coordinator pauses the run, writes a handoff file, and resets the counter. The next /session or /run picks up from the handoff automatically.
state.json tracks current_sprint_phase (one of: idle, contract, implementation, evaluation). When a session starts, it checks this field and resumes from the last active phase instead of restarting the sprint.
The coordinator MUST NOT update features.json directly. Only evaluator evidence in NN-evaluation.json feature_evidence may flip pass/fail status. Before advancing to the next round, the coordinator verifies that NN-contract.md, NN-evaluation.md, and NN-evaluation.json all exist.
state.json includes a cost_tracking object with per-round timestamps for each phase (contract, implementation, evaluation). The coordinator updates these at phase boundaries. An optional cost-log.md artifact provides a human-readable summary.
The initializer generates both init.sh (bash) and init.bat (Windows CMD) so the harness works on any platform. init.sh should detect Windows (MSYS/Git Bash) and adapt accordingly.
Use the review checklist in references/advanced.md to verify whether a run truly followed the harness.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub xuzhijie-ownself/harness --plugin harness