Skill

cowork-harness

Test or debug a Claude Code skill/plugin under Claude Cowork's runtime — sandboxed agent, default-deny egress, the can_use_tool permission/question protocol — using the cowork-harness CLI. Use when validating or regression-testing a skill, authoring or debugging a scenario YAML (prompt + scripted answers + assert:), choosing a fidelity tier, scripting AskUserQuestion / tool-permission answers, or asserting artifacts, egress, or sub-agent dispatch. Especially when a harness run no-ops an assertion, fails on an unanswered gate, false-greens, a steered answer never reaches the model, or a web_fetch is unexpectedly denied or gated. NOT for generic unit testing (pytest/vitest of your own scripts) or non-Cowork CI. Covers the skill / run / chat / record / replay / trace / decide / assert / scaffold commands and the session-vs-scenario split.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/cowork-harness:cowork-harness

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill teaches you to drive the **`cowork-harness` CLI** — a fixture that runs a Claude Code

SKILL.md

234 lines · ~4k tokens

Stats

LanguageTypeScript

Parent stars0

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

cowork-harness

This skill teaches you to drive the cowork-harness CLI — a fixture that runs a Claude Code skill the way Claude Cowork runs it (sandboxed agent, default-deny egress, the permission / AskUserQuestion control protocol). It is not the CLI itself: you still invoke cowork-harness … in the shell; this skill tells you how to author scenarios, pick a fidelity tier, choose an answer path, place assertions in the right CI lane, and avoid the harness's "✓ passed ≠ actually correct" traps.

The single most important idea: a green run is not automatically a correct run. The harness has several ways to silently no-op a check (skip an assertion on replay, auto-answer a gate, observe an empty egress allowlist). This skill exists mostly to keep you out of those traps — the Gotchas section below is the highest-value part. Read it.

Version note: the facts and file:line pointers here track cowork-harness 0.5.0 (baseline desktop-1.13576.1). If your checkout is newer, prefer the live --help, SPEC.md, and docs/*.md over this snapshot, and re-run the bundled linter.

0. Preflight — make sure the harness can actually run

Before the first command, confirm the CLI is reachable and fail loud (never fake a pass) when a tier's dependencies are missing:

CLI on PATH, recent enough? Run cowork-harness --version — this skill needs ≥ 0.5.0 (the commands/assertions it teaches: assert --list, scaffold, trace --dispatches, artifact_json incl. the in: operator, verify-cassettes, batch record <dir>/--rerecord-stale, record-time redaction, multiSelect/answer:, verify-run, record --max-artifact-bytes, verify-cassettes --allow-domain/--allow-email/--allow-file, and scenario skills: staleness scoping). If it's missing or older than 0.5.0, prefix every command with npx using a version floor: npx cowork-harness@>=0.5.0 <cmd> (Node ≥ 20). The floor matters — plain @latest would silently fetch an older CLI and the new commands would fail as "unknown command"; @>=0.5.0 instead fails loud if no compatible version is published. To install once instead: npm i -g cowork-harness@latest.
Agent binary (every tier). The staged Claude Code agent is bind-mounted from a local Claude Desktop install, or point COWORK_AGENT_BINARY at a claude-code-vm/<ver>/claude ELF. Nothing is bundled. No agent → no run; report that, don't skip silently.
Docker / Lima. Only --fidelity protocol (L0) runs without them. container / microvm / hostloop / cowork need Docker (Lima for L2). If they're absent, drop to --fidelity protocol and say so — a green that never exercised the sandbox is not a sandbox pass.
Auth. CLAUDE_CODE_OAUTH_TOKEN (preferred) or ANTHROPIC_API_KEY, via env or .env.

1. Pick the loop

"Is it even alive?" (inner loop) → cowork-harness skill <folder> "<prompt>". Fastest; no scenario file.
Repeatable, asserted regression → author a scenarios/*.yaml and run cowork-harness run. This is the CI-grade path and most of this skill.
Multi-turn debugging → cowork-harness chat (interactive; gates answered at the TTY, not an asserted test — see Debugging with chat in docs/scenario.md).

Full command set: skill · run · chat · record · replay · verify-cassettes · verify-run · trace · decide · gates · answer · scaffold · assert · sync · list · boundary-check · vm <init|status|delete|prune>. Always check cowork-harness <cmd> --help.

2. Two files: session vs scenario

sessions/*.yaml — pre-prompt setup: model, mounts (folders), and discovery (marketplaces / plugins / skills / mcp). One session is reused by many scenarios. A scenario that omits session: gets an all-defaults inline session (not a file on disk).
scenarios/*.yaml — the test: prompt, scripted answers:, and assert:.

This split matters: release ground truth (baseline: / baselines/, produced by sync) is separate from authored setup (session: / sessions/). "profile" is retired vocabulary — do not use it. See references/scenario-schema.md for every field.

3. Discovery: how the skill-under-test gets mounted

The skill is copied fresh into the sandbox each run. Wire it via plugins.local_plugins + plugins.enabled: [<plugin>@local] in the session (or --marketplace / --plugin flags on skill). A missing mount source is now a hard error (mount source(s) not found …); set COWORK_HARNESS_SOFT_MISSING=1 to fall back to warn-and-exclude. A folders[].to containing / or .. is rejected. See references/scenario-schema.md.

4. Choose a fidelity tier

Tier	What it gives you	Use when
`protocol`	Fastest; no sandbox, no egress	Pure protocol/answer-shape tests. Rejected if the scenario asserts egress.
`container`	Real sandbox + real default-deny egress (default)	Most functional + boundary tests.
`microvm`	VM-grade escape isolation (macOS arm64). Egress transport is the same allowlist proxy as `container` — not better network fidelity	Testing untrusted code escape, not network behavior.
`hostloop` / `cowork`	Production split-exec (host runs the loop, guest runs tools)	Highest-fidelity / parity runs.

Set the tier in the scenario's fidelity: field, not a flag — run rejects --fidelity (it's a skill-only flag). See references/fidelity-and-answers.md.

5. Choose an answer path (gates: AskUserQuestion + tool-permission)

Default to deterministic: scripted answers: + on_unanswered: fail. Anything that brings a live model into answering flags the run nonDeterministic — keep those out of deterministic regressions.

Path	How	Deterministic?
Scripted	`answers:` rules + `on_unanswered: fail`	✅ (the CI/agent default)
LLM decider	`on_unanswered: llm` (YAML) or `--decider-llm` (CLI)	❌ flags nonDeterministic
Spawned helper	`--decider-cmd '<helper>'`	depends on helper
In-band (driving agent)	`--decider-dir <dir>` (+ a Monitor)	depends

Exact accepted values (teach precisely): --on-unanswered takes fail|prompt|first on skill, only fail|first on run. llm is NOT an --on-unanswered value — the bare flag --on-unanswered llm is rejected (use --decider-llm); the YAML spelling is on_unanswered: llm. The word agent is retired — do not write on_unanswered: agent (the schema rejects it). --on-unanswered first is itself flagged nonDeterministic — it is not a deterministic stand-in for scripted answers. See references/fidelity-and-answers.md.

6. Assertions: two orthogonal axes

Conflating these is the biggest landmine. An assertion key has two independent properties:

Axis A — robust to LLM phrasing drift? Structural/boundary keys (subagent_dispatched, egress_*, file_exists, user_visible_artifact, result) are robust. Free-text content is not: match prose with transcript_matches / transcript_contains (stable lexical markers only — not semantic content the model paraphrases, which re-records red); check structured JSON with YAML artifact_json (or the pytest lane for complex predicates), not via a transcript substring.
Axis B — survives replay? Independent of Axis A. On the token-free replay lane, only content keys evaluate; filesystem / egress keys are silently skipped (live-only). A key being "robust" says nothing about whether it runs on your replay gate.

Getting Axis B wrong means a check that silently does nothing in CI. The harness now warns loudly when it skips, and the bundled linter catches it before you push — run it (§9).

See references/scenario-schema.md for the full assertion catalog with each key's replay class.

7. web_fetch (fail-closed, two-path)

web_fetch behaves unlike curl. A URL is gated by provenance, not the egress allowlist:

A URL is provenanced iff it appeared in the prompt or a prior web_fetch result. To make a fetch succeed, put the URL in the prompt.
Provenanced → fetches (still SSRF-guarded per redirect hop); the egress hostname allowlist is not consulted.
Not provenanced → raises a per-domain approval gate (webfetch:<domain>) that is fail-closed (it is not auto-allowed; --on-unanswered first won't allow it). Answer it with a scripted rule (when_tool: "webfetch:<domain>" + grant: domain|once), a session web_fetch.approved_domains, or a live decider.

Surprise to remember: adding a host to egress.extra_allow is a no-op for a provenanced fetch. Full model in references/scenario-schema.md.

8. Run, then lock determinism

Read the verdict and the inline failing transcript. To pin a flaky-because-stochastic gate, paste the echoed --answer "<q>=<choice>" footer lines back into the scenario's answers: for a deterministic re-run. Use cowork-harness trace <id> to digest a run. If only an assertion is wrong (the run itself was fine), cowork-harness verify-run <run-dir> <scenario.yaml> re-checks the assert: block against a kept run dir (--keep, or a --session-id run) with no live re-record — tokens-free, ~1s per iteration.

9. Scaffold a valid scenario, lint before you push, then place in CI

Don't hand-write the YAML from memory — that's how invented keys (assertions: vs assert:, json_file, answer_policy) creep in. Start from the bundled generator, which emits the known-good skeleton (right tier, scripted answers: + on_unanswered: fail, content assertions separated from live-only ones, one concern per item) and self-lints its own output:

S="${CLAUDE_PLUGIN_ROOT}/skills/cowork-harness/scripts/scenario.py"
# Working from a repo checkout instead of an installed plugin?
# S=".claude/skills/cowork-harness/scripts/scenario.py"
python3 "$S" scaffold --name report-check --skill ./skills/report-gen \
  --prompt "Generate the weekly report to outputs/report.md." \
  --content 'weekly report' --artifact outputs/report.md \
  --egress-allowed api.weather.example.com --out scenarios/report-check.yaml

Then lint every scenario — it encodes the no-silent-false-green invariants. Use the CLI wrapper cowork-harness lint (it runs the same bundled scenario.py lint):

cowork-harness lint scenarios/*.yaml

lint flags: filesystem/egress-only assertions on a replay gate (silent no-op), bad regex quoting, an egress assert on protocol fidelity, a controlOut-gated key on a non-controlOut replay, mixed-class assertion items, and hallucinated schema (assertions: vs assert:, unknown keys). Exit code is non-zero on errors (CI-friendly). scaffold auto-upgrades the tier if you ask for egress on protocol, so it never emits a scenario lint would reject.

CI placement: a token-free replay PR gate (content/structure only) + a nightly live run (filesystem/egress). See references/ci-recipe.md for the four-stage pipeline.

Gotchas — the "✓ passed ≠ correct" landmines

Stated as symptom → why → fix. The full catalog (with file:line) is in the references; these are the ones that bite hardest.

An assertion passed but tested nothing on the PR gate. Why: on a manifest-less cassette replay skips filesystem/egress keys (file_exists, user_visible_artifact, artifact_json, egress_*, no_delete_in_outputs, self_heal_ran, transcript_no_host_path); a mixed item like {result, egress_denied} greens on result while its egress_denied half is dropped. (record snapshots an artifacts manifest, which makes file_exists/user_visible_artifact/artifact_json replay-checkable — but the live-only egress keys stay skipped.) Fix: put egress/live-only checks on a live gate; keep one concern per assert: item; run the linter. The harness warns loudly on skip.
A steered gate answer never reached the model. Why: serializeDecision must emit updatedInput: { questions, answers }; a header-only gate (empty question) can never be keyed. Fix: give every gate a non-empty question. (multiSelect gates ARE supported — answer with a choose: list; free-text "Other" via answer:.) question_asked / questions_count_max / gate_answers_delivered only evaluate on replay with a controlOut cassette — re-record an old cassette or they're excluded (loudly), not vacuously passed. gate_answers_delivered fails on unobserved delivery (absence of evidence is failure, not neutral).
A multi-key assert: item is an AND. A single list item with more than one key passes iff every key passes. Fix: one concern per item unless you genuinely mean conjunction (and a mixed-class conjunction still loses its filesystem half on replay — see gotcha 1).
tool_called doesn't mean "attempted". Tool counts are authoritative and de-duped: a tool that was requested then denied does not register as called. Fix: don't assert tool_called to prove an attempt; it proves the tool actually ran.
subagent_declared_but_unused fires on declared-but-didn't-use-THAT-tool, even if the sub-agent used other tools. And subagent_dispatched matches by dispatch description too — skills often dispatch with no subagent_type (agentType:"unknown"), so match the description.
dispatch_count_max asserts but does NOT enforce. The harness records the count; it does not reproduce Cowork's skip-on-cap ({perTask:1, global:3}, deferred). Passing means "happened to dispatch ≤N this run," not "the harness capped it." Don't read enforcement into the assert.
protocol is rejected (not silently passed) if the scenario asserts egress — boundary assertions need a sandboxed tier (container+). Good: this one fails loud by design.
Read-only mounts are enforced; delete-deny is not. mode:r mounts get a real :ro bind (a write fails in-guest). But rw vs rwd (write-but-no-delete on outputs/ / .projects/) is not mount-enforced — rm succeeds and is only caught post-hoc by no_delete_in_outputs.
Keep .env out of any mounted folder — it is copied into the sandbox and the token could leak. Put it at a working-dir or install root (token resolution: env > --dotenv > ./.env > install .env).

For the complete gotcha list, the assertion catalog, the YAML schema, the fidelity/answer tables, and the CI recipe, read the files in references/.

References

references/scenario-schema.md — scenario/session YAML schema, full assertion catalog (with each key's replay class), the web_fetch model, and the complete gotcha list.
references/fidelity-and-answers.md — fidelity tiers, answer paths, the determinism contract.
references/ci-recipe.md — replay-vs-live lane split and the four-stage GitHub Actions pipeline.
scripts/scenario.py — scaffold a valid scenario skeleton and lint scenarios for the no-silent-false-green invariants (both usable as CI steps).

cowork-harness

Invocation

Context Preview

SKILL.md

cowork-harness

Invocation

Context Preview

SKILL.md

cowork-harness

0. Preflight — make sure the harness can actually run

1. Pick the loop

2. Two files: session vs scenario

3. Discovery: how the skill-under-test gets mounted

4. Choose a fidelity tier

5. Choose an answer path (gates: AskUserQuestion + tool-permission)

6. Assertions: two orthogonal axes

7. web_fetch (fail-closed, two-path)

8. Run, then lock determinism

9. Scaffold a valid scenario, lint before you push, then place in CI

Gotchas — the "✓ passed ≠ correct" landmines

References

Similar Skills

cowork-harness

0. Preflight — make sure the harness can actually run

1. Pick the loop

2. Two files: session vs scenario

3. Discovery: how the skill-under-test gets mounted

4. Choose a fidelity tier

5. Choose an answer path (gates: AskUserQuestion + tool-permission)

6. Assertions: two orthogonal axes

7. web_fetch (fail-closed, two-path)

8. Run, then lock determinism

9. Scaffold a valid scenario, lint before you push, then place in CI

Gotchas — the "✓ passed ≠ correct" landmines

References

Similar Skills