Skill

harness-engineering

Analyzes recurring failures to identify gaps in instructions, skills, tests, CI gates, and lint rules, then produces harness-level fixes.

developer-tools

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/claude-commands:harness-engineering

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

When a mistake or failure pattern is identified, analyze whether the root cause is a gap in the **harness** (instructions, skills, tests, guardrails, automation) rather than just the code. Produce a concrete fix at the harness level so the same class of mistake cannot recur without human intervention.

SKILL.md

209 lines · ~3.1k tokens

Stats

LanguagePython

Stars27

Forks4

MaintenanceExcellent

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Harness Engineering Skill

Purpose

When a mistake or failure pattern is identified, analyze whether the root cause is a gap in the harness (instructions, skills, tests, guardrails, automation) rather than just the code. Produce a concrete fix at the harness level so the same class of mistake cannot recur without human intervention.

Command: ~/.claude/commands/harness.md

Scope: user vs repository

User scope (general): ~/.claude/commands/harness.md and this skill apply to any repo unless a project overrides them.
Repository overlay: Some projects ship .claude/commands/harness.md and/or .claude/skills/<name>/SKILL.md that extend user-scope rules (for example gateway operations). When both exist, read the repo-local file for project-specific failure modes. Collision: workspace-local .claude/commands/ overrides the same-named global command in that workspace.
jleechanclaw / ~/.openclaw: Use the openclaw-harness skill in that repo for gateway, canary, deploy, and lane-backlog triage. Tracked user-scope copies for drift control live under docs/harness/ in jleechanclaw.

Harness Layers (ordered by durability)

Layer	Files	What it prevents
Instructions	`CLAUDE.md`, `AGENTS.md` (global + repo)	Wrong approach, wrong assumptions, wrong defaults
Skills	`~/.claude/skills/.md`, `~/.claude/commands/.md`	Repeated manual workflows, forgotten validation steps
Memory	`~/.claude/projects//memory/.md`	Forgetting user preferences, past corrections, project context
Integration tests	`tests/test_integration_.py`, `tests/test__test.py`	Regressions in real behavior
CI gates	`.github/workflows/*.yml`, pre-commit hooks	Merging broken code, mislabeled artifacts
Lint/validation rules	`.pre-commit-config.yaml`, custom linters	Style drift, naming violations, structural problems

Analysis Protocol

When invoked, execute this sequence in full, every time:

Step 1: Identify the failure class

Classify what went wrong:

Mislabeled artifact — something was called X but didn't meet X's criteria (e.g., "E2E test" that isn't E2E)
Wrong approach — took an approach the user has previously corrected
Missing validation — produced output without checking it meets requirements
Repeated manual fix — user had to manually correct the same type of issue more than once
Silent degradation — something broke but nothing flagged it (includes: harness layer present but broken)
Launchd env-isolation — a process moved from interactive shell to launchd without propagating required env vars; .bashrc-sourced secrets silently disappear; the process appears alive but all API calls fail
Knowledge gap — didn't know about a constraint, convention, or tool
LLM path error — the agent reasoned toward a wrong solution despite having sufficient context

Step 2: 5 Whys — the technical problem

Ask "Why?" five times about the technical failure, drilling into root cause:

Why 1: Why did the observable failure happen?
Why 2: Why did the mechanism that caused Why 1 exist?
Why 3: Why wasn't that mechanism caught or prevented?
Why 4: Why wasn't there a guardrail at that level?
Why 5: Why was the system designed without that guardrail?
→ Root cause: <single sentence>

Stop earlier if you hit bedrock. Each answer should be more specific than the last.

Step 3: 5 Whys — the agent path

Ask "Why?" five times about why the LLM (Claude Code or any coding agent) went down the wrong path. This is mandatory. Every harness failure has two dimensions: the technical problem AND the agent reasoning failure that let it slip through.

Why 1: Why did the agent not catch/prevent the failure?
Why 2: Why did the agent reason or act that way?
Why 3: Why didn't the agent's instructions prevent that reasoning?
Why 4: Why wasn't there a skill, memory, or rule that would have redirected the agent?
Why 5: Why was the harness incomplete for this class of agent behavior?
→ Agent root cause: <single sentence>

Key questions to drive this:

Did the agent trust existing code without verifying it worked correctly?
Did the agent describe the problem correctly but at the wrong level of abstraction (high-level summary instead of exact code trace)?
Did the agent assume "present = working" when it should have verified?
Did the agent skip verification because the skill/instruction didn't mandate it?
Was there a confirmation bias: the agent found what looked like the expected pattern and stopped searching?
Did I verify this pattern/regex/heuristic works at every call site, not just the obvious one? (e.g., a regex added to heuristic_decision() but not is_approval_candidate() will silently fail — is_approval_candidate() is the gate that decides whether heuristic_decision() ever runs)
Did I verify the shell command before encoding it in instructions? Before writing any gh api, jq, or CLI command into CLAUDE.md / skills / memory, run it against a real target (e.g., a real PR) and confirm the output is non-empty and correct. Do NOT encode a command pattern until you have verified it returns the expected output for the actual API response shape. Failure mode: agent writes .state == "FAILURE" (wrong field for GitHub Actions) when it should be .conclusion == "FAILURE" — command returns 0 failures silently.

Step 4: Find the harness gap

For each failure class, check which harness layers are missing or insufficient:

Read existing instructions — ~/.claude/CLAUDE.md, repo CLAUDE.md, ~/.codex/AGENTS.md
- Is the rule already documented? If yes → it's an adherence problem, add a stronger enforcement instruction
- If no → add the rule
Check for existing skills — ~/.claude/skills/, ~/.claude/commands/
- Is there a skill that should have caught this? If yes → update it
- If no and the pattern is repeatable → create a skill
Check memory — ~/.claude/projects/*/memory/
- Was this corrected before? If yes → the memory wasn't sufficient; strengthen it
- If no → save feedback memory
Check tests — are there tests that would catch this regression?
- If no → propose an integration test
Check CI — would CI have caught this before merge?
- If no → propose a CI gate or pre-commit hook

Critical check — harness layer present but broken: For each harness layer that exists, verify it actually works, not just that it exists. A broken guardrail is as bad as a missing one and is harder to detect.

Step 5: Propose the fix

Output a concrete action plan:

FAILURE CLASS: <classification>

5 WHYS — TECHNICAL:
1. <why>
2. <why>
3. <why>
4. <why>
5. <why>
→ Root cause: <sentence>

5 WHYS — AGENT PATH:
1. <why>
2. <why>
3. <why>
4. <why>
5. <why>
→ Agent root cause: <sentence>

HARNESS FIXES (in order of priority):
1. [LAYER] FILE: <path> — <what to add/change>
2. [LAYER] FILE: <path> — <what to add/change>
...

VERIFICATION: <how to confirm the fix prevents recurrence>

Step 6: Implement

After user approval (or if invoked with --fix):

Apply all harness fixes
Run verification
Report what was changed

Decision Rules

If the same correction has been given twice: This is a mandatory harness fix. No exceptions.
If the fix is a one-liner in code but the pattern could recur: Harness fix first, code fix second.
If unsure whether it's a harness gap or a one-off: Ask the user. Don't assume one-off.
Never add instructions that duplicate what's already documented: Check first.
Prefer the most durable layer: Instructions > Skills > Memory > Tests > CI
5 Whys are mandatory: Never skip them. Short-circuit analysis produces shallow fixes.
Agent path is mandatory: Never analyze only the technical dimension. Always ask why the agent failed too.

Example failure pattern: CLI redacts secrets but scripts still export them

Observable: Gateway returns unauthorized / embedded fallback; logs show token value __OPENCLAW_REDACTED__ or similar.

Failure class: Silent degradation (harness script looked correct: “get token, export, run”) plus knowledge gap (CLI intentionally does not echo real secrets).

Technical 5 Whys (compressed): Export used openclaw config get … → CLI prints redacted sentinel → shell passed sentinel to gateway → auth failed → embedded fallback.

Agent 5 Whys (compressed): Pattern “export token then call CLI” is common in docs → agent did not verify the value was non-redacted → no instruction forbidding export of config get for secrets → recurrence.

Harness fixes (typical):

Instructions — ~/.claude/CLAUDE.md: never export gateway tokens from config get when output can be redacted; unset override env vars; read JSON or use provider-specific keys (MINIMAX_API_KEY) as documented.
Commands — ~/.claude/commands/claw.md (or repo overlay): validate token with a file read / non-redacted check; document the redaction trap explicitly.
Memory — one-line feedback memory so project agents see it in MEMORY.md.

Verification: Run the command path and confirm the exported string is not a known redaction sentinel and gateway returns status: ok without “falling back to embedded”.

Example failure pattern: Green Gate job success ≠ Gate 7 pass (silent CI success)

Observable: PR shows Green Gate completed/success in gh pr checks but skeptic VERDICT was never posted. Agent reports PR as 7-green.

Failure class: Silent CI success — a workflow can exit 0 at the job level while an inner step fails. The harness layer (Green Gate) is present and running, but it reports the wrong thing at the job level.

Technical 5 Whys:

Why did the PR show Green Gate completed/success even though Gate 7 was failing?
Why does the workflow exit 0 at job level even when the VERDICT check step exits 1?
Why is the VERDICT check inside a step rather than as a separate job with its own status?
Why does the polling step swallow the step failure without propagating it to job status?
Why is there no CI gate that independently verifies VERDICT existence as a separate job? → Root cause: The Green Gate conflates “trigger posted” with “skeptic evaluated”; the job always exits 0 because the trigger step succeeds, and the VERDICT step failure is swallowed inside the job.

Agent 5 Whys:

Why did the agent trust gh pr checks output to determine Gate 7 status?
Why didn't the agent read the workflow log to verify gate outcomes?
Why was there no instruction telling the agent that gh pr checks is insufficient for Gate 7?
Why did the existing 7-green definition not include a verification procedure for Gate 7?
Why was the skill written as policy without a mandatory verification step? → Agent root cause: The harness defined 7-green as policy without requiring independent VERDICT verification, so agents trusted the workflow's job-level status.

Harness fixes:

Skill — pr-green-definition.md: Add mandatory REST API VERDICT check procedure (not just policy statement)
Skill — pr-green-definition.md: Add counter-example showing Green Gate completed/success while Gate 7 fails
Command — /green.md: Already exists but relied on workflow logs only; updated to mandate REST check for Gate 7
Memory — feedback_2026_04_21_silent_ci_success.md: Document the specific PRs where this was observed

Verification: For any PR where Green Gate shows completed/success, independently run the REST API VERDICT check and confirm the VERDICT comment exists before reporting 7-green.

Anti-patterns

Adding a memory entry when the fix should be an instruction (memory is per-project; instructions are global)
Writing a skill for a one-time operation
Adding an integration test without also fixing the instruction that led to the bug
Proposing CI gates for things that should be caught at the instruction level
Over-engineering: adding 5 harness layers when 1 instruction would suffice
Skipping the agent 5 Whys because "the technical fix is obvious"
Assuming a harness layer works because it exists — verify it

harness-engineering

Popularity

Invocation

Context Preview

SKILL.md

harness-engineering

Popularity

Invocation

Context Preview

SKILL.md

Harness Engineering Skill

Purpose

Scope: user vs repository

Harness Layers (ordered by durability)

Analysis Protocol

Step 1: Identify the failure class

Step 2: 5 Whys — the technical problem

Step 3: 5 Whys — the agent path

Step 4: Find the harness gap

Step 5: Propose the fix

Step 6: Implement

Decision Rules

Example failure pattern: CLI redacts secrets but scripts still export them

Example failure pattern: Green Gate job success ≠ Gate 7 pass (silent CI success)

Anti-patterns

Similar Skills

Harness Engineering Skill

Purpose

Scope: user vs repository

Harness Layers (ordered by durability)

Analysis Protocol

Step 1: Identify the failure class

Step 2: 5 Whys — the technical problem

Step 3: 5 Whys — the agent path

Step 4: Find the harness gap

Step 5: Propose the fix

Step 6: Implement

Decision Rules

Example failure pattern: CLI redacts secrets but scripts still export them

Example failure pattern: Green Gate job success ≠ Gate 7 pass (silent CI success)

Anti-patterns

Similar Skills