Skill

harness-audit

Audit an AI agent's harness for production readiness by reading its ACTUAL code, configs, and run traces — not a questionnaire. Scores 44 runtime controls (grouped into the eight ContextOS harness properties) Pass/Partial/ Fail with file:line evidence, assigns a maturity band, and emits a prioritized fix queue. Use when someone asks to audit / assess / production-readiness-check an AI agent, agent harness, LangGraph / OpenAI Agents SDK / ADK / CrewAI / custom agent, or asks "is my agent safe to ship / production-ready / governed".

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-audit:harness-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are running a **production readiness audit** on an AI agent's harness. The

Supporting Files

reference/checklist.mdreference/scorecard-template.mdscripts/prescan.mjs

SKILL.md

161 lines · ~2.2k tokens

Stats

LanguageJavaScript

Stars1

MaintenanceExcellent

Last CommitJun 9, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Agent Harness Audit

You are running a production readiness audit on an AI agent's harness. The governing rule is the one thing that makes this audit worth anything:

No artifact, no pass. Judge the runtime record, not the architecture diagram or the README. A control that is only described — in prose, in a design doc, in the system prompt — is a Fail. A harness is evidence, not confidence. And prefer evidence from channels the agent cannot author: a harness-emitted tool-call / resource-access / message record outranks anything the model self-reports, because a self-report is gameable.

Your job is to find the evidence in the target repository (and run traces, if available), score each control against that evidence, and hand back an action plan the team can execute. You are not here to be reassured; you are here to find what will break in production.

Inputs

Target: the path to the agent's repository (default: the current working directory). If the user names a different path, audit that.
Run traces (optional): paths or a directory of real run logs/traces. If present, inspect at least two — one ordinary successful run and one boundary run that hit a policy denial, tool error, approval gate, evaluator failure, escalation, rollback, or fallback. The happy path shows what the harness does when nothing is stressed; the boundary path shows whether the harness exists when it matters.
Declared maturity band (optional): ask the user, or infer and state your assumption. The band determines which failures block launch (see Maturity).

Protocol

Work in this order. Do not skip the prescan, and do not score a control Pass without a concrete artifact reference.

1. Orient (≈2 min)

Identify the substrate before judging it. Read package.json / pyproject.toml / lockfiles / README to determine the framework (LangGraph, OpenAI Agents SDK, Google ADK, Semantic Kernel, CrewAI, Mastra, or custom), the language, and where the agent's entry point, tools, prompts, policies, evals, and telemetry live. State what you found in one short paragraph.

Also determine whether this is single-agent or multi-agent (handoffs, sub-agents, planner/worker roles, a graph with multiple agent nodes). If multi-agent, controls #42 (outbound disclosure) and #43 (communication policy) are in scope and the inter-agent message channel is a first-class audit surface — coordination expands the risk surface. If single-agent, mark #43 N/A and say so.

2. Prescan for evidence (≈3 min)

Run the deterministic prescan to get a fast map of candidate evidence:

node "$SKILL_DIR/scripts/prescan.mjs" <target-path>

(If Node is unavailable or the script errors, fall back to rg/grep manually using the search hints in the checklist.) The prescan only locates candidate evidence — it never decides Pass/Fail. You must open each hit and verify it actually enforces the control at the right boundary. A keyword match is not a control.

3. Score the 44 controls

Open reference/checklist.md. It lists every control with: the audit question, minimum pass evidence, the immediate fail signal, severity, where to look, and the ContextOS plane/doc that owns the remediation. For each control:

Pass — a script, trace, manifest, config, test, or record proves the control for a real path. Cite the artifact as path:line.
Partial — the control exists but coverage is incomplete, manual, delayed, undocumented, or enforced at the wrong boundary. Cite what exists and what's missing.
Fail — absent, unenforced, unverifiable, or prose-only. Say where you looked.

Apply the five-minute rule: if you cannot find the evidence within ~5 minutes of searching, score it Fail. A control that can't be found under pressure won't protect the system under pressure. Note it as "not found within budget" rather than asserting it doesn't exist — but it still scores Fail.

Severity (P0/P1/P2) comes from the checklist and is independent of the pass state — a P0 control that is Partial is still a launch blocker.

4. Roll up and decide

Roll the 44 control scores into the eight outcome groups (context-aware, policy-governed, tool-controlled, validated, observable, reversible, measurable, continuously improving).
Determine the maturity band the evidence actually supports (see below), and compare it to the band the team claims to operate at. The gap between claimed and evidenced maturity is the headline finding.
Apply the launch decision: any P0 Fail/Partial blocks production unless the agent is fully read-only, isolated from real users, and confined to a controlled beta.

5. Emit the report

Produce the scorecard using reference/scorecard-template.md. The report MUST end with a fix queue: every gap gets an owner placeholder, severity, the concrete fix, the expected evidence that would flip it to Pass, and the ContextOS plane/doc to read. Order the queue by the dependency chain, not by control number — fix the most load-bearing failure first (a broken context compiler undermines grounding, validation, observability, and replay; a missing policy engine undermines tool control, approval, and privacy). Tell the user to fix one well and re-run the audit. Do not hand back dozens of parallel workstreams.

Maturity bands

The band is not a label; it determines which failures block launch.

Band	Appropriate use	Required controls	Not allowed
Prototype	Internal exploration, no real side effects, synthetic/low-risk data	Agent charter, basic eval set, trace capture, tool sandbox	Real users, PII, money movement, durable memory
Controlled beta	Limited users, explicit supervision, compensating controls	P0 controls for touched surfaces, approval gates, offline evals, trace review, fix queue	Direct high-risk execution without a human gate
Production	Real users, real tools, monitored release lifecycle	Full P0 pass, P1 gaps owned, live validation, replay, rollback, incident playbook	Unversioned prompt/model/tool changes
Regulated / high-risk	Regulated data, money movement, legal/health/security, destructive actions	Full P0/P1 pass, red-team audit, retention policy, evidence retention, formal release governance	Informal approval, undocumented memory, non-replayable action

Discipline reminders

Equivalent evidence, not ContextOS terminology. The target may use any framework. You are looking for the control, however it's named — a Pydantic guardrail, a LangGraph interrupt, an OPA policy, an OTEL span. The checklist's ContextOS mapping is for the remediation pointer, not a naming requirement.
Boundary over happy path. If you only inspect a successful run, you have audited the demo, not the harness.
Completion is not safety. A run that finishes the task is evidence of nothing about whether it stayed in bounds — task completion and safety are routinely misaligned, and a higher completion rate often means more boundary crossings, not fewer. A run that completed the task while crossing a boundary is a Fail, not a partial credit.
Watch the object, not just the tool. The most common live violation is the right tool on the wrong object — a valid refund/lookup/file call applied to an out-of-scope customer, record, or path. Schema validation passes; the boundary is still crossed. Scrutinize resource-scope binding (#41) and the argument-level axis of trajectory evals (#27) accordingly.
Be specific and falsifiable. Every Pass cites path:line. Every Fail says where you looked. Never soften a P0 Fail into a suggestion.
Don't fabricate. If traces weren't provided, say the trace-dependent controls were scored from code only and flag them for a runtime re-audit.

What this audit is grounded in

This skill is the runnable form of the ContextOS eight-property harness audit: https://contextosai.com/blog/eight-property-harness-audit — 44 controls grouped into eight outcomes, evidence required for every pass. The eight properties are defined at https://contextosai.com/docs/foundations/harness-engineering.

harness-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

harness-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Agent Harness Audit

Inputs

Protocol

1. Orient (≈2 min)

2. Prescan for evidence (≈3 min)

3. Score the 44 controls

4. Roll up and decide

5. Emit the report

Maturity bands

Discipline reminders

What this audit is grounded in

Similar Skills

Agent Harness Audit

Inputs

Protocol

1. Orient (≈2 min)

2. Prescan for evidence (≈3 min)

3. Score the 44 controls

4. Roll up and decide

5. Emit the report

Maturity bands

Discipline reminders

What this audit is grounded in

Similar Skills