From harness-audit
Audit an AI agent's harness for production readiness by reading its ACTUAL code, configs, and run traces — not a questionnaire. Scores 44 runtime controls (grouped into the eight ContextOS harness properties) Pass/Partial/ Fail with file:line evidence, assigns a maturity band, and emits a prioritized fix queue. Use when someone asks to audit / assess / production-readiness-check an AI agent, agent harness, LangGraph / OpenAI Agents SDK / ADK / CrewAI / custom agent, or asks "is my agent safe to ship / production-ready / governed".
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness-audit:harness-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are running a **production readiness audit** on an AI agent's harness. The
You are running a production readiness audit on an AI agent's harness. The governing rule is the one thing that makes this audit worth anything:
No artifact, no pass. Judge the runtime record, not the architecture diagram or the README. A control that is only described — in prose, in a design doc, in the system prompt — is a Fail. A harness is evidence, not confidence. And prefer evidence from channels the agent cannot author: a harness-emitted tool-call / resource-access / message record outranks anything the model self-reports, because a self-report is gameable.
Your job is to find the evidence in the target repository (and run traces, if available), score each control against that evidence, and hand back an action plan the team can execute. You are not here to be reassured; you are here to find what will break in production.
Work in this order. Do not skip the prescan, and do not score a control Pass without a concrete artifact reference.
Identify the substrate before judging it. Read package.json /
pyproject.toml / lockfiles / README to determine the framework (LangGraph,
OpenAI Agents SDK, Google ADK, Semantic Kernel, CrewAI, Mastra, or custom),
the language, and where the agent's entry point, tools, prompts, policies,
evals, and telemetry live. State what you found in one short paragraph.
Also determine whether this is single-agent or multi-agent (handoffs, sub-agents, planner/worker roles, a graph with multiple agent nodes). If multi-agent, controls #42 (outbound disclosure) and #43 (communication policy) are in scope and the inter-agent message channel is a first-class audit surface — coordination expands the risk surface. If single-agent, mark #43 N/A and say so.
Run the deterministic prescan to get a fast map of candidate evidence:
node "$SKILL_DIR/scripts/prescan.mjs" <target-path>
(If Node is unavailable or the script errors, fall back to rg/grep
manually using the search hints in the checklist.) The prescan only locates
candidate evidence — it never decides Pass/Fail. You must open each hit and
verify it actually enforces the control at the right boundary. A keyword match
is not a control.
Open reference/checklist.md. It lists every control with: the audit question,
minimum pass evidence, the immediate fail signal, severity, where to look, and
the ContextOS plane/doc that owns the remediation. For each control:
path:line.Apply the five-minute rule: if you cannot find the evidence within ~5 minutes of searching, score it Fail. A control that can't be found under pressure won't protect the system under pressure. Note it as "not found within budget" rather than asserting it doesn't exist — but it still scores Fail.
Severity (P0/P1/P2) comes from the checklist and is independent of the
pass state — a P0 control that is Partial is still a launch blocker.
Produce the scorecard using reference/scorecard-template.md. The report MUST
end with a fix queue: every gap gets an owner placeholder, severity, the
concrete fix, the expected evidence that would flip it to Pass, and the
ContextOS plane/doc to read. Order the queue by the dependency chain, not by
control number — fix the most load-bearing failure first (a broken context
compiler undermines grounding, validation, observability, and replay; a missing
policy engine undermines tool control, approval, and privacy). Tell the user to
fix one well and re-run the audit. Do not hand back dozens of parallel workstreams.
The band is not a label; it determines which failures block launch.
| Band | Appropriate use | Required controls | Not allowed |
|---|---|---|---|
| Prototype | Internal exploration, no real side effects, synthetic/low-risk data | Agent charter, basic eval set, trace capture, tool sandbox | Real users, PII, money movement, durable memory |
| Controlled beta | Limited users, explicit supervision, compensating controls | P0 controls for touched surfaces, approval gates, offline evals, trace review, fix queue | Direct high-risk execution without a human gate |
| Production | Real users, real tools, monitored release lifecycle | Full P0 pass, P1 gaps owned, live validation, replay, rollback, incident playbook | Unversioned prompt/model/tool changes |
| Regulated / high-risk | Regulated data, money movement, legal/health/security, destructive actions | Full P0/P1 pass, red-team audit, retention policy, evidence retention, formal release governance | Informal approval, undocumented memory, non-replayable action |
path:line. Every Fail says
where you looked. Never soften a P0 Fail into a suggestion.This skill is the runnable form of the ContextOS eight-property harness audit: https://contextosai.com/blog/eight-property-harness-audit — 44 controls grouped into eight outcomes, evidence required for every pass. The eight properties are defined at https://contextosai.com/docs/foundations/harness-engineering.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub contextosai/skills --plugin example-skills