How this skill is triggered — by the user, by Claude, or both
Slash command
/python:auditing-pythonThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Invoke the `python:standardizing-python` skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Invoke the python:standardizing-python skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Adversarial code review through comprehension. Find design flaws that automated tools cannot catch. Produce a structured verdict -- not code changes.
This skill is read-only. It produces verdicts, not commits or fixes.
Test evidence quality is audited by /auditing-python-tests. This skill audits implementation code, not test code. If test files are in scope, delegate to /auditing-python-tests.
<quick_start>
/testing for methodology + /testing-python for Python patternsCLAUDE.md, pyproject.toml (Phase 0)</quick_start>
<repo_local_overlay>
Standards are pre-loaded above. Check for spx/local/python.md at the repository root. Read it if it exists and apply it as repo-local routing to the product's governing specs and decisions. A local overlay supplements skill behavior; it does not declare product truth.
</repo_local_overlay>
<essential_principles>
Trust automated gates, then comprehend.
Phases 1-2 are mechanical prerequisites. If they fail, stop -- REJECTED. If they pass, do NOT re-check what linters and tests already verified. Your time is for Phase 3.
Comprehension is the core value.
Automated tools catch syntax errors, type mismatches, and lint violations. You catch: functions that do more than their name says, dead parameters required by no Protocol, IO tangled with logic, and designs that will break under change. The predict/verify protocol (Phase 3) is how you find these.
Test evidence is not your concern.
/auditing-python-tests evaluates whether tests provide genuine evidence using the 4-property model (coupling, falsifiability, alignment, coverage). This skill verifies tests PASS, not whether they have evidentiary value. Do not duplicate that work.
Binary verdict, no caveats.
APPROVED means every concern passes. REJECTED means at least one fails. APPROVED output contains no notes, warnings, or suggestions sections.
</essential_principles>
Execute phases IN ORDER. Do not skip.
Phase 0: Scope and Product Config
CLAUDE.md/README.md for validation commands and test runnerspyproject.toml for tool configurations (ruff, mypy, pytest)Phase 1: Automated Gates (blocking)
Run the product's validation command. Catches everything linters handle: type annotations, naming, magic numbers, bare excepts, unused imports, security rules.
If the product lacks its own linter configs, use the reference configs in ${CLAUDE_SKILL_DIR}/rules/:
| File | Purpose |
|---|---|
ruff_quality.toml | Ruff linting (pycodestyle, bugbear, security) |
mypy_strict.toml | Mypy strict type checking |
semgrep_sec.yaml | Semgrep security patterns (eval, pickle, shell) |
Non-zero exit = REJECTED. Do not proceed.
Do NOT manually re-check what linters catch. If the product's linters are properly configured per /standardizing-python, they handle type annotations, magic numbers, bare excepts, unused imports, commented-out code, modern syntax, and security rules.
Note: Some rules require manual verification during Phase 3 -- deep relative imports, sys.path manipulation, unqualified Any, # type: ignore without justification.
Phase 2: Test Execution (blocking)
Run the full test suite. Use the product's test runner from CLAUDE.md.
If tests require infrastructure (databases, Docker), attempt to provision it. Do not skip tests because infrastructure "isn't running" -- try to start it first.
ANY test failure = REJECTED. Do not proceed.
Phase 3: Code Comprehension
Read every file. Understand it. Question it. Do NOT skim, sample, or check boxes.
3.1 Per-Function Protocol
For each function/method:
| Surprise | What it suggests |
|---|---|
| Parameter never used in body | Dead parameter -- required by Protocol, or remove |
| Does more than name says | SRP violation or misleading name |
| Does less than name says | Name overpromises or logic is incomplete |
| Variable assigned but never read | Dead code or unfinished logic |
| Code path that can never execute | Dead branch given calling context |
| Return value contradicts the type hint | Logic error or wrong return type |
Prediction matched? Move on. Surprise? Document it with file:line.
3.2 Design Evaluation
For the codebase as a whole:
ValueError/RuntimeError?3.3 Import Evaluation
Evaluate import structure using the same vocabulary as /auditing-python-tests:
| Import pattern | Classification |
|---|---|
import pytest | Framework -- not reviewed |
from hypothesis import given | Framework -- not reviewed |
import json | Stdlib -- not reviewed |
from typing import TYPE_CHECKING | Type-only -- erased at runtime |
from product.config import parse_config | Codebase (production) -- review |
from ..config import parse_config | Codebase (relative) -- review |
from product_testing.harnesses import ConfigTestHarness | Codebase (test infra) -- review |
Import depth rules:
| Depth | Example | Verdict |
|---|---|---|
| Package import | from product.config ... | OK -- preferred |
| 1 level | from ..config ... | Review -- truly module-internal? |
| 2+ levels | from ....helpers ... | REJECT -- use package import |
| sys.path manip. | sys.path.insert(0, ...) | REJECT -- always |
For stable locations (product_testing.harnesses.*, product_testing.generators.*, and inert fixture path providers), package imports are mandatory.
See ${CLAUDE_SKILL_DIR}/references/false-positive-handling.md for application context when evaluating security and linter suppression comments.
Phase 4: ADR/PDR Compliance
Find applicable ADRs/PDRs in the spec hierarchy (*.adr.md, *.pdr.md). Verify each constraint is followed. Undocumented deviations = REJECTED. If the product has no spec hierarchy, this concern is N/A.
| Decision Record Constraint | Violation Example | Verdict |
|---|---|---|
| "Use dependency injection" (ADR) | Direct imports of external services | REJECTED |
"l1 tests for logic" (ADR) | l1 tests hitting network | REJECTED |
| "No ORM" (ADR) | SQLAlchemy models introduced | REJECTED |
| "Lifecycle is Draft→Published" (PDR) | Added hidden Archived state | REJECTED |
<failure_modes>
These are real failures from past audits. Study them to avoid repeating them.
Approved code that passed ruff+mypy but had a design flaw. The auditor trusted Phase 1 output and skimmed Phase 3. The code had a function named validate_config that also wrote the config file -- SRP violation hidden behind a reasonable name. The predict/verify protocol would have caught it: "Given the name, I predict this validates. But the body also calls Path.write_text(). Surprise."
Rejected code for a false positive. The auditor flagged a parameter as "dead code" because it wasn't used in the function body. The parameter was required by a CommandHandler Protocol contract -- other implementations used it. Before flagging dead parameters, check if the function implements a Protocol.
Tried to evaluate test evidence instead of delegating. The auditor found lambda cmd: (0, "", "") in tests and spent time analyzing whether it severed coupling. That's /auditing-python-tests' job. This auditor should have verified tests PASS (Phase 2) and moved on to comprehending the implementation code.
Distracted by style while missing a logic bug. The auditor spent review time on naming conventions, import ordering, and docstring completeness. Meanwhile, a branch condition was inverted -- if is_valid should have been if not is_valid. Comprehension (understanding what the code does) must come before style. Style is the linter's job.
Accepted code with tangled IO. A process_orders function both computed order totals AND sent confirmation emails. Tests passed and types were correct. But the function was untestable without an email server -- IO and logic were tangled. The design evaluation (3.2) would have caught it: "Can core logic be tested without IO? No."
</failure_modes>
<output_format>
Emit the verdict as JSON conforming to the canonical schema in plugins/spec-tree/skills/auditing/scripts/verdict.py. The skill's entire output is the JSON verdict. The caller captures the JSON and routes it through emit_verdict.py with the requested --format (defaulting to markdown+json for PR-comment delivery).
The skill's overall is PASS iff every concern row is PASS or UNKNOWN (N/A maps to UNKNOWN); FAIL if any concern is FAIL. Findings carry severity REJECT for blocking violations.
{
"schema_version": 1,
"skill": "auditing-python",
"target": "<scope-target>",
"overall": "PASS | FAIL | UNKNOWN",
"rows": [
{ "name": "automated-gates", "status": "PASS | FAIL | UNKNOWN", "findings": [] },
{ "name": "test-execution", "status": "PASS | FAIL | UNKNOWN", "findings": [] },
{ "name": "function-comprehension", "status": "PASS | FAIL | UNKNOWN", "findings": [] },
{ "name": "design-coherence", "status": "PASS | FAIL | UNKNOWN", "findings": [] },
{ "name": "import-structure", "status": "PASS | FAIL | UNKNOWN", "findings": [] },
{ "name": "adr-pdr-compliance", "status": "PASS | FAIL | UNKNOWN", "findings": [] }
],
"metadata": { "branch": "<branch>" }
}
Each finding carries file, line, rule (the concern name from the verdict table or a specific violation name), severity: "REJECT", and message (the one-line "why this fails"). Include correct-approach code samples and required changes directly in the finding's message field — the JSON verdict is the complete output of this skill.
</output_format>
<what_to_avoid>
/auditing-python-tests)</what_to_avoid>
<example_review>
Read ${CLAUDE_SKILL_DIR}/references/example-audit.md for complete APPROVED and REJECTED examples showing all concern types.
</example_review>
<success_criteria>
Review is complete when:
</success_criteria>
npx claudepluginhub outcomeeng/plugins --plugin pythonGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.