From python
ALWAYS invoke this skill when the user asks to audit Python test evidence, review Python tests for spec-tree evidence quality, or evaluate Python test infrastructure. NEVER audit Python test evidence without this skill.
How this skill is triggered — by the user, by Claude, or both
Slash command
/python:auditing-python-testsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Invoke the `python:standardizing-python` skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Invoke the python:standardizing-python skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Invoke the python:standardizing-python-tests skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Invoke the spec-tree:testing skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
Invoke the spec-tree:auditing-tests skill before proceeding. If that skill is unavailable, report the missing skill and continue with the closest available workflow.
!test -f spx/local/python.md && cat spx/local/python.md || true
!test -f spx/local/python-tests.md && cat spx/local/python-tests.md || true
Any overlay loaded above routes skill behavior to the product's governing specs and decisions; a local overlay supplements skill behavior and does not declare product truth.
Audit Python test evidence against the spec-tree test-audit properties plus Python-specific source ownership, testability, harness, generator, inert-fixture, and pytest discovery rules.This skill audits tests and test infrastructure. It does not audit production implementation style except where source architecture affects test evidence quality.
<audit_scope> For every in-scope test assertion, inspect the full evidence chain:
product_testing.harnesses.* modulesproduct_testing.generators.* modulesconftest.py files that apply to the testDo not approve a test by looking only at the test file. Laundering and severed coupling can live in generators, harnesses, fixture path providers, and pytest discovery shims. </audit_scope>
<gate_0_deterministic>
Run deterministic checks before judgment when the repository provides the tools. Prefer the repository's canonical commands — those its CLAUDE.md / AGENTS.md, Justfile, Makefile, or package scripts document; the python3 -m … forms below are the portable fallback when the product ships no wrapper:
bad_test_names="$(rg --files <spec-node-path>/tests/ --glob '*.py' | rg -v '/test_[^.]+\.(scenario|mapping|conformance|property|compliance)\.l[123](\.[^.]+)?\.py$' || true)"
test -z "$bad_test_names"
python3 -m pytest --collect-only -q <spec-node-path>/tests/
python3 -m ruff check <spec-node-path>/tests/
python3 -m mypy <spec-node-path>/tests/
Report unavailable tools separately instead of silently skipping them.
If collection, linting, type checking, or repository-canonical deterministic checks fail, halt the audit and emit REJECT with the failing command and diagnostic. Do not proceed to semantic evidence judgment on uncollectable, untyped, or lint-failing tests.
</gate_0_deterministic>
<coupling_audit> Classify imports by runtime coupling:
| Import pattern | Classification |
|---|---|
import pytest | Framework, does not count |
from hypothesis import given | Framework, does not count |
import json | Stdlib, does not count |
from typing import TYPE_CHECKING | Type-only, does not count |
from product.config import parse_config | Production coupling |
from product_testing.harnesses import config_harness | Indirect coupling through harness |
from product_testing.generators import valid_config | Input-domain provider, audit separately |
Imports inside if TYPE_CHECKING: do not create runtime coupling. A test with only framework, stdlib, and type-only imports is a tautology unless it reaches production through a harness that itself reaches production.
When a test imports a harness, inspect the harness and verify it calls the production behavior the assertion is about. A harness that builds expected values without exercising production is severed coupling. </coupling_audit>
<falsifiability_audit> Reject replacement of the behavior under test:
unittest.mock.patch replacing the production function, class, module, client, or repository the assertion claims to verifyMock() or MagicMock() standing in for behavior the assertion claims to verifymocker.patch(...) replacing the dependency under testmonkeypatch replacing the behavior under testsys.path or importlib tricks that cause tests to import alternate modulesAccept explicit test doubles only when they are passed through dependency injection and map to a /testing Stage 5 exception:
| Exception | Python pattern |
|---|---|
| Failure modes | Class implementing a protocol and raising |
| Interaction protocols | Class with typed call recording |
| Time/concurrency | Injected clock or controllable scheduler |
| Safety | Class that records intent |
| Combinatorial cost | Configurable class mirroring real behavior |
| Observability | Class capturing hidden boundary details |
| Contract probes | Stub validated against a real schema |
</falsifiability_audit>
<source_ownership_audit> The audit asks one question per test case: where does this case come from? The legitimate sources:
| Assertion type | Case source |
|---|---|
| Scenario | The spec assertion text — the case is declared by the spec, not invented by the test author |
| Mapping | A finite source-owned enumeration (enum, registry, schema, structured metadata) |
| Property | A generator over a domain — the author writes the invariant, the generator owns the cases |
| Conformance | An external oracle (schema validator, reference implementation, parser the test doesn't author) |
| Compliance | The decision record being enforced — the case is the rule itself |
| Any (fixture) | An inert fixture file under product_testing/fixtures/, passed to the code under test as a file path or byte stream — the file's whole real-world payload is the case |
The first five rows pair an assertion type with the case source it normally takes. The Fixture row is cross-cutting: any assertion type may use an inert fixture file as the case. An auditor classifying a test case checks both the assertion type and whether the case is a whole-payload fixture file.
A case that does not have a documentable source outside the author's head is a tautology dressed as a measurement — the test confirms the author's understanding forever, never the spec. The defect is in the case's origin. Lexical location (Final at module scope, plain assignment, inline literal), syntactic form, and reuse pattern (shared bag, single-value, parametrize row) are irrelevant — the audit names them only as forms the same defect takes.
Vocabulary check (where the values live). Independently of case provenance, the values used in cases must come from the right home:
| Value kind | Lives in |
|---|---|
| Production vocabulary (labels, paths, schema fields, tokens) | Owning production module, imported |
| Variable input domain | Generator under product_testing/generators/ |
| Resource-bound or runner-tuning value (timeouts, retries) | The harness module that owns the resource, under product_testing/harnesses/ |
| Whole-payload real-world sample | Inert fixture file under product_testing/fixtures/, read by path |
| One-off descriptive text (test titles, diagnostic messages) | Inline in the test function body |
For each test case, name the source. REJECT against the missing source when:
When the missing source is an architectural defect (the Python module that should own the vocabulary does not yet exist), name the module that should be created and the spec-tree node that should govern it.
Pass only when every case is traceable to a source independent of the author and every value lives in its proper home. </source_ownership_audit>
<generator_audit> Audit every imported generator:
Property evidence requires a meaningful property. @given that only checks for lack of exceptions is insufficient.
</generator_audit>
<harness_audit> Audit every imported harness:
Pytest fixture callables that perform setup, teardown, cleanup, or dependency access are harness entrypoints. They belong under product_testing.harnesses.*.
</harness_audit>
<fixture_audit> Audit inert fixture files and fixture path providers:
Reject Python modules under product_testing/fixtures/ that export pytest fixture body functions. That category is for inert data files under the PDR vocabulary.
</fixture_audit>
<conftest_audit>
Inspect every conftest.py that applies to the test path.
Allowed content:
product_testing.harnesses.*Rejected content:
</conftest_audit>
<architectural_dry_audit> When two or more in-scope tests repeat setup or infrastructure logic, reject the duplication and identify the canonical destination:
| Repeated pattern | Destination |
|---|---|
| Temp product scaffolding | product_testing.harnesses.* |
| Subprocess or CLI execution setup | product_testing.harnesses.* |
| Database, Docker, browser, service, or API setup | product_testing.harnesses.* |
| Domain-shaped input construction with variable values | product_testing.generators.* |
| Real-world payload samples | product_testing/fixtures/ as data |
Do not recommend tests/helpers, tests/support, node-local helper modules, or fixture body code in conftest.py.
</architectural_dry_audit>
<verdict_format>
Follow <verdict_format> in /auditing-tests.
For each finding, include:
Emit APPROVED only when Gate 0 and all evidence-property checks pass. Emit REJECT when any property fails.
</verdict_format>
<failure_modes>
Failure 1: Accepted TYPE_CHECKING import as coupling.
Claude saw from product.theme import ThemeColor inside an if TYPE_CHECKING: block and counted it as runtime coupling. The test declared its own color values and never executed production code. Avoid this by ignoring type-only imports for coupling.
Failure 2: Missed coupling severed by @patch.
Claude saw a production import and approved the test, while @patch("product.database.query") replaced the imported behavior. Avoid this by checking decorators, fixtures, monkeypatch usage, and harness setup code.
Failure 3: Accepted a generator that only hid a constant.
Claude saw a Hypothesis strategy and treated it as property evidence. The strategy returned one copied source value through st.just(...). Avoid this by inspecting generator bodies and requiring meaningful variation.
Failure 4: Accepted pytest fixture body code in conftest.py.
Claude treated pytest discovery as a reason to put setup logic in conftest.py. The PDR requires harness logic to live in the product's test-infrastructure implementation home. Avoid this by checking every applicable conftest.py.
Failure 5: Accepted a hand-picked test case as evidence.
Claude saw parse("name=alice") followed by assert result.name == "alice" and approved the test. The string "name=alice" was invented by the author to demonstrate their understanding of the parser. The same author wrote (or read) the parser. Every future run confirms the author's understanding — that "name=alice" parses to a record with name field equal to "alice" — never the spec assertion about parser correctness. Avoid this by asking, for every case, where the case comes from: a generator, an oracle, a fixture, source-owned vocabulary, or the spec assertion text itself. If the answer is "the author chose it because it seemed reasonable", REJECT.
Failure 6: Container-literal keys treated as opaque scaffolding.
Claude saw INVENTORY_JSON = f'{{"flatcar-version":"{VERSION}",...}}' and classified it as test-fixture scaffolding because the values were synthetic. The keys were production-owned label vocabulary the author hand-wrote into the template — hand-picked cases for the parser or consumer that reads the JSON. Avoid this by auditing keys and values separately; the construction json.dumps({LABEL: synthetic_value, ...}) with LABEL imported is the only legitimate form.
Failure 7: "Artifact is the source-of-truth" rationalization.
Claude saw a test that hand-copied a YAML field name ("flatcar-version"), an HCL attribute, or a systemd unit path. The value appeared in a parsed artifact file but no Python module owned it. Claude classified the artifact as the source-of-truth and accepted the case. The artifact is downstream — a Python module either renders or consumes it, and the absence of that Python module is the architectural defect, not the test's fault for finding nothing to import. Avoid this by naming the missing source-of-truth module and the spec-tree node that should govern it; REJECT against the missing module.
</failure_modes>
<success_criteria> A Python test audit succeeds when:
conftest.py is limited to discovery, registration, and explicit harness importsAPPROVED</success_criteria>
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub outcomeeng/plugins --plugin python