falsegreen-skill

LLM-based semantic analysis for false-positive test detection. Companion to falsegreen, the Python static scanner.

For Python, this skill applies the complete falsegreen catalog directly — all structural and semantic patterns — via LLM analysis, without requiring the static scanner to run first. For TypeScript and JavaScript, it is the primary detection tool.

Why this exists

A test suite with 100% green tests is not a proof of correctness. It is a proof that no test failed — which is a different thing. Tests can pass permanently not because the code is right, but because the test never checks anything meaningful.

Static analysis tools catch some of these cases. Linters like ruff or flake8-pytest-style catch syntax-level patterns: a bare assert True, a missing assert call, an unreachable block. Mutation testing tools like mutmut probe whether tests actually fail when the code changes. Both approaches have limits: linters cannot reason about test intent, and mutation testing requires the code to run.

This skill fills the gap between linters and mutation testing. It reads the test as text, reconstructs the intent, and asks six structural questions about whether the test can actually fail. The questions are derived from the taxonomy of false-positive test patterns collected in CREDITS.md.

The core insight: a test is useful if and only if there exists some incorrect implementation that would cause it to fail. If no such implementation exists — because the assertion is unreachable, tautological, or verifies the mock instead of the code — the test is structurally green regardless of whether the production code is correct.

The methodology

One rule underlies every judgment: a test is useful only if it can fail when the code breaks.

The six-judgment framework (J1-J6) makes this rule concrete:

#	Question	Catches
J1	Does the assertion run?	Dead assertions, vacuous loops, swallowed failures
J2	Is the expected value from an independent oracle?	Echo mocks, formula re-implementation, spec contradictions
J3	Is the real unit under test, not a mock of it?	Mock-the-SUT, self-confirming literals
J4	Does the assertion verify enough?	Truthiness-only, len > 0, repr coupling, broad raises
J5	Is the test coupled to implementation internals?	Positional mock args, private method testing
J6	Does the test pass in isolation, without ordering?	Shared mutable state, test-order dependency

A test is flagged HIGH only when the first failed judgment has no plausible legitimate interpretation. A test is flagged LOW when the smell is likely but has plausible intent. Everything else is PASS.

Precision over recall. One wrong flag on a legitimate test costs more goodwill than a missed smell. Exemptions are explicit:

Semantic case 18 requires a cited independent oracle (spec, docstring, API contract). Without a citation, do not report case 18.
Characterization tests — intentionally freezing current behavior — are not false positives.
Boolean predicates (isinstance, .exists(), .is_dir()) are not weak assertions.
In HTTP/UI layer tests, a truthiness check on a response object means "the request succeeded" and is meaningful.

Full protocol: SKILL.md.

What it detects

Python — structural patterns (complete falsegreen catalog)

Family A — The test never checks anything

Code	Pattern	Example
C1	Assert inside `if`/`for` that may not run	`if items: assert items[0].valid` when `items` can be `[]`
C2	No assertion at all	test body contains only setup calls
C2b	Calls SUT but discards result	`result = process(x)` — result never asserted
C3	Assert inside `try` whose `except` swallows it	`except Exception: pass` catches `AssertionError`
C4	Test function nested inside another function	pytest does not collect inner defs
C4b	Test class with `__init__`	pytest skips classes that have `__init__`
C20	Assertion after unconditional `return`/`raise`	dead code, never runs
C21	Every assert is conditional, none runs unconditionally	all asserts inside `if/else` branches
CC	Commented-out assertion	`# assert result == 42`

Family B — The check is weak or always true

falsegreen-skill

Popularity

What's Inside

README