falsegreen-skill

LLM-based semantic analysis for false-positive test detection. Companion
to falsegreen, the Python static
scanner.
For Python, this skill applies the complete falsegreen catalog directly — all
structural and semantic patterns — via LLM analysis, without requiring the
static scanner to run first. For TypeScript and JavaScript, it is the primary
detection tool.
Why this exists
A test suite with 100% green tests is not a proof of correctness. It is a
proof that no test failed — which is a different thing. Tests can pass
permanently not because the code is right, but because the test never checks
anything meaningful.
Static analysis tools catch some of these cases. Linters like ruff or
flake8-pytest-style catch syntax-level patterns: a bare assert True, a
missing assert call, an unreachable block. Mutation testing tools like
mutmut probe whether tests actually fail when the code changes. Both
approaches have limits: linters cannot reason about test intent, and
mutation testing requires the code to run.
This skill fills the gap between linters and mutation testing. It reads the
test as text, reconstructs the intent, and asks six structural questions about
whether the test can actually fail. The questions are derived from the
taxonomy of false-positive test patterns collected in
CREDITS.md.
The core insight: a test is useful if and only if there exists some incorrect
implementation that would cause it to fail. If no such implementation exists —
because the assertion is unreachable, tautological, or verifies the mock
instead of the code — the test is structurally green regardless of whether the
production code is correct.
The methodology
One rule underlies every judgment: a test is useful only if it can fail when
the code breaks.
The six-judgment framework (J1-J6) makes this rule concrete:
| # | Question | Catches |
|---|
| J1 | Does the assertion run? | Dead assertions, vacuous loops, swallowed failures |
| J2 | Is the expected value from an independent oracle? | Echo mocks, formula re-implementation, spec contradictions |
| J3 | Is the real unit under test, not a mock of it? | Mock-the-SUT, self-confirming literals |
| J4 | Does the assertion verify enough? | Truthiness-only, len > 0, repr coupling, broad raises |
| J5 | Is the test coupled to implementation internals? | Positional mock args, private method testing |
| J6 | Does the test pass in isolation, without ordering? | Shared mutable state, test-order dependency |
A test is flagged HIGH only when the first failed judgment has no plausible
legitimate interpretation. A test is flagged LOW when the smell is likely but
has plausible intent. Everything else is PASS.
Precision over recall. One wrong flag on a legitimate test costs more
goodwill than a missed smell. Exemptions are explicit:
- Semantic case 18 requires a cited independent oracle (spec, docstring, API
contract). Without a citation, do not report case 18.
- Characterization tests — intentionally freezing current behavior — are not
false positives.
- Boolean predicates (
isinstance, .exists(), .is_dir()) are not weak assertions.
- In HTTP/UI layer tests, a truthiness check on a response object means
"the request succeeded" and is meaningful.
Full protocol: SKILL.md.
What it detects
Python — structural patterns (complete falsegreen catalog)
Family A — The test never checks anything
| Code | Pattern | Example |
|---|
| C1 | Assert inside if/for that may not run | if items: assert items[0].valid when items can be [] |
| C2 | No assertion at all | test body contains only setup calls |
| C2b | Calls SUT but discards result | result = process(x) — result never asserted |
| C3 | Assert inside try whose except swallows it | except Exception: pass catches AssertionError |
| C4 | Test function nested inside another function | pytest does not collect inner defs |
| C4b | Test class with __init__ | pytest skips classes that have __init__ |
| C20 | Assertion after unconditional return/raise | dead code, never runs |
| C21 | Every assert is conditional, none runs unconditionally | all asserts inside if/else branches |
| CC | Commented-out assertion | # assert result == 42 |
Family B — The check is weak or always true