From test-quality-tools
Audit, harden, or generate unit tests with a focus on mutation-resistance and durable quality — not just coverage %. Use when the user asks to improve or harden test quality, make tests less fragile/brittle, review tests for anti-patterns (error-message-substring asserts, private-symbol access, tautological constructor readbacks, recomputed-crypto expectations, over-mocking, missing boundary tests, unrolled cases that should be parametrized), generate a high-quality suite for an untested module, or generally "make my tests better". Measures a suite against a validated quality scorecard and iterates until it plateaus, holding coverage as a non-regression floor and REPL-verifying library assumptions. Works across stacks — Python/pytest (validated), JavaScript/TypeScript (Jest, Vitest, Mocha/Chai, node:test), and Go — with a per-language scorer; the rubric is the same everywhere.
How this skill is triggered — by the user, by Claude, or both
Slash command
/test-quality-tools:test-qualityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Bring a target's tests up to a durable-quality bar. "Quality" means tests that
Bring a target's tests up to a durable-quality bar. "Quality" means tests that fail when behavior breaks and survive when it's only refactored — the opposite of coverage-chasing suites that hit 100% yet catch nothing.
This skill encodes the result of a controlled experiment: pointing test generation at a multi-axis quality scorecard (rather than a coverage number) produced suites that beat human-written baselines on the auto-countable axes in 9 of 9 Python suites (8 of 9 with the model held fixed — the rubric, not the model, drove the gain). The two reference docs are that experiment's distilled output:
references/quality-contract.md — 10 anti-fragility rules, each with the repair.references/scorecard.md — the scoring axes, the improvement gate, the stop condition.scripts/score.py — measures the auto-countable axes for any pytest suite.Read both reference docs before starting. They are the substance; this file is the procedure.
Coverage is a floor, not a goal. Once the suite is at or above its starting coverage, more coverage doesn't count as improvement. What counts is moving the quality axes: fewer fragility patterns, more rigor signals, less real mocking, better LOC efficiency — with every test traceable to a user-observable contract.
python -m pytest <tests> --cov=<src> --cov-branchnpx jest --coverage · Vitest: npx vitest run --coveragenpx c8 --check-coverage mocha (c8/nyc for coverage)go test -cover -coverprofile=cover.out ./... && go tool cover -func=cover.out
Read pyproject.toml/pytest.ini, package.json(scripts.test,
jest/vitest config), go.mod, Makefile to find the project's real command —
prefer it over the defaults above.Identify the source package and its tests dir. Detect the language/framework
(this sets the score.py --lang profile: python, js, or go) and the
coverage-enabled test command. Determine the mode:
python <skill>/scripts/score.py --tests <tests_dir> [--lang python|js|go]
for the auto axes (--lang auto-detects if omitted).--baseline <that copy>.Read the tests (or, for generate mode, the source) and inventory:
score.py.<,<=,>,>=,==,!= in the
source and check each has a boundary test), B.3 real-I/O fixture use, E.3
contract-naming.
Produce a concrete findings list: file:line → which rule → the repair.Loop. Each round, make ONE focused, justified move:
Rules for the loop:
python -c "...",
node -e "..." (or a scratch test you delete), go run a snippet. Don't
assert from memory. Keep a short log of what you checked.unittest.mock; parametrize instead of unrolling.references/scorecard.md): keep the move only if the
coverage floor holds, an A-axis dropped or a B-axis rose, nothing regressed,
and every new test maps to a stated contract. If a move fails the gate, revert
it — a no-op round is not progress.score.py --tests <tests> --baseline <start>).Stop when 3 consecutive rounds can't produce a gated improvement, OR there are no contract violations left and every source boundary has a test.
== to in, is a regression even if the score "improves".
The gate's rule 3 exists to catch exactly this.js (Jest/Vitest/
Mocha/node:test) and go profiles apply the same axes with heuristic regexes —
trustworthy for trends and worst-offenders, but lean harder on reading the
tests, and treat the W/L/T tally as indicative, not authoritative.PROFILES in
scripts/score.py (file globs + a test_def regex + the per-axis regexes) —
it's a self-contained dict per language.Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub rollinsio/beyond-test-coverage --plugin test-quality-tools