From ajbm-dev
Applies the Testing Trophy strategy with 15 principles for trustworthy unit, integration, and e2e tests. Enforces disciplined mocking, anti-pattern detection, and honest reporting.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ajbm-dev:testing-best-practicesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to produce test evidence that is trustworthy, reproducible, and useful for decisions.
Use this skill to produce test evidence that is trustworthy, reproducible, and useful for decisions.
Apply it whenever you:
Scope boundary: This skill covers test quality — what makes a good test, how to choose test levels, and how to verify tests are trustworthy. For the test-first process (RED-GREEN-REFACTOR), use test-driven-development.
Run the Oracle Check after every test. State unknowns explicitly. Execute, don't just suggest.
Pick one mode immediately, then run the matching workflow:
plan — build risk map and test strategy (apply Tier A principles)write — implement or update tests for changed behavior (apply Tier B principles)review — audit existing tests for quality and trustworthiness (apply Tier C principles)report — summarize exactly what was executed and what remainsIf user intent is mixed, run plan then continue with write.
1. ALWAYS investigate when a test passes on first run — verify it tests the right thing.
2. ALWAYS assert on real system behavior, not mock wiring.
3. ALWAYS keep tests as straight-line code — no conditionals, loops, or try/catch.
4. ALWAYS execute tests and report concrete evidence before claiming they pass.
5. ALWAYS keep production APIs clean — move test lifecycle helpers to test harnesses.
Confidence-per-effort, highest to lowest:
┌───────┐
│ e2e │ Few: critical user journeys only
┌─┴───────┴─┐
│Integration │ Most: contracts, persistence, boundaries
┌─┴────────────┴─┐
│ Unit │ Many: pure logic and branching
┌─┴────────────────┴─┐
│ Static Analysis │ All: types, linting, formatting
└────────────────────┘
Decision heuristic: Use the lowest test level that still proves the behavior.
Prefer integration tests when uncertain. They catch real bugs with fewer mocks.
For property-based testing, contract testing, and when e2e is worth the cost, see references/testing-trophy.md.
Organized into three tiers that map to operating modes.
| # | Principle | Rule |
|---|---|---|
| 1 | Mostly Integration | Integration tests give highest confidence-per-effort. Default to them when uncertain. |
| 2 | The Beyonce Rule | "If you liked it, shoulda put a test on it." Test everything you value: performance, security, error paths. |
| 3 | Test Boundaries and Errors | Every non-trivial change needs: happy path + failure path + edge case. |
| 4 | Hermetic Tests | Self-contained, order-independent, no shared mutable state. Each test sets up and tears down its own world. |
| # | Principle | Rule |
|---|---|---|
| 5 | Test Behavior, Not Implementation | Assert on observable outcomes (return values, persisted state, API output). If refactoring breaks the test, the test was wrong. |
| 6 | Real Over Mock | Prefer: Real > Fake > Spy > Mock. Mock only at external or nondeterministic boundaries. |
| 7 | One Behavior Per Test | Each test is a single given/when/then. Test name reads as a sentence describing the behavior. |
| 8 | Test State, Not Interactions | Verify WHAT the result is, not HOW the system got there. verify(mock).called() is almost always wrong. |
| 9 | DAMP Over DRY | Descriptive And Meaningful Phrases. Duplicate freely if it makes each test self-contained and readable. |
| 10 | Straight-Line Tests | No conditionals, loops, try/catch, or computed expected values in tests. Every path through a test is the same path. |
| 11 | Clear Failure Messages | The failure message alone should tell you what went wrong, at 3 AM, without reading the test code. |
| # | Principle | Rule |
|---|---|---|
| 12 | Deterministic Always | Same code + same test = same result. Control time, randomness, network, ordering. |
| 13 | Tests Are Documentation | Write tests a stranger would want to read while debugging. They document expected behavior. |
| 14 | Investigate First-Run Passes | A test that passes immediately proves nothing. Verify it tests the right thing — not the current (possibly buggy) behavior. |
| 15 | Survive Refactoring | If refactoring internals breaks a test without changing behavior, that test is testing implementation details. |
For code examples, exceptions, and language-specific notes on each principle, see references/principles.md.
Before adding a mock, answer all five:
If answers are unclear, do not mock yet.
The hierarchy — prefer left:
Real dependency → Fake (in-memory impl) → Spy (real + recording) → Stub (canned response) → Mock (behavior verification)
For framework-specific mocking patterns (Jest, pytest, gomock) and the mock audit checklist, see references/mocking-guide.md.
The problem: AI reads code, infers "expected" behavior from it, writes a test matching the current (possibly buggy) output. 68% of AI-generated test suites validate bugs this way.
The protocol (mandatory during write mode):
toBeDefined() instead of toEqual(specificValue)) → strengthen assertionThe test must encode what the code SHOULD do, not what it DOES do.
| # | Pattern | Signal | Fix |
|---|---|---|---|
| 1 | Mock behavior testing | Assertions on *-mock artifacts only | Assert real system outputs |
| 2 | Over-mocking | Setup larger than assertion intent | Mock only external boundaries |
| 3 | Unrealistic fixtures | Partial objects, impossible state | Contract-complete fixtures |
| 4 | Test-only production methods | resetForTests, destroyForTest | Move to test harness |
| 5 | Implementation-detail assertions | Private method call checks | Assert observable outcomes |
| 6 | Snapshot overreach | Huge snapshots as primary signal | Assert critical fields explicitly |
| 7 | Flaky async/time tests | sleep-based waits, race conditions | Control clock, wait on conditions |
| 8 | Silent failures | expect(true).toBe(true), catch-ignore | Fail loudly with diagnostics |
| 9 | Skip debt | Skipped tests without tracking | Skip only with reason + ticket |
| 10 | Coverage theater | High line coverage, weak behavior coverage | Prioritize branch decisions and invariants |
| 11 | Missing regression tests | Bug fixed without pinning failure | Add regression test with fix |
| 12 | Missing boundary contracts | Only unit mocks around APIs | Add integration/contract tests |
| 13 | Assertion roulette | Multiple unrelated assertions, no clear failure message | One behavior per test, clear messages |
| 14 | Circular oracle | Test validates current behavior, not correct behavior | State expected behavior first (Oracle Guard) |
| 15 | Conditional test logic | if/else, loops, try/catch in test body | Straight-line code only |
For full details with code examples, signals, and corrections, see references/anti-patterns.md.
Use the Required Output Template exactly. Include all fields, no omissions.
When executing tests, run in layers:
Do not mark complete until all are true:
Always include in reports:
Required phrases:
I did not run tests.I ran targeted tests only.Tests are currently failing: followed by the list.I accepted N first-run passing tests after investigation.Forbidden phrases: "should pass", "looks good", "probably fixed", "ready" — without evidence.
Testing Summary
- Mode: <plan|write|review|report>
- Commands run:
- <command>
- Scope:
- <files/suites>
- Results:
- <N passed, M failed, K skipped>
- Oracle Check:
- <N tests investigated for first-run pass, M verified, K flagged>
- Risks covered:
- <behaviors validated>
- Gaps / limitations:
- <what was not verified>
test-driven-development for the RED-GREEN-REFACTOR cyclesystematic-debugging for investigating test failuresauthoring-skills for testing skillsEvidence is the product.
If evidence is weak, improve tests. If evidence is missing, say so directly. If a test passed on first run, prove it should have.
npx claudepluginhub ajbmachon/ajbm-skills --plugin ajbm-devProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.