From claude-commands
Defines test directory structure, decision principles for choosing testing layers (unit, E2E, MCP API, HTTP API, browser), and evidence implications. Use when creating tests or reviewing coverage.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-commands:testing-layersThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill defines the **concrete test directory structure** for Your Project, the **decision principles** for choosing the right testing layer, and the **evidence implications** of each layer. Use this when creating new tests, reviewing test coverage, or evaluating `/es` evidence completeness.
This skill defines the concrete test directory structure for Your Project, the decision principles for choosing the right testing layer, and the evidence implications of each layer. Use this when creating new tests, reviewing test coverage, or evaluating /es evidence completeness.
| Layer | Directory | Runner | Count | Evidence Class |
|---|---|---|---|---|
| 1. Unit | $PROJECT_ROOT/tests/ | ./vpython -m pytest $PROJECT_ROOT/tests/test_*.py | ~295 files | Mock (no /es credit) |
| 1b. Unit (top-level) | tests/ | ./vpython -m pytest tests/test_*.py | ~4 files | Mock (no /es credit) |
| 2. End-to-End | $PROJECT_ROOT/tests/test_end2end/ | ./vpython -m pytest $PROJECT_ROOT/tests/test_end2end/ | ~30 files | Mock (no /es credit) — see /end2end-testing skill |
| 3. MCP API | testing_mcp/ | ./vpython testing_mcp/test_*.py --server http://127.0.0.1:8001 | ~139 files | Server + LLM (full /es) |
| 4. HTTP API | testing_http/ | ./vpython testing_http/test_*.py | ~25 files | Server (partial /es) |
| 5. Browser | testing_ui/ | ./vpython testing_ui/test_*.py | ~40 files | Server + LLM + Browser (full /es + video) |
| Library | Path | Purpose |
|---|---|---|
| MCP test base | testing_mcp/lib/base_test.py | MCPTestBase — server lifecycle, evidence bundle, provenance |
| MCP campaign utils | testing_mcp/lib/campaign_utils.py | finish_character_creation, process_action, get_campaign_state |
| MCP client | testing_mcp/lib/mcp_client.py | MCP protocol client for tools_call |
| Evidence utils | testing_mcp/lib/__init__.py | capture_provenance, create_evidence_bundle, write_with_checksum |
| Browser test base | testing_ui/lib/browser_test_base.py | BrowserTestBase — Playwright lifecycle, screenshots, video |
| HTTP test config | testing_http/lib/config.py | Server URL config, auth bypass |
TEST_MODE=mock (via run_tests.sh).MCP_TEST_MODE=real, TEST_MODE=real, MOCK_SERVICES_MODE=false, TESTING_AUTH_BYPASS=truetesting_mcp/ or testing_ui/ — this is a hard policyIf the behavior under test is deterministic server code that runs the same way regardless of what the LLM said, a unit test is the correct tier. The LLM is just a trigger — you can simulate it with a dict.
CRITICAL DEFAULT RULE: Unit tests should ONLY be done if we are 100% confident we can test the logic entirely self-contained. The logic must be truly small and self-contained. Otherwise, strongly consider Layer 2 (End-to-End) first by default.
dict.pop("level") works identically whether the dict came from Gemini or a test fixtureis_level_up_active())Real LLM tests prove the contract between model output and server consumption works. Mocks assume the shape is correct; real calls prove it.
Streaming, parsing, Firestore persistence, DOM rendering — bugs live in the glue, not the logic. If proving "data flows correctly through N layers," real E2E adds value.
process_action → rewards engine → Firestore write → state read-backIf the behavior spans multiple functions, files, or subsystems but does NOT require real LLM judgment (e.g., standard request routing, validation pipelines, state serialization, or game state updates), Layer 2 (End-to-End) is the preferred layer. It provides high integration confidence across the callstack while remaining fast and deterministic via Mock LLMs. Strongly consider Layer 2 first by default unless the logic is truly small and self-contained.
Mandatory coverage rule: When a PR creates or updates multiple non-test files under $PROJECT_ROOT/**, it must add or update a Layer 2 E2E test unless the PR explicitly justifies why the changed code is unreachable through an end-to-end application path. The E2E must exercise every newly introduced or modified production code path in that PR, including cross-file handoffs, and assertions must fail if any new path is skipped.
A Gemini call costs time and tokens. A unit test costs milliseconds. If the logic is a 5-line dict operation with no ambiguity, the risk of a failure that manifests only under LLM pressure is near zero.
block_unauthorized_level_mutations() — deterministic .pop() on dict keysIf a real-LLM test can pass without the guard ever firing (because the LLM didn't cooperate with the diagnostic prompt), it's not actually testing what it claims.
UNAUTHORIZED_LEVEL_UP_PENDING_CCS_MUTATION)| Question | Yes → | No → |
|---|---|---|
| Does LLM judgment affect the outcome? | Layer 3+ (MCP/HTTP/Browser) | Layer 1/2 (Unit/E2E) |
| Testing an LLM↔Server contract? | Layer 3+ | Layer 1/2 |
| Integration seam that mocks hide? | Layer 3+ | Layer 1/2 |
| Risk justifies cost of real LLM call? | Layer 3+ | Layer 1/2 |
| User-visible UI behavior? | Layer 5 (Browser) | Layer 3 or 4 |
| Can it pass vacuously? | Fix the harness first | Proceed |
| Layer | /es Evidence Credit | What It Proves | What It Does NOT Prove |
|---|---|---|---|
| Unit | ❌ None — supporting only | Logic correctness in isolation | Integration, real LLM shapes, persistence, UI rendering |
| E2E | ❌ None — supporting only | High-fidelity callstack integration with mock services | Real LLM shapes, browser behavior |
| MCP | ✅ Server + LLM | Real server processes real LLM output correctly | UI rendering, browser behavior |
| HTTP | ⚠️ Server only (no LLM unless explicitly called) | HTTP API contract, auth, response shapes | LLM behavior, UI rendering |
| Browser | ✅ Full (Server + LLM + UI + Video) | End-to-end user-visible behavior | Nothing — highest confidence |
MCPTestBase auto-emit: metadata.json, collection_log.jsonl, raw_llm_request_responses.jsonl, raw_http_request_responses.jsonlBrowserTestBase auto-emit: all MCP artifacts + screenshots + .webm video + VTT subtitles/es bundles — they are CI gate artifacts only/es-compliant bundles. Use MCPTestBase or BrowserTestBase.mvp_site PR Coverage ChecklistFor PRs that add or modify more than one non-test file under $PROJECT_ROOT/**:
$PROJECT_ROOT/tests/test_end2end/..claude/skills/end2end-testing.md — Layer 2 E2E patterns: fake implementations, Flask API test base class, multi-phase LLM testing.claude/skills/evidence-standards.md — evidence class system, minimum viable checklist.claude/skills/pr-blocker-min-repro.md — 4-layer (now 5-layer) repro protocol.claude/commands/4layer.md — command to run the repro ladder.claude/commands/tdd.md — TDD workflow commandAGENTS.md — testing_mcp and testing_ui execution policy (real mode only)npx claudepluginhub jleechanorg/claude-commands --plugin claude-commandsGenerates complete layered testing strategy (L1-L4 pyramid), plans, architecture, scenarios, code templates, and CI/CD configs for Backend+APP, Backend+WEB, or Backend+APP+Embedded projects.
Guides TDD workflows, pytest unit/integration/UAT testing strategies, test pyramid organization, coverage requirements, and GenAI validation for code quality.