From qa-process
Build-an-X workflow that caps the E2E suite size by computing flakiness ROI per test - for each E2E test, computes (regressions caught × value) ÷ (runtime × flake rate × maintenance cost), ranks all tests by ROI, identifies the bottom decile (low ROI = high cost / low signal), and recommends specific tests to retire / move to lower layer / fix flake. Use quarterly to keep E2E count from growing past the team's maintenance capacity.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-process:e2e-suite-budgetThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
E2E tests are expensive: per
E2E tests are expensive: per
test-pyramid (Cohn), they're "brittle, expensive to write,
and time consuming to run." Without active management, the E2E
suite grows: a new feature adds 5 tests per sprint; flake rate
creeps up; CI time balloons; team disables flaky tests; coverage
illusory.
This skill computes per-test ROI and recommends which tests to retire / move to a lower layer / fix.
Per-E2E-test, the agent / skill needs:
junit-xml-analysis
Step 3).ROI = (regressions_caught × value_tier) / (runtime_min × (1 + flake_rate) × (1 + maintenance_count_norm))
Where:
regressions_caught ≥ 0 (defaults to 0; can be fractional for
"caught a near-miss").value_tier ∈ [1, 5].runtime_min is the test's average runtime in minutes.flake_rate ∈ [0, 1] (e.g., 0.05 = 5% flake).maintenance_count_norm is PRs touching the test in the window
divided by the median for the suite (so a typical-maintenance
test has factor ~1).Higher ROI = more value per cost.
# scripts/e2e-budget.py
import json, sys
from collections import defaultdict
# Load per-test stats from CI history
stats = json.load(open(sys.argv[1])) # {test_id: {runtime, flake_rate, ...}}
regressions = json.load(open(sys.argv[2])) # {test_id: count}
value_tiers = json.load(open(sys.argv[3])) # {test_id: tier}
maintenance = json.load(open(sys.argv[4])) # {test_id: pr_count}
median_pr_count = sorted(maintenance.values())[len(maintenance) // 2] or 1
scores = {}
for tid, s in stats.items():
rc = regressions.get(tid, 0)
vt = value_tiers.get(tid, 3)
rt = s['runtime_min']
fr = s['flake_rate']
mn = maintenance.get(tid, 0) / median_pr_count
score = (rc * vt) / (rt * (1 + fr) * (1 + mn))
scores[tid] = score
# Sort ascending — lowest ROI first
ranked = sorted(scores.items(), key=lambda x: x[1])
print(json.dumps(ranked))
## E2E suite budget — `<repo>` — Q2 2026
**Total E2E tests:** 142
**Total runtime:** 38 min (per CI run)
**Median flake rate:** 4.2%
**Bottom-decile (14 tests) recommended for action:**
| Test | ROI | Runtime | Flake | Regressions caught | Value tier | Recommendation |
|---------------------------------------------------|----:|--------:|------:|-------------------:|-----------:|----------------|
| `archive-flow.spec.ts > old-orders` | 0.0 | 2.1m | 18% | 0 | 2 | Retire — high flake, no signal in 6mo. |
| `legacy-checkout.spec.ts > deprecated-promo` | 0.1 | 3.2m | 8% | 0 | 1 | Retire — feature deprecated. |
| `cart.spec.ts > add 1000 items` | 0.2 | 4.5m | 2% | 0 | 2 | Move to perf suite — not E2E concern. |
| `e2e-utils.spec.ts > date-formatting` | 0.5 | 0.8m | 1% | 0 | 2 | Move to unit layer. |
| ... (10 more) | | | | | | |
### Estimated impact of acting on all 14
- Suite size: 142 → 128 (-14)
- Runtime: 38 min → 28 min (-10 min per CI run, ~26% reduction)
- Flake-related reruns: estimated -50%
- Maintenance load: -20% (these 14 had the highest PR-touch count)
| Class | Action |
|---|---|
retire | Delete; covered by other tests OR feature deprecated. |
lower-layer | Rewrite at unit / integration; cheaper. |
fix-flake | Tests catches bugs but flakes; investigate per flaky-test-quarantine. |
consolidate | Merge with sibling test that overlaps. |
keep-but-monitor | Low ROI but catches important regressions; tag for next-quarter review. |
The team picks the appropriate class per test; the skill recommends.
Set an absolute budget:
# e2e-budget.yml
budget:
max_tests: 100
max_runtime_min: 30
max_flake_rate: 0.03 # 3%
When the suite exceeds budget, the next sprint's "add new E2E test" requires retiring / moving an existing one. Force the trade-off.
| Cadence | Trigger |
|---|---|
| Quarterly | Scheduled review. |
| Per-major-feature | New tests added; verify suite stays under budget. |
| After flake spike | Reactive review; flake source likely a low-ROI test. |
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| ROI formula without "regressions caught" data | All tests look equal; no basis for prioritization. | Track real-bug catches over time (Step 1). |
| Treating value_tier as binary (critical / not) | Misses tier-2 / tier-3 nuance. | 1-5 scale (Step 1). |
| Auto-retiring bottom-decile without review | False positives - important tests retired. | Recommendation only; team confirms (Step 5). |
| Adding E2E tests without budget enforcement | Suite grows; no constraint forcing trade-offs. | Per-quarter cap (Step 6). |
| Recommending "retire" without alternative | Tests that catch important bugs may have low ROI due to runtime; deletion regrets. | Per-test categorization (Step 5). |
| Cherry-picking tests to retire (favorites stay) | Bias. | Apply ranking uniformly; document overrides. |
test-pyramid-balancer -
sibling: identifies the layer-balance issue this skill
addresses tactically.flaky-test-quarantine - sibling: handles the flake side of low-ROI tests.junit-xml-analysis - upstream: provides per-test runtime + flake stats.unit-test-coverage-targeter - complementary: identifies what to add at the unit layer when
E2E tests get retired.npx claudepluginhub testland/qa --plugin qa-processProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.