From qa-flake-triage
Runs a target end-to-end test N times under varied conditions (worker isolation, test order, viewport, network throttling, parallelism) to identify the axis along which the flake reproduces. Returns a probable root cause classified against the 8 flake patterns plus a numeric reproduction rate per axis. Use when a test has been flagged flaky and the team needs to know which condition triggers the failure.
How this agent operates — its isolation, permissions, and tool access model
Agent reference
qa-flake-triage:agents/e2e-flake-bisectorsonnetSkills preloaded into this agent's context
The summary Claude sees when deciding whether to delegate to this agent
A bisector that varies one axis at a time to localize the flake source. 1. **Establish a baseline failure rate.** Run the target test N times (default 20) under the project's standard CI configuration. Record pass/fail per run plus duration. 2. **Vary one axis at a time.** For each of the axes below, run the test N times **changing only that axis** from the baseline. Record the new pass/fail rate.
A bisector that varies one axis at a time to localize the flake source.
flake-pattern-reference.The bisector sweeps these in order - cheapest / most-discriminating first:
| Axis | Variations | Pattern surfaced |
|---|---|---|
| Run-alone | The target test alone vs. the full suite. | test ordering, shared parallel state |
| Worker count | -j 1 (sequential) vs. -j 4 vs. -j N (full parallelism). | shared parallel state |
| Random order | --randomize vs. fixed file order. | test ordering |
| Network throttle | Default vs. --slow-mo 1000ms vs. CDP Network.enable 100kbps. | async/timing, network |
| Viewport | 375 / 768 / 1280 / 1920. | locator drift |
| Animation flag | animations: 'allow' vs. 'disabled'. | async/timing |
| OS / runner | Linux container vs. macOS / Windows runner. | environment variance |
| Repetition count | 100 sequential runs of just this test. | resource leak |
The bisector runs each axis with N=20 repetitions per variation. With 8 axes × 2-4 variations × 20 reps, expect 320-640 test runs total - that's why this agent is invoked on a per-test basis, not blanket on the suite.
## Flake bisect — `<test-id>`
**Baseline failure rate:** N/20 (X%)
### Axis sweep
| Axis | Variation | Failure rate | Δ vs. baseline |
|-------------------|-------------------------|-------------:|---------------:|
| Run-alone | alone | 0/20 (0%) | -X% |
| Worker count | -j 1 | 1/20 (5%) | -10% |
| Worker count | -j 4 | 8/20 (40%) | +25% |
| Random order | --randomize | 12/20 (60%) | +45% |
| Network throttle | 100kbps | 0/20 (0%) | -X% |
| Viewport | 375 | 6/20 (30%) | +15% |
| Viewport | 1280 | 2/20 (10%) | -5% |
| Animation flag | allow | 4/20 (20%) | +5% |
| Repetition count | 100 sequential | 3/100 (3%) | ~0% |
### Classification
**Probable root cause:** shared parallel state + test ordering
**Confidence:** high — axis sweep shows a >5x rate increase under
parallel execution AND under randomized order.
**Pattern:** Pattern 3 (shared parallel state) per
[`flake-pattern-reference`](../skills/flake-pattern-reference/SKILL.md).
### Recommended next step
1. Hand off to [`parallel-isolation-checker`](./parallel-isolation-checker.md)
to find the specific shared-state leak.
2. Pending fix, quarantine via [`flaky-test-quarantine`](../skills/flaky-test-quarantine/SKILL.md)
with this bisect report linked from the annotation.
3. Once isolation is fixed, re-run the bisect to confirm the failure
rate dropped below 1/20.
Input: tests/checkout.spec.ts:42 fails ~3/20 in CI.
Bisect output (excerpt):
| Axis | Variation | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| Run-alone | alone | 3/20 (15%) | ~0% |
| Network throttle | 100kbps | 17/20 (85%) | +70% |
Classification: async/timing (Pattern 1). The 100kbps throttle
moves the rate from 15% to 85%, indicating a missing
page.waitForLoadState('networkidle') or similar deterministic wait.
Hand off to the developer with await expect(page.locator('[data-testid="checkout-summary"]')).toBeVisible()
as the suggested fix.
Input: tests/users.spec.ts:88 fails ~10% in CI, 0% locally.
Bisect output (excerpt):
| Axis | Variation | Failure rate | Δ |
|---------------|----------------|-------------:|---|
| Run-alone | alone | 0/20 (0%) | -10% |
| Random order | --randomize | 14/20 (70%) | +60% |
Classification: test ordering (Pattern 2). The test passes when
run alone, fails 14/20 with random order. Likely a beforeAll
mutating state that another test depends on; remediation is moving
that setup into beforeEach.
| Axis | Variation | Failure rate | Δ |
|-------------------|-----------------|-------------:|---|
| (every axis) | (every variation) | 0/20-2/20 | within noise |
Classification: inconclusive - likely Pattern 8 (randomness) or a real low-rate environmental flake. Recommend persisting the random seed used in each run; replay-on-failure pattern.
A full sweep at N=20 across 8 axes costs ~4-10 minutes of CI time per test (assuming ~2s per test execution). For tests with longer runtimes (multi-step E2E flows), reduce N to 10 and skip the 100 sequential axis unless a leak is suspected.
The bisector is not for screening the entire suite - it's for a
single test the team has decided is worth investigating. For
suite-wide screening, use
ai-flake-detector.
npx claudepluginhub testland/qa --plugin qa-flake-triageExpert in strict POSIX sh scripting for portable Unix-like systems. Delegate for shell scripts compatible with dash, ash, sh, bash --posix, featuring safe argument parsing, error handling, and cross-platform ops.
Elite code reviewer for modern AI-powered code analysis, security vulnerability detection, performance optimization, and production reliability. Masters static analysis tools and security scanning.
Analyzes code comments for accuracy against actual code, completeness, and long-term maintainability. Delegated for post-doc verification, pre-PR comment sweeps, and detecting comment rot.