From qa-flake-triage
Builds a persistent flakiness infrastructure dashboard from JUnit XML or JSON CI run history: defines the flake-rate metric (failures per test over a configurable window), authors the data model, generates a Grafana time-series panel JSON or configures a Datadog CI Visibility view, derives the quarantine-candidate query, and wires trend alerts. Use when a team needs a long-lived observability surface for test reliability that outlasts any single weekly report.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-flake-triage:flake-dashboard-authorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Terminology note:** "flaky test" is a practitioner-emergent term used
Terminology note: "flaky test" is a practitioner-emergent term used in the industry engineering tradition (Google Testing Blog, google-flaky). "Defect," "failure," and "test run" follow ISTQB Glossary v4.7.1 definitions. "Test suite" and "test case" are used per ISTQB as well.
The e2e-test-trend-reporter agent produces a comparable weekly markdown
snapshot. This skill builds the persistent infrastructure layer: a live
dashboard that accumulates run history and surfaces the flake-rate metric
continuously, rather than on demand.
The canonical flake-rate formula for a single test T over a window of N
runs is:
flake_rate(T, window) = (failed_runs(T, window) + retried_passed_runs(T, window))
-------------------------------------------------------
total_runs(T, window)
A run counts as "retried-passed" when the test framework reports it as
flaky (passed only after at least one retry). Playwright marks these in
reporter.onTestEnd with result.status === 'flaky'
(Playwright reporter API). JUnit XML uses a <rerunFailure>
element inside <testcase> (Surefire / junit-xml convention) or a
<flakyFailure> element in the Jenkins JUnit plugin extension.
Choose your window at ingestion time. Recommended defaults:
| Team cadence | Window | Minimum runs before showing rate |
|---|---|---|
| Multiple deploys per day | 7 days | 20 |
| One deploy per day | 14 days | 10 |
| Weekly releases | 30 days | 5 |
Persist one row per test-case execution. Minimum schema:
CREATE TABLE test_runs (
run_id TEXT NOT NULL,
suite_name TEXT NOT NULL,
test_name TEXT NOT NULL,
status TEXT NOT NULL, -- 'passed' | 'failed' | 'flaky' | 'skipped'
duration_ms INTEGER NOT NULL,
branch TEXT,
commit_sha TEXT,
worker_index INTEGER,
started_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (run_id, suite_name, test_name)
);
CREATE INDEX idx_test_runs_name_time ON test_runs (test_name, started_at);
status = 'flaky' is the retried-passed value emitted by Playwright retries
(Playwright retries) and by the Jenkins JUnit plugin's
<flakyFailure> extension. For raw JUnit XML without retry markup, derive
flaky by joining two rows with the same run_id + test_name where one is
failed and the next is passed within the same CI run.
Populate from JUnit XML using xmllint --xpath:
# Extract per-testcase rows from a JUnit XML report
xmllint --xpath '//testcase' report.xml \
| python3 scripts/parse_junit.py --output jsonl >> test_runs.jsonl
Populate from Playwright JSON reporter output (--reporter=json) by
iterating results[].suites[].specs[].tests[].results[].
The following panel JSON renders a time-series of per-test flake rate over a 14-day rolling window. Paste it into Dashboard JSON model (toolbar Export > Copy JSON) or POST it to the Grafana Dashboard HTTP API (Grafana Dashboard API).
{
"type": "timeseries",
"title": "Flake rate per test (14-day rolling)",
"datasource": { "type": "postgres", "uid": "${DS_POSTGRES}" },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 0 },
"id": 1,
"targets": [
{
"refId": "A",
"datasource": { "type": "postgres", "uid": "${DS_POSTGRES}" },
"rawSql": "SELECT date_trunc('day', started_at) AS time, test_name, ROUND(100.0 * SUM(CASE WHEN status IN ('failed','flaky') THEN 1 ELSE 0 END) / COUNT(*), 2) AS flake_rate FROM test_runs WHERE started_at >= NOW() - INTERVAL '14 days' GROUP BY 1, 2 ORDER BY 1",
"format": "time_series"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 2 },
{ "color": "red", "value": 5 }
]
},
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"pointSize": 5,
"showPoints": "auto",
"spanNulls": false
}
},
"overrides": []
},
"options": {
"legend": { "displayMode": "table", "placement": "bottom", "calcs": ["lastNotNull", "max"] },
"tooltip": { "mode": "multi", "sort": "desc" }
}
}
Key fields per the Grafana time-series panel docs:
fieldConfig.defaults.thresholds.steps - color bands at 2% (yellow) and
5% (red) correspond to the quarantine-candidate thresholds in Step 5.fieldConfig.defaults.unit set to "percent" so Grafana formats values
as 2.4% rather than raw decimals.options.legend.calcs includes "max" so the legend table shows the
peak flake rate for each test over the window at a glance.datasource.uid uses the ${DS_POSTGRES} dashboard variable so the JSON
is portable across Grafana instances. Replace with your actual datasource
UID when importing into a specific instance.If your team uses Datadog, CI Visibility ingests test results natively via
the datadog-ci CLI or SDK reporters. The built-in
CI Visibility - Tests dashboard tracks Total Flaky Tests
(updated every 30 minutes per Datadog flaky test docs).
Datadog applies three tags automatically (Datadog flaky test docs):
is_flaky - test is currently passing and failing across runs on the same
commit.is_new_flaky - flaky behavior first appeared on this branch.is_known_flaky - flaky on the current or default branch previously.Quarantine query in CI Visibility Explorer:
@test.status:fail @test.is_flaky:true
Flakiness rate formula in a Datadog Timeboard widget using the Metrics
query editor (CI Visibility emits ci.test.flaky as a count metric):
(count:ci.test.flaky{*} by {test.name}.as_count() /
count:ci.test.run{*} by {test.name}.as_count()) * 100
Trend alert using a Datadog Monitor:
test.name.> 5 (5% flake rate) for quarantine candidates.> 2 (2%) for early watch.A test becomes a quarantine candidate when its flake rate exceeds the team threshold over the window AND it has enough samples to be statistically meaningful. Recommended SQL query for the data model in Step 2:
SELECT
test_name,
suite_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END) AS flaky_runs,
ROUND(
100.0 * SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0), 2
) AS flake_rate_pct,
MAX(started_at) AS last_seen
FROM test_runs
WHERE started_at >= NOW() - INTERVAL '14 days'
GROUP BY test_name, suite_name
HAVING
COUNT(*) >= 10
AND ROUND(
100.0 * SUM(CASE WHEN status IN ('failed', 'flaky') THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0), 2
) >= 5
ORDER BY flake_rate_pct DESC;
The HAVING COUNT(*) >= 10 guard prevents a test with 1 run and 1 failure
from appearing as 100% flaky. Adjust the minimum run count per your window
size using the table in Step 1.
Hand quarantine candidates to the
flaky-test-quarantine skill, which
enforces the two-week TTL and renewal cap.
Grafana managed alert rules evaluate expressions against your datasource on a configurable schedule (Grafana alert rules).
Steps to create a flake-rate spike alert:
A: the same SQL from the Grafana panel in Step 3.B: set Function to Last, Input to A.C: set Input to B, threshold to
IS ABOVE 5 (5% flake rate).15m so transient spikes don't fire pages.2 so the alert resolves only when the
flake rate drops back below the warning level.The fieldConfig.defaults.thresholds.steps bands in the panel JSON
(green/yellow/red at null/2/5) visually mirror the alert thresholds so
on-call engineers see the same boundary lines in the chart that trigger the
alert (Grafana time-series thresholds).
Given pw-results.json (Playwright --reporter=json output):
# 1. Parse into the test_runs table
node scripts/ingest_playwright_json.js pw-results.json \
--db postgres://localhost/qa_metrics \
--branch "$CI_BRANCH" \
--commit "$CI_COMMIT_SHA" \
--run-id "$CI_RUN_ID"
# 2. Run the quarantine-candidate query and emit a CSV
psql postgres://localhost/qa_metrics \
-f scripts/quarantine_candidates.sql \
--csv > candidates-$(date +%F).csv
# 3. Import the Grafana dashboard JSON
curl -s -X POST http://grafana:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-u "$GRAFANA_USER:$GRAFANA_PASS" \
-d @dashboards/flakiness-overview.json
After the first ingestion, the Grafana panel populates immediately for the last 14 days of history that was just loaded. The trend alert begins evaluating on the next 1-minute evaluation cycle.
<flakyFailure> is a Jenkins
JUnit plugin extension, not part of the base JUnit XML schema. Surefire and
pytest-junit do not emit it. Treat status = 'flaky' as an enrichment step
rather than a baseline guarantee.ci.test.flaky metric availability: the ci.test.flaky count
metric requires the Datadog Agent test reporter or datadog-ci junit upload
with the --service flag. Raw JUnit uploads via HTTP do not emit the metric
automatically. Verify metric existence in Metrics Explorer before building
the Timeboard widget.${DS_POSTGRES} variable requires
a matching datasource name (or UID override) in every Grafana instance
where the JSON is imported.flaky status definition; source of the
retried-passed run classification used in the flake-rate formula.result.status values including
'flaky'; used for JSON ingestion in Step 2.fieldConfig.defaults.thresholds
structure and unit options; basis for the panel JSON in Step 3.is_flaky, is_new_flaky,
is_known_flaky tag definitions; 30-minute metric refresh cadence.Total Flaky Tests widget; baseline for the Timeboard formula in Step 4.flaky-test-quarantine - downstream
consumer of the quarantine-candidate query in Step 5.e2e-test-trend-reporter - the
weekly narrative complement to this persistent dashboard.flake-pattern-reference - pattern
catalog used to interpret spikes surfaced by this dashboard.npx claudepluginhub testland/qa --plugin qa-flake-triageProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.