From qa-resilience-drills
Reference for tracking MTTR (Mean Time To Recovery) / MTBF (Mean Time Between Failures) / MTTD (Mean Time To Detection) / MTTA (Mean Time To Acknowledge) - incident-record schema, calculation formulae, dashboards-as-code, target-vs-actual alerting. Aligns with ITIL incident management + ISO 20000 + Google SRE incident response chapter.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-resilience-drills:mttr-mtbf-trackerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Reference document for the four canonical incident-response metrics.
Reference document for the four canonical incident-response metrics. This is a reference skill - incidents are tracked in your IR tool (PagerDuty, Opsgenie, FireHydrant, custom), and this skill defines the schema + formulae so dashboards reflect reality.
error-budget-tests).Required fields:
{
"incident_id": "INC-2026-05-06-001",
"service": "orders",
"severity": "SEV-1",
"detected_at": "2026-05-06T10:23:14Z",
"acknowledged_at": "2026-05-06T10:25:02Z",
"mitigated_at": "2026-05-06T10:54:11Z",
"resolved_at": "2026-05-06T11:42:33Z",
"root_cause_category": "deployment-config",
"is_planned_maintenance": false,
"customer_impact": true
}
Distinct timestamps for detected / acknowledged / mitigated (impact stopped) / resolved (root cause remediated). Conflating them inflates / deflates metrics.
MTTD = mean(detected_at − incident_start_at)
MTTA = mean(acknowledged_at − detected_at)
MTTR = mean(mitigated_at − detected_at) # OR resolved_at depending on definition
MTBF = mean(time between mitigation of one incident and detection of next)
| Metric | Window | Lower / Higher |
|---|---|---|
| MTTD | rolling 90 days | Lower better (faster detection) |
| MTTA | rolling 90 days | Lower better (responsive on-call) |
| MTTR | rolling 90 days | Lower better (faster recovery) |
| MTBF | rolling 365 days | Higher better (more time between failures) |
Definition note: MTTR can mean Mitigation OR Resolution; pick one per organization and document. Mixing yields misleading trends.
| Should exclude | Why |
|---|---|
| Planned maintenance | Not a failure |
| Test/drill incidents | Don't pollute reliability metrics |
| Issues out of customer-trust path (internal-only) | Per organization policy - be explicit |
| Duplicates / "same root cause" within window | Inflates incident count |
Schema field is_planned_maintenance + customer_impact allow
filtered queries.
# Grafana dashboard fragment
panels:
- title: "MTTR (rolling 90 days)"
targets:
- expr: |
avg_over_time(
(
incident_mitigated_ts - incident_detected_ts
)[90d:1d]
)
format: "duration"
- title: "MTBF (rolling 365 days)"
targets:
- expr: |
... (your time-series store DSL)
Treat dashboards as code (versioned, reviewed). Avoid clicked-up dashboards that nobody can rebuild.
- alert: MTTR_TARGET_BREACH
expr: avg_over_time(mttr_seconds[30d]) > 1800 # 30 min target
for: 1h
labels: { severity: warning }
annotations:
summary: "30-day MTTR exceeds 30-min target"
Alert fires when the trend breaks the target - not on individual incidents.
ITIL 4 (Information Technology Infrastructure Library) practices incident management map to these metrics:
| ITIL term | This skill's metric |
|---|---|
| Time to detect | MTTD |
| Time to acknowledge / response | MTTA |
| Time to restore service | MTTR (mitigation) |
| Time to resolve | MTTR (resolution) |
| Mean time between failures | MTBF |
ITIL doesn't prescribe specific formulae; this skill makes them explicit. Pair with your ITSM tool (ServiceNow, Jira Service Management).
Each incident has a postmortem. Postmortem fields feed back into the incident schema:
| Postmortem field | Schema field |
|---|---|
| Detection mechanism | (annotation; helps drive MTTD lower) |
| Root cause | root_cause_category |
| Action items | (separate table; link by incident_id) |
| Was the runbook used? | (annotation; informs runbook-quality investment) |
Action items have due dates; track completion.
Many organizations report only MTTR-mitigation (better numbers, truer to customer experience). Per Google SRE - Embracing Risk, the customer-facing metric is what matters for SLO purposes.
Document which definition your reports use; both are legitimate.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Mixed mitigation/resolution in MTTR | Trends incoherent | Pick one (Step 8) |
| Include maintenance / test incidents | Inflated incident count | Step 3 exclusion |
| Dashboard built once, never revisited | Stale; unrelated to current SLOs | Dashboards-as-code (Step 4) |
| MTTR target without MTTD focus | Fast recovery from things you found late ≠ fast for customer | Track all four |
| Postmortem disconnected from metrics | Action items don't reduce future MTTR | Step 7 integration |
error-budget-tests - per-incident
budget consumptiondr-drill-runner - drills produce
incidents with is_planned_maintenance: truenpx claudepluginhub testland/qa --plugin qa-resilience-drillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.