From qa-resilience-drills
Run scheduled DR drills - author the runbook (per-tier RTO + RPO), pre-drill checklist (data sync state, alert silencing, customer comms), drill workflow (announce → fail-over → verify → fail-back), post-drill report. Per Google Cloud DR planning guide; covers cold / warm / hot standby tier-specific patterns.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-resilience-drills:dr-drill-runnerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per the [Google Cloud DR planning guide], DR planning requires
Per the Google Cloud DR planning guide, DR planning requires "end-to-end recovery design addressing backup, restoration, and cleanup procedures." Drills test that the procedure works AND that the team can run it. Both surface different failures.
Per the Google Cloud DR planning guide:
| Metric | Definition |
|---|---|
| RTO | Maximum acceptable length of time the application can be offline |
| RPO | Maximum acceptable data loss (time window) |
| Tier | Example RTO | Example RPO | Pattern |
|---|---|---|---|
| 1 (revenue-critical) | < 15 min | < 1 min | Hot standby (active-active) |
| 2 (customer-impacting) | < 4 hr | < 1 hr | Warm standby |
| 3 (internal) | < 24 hr | < 24 hr | Cold (rebuild from backup) |
Document per service in a service catalog; drills enforce the contract.
Per the Google Cloud DR planning guide:
Drill expectations differ:
restore-time-tests skill).## Pre-Drill Checklist — `<service>` `<date>`
- [ ] Drill window scheduled (low-traffic; aligned with
customer-comm window)
- [ ] Drill scope decided (region, single service, full app)
- [ ] Replication lag confirmed within RPO at T-30 min
- [ ] Monitoring alerts SILENCED for expected failure indicators
(alert routing redirected to drill channel)
- [ ] On-call notified (avoid duplicate paging during drill)
- [ ] Customer comms sent if customer-impacting drill
- [ ] Rollback path documented (what triggers abort?)
- [ ] Drill commander assigned (owns go/no-go calls)
- [ ] Postmortem time scheduled (within 48hr of drill end)
Skipping the pre-drill = drills become incidents.
## Drill Workflow
### T-0: Announce
- Post in #drill-channel; confirm all participants ready.
- Drill commander gives "GO" — record T-0 timestamp.
### T+0..N: Fail-over
- Execute the runbook step-by-step (everyone follows the doc; no
improvisation).
- Capture timestamp of each step.
### Verify
- Run the verification suite (smoke + customer-impact + data integrity).
- Compare actual vs expected RTO; if RTO breached, decide:
abort + rollback, or continue + capture learning.
### Fail-back
- If hot/warm: redirect traffic back to primary.
- If cold: tear down DR environment + restore primary.
- Verify primary is healthy before claiming drill complete.
### Cleanup
- Re-enable alerts (Step 3).
- Send "all clear" customer comms.
- Reconcile any drill-introduced data divergence.
## Drill Report — `<service>` `<date>`
**Drill objective:** Verify warm standby fails over within RTO 4hr.
**Timeline:**
- T-30 min: Replication lag verified (52s — within RPO 1hr) ✓
- T-0: Announced, on-call silenced
- T+12m: Failover initiated
- T+47m: Standby took traffic
- T+1h22m: Verified service healthy on standby
- T+2h11m: Failback to primary
- T+3h05m: Drill complete
**RTO observed:** 1h22m (target: 4hr) ✓
**Issues found:**
1. CRITICAL: DNS TTL was 24hr in standby DNS records; users
couldn't reach service for 23min after failover. Fix: lower
TTL to 60s in standby zone before next drill.
2. MAJOR: Secret-manager copy step was undocumented; commander
improvised. Fix: add Step 3.4 to runbook.
3. MINOR: One alert wasn't silenced in advance; on-call was paged.
**Action items (with owners + dates):**
- DNS TTL fix → @platform-team — 2026-05-20
- Runbook Step 3.4 → @sre — 2026-05-13
- Alert routing audit → @sre — 2026-05-13
**Next drill:** 2026-08-06 (quarterly cadence).
Cold drills = bring up from backup. Verifies:
backup-verification-author).restore-time-tests).Hot drills = redirect traffic between active replicas. Verifies:
| Tier | Cadence |
|---|---|
| 1 | Monthly (game-day style) |
| 2 | Quarterly |
| 3 | Annually |
Per the Google Cloud DR planning guide: "test it regularly, noting any issues." Without cadence, runbooks rot.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Skip pre-drill checklist | Drill becomes incident | Step 3 mandatory |
| One person knows the runbook | Bus-factor 1; drill panics when they're out | Rotate drill commander |
| Skip post-drill report | Lessons lost; same issues recur | Step 5 mandatory + 48hr deadline |
| Test failover only; skip failback | Failback is the actual prod path; bugs hide | Step 4 covers both |
| Lower RTO target after a missed drill | Goalpost moving | Hold the line + invest in fixes |
qa-chaos-resilience) - they test rehearsed paths; chaos tests unrehearsed ones.backup-verification-author,
restore-time-tests - sister
skills for drill prerequisiteserror-budget-tests,
mttr-mtbf-tracker - incident
metrics fed by drillsnpx claudepluginhub testland/qa --plugin qa-resilience-drillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.