From qa-resilience-drills
Build error-budget gate tests - SLO + error-budget calculation per Google SRE workbook ("difference between target uptime and actual uptime"); burn-rate alerting; monthly-budget exhaustion test; freeze-trigger when budget consumed. Per sre.google embracing-risk reference.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-resilience-drills:error-budget-testsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per [Google SRE - Embracing Risk], "the difference between [SLO] and
Per Google SRE - Embracing Risk, "the difference between [SLO] and [actual uptime] is the 'budget' of how much 'unreliability' is remaining for the quarter." When the budget is consumed, releases freeze. Tests verify this contract is enforced.
| Element | Example |
|---|---|
| SLI (indicator) | successful_requests / total_requests over rolling 30-day window |
| SLO (objective) | 99.9% over 30 days |
| Error budget | 100% − 99.9% = 0.1% of 30 days = ~43.2 minutes downtime allowed per 30 days |
Per Google SRE - Embracing Risk: "A failure affecting 0.0002% of queries consumes 20% of a 0.001% quarterly budget."
def test_sli_excludes_planned_maintenance():
requests = [
# Normal traffic
Request(success=True, ts=t1, was_maintenance=False),
Request(success=False, ts=t2, was_maintenance=False),
# Planned maintenance — should NOT count against SLO
Request(success=False, ts=t3, was_maintenance=True),
]
sli = compute_sli(requests)
# 1 success / 2 non-maintenance = 0.5 (not 1/3)
assert sli == 0.5
Maintenance windows + planned outages: agreed-upon exclusions matter. Test the rule.
def test_30_min_outage_consumes_70_percent_of_monthly_budget():
"""30 days × 0.1% = 43.2 min budget. 30 min outage = 69%."""
monthly_budget_min = 30 * 24 * 60 * 0.001 # 43.2 min
incident_duration_min = 30
consumed_pct = (incident_duration_min / monthly_budget_min) * 100
assert 65 < consumed_pct < 75
Per the SRE workbook, burn-rate alerting fires when budget is being consumed faster than safe.
| Window | Burn rate | Alert |
|---|---|---|
| 1 hour | 14.4× | "Critical - page" (consumes 2% in 1 hr) |
| 6 hours | 6× | "Major - ticket" (consumes 5% in 6 hr) |
def test_critical_burn_alert_fires_at_14_4x():
# Simulate 1-hour window with 14.4× burn
error_rate_in_window = 0.0144 # 1.44%; 14.4× the 0.1% SLO threshold
alert = burn_rate_alert(window_seconds=3600, observed_rate=error_rate_in_window)
assert alert.severity == "critical"
assert alert.routes_to == "page"
Test both directions: burn at 14.4× → critical; below threshold → no alert.
Per Google SRE - Embracing Risk: "If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted."
def test_freeze_engaged_when_budget_below_zero():
# Budget tracker reports negative (over-spent)
budget_state = BudgetTracker(slo=0.999, window_days=30)
budget_state.record_outage_minutes(60) # 30-day budget is 43 min
assert budget_state.remaining_seconds < 0
assert release_gate(budget_state).should_freeze() is True
assert release_gate(budget_state).reason == "Error budget exhausted"
def test_budget_resets_as_old_outages_age_out():
# Outage 35 days ago; rolling 30-day window has aged it out
tracker = BudgetTracker(slo=0.999, window_days=30)
tracker.record_outage(when=now - timedelta(days=35), duration=timedelta(minutes=60))
# Window doesn't include 35-day-old event
assert tracker.remaining_seconds > 0
The SRE workbook recommends multi-window burn-rate alerts to balance sensitivity vs noise:
| Long window | Short window | Burn rate threshold | Alert |
|---|---|---|---|
| 1 hr | 5 min | 14.4× | Page |
| 6 hr | 30 min | 6× | Page |
| 3 day | 6 hr | 1× | Ticket |
The short window confirms the long window isn't a stale alert. Both must trigger.
def test_both_windows_must_trigger_to_page():
# Long window says "burn rate high"; short window says "stopped"
long_burn = 14.5
short_burn = 0.5
page_fired = multi_window_alert(long_burn, short_burn,
threshold_long=14.4, threshold_short=14.4)
assert page_fired is False # don't page when issue resolved
Per Google SRE - Embracing Risk: "Rather than political negotiations, teams reference objective metrics." Report budget remaining to product + leadership:
def test_weekly_budget_report_format():
report = weekly_budget_report(service="orders", week=current_week)
assert "remaining_minutes" in report
assert "burn_rate" in report
assert "incidents_this_window" in report
assert "freeze_status" in report
# Format: machine + human readable (CSV + Slack message)
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| SLO with no enforcement (no freeze) | Targets ignored; reliability degrades | Step 5 freeze-trigger |
| Single burn-rate alert | Either too noisy or too late | Step 7 multi-window |
| Include maintenance in SLI | Planned outages eat real budget | Step 2 exclusion |
| 99.999% SLO ("five nines") for everything | 26 sec/month budget; constant freeze | Tier SLOs per criticality |
| No reporting | Stakeholders don't internalize | Step 8 weekly cadence |
mttr-mtbf-tracker - incident
metrics that consume budgetdr-drill-runner - drills that
intentionally affect SLIProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.
npx claudepluginhub testland/qa --plugin qa-resilience-drills