From qa-resilience-drills
Build restore-time SLA tests - per-database + per-object-store baseline measurement, RTO objective verification, parallel-restore optimization tests, point-in-time-recovery (PITR) latency. Bound `time-to-functional` (TTF) ≤ documented RTO; flag silent regressions when restore time grows over months.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-resilience-drills:restore-time-testsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per the [Google Cloud DR planning guide], RTO is "the maximum
Per the Google Cloud DR planning guide, RTO is "the maximum acceptable length of time that your application can be offline." Restore-time tests measure the actual time-to-functional (TTF) for each backup type and gate it on the RTO budget.
Time-to-functional = sum of:
| Segment | Definition |
|---|---|
| Detection | Time from incident to "something's wrong" |
| Decision | Time from detection to "initiate DR" |
| Provisioning | Time to spin up DR environment (IaC apply) |
| Restore | Time to apply the latest backup |
| Verification | Time to run smoke tests + accept traffic |
| Cutover | DNS / load balancer switch + propagation |
Each segment has its own SLA. The aggregate is the RTO.
This skill focuses on Restore + Verification segments.
import subprocess, time
import pytest
@pytest.mark.benchmark
def test_postgres_restore_time_under_rto():
# Setup: clean target DB
subprocess.run(["psql", "-h", "test-db", "-c", "DROP DATABASE IF EXISTS db_test"])
subprocess.run(["psql", "-h", "test-db", "-c", "CREATE DATABASE db_test"])
backup = "postgres-prod-latest.sql.gz"
start = time.time()
subprocess.run(
["bash", "-c", f"gunzip -c {backup} | psql -h test-db -d db_test"],
check=True,
)
elapsed = time.time() - start
RTO_BUDGET_SECONDS = 4 * 3600 # 4 hours
# Restore should be < 50% of total RTO budget (rest is provision + verify + cutover)
assert elapsed < RTO_BUDGET_SECONDS * 0.5, \
f"Restore took {elapsed:.0f}s; budget {RTO_BUDGET_SECONDS * 0.5:.0f}s"
Run weekly in CI; track trend.
Many backup tools support parallelization. Test:
# pg_restore parallel
pg_restore -j 8 -d db_test backup.dump # 8 parallel workers
# WAL-E / pgbackrest parallel restore
pgbackrest --stanza=prod --process-max=8 restore
def test_parallel_restore_faster_than_serial():
serial = run_restore(parallel_jobs=1)
parallel = run_restore(parallel_jobs=8)
speedup = serial / parallel
assert speedup > 3.0, f"Parallel restore only {speedup:.1f}x faster — diminishing returns"
Find the sweet spot (often 4-8 jobs); past that, contention diminishes returns.
PITR = restore the database to an arbitrary point in the past (within retention). Restore time + WAL replay time:
def test_pitr_to_5min_ago_under_30min():
target_time = datetime.utcnow() - timedelta(minutes=5)
start = time.time()
subprocess.run([
"pg_basebackup", "-h", "prod-db", "-D", "/restore", "-Ft", "-z",
], check=True)
subprocess.run([
"pg_ctl", "start", "-D", "/restore",
"-o", f"-c recovery_target_time='{target_time.isoformat()}'",
], check=True)
# Wait for recovery_target_action='pause' or successful replay
wait_for_recovery_complete(timeout=1800)
elapsed = time.time() - start
assert elapsed < 1800, f"PITR took {elapsed:.0f}s; budget 30min"
PITR latency = base restore + WAL replay. Tests both segments.
For S3 / GCS / Azure Blob restores, time the partial restore (not whole-bucket):
def test_partial_object_restore_under_5_min():
keys_to_restore = sample_500_keys_from_inventory()
start = time.time()
for key in keys_to_restore:
s3.copy_object(
Bucket="restore-target",
Key=key,
CopySource={"Bucket": "backup-versioned", "Key": key, "VersionId": ...},
)
elapsed = time.time() - start
assert elapsed < 300, f"500-object restore took {elapsed:.0f}s"
500 objects = realistic single-customer-account restore size.
Backup grows over time → restore time grows. Track:
def emit_restore_time_metric(elapsed_seconds, backup_size_bytes):
metrics_client.gauge("dr.restore_time_seconds", elapsed_seconds)
metrics_client.gauge("dr.backup_size_bytes", backup_size_bytes)
metrics_client.gauge("dr.restore_throughput_bytes_per_sec",
backup_size_bytes / elapsed_seconds)
Alert if restore time grows > 20% in 90 days. Indicates need for backup compaction, parallelism increase, or RTO renegotiation.
Restore success ≠ functional. Verification adds time:
def test_post_restore_smoke_under_5_min():
do_restore()
start = time.time()
run_smoke_suite("dr-environment")
elapsed = time.time() - start
assert elapsed < 300, f"Smoke tests took {elapsed:.0f}s; budget 5min"
Smoke suite scope: critical paths only. Full regression is too slow for the RTO window.
After restore, applications hit cold caches → first requests slow. Test that the cold-start latency is within service SLA:
def test_cold_start_latency_within_sla():
# Restore complete; app started; first requests
latencies = []
for _ in range(100):
start = time.time()
requests.get("https://dr-env.svc/api/products")
latencies.append(time.time() - start)
p99_cold = sorted(latencies)[99]
assert p99_cold < 2.0, f"Cold-start p99 {p99_cold:.2f}s exceeds 2s SLA"
Cache-warm step may be needed in DR runbook (loading common queries before declaring "functional").
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Test on yesterday's backup, claim "RTO met" | Real DR uses minutes-old backup | Weekly cadence with realistic data freshness |
| Skip parallel test; use single thread | Aggregate RTO + budget breached at scale | Step 3 sweet-spot tuning |
| Skip verification time | Restore "complete"; users still 5xx | Step 7 must be timed |
| No trend tracking | Silent regression months in | Step 6 metric + alert |
| RTO unit on DB only, ignore app | App may take longer than DB | Step 8 cold-start |
dr-drill-runner - drill-level
end-to-end timingbackup-verification-author -
verifies backup integrity before restoreerror-budget-tests - restore
failures consume error budgetnpx claudepluginhub testland/qa --plugin qa-resilience-drillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.