From grimoire
Designs cloud systems that survive regional outages, minimize global latency, and meet data residency requirements. Covers RTO/RPO targets, region selection, active-active/passive patterns, data replication, global load balancing, and failover testing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:design-multi-region-architectureThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design cloud systems that remain available and performant across geographic failures.
Design cloud systems that remain available and performant across geographic failures.
Adopted by: Netflix, Amazon, Google, Stripe, financial institutions under DORA/BCBS239 regulation Impact: Achieve 99.99%+ availability; reduce latency by 40-60% for globally distributed users; meet RTO < 15 min targets Why best: Single-region deployments create single points of failure; multi-region distributes blast radius and puts compute near users
Sources: AWS Well-Architected Reliability Pillar (2023); Google Cloud Architecture Framework; NIST SP 800-34 Rev.1
Define availability targets — Set RTO/RPO per service tier (tier 1: RTO < 15 min, RPO < 1 min; tier 2: RTO < 4 hr, RPO < 1 hr). These drive region count and replication strategy.
Select regions — Choose primary, secondary, and optional tertiary regions based on: user latency (<100 ms rule), data residency law (GDPR, PDPA), cloud provider availability, and inter-region latency (prefer <50 ms pairs).
Choose a multi-region pattern — Active-active: traffic served from all regions simultaneously (highest complexity, lowest RTO). Active-passive: standby region takes over on failure (simpler, higher RTO). Pilot-light: minimal standby scaled on failover.
Design data replication — Synchronous replication for zero RPO (adds latency); asynchronous for low-latency writes with non-zero RPO. Use CRDTs or last-write-wins for conflict resolution. Partition databases by geography where data residency requires it.
Implement global load balancing — Route traffic via anycast DNS (Route 53, Cloud DNS) or global load balancers with health checks. Use latency-based or geolocation routing. Set failover thresholds and TTLs (<30 s).
Handle consistency trade-offs — Apply CAP theorem: during partition, choose availability or consistency per service. Use eventual consistency for non-critical data; strong consistency only where required (financial transactions).
Design stateless compute — Ensure application servers carry no local state. Externalize sessions to distributed caches (Redis Cluster, Memorystore). Store ephemeral files in object storage, not local disk.
Implement observability per region — Deploy independent monitoring stacks per region. Aggregate metrics globally but alert regionally. Measure inter-region replication lag as a key metric.
Test failover regularly — Run game days quarterly; automate chaos experiments (inject region failure in staging). Measure actual RTO vs target; update runbooks.
Document runbooks — Write step-by-step regional failover procedures. Assign ownership. Test runbooks annually and after every architecture change.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireGuides multi-region deployments for globally distributed systems, high availability, low latency, and disaster recovery. Covers active-passive/active-active models, data replication, failover patterns.
Distribute globally across multiple regions for low latency, compliance, and resilience. Plan data replication, failover, and latency optimization. Use when designing global systems.
Reviews GCP workload HA and BCDR designs — multi-region architectures, Cloud SQL HA failover, Spanner global instances, GKE multi-cluster, RTO/RPO target analysis, and runbook completeness. Activates when resilience, disaster recovery, or high availability patterns are discussed.