From grimoire
Designs cloud disaster recovery plans with RTO/RPO tiers, backup architecture, IaC, and communication procedures based on AWS, NIST, and ISO standards.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:design-disaster-recovery-planThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Create a tested, documented plan that recovers critical systems within defined time and data-loss targets.
Create a tested, documented plan that recovers critical systems within defined time and data-loss targets.
Adopted by: Financial institutions (required by FFIEC, BCBS239), healthcare (HIPAA §164.308), AWS Well-Architected framework mandates it for reliability pillar Impact: Organizations with tested DRPs recover 3x faster (IBM 2022); average breach without DRP costs $4.7M more in downtime losses; NIST compliance requires documented RTOs Why best: Untested assumptions about recovery are the leading cause of extended outages; a written, drilled plan eliminates ambiguity under pressure
Sources: AWS Well-Architected Reliability Pillar (2023); NIST SP 800-34 Rev.1 (2010); ISO 22301:2019
Conduct business impact analysis (BIA) — Identify all systems, rank by business criticality, and quantify downtime cost ($/hour). Output: tiered system inventory with financial impact per outage hour.
Define RTO and RPO per tier — Recovery Time Objective (max acceptable downtime) and Recovery Point Objective (max acceptable data loss). Tier 1 (mission-critical): RTO < 1 hr, RPO < 15 min. Tier 2: RTO < 4 hr, RPO < 1 hr. Tier 3: RTO < 24 hr, RPO < 24 hr.
Select DR strategy per tier — Backup/restore (lowest cost, highest RTO); pilot light (minimal standby, minutes to scale); warm standby (reduced-capacity running system); active-active (highest cost, near-zero RTO). Match strategy to RTO/RPO; don't gold-plate low-criticality systems.
Design data backup architecture — Define backup frequency, retention period, and geographic distribution. Enforce 3-2-1 rule: 3 copies, 2 different media, 1 offsite. Test restores monthly; a backup never tested is not a backup.
Document recovery procedures — Write step-by-step runbooks for each failure scenario (region failure, data corruption, ransomware, service dependency outage). Include commands, escalation contacts, and decision trees. Store runbooks outside the failed system.
Implement infrastructure as code — All infrastructure must be reproducible from code (Terraform, CloudFormation, Pulumi). Manual click-ops environments cannot be reliably recreated under pressure. IaC is a prerequisite for sub-hour RTO.
Define communication plan — Establish incident commander role, internal escalation chain, and external stakeholder communication cadence. Draft status page templates in advance. Identify who authorizes failover decisions.
Automate failover where possible — Use health checks and automated DNS failover for tier 1 systems. Manual failover adds 15-60 minutes of human decision time. Automate the detection-to-failover path; require human sign-off only for irreversible actions.
Conduct tabletop exercises — Quarterly: walk through disaster scenarios with stakeholders without actually failing systems. Identify gaps in procedures, ownership ambiguity, and missing automation. Document findings and remediate.
Run full DR drills — Annually (or after major architecture changes): execute actual failover to DR environment. Measure actual RTO/RPO against targets. Treat drill failures as production incidents and fix root causes.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireDefine recovery objectives (RTO/RPO), backup strategies, failover procedures, and testing protocols. Use when planning disaster recovery or establishing continuity practices.
Defines RPO/RTO targets, designs backup architecture, and guides disaster recovery drills for full-region or platform outages. Also handles ransomware planning and post-incident restoration gap analysis.
Produces a complete disaster recovery plan with RPO/RTO targets, per-scenario runbooks, backup procedures, testing cadence, and communication templates for a service or system.