From grimoire
Guides writing SRE runbooks for on-call engineers to diagnose and resolve incidents. Includes alert headers, triage checklists, branching resolution paths, rollback steps, and escalation criteria.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:write-sre-runbookThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Write a runbook that an on-call engineer who has never seen this service can follow at 3 AM to diagnose and resolve an incident.
Write a runbook that an on-call engineer who has never seen this service can follow at 3 AM to diagnose and resolve an incident.
Adopted by: Google SRE teams; PagerDuty's incident response model; AWS operational excellence pillar in the Well-Architected Framework
Impact: PagerDuty reports that teams with documented runbooks reduce mean time to resolution (MTTR) by 40-60%; AWS Well-Architected reviews flag missing runbooks as a reliability risk requiring remediation
Why best: Runbooks fail when they assume knowledge the on-call engineer does not have at 3 AM under stress. The baseline assumption must be: the reader knows nothing about this specific service, is tired, and has five minutes before an executive asks for an update.
<your-cluster-name> without a pointer to where to find the valueAlert: CheckoutLatencyHigh (P1, SLO: 99.5% of requests under 800ms, MTTR target: 30 min)
Triage command: kubectl top pods -n checkout-service | sort -k3 -rn | head -5 — expected: no single pod above 80% CPU; if one pod is at 100%, proceed to Branch A (pod restart).
npx claudepluginhub jeffreytse/grimoire --plugin grimoireGenerates operational runbooks for repeatable incident procedures that any engineer can execute under pressure. Follows Google SRE and PagerDuty best practices.
Generates structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates. Useful for standardizing response processes, reducing MTTR, and documenting recurring alert procedures.
Runbook templates, escalation procedures, playbooks, and incident playbooks.