Generate a structured operational runbook for incident response, diagnostics, or maintenance procedures. Use when defining a runbook for a new alert, documenting a post-incident gap, establishing an operational procedure, or preparing a maintenance checklist.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cloud-engineering-aws:write-runbookThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate a fully structured operational runbook from a scenario description.
Generate a fully structured operational runbook from a scenario description.
$ARGUMENTS = description of the operational scenario — alert name, service name, and/or procedure type (e.g., "high CPU on payment-api", "certificate rotation for api.example.com", "diagnose intermittent 502s from ALB").
Select the type based on the scenario:
If the scenario is ambiguous, default to incident response.
Assign a unique ID using the pattern RUN-[SERVICE-PREFIX]-[NNN] where:
SERVICE-PREFIX is a 2–4 letter abbreviation of the primary service (e.g., PAY for payment-api, DB for database, CERT for certificate, GEN for general)NNN is a sequential number starting at 001Use the template for the selected type below. Fill every section with specifics derived from $ARGUMENTS. Do not leave placeholder text that requires the user to fill in — infer reasonable defaults and mark them with [VERIFY] where validation against the real system is required.
If this procedure is normally executed by automation or a script: Ask: What steps does the normal automation perform implicitly that a responder would miss if bypassing it? Incorporate every answer as an explicit numbered step. Common hidden steps: loading environment variables, validating credentials, acquiring locks, checking prerequisites, taking a pre-operation backup. Never write "run the script" — list what the script does so a responder can execute the procedure without it.
# [Alert Name]: [Service] — Incident Response Runbook
## Metadata
| Field | Value |
|---------------------|-----------------------------------------------|
| Runbook ID | RUN-XXX-001 |
| Owner | [Team] / @[individual] |
| Last Updated | YYYY-MM-DD |
| Last Validated | YYYY-MM-DD |
| Severity | [P1 / P2 / P3] |
| Related Alerts | [alert-name] |
| Dashboards | [Link to primary dashboard — VERIFY] |
| Tools Required | [e.g., kubectl, aws-cli, psql] |
| Permissions Required| [e.g., k8s-prod-admin, AWS ReadOnly] |
| Escalation Contact | [PagerDuty service / Slack channel] |
## Overview
**What this runbook covers:** [1–2 sentences on the symptom and scope]
**What this runbook does NOT cover:** [Explicit exclusions with links to other runbooks if known]
## Impact
[Who or what is affected when this alert fires. Customer-facing? Internal only? Data integrity risk? Revenue impact?]
## Prerequisites
- [ ] Access to [system/tool] via [method — e.g., VPN, bastion host]
- [ ] CLI tool [name] installed and authenticated
- [ ] Credentials fetched from [vault path — never store inline]
- [ ] [Any other precondition — e.g., maintenance window open, stakeholders notified]
## Diagnosis Steps
### Step 1: Confirm the alert is real (not a false positive)
Run:
```bash
[command to confirm the alert condition — e.g., check metric, health endpoint]
Expected: [normal output]
If alert is confirmed: → Continue to Step 2.
If no anomaly found: → Alert may have resolved. Monitor for 5 minutes. If clean, close with note "resolved before investigation began."
Run:
[command to check affected services, error rates, user impact]
Expected: [normal output]
If blast radius is expanding: → Escalate immediately per the Escalation section below.
If contained: → Continue to Step 3.
Run:
[exact diagnostic command]
Expected: [normal output description]
If abnormal → Go to Remediation Option A.
If normal → Go to Step 4.
Run:
[exact diagnostic command]
Expected: [normal output]
If abnormal → Go to Remediation Option B.
If normal → Go to Step 5.
Run:
[command to check recent deployments, config changes, or infrastructure changes]
Expected: No changes in the past [N] hours / last deployment was > [N] hours ago.
If a recent change correlates with the issue: → Rollback per the Rollback section, then re-verify.
If no correlating change found: → Escalate with full diagnostic output.
⚠️ This modifies production state.
Run:
[exact command]
Expected output: [what success looks like]
Verify the fix took effect:
[verification command]
Expected: [success indicator]
→ Continue to Verification.
⚠️ This modifies production state.
Run:
[exact command]
Expected output: [what success looks like]
→ Continue to Verification.
If remediation worsens the situation or introduces new errors:
⚠️ Run:
[exact rollback command]
Verify rollback succeeded:
[verification command]
Expected: [pre-incident state indicators]
Escalate if rollback does not restore the system to baseline.
Escalate if any of the following are true:
How to escalate:
---
### Diagnostic Runbook
For situations with **unknown root cause** — structure as a symptom-to-cause decision tree ordered from most likely to least likely.
```markdown
# [Symptom Description] — Diagnostic Runbook
## Metadata
| Field | Value |
|---------------------|------------------------------------|
| Runbook ID | RUN-XXX-001 |
| Owner | [Team] / @[individual] |
| Last Updated | YYYY-MM-DD |
| Last Validated | YYYY-MM-DD |
| Tools Required | [tools] |
| Permissions Required| [permissions] |
| Related Runbooks | [links to remediation runbooks] |
## Symptom
[Observable behavior that triggered this investigation — exact error message, metric value, or user report]
## Investigation Checklist
Work through checks in order. Stop at the first check that identifies a root cause.
### Check 1: [Most likely cause]
**Command:**
```bash
[diagnostic command]
If [condition indicating this is the cause]: Root cause is [X]. → See [Remediation Runbook RUN-XXX].
If normal: → Continue to Check 2.
Command:
[diagnostic command]
If [condition]: Root cause is [Y]. → See [Remediation Runbook RUN-XXX].
If normal: → Continue to Check 3.
Command:
[diagnostic command]
If [condition]: Root cause is [Z]. → Apply [resolution].
If normal: → Continue to Check 4.
Command:
[diagnostic command]
If [condition]: → [action]
If normal: → No root cause identified. Escalate.
Escalate to [team] via [method] with the following information gathered:
---
### Maintenance Runbook
For planned operational tasks — deployments, certificate rotations, capacity changes, database migrations.
```markdown
# [Procedure Name] — Maintenance Runbook
## Metadata
| Field | Value |
|----------------------|------------------------------------------|
| Runbook ID | RUN-XXX-001 |
| Owner | [Team] / @[individual] |
| Last Updated | YYYY-MM-DD |
| Last Validated | YYYY-MM-DD |
| Estimated Duration | [N] minutes |
| Maintenance Window | [required window — or "anytime"] |
| Tools Required | [tools] |
| Permissions Required | [permissions — fetch from vault, path] |
| Rollback Time | [estimated rollback duration] |
## Overview
**Purpose:** [What this procedure accomplishes and why it is needed]
**Scope:** [What systems are affected]
**Risk level:** [Low / Medium / High] — [brief rationale]
## Pre-Maintenance Checklist
- [ ] Change ticket [ID] approved and linked
- [ ] Maintenance window confirmed: [date/time] – [date/time] [timezone]
- [ ] Stakeholders notified via [channel]
- [ ] Backup completed and verified — run:
```bash
[backup verification command]
Expected: [confirmation output]
Run:
[exact command]
Checkpoint: Verify step succeeded:
[verification command]
Expected: [success output]
If step fails: → Do not continue. Execute rollback.
⚠️ This modifies production state.
Run:
[exact command]
Checkpoint:
[verification command]
Expected: [success output]
[...]
If the procedure fails at any step:
Stop immediately — do not continue to the next step.
⚠️ Execute rollback:
[rollback command for the most recent irreversible action]
Verify system returned to pre-maintenance state:
[verification command]
Notify stakeholders of rollback via [channel].
Update change ticket with failure details and schedule a post-mortem.
---
## Output
Produce a single complete runbook in Markdown. Fill every field with specifics from `$ARGUMENTS`. Where exact values cannot be inferred, use `[VERIFY]` as a placeholder and note what needs to be confirmed against the real system.
After the runbook, include a brief **"Next Steps"** section (outside the runbook body) with:
1. Where to store this runbook (suggest the relevant repository or wiki location)
2. Which alert definition(s) should link to this runbook URL
3. Who should validate it (service owner or on-call engineer)
4. Suggested first validation method (e.g., tabletop walkthrough, game day, next incident)
npx claudepluginhub cpliakas/claude-code-digital-coworkers --plugin cloud-engineering-awsGenerates Markdown runbooks for incident response, operational procedures, troubleshooting guides, and emergency protocols from system analysis. Outputs structured files with metadata, steps, decision trees, and escalation paths.
Generates operational runbooks for repeatable incident procedures that any engineer can execute under pressure. Follows Google SRE and PagerDuty best practices.
Generates structured operational runbooks for services, incidents, or deployments with prerequisites, step-by-step procedures, rollback steps, and escalation paths.