From skillry-documentation-and-tech-writing
Use when you need to write or audit incident runbooks and on-call operational docs — symptom-first triage, validated diagnostic and recovery commands, escalation paths and severity levels, rollback steps, and verification that service is restored.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-documentation-and-tech-writing:305-runbook-and-operational-docsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Produce operational documentation an on-call engineer can follow at 3 a.m. under pressure: incident runbooks keyed by symptom, exact diagnostic and recovery commands, clear severity levels and escalation paths, rollback procedures, and a "confirm recovery" step. A good runbook turns a stressful incident into a checklist. Each procedure must be specific, validated, and safe — fail-closed where a...
Produce operational documentation an on-call engineer can follow at 3 a.m. under pressure: incident runbooks keyed by symptom, exact diagnostic and recovery commands, clear severity levels and escalation paths, rollback procedures, and a "confirm recovery" step. A good runbook turns a stressful incident into a checklist. Each procedure must be specific, validated, and safe — fail-closed where an action is destructive.
# Collect the alerts and signals that should each map to a runbook
grep -rEi "alert|severity|pager|threshold" monitoring/ alerts/ 2>/dev/null | head -40
ls runbooks/ docs/runbooks/ 2>/dev/null
List each alert/symptom that can page a human. Every pageable alert needs a runbook entry; an alert with no runbook is a gap.
The responder sees a symptom, not a root cause. Lead with the observable symptom and alert name, then severity, then triage. Title: "Symptom: API 5xx rate above 5%", not "Database connection pool internals".
Give read-only diagnostics first — establish blast radius and likely cause before acting.
# Read-only triage block: scope the problem before changing anything
kubectl get pods -n prod -l app=api # are pods healthy / restarting?
kubectl logs -n prod deploy/api --since=10m --tail=100 | grep -Ei "error|panic|timeout"
curl -sS -o /dev/null -w '%{http_code} %{time_total}s\n' "$HEALTHCHECK_URL"
State what each severity means (user impact + response time) and exactly who/what is paged at each, with the time-to-escalate if unresolved.
Each mutating step states its effect and a verification. Destructive steps require confirmation and a pre-action snapshot/backup.
# Recovery example — restart, then VERIFY before declaring resolved
kubectl rollout restart deploy/api -n prod
kubectl rollout status deploy/api -n prod --timeout=120s
# Rollback to the previous known-good release (confirm version first)
kubectl rollout undo deploy/api -n prod
End with the exact signal that proves recovery (alert cleared, success rate normal) and the incident-comms/update step. Add a postmortem trigger for high-severity incidents.
# Runbook: API 5xx rate above 5%
**Severity:** SEV-2 (user-facing errors). Page: on-call API engineer.
**Escalate to** team lead if not mitigated in 15 min, **SEV-1** if checkout is down.
## Symptom
Alert `api_5xx_high` firing; users see 500s on the API.
## 1. Triage (read-only)
kubectl get pods -n prod -l app=api
kubectl logs -n prod deploy/api --since=10m --tail=100 | grep -Ei "error|timeout"
curl -sS -o /dev/null -w '%{http_code}\n' "$HEALTHCHECK_URL"
## 2. Likely causes → action
- Pods crash-looping → restart: `kubectl rollout restart deploy/api -n prod`
- Bad recent deploy → roll back: `kubectl rollout undo deploy/api -n prod`
- DB unreachable → check `db_connections` dashboard; escalate to DBA on-call.
## 3. Confirm recovery
kubectl rollout status deploy/api -n prod --timeout=120s
Alert `api_5xx_high` clears and 5xx rate < 1% for 5 min.
## 4. Comms
Post status in the incident channel: cause, action taken, current state.
SEV-1/SEV-2 require a postmortem within 48h.
# Escalation matrix
SEV-1 full outage / data loss page on-call + lead immediately; exec update 30m
SEV-2 major degradation page on-call; escalate lead at 15m
SEV-3 minor / single feature on-call handles; no immediate escalation
Produce: (1) the alert-to-runbook coverage list (gaps flagged); (2) per-symptom runbooks with severity, read-only triage, recovery, and rollback; (3) the severity/escalation matrix; (4) a recovery-confirmation signal per runbook; (5) a comms/postmortem step; (6) a list of unvalidated or destructive commands needing review.
$HOME/relative paths; never real production endpoints or absolute machine paths.npx claudepluginhub fluxonlab/skillry --plugin skillry-documentation-and-tech-writingProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.