From rune
Executes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.
How this skill is triggered — by the user, by Claude, or both
Slash command
/rune:incidentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Structured incident response for production issues. Follows a strict order: triage first, contain before investigating, root-cause after stable, postmortem last. Prevents the most common incident anti-pattern — developers debugging while the system is still on fire. Covers P1 outages, P2 degraded service, and P3 minor issues with appropriate urgency at each level.
Structured incident response for production issues. Follows a strict order: triage first, contain before investigating, root-cause after stable, postmortem last. Prevents the most common incident anti-pattern — developers debugging while the system is still on fire. Covers P1 outages, P2 degraded service, and P3 minor issues with appropriate urgency at each level.
/rune incident "description of what's broken" — direct user invocationlaunch (L1): watchdog alerts during Phase 3 VERIFYdeploy (L2): health check fails post-deploywatchdog emits incident.detected — no manual invocation neededwatchdog (L3): current system state — which endpoints are down, response timesautopsy (L2): root cause analysis after containmentjournal (L3): record incident timeline and decisionssentinel (L2): check for security dimension (data exposure, unauthorized access)neural-memory (ext): after resolution — capture incident root cause + fix pattern cross-session so the same failure mode is never diagnosed twicelaunch (L1): monitoring alert during production verificationdeploy (L2): post-deploy health check failure/rune incident direct invocationClassify severity using this matrix:
| Severity | Definition | Contain Within |
|---|---|---|
| P1 | Full outage — core feature unavailable for all users | 15 minutes |
| P2 | Partial degradation — feature broken for subset of users or degraded for all | 1 hour |
| P3 | Minor issue — cosmetic, edge case, or non-blocking degradation | 4 hours |
P1 indicators: 5xx on root /, auth endpoint down, payment flow broken, data loss detected
P2 indicators: elevated error rate (>1%) on key flow, 1+ regions down, performance >5x baseline
P3 indicators: UI glitch, non-critical feature broken, low error rate (<0.1%)
Emit: TRIAGE: [P1|P2|P3] — [one-line impact description]
Choose containment strategy based on what's available and severity:
| Strategy | When to Use |
|---|---|
| Rollback | Last deploy caused regression (check git log vs incident start time) |
| Feature flag off | Feature-gated code — disable without deploy |
| Traffic shift | Multi-region: route away from affected region |
| Scale up | Resource exhaustion (CPU/memory/connection pool) |
| Rate limit | Abuse pattern or traffic spike |
| Manual intervention | DB locked record, stuck job, cache corruption |
Execute containment action. Then invoke watchdog to verify system is stable before proceeding.
Emit: CONTAINED: [strategy used] — [timestamp] or CONTAINMENT_FAILED: [what was tried] — escalate
Invoke watchdog with current base_url and critical endpoints.
Proceed to Step 4 only if watchdog returns ALL_HEALTHY or DEGRADED with upward trend.
If watchdog returns DOWN — return to Step 2 with a different containment strategy.
Invoke sentinel to check if the incident has a security dimension:
If sentinel returns BLOCK: escalate to security incident — different protocol (notify security team, preserve logs, document access chain).
If sentinel returns PASS or WARN: continue to root cause.
Invoke autopsy with context:
autopsy returns: root cause hypothesis with evidence, affected code paths, contributing factors.
Do not attempt fixes — incident only investigates. Any code changes are a separate task.
Construct incident timeline using:
Format:
[HH:MM] Incident detected — [who/what detected it]
[HH:MM] Triage: [P1/P2/P3] — [impact]
[HH:MM] Containment started — [strategy]
[HH:MM] CONTAINED — [watchdog confirms stable]
[HH:MM] RCA: [root cause summary]
[HH:MM] Resolution: [what was done]
Invoke journal to record the timeline and decisions in .rune/adr/ as an incident ADR.
Generate postmortem report and save as .rune/incidents/INCIDENT-[YYYY-MM-DD]-[slug].md:
# Incident Report: [title]
**Severity**: [P1|P2|P3]
**Date**: [YYYY-MM-DD]
**Duration**: [time from detection to resolution]
**Impact**: [users affected, data affected, revenue impact if known]
## Timeline
[from Step 6]
## Root Cause
[from autopsy — specific, not vague]
## Contributing Factors
[from autopsy — what made this worse]
## What Went Well
[containment speed, detection, communication]
## What Went Wrong
[detection lag, failed first containment, etc.]
## Prevention Actions
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| [specific action] | [team/person] | [date] | P1/P2/P3 |
## Lessons Learned
[3-5 bullet points]
## Incident Response: [title]
### Triage
P2 — Login service returning 503 for ~30% of users
### Containment
Strategy: Rollback to commit abc123 (pre-deploy from 14:32)
Status: CONTAINED at 15:07 — watchdog confirms ALL_HEALTHY
### Security Check
sentinel: PASS — no data exposure detected
### Root Cause (from autopsy)
Connection pool exhausted — new feature added synchronous DB call in middleware,
reducing available connections from 20 to 3 under load
File: src/middleware/auth.ts:47
### Timeline
14:32 Deploy completed
14:45 Alerts fired — 503 rate >1%
14:47 TRIAGE: P2
14:52 Containment: rollback initiated
15:07 CONTAINED
15:20 RCA complete
15:35 Postmortem drafted
### Postmortem saved
.rune/incidents/INCIDENT-2026-02-24-login-503.md
| Gate | Requires | If Missing |
|---|---|---|
| Triage Gate | Severity classified (P1/P2/P3) before any other step | Classify before proceeding |
| Containment Gate | watchdog confirms HEALTHY/DEGRADED-improving before RCA | Return to containment if still DOWN |
| Security Gate | sentinel ran before closing incident | Run sentinel — do not skip |
| Postmortem Gate | All sections populated (Timeline, RCA, Prevention Actions) before status = Resolved | Complete or note as DRAFT |
| Artifact | Format | Location |
|---|---|---|
| Incident response report | Markdown | inline (chat output) |
| Incident timeline | Text (HH:MM format) | inline + postmortem |
| Postmortem document | Markdown | .rune/incidents/INCIDENT-<date>-<slug>.md |
| Prevention actions table | Markdown table | postmortem |
| Journal entry (incident ADR) | Text | .rune/adr/ (via rune:journal) |
Known failure modes for this skill. Check these before declaring done.
| Failure Mode | Severity | Mitigation |
|---|---|---|
| Starting RCA before containment confirmed | CRITICAL | HARD-GATE: check CONTAINED status before calling autopsy |
| Declaring incident resolved without watchdog verification | HIGH | MUST call watchdog after containment — not just assume |
| Postmortem Prevention Actions without owners or dates | MEDIUM | Every action needs owner + due date — otherwise it never happens |
| Skipping sentinel because "looks like a performance issue" | HIGH | Security dimension is not always obvious — always run sentinel |
| P1 triage without 15-minute containment urgency | HIGH | P1 SLA = 15 min to contain — flag if containment exceeds threshold |
~3000-8000 tokens input, ~1000-2500 tokens output. Sonnet for response coordination.
npx claudepluginhub rune-kit/rune --plugin @rune/analyticsClassifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
Responds to production incidents using a structured workflow: classify severity, triage impact, mitigate, root-cause, and write a blameless post-mortem. Use for outages, production issues, or security incidents.
Guides structured incident response from detection through post-mortem, including severity classification, response workflow, and post-mortem templates.