From devops
Guide incident response — detect, assess, mitigate, root cause, prevent recurrence.
How this skill is triggered — by the user, by Claude, or both
Slash command
/devops:incident-responseThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Respond to $ARGUMENTS.
Respond to $ARGUMENTS.
The cardinal rule: mitigate first, root-cause second. The goal is to stop the bleeding before you diagnose the disease. Never spend 30 minutes investigating while users are down.
Gather the symptoms (2 minutes max):
Classify severity:
| Severity | Criteria | Response time | Communication cadence |
|---|---|---|---|
| SEV-1 (Critical) | Service down, data loss, security breach, revenue impact | Immediate | Every 15 minutes |
| SEV-2 (High) | Major feature degraded, affecting many users, no workaround | < 30 min | Every 30 minutes |
| SEV-3 (Medium) | Feature degraded, workaround exists, limited user impact | < 2 hours | Every 2 hours |
| SEV-4 (Low) | Minor issue, cosmetic, single user affected | Next business day | Resolution only |
Rules:
Answer these questions before taking action:
Build a timeline (MANDATORY):
HH:MM UTC — [Event] — [Source of information]
14:23 UTC — Error rate spike to 15% (normal: <1%) — Datadog alert
14:25 UTC — Deployment abc123 completed — GitHub Actions
14:27 UTC — First customer report via support — Zendesk ticket #4521
The timeline is the single most important artifact. Update it continuously.
Choose the fastest path to reduce user impact. Speed over elegance.
Mitigation options (in order of preference):
| Option | Speed | Risk | When to use |
|---|---|---|---|
| Feature flag off | Seconds | Low | Feature is behind a flag |
| Rollback deployment | 1-5 min | Low | Recent deployment is the likely cause |
| Scale up/out | 1-5 min | Low | Load-related, capacity issue |
| Traffic redirect | 1-5 min | Medium | Regional issue, failover available |
| Configuration change | 1-10 min | Medium | Bad config deployed |
| Hotfix deploy | 10-30 min | Higher | Root cause identified and fix is small |
| Service isolation | 1-5 min | Medium | Cascade prevention, circuit breaker |
Rules:
Only after mitigation is confirmed effective:
Check the change log — recent deployments, config changes, dependency updates, infrastructure changes
git log --oneline --since="2 hours ago"
Correlate with the timeline — what changed within 30 minutes before the incident started?
Trace the failure path:
Form a hypothesis — specific and falsifiable:
Test the hypothesis with one change — confirm or refute before moving to the next
Identify contributing factors beyond the root cause:
For every root cause, define concrete prevention measures:
| Prevention type | Example | Timeline |
|---|---|---|
| Immediate | Add missing validation, fix the bug | This sprint |
| Short-term | Add test, add monitoring alert, add circuit breaker | Next sprint |
| Long-term | Architecture change, process improvement, training | Next quarter |
Rules:
Who to notify (by severity):
| Severity | Notify | Channel |
|---|---|---|
| SEV-1 | Engineering lead, product lead, support lead, affected customers | Incident Slack channel + status page |
| SEV-2 | Engineering lead, product lead | Incident Slack channel |
| SEV-3 | Team lead | Team Slack channel |
| SEV-4 | Log for next standup | None |
Status update template:
**Incident: [title]**
**Severity:** SEV-[1/2/3/4]
**Status:** [Investigating / Mitigating / Monitoring / Resolved]
**Impact:** [who is affected and how]
**Current action:** [what is being done right now]
**Next update:** [time]
Rules:
# Post-Mortem: [Incident Title]
**Date:** [date]
**Duration:** [start time] — [end time] ([total duration])
**Severity:** SEV-[level]
**Author:** [name]
**Reviewers:** [names]
## Summary
[2-3 sentences: what happened, who was affected, what was the impact]
## Timeline
| Time (UTC) | Event | Source |
|---|---|---|
| [HH:MM] | [what happened] | [how we know] |
## Impact
- **Users affected:** [number or percentage]
- **Duration of impact:** [time]
- **Data impact:** [none / lost / corrupted / exposed]
- **Financial impact:** [if any]
- **SLA impact:** [if any]
## Root Cause
[Detailed technical explanation. Not "human error" — what system allowed this to happen?]
## Contributing Factors
- [Factor 1 — why it made the incident worse or harder to detect]
- [Factor 2]
## Resolution
[What was done to resolve the incident — both mitigation and permanent fix]
## Detection
- **How was it detected?** [alert / customer report / internal discovery]
- **Time to detect:** [minutes from start to detection]
- **Could we have detected it sooner?** [yes/no — how?]
## Action Items
| # | Action | Type | Owner | Deadline | Status |
|---|---|---|---|---|---|
| 1 | [action] | Prevent / Detect / Mitigate | [name] | [date] | TODO |
| 2 | [action] | Prevent / Detect / Mitigate | [name] | [date] | TODO |
## Lessons Learned
- **What went well:** [things that helped during the response]
- **What went poorly:** [things that hindered the response]
- **Where we got lucky:** [things that could have made it worse]
Deliver:
Use the runbook template (templates/runbook.md) for creating runbooks referenced during incidents.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
npx claudepluginhub hpsgd/turtlestack --plugin devops