From devops-sre
Systematic root cause analysis using 5 Whys, fishbone diagrams, and fault tree analysis. Use this skill when investigating why an incident happened, performing RCA, or writing postmortems. Activate when: root cause, why did this happen, 5 whys, incident analysis, postmortem investigation, how did this happen, what caused, failure analysis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/devops-sre:root-cause-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Find the real cause, not just the symptoms, to prevent recurrence.**
Find the real cause, not just the symptoms, to prevent recurrence.
Keep asking "why" until you reach an actionable root cause.
Problem: API returned 500 errors for 45 minutes
Why #1: Why did the API return 500 errors?
→ The database connection pool was exhausted
Why #2: Why was the connection pool exhausted?
→ Connections weren't being released after queries
Why #3: Why weren't connections being released?
→ A code change introduced a bug that skipped connection.close()
Why #4: Why wasn't this caught before production?
→ Our integration tests don't check for connection leaks
Why #5: Why don't integration tests check for connection leaks?
→ We haven't implemented connection pool monitoring in tests
ROOT CAUSE: Missing connection leak detection in test suite
ACTION: Add connection pool assertions to integration tests
| Do | Don't |
|---|---|
| Use data, not assumptions | Stop at "human error" |
| Consider multiple branches | Accept vague answers |
| Verify each "because" | Skip to conclusions |
| Look for systemic issues | Blame individuals |
Most incidents have multiple contributing factors.
┌─────────────────────────────────────────────────────────────┐
│ INCIDENT: API OUTAGE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Direct Cause: │
│ └─ Database connection pool exhaustion │
│ │
│ Contributing Factors: │
│ ├─ [Code] Connection leak bug in PR #1234 │
│ ├─ [Process] Code review didn't catch the bug │
│ ├─ [Testing] No connection leak tests │
│ ├─ [Monitoring] No alert for connection pool usage │
│ ├─ [Deploy] Deployed during high-traffic period │
│ └─ [Recovery] Runbook for this scenario was outdated │
│ │
│ Environmental Factors: │
│ ├─ Team was understaffed (vacation season) │
│ └─ Similar incident 6 months ago, action items incomplete │
│ │
└─────────────────────────────────────────────────────────────┘
Work backwards from failure to identify all paths.
[API Outage]
│
┌────────────┴────────────┐
│ │
[DB Connections [App Server
Exhausted] Crashed]
│ │
┌───────┴───────┐ │
│ │ │
[Connection [Too Many [OOM
Leak] Requests] Error]
│ │ │
│ ┌─────┴─────┐ │
│ │ │ │
[Bug in [Traffic [Missing [Memory
Code] Spike] Rate Leak]
│ Limit]
│
[Marketing
Campaign]
Detailed timeline helps identify the chain of events.
Timeline: API Outage - 2026-01-15
Time (UTC) | Event | Source
------------|------------------------------|--------
09:00 | Deploy v2.3.4 started | GitHub
09:15 | Deploy completed | K8s
09:45 | Marketing email sent (50k) | Marketing
10:02 | Traffic spike begins | Datadog
10:15 | Connection pool at 80% | Metrics
10:23 | First 500 errors | Logs
10:25 | Alert fired | PagerDuty
10:27 | On-call acknowledged | PagerDuty
10:35 | Root cause identified | Slack
10:42 | Rollback initiated | K8s
10:48 | Service recovering | Datadog
11:00 | All clear declared | Slack
Key Finding: 38 minutes between deploy and issue detection
Deploy + traffic spike = perfect storm
Good action items are SMART:
| Criteria | Bad Example | Good Example |
|---|---|---|
| Specific | "Improve testing" | "Add connection pool leak test to CI" |
| Measurable | "Monitor better" | "Alert when pool > 80% for 5 min" |
| Assignable | "Team should fix" | "@jane owns implementation" |
| Realistic | "Rewrite entire system" | "Add circuit breaker to DB calls" |
| Time-bound | "Soon" | "Complete by 2026-02-01" |
## Root Cause Analysis
### Direct Cause
[What directly caused the incident]
### 5 Whys Analysis
1. Why? → [Answer]
2. Why? → [Answer]
3. Why? → [Answer]
4. Why? → [Answer]
5. Why? → [Root cause]
### Contributing Factors
- **Technical:** [List]
- **Process:** [List]
- **Organizational:** [List]
### Why Wasn't This Caught?
- In development: [Why]
- In code review: [Why]
- In testing: [Why]
- In staging: [Why]
- By monitoring: [Why]
### Action Items
| Priority | Action | Owner | Due | Prevents |
|----------|--------|-------|-----|----------|
| P0 | [Action] | @name | [Date] | Direct cause |
| P1 | [Action] | @name | [Date] | Detection |
| P2 | [Action] | @name | [Date] | Future risk |
npx claudepluginhub latestaiagents/agent-skills --plugin devops-sreFive Whys, fishbone diagrams, identifying systemic causes not just symptoms.
Conducts blameless postmortems for outages and incidents with timeline reconstruction, root cause analysis (5 Whys, fishbone), and corrective action tracking.
Investigates root causes of defects or incidents by iteratively asking 'why' to trace failures from symptoms to systemic causes. Useful for postmortems and recurring failures.