sre-incident-response | site-reliability-engineering

Stats

Actions

Tags

sre-incident-response | site-reliability-engineering

SRE Incident Response

Managing incidents and conducting effective postmortems.

Incident Severity Levels

P0 - Critical

Impact: Service completely down or major functionality unavailable
Response: Immediate, all-hands
Communication: Every 30 minutes
Examples: Complete outage, data loss, security breach

P1 - High

Impact: Significant degradation affecting many users
Response: Immediate, primary on-call
Communication: Every hour
Examples: Elevated error rates, slow response times

P2 - Medium

Impact: Minor degradation or single component affected
Response: Next business day
Communication: Daily updates
Examples: Single region issue, non-critical feature down

P3 - Low

Impact: No user impact yet, potential future issue
Response: Track in backlog
Communication: Async
Examples: Monitoring gaps, capacity warnings

Incident Response Process

1. Detection

Alert fires → On-call acknowledges → Initial assessment

2. Triage

- Assess severity
- Page additional responders if needed
- Establish incident channel
- Assign incident commander

3. Mitigation

- Identify mitigation options
- Execute fastest safe mitigation
- Monitor for improvement
- Escalate if not improving

4. Resolution

- Verify service health
- Communicate resolution
- Document actions taken
- Schedule postmortem

5. Follow-up

- Conduct postmortem
- Identify action items
- Track completion
- Update runbooks

Incident Roles

Incident Commander (IC)

Owns incident response
Makes decisions
Coordinates responders
Manages communication
Declares incident resolved

Operations Lead

Executes technical remediation
Proposes mitigation strategies
Implements fixes
Tests changes

Communications Lead

Updates status page
Posts to incident channel
Notifies stakeholders
Prepares external messaging

Planning Lead

Tracks action items
Takes detailed notes
Monitors responder fatigue
Coordinates shift changes

Communication Templates

Initial Notification

🚨 INCIDENT DECLARED - P0

Service: API Gateway
Impact: All API requests failing
Started: 2024-01-15 14:23 UTC
IC: @alice
Status Channel: #incident-001

Current Status: Investigating
Next Update: 30 minutes

Status Update

📊 INCIDENT UPDATE #2 - P0

Service: API Gateway
Elapsed: 45 minutes

Progress: Identified root cause as database connection pool exhaustion.
Mitigation: Increasing pool size and restarting services.

ETA to Resolution: 15 minutes
Next Update: 15 minutes or when resolved

Resolution Notice

✅ INCIDENT RESOLVED - P0

Service: API Gateway
Duration: 1h 12m
Impact: 100% of API requests failed

Resolution: Increased database connection pool and restarted services.

Next Steps:
- Postmortem scheduled for tomorrow 10am
- Monitoring for recurrence
- Action items being tracked in #incident-001

Blameless Postmortem

Template

# Incident Postmortem: API Outage 2024-01-15

## Summary

On January 15th, our API was completely unavailable for 72 minutes due to
database connection pool exhaustion.

## Impact

- Duration: 72 minutes (14:23 - 15:35 UTC)
- Severity: P0
- Users Affected: 100% of API users (~50,000 requests failed)
- Revenue Impact: ~$5,000 in SLA credits

## Timeline

**14:23** - Alerts fire for elevated error rate
**14:25** - IC paged, incident channel created
**14:30** - Identified all database connections exhausted
**14:45** - Decided to increase pool size
**15:00** - Configuration deployed
**15:15** - Services restarted
**15:35** - Error rate returned to normal, incident resolved

## Root Cause

Database connection pool was sized for normal load (100 connections).
Traffic spike from new feature launch (3x normal) exhausted connections.
No alerting existed for connection pool utilization.

## What Went Well

- Detection was quick (2 minutes from issue start)
- Team assembled rapidly
- Clear communication maintained

## What Didn't Go Well

- No capacity testing before feature launch
- Connection pool metrics not monitored
- No automated rollback capability

## Action Items

1. [P0] Add connection pool utilization monitoring (@bob, 1/17)
2. [P0] Implement automated rollback for deploys (@charlie, 1/20)
3. [P1] Establish capacity testing process (@diana, 1/25)
4. [P1] Increase connection pool to 300 (@bob, 1/16)
5. [P2] Update deployment runbook with load testing (@eve, 1/30)

## Lessons Learned

- Always load test before launching features
- Monitor resource utilization at all layers
- Have rollback mechanisms ready

Runbooks

Example Runbook

# Runbook: High Database Latency

## Symptoms

- Database query times > 500ms
- Elevated API latency
- Alert: DatabaseLatencyHigh

## Impact

Users experience slow page loads. P1 severity if p95 > 1s.

## Investigation

1. Check database metrics in Grafana
   https://grafana.example.com/d/db-overview

2. Identify slow queries:
   ```sql
   SELECT * FROM pg_stat_statements 
   ORDER BY total_time DESC LIMIT 10;

Check for locks:

SELECT * FROM pg_stat_activity 
WHERE state = 'active';

Mitigation

Quick fixes:

Kill long-running queries if safe
Add missing indexes if identified
Scale up read replicas if read-heavy

Escalation: If latency > 2s for > 15 minutes, page DBA team.

Prevention

Regular query performance reviews
Automated index recommendations
Capacity planning for growth


## Best Practices

### Blameless Culture

- Focus on systems, not individuals
- Assume good intentions
- Learn from mistakes
- Reward transparency

### Clear Severity Definitions

- Severity should be based on user impact
- Document response time expectations
- Update definitions based on learnings

### Practice Incident Response

- Run "game days" quarterly
- Practice different scenarios
- Test on-call handoffs
- Review and improve runbooks

### Track Action Items

- Assign owners and due dates
- Review in team meetings
- Close loop on completion
- Measure time to completion