From fairdb-ops-manager
Guides P0 critical PostgreSQL database outage response with verification commands, diagnostics, recovery procedures, alerts, and documentation templates.
How this command is triggered — by the user, by Claude, or both
Slash command
/fairdb-ops-manager:incident-p0-database-downsonnetFiles this command reads when invoked
The summary Claude sees in its command listing — used to decide when to auto-load this command
# SOP-201: P0 - Database Down (CRITICAL) 🚨 **EMERGENCY INCIDENT RESPONSE** You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down. ## Severity: P0 - CRITICAL - **Impact:** ALL customers affected - **Response Time:** IMMEDIATE - **Resolution Target:** <15 minutes ## Your Mission Guide rapid diagnosis and recovery with: - Systematic troubleshooting steps - Clear commands for each check - Fast recovery procedures - Customer communication templates - Post-incident documentation ## IMMEDIATE ACTIONS (First 60 seconds) ### 1. Verify the Issue ### 2. Alert Stakehol...
🚨 EMERGENCY INCIDENT RESPONSE
You are responding to a P0 CRITICAL incident: PostgreSQL database is down.
Guide rapid diagnosis and recovery with:
# Is PostgreSQL running?
sudo systemctl status postgresql
# Can we connect?
sudo -u postgres psql -c "SELECT 1;"
# Check recent logs
sudo tail -100 /var/log/postgresql/postgresql-16-main.log
Post to incident channel IMMEDIATELY:
🚨 P0 INCIDENT - Database Down
Time: [TIMESTAMP]
Server: VPS-XXX
Impact: All customers unable to connect
Status: Investigating
ETA: TBD
sudo systemctl status postgresql
sudo systemctl status pgbouncer # If installed
Possible states:
inactive (dead) → Service stoppedfailed → Service crashedactive (running) → Service running but not responding# Check for PostgreSQL processes
ps aux | grep postgres
# Check listening ports
sudo ss -tlnp | grep 5432
sudo ss -tlnp | grep 6432 # pgBouncer
df -h /var/lib/postgresql
⚠️ If disk is full (100%):
# Check for errors in PostgreSQL log
sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
# Check system logs
sudo journalctl -u postgresql -n 100 --no-pager
# Check for OOM (Out of Memory) kills
sudo grep -i "killed process" /var/log/syslog | grep postgres
# Test PostgreSQL config
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
# Check for lock files
ls -la /var/run/postgresql/
ls -la /var/lib/postgresql/16/main/postmaster.pid
If service is stopped but no obvious errors:
# Start PostgreSQL
sudo systemctl start postgresql
# Check status
sudo systemctl status postgresql
# Test connection
sudo -u postgres psql -c "SELECT version();"
# Monitor logs
sudo tail -f /var/log/postgresql/postgresql-16-main.log
✅ If successful: Jump to "Post-Recovery" section
If error mentions "postmaster.pid already exists":
# Stop PostgreSQL (if running)
sudo systemctl stop postgresql
# Remove stale PID file
sudo rm /var/lib/postgresql/16/main/postmaster.pid
# Start PostgreSQL
sudo systemctl start postgresql
# Verify
sudo systemctl status postgresql
sudo -u postgres psql -c "SELECT 1;"
If disk is 100% full:
# Find largest files
sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
# Option A: Clear old logs
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
# Option B: Vacuum to reclaim space
sudo -u postgres vacuumdb --all --full
# Option C: Archive/delete old WAL files (DANGER!)
# Only if you have confirmed backups!
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
# Check space
df -h /var/lib/postgresql
# Start PostgreSQL
sudo systemctl start postgresql
If config test fails:
# Restore backup config
sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
# Start PostgreSQL
sudo systemctl start postgresql
If logs show corruption errors:
# Stop PostgreSQL
sudo systemctl stop postgresql
# Run filesystem check (if safe to do so)
# sudo fsck /dev/sdX # Only if unmounted!
# Try single-user mode recovery
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
# If that fails, restore from backup (SOP-204)
⚠️ At this point, escalate to backup restoration procedure!
# Test connections
sudo -u postgres psql -c "SELECT version();"
# Check all databases
sudo -u postgres psql -c "\l"
# Test customer database access (example)
sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
# Check active connections
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
# Run health check
/opt/fairdb/scripts/pg-health-check.sh
✅ RESOLVED - Database Restored
Resolution Time: [X minutes]
Root Cause: [Brief description]
Recovery Method: [Which recovery procedure used]
Customer Impact: [Duration of outage]
Follow-up: [Post-mortem scheduled]
Template:
Subject: [RESOLVED] Database Service Interruption
Dear FairDB Customer,
We experienced a brief service interruption affecting database
connectivity from [START_TIME] to [END_TIME] ([DURATION]).
The issue has been fully resolved and all services are operational.
Root Cause: [Brief explanation]
Resolution: [What we did]
Prevention: [Steps to prevent recurrence]
We apologize for any inconvenience. If you continue to experience
issues, please contact [email protected].
- FairDB Operations Team
Create incident report at /opt/fairdb/incidents/YYYY-MM-DD-database-down.md:
# Incident Report: Database Down
**Incident ID:** INC-YYYYMMDD-001
**Severity:** P0 - Critical
**Date:** YYYY-MM-DD
**Duration:** X minutes
## Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution implemented
- HH:MM - Service restored
- HH:MM - Verified functionality
## Root Cause
[Detailed explanation]
## Impact
- Customers affected: X
- Downtime: X minutes
- Data loss: None / [describe if any]
## Resolution
[Detailed steps taken]
## Prevention
[Action items to prevent recurrence]
## Follow-up Tasks
- [ ] Review monitoring alerts
- [ ] Update runbooks
- [ ] Implement preventive measures
- [ ] Schedule post-mortem meeting
Escalate if:
Escalation contacts: [Document your escalation chain]
Begin by asking:
Then immediately execute Diagnostic Protocol starting with Check 1.
Remember: Speed is critical. Every minute counts. Stay calm, work systematically.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin fairdb-ops-manager/fairdb-emergency-responseRuns FairDB PostgreSQL emergency incident response: classifies severity, executes bash script for service checks, restarts, disk cleanup, corruption checks, and incident logging.
/health-checkImplements database health monitoring for PostgreSQL and MySQL with real-time metrics, predictive alerts, automated remediation, and Grafana dashboards using Prometheus exporters.
/incidentOrchestrates incident response for specified <incident> using SRE best practices, supporting optional [phase] like triage or postmortem.
/recoveryImplements disaster recovery and point-in-time recovery (PITR) for production databases via WAL archiving, automated backups, failover, testing, and runbooks.
/rescueDiagnoses production incidents: checks site health, analyzes recent commits, detects error patterns, generates rollback commands, produces post-mortem template.
/hatch3r-incident-responseDrive a live production incident through a structured lifecycle -- triage + topology, bounded-autonomy mitigation, stakeholder communication, then a blameless post-mortem with runbook -- via delegated sub-agents.