Slash Command

/incident-p0-database-down

Guides P0 critical PostgreSQL database outage response with verification commands, diagnostics, recovery procedures, alerts, and documentation templates.

PostgreSQL

Popularity

Parent stars

2,199

Parent forks

296

Shared by

Invocation

How this command is triggered — by the user, by Claude, or both

Slash command

/fairdb-ops-manager:incident-p0-database-down

Model invocable

No pre-commands

Configuration

Modelsonnet

Referenced Files

Files this command reads when invoked

fairdb.io.

Context Preview

The summary Claude sees in its command listing — used to decide when to auto-load this command

# SOP-201: P0 - Database Down (CRITICAL)

🚨 **EMERGENCY INCIDENT RESPONSE**

You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down.

## Severity: P0 - CRITICAL
- **Impact:** ALL customers affected
- **Response Time:** IMMEDIATE
- **Resolution Target:** <15 minutes

## Your Mission

Guide rapid diagnosis and recovery with:
- Systematic troubleshooting steps
- Clear commands for each check
- Fast recovery procedures
- Customer communication templates
- Post-incident documentation

## IMMEDIATE ACTIONS (First 60 seconds)

### 1. Verify the Issue


### 2. Alert Stakehol...

Command Content

319 lines · ~1.8k tokens

Stats

LanguagePython

Parent stars2,199

Parent forks296

MaintenanceGood

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

SOP-201: P0 - Database Down (CRITICAL)

🚨 EMERGENCY INCIDENT RESPONSE

You are responding to a P0 CRITICAL incident: PostgreSQL database is down.

Severity: P0 - CRITICAL

Impact: ALL customers affected
Response Time: IMMEDIATE
Resolution Target: <15 minutes

Your Mission

Guide rapid diagnosis and recovery with:

Systematic troubleshooting steps
Clear commands for each check
Fast recovery procedures
Customer communication templates
Post-incident documentation

IMMEDIATE ACTIONS (First 60 seconds)

1. Verify the Issue

# Is PostgreSQL running?
sudo systemctl status postgresql

# Can we connect?
sudo -u postgres psql -c "SELECT 1;"

# Check recent logs
sudo tail -100 /var/log/postgresql/postgresql-16-main.log

2. Alert Stakeholders

Post to incident channel IMMEDIATELY:

🚨 P0 INCIDENT - Database Down
Time: [TIMESTAMP]
Server: VPS-XXX
Impact: All customers unable to connect
Status: Investigating
ETA: TBD

DIAGNOSTIC PROTOCOL

Check 1: Service Status

sudo systemctl status postgresql
sudo systemctl status pgbouncer  # If installed

Possible states:

inactive (dead) → Service stopped
failed → Service crashed
active (running) → Service running but not responding

Check 2: Process Status

# Check for PostgreSQL processes
ps aux | grep postgres

# Check listening ports
sudo ss -tlnp | grep 5432
sudo ss -tlnp | grep 6432  # pgBouncer

Check 3: Disk Space

df -h /var/lib/postgresql

⚠️ If disk is full (100%):

This is likely the cause!
Jump to "Recovery: Disk Full" section

Check 4: Log Analysis

# Check for errors in PostgreSQL log
sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50

# Check system logs
sudo journalctl -u postgresql -n 100 --no-pager

# Check for OOM (Out of Memory) kills
sudo grep -i "killed process" /var/log/syslog | grep postgres

Check 5: Configuration Issues

# Test PostgreSQL config
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main

# Check for lock files
ls -la /var/run/postgresql/
ls -la /var/lib/postgresql/16/main/postmaster.pid

RECOVERY PROCEDURES

Recovery 1: Simple Service Restart

If service is stopped but no obvious errors:

# Start PostgreSQL
sudo systemctl start postgresql

# Check status
sudo systemctl status postgresql

# Test connection
sudo -u postgres psql -c "SELECT version();"

# Monitor logs
sudo tail -f /var/log/postgresql/postgresql-16-main.log

✅ If successful: Jump to "Post-Recovery" section

Recovery 2: Remove Stale PID File

If error mentions "postmaster.pid already exists":

# Stop PostgreSQL (if running)
sudo systemctl stop postgresql

# Remove stale PID file
sudo rm /var/lib/postgresql/16/main/postmaster.pid

# Start PostgreSQL
sudo systemctl start postgresql

# Verify
sudo systemctl status postgresql
sudo -u postgres psql -c "SELECT 1;"

Recovery 3: Disk Full Emergency

If disk is 100% full:

# Find largest files
sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10

# Option A: Clear old logs
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete

# Option B: Vacuum to reclaim space
sudo -u postgres vacuumdb --all --full

# Option C: Archive/delete old WAL files (DANGER!)
# Only if you have confirmed backups!
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010

# Check space
df -h /var/lib/postgresql

# Start PostgreSQL
sudo systemctl start postgresql

Recovery 4: Configuration Fix

If config test fails:

# Restore backup config
sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf

# Start PostgreSQL
sudo systemctl start postgresql

Recovery 5: Database Corruption (WORST CASE)

If logs show corruption errors:

# Stop PostgreSQL
sudo systemctl stop postgresql

# Run filesystem check (if safe to do so)
# sudo fsck /dev/sdX  # Only if unmounted!

# Try single-user mode recovery
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main

# If that fails, restore from backup (SOP-204)

⚠️ At this point, escalate to backup restoration procedure!

POST-RECOVERY ACTIONS

1. Verify Full Functionality

# Test connections
sudo -u postgres psql -c "SELECT version();"

# Check all databases
sudo -u postgres psql -c "\l"

# Test customer database access (example)
sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"

# Check active connections
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"

# Run health check
/opt/fairdb/scripts/pg-health-check.sh

2. Update Incident Status

✅ RESOLVED - Database Restored
Resolution Time: [X minutes]
Root Cause: [Brief description]
Recovery Method: [Which recovery procedure used]
Customer Impact: [Duration of outage]
Follow-up: [Post-mortem scheduled]

3. Customer Communication

Template:

Subject: [RESOLVED] Database Service Interruption

Dear FairDB Customer,

We experienced a brief service interruption affecting database
connectivity from [START_TIME] to [END_TIME] ([DURATION]).

The issue has been fully resolved and all services are operational.

Root Cause: [Brief explanation]
Resolution: [What we did]
Prevention: [Steps to prevent recurrence]

We apologize for any inconvenience. If you continue to experience
issues, please contact [email protected].

- FairDB Operations Team

4. Document Incident

Create incident report at /opt/fairdb/incidents/YYYY-MM-DD-database-down.md:

# Incident Report: Database Down

**Incident ID:** INC-YYYYMMDD-001
**Severity:** P0 - Critical
**Date:** YYYY-MM-DD
**Duration:** X minutes

## Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution implemented
- HH:MM - Service restored
- HH:MM - Verified functionality

## Root Cause
[Detailed explanation]

## Impact
- Customers affected: X
- Downtime: X minutes
- Data loss: None / [describe if any]

## Resolution
[Detailed steps taken]

## Prevention
[Action items to prevent recurrence]

## Follow-up Tasks
- [ ] Review monitoring alerts
- [ ] Update runbooks
- [ ] Implement preventive measures
- [ ] Schedule post-mortem meeting

ESCALATION CRITERIA

Escalate if:

❌ Cannot restore service within 15 minutes
❌ Data corruption suspected
❌ Backup restoration required
❌ Multiple VPS affected
❌ Security incident suspected

Escalation contacts: [Document your escalation chain]

START RESPONSE

Begin by asking:

"What symptoms are you seeing? (Can't connect, service down, etc.)"
"When did the issue start?"
"Are you on the affected server now?"

Then immediately execute Diagnostic Protocol starting with Check 1.

Remember: Speed is critical. Every minute counts. Stay calm, work systematically.

/incident-p0-database-down

Popularity

Invocation

Configuration

Referenced Files

Context Preview

Command Content

/incident-p0-database-down

Popularity

Invocation

Configuration

Referenced Files

Context Preview

Command Content

SOP-201: P0 - Database Down (CRITICAL)

Severity: P0 - CRITICAL

Your Mission

IMMEDIATE ACTIONS (First 60 seconds)

1. Verify the Issue

2. Alert Stakeholders

DIAGNOSTIC PROTOCOL

Check 1: Service Status

Check 2: Process Status

Check 3: Disk Space

Check 4: Log Analysis

Check 5: Configuration Issues

RECOVERY PROCEDURES

Recovery 1: Simple Service Restart

Recovery 2: Remove Stale PID File

Recovery 3: Disk Full Emergency

Recovery 4: Configuration Fix

Recovery 5: Database Corruption (WORST CASE)

POST-RECOVERY ACTIONS

1. Verify Full Functionality

2. Update Incident Status

3. Customer Communication

4. Document Incident

ESCALATION CRITERIA

START RESPONSE

Other plugins with /incident-p0-database-down

SOP-201: P0 - Database Down (CRITICAL)

Severity: P0 - CRITICAL

Your Mission

IMMEDIATE ACTIONS (First 60 seconds)

1. Verify the Issue

2. Alert Stakeholders

DIAGNOSTIC PROTOCOL

Check 1: Service Status

Check 2: Process Status

Check 3: Disk Space

Check 4: Log Analysis

Check 5: Configuration Issues

RECOVERY PROCEDURES

Recovery 1: Simple Service Restart

Recovery 2: Remove Stale PID File

Recovery 3: Disk Full Emergency

Recovery 4: Configuration Fix

Recovery 5: Database Corruption (WORST CASE)

POST-RECOVERY ACTIONS

1. Verify Full Functionality

2. Update Incident Status

3. Customer Communication

4. Document Incident

ESCALATION CRITERIA

START RESPONSE

Other plugins with /incident-p0-database-down