Initial commit
This commit is contained in:
318
commands/incident-p0-database-down.md
Normal file
318
commands/incident-p0-database-down.md
Normal file
@@ -0,0 +1,318 @@
|
||||
---
|
||||
name: incident-p0-database-down
|
||||
description: Emergency response procedure for SOP-201 P0 - Database Down (Critical)
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-201: P0 - Database Down (CRITICAL)
|
||||
|
||||
🚨 **EMERGENCY INCIDENT RESPONSE**
|
||||
|
||||
You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down.
|
||||
|
||||
## Severity: P0 - CRITICAL
|
||||
- **Impact:** ALL customers affected
|
||||
- **Response Time:** IMMEDIATE
|
||||
- **Resolution Target:** <15 minutes
|
||||
|
||||
## Your Mission
|
||||
|
||||
Guide rapid diagnosis and recovery with:
|
||||
- Systematic troubleshooting steps
|
||||
- Clear commands for each check
|
||||
- Fast recovery procedures
|
||||
- Customer communication templates
|
||||
- Post-incident documentation
|
||||
|
||||
## IMMEDIATE ACTIONS (First 60 seconds)
|
||||
|
||||
### 1. Verify the Issue
|
||||
```bash
|
||||
# Is PostgreSQL running?
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Can we connect?
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Check recent logs
|
||||
sudo tail -100 /var/log/postgresql/postgresql-16-main.log
|
||||
```
|
||||
|
||||
### 2. Alert Stakeholders
|
||||
**Post to incident channel IMMEDIATELY:**
|
||||
```
|
||||
🚨 P0 INCIDENT - Database Down
|
||||
Time: [TIMESTAMP]
|
||||
Server: VPS-XXX
|
||||
Impact: All customers unable to connect
|
||||
Status: Investigating
|
||||
ETA: TBD
|
||||
```
|
||||
|
||||
## DIAGNOSTIC PROTOCOL
|
||||
|
||||
### Check 1: Service Status
|
||||
```bash
|
||||
sudo systemctl status postgresql
|
||||
sudo systemctl status pgbouncer # If installed
|
||||
```
|
||||
|
||||
**Possible states:**
|
||||
- `inactive (dead)` → Service stopped
|
||||
- `failed` → Service crashed
|
||||
- `active (running)` → Service running but not responding
|
||||
|
||||
### Check 2: Process Status
|
||||
```bash
|
||||
# Check for PostgreSQL processes
|
||||
ps aux | grep postgres
|
||||
|
||||
# Check listening ports
|
||||
sudo ss -tlnp | grep 5432
|
||||
sudo ss -tlnp | grep 6432 # pgBouncer
|
||||
```
|
||||
|
||||
### Check 3: Disk Space
|
||||
```bash
|
||||
df -h /var/lib/postgresql
|
||||
```
|
||||
|
||||
⚠️ **If disk is full (100%):**
|
||||
- This is likely the cause!
|
||||
- Jump to "Recovery: Disk Full" section
|
||||
|
||||
### Check 4: Log Analysis
|
||||
```bash
|
||||
# Check for errors in PostgreSQL log
|
||||
sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
|
||||
|
||||
# Check system logs
|
||||
sudo journalctl -u postgresql -n 100 --no-pager
|
||||
|
||||
# Check for OOM (Out of Memory) kills
|
||||
sudo grep -i "killed process" /var/log/syslog | grep postgres
|
||||
```
|
||||
|
||||
### Check 5: Configuration Issues
|
||||
```bash
|
||||
# Test PostgreSQL config
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
|
||||
|
||||
# Check for lock files
|
||||
ls -la /var/run/postgresql/
|
||||
ls -la /var/lib/postgresql/16/main/postmaster.pid
|
||||
```
|
||||
|
||||
## RECOVERY PROCEDURES
|
||||
|
||||
### Recovery 1: Simple Service Restart
|
||||
|
||||
**If service is stopped but no obvious errors:**
|
||||
|
||||
```bash
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
|
||||
# Check status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connection
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# Monitor logs
|
||||
sudo tail -f /var/log/postgresql/postgresql-16-main.log
|
||||
```
|
||||
|
||||
**✅ If successful:** Jump to "Post-Recovery" section
|
||||
|
||||
### Recovery 2: Remove Stale PID File
|
||||
|
||||
**If error mentions "postmaster.pid already exists":**
|
||||
|
||||
```bash
|
||||
# Stop PostgreSQL (if running)
|
||||
sudo systemctl stop postgresql
|
||||
|
||||
# Remove stale PID file
|
||||
sudo rm /var/lib/postgresql/16/main/postmaster.pid
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
|
||||
# Verify
|
||||
sudo systemctl status postgresql
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
```
|
||||
|
||||
### Recovery 3: Disk Full Emergency
|
||||
|
||||
**If disk is 100% full:**
|
||||
|
||||
```bash
|
||||
# Find largest files
|
||||
sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
|
||||
|
||||
# Option A: Clear old logs
|
||||
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Option B: Vacuum to reclaim space
|
||||
sudo -u postgres vacuumdb --all --full
|
||||
|
||||
# Option C: Archive/delete old WAL files (DANGER!)
|
||||
# Only if you have confirmed backups!
|
||||
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
|
||||
|
||||
# Check space
|
||||
df -h /var/lib/postgresql
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Recovery 4: Configuration Fix
|
||||
|
||||
**If config test fails:**
|
||||
|
||||
```bash
|
||||
# Restore backup config
|
||||
sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
|
||||
sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Recovery 5: Database Corruption (WORST CASE)
|
||||
|
||||
**If logs show corruption errors:**
|
||||
|
||||
```bash
|
||||
# Stop PostgreSQL
|
||||
sudo systemctl stop postgresql
|
||||
|
||||
# Run filesystem check (if safe to do so)
|
||||
# sudo fsck /dev/sdX # Only if unmounted!
|
||||
|
||||
# Try single-user mode recovery
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
|
||||
|
||||
# If that fails, restore from backup (SOP-204)
|
||||
```
|
||||
|
||||
⚠️ **At this point, escalate to backup restoration procedure!**
|
||||
|
||||
## POST-RECOVERY ACTIONS
|
||||
|
||||
### 1. Verify Full Functionality
|
||||
```bash
|
||||
# Test connections
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# Check all databases
|
||||
sudo -u postgres psql -c "\l"
|
||||
|
||||
# Test customer database access (example)
|
||||
sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
|
||||
|
||||
# Check active connections
|
||||
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# Run health check
|
||||
/opt/fairdb/scripts/pg-health-check.sh
|
||||
```
|
||||
|
||||
### 2. Update Incident Status
|
||||
```
|
||||
✅ RESOLVED - Database Restored
|
||||
Resolution Time: [X minutes]
|
||||
Root Cause: [Brief description]
|
||||
Recovery Method: [Which recovery procedure used]
|
||||
Customer Impact: [Duration of outage]
|
||||
Follow-up: [Post-mortem scheduled]
|
||||
```
|
||||
|
||||
### 3. Customer Communication
|
||||
|
||||
**Template:**
|
||||
```
|
||||
Subject: [RESOLVED] Database Service Interruption
|
||||
|
||||
Dear FairDB Customer,
|
||||
|
||||
We experienced a brief service interruption affecting database
|
||||
connectivity from [START_TIME] to [END_TIME] ([DURATION]).
|
||||
|
||||
The issue has been fully resolved and all services are operational.
|
||||
|
||||
Root Cause: [Brief explanation]
|
||||
Resolution: [What we did]
|
||||
Prevention: [Steps to prevent recurrence]
|
||||
|
||||
We apologize for any inconvenience. If you continue to experience
|
||||
issues, please contact support@fairdb.io.
|
||||
|
||||
- FairDB Operations Team
|
||||
```
|
||||
|
||||
### 4. Document Incident
|
||||
|
||||
Create incident report at `/opt/fairdb/incidents/YYYY-MM-DD-database-down.md`:
|
||||
|
||||
```markdown
|
||||
# Incident Report: Database Down
|
||||
|
||||
**Incident ID:** INC-YYYYMMDD-001
|
||||
**Severity:** P0 - Critical
|
||||
**Date:** YYYY-MM-DD
|
||||
**Duration:** X minutes
|
||||
|
||||
## Timeline
|
||||
- HH:MM - Issue detected
|
||||
- HH:MM - Investigation started
|
||||
- HH:MM - Root cause identified
|
||||
- HH:MM - Resolution implemented
|
||||
- HH:MM - Service restored
|
||||
- HH:MM - Verified functionality
|
||||
|
||||
## Root Cause
|
||||
[Detailed explanation]
|
||||
|
||||
## Impact
|
||||
- Customers affected: X
|
||||
- Downtime: X minutes
|
||||
- Data loss: None / [describe if any]
|
||||
|
||||
## Resolution
|
||||
[Detailed steps taken]
|
||||
|
||||
## Prevention
|
||||
[Action items to prevent recurrence]
|
||||
|
||||
## Follow-up Tasks
|
||||
- [ ] Review monitoring alerts
|
||||
- [ ] Update runbooks
|
||||
- [ ] Implement preventive measures
|
||||
- [ ] Schedule post-mortem meeting
|
||||
```
|
||||
|
||||
## ESCALATION CRITERIA
|
||||
|
||||
Escalate if:
|
||||
- ❌ Cannot restore service within 15 minutes
|
||||
- ❌ Data corruption suspected
|
||||
- ❌ Backup restoration required
|
||||
- ❌ Multiple VPS affected
|
||||
- ❌ Security incident suspected
|
||||
|
||||
**Escalation contacts:** [Document your escalation chain]
|
||||
|
||||
## START RESPONSE
|
||||
|
||||
Begin by asking:
|
||||
1. "What symptoms are you seeing? (Can't connect, service down, etc.)"
|
||||
2. "When did the issue start?"
|
||||
3. "Are you on the affected server now?"
|
||||
|
||||
Then immediately execute Diagnostic Protocol starting with Check 1.
|
||||
|
||||
**Remember:** Speed is critical. Every minute counts. Stay calm, work systematically.
|
||||
Reference in New Issue
Block a user