Initial commit

2025-11-29 18:52:55 +08:00
commit 713820bb67
22 changed files with 2987 additions and 0 deletions
--- a/commands/incident-p0-database-down.md
+++ b/commands/incident-p0-database-down.md
@@ -0,0 +1,318 @@
+---
+name: incident-p0-database-down
+description: Emergency response procedure for SOP-201 P0 - Database Down (Critical)
+model: sonnet
+---
+
+# SOP-201: P0 - Database Down (CRITICAL)
+
+🚨 **EMERGENCY INCIDENT RESPONSE**
+
+You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down.
+
+## Severity: P0 - CRITICAL
+- **Impact:** ALL customers affected
+- **Response Time:** IMMEDIATE
+- **Resolution Target:** <15 minutes
+
+## Your Mission
+
+Guide rapid diagnosis and recovery with:
+- Systematic troubleshooting steps
+- Clear commands for each check
+- Fast recovery procedures
+- Customer communication templates
+- Post-incident documentation
+
+## IMMEDIATE ACTIONS (First 60 seconds)
+
+### 1. Verify the Issue
+```bash
+# Is PostgreSQL running?
+sudo systemctl status postgresql
+
+# Can we connect?
+sudo -u postgres psql -c "SELECT 1;"
+
+# Check recent logs
+sudo tail -100 /var/log/postgresql/postgresql-16-main.log
+```
+
+### 2. Alert Stakeholders
+**Post to incident channel IMMEDIATELY:**
+```
+🚨 P0 INCIDENT - Database Down
+Time: [TIMESTAMP]
+Server: VPS-XXX
+Impact: All customers unable to connect
+Status: Investigating
+ETA: TBD
+```
+
+## DIAGNOSTIC PROTOCOL
+
+### Check 1: Service Status
+```bash
+sudo systemctl status postgresql
+sudo systemctl status pgbouncer  # If installed
+```
+
+**Possible states:**
+- `inactive (dead)` → Service stopped
+- `failed` → Service crashed
+- `active (running)` → Service running but not responding
+
+### Check 2: Process Status
+```bash
+# Check for PostgreSQL processes
+ps aux | grep postgres
+
+# Check listening ports
+sudo ss -tlnp | grep 5432
+sudo ss -tlnp | grep 6432  # pgBouncer
+```
+
+### Check 3: Disk Space
+```bash
+df -h /var/lib/postgresql
+```
+
+⚠️ **If disk is full (100%):**
+- This is likely the cause!
+- Jump to "Recovery: Disk Full" section
+
+### Check 4: Log Analysis
+```bash
+# Check for errors in PostgreSQL log
+sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
+
+# Check system logs
+sudo journalctl -u postgresql -n 100 --no-pager
+
+# Check for OOM (Out of Memory) kills
+sudo grep -i "killed process" /var/log/syslog | grep postgres
+```
+
+### Check 5: Configuration Issues
+```bash
+# Test PostgreSQL config
+sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
+
+# Check for lock files
+ls -la /var/run/postgresql/
+ls -la /var/lib/postgresql/16/main/postmaster.pid
+```
+
+## RECOVERY PROCEDURES
+
+### Recovery 1: Simple Service Restart
+
+**If service is stopped but no obvious errors:**
+
+```bash
+# Start PostgreSQL
+sudo systemctl start postgresql
+
+# Check status
+sudo systemctl status postgresql
+
+# Test connection
+sudo -u postgres psql -c "SELECT version();"
+
+# Monitor logs
+sudo tail -f /var/log/postgresql/postgresql-16-main.log
+```
+
+**✅ If successful:** Jump to "Post-Recovery" section
+
+### Recovery 2: Remove Stale PID File
+
+**If error mentions "postmaster.pid already exists":**
+
+```bash
+# Stop PostgreSQL (if running)
+sudo systemctl stop postgresql
+
+# Remove stale PID file
+sudo rm /var/lib/postgresql/16/main/postmaster.pid
+
+# Start PostgreSQL
+sudo systemctl start postgresql
+
+# Verify
+sudo systemctl status postgresql
+sudo -u postgres psql -c "SELECT 1;"
+```
+
+### Recovery 3: Disk Full Emergency
+
+**If disk is 100% full:**
+
+```bash
+# Find largest files
+sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
+
+# Option A: Clear old logs
+sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
+
+# Option B: Vacuum to reclaim space
+sudo -u postgres vacuumdb --all --full
+
+# Option C: Archive/delete old WAL files (DANGER!)
+# Only if you have confirmed backups!
+sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
+
+# Check space
+df -h /var/lib/postgresql
+
+# Start PostgreSQL
+sudo systemctl start postgresql
+```
+
+### Recovery 4: Configuration Fix
+
+**If config test fails:**
+
+```bash
+# Restore backup config
+sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
+sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
+
+# Start PostgreSQL
+sudo systemctl start postgresql
+```
+
+### Recovery 5: Database Corruption (WORST CASE)
+
+**If logs show corruption errors:**
+
+```bash
+# Stop PostgreSQL
+sudo systemctl stop postgresql
+
+# Run filesystem check (if safe to do so)
+# sudo fsck /dev/sdX  # Only if unmounted!
+
+# Try single-user mode recovery
+sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
+
+# If that fails, restore from backup (SOP-204)
+```
+
+⚠️ **At this point, escalate to backup restoration procedure!**
+
+## POST-RECOVERY ACTIONS
+
+### 1. Verify Full Functionality
+```bash
+# Test connections
+sudo -u postgres psql -c "SELECT version();"
+
+# Check all databases
+sudo -u postgres psql -c "\l"
+
+# Test customer database access (example)
+sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
+
+# Check active connections
+sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
+
+# Run health check
+/opt/fairdb/scripts/pg-health-check.sh
+```
+
+### 2. Update Incident Status
+```
+✅ RESOLVED - Database Restored
+Resolution Time: [X minutes]
+Root Cause: [Brief description]
+Recovery Method: [Which recovery procedure used]
+Customer Impact: [Duration of outage]
+Follow-up: [Post-mortem scheduled]
+```
+
+### 3. Customer Communication
+
+**Template:**
+```
+Subject: [RESOLVED] Database Service Interruption
+
+Dear FairDB Customer,
+
+We experienced a brief service interruption affecting database
+connectivity from [START_TIME] to [END_TIME] ([DURATION]).
+
+The issue has been fully resolved and all services are operational.
+
+Root Cause: [Brief explanation]
+Resolution: [What we did]
+Prevention: [Steps to prevent recurrence]
+
+We apologize for any inconvenience. If you continue to experience
+issues, please contact support@fairdb.io.
+
+- FairDB Operations Team
+```
+
+### 4. Document Incident
+
+Create incident report at `/opt/fairdb/incidents/YYYY-MM-DD-database-down.md`:
+
+```markdown
+# Incident Report: Database Down
+
+**Incident ID:** INC-YYYYMMDD-001
+**Severity:** P0 - Critical
+**Date:** YYYY-MM-DD
+**Duration:** X minutes
+
+## Timeline
+- HH:MM - Issue detected
+- HH:MM - Investigation started
+- HH:MM - Root cause identified
+- HH:MM - Resolution implemented
+- HH:MM - Service restored
+- HH:MM - Verified functionality
+
+## Root Cause
+[Detailed explanation]
+
+## Impact
+- Customers affected: X
+- Downtime: X minutes
+- Data loss: None / [describe if any]
+
+## Resolution
+[Detailed steps taken]
+
+## Prevention
+[Action items to prevent recurrence]
+
+## Follow-up Tasks
+- [ ] Review monitoring alerts
+- [ ] Update runbooks
+- [ ] Implement preventive measures
+- [ ] Schedule post-mortem meeting
+```
+
+## ESCALATION CRITERIA
+
+Escalate if:
+- ❌ Cannot restore service within 15 minutes
+- ❌ Data corruption suspected
+- ❌ Backup restoration required
+- ❌ Multiple VPS affected
+- ❌ Security incident suspected
+
+**Escalation contacts:** [Document your escalation chain]
+
+## START RESPONSE
+
+Begin by asking:
+1. "What symptoms are you seeing? (Can't connect, service down, etc.)"
+2. "When did the issue start?"
+3. "Are you on the affected server now?"
+
+Then immediately execute Diagnostic Protocol starting with Check 1.
+
+**Remember:** Speed is critical. Every minute counts. Stay calm, work systematically.