gh-jeremylongshore-claude-c…/commands/incident-p0-disk-full.md

---
name: incident-p0-disk-full
description: Emergency response for SOP-203 P0 - Disk Space Emergency
model: sonnet
---

# SOP-203: P0 - Disk Space Emergency

🚨 **CRITICAL: Disk Space at 100% or >95%**

You are responding to a **disk space emergency** that threatens database operations.

## Severity: P0 - CRITICAL
- **Impact:** Database writes failing, potential data loss
- **Response Time:** IMMEDIATE
- **Resolution Target:** <30 minutes

## IMMEDIATE DANGER SIGNS

If disk is at 100%:
- ❌ PostgreSQL cannot write data
- ❌ WAL files cannot be created
- ❌ Transactions will fail
- ❌ Database may crash
- ❌ Backups will fail

**Act NOW to free space!**

## RAPID ASSESSMENT

### 1. Check Current Usage
```bash
# Overall disk usage
df -h

# PostgreSQL data directory
du -sh /var/lib/postgresql/16/main

# Find largest directories
du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10

# Find largest files
find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
```

### 2. Identify Culprits
```bash
# Check log sizes
du -sh /var/log/postgresql/

# Check WAL directory
du -sh /var/lib/postgresql/16/main/pg_wal/
ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l

# Check for temp files
du -sh /tmp/
find /tmp -type f -size +10M -ls

# Database sizes
sudo -u postgres psql -c "
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    pg_database_size(datname) AS size_bytes
FROM pg_database
ORDER BY size_bytes DESC;"
```

## EMERGENCY SPACE RECOVERY

### Priority 1: Clear Old Logs (SAFEST)

```bash
# PostgreSQL logs older than 7 days
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete

# Compress recent logs
sudo gzip /var/log/postgresql/*.log

# Clear syslog/journal
sudo journalctl --vacuum-time=7d

# Check space recovered
df -h
```

**Expected recovery:** 1-5 GB

### Priority 2: Archive Old WAL Files

⚠️ **ONLY if you have confirmed backups!**

```bash
# Check WAL retention settings
sudo -u postgres psql -c "SHOW wal_keep_size;"

# List old WAL files
ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50

# Archive WAL files (pgBackRest will help)
sudo -u postgres pgbackrest --stanza=main --type=full backup

# Clean archived WALs (CAREFUL!)
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
    $(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)

# Check space
df -h
```

**Expected recovery:** 5-20 GB

### Priority 3: Vacuum Databases

```bash
# Quick vacuum (recovers space within tables)
sudo -u postgres vacuumdb --all --analyze

# Check largest tables
sudo -u postgres psql -c "
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

# Full vacuum on bloated tables (SLOW, locks table)
sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"

# Check space
df -h
```

**Expected recovery:** Variable, depends on bloat

### Priority 4: Remove Temp Files

```bash
# Clear PostgreSQL temp files
sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*

# Clear system temp
sudo rm -rf /tmp/*

# Clear old backups (if local copies exist)
ls -lh /opt/fairdb/backups/
# Delete old local backups if remote backups are confirmed

df -h
```

### Priority 5: Drop Old/Unused Databases (DANGER!)

⚠️ **ONLY with customer approval!**

```bash
# List databases and last access
sudo -u postgres psql -c "
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    (SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
FROM pg_database d
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;"

# Identify inactive databases (last_activity is NULL or very old)

# BEFORE DROPPING: Backup!
sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz

# Drop database (IRREVERSIBLE!)
sudo -u postgres psql -c "DROP DATABASE [database_name];"
```

## LONG-TERM SOLUTIONS

### Option 1: Increase Disk Size

**Contabo/VPS Provider:**
1. Log into provider control panel
2. Upgrade storage plan
3. Resize disk partition
4. Expand filesystem

```bash
# After resize, expand filesystem
sudo resize2fs /dev/sda1  # Adjust device as needed

# Verify
df -h
```

### Option 2: Move Data to External Volume

```bash
# Create new volume/mount point
# Move PostgreSQL data directory
sudo systemctl stop postgresql
sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
sudo mv /var/lib/postgresql /var/lib/postgresql.old
sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
sudo systemctl start postgresql
```

### Option 3: Offload Old Data

- Archive old customer databases
- Export historical data to cold storage
- Implement data retention policies

### Option 4: Optimize Storage

```bash
# Enable compression for tables (PostgreSQL 14+)
ALTER TABLE [table_name] SET COMPRESSION lz4;

# Rewrite table to apply compression
VACUUM FULL [table_name];

# Set autovacuum more aggressively
ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);
```

## MONITORING & PREVENTION

### Set Up Disk Monitoring

Add to cron (`crontab -e`):
```bash
# Check disk space every hour
0 * * * * /opt/fairdb/scripts/check-disk-space.sh
```

**Create script** `/opt/fairdb/scripts/check-disk-space.sh`:
```bash
#!/bin/bash
THRESHOLD=80
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')

if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
fi
```

### Configure Log Rotation

Edit `/etc/logrotate.d/postgresql`:
```
/var/log/postgresql/*.log {
    daily
    rotate 7
    compress
    delaycompress
    notifempty
    missingok
}
```

### Implement Database Quotas

```sql
-- Set database size limits
ALTER DATABASE customer_db_001 SET max_database_size = '10GB';
```

## POST-RECOVERY ACTIONS

### 1. Verify Database Health
```bash
# Check PostgreSQL status
sudo systemctl status postgresql

# Test connections
sudo -u postgres psql -c "SELECT 1;"

# Run health check
/opt/fairdb/scripts/pg-health-check.sh
```

### 2. Document Incident

```markdown
# Disk Space Emergency - YYYY-MM-DD

## Initial State
- Disk usage: X%
- Free space: XGB
- Affected services: [list]

## Actions Taken
- [List each action with space recovered]

## Final State
- Disk usage: X%
- Free space: XGB
- Time to resolution: X minutes

## Root Cause
[Why did disk fill up?]

## Prevention
- [ ] Implement monitoring
- [ ] Set up log rotation
- [ ] Schedule regular cleanups
- [ ] Consider storage upgrade
```

### 3. Implement Monitoring

```bash
# Install monitoring script
sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/

# Set up alerts
# (Configure email/Slack notifications)
```

## DECISION TREE

```
Disk at 100%?
├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
│   ├─ Space freed? → Continue to monitoring
│   └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
│
└─ Disk at 85-99%?
    ├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
    └─ Plan long-term solution (resize disk)
```

## START RESPONSE

Ask user:
1. "What is the current disk usage? (run `df -h`)"
2. "Is PostgreSQL still running?"
3. "When did this start happening?"

Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.

**Remember:** Time is critical. Database writes are failing. Act fast but safely!