Files
gh-jeremylongshore-claude-c…/commands/incident-p0-disk-full.md
2025-11-29 18:52:55 +08:00

7.5 KiB

name, description, model
name description model
incident-p0-disk-full Emergency response for SOP-203 P0 - Disk Space Emergency sonnet

SOP-203: P0 - Disk Space Emergency

🚨 CRITICAL: Disk Space at 100% or >95%

You are responding to a disk space emergency that threatens database operations.

Severity: P0 - CRITICAL

  • Impact: Database writes failing, potential data loss
  • Response Time: IMMEDIATE
  • Resolution Target: <30 minutes

IMMEDIATE DANGER SIGNS

If disk is at 100%:

  • PostgreSQL cannot write data
  • WAL files cannot be created
  • Transactions will fail
  • Database may crash
  • Backups will fail

Act NOW to free space!

RAPID ASSESSMENT

1. Check Current Usage

# Overall disk usage
df -h

# PostgreSQL data directory
du -sh /var/lib/postgresql/16/main

# Find largest directories
du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10

# Find largest files
find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20

2. Identify Culprits

# Check log sizes
du -sh /var/log/postgresql/

# Check WAL directory
du -sh /var/lib/postgresql/16/main/pg_wal/
ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l

# Check for temp files
du -sh /tmp/
find /tmp -type f -size +10M -ls

# Database sizes
sudo -u postgres psql -c "
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    pg_database_size(datname) AS size_bytes
FROM pg_database
ORDER BY size_bytes DESC;"

EMERGENCY SPACE RECOVERY

Priority 1: Clear Old Logs (SAFEST)

# PostgreSQL logs older than 7 days
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete

# Compress recent logs
sudo gzip /var/log/postgresql/*.log

# Clear syslog/journal
sudo journalctl --vacuum-time=7d

# Check space recovered
df -h

Expected recovery: 1-5 GB

Priority 2: Archive Old WAL Files

⚠️ ONLY if you have confirmed backups!

# Check WAL retention settings
sudo -u postgres psql -c "SHOW wal_keep_size;"

# List old WAL files
ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50

# Archive WAL files (pgBackRest will help)
sudo -u postgres pgbackrest --stanza=main --type=full backup

# Clean archived WALs (CAREFUL!)
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
    $(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)

# Check space
df -h

Expected recovery: 5-20 GB

Priority 3: Vacuum Databases

# Quick vacuum (recovers space within tables)
sudo -u postgres vacuumdb --all --analyze

# Check largest tables
sudo -u postgres psql -c "
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

# Full vacuum on bloated tables (SLOW, locks table)
sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"

# Check space
df -h

Expected recovery: Variable, depends on bloat

Priority 4: Remove Temp Files

# Clear PostgreSQL temp files
sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*

# Clear system temp
sudo rm -rf /tmp/*

# Clear old backups (if local copies exist)
ls -lh /opt/fairdb/backups/
# Delete old local backups if remote backups are confirmed

df -h

Priority 5: Drop Old/Unused Databases (DANGER!)

⚠️ ONLY with customer approval!

# List databases and last access
sudo -u postgres psql -c "
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size,
    (SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
FROM pg_database d
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;"

# Identify inactive databases (last_activity is NULL or very old)

# BEFORE DROPPING: Backup!
sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz

# Drop database (IRREVERSIBLE!)
sudo -u postgres psql -c "DROP DATABASE [database_name];"

LONG-TERM SOLUTIONS

Option 1: Increase Disk Size

Contabo/VPS Provider:

  1. Log into provider control panel
  2. Upgrade storage plan
  3. Resize disk partition
  4. Expand filesystem
# After resize, expand filesystem
sudo resize2fs /dev/sda1  # Adjust device as needed

# Verify
df -h

Option 2: Move Data to External Volume

# Create new volume/mount point
# Move PostgreSQL data directory
sudo systemctl stop postgresql
sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
sudo mv /var/lib/postgresql /var/lib/postgresql.old
sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
sudo systemctl start postgresql

Option 3: Offload Old Data

  • Archive old customer databases
  • Export historical data to cold storage
  • Implement data retention policies

Option 4: Optimize Storage

# Enable compression for tables (PostgreSQL 14+)
ALTER TABLE [table_name] SET COMPRESSION lz4;

# Rewrite table to apply compression
VACUUM FULL [table_name];

# Set autovacuum more aggressively
ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);

MONITORING & PREVENTION

Set Up Disk Monitoring

Add to cron (crontab -e):

# Check disk space every hour
0 * * * * /opt/fairdb/scripts/check-disk-space.sh

Create script /opt/fairdb/scripts/check-disk-space.sh:

#!/bin/bash
THRESHOLD=80
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')

if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
fi

Configure Log Rotation

Edit /etc/logrotate.d/postgresql:

/var/log/postgresql/*.log {
    daily
    rotate 7
    compress
    delaycompress
    notifempty
    missingok
}

Implement Database Quotas

-- Set database size limits
ALTER DATABASE customer_db_001 SET max_database_size = '10GB';

POST-RECOVERY ACTIONS

1. Verify Database Health

# Check PostgreSQL status
sudo systemctl status postgresql

# Test connections
sudo -u postgres psql -c "SELECT 1;"

# Run health check
/opt/fairdb/scripts/pg-health-check.sh

2. Document Incident

# Disk Space Emergency - YYYY-MM-DD

## Initial State
- Disk usage: X%
- Free space: XGB
- Affected services: [list]

## Actions Taken
- [List each action with space recovered]

## Final State
- Disk usage: X%
- Free space: XGB
- Time to resolution: X minutes

## Root Cause
[Why did disk fill up?]

## Prevention
- [ ] Implement monitoring
- [ ] Set up log rotation
- [ ] Schedule regular cleanups
- [ ] Consider storage upgrade

3. Implement Monitoring

# Install monitoring script
sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/

# Set up alerts
# (Configure email/Slack notifications)

DECISION TREE

Disk at 100%?
├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
│   ├─ Space freed? → Continue to monitoring
│   └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
│
└─ Disk at 85-99%?
    ├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
    └─ Plan long-term solution (resize disk)

START RESPONSE

Ask user:

  1. "What is the current disk usage? (run df -h)"
  2. "Is PostgreSQL still running?"
  3. "When did this start happening?"

Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.

Remember: Time is critical. Database writes are failing. Act fast but safely!