Files
gh-jeremylongshore-claude-c…/commands/incident-p0-disk-full.md
2025-11-29 18:52:55 +08:00

345 lines
7.5 KiB
Markdown

---
name: incident-p0-disk-full
description: Emergency response for SOP-203 P0 - Disk Space Emergency
model: sonnet
---
# SOP-203: P0 - Disk Space Emergency
🚨 **CRITICAL: Disk Space at 100% or >95%**
You are responding to a **disk space emergency** that threatens database operations.
## Severity: P0 - CRITICAL
- **Impact:** Database writes failing, potential data loss
- **Response Time:** IMMEDIATE
- **Resolution Target:** <30 minutes
## IMMEDIATE DANGER SIGNS
If disk is at 100%:
- ❌ PostgreSQL cannot write data
- ❌ WAL files cannot be created
- ❌ Transactions will fail
- ❌ Database may crash
- ❌ Backups will fail
**Act NOW to free space!**
## RAPID ASSESSMENT
### 1. Check Current Usage
```bash
# Overall disk usage
df -h
# PostgreSQL data directory
du -sh /var/lib/postgresql/16/main
# Find largest directories
du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
# Find largest files
find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
```
### 2. Identify Culprits
```bash
# Check log sizes
du -sh /var/log/postgresql/
# Check WAL directory
du -sh /var/lib/postgresql/16/main/pg_wal/
ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l
# Check for temp files
du -sh /tmp/
find /tmp -type f -size +10M -ls
# Database sizes
sudo -u postgres psql -c "
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) AS size,
pg_database_size(datname) AS size_bytes
FROM pg_database
ORDER BY size_bytes DESC;"
```
## EMERGENCY SPACE RECOVERY
### Priority 1: Clear Old Logs (SAFEST)
```bash
# PostgreSQL logs older than 7 days
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
# Compress recent logs
sudo gzip /var/log/postgresql/*.log
# Clear syslog/journal
sudo journalctl --vacuum-time=7d
# Check space recovered
df -h
```
**Expected recovery:** 1-5 GB
### Priority 2: Archive Old WAL Files
⚠️ **ONLY if you have confirmed backups!**
```bash
# Check WAL retention settings
sudo -u postgres psql -c "SHOW wal_keep_size;"
# List old WAL files
ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50
# Archive WAL files (pgBackRest will help)
sudo -u postgres pgbackrest --stanza=main --type=full backup
# Clean archived WALs (CAREFUL!)
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
$(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)
# Check space
df -h
```
**Expected recovery:** 5-20 GB
### Priority 3: Vacuum Databases
```bash
# Quick vacuum (recovers space within tables)
sudo -u postgres vacuumdb --all --analyze
# Check largest tables
sudo -u postgres psql -c "
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"
# Full vacuum on bloated tables (SLOW, locks table)
sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"
# Check space
df -h
```
**Expected recovery:** Variable, depends on bloat
### Priority 4: Remove Temp Files
```bash
# Clear PostgreSQL temp files
sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*
# Clear system temp
sudo rm -rf /tmp/*
# Clear old backups (if local copies exist)
ls -lh /opt/fairdb/backups/
# Delete old local backups if remote backups are confirmed
df -h
```
### Priority 5: Drop Old/Unused Databases (DANGER!)
⚠️ **ONLY with customer approval!**
```bash
# List databases and last access
sudo -u postgres psql -c "
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) AS size,
(SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
FROM pg_database d
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;"
# Identify inactive databases (last_activity is NULL or very old)
# BEFORE DROPPING: Backup!
sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz
# Drop database (IRREVERSIBLE!)
sudo -u postgres psql -c "DROP DATABASE [database_name];"
```
## LONG-TERM SOLUTIONS
### Option 1: Increase Disk Size
**Contabo/VPS Provider:**
1. Log into provider control panel
2. Upgrade storage plan
3. Resize disk partition
4. Expand filesystem
```bash
# After resize, expand filesystem
sudo resize2fs /dev/sda1 # Adjust device as needed
# Verify
df -h
```
### Option 2: Move Data to External Volume
```bash
# Create new volume/mount point
# Move PostgreSQL data directory
sudo systemctl stop postgresql
sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
sudo mv /var/lib/postgresql /var/lib/postgresql.old
sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
sudo systemctl start postgresql
```
### Option 3: Offload Old Data
- Archive old customer databases
- Export historical data to cold storage
- Implement data retention policies
### Option 4: Optimize Storage
```bash
# Enable compression for tables (PostgreSQL 14+)
ALTER TABLE [table_name] SET COMPRESSION lz4;
# Rewrite table to apply compression
VACUUM FULL [table_name];
# Set autovacuum more aggressively
ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);
```
## MONITORING & PREVENTION
### Set Up Disk Monitoring
Add to cron (`crontab -e`):
```bash
# Check disk space every hour
0 * * * * /opt/fairdb/scripts/check-disk-space.sh
```
**Create script** `/opt/fairdb/scripts/check-disk-space.sh`:
```bash
#!/bin/bash
THRESHOLD=80
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
fi
```
### Configure Log Rotation
Edit `/etc/logrotate.d/postgresql`:
```
/var/log/postgresql/*.log {
daily
rotate 7
compress
delaycompress
notifempty
missingok
}
```
### Implement Database Quotas
```sql
-- Set database size limits
ALTER DATABASE customer_db_001 SET max_database_size = '10GB';
```
## POST-RECOVERY ACTIONS
### 1. Verify Database Health
```bash
# Check PostgreSQL status
sudo systemctl status postgresql
# Test connections
sudo -u postgres psql -c "SELECT 1;"
# Run health check
/opt/fairdb/scripts/pg-health-check.sh
```
### 2. Document Incident
```markdown
# Disk Space Emergency - YYYY-MM-DD
## Initial State
- Disk usage: X%
- Free space: XGB
- Affected services: [list]
## Actions Taken
- [List each action with space recovered]
## Final State
- Disk usage: X%
- Free space: XGB
- Time to resolution: X minutes
## Root Cause
[Why did disk fill up?]
## Prevention
- [ ] Implement monitoring
- [ ] Set up log rotation
- [ ] Schedule regular cleanups
- [ ] Consider storage upgrade
```
### 3. Implement Monitoring
```bash
# Install monitoring script
sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/
# Set up alerts
# (Configure email/Slack notifications)
```
## DECISION TREE
```
Disk at 100%?
├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
│ ├─ Space freed? → Continue to monitoring
│ └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
└─ Disk at 85-99%?
├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
└─ Plan long-term solution (resize disk)
```
## START RESPONSE
Ask user:
1. "What is the current disk usage? (run `df -h`)"
2. "Is PostgreSQL still running?"
3. "When did this start happening?"
Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.
**Remember:** Time is critical. Database writes are failing. Act fast but safely!