Initial commit
This commit is contained in:
344
commands/incident-p0-disk-full.md
Normal file
344
commands/incident-p0-disk-full.md
Normal file
@@ -0,0 +1,344 @@
|
||||
---
|
||||
name: incident-p0-disk-full
|
||||
description: Emergency response for SOP-203 P0 - Disk Space Emergency
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-203: P0 - Disk Space Emergency
|
||||
|
||||
🚨 **CRITICAL: Disk Space at 100% or >95%**
|
||||
|
||||
You are responding to a **disk space emergency** that threatens database operations.
|
||||
|
||||
## Severity: P0 - CRITICAL
|
||||
- **Impact:** Database writes failing, potential data loss
|
||||
- **Response Time:** IMMEDIATE
|
||||
- **Resolution Target:** <30 minutes
|
||||
|
||||
## IMMEDIATE DANGER SIGNS
|
||||
|
||||
If disk is at 100%:
|
||||
- ❌ PostgreSQL cannot write data
|
||||
- ❌ WAL files cannot be created
|
||||
- ❌ Transactions will fail
|
||||
- ❌ Database may crash
|
||||
- ❌ Backups will fail
|
||||
|
||||
**Act NOW to free space!**
|
||||
|
||||
## RAPID ASSESSMENT
|
||||
|
||||
### 1. Check Current Usage
|
||||
```bash
|
||||
# Overall disk usage
|
||||
df -h
|
||||
|
||||
# PostgreSQL data directory
|
||||
du -sh /var/lib/postgresql/16/main
|
||||
|
||||
# Find largest directories
|
||||
du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
|
||||
|
||||
# Find largest files
|
||||
find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
|
||||
```
|
||||
|
||||
### 2. Identify Culprits
|
||||
```bash
|
||||
# Check log sizes
|
||||
du -sh /var/log/postgresql/
|
||||
|
||||
# Check WAL directory
|
||||
du -sh /var/lib/postgresql/16/main/pg_wal/
|
||||
ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l
|
||||
|
||||
# Check for temp files
|
||||
du -sh /tmp/
|
||||
find /tmp -type f -size +10M -ls
|
||||
|
||||
# Database sizes
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size,
|
||||
pg_database_size(datname) AS size_bytes
|
||||
FROM pg_database
|
||||
ORDER BY size_bytes DESC;"
|
||||
```
|
||||
|
||||
## EMERGENCY SPACE RECOVERY
|
||||
|
||||
### Priority 1: Clear Old Logs (SAFEST)
|
||||
|
||||
```bash
|
||||
# PostgreSQL logs older than 7 days
|
||||
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Compress recent logs
|
||||
sudo gzip /var/log/postgresql/*.log
|
||||
|
||||
# Clear syslog/journal
|
||||
sudo journalctl --vacuum-time=7d
|
||||
|
||||
# Check space recovered
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** 1-5 GB
|
||||
|
||||
### Priority 2: Archive Old WAL Files
|
||||
|
||||
⚠️ **ONLY if you have confirmed backups!**
|
||||
|
||||
```bash
|
||||
# Check WAL retention settings
|
||||
sudo -u postgres psql -c "SHOW wal_keep_size;"
|
||||
|
||||
# List old WAL files
|
||||
ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50
|
||||
|
||||
# Archive WAL files (pgBackRest will help)
|
||||
sudo -u postgres pgbackrest --stanza=main --type=full backup
|
||||
|
||||
# Clean archived WALs (CAREFUL!)
|
||||
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
|
||||
$(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)
|
||||
|
||||
# Check space
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** 5-20 GB
|
||||
|
||||
### Priority 3: Vacuum Databases
|
||||
|
||||
```bash
|
||||
# Quick vacuum (recovers space within tables)
|
||||
sudo -u postgres vacuumdb --all --analyze
|
||||
|
||||
# Check largest tables
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# Full vacuum on bloated tables (SLOW, locks table)
|
||||
sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"
|
||||
|
||||
# Check space
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** Variable, depends on bloat
|
||||
|
||||
### Priority 4: Remove Temp Files
|
||||
|
||||
```bash
|
||||
# Clear PostgreSQL temp files
|
||||
sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*
|
||||
|
||||
# Clear system temp
|
||||
sudo rm -rf /tmp/*
|
||||
|
||||
# Clear old backups (if local copies exist)
|
||||
ls -lh /opt/fairdb/backups/
|
||||
# Delete old local backups if remote backups are confirmed
|
||||
|
||||
df -h
|
||||
```
|
||||
|
||||
### Priority 5: Drop Old/Unused Databases (DANGER!)
|
||||
|
||||
⚠️ **ONLY with customer approval!**
|
||||
|
||||
```bash
|
||||
# List databases and last access
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size,
|
||||
(SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
|
||||
FROM pg_database d
|
||||
WHERE datname NOT IN ('template0', 'template1', 'postgres')
|
||||
ORDER BY pg_database_size(datname) DESC;"
|
||||
|
||||
# Identify inactive databases (last_activity is NULL or very old)
|
||||
|
||||
# BEFORE DROPPING: Backup!
|
||||
sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz
|
||||
|
||||
# Drop database (IRREVERSIBLE!)
|
||||
sudo -u postgres psql -c "DROP DATABASE [database_name];"
|
||||
```
|
||||
|
||||
## LONG-TERM SOLUTIONS
|
||||
|
||||
### Option 1: Increase Disk Size
|
||||
|
||||
**Contabo/VPS Provider:**
|
||||
1. Log into provider control panel
|
||||
2. Upgrade storage plan
|
||||
3. Resize disk partition
|
||||
4. Expand filesystem
|
||||
|
||||
```bash
|
||||
# After resize, expand filesystem
|
||||
sudo resize2fs /dev/sda1 # Adjust device as needed
|
||||
|
||||
# Verify
|
||||
df -h
|
||||
```
|
||||
|
||||
### Option 2: Move Data to External Volume
|
||||
|
||||
```bash
|
||||
# Create new volume/mount point
|
||||
# Move PostgreSQL data directory
|
||||
sudo systemctl stop postgresql
|
||||
sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
|
||||
sudo mv /var/lib/postgresql /var/lib/postgresql.old
|
||||
sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Option 3: Offload Old Data
|
||||
|
||||
- Archive old customer databases
|
||||
- Export historical data to cold storage
|
||||
- Implement data retention policies
|
||||
|
||||
### Option 4: Optimize Storage
|
||||
|
||||
```bash
|
||||
# Enable compression for tables (PostgreSQL 14+)
|
||||
ALTER TABLE [table_name] SET COMPRESSION lz4;
|
||||
|
||||
# Rewrite table to apply compression
|
||||
VACUUM FULL [table_name];
|
||||
|
||||
# Set autovacuum more aggressively
|
||||
ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);
|
||||
```
|
||||
|
||||
## MONITORING & PREVENTION
|
||||
|
||||
### Set Up Disk Monitoring
|
||||
|
||||
Add to cron (`crontab -e`):
|
||||
```bash
|
||||
# Check disk space every hour
|
||||
0 * * * * /opt/fairdb/scripts/check-disk-space.sh
|
||||
```
|
||||
|
||||
**Create script** `/opt/fairdb/scripts/check-disk-space.sh`:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
THRESHOLD=80
|
||||
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
|
||||
|
||||
if [ "$USAGE" -gt "$THRESHOLD" ]; then
|
||||
echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
|
||||
fi
|
||||
```
|
||||
|
||||
### Configure Log Rotation
|
||||
|
||||
Edit `/etc/logrotate.d/postgresql`:
|
||||
```
|
||||
/var/log/postgresql/*.log {
|
||||
daily
|
||||
rotate 7
|
||||
compress
|
||||
delaycompress
|
||||
notifempty
|
||||
missingok
|
||||
}
|
||||
```
|
||||
|
||||
### Implement Database Quotas
|
||||
|
||||
```sql
|
||||
-- Set database size limits
|
||||
ALTER DATABASE customer_db_001 SET max_database_size = '10GB';
|
||||
```
|
||||
|
||||
## POST-RECOVERY ACTIONS
|
||||
|
||||
### 1. Verify Database Health
|
||||
```bash
|
||||
# Check PostgreSQL status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connections
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Run health check
|
||||
/opt/fairdb/scripts/pg-health-check.sh
|
||||
```
|
||||
|
||||
### 2. Document Incident
|
||||
|
||||
```markdown
|
||||
# Disk Space Emergency - YYYY-MM-DD
|
||||
|
||||
## Initial State
|
||||
- Disk usage: X%
|
||||
- Free space: XGB
|
||||
- Affected services: [list]
|
||||
|
||||
## Actions Taken
|
||||
- [List each action with space recovered]
|
||||
|
||||
## Final State
|
||||
- Disk usage: X%
|
||||
- Free space: XGB
|
||||
- Time to resolution: X minutes
|
||||
|
||||
## Root Cause
|
||||
[Why did disk fill up?]
|
||||
|
||||
## Prevention
|
||||
- [ ] Implement monitoring
|
||||
- [ ] Set up log rotation
|
||||
- [ ] Schedule regular cleanups
|
||||
- [ ] Consider storage upgrade
|
||||
```
|
||||
|
||||
### 3. Implement Monitoring
|
||||
|
||||
```bash
|
||||
# Install monitoring script
|
||||
sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/
|
||||
|
||||
# Set up alerts
|
||||
# (Configure email/Slack notifications)
|
||||
```
|
||||
|
||||
## DECISION TREE
|
||||
|
||||
```
|
||||
Disk at 100%?
|
||||
├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
|
||||
│ ├─ Space freed? → Continue to monitoring
|
||||
│ └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
|
||||
│
|
||||
└─ Disk at 85-99%?
|
||||
├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
|
||||
└─ Plan long-term solution (resize disk)
|
||||
```
|
||||
|
||||
## START RESPONSE
|
||||
|
||||
Ask user:
|
||||
1. "What is the current disk usage? (run `df -h`)"
|
||||
2. "Is PostgreSQL still running?"
|
||||
3. "When did this start happening?"
|
||||
|
||||
Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.
|
||||
|
||||
**Remember:** Time is critical. Database writes are failing. Act fast but safely!
|
||||
Reference in New Issue
Block a user