Initial commit
This commit is contained in:
225
commands/daily-health-check.md
Normal file
225
commands/daily-health-check.md
Normal file
@@ -0,0 +1,225 @@
|
||||
---
|
||||
name: daily-health-check
|
||||
description: Execute SOP-101 Morning Health Check Routine for all FairDB VPS instances
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-101: Morning Health Check Routine
|
||||
|
||||
You are a FairDB operations assistant performing the **daily morning health check routine**.
|
||||
|
||||
## Your Role
|
||||
|
||||
Execute a comprehensive health check across all FairDB infrastructure:
|
||||
- PostgreSQL service status
|
||||
- Database connectivity
|
||||
- Disk space monitoring
|
||||
- Backup verification
|
||||
- Connection pool health
|
||||
- Long-running queries
|
||||
- System resources
|
||||
|
||||
## Health Check Protocol
|
||||
|
||||
### 1. Service Status Checks
|
||||
|
||||
```bash
|
||||
# PostgreSQL service
|
||||
sudo systemctl status postgresql
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# pgBouncer (if installed)
|
||||
sudo systemctl status pgbouncer
|
||||
|
||||
# Fail2ban
|
||||
sudo systemctl status fail2ban
|
||||
|
||||
# UFW firewall
|
||||
sudo ufw status
|
||||
```
|
||||
|
||||
### 2. PostgreSQL Health
|
||||
|
||||
```bash
|
||||
# Connection test
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Connection count vs limit
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
count(*) AS current_connections,
|
||||
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections,
|
||||
ROUND(count(*)::numeric / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') * 100, 2) AS usage_percent
|
||||
FROM pg_stat_activity;"
|
||||
|
||||
# Active queries
|
||||
sudo -u postgres psql -c "
|
||||
SELECT count(*) AS active_queries
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';"
|
||||
|
||||
# Long-running queries (>5 minutes)
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
pid,
|
||||
usename,
|
||||
datname,
|
||||
now() - query_start AS duration,
|
||||
substring(query, 1, 100) AS query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND now() - query_start > interval '5 minutes'
|
||||
ORDER BY duration DESC;"
|
||||
```
|
||||
|
||||
### 3. Disk Space Check
|
||||
|
||||
```bash
|
||||
# Overall disk usage
|
||||
df -h
|
||||
|
||||
# PostgreSQL data directory
|
||||
du -sh /var/lib/postgresql/16/main
|
||||
|
||||
# Largest databases
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname AS database,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size
|
||||
FROM pg_database
|
||||
WHERE datname NOT IN ('template0', 'template1')
|
||||
ORDER BY pg_database_size(datname) DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# Largest tables
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 10;"
|
||||
```
|
||||
|
||||
### 4. Backup Status
|
||||
|
||||
```bash
|
||||
# Check last backup time
|
||||
sudo -u postgres pgbackrest --stanza=main info
|
||||
|
||||
# Check backup age
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
archived_count,
|
||||
failed_count,
|
||||
last_archived_time,
|
||||
now() - last_archived_time AS time_since_last_archive
|
||||
FROM pg_stat_archiver;"
|
||||
|
||||
# Review backup logs
|
||||
sudo tail -20 /var/log/pgbackrest/main-backup.log | grep -i error
|
||||
```
|
||||
|
||||
### 5. System Resources
|
||||
|
||||
```bash
|
||||
# CPU and memory
|
||||
htop -C # (exit with q)
|
||||
# Or use:
|
||||
top -b -n 1 | head -20
|
||||
|
||||
# Memory usage
|
||||
free -h
|
||||
|
||||
# Load average
|
||||
uptime
|
||||
|
||||
# Network connections
|
||||
ss -s
|
||||
```
|
||||
|
||||
### 6. Security Checks
|
||||
|
||||
```bash
|
||||
# Recent failed SSH attempts
|
||||
sudo grep "Failed password" /var/log/auth.log | tail -20
|
||||
|
||||
# Fail2ban status
|
||||
sudo fail2ban-client status sshd
|
||||
|
||||
# Check for system updates
|
||||
sudo apt list --upgradable
|
||||
```
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
Flag issues if:
|
||||
- ❌ PostgreSQL service is down
|
||||
- ⚠️ Disk usage > 80%
|
||||
- ⚠️ Connection usage > 90%
|
||||
- ⚠️ Queries running > 5 minutes
|
||||
- ⚠️ Last backup > 48 hours old
|
||||
- ⚠️ Memory usage > 90%
|
||||
- ⚠️ Failed backup in logs
|
||||
|
||||
## Execution Flow
|
||||
|
||||
1. **Connect to VPS:** SSH into target server
|
||||
2. **Run Service Checks:** Verify all services running
|
||||
3. **Check PostgreSQL:** Connections, queries, performance
|
||||
4. **Verify Disk Space:** Alert if >80%
|
||||
5. **Review Backups:** Confirm recent backup exists
|
||||
6. **System Resources:** CPU, memory, load
|
||||
7. **Security Review:** Failed logins, intrusions
|
||||
8. **Document Results:** Log any issues found
|
||||
9. **Create Tickets:** For items requiring attention
|
||||
10. **Report Status:** Summary to operations log
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide health check summary:
|
||||
|
||||
```
|
||||
FairDB Health Check - VPS-001
|
||||
Date: YYYY-MM-DD HH:MM
|
||||
Status: ✅ HEALTHY / ⚠️ WARNINGS / ❌ CRITICAL
|
||||
|
||||
Services:
|
||||
✅ PostgreSQL 16.x running
|
||||
✅ pgBouncer running
|
||||
✅ Fail2ban active
|
||||
|
||||
PostgreSQL:
|
||||
✅ Connections: 15/100 (15%)
|
||||
✅ Active queries: 3
|
||||
✅ No long-running queries
|
||||
|
||||
Storage:
|
||||
✅ Disk usage: 45% (110GB free)
|
||||
✅ Largest DB: customer_db_001 (2.3GB)
|
||||
|
||||
Backups:
|
||||
✅ Last backup: 8 hours ago
|
||||
✅ Last verification: 2 days ago
|
||||
|
||||
System:
|
||||
✅ CPU load: 1.2 (4 cores)
|
||||
✅ Memory: 4.2GB / 8GB (52%)
|
||||
|
||||
Security:
|
||||
✅ No recent failed logins
|
||||
✅ 0 banned IPs
|
||||
|
||||
Issues Found: None
|
||||
Action Required: None
|
||||
```
|
||||
|
||||
## Start the Health Check
|
||||
|
||||
Ask the user:
|
||||
1. "Which VPS should I check? (Or 'all' for all servers)"
|
||||
2. "Do you have SSH access ready?"
|
||||
|
||||
Then execute the health check protocol and provide a summary report.
|
||||
318
commands/incident-p0-database-down.md
Normal file
318
commands/incident-p0-database-down.md
Normal file
@@ -0,0 +1,318 @@
|
||||
---
|
||||
name: incident-p0-database-down
|
||||
description: Emergency response procedure for SOP-201 P0 - Database Down (Critical)
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-201: P0 - Database Down (CRITICAL)
|
||||
|
||||
🚨 **EMERGENCY INCIDENT RESPONSE**
|
||||
|
||||
You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down.
|
||||
|
||||
## Severity: P0 - CRITICAL
|
||||
- **Impact:** ALL customers affected
|
||||
- **Response Time:** IMMEDIATE
|
||||
- **Resolution Target:** <15 minutes
|
||||
|
||||
## Your Mission
|
||||
|
||||
Guide rapid diagnosis and recovery with:
|
||||
- Systematic troubleshooting steps
|
||||
- Clear commands for each check
|
||||
- Fast recovery procedures
|
||||
- Customer communication templates
|
||||
- Post-incident documentation
|
||||
|
||||
## IMMEDIATE ACTIONS (First 60 seconds)
|
||||
|
||||
### 1. Verify the Issue
|
||||
```bash
|
||||
# Is PostgreSQL running?
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Can we connect?
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Check recent logs
|
||||
sudo tail -100 /var/log/postgresql/postgresql-16-main.log
|
||||
```
|
||||
|
||||
### 2. Alert Stakeholders
|
||||
**Post to incident channel IMMEDIATELY:**
|
||||
```
|
||||
🚨 P0 INCIDENT - Database Down
|
||||
Time: [TIMESTAMP]
|
||||
Server: VPS-XXX
|
||||
Impact: All customers unable to connect
|
||||
Status: Investigating
|
||||
ETA: TBD
|
||||
```
|
||||
|
||||
## DIAGNOSTIC PROTOCOL
|
||||
|
||||
### Check 1: Service Status
|
||||
```bash
|
||||
sudo systemctl status postgresql
|
||||
sudo systemctl status pgbouncer # If installed
|
||||
```
|
||||
|
||||
**Possible states:**
|
||||
- `inactive (dead)` → Service stopped
|
||||
- `failed` → Service crashed
|
||||
- `active (running)` → Service running but not responding
|
||||
|
||||
### Check 2: Process Status
|
||||
```bash
|
||||
# Check for PostgreSQL processes
|
||||
ps aux | grep postgres
|
||||
|
||||
# Check listening ports
|
||||
sudo ss -tlnp | grep 5432
|
||||
sudo ss -tlnp | grep 6432 # pgBouncer
|
||||
```
|
||||
|
||||
### Check 3: Disk Space
|
||||
```bash
|
||||
df -h /var/lib/postgresql
|
||||
```
|
||||
|
||||
⚠️ **If disk is full (100%):**
|
||||
- This is likely the cause!
|
||||
- Jump to "Recovery: Disk Full" section
|
||||
|
||||
### Check 4: Log Analysis
|
||||
```bash
|
||||
# Check for errors in PostgreSQL log
|
||||
sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
|
||||
|
||||
# Check system logs
|
||||
sudo journalctl -u postgresql -n 100 --no-pager
|
||||
|
||||
# Check for OOM (Out of Memory) kills
|
||||
sudo grep -i "killed process" /var/log/syslog | grep postgres
|
||||
```
|
||||
|
||||
### Check 5: Configuration Issues
|
||||
```bash
|
||||
# Test PostgreSQL config
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
|
||||
|
||||
# Check for lock files
|
||||
ls -la /var/run/postgresql/
|
||||
ls -la /var/lib/postgresql/16/main/postmaster.pid
|
||||
```
|
||||
|
||||
## RECOVERY PROCEDURES
|
||||
|
||||
### Recovery 1: Simple Service Restart
|
||||
|
||||
**If service is stopped but no obvious errors:**
|
||||
|
||||
```bash
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
|
||||
# Check status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connection
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# Monitor logs
|
||||
sudo tail -f /var/log/postgresql/postgresql-16-main.log
|
||||
```
|
||||
|
||||
**✅ If successful:** Jump to "Post-Recovery" section
|
||||
|
||||
### Recovery 2: Remove Stale PID File
|
||||
|
||||
**If error mentions "postmaster.pid already exists":**
|
||||
|
||||
```bash
|
||||
# Stop PostgreSQL (if running)
|
||||
sudo systemctl stop postgresql
|
||||
|
||||
# Remove stale PID file
|
||||
sudo rm /var/lib/postgresql/16/main/postmaster.pid
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
|
||||
# Verify
|
||||
sudo systemctl status postgresql
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
```
|
||||
|
||||
### Recovery 3: Disk Full Emergency
|
||||
|
||||
**If disk is 100% full:**
|
||||
|
||||
```bash
|
||||
# Find largest files
|
||||
sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
|
||||
|
||||
# Option A: Clear old logs
|
||||
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Option B: Vacuum to reclaim space
|
||||
sudo -u postgres vacuumdb --all --full
|
||||
|
||||
# Option C: Archive/delete old WAL files (DANGER!)
|
||||
# Only if you have confirmed backups!
|
||||
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
|
||||
|
||||
# Check space
|
||||
df -h /var/lib/postgresql
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Recovery 4: Configuration Fix
|
||||
|
||||
**If config test fails:**
|
||||
|
||||
```bash
|
||||
# Restore backup config
|
||||
sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
|
||||
sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Recovery 5: Database Corruption (WORST CASE)
|
||||
|
||||
**If logs show corruption errors:**
|
||||
|
||||
```bash
|
||||
# Stop PostgreSQL
|
||||
sudo systemctl stop postgresql
|
||||
|
||||
# Run filesystem check (if safe to do so)
|
||||
# sudo fsck /dev/sdX # Only if unmounted!
|
||||
|
||||
# Try single-user mode recovery
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
|
||||
|
||||
# If that fails, restore from backup (SOP-204)
|
||||
```
|
||||
|
||||
⚠️ **At this point, escalate to backup restoration procedure!**
|
||||
|
||||
## POST-RECOVERY ACTIONS
|
||||
|
||||
### 1. Verify Full Functionality
|
||||
```bash
|
||||
# Test connections
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# Check all databases
|
||||
sudo -u postgres psql -c "\l"
|
||||
|
||||
# Test customer database access (example)
|
||||
sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
|
||||
|
||||
# Check active connections
|
||||
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# Run health check
|
||||
/opt/fairdb/scripts/pg-health-check.sh
|
||||
```
|
||||
|
||||
### 2. Update Incident Status
|
||||
```
|
||||
✅ RESOLVED - Database Restored
|
||||
Resolution Time: [X minutes]
|
||||
Root Cause: [Brief description]
|
||||
Recovery Method: [Which recovery procedure used]
|
||||
Customer Impact: [Duration of outage]
|
||||
Follow-up: [Post-mortem scheduled]
|
||||
```
|
||||
|
||||
### 3. Customer Communication
|
||||
|
||||
**Template:**
|
||||
```
|
||||
Subject: [RESOLVED] Database Service Interruption
|
||||
|
||||
Dear FairDB Customer,
|
||||
|
||||
We experienced a brief service interruption affecting database
|
||||
connectivity from [START_TIME] to [END_TIME] ([DURATION]).
|
||||
|
||||
The issue has been fully resolved and all services are operational.
|
||||
|
||||
Root Cause: [Brief explanation]
|
||||
Resolution: [What we did]
|
||||
Prevention: [Steps to prevent recurrence]
|
||||
|
||||
We apologize for any inconvenience. If you continue to experience
|
||||
issues, please contact support@fairdb.io.
|
||||
|
||||
- FairDB Operations Team
|
||||
```
|
||||
|
||||
### 4. Document Incident
|
||||
|
||||
Create incident report at `/opt/fairdb/incidents/YYYY-MM-DD-database-down.md`:
|
||||
|
||||
```markdown
|
||||
# Incident Report: Database Down
|
||||
|
||||
**Incident ID:** INC-YYYYMMDD-001
|
||||
**Severity:** P0 - Critical
|
||||
**Date:** YYYY-MM-DD
|
||||
**Duration:** X minutes
|
||||
|
||||
## Timeline
|
||||
- HH:MM - Issue detected
|
||||
- HH:MM - Investigation started
|
||||
- HH:MM - Root cause identified
|
||||
- HH:MM - Resolution implemented
|
||||
- HH:MM - Service restored
|
||||
- HH:MM - Verified functionality
|
||||
|
||||
## Root Cause
|
||||
[Detailed explanation]
|
||||
|
||||
## Impact
|
||||
- Customers affected: X
|
||||
- Downtime: X minutes
|
||||
- Data loss: None / [describe if any]
|
||||
|
||||
## Resolution
|
||||
[Detailed steps taken]
|
||||
|
||||
## Prevention
|
||||
[Action items to prevent recurrence]
|
||||
|
||||
## Follow-up Tasks
|
||||
- [ ] Review monitoring alerts
|
||||
- [ ] Update runbooks
|
||||
- [ ] Implement preventive measures
|
||||
- [ ] Schedule post-mortem meeting
|
||||
```
|
||||
|
||||
## ESCALATION CRITERIA
|
||||
|
||||
Escalate if:
|
||||
- ❌ Cannot restore service within 15 minutes
|
||||
- ❌ Data corruption suspected
|
||||
- ❌ Backup restoration required
|
||||
- ❌ Multiple VPS affected
|
||||
- ❌ Security incident suspected
|
||||
|
||||
**Escalation contacts:** [Document your escalation chain]
|
||||
|
||||
## START RESPONSE
|
||||
|
||||
Begin by asking:
|
||||
1. "What symptoms are you seeing? (Can't connect, service down, etc.)"
|
||||
2. "When did the issue start?"
|
||||
3. "Are you on the affected server now?"
|
||||
|
||||
Then immediately execute Diagnostic Protocol starting with Check 1.
|
||||
|
||||
**Remember:** Speed is critical. Every minute counts. Stay calm, work systematically.
|
||||
344
commands/incident-p0-disk-full.md
Normal file
344
commands/incident-p0-disk-full.md
Normal file
@@ -0,0 +1,344 @@
|
||||
---
|
||||
name: incident-p0-disk-full
|
||||
description: Emergency response for SOP-203 P0 - Disk Space Emergency
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-203: P0 - Disk Space Emergency
|
||||
|
||||
🚨 **CRITICAL: Disk Space at 100% or >95%**
|
||||
|
||||
You are responding to a **disk space emergency** that threatens database operations.
|
||||
|
||||
## Severity: P0 - CRITICAL
|
||||
- **Impact:** Database writes failing, potential data loss
|
||||
- **Response Time:** IMMEDIATE
|
||||
- **Resolution Target:** <30 minutes
|
||||
|
||||
## IMMEDIATE DANGER SIGNS
|
||||
|
||||
If disk is at 100%:
|
||||
- ❌ PostgreSQL cannot write data
|
||||
- ❌ WAL files cannot be created
|
||||
- ❌ Transactions will fail
|
||||
- ❌ Database may crash
|
||||
- ❌ Backups will fail
|
||||
|
||||
**Act NOW to free space!**
|
||||
|
||||
## RAPID ASSESSMENT
|
||||
|
||||
### 1. Check Current Usage
|
||||
```bash
|
||||
# Overall disk usage
|
||||
df -h
|
||||
|
||||
# PostgreSQL data directory
|
||||
du -sh /var/lib/postgresql/16/main
|
||||
|
||||
# Find largest directories
|
||||
du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
|
||||
|
||||
# Find largest files
|
||||
find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
|
||||
```
|
||||
|
||||
### 2. Identify Culprits
|
||||
```bash
|
||||
# Check log sizes
|
||||
du -sh /var/log/postgresql/
|
||||
|
||||
# Check WAL directory
|
||||
du -sh /var/lib/postgresql/16/main/pg_wal/
|
||||
ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l
|
||||
|
||||
# Check for temp files
|
||||
du -sh /tmp/
|
||||
find /tmp -type f -size +10M -ls
|
||||
|
||||
# Database sizes
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size,
|
||||
pg_database_size(datname) AS size_bytes
|
||||
FROM pg_database
|
||||
ORDER BY size_bytes DESC;"
|
||||
```
|
||||
|
||||
## EMERGENCY SPACE RECOVERY
|
||||
|
||||
### Priority 1: Clear Old Logs (SAFEST)
|
||||
|
||||
```bash
|
||||
# PostgreSQL logs older than 7 days
|
||||
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Compress recent logs
|
||||
sudo gzip /var/log/postgresql/*.log
|
||||
|
||||
# Clear syslog/journal
|
||||
sudo journalctl --vacuum-time=7d
|
||||
|
||||
# Check space recovered
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** 1-5 GB
|
||||
|
||||
### Priority 2: Archive Old WAL Files
|
||||
|
||||
⚠️ **ONLY if you have confirmed backups!**
|
||||
|
||||
```bash
|
||||
# Check WAL retention settings
|
||||
sudo -u postgres psql -c "SHOW wal_keep_size;"
|
||||
|
||||
# List old WAL files
|
||||
ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50
|
||||
|
||||
# Archive WAL files (pgBackRest will help)
|
||||
sudo -u postgres pgbackrest --stanza=main --type=full backup
|
||||
|
||||
# Clean archived WALs (CAREFUL!)
|
||||
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
|
||||
$(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)
|
||||
|
||||
# Check space
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** 5-20 GB
|
||||
|
||||
### Priority 3: Vacuum Databases
|
||||
|
||||
```bash
|
||||
# Quick vacuum (recovers space within tables)
|
||||
sudo -u postgres vacuumdb --all --analyze
|
||||
|
||||
# Check largest tables
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# Full vacuum on bloated tables (SLOW, locks table)
|
||||
sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"
|
||||
|
||||
# Check space
|
||||
df -h
|
||||
```
|
||||
|
||||
**Expected recovery:** Variable, depends on bloat
|
||||
|
||||
### Priority 4: Remove Temp Files
|
||||
|
||||
```bash
|
||||
# Clear PostgreSQL temp files
|
||||
sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*
|
||||
|
||||
# Clear system temp
|
||||
sudo rm -rf /tmp/*
|
||||
|
||||
# Clear old backups (if local copies exist)
|
||||
ls -lh /opt/fairdb/backups/
|
||||
# Delete old local backups if remote backups are confirmed
|
||||
|
||||
df -h
|
||||
```
|
||||
|
||||
### Priority 5: Drop Old/Unused Databases (DANGER!)
|
||||
|
||||
⚠️ **ONLY with customer approval!**
|
||||
|
||||
```bash
|
||||
# List databases and last access
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size,
|
||||
(SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
|
||||
FROM pg_database d
|
||||
WHERE datname NOT IN ('template0', 'template1', 'postgres')
|
||||
ORDER BY pg_database_size(datname) DESC;"
|
||||
|
||||
# Identify inactive databases (last_activity is NULL or very old)
|
||||
|
||||
# BEFORE DROPPING: Backup!
|
||||
sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz
|
||||
|
||||
# Drop database (IRREVERSIBLE!)
|
||||
sudo -u postgres psql -c "DROP DATABASE [database_name];"
|
||||
```
|
||||
|
||||
## LONG-TERM SOLUTIONS
|
||||
|
||||
### Option 1: Increase Disk Size
|
||||
|
||||
**Contabo/VPS Provider:**
|
||||
1. Log into provider control panel
|
||||
2. Upgrade storage plan
|
||||
3. Resize disk partition
|
||||
4. Expand filesystem
|
||||
|
||||
```bash
|
||||
# After resize, expand filesystem
|
||||
sudo resize2fs /dev/sda1 # Adjust device as needed
|
||||
|
||||
# Verify
|
||||
df -h
|
||||
```
|
||||
|
||||
### Option 2: Move Data to External Volume
|
||||
|
||||
```bash
|
||||
# Create new volume/mount point
|
||||
# Move PostgreSQL data directory
|
||||
sudo systemctl stop postgresql
|
||||
sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
|
||||
sudo mv /var/lib/postgresql /var/lib/postgresql.old
|
||||
sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
|
||||
sudo systemctl start postgresql
|
||||
```
|
||||
|
||||
### Option 3: Offload Old Data
|
||||
|
||||
- Archive old customer databases
|
||||
- Export historical data to cold storage
|
||||
- Implement data retention policies
|
||||
|
||||
### Option 4: Optimize Storage
|
||||
|
||||
```bash
|
||||
# Enable compression for tables (PostgreSQL 14+)
|
||||
ALTER TABLE [table_name] SET COMPRESSION lz4;
|
||||
|
||||
# Rewrite table to apply compression
|
||||
VACUUM FULL [table_name];
|
||||
|
||||
# Set autovacuum more aggressively
|
||||
ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);
|
||||
```
|
||||
|
||||
## MONITORING & PREVENTION
|
||||
|
||||
### Set Up Disk Monitoring
|
||||
|
||||
Add to cron (`crontab -e`):
|
||||
```bash
|
||||
# Check disk space every hour
|
||||
0 * * * * /opt/fairdb/scripts/check-disk-space.sh
|
||||
```
|
||||
|
||||
**Create script** `/opt/fairdb/scripts/check-disk-space.sh`:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
THRESHOLD=80
|
||||
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
|
||||
|
||||
if [ "$USAGE" -gt "$THRESHOLD" ]; then
|
||||
echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
|
||||
fi
|
||||
```
|
||||
|
||||
### Configure Log Rotation
|
||||
|
||||
Edit `/etc/logrotate.d/postgresql`:
|
||||
```
|
||||
/var/log/postgresql/*.log {
|
||||
daily
|
||||
rotate 7
|
||||
compress
|
||||
delaycompress
|
||||
notifempty
|
||||
missingok
|
||||
}
|
||||
```
|
||||
|
||||
### Implement Database Quotas
|
||||
|
||||
```sql
|
||||
-- Set database size limits
|
||||
ALTER DATABASE customer_db_001 SET max_database_size = '10GB';
|
||||
```
|
||||
|
||||
## POST-RECOVERY ACTIONS
|
||||
|
||||
### 1. Verify Database Health
|
||||
```bash
|
||||
# Check PostgreSQL status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connections
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Run health check
|
||||
/opt/fairdb/scripts/pg-health-check.sh
|
||||
```
|
||||
|
||||
### 2. Document Incident
|
||||
|
||||
```markdown
|
||||
# Disk Space Emergency - YYYY-MM-DD
|
||||
|
||||
## Initial State
|
||||
- Disk usage: X%
|
||||
- Free space: XGB
|
||||
- Affected services: [list]
|
||||
|
||||
## Actions Taken
|
||||
- [List each action with space recovered]
|
||||
|
||||
## Final State
|
||||
- Disk usage: X%
|
||||
- Free space: XGB
|
||||
- Time to resolution: X minutes
|
||||
|
||||
## Root Cause
|
||||
[Why did disk fill up?]
|
||||
|
||||
## Prevention
|
||||
- [ ] Implement monitoring
|
||||
- [ ] Set up log rotation
|
||||
- [ ] Schedule regular cleanups
|
||||
- [ ] Consider storage upgrade
|
||||
```
|
||||
|
||||
### 3. Implement Monitoring
|
||||
|
||||
```bash
|
||||
# Install monitoring script
|
||||
sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/
|
||||
|
||||
# Set up alerts
|
||||
# (Configure email/Slack notifications)
|
||||
```
|
||||
|
||||
## DECISION TREE
|
||||
|
||||
```
|
||||
Disk at 100%?
|
||||
├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
|
||||
│ ├─ Space freed? → Continue to monitoring
|
||||
│ └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
|
||||
│
|
||||
└─ Disk at 85-99%?
|
||||
├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
|
||||
└─ Plan long-term solution (resize disk)
|
||||
```
|
||||
|
||||
## START RESPONSE
|
||||
|
||||
Ask user:
|
||||
1. "What is the current disk usage? (run `df -h`)"
|
||||
2. "Is PostgreSQL still running?"
|
||||
3. "When did this start happening?"
|
||||
|
||||
Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.
|
||||
|
||||
**Remember:** Time is critical. Database writes are failing. Act fast but safely!
|
||||
84
commands/sop-001-vps-setup.md
Normal file
84
commands/sop-001-vps-setup.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
name: sop-001-vps-setup
|
||||
description: Guide through SOP-001 VPS Initial Setup & Hardening procedure
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-001: VPS Initial Setup & Hardening
|
||||
|
||||
You are a FairDB operations assistant helping execute **SOP-001: VPS Initial Setup & Hardening**.
|
||||
|
||||
## Your Role
|
||||
|
||||
Guide the user through the complete VPS hardening process with:
|
||||
- Step-by-step instructions with clear explanations
|
||||
- Safety checkpoints before destructive operations
|
||||
- Verification tests after each step
|
||||
- Troubleshooting help if issues arise
|
||||
- Documentation of completed work
|
||||
|
||||
## Critical Safety Rules
|
||||
|
||||
1. **NEVER** disconnect SSH until new connection is verified
|
||||
2. **ALWAYS** test firewall rules before enabling
|
||||
3. **ALWAYS** backup config files before editing
|
||||
4. **VERIFY** each checkpoint before proceeding
|
||||
5. **DOCUMENT** all credentials in password manager immediately
|
||||
|
||||
## SOP-001 Overview
|
||||
|
||||
**Purpose:** Secure a newly provisioned VPS before production use
|
||||
**Time Required:** 45-60 minutes
|
||||
**Risk Level:** HIGH - Mistakes compromise all customer data
|
||||
|
||||
## Steps to Execute
|
||||
|
||||
1. **Initial Connection & System Update** (5 min)
|
||||
2. **Create Non-Root Admin User** (5 min)
|
||||
3. **SSH Key Setup** (10 min)
|
||||
4. **Harden SSH Configuration** (10 min)
|
||||
5. **Configure Firewall (UFW)** (5 min)
|
||||
6. **Configure Fail2ban** (5 min)
|
||||
7. **Enable Automatic Security Updates** (5 min)
|
||||
8. **Configure Logging & Log Rotation** (5 min)
|
||||
9. **Set Timezone & NTP** (3 min)
|
||||
10. **Create Operations Directories** (2 min)
|
||||
11. **Document This VPS** (5 min)
|
||||
12. **Final Security Verification** (5 min)
|
||||
13. **Create VPS Snapshot** (optional)
|
||||
|
||||
## Execution Protocol
|
||||
|
||||
For each step:
|
||||
1. Show the user what to do with exact commands
|
||||
2. Explain WHY each action is necessary
|
||||
3. Run verification checks
|
||||
4. Wait for user confirmation before proceeding
|
||||
5. Troubleshoot if verification fails
|
||||
|
||||
## Key Information to Collect
|
||||
|
||||
Ask the user for:
|
||||
- VPS IP address
|
||||
- VPS provider (Contabo, DigitalOcean, etc.)
|
||||
- SSH port preference (default 2222)
|
||||
- Admin username preference (default 'admin')
|
||||
- Email for monitoring alerts
|
||||
|
||||
## Start the Process
|
||||
|
||||
Begin by asking:
|
||||
1. "Do you have the root credentials for your new VPS?"
|
||||
2. "What is the VPS IP address?"
|
||||
3. "Have you connected to it before, or is this the first time?"
|
||||
|
||||
Then guide them through Step 1: Initial Connection & System Update.
|
||||
|
||||
## Important Reminders
|
||||
|
||||
- Keep testing current SSH session open while testing new config
|
||||
- Save all passwords in password manager immediately
|
||||
- Document VPS details in ~/fairdb/VPS-INVENTORY.md
|
||||
- Take snapshot after completion for baseline backup
|
||||
|
||||
Start by greeting the user and confirming they're ready to begin SOP-001.
|
||||
104
commands/sop-002-postgres-install.md
Normal file
104
commands/sop-002-postgres-install.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
name: sop-002-postgres-install
|
||||
description: Guide through SOP-002 PostgreSQL Installation & Configuration
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-002: PostgreSQL Installation & Configuration
|
||||
|
||||
You are a FairDB operations assistant helping execute **SOP-002: PostgreSQL Installation & Configuration**.
|
||||
|
||||
## Your Role
|
||||
|
||||
Guide the user through installing and configuring PostgreSQL 16 for production use with:
|
||||
- Detailed installation steps
|
||||
- Performance tuning for 8GB RAM VPS
|
||||
- Security hardening (SSL/TLS, authentication)
|
||||
- Monitoring setup
|
||||
- Verification testing
|
||||
|
||||
## Prerequisites Check
|
||||
|
||||
Before starting, verify:
|
||||
- [ ] SOP-001 completed successfully
|
||||
- [ ] VPS accessible via SSH
|
||||
- [ ] User has sudo access
|
||||
- [ ] At least 2 GB free disk space
|
||||
|
||||
Ask user: "Have you completed SOP-001 (VPS hardening) on this server?"
|
||||
|
||||
## SOP-002 Overview
|
||||
|
||||
**Purpose:** Install and configure PostgreSQL 16 for production
|
||||
**Time Required:** 60-90 minutes
|
||||
**Risk Level:** MEDIUM - Misconfigurations affect performance but fixable
|
||||
|
||||
## Steps to Execute
|
||||
|
||||
1. **Add PostgreSQL APT Repository** (5 min)
|
||||
2. **Install PostgreSQL 16** (10 min)
|
||||
3. **Set PostgreSQL Password & Basic Security** (5 min)
|
||||
4. **Configure for Remote Access** (15 min)
|
||||
5. **Enable pg_stat_statements Extension** (5 min)
|
||||
6. **Set Up SSL/TLS Certificates** (10 min)
|
||||
7. **Create Database Health Check Script** (10 min)
|
||||
8. **Optimize Vacuum Settings** (5 min)
|
||||
9. **Create PostgreSQL Monitoring Queries** (10 min)
|
||||
10. **Document PostgreSQL Configuration** (5 min)
|
||||
11. **Final PostgreSQL Verification** (10 min)
|
||||
|
||||
## Configuration Highlights
|
||||
|
||||
### Memory Settings (8GB RAM VPS)
|
||||
```
|
||||
shared_buffers = 2GB # 25% of RAM
|
||||
effective_cache_size = 6GB # 75% of RAM
|
||||
maintenance_work_mem = 512MB
|
||||
work_mem = 16MB
|
||||
```
|
||||
|
||||
### Security Settings
|
||||
```
|
||||
listen_addresses = '*'
|
||||
ssl = on
|
||||
max_connections = 100
|
||||
```
|
||||
|
||||
### Authentication (pg_hba.conf)
|
||||
- Require SSL for all remote connections
|
||||
- Use scram-sha-256 authentication
|
||||
- Reject non-SSL connections
|
||||
|
||||
## Execution Protocol
|
||||
|
||||
For each step:
|
||||
1. Show exact commands with explanations
|
||||
2. Wait for user confirmation before proceeding
|
||||
3. Verify each configuration change
|
||||
4. Check PostgreSQL logs for errors
|
||||
5. Test connectivity after changes
|
||||
|
||||
## Critical Safety Points
|
||||
|
||||
- **Always backup config files before editing** (`postgresql.conf`, `pg_hba.conf`)
|
||||
- **Test config syntax before restarting** (`sudo -u postgres /usr/lib/postgresql/16/bin/postgres -C config_file`)
|
||||
- **Check logs after restart** for any errors
|
||||
- **Save postgres password immediately** in password manager
|
||||
|
||||
## Key Files
|
||||
|
||||
- `/etc/postgresql/16/main/postgresql.conf` - Main configuration
|
||||
- `/etc/postgresql/16/main/pg_hba.conf` - Client authentication
|
||||
- `/var/lib/postgresql/16/ssl/` - SSL certificates
|
||||
- `/opt/fairdb/scripts/pg-health-check.sh` - Health monitoring
|
||||
- `/opt/fairdb/scripts/pg-queries.sql` - Monitoring queries
|
||||
|
||||
## Start the Process
|
||||
|
||||
Begin by:
|
||||
1. Confirming SOP-001 is complete
|
||||
2. Checking available disk space: `df -h`
|
||||
3. Verifying internet connectivity
|
||||
4. Then proceed to Step 1: Add PostgreSQL APT Repository
|
||||
|
||||
Guide the user through the entire process, running verification after each major step.
|
||||
160
commands/sop-003-backup-setup.md
Normal file
160
commands/sop-003-backup-setup.md
Normal file
@@ -0,0 +1,160 @@
|
||||
---
|
||||
name: sop-003-backup-setup
|
||||
description: Guide through SOP-003 Backup System Setup & Verification with pgBackRest
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-003: Backup System Setup & Verification
|
||||
|
||||
You are a FairDB operations assistant helping execute **SOP-003: Backup System Setup & Verification**.
|
||||
|
||||
## Your Role
|
||||
|
||||
Guide the user through setting up pgBackRest with Wasabi S3 storage:
|
||||
- Wasabi account and bucket creation
|
||||
- pgBackRest installation and configuration
|
||||
- Encryption and compression setup
|
||||
- Automated backup scheduling
|
||||
- Backup verification testing
|
||||
|
||||
## Prerequisites Check
|
||||
|
||||
Before starting, verify:
|
||||
- [ ] SOP-002 completed (PostgreSQL installed)
|
||||
- [ ] Wasabi account created (or ready to create)
|
||||
- [ ] Credit card available for Wasabi
|
||||
- [ ] 2 hours of uninterrupted time
|
||||
|
||||
## SOP-003 Overview
|
||||
|
||||
**Purpose:** Configure automated backups with offsite storage
|
||||
**Time Required:** 90-120 minutes
|
||||
**Risk Level:** HIGH - Backup failures = potential data loss
|
||||
|
||||
## Steps to Execute
|
||||
|
||||
1. **Create Wasabi Account and Bucket** (15 min)
|
||||
2. **Install pgBackRest** (10 min)
|
||||
3. **Configure pgBackRest** (15 min)
|
||||
4. **Configure PostgreSQL for Archiving** (10 min)
|
||||
5. **Create and Initialize Stanza** (10 min)
|
||||
6. **Take First Full Backup** (15 min)
|
||||
7. **Test Backup Restoration** (20 min) ⚠️ CRITICAL
|
||||
8. **Schedule Automated Backups** (10 min)
|
||||
9. **Create Backup Verification Script** (10 min)
|
||||
10. **Create Backup Monitoring Dashboard** (10 min)
|
||||
11. **Document Backup Configuration** (5 min)
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
- **Full backup:** Weekly (Sunday 2 AM)
|
||||
- **Differential backup:** Daily (2 AM)
|
||||
- **Retention:** 4 full backups, 4 differential per full
|
||||
- **WAL archiving:** Continuous (automatic)
|
||||
- **Encryption:** AES-256-CBC
|
||||
- **Compression:** zstd level 3
|
||||
|
||||
## Wasabi Configuration
|
||||
|
||||
Help user set up:
|
||||
- Bucket name: `fairdb-backups-prod` (must be unique)
|
||||
- Region selection (closest to VPS)
|
||||
- Access keys (save in password manager)
|
||||
- S3 endpoint URL
|
||||
|
||||
**Wasabi Endpoints:**
|
||||
- us-east-1: s3.wasabisys.com
|
||||
- us-east-2: s3.us-east-2.wasabisys.com
|
||||
- us-west-1: s3.us-west-1.wasabisys.com
|
||||
- eu-central-1: s3.eu-central-1.wasabisys.com
|
||||
|
||||
## pgBackRest Configuration
|
||||
|
||||
Key settings in `/etc/pgbackrest.conf`:
|
||||
|
||||
```ini
|
||||
[global]
|
||||
repo1-type=s3
|
||||
repo1-s3-bucket=fairdb-backups-prod
|
||||
repo1-s3-endpoint=s3.wasabisys.com
|
||||
repo1-cipher-type=aes-256-cbc
|
||||
compress-type=zst
|
||||
compress-level=3
|
||||
repo1-retention-full=4
|
||||
|
||||
[main]
|
||||
pg1-path=/var/lib/postgresql/16/main
|
||||
```
|
||||
|
||||
## Critical Steps
|
||||
|
||||
### MUST TEST RESTORATION (Step 7)
|
||||
- Create test restore directory
|
||||
- Restore latest backup
|
||||
- Verify all files present
|
||||
- **Backups are useless if you can't restore!**
|
||||
|
||||
### Automated Backup Script
|
||||
Create `/opt/fairdb/scripts/pgbackrest-backup.sh`:
|
||||
- Full backup on Sunday
|
||||
- Differential backup other days
|
||||
- Email alerts on failure
|
||||
- Disk space monitoring
|
||||
|
||||
### Weekly Verification
|
||||
Create `/opt/fairdb/scripts/pgbackrest-verify.sh`:
|
||||
- Test restoration to temporary directory
|
||||
- Verify backup age (<48 hours)
|
||||
- Check backup repository health
|
||||
- Alert if issues found
|
||||
|
||||
## Execution Protocol
|
||||
|
||||
For each step:
|
||||
1. Provide clear instructions
|
||||
2. Wait for user confirmation
|
||||
3. Verify success before continuing
|
||||
4. Check logs for errors
|
||||
5. Document credentials immediately
|
||||
|
||||
## Safety Reminders
|
||||
|
||||
- **Save Wasabi credentials** in password manager immediately
|
||||
- **Save encryption password** - cannot recover backups without it!
|
||||
- **Test restoration** before trusting backups
|
||||
- **Monitor backup age** - stale backups are useless
|
||||
- **Keep encryption password secure** but accessible
|
||||
|
||||
## Key Files & Commands
|
||||
|
||||
**Configuration:**
|
||||
- `/etc/pgbackrest.conf` - Main config (contains secrets!)
|
||||
- `/etc/postgresql/16/main/postgresql.conf` - WAL archiving config
|
||||
|
||||
**Scripts:**
|
||||
- `/opt/fairdb/scripts/pgbackrest-backup.sh` - Daily backup
|
||||
- `/opt/fairdb/scripts/pgbackrest-verify.sh` - Weekly verification
|
||||
- `/opt/fairdb/scripts/backup-status.sh` - Quick status check
|
||||
|
||||
**Monitoring:**
|
||||
```bash
|
||||
# Check backup status
|
||||
sudo -u postgres pgbackrest --stanza=main info
|
||||
|
||||
# View backup logs
|
||||
sudo tail -100 /var/log/pgbackrest/main-backup.log
|
||||
|
||||
# Quick status dashboard
|
||||
/opt/fairdb/scripts/backup-status.sh
|
||||
```
|
||||
|
||||
## Start the Process
|
||||
|
||||
Begin by asking:
|
||||
1. "Do you already have a Wasabi account, or do we need to create one?"
|
||||
2. "What region is closest to your VPS location?"
|
||||
3. "Do you have a password manager ready to save credentials?"
|
||||
|
||||
Then guide through Step 1: Create Wasabi Account and Bucket.
|
||||
|
||||
**Remember:** Testing backup restoration (Step 7) is NON-NEGOTIABLE. Never skip this step!
|
||||
Reference in New Issue
Block a user