Initial commit
This commit is contained in:
480
commands/fairdb-emergency-response.md
Normal file
480
commands/fairdb-emergency-response.md
Normal file
@@ -0,0 +1,480 @@
|
||||
---
|
||||
name: fairdb-emergency-response
|
||||
description: Emergency incident response procedures for critical FairDB issues
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# FairDB Emergency Incident Response
|
||||
|
||||
You are responding to a critical incident in the FairDB PostgreSQL infrastructure. Follow this structured approach to diagnose, contain, and resolve the issue.
|
||||
|
||||
## Incident Classification
|
||||
|
||||
First, identify the incident type:
|
||||
- **P1 Critical**: Complete service outage, data loss risk
|
||||
- **P2 High**: Major degradation, affecting multiple customers
|
||||
- **P3 Medium**: Single customer impact, performance issues
|
||||
- **P4 Low**: Minor issues, cosmetic problems
|
||||
|
||||
## Initial Assessment (First 5 Minutes)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# FairDB Emergency Response Script
|
||||
|
||||
echo "================================================"
|
||||
echo " FAIRDB EMERGENCY INCIDENT RESPONSE"
|
||||
echo " Started: $(date '+%Y-%m-%d %H:%M:%S')"
|
||||
echo "================================================"
|
||||
|
||||
# Create incident log
|
||||
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"
|
||||
INCIDENT_LOG="/opt/fairdb/incidents/${INCIDENT_ID}.log"
|
||||
mkdir -p /opt/fairdb/incidents
|
||||
|
||||
{
|
||||
echo "Incident ID: $INCIDENT_ID"
|
||||
echo "Response started: $(date)"
|
||||
echo "Responding user: $(whoami)"
|
||||
echo "========================================"
|
||||
} | tee $INCIDENT_LOG
|
||||
```
|
||||
|
||||
## Step 1: Service Status Check
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 1] SERVICE STATUS CHECK" | tee -a $INCIDENT_LOG
|
||||
echo "------------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check PostgreSQL service
|
||||
if systemctl is-active --quiet postgresql; then
|
||||
echo "✅ PostgreSQL: RUNNING" | tee -a $INCIDENT_LOG
|
||||
else
|
||||
echo "❌ CRITICAL: PostgreSQL is DOWN" | tee -a $INCIDENT_LOG
|
||||
echo "Attempting emergency restart..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# Try to start the service
|
||||
sudo systemctl start postgresql 2>&1 | tee -a $INCIDENT_LOG
|
||||
|
||||
sleep 5
|
||||
|
||||
if systemctl is-active --quiet postgresql; then
|
||||
echo "✅ PostgreSQL restarted successfully" | tee -a $INCIDENT_LOG
|
||||
else
|
||||
echo "❌ FAILED to restart PostgreSQL" | tee -a $INCIDENT_LOG
|
||||
echo "Checking for port conflicts..." | tee -a $INCIDENT_LOG
|
||||
sudo netstat -tulpn | grep :5432 | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check for corruption
|
||||
echo "Checking for data corruption..." | tee -a $INCIDENT_LOG
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres -D /var/lib/postgresql/16/main -C data_directory 2>&1 | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check disk space
|
||||
echo -e "\nDisk Space:" | tee -a $INCIDENT_LOG
|
||||
df -h | grep -E "^/dev|^Filesystem" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check for full disks
|
||||
FULL_DISKS=$(df -h | grep -E "100%|9[5-9]%" | wc -l)
|
||||
if [ $FULL_DISKS -gt 0 ]; then
|
||||
echo "⚠️ CRITICAL: Disk space exhausted!" | tee -a $INCIDENT_LOG
|
||||
echo "Emergency cleanup required..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# Emergency log cleanup
|
||||
find /var/log/postgresql -name "*.log" -mtime +7 -delete 2>/dev/null
|
||||
find /opt/fairdb/logs -name "*.log" -mtime +7 -delete 2>/dev/null
|
||||
|
||||
echo "Old logs cleared. New disk usage:" | tee -a $INCIDENT_LOG
|
||||
df -h | grep -E "^/dev" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 2: Connection Diagnostics
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 2] CONNECTION DIAGNOSTICS" | tee -a $INCIDENT_LOG
|
||||
echo "--------------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Test local connection
|
||||
echo "Testing local connection..." | tee -a $INCIDENT_LOG
|
||||
if sudo -u postgres psql -c "SELECT 1;" > /dev/null 2>&1; then
|
||||
echo "✅ Local connections: OK" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Get connection stats
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT 'Active connections: ' || count(*)
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle';" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check for connection exhaustion
|
||||
MAX_CONN=$(sudo -u postgres psql -t -c "SHOW max_connections;")
|
||||
CURRENT_CONN=$(sudo -u postgres psql -t -c "SELECT count(*) FROM pg_stat_activity;")
|
||||
|
||||
echo "Connections: $CURRENT_CONN / $MAX_CONN" | tee -a $INCIDENT_LOG
|
||||
|
||||
if [ $CURRENT_CONN -gt $(( MAX_CONN * 90 / 100 )) ]; then
|
||||
echo "⚠️ WARNING: Connection pool nearly exhausted" | tee -a $INCIDENT_LOG
|
||||
echo "Terminating idle connections..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# Kill idle connections older than 10 minutes
|
||||
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND state_change < NOW() - INTERVAL '10 minutes'
|
||||
AND pid != pg_backend_pid();
|
||||
EOF
|
||||
fi
|
||||
else
|
||||
echo "❌ CRITICAL: Cannot connect to PostgreSQL" | tee -a $INCIDENT_LOG
|
||||
echo "Checking PostgreSQL logs..." | tee -a $INCIDENT_LOG
|
||||
tail -50 /var/log/postgresql/postgresql-*.log | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
|
||||
# Check network connectivity
|
||||
echo -e "\nNetwork status:" | tee -a $INCIDENT_LOG
|
||||
ip addr show | grep "inet " | tee -a $INCIDENT_LOG
|
||||
```
|
||||
|
||||
## Step 3: Performance Emergency Response
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 3] PERFORMANCE TRIAGE" | tee -a $INCIDENT_LOG
|
||||
echo "----------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Find and kill long-running queries
|
||||
echo "Checking for blocked/long queries..." | tee -a $INCIDENT_LOG
|
||||
|
||||
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
|
||||
-- Queries running longer than 5 minutes
|
||||
SELECT
|
||||
pid,
|
||||
now() - query_start as duration,
|
||||
state,
|
||||
LEFT(query, 100) as query_preview
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle'
|
||||
AND now() - query_start > interval '5 minutes'
|
||||
ORDER BY duration DESC;
|
||||
|
||||
-- Kill queries running longer than 30 minutes
|
||||
SELECT pg_cancel_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle'
|
||||
AND now() - query_start > interval '30 minutes'
|
||||
AND pid != pg_backend_pid();
|
||||
EOF
|
||||
|
||||
# Check for locks
|
||||
echo -e "\nChecking for lock conflicts..." | tee -a $INCIDENT_LOG
|
||||
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
|
||||
SELECT
|
||||
blocked_locks.pid AS blocked_pid,
|
||||
blocked_activity.usename AS blocked_user,
|
||||
blocking_locks.pid AS blocking_pid,
|
||||
blocking_activity.usename AS blocking_user,
|
||||
blocked_activity.query AS blocked_statement,
|
||||
blocking_activity.query AS blocking_statement
|
||||
FROM pg_catalog.pg_locks blocked_locks
|
||||
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
||||
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
|
||||
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
|
||||
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
||||
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
|
||||
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
|
||||
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
|
||||
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
|
||||
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
|
||||
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
|
||||
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
|
||||
AND blocking_locks.pid != blocked_locks.pid
|
||||
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
||||
WHERE NOT blocked_locks.GRANTED;
|
||||
EOF
|
||||
```
|
||||
|
||||
## Step 4: Data Integrity Check
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 4] DATA INTEGRITY CHECK" | tee -a $INCIDENT_LOG
|
||||
echo "------------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check for corruption indicators
|
||||
echo "Checking for corruption indicators..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check PostgreSQL data directory
|
||||
DATA_DIR="/var/lib/postgresql/16/main"
|
||||
if [ -d "$DATA_DIR" ]; then
|
||||
echo "Data directory exists: $DATA_DIR" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check for recovery in progress
|
||||
if [ -f "$DATA_DIR/recovery.signal" ]; then
|
||||
echo "⚠️ Recovery in progress!" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
|
||||
# Check WAL status
|
||||
WAL_COUNT=$(ls -1 $DATA_DIR/pg_wal/*.partial 2>/dev/null | wc -l)
|
||||
if [ $WAL_COUNT -gt 0 ]; then
|
||||
echo "⚠️ Partial WAL files detected: $WAL_COUNT" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
else
|
||||
echo "❌ CRITICAL: Data directory not found!" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
|
||||
# Run basic integrity check
|
||||
echo -e "\nRunning integrity checks..." | tee -a $INCIDENT_LOG
|
||||
for DB in $(sudo -u postgres psql -t -c "SELECT datname FROM pg_database WHERE datistemplate = false;"); do
|
||||
echo "Checking database: $DB" | tee -a $INCIDENT_LOG
|
||||
sudo -u postgres psql -d $DB -c "SELECT 1;" > /dev/null 2>&1
|
||||
if [ $? -eq 0 ]; then
|
||||
echo " ✅ Database $DB is accessible" | tee -a $INCIDENT_LOG
|
||||
else
|
||||
echo " ❌ Database $DB has issues!" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
## Step 5: Emergency Recovery Actions
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 5] RECOVERY ACTIONS" | tee -a $INCIDENT_LOG
|
||||
echo "-------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Determine if recovery is needed
|
||||
read -p "Do you need to initiate emergency recovery? (yes/no): " NEED_RECOVERY
|
||||
|
||||
if [ "$NEED_RECOVERY" = "yes" ]; then
|
||||
echo "Starting emergency recovery procedures..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# Option 1: Restart in single-user mode for repairs
|
||||
echo "Option 1: Single-user mode repair" | tee -a $INCIDENT_LOG
|
||||
echo "Command: sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D $DATA_DIR" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Option 2: Restore from backup
|
||||
echo "Option 2: Restore from backup" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Check available backups
|
||||
if command -v pgbackrest &> /dev/null; then
|
||||
echo "Available backups:" | tee -a $INCIDENT_LOG
|
||||
sudo -u postgres pgbackrest --stanza=fairdb info 2>&1 | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
|
||||
# Option 3: Point-in-time recovery
|
||||
echo "Option 3: Point-in-time recovery" | tee -a $INCIDENT_LOG
|
||||
echo "Use: /opt/fairdb/scripts/restore-pitr.sh 'YYYY-MM-DD HH:MM:SS'" | tee -a $INCIDENT_LOG
|
||||
|
||||
read -p "Select recovery option (1/2/3/none): " RECOVERY_OPTION
|
||||
|
||||
case $RECOVERY_OPTION in
|
||||
1)
|
||||
echo "Starting single-user mode..." | tee -a $INCIDENT_LOG
|
||||
sudo systemctl stop postgresql
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D $DATA_DIR
|
||||
;;
|
||||
2)
|
||||
echo "Starting backup restore..." | tee -a $INCIDENT_LOG
|
||||
read -p "Enter backup label to restore: " BACKUP_LABEL
|
||||
sudo systemctl stop postgresql
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --set=$BACKUP_LABEL restore
|
||||
sudo systemctl start postgresql
|
||||
;;
|
||||
3)
|
||||
echo "Starting PITR..." | tee -a $INCIDENT_LOG
|
||||
read -p "Enter target time (YYYY-MM-DD HH:MM:SS): " TARGET_TIME
|
||||
/opt/fairdb/scripts/restore-pitr.sh "$TARGET_TIME"
|
||||
;;
|
||||
*)
|
||||
echo "No recovery action taken" | tee -a $INCIDENT_LOG
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 6: Customer Communication
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 6] CUSTOMER IMPACT ASSESSMENT" | tee -a $INCIDENT_LOG
|
||||
echo "------------------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Identify affected customers
|
||||
echo "Affected customer databases:" | tee -a $INCIDENT_LOG
|
||||
|
||||
AFFECTED_DBS=$(sudo -u postgres psql -t -c "
|
||||
SELECT datname FROM pg_database
|
||||
WHERE datname NOT IN ('postgres', 'template0', 'template1')
|
||||
ORDER BY datname;")
|
||||
|
||||
for DB in $AFFECTED_DBS; do
|
||||
# Check if database is accessible
|
||||
if sudo -u postgres psql -d $DB -c "SELECT 1;" > /dev/null 2>&1; then
|
||||
echo " ✅ $DB - Operational" | tee -a $INCIDENT_LOG
|
||||
else
|
||||
echo " ❌ $DB - IMPACTED" | tee -a $INCIDENT_LOG
|
||||
fi
|
||||
done
|
||||
|
||||
# Generate customer notification
|
||||
cat << EOF | tee -a $INCIDENT_LOG
|
||||
|
||||
CUSTOMER NOTIFICATION TEMPLATE
|
||||
===============================
|
||||
Subject: FairDB Service Incident - $INCIDENT_ID
|
||||
|
||||
Dear Customer,
|
||||
|
||||
We are currently experiencing a service incident affecting FairDB PostgreSQL services.
|
||||
|
||||
Incident ID: $INCIDENT_ID
|
||||
Start Time: $(date)
|
||||
Severity: [P1/P2/P3/P4]
|
||||
Status: Investigating / Identified / Monitoring / Resolved
|
||||
|
||||
Impact:
|
||||
[Describe customer impact]
|
||||
|
||||
Current Actions:
|
||||
[List recovery actions being taken]
|
||||
|
||||
Next Update:
|
||||
We will provide an update within 30 minutes or sooner if the situation changes.
|
||||
|
||||
We apologize for any inconvenience and are working to resolve this as quickly as possible.
|
||||
|
||||
For urgent matters, please contact our emergency hotline: [PHONE]
|
||||
|
||||
Regards,
|
||||
FairDB Operations Team
|
||||
EOF
|
||||
```
|
||||
|
||||
## Step 7: Post-Incident Checklist
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 7] STABILIZATION CHECKLIST" | tee -a $INCIDENT_LOG
|
||||
echo "---------------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Verification checklist
|
||||
cat << 'EOF' | tee -a $INCIDENT_LOG
|
||||
Post-Recovery Verification:
|
||||
[ ] PostgreSQL service running
|
||||
[ ] All customer databases accessible
|
||||
[ ] Backup system operational
|
||||
[ ] Monitoring alerts cleared
|
||||
[ ] Network connectivity verified
|
||||
[ ] Disk space adequate (>20% free)
|
||||
[ ] CPU usage normal (<80%)
|
||||
[ ] Memory usage normal (<90%)
|
||||
[ ] No blocking locks
|
||||
[ ] No long-running queries
|
||||
[ ] Recent backup available
|
||||
[ ] Customer access verified
|
||||
[ ] Incident documented
|
||||
[ ] Root cause identified
|
||||
[ ] Prevention plan created
|
||||
EOF
|
||||
|
||||
# Final status
|
||||
echo -e "\n[FINAL STATUS]" | tee -a $INCIDENT_LOG
|
||||
echo "==============" | tee -a $INCIDENT_LOG
|
||||
/usr/local/bin/fairdb-health-check | head -20 | tee -a $INCIDENT_LOG
|
||||
```
|
||||
|
||||
## Step 8: Root Cause Analysis
|
||||
|
||||
```bash
|
||||
echo -e "\n[STEP 8] ROOT CAUSE ANALYSIS" | tee -a $INCIDENT_LOG
|
||||
echo "-----------------------------" | tee -a $INCIDENT_LOG
|
||||
|
||||
# Collect evidence
|
||||
echo "Collecting evidence for RCA..." | tee -a $INCIDENT_LOG
|
||||
|
||||
# System logs
|
||||
echo -e "\nSystem logs (last hour):" | tee -a $INCIDENT_LOG
|
||||
sudo journalctl --since "1 hour ago" -p err --no-pager | tail -20 | tee -a $INCIDENT_LOG
|
||||
|
||||
# PostgreSQL logs
|
||||
echo -e "\nPostgreSQL error logs:" | tee -a $INCIDENT_LOG
|
||||
find /var/log/postgresql -name "*.log" -mmin -60 -exec grep -i "error\|fatal\|panic" {} \; | tail -20 | tee -a $INCIDENT_LOG
|
||||
|
||||
# Resource history
|
||||
echo -e "\nResource usage history:" | tee -a $INCIDENT_LOG
|
||||
sar -u -f /var/log/sysstat/sa$(date +%d) | tail -10 | tee -a $INCIDENT_LOG 2>/dev/null
|
||||
|
||||
# Create RCA document
|
||||
cat << EOF | tee /opt/fairdb/incidents/${INCIDENT_ID}-rca.md
|
||||
# Root Cause Analysis - $INCIDENT_ID
|
||||
|
||||
## Incident Summary
|
||||
- **Date/Time**: $(date)
|
||||
- **Duration**: [TO BE FILLED]
|
||||
- **Severity**: [P1/P2/P3/P4]
|
||||
- **Impact**: [Number of customers/databases affected]
|
||||
|
||||
## Timeline
|
||||
[Document sequence of events]
|
||||
|
||||
## Root Cause
|
||||
[Identify primary cause]
|
||||
|
||||
## Contributing Factors
|
||||
[List any contributing factors]
|
||||
|
||||
## Resolution
|
||||
[Describe how the incident was resolved]
|
||||
|
||||
## Lessons Learned
|
||||
[What was learned from this incident]
|
||||
|
||||
## Action Items
|
||||
[ ] [Prevention measure 1]
|
||||
[ ] [Prevention measure 2]
|
||||
[ ] [Monitoring improvement]
|
||||
|
||||
## Metrics
|
||||
- Time to Detection: [minutes]
|
||||
- Time to Resolution: [minutes]
|
||||
- Customer Impact Duration: [minutes]
|
||||
|
||||
Generated: $(date)
|
||||
EOF
|
||||
|
||||
echo -e "\n================================================" | tee -a $INCIDENT_LOG
|
||||
echo " INCIDENT RESPONSE COMPLETED" | tee -a $INCIDENT_LOG
|
||||
echo " Incident ID: $INCIDENT_ID" | tee -a $INCIDENT_LOG
|
||||
echo " Log saved to: $INCIDENT_LOG" | tee -a $INCIDENT_LOG
|
||||
echo " RCA template: /opt/fairdb/incidents/${INCIDENT_ID}-rca.md" | tee -a $INCIDENT_LOG
|
||||
echo "================================================" | tee -a $INCIDENT_LOG
|
||||
```
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
Keep these contacts readily available:
|
||||
- PostgreSQL Expert: [Contact info]
|
||||
- Infrastructure Team: [Contact info]
|
||||
- Customer Success: [Contact info]
|
||||
- Management Escalation: [Contact info]
|
||||
|
||||
## Quick Reference Commands
|
||||
|
||||
```bash
|
||||
# Emergency service control
|
||||
sudo systemctl stop postgresql
|
||||
sudo systemctl start postgresql
|
||||
sudo systemctl restart postgresql
|
||||
|
||||
# Kill all connections
|
||||
sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != pg_backend_pid();"
|
||||
|
||||
# Emergency single-user mode
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
|
||||
|
||||
# Force checkpoint
|
||||
sudo -u postgres psql -c "CHECKPOINT;"
|
||||
|
||||
# Emergency vacuum
|
||||
sudo -u postgres vacuumdb --all --analyze-in-stages
|
||||
|
||||
# Check data checksums
|
||||
sudo -u postgres /usr/lib/postgresql/16/bin/pg_checksums -D /var/lib/postgresql/16/main --check
|
||||
```
|
||||
459
commands/fairdb-health-check.md
Normal file
459
commands/fairdb-health-check.md
Normal file
@@ -0,0 +1,459 @@
|
||||
---
|
||||
name: fairdb-health-check
|
||||
description: Comprehensive health check for FairDB PostgreSQL infrastructure
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# FairDB System Health Check
|
||||
|
||||
Perform a comprehensive health check of the FairDB PostgreSQL infrastructure including server resources, database status, backup integrity, and customer databases.
|
||||
|
||||
## System Health Overview
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# FairDB Comprehensive Health Check
|
||||
|
||||
echo "================================================"
|
||||
echo " FairDB System Health Check"
|
||||
echo " $(date '+%Y-%m-%d %H:%M:%S')"
|
||||
echo "================================================"
|
||||
```
|
||||
|
||||
## Step 1: Server Resources Check
|
||||
|
||||
```bash
|
||||
echo -e "\n[1/10] SERVER RESOURCES"
|
||||
echo "------------------------"
|
||||
|
||||
# CPU Usage
|
||||
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
|
||||
echo "CPU Usage: ${CPU_USAGE}%"
|
||||
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
|
||||
echo "⚠️ WARNING: High CPU usage detected"
|
||||
fi
|
||||
|
||||
# Memory Usage
|
||||
MEM_INFO=$(free -m | awk 'NR==2{printf "Memory: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }')
|
||||
echo "$MEM_INFO"
|
||||
MEM_PERCENT=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
|
||||
if (( $(echo "$MEM_PERCENT > 90" | bc -l) )); then
|
||||
echo "⚠️ WARNING: High memory usage detected"
|
||||
fi
|
||||
|
||||
# Disk Usage
|
||||
echo "Disk Usage:"
|
||||
df -h | grep -E '^/dev/' | while read line; do
|
||||
USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
|
||||
MOUNT=$(echo $line | awk '{print $6}')
|
||||
echo " $MOUNT: $line"
|
||||
if [ $USAGE -gt 85 ]; then
|
||||
echo " ⚠️ WARNING: Disk space critical on $MOUNT"
|
||||
fi
|
||||
done
|
||||
|
||||
# Load Average
|
||||
LOAD=$(uptime | awk -F'load average:' '{print $2}')
|
||||
echo "Load Average:$LOAD"
|
||||
CORES=$(nproc)
|
||||
LOAD_1=$(echo $LOAD | cut -d, -f1 | tr -d ' ')
|
||||
if (( $(echo "$LOAD_1 > $CORES" | bc -l) )); then
|
||||
echo "⚠️ WARNING: High load average detected"
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 2: PostgreSQL Service Status
|
||||
|
||||
```bash
|
||||
echo -e "\n[2/10] POSTGRESQL SERVICE"
|
||||
echo "-------------------------"
|
||||
|
||||
# Check if PostgreSQL is running
|
||||
if systemctl is-active --quiet postgresql; then
|
||||
echo "✅ PostgreSQL service: RUNNING"
|
||||
|
||||
# Get version and uptime
|
||||
sudo -u postgres psql -t -c "SELECT version();" | head -1
|
||||
|
||||
UPTIME=$(sudo -u postgres psql -t -c "
|
||||
SELECT now() - pg_postmaster_start_time() as uptime;")
|
||||
echo "Uptime: $UPTIME"
|
||||
else
|
||||
echo "❌ CRITICAL: PostgreSQL service is NOT running!"
|
||||
echo "Attempting to start..."
|
||||
sudo systemctl start postgresql
|
||||
sleep 5
|
||||
if systemctl is-active --quiet postgresql; then
|
||||
echo "✅ Service restarted successfully"
|
||||
else
|
||||
echo "❌ Failed to start PostgreSQL - manual intervention required!"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check PostgreSQL cluster status
|
||||
sudo pg_lsclusters
|
||||
```
|
||||
|
||||
## Step 3: Database Connections
|
||||
|
||||
```bash
|
||||
echo -e "\n[3/10] DATABASE CONNECTIONS"
|
||||
echo "---------------------------"
|
||||
|
||||
# Connection statistics
|
||||
sudo -u postgres psql -t << EOF
|
||||
SELECT
|
||||
'Total Connections: ' || count(*) || '/' || setting AS connection_info
|
||||
FROM pg_stat_activity, pg_settings
|
||||
WHERE pg_settings.name = 'max_connections'
|
||||
GROUP BY setting;
|
||||
EOF
|
||||
|
||||
# Connections by database
|
||||
echo -e "\nConnections by database:"
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT datname, count(*) as connections
|
||||
FROM pg_stat_activity
|
||||
GROUP BY datname
|
||||
ORDER BY connections DESC;"
|
||||
|
||||
# Connections by user
|
||||
echo -e "\nConnections by user:"
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT usename, count(*) as connections
|
||||
FROM pg_stat_activity
|
||||
GROUP BY usename
|
||||
ORDER BY connections DESC;"
|
||||
|
||||
# Check for idle connections
|
||||
IDLE_COUNT=$(sudo -u postgres psql -t -c "
|
||||
SELECT count(*)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND state_change < NOW() - INTERVAL '10 minutes';")
|
||||
|
||||
if [ $IDLE_COUNT -gt 10 ]; then
|
||||
echo "⚠️ WARNING: $IDLE_COUNT idle connections older than 10 minutes"
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 4: Database Performance Metrics
|
||||
|
||||
```bash
|
||||
echo -e "\n[4/10] PERFORMANCE METRICS"
|
||||
echo "--------------------------"
|
||||
|
||||
# Cache hit ratio
|
||||
sudo -u postgres psql -t << 'EOF'
|
||||
SELECT
|
||||
'Cache Hit Ratio: ' ||
|
||||
ROUND(100.0 * sum(heap_blks_hit) /
|
||||
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2) || '%'
|
||||
FROM pg_statio_user_tables;
|
||||
EOF
|
||||
|
||||
# Transaction statistics
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT
|
||||
'Transactions: ' || xact_commit || ' commits, ' ||
|
||||
xact_rollback || ' rollbacks, ' ||
|
||||
ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) || '% rollback rate'
|
||||
FROM pg_stat_database
|
||||
WHERE datname = 'postgres';"
|
||||
|
||||
# Longest running queries
|
||||
echo -e "\nLong-running queries (>1 minute):"
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT pid, now() - query_start as duration,
|
||||
LEFT(query, 50) as query_preview
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND now() - query_start > interval '1 minute'
|
||||
ORDER BY duration DESC
|
||||
LIMIT 5;"
|
||||
|
||||
# Table bloat check
|
||||
echo -e "\nTable bloat (top 5):"
|
||||
sudo -u postgres psql -t << 'EOF'
|
||||
SELECT
|
||||
schemaname || '.' || tablename AS table,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
|
||||
ROUND(100 * pg_total_relation_size(schemaname||'.'||tablename) /
|
||||
NULLIF(sum(pg_total_relation_size(schemaname||'.'||tablename))
|
||||
OVER (), 0), 2) AS percentage
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 5;
|
||||
EOF
|
||||
```
|
||||
|
||||
## Step 5: Backup Status
|
||||
|
||||
```bash
|
||||
echo -e "\n[5/10] BACKUP STATUS"
|
||||
echo "--------------------"
|
||||
|
||||
# Check pgBackRest status
|
||||
if command -v pgbackrest &> /dev/null; then
|
||||
echo "pgBackRest Status:"
|
||||
|
||||
# Get all stanzas
|
||||
STANZAS=$(sudo -u postgres pgbackrest info --output=json 2>/dev/null | jq -r '.[].name' 2>/dev/null)
|
||||
|
||||
if [ -z "$STANZAS" ]; then
|
||||
echo "⚠️ WARNING: No backup stanzas configured"
|
||||
else
|
||||
for STANZA in $STANZAS; do
|
||||
echo -e "\nStanza: $STANZA"
|
||||
|
||||
# Get last backup info
|
||||
LAST_BACKUP=$(sudo -u postgres pgbackrest --stanza=$STANZA info --output=json 2>/dev/null | \
|
||||
jq -r '.[] | select(.name=="'$STANZA'") | .backup[-1].timestamp.stop' 2>/dev/null)
|
||||
|
||||
if [ ! -z "$LAST_BACKUP" ]; then
|
||||
echo " Last backup: $LAST_BACKUP"
|
||||
|
||||
# Calculate age in hours
|
||||
BACKUP_AGE=$(( ($(date +%s) - $(date -d "$LAST_BACKUP" +%s)) / 3600 ))
|
||||
|
||||
if [ $BACKUP_AGE -gt 25 ]; then
|
||||
echo " ⚠️ WARNING: Last backup is $BACKUP_AGE hours old"
|
||||
else
|
||||
echo " ✅ Backup is current ($BACKUP_AGE hours old)"
|
||||
fi
|
||||
else
|
||||
echo " ❌ ERROR: No backups found for this stanza"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
else
|
||||
echo "❌ ERROR: pgBackRest is not installed"
|
||||
fi
|
||||
|
||||
# Check WAL archiving
|
||||
WAL_STATUS=$(sudo -u postgres psql -t -c "SHOW archive_mode;")
|
||||
echo -e "\nWAL Archiving: $WAL_STATUS"
|
||||
|
||||
if [ "$WAL_STATUS" = " on" ]; then
|
||||
LAST_ARCHIVED=$(sudo -u postgres psql -t -c "
|
||||
SELECT age(now(), last_archived_time)
|
||||
FROM pg_stat_archiver;")
|
||||
echo "Last WAL archived: $LAST_ARCHIVED ago"
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 6: Replication Status
|
||||
|
||||
```bash
|
||||
echo -e "\n[6/10] REPLICATION STATUS"
|
||||
echo "-------------------------"
|
||||
|
||||
# Check if this is a primary or replica
|
||||
IS_PRIMARY=$(sudo -u postgres psql -t -c "SELECT pg_is_in_recovery();")
|
||||
|
||||
if [ "$IS_PRIMARY" = " f" ]; then
|
||||
echo "Role: PRIMARY"
|
||||
|
||||
# Check replication slots
|
||||
REP_SLOTS=$(sudo -u postgres psql -t -c "
|
||||
SELECT count(*) FROM pg_replication_slots WHERE active = true;")
|
||||
echo "Active replication slots: $REP_SLOTS"
|
||||
|
||||
# Check connected replicas
|
||||
sudo -u postgres psql -t -c "
|
||||
SELECT client_addr, state, sync_state,
|
||||
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) as lag
|
||||
FROM pg_stat_replication;" 2>/dev/null
|
||||
else
|
||||
echo "Role: REPLICA"
|
||||
|
||||
# Check replication lag
|
||||
LAG=$(sudo -u postgres psql -t -c "
|
||||
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag;")
|
||||
echo "Replication lag: ${LAG} seconds"
|
||||
|
||||
if (( $(echo "$LAG > 60" | bc -l) )); then
|
||||
echo "⚠️ WARNING: High replication lag detected"
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 7: Security Audit
|
||||
|
||||
```bash
|
||||
echo -e "\n[7/10] SECURITY AUDIT"
|
||||
echo "---------------------"
|
||||
|
||||
# Check for default passwords
|
||||
echo "Checking for common issues..."
|
||||
|
||||
# SSL status
|
||||
SSL_STATUS=$(sudo -u postgres psql -t -c "SHOW ssl;")
|
||||
echo "SSL: $SSL_STATUS"
|
||||
if [ "$SSL_STATUS" != " on" ]; then
|
||||
echo "⚠️ WARNING: SSL is not enabled"
|
||||
fi
|
||||
|
||||
# Check for users without passwords
|
||||
NO_PASS=$(sudo -u postgres psql -t -c "
|
||||
SELECT count(*) FROM pg_shadow WHERE passwd IS NULL;")
|
||||
if [ $NO_PASS -gt 0 ]; then
|
||||
echo "⚠️ WARNING: $NO_PASS users without passwords"
|
||||
fi
|
||||
|
||||
# Check firewall status
|
||||
if sudo ufw status | grep -q "Status: active"; then
|
||||
echo "✅ Firewall: ACTIVE"
|
||||
else
|
||||
echo "⚠️ WARNING: Firewall is not active"
|
||||
fi
|
||||
|
||||
# Check fail2ban status
|
||||
if systemctl is-active --quiet fail2ban; then
|
||||
echo "✅ Fail2ban: RUNNING"
|
||||
JAIL_STATUS=$(sudo fail2ban-client status postgresql 2>/dev/null | grep "Currently banned" || echo "Jail not configured")
|
||||
echo " PostgreSQL jail: $JAIL_STATUS"
|
||||
else
|
||||
echo "⚠️ WARNING: Fail2ban is not running"
|
||||
fi
|
||||
```
|
||||
|
||||
## Step 8: Customer Database Health
|
||||
|
||||
```bash
|
||||
echo -e "\n[8/10] CUSTOMER DATABASES"
|
||||
echo "-------------------------"
|
||||
|
||||
# Check each customer database
|
||||
CUSTOMER_DBS=$(sudo -u postgres psql -t -c "
|
||||
SELECT datname FROM pg_database
|
||||
WHERE datname NOT IN ('postgres', 'template0', 'template1')
|
||||
ORDER BY datname;")
|
||||
|
||||
for DB in $CUSTOMER_DBS; do
|
||||
echo -e "\nDatabase: $DB"
|
||||
|
||||
# Size
|
||||
SIZE=$(sudo -u postgres psql -t -c "
|
||||
SELECT pg_size_pretty(pg_database_size('$DB'));")
|
||||
echo " Size: $SIZE"
|
||||
|
||||
# Connection count
|
||||
CONN=$(sudo -u postgres psql -t -c "
|
||||
SELECT count(*) FROM pg_stat_activity WHERE datname = '$DB';")
|
||||
echo " Connections: $CONN"
|
||||
|
||||
# Transaction rate
|
||||
TPS=$(sudo -u postgres psql -t -c "
|
||||
SELECT xact_commit + xact_rollback as transactions
|
||||
FROM pg_stat_database WHERE datname = '$DB';")
|
||||
echo " Total transactions: $TPS"
|
||||
|
||||
# Check for locks
|
||||
LOCKS=$(sudo -u postgres psql -t -d $DB -c "
|
||||
SELECT count(*) FROM pg_locks WHERE granted = false;")
|
||||
if [ $LOCKS -gt 0 ]; then
|
||||
echo " ⚠️ WARNING: $LOCKS blocked locks detected"
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
## Step 9: System Logs Analysis
|
||||
|
||||
```bash
|
||||
echo -e "\n[9/10] LOG ANALYSIS"
|
||||
echo "-------------------"
|
||||
|
||||
# Check PostgreSQL logs for errors
|
||||
LOG_DIR="/var/log/postgresql"
|
||||
if [ -d "$LOG_DIR" ]; then
|
||||
echo "Recent PostgreSQL errors (last 24 hours):"
|
||||
find $LOG_DIR -name "*.log" -mtime -1 -exec grep -i "error\|fatal\|panic" {} \; | \
|
||||
tail -10 | head -5
|
||||
|
||||
ERROR_COUNT=$(find $LOG_DIR -name "*.log" -mtime -1 -exec grep -i "error\|fatal\|panic" {} \; | wc -l)
|
||||
echo "Total errors in last 24 hours: $ERROR_COUNT"
|
||||
|
||||
if [ $ERROR_COUNT -gt 100 ]; then
|
||||
echo "⚠️ WARNING: High error rate detected"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check system logs
|
||||
echo -e "\nRecent system issues:"
|
||||
sudo journalctl -p err -since "24 hours ago" --no-pager | tail -5
|
||||
```
|
||||
|
||||
## Step 10: Recommendations
|
||||
|
||||
```bash
|
||||
echo -e "\n[10/10] HEALTH SUMMARY & RECOMMENDATIONS"
|
||||
echo "========================================="
|
||||
|
||||
# Collect all warnings
|
||||
WARNINGS=0
|
||||
CRITICAL=0
|
||||
|
||||
# Generate recommendations based on findings
|
||||
echo -e "\nRecommendations:"
|
||||
|
||||
# Check if vacuum is needed
|
||||
LAST_VACUUM=$(sudo -u postgres psql -t -c "
|
||||
SELECT MAX(last_autovacuum) FROM pg_stat_user_tables;")
|
||||
echo "- Last autovacuum: $LAST_VACUUM"
|
||||
|
||||
# Check if analyze is needed
|
||||
LAST_ANALYZE=$(sudo -u postgres psql -t -c "
|
||||
SELECT MAX(last_autoanalyze) FROM pg_stat_user_tables;")
|
||||
echo "- Last autoanalyze: $LAST_ANALYZE"
|
||||
|
||||
# Generate overall health score
|
||||
echo -e "\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
if [ $CRITICAL -eq 0 ] && [ $WARNINGS -lt 3 ]; then
|
||||
echo "✅ OVERALL HEALTH: GOOD"
|
||||
elif [ $CRITICAL -eq 0 ] && [ $WARNINGS -lt 10 ]; then
|
||||
echo "⚠️ OVERALL HEALTH: FAIR - Review warnings"
|
||||
else
|
||||
echo "❌ OVERALL HEALTH: POOR - Immediate action required"
|
||||
fi
|
||||
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
|
||||
|
||||
# Save report
|
||||
REPORT_FILE="/opt/fairdb/logs/health-check-$(date +%Y%m%d-%H%M%S).log"
|
||||
echo -e "\nFull report saved to: $REPORT_FILE"
|
||||
```
|
||||
|
||||
## Actions Based on Results
|
||||
|
||||
### If Critical Issues Found:
|
||||
1. Check PostgreSQL service status
|
||||
2. Review disk space availability
|
||||
3. Verify backup integrity
|
||||
4. Check for data corruption
|
||||
5. Review security vulnerabilities
|
||||
|
||||
### If Warnings Found:
|
||||
1. Schedule maintenance window
|
||||
2. Plan capacity upgrades
|
||||
3. Review query performance
|
||||
4. Update monitoring thresholds
|
||||
5. Document issues for trending
|
||||
|
||||
### Regular Maintenance Tasks:
|
||||
1. Run VACUUM ANALYZE on large tables
|
||||
2. Update table statistics
|
||||
3. Review and optimize slow queries
|
||||
4. Clean up old logs
|
||||
5. Test backup restoration
|
||||
|
||||
## Schedule Next Health Check
|
||||
|
||||
```bash
|
||||
# Schedule regular health checks
|
||||
echo "30 */6 * * * root /usr/local/bin/fairdb-health-check > /dev/null 2>&1" | \
|
||||
sudo tee /etc/cron.d/fairdb-health-check
|
||||
|
||||
echo "Health checks scheduled every 6 hours"
|
||||
```
|
||||
446
commands/fairdb-onboard-customer.md
Normal file
446
commands/fairdb-onboard-customer.md
Normal file
@@ -0,0 +1,446 @@
|
||||
---
|
||||
name: fairdb-onboard-customer
|
||||
description: Complete customer onboarding workflow for FairDB PostgreSQL service
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# FairDB Customer Onboarding Workflow
|
||||
|
||||
You are onboarding a new customer for FairDB PostgreSQL as a Service. This comprehensive workflow creates their database, users, configures access, sets up backups, and provides connection details.
|
||||
|
||||
## Step 1: Gather Customer Information
|
||||
|
||||
Collect these details:
|
||||
1. **Customer Name**: Company/organization name
|
||||
2. **Database Name**: Preferred database name (lowercase, no spaces)
|
||||
3. **Primary Contact**: Name and email
|
||||
4. **Plan Type**: Starter/Professional/Enterprise
|
||||
5. **IP Allowlist**: Customer IP addresses for access
|
||||
6. **Special Requirements**: Extensions, configurations, etc.
|
||||
|
||||
## Step 2: Validate Resources
|
||||
|
||||
```bash
|
||||
# Check available resources
|
||||
df -h /var/lib/postgresql
|
||||
free -h
|
||||
sudo -u postgres psql -c "SELECT count(*) as database_count FROM pg_database WHERE datistemplate = false;"
|
||||
|
||||
# Check current connections
|
||||
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
## Step 3: Create Customer Database
|
||||
|
||||
```bash
|
||||
# Set customer variables
|
||||
CUSTOMER_NAME="customer_name" # Replace with actual
|
||||
DB_NAME="${CUSTOMER_NAME}_db"
|
||||
DB_OWNER="${CUSTOMER_NAME}_owner"
|
||||
DB_USER="${CUSTOMER_NAME}_user"
|
||||
DB_READONLY="${CUSTOMER_NAME}_readonly"
|
||||
|
||||
# Generate secure passwords
|
||||
DB_OWNER_PASS=$(openssl rand -base64 32)
|
||||
DB_USER_PASS=$(openssl rand -base64 32)
|
||||
DB_READONLY_PASS=$(openssl rand -base64 32)
|
||||
|
||||
# Create database and users
|
||||
sudo -u postgres psql << EOF
|
||||
-- Create database owner role
|
||||
CREATE ROLE ${DB_OWNER} WITH LOGIN PASSWORD '${DB_OWNER_PASS}'
|
||||
CREATEDB CREATEROLE CONNECTION LIMIT 5;
|
||||
|
||||
-- Create application user
|
||||
CREATE ROLE ${DB_USER} WITH LOGIN PASSWORD '${DB_USER_PASS}'
|
||||
CONNECTION LIMIT 50;
|
||||
|
||||
-- Create read-only user
|
||||
CREATE ROLE ${DB_READONLY} WITH LOGIN PASSWORD '${DB_READONLY_PASS}'
|
||||
CONNECTION LIMIT 10;
|
||||
|
||||
-- Create customer database
|
||||
CREATE DATABASE ${DB_NAME}
|
||||
WITH OWNER = ${DB_OWNER}
|
||||
ENCODING = 'UTF8'
|
||||
LC_COLLATE = 'en_US.UTF-8'
|
||||
LC_CTYPE = 'en_US.UTF-8'
|
||||
TEMPLATE = template0
|
||||
CONNECTION LIMIT = 100;
|
||||
|
||||
-- Configure database
|
||||
\c ${DB_NAME}
|
||||
|
||||
-- Create schema
|
||||
CREATE SCHEMA IF NOT EXISTS ${CUSTOMER_NAME} AUTHORIZATION ${DB_OWNER};
|
||||
|
||||
-- Grant permissions
|
||||
GRANT CONNECT ON DATABASE ${DB_NAME} TO ${DB_USER}, ${DB_READONLY};
|
||||
GRANT USAGE ON SCHEMA ${CUSTOMER_NAME} TO ${DB_USER}, ${DB_READONLY};
|
||||
GRANT CREATE ON SCHEMA ${CUSTOMER_NAME} TO ${DB_USER};
|
||||
|
||||
-- Default privileges for tables
|
||||
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
|
||||
GRANT ALL ON TABLES TO ${DB_USER};
|
||||
|
||||
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
|
||||
GRANT SELECT ON TABLES TO ${DB_READONLY};
|
||||
|
||||
-- Default privileges for sequences
|
||||
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
|
||||
GRANT ALL ON SEQUENCES TO ${DB_USER};
|
||||
|
||||
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
|
||||
GRANT SELECT ON SEQUENCES TO ${DB_READONLY};
|
||||
|
||||
-- Enable useful extensions
|
||||
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
EOF
|
||||
|
||||
echo "Database ${DB_NAME} created successfully"
|
||||
```
|
||||
|
||||
## Step 4: Configure Network Access
|
||||
|
||||
```bash
|
||||
# Add customer IP to pg_hba.conf
|
||||
CUSTOMER_IP="203.0.113.0/32" # Replace with actual customer IP
|
||||
|
||||
# Backup pg_hba.conf
|
||||
sudo cp /etc/postgresql/16/main/pg_hba.conf /etc/postgresql/16/main/pg_hba.conf.$(date +%Y%m%d)
|
||||
|
||||
# Add customer access rules
|
||||
cat << EOF | sudo tee -a /etc/postgresql/16/main/pg_hba.conf
|
||||
|
||||
# Customer: ${CUSTOMER_NAME}
|
||||
hostssl ${DB_NAME} ${DB_OWNER} ${CUSTOMER_IP} scram-sha-256
|
||||
hostssl ${DB_NAME} ${DB_USER} ${CUSTOMER_IP} scram-sha-256
|
||||
hostssl ${DB_NAME} ${DB_READONLY} ${CUSTOMER_IP} scram-sha-256
|
||||
EOF
|
||||
|
||||
# Update firewall
|
||||
sudo ufw allow from ${CUSTOMER_IP} to any port 5432 comment "FairDB: ${CUSTOMER_NAME}"
|
||||
|
||||
# Reload PostgreSQL configuration
|
||||
sudo systemctl reload postgresql
|
||||
```
|
||||
|
||||
## Step 5: Set Resource Limits
|
||||
|
||||
```bash
|
||||
# Configure per-database resource limits based on plan
|
||||
case "${PLAN_TYPE}" in
|
||||
"starter")
|
||||
MAX_CONN=50
|
||||
WORK_MEM="4MB"
|
||||
SHARED_BUFFERS="256MB"
|
||||
;;
|
||||
"professional")
|
||||
MAX_CONN=100
|
||||
WORK_MEM="8MB"
|
||||
SHARED_BUFFERS="1GB"
|
||||
;;
|
||||
"enterprise")
|
||||
MAX_CONN=200
|
||||
WORK_MEM="16MB"
|
||||
SHARED_BUFFERS="4GB"
|
||||
;;
|
||||
esac
|
||||
|
||||
# Apply database-specific settings
|
||||
sudo -u postgres psql -d ${DB_NAME} << EOF
|
||||
-- Set connection limit
|
||||
ALTER DATABASE ${DB_NAME} CONNECTION LIMIT ${MAX_CONN};
|
||||
|
||||
-- Set database parameters
|
||||
ALTER DATABASE ${DB_NAME} SET work_mem = '${WORK_MEM}';
|
||||
ALTER DATABASE ${DB_NAME} SET maintenance_work_mem = '${WORK_MEM}';
|
||||
ALTER DATABASE ${DB_NAME} SET effective_cache_size = '${SHARED_BUFFERS}';
|
||||
ALTER DATABASE ${DB_NAME} SET random_page_cost = 1.1;
|
||||
ALTER DATABASE ${DB_NAME} SET log_statement = 'all';
|
||||
ALTER DATABASE ${DB_NAME} SET log_duration = on;
|
||||
EOF
|
||||
```
|
||||
|
||||
## Step 6: Configure Backup Policy
|
||||
|
||||
```bash
|
||||
# Create customer-specific backup configuration
|
||||
cat << EOF | sudo tee -a /opt/fairdb/configs/backup-${CUSTOMER_NAME}.conf
|
||||
# Backup configuration for ${CUSTOMER_NAME}
|
||||
DATABASE=${DB_NAME}
|
||||
BACKUP_RETENTION_DAYS=30
|
||||
BACKUP_SCHEDULE="0 3 * * *" # Daily at 3 AM
|
||||
BACKUP_TYPE="full"
|
||||
S3_PREFIX="${CUSTOMER_NAME}/"
|
||||
EOF
|
||||
|
||||
# Add to pgBackRest configuration
|
||||
sudo tee -a /etc/pgbackrest/pgbackrest.conf << EOF
|
||||
|
||||
[${CUSTOMER_NAME}]
|
||||
pg1-path=/var/lib/postgresql/16/main
|
||||
pg1-database=${DB_NAME}
|
||||
pg1-port=5432
|
||||
backup-user=backup_user
|
||||
process-max=2
|
||||
repo1-retention-full=4
|
||||
repo1-retention-diff=7
|
||||
EOF
|
||||
|
||||
# Create backup stanza for customer
|
||||
sudo -u postgres pgbackrest --stanza=${CUSTOMER_NAME} stanza-create
|
||||
|
||||
# Schedule customer backup
|
||||
echo "0 3 * * * postgres pgbackrest --stanza=${CUSTOMER_NAME} --type=full backup" | \
|
||||
sudo tee -a /etc/cron.d/fairdb-customer-${CUSTOMER_NAME}
|
||||
```
|
||||
|
||||
## Step 7: Setup Monitoring
|
||||
|
||||
```bash
|
||||
# Create monitoring user and grants
|
||||
sudo -u postgres psql -d ${DB_NAME} << EOF
|
||||
-- Grant monitoring permissions
|
||||
GRANT pg_monitor TO ${DB_READONLY};
|
||||
GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO ${DB_OWNER};
|
||||
EOF
|
||||
|
||||
# Create customer monitoring script
|
||||
cat << 'EOF' | sudo tee /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
|
||||
#!/bin/bash
|
||||
# Monitoring script for ${CUSTOMER_NAME}
|
||||
|
||||
DB_NAME="${DB_NAME}"
|
||||
ALERT_THRESHOLD_CONNECTIONS=80
|
||||
ALERT_THRESHOLD_SIZE_GB=100
|
||||
|
||||
# Check connection usage
|
||||
CONN_USAGE=$(sudo -u postgres psql -t -c "
|
||||
SELECT (count(*) * 100.0 / setting::int)::int as pct
|
||||
FROM pg_stat_activity, pg_settings
|
||||
WHERE name = 'max_connections'
|
||||
AND datname = '${DB_NAME}'
|
||||
GROUP BY setting;")
|
||||
|
||||
if [ ${CONN_USAGE:-0} -gt $ALERT_THRESHOLD_CONNECTIONS ]; then
|
||||
echo "ALERT: Connection usage at ${CONN_USAGE}% for ${CUSTOMER_NAME}"
|
||||
fi
|
||||
|
||||
# Check database size
|
||||
DB_SIZE_GB=$(sudo -u postgres psql -t -c "
|
||||
SELECT pg_database_size('${DB_NAME}') / 1024 / 1024 / 1024;")
|
||||
|
||||
if [ ${DB_SIZE_GB:-0} -gt $ALERT_THRESHOLD_SIZE_GB ]; then
|
||||
echo "ALERT: Database size is ${DB_SIZE_GB}GB for ${CUSTOMER_NAME}"
|
||||
fi
|
||||
|
||||
# Check for long-running queries
|
||||
sudo -u postgres psql -d ${DB_NAME} -c "
|
||||
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
|
||||
FROM pg_stat_activity
|
||||
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
|
||||
AND state = 'active';"
|
||||
EOF
|
||||
|
||||
sudo chmod +x /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
|
||||
|
||||
# Add to monitoring cron
|
||||
echo "*/10 * * * * root /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh" | \
|
||||
sudo tee -a /etc/cron.d/fairdb-monitor-${CUSTOMER_NAME}
|
||||
```
|
||||
|
||||
## Step 8: Generate SSL Certificates
|
||||
|
||||
```bash
|
||||
# Create customer SSL certificate
|
||||
sudo mkdir -p /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
|
||||
cd /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
|
||||
|
||||
# Generate customer-specific SSL cert
|
||||
sudo openssl req -new -x509 -days 365 -nodes \
|
||||
-out server.crt -keyout server.key \
|
||||
-subj "/C=US/ST=State/L=City/O=FairDB/OU=${CUSTOMER_NAME}/CN=${CUSTOMER_NAME}.fairdb.io"
|
||||
|
||||
# Set permissions
|
||||
sudo chmod 600 server.key
|
||||
sudo chown postgres:postgres server.*
|
||||
|
||||
# Create client certificate
|
||||
sudo openssl req -new -nodes \
|
||||
-out client.csr -keyout client.key \
|
||||
-subj "/C=US/ST=State/L=City/O=FairDB/OU=${CUSTOMER_NAME}/CN=${DB_USER}"
|
||||
|
||||
sudo openssl x509 -req -CAcreateserial \
|
||||
-in client.csr -CA server.crt -CAkey server.key \
|
||||
-out client.crt -days 365
|
||||
|
||||
# Package client certificates
|
||||
tar czf /tmp/${CUSTOMER_NAME}-ssl-bundle.tar.gz client.crt client.key server.crt
|
||||
```
|
||||
|
||||
## Step 9: Create Connection Documentation
|
||||
|
||||
```bash
|
||||
# Generate connection details document
|
||||
cat << EOF > /tmp/${CUSTOMER_NAME}-connection-details.md
|
||||
# FairDB PostgreSQL Connection Details
|
||||
## Customer: ${CUSTOMER_NAME}
|
||||
|
||||
### Database Information
|
||||
- **Database Name**: ${DB_NAME}
|
||||
- **Host**: fairdb-prod.example.com
|
||||
- **Port**: 5432
|
||||
- **SSL Required**: Yes
|
||||
|
||||
### User Credentials
|
||||
#### Database Owner (DDL Operations)
|
||||
- **Username**: ${DB_OWNER}
|
||||
- **Password**: ${DB_OWNER_PASS}
|
||||
- **Connection Limit**: 5
|
||||
- **Permissions**: Full database owner
|
||||
|
||||
#### Application User (DML Operations)
|
||||
- **Username**: ${DB_USER}
|
||||
- **Password**: ${DB_USER_PASS}
|
||||
- **Connection Limit**: 50
|
||||
- **Permissions**: CRUD operations on all tables
|
||||
|
||||
#### Read-Only User (Reporting)
|
||||
- **Username**: ${DB_READONLY}
|
||||
- **Password**: ${DB_READONLY_PASS}
|
||||
- **Connection Limit**: 10
|
||||
- **Permissions**: SELECT only
|
||||
|
||||
### Connection Strings
|
||||
\`\`\`
|
||||
# Standard connection
|
||||
postgresql://${DB_USER}:${DB_USER_PASS}@fairdb-prod.example.com:5432/${DB_NAME}?sslmode=require
|
||||
|
||||
# With SSL certificate
|
||||
postgresql://${DB_USER}:${DB_USER_PASS}@fairdb-prod.example.com:5432/${DB_NAME}?sslmode=require&sslcert=client.crt&sslkey=client.key&sslrootcert=server.crt
|
||||
|
||||
# JDBC URL
|
||||
jdbc:postgresql://fairdb-prod.example.com:5432/${DB_NAME}?ssl=true&sslmode=require
|
||||
|
||||
# psql command
|
||||
psql "host=fairdb-prod.example.com port=5432 dbname=${DB_NAME} user=${DB_USER} sslmode=require"
|
||||
\`\`\`
|
||||
|
||||
### Resource Limits
|
||||
- **Plan**: ${PLAN_TYPE}
|
||||
- **Max Connections**: ${MAX_CONN}
|
||||
- **Storage Quota**: Unlimited (pay per GB)
|
||||
- **Backup Retention**: 30 days
|
||||
- **Backup Schedule**: Daily at 3:00 AM UTC
|
||||
|
||||
### Support Information
|
||||
- **Email**: support@fairdb.io
|
||||
- **Emergency**: +1-xxx-xxx-xxxx
|
||||
- **Documentation**: https://docs.fairdb.io
|
||||
- **Status Page**: https://status.fairdb.io
|
||||
|
||||
### Important Notes
|
||||
1. Always use SSL connections
|
||||
2. Rotate passwords every 90 days
|
||||
3. Monitor connection pool usage
|
||||
4. Test restore procedures quarterly
|
||||
5. Keep IP allowlist updated
|
||||
|
||||
### Next Steps
|
||||
1. Download SSL certificates: ${CUSTOMER_NAME}-ssl-bundle.tar.gz
|
||||
2. Test connection with provided credentials
|
||||
3. Configure application connection pool
|
||||
4. Set up monitoring dashboards
|
||||
5. Review security best practices
|
||||
|
||||
Generated: $(date)
|
||||
EOF
|
||||
|
||||
echo "Connection details saved to /tmp/${CUSTOMER_NAME}-connection-details.md"
|
||||
```
|
||||
|
||||
## Step 10: Final Verification
|
||||
|
||||
```bash
|
||||
# Test all user connections
|
||||
echo "Testing database connections..."
|
||||
|
||||
# Test owner connection
|
||||
PGPASSWORD=${DB_OWNER_PASS} psql -h localhost -U ${DB_OWNER} -d ${DB_NAME} -c "SELECT current_user, current_database();"
|
||||
|
||||
# Test app user connection
|
||||
PGPASSWORD=${DB_USER_PASS} psql -h localhost -U ${DB_USER} -d ${DB_NAME} -c "SELECT current_user, current_database();"
|
||||
|
||||
# Test readonly connection
|
||||
PGPASSWORD=${DB_READONLY_PASS} psql -h localhost -U ${DB_READONLY} -d ${DB_NAME} -c "SELECT current_user, current_database();"
|
||||
|
||||
# Verify backup configuration
|
||||
sudo -u postgres pgbackrest --stanza=${CUSTOMER_NAME} check
|
||||
|
||||
# Check monitoring
|
||||
/opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
|
||||
|
||||
# Generate onboarding summary
|
||||
echo "
|
||||
===========================================
|
||||
FairDB Customer Onboarding Complete
|
||||
===========================================
|
||||
Customer: ${CUSTOMER_NAME}
|
||||
Database: ${DB_NAME}
|
||||
Created: $(date)
|
||||
Plan: ${PLAN_TYPE}
|
||||
|
||||
Files Generated:
|
||||
- /tmp/${CUSTOMER_NAME}-connection-details.md
|
||||
- /tmp/${CUSTOMER_NAME}-ssl-bundle.tar.gz
|
||||
|
||||
Next Actions:
|
||||
1. Send connection details to customer
|
||||
2. Schedule onboarding call
|
||||
3. Monitor initial usage
|
||||
4. Follow up in 24 hours
|
||||
===========================================
|
||||
"
|
||||
```
|
||||
|
||||
## Onboarding Checklist
|
||||
|
||||
Verify completion:
|
||||
- [ ] Database created
|
||||
- [ ] Users created with secure passwords
|
||||
- [ ] Network access configured
|
||||
- [ ] Resource limits applied
|
||||
- [ ] Backup policy configured
|
||||
- [ ] Monitoring enabled
|
||||
- [ ] SSL certificates generated
|
||||
- [ ] Documentation created
|
||||
- [ ] Connection tests passed
|
||||
- [ ] Customer notified
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If onboarding fails:
|
||||
```bash
|
||||
# Remove database and users
|
||||
sudo -u postgres psql << EOF
|
||||
DROP DATABASE IF EXISTS ${DB_NAME};
|
||||
DROP ROLE IF EXISTS ${DB_OWNER};
|
||||
DROP ROLE IF EXISTS ${DB_USER};
|
||||
DROP ROLE IF EXISTS ${DB_READONLY};
|
||||
EOF
|
||||
|
||||
# Remove configurations
|
||||
sudo rm -f /etc/cron.d/fairdb-customer-${CUSTOMER_NAME}
|
||||
sudo rm -f /etc/cron.d/fairdb-monitor-${CUSTOMER_NAME}
|
||||
sudo rm -f /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
|
||||
sudo rm -rf /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
|
||||
|
||||
# Remove firewall rule
|
||||
sudo ufw delete allow from ${CUSTOMER_IP} to any port 5432
|
||||
|
||||
echo "Customer ${CUSTOMER_NAME} rollback complete"
|
||||
```
|
||||
420
commands/fairdb-setup-backup.md
Normal file
420
commands/fairdb-setup-backup.md
Normal file
@@ -0,0 +1,420 @@
|
||||
---
|
||||
name: fairdb-setup-backup
|
||||
description: Configure pgBackRest with Wasabi S3 for automated PostgreSQL backups
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# FairDB pgBackRest Backup Configuration with Wasabi S3
|
||||
|
||||
You are configuring pgBackRest with Wasabi S3 storage for automated PostgreSQL backups. Follow SOP-003 precisely.
|
||||
|
||||
## Prerequisites Check
|
||||
|
||||
Verify before starting:
|
||||
1. PostgreSQL 16 is installed and running
|
||||
2. Wasabi S3 account is active with bucket created
|
||||
3. AWS CLI credentials are available
|
||||
4. At least 50GB free disk space for local backups
|
||||
|
||||
## Step 1: Install pgBackRest
|
||||
|
||||
```bash
|
||||
# Add pgBackRest repository
|
||||
sudo apt-get install -y software-properties-common
|
||||
sudo add-apt-repository -y ppa:pgbackrest/backrest
|
||||
sudo apt-get update
|
||||
|
||||
# Install pgBackRest
|
||||
sudo apt-get install -y pgbackrest
|
||||
|
||||
# Verify installation
|
||||
pgbackrest version
|
||||
```
|
||||
|
||||
## Step 2: Configure Wasabi S3 Credentials
|
||||
|
||||
```bash
|
||||
# Create pgBackRest configuration directory
|
||||
sudo mkdir -p /etc/pgbackrest
|
||||
sudo mkdir -p /var/lib/pgbackrest
|
||||
sudo mkdir -p /var/log/pgbackrest
|
||||
sudo mkdir -p /var/spool/pgbackrest
|
||||
|
||||
# Set ownership
|
||||
sudo chown -R postgres:postgres /var/lib/pgbackrest
|
||||
sudo chown -R postgres:postgres /var/log/pgbackrest
|
||||
sudo chown -R postgres:postgres /var/spool/pgbackrest
|
||||
|
||||
# Store Wasabi credentials (secure these!)
|
||||
export WASABI_ACCESS_KEY="YOUR_WASABI_ACCESS_KEY"
|
||||
export WASABI_SECRET_KEY="YOUR_WASABI_SECRET_KEY"
|
||||
export WASABI_BUCKET="fairdb-backups"
|
||||
export WASABI_REGION="us-east-1" # Or your Wasabi region
|
||||
export WASABI_ENDPOINT="s3.us-east-1.wasabisys.com" # Adjust for your region
|
||||
```
|
||||
|
||||
## Step 3: Create pgBackRest Configuration
|
||||
|
||||
```bash
|
||||
# Create main configuration file
|
||||
sudo tee /etc/pgbackrest/pgbackrest.conf << EOF
|
||||
[global]
|
||||
# General Options
|
||||
process-max=4
|
||||
log-level-console=info
|
||||
log-level-file=detail
|
||||
start-fast=y
|
||||
stop-auto=y
|
||||
archive-async=y
|
||||
archive-push-queue-max=4GB
|
||||
spool-path=/var/spool/pgbackrest
|
||||
|
||||
# S3 Repository Configuration
|
||||
repo1-type=s3
|
||||
repo1-s3-endpoint=${WASABI_ENDPOINT}
|
||||
repo1-s3-bucket=${WASABI_BUCKET}
|
||||
repo1-s3-region=${WASABI_REGION}
|
||||
repo1-s3-key=${WASABI_ACCESS_KEY}
|
||||
repo1-s3-key-secret=${WASABI_SECRET_KEY}
|
||||
repo1-path=/pgbackrest
|
||||
repo1-retention-full=4
|
||||
repo1-retention-diff=12
|
||||
repo1-retention-archive=30
|
||||
repo1-cipher-type=aes-256-cbc
|
||||
repo1-cipher-pass=CHANGE_THIS_PASSPHRASE
|
||||
|
||||
# Local Repository (for faster restores)
|
||||
repo2-type=posix
|
||||
repo2-path=/var/lib/pgbackrest
|
||||
repo2-retention-full=2
|
||||
repo2-retention-diff=6
|
||||
|
||||
[fairdb]
|
||||
# PostgreSQL Configuration
|
||||
pg1-path=/var/lib/postgresql/16/main
|
||||
pg1-port=5432
|
||||
pg1-user=postgres
|
||||
|
||||
# Archive Configuration
|
||||
archive-timeout=60
|
||||
archive-check=y
|
||||
backup-standby=n
|
||||
|
||||
# Backup Options
|
||||
compress-type=lz4
|
||||
compress-level=3
|
||||
backup-user=backup_user
|
||||
delta=y
|
||||
process-max=2
|
||||
EOF
|
||||
|
||||
# Secure the configuration file
|
||||
sudo chmod 640 /etc/pgbackrest/pgbackrest.conf
|
||||
sudo chown postgres:postgres /etc/pgbackrest/pgbackrest.conf
|
||||
```
|
||||
|
||||
## Step 4: Configure PostgreSQL for pgBackRest
|
||||
|
||||
```bash
|
||||
# Update PostgreSQL configuration
|
||||
sudo tee -a /etc/postgresql/16/main/postgresql.conf << 'EOF'
|
||||
|
||||
# pgBackRest Archive Configuration
|
||||
archive_mode = on
|
||||
archive_command = 'pgbackrest --stanza=fairdb archive-push %p'
|
||||
archive_timeout = 60
|
||||
max_wal_senders = 3
|
||||
wal_level = replica
|
||||
wal_log_hints = on
|
||||
EOF
|
||||
|
||||
# Restart PostgreSQL
|
||||
sudo systemctl restart postgresql
|
||||
```
|
||||
|
||||
## Step 5: Initialize Backup Stanza
|
||||
|
||||
```bash
|
||||
# Create the stanza
|
||||
sudo -u postgres pgbackrest --stanza=fairdb stanza-create
|
||||
|
||||
# Verify stanza
|
||||
sudo -u postgres pgbackrest --stanza=fairdb check
|
||||
```
|
||||
|
||||
## Step 6: Create Backup Scripts
|
||||
|
||||
```bash
|
||||
# Full backup script
|
||||
sudo tee /opt/fairdb/scripts/backup-full.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
LOG_FILE="/var/log/fairdb/backup-full-$(date +%Y%m%d-%H%M%S).log"
|
||||
echo "Starting full backup at $(date)" | tee -a $LOG_FILE
|
||||
|
||||
# Perform full backup to both repositories
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=full --repo=1 backup 2>&1 | tee -a $LOG_FILE
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=full --repo=2 backup 2>&1 | tee -a $LOG_FILE
|
||||
|
||||
# Verify backup
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --repo=1 info 2>&1 | tee -a $LOG_FILE
|
||||
|
||||
echo "Full backup completed at $(date)" | tee -a $LOG_FILE
|
||||
|
||||
# Send notification (implement webhook/email here)
|
||||
curl -X POST $FAIRDB_MONITORING_WEBHOOK \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{\"text\":\"FairDB full backup completed successfully\"}" 2>/dev/null || true
|
||||
EOF
|
||||
|
||||
# Incremental backup script
|
||||
sudo tee /opt/fairdb/scripts/backup-incremental.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
LOG_FILE="/var/log/fairdb/backup-incr-$(date +%Y%m%d-%H%M%S).log"
|
||||
echo "Starting incremental backup at $(date)" | tee -a $LOG_FILE
|
||||
|
||||
# Perform incremental backup
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=incr --repo=1 backup 2>&1 | tee -a $LOG_FILE
|
||||
|
||||
echo "Incremental backup completed at $(date)" | tee -a $LOG_FILE
|
||||
EOF
|
||||
|
||||
# Differential backup script
|
||||
sudo tee /opt/fairdb/scripts/backup-differential.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
LOG_FILE="/var/log/fairdb/backup-diff-$(date +%Y%m%d-%H%M%S).log"
|
||||
echo "Starting differential backup at $(date)" | tee -a $LOG_FILE
|
||||
|
||||
# Perform differential backup
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=diff --repo=1 backup 2>&1 | tee -a $LOG_FILE
|
||||
|
||||
echo "Differential backup completed at $(date)" | tee -a $LOG_FILE
|
||||
EOF
|
||||
|
||||
# Make scripts executable
|
||||
sudo chmod +x /opt/fairdb/scripts/backup-*.sh
|
||||
```
|
||||
|
||||
## Step 7: Schedule Automated Backups
|
||||
|
||||
```bash
|
||||
# Add to root's crontab for automated backups
|
||||
cat << 'EOF' | sudo tee /etc/cron.d/fairdb-backups
|
||||
# FairDB Automated Backup Schedule
|
||||
SHELL=/bin/bash
|
||||
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
|
||||
|
||||
# Weekly full backup (Sunday 2 AM)
|
||||
0 2 * * 0 root /opt/fairdb/scripts/backup-full.sh
|
||||
|
||||
# Daily differential backup (Mon-Sat 2 AM)
|
||||
0 2 * * 1-6 root /opt/fairdb/scripts/backup-differential.sh
|
||||
|
||||
# Hourly incremental backup (business hours)
|
||||
0 9-18 * * 1-5 root /opt/fairdb/scripts/backup-incremental.sh
|
||||
|
||||
# Backup verification (daily at 5 AM)
|
||||
0 5 * * * postgres pgbackrest --stanza=fairdb --repo=1 check
|
||||
|
||||
# Archive expiration (daily at 3 AM)
|
||||
0 3 * * * postgres pgbackrest --stanza=fairdb --repo=1 expire
|
||||
EOF
|
||||
```
|
||||
|
||||
## Step 8: Create Restore Procedures
|
||||
|
||||
```bash
|
||||
# Point-in-time recovery script
|
||||
sudo tee /opt/fairdb/scripts/restore-pitr.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
# FairDB Point-in-Time Recovery Script
|
||||
|
||||
if [ $# -ne 1 ]; then
|
||||
echo "Usage: $0 'YYYY-MM-DD HH:MM:SS'"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
TARGET_TIME="$1"
|
||||
BACKUP_PATH="/var/lib/postgresql/16/main"
|
||||
|
||||
echo "WARNING: This will restore the database to $TARGET_TIME"
|
||||
echo "Current data will be LOST. Continue? (yes/no)"
|
||||
read CONFIRM
|
||||
|
||||
if [ "$CONFIRM" != "yes" ]; then
|
||||
echo "Restore cancelled"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Stop PostgreSQL
|
||||
sudo systemctl stop postgresql
|
||||
|
||||
# Clear data directory
|
||||
sudo rm -rf ${BACKUP_PATH}/*
|
||||
|
||||
# Restore to target time
|
||||
sudo -u postgres pgbackrest --stanza=fairdb \
|
||||
--type=time \
|
||||
--target="$TARGET_TIME" \
|
||||
--target-action=promote \
|
||||
restore
|
||||
|
||||
# Start PostgreSQL
|
||||
sudo systemctl start postgresql
|
||||
|
||||
echo "Restore completed. Verify data integrity."
|
||||
EOF
|
||||
|
||||
sudo chmod +x /opt/fairdb/scripts/restore-pitr.sh
|
||||
```
|
||||
|
||||
## Step 9: Test Backup and Restore
|
||||
|
||||
```bash
|
||||
# Perform test backup
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=full backup
|
||||
|
||||
# Check backup info
|
||||
sudo -u postgres pgbackrest --stanza=fairdb info
|
||||
|
||||
# List backups
|
||||
sudo -u postgres pgbackrest --stanza=fairdb info --output=json
|
||||
|
||||
# Test restore to alternate location
|
||||
sudo mkdir -p /tmp/pgbackrest-test
|
||||
sudo chown postgres:postgres /tmp/pgbackrest-test
|
||||
sudo -u postgres pgbackrest --stanza=fairdb \
|
||||
--pg1-path=/tmp/pgbackrest-test \
|
||||
--type=latest \
|
||||
restore
|
||||
```
|
||||
|
||||
## Step 10: Monitor Backup Health
|
||||
|
||||
```bash
|
||||
# Create monitoring script
|
||||
sudo tee /opt/fairdb/scripts/check-backup-health.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
# FairDB Backup Health Check
|
||||
|
||||
# Check last backup time
|
||||
LAST_BACKUP=$(sudo -u postgres pgbackrest --stanza=fairdb info --output=json | \
|
||||
jq -r '.[] | .backup[-1].timestamp.stop')
|
||||
|
||||
# Convert to seconds
|
||||
LAST_BACKUP_EPOCH=$(date -d "$LAST_BACKUP" +%s)
|
||||
CURRENT_EPOCH=$(date +%s)
|
||||
HOURS_AGO=$(( ($CURRENT_EPOCH - $LAST_BACKUP_EPOCH) / 3600 ))
|
||||
|
||||
# Alert if backup is older than 25 hours
|
||||
if [ $HOURS_AGO -gt 25 ]; then
|
||||
echo "ALERT: Last backup was $HOURS_AGO hours ago!"
|
||||
# Send alert (implement notification here)
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Backup health OK - last backup $HOURS_AGO hours ago"
|
||||
|
||||
# Check S3 connectivity
|
||||
aws s3 ls s3://${WASABI_BUCKET}/pgbackrest/ \
|
||||
--endpoint-url=https://${WASABI_ENDPOINT} > /dev/null 2>&1
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "ALERT: Cannot connect to Wasabi S3!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "S3 connectivity OK"
|
||||
EOF
|
||||
|
||||
sudo chmod +x /opt/fairdb/scripts/check-backup-health.sh
|
||||
|
||||
# Add to monitoring cron
|
||||
echo "*/30 * * * * root /opt/fairdb/scripts/check-backup-health.sh" | \
|
||||
sudo tee -a /etc/cron.d/fairdb-monitoring
|
||||
```
|
||||
|
||||
## Step 11: Document Backup Configuration
|
||||
|
||||
```bash
|
||||
cat > /opt/fairdb/configs/backup-info.txt << EOF
|
||||
FairDB Backup Configuration
|
||||
===========================
|
||||
Backup Solution: pgBackRest
|
||||
Primary Repository: Wasabi S3 (${WASABI_BUCKET})
|
||||
Secondary Repository: Local (/var/lib/pgbackrest)
|
||||
Stanza Name: fairdb
|
||||
Encryption: AES-256-CBC
|
||||
|
||||
Retention Policy:
|
||||
- Full Backups: 4 (S3), 2 (Local)
|
||||
- Differential: 12 (S3), 6 (Local)
|
||||
- WAL Archives: 30 days
|
||||
|
||||
Schedule:
|
||||
- Full: Weekly (Sunday 2 AM)
|
||||
- Differential: Daily (Mon-Sat 2 AM)
|
||||
- Incremental: Hourly (9 AM - 6 PM weekdays)
|
||||
|
||||
Restore Procedures:
|
||||
- Latest: pgbackrest --stanza=fairdb restore
|
||||
- PITR: /opt/fairdb/scripts/restore-pitr.sh 'YYYY-MM-DD HH:MM:SS'
|
||||
|
||||
Monitoring:
|
||||
- Health checks: Every 30 minutes
|
||||
- Verification: Daily at 5 AM
|
||||
- Expiration: Daily at 3 AM
|
||||
EOF
|
||||
```
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
Confirm these items:
|
||||
- [ ] pgBackRest installed and configured
|
||||
- [ ] Wasabi S3 credentials configured
|
||||
- [ ] Stanza created and verified
|
||||
- [ ] PostgreSQL archive_command configured
|
||||
- [ ] Backup scripts created and executable
|
||||
- [ ] Automated schedule configured
|
||||
- [ ] Test backup successful
|
||||
- [ ] Test restore successful
|
||||
- [ ] Monitoring scripts in place
|
||||
- [ ] Documentation complete
|
||||
|
||||
## Security Notes
|
||||
|
||||
- Store Wasabi credentials securely (use AWS Secrets Manager in production)
|
||||
- Encrypt backup repository with strong passphrase
|
||||
- Regularly test restore procedures
|
||||
- Monitor backup logs for failures
|
||||
- Keep pgBackRest updated
|
||||
|
||||
## Output Summary
|
||||
|
||||
Provide the user with:
|
||||
1. Backup stanza status: `pgbackrest --stanza=fairdb info`
|
||||
2. Next full backup time from cron schedule
|
||||
3. Location of backup scripts and logs
|
||||
4. Restore procedure documentation
|
||||
5. Monitoring webhook configuration needed
|
||||
|
||||
## Important Commands
|
||||
|
||||
```bash
|
||||
# Manual backup commands
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=full backup # Full
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=diff backup # Differential
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=incr backup # Incremental
|
||||
|
||||
# Check backup status
|
||||
sudo -u postgres pgbackrest --stanza=fairdb info
|
||||
sudo -u postgres pgbackrest --stanza=fairdb check
|
||||
|
||||
# Restore commands
|
||||
sudo -u postgres pgbackrest --stanza=fairdb restore # Latest
|
||||
sudo -u postgres pgbackrest --stanza=fairdb --type=time --target="2024-01-01 12:00:00" restore # PITR
|
||||
```
|
||||
Reference in New Issue
Block a user