Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:19:24 +08:00
commit 74075be734
22 changed files with 2851 additions and 0 deletions

View File

@@ -0,0 +1,480 @@
---
name: fairdb-emergency-response
description: Emergency incident response procedures for critical FairDB issues
model: sonnet
---
# FairDB Emergency Incident Response
You are responding to a critical incident in the FairDB PostgreSQL infrastructure. Follow this structured approach to diagnose, contain, and resolve the issue.
## Incident Classification
First, identify the incident type:
- **P1 Critical**: Complete service outage, data loss risk
- **P2 High**: Major degradation, affecting multiple customers
- **P3 Medium**: Single customer impact, performance issues
- **P4 Low**: Minor issues, cosmetic problems
## Initial Assessment (First 5 Minutes)
```bash
#!/bin/bash
# FairDB Emergency Response Script
echo "================================================"
echo " FAIRDB EMERGENCY INCIDENT RESPONSE"
echo " Started: $(date '+%Y-%m-%d %H:%M:%S')"
echo "================================================"
# Create incident log
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"
INCIDENT_LOG="/opt/fairdb/incidents/${INCIDENT_ID}.log"
mkdir -p /opt/fairdb/incidents
{
echo "Incident ID: $INCIDENT_ID"
echo "Response started: $(date)"
echo "Responding user: $(whoami)"
echo "========================================"
} | tee $INCIDENT_LOG
```
## Step 1: Service Status Check
```bash
echo -e "\n[STEP 1] SERVICE STATUS CHECK" | tee -a $INCIDENT_LOG
echo "------------------------------" | tee -a $INCIDENT_LOG
# Check PostgreSQL service
if systemctl is-active --quiet postgresql; then
echo "✅ PostgreSQL: RUNNING" | tee -a $INCIDENT_LOG
else
echo "❌ CRITICAL: PostgreSQL is DOWN" | tee -a $INCIDENT_LOG
echo "Attempting emergency restart..." | tee -a $INCIDENT_LOG
# Try to start the service
sudo systemctl start postgresql 2>&1 | tee -a $INCIDENT_LOG
sleep 5
if systemctl is-active --quiet postgresql; then
echo "✅ PostgreSQL restarted successfully" | tee -a $INCIDENT_LOG
else
echo "❌ FAILED to restart PostgreSQL" | tee -a $INCIDENT_LOG
echo "Checking for port conflicts..." | tee -a $INCIDENT_LOG
sudo netstat -tulpn | grep :5432 | tee -a $INCIDENT_LOG
# Check for corruption
echo "Checking for data corruption..." | tee -a $INCIDENT_LOG
sudo -u postgres /usr/lib/postgresql/16/bin/postgres -D /var/lib/postgresql/16/main -C data_directory 2>&1 | tee -a $INCIDENT_LOG
fi
fi
# Check disk space
echo -e "\nDisk Space:" | tee -a $INCIDENT_LOG
df -h | grep -E "^/dev|^Filesystem" | tee -a $INCIDENT_LOG
# Check for full disks
FULL_DISKS=$(df -h | grep -E "100%|9[5-9]%" | wc -l)
if [ $FULL_DISKS -gt 0 ]; then
echo "⚠️ CRITICAL: Disk space exhausted!" | tee -a $INCIDENT_LOG
echo "Emergency cleanup required..." | tee -a $INCIDENT_LOG
# Emergency log cleanup
find /var/log/postgresql -name "*.log" -mtime +7 -delete 2>/dev/null
find /opt/fairdb/logs -name "*.log" -mtime +7 -delete 2>/dev/null
echo "Old logs cleared. New disk usage:" | tee -a $INCIDENT_LOG
df -h | grep -E "^/dev" | tee -a $INCIDENT_LOG
fi
```
## Step 2: Connection Diagnostics
```bash
echo -e "\n[STEP 2] CONNECTION DIAGNOSTICS" | tee -a $INCIDENT_LOG
echo "--------------------------------" | tee -a $INCIDENT_LOG
# Test local connection
echo "Testing local connection..." | tee -a $INCIDENT_LOG
if sudo -u postgres psql -c "SELECT 1;" > /dev/null 2>&1; then
echo "✅ Local connections: OK" | tee -a $INCIDENT_LOG
# Get connection stats
sudo -u postgres psql -t -c "
SELECT 'Active connections: ' || count(*)
FROM pg_stat_activity
WHERE state != 'idle';" | tee -a $INCIDENT_LOG
# Check for connection exhaustion
MAX_CONN=$(sudo -u postgres psql -t -c "SHOW max_connections;")
CURRENT_CONN=$(sudo -u postgres psql -t -c "SELECT count(*) FROM pg_stat_activity;")
echo "Connections: $CURRENT_CONN / $MAX_CONN" | tee -a $INCIDENT_LOG
if [ $CURRENT_CONN -gt $(( MAX_CONN * 90 / 100 )) ]; then
echo "⚠️ WARNING: Connection pool nearly exhausted" | tee -a $INCIDENT_LOG
echo "Terminating idle connections..." | tee -a $INCIDENT_LOG
# Kill idle connections older than 10 minutes
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes'
AND pid != pg_backend_pid();
EOF
fi
else
echo "❌ CRITICAL: Cannot connect to PostgreSQL" | tee -a $INCIDENT_LOG
echo "Checking PostgreSQL logs..." | tee -a $INCIDENT_LOG
tail -50 /var/log/postgresql/postgresql-*.log | tee -a $INCIDENT_LOG
fi
# Check network connectivity
echo -e "\nNetwork status:" | tee -a $INCIDENT_LOG
ip addr show | grep "inet " | tee -a $INCIDENT_LOG
```
## Step 3: Performance Emergency Response
```bash
echo -e "\n[STEP 3] PERFORMANCE TRIAGE" | tee -a $INCIDENT_LOG
echo "----------------------------" | tee -a $INCIDENT_LOG
# Find and kill long-running queries
echo "Checking for blocked/long queries..." | tee -a $INCIDENT_LOG
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
-- Queries running longer than 5 minutes
SELECT
pid,
now() - query_start as duration,
state,
LEFT(query, 100) as query_preview
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
-- Kill queries running longer than 30 minutes
SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '30 minutes'
AND pid != pg_backend_pid();
EOF
# Check for locks
echo -e "\nChecking for lock conflicts..." | tee -a $INCIDENT_LOG
sudo -u postgres psql << 'EOF' | tee -a $INCIDENT_LOG
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
EOF
```
## Step 4: Data Integrity Check
```bash
echo -e "\n[STEP 4] DATA INTEGRITY CHECK" | tee -a $INCIDENT_LOG
echo "------------------------------" | tee -a $INCIDENT_LOG
# Check for corruption indicators
echo "Checking for corruption indicators..." | tee -a $INCIDENT_LOG
# Check PostgreSQL data directory
DATA_DIR="/var/lib/postgresql/16/main"
if [ -d "$DATA_DIR" ]; then
echo "Data directory exists: $DATA_DIR" | tee -a $INCIDENT_LOG
# Check for recovery in progress
if [ -f "$DATA_DIR/recovery.signal" ]; then
echo "⚠️ Recovery in progress!" | tee -a $INCIDENT_LOG
fi
# Check WAL status
WAL_COUNT=$(ls -1 $DATA_DIR/pg_wal/*.partial 2>/dev/null | wc -l)
if [ $WAL_COUNT -gt 0 ]; then
echo "⚠️ Partial WAL files detected: $WAL_COUNT" | tee -a $INCIDENT_LOG
fi
else
echo "❌ CRITICAL: Data directory not found!" | tee -a $INCIDENT_LOG
fi
# Run basic integrity check
echo -e "\nRunning integrity checks..." | tee -a $INCIDENT_LOG
for DB in $(sudo -u postgres psql -t -c "SELECT datname FROM pg_database WHERE datistemplate = false;"); do
echo "Checking database: $DB" | tee -a $INCIDENT_LOG
sudo -u postgres psql -d $DB -c "SELECT 1;" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo " ✅ Database $DB is accessible" | tee -a $INCIDENT_LOG
else
echo " ❌ Database $DB has issues!" | tee -a $INCIDENT_LOG
fi
done
```
## Step 5: Emergency Recovery Actions
```bash
echo -e "\n[STEP 5] RECOVERY ACTIONS" | tee -a $INCIDENT_LOG
echo "-------------------------" | tee -a $INCIDENT_LOG
# Determine if recovery is needed
read -p "Do you need to initiate emergency recovery? (yes/no): " NEED_RECOVERY
if [ "$NEED_RECOVERY" = "yes" ]; then
echo "Starting emergency recovery procedures..." | tee -a $INCIDENT_LOG
# Option 1: Restart in single-user mode for repairs
echo "Option 1: Single-user mode repair" | tee -a $INCIDENT_LOG
echo "Command: sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D $DATA_DIR" | tee -a $INCIDENT_LOG
# Option 2: Restore from backup
echo "Option 2: Restore from backup" | tee -a $INCIDENT_LOG
# Check available backups
if command -v pgbackrest &> /dev/null; then
echo "Available backups:" | tee -a $INCIDENT_LOG
sudo -u postgres pgbackrest --stanza=fairdb info 2>&1 | tee -a $INCIDENT_LOG
fi
# Option 3: Point-in-time recovery
echo "Option 3: Point-in-time recovery" | tee -a $INCIDENT_LOG
echo "Use: /opt/fairdb/scripts/restore-pitr.sh 'YYYY-MM-DD HH:MM:SS'" | tee -a $INCIDENT_LOG
read -p "Select recovery option (1/2/3/none): " RECOVERY_OPTION
case $RECOVERY_OPTION in
1)
echo "Starting single-user mode..." | tee -a $INCIDENT_LOG
sudo systemctl stop postgresql
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D $DATA_DIR
;;
2)
echo "Starting backup restore..." | tee -a $INCIDENT_LOG
read -p "Enter backup label to restore: " BACKUP_LABEL
sudo systemctl stop postgresql
sudo -u postgres pgbackrest --stanza=fairdb --set=$BACKUP_LABEL restore
sudo systemctl start postgresql
;;
3)
echo "Starting PITR..." | tee -a $INCIDENT_LOG
read -p "Enter target time (YYYY-MM-DD HH:MM:SS): " TARGET_TIME
/opt/fairdb/scripts/restore-pitr.sh "$TARGET_TIME"
;;
*)
echo "No recovery action taken" | tee -a $INCIDENT_LOG
;;
esac
fi
```
## Step 6: Customer Communication
```bash
echo -e "\n[STEP 6] CUSTOMER IMPACT ASSESSMENT" | tee -a $INCIDENT_LOG
echo "------------------------------------" | tee -a $INCIDENT_LOG
# Identify affected customers
echo "Affected customer databases:" | tee -a $INCIDENT_LOG
AFFECTED_DBS=$(sudo -u postgres psql -t -c "
SELECT datname FROM pg_database
WHERE datname NOT IN ('postgres', 'template0', 'template1')
ORDER BY datname;")
for DB in $AFFECTED_DBS; do
# Check if database is accessible
if sudo -u postgres psql -d $DB -c "SELECT 1;" > /dev/null 2>&1; then
echo "$DB - Operational" | tee -a $INCIDENT_LOG
else
echo "$DB - IMPACTED" | tee -a $INCIDENT_LOG
fi
done
# Generate customer notification
cat << EOF | tee -a $INCIDENT_LOG
CUSTOMER NOTIFICATION TEMPLATE
===============================
Subject: FairDB Service Incident - $INCIDENT_ID
Dear Customer,
We are currently experiencing a service incident affecting FairDB PostgreSQL services.
Incident ID: $INCIDENT_ID
Start Time: $(date)
Severity: [P1/P2/P3/P4]
Status: Investigating / Identified / Monitoring / Resolved
Impact:
[Describe customer impact]
Current Actions:
[List recovery actions being taken]
Next Update:
We will provide an update within 30 minutes or sooner if the situation changes.
We apologize for any inconvenience and are working to resolve this as quickly as possible.
For urgent matters, please contact our emergency hotline: [PHONE]
Regards,
FairDB Operations Team
EOF
```
## Step 7: Post-Incident Checklist
```bash
echo -e "\n[STEP 7] STABILIZATION CHECKLIST" | tee -a $INCIDENT_LOG
echo "---------------------------------" | tee -a $INCIDENT_LOG
# Verification checklist
cat << 'EOF' | tee -a $INCIDENT_LOG
Post-Recovery Verification:
[ ] PostgreSQL service running
[ ] All customer databases accessible
[ ] Backup system operational
[ ] Monitoring alerts cleared
[ ] Network connectivity verified
[ ] Disk space adequate (>20% free)
[ ] CPU usage normal (<80%)
[ ] Memory usage normal (<90%)
[ ] No blocking locks
[ ] No long-running queries
[ ] Recent backup available
[ ] Customer access verified
[ ] Incident documented
[ ] Root cause identified
[ ] Prevention plan created
EOF
# Final status
echo -e "\n[FINAL STATUS]" | tee -a $INCIDENT_LOG
echo "==============" | tee -a $INCIDENT_LOG
/usr/local/bin/fairdb-health-check | head -20 | tee -a $INCIDENT_LOG
```
## Step 8: Root Cause Analysis
```bash
echo -e "\n[STEP 8] ROOT CAUSE ANALYSIS" | tee -a $INCIDENT_LOG
echo "-----------------------------" | tee -a $INCIDENT_LOG
# Collect evidence
echo "Collecting evidence for RCA..." | tee -a $INCIDENT_LOG
# System logs
echo -e "\nSystem logs (last hour):" | tee -a $INCIDENT_LOG
sudo journalctl --since "1 hour ago" -p err --no-pager | tail -20 | tee -a $INCIDENT_LOG
# PostgreSQL logs
echo -e "\nPostgreSQL error logs:" | tee -a $INCIDENT_LOG
find /var/log/postgresql -name "*.log" -mmin -60 -exec grep -i "error\|fatal\|panic" {} \; | tail -20 | tee -a $INCIDENT_LOG
# Resource history
echo -e "\nResource usage history:" | tee -a $INCIDENT_LOG
sar -u -f /var/log/sysstat/sa$(date +%d) | tail -10 | tee -a $INCIDENT_LOG 2>/dev/null
# Create RCA document
cat << EOF | tee /opt/fairdb/incidents/${INCIDENT_ID}-rca.md
# Root Cause Analysis - $INCIDENT_ID
## Incident Summary
- **Date/Time**: $(date)
- **Duration**: [TO BE FILLED]
- **Severity**: [P1/P2/P3/P4]
- **Impact**: [Number of customers/databases affected]
## Timeline
[Document sequence of events]
## Root Cause
[Identify primary cause]
## Contributing Factors
[List any contributing factors]
## Resolution
[Describe how the incident was resolved]
## Lessons Learned
[What was learned from this incident]
## Action Items
[ ] [Prevention measure 1]
[ ] [Prevention measure 2]
[ ] [Monitoring improvement]
## Metrics
- Time to Detection: [minutes]
- Time to Resolution: [minutes]
- Customer Impact Duration: [minutes]
Generated: $(date)
EOF
echo -e "\n================================================" | tee -a $INCIDENT_LOG
echo " INCIDENT RESPONSE COMPLETED" | tee -a $INCIDENT_LOG
echo " Incident ID: $INCIDENT_ID" | tee -a $INCIDENT_LOG
echo " Log saved to: $INCIDENT_LOG" | tee -a $INCIDENT_LOG
echo " RCA template: /opt/fairdb/incidents/${INCIDENT_ID}-rca.md" | tee -a $INCIDENT_LOG
echo "================================================" | tee -a $INCIDENT_LOG
```
## Emergency Contacts
Keep these contacts readily available:
- PostgreSQL Expert: [Contact info]
- Infrastructure Team: [Contact info]
- Customer Success: [Contact info]
- Management Escalation: [Contact info]
## Quick Reference Commands
```bash
# Emergency service control
sudo systemctl stop postgresql
sudo systemctl start postgresql
sudo systemctl restart postgresql
# Kill all connections
sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != pg_backend_pid();"
# Emergency single-user mode
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
# Force checkpoint
sudo -u postgres psql -c "CHECKPOINT;"
# Emergency vacuum
sudo -u postgres vacuumdb --all --analyze-in-stages
# Check data checksums
sudo -u postgres /usr/lib/postgresql/16/bin/pg_checksums -D /var/lib/postgresql/16/main --check
```

View File

@@ -0,0 +1,459 @@
---
name: fairdb-health-check
description: Comprehensive health check for FairDB PostgreSQL infrastructure
model: sonnet
---
# FairDB System Health Check
Perform a comprehensive health check of the FairDB PostgreSQL infrastructure including server resources, database status, backup integrity, and customer databases.
## System Health Overview
```bash
#!/bin/bash
# FairDB Comprehensive Health Check
echo "================================================"
echo " FairDB System Health Check"
echo " $(date '+%Y-%m-%d %H:%M:%S')"
echo "================================================"
```
## Step 1: Server Resources Check
```bash
echo -e "\n[1/10] SERVER RESOURCES"
echo "------------------------"
# CPU Usage
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
echo "CPU Usage: ${CPU_USAGE}%"
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
echo "⚠️ WARNING: High CPU usage detected"
fi
# Memory Usage
MEM_INFO=$(free -m | awk 'NR==2{printf "Memory: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }')
echo "$MEM_INFO"
MEM_PERCENT=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
if (( $(echo "$MEM_PERCENT > 90" | bc -l) )); then
echo "⚠️ WARNING: High memory usage detected"
fi
# Disk Usage
echo "Disk Usage:"
df -h | grep -E '^/dev/' | while read line; do
USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo $line | awk '{print $6}')
echo " $MOUNT: $line"
if [ $USAGE -gt 85 ]; then
echo " ⚠️ WARNING: Disk space critical on $MOUNT"
fi
done
# Load Average
LOAD=$(uptime | awk -F'load average:' '{print $2}')
echo "Load Average:$LOAD"
CORES=$(nproc)
LOAD_1=$(echo $LOAD | cut -d, -f1 | tr -d ' ')
if (( $(echo "$LOAD_1 > $CORES" | bc -l) )); then
echo "⚠️ WARNING: High load average detected"
fi
```
## Step 2: PostgreSQL Service Status
```bash
echo -e "\n[2/10] POSTGRESQL SERVICE"
echo "-------------------------"
# Check if PostgreSQL is running
if systemctl is-active --quiet postgresql; then
echo "✅ PostgreSQL service: RUNNING"
# Get version and uptime
sudo -u postgres psql -t -c "SELECT version();" | head -1
UPTIME=$(sudo -u postgres psql -t -c "
SELECT now() - pg_postmaster_start_time() as uptime;")
echo "Uptime: $UPTIME"
else
echo "❌ CRITICAL: PostgreSQL service is NOT running!"
echo "Attempting to start..."
sudo systemctl start postgresql
sleep 5
if systemctl is-active --quiet postgresql; then
echo "✅ Service restarted successfully"
else
echo "❌ Failed to start PostgreSQL - manual intervention required!"
exit 1
fi
fi
# Check PostgreSQL cluster status
sudo pg_lsclusters
```
## Step 3: Database Connections
```bash
echo -e "\n[3/10] DATABASE CONNECTIONS"
echo "---------------------------"
# Connection statistics
sudo -u postgres psql -t << EOF
SELECT
'Total Connections: ' || count(*) || '/' || setting AS connection_info
FROM pg_stat_activity, pg_settings
WHERE pg_settings.name = 'max_connections'
GROUP BY setting;
EOF
# Connections by database
echo -e "\nConnections by database:"
sudo -u postgres psql -t -c "
SELECT datname, count(*) as connections
FROM pg_stat_activity
GROUP BY datname
ORDER BY connections DESC;"
# Connections by user
echo -e "\nConnections by user:"
sudo -u postgres psql -t -c "
SELECT usename, count(*) as connections
FROM pg_stat_activity
GROUP BY usename
ORDER BY connections DESC;"
# Check for idle connections
IDLE_COUNT=$(sudo -u postgres psql -t -c "
SELECT count(*)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';")
if [ $IDLE_COUNT -gt 10 ]; then
echo "⚠️ WARNING: $IDLE_COUNT idle connections older than 10 minutes"
fi
```
## Step 4: Database Performance Metrics
```bash
echo -e "\n[4/10] PERFORMANCE METRICS"
echo "--------------------------"
# Cache hit ratio
sudo -u postgres psql -t << 'EOF'
SELECT
'Cache Hit Ratio: ' ||
ROUND(100.0 * sum(heap_blks_hit) /
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2) || '%'
FROM pg_statio_user_tables;
EOF
# Transaction statistics
sudo -u postgres psql -t -c "
SELECT
'Transactions: ' || xact_commit || ' commits, ' ||
xact_rollback || ' rollbacks, ' ||
ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) || '% rollback rate'
FROM pg_stat_database
WHERE datname = 'postgres';"
# Longest running queries
echo -e "\nLong-running queries (>1 minute):"
sudo -u postgres psql -t -c "
SELECT pid, now() - query_start as duration,
LEFT(query, 50) as query_preview
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 minute'
ORDER BY duration DESC
LIMIT 5;"
# Table bloat check
echo -e "\nTable bloat (top 5):"
sudo -u postgres psql -t << 'EOF'
SELECT
schemaname || '.' || tablename AS table,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
ROUND(100 * pg_total_relation_size(schemaname||'.'||tablename) /
NULLIF(sum(pg_total_relation_size(schemaname||'.'||tablename))
OVER (), 0), 2) AS percentage
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 5;
EOF
```
## Step 5: Backup Status
```bash
echo -e "\n[5/10] BACKUP STATUS"
echo "--------------------"
# Check pgBackRest status
if command -v pgbackrest &> /dev/null; then
echo "pgBackRest Status:"
# Get all stanzas
STANZAS=$(sudo -u postgres pgbackrest info --output=json 2>/dev/null | jq -r '.[].name' 2>/dev/null)
if [ -z "$STANZAS" ]; then
echo "⚠️ WARNING: No backup stanzas configured"
else
for STANZA in $STANZAS; do
echo -e "\nStanza: $STANZA"
# Get last backup info
LAST_BACKUP=$(sudo -u postgres pgbackrest --stanza=$STANZA info --output=json 2>/dev/null | \
jq -r '.[] | select(.name=="'$STANZA'") | .backup[-1].timestamp.stop' 2>/dev/null)
if [ ! -z "$LAST_BACKUP" ]; then
echo " Last backup: $LAST_BACKUP"
# Calculate age in hours
BACKUP_AGE=$(( ($(date +%s) - $(date -d "$LAST_BACKUP" +%s)) / 3600 ))
if [ $BACKUP_AGE -gt 25 ]; then
echo " ⚠️ WARNING: Last backup is $BACKUP_AGE hours old"
else
echo " ✅ Backup is current ($BACKUP_AGE hours old)"
fi
else
echo " ❌ ERROR: No backups found for this stanza"
fi
done
fi
else
echo "❌ ERROR: pgBackRest is not installed"
fi
# Check WAL archiving
WAL_STATUS=$(sudo -u postgres psql -t -c "SHOW archive_mode;")
echo -e "\nWAL Archiving: $WAL_STATUS"
if [ "$WAL_STATUS" = " on" ]; then
LAST_ARCHIVED=$(sudo -u postgres psql -t -c "
SELECT age(now(), last_archived_time)
FROM pg_stat_archiver;")
echo "Last WAL archived: $LAST_ARCHIVED ago"
fi
```
## Step 6: Replication Status
```bash
echo -e "\n[6/10] REPLICATION STATUS"
echo "-------------------------"
# Check if this is a primary or replica
IS_PRIMARY=$(sudo -u postgres psql -t -c "SELECT pg_is_in_recovery();")
if [ "$IS_PRIMARY" = " f" ]; then
echo "Role: PRIMARY"
# Check replication slots
REP_SLOTS=$(sudo -u postgres psql -t -c "
SELECT count(*) FROM pg_replication_slots WHERE active = true;")
echo "Active replication slots: $REP_SLOTS"
# Check connected replicas
sudo -u postgres psql -t -c "
SELECT client_addr, state, sync_state,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) as lag
FROM pg_stat_replication;" 2>/dev/null
else
echo "Role: REPLICA"
# Check replication lag
LAG=$(sudo -u postgres psql -t -c "
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag;")
echo "Replication lag: ${LAG} seconds"
if (( $(echo "$LAG > 60" | bc -l) )); then
echo "⚠️ WARNING: High replication lag detected"
fi
fi
```
## Step 7: Security Audit
```bash
echo -e "\n[7/10] SECURITY AUDIT"
echo "---------------------"
# Check for default passwords
echo "Checking for common issues..."
# SSL status
SSL_STATUS=$(sudo -u postgres psql -t -c "SHOW ssl;")
echo "SSL: $SSL_STATUS"
if [ "$SSL_STATUS" != " on" ]; then
echo "⚠️ WARNING: SSL is not enabled"
fi
# Check for users without passwords
NO_PASS=$(sudo -u postgres psql -t -c "
SELECT count(*) FROM pg_shadow WHERE passwd IS NULL;")
if [ $NO_PASS -gt 0 ]; then
echo "⚠️ WARNING: $NO_PASS users without passwords"
fi
# Check firewall status
if sudo ufw status | grep -q "Status: active"; then
echo "✅ Firewall: ACTIVE"
else
echo "⚠️ WARNING: Firewall is not active"
fi
# Check fail2ban status
if systemctl is-active --quiet fail2ban; then
echo "✅ Fail2ban: RUNNING"
JAIL_STATUS=$(sudo fail2ban-client status postgresql 2>/dev/null | grep "Currently banned" || echo "Jail not configured")
echo " PostgreSQL jail: $JAIL_STATUS"
else
echo "⚠️ WARNING: Fail2ban is not running"
fi
```
## Step 8: Customer Database Health
```bash
echo -e "\n[8/10] CUSTOMER DATABASES"
echo "-------------------------"
# Check each customer database
CUSTOMER_DBS=$(sudo -u postgres psql -t -c "
SELECT datname FROM pg_database
WHERE datname NOT IN ('postgres', 'template0', 'template1')
ORDER BY datname;")
for DB in $CUSTOMER_DBS; do
echo -e "\nDatabase: $DB"
# Size
SIZE=$(sudo -u postgres psql -t -c "
SELECT pg_size_pretty(pg_database_size('$DB'));")
echo " Size: $SIZE"
# Connection count
CONN=$(sudo -u postgres psql -t -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = '$DB';")
echo " Connections: $CONN"
# Transaction rate
TPS=$(sudo -u postgres psql -t -c "
SELECT xact_commit + xact_rollback as transactions
FROM pg_stat_database WHERE datname = '$DB';")
echo " Total transactions: $TPS"
# Check for locks
LOCKS=$(sudo -u postgres psql -t -d $DB -c "
SELECT count(*) FROM pg_locks WHERE granted = false;")
if [ $LOCKS -gt 0 ]; then
echo " ⚠️ WARNING: $LOCKS blocked locks detected"
fi
done
```
## Step 9: System Logs Analysis
```bash
echo -e "\n[9/10] LOG ANALYSIS"
echo "-------------------"
# Check PostgreSQL logs for errors
LOG_DIR="/var/log/postgresql"
if [ -d "$LOG_DIR" ]; then
echo "Recent PostgreSQL errors (last 24 hours):"
find $LOG_DIR -name "*.log" -mtime -1 -exec grep -i "error\|fatal\|panic" {} \; | \
tail -10 | head -5
ERROR_COUNT=$(find $LOG_DIR -name "*.log" -mtime -1 -exec grep -i "error\|fatal\|panic" {} \; | wc -l)
echo "Total errors in last 24 hours: $ERROR_COUNT"
if [ $ERROR_COUNT -gt 100 ]; then
echo "⚠️ WARNING: High error rate detected"
fi
fi
# Check system logs
echo -e "\nRecent system issues:"
sudo journalctl -p err -since "24 hours ago" --no-pager | tail -5
```
## Step 10: Recommendations
```bash
echo -e "\n[10/10] HEALTH SUMMARY & RECOMMENDATIONS"
echo "========================================="
# Collect all warnings
WARNINGS=0
CRITICAL=0
# Generate recommendations based on findings
echo -e "\nRecommendations:"
# Check if vacuum is needed
LAST_VACUUM=$(sudo -u postgres psql -t -c "
SELECT MAX(last_autovacuum) FROM pg_stat_user_tables;")
echo "- Last autovacuum: $LAST_VACUUM"
# Check if analyze is needed
LAST_ANALYZE=$(sudo -u postgres psql -t -c "
SELECT MAX(last_autoanalyze) FROM pg_stat_user_tables;")
echo "- Last autoanalyze: $LAST_ANALYZE"
# Generate overall health score
echo -e "\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
if [ $CRITICAL -eq 0 ] && [ $WARNINGS -lt 3 ]; then
echo "✅ OVERALL HEALTH: GOOD"
elif [ $CRITICAL -eq 0 ] && [ $WARNINGS -lt 10 ]; then
echo "⚠️ OVERALL HEALTH: FAIR - Review warnings"
else
echo "❌ OVERALL HEALTH: POOR - Immediate action required"
fi
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
# Save report
REPORT_FILE="/opt/fairdb/logs/health-check-$(date +%Y%m%d-%H%M%S).log"
echo -e "\nFull report saved to: $REPORT_FILE"
```
## Actions Based on Results
### If Critical Issues Found:
1. Check PostgreSQL service status
2. Review disk space availability
3. Verify backup integrity
4. Check for data corruption
5. Review security vulnerabilities
### If Warnings Found:
1. Schedule maintenance window
2. Plan capacity upgrades
3. Review query performance
4. Update monitoring thresholds
5. Document issues for trending
### Regular Maintenance Tasks:
1. Run VACUUM ANALYZE on large tables
2. Update table statistics
3. Review and optimize slow queries
4. Clean up old logs
5. Test backup restoration
## Schedule Next Health Check
```bash
# Schedule regular health checks
echo "30 */6 * * * root /usr/local/bin/fairdb-health-check > /dev/null 2>&1" | \
sudo tee /etc/cron.d/fairdb-health-check
echo "Health checks scheduled every 6 hours"
```

View File

@@ -0,0 +1,446 @@
---
name: fairdb-onboard-customer
description: Complete customer onboarding workflow for FairDB PostgreSQL service
model: sonnet
---
# FairDB Customer Onboarding Workflow
You are onboarding a new customer for FairDB PostgreSQL as a Service. This comprehensive workflow creates their database, users, configures access, sets up backups, and provides connection details.
## Step 1: Gather Customer Information
Collect these details:
1. **Customer Name**: Company/organization name
2. **Database Name**: Preferred database name (lowercase, no spaces)
3. **Primary Contact**: Name and email
4. **Plan Type**: Starter/Professional/Enterprise
5. **IP Allowlist**: Customer IP addresses for access
6. **Special Requirements**: Extensions, configurations, etc.
## Step 2: Validate Resources
```bash
# Check available resources
df -h /var/lib/postgresql
free -h
sudo -u postgres psql -c "SELECT count(*) as database_count FROM pg_database WHERE datistemplate = false;"
# Check current connections
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
```
## Step 3: Create Customer Database
```bash
# Set customer variables
CUSTOMER_NAME="customer_name" # Replace with actual
DB_NAME="${CUSTOMER_NAME}_db"
DB_OWNER="${CUSTOMER_NAME}_owner"
DB_USER="${CUSTOMER_NAME}_user"
DB_READONLY="${CUSTOMER_NAME}_readonly"
# Generate secure passwords
DB_OWNER_PASS=$(openssl rand -base64 32)
DB_USER_PASS=$(openssl rand -base64 32)
DB_READONLY_PASS=$(openssl rand -base64 32)
# Create database and users
sudo -u postgres psql << EOF
-- Create database owner role
CREATE ROLE ${DB_OWNER} WITH LOGIN PASSWORD '${DB_OWNER_PASS}'
CREATEDB CREATEROLE CONNECTION LIMIT 5;
-- Create application user
CREATE ROLE ${DB_USER} WITH LOGIN PASSWORD '${DB_USER_PASS}'
CONNECTION LIMIT 50;
-- Create read-only user
CREATE ROLE ${DB_READONLY} WITH LOGIN PASSWORD '${DB_READONLY_PASS}'
CONNECTION LIMIT 10;
-- Create customer database
CREATE DATABASE ${DB_NAME}
WITH OWNER = ${DB_OWNER}
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
TEMPLATE = template0
CONNECTION LIMIT = 100;
-- Configure database
\c ${DB_NAME}
-- Create schema
CREATE SCHEMA IF NOT EXISTS ${CUSTOMER_NAME} AUTHORIZATION ${DB_OWNER};
-- Grant permissions
GRANT CONNECT ON DATABASE ${DB_NAME} TO ${DB_USER}, ${DB_READONLY};
GRANT USAGE ON SCHEMA ${CUSTOMER_NAME} TO ${DB_USER}, ${DB_READONLY};
GRANT CREATE ON SCHEMA ${CUSTOMER_NAME} TO ${DB_USER};
-- Default privileges for tables
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
GRANT ALL ON TABLES TO ${DB_USER};
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
GRANT SELECT ON TABLES TO ${DB_READONLY};
-- Default privileges for sequences
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
GRANT ALL ON SEQUENCES TO ${DB_USER};
ALTER DEFAULT PRIVILEGES FOR ROLE ${DB_OWNER} IN SCHEMA ${CUSTOMER_NAME}
GRANT SELECT ON SEQUENCES TO ${DB_READONLY};
-- Enable useful extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS citext;
EOF
echo "Database ${DB_NAME} created successfully"
```
## Step 4: Configure Network Access
```bash
# Add customer IP to pg_hba.conf
CUSTOMER_IP="203.0.113.0/32" # Replace with actual customer IP
# Backup pg_hba.conf
sudo cp /etc/postgresql/16/main/pg_hba.conf /etc/postgresql/16/main/pg_hba.conf.$(date +%Y%m%d)
# Add customer access rules
cat << EOF | sudo tee -a /etc/postgresql/16/main/pg_hba.conf
# Customer: ${CUSTOMER_NAME}
hostssl ${DB_NAME} ${DB_OWNER} ${CUSTOMER_IP} scram-sha-256
hostssl ${DB_NAME} ${DB_USER} ${CUSTOMER_IP} scram-sha-256
hostssl ${DB_NAME} ${DB_READONLY} ${CUSTOMER_IP} scram-sha-256
EOF
# Update firewall
sudo ufw allow from ${CUSTOMER_IP} to any port 5432 comment "FairDB: ${CUSTOMER_NAME}"
# Reload PostgreSQL configuration
sudo systemctl reload postgresql
```
## Step 5: Set Resource Limits
```bash
# Configure per-database resource limits based on plan
case "${PLAN_TYPE}" in
"starter")
MAX_CONN=50
WORK_MEM="4MB"
SHARED_BUFFERS="256MB"
;;
"professional")
MAX_CONN=100
WORK_MEM="8MB"
SHARED_BUFFERS="1GB"
;;
"enterprise")
MAX_CONN=200
WORK_MEM="16MB"
SHARED_BUFFERS="4GB"
;;
esac
# Apply database-specific settings
sudo -u postgres psql -d ${DB_NAME} << EOF
-- Set connection limit
ALTER DATABASE ${DB_NAME} CONNECTION LIMIT ${MAX_CONN};
-- Set database parameters
ALTER DATABASE ${DB_NAME} SET work_mem = '${WORK_MEM}';
ALTER DATABASE ${DB_NAME} SET maintenance_work_mem = '${WORK_MEM}';
ALTER DATABASE ${DB_NAME} SET effective_cache_size = '${SHARED_BUFFERS}';
ALTER DATABASE ${DB_NAME} SET random_page_cost = 1.1;
ALTER DATABASE ${DB_NAME} SET log_statement = 'all';
ALTER DATABASE ${DB_NAME} SET log_duration = on;
EOF
```
## Step 6: Configure Backup Policy
```bash
# Create customer-specific backup configuration
cat << EOF | sudo tee -a /opt/fairdb/configs/backup-${CUSTOMER_NAME}.conf
# Backup configuration for ${CUSTOMER_NAME}
DATABASE=${DB_NAME}
BACKUP_RETENTION_DAYS=30
BACKUP_SCHEDULE="0 3 * * *" # Daily at 3 AM
BACKUP_TYPE="full"
S3_PREFIX="${CUSTOMER_NAME}/"
EOF
# Add to pgBackRest configuration
sudo tee -a /etc/pgbackrest/pgbackrest.conf << EOF
[${CUSTOMER_NAME}]
pg1-path=/var/lib/postgresql/16/main
pg1-database=${DB_NAME}
pg1-port=5432
backup-user=backup_user
process-max=2
repo1-retention-full=4
repo1-retention-diff=7
EOF
# Create backup stanza for customer
sudo -u postgres pgbackrest --stanza=${CUSTOMER_NAME} stanza-create
# Schedule customer backup
echo "0 3 * * * postgres pgbackrest --stanza=${CUSTOMER_NAME} --type=full backup" | \
sudo tee -a /etc/cron.d/fairdb-customer-${CUSTOMER_NAME}
```
## Step 7: Setup Monitoring
```bash
# Create monitoring user and grants
sudo -u postgres psql -d ${DB_NAME} << EOF
-- Grant monitoring permissions
GRANT pg_monitor TO ${DB_READONLY};
GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO ${DB_OWNER};
EOF
# Create customer monitoring script
cat << 'EOF' | sudo tee /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
#!/bin/bash
# Monitoring script for ${CUSTOMER_NAME}
DB_NAME="${DB_NAME}"
ALERT_THRESHOLD_CONNECTIONS=80
ALERT_THRESHOLD_SIZE_GB=100
# Check connection usage
CONN_USAGE=$(sudo -u postgres psql -t -c "
SELECT (count(*) * 100.0 / setting::int)::int as pct
FROM pg_stat_activity, pg_settings
WHERE name = 'max_connections'
AND datname = '${DB_NAME}'
GROUP BY setting;")
if [ ${CONN_USAGE:-0} -gt $ALERT_THRESHOLD_CONNECTIONS ]; then
echo "ALERT: Connection usage at ${CONN_USAGE}% for ${CUSTOMER_NAME}"
fi
# Check database size
DB_SIZE_GB=$(sudo -u postgres psql -t -c "
SELECT pg_database_size('${DB_NAME}') / 1024 / 1024 / 1024;")
if [ ${DB_SIZE_GB:-0} -gt $ALERT_THRESHOLD_SIZE_GB ]; then
echo "ALERT: Database size is ${DB_SIZE_GB}GB for ${CUSTOMER_NAME}"
fi
# Check for long-running queries
sudo -u postgres psql -d ${DB_NAME} -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state = 'active';"
EOF
sudo chmod +x /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
# Add to monitoring cron
echo "*/10 * * * * root /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh" | \
sudo tee -a /etc/cron.d/fairdb-monitor-${CUSTOMER_NAME}
```
## Step 8: Generate SSL Certificates
```bash
# Create customer SSL certificate
sudo mkdir -p /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
cd /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
# Generate customer-specific SSL cert
sudo openssl req -new -x509 -days 365 -nodes \
-out server.crt -keyout server.key \
-subj "/C=US/ST=State/L=City/O=FairDB/OU=${CUSTOMER_NAME}/CN=${CUSTOMER_NAME}.fairdb.io"
# Set permissions
sudo chmod 600 server.key
sudo chown postgres:postgres server.*
# Create client certificate
sudo openssl req -new -nodes \
-out client.csr -keyout client.key \
-subj "/C=US/ST=State/L=City/O=FairDB/OU=${CUSTOMER_NAME}/CN=${DB_USER}"
sudo openssl x509 -req -CAcreateserial \
-in client.csr -CA server.crt -CAkey server.key \
-out client.crt -days 365
# Package client certificates
tar czf /tmp/${CUSTOMER_NAME}-ssl-bundle.tar.gz client.crt client.key server.crt
```
## Step 9: Create Connection Documentation
```bash
# Generate connection details document
cat << EOF > /tmp/${CUSTOMER_NAME}-connection-details.md
# FairDB PostgreSQL Connection Details
## Customer: ${CUSTOMER_NAME}
### Database Information
- **Database Name**: ${DB_NAME}
- **Host**: fairdb-prod.example.com
- **Port**: 5432
- **SSL Required**: Yes
### User Credentials
#### Database Owner (DDL Operations)
- **Username**: ${DB_OWNER}
- **Password**: ${DB_OWNER_PASS}
- **Connection Limit**: 5
- **Permissions**: Full database owner
#### Application User (DML Operations)
- **Username**: ${DB_USER}
- **Password**: ${DB_USER_PASS}
- **Connection Limit**: 50
- **Permissions**: CRUD operations on all tables
#### Read-Only User (Reporting)
- **Username**: ${DB_READONLY}
- **Password**: ${DB_READONLY_PASS}
- **Connection Limit**: 10
- **Permissions**: SELECT only
### Connection Strings
\`\`\`
# Standard connection
postgresql://${DB_USER}:${DB_USER_PASS}@fairdb-prod.example.com:5432/${DB_NAME}?sslmode=require
# With SSL certificate
postgresql://${DB_USER}:${DB_USER_PASS}@fairdb-prod.example.com:5432/${DB_NAME}?sslmode=require&sslcert=client.crt&sslkey=client.key&sslrootcert=server.crt
# JDBC URL
jdbc:postgresql://fairdb-prod.example.com:5432/${DB_NAME}?ssl=true&sslmode=require
# psql command
psql "host=fairdb-prod.example.com port=5432 dbname=${DB_NAME} user=${DB_USER} sslmode=require"
\`\`\`
### Resource Limits
- **Plan**: ${PLAN_TYPE}
- **Max Connections**: ${MAX_CONN}
- **Storage Quota**: Unlimited (pay per GB)
- **Backup Retention**: 30 days
- **Backup Schedule**: Daily at 3:00 AM UTC
### Support Information
- **Email**: support@fairdb.io
- **Emergency**: +1-xxx-xxx-xxxx
- **Documentation**: https://docs.fairdb.io
- **Status Page**: https://status.fairdb.io
### Important Notes
1. Always use SSL connections
2. Rotate passwords every 90 days
3. Monitor connection pool usage
4. Test restore procedures quarterly
5. Keep IP allowlist updated
### Next Steps
1. Download SSL certificates: ${CUSTOMER_NAME}-ssl-bundle.tar.gz
2. Test connection with provided credentials
3. Configure application connection pool
4. Set up monitoring dashboards
5. Review security best practices
Generated: $(date)
EOF
echo "Connection details saved to /tmp/${CUSTOMER_NAME}-connection-details.md"
```
## Step 10: Final Verification
```bash
# Test all user connections
echo "Testing database connections..."
# Test owner connection
PGPASSWORD=${DB_OWNER_PASS} psql -h localhost -U ${DB_OWNER} -d ${DB_NAME} -c "SELECT current_user, current_database();"
# Test app user connection
PGPASSWORD=${DB_USER_PASS} psql -h localhost -U ${DB_USER} -d ${DB_NAME} -c "SELECT current_user, current_database();"
# Test readonly connection
PGPASSWORD=${DB_READONLY_PASS} psql -h localhost -U ${DB_READONLY} -d ${DB_NAME} -c "SELECT current_user, current_database();"
# Verify backup configuration
sudo -u postgres pgbackrest --stanza=${CUSTOMER_NAME} check
# Check monitoring
/opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
# Generate onboarding summary
echo "
===========================================
FairDB Customer Onboarding Complete
===========================================
Customer: ${CUSTOMER_NAME}
Database: ${DB_NAME}
Created: $(date)
Plan: ${PLAN_TYPE}
Files Generated:
- /tmp/${CUSTOMER_NAME}-connection-details.md
- /tmp/${CUSTOMER_NAME}-ssl-bundle.tar.gz
Next Actions:
1. Send connection details to customer
2. Schedule onboarding call
3. Monitor initial usage
4. Follow up in 24 hours
===========================================
"
```
## Onboarding Checklist
Verify completion:
- [ ] Database created
- [ ] Users created with secure passwords
- [ ] Network access configured
- [ ] Resource limits applied
- [ ] Backup policy configured
- [ ] Monitoring enabled
- [ ] SSL certificates generated
- [ ] Documentation created
- [ ] Connection tests passed
- [ ] Customer notified
## Rollback Procedure
If onboarding fails:
```bash
# Remove database and users
sudo -u postgres psql << EOF
DROP DATABASE IF EXISTS ${DB_NAME};
DROP ROLE IF EXISTS ${DB_OWNER};
DROP ROLE IF EXISTS ${DB_USER};
DROP ROLE IF EXISTS ${DB_READONLY};
EOF
# Remove configurations
sudo rm -f /etc/cron.d/fairdb-customer-${CUSTOMER_NAME}
sudo rm -f /etc/cron.d/fairdb-monitor-${CUSTOMER_NAME}
sudo rm -f /opt/fairdb/scripts/monitor-${CUSTOMER_NAME}.sh
sudo rm -rf /etc/postgresql/16/main/ssl/${CUSTOMER_NAME}
# Remove firewall rule
sudo ufw delete allow from ${CUSTOMER_IP} to any port 5432
echo "Customer ${CUSTOMER_NAME} rollback complete"
```

View File

@@ -0,0 +1,420 @@
---
name: fairdb-setup-backup
description: Configure pgBackRest with Wasabi S3 for automated PostgreSQL backups
model: sonnet
---
# FairDB pgBackRest Backup Configuration with Wasabi S3
You are configuring pgBackRest with Wasabi S3 storage for automated PostgreSQL backups. Follow SOP-003 precisely.
## Prerequisites Check
Verify before starting:
1. PostgreSQL 16 is installed and running
2. Wasabi S3 account is active with bucket created
3. AWS CLI credentials are available
4. At least 50GB free disk space for local backups
## Step 1: Install pgBackRest
```bash
# Add pgBackRest repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:pgbackrest/backrest
sudo apt-get update
# Install pgBackRest
sudo apt-get install -y pgbackrest
# Verify installation
pgbackrest version
```
## Step 2: Configure Wasabi S3 Credentials
```bash
# Create pgBackRest configuration directory
sudo mkdir -p /etc/pgbackrest
sudo mkdir -p /var/lib/pgbackrest
sudo mkdir -p /var/log/pgbackrest
sudo mkdir -p /var/spool/pgbackrest
# Set ownership
sudo chown -R postgres:postgres /var/lib/pgbackrest
sudo chown -R postgres:postgres /var/log/pgbackrest
sudo chown -R postgres:postgres /var/spool/pgbackrest
# Store Wasabi credentials (secure these!)
export WASABI_ACCESS_KEY="YOUR_WASABI_ACCESS_KEY"
export WASABI_SECRET_KEY="YOUR_WASABI_SECRET_KEY"
export WASABI_BUCKET="fairdb-backups"
export WASABI_REGION="us-east-1" # Or your Wasabi region
export WASABI_ENDPOINT="s3.us-east-1.wasabisys.com" # Adjust for your region
```
## Step 3: Create pgBackRest Configuration
```bash
# Create main configuration file
sudo tee /etc/pgbackrest/pgbackrest.conf << EOF
[global]
# General Options
process-max=4
log-level-console=info
log-level-file=detail
start-fast=y
stop-auto=y
archive-async=y
archive-push-queue-max=4GB
spool-path=/var/spool/pgbackrest
# S3 Repository Configuration
repo1-type=s3
repo1-s3-endpoint=${WASABI_ENDPOINT}
repo1-s3-bucket=${WASABI_BUCKET}
repo1-s3-region=${WASABI_REGION}
repo1-s3-key=${WASABI_ACCESS_KEY}
repo1-s3-key-secret=${WASABI_SECRET_KEY}
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=12
repo1-retention-archive=30
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=CHANGE_THIS_PASSPHRASE
# Local Repository (for faster restores)
repo2-type=posix
repo2-path=/var/lib/pgbackrest
repo2-retention-full=2
repo2-retention-diff=6
[fairdb]
# PostgreSQL Configuration
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
pg1-user=postgres
# Archive Configuration
archive-timeout=60
archive-check=y
backup-standby=n
# Backup Options
compress-type=lz4
compress-level=3
backup-user=backup_user
delta=y
process-max=2
EOF
# Secure the configuration file
sudo chmod 640 /etc/pgbackrest/pgbackrest.conf
sudo chown postgres:postgres /etc/pgbackrest/pgbackrest.conf
```
## Step 4: Configure PostgreSQL for pgBackRest
```bash
# Update PostgreSQL configuration
sudo tee -a /etc/postgresql/16/main/postgresql.conf << 'EOF'
# pgBackRest Archive Configuration
archive_mode = on
archive_command = 'pgbackrest --stanza=fairdb archive-push %p'
archive_timeout = 60
max_wal_senders = 3
wal_level = replica
wal_log_hints = on
EOF
# Restart PostgreSQL
sudo systemctl restart postgresql
```
## Step 5: Initialize Backup Stanza
```bash
# Create the stanza
sudo -u postgres pgbackrest --stanza=fairdb stanza-create
# Verify stanza
sudo -u postgres pgbackrest --stanza=fairdb check
```
## Step 6: Create Backup Scripts
```bash
# Full backup script
sudo tee /opt/fairdb/scripts/backup-full.sh << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/fairdb/backup-full-$(date +%Y%m%d-%H%M%S).log"
echo "Starting full backup at $(date)" | tee -a $LOG_FILE
# Perform full backup to both repositories
sudo -u postgres pgbackrest --stanza=fairdb --type=full --repo=1 backup 2>&1 | tee -a $LOG_FILE
sudo -u postgres pgbackrest --stanza=fairdb --type=full --repo=2 backup 2>&1 | tee -a $LOG_FILE
# Verify backup
sudo -u postgres pgbackrest --stanza=fairdb --repo=1 info 2>&1 | tee -a $LOG_FILE
echo "Full backup completed at $(date)" | tee -a $LOG_FILE
# Send notification (implement webhook/email here)
curl -X POST $FAIRDB_MONITORING_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{\"text\":\"FairDB full backup completed successfully\"}" 2>/dev/null || true
EOF
# Incremental backup script
sudo tee /opt/fairdb/scripts/backup-incremental.sh << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/fairdb/backup-incr-$(date +%Y%m%d-%H%M%S).log"
echo "Starting incremental backup at $(date)" | tee -a $LOG_FILE
# Perform incremental backup
sudo -u postgres pgbackrest --stanza=fairdb --type=incr --repo=1 backup 2>&1 | tee -a $LOG_FILE
echo "Incremental backup completed at $(date)" | tee -a $LOG_FILE
EOF
# Differential backup script
sudo tee /opt/fairdb/scripts/backup-differential.sh << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/fairdb/backup-diff-$(date +%Y%m%d-%H%M%S).log"
echo "Starting differential backup at $(date)" | tee -a $LOG_FILE
# Perform differential backup
sudo -u postgres pgbackrest --stanza=fairdb --type=diff --repo=1 backup 2>&1 | tee -a $LOG_FILE
echo "Differential backup completed at $(date)" | tee -a $LOG_FILE
EOF
# Make scripts executable
sudo chmod +x /opt/fairdb/scripts/backup-*.sh
```
## Step 7: Schedule Automated Backups
```bash
# Add to root's crontab for automated backups
cat << 'EOF' | sudo tee /etc/cron.d/fairdb-backups
# FairDB Automated Backup Schedule
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
# Weekly full backup (Sunday 2 AM)
0 2 * * 0 root /opt/fairdb/scripts/backup-full.sh
# Daily differential backup (Mon-Sat 2 AM)
0 2 * * 1-6 root /opt/fairdb/scripts/backup-differential.sh
# Hourly incremental backup (business hours)
0 9-18 * * 1-5 root /opt/fairdb/scripts/backup-incremental.sh
# Backup verification (daily at 5 AM)
0 5 * * * postgres pgbackrest --stanza=fairdb --repo=1 check
# Archive expiration (daily at 3 AM)
0 3 * * * postgres pgbackrest --stanza=fairdb --repo=1 expire
EOF
```
## Step 8: Create Restore Procedures
```bash
# Point-in-time recovery script
sudo tee /opt/fairdb/scripts/restore-pitr.sh << 'EOF'
#!/bin/bash
# FairDB Point-in-Time Recovery Script
if [ $# -ne 1 ]; then
echo "Usage: $0 'YYYY-MM-DD HH:MM:SS'"
exit 1
fi
TARGET_TIME="$1"
BACKUP_PATH="/var/lib/postgresql/16/main"
echo "WARNING: This will restore the database to $TARGET_TIME"
echo "Current data will be LOST. Continue? (yes/no)"
read CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "Restore cancelled"
exit 1
fi
# Stop PostgreSQL
sudo systemctl stop postgresql
# Clear data directory
sudo rm -rf ${BACKUP_PATH}/*
# Restore to target time
sudo -u postgres pgbackrest --stanza=fairdb \
--type=time \
--target="$TARGET_TIME" \
--target-action=promote \
restore
# Start PostgreSQL
sudo systemctl start postgresql
echo "Restore completed. Verify data integrity."
EOF
sudo chmod +x /opt/fairdb/scripts/restore-pitr.sh
```
## Step 9: Test Backup and Restore
```bash
# Perform test backup
sudo -u postgres pgbackrest --stanza=fairdb --type=full backup
# Check backup info
sudo -u postgres pgbackrest --stanza=fairdb info
# List backups
sudo -u postgres pgbackrest --stanza=fairdb info --output=json
# Test restore to alternate location
sudo mkdir -p /tmp/pgbackrest-test
sudo chown postgres:postgres /tmp/pgbackrest-test
sudo -u postgres pgbackrest --stanza=fairdb \
--pg1-path=/tmp/pgbackrest-test \
--type=latest \
restore
```
## Step 10: Monitor Backup Health
```bash
# Create monitoring script
sudo tee /opt/fairdb/scripts/check-backup-health.sh << 'EOF'
#!/bin/bash
# FairDB Backup Health Check
# Check last backup time
LAST_BACKUP=$(sudo -u postgres pgbackrest --stanza=fairdb info --output=json | \
jq -r '.[] | .backup[-1].timestamp.stop')
# Convert to seconds
LAST_BACKUP_EPOCH=$(date -d "$LAST_BACKUP" +%s)
CURRENT_EPOCH=$(date +%s)
HOURS_AGO=$(( ($CURRENT_EPOCH - $LAST_BACKUP_EPOCH) / 3600 ))
# Alert if backup is older than 25 hours
if [ $HOURS_AGO -gt 25 ]; then
echo "ALERT: Last backup was $HOURS_AGO hours ago!"
# Send alert (implement notification here)
exit 1
fi
echo "Backup health OK - last backup $HOURS_AGO hours ago"
# Check S3 connectivity
aws s3 ls s3://${WASABI_BUCKET}/pgbackrest/ \
--endpoint-url=https://${WASABI_ENDPOINT} > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "ALERT: Cannot connect to Wasabi S3!"
exit 1
fi
echo "S3 connectivity OK"
EOF
sudo chmod +x /opt/fairdb/scripts/check-backup-health.sh
# Add to monitoring cron
echo "*/30 * * * * root /opt/fairdb/scripts/check-backup-health.sh" | \
sudo tee -a /etc/cron.d/fairdb-monitoring
```
## Step 11: Document Backup Configuration
```bash
cat > /opt/fairdb/configs/backup-info.txt << EOF
FairDB Backup Configuration
===========================
Backup Solution: pgBackRest
Primary Repository: Wasabi S3 (${WASABI_BUCKET})
Secondary Repository: Local (/var/lib/pgbackrest)
Stanza Name: fairdb
Encryption: AES-256-CBC
Retention Policy:
- Full Backups: 4 (S3), 2 (Local)
- Differential: 12 (S3), 6 (Local)
- WAL Archives: 30 days
Schedule:
- Full: Weekly (Sunday 2 AM)
- Differential: Daily (Mon-Sat 2 AM)
- Incremental: Hourly (9 AM - 6 PM weekdays)
Restore Procedures:
- Latest: pgbackrest --stanza=fairdb restore
- PITR: /opt/fairdb/scripts/restore-pitr.sh 'YYYY-MM-DD HH:MM:SS'
Monitoring:
- Health checks: Every 30 minutes
- Verification: Daily at 5 AM
- Expiration: Daily at 3 AM
EOF
```
## Verification Checklist
Confirm these items:
- [ ] pgBackRest installed and configured
- [ ] Wasabi S3 credentials configured
- [ ] Stanza created and verified
- [ ] PostgreSQL archive_command configured
- [ ] Backup scripts created and executable
- [ ] Automated schedule configured
- [ ] Test backup successful
- [ ] Test restore successful
- [ ] Monitoring scripts in place
- [ ] Documentation complete
## Security Notes
- Store Wasabi credentials securely (use AWS Secrets Manager in production)
- Encrypt backup repository with strong passphrase
- Regularly test restore procedures
- Monitor backup logs for failures
- Keep pgBackRest updated
## Output Summary
Provide the user with:
1. Backup stanza status: `pgbackrest --stanza=fairdb info`
2. Next full backup time from cron schedule
3. Location of backup scripts and logs
4. Restore procedure documentation
5. Monitoring webhook configuration needed
## Important Commands
```bash
# Manual backup commands
sudo -u postgres pgbackrest --stanza=fairdb --type=full backup # Full
sudo -u postgres pgbackrest --stanza=fairdb --type=diff backup # Differential
sudo -u postgres pgbackrest --stanza=fairdb --type=incr backup # Incremental
# Check backup status
sudo -u postgres pgbackrest --stanza=fairdb info
sudo -u postgres pgbackrest --stanza=fairdb check
# Restore commands
sudo -u postgres pgbackrest --stanza=fairdb restore # Latest
sudo -u postgres pgbackrest --stanza=fairdb --type=time --target="2024-01-01 12:00:00" restore # PITR
```