Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:18:28 +08:00
commit c2260d8672
8 changed files with 908 additions and 0 deletions

753
commands/recovery.md Normal file
View File

@@ -0,0 +1,753 @@
---
description: Implement disaster recovery and point-in-time recovery strategies
shortcut: recovery
---
# Database Recovery Manager
Implement comprehensive disaster recovery, point-in-time recovery (PITR), and automated failover strategies for production database systems with automated backup verification and recovery testing.
## When to Use This Command
Use `/recovery` when you need to:
- Set up disaster recovery infrastructure for production databases
- Implement point-in-time recovery (PITR) capabilities
- Automate backup validation and recovery testing
- Design multi-region failover strategies
- Recover from data corruption or accidental deletions
- Meet compliance requirements for backup retention and recovery time objectives (RTO)
DON'T use this when:
- Only need basic database backups (use backup automator instead)
- Working with development databases without recovery requirements
- Database system doesn't support WAL/binary log replication
- Compliance doesn't require tested recovery procedures
## Design Decisions
This command implements **comprehensive disaster recovery with PITR** because:
- Point-in-time recovery prevents data loss from user errors or corruption
- Automated failover ensures minimal downtime (RTO < 5 minutes)
- Regular recovery testing validates backup integrity before disasters
- Multi-region replication provides geographic redundancy
- WAL archiving enables recovery to any point in last 30 days
**Alternative considered: Snapshot-only backups**
- Simpler to implement and restore
- No point-in-time recovery capability
- Recovery point objective (RPO) limited to snapshot frequency
- Recommended only for non-critical databases
**Alternative considered: Manual recovery procedures**
- No automation or testing
- Prone to human error during incidents
- Longer recovery times (RTO hours vs minutes)
- Recommended only for development environments
## Prerequisites
Before running this command:
1. Database with WAL/binary logging enabled
2. Object storage for backup retention (S3, GCS, Azure Blob)
3. Monitoring infrastructure for backup validation
4. Understanding of RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements
5. Separate recovery environment for testing
## Implementation Process
### Step 1: Configure WAL Archiving and Continuous Backup
Enable write-ahead logging (WAL) archiving for point-in-time recovery capabilities.
### Step 2: Implement Automated Base Backup System
Set up scheduled base backups with compression and encryption to object storage.
### Step 3: Design Failover and High Availability Architecture
Configure streaming replication with automated failover for zero-downtime recovery.
### Step 4: Build Recovery Testing Framework
Automate recovery validation by restoring backups to test environments regularly.
### Step 5: Document and Drill Recovery Procedures
Create runbooks and conduct disaster recovery drills quarterly.
## Output Format
The command generates:
- `config/recovery.conf` - PostgreSQL recovery configuration
- `scripts/pitr-restore.sh` - Point-in-time recovery automation script
- `monitoring/backup-validator.py` - Automated backup verification
- `failover/replication-monitor.py` - Streaming replication health monitoring
- `docs/recovery-runbook.md` - Step-by-step recovery procedures
## Code Examples
### Example 1: PostgreSQL PITR with WAL Archiving
```bash
# postgresql.conf - Enable WAL archiving
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-db-backups/wal-archive/%f --region us-east-1'
archive_timeout = 300 # Force segment switch every 5 minutes
max_wal_senders = 10
wal_keep_size = 1GB
# Continuous archiving with monitoring
restore_command = 'aws s3 cp s3://my-db-backups/wal-archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
```
```bash
#!/bin/bash
# scripts/pitr-restore.sh - Point-in-Time Recovery Script
set -euo pipefail
# Configuration
BACKUP_BUCKET="s3://my-db-backups"
PGDATA="/var/lib/postgresql/14/main"
TARGET_TIME="${1:-latest}" # Format: '2024-10-15 14:30:00 UTC'
RECOVERY_TARGET="${2:-immediate}" # immediate, time, xid, name
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[ERROR]${NC} $1" >&2
exit 1
}
warn() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
# Step 1: Verify prerequisites
check_prerequisites() {
log "Checking prerequisites..."
# Check if PostgreSQL is stopped
if systemctl is-active --quiet postgresql; then
warn "PostgreSQL is running. Stopping service..."
systemctl stop postgresql
fi
# Verify AWS credentials
if ! aws sts get-caller-identity &>/dev/null; then
error "AWS credentials not configured"
fi
# Check available disk space
REQUIRED_SPACE=$((50 * 1024 * 1024)) # 50GB in KB
AVAILABLE_SPACE=$(df -k "$PGDATA" | tail -1 | awk '{print $4}')
if [ "$AVAILABLE_SPACE" -lt "$REQUIRED_SPACE" ]; then
error "Insufficient disk space. Required: 50GB, Available: $((AVAILABLE_SPACE / 1024 / 1024))GB"
fi
log "Prerequisites check passed"
}
# Step 2: List available base backups
list_backups() {
log "Fetching available base backups..."
aws s3 ls "$BACKUP_BUCKET/base-backups/" --recursive | \
grep "backup.tar.gz" | \
awk '{print $4}' | \
sort -r | \
head -10
read -p "Enter backup to restore (or press Enter for latest): " SELECTED_BACKUP
if [ -z "$SELECTED_BACKUP" ]; then
SELECTED_BACKUP=$(aws s3 ls "$BACKUP_BUCKET/base-backups/" --recursive | \
grep "backup.tar.gz" | \
awk '{print $4}' | \
sort -r | \
head -1)
fi
log "Selected backup: $SELECTED_BACKUP"
}
# Step 3: Restore base backup
restore_base_backup() {
log "Restoring base backup..."
# Backup current PGDATA if it exists
if [ -d "$PGDATA" ]; then
BACKUP_DIR="${PGDATA}.$(date +%Y%m%d_%H%M%S)"
warn "Backing up current PGDATA to $BACKUP_DIR"
mv "$PGDATA" "$BACKUP_DIR"
fi
mkdir -p "$PGDATA"
# Download and extract base backup
log "Downloading base backup from S3..."
aws s3 cp "$BACKUP_BUCKET/$SELECTED_BACKUP" - | \
tar -xzf - -C "$PGDATA"
# Set correct permissions
chown -R postgres:postgres "$PGDATA"
chmod 700 "$PGDATA"
log "Base backup restored successfully"
}
# Step 4: Configure recovery
configure_recovery() {
log "Configuring recovery settings..."
cat > "$PGDATA/recovery.signal" << EOF
# Recovery signal file created by pitr-restore.sh
EOF
# Create postgresql.auto.conf with recovery settings
cat > "$PGDATA/postgresql.auto.conf" << EOF
# Temporary recovery configuration
restore_command = 'aws s3 cp $BACKUP_BUCKET/wal-archive/%f %p'
recovery_target_action = 'promote'
EOF
# Add recovery target if specified
case "$RECOVERY_TARGET" in
time)
echo "recovery_target_time = '$TARGET_TIME'" >> "$PGDATA/postgresql.auto.conf"
log "Recovery target: $TARGET_TIME"
;;
xid)
echo "recovery_target_xid = '$TARGET_TIME'" >> "$PGDATA/postgresql.auto.conf"
log "Recovery target XID: $TARGET_TIME"
;;
name)
echo "recovery_target_name = '$TARGET_TIME'" >> "$PGDATA/postgresql.auto.conf"
log "Recovery target name: $TARGET_TIME"
;;
immediate)
echo "recovery_target = 'immediate'" >> "$PGDATA/postgresql.auto.conf"
log "Recovery target: end of base backup"
;;
esac
log "Recovery configuration complete"
}
# Step 5: Start recovery and monitor
start_recovery() {
log "Starting PostgreSQL in recovery mode..."
systemctl start postgresql
log "Monitoring recovery progress..."
# Wait for recovery to complete
while true; do
if sudo -u postgres psql -c "SELECT pg_is_in_recovery();" 2>/dev/null | grep -q "f"; then
log "Recovery completed successfully!"
break
fi
# Show recovery progress
RECOVERY_INFO=$(sudo -u postgres psql -c "
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 100
ELSE ROUND(100.0 * pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '0/0') /
NULLIF(pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '0/0'), 0), 2)
END AS recovery_percent,
pg_last_wal_replay_lsn() AS replay_lsn,
NOW() - pg_last_xact_replay_timestamp() AS replay_lag
" 2>/dev/null | tail -3 | head -1 || echo "Checking...")
echo -ne "\r${YELLOW}Recovery in progress: $RECOVERY_INFO${NC}"
sleep 5
done
echo ""
}
# Step 6: Verify recovery
verify_recovery() {
log "Verifying database integrity..."
# Run basic checks
sudo -u postgres psql -c "SELECT version();"
sudo -u postgres psql -c "SELECT COUNT(*) FROM pg_stat_database;"
# Check for replication slots
SLOT_COUNT=$(sudo -u postgres psql -t -c "SELECT COUNT(*) FROM pg_replication_slots;")
if [ "$SLOT_COUNT" -gt 0 ]; then
warn "$SLOT_COUNT replication slots found. Consider cleaning up if this is a new primary."
fi
log "Database verification complete"
}
# Main recovery workflow
main() {
log "=== PostgreSQL Point-in-Time Recovery ==="
log "Target: $TARGET_TIME"
log "Recovery mode: $RECOVERY_TARGET"
check_prerequisites
list_backups
restore_base_backup
configure_recovery
start_recovery
verify_recovery
log "Recovery completed successfully!"
log "Database is now operational"
cat << EOF
${GREEN}Next Steps:${NC}
1. Verify application connectivity
2. Check data integrity for affected tables
3. Update DNS/load balancer to point to recovered database
4. Monitor replication lag if standby servers exist
5. Create new base backup after recovery
EOF
}
# Run main function
main "$@"
```
```python
# monitoring/backup-validator.py - Automated Backup Verification
import subprocess
import boto3
import psycopg2
import logging
import json
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class BackupValidationResult:
"""Results from backup validation."""
backup_name: str
backup_date: datetime
validation_date: datetime
size_mb: float
restore_time_seconds: float
integrity_check_passed: bool
table_count: int
row_sample_count: int
errors: List[str]
warnings: List[str]
def to_dict(self) -> dict:
result = asdict(self)
result['backup_date'] = self.backup_date.isoformat()
result['validation_date'] = self.validation_date.isoformat()
return result
class PostgreSQLBackupValidator:
"""Validates PostgreSQL backups by restoring to test environment."""
def __init__(
self,
s3_bucket: str,
test_db_config: Dict[str, str],
retention_days: int = 30
):
self.s3_bucket = s3_bucket
self.test_db_config = test_db_config
self.retention_days = retention_days
self.s3_client = boto3.client('s3')
def list_recent_backups(self, days: int = 7) -> List[Dict[str, any]]:
"""List backups from last N days."""
prefix = "base-backups/"
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix=prefix
)
cutoff_date = datetime.now() - timedelta(days=days)
backups = []
for obj in response.get('Contents', []):
if obj['LastModified'].replace(tzinfo=None) > cutoff_date:
backups.append({
'key': obj['Key'],
'size': obj['Size'],
'last_modified': obj['LastModified']
})
return sorted(backups, key=lambda x: x['last_modified'], reverse=True)
def download_backup(self, backup_key: str, local_path: str) -> bool:
"""Download backup from S3."""
try:
logger.info(f"Downloading backup {backup_key}...")
self.s3_client.download_file(
self.s3_bucket,
backup_key,
local_path
)
logger.info(f"Downloaded to {local_path}")
return True
except Exception as e:
logger.error(f"Download failed: {e}")
return False
def restore_to_test_db(self, backup_path: str) -> Optional[float]:
"""Restore backup to test database and measure time."""
start_time = datetime.now()
try:
# Drop and recreate test database
conn = psycopg2.connect(
host=self.test_db_config['host'],
user=self.test_db_config['user'],
password=self.test_db_config['password'],
database='postgres'
)
conn.autocommit = True
with conn.cursor() as cur:
cur.execute(f"DROP DATABASE IF EXISTS {self.test_db_config['database']};")
cur.execute(f"CREATE DATABASE {self.test_db_config['database']};")
conn.close()
# Restore backup using pg_restore
restore_cmd = [
'pg_restore',
'--host', self.test_db_config['host'],
'--username', self.test_db_config['user'],
'--dbname', self.test_db_config['database'],
'--no-owner',
'--no-acl',
'--verbose',
backup_path
]
result = subprocess.run(
restore_cmd,
capture_output=True,
text=True,
env={'PGPASSWORD': self.test_db_config['password']}
)
if result.returncode != 0:
logger.error(f"Restore failed: {result.stderr}")
return None
restore_time = (datetime.now() - start_time).total_seconds()
logger.info(f"Restore completed in {restore_time:.2f} seconds")
return restore_time
except Exception as e:
logger.error(f"Restore error: {e}")
return None
def verify_database_integrity(self) -> Dict[str, any]:
"""Run integrity checks on restored database."""
checks = {
'table_count': 0,
'row_sample_count': 0,
'index_validity': True,
'constraint_violations': [],
'errors': [],
'warnings': []
}
try:
conn = psycopg2.connect(
host=self.test_db_config['host'],
user=self.test_db_config['user'],
password=self.test_db_config['password'],
database=self.test_db_config['database']
)
with conn.cursor() as cur:
# Count tables
cur.execute("""
SELECT COUNT(*)
FROM information_schema.tables
WHERE table_schema NOT IN ('pg_catalog', 'information_schema');
""")
checks['table_count'] = cur.fetchone()[0]
# Sample row counts from largest tables
cur.execute("""
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY random()
LIMIT 10;
""")
for schema, table in cur.fetchall():
try:
cur.execute(f'SELECT COUNT(*) FROM "{schema}"."{table}";')
row_count = cur.fetchone()[0]
checks['row_sample_count'] += row_count
except Exception as e:
checks['warnings'].append(f"Could not count {schema}.{table}: {e}")
# Check for invalid indexes
cur.execute("""
SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
LIMIT 100;
""")
for schema, table, index in cur.fetchall():
try:
cur.execute(f'REINDEX INDEX "{schema}"."{index}";')
except Exception as e:
checks['errors'].append(f"Invalid index {schema}.{index}: {e}")
checks['index_validity'] = False
conn.close()
except Exception as e:
checks['errors'].append(f"Integrity check failed: {e}")
return checks
def validate_backup(self, backup_info: Dict[str, any]) -> BackupValidationResult:
"""Complete backup validation workflow."""
logger.info(f"Validating backup: {backup_info['key']}")
errors = []
warnings = []
# Download backup
local_backup = f"/tmp/{backup_info['key'].split('/')[-1]}"
if not self.download_backup(backup_info['key'], local_backup):
errors.append("Failed to download backup")
# Restore to test database
restore_time = self.restore_to_test_db(local_backup)
if restore_time is None:
errors.append("Failed to restore backup")
# Verify integrity
integrity_checks = self.verify_database_integrity()
errors.extend(integrity_checks.get('errors', []))
warnings.extend(integrity_checks.get('warnings', []))
result = BackupValidationResult(
backup_name=backup_info['key'],
backup_date=backup_info['last_modified'].replace(tzinfo=None),
validation_date=datetime.now(),
size_mb=backup_info['size'] / (1024 * 1024),
restore_time_seconds=restore_time or 0.0,
integrity_check_passed=len(errors) == 0,
table_count=integrity_checks.get('table_count', 0),
row_sample_count=integrity_checks.get('row_sample_count', 0),
errors=errors,
warnings=warnings
)
# Log results
if result.integrity_check_passed:
logger.info(f"✅ Backup validation PASSED: {backup_info['key']}")
else:
logger.error(f"❌ Backup validation FAILED: {backup_info['key']}")
for error in errors:
logger.error(f" - {error}")
return result
def run_daily_validation(self):
"""Run daily backup validation on most recent backup."""
logger.info("Starting daily backup validation...")
backups = self.list_recent_backups(days=1)
if not backups:
logger.warning("No recent backups found")
return
latest_backup = backups[0]
result = self.validate_backup(latest_backup)
# Save validation results
report_file = f"validation-report-{datetime.now().strftime('%Y%m%d')}.json"
with open(report_file, 'w') as f:
json.dump(result.to_dict(), f, indent=2)
logger.info(f"Validation report saved to {report_file}")
# Alert if validation failed
if not result.integrity_check_passed:
self.send_alert(result)
def send_alert(self, result: BackupValidationResult):
"""Send alert for failed validation."""
logger.critical(f"ALERT: Backup validation failed for {result.backup_name}")
# Implement alerting (email, Slack, PagerDuty, etc.)
# Usage
if __name__ == "__main__":
validator = PostgreSQLBackupValidator(
s3_bucket="my-db-backups",
test_db_config={
'host': 'test-db.example.com',
'user': 'postgres',
'password': 'password',
'database': 'validation_test'
},
retention_days=30
)
validator.run_daily_validation()
```
### Example 2: MySQL PITR with Binary Log Replication
```bash
# my.cnf - Enable binary logging for PITR
[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
# Replication for high availability
gtid_mode = ON
enforce_gtid_consistency = ON
log_slave_updates = ON
```
```bash
#!/bin/bash
# scripts/mysql-pitr-restore.sh
set -euo pipefail
BACKUP_DIR="/backups/mysql"
BINLOG_DIR="/var/log/mysql"
TARGET_DATETIME="$1" # Format: '2024-10-15 14:30:00'
RESTORE_DIR="/var/lib/mysql-restore"
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"; }
error() { echo "[ERROR] $1" >&2; exit 1; }
# Find base backup before target time
find_base_backup() {
log "Finding appropriate base backup..."
TARGET_EPOCH=$(date -d "$TARGET_DATETIME" +%s)
for backup in $(ls -t "$BACKUP_DIR"/*.tar.gz); do
BACKUP_TIME=$(basename "$backup" .tar.gz | cut -d'-' -f2)
BACKUP_EPOCH=$(date -d "$BACKUP_TIME" +%s 2>/dev/null || continue)
if [ "$BACKUP_EPOCH" -lt "$TARGET_EPOCH" ]; then
echo "$backup"
return 0
fi
done
error "No suitable backup found before $TARGET_DATETIME"
}
# Restore base backup
BASE_BACKUP=$(find_base_backup)
log "Restoring base backup: $BASE_BACKUP"
systemctl stop mysql
rm -rf "$RESTORE_DIR"
mkdir -p "$RESTORE_DIR"
tar -xzf "$BASE_BACKUP" -C "$RESTORE_DIR"
# Apply binary logs up to target time
log "Applying binary logs up to $TARGET_DATETIME..."
mysqlbinlog \
--stop-datetime="$TARGET_DATETIME" \
"$BINLOG_DIR"/mysql-bin.* | \
mysql --defaults-file="$RESTORE_DIR"/my.cnf
log "Point-in-time recovery completed"
```
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| "WAL segment not found" | Missing archived WAL files | Check `archive_command` and S3 bucket permissions |
| "Invalid checkpoint" | Corrupted base backup | Restore from previous base backup |
| "Recovery target not reached" | Target time beyond available WAL | Verify WAL archiving is functioning |
| "Insufficient disk space" | Large database or WAL files | Provision additional storage or compress archives |
| "Connection refused during recovery" | PostgreSQL still in recovery mode | Wait for recovery to complete before connecting |
## Configuration Options
**WAL Archiving**
- `wal_level = replica`: Enable WAL archiving (PostgreSQL)
- `archive_mode = on`: Activate WAL archiving
- `archive_timeout = 300`: Force WAL segment switch every 5 minutes
- `log-bin`: Enable binary logging (MySQL)
**Recovery Targets**
- `recovery_target_time`: Restore to specific timestamp
- `recovery_target_xid`: Restore to transaction ID
- `recovery_target_name`: Restore to named restore point
- `recovery_target = 'immediate'`: Stop at end of base backup
**Replication**
- `max_wal_senders = 10`: Maximum replication connections
- `wal_keep_size = 1GB`: Minimum WAL retention on primary
## Best Practices
DO:
- Test recovery procedures monthly in isolated environment
- Monitor WAL archiving lag and alert if > 5 minutes
- Encrypt backups at rest and in transit
- Store backups in multiple regions for geographic redundancy
- Validate backup integrity automatically after creation
- Document RTO/RPO requirements and measure against them
DON'T:
- Skip recovery testing (untested backups are useless)
- Store backups on same infrastructure as production database
- Ignore WAL archiving failures (creates recovery gaps)
- Use same credentials for production and backup storage
- Assume backups work without validation
## Performance Considerations
- WAL archiving adds ~1-5% overhead depending on write workload
- Use parallel backup tools (pgBackRest, Barman) for large databases
- Compress WAL archives to reduce storage costs (50-70% reduction typical)
- Use incremental backups to minimize backup window
- Consider backup network bandwidth (1TB database = ~30 minutes over 10 Gbps)
## Related Commands
- `/database-backup-automator` - Automated backup scheduling
- `/database-replication-manager` - Configure streaming replication
- `/database-health-monitor` - Monitor backup and replication health
- `/database-migration-manager` - Schema change management with recovery points
## Version History
- v1.0.0 (2024-10): Initial implementation with PostgreSQL and MySQL PITR support
- Planned v1.1.0: Add automated failover orchestration and multi-region replication