Initial commit
This commit is contained in:
15
.claude-plugin/plugin.json
Normal file
15
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"name": "database-replication-manager",
|
||||
"description": "Manage database replication, failover, and high availability configurations",
|
||||
"version": "1.0.0",
|
||||
"author": {
|
||||
"name": "Claude Code Plugins",
|
||||
"email": "[email protected]"
|
||||
},
|
||||
"skills": [
|
||||
"./skills"
|
||||
],
|
||||
"commands": [
|
||||
"./commands"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# database-replication-manager
|
||||
|
||||
Manage database replication, failover, and high availability configurations
|
||||
759
commands/replication.md
Normal file
759
commands/replication.md
Normal file
@@ -0,0 +1,759 @@
|
||||
---
|
||||
description: Comprehensive database replication management with streaming replication, failover automation, and lag monitoring
|
||||
shortcut: replication
|
||||
---
|
||||
|
||||
# Database Replication Manager
|
||||
|
||||
Implement production-grade database replication for PostgreSQL and MySQL with streaming replication (physical), logical replication (selective tables), synchronous and asynchronous modes, automatic failover, lag monitoring, conflict resolution, and read scaling across multiple replicas. Achieve 99.99% availability with RPO <5 seconds and RTO <30 seconds for automated failover.
|
||||
|
||||
## When to Use This Command
|
||||
|
||||
Use `/replication` when you need to:
|
||||
- Implement high availability with automatic failover (99.99%+ uptime)
|
||||
- Scale read workloads across multiple replicas (10x read capacity)
|
||||
- Create disaster recovery instances in different regions
|
||||
- Enable zero-downtime database migrations and upgrades
|
||||
- Implement read-heavy application architectures
|
||||
- Meet compliance requirements for data redundancy
|
||||
|
||||
DON'T use this when:
|
||||
- Single server handles all load comfortably (<50% CPU)
|
||||
- Database size is small (<10GB) and backup/restore is fast
|
||||
- Application doesn't support read replica routing
|
||||
- Network latency between regions is high (>100ms for sync replication)
|
||||
- You lack monitoring infrastructure for replication lag
|
||||
- Write workload is too heavy for replication to keep up
|
||||
|
||||
## Design Decisions
|
||||
|
||||
This command implements **automated replication with failover** because:
|
||||
- Streaming replication provides real-time data synchronization
|
||||
- Automatic failover reduces RTO from hours to seconds
|
||||
- Read replicas enable horizontal read scaling
|
||||
- Logical replication allows selective table replication
|
||||
- Synchronous mode ensures zero data loss for critical transactions
|
||||
|
||||
**Alternative considered: Application-level read/write splitting**
|
||||
- No replication overhead at database level
|
||||
- Requires application changes for every database interaction
|
||||
- More complex error handling and retry logic
|
||||
- Recommended when replication infrastructure unavailable
|
||||
|
||||
**Alternative considered: Database clustering (Patroni, Galera)**
|
||||
- Multi-master with automatic failover
|
||||
- More complex setup and maintenance
|
||||
- Better for write-heavy workloads
|
||||
- Recommended for high-write applications requiring HA
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before running this command:
|
||||
1. Primary and replica servers with network connectivity
|
||||
2. Sufficient disk space for WAL archiving (30-50% of database size)
|
||||
3. Monitoring system for replication lag alerts
|
||||
4. Understanding of RPO (Recovery Point Objective) and RTO requirements
|
||||
5. Tested failover procedures and runbooks
|
||||
|
||||
## Implementation Process
|
||||
|
||||
### Step 1: Configure Primary for Replication
|
||||
Enable WAL archiving, set max_wal_senders, and create replication user.
|
||||
|
||||
### Step 2: Initialize Replica with Base Backup
|
||||
Use pg_basebackup to clone primary database to replica server.
|
||||
|
||||
### Step 3: Configure Replica Connection
|
||||
Set primary_conninfo and start replica in standby mode.
|
||||
|
||||
### Step 4: Verify Replication Status
|
||||
Check replication lag and ensure WAL streaming is active.
|
||||
|
||||
### Step 5: Implement Monitoring and Failover
|
||||
Deploy replication lag alerts and automatic failover scripts.
|
||||
|
||||
## Output Format
|
||||
|
||||
The command generates:
|
||||
- `replication/primary_setup.sql` - Primary configuration and replication user
|
||||
- `replication/replica_setup.sh` - Automated replica initialization script
|
||||
- `replication/failover.py` - Automatic failover orchestration
|
||||
- `replication/monitoring.yml` - Prometheus/Grafana replication metrics
|
||||
- `replication/recovery.conf` - Replica recovery configuration
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Example 1: PostgreSQL Streaming Replication Setup
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#
|
||||
# Production-ready PostgreSQL streaming replication setup
|
||||
# with automatic failover and monitoring integration
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
# Configuration
|
||||
PRIMARY_HOST="${PRIMARY_HOST:-primary.example.com}"
|
||||
REPLICA_HOST="${REPLICA_HOST:-replica.example.com}"
|
||||
REPLICATION_USER="${REPLICATION_USER:-replicator}"
|
||||
REPLICATION_PASSWORD="${REPLICATION_PASSWORD:-changeme}"
|
||||
POSTGRES_DATA_DIR="/var/lib/postgresql/14/main"
|
||||
|
||||
echo "========================================="
|
||||
echo "PostgreSQL Streaming Replication Setup"
|
||||
echo "========================================="
|
||||
echo ""
|
||||
|
||||
# ===== PRIMARY SERVER CONFIGURATION =====
|
||||
|
||||
setup_primary() {
|
||||
echo "Configuring PRIMARY server: $PRIMARY_HOST"
|
||||
echo ""
|
||||
|
||||
# 1. Configure postgresql.conf for replication
|
||||
cat >> /etc/postgresql/14/main/postgresql.conf <<EOF
|
||||
|
||||
# ========== REPLICATION SETTINGS ==========
|
||||
# Added by replication setup script
|
||||
|
||||
# Write-Ahead Log (WAL) settings
|
||||
wal_level = replica # Enable WAL for replication
|
||||
max_wal_senders = 10 # Max concurrent replication connections
|
||||
wal_keep_size = 1024 # Keep 1GB of WAL segments (PostgreSQL 13+)
|
||||
max_replication_slots = 10 # For replication slots (recommended)
|
||||
|
||||
# Synchronous replication (optional - for zero data loss)
|
||||
# synchronous_standby_names = 'replica1' # Uncomment for sync replication
|
||||
synchronous_commit = local # Options: off, local, remote_write, remote_apply, on
|
||||
|
||||
# Archive WAL for point-in-time recovery (optional)
|
||||
archive_mode = on
|
||||
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
|
||||
archive_timeout = 300 # Force WAL switch every 5 minutes
|
||||
|
||||
# Hot standby settings
|
||||
hot_standby = on # Allow reads on replica
|
||||
hot_standby_feedback = on # Prevent query conflicts
|
||||
|
||||
# ========================================
|
||||
EOF
|
||||
|
||||
# 2. Create WAL archive directory
|
||||
mkdir -p /var/lib/postgresql/wal_archive
|
||||
chown postgres:postgres /var/lib/postgresql/wal_archive
|
||||
chmod 700 /var/lib/postgresql/wal_archive
|
||||
|
||||
# 3. Configure pg_hba.conf for replication connections
|
||||
cat >> /etc/postgresql/14/main/pg_hba.conf <<EOF
|
||||
|
||||
# Replication connections (added by setup script)
|
||||
host replication $REPLICATION_USER $REPLICA_HOST/32 scram-sha-256
|
||||
host replication $REPLICATION_USER 0.0.0.0/0 scram-sha-256 # For multiple replicas
|
||||
EOF
|
||||
|
||||
# 4. Create replication user
|
||||
sudo -u postgres psql <<EOF
|
||||
-- Create replication user with strong password
|
||||
CREATE ROLE $REPLICATION_USER WITH REPLICATION LOGIN PASSWORD '$REPLICATION_PASSWORD';
|
||||
|
||||
-- Grant necessary permissions
|
||||
GRANT CONNECT ON DATABASE postgres TO $REPLICATION_USER;
|
||||
|
||||
-- Create replication slot (recommended for reliability)
|
||||
SELECT * FROM pg_create_physical_replication_slot('replica1_slot');
|
||||
|
||||
-- Verify replication user
|
||||
\du $REPLICATION_USER
|
||||
EOF
|
||||
|
||||
# 5. Restart PostgreSQL to apply changes
|
||||
systemctl restart postgresql@14-main
|
||||
|
||||
echo ""
|
||||
echo "✅ Primary server configured successfully"
|
||||
echo ""
|
||||
echo "Replication Status:"
|
||||
sudo -u postgres psql -c "SELECT * FROM pg_replication_slots;"
|
||||
sudo -u postgres psql -c "SELECT usename, application_name, client_addr, state, sync_state FROM pg_stat_replication;"
|
||||
}
|
||||
|
||||
# ===== REPLICA SERVER CONFIGURATION =====
|
||||
|
||||
setup_replica() {
|
||||
echo "Configuring REPLICA server: $REPLICA_HOST"
|
||||
echo ""
|
||||
|
||||
# 1. Stop PostgreSQL on replica (will be rebuilt)
|
||||
systemctl stop postgresql@14-main
|
||||
|
||||
# 2. Backup existing data (safety)
|
||||
if [ -d "$POSTGRES_DATA_DIR" ]; then
|
||||
mv "$POSTGRES_DATA_DIR" "${POSTGRES_DATA_DIR}.backup.$(date +%Y%m%d-%H%M%S)"
|
||||
fi
|
||||
|
||||
# 3. Create base backup from primary using pg_basebackup
|
||||
echo "Creating base backup from primary (this may take several minutes)..."
|
||||
sudo -u postgres pg_basebackup \
|
||||
-h $PRIMARY_HOST \
|
||||
-D $POSTGRES_DATA_DIR \
|
||||
-U $REPLICATION_USER \
|
||||
-P \
|
||||
-v \
|
||||
-R \
|
||||
-X stream \
|
||||
-C \
|
||||
-S replica1_slot
|
||||
|
||||
# -R: Creates standby.signal and writes recovery parameters
|
||||
# -X stream: Stream WAL while backup is in progress
|
||||
# -C: Create replication slot on primary (if it doesn't exist)
|
||||
# -S: Use replication slot for reliable replication
|
||||
|
||||
# 4. Configure replica-specific settings (optional)
|
||||
cat >> $POSTGRES_DATA_DIR/postgresql.auto.conf <<EOF
|
||||
|
||||
# Replica-specific configuration
|
||||
primary_conninfo = 'host=$PRIMARY_HOST port=5432 user=$REPLICATION_USER password=$REPLICATION_PASSWORD application_name=replica1'
|
||||
primary_slot_name = 'replica1_slot'
|
||||
|
||||
# Recovery settings
|
||||
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p' # If using WAL archiving
|
||||
recovery_target_timeline = 'latest'
|
||||
|
||||
# Hot standby settings (allow read queries on replica)
|
||||
hot_standby = on
|
||||
max_standby_streaming_delay = 30s # Max delay before canceling conflicting queries
|
||||
EOF
|
||||
|
||||
# 5. Set proper permissions
|
||||
chown -R postgres:postgres $POSTGRES_DATA_DIR
|
||||
chmod 700 $POSTGRES_DATA_DIR
|
||||
|
||||
# 6. Start replica
|
||||
systemctl start postgresql@14-main
|
||||
|
||||
echo ""
|
||||
echo "✅ Replica server configured successfully"
|
||||
echo ""
|
||||
echo "Replica Status:"
|
||||
sudo -u postgres psql -c "SELECT pg_is_in_recovery();" # Should return 't'
|
||||
sudo -u postgres psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
|
||||
}
|
||||
|
||||
# ===== REPLICATION VERIFICATION =====
|
||||
|
||||
verify_replication() {
|
||||
echo "Verifying replication setup..."
|
||||
echo ""
|
||||
|
||||
# Check primary replication status
|
||||
echo "=== PRIMARY SERVER STATUS ==="
|
||||
sudo -u postgres psql -h $PRIMARY_HOST -U $REPLICATION_USER postgres <<EOF
|
||||
SELECT
|
||||
client_addr,
|
||||
application_name,
|
||||
state,
|
||||
sync_state,
|
||||
sent_lsn,
|
||||
write_lsn,
|
||||
flush_lsn,
|
||||
replay_lsn,
|
||||
sync_priority,
|
||||
EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
|
||||
FROM pg_stat_replication;
|
||||
EOF
|
||||
|
||||
echo ""
|
||||
echo "=== REPLICA SERVER STATUS ==="
|
||||
sudo -u postgres psql -h $REPLICA_HOST postgres <<EOF
|
||||
SELECT
|
||||
pg_is_in_recovery() AS is_replica,
|
||||
pg_last_wal_receive_lsn() AS receive_lsn,
|
||||
pg_last_wal_replay_lsn() AS replay_lsn,
|
||||
pg_last_xact_replay_timestamp() AS last_replay_timestamp,
|
||||
EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds;
|
||||
EOF
|
||||
|
||||
echo ""
|
||||
echo "=== REPLICATION LAG ==="
|
||||
# Acceptable lag: <1 second for local replicas, <5 seconds for remote
|
||||
LAG=$(sudo -u postgres psql -h $PRIMARY_HOST -U postgres postgres -t -c \
|
||||
"SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) FROM pg_stat_replication LIMIT 1;")
|
||||
|
||||
if (( $(echo "$LAG < 5" | bc -l) )); then
|
||||
echo "✅ Replication lag: ${LAG}s (healthy)"
|
||||
else
|
||||
echo "⚠️ Replication lag: ${LAG}s (high)"
|
||||
fi
|
||||
}
|
||||
|
||||
# ===== MAIN =====
|
||||
|
||||
case "${1:-}" in
|
||||
primary)
|
||||
setup_primary
|
||||
;;
|
||||
replica)
|
||||
setup_replica
|
||||
;;
|
||||
verify)
|
||||
verify_replication
|
||||
;;
|
||||
*)
|
||||
echo "Usage: $0 {primary|replica|verify}"
|
||||
echo ""
|
||||
echo " primary - Configure primary server for replication"
|
||||
echo " replica - Set up replica from primary"
|
||||
echo " verify - Verify replication status"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
```
|
||||
|
||||
### Example 2: Automated Failover Script with Prometheus Integration
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production-ready PostgreSQL automatic failover script with
|
||||
health checks, monitoring integration, and rollback capability.
|
||||
"""
|
||||
|
||||
import psycopg2
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
from typing import Optional, Dict
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
import requests
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ServerRole(Enum):
|
||||
"""Database server role."""
|
||||
PRIMARY = "primary"
|
||||
REPLICA = "replica"
|
||||
UNKNOWN = "unknown"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ReplicationStatus:
|
||||
"""Replication status information."""
|
||||
is_primary: bool
|
||||
is_replica: bool
|
||||
replication_lag_seconds: Optional[float]
|
||||
wal_receive_lsn: Optional[str]
|
||||
wal_replay_lsn: Optional[str]
|
||||
connected_replicas: int
|
||||
|
||||
|
||||
class PostgreSQLFailoverManager:
|
||||
"""
|
||||
Manages automatic failover for PostgreSQL streaming replication.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
primary_host: str,
|
||||
replica_host: str,
|
||||
postgres_user: str = "postgres",
|
||||
postgres_password: str = "",
|
||||
failover_threshold_seconds: int = 30,
|
||||
alert_webhook: Optional[str] = None
|
||||
):
|
||||
"""
|
||||
Initialize failover manager.
|
||||
|
||||
Args:
|
||||
primary_host: Primary server hostname
|
||||
replica_host: Replica server hostname
|
||||
postgres_user: PostgreSQL superuser
|
||||
postgres_password: PostgreSQL password
|
||||
failover_threshold_seconds: Trigger failover after this many seconds down
|
||||
alert_webhook: Slack/PagerDuty webhook for alerts
|
||||
"""
|
||||
self.primary_host = primary_host
|
||||
self.replica_host = replica_host
|
||||
self.postgres_user = postgres_user
|
||||
self.postgres_password = postgres_password
|
||||
self.failover_threshold = failover_threshold_seconds
|
||||
self.alert_webhook = alert_webhook
|
||||
|
||||
self.primary_down_since: Optional[float] = None
|
||||
|
||||
def check_server_health(self, host: str) -> bool:
|
||||
"""
|
||||
Check if PostgreSQL server is healthy.
|
||||
|
||||
Args:
|
||||
host: Server hostname
|
||||
|
||||
Returns:
|
||||
True if server is healthy, False otherwise
|
||||
"""
|
||||
try:
|
||||
conn = psycopg2.connect(
|
||||
host=host,
|
||||
user=self.postgres_user,
|
||||
password=self.postgres_password,
|
||||
database="postgres",
|
||||
connect_timeout=5
|
||||
)
|
||||
conn.close()
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Health check failed for {host}: {e}")
|
||||
return False
|
||||
|
||||
def get_replication_status(self, host: str) -> Optional[ReplicationStatus]:
|
||||
"""
|
||||
Get replication status from a server.
|
||||
|
||||
Args:
|
||||
host: Server hostname
|
||||
|
||||
Returns:
|
||||
ReplicationStatus or None if unreachable
|
||||
"""
|
||||
try:
|
||||
conn = psycopg2.connect(
|
||||
host=host,
|
||||
user=self.postgres_user,
|
||||
password=self.postgres_password,
|
||||
database="postgres",
|
||||
connect_timeout=5
|
||||
)
|
||||
|
||||
with conn.cursor() as cur:
|
||||
# Check if primary or replica
|
||||
cur.execute("SELECT pg_is_in_recovery()")
|
||||
is_replica = cur.fetchone()[0]
|
||||
is_primary = not is_replica
|
||||
|
||||
# Get replication lag (for replicas)
|
||||
replication_lag = None
|
||||
wal_receive_lsn = None
|
||||
wal_replay_lsn = None
|
||||
|
||||
if is_replica:
|
||||
cur.execute("""
|
||||
SELECT
|
||||
EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds,
|
||||
pg_last_wal_receive_lsn()::text AS receive_lsn,
|
||||
pg_last_wal_replay_lsn()::text AS replay_lsn
|
||||
""")
|
||||
row = cur.fetchone()
|
||||
replication_lag = row[0]
|
||||
wal_receive_lsn = row[1]
|
||||
wal_replay_lsn = row[2]
|
||||
|
||||
# Count connected replicas (for primary)
|
||||
connected_replicas = 0
|
||||
if is_primary:
|
||||
cur.execute("SELECT COUNT(*) FROM pg_stat_replication")
|
||||
connected_replicas = cur.fetchone()[0]
|
||||
|
||||
conn.close()
|
||||
|
||||
return ReplicationStatus(
|
||||
is_primary=is_primary,
|
||||
is_replica=is_replica,
|
||||
replication_lag_seconds=replication_lag,
|
||||
wal_receive_lsn=wal_receive_lsn,
|
||||
wal_replay_lsn=wal_replay_lsn,
|
||||
connected_replicas=connected_replicas
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get replication status from {host}: {e}")
|
||||
return None
|
||||
|
||||
def promote_replica_to_primary(self, replica_host: str) -> bool:
|
||||
"""
|
||||
Promote replica to primary.
|
||||
|
||||
Args:
|
||||
replica_host: Replica server to promote
|
||||
|
||||
Returns:
|
||||
True if promotion successful
|
||||
"""
|
||||
logger.info(f"Promoting replica {replica_host} to primary...")
|
||||
|
||||
try:
|
||||
# Execute pg_promote() via SSH or local command
|
||||
# (Assuming replica is on same machine for this example)
|
||||
conn = psycopg2.connect(
|
||||
host=replica_host,
|
||||
user=self.postgres_user,
|
||||
password=self.postgres_password,
|
||||
database="postgres"
|
||||
)
|
||||
|
||||
with conn.cursor() as cur:
|
||||
# Promote replica to primary
|
||||
cur.execute("SELECT pg_promote()")
|
||||
conn.commit()
|
||||
|
||||
conn.close()
|
||||
|
||||
# Wait for promotion to complete
|
||||
time.sleep(5)
|
||||
|
||||
# Verify promotion
|
||||
status = self.get_replication_status(replica_host)
|
||||
if status and status.is_primary:
|
||||
logger.info(f"✅ Successfully promoted {replica_host} to primary")
|
||||
return True
|
||||
else:
|
||||
logger.error(f"❌ Promotion failed - server is still replica")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Promotion failed: {e}")
|
||||
return False
|
||||
|
||||
def send_alert(self, message: str, severity: str = "error") -> None:
|
||||
"""
|
||||
Send alert to configured webhook.
|
||||
|
||||
Args:
|
||||
message: Alert message
|
||||
severity: Alert severity (info, warning, error, critical)
|
||||
"""
|
||||
if not self.alert_webhook:
|
||||
return
|
||||
|
||||
emoji_map = {
|
||||
'info': 'ℹ️',
|
||||
'warning': '⚠️',
|
||||
'error': '❌',
|
||||
'critical': '🚨'
|
||||
}
|
||||
|
||||
payload = {
|
||||
'text': f"{emoji_map.get(severity, '❓')} PostgreSQL Failover Alert",
|
||||
'attachments': [{
|
||||
'color': 'danger' if severity in ['error', 'critical'] else 'warning',
|
||||
'text': message,
|
||||
'footer': 'PostgreSQL Failover Manager',
|
||||
'ts': int(time.time())
|
||||
}]
|
||||
}
|
||||
|
||||
try:
|
||||
requests.post(self.alert_webhook, json=payload, timeout=5)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send alert: {e}")
|
||||
|
||||
def monitor_and_failover(self) -> None:
|
||||
"""
|
||||
Continuously monitor replication and perform automatic failover.
|
||||
"""
|
||||
logger.info("Starting replication monitoring...")
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Check primary health
|
||||
primary_healthy = self.check_server_health(self.primary_host)
|
||||
|
||||
if not primary_healthy:
|
||||
if self.primary_down_since is None:
|
||||
# Primary just went down
|
||||
self.primary_down_since = time.time()
|
||||
logger.warning(f"⚠️ Primary {self.primary_host} is DOWN")
|
||||
self.send_alert(
|
||||
f"Primary database {self.primary_host} is unreachable. "
|
||||
f"Failover will trigger in {self.failover_threshold} seconds.",
|
||||
severity='warning'
|
||||
)
|
||||
|
||||
# Check if primary has been down long enough to trigger failover
|
||||
down_duration = time.time() - self.primary_down_since
|
||||
|
||||
if down_duration >= self.failover_threshold:
|
||||
logger.critical(
|
||||
f"🚨 Primary down for {down_duration:.0f}s - "
|
||||
f"TRIGGERING FAILOVER"
|
||||
)
|
||||
|
||||
self.send_alert(
|
||||
f"PRIMARY FAILURE: {self.primary_host} down for "
|
||||
f"{down_duration:.0f}s. Initiating automatic failover to "
|
||||
f"{self.replica_host}",
|
||||
severity='critical'
|
||||
)
|
||||
|
||||
# Perform failover
|
||||
success = self.promote_replica_to_primary(self.replica_host)
|
||||
|
||||
if success:
|
||||
self.send_alert(
|
||||
f"✅ FAILOVER SUCCESSFUL: {self.replica_host} is now PRIMARY. "
|
||||
f"Update application connection strings immediately.",
|
||||
severity='error' # Still an error situation
|
||||
)
|
||||
# Stop monitoring (manual intervention required)
|
||||
break
|
||||
else:
|
||||
self.send_alert(
|
||||
f"❌ FAILOVER FAILED: Could not promote {self.replica_host}. "
|
||||
f"Manual intervention required immediately.",
|
||||
severity='critical'
|
||||
)
|
||||
break
|
||||
|
||||
else:
|
||||
# Primary is healthy
|
||||
if self.primary_down_since is not None:
|
||||
# Primary recovered
|
||||
logger.info(f"✅ Primary {self.primary_host} recovered")
|
||||
self.send_alert(
|
||||
f"Primary database {self.primary_host} has recovered.",
|
||||
severity='info'
|
||||
)
|
||||
self.primary_down_since = None
|
||||
|
||||
# Check replication lag
|
||||
replica_status = self.get_replication_status(self.replica_host)
|
||||
|
||||
if replica_status:
|
||||
lag = replica_status.replication_lag_seconds or 0
|
||||
|
||||
if lag > 60:
|
||||
logger.warning(f"⚠️ High replication lag: {lag:.1f}s")
|
||||
self.send_alert(
|
||||
f"High replication lag detected: {lag:.1f} seconds",
|
||||
severity='warning'
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
f"Replication healthy: lag={lag:.1f}s, "
|
||||
f"primary={self.primary_host}, replica={self.replica_host}"
|
||||
)
|
||||
|
||||
# Sleep before next check
|
||||
time.sleep(10)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Monitoring stopped by user")
|
||||
break
|
||||
except Exception as e:
|
||||
logger.error(f"Monitoring error: {e}")
|
||||
time.sleep(10)
|
||||
|
||||
|
||||
# CLI usage
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="PostgreSQL Failover Manager")
|
||||
parser.add_argument("--primary", required=True, help="Primary host")
|
||||
parser.add_argument("--replica", required=True, help="Replica host")
|
||||
parser.add_argument("--user", default="postgres", help="PostgreSQL user")
|
||||
parser.add_argument("--password", default="", help="PostgreSQL password")
|
||||
parser.add_argument("--threshold", type=int, default=30, help="Failover threshold (seconds)")
|
||||
parser.add_argument("--webhook", help="Alert webhook URL")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
manager = PostgreSQLFailoverManager(
|
||||
primary_host=args.primary,
|
||||
replica_host=args.replica,
|
||||
postgres_user=args.user,
|
||||
postgres_password=args.password,
|
||||
failover_threshold_seconds=args.threshold,
|
||||
alert_webhook=args.webhook
|
||||
)
|
||||
|
||||
manager.monitor_and_failover()
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| "could not connect to server" | Replica cannot reach primary | Check network connectivity, firewall rules, pg_hba.conf |
|
||||
| "requested WAL segment already removed" | WAL files deleted before replica could receive them | Increase wal_keep_size or use replication slots |
|
||||
| "replication slot does not exist" | Replica trying to use non-existent slot | Create slot on primary: `SELECT pg_create_physical_replication_slot('slot_name')` |
|
||||
| "hot standby conflict" | Query on replica conflicts with recovery | Increase max_standby_streaming_delay or tune query cancellation |
|
||||
| "timeline history file missing" | Replica and primary have diverged after failover | Rebuild replica from new primary using pg_basebackup |
|
||||
|
||||
## Configuration Options
|
||||
|
||||
**Replication Modes**
|
||||
- **Asynchronous** (default): Best performance, small data loss risk
|
||||
- **Synchronous** (`synchronous_commit=on`): Zero data loss, slower writes
|
||||
- **Remote write** (`synchronous_commit=remote_write`): Balanced approach
|
||||
- **Remote apply** (`synchronous_commit=remote_apply`): Strongest consistency
|
||||
|
||||
**Replication Methods**
|
||||
- **Streaming replication**: Binary WAL streaming (physical replication)
|
||||
- **Logical replication**: Selective table replication (PostgreSQL 10+)
|
||||
- **WAL shipping**: Archive-based replication (for backups)
|
||||
|
||||
**Failover Strategies**
|
||||
- **Manual failover**: DBA triggers promotion (safest)
|
||||
- **Automatic failover**: Scripted promotion after health check failure
|
||||
- **Patroni/repmgr**: Cluster management with automatic failover
|
||||
- **Cloud-managed**: RDS/CloudSQL automatic failover
|
||||
|
||||
## Best Practices
|
||||
|
||||
DO:
|
||||
- Use replication slots to prevent WAL deletion before replica receives it
|
||||
- Monitor replication lag continuously (alert at >10 seconds)
|
||||
- Test failover procedures quarterly (disaster recovery drills)
|
||||
- Use synchronous replication for critical write transactions only
|
||||
- Implement connection pooling to handle failover reconnections
|
||||
- Document failover runbooks with step-by-step procedures
|
||||
- Keep replicas on same PostgreSQL major version as primary
|
||||
|
||||
DON'T:
|
||||
- Run long-running queries on replicas without tuning conflict resolution
|
||||
- Forget to update application connection strings after failover
|
||||
- Use synchronous replication over high-latency networks (>50ms)
|
||||
- Skip monitoring replication lag (leads to split-brain scenarios)
|
||||
- Promote replica without verifying it's up-to-date
|
||||
- Delete replication slots without stopping replicas first
|
||||
- Ignore hot standby conflicts (causes query cancellations)
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Replication lag**: <1s for local replicas, <5s for cross-region
|
||||
- **Write overhead**: 5-10% for asynchronous, 20-50% for synchronous
|
||||
- **Network bandwidth**: 10-50 Mbps per active replica
|
||||
- **Disk I/O**: Replica writes same data as primary (similar load)
|
||||
- **Failover time (RTO)**: 10-60 seconds for automatic, 5-15 minutes for manual
|
||||
- **Data loss (RPO)**: 0 seconds (sync), 0-5 seconds (async)
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Use strong passwords for replication user (20+ characters)
|
||||
- Encrypt replication traffic with SSL/TLS (`sslmode=require`)
|
||||
- Restrict replication connections in pg_hba.conf to specific IPs
|
||||
- Audit all failover operations for compliance
|
||||
- Rotate replication credentials quarterly
|
||||
- Use dedicated replication user (not superuser)
|
||||
- Enable connection logging for replication connections
|
||||
|
||||
## Related Commands
|
||||
|
||||
- `/database-backup-automator` - Backup before major replication changes
|
||||
- `/database-health-monitor` - Monitor replication lag and health
|
||||
- `/database-recovery-manager` - PITR using WAL archives from replication
|
||||
- `/database-connection-pooler` - Handle connection routing after failover
|
||||
|
||||
## Version History
|
||||
|
||||
- v1.0.0 (2024-10): Initial implementation with streaming replication and automatic failover
|
||||
- Planned v1.1.0: Add logical replication support, Patroni integration
|
||||
61
plugin.lock.json
Normal file
61
plugin.lock.json
Normal file
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:jeremylongshore/claude-code-plugins-plus:plugins/database/database-replication-manager",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "97ae5d64e8c06b38e8716238998aab19812e69e9",
|
||||
"treeHash": "da5f5a034a7629bd22d3978b916295393b676c7a98873cdb2b6431232b16c504",
|
||||
"generatedAt": "2025-11-28T10:18:21.192078Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "database-replication-manager",
|
||||
"description": "Manage database replication, failover, and high availability configurations",
|
||||
"version": "1.0.0"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "3f30b26bdc6895ac101de9ef4745188554a76c4c10318197441534d57eba2191"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "8264a4ac2bb7ecbcf87a1c1d2fd467af191222b97173156be933e93db6081821"
|
||||
},
|
||||
{
|
||||
"path": "commands/replication.md",
|
||||
"sha256": "2de263b472c1793b2cc6ae7fc3e47fce5b876968f377957e57a0becbcd9237b6"
|
||||
},
|
||||
{
|
||||
"path": "skills/database-replication-manager/SKILL.md",
|
||||
"sha256": "59f96e915fadc1fa39f2cb8f3310d321e405b86adb80a8b749056a4154bd19b3"
|
||||
},
|
||||
{
|
||||
"path": "skills/database-replication-manager/references/README.md",
|
||||
"sha256": "720f6e0356268b97a40bc71569cfcb315894d4833059d52dd27be8063f41ab1a"
|
||||
},
|
||||
{
|
||||
"path": "skills/database-replication-manager/scripts/README.md",
|
||||
"sha256": "223f0497201f24e609b54d987df9f5afc41731069e21e2309cd0f6ba0b340bde"
|
||||
},
|
||||
{
|
||||
"path": "skills/database-replication-manager/assets/README.md",
|
||||
"sha256": "24cfb70ef86f6762c67597ed06440f4ab8d6cd22e8bfdac707ad12925d21c49d"
|
||||
}
|
||||
],
|
||||
"dirSha256": "da5f5a034a7629bd22d3978b916295393b676c7a98873cdb2b6431232b16c504"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
55
skills/database-replication-manager/SKILL.md
Normal file
55
skills/database-replication-manager/SKILL.md
Normal file
@@ -0,0 +1,55 @@
|
||||
---
|
||||
name: managing-database-replication
|
||||
description: |
|
||||
This skill enables Claude to manage database replication, failover, and high availability configurations using the database-replication-manager plugin. It is designed to assist with tasks such as setting up master-slave replication, configuring automatic failover, monitoring replication lag, and implementing read scaling. Use this skill when the user requests help with "database replication", "failover configuration", "high availability", "replication lag", or "read scaling" for databases like PostgreSQL or MySQL. The plugin facilitates both physical and logical replication strategies.
|
||||
allowed-tools: Read, Write, Edit, Grep, Glob, Bash
|
||||
version: 1.0.0
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This skill empowers Claude to automate and streamline database replication processes, ensuring high availability and data consistency across multiple database instances. It simplifies the configuration and management of complex replication topologies.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Initialization**: The skill activates the database-replication-manager plugin upon detecting relevant keywords.
|
||||
2. **Configuration**: The skill prompts the user for database connection details, replication type (physical/logical), and desired configuration parameters (e.g., failover settings, replication lag thresholds).
|
||||
3. **Implementation**: The plugin generates and executes the necessary commands to configure database replication based on the user's specifications.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill activates when you need to:
|
||||
- Set up a new database replication environment.
|
||||
- Configure automatic failover for a database cluster.
|
||||
- Monitor replication lag and trigger alerts based on defined thresholds.
|
||||
- Implement read scaling by distributing read queries across multiple replicas.
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Setting up Master-Slave Replication
|
||||
|
||||
User request: "Set up master-slave replication for my PostgreSQL database with automatic failover."
|
||||
|
||||
The skill will:
|
||||
1. Activate the database-replication-manager plugin.
|
||||
2. Guide the user through the configuration process, prompting for connection details and failover settings.
|
||||
3. Generate and execute the necessary PostgreSQL commands to establish master-slave replication and configure automatic failover.
|
||||
|
||||
### Example 2: Monitoring Replication Lag
|
||||
|
||||
User request: "Monitor replication lag on my MySQL replica and alert me if it exceeds 5 seconds."
|
||||
|
||||
The skill will:
|
||||
1. Activate the database-replication-manager plugin.
|
||||
2. Configure replication lag monitoring for the specified MySQL replica.
|
||||
3. Set up alerts that trigger when the replication lag exceeds the defined threshold of 5 seconds.
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Security**: Always encrypt database credentials and use secure communication channels for replication traffic.
|
||||
- **Monitoring**: Implement comprehensive monitoring of replication status, lag, and resource utilization.
|
||||
- **Testing**: Regularly test failover procedures to ensure they function correctly in a disaster recovery scenario.
|
||||
|
||||
## Integration
|
||||
|
||||
This skill can be integrated with other monitoring and alerting tools to provide comprehensive database management capabilities. It can also be used in conjunction with infrastructure-as-code tools to automate the deployment and configuration of database replication environments.
|
||||
6
skills/database-replication-manager/assets/README.md
Normal file
6
skills/database-replication-manager/assets/README.md
Normal file
@@ -0,0 +1,6 @@
|
||||
# Assets
|
||||
|
||||
Bundled resources for database-replication-manager skill
|
||||
|
||||
- [ ] replication_template.conf: A template configuration file for database replication.
|
||||
- [ ] monitoring_dashboard.json: A sample dashboard configuration for monitoring replication metrics.
|
||||
8
skills/database-replication-manager/references/README.md
Normal file
8
skills/database-replication-manager/references/README.md
Normal file
@@ -0,0 +1,8 @@
|
||||
# References
|
||||
|
||||
Bundled resources for database-replication-manager skill
|
||||
|
||||
- [ ] replication_best_practices.md: A document detailing best practices for database replication.
|
||||
- [ ] failover_configuration.md: Detailed instructions on configuring automatic failover.
|
||||
- [ ] database_schema.md: Schema definitions for the databases being replicated.
|
||||
- [ ] troubleshooting_replication.md: A guide to troubleshooting common replication issues.
|
||||
7
skills/database-replication-manager/scripts/README.md
Normal file
7
skills/database-replication-manager/scripts/README.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Scripts
|
||||
|
||||
Bundled resources for database-replication-manager skill
|
||||
|
||||
- [ ] setup_replication.sh: Automates the setup of master-slave replication.
|
||||
- [ ] failover.sh: Simulates a failover scenario and tests the failover configuration.
|
||||
- [ ] monitor_replication.py: Monitors replication lag and sends alerts if it exceeds a threshold.
|
||||
Reference in New Issue
Block a user