--- description: Monitor database health with real-time metrics, predictive alerts, and automated remediation shortcut: health-check --- # Database Health Monitor Implement production-grade database health monitoring for PostgreSQL and MySQL with real-time metrics collection, predictive alerting, automated remediation, and comprehensive dashboards. Detect performance degradation, resource exhaustion, and replication issues before they impact production with 99.9% uptime SLA compliance. ## When to Use This Command Use `/health-check` when you need to: - Monitor database health metrics (connections, CPU, memory, disk) in real-time - Detect performance degradation before users report issues - Track query performance trends and identify slow query patterns - Monitor replication lag and data synchronization issues - Implement automated alerts for critical thresholds (connections > 90%) - Generate executive health reports for stakeholders DON'T use this when: - You only need one-time health check (use manual SQL queries instead) - Database is development/test environment (overkill for non-production) - You lack monitoring infrastructure (Prometheus, Grafana, or equivalent) - Metrics collection overhead impacts performance (use sampling instead) - Cloud provider offers managed monitoring (use RDS Performance Insights instead) ## Design Decisions This command implements **comprehensive continuous monitoring** because: - Real-time metrics enable proactive issue detection (fix before users notice) - Historical trends reveal capacity planning needs (prevent future outages) - Automated alerting reduces mean-time-to-resolution (MTTR) by 80% - Predictive analysis identifies issues before they become critical - Centralized dashboards provide single-pane-of-glass visibility **Alternative considered: Cloud provider monitoring (RDS/CloudSQL)** - Lower setup overhead (managed service) - Vendor-specific dashboards and metrics - Limited customization for business-specific alerts - Recommended when using managed databases without custom metrics **Alternative considered: Application Performance Monitoring (APM) only** - Monitors application layer, not database internals - Misses database-specific issues (replication lag, vacuum bloat) - Cannot detect issues before they impact applications - Recommended as complement, not replacement, for database monitoring ## Prerequisites Before running this command: 1. Monitoring infrastructure (Prometheus + Grafana or equivalent) 2. Database metrics exporter installed (postgres_exporter, mysqld_exporter) 3. Alert notification channels configured (Slack, PagerDuty, email) 4. Database permissions for monitoring queries (pg_monitor role or equivalent) 5. Historical baseline data for anomaly detection (minimum 7 days) ## Implementation Process ### Step 1: Deploy Metrics Collector Install postgres_exporter or mysqld_exporter to expose database metrics to Prometheus. ### Step 2: Configure Prometheus Scraping Add scrape targets for database metrics with appropriate intervals (15-30 seconds). ### Step 3: Create Grafana Dashboards Import or build dashboards for connections, queries, replication, and resources. ### Step 4: Define Alert Rules Set thresholds for critical metrics: connection pool saturation, replication lag, disk usage. ### Step 5: Implement Automated Remediation Create runbooks and scripts to auto-heal common issues (kill idle connections, vacuum). ## Output Format The command generates: - `monitoring/prometheus_config.yml` - Scrape targets and alert rules - `monitoring/health_monitor.py` - Python health check collector - `monitoring/grafana_dashboard.json` - Pre-configured Grafana dashboard - `monitoring/alert_rules.yml` - Alert definitions with thresholds - `monitoring/remediation.sh` - Automated remediation scripts ## Code Examples ### Example 1: PostgreSQL Comprehensive Health Monitor ```python #!/usr/bin/env python3 """ Production-ready PostgreSQL health monitoring system with real-time metrics collection, predictive alerting, and automated remediation. """ import psycopg2 from psycopg2.extras import RealDictCursor from datetime import datetime, timedelta from typing import Dict, List, Optional, Tuple import logging import json import time import requests logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) class PostgreSQLHealthMonitor: """ Comprehensive PostgreSQL health monitoring with metrics collection, threshold alerting, and historical trend analysis. """ def __init__( self, conn_string: str, alert_webhook: Optional[str] = None, check_interval: int = 30 ): """ Initialize health monitor. Args: conn_string: Database connection string alert_webhook: Slack webhook URL for alerts check_interval: Seconds between health checks """ self.conn_string = conn_string self.alert_webhook = alert_webhook self.check_interval = check_interval self.baseline_metrics = {} def collect_health_metrics(self) -> Dict[str, any]: """ Collect comprehensive health metrics from PostgreSQL. Returns: Dictionary with all health metrics """ metrics = { 'timestamp': datetime.now().isoformat(), 'connections': {}, 'performance': {}, 'resources': {}, 'replication': {}, 'vacuum': {}, 'health_score': 100 # Start at 100, deduct for issues } with psycopg2.connect(self.conn_string) as conn: with conn.cursor(cursor_factory=RealDictCursor) as cur: # Connection metrics metrics['connections'] = self._collect_connection_metrics(cur) # Query performance metrics metrics['performance'] = self._collect_performance_metrics(cur) # Resource utilization metrics metrics['resources'] = self._collect_resource_metrics(cur) # Replication metrics (if replica) metrics['replication'] = self._collect_replication_metrics(cur) # Vacuum and maintenance metrics metrics['vacuum'] = self._collect_vacuum_metrics(cur) # Calculate overall health score metrics['health_score'] = self._calculate_health_score(metrics) return metrics def _collect_connection_metrics(self, cur) -> Dict[str, any]: """Collect connection pool metrics.""" cur.execute(""" SELECT count(*) as total_connections, count(*) FILTER (WHERE state = 'active') as active_connections, count(*) FILTER (WHERE state = 'idle') as idle_connections, count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction, max(EXTRACT(EPOCH FROM (now() - backend_start))) as max_connection_age_seconds FROM pg_stat_activity WHERE pid != pg_backend_pid() """) conn_stats = dict(cur.fetchone()) # Get max connections setting cur.execute("SHOW max_connections") max_conn = int(cur.fetchone()['max_connections']) conn_stats['max_connections'] = max_conn conn_stats['connection_usage_pct'] = ( conn_stats['total_connections'] / max_conn * 100 ) # Alert if connection pool is > 80% full if conn_stats['connection_usage_pct'] > 80: logger.warning( f"Connection pool at {conn_stats['connection_usage_pct']:.1f}% " f"({conn_stats['total_connections']}/{max_conn})" ) return conn_stats def _collect_performance_metrics(self, cur) -> Dict[str, any]: """Collect query performance metrics.""" # Check if pg_stat_statements is enabled cur.execute(""" SELECT count(*) > 0 as enabled FROM pg_extension WHERE extname = 'pg_stat_statements' """) pg_stat_enabled = cur.fetchone()['enabled'] perf_metrics = {'pg_stat_statements_enabled': pg_stat_enabled} if pg_stat_enabled: # Top 5 slowest queries cur.execute(""" SELECT query, calls, mean_exec_time, total_exec_time, stddev_exec_time FROM pg_stat_statements WHERE query NOT LIKE '%pg_stat_statements%' ORDER BY mean_exec_time DESC LIMIT 5 """) perf_metrics['slow_queries'] = [dict(row) for row in cur.fetchall()] # Overall query stats cur.execute(""" SELECT sum(calls) as total_queries, avg(mean_exec_time) as avg_query_time_ms, percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_exec_time) as p95_query_time_ms, percentile_cont(0.99) WITHIN GROUP (ORDER BY mean_exec_time) as p99_query_time_ms FROM pg_stat_statements """) perf_metrics['query_stats'] = dict(cur.fetchone()) else: logger.warning("pg_stat_statements extension not enabled") # Transaction stats cur.execute(""" SELECT sum(xact_commit) as commits, sum(xact_rollback) as rollbacks, CASE WHEN sum(xact_commit) > 0 THEN sum(xact_rollback)::float / sum(xact_commit) * 100 ELSE 0 END as rollback_rate_pct FROM pg_stat_database WHERE datname = current_database() """) perf_metrics['transactions'] = dict(cur.fetchone()) return perf_metrics def _collect_resource_metrics(self, cur) -> Dict[str, any]: """Collect resource utilization metrics.""" resource_metrics = {} # Database size cur.execute(""" SELECT pg_database_size(current_database()) as size_bytes, pg_size_pretty(pg_database_size(current_database())) as size_pretty """) resource_metrics['database_size'] = dict(cur.fetchone()) # Top 10 largest tables cur.execute(""" SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size, pg_total_relation_size(schemaname||'.'||tablename) as size_bytes FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10 """) resource_metrics['largest_tables'] = [dict(row) for row in cur.fetchall()] # Cache hit ratio cur.execute(""" SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, CASE WHEN sum(heap_blks_hit) + sum(heap_blks_read) > 0 THEN sum(heap_blks_hit)::float / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100 ELSE 0 END as cache_hit_ratio_pct FROM pg_statio_user_tables """) resource_metrics['cache'] = dict(cur.fetchone()) # Index hit ratio cur.execute(""" SELECT sum(idx_blks_read) as idx_read, sum(idx_blks_hit) as idx_hit, CASE WHEN sum(idx_blks_hit) + sum(idx_blks_read) > 0 THEN sum(idx_blks_hit)::float / (sum(idx_blks_hit) + sum(idx_blks_read)) * 100 ELSE 0 END as index_hit_ratio_pct FROM pg_statio_user_indexes """) resource_metrics['index_cache'] = dict(cur.fetchone()) # Warn if cache hit ratio is < 95% cache_hit_ratio = resource_metrics['cache']['cache_hit_ratio_pct'] if cache_hit_ratio < 95: logger.warning( f"Low cache hit ratio: {cache_hit_ratio:.2f}% " "(consider increasing shared_buffers)" ) return resource_metrics def _collect_replication_metrics(self, cur) -> Dict[str, any]: """Collect replication lag and sync metrics.""" replication_metrics = {'is_replica': False} # Check if this is a replica cur.execute("SELECT pg_is_in_recovery() as is_recovery") is_replica = cur.fetchone()['is_recovery'] replication_metrics['is_replica'] = is_replica if is_replica: # Calculate replication lag cur.execute(""" SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds, pg_last_xact_replay_timestamp() as last_replay_timestamp """) lag_data = dict(cur.fetchone()) replication_metrics['lag'] = lag_data lag_seconds = lag_data['lag_seconds'] or 0 # Alert if lag > 60 seconds if lag_seconds > 60: logger.error(f"High replication lag: {lag_seconds:.1f} seconds") self._send_alert( f"⚠️ High replication lag: {lag_seconds:.1f} seconds", severity='critical' ) elif lag_seconds > 30: logger.warning(f"Elevated replication lag: {lag_seconds:.1f} seconds") return replication_metrics def _collect_vacuum_metrics(self, cur) -> Dict[str, any]: """Collect vacuum and autovacuum metrics.""" vacuum_metrics = {} # Tables needing vacuum cur.execute(""" SELECT schemaname, relname, n_dead_tup, n_live_tup, CASE WHEN n_live_tup > 0 THEN n_dead_tup::float / n_live_tup * 100 ELSE 0 END as dead_tuple_ratio_pct, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC LIMIT 10 """) vacuum_metrics['tables_needing_vacuum'] = [dict(row) for row in cur.fetchall()] # Bloated tables (dead tuples > 20% of live tuples) bloated_tables = [ t for t in vacuum_metrics['tables_needing_vacuum'] if t['dead_tuple_ratio_pct'] > 20 ] if bloated_tables: logger.warning( f"{len(bloated_tables)} tables have >20% dead tuples, " "consider manual VACUUM ANALYZE" ) return vacuum_metrics def _calculate_health_score(self, metrics: Dict[str, any]) -> int: """ Calculate overall health score (0-100) based on metrics. Deductions: - Connection pool > 80%: -10 points - Cache hit ratio < 95%: -10 points - Replication lag > 30s: -20 points - Rollback rate > 5%: -10 points - Tables with >20% dead tuples: -5 points each (max -20) """ score = 100 # Connection pool utilization conn_usage = metrics['connections'].get('connection_usage_pct', 0) if conn_usage > 90: score -= 20 elif conn_usage > 80: score -= 10 # Cache hit ratio cache_hit_ratio = metrics['resources']['cache'].get('cache_hit_ratio_pct', 100) if cache_hit_ratio < 90: score -= 20 elif cache_hit_ratio < 95: score -= 10 # Replication lag (if replica) if metrics['replication']['is_replica']: lag_seconds = metrics['replication']['lag'].get('lag_seconds', 0) or 0 if lag_seconds > 60: score -= 30 elif lag_seconds > 30: score -= 20 # Rollback rate rollback_rate = metrics['performance']['transactions'].get('rollback_rate_pct', 0) if rollback_rate > 10: score -= 20 elif rollback_rate > 5: score -= 10 # Bloated tables bloated_count = sum( 1 for t in metrics['vacuum']['tables_needing_vacuum'] if t['dead_tuple_ratio_pct'] > 20 ) score -= min(bloated_count * 5, 20) return max(score, 0) # Ensure score doesn't go below 0 def _send_alert(self, message: str, severity: str = 'warning') -> None: """ Send alert to configured webhook (Slack, etc.). Args: message: Alert message severity: 'info', 'warning', or 'critical' """ if not self.alert_webhook: return emoji_map = { 'info': 'ℹ️', 'warning': '⚠️', 'critical': '🚨' } payload = { 'text': f"{emoji_map.get(severity, '❓')} Database Health Alert", 'attachments': [{ 'color': 'warning' if severity == 'warning' else 'danger', 'text': message, 'footer': f'PostgreSQL Health Monitor', 'ts': int(time.time()) }] } try: response = requests.post( self.alert_webhook, json=payload, timeout=5 ) response.raise_for_status() except Exception as e: logger.error(f"Failed to send alert: {e}") def monitor_continuous(self) -> None: """ Run continuous health monitoring loop. """ logger.info( f"Starting continuous health monitoring " f"(interval: {self.check_interval}s)" ) while True: try: metrics = self.collect_health_metrics() # Log health score health_score = metrics['health_score'] logger.info(f"Health score: {health_score}/100") # Alert on poor health if health_score < 70: self._send_alert( f"Database health score: {health_score}/100\n" f"Connection usage: {metrics['connections']['connection_usage_pct']:.1f}%\n" f"Cache hit ratio: {metrics['resources']['cache']['cache_hit_ratio_pct']:.1f}%", severity='critical' if health_score < 50 else 'warning' ) # Store metrics for historical trending (optional) # self._store_metrics(metrics) time.sleep(self.check_interval) except KeyboardInterrupt: logger.info("Stopping health monitoring") break except Exception as e: logger.error(f"Health check error: {e}") time.sleep(self.check_interval) def generate_health_report(self) -> str: """ Generate human-readable health report. Returns: Formatted health report string """ metrics = self.collect_health_metrics() report = f""" === PostgreSQL Health Report === Generated: {metrics['timestamp']} Health Score: {metrics['health_score']}/100 --- Connections --- Total: {metrics['connections']['total_connections']}/{metrics['connections']['max_connections']} ({metrics['connections']['connection_usage_pct']:.1f}%) Active: {metrics['connections']['active_connections']} Idle: {metrics['connections']['idle_connections']} Idle in Transaction: {metrics['connections']['idle_in_transaction']} --- Performance --- Cache Hit Ratio: {metrics['resources']['cache']['cache_hit_ratio_pct']:.2f}% Index Hit Ratio: {metrics['resources']['index_cache']['index_hit_ratio_pct']:.2f}% Rollback Rate: {metrics['performance']['transactions']['rollback_rate_pct']:.2f}% --- Resources --- Database Size: {metrics['resources']['database_size']['size_pretty']} --- Replication --- Is Replica: {metrics['replication']['is_replica']} """ if metrics['replication']['is_replica']: lag = metrics['replication']['lag']['lag_seconds'] or 0 report += f"Replication Lag: {lag:.1f} seconds\n" report += f"\n--- Vacuum Status ---\n" report += f"Tables Needing Vacuum: {len(metrics['vacuum']['tables_needing_vacuum'])}\n" return report # CLI usage if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="PostgreSQL Health Monitor") parser.add_argument("--conn", required=True, help="Connection string") parser.add_argument("--webhook", help="Slack webhook URL for alerts") parser.add_argument("--interval", type=int, default=30, help="Check interval (seconds)") parser.add_argument("--continuous", action="store_true", help="Run continuous monitoring") args = parser.parse_args() monitor = PostgreSQLHealthMonitor( conn_string=args.conn, alert_webhook=args.webhook, check_interval=args.interval ) if args.continuous: monitor.monitor_continuous() else: print(monitor.generate_health_report()) ``` ### Example 2: Prometheus Alert Rules for Database Health ```yaml # prometheus_alert_rules.yml # Production-ready Prometheus alert rules for PostgreSQL health monitoring groups: - name: postgresql_health interval: 30s rules: # Connection pool saturation - alert: PostgreSQLConnectionPoolHigh expr: | (pg_stat_activity_count / pg_settings_max_connections) > 0.8 for: 5m labels: severity: warning component: database annotations: summary: "PostgreSQL connection pool usage high" description: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.instance }}" - alert: PostgreSQLConnectionPoolCritical expr: | (pg_stat_activity_count / pg_settings_max_connections) > 0.95 for: 2m labels: severity: critical component: database annotations: summary: "PostgreSQL connection pool nearly exhausted" description: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.instance }}. Immediate action required." # Replication lag - alert: PostgreSQLReplicationLag expr: | pg_replication_lag_seconds > 30 for: 2m labels: severity: warning component: replication annotations: summary: "PostgreSQL replication lag high" description: "Replication lag is {{ $value }} seconds on {{ $labels.instance }}" - alert: PostgreSQLReplicationLagCritical expr: | pg_replication_lag_seconds > 60 for: 1m labels: severity: critical component: replication annotations: summary: "PostgreSQL replication lag critical" description: "Replication lag is {{ $value }} seconds on {{ $labels.instance }}. Data synchronization delayed." # Cache hit ratio - alert: PostgreSQLLowCacheHitRatio expr: | rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])) < 0.95 for: 10m labels: severity: warning component: performance annotations: summary: "PostgreSQL cache hit ratio low" description: "Cache hit ratio is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Consider increasing shared_buffers." # Disk usage - alert: PostgreSQLDiskUsageHigh expr: | (pg_database_size_bytes / (1024^3)) > 800 # 800GB for: 5m labels: severity: warning component: storage annotations: summary: "PostgreSQL database size growing" description: "Database size is {{ $value }} GB on {{ $labels.instance }}. Consider archival or partitioning." # Rollback rate - alert: PostgreSQLHighRollbackRate expr: | rate(pg_stat_database_xact_rollback[5m]) / (rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])) > 0.05 for: 5m labels: severity: warning component: application annotations: summary: "PostgreSQL high transaction rollback rate" description: "Rollback rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Check application errors." # Dead tuples (vacuum needed) - alert: PostgreSQLTableBloat expr: | pg_stat_user_tables_n_dead_tup > 100000 for: 30m labels: severity: warning component: maintenance annotations: summary: "PostgreSQL table bloat detected" description: "Table {{ $labels.relname }} has {{ $value }} dead tuples on {{ $labels.instance }}. Manual VACUUM recommended." # Long-running queries - alert: PostgreSQLLongRunningQuery expr: | pg_stat_activity_max_tx_duration_seconds > 600 # 10 minutes for: 2m labels: severity: warning component: performance annotations: summary: "PostgreSQL long-running query detected" description: "Query running for {{ $value }} seconds on {{ $labels.instance }}" # Database down - alert: PostgreSQLDown expr: | up{job="postgresql"} == 0 for: 1m labels: severity: critical component: availability annotations: summary: "PostgreSQL instance down" description: "PostgreSQL instance {{ $labels.instance }} is unreachable" ``` ### Example 3: Grafana Dashboard JSON (Excerpt) ```json { "dashboard": { "title": "PostgreSQL Health Dashboard", "uid": "postgresql-health", "tags": ["postgresql", "database", "health"], "timezone": "browser", "panels": [ { "id": 1, "title": "Health Score", "type": "gauge", "targets": [{ "expr": "pg_health_score", "legendFormat": "Health" }], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "mode": "absolute", "steps": [ {"value": 0, "color": "red"}, {"value": 70, "color": "yellow"}, {"value": 90, "color": "green"} ] } } } }, { "id": 2, "title": "Connection Pool Usage", "type": "timeseries", "targets": [{ "expr": "pg_stat_activity_count / pg_settings_max_connections * 100", "legendFormat": "Usage %" }], "fieldConfig": { "defaults": { "unit": "percent" } } }, { "id": 3, "title": "Cache Hit Ratio", "type": "timeseries", "targets": [{ "expr": "rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])) * 100", "legendFormat": "Cache Hit %" }], "fieldConfig": { "defaults": { "unit": "percent" } } }, { "id": 4, "title": "Replication Lag", "type": "timeseries", "targets": [{ "expr": "pg_replication_lag_seconds", "legendFormat": "Lag (seconds)" }], "fieldConfig": { "defaults": { "unit": "s" } } }, { "id": 5, "title": "Query Performance (P95 Latency)", "type": "timeseries", "targets": [{ "expr": "histogram_quantile(0.95, rate(pg_stat_statements_mean_exec_time_bucket[5m]))", "legendFormat": "P95 latency" }], "fieldConfig": { "defaults": { "unit": "ms" } } } ] } } ``` ## Error Handling | Error | Cause | Solution | |-------|-------|----------| | "permission denied for pg_stat_statements" | Insufficient monitoring privileges | Grant pg_monitor role: `GRANT pg_monitor TO monitoring_user` | | "extension pg_stat_statements not found" | Extension not enabled | Enable in postgresql.conf: `shared_preload_libraries = 'pg_stat_statements'` | | "connection limit reached" | Monitoring exhausting connections | Use connection pooling (PgBouncer) with dedicated monitoring pool | | "Prometheus scrape timeout" | Slow monitoring queries | Add indexes on pg_stat tables, reduce scrape frequency | | "Alert fatigue" | Too many false positives | Adjust thresholds based on baseline, use alert grouping | ## Configuration Options **Monitoring Intervals** - **Real-time (10-30s)**: Critical production databases - **Standard (1-5min)**: Most production workloads - **Relaxed (10-15min)**: Development/staging environments - **On-demand**: Manual health checks **Alert Thresholds** - **Connection pool**: 80% warning, 95% critical - **Replication lag**: 30s warning, 60s critical - **Cache hit ratio**: <95% warning, <90% critical - **Disk usage**: 80% warning, 90% critical - **Rollback rate**: >5% warning, >10% critical **Retention Policies** - **Raw metrics**: 15 days (high resolution) - **Downsampled metrics**: 90 days (5min intervals) - **Long-term trends**: 1 year (1hour intervals) ## Best Practices DO: - Enable pg_stat_statements for query-level insights - Set alert thresholds based on historical baseline (not arbitrary values) - Use dedicated monitoring user with pg_monitor role - Implement alert grouping to reduce notification fatigue - Test alerting channels quarterly to ensure delivery - Document runbooks for each alert type - Monitor the monitoring system itself (meta-monitoring) DON'T: - Query pg_stat_activity excessively (adds overhead) - Ignore cache hit ratio warnings (impacts performance significantly) - Set thresholds without understanding workload patterns - Alert on metrics that don't require action - Collect metrics more frequently than necessary (resource waste) - Neglect replication lag on read replicas - Skip baseline data collection before alerting ## Performance Considerations - **Monitoring overhead**: <1% CPU impact with 30s intervals - **Network overhead**: 5-10KB/s metrics bandwidth per database - **Storage requirements**: 100MB/day for raw metrics (per database) - **Query latency**: <10ms for most health check queries - **Alert latency**: 1-5 seconds from threshold breach to notification - **Dashboard refresh**: 30s intervals recommended for real-time visibility ## Security Considerations - Use read-only monitoring user (pg_monitor role, no write access) - Encrypt metrics in transit (TLS for Prometheus scraping) - Restrict dashboard access with Grafana authentication - Mask sensitive query data in pg_stat_statements - Audit access to monitoring systems (SOC2 compliance) - Rotate monitoring credentials quarterly - Secure webhook URLs (Slack, PagerDuty) as secrets ## Related Commands - `/database-transaction-monitor` - Real-time transaction monitoring - `/database-deadlock-detector` - Detect and resolve deadlocks - `/sql-query-optimizer` - Optimize slow queries identified by health monitor - `/database-connection-pooler` - Connection pool optimization ## Version History - v1.0.0 (2024-10): Initial implementation with PostgreSQL/MySQL health monitoring - Planned v1.1.0: Add predictive alerting with machine learning, automated remediation