Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:18:19 +08:00
commit 968db1804b
8 changed files with 1052 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "database-health-monitor",
"description": "Database plugin for database-health-monitor",
"version": "1.0.0",
"author": {
"name": "Claude Code Plugins",
"email": "[email protected]"
},
"skills": [
"./skills"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# database-health-monitor
Database plugin for database-health-monitor

895
commands/health-check.md Normal file
View File

@@ -0,0 +1,895 @@
---
description: Monitor database health with real-time metrics, predictive alerts, and automated remediation
shortcut: health-check
---
# Database Health Monitor
Implement production-grade database health monitoring for PostgreSQL and MySQL with real-time metrics collection, predictive alerting, automated remediation, and comprehensive dashboards. Detect performance degradation, resource exhaustion, and replication issues before they impact production with 99.9% uptime SLA compliance.
## When to Use This Command
Use `/health-check` when you need to:
- Monitor database health metrics (connections, CPU, memory, disk) in real-time
- Detect performance degradation before users report issues
- Track query performance trends and identify slow query patterns
- Monitor replication lag and data synchronization issues
- Implement automated alerts for critical thresholds (connections > 90%)
- Generate executive health reports for stakeholders
DON'T use this when:
- You only need one-time health check (use manual SQL queries instead)
- Database is development/test environment (overkill for non-production)
- You lack monitoring infrastructure (Prometheus, Grafana, or equivalent)
- Metrics collection overhead impacts performance (use sampling instead)
- Cloud provider offers managed monitoring (use RDS Performance Insights instead)
## Design Decisions
This command implements **comprehensive continuous monitoring** because:
- Real-time metrics enable proactive issue detection (fix before users notice)
- Historical trends reveal capacity planning needs (prevent future outages)
- Automated alerting reduces mean-time-to-resolution (MTTR) by 80%
- Predictive analysis identifies issues before they become critical
- Centralized dashboards provide single-pane-of-glass visibility
**Alternative considered: Cloud provider monitoring (RDS/CloudSQL)**
- Lower setup overhead (managed service)
- Vendor-specific dashboards and metrics
- Limited customization for business-specific alerts
- Recommended when using managed databases without custom metrics
**Alternative considered: Application Performance Monitoring (APM) only**
- Monitors application layer, not database internals
- Misses database-specific issues (replication lag, vacuum bloat)
- Cannot detect issues before they impact applications
- Recommended as complement, not replacement, for database monitoring
## Prerequisites
Before running this command:
1. Monitoring infrastructure (Prometheus + Grafana or equivalent)
2. Database metrics exporter installed (postgres_exporter, mysqld_exporter)
3. Alert notification channels configured (Slack, PagerDuty, email)
4. Database permissions for monitoring queries (pg_monitor role or equivalent)
5. Historical baseline data for anomaly detection (minimum 7 days)
## Implementation Process
### Step 1: Deploy Metrics Collector
Install postgres_exporter or mysqld_exporter to expose database metrics to Prometheus.
### Step 2: Configure Prometheus Scraping
Add scrape targets for database metrics with appropriate intervals (15-30 seconds).
### Step 3: Create Grafana Dashboards
Import or build dashboards for connections, queries, replication, and resources.
### Step 4: Define Alert Rules
Set thresholds for critical metrics: connection pool saturation, replication lag, disk usage.
### Step 5: Implement Automated Remediation
Create runbooks and scripts to auto-heal common issues (kill idle connections, vacuum).
## Output Format
The command generates:
- `monitoring/prometheus_config.yml` - Scrape targets and alert rules
- `monitoring/health_monitor.py` - Python health check collector
- `monitoring/grafana_dashboard.json` - Pre-configured Grafana dashboard
- `monitoring/alert_rules.yml` - Alert definitions with thresholds
- `monitoring/remediation.sh` - Automated remediation scripts
## Code Examples
### Example 1: PostgreSQL Comprehensive Health Monitor
```python
#!/usr/bin/env python3
"""
Production-ready PostgreSQL health monitoring system with real-time
metrics collection, predictive alerting, and automated remediation.
"""
import psycopg2
from psycopg2.extras import RealDictCursor
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
import logging
import json
import time
import requests
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class PostgreSQLHealthMonitor:
"""
Comprehensive PostgreSQL health monitoring with metrics collection,
threshold alerting, and historical trend analysis.
"""
def __init__(
self,
conn_string: str,
alert_webhook: Optional[str] = None,
check_interval: int = 30
):
"""
Initialize health monitor.
Args:
conn_string: Database connection string
alert_webhook: Slack webhook URL for alerts
check_interval: Seconds between health checks
"""
self.conn_string = conn_string
self.alert_webhook = alert_webhook
self.check_interval = check_interval
self.baseline_metrics = {}
def collect_health_metrics(self) -> Dict[str, any]:
"""
Collect comprehensive health metrics from PostgreSQL.
Returns:
Dictionary with all health metrics
"""
metrics = {
'timestamp': datetime.now().isoformat(),
'connections': {},
'performance': {},
'resources': {},
'replication': {},
'vacuum': {},
'health_score': 100 # Start at 100, deduct for issues
}
with psycopg2.connect(self.conn_string) as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
# Connection metrics
metrics['connections'] = self._collect_connection_metrics(cur)
# Query performance metrics
metrics['performance'] = self._collect_performance_metrics(cur)
# Resource utilization metrics
metrics['resources'] = self._collect_resource_metrics(cur)
# Replication metrics (if replica)
metrics['replication'] = self._collect_replication_metrics(cur)
# Vacuum and maintenance metrics
metrics['vacuum'] = self._collect_vacuum_metrics(cur)
# Calculate overall health score
metrics['health_score'] = self._calculate_health_score(metrics)
return metrics
def _collect_connection_metrics(self, cur) -> Dict[str, any]:
"""Collect connection pool metrics."""
cur.execute("""
SELECT
count(*) as total_connections,
count(*) FILTER (WHERE state = 'active') as active_connections,
count(*) FILTER (WHERE state = 'idle') as idle_connections,
count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction,
max(EXTRACT(EPOCH FROM (now() - backend_start))) as max_connection_age_seconds
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
""")
conn_stats = dict(cur.fetchone())
# Get max connections setting
cur.execute("SHOW max_connections")
max_conn = int(cur.fetchone()['max_connections'])
conn_stats['max_connections'] = max_conn
conn_stats['connection_usage_pct'] = (
conn_stats['total_connections'] / max_conn * 100
)
# Alert if connection pool is > 80% full
if conn_stats['connection_usage_pct'] > 80:
logger.warning(
f"Connection pool at {conn_stats['connection_usage_pct']:.1f}% "
f"({conn_stats['total_connections']}/{max_conn})"
)
return conn_stats
def _collect_performance_metrics(self, cur) -> Dict[str, any]:
"""Collect query performance metrics."""
# Check if pg_stat_statements is enabled
cur.execute("""
SELECT count(*) > 0 as enabled
FROM pg_extension
WHERE extname = 'pg_stat_statements'
""")
pg_stat_enabled = cur.fetchone()['enabled']
perf_metrics = {'pg_stat_statements_enabled': pg_stat_enabled}
if pg_stat_enabled:
# Top 5 slowest queries
cur.execute("""
SELECT
query,
calls,
mean_exec_time,
total_exec_time,
stddev_exec_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY mean_exec_time DESC
LIMIT 5
""")
perf_metrics['slow_queries'] = [dict(row) for row in cur.fetchall()]
# Overall query stats
cur.execute("""
SELECT
sum(calls) as total_queries,
avg(mean_exec_time) as avg_query_time_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_exec_time) as p95_query_time_ms,
percentile_cont(0.99) WITHIN GROUP (ORDER BY mean_exec_time) as p99_query_time_ms
FROM pg_stat_statements
""")
perf_metrics['query_stats'] = dict(cur.fetchone())
else:
logger.warning("pg_stat_statements extension not enabled")
# Transaction stats
cur.execute("""
SELECT
sum(xact_commit) as commits,
sum(xact_rollback) as rollbacks,
CASE
WHEN sum(xact_commit) > 0 THEN
sum(xact_rollback)::float / sum(xact_commit) * 100
ELSE 0
END as rollback_rate_pct
FROM pg_stat_database
WHERE datname = current_database()
""")
perf_metrics['transactions'] = dict(cur.fetchone())
return perf_metrics
def _collect_resource_metrics(self, cur) -> Dict[str, any]:
"""Collect resource utilization metrics."""
resource_metrics = {}
# Database size
cur.execute("""
SELECT
pg_database_size(current_database()) as size_bytes,
pg_size_pretty(pg_database_size(current_database())) as size_pretty
""")
resource_metrics['database_size'] = dict(cur.fetchone())
# Top 10 largest tables
cur.execute("""
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
pg_total_relation_size(schemaname||'.'||tablename) as size_bytes
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10
""")
resource_metrics['largest_tables'] = [dict(row) for row in cur.fetchall()]
# Cache hit ratio
cur.execute("""
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
CASE
WHEN sum(heap_blks_hit) + sum(heap_blks_read) > 0 THEN
sum(heap_blks_hit)::float / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100
ELSE 0
END as cache_hit_ratio_pct
FROM pg_statio_user_tables
""")
resource_metrics['cache'] = dict(cur.fetchone())
# Index hit ratio
cur.execute("""
SELECT
sum(idx_blks_read) as idx_read,
sum(idx_blks_hit) as idx_hit,
CASE
WHEN sum(idx_blks_hit) + sum(idx_blks_read) > 0 THEN
sum(idx_blks_hit)::float / (sum(idx_blks_hit) + sum(idx_blks_read)) * 100
ELSE 0
END as index_hit_ratio_pct
FROM pg_statio_user_indexes
""")
resource_metrics['index_cache'] = dict(cur.fetchone())
# Warn if cache hit ratio is < 95%
cache_hit_ratio = resource_metrics['cache']['cache_hit_ratio_pct']
if cache_hit_ratio < 95:
logger.warning(
f"Low cache hit ratio: {cache_hit_ratio:.2f}% "
"(consider increasing shared_buffers)"
)
return resource_metrics
def _collect_replication_metrics(self, cur) -> Dict[str, any]:
"""Collect replication lag and sync metrics."""
replication_metrics = {'is_replica': False}
# Check if this is a replica
cur.execute("SELECT pg_is_in_recovery() as is_recovery")
is_replica = cur.fetchone()['is_recovery']
replication_metrics['is_replica'] = is_replica
if is_replica:
# Calculate replication lag
cur.execute("""
SELECT
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds,
pg_last_xact_replay_timestamp() as last_replay_timestamp
""")
lag_data = dict(cur.fetchone())
replication_metrics['lag'] = lag_data
lag_seconds = lag_data['lag_seconds'] or 0
# Alert if lag > 60 seconds
if lag_seconds > 60:
logger.error(f"High replication lag: {lag_seconds:.1f} seconds")
self._send_alert(
f"⚠️ High replication lag: {lag_seconds:.1f} seconds",
severity='critical'
)
elif lag_seconds > 30:
logger.warning(f"Elevated replication lag: {lag_seconds:.1f} seconds")
return replication_metrics
def _collect_vacuum_metrics(self, cur) -> Dict[str, any]:
"""Collect vacuum and autovacuum metrics."""
vacuum_metrics = {}
# Tables needing vacuum
cur.execute("""
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
CASE
WHEN n_live_tup > 0 THEN
n_dead_tup::float / n_live_tup * 100
ELSE 0
END as dead_tuple_ratio_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 10
""")
vacuum_metrics['tables_needing_vacuum'] = [dict(row) for row in cur.fetchall()]
# Bloated tables (dead tuples > 20% of live tuples)
bloated_tables = [
t for t in vacuum_metrics['tables_needing_vacuum']
if t['dead_tuple_ratio_pct'] > 20
]
if bloated_tables:
logger.warning(
f"{len(bloated_tables)} tables have >20% dead tuples, "
"consider manual VACUUM ANALYZE"
)
return vacuum_metrics
def _calculate_health_score(self, metrics: Dict[str, any]) -> int:
"""
Calculate overall health score (0-100) based on metrics.
Deductions:
- Connection pool > 80%: -10 points
- Cache hit ratio < 95%: -10 points
- Replication lag > 30s: -20 points
- Rollback rate > 5%: -10 points
- Tables with >20% dead tuples: -5 points each (max -20)
"""
score = 100
# Connection pool utilization
conn_usage = metrics['connections'].get('connection_usage_pct', 0)
if conn_usage > 90:
score -= 20
elif conn_usage > 80:
score -= 10
# Cache hit ratio
cache_hit_ratio = metrics['resources']['cache'].get('cache_hit_ratio_pct', 100)
if cache_hit_ratio < 90:
score -= 20
elif cache_hit_ratio < 95:
score -= 10
# Replication lag (if replica)
if metrics['replication']['is_replica']:
lag_seconds = metrics['replication']['lag'].get('lag_seconds', 0) or 0
if lag_seconds > 60:
score -= 30
elif lag_seconds > 30:
score -= 20
# Rollback rate
rollback_rate = metrics['performance']['transactions'].get('rollback_rate_pct', 0)
if rollback_rate > 10:
score -= 20
elif rollback_rate > 5:
score -= 10
# Bloated tables
bloated_count = sum(
1 for t in metrics['vacuum']['tables_needing_vacuum']
if t['dead_tuple_ratio_pct'] > 20
)
score -= min(bloated_count * 5, 20)
return max(score, 0) # Ensure score doesn't go below 0
def _send_alert(self, message: str, severity: str = 'warning') -> None:
"""
Send alert to configured webhook (Slack, etc.).
Args:
message: Alert message
severity: 'info', 'warning', or 'critical'
"""
if not self.alert_webhook:
return
emoji_map = {
'info': '',
'warning': '⚠️',
'critical': '🚨'
}
payload = {
'text': f"{emoji_map.get(severity, '')} Database Health Alert",
'attachments': [{
'color': 'warning' if severity == 'warning' else 'danger',
'text': message,
'footer': f'PostgreSQL Health Monitor',
'ts': int(time.time())
}]
}
try:
response = requests.post(
self.alert_webhook,
json=payload,
timeout=5
)
response.raise_for_status()
except Exception as e:
logger.error(f"Failed to send alert: {e}")
def monitor_continuous(self) -> None:
"""
Run continuous health monitoring loop.
"""
logger.info(
f"Starting continuous health monitoring "
f"(interval: {self.check_interval}s)"
)
while True:
try:
metrics = self.collect_health_metrics()
# Log health score
health_score = metrics['health_score']
logger.info(f"Health score: {health_score}/100")
# Alert on poor health
if health_score < 70:
self._send_alert(
f"Database health score: {health_score}/100\n"
f"Connection usage: {metrics['connections']['connection_usage_pct']:.1f}%\n"
f"Cache hit ratio: {metrics['resources']['cache']['cache_hit_ratio_pct']:.1f}%",
severity='critical' if health_score < 50 else 'warning'
)
# Store metrics for historical trending (optional)
# self._store_metrics(metrics)
time.sleep(self.check_interval)
except KeyboardInterrupt:
logger.info("Stopping health monitoring")
break
except Exception as e:
logger.error(f"Health check error: {e}")
time.sleep(self.check_interval)
def generate_health_report(self) -> str:
"""
Generate human-readable health report.
Returns:
Formatted health report string
"""
metrics = self.collect_health_metrics()
report = f"""
=== PostgreSQL Health Report ===
Generated: {metrics['timestamp']}
Health Score: {metrics['health_score']}/100
--- Connections ---
Total: {metrics['connections']['total_connections']}/{metrics['connections']['max_connections']} ({metrics['connections']['connection_usage_pct']:.1f}%)
Active: {metrics['connections']['active_connections']}
Idle: {metrics['connections']['idle_connections']}
Idle in Transaction: {metrics['connections']['idle_in_transaction']}
--- Performance ---
Cache Hit Ratio: {metrics['resources']['cache']['cache_hit_ratio_pct']:.2f}%
Index Hit Ratio: {metrics['resources']['index_cache']['index_hit_ratio_pct']:.2f}%
Rollback Rate: {metrics['performance']['transactions']['rollback_rate_pct']:.2f}%
--- Resources ---
Database Size: {metrics['resources']['database_size']['size_pretty']}
--- Replication ---
Is Replica: {metrics['replication']['is_replica']}
"""
if metrics['replication']['is_replica']:
lag = metrics['replication']['lag']['lag_seconds'] or 0
report += f"Replication Lag: {lag:.1f} seconds\n"
report += f"\n--- Vacuum Status ---\n"
report += f"Tables Needing Vacuum: {len(metrics['vacuum']['tables_needing_vacuum'])}\n"
return report
# CLI usage
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="PostgreSQL Health Monitor")
parser.add_argument("--conn", required=True, help="Connection string")
parser.add_argument("--webhook", help="Slack webhook URL for alerts")
parser.add_argument("--interval", type=int, default=30, help="Check interval (seconds)")
parser.add_argument("--continuous", action="store_true", help="Run continuous monitoring")
args = parser.parse_args()
monitor = PostgreSQLHealthMonitor(
conn_string=args.conn,
alert_webhook=args.webhook,
check_interval=args.interval
)
if args.continuous:
monitor.monitor_continuous()
else:
print(monitor.generate_health_report())
```
### Example 2: Prometheus Alert Rules for Database Health
```yaml
# prometheus_alert_rules.yml
# Production-ready Prometheus alert rules for PostgreSQL health monitoring
groups:
- name: postgresql_health
interval: 30s
rules:
# Connection pool saturation
- alert: PostgreSQLConnectionPoolHigh
expr: |
(pg_stat_activity_count / pg_settings_max_connections) > 0.8
for: 5m
labels:
severity: warning
component: database
annotations:
summary: "PostgreSQL connection pool usage high"
description: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.instance }}"
- alert: PostgreSQLConnectionPoolCritical
expr: |
(pg_stat_activity_count / pg_settings_max_connections) > 0.95
for: 2m
labels:
severity: critical
component: database
annotations:
summary: "PostgreSQL connection pool nearly exhausted"
description: "Connection pool at {{ $value | humanizePercentage }} on {{ $labels.instance }}. Immediate action required."
# Replication lag
- alert: PostgreSQLReplicationLag
expr: |
pg_replication_lag_seconds > 30
for: 2m
labels:
severity: warning
component: replication
annotations:
summary: "PostgreSQL replication lag high"
description: "Replication lag is {{ $value }} seconds on {{ $labels.instance }}"
- alert: PostgreSQLReplicationLagCritical
expr: |
pg_replication_lag_seconds > 60
for: 1m
labels:
severity: critical
component: replication
annotations:
summary: "PostgreSQL replication lag critical"
description: "Replication lag is {{ $value }} seconds on {{ $labels.instance }}. Data synchronization delayed."
# Cache hit ratio
- alert: PostgreSQLLowCacheHitRatio
expr: |
rate(pg_stat_database_blks_hit[5m]) /
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])) < 0.95
for: 10m
labels:
severity: warning
component: performance
annotations:
summary: "PostgreSQL cache hit ratio low"
description: "Cache hit ratio is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Consider increasing shared_buffers."
# Disk usage
- alert: PostgreSQLDiskUsageHigh
expr: |
(pg_database_size_bytes / (1024^3)) > 800 # 800GB
for: 5m
labels:
severity: warning
component: storage
annotations:
summary: "PostgreSQL database size growing"
description: "Database size is {{ $value }} GB on {{ $labels.instance }}. Consider archival or partitioning."
# Rollback rate
- alert: PostgreSQLHighRollbackRate
expr: |
rate(pg_stat_database_xact_rollback[5m]) /
(rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])) > 0.05
for: 5m
labels:
severity: warning
component: application
annotations:
summary: "PostgreSQL high transaction rollback rate"
description: "Rollback rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Check application errors."
# Dead tuples (vacuum needed)
- alert: PostgreSQLTableBloat
expr: |
pg_stat_user_tables_n_dead_tup > 100000
for: 30m
labels:
severity: warning
component: maintenance
annotations:
summary: "PostgreSQL table bloat detected"
description: "Table {{ $labels.relname }} has {{ $value }} dead tuples on {{ $labels.instance }}. Manual VACUUM recommended."
# Long-running queries
- alert: PostgreSQLLongRunningQuery
expr: |
pg_stat_activity_max_tx_duration_seconds > 600 # 10 minutes
for: 2m
labels:
severity: warning
component: performance
annotations:
summary: "PostgreSQL long-running query detected"
description: "Query running for {{ $value }} seconds on {{ $labels.instance }}"
# Database down
- alert: PostgreSQLDown
expr: |
up{job="postgresql"} == 0
for: 1m
labels:
severity: critical
component: availability
annotations:
summary: "PostgreSQL instance down"
description: "PostgreSQL instance {{ $labels.instance }} is unreachable"
```
### Example 3: Grafana Dashboard JSON (Excerpt)
```json
{
"dashboard": {
"title": "PostgreSQL Health Dashboard",
"uid": "postgresql-health",
"tags": ["postgresql", "database", "health"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Health Score",
"type": "gauge",
"targets": [{
"expr": "pg_health_score",
"legendFormat": "Health"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "red"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "green"}
]
}
}
}
},
{
"id": 2,
"title": "Connection Pool Usage",
"type": "timeseries",
"targets": [{
"expr": "pg_stat_activity_count / pg_settings_max_connections * 100",
"legendFormat": "Usage %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent"
}
}
},
{
"id": 3,
"title": "Cache Hit Ratio",
"type": "timeseries",
"targets": [{
"expr": "rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])) * 100",
"legendFormat": "Cache Hit %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent"
}
}
},
{
"id": 4,
"title": "Replication Lag",
"type": "timeseries",
"targets": [{
"expr": "pg_replication_lag_seconds",
"legendFormat": "Lag (seconds)"
}],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"id": 5,
"title": "Query Performance (P95 Latency)",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.95, rate(pg_stat_statements_mean_exec_time_bucket[5m]))",
"legendFormat": "P95 latency"
}],
"fieldConfig": {
"defaults": {
"unit": "ms"
}
}
}
]
}
}
```
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| "permission denied for pg_stat_statements" | Insufficient monitoring privileges | Grant pg_monitor role: `GRANT pg_monitor TO monitoring_user` |
| "extension pg_stat_statements not found" | Extension not enabled | Enable in postgresql.conf: `shared_preload_libraries = 'pg_stat_statements'` |
| "connection limit reached" | Monitoring exhausting connections | Use connection pooling (PgBouncer) with dedicated monitoring pool |
| "Prometheus scrape timeout" | Slow monitoring queries | Add indexes on pg_stat tables, reduce scrape frequency |
| "Alert fatigue" | Too many false positives | Adjust thresholds based on baseline, use alert grouping |
## Configuration Options
**Monitoring Intervals**
- **Real-time (10-30s)**: Critical production databases
- **Standard (1-5min)**: Most production workloads
- **Relaxed (10-15min)**: Development/staging environments
- **On-demand**: Manual health checks
**Alert Thresholds**
- **Connection pool**: 80% warning, 95% critical
- **Replication lag**: 30s warning, 60s critical
- **Cache hit ratio**: <95% warning, <90% critical
- **Disk usage**: 80% warning, 90% critical
- **Rollback rate**: >5% warning, >10% critical
**Retention Policies**
- **Raw metrics**: 15 days (high resolution)
- **Downsampled metrics**: 90 days (5min intervals)
- **Long-term trends**: 1 year (1hour intervals)
## Best Practices
DO:
- Enable pg_stat_statements for query-level insights
- Set alert thresholds based on historical baseline (not arbitrary values)
- Use dedicated monitoring user with pg_monitor role
- Implement alert grouping to reduce notification fatigue
- Test alerting channels quarterly to ensure delivery
- Document runbooks for each alert type
- Monitor the monitoring system itself (meta-monitoring)
DON'T:
- Query pg_stat_activity excessively (adds overhead)
- Ignore cache hit ratio warnings (impacts performance significantly)
- Set thresholds without understanding workload patterns
- Alert on metrics that don't require action
- Collect metrics more frequently than necessary (resource waste)
- Neglect replication lag on read replicas
- Skip baseline data collection before alerting
## Performance Considerations
- **Monitoring overhead**: <1% CPU impact with 30s intervals
- **Network overhead**: 5-10KB/s metrics bandwidth per database
- **Storage requirements**: 100MB/day for raw metrics (per database)
- **Query latency**: <10ms for most health check queries
- **Alert latency**: 1-5 seconds from threshold breach to notification
- **Dashboard refresh**: 30s intervals recommended for real-time visibility
## Security Considerations
- Use read-only monitoring user (pg_monitor role, no write access)
- Encrypt metrics in transit (TLS for Prometheus scraping)
- Restrict dashboard access with Grafana authentication
- Mask sensitive query data in pg_stat_statements
- Audit access to monitoring systems (SOC2 compliance)
- Rotate monitoring credentials quarterly
- Secure webhook URLs (Slack, PagerDuty) as secrets
## Related Commands
- `/database-transaction-monitor` - Real-time transaction monitoring
- `/database-deadlock-detector` - Detect and resolve deadlocks
- `/sql-query-optimizer` - Optimize slow queries identified by health monitor
- `/database-connection-pooler` - Connection pool optimization
## Version History
- v1.0.0 (2024-10): Initial implementation with PostgreSQL/MySQL health monitoring
- Planned v1.1.0: Add predictive alerting with machine learning, automated remediation

61
plugin.lock.json Normal file
View File

@@ -0,0 +1,61 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:jeremylongshore/claude-code-plugins-plus:plugins/database/database-health-monitor",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "a055dc7aae80338023c1d21ea89838b79dd61811",
"treeHash": "3a109b7767df3d54a5f550262c49fd64af0c8c84f655872d5b8d8ee4b3ce08d0",
"generatedAt": "2025-11-28T10:18:19.934030Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "database-health-monitor",
"description": "Database plugin for database-health-monitor",
"version": "1.0.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "49f04b2638265792b2b7a27664a8d9e88b6f96518a84770349d7ebb1cba0e803"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "c623af4ce9dcd693f829d28d9ebe3b32a4242555c53e3c2e5e22be4f0b06388b"
},
{
"path": "commands/health-check.md",
"sha256": "915e36a99a1ea3df70ccc2ddbe71dde2519cc98b5fa982ac45611e64a20c2e40"
},
{
"path": "skills/database-health-monitor/SKILL.md",
"sha256": "0af927656450b1e313ce8b665efe22047108095ac33e583909391a6de05c1e1f"
},
{
"path": "skills/database-health-monitor/references/README.md",
"sha256": "b2a55d36219f1d4d2b8a4ca1ef9a2d1d0ddaacebca9bf7fdf7cbb2856cb61b04"
},
{
"path": "skills/database-health-monitor/scripts/README.md",
"sha256": "064ea3bd8fbc2aad43199019a4b744789757006243905bf00f1a0713f918d125"
},
{
"path": "skills/database-health-monitor/assets/README.md",
"sha256": "d1eff4a48628bfcd0a16e3aed80ca00a73584cb523f3df433fc9dd72546d7626"
}
],
"dirSha256": "3a109b7767df3d54a5f550262c49fd64af0c8c84f655872d5b8d8ee4b3ce08d0"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

View File

@@ -0,0 +1,55 @@
---
name: monitoring-database-health
description: |
This skill enables Claude to monitor database health using real-time metrics, predictive alerts, and automated remediation. It's designed for production-grade database health monitoring for PostgreSQL and MySQL, detecting performance degradation, resource exhaustion, and replication issues. Use this skill when the user requests to monitor database health, check database performance, receive database alerts, or automate database remediation. The skill is triggered by phrases like "check database health", "monitor database performance", "database health check", or "/health-check".
allowed-tools: Read, Write, Edit, Grep, Glob, Bash
version: 1.0.0
---
## Overview
This skill empowers Claude to proactively monitor the health of your databases. It provides real-time metrics, predictive alerts, and automated remediation capabilities to ensure optimal performance and uptime.
## How It Works
1. **Initiate Health Check**: The user requests a database health check via natural language or the `/health-check` command.
2. **Collect Metrics**: The plugin gathers real-time metrics from the specified database (PostgreSQL or MySQL), including connection counts, query performance, resource utilization, and replication status.
3. **Analyze and Alert**: The collected metrics are analyzed against predefined thresholds and historical data to identify potential issues. Predictive alerts are generated for anomalies.
4. **Provide Report**: A comprehensive health report is provided, detailing the current status, potential issues, and recommended remediation steps.
## When to Use This Skill
This skill activates when you need to:
- Check the current health status of a database.
- Monitor database performance for potential bottlenecks.
- Receive alerts about potential database issues before they impact production.
## Examples
### Example 1: Checking Database Performance
User request: "Check the health of my PostgreSQL database."
The skill will:
1. Connect to the PostgreSQL database.
2. Collect metrics on CPU usage, memory consumption, disk I/O, connection counts, and query execution times.
3. Analyze the collected data and generate a report highlighting any performance bottlenecks or potential issues.
### Example 2: Setting Up Monitoring for a MySQL Database
User request: "Monitor the health of my MySQL database and alert me if CPU usage exceeds 80%."
The skill will:
1. Connect to the MySQL database.
2. Configure monitoring to track CPU usage, memory consumption, disk I/O, and connection counts.
3. Set up an alert that triggers if CPU usage exceeds 80%.
## Best Practices
- **Database Credentials**: Ensure that the plugin has the necessary credentials to access the database.
- **Alert Thresholds**: Customize alert thresholds to match the specific needs of your application and infrastructure.
- **Regular Monitoring**: Schedule regular health checks to proactively identify and address potential issues.
## Integration
This skill can be integrated with other monitoring and alerting tools to provide a comprehensive view of your infrastructure's health.

View File

@@ -0,0 +1,7 @@
# Assets
Bundled resources for database-health-monitor skill
- [ ] dashboard_template.json - JSON template for creating a database health monitoring dashboard.
- [ ] alert_template.json - JSON template for configuring alerts in a monitoring system.
- [ ] example_queries.sql - Example SQL queries for retrieving database health metrics.

View File

@@ -0,0 +1,8 @@
# References
Bundled resources for database-health-monitor skill
- [ ] postgres_metrics.md - Detailed documentation of PostgreSQL health metrics and their interpretation.
- [ ] mysql_metrics.md - Detailed documentation of MySQL health metrics and their interpretation.
- [ ] alerting_thresholds.md - Defines the thresholds for triggering alerts based on database health metrics.
- [ ] remediation_procedures.md - Step-by-step procedures for automated database remediation.

View File

@@ -0,0 +1,8 @@
# Scripts
Bundled resources for database-health-monitor skill
- [ ] database_connection_test.sh - Tests the connection to the database and verifies credentials.
- [ ] metric_collection.py - Collects database health metrics (CPU usage, memory usage, disk I/O, query performance).
- [ ] alert_trigger.py - Triggers alerts based on predefined thresholds for database health metrics.
- [ ] remediation_script.sh - Executes automated remediation steps based on alert triggers (e.g., restarting the database, increasing memory allocation).