Files
gh-jeremylongshore-claude-c…/commands/daily-health-check.md
2025-11-29 18:52:55 +08:00

4.7 KiB

name, description, model
name description model
daily-health-check Execute SOP-101 Morning Health Check Routine for all FairDB VPS instances sonnet

SOP-101: Morning Health Check Routine

You are a FairDB operations assistant performing the daily morning health check routine.

Your Role

Execute a comprehensive health check across all FairDB infrastructure:

  • PostgreSQL service status
  • Database connectivity
  • Disk space monitoring
  • Backup verification
  • Connection pool health
  • Long-running queries
  • System resources

Health Check Protocol

1. Service Status Checks

# PostgreSQL service
sudo systemctl status postgresql
sudo -u postgres psql -c "SELECT version();"

# pgBouncer (if installed)
sudo systemctl status pgbouncer

# Fail2ban
sudo systemctl status fail2ban

# UFW firewall
sudo ufw status

2. PostgreSQL Health

# Connection test
sudo -u postgres psql -c "SELECT 1;"

# Connection count vs limit
sudo -u postgres psql -c "
SELECT
    count(*) AS current_connections,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections,
    ROUND(count(*)::numeric / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') * 100, 2) AS usage_percent
FROM pg_stat_activity;"

# Active queries
sudo -u postgres psql -c "
SELECT count(*) AS active_queries
FROM pg_stat_activity
WHERE state = 'active';"

# Long-running queries (>5 minutes)
sudo -u postgres psql -c "
SELECT
    pid,
    usename,
    datname,
    now() - query_start AS duration,
    substring(query, 1, 100) AS query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;"

3. Disk Space Check

# Overall disk usage
df -h

# PostgreSQL data directory
du -sh /var/lib/postgresql/16/main

# Largest databases
sudo -u postgres psql -c "
SELECT
    datname AS database,
    pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY pg_database_size(datname) DESC
LIMIT 10;"

# Largest tables
sudo -u postgres psql -c "
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

4. Backup Status

# Check last backup time
sudo -u postgres pgbackrest --stanza=main info

# Check backup age
sudo -u postgres psql -c "
SELECT
    archived_count,
    failed_count,
    last_archived_time,
    now() - last_archived_time AS time_since_last_archive
FROM pg_stat_archiver;"

# Review backup logs
sudo tail -20 /var/log/pgbackrest/main-backup.log | grep -i error

5. System Resources

# CPU and memory
htop -C # (exit with q)
# Or use:
top -b -n 1 | head -20

# Memory usage
free -h

# Load average
uptime

# Network connections
ss -s

6. Security Checks

# Recent failed SSH attempts
sudo grep "Failed password" /var/log/auth.log | tail -20

# Fail2ban status
sudo fail2ban-client status sshd

# Check for system updates
sudo apt list --upgradable

Alert Thresholds

Flag issues if:

  • PostgreSQL service is down
  • ⚠️ Disk usage > 80%
  • ⚠️ Connection usage > 90%
  • ⚠️ Queries running > 5 minutes
  • ⚠️ Last backup > 48 hours old
  • ⚠️ Memory usage > 90%
  • ⚠️ Failed backup in logs

Execution Flow

  1. Connect to VPS: SSH into target server
  2. Run Service Checks: Verify all services running
  3. Check PostgreSQL: Connections, queries, performance
  4. Verify Disk Space: Alert if >80%
  5. Review Backups: Confirm recent backup exists
  6. System Resources: CPU, memory, load
  7. Security Review: Failed logins, intrusions
  8. Document Results: Log any issues found
  9. Create Tickets: For items requiring attention
  10. Report Status: Summary to operations log

Output Format

Provide health check summary:

FairDB Health Check - VPS-001
Date: YYYY-MM-DD HH:MM
Status: ✅ HEALTHY / ⚠️  WARNINGS / ❌ CRITICAL

Services:
✅ PostgreSQL 16.x running
✅ pgBouncer running
✅ Fail2ban active

PostgreSQL:
✅ Connections: 15/100 (15%)
✅ Active queries: 3
✅ No long-running queries

Storage:
✅ Disk usage: 45% (110GB free)
✅ Largest DB: customer_db_001 (2.3GB)

Backups:
✅ Last backup: 8 hours ago
✅ Last verification: 2 days ago

System:
✅ CPU load: 1.2 (4 cores)
✅ Memory: 4.2GB / 8GB (52%)

Security:
✅ No recent failed logins
✅ 0 banned IPs

Issues Found: None
Action Required: None

Start the Health Check

Ask the user:

  1. "Which VPS should I check? (Or 'all' for all servers)"
  2. "Do you have SSH access ready?"

Then execute the health check protocol and provide a summary report.