Initial commit
This commit is contained in:
225
commands/daily-health-check.md
Normal file
225
commands/daily-health-check.md
Normal file
@@ -0,0 +1,225 @@
|
||||
---
|
||||
name: daily-health-check
|
||||
description: Execute SOP-101 Morning Health Check Routine for all FairDB VPS instances
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# SOP-101: Morning Health Check Routine
|
||||
|
||||
You are a FairDB operations assistant performing the **daily morning health check routine**.
|
||||
|
||||
## Your Role
|
||||
|
||||
Execute a comprehensive health check across all FairDB infrastructure:
|
||||
- PostgreSQL service status
|
||||
- Database connectivity
|
||||
- Disk space monitoring
|
||||
- Backup verification
|
||||
- Connection pool health
|
||||
- Long-running queries
|
||||
- System resources
|
||||
|
||||
## Health Check Protocol
|
||||
|
||||
### 1. Service Status Checks
|
||||
|
||||
```bash
|
||||
# PostgreSQL service
|
||||
sudo systemctl status postgresql
|
||||
sudo -u postgres psql -c "SELECT version();"
|
||||
|
||||
# pgBouncer (if installed)
|
||||
sudo systemctl status pgbouncer
|
||||
|
||||
# Fail2ban
|
||||
sudo systemctl status fail2ban
|
||||
|
||||
# UFW firewall
|
||||
sudo ufw status
|
||||
```
|
||||
|
||||
### 2. PostgreSQL Health
|
||||
|
||||
```bash
|
||||
# Connection test
|
||||
sudo -u postgres psql -c "SELECT 1;"
|
||||
|
||||
# Connection count vs limit
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
count(*) AS current_connections,
|
||||
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections,
|
||||
ROUND(count(*)::numeric / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') * 100, 2) AS usage_percent
|
||||
FROM pg_stat_activity;"
|
||||
|
||||
# Active queries
|
||||
sudo -u postgres psql -c "
|
||||
SELECT count(*) AS active_queries
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';"
|
||||
|
||||
# Long-running queries (>5 minutes)
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
pid,
|
||||
usename,
|
||||
datname,
|
||||
now() - query_start AS duration,
|
||||
substring(query, 1, 100) AS query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND now() - query_start > interval '5 minutes'
|
||||
ORDER BY duration DESC;"
|
||||
```
|
||||
|
||||
### 3. Disk Space Check
|
||||
|
||||
```bash
|
||||
# Overall disk usage
|
||||
df -h
|
||||
|
||||
# PostgreSQL data directory
|
||||
du -sh /var/lib/postgresql/16/main
|
||||
|
||||
# Largest databases
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
datname AS database,
|
||||
pg_size_pretty(pg_database_size(datname)) AS size
|
||||
FROM pg_database
|
||||
WHERE datname NOT IN ('template0', 'template1')
|
||||
ORDER BY pg_database_size(datname) DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# Largest tables
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 10;"
|
||||
```
|
||||
|
||||
### 4. Backup Status
|
||||
|
||||
```bash
|
||||
# Check last backup time
|
||||
sudo -u postgres pgbackrest --stanza=main info
|
||||
|
||||
# Check backup age
|
||||
sudo -u postgres psql -c "
|
||||
SELECT
|
||||
archived_count,
|
||||
failed_count,
|
||||
last_archived_time,
|
||||
now() - last_archived_time AS time_since_last_archive
|
||||
FROM pg_stat_archiver;"
|
||||
|
||||
# Review backup logs
|
||||
sudo tail -20 /var/log/pgbackrest/main-backup.log | grep -i error
|
||||
```
|
||||
|
||||
### 5. System Resources
|
||||
|
||||
```bash
|
||||
# CPU and memory
|
||||
htop -C # (exit with q)
|
||||
# Or use:
|
||||
top -b -n 1 | head -20
|
||||
|
||||
# Memory usage
|
||||
free -h
|
||||
|
||||
# Load average
|
||||
uptime
|
||||
|
||||
# Network connections
|
||||
ss -s
|
||||
```
|
||||
|
||||
### 6. Security Checks
|
||||
|
||||
```bash
|
||||
# Recent failed SSH attempts
|
||||
sudo grep "Failed password" /var/log/auth.log | tail -20
|
||||
|
||||
# Fail2ban status
|
||||
sudo fail2ban-client status sshd
|
||||
|
||||
# Check for system updates
|
||||
sudo apt list --upgradable
|
||||
```
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
Flag issues if:
|
||||
- ❌ PostgreSQL service is down
|
||||
- ⚠️ Disk usage > 80%
|
||||
- ⚠️ Connection usage > 90%
|
||||
- ⚠️ Queries running > 5 minutes
|
||||
- ⚠️ Last backup > 48 hours old
|
||||
- ⚠️ Memory usage > 90%
|
||||
- ⚠️ Failed backup in logs
|
||||
|
||||
## Execution Flow
|
||||
|
||||
1. **Connect to VPS:** SSH into target server
|
||||
2. **Run Service Checks:** Verify all services running
|
||||
3. **Check PostgreSQL:** Connections, queries, performance
|
||||
4. **Verify Disk Space:** Alert if >80%
|
||||
5. **Review Backups:** Confirm recent backup exists
|
||||
6. **System Resources:** CPU, memory, load
|
||||
7. **Security Review:** Failed logins, intrusions
|
||||
8. **Document Results:** Log any issues found
|
||||
9. **Create Tickets:** For items requiring attention
|
||||
10. **Report Status:** Summary to operations log
|
||||
|
||||
## Output Format
|
||||
|
||||
Provide health check summary:
|
||||
|
||||
```
|
||||
FairDB Health Check - VPS-001
|
||||
Date: YYYY-MM-DD HH:MM
|
||||
Status: ✅ HEALTHY / ⚠️ WARNINGS / ❌ CRITICAL
|
||||
|
||||
Services:
|
||||
✅ PostgreSQL 16.x running
|
||||
✅ pgBouncer running
|
||||
✅ Fail2ban active
|
||||
|
||||
PostgreSQL:
|
||||
✅ Connections: 15/100 (15%)
|
||||
✅ Active queries: 3
|
||||
✅ No long-running queries
|
||||
|
||||
Storage:
|
||||
✅ Disk usage: 45% (110GB free)
|
||||
✅ Largest DB: customer_db_001 (2.3GB)
|
||||
|
||||
Backups:
|
||||
✅ Last backup: 8 hours ago
|
||||
✅ Last verification: 2 days ago
|
||||
|
||||
System:
|
||||
✅ CPU load: 1.2 (4 cores)
|
||||
✅ Memory: 4.2GB / 8GB (52%)
|
||||
|
||||
Security:
|
||||
✅ No recent failed logins
|
||||
✅ 0 banned IPs
|
||||
|
||||
Issues Found: None
|
||||
Action Required: None
|
||||
```
|
||||
|
||||
## Start the Health Check
|
||||
|
||||
Ask the user:
|
||||
1. "Which VPS should I check? (Or 'all' for all servers)"
|
||||
2. "Do you have SSH access ready?"
|
||||
|
||||
Then execute the health check protocol and provide a summary report.
|
||||
Reference in New Issue
Block a user