Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/modules/infrastructure.md
+++ b/agents/sre/modules/infrastructure.md
@@ -0,0 +1,561 @@
+# Infrastructure Diagnostics
+
+**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
+
+## Common Infrastructure Issues
+
+### 1. High CPU Usage (Server)
+
+**Symptoms**:
+- Server CPU at 100%
+- Applications slow
+- SSH lag
+
+**Diagnosis**:
+
+#### Check CPU Usage
+```bash
+# Overall CPU usage
+top -bn1 | grep "Cpu(s)"
+
+# Top CPU processes
+top -bn1 | head -20
+
+# CPU usage per core
+mpstat -P ALL 1 5
+
+# Historical CPU (if sar installed)
+sar -u 1 10
+```
+
+**Red flags**:
+- CPU at 100% for >5 minutes
+- Single process using >80% CPU
+- iowait >20% (disk bottleneck)
+- System CPU >30% (kernel overhead)
+
+---
+
+#### Identify CPU-heavy Process
+```bash
+# Top CPU process
+ps aux | sort -nrk 3,3 | head -10
+
+# CPU per thread
+top -H
+
+# Process tree
+pstree -p
+```
+
+**Common causes**:
+- Application bug (infinite loop)
+- Heavy computation
+- Crypto mining malware
+- Backup/compression running
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Limit process CPU (nice)
+renice +10 <PID>  # Lower priority
+
+# 2. Kill process (last resort)
+kill -TERM <PID>  # Graceful
+kill -KILL <PID>  # Force kill
+
+# 3. Scale horizontally (add servers)
+# Cloud: Auto-scaling group
+
+# 4. Scale vertically (bigger instance)
+# Cloud: Resize instance
+```
+
+---
+
+### 2. Out of Memory (OOM)
+
+**Symptoms**:
+- "Out of memory" errors
+- OOM Killer triggered
+- Applications crash
+- Swap usage high
+
+**Diagnosis**:
+
+#### Check Memory Usage
+```bash
+# Current memory usage
+free -h
+
+# Memory per process
+ps aux | sort -nrk 4,4 | head -10
+
+# Check OOM killer logs
+dmesg | grep -i "out of memory\|oom"
+grep "Out of memory" /var/log/syslog
+
+# Check swap usage
+swapon -s
+```
+
+**Red flags**:
+- Available memory <10%
+- Swap usage >80%
+- OOM killer active
+- Single process using >50% memory
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Free page cache (safe)
+sync && echo 3 > /proc/sys/vm/drop_caches
+
+# 2. Kill memory-heavy process
+kill -9 <PID>
+
+# 3. Increase swap (temporary)
+dd if=/dev/zero of=/swapfile bs=1M count=2048
+mkswap /swapfile
+swapon /swapfile
+
+# 4. Scale up (more RAM)
+# Cloud: Resize instance
+```
+
+---
+
+### 3. Disk Full
+
+**Symptoms**:
+- "No space left on device" errors
+- Applications can't write files
+- Database refuses writes
+- Logs not being written
+
+**Diagnosis**:
+
+#### Check Disk Usage
+```bash
+# Disk usage by partition
+df -h
+
+# Disk usage by directory
+du -sh /*
+du -sh /var/*
+
+# Find large files
+find / -type f -size +100M -exec ls -lh {} \;
+
+# Find files using deleted space
+lsof | grep deleted
+```
+
+**Red flags**:
+- Disk usage >90%
+- /var/log full (runaway logs)
+- /tmp full (temp files not cleaned)
+- Deleted files still holding space (process has handle)
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Clean up logs
+find /var/log -name "*.log.*" -mtime +7 -delete
+journalctl --vacuum-time=7d
+
+# 2. Clean up temp files
+rm -rf /tmp/*
+rm -rf /var/tmp/*
+
+# 3. Find and remove deleted files holding space
+lsof | grep deleted | awk '{print $2}' | xargs kill -9
+
+# 4. Compress logs
+gzip /var/log/*.log
+
+# 5. Expand disk (cloud)
+# AWS: Modify EBS volume size
+# Azure: Expand managed disk
+# After expanding:
+resize2fs /dev/xvda1  # ext4
+xfs_growfs /            # xfs
+```
+
+---
+
+### 4. Network Issues
+
+**Symptoms**:
+- Slow network performance
+- Timeouts
+- Connection refused
+- High latency
+
+**Diagnosis**:
+
+#### Check Network Connectivity
+```bash
+# Ping test
+ping -c 5 google.com
+
+# DNS resolution
+nslookup example.com
+dig example.com
+
+# Traceroute
+traceroute example.com
+
+# Check network interfaces
+ip addr show
+ifconfig
+
+# Check routing table
+ip route show
+route -n
+```
+
+**Red flags**:
+- Packet loss >1%
+- Latency >100ms (same region)
+- DNS resolution failures
+- Interface down
+
+---
+
+#### Check Network Bandwidth
+```bash
+# Current bandwidth usage
+iftop -i eth0
+
+# Network stats
+netstat -i
+
+# Historical bandwidth (if vnstat installed)
+vnstat -l
+
+# Check for bandwidth limits (cloud)
+# AWS: Check CloudWatch NetworkIn/NetworkOut
+```
+
+---
+
+#### Check Firewall Rules
+```bash
+# Check iptables rules
+iptables -L -n -v
+
+# Check firewalld (RHEL/CentOS)
+firewall-cmd --list-all
+
+# Check UFW (Ubuntu)
+ufw status verbose
+
+# Check security groups (cloud)
+# AWS: EC2 → Security Groups
+# Azure: Network Security Groups
+```
+
+**Common causes**:
+- Firewall blocking traffic
+- Security group misconfigured
+- MTU mismatch
+- Network congestion
+- DDoS attack
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Check firewall allows traffic
+iptables -A INPUT -p tcp --dport 80 -j ACCEPT
+iptables -A INPUT -p tcp --dport 443 -j ACCEPT
+
+# 2. Restart networking
+systemctl restart networking
+systemctl restart NetworkManager
+
+# 3. Flush DNS cache
+systemd-resolve --flush-caches
+
+# 4. Check cloud network ACLs
+# Ensure subnet has route to internet gateway
+```
+
+---
+
+### 5. High Disk I/O (Slow Disk)
+
+**Symptoms**:
+- Applications slow
+- High iowait CPU
+- Disk latency high
+
+**Diagnosis**:
+
+#### Check Disk I/O
+```bash
+# Disk I/O stats
+iostat -x 1 5
+
+# Look for:
+# - %util >80% (disk saturated)
+# - await >100ms (high latency)
+
+# Top I/O processes
+iotop -o
+
+# Historical I/O (if sar installed)
+sar -d 1 10
+```
+
+**Red flags**:
+- %util at 100%
+- await >100ms
+- iowait CPU >20%
+- Queue size (avgqu-sz) >10
+
+---
+
+#### Common Causes
+```bash
+# 1. Database without indexes (Seq Scan)
+# See database-diagnostics.md
+
+# 2. Log rotation running
+# Large logs being compressed
+
+# 3. Backup running
+# Database dump, file backup
+
+# 4. Disk issue (bad sectors)
+dmesg | grep -i "I/O error"
+smartctl -a /dev/sda  # SMART status
+```
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Reduce I/O pressure
+# Stop non-critical processes (backup, log rotation)
+
+# 2. Add read cache
+# Enable query caching (database)
+# Add Redis for application cache
+
+# 3. Scale disk IOPS (cloud)
+# AWS: Change EBS volume type (gp2 → gp3 → io1)
+# Azure: Change disk tier
+
+# 4. Move to SSD (if on HDD)
+```
+
+---
+
+### 6. Service Down / Process Crashed
+
+**Symptoms**:
+- Service not responding
+- Health check failures
+- 502 Bad Gateway
+
+**Diagnosis**:
+
+#### Check Service Status
+```bash
+# Systemd services
+systemctl status nginx
+systemctl status postgresql
+systemctl status application
+
+# Check if process running
+ps aux | grep nginx
+pidof nginx
+
+# Check service logs
+journalctl -u nginx -n 50
+tail -f /var/log/nginx/error.log
+```
+
+**Red flags**:
+- Service: inactive (dead)
+- Process not found
+- Recent crash in logs
+
+---
+
+#### Check Why Service Crashed
+```bash
+# Check system logs
+dmesg | tail -50
+grep "error\|segfault\|killed" /var/log/syslog
+
+# Check application logs
+tail -100 /var/log/application.log
+
+# Check for OOM killer
+dmesg | grep -i "killed process"
+
+# Check core dumps
+ls -l /var/crash/
+ls -l /tmp/core*
+```
+
+**Common causes**:
+- Out of memory (OOM Killer)
+- Segmentation fault (code bug)
+- Unhandled exception
+- Dependency service down
+- Configuration error
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Restart service
+systemctl restart nginx
+
+# 2. Check if started successfully
+systemctl status nginx
+curl http://localhost
+
+# 3. If startup fails, check config
+nginx -t  # Test nginx config
+postgresql -D /var/lib/postgresql/data --config-test
+
+# 4. Enable auto-restart (systemd)
+# Add to service file:
+[Service]
+Restart=always
+RestartSec=10
+```
+
+---
+
+### 7. Cloud Infrastructure Issues
+
+#### AWS-Specific
+
+**Instance Issues**:
+```bash
+# Check instance health
+aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
+
+# Check system logs
+aws ec2 get-console-output --instance-id i-1234567890abcdef0
+
+# Check CloudWatch metrics
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/EC2 \
+  --metric-name CPUUtilization \
+  --dimensions Name=InstanceId,Value=i-1234567890abcdef0
+```
+
+**EBS Volume Issues**:
+```bash
+# Check volume status
+aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
+
+# Increase IOPS (gp3)
+aws ec2 modify-volume \
+  --volume-id vol-1234567890abcdef0 \
+  --iops 3000
+
+# Check volume metrics
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/EBS \
+  --metric-name VolumeReadOps
+```
+
+**Network Issues**:
+```bash
+# Check security groups
+aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
+
+# Check network ACLs
+aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
+
+# Check route tables
+aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
+```
+
+---
+
+#### Azure-Specific
+
+**VM Issues**:
+```bash
+# Check VM status
+az vm get-instance-view --name myVM --resource-group myRG
+
+# Restart VM
+az vm restart --name myVM --resource-group myRG
+
+# Resize VM
+az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
+```
+
+**Disk Issues**:
+```bash
+# Check disk status
+az disk show --name myDisk --resource-group myRG
+
+# Expand disk
+az disk update --name myDisk --resource-group myRG --size-gb 256
+```
+
+---
+
+## Infrastructure Performance Metrics
+
+**Server Health**:
+- CPU: <70% average, <90% peak
+- Memory: <80% usage
+- Disk: <80% usage, <80% IOPS
+- Network: <70% bandwidth
+
+**Uptime**:
+- Target: 99.9% (8.76 hours downtime/year)
+- Monitoring: Check every 1 minute
+
+**Response Time**:
+- Ping latency: <50ms (same region)
+- HTTP response: <200ms
+
+---
+
+## Infrastructure Diagnostic Checklist
+
+**When diagnosing infrastructure issues**:
+
+- [ ] Check CPU usage (target: <70%)
+- [ ] Check memory usage (target: <80%)
+- [ ] Check disk usage (target: <80%)
+- [ ] Check disk I/O (%util, await)
+- [ ] Check network connectivity (ping, traceroute)
+- [ ] Check firewall rules (iptables, security groups)
+- [ ] Check service status (systemd, ps)
+- [ ] Check system logs (dmesg, /var/log/syslog)
+- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
+- [ ] Check for hardware issues (SMART, dmesg errors)
+
+**Tools**:
+- `top`, `htop` - CPU, memory
+- `df`, `du` - Disk usage
+- `iostat` - Disk I/O
+- `iftop`, `netstat` - Network
+- `dmesg`, `journalctl` - System logs
+- Cloud dashboards (AWS, Azure, GCP)
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
+- [database-diagnostics.md](database-diagnostics.md) - Database performance
+- [security-incidents.md](security-incidents.md) - Security response