Files
gh-anton-abyzov-specweave-p…/agents/sre/modules/infrastructure.md
2025-11-29 17:56:41 +08:00

562 lines
9.9 KiB
Markdown

# Infrastructure Diagnostics
**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
## Common Infrastructure Issues
### 1. High CPU Usage (Server)
**Symptoms**:
- Server CPU at 100%
- Applications slow
- SSH lag
**Diagnosis**:
#### Check CPU Usage
```bash
# Overall CPU usage
top -bn1 | grep "Cpu(s)"
# Top CPU processes
top -bn1 | head -20
# CPU usage per core
mpstat -P ALL 1 5
# Historical CPU (if sar installed)
sar -u 1 10
```
**Red flags**:
- CPU at 100% for >5 minutes
- Single process using >80% CPU
- iowait >20% (disk bottleneck)
- System CPU >30% (kernel overhead)
---
#### Identify CPU-heavy Process
```bash
# Top CPU process
ps aux | sort -nrk 3,3 | head -10
# CPU per thread
top -H
# Process tree
pstree -p
```
**Common causes**:
- Application bug (infinite loop)
- Heavy computation
- Crypto mining malware
- Backup/compression running
---
#### Immediate Mitigation
```bash
# 1. Limit process CPU (nice)
renice +10 <PID> # Lower priority
# 2. Kill process (last resort)
kill -TERM <PID> # Graceful
kill -KILL <PID> # Force kill
# 3. Scale horizontally (add servers)
# Cloud: Auto-scaling group
# 4. Scale vertically (bigger instance)
# Cloud: Resize instance
```
---
### 2. Out of Memory (OOM)
**Symptoms**:
- "Out of memory" errors
- OOM Killer triggered
- Applications crash
- Swap usage high
**Diagnosis**:
#### Check Memory Usage
```bash
# Current memory usage
free -h
# Memory per process
ps aux | sort -nrk 4,4 | head -10
# Check OOM killer logs
dmesg | grep -i "out of memory\|oom"
grep "Out of memory" /var/log/syslog
# Check swap usage
swapon -s
```
**Red flags**:
- Available memory <10%
- Swap usage >80%
- OOM killer active
- Single process using >50% memory
---
#### Immediate Mitigation
```bash
# 1. Free page cache (safe)
sync && echo 3 > /proc/sys/vm/drop_caches
# 2. Kill memory-heavy process
kill -9 <PID>
# 3. Increase swap (temporary)
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# 4. Scale up (more RAM)
# Cloud: Resize instance
```
---
### 3. Disk Full
**Symptoms**:
- "No space left on device" errors
- Applications can't write files
- Database refuses writes
- Logs not being written
**Diagnosis**:
#### Check Disk Usage
```bash
# Disk usage by partition
df -h
# Disk usage by directory
du -sh /*
du -sh /var/*
# Find large files
find / -type f -size +100M -exec ls -lh {} \;
# Find files using deleted space
lsof | grep deleted
```
**Red flags**:
- Disk usage >90%
- /var/log full (runaway logs)
- /tmp full (temp files not cleaned)
- Deleted files still holding space (process has handle)
---
#### Immediate Mitigation
```bash
# 1. Clean up logs
find /var/log -name "*.log.*" -mtime +7 -delete
journalctl --vacuum-time=7d
# 2. Clean up temp files
rm -rf /tmp/*
rm -rf /var/tmp/*
# 3. Find and remove deleted files holding space
lsof | grep deleted | awk '{print $2}' | xargs kill -9
# 4. Compress logs
gzip /var/log/*.log
# 5. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
# After expanding:
resize2fs /dev/xvda1 # ext4
xfs_growfs / # xfs
```
---
### 4. Network Issues
**Symptoms**:
- Slow network performance
- Timeouts
- Connection refused
- High latency
**Diagnosis**:
#### Check Network Connectivity
```bash
# Ping test
ping -c 5 google.com
# DNS resolution
nslookup example.com
dig example.com
# Traceroute
traceroute example.com
# Check network interfaces
ip addr show
ifconfig
# Check routing table
ip route show
route -n
```
**Red flags**:
- Packet loss >1%
- Latency >100ms (same region)
- DNS resolution failures
- Interface down
---
#### Check Network Bandwidth
```bash
# Current bandwidth usage
iftop -i eth0
# Network stats
netstat -i
# Historical bandwidth (if vnstat installed)
vnstat -l
# Check for bandwidth limits (cloud)
# AWS: Check CloudWatch NetworkIn/NetworkOut
```
---
#### Check Firewall Rules
```bash
# Check iptables rules
iptables -L -n -v
# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all
# Check UFW (Ubuntu)
ufw status verbose
# Check security groups (cloud)
# AWS: EC2 → Security Groups
# Azure: Network Security Groups
```
**Common causes**:
- Firewall blocking traffic
- Security group misconfigured
- MTU mismatch
- Network congestion
- DDoS attack
---
#### Immediate Mitigation
```bash
# 1. Check firewall allows traffic
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# 2. Restart networking
systemctl restart networking
systemctl restart NetworkManager
# 3. Flush DNS cache
systemd-resolve --flush-caches
# 4. Check cloud network ACLs
# Ensure subnet has route to internet gateway
```
---
### 5. High Disk I/O (Slow Disk)
**Symptoms**:
- Applications slow
- High iowait CPU
- Disk latency high
**Diagnosis**:
#### Check Disk I/O
```bash
# Disk I/O stats
iostat -x 1 5
# Look for:
# - %util >80% (disk saturated)
# - await >100ms (high latency)
# Top I/O processes
iotop -o
# Historical I/O (if sar installed)
sar -d 1 10
```
**Red flags**:
- %util at 100%
- await >100ms
- iowait CPU >20%
- Queue size (avgqu-sz) >10
---
#### Common Causes
```bash
# 1. Database without indexes (Seq Scan)
# See database-diagnostics.md
# 2. Log rotation running
# Large logs being compressed
# 3. Backup running
# Database dump, file backup
# 4. Disk issue (bad sectors)
dmesg | grep -i "I/O error"
smartctl -a /dev/sda # SMART status
```
---
#### Immediate Mitigation
```bash
# 1. Reduce I/O pressure
# Stop non-critical processes (backup, log rotation)
# 2. Add read cache
# Enable query caching (database)
# Add Redis for application cache
# 3. Scale disk IOPS (cloud)
# AWS: Change EBS volume type (gp2 → gp3 → io1)
# Azure: Change disk tier
# 4. Move to SSD (if on HDD)
```
---
### 6. Service Down / Process Crashed
**Symptoms**:
- Service not responding
- Health check failures
- 502 Bad Gateway
**Diagnosis**:
#### Check Service Status
```bash
# Systemd services
systemctl status nginx
systemctl status postgresql
systemctl status application
# Check if process running
ps aux | grep nginx
pidof nginx
# Check service logs
journalctl -u nginx -n 50
tail -f /var/log/nginx/error.log
```
**Red flags**:
- Service: inactive (dead)
- Process not found
- Recent crash in logs
---
#### Check Why Service Crashed
```bash
# Check system logs
dmesg | tail -50
grep "error\|segfault\|killed" /var/log/syslog
# Check application logs
tail -100 /var/log/application.log
# Check for OOM killer
dmesg | grep -i "killed process"
# Check core dumps
ls -l /var/crash/
ls -l /tmp/core*
```
**Common causes**:
- Out of memory (OOM Killer)
- Segmentation fault (code bug)
- Unhandled exception
- Dependency service down
- Configuration error
---
#### Immediate Mitigation
```bash
# 1. Restart service
systemctl restart nginx
# 2. Check if started successfully
systemctl status nginx
curl http://localhost
# 3. If startup fails, check config
nginx -t # Test nginx config
postgresql -D /var/lib/postgresql/data --config-test
# 4. Enable auto-restart (systemd)
# Add to service file:
[Service]
Restart=always
RestartSec=10
```
---
### 7. Cloud Infrastructure Issues
#### AWS-Specific
**Instance Issues**:
```bash
# Check instance health
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
# Check system logs
aws ec2 get-console-output --instance-id i-1234567890abcdef0
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
```
**EBS Volume Issues**:
```bash
# Check volume status
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
# Increase IOPS (gp3)
aws ec2 modify-volume \
--volume-id vol-1234567890abcdef0 \
--iops 3000
# Check volume metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeReadOps
```
**Network Issues**:
```bash
# Check security groups
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
# Check network ACLs
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
# Check route tables
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
```
---
#### Azure-Specific
**VM Issues**:
```bash
# Check VM status
az vm get-instance-view --name myVM --resource-group myRG
# Restart VM
az vm restart --name myVM --resource-group myRG
# Resize VM
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
```
**Disk Issues**:
```bash
# Check disk status
az disk show --name myDisk --resource-group myRG
# Expand disk
az disk update --name myDisk --resource-group myRG --size-gb 256
```
---
## Infrastructure Performance Metrics
**Server Health**:
- CPU: <70% average, <90% peak
- Memory: <80% usage
- Disk: <80% usage, <80% IOPS
- Network: <70% bandwidth
**Uptime**:
- Target: 99.9% (8.76 hours downtime/year)
- Monitoring: Check every 1 minute
**Response Time**:
- Ping latency: <50ms (same region)
- HTTP response: <200ms
---
## Infrastructure Diagnostic Checklist
**When diagnosing infrastructure issues**:
- [ ] Check CPU usage (target: <70%)
- [ ] Check memory usage (target: <80%)
- [ ] Check disk usage (target: <80%)
- [ ] Check disk I/O (%util, await)
- [ ] Check network connectivity (ping, traceroute)
- [ ] Check firewall rules (iptables, security groups)
- [ ] Check service status (systemd, ps)
- [ ] Check system logs (dmesg, /var/log/syslog)
- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
- [ ] Check for hardware issues (SMART, dmesg errors)
**Tools**:
- `top`, `htop` - CPU, memory
- `df`, `du` - Disk usage
- `iostat` - Disk I/O
- `iftop`, `netstat` - Network
- `dmesg`, `journalctl` - System logs
- Cloud dashboards (AWS, Azure, GCP)
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
- [database-diagnostics.md](database-diagnostics.md) - Database performance
- [security-incidents.md](security-incidents.md) - Security response