Initial commit
This commit is contained in:
561
agents/sre/modules/infrastructure.md
Normal file
561
agents/sre/modules/infrastructure.md
Normal file
@@ -0,0 +1,561 @@
|
||||
# Infrastructure Diagnostics
|
||||
|
||||
**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
|
||||
|
||||
## Common Infrastructure Issues
|
||||
|
||||
### 1. High CPU Usage (Server)
|
||||
|
||||
**Symptoms**:
|
||||
- Server CPU at 100%
|
||||
- Applications slow
|
||||
- SSH lag
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check CPU Usage
|
||||
```bash
|
||||
# Overall CPU usage
|
||||
top -bn1 | grep "Cpu(s)"
|
||||
|
||||
# Top CPU processes
|
||||
top -bn1 | head -20
|
||||
|
||||
# CPU usage per core
|
||||
mpstat -P ALL 1 5
|
||||
|
||||
# Historical CPU (if sar installed)
|
||||
sar -u 1 10
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- CPU at 100% for >5 minutes
|
||||
- Single process using >80% CPU
|
||||
- iowait >20% (disk bottleneck)
|
||||
- System CPU >30% (kernel overhead)
|
||||
|
||||
---
|
||||
|
||||
#### Identify CPU-heavy Process
|
||||
```bash
|
||||
# Top CPU process
|
||||
ps aux | sort -nrk 3,3 | head -10
|
||||
|
||||
# CPU per thread
|
||||
top -H
|
||||
|
||||
# Process tree
|
||||
pstree -p
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Application bug (infinite loop)
|
||||
- Heavy computation
|
||||
- Crypto mining malware
|
||||
- Backup/compression running
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Limit process CPU (nice)
|
||||
renice +10 <PID> # Lower priority
|
||||
|
||||
# 2. Kill process (last resort)
|
||||
kill -TERM <PID> # Graceful
|
||||
kill -KILL <PID> # Force kill
|
||||
|
||||
# 3. Scale horizontally (add servers)
|
||||
# Cloud: Auto-scaling group
|
||||
|
||||
# 4. Scale vertically (bigger instance)
|
||||
# Cloud: Resize instance
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Out of Memory (OOM)
|
||||
|
||||
**Symptoms**:
|
||||
- "Out of memory" errors
|
||||
- OOM Killer triggered
|
||||
- Applications crash
|
||||
- Swap usage high
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Memory Usage
|
||||
```bash
|
||||
# Current memory usage
|
||||
free -h
|
||||
|
||||
# Memory per process
|
||||
ps aux | sort -nrk 4,4 | head -10
|
||||
|
||||
# Check OOM killer logs
|
||||
dmesg | grep -i "out of memory\|oom"
|
||||
grep "Out of memory" /var/log/syslog
|
||||
|
||||
# Check swap usage
|
||||
swapon -s
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Available memory <10%
|
||||
- Swap usage >80%
|
||||
- OOM killer active
|
||||
- Single process using >50% memory
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Free page cache (safe)
|
||||
sync && echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
# 2. Kill memory-heavy process
|
||||
kill -9 <PID>
|
||||
|
||||
# 3. Increase swap (temporary)
|
||||
dd if=/dev/zero of=/swapfile bs=1M count=2048
|
||||
mkswap /swapfile
|
||||
swapon /swapfile
|
||||
|
||||
# 4. Scale up (more RAM)
|
||||
# Cloud: Resize instance
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Disk Full
|
||||
|
||||
**Symptoms**:
|
||||
- "No space left on device" errors
|
||||
- Applications can't write files
|
||||
- Database refuses writes
|
||||
- Logs not being written
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Disk Usage
|
||||
```bash
|
||||
# Disk usage by partition
|
||||
df -h
|
||||
|
||||
# Disk usage by directory
|
||||
du -sh /*
|
||||
du -sh /var/*
|
||||
|
||||
# Find large files
|
||||
find / -type f -size +100M -exec ls -lh {} \;
|
||||
|
||||
# Find files using deleted space
|
||||
lsof | grep deleted
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Disk usage >90%
|
||||
- /var/log full (runaway logs)
|
||||
- /tmp full (temp files not cleaned)
|
||||
- Deleted files still holding space (process has handle)
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Clean up logs
|
||||
find /var/log -name "*.log.*" -mtime +7 -delete
|
||||
journalctl --vacuum-time=7d
|
||||
|
||||
# 2. Clean up temp files
|
||||
rm -rf /tmp/*
|
||||
rm -rf /var/tmp/*
|
||||
|
||||
# 3. Find and remove deleted files holding space
|
||||
lsof | grep deleted | awk '{print $2}' | xargs kill -9
|
||||
|
||||
# 4. Compress logs
|
||||
gzip /var/log/*.log
|
||||
|
||||
# 5. Expand disk (cloud)
|
||||
# AWS: Modify EBS volume size
|
||||
# Azure: Expand managed disk
|
||||
# After expanding:
|
||||
resize2fs /dev/xvda1 # ext4
|
||||
xfs_growfs / # xfs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Network Issues
|
||||
|
||||
**Symptoms**:
|
||||
- Slow network performance
|
||||
- Timeouts
|
||||
- Connection refused
|
||||
- High latency
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Network Connectivity
|
||||
```bash
|
||||
# Ping test
|
||||
ping -c 5 google.com
|
||||
|
||||
# DNS resolution
|
||||
nslookup example.com
|
||||
dig example.com
|
||||
|
||||
# Traceroute
|
||||
traceroute example.com
|
||||
|
||||
# Check network interfaces
|
||||
ip addr show
|
||||
ifconfig
|
||||
|
||||
# Check routing table
|
||||
ip route show
|
||||
route -n
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Packet loss >1%
|
||||
- Latency >100ms (same region)
|
||||
- DNS resolution failures
|
||||
- Interface down
|
||||
|
||||
---
|
||||
|
||||
#### Check Network Bandwidth
|
||||
```bash
|
||||
# Current bandwidth usage
|
||||
iftop -i eth0
|
||||
|
||||
# Network stats
|
||||
netstat -i
|
||||
|
||||
# Historical bandwidth (if vnstat installed)
|
||||
vnstat -l
|
||||
|
||||
# Check for bandwidth limits (cloud)
|
||||
# AWS: Check CloudWatch NetworkIn/NetworkOut
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Check Firewall Rules
|
||||
```bash
|
||||
# Check iptables rules
|
||||
iptables -L -n -v
|
||||
|
||||
# Check firewalld (RHEL/CentOS)
|
||||
firewall-cmd --list-all
|
||||
|
||||
# Check UFW (Ubuntu)
|
||||
ufw status verbose
|
||||
|
||||
# Check security groups (cloud)
|
||||
# AWS: EC2 → Security Groups
|
||||
# Azure: Network Security Groups
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Firewall blocking traffic
|
||||
- Security group misconfigured
|
||||
- MTU mismatch
|
||||
- Network congestion
|
||||
- DDoS attack
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Check firewall allows traffic
|
||||
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
|
||||
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
|
||||
|
||||
# 2. Restart networking
|
||||
systemctl restart networking
|
||||
systemctl restart NetworkManager
|
||||
|
||||
# 3. Flush DNS cache
|
||||
systemd-resolve --flush-caches
|
||||
|
||||
# 4. Check cloud network ACLs
|
||||
# Ensure subnet has route to internet gateway
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. High Disk I/O (Slow Disk)
|
||||
|
||||
**Symptoms**:
|
||||
- Applications slow
|
||||
- High iowait CPU
|
||||
- Disk latency high
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Disk I/O
|
||||
```bash
|
||||
# Disk I/O stats
|
||||
iostat -x 1 5
|
||||
|
||||
# Look for:
|
||||
# - %util >80% (disk saturated)
|
||||
# - await >100ms (high latency)
|
||||
|
||||
# Top I/O processes
|
||||
iotop -o
|
||||
|
||||
# Historical I/O (if sar installed)
|
||||
sar -d 1 10
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- %util at 100%
|
||||
- await >100ms
|
||||
- iowait CPU >20%
|
||||
- Queue size (avgqu-sz) >10
|
||||
|
||||
---
|
||||
|
||||
#### Common Causes
|
||||
```bash
|
||||
# 1. Database without indexes (Seq Scan)
|
||||
# See database-diagnostics.md
|
||||
|
||||
# 2. Log rotation running
|
||||
# Large logs being compressed
|
||||
|
||||
# 3. Backup running
|
||||
# Database dump, file backup
|
||||
|
||||
# 4. Disk issue (bad sectors)
|
||||
dmesg | grep -i "I/O error"
|
||||
smartctl -a /dev/sda # SMART status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Reduce I/O pressure
|
||||
# Stop non-critical processes (backup, log rotation)
|
||||
|
||||
# 2. Add read cache
|
||||
# Enable query caching (database)
|
||||
# Add Redis for application cache
|
||||
|
||||
# 3. Scale disk IOPS (cloud)
|
||||
# AWS: Change EBS volume type (gp2 → gp3 → io1)
|
||||
# Azure: Change disk tier
|
||||
|
||||
# 4. Move to SSD (if on HDD)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Service Down / Process Crashed
|
||||
|
||||
**Symptoms**:
|
||||
- Service not responding
|
||||
- Health check failures
|
||||
- 502 Bad Gateway
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Service Status
|
||||
```bash
|
||||
# Systemd services
|
||||
systemctl status nginx
|
||||
systemctl status postgresql
|
||||
systemctl status application
|
||||
|
||||
# Check if process running
|
||||
ps aux | grep nginx
|
||||
pidof nginx
|
||||
|
||||
# Check service logs
|
||||
journalctl -u nginx -n 50
|
||||
tail -f /var/log/nginx/error.log
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Service: inactive (dead)
|
||||
- Process not found
|
||||
- Recent crash in logs
|
||||
|
||||
---
|
||||
|
||||
#### Check Why Service Crashed
|
||||
```bash
|
||||
# Check system logs
|
||||
dmesg | tail -50
|
||||
grep "error\|segfault\|killed" /var/log/syslog
|
||||
|
||||
# Check application logs
|
||||
tail -100 /var/log/application.log
|
||||
|
||||
# Check for OOM killer
|
||||
dmesg | grep -i "killed process"
|
||||
|
||||
# Check core dumps
|
||||
ls -l /var/crash/
|
||||
ls -l /tmp/core*
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Out of memory (OOM Killer)
|
||||
- Segmentation fault (code bug)
|
||||
- Unhandled exception
|
||||
- Dependency service down
|
||||
- Configuration error
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Restart service
|
||||
systemctl restart nginx
|
||||
|
||||
# 2. Check if started successfully
|
||||
systemctl status nginx
|
||||
curl http://localhost
|
||||
|
||||
# 3. If startup fails, check config
|
||||
nginx -t # Test nginx config
|
||||
postgresql -D /var/lib/postgresql/data --config-test
|
||||
|
||||
# 4. Enable auto-restart (systemd)
|
||||
# Add to service file:
|
||||
[Service]
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 7. Cloud Infrastructure Issues
|
||||
|
||||
#### AWS-Specific
|
||||
|
||||
**Instance Issues**:
|
||||
```bash
|
||||
# Check instance health
|
||||
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
|
||||
|
||||
# Check system logs
|
||||
aws ec2 get-console-output --instance-id i-1234567890abcdef0
|
||||
|
||||
# Check CloudWatch metrics
|
||||
aws cloudwatch get-metric-statistics \
|
||||
--namespace AWS/EC2 \
|
||||
--metric-name CPUUtilization \
|
||||
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
|
||||
```
|
||||
|
||||
**EBS Volume Issues**:
|
||||
```bash
|
||||
# Check volume status
|
||||
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
|
||||
|
||||
# Increase IOPS (gp3)
|
||||
aws ec2 modify-volume \
|
||||
--volume-id vol-1234567890abcdef0 \
|
||||
--iops 3000
|
||||
|
||||
# Check volume metrics
|
||||
aws cloudwatch get-metric-statistics \
|
||||
--namespace AWS/EBS \
|
||||
--metric-name VolumeReadOps
|
||||
```
|
||||
|
||||
**Network Issues**:
|
||||
```bash
|
||||
# Check security groups
|
||||
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
|
||||
|
||||
# Check network ACLs
|
||||
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
|
||||
|
||||
# Check route tables
|
||||
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Azure-Specific
|
||||
|
||||
**VM Issues**:
|
||||
```bash
|
||||
# Check VM status
|
||||
az vm get-instance-view --name myVM --resource-group myRG
|
||||
|
||||
# Restart VM
|
||||
az vm restart --name myVM --resource-group myRG
|
||||
|
||||
# Resize VM
|
||||
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
|
||||
```
|
||||
|
||||
**Disk Issues**:
|
||||
```bash
|
||||
# Check disk status
|
||||
az disk show --name myDisk --resource-group myRG
|
||||
|
||||
# Expand disk
|
||||
az disk update --name myDisk --resource-group myRG --size-gb 256
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Performance Metrics
|
||||
|
||||
**Server Health**:
|
||||
- CPU: <70% average, <90% peak
|
||||
- Memory: <80% usage
|
||||
- Disk: <80% usage, <80% IOPS
|
||||
- Network: <70% bandwidth
|
||||
|
||||
**Uptime**:
|
||||
- Target: 99.9% (8.76 hours downtime/year)
|
||||
- Monitoring: Check every 1 minute
|
||||
|
||||
**Response Time**:
|
||||
- Ping latency: <50ms (same region)
|
||||
- HTTP response: <200ms
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Diagnostic Checklist
|
||||
|
||||
**When diagnosing infrastructure issues**:
|
||||
|
||||
- [ ] Check CPU usage (target: <70%)
|
||||
- [ ] Check memory usage (target: <80%)
|
||||
- [ ] Check disk usage (target: <80%)
|
||||
- [ ] Check disk I/O (%util, await)
|
||||
- [ ] Check network connectivity (ping, traceroute)
|
||||
- [ ] Check firewall rules (iptables, security groups)
|
||||
- [ ] Check service status (systemd, ps)
|
||||
- [ ] Check system logs (dmesg, /var/log/syslog)
|
||||
- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
|
||||
- [ ] Check for hardware issues (SMART, dmesg errors)
|
||||
|
||||
**Tools**:
|
||||
- `top`, `htop` - CPU, memory
|
||||
- `df`, `du` - Disk usage
|
||||
- `iostat` - Disk I/O
|
||||
- `iftop`, `netstat` - Network
|
||||
- `dmesg`, `journalctl` - System logs
|
||||
- Cloud dashboards (AWS, Azure, GCP)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
|
||||
- [database-diagnostics.md](database-diagnostics.md) - Database performance
|
||||
- [security-incidents.md](security-incidents.md) - Security response
|
||||
Reference in New Issue
Block a user