9.9 KiB
9.9 KiB
Infrastructure Diagnostics
Purpose: Troubleshoot server, network, disk, and cloud infrastructure issues.
Common Infrastructure Issues
1. High CPU Usage (Server)
Symptoms:
- Server CPU at 100%
- Applications slow
- SSH lag
Diagnosis:
Check CPU Usage
# Overall CPU usage
top -bn1 | grep "Cpu(s)"
# Top CPU processes
top -bn1 | head -20
# CPU usage per core
mpstat -P ALL 1 5
# Historical CPU (if sar installed)
sar -u 1 10
Red flags:
- CPU at 100% for >5 minutes
- Single process using >80% CPU
- iowait >20% (disk bottleneck)
- System CPU >30% (kernel overhead)
Identify CPU-heavy Process
# Top CPU process
ps aux | sort -nrk 3,3 | head -10
# CPU per thread
top -H
# Process tree
pstree -p
Common causes:
- Application bug (infinite loop)
- Heavy computation
- Crypto mining malware
- Backup/compression running
Immediate Mitigation
# 1. Limit process CPU (nice)
renice +10 <PID> # Lower priority
# 2. Kill process (last resort)
kill -TERM <PID> # Graceful
kill -KILL <PID> # Force kill
# 3. Scale horizontally (add servers)
# Cloud: Auto-scaling group
# 4. Scale vertically (bigger instance)
# Cloud: Resize instance
2. Out of Memory (OOM)
Symptoms:
- "Out of memory" errors
- OOM Killer triggered
- Applications crash
- Swap usage high
Diagnosis:
Check Memory Usage
# Current memory usage
free -h
# Memory per process
ps aux | sort -nrk 4,4 | head -10
# Check OOM killer logs
dmesg | grep -i "out of memory\|oom"
grep "Out of memory" /var/log/syslog
# Check swap usage
swapon -s
Red flags:
- Available memory <10%
- Swap usage >80%
- OOM killer active
- Single process using >50% memory
Immediate Mitigation
# 1. Free page cache (safe)
sync && echo 3 > /proc/sys/vm/drop_caches
# 2. Kill memory-heavy process
kill -9 <PID>
# 3. Increase swap (temporary)
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# 4. Scale up (more RAM)
# Cloud: Resize instance
3. Disk Full
Symptoms:
- "No space left on device" errors
- Applications can't write files
- Database refuses writes
- Logs not being written
Diagnosis:
Check Disk Usage
# Disk usage by partition
df -h
# Disk usage by directory
du -sh /*
du -sh /var/*
# Find large files
find / -type f -size +100M -exec ls -lh {} \;
# Find files using deleted space
lsof | grep deleted
Red flags:
- Disk usage >90%
- /var/log full (runaway logs)
- /tmp full (temp files not cleaned)
- Deleted files still holding space (process has handle)
Immediate Mitigation
# 1. Clean up logs
find /var/log -name "*.log.*" -mtime +7 -delete
journalctl --vacuum-time=7d
# 2. Clean up temp files
rm -rf /tmp/*
rm -rf /var/tmp/*
# 3. Find and remove deleted files holding space
lsof | grep deleted | awk '{print $2}' | xargs kill -9
# 4. Compress logs
gzip /var/log/*.log
# 5. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
# After expanding:
resize2fs /dev/xvda1 # ext4
xfs_growfs / # xfs
4. Network Issues
Symptoms:
- Slow network performance
- Timeouts
- Connection refused
- High latency
Diagnosis:
Check Network Connectivity
# Ping test
ping -c 5 google.com
# DNS resolution
nslookup example.com
dig example.com
# Traceroute
traceroute example.com
# Check network interfaces
ip addr show
ifconfig
# Check routing table
ip route show
route -n
Red flags:
- Packet loss >1%
- Latency >100ms (same region)
- DNS resolution failures
- Interface down
Check Network Bandwidth
# Current bandwidth usage
iftop -i eth0
# Network stats
netstat -i
# Historical bandwidth (if vnstat installed)
vnstat -l
# Check for bandwidth limits (cloud)
# AWS: Check CloudWatch NetworkIn/NetworkOut
Check Firewall Rules
# Check iptables rules
iptables -L -n -v
# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all
# Check UFW (Ubuntu)
ufw status verbose
# Check security groups (cloud)
# AWS: EC2 → Security Groups
# Azure: Network Security Groups
Common causes:
- Firewall blocking traffic
- Security group misconfigured
- MTU mismatch
- Network congestion
- DDoS attack
Immediate Mitigation
# 1. Check firewall allows traffic
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# 2. Restart networking
systemctl restart networking
systemctl restart NetworkManager
# 3. Flush DNS cache
systemd-resolve --flush-caches
# 4. Check cloud network ACLs
# Ensure subnet has route to internet gateway
5. High Disk I/O (Slow Disk)
Symptoms:
- Applications slow
- High iowait CPU
- Disk latency high
Diagnosis:
Check Disk I/O
# Disk I/O stats
iostat -x 1 5
# Look for:
# - %util >80% (disk saturated)
# - await >100ms (high latency)
# Top I/O processes
iotop -o
# Historical I/O (if sar installed)
sar -d 1 10
Red flags:
- %util at 100%
- await >100ms
- iowait CPU >20%
- Queue size (avgqu-sz) >10
Common Causes
# 1. Database without indexes (Seq Scan)
# See database-diagnostics.md
# 2. Log rotation running
# Large logs being compressed
# 3. Backup running
# Database dump, file backup
# 4. Disk issue (bad sectors)
dmesg | grep -i "I/O error"
smartctl -a /dev/sda # SMART status
Immediate Mitigation
# 1. Reduce I/O pressure
# Stop non-critical processes (backup, log rotation)
# 2. Add read cache
# Enable query caching (database)
# Add Redis for application cache
# 3. Scale disk IOPS (cloud)
# AWS: Change EBS volume type (gp2 → gp3 → io1)
# Azure: Change disk tier
# 4. Move to SSD (if on HDD)
6. Service Down / Process Crashed
Symptoms:
- Service not responding
- Health check failures
- 502 Bad Gateway
Diagnosis:
Check Service Status
# Systemd services
systemctl status nginx
systemctl status postgresql
systemctl status application
# Check if process running
ps aux | grep nginx
pidof nginx
# Check service logs
journalctl -u nginx -n 50
tail -f /var/log/nginx/error.log
Red flags:
- Service: inactive (dead)
- Process not found
- Recent crash in logs
Check Why Service Crashed
# Check system logs
dmesg | tail -50
grep "error\|segfault\|killed" /var/log/syslog
# Check application logs
tail -100 /var/log/application.log
# Check for OOM killer
dmesg | grep -i "killed process"
# Check core dumps
ls -l /var/crash/
ls -l /tmp/core*
Common causes:
- Out of memory (OOM Killer)
- Segmentation fault (code bug)
- Unhandled exception
- Dependency service down
- Configuration error
Immediate Mitigation
# 1. Restart service
systemctl restart nginx
# 2. Check if started successfully
systemctl status nginx
curl http://localhost
# 3. If startup fails, check config
nginx -t # Test nginx config
postgresql -D /var/lib/postgresql/data --config-test
# 4. Enable auto-restart (systemd)
# Add to service file:
[Service]
Restart=always
RestartSec=10
7. Cloud Infrastructure Issues
AWS-Specific
Instance Issues:
# Check instance health
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
# Check system logs
aws ec2 get-console-output --instance-id i-1234567890abcdef0
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
EBS Volume Issues:
# Check volume status
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
# Increase IOPS (gp3)
aws ec2 modify-volume \
--volume-id vol-1234567890abcdef0 \
--iops 3000
# Check volume metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeReadOps
Network Issues:
# Check security groups
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
# Check network ACLs
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
# Check route tables
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
Azure-Specific
VM Issues:
# Check VM status
az vm get-instance-view --name myVM --resource-group myRG
# Restart VM
az vm restart --name myVM --resource-group myRG
# Resize VM
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
Disk Issues:
# Check disk status
az disk show --name myDisk --resource-group myRG
# Expand disk
az disk update --name myDisk --resource-group myRG --size-gb 256
Infrastructure Performance Metrics
Server Health:
- CPU: <70% average, <90% peak
- Memory: <80% usage
- Disk: <80% usage, <80% IOPS
- Network: <70% bandwidth
Uptime:
- Target: 99.9% (8.76 hours downtime/year)
- Monitoring: Check every 1 minute
Response Time:
- Ping latency: <50ms (same region)
- HTTP response: <200ms
Infrastructure Diagnostic Checklist
When diagnosing infrastructure issues:
- Check CPU usage (target: <70%)
- Check memory usage (target: <80%)
- Check disk usage (target: <80%)
- Check disk I/O (%util, await)
- Check network connectivity (ping, traceroute)
- Check firewall rules (iptables, security groups)
- Check service status (systemd, ps)
- Check system logs (dmesg, /var/log/syslog)
- Check cloud metrics (CloudWatch, Azure Monitor)
- Check for hardware issues (SMART, dmesg errors)
Tools:
top,htop- CPU, memorydf,du- Disk usageiostat- Disk I/Oiftop,netstat- Networkdmesg,journalctl- System logs- Cloud dashboards (AWS, Azure, GCP)
Related Documentation
- SKILL.md - Main SRE agent
- backend-diagnostics.md - Application-level troubleshooting
- database-diagnostics.md - Database performance
- security-incidents.md - Security response