# Infrastructure Diagnostics **Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues. ## Common Infrastructure Issues ### 1. High CPU Usage (Server) **Symptoms**: - Server CPU at 100% - Applications slow - SSH lag **Diagnosis**: #### Check CPU Usage ```bash # Overall CPU usage top -bn1 | grep "Cpu(s)" # Top CPU processes top -bn1 | head -20 # CPU usage per core mpstat -P ALL 1 5 # Historical CPU (if sar installed) sar -u 1 10 ``` **Red flags**: - CPU at 100% for >5 minutes - Single process using >80% CPU - iowait >20% (disk bottleneck) - System CPU >30% (kernel overhead) --- #### Identify CPU-heavy Process ```bash # Top CPU process ps aux | sort -nrk 3,3 | head -10 # CPU per thread top -H # Process tree pstree -p ``` **Common causes**: - Application bug (infinite loop) - Heavy computation - Crypto mining malware - Backup/compression running --- #### Immediate Mitigation ```bash # 1. Limit process CPU (nice) renice +10 # Lower priority # 2. Kill process (last resort) kill -TERM # Graceful kill -KILL # Force kill # 3. Scale horizontally (add servers) # Cloud: Auto-scaling group # 4. Scale vertically (bigger instance) # Cloud: Resize instance ``` --- ### 2. Out of Memory (OOM) **Symptoms**: - "Out of memory" errors - OOM Killer triggered - Applications crash - Swap usage high **Diagnosis**: #### Check Memory Usage ```bash # Current memory usage free -h # Memory per process ps aux | sort -nrk 4,4 | head -10 # Check OOM killer logs dmesg | grep -i "out of memory\|oom" grep "Out of memory" /var/log/syslog # Check swap usage swapon -s ``` **Red flags**: - Available memory <10% - Swap usage >80% - OOM killer active - Single process using >50% memory --- #### Immediate Mitigation ```bash # 1. Free page cache (safe) sync && echo 3 > /proc/sys/vm/drop_caches # 2. Kill memory-heavy process kill -9 # 3. Increase swap (temporary) dd if=/dev/zero of=/swapfile bs=1M count=2048 mkswap /swapfile swapon /swapfile # 4. Scale up (more RAM) # Cloud: Resize instance ``` --- ### 3. Disk Full **Symptoms**: - "No space left on device" errors - Applications can't write files - Database refuses writes - Logs not being written **Diagnosis**: #### Check Disk Usage ```bash # Disk usage by partition df -h # Disk usage by directory du -sh /* du -sh /var/* # Find large files find / -type f -size +100M -exec ls -lh {} \; # Find files using deleted space lsof | grep deleted ``` **Red flags**: - Disk usage >90% - /var/log full (runaway logs) - /tmp full (temp files not cleaned) - Deleted files still holding space (process has handle) --- #### Immediate Mitigation ```bash # 1. Clean up logs find /var/log -name "*.log.*" -mtime +7 -delete journalctl --vacuum-time=7d # 2. Clean up temp files rm -rf /tmp/* rm -rf /var/tmp/* # 3. Find and remove deleted files holding space lsof | grep deleted | awk '{print $2}' | xargs kill -9 # 4. Compress logs gzip /var/log/*.log # 5. Expand disk (cloud) # AWS: Modify EBS volume size # Azure: Expand managed disk # After expanding: resize2fs /dev/xvda1 # ext4 xfs_growfs / # xfs ``` --- ### 4. Network Issues **Symptoms**: - Slow network performance - Timeouts - Connection refused - High latency **Diagnosis**: #### Check Network Connectivity ```bash # Ping test ping -c 5 google.com # DNS resolution nslookup example.com dig example.com # Traceroute traceroute example.com # Check network interfaces ip addr show ifconfig # Check routing table ip route show route -n ``` **Red flags**: - Packet loss >1% - Latency >100ms (same region) - DNS resolution failures - Interface down --- #### Check Network Bandwidth ```bash # Current bandwidth usage iftop -i eth0 # Network stats netstat -i # Historical bandwidth (if vnstat installed) vnstat -l # Check for bandwidth limits (cloud) # AWS: Check CloudWatch NetworkIn/NetworkOut ``` --- #### Check Firewall Rules ```bash # Check iptables rules iptables -L -n -v # Check firewalld (RHEL/CentOS) firewall-cmd --list-all # Check UFW (Ubuntu) ufw status verbose # Check security groups (cloud) # AWS: EC2 → Security Groups # Azure: Network Security Groups ``` **Common causes**: - Firewall blocking traffic - Security group misconfigured - MTU mismatch - Network congestion - DDoS attack --- #### Immediate Mitigation ```bash # 1. Check firewall allows traffic iptables -A INPUT -p tcp --dport 80 -j ACCEPT iptables -A INPUT -p tcp --dport 443 -j ACCEPT # 2. Restart networking systemctl restart networking systemctl restart NetworkManager # 3. Flush DNS cache systemd-resolve --flush-caches # 4. Check cloud network ACLs # Ensure subnet has route to internet gateway ``` --- ### 5. High Disk I/O (Slow Disk) **Symptoms**: - Applications slow - High iowait CPU - Disk latency high **Diagnosis**: #### Check Disk I/O ```bash # Disk I/O stats iostat -x 1 5 # Look for: # - %util >80% (disk saturated) # - await >100ms (high latency) # Top I/O processes iotop -o # Historical I/O (if sar installed) sar -d 1 10 ``` **Red flags**: - %util at 100% - await >100ms - iowait CPU >20% - Queue size (avgqu-sz) >10 --- #### Common Causes ```bash # 1. Database without indexes (Seq Scan) # See database-diagnostics.md # 2. Log rotation running # Large logs being compressed # 3. Backup running # Database dump, file backup # 4. Disk issue (bad sectors) dmesg | grep -i "I/O error" smartctl -a /dev/sda # SMART status ``` --- #### Immediate Mitigation ```bash # 1. Reduce I/O pressure # Stop non-critical processes (backup, log rotation) # 2. Add read cache # Enable query caching (database) # Add Redis for application cache # 3. Scale disk IOPS (cloud) # AWS: Change EBS volume type (gp2 → gp3 → io1) # Azure: Change disk tier # 4. Move to SSD (if on HDD) ``` --- ### 6. Service Down / Process Crashed **Symptoms**: - Service not responding - Health check failures - 502 Bad Gateway **Diagnosis**: #### Check Service Status ```bash # Systemd services systemctl status nginx systemctl status postgresql systemctl status application # Check if process running ps aux | grep nginx pidof nginx # Check service logs journalctl -u nginx -n 50 tail -f /var/log/nginx/error.log ``` **Red flags**: - Service: inactive (dead) - Process not found - Recent crash in logs --- #### Check Why Service Crashed ```bash # Check system logs dmesg | tail -50 grep "error\|segfault\|killed" /var/log/syslog # Check application logs tail -100 /var/log/application.log # Check for OOM killer dmesg | grep -i "killed process" # Check core dumps ls -l /var/crash/ ls -l /tmp/core* ``` **Common causes**: - Out of memory (OOM Killer) - Segmentation fault (code bug) - Unhandled exception - Dependency service down - Configuration error --- #### Immediate Mitigation ```bash # 1. Restart service systemctl restart nginx # 2. Check if started successfully systemctl status nginx curl http://localhost # 3. If startup fails, check config nginx -t # Test nginx config postgresql -D /var/lib/postgresql/data --config-test # 4. Enable auto-restart (systemd) # Add to service file: [Service] Restart=always RestartSec=10 ``` --- ### 7. Cloud Infrastructure Issues #### AWS-Specific **Instance Issues**: ```bash # Check instance health aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0 # Check system logs aws ec2 get-console-output --instance-id i-1234567890abcdef0 # Check CloudWatch metrics aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 ``` **EBS Volume Issues**: ```bash # Check volume status aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0 # Increase IOPS (gp3) aws ec2 modify-volume \ --volume-id vol-1234567890abcdef0 \ --iops 3000 # Check volume metrics aws cloudwatch get-metric-statistics \ --namespace AWS/EBS \ --metric-name VolumeReadOps ``` **Network Issues**: ```bash # Check security groups aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0 # Check network ACLs aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0 # Check route tables aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0 ``` --- #### Azure-Specific **VM Issues**: ```bash # Check VM status az vm get-instance-view --name myVM --resource-group myRG # Restart VM az vm restart --name myVM --resource-group myRG # Resize VM az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3 ``` **Disk Issues**: ```bash # Check disk status az disk show --name myDisk --resource-group myRG # Expand disk az disk update --name myDisk --resource-group myRG --size-gb 256 ``` --- ## Infrastructure Performance Metrics **Server Health**: - CPU: <70% average, <90% peak - Memory: <80% usage - Disk: <80% usage, <80% IOPS - Network: <70% bandwidth **Uptime**: - Target: 99.9% (8.76 hours downtime/year) - Monitoring: Check every 1 minute **Response Time**: - Ping latency: <50ms (same region) - HTTP response: <200ms --- ## Infrastructure Diagnostic Checklist **When diagnosing infrastructure issues**: - [ ] Check CPU usage (target: <70%) - [ ] Check memory usage (target: <80%) - [ ] Check disk usage (target: <80%) - [ ] Check disk I/O (%util, await) - [ ] Check network connectivity (ping, traceroute) - [ ] Check firewall rules (iptables, security groups) - [ ] Check service status (systemd, ps) - [ ] Check system logs (dmesg, /var/log/syslog) - [ ] Check cloud metrics (CloudWatch, Azure Monitor) - [ ] Check for hardware issues (SMART, dmesg errors) **Tools**: - `top`, `htop` - CPU, memory - `df`, `du` - Disk usage - `iostat` - Disk I/O - `iftop`, `netstat` - Network - `dmesg`, `journalctl` - System logs - Cloud dashboards (AWS, Azure, GCP) --- ## Related Documentation - [SKILL.md](../SKILL.md) - Main SRE agent - [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting - [database-diagnostics.md](database-diagnostics.md) - Database performance - [security-incidents.md](security-incidents.md) - Security response