Files
2025-11-29 17:56:41 +08:00

9.9 KiB

Infrastructure Diagnostics

Purpose: Troubleshoot server, network, disk, and cloud infrastructure issues.

Common Infrastructure Issues

1. High CPU Usage (Server)

Symptoms:

  • Server CPU at 100%
  • Applications slow
  • SSH lag

Diagnosis:

Check CPU Usage

# Overall CPU usage
top -bn1 | grep "Cpu(s)"

# Top CPU processes
top -bn1 | head -20

# CPU usage per core
mpstat -P ALL 1 5

# Historical CPU (if sar installed)
sar -u 1 10

Red flags:

  • CPU at 100% for >5 minutes
  • Single process using >80% CPU
  • iowait >20% (disk bottleneck)
  • System CPU >30% (kernel overhead)

Identify CPU-heavy Process

# Top CPU process
ps aux | sort -nrk 3,3 | head -10

# CPU per thread
top -H

# Process tree
pstree -p

Common causes:

  • Application bug (infinite loop)
  • Heavy computation
  • Crypto mining malware
  • Backup/compression running

Immediate Mitigation

# 1. Limit process CPU (nice)
renice +10 <PID>  # Lower priority

# 2. Kill process (last resort)
kill -TERM <PID>  # Graceful
kill -KILL <PID>  # Force kill

# 3. Scale horizontally (add servers)
# Cloud: Auto-scaling group

# 4. Scale vertically (bigger instance)
# Cloud: Resize instance

2. Out of Memory (OOM)

Symptoms:

  • "Out of memory" errors
  • OOM Killer triggered
  • Applications crash
  • Swap usage high

Diagnosis:

Check Memory Usage

# Current memory usage
free -h

# Memory per process
ps aux | sort -nrk 4,4 | head -10

# Check OOM killer logs
dmesg | grep -i "out of memory\|oom"
grep "Out of memory" /var/log/syslog

# Check swap usage
swapon -s

Red flags:

  • Available memory <10%
  • Swap usage >80%
  • OOM killer active
  • Single process using >50% memory

Immediate Mitigation

# 1. Free page cache (safe)
sync && echo 3 > /proc/sys/vm/drop_caches

# 2. Kill memory-heavy process
kill -9 <PID>

# 3. Increase swap (temporary)
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile

# 4. Scale up (more RAM)
# Cloud: Resize instance

3. Disk Full

Symptoms:

  • "No space left on device" errors
  • Applications can't write files
  • Database refuses writes
  • Logs not being written

Diagnosis:

Check Disk Usage

# Disk usage by partition
df -h

# Disk usage by directory
du -sh /*
du -sh /var/*

# Find large files
find / -type f -size +100M -exec ls -lh {} \;

# Find files using deleted space
lsof | grep deleted

Red flags:

  • Disk usage >90%
  • /var/log full (runaway logs)
  • /tmp full (temp files not cleaned)
  • Deleted files still holding space (process has handle)

Immediate Mitigation

# 1. Clean up logs
find /var/log -name "*.log.*" -mtime +7 -delete
journalctl --vacuum-time=7d

# 2. Clean up temp files
rm -rf /tmp/*
rm -rf /var/tmp/*

# 3. Find and remove deleted files holding space
lsof | grep deleted | awk '{print $2}' | xargs kill -9

# 4. Compress logs
gzip /var/log/*.log

# 5. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
# After expanding:
resize2fs /dev/xvda1  # ext4
xfs_growfs /            # xfs

4. Network Issues

Symptoms:

  • Slow network performance
  • Timeouts
  • Connection refused
  • High latency

Diagnosis:

Check Network Connectivity

# Ping test
ping -c 5 google.com

# DNS resolution
nslookup example.com
dig example.com

# Traceroute
traceroute example.com

# Check network interfaces
ip addr show
ifconfig

# Check routing table
ip route show
route -n

Red flags:

  • Packet loss >1%
  • Latency >100ms (same region)
  • DNS resolution failures
  • Interface down

Check Network Bandwidth

# Current bandwidth usage
iftop -i eth0

# Network stats
netstat -i

# Historical bandwidth (if vnstat installed)
vnstat -l

# Check for bandwidth limits (cloud)
# AWS: Check CloudWatch NetworkIn/NetworkOut

Check Firewall Rules

# Check iptables rules
iptables -L -n -v

# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all

# Check UFW (Ubuntu)
ufw status verbose

# Check security groups (cloud)
# AWS: EC2 → Security Groups
# Azure: Network Security Groups

Common causes:

  • Firewall blocking traffic
  • Security group misconfigured
  • MTU mismatch
  • Network congestion
  • DDoS attack

Immediate Mitigation

# 1. Check firewall allows traffic
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# 2. Restart networking
systemctl restart networking
systemctl restart NetworkManager

# 3. Flush DNS cache
systemd-resolve --flush-caches

# 4. Check cloud network ACLs
# Ensure subnet has route to internet gateway

5. High Disk I/O (Slow Disk)

Symptoms:

  • Applications slow
  • High iowait CPU
  • Disk latency high

Diagnosis:

Check Disk I/O

# Disk I/O stats
iostat -x 1 5

# Look for:
# - %util >80% (disk saturated)
# - await >100ms (high latency)

# Top I/O processes
iotop -o

# Historical I/O (if sar installed)
sar -d 1 10

Red flags:

  • %util at 100%
  • await >100ms
  • iowait CPU >20%
  • Queue size (avgqu-sz) >10

Common Causes

# 1. Database without indexes (Seq Scan)
# See database-diagnostics.md

# 2. Log rotation running
# Large logs being compressed

# 3. Backup running
# Database dump, file backup

# 4. Disk issue (bad sectors)
dmesg | grep -i "I/O error"
smartctl -a /dev/sda  # SMART status

Immediate Mitigation

# 1. Reduce I/O pressure
# Stop non-critical processes (backup, log rotation)

# 2. Add read cache
# Enable query caching (database)
# Add Redis for application cache

# 3. Scale disk IOPS (cloud)
# AWS: Change EBS volume type (gp2 → gp3 → io1)
# Azure: Change disk tier

# 4. Move to SSD (if on HDD)

6. Service Down / Process Crashed

Symptoms:

  • Service not responding
  • Health check failures
  • 502 Bad Gateway

Diagnosis:

Check Service Status

# Systemd services
systemctl status nginx
systemctl status postgresql
systemctl status application

# Check if process running
ps aux | grep nginx
pidof nginx

# Check service logs
journalctl -u nginx -n 50
tail -f /var/log/nginx/error.log

Red flags:

  • Service: inactive (dead)
  • Process not found
  • Recent crash in logs

Check Why Service Crashed

# Check system logs
dmesg | tail -50
grep "error\|segfault\|killed" /var/log/syslog

# Check application logs
tail -100 /var/log/application.log

# Check for OOM killer
dmesg | grep -i "killed process"

# Check core dumps
ls -l /var/crash/
ls -l /tmp/core*

Common causes:

  • Out of memory (OOM Killer)
  • Segmentation fault (code bug)
  • Unhandled exception
  • Dependency service down
  • Configuration error

Immediate Mitigation

# 1. Restart service
systemctl restart nginx

# 2. Check if started successfully
systemctl status nginx
curl http://localhost

# 3. If startup fails, check config
nginx -t  # Test nginx config
postgresql -D /var/lib/postgresql/data --config-test

# 4. Enable auto-restart (systemd)
# Add to service file:
[Service]
Restart=always
RestartSec=10

7. Cloud Infrastructure Issues

AWS-Specific

Instance Issues:

# Check instance health
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0

# Check system logs
aws ec2 get-console-output --instance-id i-1234567890abcdef0

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0

EBS Volume Issues:

# Check volume status
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0

# Increase IOPS (gp3)
aws ec2 modify-volume \
  --volume-id vol-1234567890abcdef0 \
  --iops 3000

# Check volume metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeReadOps

Network Issues:

# Check security groups
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0

# Check network ACLs
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0

# Check route tables
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0

Azure-Specific

VM Issues:

# Check VM status
az vm get-instance-view --name myVM --resource-group myRG

# Restart VM
az vm restart --name myVM --resource-group myRG

# Resize VM
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3

Disk Issues:

# Check disk status
az disk show --name myDisk --resource-group myRG

# Expand disk
az disk update --name myDisk --resource-group myRG --size-gb 256

Infrastructure Performance Metrics

Server Health:

  • CPU: <70% average, <90% peak
  • Memory: <80% usage
  • Disk: <80% usage, <80% IOPS
  • Network: <70% bandwidth

Uptime:

  • Target: 99.9% (8.76 hours downtime/year)
  • Monitoring: Check every 1 minute

Response Time:

  • Ping latency: <50ms (same region)
  • HTTP response: <200ms

Infrastructure Diagnostic Checklist

When diagnosing infrastructure issues:

  • Check CPU usage (target: <70%)
  • Check memory usage (target: <80%)
  • Check disk usage (target: <80%)
  • Check disk I/O (%util, await)
  • Check network connectivity (ping, traceroute)
  • Check firewall rules (iptables, security groups)
  • Check service status (systemd, ps)
  • Check system logs (dmesg, /var/log/syslog)
  • Check cloud metrics (CloudWatch, Azure Monitor)
  • Check for hardware issues (SMART, dmesg errors)

Tools:

  • top, htop - CPU, memory
  • df, du - Disk usage
  • iostat - Disk I/O
  • iftop, netstat - Network
  • dmesg, journalctl - System logs
  • Cloud dashboards (AWS, Azure, GCP)