zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

9.9 KiB

Raw Blame History

Infrastructure Diagnostics

Purpose: Troubleshoot server, network, disk, and cloud infrastructure issues.

Common Infrastructure Issues

1. High CPU Usage (Server)

Symptoms:

Server CPU at 100%
Applications slow
SSH lag

Diagnosis:

Check CPU Usage

# Overall CPU usage
top -bn1 | grep "Cpu(s)"

# Top CPU processes
top -bn1 | head -20

# CPU usage per core
mpstat -P ALL 1 5

# Historical CPU (if sar installed)
sar -u 1 10

Red flags:

CPU at 100% for >5 minutes
Single process using >80% CPU
iowait >20% (disk bottleneck)
System CPU >30% (kernel overhead)

Identify CPU-heavy Process

# Top CPU process
ps aux | sort -nrk 3,3 | head -10

# CPU per thread
top -H

# Process tree
pstree -p

Common causes:

Application bug (infinite loop)
Heavy computation
Crypto mining malware
Backup/compression running

Immediate Mitigation

# 1. Limit process CPU (nice)
renice +10 <PID>  # Lower priority

# 2. Kill process (last resort)
kill -TERM <PID>  # Graceful
kill -KILL <PID>  # Force kill

# 3. Scale horizontally (add servers)
# Cloud: Auto-scaling group

# 4. Scale vertically (bigger instance)
# Cloud: Resize instance

2. Out of Memory (OOM)

Symptoms:

"Out of memory" errors
OOM Killer triggered
Applications crash
Swap usage high

Diagnosis:

Check Memory Usage

# Current memory usage
free -h

# Memory per process
ps aux | sort -nrk 4,4 | head -10

# Check OOM killer logs
dmesg | grep -i "out of memory\|oom"
grep "Out of memory" /var/log/syslog

# Check swap usage
swapon -s

Red flags:

Available memory <10%
Swap usage >80%
OOM killer active
Single process using >50% memory

Immediate Mitigation

# 1. Free page cache (safe)
sync && echo 3 > /proc/sys/vm/drop_caches

# 2. Kill memory-heavy process
kill -9 <PID>

# 3. Increase swap (temporary)
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile

# 4. Scale up (more RAM)
# Cloud: Resize instance

3. Disk Full

Symptoms:

"No space left on device" errors
Applications can't write files
Database refuses writes
Logs not being written

Diagnosis:

Check Disk Usage

# Disk usage by partition
df -h

# Disk usage by directory
du -sh /*
du -sh /var/*

# Find large files
find / -type f -size +100M -exec ls -lh {} \;

# Find files using deleted space
lsof | grep deleted

Red flags:

Disk usage >90%
/var/log full (runaway logs)
/tmp full (temp files not cleaned)
Deleted files still holding space (process has handle)

Immediate Mitigation

# 1. Clean up logs
find /var/log -name "*.log.*" -mtime +7 -delete
journalctl --vacuum-time=7d

# 2. Clean up temp files
rm -rf /tmp/*
rm -rf /var/tmp/*

# 3. Find and remove deleted files holding space
lsof | grep deleted | awk '{print $2}' | xargs kill -9

# 4. Compress logs
gzip /var/log/*.log

# 5. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
# After expanding:
resize2fs /dev/xvda1  # ext4
xfs_growfs /            # xfs

4. Network Issues

Symptoms:

Slow network performance
Timeouts
Connection refused
High latency

Diagnosis:

Check Network Connectivity

# Ping test
ping -c 5 google.com

# DNS resolution
nslookup example.com
dig example.com

# Traceroute
traceroute example.com

# Check network interfaces
ip addr show
ifconfig

# Check routing table
ip route show
route -n

Red flags:

Packet loss >1%
Latency >100ms (same region)
DNS resolution failures
Interface down

Check Network Bandwidth

# Current bandwidth usage
iftop -i eth0

# Network stats
netstat -i

# Historical bandwidth (if vnstat installed)
vnstat -l

# Check for bandwidth limits (cloud)
# AWS: Check CloudWatch NetworkIn/NetworkOut

Check Firewall Rules

# Check iptables rules
iptables -L -n -v

# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all

# Check UFW (Ubuntu)
ufw status verbose

# Check security groups (cloud)
# AWS: EC2 → Security Groups
# Azure: Network Security Groups

Common causes:

Firewall blocking traffic
Security group misconfigured
MTU mismatch
Network congestion
DDoS attack

Immediate Mitigation

# 1. Check firewall allows traffic
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# 2. Restart networking
systemctl restart networking
systemctl restart NetworkManager

# 3. Flush DNS cache
systemd-resolve --flush-caches

# 4. Check cloud network ACLs
# Ensure subnet has route to internet gateway

5. High Disk I/O (Slow Disk)

Symptoms:

Applications slow
High iowait CPU
Disk latency high

Diagnosis:

Check Disk I/O

# Disk I/O stats
iostat -x 1 5

# Look for:
# - %util >80% (disk saturated)
# - await >100ms (high latency)

# Top I/O processes
iotop -o

# Historical I/O (if sar installed)
sar -d 1 10

Red flags:

%util at 100%
await >100ms
iowait CPU >20%
Queue size (avgqu-sz) >10

Common Causes

# 1. Database without indexes (Seq Scan)
# See database-diagnostics.md

# 2. Log rotation running
# Large logs being compressed

# 3. Backup running
# Database dump, file backup

# 4. Disk issue (bad sectors)
dmesg | grep -i "I/O error"
smartctl -a /dev/sda  # SMART status

Immediate Mitigation

# 1. Reduce I/O pressure
# Stop non-critical processes (backup, log rotation)

# 2. Add read cache
# Enable query caching (database)
# Add Redis for application cache

# 3. Scale disk IOPS (cloud)
# AWS: Change EBS volume type (gp2 → gp3 → io1)
# Azure: Change disk tier

# 4. Move to SSD (if on HDD)

6. Service Down / Process Crashed

Symptoms:

Service not responding
Health check failures
502 Bad Gateway

Diagnosis:

Check Service Status

# Systemd services
systemctl status nginx
systemctl status postgresql
systemctl status application

# Check if process running
ps aux | grep nginx
pidof nginx

# Check service logs
journalctl -u nginx -n 50
tail -f /var/log/nginx/error.log

Red flags:

Service: inactive (dead)
Process not found
Recent crash in logs

Check Why Service Crashed

# Check system logs
dmesg | tail -50
grep "error\|segfault\|killed" /var/log/syslog

# Check application logs
tail -100 /var/log/application.log

# Check for OOM killer
dmesg | grep -i "killed process"

# Check core dumps
ls -l /var/crash/
ls -l /tmp/core*

Common causes:

Out of memory (OOM Killer)
Segmentation fault (code bug)
Unhandled exception
Dependency service down
Configuration error

Immediate Mitigation

# 1. Restart service
systemctl restart nginx

# 2. Check if started successfully
systemctl status nginx
curl http://localhost

# 3. If startup fails, check config
nginx -t  # Test nginx config
postgresql -D /var/lib/postgresql/data --config-test

# 4. Enable auto-restart (systemd)
# Add to service file:
[Service]
Restart=always
RestartSec=10

7. Cloud Infrastructure Issues

AWS-Specific

Instance Issues:

# Check instance health
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0

# Check system logs
aws ec2 get-console-output --instance-id i-1234567890abcdef0

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0

EBS Volume Issues:

# Check volume status
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0

# Increase IOPS (gp3)
aws ec2 modify-volume \
  --volume-id vol-1234567890abcdef0 \
  --iops 3000

# Check volume metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeReadOps

Network Issues:

# Check security groups
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0

# Check network ACLs
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0

# Check route tables
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0

Azure-Specific

VM Issues:

# Check VM status
az vm get-instance-view --name myVM --resource-group myRG

# Restart VM
az vm restart --name myVM --resource-group myRG

# Resize VM
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3

Disk Issues:

# Check disk status
az disk show --name myDisk --resource-group myRG

# Expand disk
az disk update --name myDisk --resource-group myRG --size-gb 256

Infrastructure Performance Metrics

Server Health:

CPU: <70% average, <90% peak
Memory: <80% usage
Disk: <80% usage, <80% IOPS
Network: <70% bandwidth

Uptime:

Target: 99.9% (8.76 hours downtime/year)
Monitoring: Check every 1 minute

Response Time:

Ping latency: <50ms (same region)
HTTP response: <200ms

Infrastructure Diagnostic Checklist

When diagnosing infrastructure issues:

Check CPU usage (target: <70%)
Check memory usage (target: <80%)
Check disk usage (target: <80%)
Check disk I/O (%util, await)
Check network connectivity (ping, traceroute)
Check firewall rules (iptables, security groups)
Check service status (systemd, ps)
Check system logs (dmesg, /var/log/syslog)
Check cloud metrics (CloudWatch, Azure Monitor)
Check for hardware issues (SMART, dmesg errors)

Tools:

top, htop - CPU, memory
df, du - Disk usage
iostat - Disk I/O
iftop, netstat - Network
dmesg, journalctl - System logs
Cloud dashboards (AWS, Azure, GCP)

SKILL.md - Main SRE agent
backend-diagnostics.md - Application-level troubleshooting
database-diagnostics.md - Database performance
security-incidents.md - Security response

9.9 KiB Raw Blame History

Infrastructure Diagnostics

Common Infrastructure Issues

1. High CPU Usage (Server)

Check CPU Usage

Identify CPU-heavy Process

Immediate Mitigation

2. Out of Memory (OOM)

Check Memory Usage

Immediate Mitigation

3. Disk Full

Check Disk Usage

Immediate Mitigation

4. Network Issues

Check Network Connectivity

Check Network Bandwidth

Check Firewall Rules

Immediate Mitigation

5. High Disk I/O (Slow Disk)

Check Disk I/O

Common Causes

Immediate Mitigation

6. Service Down / Process Crashed

Check Service Status

Check Why Service Crashed

Immediate Mitigation

7. Cloud Infrastructure Issues

AWS-Specific

Azure-Specific

Infrastructure Performance Metrics

Infrastructure Diagnostic Checklist

Related Documentation

9.9 KiB

Raw Blame History