3.9 KiB
3.9 KiB
Playbook: High CPU Usage
Symptoms
- CPU usage at 80-100%
- Applications slow or unresponsive
- Server lag, SSH slow
- Monitoring alert: "CPU usage >80% for 5 minutes"
Severity
- SEV2 if application degraded but functional
- SEV1 if application unresponsive
Diagnosis
Step 1: Identify Top CPU Process
# Current CPU usage
top -bn1 | head -20
# Top CPU processes
ps aux | sort -nrk 3,3 | head -10
# CPU per thread
top -H -p <PID>
What to look for:
- Single process using >80% CPU
- Multiple processes all high (system-wide issue)
- System CPU vs user CPU (iowait = disk issue)
Step 2: Identify Process Type
Application process (node, java, python):
# Check application logs
tail -100 /var/log/application.log
# Check for infinite loops, heavy computation
# Check APM for slow endpoints
System process (kernel, systemd):
# Check system logs
dmesg | tail -50
journalctl -xe
# Check for hardware issues
Unknown/suspicious process:
# Check process details
ps aux | grep <PID>
lsof -p <PID>
# Could be malware (crypto mining)
# See security-incidents.md
Step 3: Check If Disk-Related
# Check iowait
iostat -x 1 5
# If iowait >20%, disk is bottleneck
# See infrastructure.md for disk I/O troubleshooting
Mitigation
Immediate (Now - 5 min)
Option A: Lower Process Priority
# Reduce CPU priority
renice +10 <PID>
# Impact: Process gets less CPU time
# Risk: Low (process still runs, just slower)
Option B: Kill Process (if application)
# Graceful shutdown
kill -TERM <PID>
# Force kill (last resort)
kill -KILL <PID>
# Restart service
systemctl restart <service>
# Impact: Process restarts, CPU normalizes
# Risk: Medium (brief downtime)
Option C: Scale Horizontally (cloud)
# Add more instances to distribute load
# AWS: Auto Scaling Group
# Azure: Scale Set
# Kubernetes: Horizontal Pod Autoscaler
# Impact: Load distributed across instances
# Risk: Low (no downtime)
Short-term (5 min - 1 hour)
Option A: Optimize Code (if application bug)
# Profile application
# Node.js: node --prof
# Java: jstack, jvisualvm
# Python: py-spy
# Identify hot path
# Fix infinite loop, heavy computation
Option B: Add Caching
// Cache expensive computation
const cache = new Map();
function expensiveOperation(input) {
if (cache.has(input)) {
return cache.get(input);
}
const result = /* heavy computation */;
cache.set(input, result);
return result;
}
Option C: Scale Vertically (cloud)
# Resize to larger instance type
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More CPU capacity
# Risk: Medium (brief downtime during resize)
Long-term (1 hour+)
- Add CPU monitoring alert (>70% for 5 min)
- Optimize application code (reduce computation)
- Use worker threads for heavy tasks (Node.js)
- Implement auto-scaling (cloud)
- Add APM for performance profiling
- Review architecture (async processing, job queues)
Escalation
Escalate to developer if:
- Application code causing issue
- Requires code fix or optimization
Escalate to security-agent if:
- Unknown/suspicious process
- Potential malware or crypto mining
Escalate to infrastructure if:
- Hardware issue (kernel errors)
- Cloud infrastructure problem
Related Runbooks
- 03-memory-leak.md - If memory also high
- 04-slow-api-response.md - If API slow due to CPU
- ../modules/infrastructure.md - Infrastructure diagnostics
Post-Incident
After resolving:
- Create post-mortem (if SEV1/SEV2)
- Identify root cause
- Add monitoring/alerting
- Update this runbook if needed
- Add regression test (if code bug)