zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

3.9 KiB

Raw Blame History

Playbook: High CPU Usage

Symptoms

CPU usage at 80-100%
Applications slow or unresponsive
Server lag, SSH slow
Monitoring alert: "CPU usage >80% for 5 minutes"

Severity

SEV2 if application degraded but functional
SEV1 if application unresponsive

Diagnosis

Step 1: Identify Top CPU Process

# Current CPU usage
top -bn1 | head -20

# Top CPU processes
ps aux | sort -nrk 3,3 | head -10

# CPU per thread
top -H -p <PID>

What to look for:

Single process using >80% CPU
Multiple processes all high (system-wide issue)
System CPU vs user CPU (iowait = disk issue)

Step 2: Identify Process Type

Application process (node, java, python):

# Check application logs
tail -100 /var/log/application.log

# Check for infinite loops, heavy computation
# Check APM for slow endpoints

System process (kernel, systemd):

# Check system logs
dmesg | tail -50
journalctl -xe

# Check for hardware issues

Unknown/suspicious process:

# Check process details
ps aux | grep <PID>
lsof -p <PID>

# Could be malware (crypto mining)
# See security-incidents.md

# Check iowait
iostat -x 1 5

# If iowait >20%, disk is bottleneck
# See infrastructure.md for disk I/O troubleshooting

Mitigation

Immediate (Now - 5 min)

Option A: Lower Process Priority

# Reduce CPU priority
renice +10 <PID>

# Impact: Process gets less CPU time
# Risk: Low (process still runs, just slower)

Option B: Kill Process (if application)

# Graceful shutdown
kill -TERM <PID>

# Force kill (last resort)
kill -KILL <PID>

# Restart service
systemctl restart <service>

# Impact: Process restarts, CPU normalizes
# Risk: Medium (brief downtime)

Option C: Scale Horizontally (cloud)

# Add more instances to distribute load
# AWS: Auto Scaling Group
# Azure: Scale Set
# Kubernetes: Horizontal Pod Autoscaler

# Impact: Load distributed across instances
# Risk: Low (no downtime)

Short-term (5 min - 1 hour)

Option A: Optimize Code (if application bug)

# Profile application
# Node.js: node --prof
# Java: jstack, jvisualvm
# Python: py-spy

# Identify hot path
# Fix infinite loop, heavy computation

Option B: Add Caching

// Cache expensive computation
const cache = new Map();

function expensiveOperation(input) {
  if (cache.has(input)) {
    return cache.get(input);
  }

  const result = /* heavy computation */;
  cache.set(input, result);
  return result;
}

Option C: Scale Vertically (cloud)

# Resize to larger instance type
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More CPU capacity
# Risk: Medium (brief downtime during resize)

Long-term (1 hour+)

Add CPU monitoring alert (>70% for 5 min)
Optimize application code (reduce computation)
Use worker threads for heavy tasks (Node.js)
Implement auto-scaling (cloud)
Add APM for performance profiling
Review architecture (async processing, job queues)

Escalation

Escalate to developer if:

Application code causing issue
Requires code fix or optimization

Escalate to security-agent if:

Unknown/suspicious process
Potential malware or crypto mining

Escalate to infrastructure if:

Hardware issue (kernel errors)
Cloud infrastructure problem

03-memory-leak.md - If memory also high
04-slow-api-response.md - If API slow due to CPU
../modules/infrastructure.md - Infrastructure diagnostics

Post-Incident

After resolving:

Create post-mortem (if SEV1/SEV2)
Identify root cause
Add monitoring/alerting
Update this runbook if needed
Add regression test (if code bug)

3.9 KiB Raw Blame History