Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/01-high-cpu-usage.md
+++ b/agents/sre/playbooks/01-high-cpu-usage.md
@@ -0,0 +1,204 @@
+# Playbook: High CPU Usage
+
+## Symptoms
+
+- CPU usage at 80-100%
+- Applications slow or unresponsive
+- Server lag, SSH slow
+- Monitoring alert: "CPU usage >80% for 5 minutes"
+
+## Severity
+
+- **SEV2** if application degraded but functional
+- **SEV1** if application unresponsive
+
+## Diagnosis
+
+### Step 1: Identify Top CPU Process
+
+```bash
+# Current CPU usage
+top -bn1 | head -20
+
+# Top CPU processes
+ps aux | sort -nrk 3,3 | head -10
+
+# CPU per thread
+top -H -p <PID>
+```
+
+**What to look for**:
+- Single process using >80% CPU
+- Multiple processes all high (system-wide issue)
+- System CPU vs user CPU (iowait = disk issue)
+
+---
+
+### Step 2: Identify Process Type
+
+**Application process** (node, java, python):
+```bash
+# Check application logs
+tail -100 /var/log/application.log
+
+# Check for infinite loops, heavy computation
+# Check APM for slow endpoints
+```
+
+**System process** (kernel, systemd):
+```bash
+# Check system logs
+dmesg | tail -50
+journalctl -xe
+
+# Check for hardware issues
+```
+
+**Unknown/suspicious process**:
+```bash
+# Check process details
+ps aux | grep <PID>
+lsof -p <PID>
+
+# Could be malware (crypto mining)
+# See security-incidents.md
+```
+
+---
+
+### Step 3: Check If Disk-Related
+
+```bash
+# Check iowait
+iostat -x 1 5
+
+# If iowait >20%, disk is bottleneck
+# See infrastructure.md for disk I/O troubleshooting
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Lower Process Priority**
+```bash
+# Reduce CPU priority
+renice +10 <PID>
+
+# Impact: Process gets less CPU time
+# Risk: Low (process still runs, just slower)
+```
+
+**Option B: Kill Process** (if application)
+```bash
+# Graceful shutdown
+kill -TERM <PID>
+
+# Force kill (last resort)
+kill -KILL <PID>
+
+# Restart service
+systemctl restart <service>
+
+# Impact: Process restarts, CPU normalizes
+# Risk: Medium (brief downtime)
+```
+
+**Option C: Scale Horizontally** (cloud)
+```bash
+# Add more instances to distribute load
+# AWS: Auto Scaling Group
+# Azure: Scale Set
+# Kubernetes: Horizontal Pod Autoscaler
+
+# Impact: Load distributed across instances
+# Risk: Low (no downtime)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Optimize Code** (if application bug)
+```bash
+# Profile application
+# Node.js: node --prof
+# Java: jstack, jvisualvm
+# Python: py-spy
+
+# Identify hot path
+# Fix infinite loop, heavy computation
+```
+
+**Option B: Add Caching**
+```javascript
+// Cache expensive computation
+const cache = new Map();
+
+function expensiveOperation(input) {
+  if (cache.has(input)) {
+    return cache.get(input);
+  }
+
+  const result = /* heavy computation */;
+  cache.set(input, result);
+  return result;
+}
+```
+
+**Option C: Scale Vertically** (cloud)
+```bash
+# Resize to larger instance type
+# AWS: Change instance type (t3.medium → t3.large)
+# Azure: Resize VM
+# Impact: More CPU capacity
+# Risk: Medium (brief downtime during resize)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add CPU monitoring alert (>70% for 5 min)
+- [ ] Optimize application code (reduce computation)
+- [ ] Use worker threads for heavy tasks (Node.js)
+- [ ] Implement auto-scaling (cloud)
+- [ ] Add APM for performance profiling
+- [ ] Review architecture (async processing, job queues)
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code causing issue
+- Requires code fix or optimization
+
+**Escalate to security-agent if**:
+- Unknown/suspicious process
+- Potential malware or crypto mining
+
+**Escalate to infrastructure if**:
+- Hardware issue (kernel errors)
+- Cloud infrastructure problem
+
+---
+
+## Related Runbooks
+
+- [03-memory-leak.md](03-memory-leak.md) - If memory also high
+- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to CPU
+- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure diagnostics
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (if SEV1/SEV2)
+- [ ] Identify root cause
+- [ ] Add monitoring/alerting
+- [ ] Update this runbook if needed
+- [ ] Add regression test (if code bug)