Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/03-memory-leak.md
+++ b/agents/sre/playbooks/03-memory-leak.md
@@ -0,0 +1,252 @@
+# Playbook: Memory Leak
+
+## Symptoms
+
+- Memory usage increasing continuously over time
+- Application crashes with OutOfMemoryError (Java) or "JavaScript heap out of memory" (Node.js)
+- Performance degrades over time
+- High swap usage
+- Monitoring alert: "Memory usage >90%"
+
+## Severity
+
+- **SEV2** if memory increasing but not yet critical
+- **SEV1** if application crashed or unresponsive
+
+## Diagnosis
+
+### Step 1: Confirm Memory Leak
+
+```bash
+# Monitor memory over time (5 minute intervals)
+watch -n 300 'ps aux | grep <process> | awk "{print \$4, \$5, \$6}"'
+
+# Check if memory continuously increasing
+# Leak: 20% → 30% → 40% → 50% (linear growth)
+# Normal: 30% → 32% → 31% → 30% (stable)
+```
+
+---
+
+### Step 2: Get Memory Snapshot
+
+**Java (Heap Dump)**:
+```bash
+# Get heap dump
+jmap -dump:format=b,file=heap.bin <PID>
+
+# Analyze with jhat or VisualVM
+jhat heap.bin
+# Open http://localhost:7000
+
+# Or use Eclipse Memory Analyzer
+```
+
+**Node.js (Heap Snapshot)**:
+```bash
+# Start with --inspect
+node --inspect index.js
+
+# Chrome DevTools → Memory → Take heap snapshot
+
+# Or use heapdump module
+const heapdump = require('heapdump');
+heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');
+```
+
+**Python (Memory Profiler)**:
+```bash
+# Install memory_profiler
+pip install memory_profiler
+
+# Profile function
+python -m memory_profiler script.py
+```
+
+---
+
+### Step 3: Identify Leak Source
+
+**Look for**:
+- Large arrays/objects growing over time
+- Detached DOM nodes (if browser/UI)
+- Event listeners not removed
+- Timers/intervals not cleared
+- Closures holding references
+- Cache without eviction policy
+
+**Common patterns**:
+```javascript
+// 1. Global cache growing forever
+global.cache = {}; // Never cleared
+
+// 2. Event listeners not removed
+emitter.on('event', handler); // Never removed
+
+// 3. Timers not cleared
+setInterval(() => { /* ... */ }, 1000); // Never cleared
+
+// 4. Closures
+function createHandler() {
+  const largeData = new Array(1000000);
+  return () => {
+    // Closure keeps largeData in memory
+  };
+}
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Restart Application**
+```bash
+# Restart to free memory
+systemctl restart application
+
+# Impact: Memory usage returns to baseline
+# Risk: Low (brief downtime)
+# NOTE: This is temporary, leak will recur!
+```
+
+**Option B: Increase Memory Limit** (temporary)
+```bash
+# Java
+java -Xmx4G -jar application.jar  # Was 2G
+
+# Node.js
+node --max-old-space-size=4096 index.js  # Was 2048
+
+# Impact: Buys time to find root cause
+# Risk: Low (but doesn't fix leak)
+```
+
+**Option C: Scale Horizontally** (cloud)
+```bash
+# Add more instances
+# Use load balancer to rotate traffic
+# Restart instances on schedule (e.g., every 6 hours)
+
+# Impact: Distributes load, restarts prevent OOM
+# Risk: Low (but doesn't fix root cause)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Analyze heap dump** and identify leak source
+
+**Common Fixes**:
+
+**1. Add LRU Cache**
+```javascript
+// BAD: Unbounded cache
+const cache = {};
+
+// GOOD: LRU cache with size limit
+const LRU = require('lru-cache');
+const cache = new LRU({ max: 1000 });
+```
+
+**2. Remove Event Listeners**
+```javascript
+// Add listener
+const handler = () => { /* ... */ };
+emitter.on('event', handler);
+
+// CRITICAL: Remove later
+emitter.off('event', handler);
+
+// React/Vue: cleanup in componentWillUnmount/onUnmounted
+```
+
+**3. Clear Timers**
+```javascript
+// Set timer
+const intervalId = setInterval(() => { /* ... */ }, 1000);
+
+// CRITICAL: Clear later
+clearInterval(intervalId);
+
+// React: cleanup in useEffect return
+useEffect(() => {
+  const id = setInterval(() => { /* ... */ }, 1000);
+  return () => clearInterval(id);
+}, []);
+```
+
+**4. Close Connections**
+```javascript
+// BAD: Connection leak
+const conn = await db.connect();
+await conn.query(/* ... */);
+// Connection never closed!
+
+// GOOD: Always close
+const conn = await db.connect();
+try {
+  await conn.query(/* ... */);
+} finally {
+  await conn.close(); // CRITICAL
+}
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add memory monitoring (alert if >80% and increasing)
+- [ ] Add memory profiling to CI/CD (detect leaks early)
+- [ ] Use WeakMap for caches (auto garbage collected)
+- [ ] Review closure usage (avoid holding large data)
+- [ ] Add automated restart (every N hours, if leak can't be fixed immediately)
+- [ ] Load test to reproduce leak in test environment
+- [ ] Fix root cause in code
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code causing leak
+- Requires code fix
+
+**Escalate to platform team if**:
+- Platform/framework bug
+- Requires upgrade or workaround
+
+---
+
+## Prevention Checklist
+
+- [ ] Use LRU cache (not unbounded)
+- [ ] Remove event listeners in cleanup
+- [ ] Clear timers/intervals
+- [ ] Close database connections (use `finally`)
+- [ ] Avoid closures holding large data
+- [ ] Use WeakMap for temporary caches
+- [ ] Profile memory in development
+- [ ] Load test before production
+
+---
+
+## Related Runbooks
+
+- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU also high
+- [07-service-down.md](07-service-down.md) - If OOM crashed service
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem
+- [ ] Identify leak source from heap dump
+- [ ] Fix code
+- [ ] Add regression test (memory profiling)
+- [ ] Add monitoring alert
+- [ ] Update this runbook if needed