Files
gh-anton-abyzov-specweave-p…/agents/sre/playbooks/03-memory-leak.md
2025-11-29 17:56:41 +08:00

5.3 KiB

Playbook: Memory Leak

Symptoms

  • Memory usage increasing continuously over time
  • Application crashes with OutOfMemoryError (Java) or "JavaScript heap out of memory" (Node.js)
  • Performance degrades over time
  • High swap usage
  • Monitoring alert: "Memory usage >90%"

Severity

  • SEV2 if memory increasing but not yet critical
  • SEV1 if application crashed or unresponsive

Diagnosis

Step 1: Confirm Memory Leak

# Monitor memory over time (5 minute intervals)
watch -n 300 'ps aux | grep <process> | awk "{print \$4, \$5, \$6}"'

# Check if memory continuously increasing
# Leak: 20% → 30% → 40% → 50% (linear growth)
# Normal: 30% → 32% → 31% → 30% (stable)

Step 2: Get Memory Snapshot

Java (Heap Dump):

# Get heap dump
jmap -dump:format=b,file=heap.bin <PID>

# Analyze with jhat or VisualVM
jhat heap.bin
# Open http://localhost:7000

# Or use Eclipse Memory Analyzer

Node.js (Heap Snapshot):

# Start with --inspect
node --inspect index.js

# Chrome DevTools → Memory → Take heap snapshot

# Or use heapdump module
const heapdump = require('heapdump');
heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');

Python (Memory Profiler):

# Install memory_profiler
pip install memory_profiler

# Profile function
python -m memory_profiler script.py

Step 3: Identify Leak Source

Look for:

  • Large arrays/objects growing over time
  • Detached DOM nodes (if browser/UI)
  • Event listeners not removed
  • Timers/intervals not cleared
  • Closures holding references
  • Cache without eviction policy

Common patterns:

// 1. Global cache growing forever
global.cache = {}; // Never cleared

// 2. Event listeners not removed
emitter.on('event', handler); // Never removed

// 3. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared

// 4. Closures
function createHandler() {
  const largeData = new Array(1000000);
  return () => {
    // Closure keeps largeData in memory
  };
}

Mitigation

Immediate (Now - 5 min)

Option A: Restart Application

# Restart to free memory
systemctl restart application

# Impact: Memory usage returns to baseline
# Risk: Low (brief downtime)
# NOTE: This is temporary, leak will recur!

Option B: Increase Memory Limit (temporary)

# Java
java -Xmx4G -jar application.jar  # Was 2G

# Node.js
node --max-old-space-size=4096 index.js  # Was 2048

# Impact: Buys time to find root cause
# Risk: Low (but doesn't fix leak)

Option C: Scale Horizontally (cloud)

# Add more instances
# Use load balancer to rotate traffic
# Restart instances on schedule (e.g., every 6 hours)

# Impact: Distributes load, restarts prevent OOM
# Risk: Low (but doesn't fix root cause)

Short-term (5 min - 1 hour)

Analyze heap dump and identify leak source

Common Fixes:

1. Add LRU Cache

// BAD: Unbounded cache
const cache = {};

// GOOD: LRU cache with size limit
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });

2. Remove Event Listeners

// Add listener
const handler = () => { /* ... */ };
emitter.on('event', handler);

// CRITICAL: Remove later
emitter.off('event', handler);

// React/Vue: cleanup in componentWillUnmount/onUnmounted

3. Clear Timers

// Set timer
const intervalId = setInterval(() => { /* ... */ }, 1000);

// CRITICAL: Clear later
clearInterval(intervalId);

// React: cleanup in useEffect return
useEffect(() => {
  const id = setInterval(() => { /* ... */ }, 1000);
  return () => clearInterval(id);
}, []);

4. Close Connections

// BAD: Connection leak
const conn = await db.connect();
await conn.query(/* ... */);
// Connection never closed!

// GOOD: Always close
const conn = await db.connect();
try {
  await conn.query(/* ... */);
} finally {
  await conn.close(); // CRITICAL
}

Long-term (1 hour+)

  • Add memory monitoring (alert if >80% and increasing)
  • Add memory profiling to CI/CD (detect leaks early)
  • Use WeakMap for caches (auto garbage collected)
  • Review closure usage (avoid holding large data)
  • Add automated restart (every N hours, if leak can't be fixed immediately)
  • Load test to reproduce leak in test environment
  • Fix root cause in code

Escalation

Escalate to developer if:

  • Application code causing leak
  • Requires code fix

Escalate to platform team if:

  • Platform/framework bug
  • Requires upgrade or workaround

Prevention Checklist

  • Use LRU cache (not unbounded)
  • Remove event listeners in cleanup
  • Clear timers/intervals
  • Close database connections (use finally)
  • Avoid closures holding large data
  • Use WeakMap for temporary caches
  • Profile memory in development
  • Load test before production


Post-Incident

After resolving:

  • Create post-mortem
  • Identify leak source from heap dump
  • Fix code
  • Add regression test (memory profiling)
  • Add monitoring alert
  • Update this runbook if needed