Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/01-high-cpu-usage.md
+++ b/agents/sre/playbooks/01-high-cpu-usage.md
@@ -0,0 +1,204 @@
+# Playbook: High CPU Usage
+
+## Symptoms
+
+- CPU usage at 80-100%
+- Applications slow or unresponsive
+- Server lag, SSH slow
+- Monitoring alert: "CPU usage >80% for 5 minutes"
+
+## Severity
+
+- **SEV2** if application degraded but functional
+- **SEV1** if application unresponsive
+
+## Diagnosis
+
+### Step 1: Identify Top CPU Process
+
+```bash
+# Current CPU usage
+top -bn1 | head -20
+
+# Top CPU processes
+ps aux | sort -nrk 3,3 | head -10
+
+# CPU per thread
+top -H -p <PID>
+```
+
+**What to look for**:
+- Single process using >80% CPU
+- Multiple processes all high (system-wide issue)
+- System CPU vs user CPU (iowait = disk issue)
+
+---
+
+### Step 2: Identify Process Type
+
+**Application process** (node, java, python):
+```bash
+# Check application logs
+tail -100 /var/log/application.log
+
+# Check for infinite loops, heavy computation
+# Check APM for slow endpoints
+```
+
+**System process** (kernel, systemd):
+```bash
+# Check system logs
+dmesg | tail -50
+journalctl -xe
+
+# Check for hardware issues
+```
+
+**Unknown/suspicious process**:
+```bash
+# Check process details
+ps aux | grep <PID>
+lsof -p <PID>
+
+# Could be malware (crypto mining)
+# See security-incidents.md
+```
+
+---
+
+### Step 3: Check If Disk-Related
+
+```bash
+# Check iowait
+iostat -x 1 5
+
+# If iowait >20%, disk is bottleneck
+# See infrastructure.md for disk I/O troubleshooting
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Lower Process Priority**
+```bash
+# Reduce CPU priority
+renice +10 <PID>
+
+# Impact: Process gets less CPU time
+# Risk: Low (process still runs, just slower)
+```
+
+**Option B: Kill Process** (if application)
+```bash
+# Graceful shutdown
+kill -TERM <PID>
+
+# Force kill (last resort)
+kill -KILL <PID>
+
+# Restart service
+systemctl restart <service>
+
+# Impact: Process restarts, CPU normalizes
+# Risk: Medium (brief downtime)
+```
+
+**Option C: Scale Horizontally** (cloud)
+```bash
+# Add more instances to distribute load
+# AWS: Auto Scaling Group
+# Azure: Scale Set
+# Kubernetes: Horizontal Pod Autoscaler
+
+# Impact: Load distributed across instances
+# Risk: Low (no downtime)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Optimize Code** (if application bug)
+```bash
+# Profile application
+# Node.js: node --prof
+# Java: jstack, jvisualvm
+# Python: py-spy
+
+# Identify hot path
+# Fix infinite loop, heavy computation
+```
+
+**Option B: Add Caching**
+```javascript
+// Cache expensive computation
+const cache = new Map();
+
+function expensiveOperation(input) {
+  if (cache.has(input)) {
+    return cache.get(input);
+  }
+
+  const result = /* heavy computation */;
+  cache.set(input, result);
+  return result;
+}
+```
+
+**Option C: Scale Vertically** (cloud)
+```bash
+# Resize to larger instance type
+# AWS: Change instance type (t3.medium → t3.large)
+# Azure: Resize VM
+# Impact: More CPU capacity
+# Risk: Medium (brief downtime during resize)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add CPU monitoring alert (>70% for 5 min)
+- [ ] Optimize application code (reduce computation)
+- [ ] Use worker threads for heavy tasks (Node.js)
+- [ ] Implement auto-scaling (cloud)
+- [ ] Add APM for performance profiling
+- [ ] Review architecture (async processing, job queues)
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code causing issue
+- Requires code fix or optimization
+
+**Escalate to security-agent if**:
+- Unknown/suspicious process
+- Potential malware or crypto mining
+
+**Escalate to infrastructure if**:
+- Hardware issue (kernel errors)
+- Cloud infrastructure problem
+
+---
+
+## Related Runbooks
+
+- [03-memory-leak.md](03-memory-leak.md) - If memory also high
+- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to CPU
+- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure diagnostics
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (if SEV1/SEV2)
+- [ ] Identify root cause
+- [ ] Add monitoring/alerting
+- [ ] Update this runbook if needed
+- [ ] Add regression test (if code bug)
--- a/agents/sre/playbooks/02-database-deadlock.md
+++ b/agents/sre/playbooks/02-database-deadlock.md
@@ -0,0 +1,241 @@
+# Playbook: Database Deadlock
+
+## Symptoms
+
+- "Deadlock detected" errors in application
+- API returning 500 errors
+- Transactions timing out
+- Database connection pool exhausted
+- Monitoring alert: "Deadlock count >0"
+
+## Severity
+
+- **SEV2** if isolated to specific endpoint
+- **SEV1** if affecting all database operations
+
+## Diagnosis
+
+### Step 1: Confirm Deadlock (PostgreSQL)
+
+```sql
+-- Check for currently locked queries
+SELECT
+  blocked_locks.pid AS blocked_pid,
+  blocked_activity.usename AS blocked_user,
+  blocking_locks.pid AS blocking_pid,
+  blocking_activity.usename AS blocking_user,
+  blocked_activity.query AS blocked_statement,
+  blocking_activity.query AS blocking_statement
+FROM pg_catalog.pg_locks blocked_locks
+JOIN pg_catalog.pg_stat_activity blocked_activity
+  ON blocked_activity.pid = blocked_locks.pid
+JOIN pg_catalog.pg_locks blocking_locks
+  ON blocking_locks.locktype = blocked_locks.locktype
+  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
+  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
+  AND blocking_locks.pid != blocked_locks.pid
+JOIN pg_catalog.pg_stat_activity blocking_activity
+  ON blocking_activity.pid = blocking_locks.pid
+WHERE NOT blocked_locks.granted;
+
+-- Check deadlock log
+SELECT * FROM pg_stat_database WHERE datname = 'your_database';
+```
+
+### Step 2: Confirm Deadlock (MySQL)
+
+```sql
+-- Show InnoDB status (includes deadlock info)
+SHOW ENGINE INNODB STATUS\G
+
+-- Look for "LATEST DETECTED DEADLOCK" section
+-- Shows which transactions were involved
+```
+
+---
+
+### Step 3: Identify Deadlock Pattern
+
+**Common Pattern 1: Lock Order Mismatch**
+```
+Transaction A: Locks row 1, then row 2
+Transaction B: Locks row 2, then row 1
+→ DEADLOCK
+```
+
+**Common Pattern 2: Gap Locks**
+```
+Transaction A: SELECT ... FOR UPDATE WHERE id BETWEEN 1 AND 10
+Transaction B: INSERT INTO table (id) VALUES (5)
+→ DEADLOCK
+```
+
+**Common Pattern 3: Foreign Key Deadlock**
+```
+Transaction A: Updates parent table
+Transaction B: Inserts into child table
+→ DEADLOCK (foreign key check locks)
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Kill Blocking Query** (PostgreSQL)
+```sql
+-- Terminate blocking process
+SELECT pg_terminate_backend(<blocking_pid>);
+
+-- Verify deadlock cleared
+SELECT count(*) FROM pg_locks WHERE NOT granted;
+-- Should return 0
+```
+
+**Option B: Kill Blocking Query** (MySQL)
+```sql
+-- Show process list
+SHOW PROCESSLIST;
+
+-- Kill blocking query
+KILL <process_id>;
+```
+
+**Option C: Kill Idle Transactions** (PostgreSQL)
+```sql
+-- Find idle transactions (>5 min)
+SELECT pg_terminate_backend(pid)
+FROM pg_stat_activity
+WHERE state = 'idle in transaction'
+AND state_change < NOW() - INTERVAL '5 minutes';
+
+-- Impact: Frees up locks
+-- Risk: Low (transactions are idle)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Add Transaction Timeout** (PostgreSQL)
+```sql
+-- Set statement timeout (30 seconds)
+ALTER DATABASE your_database SET statement_timeout = '30s';
+
+-- Or in application:
+SET statement_timeout = '30s';
+
+-- Impact: Prevents long-running transactions
+-- Risk: Low (transactions should be fast)
+```
+
+**Option B: Add Transaction Timeout** (MySQL)
+```sql
+-- Set lock wait timeout
+SET GLOBAL innodb_lock_wait_timeout = 30;
+
+-- Impact: Transactions fail instead of waiting forever
+-- Risk: Low (application should handle errors)
+```
+
+**Option C: Fix Lock Order in Application**
+```javascript
+// BAD: Inconsistent lock order
+async function transferMoney(fromId, toId, amount) {
+  await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, fromId]);
+  await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, toId]);
+}
+
+// GOOD: Consistent lock order
+async function transferMoney(fromId, toId, amount) {
+  const firstId = Math.min(fromId, toId);
+  const secondId = Math.max(fromId, toId);
+
+  await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, firstId]);
+  await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, secondId]);
+}
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Option A: Reduce Transaction Scope**
+```javascript
+// BAD: Long transaction
+BEGIN;
+const user = await db.query('SELECT * FROM users WHERE id = ? FOR UPDATE', [userId]);
+await sendEmail(user.email); // External call (slow!)
+await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
+COMMIT;
+
+// GOOD: Short transaction
+const user = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
+await sendEmail(user.email); // Outside transaction
+await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
+```
+
+**Option B: Use Optimistic Locking**
+```sql
+-- Add version column
+ALTER TABLE accounts ADD COLUMN version INT DEFAULT 0;
+
+-- Update with version check
+UPDATE accounts
+SET balance = balance - 100, version = version + 1
+WHERE id = 1 AND version = <current_version>;
+
+-- If 0 rows updated, retry with new version
+```
+
+**Option C: Review Isolation Level**
+```sql
+-- PostgreSQL default: READ COMMITTED
+-- Most cases: READ COMMITTED is fine
+-- Rare cases: REPEATABLE READ or SERIALIZABLE
+
+-- Lower isolation = less locking = fewer deadlocks
+SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
+```
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code causing deadlock
+- Requires code refactoring
+
+**Escalate to DBA if**:
+- Database configuration issue
+- Foreign key constraint problem
+
+---
+
+## Prevention
+
+- [ ] Always lock in same order
+- [ ] Keep transactions short
+- [ ] Use timeout (statement_timeout, lock_wait_timeout)
+- [ ] Use optimistic locking when possible
+- [ ] Add deadlock monitoring alert
+- [ ] Review isolation level (lower = fewer deadlocks)
+
+---
+
+## Related Runbooks
+
+- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to deadlock
+- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem
+- [ ] Identify which queries deadlocked
+- [ ] Fix lock order in application code
+- [ ] Add regression test
+- [ ] Update this runbook if needed
--- a/agents/sre/playbooks/03-memory-leak.md
+++ b/agents/sre/playbooks/03-memory-leak.md
@@ -0,0 +1,252 @@
+# Playbook: Memory Leak
+
+## Symptoms
+
+- Memory usage increasing continuously over time
+- Application crashes with OutOfMemoryError (Java) or "JavaScript heap out of memory" (Node.js)
+- Performance degrades over time
+- High swap usage
+- Monitoring alert: "Memory usage >90%"
+
+## Severity
+
+- **SEV2** if memory increasing but not yet critical
+- **SEV1** if application crashed or unresponsive
+
+## Diagnosis
+
+### Step 1: Confirm Memory Leak
+
+```bash
+# Monitor memory over time (5 minute intervals)
+watch -n 300 'ps aux | grep <process> | awk "{print \$4, \$5, \$6}"'
+
+# Check if memory continuously increasing
+# Leak: 20% → 30% → 40% → 50% (linear growth)
+# Normal: 30% → 32% → 31% → 30% (stable)
+```
+
+---
+
+### Step 2: Get Memory Snapshot
+
+**Java (Heap Dump)**:
+```bash
+# Get heap dump
+jmap -dump:format=b,file=heap.bin <PID>
+
+# Analyze with jhat or VisualVM
+jhat heap.bin
+# Open http://localhost:7000
+
+# Or use Eclipse Memory Analyzer
+```
+
+**Node.js (Heap Snapshot)**:
+```bash
+# Start with --inspect
+node --inspect index.js
+
+# Chrome DevTools → Memory → Take heap snapshot
+
+# Or use heapdump module
+const heapdump = require('heapdump');
+heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');
+```
+
+**Python (Memory Profiler)**:
+```bash
+# Install memory_profiler
+pip install memory_profiler
+
+# Profile function
+python -m memory_profiler script.py
+```
+
+---
+
+### Step 3: Identify Leak Source
+
+**Look for**:
+- Large arrays/objects growing over time
+- Detached DOM nodes (if browser/UI)
+- Event listeners not removed
+- Timers/intervals not cleared
+- Closures holding references
+- Cache without eviction policy
+
+**Common patterns**:
+```javascript
+// 1. Global cache growing forever
+global.cache = {}; // Never cleared
+
+// 2. Event listeners not removed
+emitter.on('event', handler); // Never removed
+
+// 3. Timers not cleared
+setInterval(() => { /* ... */ }, 1000); // Never cleared
+
+// 4. Closures
+function createHandler() {
+  const largeData = new Array(1000000);
+  return () => {
+    // Closure keeps largeData in memory
+  };
+}
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Restart Application**
+```bash
+# Restart to free memory
+systemctl restart application
+
+# Impact: Memory usage returns to baseline
+# Risk: Low (brief downtime)
+# NOTE: This is temporary, leak will recur!
+```
+
+**Option B: Increase Memory Limit** (temporary)
+```bash
+# Java
+java -Xmx4G -jar application.jar  # Was 2G
+
+# Node.js
+node --max-old-space-size=4096 index.js  # Was 2048
+
+# Impact: Buys time to find root cause
+# Risk: Low (but doesn't fix leak)
+```
+
+**Option C: Scale Horizontally** (cloud)
+```bash
+# Add more instances
+# Use load balancer to rotate traffic
+# Restart instances on schedule (e.g., every 6 hours)
+
+# Impact: Distributes load, restarts prevent OOM
+# Risk: Low (but doesn't fix root cause)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Analyze heap dump** and identify leak source
+
+**Common Fixes**:
+
+**1. Add LRU Cache**
+```javascript
+// BAD: Unbounded cache
+const cache = {};
+
+// GOOD: LRU cache with size limit
+const LRU = require('lru-cache');
+const cache = new LRU({ max: 1000 });
+```
+
+**2. Remove Event Listeners**
+```javascript
+// Add listener
+const handler = () => { /* ... */ };
+emitter.on('event', handler);
+
+// CRITICAL: Remove later
+emitter.off('event', handler);
+
+// React/Vue: cleanup in componentWillUnmount/onUnmounted
+```
+
+**3. Clear Timers**
+```javascript
+// Set timer
+const intervalId = setInterval(() => { /* ... */ }, 1000);
+
+// CRITICAL: Clear later
+clearInterval(intervalId);
+
+// React: cleanup in useEffect return
+useEffect(() => {
+  const id = setInterval(() => { /* ... */ }, 1000);
+  return () => clearInterval(id);
+}, []);
+```
+
+**4. Close Connections**
+```javascript
+// BAD: Connection leak
+const conn = await db.connect();
+await conn.query(/* ... */);
+// Connection never closed!
+
+// GOOD: Always close
+const conn = await db.connect();
+try {
+  await conn.query(/* ... */);
+} finally {
+  await conn.close(); // CRITICAL
+}
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add memory monitoring (alert if >80% and increasing)
+- [ ] Add memory profiling to CI/CD (detect leaks early)
+- [ ] Use WeakMap for caches (auto garbage collected)
+- [ ] Review closure usage (avoid holding large data)
+- [ ] Add automated restart (every N hours, if leak can't be fixed immediately)
+- [ ] Load test to reproduce leak in test environment
+- [ ] Fix root cause in code
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code causing leak
+- Requires code fix
+
+**Escalate to platform team if**:
+- Platform/framework bug
+- Requires upgrade or workaround
+
+---
+
+## Prevention Checklist
+
+- [ ] Use LRU cache (not unbounded)
+- [ ] Remove event listeners in cleanup
+- [ ] Clear timers/intervals
+- [ ] Close database connections (use `finally`)
+- [ ] Avoid closures holding large data
+- [ ] Use WeakMap for temporary caches
+- [ ] Profile memory in development
+- [ ] Load test before production
+
+---
+
+## Related Runbooks
+
+- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU also high
+- [07-service-down.md](07-service-down.md) - If OOM crashed service
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem
+- [ ] Identify leak source from heap dump
+- [ ] Fix code
+- [ ] Add regression test (memory profiling)
+- [ ] Add monitoring alert
+- [ ] Update this runbook if needed
--- a/agents/sre/playbooks/04-slow-api-response.md
+++ b/agents/sre/playbooks/04-slow-api-response.md
@@ -0,0 +1,269 @@
+# Playbook: Slow API Response
+
+## Symptoms
+
+- API response time >1 second (degraded)
+- API response time >5 seconds (critical)
+- Users reporting slow loading
+- Timeout errors (504 Gateway Timeout)
+- Monitoring alert: "p95 response time >1s"
+
+## Severity
+
+- **SEV3** if response time 1-3 seconds
+- **SEV2** if response time 3-5 seconds
+- **SEV1** if response time >5 seconds or timeouts
+
+## Diagnosis
+
+### Step 1: Check Application Logs
+
+```bash
+# Find slow requests
+grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
+
+# Identify slow endpoint
+awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
+
+# Example output:
+# /api/dashboard 8200ms  ← SLOW
+# /api/users 50ms
+# /api/posts 120ms
+```
+
+---
+
+### Step 2: Measure Response Time Breakdown
+
+**Total response time = Database + Application + Network**
+
+```bash
+# Use curl with timing
+curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
+
+# curl-format.txt:
+# time_namelookup:  %{time_namelookup}\n
+# time_connect:  %{time_connect}\n
+# time_starttransfer:  %{time_starttransfer}\n
+# time_total:  %{time_total}\n
+```
+
+**Example breakdown**:
+```
+time_namelookup:    0.005s  (DNS)
+time_connect:       0.010s  (TCP connect)
+time_starttransfer: 8.200s  (Time to first byte) ← SLOW HERE
+time_total:         8.250s
+
+→ Problem is backend processing, not network
+```
+
+---
+
+### Step 3: Check Database Query Time
+
+```bash
+# Check application logs for query time
+grep "query.*duration" /var/log/application.log
+
+# Example:
+# query: SELECT * FROM users... duration: 7800ms  ← SLOW
+```
+
+**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)
+
+---
+
+### Step 4: Check External API Calls
+
+```bash
+# Check logs for external API calls
+grep "http.request" /var/log/application.log
+
+# Example:
+# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Add Database Index** (if DB is bottleneck)
+```sql
+-- Example: Missing index on last_login_at
+CREATE INDEX CONCURRENTLY idx_users_last_login_at
+ON users(last_login_at);
+
+-- Impact: 7.8s → 50ms query time
+-- Risk: Low (CONCURRENTLY = no table lock)
+```
+
+**Option B: Enable Caching** (if same data requested frequently)
+```javascript
+// Add Redis cache
+const redis = require('redis').createClient();
+
+app.get('/api/dashboard', async (req, res) => {
+  // Check cache first
+  const cached = await redis.get('dashboard:' + req.user.id);
+  if (cached) return res.json(JSON.parse(cached));
+
+  // Generate data
+  const data = await generateDashboard(req.user.id);
+
+  // Cache for 5 minutes
+  await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
+
+  res.json(data);
+});
+
+// Impact: 8s → 10ms (cache hit)
+// Risk: Low (data staleness acceptable for dashboard)
+```
+
+**Option C: Optimize Query** (if N+1 query)
+```javascript
+// BAD: N+1 queries
+const users = await db.query('SELECT * FROM users');
+for (const user of users) {
+  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
+  user.posts = posts;
+}
+
+// GOOD: Single query with JOIN
+const users = await db.query(`
+  SELECT users.*, posts.*
+  FROM users
+  LEFT JOIN posts ON posts.user_id = users.id
+`);
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Add Timeout** (if external API is slow)
+```javascript
+// Add timeout to external API call
+const response = await fetch('https://api.external.com/data', {
+  timeout: 2000, // 2 second timeout
+});
+
+// If timeout, use fallback data
+if (!response.ok) {
+  return fallbackData;
+}
+
+// Impact: Prevents slow external API from blocking response
+// Risk: Low (fallback data acceptable)
+```
+
+**Option B: Async Processing** (if computation is heavy)
+```javascript
+// BAD: Synchronous heavy computation
+app.post('/api/process', async (req, res) => {
+  const result = await heavyComputation(req.body); // 10 seconds
+  res.json(result);
+});
+
+// GOOD: Async processing with job queue
+app.post('/api/process', async (req, res) => {
+  const jobId = await queue.add('process', req.body);
+  res.status(202).json({ jobId, status: 'processing' });
+});
+
+// Client polls for result
+app.get('/api/job/:id', async (req, res) => {
+  const job = await queue.getJob(req.params.id);
+  res.json({ status: job.status, result: job.result });
+});
+
+// Impact: API responds immediately (202 Accepted)
+// Risk: Low (client needs to handle async pattern)
+```
+
+**Option C: Pagination** (if returning large dataset)
+```javascript
+// BAD: Return all 10,000 records
+app.get('/api/users', async (req, res) => {
+  const users = await db.query('SELECT * FROM users');
+  res.json(users); // Huge payload
+});
+
+// GOOD: Pagination
+app.get('/api/users', async (req, res) => {
+  const page = parseInt(req.query.page) || 1;
+  const limit = 50;
+  const offset = (page - 1) * limit;
+
+  const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
+  res.json({ data: users, page, limit });
+});
+
+// Impact: 8s → 200ms (smaller dataset)
+// Risk: Low (clients usually want pagination anyway)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add response time monitoring (p95, p99)
+- [ ] Add APM (Application Performance Monitoring)
+- [ ] Optimize database queries (add indexes, reduce JOINs)
+- [ ] Add caching layer (Redis, Memcached)
+- [ ] Implement pagination for large datasets
+- [ ] Move heavy computation to background jobs
+- [ ] Add timeout for external APIs
+- [ ] Add E2E test: API response <1s
+- [ ] Review and optimize N+1 queries
+
+---
+
+## Common Root Causes
+
+| Symptom | Root Cause | Solution |
+|---------|------------|----------|
+| 7.8s query time | Missing database index | CREATE INDEX |
+| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
+| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
+| 5s external API call | No timeout | Add timeout + fallback |
+| Heavy computation | Sync processing | Async job queue |
+| Same data fetched repeatedly | No caching | Add Redis cache |
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application code needs optimization
+- N+1 query problem
+
+**Escalate to DBA if**:
+- Database performance issue
+- Need help with query optimization
+
+**Escalate to external team if**:
+- External API consistently slow
+- Need to negotiate SLA
+
+---
+
+## Related Runbooks
+
+- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
+- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem
+- [ ] Identify root cause (DB, external API, N+1, etc.)
+- [ ] Add performance test (response time <1s)
+- [ ] Add monitoring alert
+- [ ] Update this runbook if needed
--- a/agents/sre/playbooks/05-ddos-attack.md
+++ b/agents/sre/playbooks/05-ddos-attack.md
@@ -0,0 +1,293 @@
+# Playbook: DDoS Attack
+
+## Symptoms
+
+- Sudden traffic spike (10x-100x normal)
+- Legitimate users can't access service
+- High bandwidth usage (saturated)
+- Server overload (CPU, memory, network)
+- Monitoring alert: "Traffic spike", "Bandwidth >90%"
+
+## Severity
+
+- **SEV1** - Production service unavailable due to attack
+
+## Diagnosis
+
+### Step 1: Confirm Traffic Spike
+
+```bash
+# Check current connections
+netstat -ntu | wc -l
+
+# Compare to baseline (normal: 100-500, attack: 10,000+)
+
+# Check requests per second (nginx)
+tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
+```
+
+---
+
+### Step 2: Identify Attack Pattern
+
+**Check connections by IP**:
+```bash
+# Top 20 IPs by connection count
+netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
+
+# Example output:
+# 5000 192.168.1.100  ← Attacker IP
+# 3000 192.168.1.101  ← Attacker IP
+# 2 192.168.1.200    ← Legitimate user
+```
+
+**Check HTTP requests by IP** (nginx):
+```bash
+awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
+```
+
+**Check request patterns**:
+```bash
+# Check requested URLs
+awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
+
+# Check user agents (bots often have telltale user agents)
+awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -nr
+```
+
+---
+
+### Step 3: Classify Attack Type
+
+**HTTP Flood** (application layer):
+- Many HTTP requests from distributed IPs
+- Valid HTTP requests, just too many
+- Example: 10,000 requests/second to homepage
+
+**SYN Flood** (network layer):
+- Many TCP SYN packets
+- Connection requests never complete
+- Exhausts server connection table
+
+**Amplification** (DNS, NTP):
+- Small request → Large response
+- Attacker spoofs your IP
+- Servers send large responses to you
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Block Attacker IPs** (if few IPs)
+```bash
+# Block single IP (iptables)
+iptables -A INPUT -s <ATTACKER_IP> -j DROP
+
+# Block IP range
+iptables -A INPUT -s 192.168.1.0/24 -j DROP
+
+# Block specific country (using ipset + GeoIP)
+# Advanced, see infrastructure team
+
+# Impact: Blocks attacker, restores service
+# Risk: Low (if attacker IPs identified correctly)
+```
+
+**Option B: Enable Rate Limiting** (nginx)
+```nginx
+# Add to nginx.conf
+http {
+  # Define rate limit zone (10 req/s per IP)
+  limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
+
+  server {
+    location / {
+      # Apply rate limit
+      limit_req zone=one burst=20 nodelay;
+      limit_req_status 429;
+    }
+  }
+}
+
+# Reload nginx
+nginx -t && systemctl reload nginx
+
+# Impact: Limits requests per IP
+# Risk: Low (legitimate users rarely exceed 10 req/s)
+```
+
+**Option C: Enable CloudFlare "Under Attack" Mode**
+```bash
+# If using CloudFlare:
+# 1. Log in to CloudFlare dashboard
+# 2. Select domain
+# 3. Click "Under Attack Mode"
+# 4. Adds JavaScript challenge before serving content
+
+# Impact: Blocks bots, allows legitimate browsers
+# Risk: Low (slight user friction)
+```
+
+**Option D: Enable AWS Shield** (AWS)
+```bash
+# AWS Shield Standard: Free, automatic DDoS protection
+# AWS Shield Advanced: $3000/month, enhanced protection
+
+# CloudFormation:
+aws cloudformation deploy \
+  --template-file shield.yaml \
+  --stack-name ddos-protection
+
+# Impact: Absorbs DDoS at AWS edge
+# Risk: None (AWS handles)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Add Connection Limits**
+```nginx
+# Limit concurrent connections per IP
+limit_conn_zone $binary_remote_addr zone=addr:10m;
+
+server {
+  location / {
+    limit_conn addr 10;  # Max 10 concurrent connections per IP
+  }
+}
+```
+
+**Option B: Add CAPTCHA** (reCAPTCHA)
+```html
+<!-- Add reCAPTCHA to sensitive endpoints -->
+<form action="/login" method="POST">
+  <input type="email" name="email">
+  <input type="password" name="password">
+  <div class="g-recaptcha" data-sitekey="YOUR_SITE_KEY"></div>
+  <button type="submit">Login</button>
+</form>
+```
+
+**Option C: Scale Up** (cloud auto-scaling)
+```bash
+# AWS: Increase Auto Scaling Group desired capacity
+aws autoscaling set-desired-capacity \
+  --auto-scaling-group-name my-asg \
+  --desired-capacity 20  # Was 5
+
+# Impact: More capacity to handle attack
+# Risk: Medium (costs money, may not fully mitigate)
+# NOTE: Only do this if legitimate traffic also spiked
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Enable CloudFlare or AWS Shield (DDoS protection service)
+- [ ] Implement rate limiting on all endpoints
+- [ ] Add CAPTCHA to login, signup, checkout
+- [ ] Configure auto-scaling (handle legitimate traffic spikes)
+- [ ] Add monitoring alert for traffic anomalies
+- [ ] Create DDoS response plan
+- [ ] Contact ISP for upstream filtering (if very large attack)
+- [ ] Review and update firewall rules
+- [ ] Add geographic blocking (if applicable)
+
+---
+
+## Important Notes
+
+**DO NOT**:
+- Scale up indefinitely (attack can grow, costs explode)
+- Fight DDoS at application layer alone (use CDN, cloud protection)
+
+**DO**:
+- Use CDN/DDoS protection service (CloudFlare, AWS Shield, Akamai)
+- Enable rate limiting
+- Block attacker IPs/ranges
+- Monitor costs (auto-scaling can be expensive)
+
+---
+
+## Escalation
+
+**Escalate to infrastructure team if**:
+- Attack very large (>10 Gbps)
+- Need upstream filtering at ISP level
+
+**Escalate to security team**:
+- All DDoS attacks (for post-mortem, legal action)
+
+**Contact ISP if**:
+- Attack saturating internet connection
+- Need transit provider to filter
+
+**Contact CloudFlare/AWS if**:
+- Using their DDoS protection
+- Need assistance enabling features
+
+---
+
+## Prevention Checklist
+
+- [ ] Use CDN (CloudFlare, CloudFront, Akamai)
+- [ ] Enable DDoS protection (AWS Shield, CloudFlare)
+- [ ] Implement rate limiting (per IP, per user)
+- [ ] Add CAPTCHA to sensitive endpoints
+- [ ] Configure auto-scaling (within cost limits)
+- [ ] Monitor traffic patterns (detect spikes early)
+- [ ] Have DDoS response plan ready
+- [ ] Test response plan (tabletop exercise)
+
+---
+
+## Related Runbooks
+
+- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU overloaded
+- [07-service-down.md](07-service-down.md) - If service crashed
+- [../modules/security-incidents.md](../modules/security-incidents.md) - Security response
+- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (mandatory for DDoS)
+- [ ] Identify attack vectors
+- [ ] Document attacker IPs, patterns
+- [ ] Report to ISP, CloudFlare (they may block attacker)
+- [ ] Review and improve DDoS defenses
+- [ ] Consider legal action (if attacker identified)
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# Check connection count
+netstat -ntu | wc -l
+
+# Top IPs by connection count
+netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
+
+# Block IP (iptables)
+iptables -A INPUT -s <IP> -j DROP
+
+# Check nginx requests per second
+tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
+
+# List iptables rules
+iptables -L -n -v
+
+# Clear all iptables rules (CAREFUL!)
+iptables -F
+
+# Save iptables rules (persist after reboot)
+iptables-save > /etc/iptables/rules.v4
+```
--- a/agents/sre/playbooks/06-disk-full.md
+++ b/agents/sre/playbooks/06-disk-full.md
@@ -0,0 +1,314 @@
+# Playbook: Disk Full
+
+## Symptoms
+
+- "No space left on device" errors
+- Applications can't write files
+- Database refuses writes
+- Logs not being written
+- Monitoring alert: "Disk usage >90%"
+
+## Severity
+
+- **SEV3** if disk >90% but still functioning
+- **SEV2** if disk >95% and applications degraded
+- **SEV1** if disk 100% and applications down
+
+## Diagnosis
+
+### Step 1: Check Disk Usage
+
+```bash
+# Check disk usage by partition
+df -h
+
+# Example output:
+# Filesystem      Size  Used Avail Use% Mounted on
+# /dev/sda1        50G   48G   2G  96% /         ← CRITICAL
+# /dev/sdb1       100G   20G  80G  20% /data
+```
+
+---
+
+### Step 2: Find Large Directories
+
+```bash
+# Disk usage by top-level directory
+du -sh /*
+
+# Example output:
+# 15G  /var       ← Likely logs
+# 10G  /home
+# 5G   /usr
+# 1G   /tmp
+
+# Drill down into large directory
+du -sh /var/*
+
+# Example:
+# 14G  /var/log   ← FOUND IT
+# 500M /var/cache
+```
+
+---
+
+### Step 3: Find Large Files
+
+```bash
+# Find files larger than 100MB
+find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -h -r | head -20
+
+# Example output:
+# 5.0G /var/log/application.log     ← Large log file
+# 2.0G /var/log/nginx/access.log
+# 500M /tmp/dump.sql
+```
+
+---
+
+### Step 4: Check for Deleted Files Holding Space
+
+```bash
+# Files deleted but process still has handle
+lsof | grep deleted | awk '{print $1, $2, $7}' | sort -u
+
+# Example output:
+# nginx    1234  10G     ← nginx has handle to 10GB deleted file
+```
+
+**Why this happens**:
+- File deleted (`rm /var/log/nginx/access.log`)
+- But process (nginx) still writing to it
+- Disk space not released until process closes file or restarts
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Delete Old Logs**
+```bash
+# Delete old log files (>7 days)
+find /var/log -name "*.log.*" -mtime +7 -delete
+
+# Delete compressed logs (>30 days)
+find /var/log -name "*.gz" -mtime +30 -delete
+
+# journalctl: Keep only last 7 days
+journalctl --vacuum-time=7d
+
+# Impact: Frees disk space immediately
+# Risk: Low (old logs not needed for debugging recent issues)
+```
+
+**Option B: Compress Logs**
+```bash
+# Compress large log files
+gzip /var/log/application.log
+gzip /var/log/nginx/access.log
+
+# Impact: Reduces log file size by 80-90%
+# Risk: Low (logs still available, just compressed)
+```
+
+**Option C: Release Deleted Files**
+```bash
+# Find processes holding deleted files
+lsof | grep deleted
+
+# Restart process to release space
+systemctl restart nginx
+
+# Or kill and restart
+kill -HUP <PID>
+
+# Impact: Frees disk space held by deleted files
+# Risk: Medium (brief service interruption)
+```
+
+**Option D: Clean Temp Files**
+```bash
+# Delete old temp files
+rm -rf /tmp/*
+rm -rf /var/tmp/*
+
+# Delete apt/yum cache
+apt-get clean       # Ubuntu/Debian
+yum clean all       # RHEL/CentOS
+
+# Delete old kernels (Ubuntu)
+apt-get autoremove --purge
+
+# Impact: Frees disk space
+# Risk: Low (temp files can be deleted)
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Rotate Logs Immediately**
+```bash
+# Force log rotation
+logrotate -f /etc/logrotate.conf
+
+# Verify logs rotated
+ls -lh /var/log/
+
+# Configure aggressive rotation (daily instead of weekly)
+# Edit /etc/logrotate.d/application:
+/var/log/application.log {
+  daily              # Was: weekly
+  rotate 7           # Keep 7 days
+  compress           # Compress old logs
+  delaycompress      # Don't compress most recent
+  missingok          # Don't error if file missing
+  notifempty         # Don't rotate if empty
+  create 0640 www-data www-data
+  sharedscripts
+  postrotate
+    systemctl reload application
+  endscript
+}
+```
+
+**Option B: Archive Old Data**
+```bash
+# Archive old database dumps
+tar -czf old-dumps.tar.gz /backup/*.sql
+rm /backup/*.sql
+
+# Move to cheaper storage (S3, Archive)
+aws s3 cp old-dumps.tar.gz s3://archive-bucket/
+rm old-dumps.tar.gz
+
+# Impact: Frees local disk space
+# Risk: Low (data archived, not deleted)
+```
+
+**Option C: Expand Disk** (cloud)
+```bash
+# AWS: Modify EBS volume
+aws ec2 modify-volume --volume-id vol-1234567890abcdef0 --size 100  # Was 50 GB
+
+# Wait for modification to complete (5-10 min)
+watch aws ec2 describe-volumes-modifications --volume-ids vol-1234567890abcdef0
+
+# Resize filesystem
+# ext4:
+sudo resize2fs /dev/xvda1
+
+# xfs:
+sudo xfs_growfs /
+
+# Verify
+df -h
+
+# Impact: More disk space
+# Risk: Low (no downtime, but takes time)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Add disk usage monitoring (alert at >80%)
+- [ ] Configure log rotation (daily, keep 7 days)
+- [ ] Set up log forwarding (to ELK, Splunk, CloudWatch)
+- [ ] Review disk usage trends (plan capacity)
+- [ ] Add automated cleanup (cron job for old files)
+- [ ] Archive old data (move to S3, Glacier)
+- [ ] Implement log sampling (reduce volume)
+- [ ] Review application logging (reduce verbosity)
+
+---
+
+## Common Culprits
+
+| Location | Cause | Solution |
+|----------|-------|----------|
+| /var/log | Log files not rotated | logrotate, compress, delete old |
+| /tmp | Temp files not cleaned | Delete old files, add cron job |
+| /var/cache | Apt/yum cache | apt-get clean, yum clean all |
+| /home | User files, downloads | Clean up or expand disk |
+| Database | Large tables, no archiving | Archive old data, vacuum |
+| Deleted files | Process holding handle | Restart process |
+
+---
+
+## Prevention Checklist
+
+- [ ] Configure log rotation (daily, 7 days retention)
+- [ ] Add disk monitoring (alert at >80%)
+- [ ] Set up log forwarding (reduce local storage)
+- [ ] Add cron job to clean temp files
+- [ ] Review disk trends monthly
+- [ ] Plan capacity (expand before hitting limit)
+- [ ] Archive old data (move to cheaper storage)
+- [ ] Implement log sampling (reduce volume)
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application generating excessive logs
+- Need to reduce logging verbosity
+
+**Escalate to DBA if**:
+- Database files consuming disk
+- Need to archive old data
+
+**Escalate to infrastructure if**:
+- Need to expand disk (physical server)
+- Need to add new disk
+
+---
+
+## Related Runbooks
+
+- [07-service-down.md](07-service-down.md) - If disk full crashed service
+- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (if SEV1/SEV2)
+- [ ] Identify what filled disk
+- [ ] Implement prevention (log rotation, monitoring)
+- [ ] Review disk trends (prevent recurrence)
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# Disk usage
+df -h                                    # By partition
+du -sh /*                                # By directory
+du -sh /var/*                            # Drill down
+
+# Large files
+find / -type f -size +100M -exec ls -lh {} \;
+
+# Deleted files holding space
+lsof | grep deleted
+
+# Clean up
+find /var/log -name "*.log.*" -mtime +7 -delete   # Old logs
+gzip /var/log/*.log                                # Compress
+journalctl --vacuum-time=7d                        # journalctl
+apt-get clean                                      # Apt cache
+yum clean all                                      # Yum cache
+
+# Log rotation
+logrotate -f /etc/logrotate.conf
+
+# Expand disk (after EBS resize)
+resize2fs /dev/xvda1  # ext4
+xfs_growfs /          # xfs
+```
--- a/agents/sre/playbooks/07-service-down.md
+++ b/agents/sre/playbooks/07-service-down.md
@@ -0,0 +1,333 @@
+# Playbook: Service Down
+
+## Symptoms
+
+- Service not responding
+- Health check failures
+- 502 Bad Gateway or 503 Service Unavailable
+- Users can't access application
+- Monitoring alert: "Service down", "Health check failed"
+
+## Severity
+
+- **SEV1** - Production service completely unavailable
+
+## Diagnosis
+
+### Step 1: Check Service Status
+
+```bash
+# Check if service is running (systemd)
+systemctl status nginx
+systemctl status application
+systemctl status postgresql
+
+# Check process
+ps aux | grep nginx
+pidof nginx
+
+# Example output:
+# nginx.service - nginx web server
+# Active: inactive (dead)  ← SERVICE IS DOWN
+```
+
+---
+
+### Step 2: Check Why Service Stopped
+
+**Check Service Logs** (systemd):
+```bash
+# Last 50 lines of service logs
+journalctl -u nginx -n 50
+
+# Tail logs in real-time
+journalctl -u nginx -f
+
+# Look for:
+# - Exit code (0 = normal, non-zero = error)
+# - Error messages
+# - Crash reason
+```
+
+**Check Application Logs**:
+```bash
+# Check application error log
+tail -100 /var/log/application/error.log
+
+# Look for:
+# - Exception/error before crash
+# - Stack trace
+# - "Fatal error", "Segmentation fault"
+```
+
+**Check System Logs**:
+```bash
+# Check for OOM (Out of Memory) killer
+dmesg | grep -i "out of memory\|oom\|killed process"
+
+# Example:
+# Out of memory: Killed process 1234 (node) total-vm:8GB
+# ↑ OOM Killer terminated application
+
+# Check kernel errors
+dmesg | tail -50
+
+# Check syslog
+grep "error\|segfault" /var/log/syslog
+```
+
+---
+
+### Step 3: Identify Root Cause
+
+**Common causes**:
+
+| Symptom | Root Cause |
+|---------|------------|
+| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
+| "Segmentation fault" | Application bug (crash) |
+| "Address already in use" | Port already bound |
+| "Connection refused" to database | Database down |
+| "No such file or directory" | Missing config file |
+| "Permission denied" | Wrong file permissions |
+| Exit code 137 | Killed by OOM Killer |
+| Exit code 139 | Segmentation fault |
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Restart Service**
+```bash
+# Restart service
+systemctl restart nginx
+
+# Check if started successfully
+systemctl status nginx
+
+# Test endpoint
+curl http://localhost
+
+# Impact: Service restored
+# Risk: Low (if root cause not addressed, may crash again)
+```
+
+**Option B: Fix Configuration Error** (if config issue)
+```bash
+# Test configuration
+nginx -t             # nginx
+postgresql --help    # postgres
+
+# If config error, check recent changes
+git diff HEAD~1 /etc/nginx/nginx.conf
+
+# Revert to working config
+git checkout HEAD~1 /etc/nginx/nginx.conf
+
+# Restart
+systemctl restart nginx
+```
+
+**Option C: Free Up Resources** (if OOM)
+```bash
+# Check memory usage
+free -h
+
+# Kill memory-heavy processes (non-critical)
+kill -9 <PID>
+
+# Free page cache
+sync && echo 3 > /proc/sys/vm/drop_caches
+
+# Restart service
+systemctl restart application
+```
+
+**Option D: Change Port** (if port conflict)
+```bash
+# Check what's using port
+lsof -i :80
+
+# Example:
+# apache2  1234  root    4u  IPv4  12345  0t0  TCP *:80 (LISTEN)
+# ↑ Apache using port 80
+
+# Stop conflicting service
+systemctl stop apache2
+
+# Start intended service
+systemctl start nginx
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Fix Crash Bug** (if application bug)
+```bash
+# Check stack trace in logs
+tail -100 /var/log/application/error.log
+
+# Identify line causing crash
+# Example: NullPointerException at PaymentService.java:42
+
+# Deploy hotfix OR revert to previous version
+git checkout <previous-working-commit>
+npm run build && pm2 restart all
+
+# Impact: Bug fixed, service stable
+# Risk: Medium (need proper testing)
+```
+
+**Option B: Increase Memory** (if OOM)
+```bash
+# Short-term: Increase swap
+dd if=/dev/zero of=/swapfile bs=1M count=2048
+mkswap /swapfile
+swapon /swapfile
+
+# Long-term: Resize instance
+# AWS: Change instance type (t3.medium → t3.large)
+# Azure: Resize VM
+
+# Impact: More memory available
+# Risk: Medium (swap is slow, instance resize has downtime)
+```
+
+**Option C: Enable Auto-Restart** (systemd)
+```bash
+# Edit service file
+# /etc/systemd/system/application.service
+
+[Service]
+Restart=always             # Auto-restart on failure
+RestartSec=10              # Wait 10s before restart
+StartLimitBurst=5          # Max 5 restarts
+StartLimitIntervalSec=60   # In 60 seconds
+
+# Reload systemd
+systemctl daemon-reload
+
+# Impact: Service auto-restarts on crash
+# Risk: Low (but doesn't fix root cause)
+```
+
+**Option D: Route Traffic to Backup** (if multi-instance)
+```bash
+# If using load balancer:
+# 1. Remove failed instance from LB
+# 2. Traffic goes to healthy instances
+
+# AWS:
+aws elbv2 deregister-targets \
+  --target-group-arn <arn> \
+  --targets Id=i-1234567890abcdef0
+
+# Impact: Users see working instance
+# Risk: Low (other instances handle load)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] Fix root cause (memory leak, bug, etc.)
+- [ ] Add health check monitoring
+- [ ] Enable auto-restart (systemd)
+- [ ] Set up redundancy (multiple instances)
+- [ ] Add load balancer (distribute traffic)
+- [ ] Increase memory/CPU (if resource issue)
+- [ ] Add alerting (service down, health check fail)
+- [ ] Add E2E test (smoke test after deploy)
+- [ ] Review deployment process (how did bug reach prod?)
+
+---
+
+## Root Cause Analysis
+
+**For each incident, determine**:
+
+1. **What failed?** (nginx, application, database)
+2. **Why did it fail?** (OOM, bug, config error)
+3. **What triggered it?** (deploy, traffic spike, external event)
+4. **How to prevent?** (fix bug, add monitoring, increase capacity)
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application crash due to bug
+- Need code fix
+
+**Escalate to platform team if**:
+- Platform/framework issue
+- Infrastructure problem
+
+**Escalate to on-call manager if**:
+- Can't restore service in 30 min
+- Need additional resources
+
+---
+
+## Prevention Checklist
+
+- [ ] Health check monitoring (alert on failure)
+- [ ] Auto-restart (systemd Restart=always)
+- [ ] Redundancy (multiple instances behind LB)
+- [ ] Resource monitoring (CPU, memory alerts)
+- [ ] Graceful degradation (circuit breakers, fallbacks)
+- [ ] Smoke tests after deploy
+- [ ] Rollback plan (blue-green, canary)
+- [ ] Chaos engineering (test failure scenarios)
+
+---
+
+## Related Runbooks
+
+- [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash
+- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (MANDATORY for SEV1)
+- [ ] Timeline with all events
+- [ ] Root cause analysis
+- [ ] Action items (prevent recurrence)
+- [ ] Update runbook if needed
+- [ ] Share learnings with team
+
+---
+
+## Useful Commands Reference
+
+```bash
+# Service status
+systemctl status <service>
+systemctl restart <service>
+journalctl -u <service> -n 50
+
+# Process check
+ps aux | grep <process>
+pidof <process>
+
+# Check OOM
+dmesg | grep -i "out of memory\|oom"
+
+# Check port usage
+lsof -i :<port>
+netstat -tlnp | grep <port>
+
+# Test config
+nginx -t
+postgresql --help
+
+# Health check
+curl http://localhost/health
+```
--- a/agents/sre/playbooks/08-data-corruption.md
+++ b/agents/sre/playbooks/08-data-corruption.md
@@ -0,0 +1,337 @@
+# Playbook: Data Corruption
+
+## Symptoms
+
+- Users report incorrect data
+- Database integrity constraint violations
+- Foreign key errors
+- Application errors due to unexpected data
+- Failed backups (checksum mismatch)
+- Monitoring alert: "Data integrity check failed"
+
+## Severity
+
+- **SEV1** - Critical data corrupted (financial, health, legal)
+- **SEV2** - Non-critical data corrupted (user profiles, cache)
+- **SEV3** - Recoverable corruption (can restore from backup)
+
+## Diagnosis
+
+### Step 1: Confirm Corruption
+
+**Database Integrity Check** (PostgreSQL):
+```sql
+-- Check for corruption
+SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';
+
+-- Verify checksums (if enabled)
+SELECT datname, datcollate, datctype
+FROM pg_database
+WHERE datname = 'your_database';
+
+-- Check for bloat
+SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
+FROM pg_tables
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+```
+
+**Database Integrity Check** (MySQL):
+```sql
+-- Check table for corruption
+CHECK TABLE users;
+
+-- Repair table (if corrupted)
+REPAIR TABLE users;
+
+-- Optimize table (defragment)
+OPTIMIZE TABLE users;
+```
+
+---
+
+### Step 2: Identify Scope
+
+**Questions to answer**:
+- Which tables/data are affected?
+- How many records corrupted?
+- When did corruption start?
+- What's the impact on users?
+
+**Check Database Logs**:
+```bash
+# PostgreSQL
+grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log
+
+# MySQL
+grep "ERROR" /var/log/mysql/error.log
+
+# Look for:
+# - Constraint violations
+# - Foreign key errors
+# - Checksum errors
+# - Disk I/O errors
+```
+
+---
+
+### Step 3: Determine Root Cause
+
+**Common causes**:
+
+| Cause | Symptoms |
+|-------|----------|
+| Disk corruption | I/O errors in dmesg, checksum failures |
+| Application bug | Logical corruption (wrong data, not random) |
+| Failed migration | Schema mismatch, foreign key violations |
+| Concurrent writes | Race condition, duplicate records |
+| Hardware failure | Random corruption, unrelated records |
+| Malicious attack | Deliberate data modification |
+
+**Check for Disk Errors**:
+```bash
+# Check disk errors
+dmesg | grep -i "I/O error\|disk error"
+
+# Check SMART status
+smartctl -a /dev/sda
+
+# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**CRITICAL: Preserve Evidence**
+```bash
+# 1. STOP ALL WRITES (prevent further corruption)
+# Put application in read-only mode OR
+# Take application offline
+
+# 2. Snapshot/backup current state (even if corrupted)
+# PostgreSQL:
+pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
+
+# MySQL:
+mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
+
+# 3. Snapshot disk (cloud)
+# AWS:
+aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"
+
+# Impact: Preserves evidence for forensics
+# Risk: None (read-only operations)
+```
+
+**CRITICAL: DO NOT**:
+- Delete corrupted data (may need for forensics)
+- Run REPAIR TABLE (may destroy evidence)
+- Restart database (may clear logs)
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Restore from Backup** (if recent clean backup)
+```bash
+# 1. Identify last known good backup
+ls -lh /backup/ | grep pg_dump
+
+# Example:
+# backup-20251026-0200.sql  ← Clean backup (before corruption)
+# backup-20251026-0800.sql  ← Corrupted
+
+# 2. Restore from clean backup
+# PostgreSQL:
+psql your_database < /backup/backup-20251026-0200.sql
+
+# MySQL:
+mysql your_database < /backup/backup-20251026-0200.sql
+
+# 3. Verify data integrity
+# Run application tests
+# Check user-reported issues
+
+# Impact: Data restored to clean state
+# Risk: Medium (lose data after backup time)
+```
+
+**Option B: Repair Corrupted Records** (if isolated corruption)
+```sql
+-- Identify corrupted records
+SELECT * FROM users WHERE email IS NULL;  -- Should not be null
+
+-- Fix corrupted records
+UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;
+
+-- Verify fix
+SELECT count(*) FROM users WHERE email IS NULL;  -- Should be 0
+
+-- Impact: Corruption fixed
+-- Risk: Low (if corruption is known and fixable)
+```
+
+**Option C: Point-in-Time Recovery** (PostgreSQL)
+```bash
+# If WAL (Write-Ahead Logging) enabled:
+
+# 1. Determine recovery point (before corruption)
+# 2025-10-26 07:00:00 (corruption detected at 08:00)
+
+# 2. Restore from base backup + WAL
+pg_basebackup -D /var/lib/postgresql/data-recovery
+
+# 3. Configure recovery.conf
+# recovery_target_time = '2025-10-26 07:00:00'
+
+# 4. Start PostgreSQL (will replay WAL until target time)
+systemctl start postgresql
+
+# Impact: Restore to exact point before corruption
+# Risk: Low (if WAL available)
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Root Cause Analysis**:
+
+**If disk corruption**:
+- [ ] Replace disk immediately
+- [ ] Check RAID status
+- [ ] Run filesystem check (fsck)
+- [ ] Enable database checksums
+
+**If application bug**:
+- [ ] Fix bug in application code
+- [ ] Add data validation
+- [ ] Add integrity checks
+- [ ] Add regression test
+
+**If failed migration**:
+- [ ] Review migration script
+- [ ] Test migrations in staging first
+- [ ] Add rollback plan
+- [ ] Use transaction-based migrations
+
+**If concurrent writes**:
+- [ ] Add locking (row-level, table-level)
+- [ ] Use optimistic locking (version column)
+- [ ] Review transaction isolation level
+- [ ] Add unique constraints
+
+---
+
+## Prevention
+
+**Backups**:
+- [ ] Daily automated backups
+- [ ] Test restore process monthly
+- [ ] Multiple backup locations (local + S3)
+- [ ] Point-in-time recovery enabled (WAL)
+- [ ] Retention: 30 days
+
+**Monitoring**:
+- [ ] Data integrity checks (checksums)
+- [ ] Foreign key violation alerts
+- [ ] Disk error monitoring (SMART)
+- [ ] Backup success/failure alerts
+- [ ] Application-level data validation
+
+**Data Validation**:
+- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
+- [ ] Application-level validation
+- [ ] Schema migrations in transactions
+- [ ] Automated data quality tests
+
+**Redundancy**:
+- [ ] Database replication (primary + replica)
+- [ ] RAID for disk redundancy
+- [ ] Multi-AZ deployment (cloud)
+
+---
+
+## Escalation
+
+**Escalate to DBA if**:
+- Database-level corruption
+- Need expert for recovery
+
+**Escalate to developer if**:
+- Application bug causing corruption
+- Need code fix
+
+**Escalate to security team if**:
+- Suspected malicious attack
+- Unauthorized data modification
+
+**Escalate to management if**:
+- Critical data lost
+- Legal/compliance implications
+- Data breach
+
+---
+
+## Legal/Compliance
+
+**If critical data corrupted**:
+- [ ] Notify legal team
+- [ ] Notify compliance team
+- [ ] Check notification requirements:
+  - GDPR: 72 hours for breach notification
+  - HIPAA: 60 days for breach notification
+  - PCI-DSS: Immediate notification
+- [ ] Document incident timeline (for audit)
+- [ ] Preserve evidence (forensics)
+
+---
+
+## Related Runbooks
+
+- [07-service-down.md](07-service-down.md) - If database down
+- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
+- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (MANDATORY for SEV1)
+- [ ] Root cause analysis (what, why, how)
+- [ ] Identify affected users/records
+- [ ] User communication (if needed)
+- [ ] Action items (prevent recurrence)
+- [ ] Update backup/recovery procedures
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# PostgreSQL integrity check
+psql -c "SELECT * FROM pg_catalog.pg_database"
+
+# MySQL table check
+mysqlcheck -c your_database
+
+# Backup
+pg_dump your_database > backup.sql
+mysqldump your_database > backup.sql
+
+# Restore
+psql your_database < backup.sql
+mysql your_database < backup.sql
+
+# Disk check
+dmesg | grep -i "I/O error"
+smartctl -a /dev/sda
+fsck /dev/sda1
+
+# Snapshot (AWS)
+aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
+```
--- a/agents/sre/playbooks/09-cascade-failure.md
+++ b/agents/sre/playbooks/09-cascade-failure.md
@@ -0,0 +1,430 @@
+# Playbook: Cascade Failure
+
+## Symptoms
+
+- Multiple services failing simultaneously
+- Failures spreading across services
+- Dependency services timing out
+- Error rate increasing exponentially
+- Monitoring alert: "Multiple services degraded", "Cascade detected"
+
+## Severity
+
+- **SEV1** - Cascade affecting production services
+
+## What is a Cascade Failure?
+
+**Definition**: One service failure triggers failures in dependent services, spreading through the system.
+
+**Example**:
+```
+Database slow (2s queries)
+  ↓
+API times out waiting for database (5s timeout)
+  ↓
+Frontend times out waiting for API (10s timeout)
+  ↓
+Load balancer marks frontend unhealthy
+  ↓
+Traffic routes to other frontends (overload them)
+  ↓
+All frontends fail → Complete outage
+```
+
+---
+
+## Diagnosis
+
+### Step 1: Identify Initial Failure Point
+
+**Check Service Dependencies**:
+```
+Frontend → API → Database
+         ↓
+         Cache (Redis)
+         ↓
+         Queue (RabbitMQ)
+         ↓
+         External API
+```
+
+**Find the root**:
+```bash
+# Check service health (start with leaf dependencies)
+# 1. Database
+psql -c "SELECT 1"
+
+# 2. Cache
+redis-cli PING
+
+# 3. Queue
+rabbitmqctl status
+
+# 4. External API
+curl https://api.external.com/health
+
+# First failure = likely root cause
+```
+
+---
+
+### Step 2: Trace Failure Propagation
+
+**Check Service Logs** (in order):
+```bash
+# Database logs (first)
+tail -100 /var/log/postgresql/postgresql.log
+
+# API logs (second)
+tail -100 /var/log/api/error.log
+
+# Frontend logs (third)
+tail -100 /var/log/frontend/error.log
+```
+
+**Look for timestamps**:
+```
+14:00:00 - Database: Slow query (7s)  ← ROOT CAUSE
+14:00:05 - API: Timeout error
+14:00:10 - Frontend: API unavailable
+14:00:15 - Load balancer: All frontends unhealthy
+```
+
+---
+
+### Step 3: Assess Cascade Depth
+
+**How many layers affected?**
+- **1 layer**: Database only (isolated failure)
+- **2-3 layers**: Database → API → Frontend (cascade)
+- **4+ layers**: Full system cascade (critical)
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**PRIORITY: Stop the cascade from spreading**
+
+**Option A: Circuit Breaker** (if not already enabled)
+```javascript
+// Enable circuit breaker manually
+// Prevents API from overwhelming database
+
+const CircuitBreaker = require('opossum');
+
+const dbQuery = new CircuitBreaker(queryDatabase, {
+  timeout: 3000,        // 3s timeout
+  errorThresholdPercentage: 50,  // Open after 50% failures
+  resetTimeout: 30000   // Try again after 30s
+});
+
+dbQuery.on('open', () => {
+  console.log('Circuit breaker OPEN - using fallback');
+});
+
+// Use fallback when circuit open
+dbQuery.fallback(() => {
+  return cachedData; // Return cached data instead
+});
+```
+
+**Option B: Rate Limiting** (protect downstream)
+```nginx
+# Limit requests to database (nginx)
+limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
+
+location /api/ {
+  limit_req zone=api burst=20 nodelay;
+  proxy_pass http://api-backend;
+}
+```
+
+**Option C: Shed Load** (reject non-critical requests)
+```javascript
+// Reject non-critical requests when overloaded
+app.use((req, res, next) => {
+  const load = getCurrentLoad(); // CPU, memory, queue depth
+
+  if (load > 0.8 && !isCriticalEndpoint(req.path)) {
+    return res.status(503).json({
+      error: 'Service overloaded, try again later'
+    });
+  }
+
+  next();
+});
+
+function isCriticalEndpoint(path) {
+  return ['/api/health', '/api/payment'].includes(path);
+}
+```
+
+**Option D: Isolate Failure** (take failing service offline)
+```bash
+# Remove failing service from load balancer
+# AWS ELB:
+aws elbv2 deregister-targets \
+  --target-group-arn <arn> \
+  --targets Id=i-1234567890abcdef0
+
+# nginx:
+# Comment out failing backend in upstream block
+# upstream api {
+#   server api1.example.com;  # Healthy
+#   # server api2.example.com;  # FAILING - commented out
+# }
+
+# Impact: Prevents failing service from affecting others
+# Risk: Reduced capacity
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Fix Root Cause**
+
+**If database slow**:
+```sql
+-- Add missing index
+CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
+```
+
+**If external API slow**:
+```javascript
+// Add timeout + fallback
+const response = await fetch('https://api.external.com', {
+  timeout: 2000  // 2s timeout
+});
+
+if (!response.ok) {
+  return fallbackData; // Don't cascade failure
+}
+```
+
+**If service overloaded**:
+```bash
+# Scale horizontally (add more instances)
+# AWS Auto Scaling:
+aws autoscaling set-desired-capacity \
+  --auto-scaling-group-name my-asg \
+  --desired-capacity 10  # Was 5
+```
+
+---
+
+**Option B: Add Timeouts** (prevent indefinite waiting)
+```javascript
+// Database query timeout
+const result = await db.query('SELECT * FROM users', {
+  timeout: 3000  // 3 second timeout
+});
+
+// API call timeout
+const response = await fetch('/api/data', {
+  signal: AbortSignal.timeout(5000)  // 5 second timeout
+});
+
+// Impact: Fail fast instead of cascading
+// Risk: Low (better to timeout than cascade)
+```
+
+---
+
+**Option C: Add Bulkheads** (isolate critical paths)
+```javascript
+// Separate connection pools for critical vs non-critical
+const criticalPool = new Pool({ max: 10 }); // Payments, auth
+const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
+
+// Critical requests get priority
+app.post('/api/payment', async (req, res) => {
+  const conn = await criticalPool.connect();
+  // ...
+});
+
+// Non-critical requests use separate pool
+app.get('/api/analytics', async (req, res) => {
+  const conn = await nonCriticalPool.connect();
+  // ...
+});
+
+// Impact: Critical paths protected from non-critical load
+// Risk: None (isolation improves reliability)
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Architecture Improvements**:
+
+- [ ] **Circuit Breakers** (all external dependencies)
+- [ ] **Timeouts** (every network call, database query)
+- [ ] **Retries with exponential backoff** (transient failures)
+- [ ] **Bulkheads** (isolate critical paths)
+- [ ] **Rate limiting** (protect downstream services)
+- [ ] **Graceful degradation** (fallback data, cached responses)
+- [ ] **Health checks** (detect failures early)
+- [ ] **Auto-scaling** (handle load spikes)
+- [ ] **Chaos engineering** (test cascade scenarios)
+
+---
+
+## Cascade Prevention Patterns
+
+### 1. Circuit Breaker Pattern
+```javascript
+const breaker = new CircuitBreaker(riskyOperation, {
+  timeout: 3000,
+  errorThresholdPercentage: 50,
+  resetTimeout: 30000
+});
+
+breaker.fallback(() => cachedData);
+```
+
+**Benefits**:
+- Fast failure (don't wait for timeout)
+- Automatic recovery (reset after timeout)
+- Fallback data (graceful degradation)
+
+---
+
+### 2. Timeout Pattern
+```javascript
+// ALWAYS set timeouts
+const response = await fetch('/api', {
+  signal: AbortSignal.timeout(5000)
+});
+```
+
+**Benefits**:
+- Fail fast (don't cascade indefinite waits)
+- Predictable behavior
+
+---
+
+### 3. Bulkhead Pattern
+```javascript
+// Separate resource pools
+const criticalPool = new Pool({ max: 10 });
+const nonCriticalPool = new Pool({ max: 5 });
+```
+
+**Benefits**:
+- Critical paths protected
+- Non-critical load can't exhaust resources
+
+---
+
+### 4. Retry with Backoff
+```javascript
+async function retryWithBackoff(fn, retries = 3) {
+  for (let i = 0; i < retries; i++) {
+    try {
+      return await fn();
+    } catch (error) {
+      if (i === retries - 1) throw error;
+      await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
+    }
+  }
+}
+```
+
+**Benefits**:
+- Handles transient failures
+- Exponential backoff prevents thundering herd
+
+---
+
+### 5. Load Shedding
+```javascript
+// Reject requests when overloaded
+if (queueDepth > threshold) {
+  return res.status(503).send('Overloaded');
+}
+```
+
+**Benefits**:
+- Prevent overload
+- Protect downstream services
+
+---
+
+## Escalation
+
+**Escalate to architecture team if**:
+- System-wide cascade
+- Architectural changes needed
+
+**Escalate to all service owners if**:
+- Multiple teams affected
+- Need coordinated response
+
+**Escalate to management if**:
+- Complete outage
+- Large customer impact
+
+---
+
+## Prevention Checklist
+
+- [ ] Circuit breakers on all external calls
+- [ ] Timeouts on all network operations
+- [ ] Retries with exponential backoff
+- [ ] Bulkheads for critical paths
+- [ ] Rate limiting (protect downstream)
+- [ ] Health checks (detect failures early)
+- [ ] Auto-scaling (handle load)
+- [ ] Graceful degradation (fallback data)
+- [ ] Chaos engineering (test failure scenarios)
+- [ ] Load testing (find breaking points)
+
+---
+
+## Related Runbooks
+
+- [04-slow-api-response.md](04-slow-api-response.md) - API performance
+- [07-service-down.md](07-service-down.md) - Service failures
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (MANDATORY for cascade failures)
+- [ ] Draw cascade diagram (which services failed in order)
+- [ ] Identify missing safeguards (circuit breakers, timeouts)
+- [ ] Implement prevention patterns
+- [ ] Test cascade scenarios (chaos engineering)
+- [ ] Update this runbook if needed
+
+---
+
+## Cascade Failure Examples
+
+**Netflix Outage (2012)**:
+- Database latency → API timeouts → Frontend failures → Complete outage
+- **Fix**: Circuit breakers, timeouts, fallback data
+
+**AWS S3 Outage (2017)**:
+- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
+- **Fix**: Multi-region redundancy, fallback to different regions
+
+**Google Cloud Outage (2019)**:
+- Network misconfiguration → Internal services fail → External services cascade
+- **Fix**: Network configuration validation, staged rollouts
+
+---
+
+## Key Takeaways
+
+1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
+2. **Fix the root cause first** (not the symptoms)
+3. **Fail fast, don't cascade waits** (timeouts everywhere)
+4. **Graceful degradation** (fallback > failure)
+5. **Test failure scenarios** (chaos engineering)
--- a/agents/sre/playbooks/10-rate-limit-exceeded.md
+++ b/agents/sre/playbooks/10-rate-limit-exceeded.md
@@ -0,0 +1,464 @@
+# Playbook: Rate Limit Exceeded
+
+## Symptoms
+
+- "Rate limit exceeded" errors
+- "429 Too Many Requests" responses
+- "Quota exceeded" messages
+- Legitimate requests being blocked
+- Monitoring alert: "High rate of 429 errors"
+
+## Severity
+
+- **SEV3** if isolated to specific users/endpoints
+- **SEV2** if affecting many users
+- **SEV1** if critical functionality blocked (payments, auth)
+
+## Diagnosis
+
+### Step 1: Identify What's Rate Limited
+
+**Check Error Messages**:
+```bash
+# Application logs
+grep "rate limit\|429\|quota exceeded" /var/log/application.log
+
+# nginx logs
+awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
+
+# Example output:
+# 500 192.168.1.100 /api/users    ← IP hitting rate limit
+# 200 192.168.1.101 /api/posts
+```
+
+**Check Rate Limit Source**:
+- **Application-level**: Your code enforcing limit
+- **nginx/API Gateway**: Reverse proxy rate limiting
+- **External API**: Third-party service limit (Stripe, Twilio, etc.)
+- **Cloud**: AWS API Gateway, CloudFlare
+
+---
+
+### Step 2: Determine If Legitimate or Malicious
+
+**Legitimate traffic**:
+```
+Scenario: User refreshing dashboard repeatedly
+Pattern: Single user, single endpoint, short burst
+Action: Increase rate limit or add caching
+```
+
+**Malicious traffic** (abuse):
+```
+Scenario: Scraper or bot
+Pattern: Multiple IPs, automated behavior, sustained
+Action: Block IPs, add CAPTCHA
+```
+
+**Traffic spike** (legitimate):
+```
+Scenario: Marketing campaign, viral post
+Pattern: Many users, distributed IPs, real user behavior
+Action: Increase rate limit, scale up
+```
+
+---
+
+### Step 3: Check Current Rate Limits
+
+**nginx**:
+```nginx
+# Check nginx.conf
+grep "limit_req" /etc/nginx/nginx.conf
+
+# Example:
+# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
+#                                                         ^^^^ Current limit
+```
+
+**Application** (Express.js example):
+```javascript
+// Check rate limit middleware
+const rateLimit = require('express-rate-limit');
+
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 100, // Limit: 100 requests per 15 minutes
+});
+```
+
+**External API**:
+```bash
+# Check external API documentation
+# Stripe: 100 requests per second
+# Twilio: 100 requests per second
+# Google Maps: $200/month free quota
+
+# Check current usage
+# Stripe:
+curl https://api.stripe.com/v1/balance \
+  -u sk_test_XXX: \
+  -H "Stripe-Account: acct_XXX"
+
+# Response headers:
+# X-RateLimit-Limit: 100
+# X-RateLimit-Remaining: 45  ← 45 requests left
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Increase Rate Limit** (if legitimate traffic)
+
+**nginx**:
+```nginx
+# Edit /etc/nginx/nginx.conf
+# Increase from 10r/s to 50r/s
+limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
+
+# Test and reload
+nginx -t && systemctl reload nginx
+
+# Impact: Allows more requests
+# Risk: Low (if traffic is legitimate)
+```
+
+**Application** (Express.js):
+```javascript
+// Increase from 100 to 500 requests per 15 min
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: 500, // Increased
+});
+
+// Restart application
+pm2 restart all
+```
+
+---
+
+**Option B: Whitelist Specific IPs** (if known legitimate source)
+
+**nginx**:
+```nginx
+# Whitelist internal IPs, monitoring systems
+geo $limit {
+  default 1;
+  10.0.0.0/8 0;        # Internal network
+  192.168.1.100 0;     # Monitoring system
+}
+
+map $limit $limit_key {
+  0 "";
+  1 $binary_remote_addr;
+}
+
+limit_req_zone $limit_key zone=one:10m rate=10r/s;
+```
+
+**Application**:
+```javascript
+const limiter = rateLimit({
+  skip: (req) => {
+    // Whitelist internal IPs
+    return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
+  },
+  windowMs: 15 * 60 * 1000,
+  max: 100,
+});
+```
+
+---
+
+**Option C: Add Caching** (reduce requests to backend)
+
+**Redis cache**:
+```javascript
+const redis = require('redis').createClient();
+
+app.get('/api/users', async (req, res) => {
+  // Check cache first
+  const cached = await redis.get('users:' + req.query.id);
+  if (cached) {
+    return res.json(JSON.parse(cached));
+  }
+
+  // Fetch from database
+  const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
+
+  // Cache for 5 minutes
+  await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
+
+  res.json(user);
+});
+
+// Impact: Reduces backend load, fewer rate limit hits
+// Risk: Low (data staleness acceptable)
+```
+
+---
+
+**Option D: Block Malicious IPs** (if abuse detected)
+
+**nginx**:
+```bash
+# Block specific IP
+iptables -A INPUT -s 192.168.1.100 -j DROP
+
+# Or in nginx.conf:
+deny 192.168.1.100;
+deny 192.168.1.0/24;  # Block range
+```
+
+**CloudFlare**:
+```
+# CloudFlare dashboard:
+# Security → WAF → Custom rules
+# Block IP: 192.168.1.100
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Implement Tiered Rate Limits**
+
+**Different limits for different users**:
+```javascript
+const rateLimit = require('express-rate-limit');
+
+const createLimiter = (max) => rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: max,
+  keyGenerator: (req) => req.user?.id || req.ip,
+});
+
+app.use('/api', (req, res, next) => {
+  let limiter;
+  if (req.user?.tier === 'premium') {
+    limiter = createLimiter(1000);  // Premium: 1000 req/15min
+  } else if (req.user) {
+    limiter = createLimiter(300);   // Authenticated: 300 req/15min
+  } else {
+    limiter = createLimiter(100);   // Anonymous: 100 req/15min
+  }
+  limiter(req, res, next);
+});
+```
+
+---
+
+**Option B: Add CAPTCHA** (prevent bots)
+
+**reCAPTCHA** on sensitive endpoints:
+```javascript
+const { recaptcha } = require('express-recaptcha');
+
+app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
+  if (!req.recaptcha.error) {
+    // CAPTCHA valid, proceed with login
+    await handleLogin(req, res);
+  } else {
+    res.status(400).json({ error: 'CAPTCHA failed' });
+  }
+});
+```
+
+---
+
+**Option C: Upgrade External API Plan** (if hitting external limit)
+
+**Stripe**:
+```
+Current: 100 requests/second (free)
+Upgrade: Contact Stripe for higher limit (paid)
+```
+
+**AWS API Gateway**:
+```bash
+# Increase throttle limit
+aws apigateway update-usage-plan \
+  --usage-plan-id <ID> \
+  --patch-operations \
+    op=replace,path=/throttle/rateLimit,value=1000
+
+# Impact: Higher rate limit
+# Risk: None (may cost more)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
+- [ ] **Add caching** (reduce backend load)
+- [ ] **Use CDN** (cache static content, reduce origin requests)
+- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
+- [ ] **Monitor rate limit usage** (alert before hitting limit)
+- [ ] **Batch requests** (reduce API calls to external services)
+- [ ] **Implement retry with backoff** (external API rate limits)
+- [ ] **Document rate limits** (API documentation for users)
+- [ ] **Add rate limit headers** (tell users their remaining quota)
+
+---
+
+## Rate Limit Best Practices
+
+### 1. Return Helpful Headers
+
+**RFC 6585 standard**:
+```http
+HTTP/1.1 429 Too Many Requests
+X-RateLimit-Limit: 100
+X-RateLimit-Remaining: 0
+X-RateLimit-Reset: 1698345600  # Unix timestamp
+Retry-After: 60  # Seconds until reset
+
+{
+  "error": "Rate limit exceeded",
+  "message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
+}
+```
+
+**Implementation**:
+```javascript
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: 100,
+  standardHeaders: true,  // Return RateLimit-* headers
+  legacyHeaders: false,
+  handler: (req, res) => {
+    res.status(429).json({
+      error: 'Rate limit exceeded',
+      message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
+    });
+  },
+});
+```
+
+---
+
+### 2. Use Sliding Window (not Fixed Window)
+
+**Fixed window** (bad):
+```
+Window 1: 00:00-00:15 (100 requests)
+Window 2: 00:15-00:30 (100 requests)
+
+User makes 100 requests at 00:14:59
+User makes 100 requests at 00:15:01
+→ 200 requests in 2 seconds! (burst)
+```
+
+**Sliding window** (good):
+```
+Rate limit based on last 15 minutes from current time
+→ Can't burst (limit enforced continuously)
+```
+
+---
+
+### 3. Different Limits for Different Endpoints
+
+```javascript
+// Expensive endpoint (lower limit)
+app.get('/api/analytics', rateLimit({ max: 10 }), handler);
+
+// Cheap endpoint (higher limit)
+app.get('/api/health', rateLimit({ max: 1000 }), handler);
+```
+
+---
+
+## External API Rate Limit Handling
+
+### Retry with Backoff
+
+```javascript
+async function callExternalAPI(url, retries = 3) {
+  for (let i = 0; i < retries; i++) {
+    try {
+      const response = await fetch(url);
+
+      // Check rate limit headers
+      const remaining = response.headers.get('X-RateLimit-Remaining');
+      if (remaining < 10) {
+        console.warn('Approaching rate limit:', remaining);
+      }
+
+      if (response.status === 429) {
+        // Rate limited
+        const retryAfter = response.headers.get('Retry-After') || 60;
+        console.log(`Rate limited, retrying after ${retryAfter}s`);
+        await sleep(retryAfter * 1000);
+        continue;
+      }
+
+      return response.json();
+    } catch (error) {
+      if (i === retries - 1) throw error;
+      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
+    }
+  }
+}
+```
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application rate limit logic needs changes
+- Need to implement caching
+
+**Escalate to infrastructure if**:
+- nginx/API Gateway rate limit config
+- Need to scale up capacity
+
+**Escalate to external vendor if**:
+- Hitting external API rate limit
+- Need higher quota
+
+---
+
+## Related Runbooks
+
+- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (if SEV1/SEV2)
+- [ ] Identify why rate limit hit
+- [ ] Adjust rate limits (if needed)
+- [ ] Add monitoring (alert before hitting limit)
+- [ ] Document rate limits (for users/API consumers)
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# Check 429 errors (nginx)
+awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
+
+# Check rate limit config (nginx)
+grep "limit_req" /etc/nginx/nginx.conf
+
+# Block IP (iptables)
+iptables -A INPUT -s <IP> -j DROP
+
+# Test rate limit
+for i in {1..200}; do curl http://localhost/api; done
+
+# Check external API rate limit
+curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
+# Look for X-RateLimit-* headers
+```