Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/09-cascade-failure.md
+++ b/agents/sre/playbooks/09-cascade-failure.md
@@ -0,0 +1,430 @@
+# Playbook: Cascade Failure
+
+## Symptoms
+
+- Multiple services failing simultaneously
+- Failures spreading across services
+- Dependency services timing out
+- Error rate increasing exponentially
+- Monitoring alert: "Multiple services degraded", "Cascade detected"
+
+## Severity
+
+- **SEV1** - Cascade affecting production services
+
+## What is a Cascade Failure?
+
+**Definition**: One service failure triggers failures in dependent services, spreading through the system.
+
+**Example**:
+```
+Database slow (2s queries)
+  ↓
+API times out waiting for database (5s timeout)
+  ↓
+Frontend times out waiting for API (10s timeout)
+  ↓
+Load balancer marks frontend unhealthy
+  ↓
+Traffic routes to other frontends (overload them)
+  ↓
+All frontends fail → Complete outage
+```
+
+---
+
+## Diagnosis
+
+### Step 1: Identify Initial Failure Point
+
+**Check Service Dependencies**:
+```
+Frontend → API → Database
+         ↓
+         Cache (Redis)
+         ↓
+         Queue (RabbitMQ)
+         ↓
+         External API
+```
+
+**Find the root**:
+```bash
+# Check service health (start with leaf dependencies)
+# 1. Database
+psql -c "SELECT 1"
+
+# 2. Cache
+redis-cli PING
+
+# 3. Queue
+rabbitmqctl status
+
+# 4. External API
+curl https://api.external.com/health
+
+# First failure = likely root cause
+```
+
+---
+
+### Step 2: Trace Failure Propagation
+
+**Check Service Logs** (in order):
+```bash
+# Database logs (first)
+tail -100 /var/log/postgresql/postgresql.log
+
+# API logs (second)
+tail -100 /var/log/api/error.log
+
+# Frontend logs (third)
+tail -100 /var/log/frontend/error.log
+```
+
+**Look for timestamps**:
+```
+14:00:00 - Database: Slow query (7s)  ← ROOT CAUSE
+14:00:05 - API: Timeout error
+14:00:10 - Frontend: API unavailable
+14:00:15 - Load balancer: All frontends unhealthy
+```
+
+---
+
+### Step 3: Assess Cascade Depth
+
+**How many layers affected?**
+- **1 layer**: Database only (isolated failure)
+- **2-3 layers**: Database → API → Frontend (cascade)
+- **4+ layers**: Full system cascade (critical)
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**PRIORITY: Stop the cascade from spreading**
+
+**Option A: Circuit Breaker** (if not already enabled)
+```javascript
+// Enable circuit breaker manually
+// Prevents API from overwhelming database
+
+const CircuitBreaker = require('opossum');
+
+const dbQuery = new CircuitBreaker(queryDatabase, {
+  timeout: 3000,        // 3s timeout
+  errorThresholdPercentage: 50,  // Open after 50% failures
+  resetTimeout: 30000   // Try again after 30s
+});
+
+dbQuery.on('open', () => {
+  console.log('Circuit breaker OPEN - using fallback');
+});
+
+// Use fallback when circuit open
+dbQuery.fallback(() => {
+  return cachedData; // Return cached data instead
+});
+```
+
+**Option B: Rate Limiting** (protect downstream)
+```nginx
+# Limit requests to database (nginx)
+limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
+
+location /api/ {
+  limit_req zone=api burst=20 nodelay;
+  proxy_pass http://api-backend;
+}
+```
+
+**Option C: Shed Load** (reject non-critical requests)
+```javascript
+// Reject non-critical requests when overloaded
+app.use((req, res, next) => {
+  const load = getCurrentLoad(); // CPU, memory, queue depth
+
+  if (load > 0.8 && !isCriticalEndpoint(req.path)) {
+    return res.status(503).json({
+      error: 'Service overloaded, try again later'
+    });
+  }
+
+  next();
+});
+
+function isCriticalEndpoint(path) {
+  return ['/api/health', '/api/payment'].includes(path);
+}
+```
+
+**Option D: Isolate Failure** (take failing service offline)
+```bash
+# Remove failing service from load balancer
+# AWS ELB:
+aws elbv2 deregister-targets \
+  --target-group-arn <arn> \
+  --targets Id=i-1234567890abcdef0
+
+# nginx:
+# Comment out failing backend in upstream block
+# upstream api {
+#   server api1.example.com;  # Healthy
+#   # server api2.example.com;  # FAILING - commented out
+# }
+
+# Impact: Prevents failing service from affecting others
+# Risk: Reduced capacity
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Fix Root Cause**
+
+**If database slow**:
+```sql
+-- Add missing index
+CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
+```
+
+**If external API slow**:
+```javascript
+// Add timeout + fallback
+const response = await fetch('https://api.external.com', {
+  timeout: 2000  // 2s timeout
+});
+
+if (!response.ok) {
+  return fallbackData; // Don't cascade failure
+}
+```
+
+**If service overloaded**:
+```bash
+# Scale horizontally (add more instances)
+# AWS Auto Scaling:
+aws autoscaling set-desired-capacity \
+  --auto-scaling-group-name my-asg \
+  --desired-capacity 10  # Was 5
+```
+
+---
+
+**Option B: Add Timeouts** (prevent indefinite waiting)
+```javascript
+// Database query timeout
+const result = await db.query('SELECT * FROM users', {
+  timeout: 3000  // 3 second timeout
+});
+
+// API call timeout
+const response = await fetch('/api/data', {
+  signal: AbortSignal.timeout(5000)  // 5 second timeout
+});
+
+// Impact: Fail fast instead of cascading
+// Risk: Low (better to timeout than cascade)
+```
+
+---
+
+**Option C: Add Bulkheads** (isolate critical paths)
+```javascript
+// Separate connection pools for critical vs non-critical
+const criticalPool = new Pool({ max: 10 }); // Payments, auth
+const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
+
+// Critical requests get priority
+app.post('/api/payment', async (req, res) => {
+  const conn = await criticalPool.connect();
+  // ...
+});
+
+// Non-critical requests use separate pool
+app.get('/api/analytics', async (req, res) => {
+  const conn = await nonCriticalPool.connect();
+  // ...
+});
+
+// Impact: Critical paths protected from non-critical load
+// Risk: None (isolation improves reliability)
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Architecture Improvements**:
+
+- [ ] **Circuit Breakers** (all external dependencies)
+- [ ] **Timeouts** (every network call, database query)
+- [ ] **Retries with exponential backoff** (transient failures)
+- [ ] **Bulkheads** (isolate critical paths)
+- [ ] **Rate limiting** (protect downstream services)
+- [ ] **Graceful degradation** (fallback data, cached responses)
+- [ ] **Health checks** (detect failures early)
+- [ ] **Auto-scaling** (handle load spikes)
+- [ ] **Chaos engineering** (test cascade scenarios)
+
+---
+
+## Cascade Prevention Patterns
+
+### 1. Circuit Breaker Pattern
+```javascript
+const breaker = new CircuitBreaker(riskyOperation, {
+  timeout: 3000,
+  errorThresholdPercentage: 50,
+  resetTimeout: 30000
+});
+
+breaker.fallback(() => cachedData);
+```
+
+**Benefits**:
+- Fast failure (don't wait for timeout)
+- Automatic recovery (reset after timeout)
+- Fallback data (graceful degradation)
+
+---
+
+### 2. Timeout Pattern
+```javascript
+// ALWAYS set timeouts
+const response = await fetch('/api', {
+  signal: AbortSignal.timeout(5000)
+});
+```
+
+**Benefits**:
+- Fail fast (don't cascade indefinite waits)
+- Predictable behavior
+
+---
+
+### 3. Bulkhead Pattern
+```javascript
+// Separate resource pools
+const criticalPool = new Pool({ max: 10 });
+const nonCriticalPool = new Pool({ max: 5 });
+```
+
+**Benefits**:
+- Critical paths protected
+- Non-critical load can't exhaust resources
+
+---
+
+### 4. Retry with Backoff
+```javascript
+async function retryWithBackoff(fn, retries = 3) {
+  for (let i = 0; i < retries; i++) {
+    try {
+      return await fn();
+    } catch (error) {
+      if (i === retries - 1) throw error;
+      await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
+    }
+  }
+}
+```
+
+**Benefits**:
+- Handles transient failures
+- Exponential backoff prevents thundering herd
+
+---
+
+### 5. Load Shedding
+```javascript
+// Reject requests when overloaded
+if (queueDepth > threshold) {
+  return res.status(503).send('Overloaded');
+}
+```
+
+**Benefits**:
+- Prevent overload
+- Protect downstream services
+
+---
+
+## Escalation
+
+**Escalate to architecture team if**:
+- System-wide cascade
+- Architectural changes needed
+
+**Escalate to all service owners if**:
+- Multiple teams affected
+- Need coordinated response
+
+**Escalate to management if**:
+- Complete outage
+- Large customer impact
+
+---
+
+## Prevention Checklist
+
+- [ ] Circuit breakers on all external calls
+- [ ] Timeouts on all network operations
+- [ ] Retries with exponential backoff
+- [ ] Bulkheads for critical paths
+- [ ] Rate limiting (protect downstream)
+- [ ] Health checks (detect failures early)
+- [ ] Auto-scaling (handle load)
+- [ ] Graceful degradation (fallback data)
+- [ ] Chaos engineering (test failure scenarios)
+- [ ] Load testing (find breaking points)
+
+---
+
+## Related Runbooks
+
+- [04-slow-api-response.md](04-slow-api-response.md) - API performance
+- [07-service-down.md](07-service-down.md) - Service failures
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (MANDATORY for cascade failures)
+- [ ] Draw cascade diagram (which services failed in order)
+- [ ] Identify missing safeguards (circuit breakers, timeouts)
+- [ ] Implement prevention patterns
+- [ ] Test cascade scenarios (chaos engineering)
+- [ ] Update this runbook if needed
+
+---
+
+## Cascade Failure Examples
+
+**Netflix Outage (2012)**:
+- Database latency → API timeouts → Frontend failures → Complete outage
+- **Fix**: Circuit breakers, timeouts, fallback data
+
+**AWS S3 Outage (2017)**:
+- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
+- **Fix**: Multi-region redundancy, fallback to different regions
+
+**Google Cloud Outage (2019)**:
+- Network misconfiguration → Internal services fail → External services cascade
+- **Fix**: Network configuration validation, staged rollouts
+
+---
+
+## Key Takeaways
+
+1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
+2. **Fix the root cause first** (not the symptoms)
+3. **Fail fast, don't cascade waits** (timeouts everywhere)
+4. **Graceful degradation** (fallback > failure)
+5. **Test failure scenarios** (chaos engineering)