Files
gh-anton-abyzov-specweave-p…/agents/sre/playbooks/09-cascade-failure.md
2025-11-29 17:56:41 +08:00

9.6 KiB

Playbook: Cascade Failure

Symptoms

  • Multiple services failing simultaneously
  • Failures spreading across services
  • Dependency services timing out
  • Error rate increasing exponentially
  • Monitoring alert: "Multiple services degraded", "Cascade detected"

Severity

  • SEV1 - Cascade affecting production services

What is a Cascade Failure?

Definition: One service failure triggers failures in dependent services, spreading through the system.

Example:

Database slow (2s queries)
  ↓
API times out waiting for database (5s timeout)
  ↓
Frontend times out waiting for API (10s timeout)
  ↓
Load balancer marks frontend unhealthy
  ↓
Traffic routes to other frontends (overload them)
  ↓
All frontends fail → Complete outage

Diagnosis

Step 1: Identify Initial Failure Point

Check Service Dependencies:

Frontend → API → Database
         ↓
         Cache (Redis)
         ↓
         Queue (RabbitMQ)
         ↓
         External API

Find the root:

# Check service health (start with leaf dependencies)
# 1. Database
psql -c "SELECT 1"

# 2. Cache
redis-cli PING

# 3. Queue
rabbitmqctl status

# 4. External API
curl https://api.external.com/health

# First failure = likely root cause

Step 2: Trace Failure Propagation

Check Service Logs (in order):

# Database logs (first)
tail -100 /var/log/postgresql/postgresql.log

# API logs (second)
tail -100 /var/log/api/error.log

# Frontend logs (third)
tail -100 /var/log/frontend/error.log

Look for timestamps:

14:00:00 - Database: Slow query (7s)  ← ROOT CAUSE
14:00:05 - API: Timeout error
14:00:10 - Frontend: API unavailable
14:00:15 - Load balancer: All frontends unhealthy

Step 3: Assess Cascade Depth

How many layers affected?

  • 1 layer: Database only (isolated failure)
  • 2-3 layers: Database → API → Frontend (cascade)
  • 4+ layers: Full system cascade (critical)

Mitigation

Immediate (Now - 5 min)

PRIORITY: Stop the cascade from spreading

Option A: Circuit Breaker (if not already enabled)

// Enable circuit breaker manually
// Prevents API from overwhelming database

const CircuitBreaker = require('opossum');

const dbQuery = new CircuitBreaker(queryDatabase, {
  timeout: 3000,        // 3s timeout
  errorThresholdPercentage: 50,  // Open after 50% failures
  resetTimeout: 30000   // Try again after 30s
});

dbQuery.on('open', () => {
  console.log('Circuit breaker OPEN - using fallback');
});

// Use fallback when circuit open
dbQuery.fallback(() => {
  return cachedData; // Return cached data instead
});

Option B: Rate Limiting (protect downstream)

# Limit requests to database (nginx)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location /api/ {
  limit_req zone=api burst=20 nodelay;
  proxy_pass http://api-backend;
}

Option C: Shed Load (reject non-critical requests)

// Reject non-critical requests when overloaded
app.use((req, res, next) => {
  const load = getCurrentLoad(); // CPU, memory, queue depth

  if (load > 0.8 && !isCriticalEndpoint(req.path)) {
    return res.status(503).json({
      error: 'Service overloaded, try again later'
    });
  }

  next();
});

function isCriticalEndpoint(path) {
  return ['/api/health', '/api/payment'].includes(path);
}

Option D: Isolate Failure (take failing service offline)

# Remove failing service from load balancer
# AWS ELB:
aws elbv2 deregister-targets \
  --target-group-arn <arn> \
  --targets Id=i-1234567890abcdef0

# nginx:
# Comment out failing backend in upstream block
# upstream api {
#   server api1.example.com;  # Healthy
#   # server api2.example.com;  # FAILING - commented out
# }

# Impact: Prevents failing service from affecting others
# Risk: Reduced capacity

Short-term (5 min - 1 hour)

Option A: Fix Root Cause

If database slow:

-- Add missing index
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);

If external API slow:

// Add timeout + fallback
const response = await fetch('https://api.external.com', {
  timeout: 2000  // 2s timeout
});

if (!response.ok) {
  return fallbackData; // Don't cascade failure
}

If service overloaded:

# Scale horizontally (add more instances)
# AWS Auto Scaling:
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name my-asg \
  --desired-capacity 10  # Was 5

Option B: Add Timeouts (prevent indefinite waiting)

// Database query timeout
const result = await db.query('SELECT * FROM users', {
  timeout: 3000  // 3 second timeout
});

// API call timeout
const response = await fetch('/api/data', {
  signal: AbortSignal.timeout(5000)  // 5 second timeout
});

// Impact: Fail fast instead of cascading
// Risk: Low (better to timeout than cascade)

Option C: Add Bulkheads (isolate critical paths)

// Separate connection pools for critical vs non-critical
const criticalPool = new Pool({ max: 10 }); // Payments, auth
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports

// Critical requests get priority
app.post('/api/payment', async (req, res) => {
  const conn = await criticalPool.connect();
  // ...
});

// Non-critical requests use separate pool
app.get('/api/analytics', async (req, res) => {
  const conn = await nonCriticalPool.connect();
  // ...
});

// Impact: Critical paths protected from non-critical load
// Risk: None (isolation improves reliability)

Long-term (1 hour+)

Architecture Improvements:

  • Circuit Breakers (all external dependencies)
  • Timeouts (every network call, database query)
  • Retries with exponential backoff (transient failures)
  • Bulkheads (isolate critical paths)
  • Rate limiting (protect downstream services)
  • Graceful degradation (fallback data, cached responses)
  • Health checks (detect failures early)
  • Auto-scaling (handle load spikes)
  • Chaos engineering (test cascade scenarios)

Cascade Prevention Patterns

1. Circuit Breaker Pattern

const breaker = new CircuitBreaker(riskyOperation, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

breaker.fallback(() => cachedData);

Benefits:

  • Fast failure (don't wait for timeout)
  • Automatic recovery (reset after timeout)
  • Fallback data (graceful degradation)

2. Timeout Pattern

// ALWAYS set timeouts
const response = await fetch('/api', {
  signal: AbortSignal.timeout(5000)
});

Benefits:

  • Fail fast (don't cascade indefinite waits)
  • Predictable behavior

3. Bulkhead Pattern

// Separate resource pools
const criticalPool = new Pool({ max: 10 });
const nonCriticalPool = new Pool({ max: 5 });

Benefits:

  • Critical paths protected
  • Non-critical load can't exhaust resources

4. Retry with Backoff

async function retryWithBackoff(fn, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === retries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
    }
  }
}

Benefits:

  • Handles transient failures
  • Exponential backoff prevents thundering herd

5. Load Shedding

// Reject requests when overloaded
if (queueDepth > threshold) {
  return res.status(503).send('Overloaded');
}

Benefits:

  • Prevent overload
  • Protect downstream services

Escalation

Escalate to architecture team if:

  • System-wide cascade
  • Architectural changes needed

Escalate to all service owners if:

  • Multiple teams affected
  • Need coordinated response

Escalate to management if:

  • Complete outage
  • Large customer impact

Prevention Checklist

  • Circuit breakers on all external calls
  • Timeouts on all network operations
  • Retries with exponential backoff
  • Bulkheads for critical paths
  • Rate limiting (protect downstream)
  • Health checks (detect failures early)
  • Auto-scaling (handle load)
  • Graceful degradation (fallback data)
  • Chaos engineering (test failure scenarios)
  • Load testing (find breaking points)


Post-Incident

After resolving:

  • Create post-mortem (MANDATORY for cascade failures)
  • Draw cascade diagram (which services failed in order)
  • Identify missing safeguards (circuit breakers, timeouts)
  • Implement prevention patterns
  • Test cascade scenarios (chaos engineering)
  • Update this runbook if needed

Cascade Failure Examples

Netflix Outage (2012):

  • Database latency → API timeouts → Frontend failures → Complete outage
  • Fix: Circuit breakers, timeouts, fallback data

AWS S3 Outage (2017):

  • S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
  • Fix: Multi-region redundancy, fallback to different regions

Google Cloud Outage (2019):

  • Network misconfiguration → Internal services fail → External services cascade
  • Fix: Network configuration validation, staged rollouts

Key Takeaways

  1. Cascades happen when failures propagate (no circuit breakers, timeouts)
  2. Fix the root cause first (not the symptoms)
  3. Fail fast, don't cascade waits (timeouts everywhere)
  4. Graceful degradation (fallback > failure)
  5. Test failure scenarios (chaos engineering)