9.6 KiB
Playbook: Cascade Failure
Symptoms
- Multiple services failing simultaneously
- Failures spreading across services
- Dependency services timing out
- Error rate increasing exponentially
- Monitoring alert: "Multiple services degraded", "Cascade detected"
Severity
- SEV1 - Cascade affecting production services
What is a Cascade Failure?
Definition: One service failure triggers failures in dependent services, spreading through the system.
Example:
Database slow (2s queries)
↓
API times out waiting for database (5s timeout)
↓
Frontend times out waiting for API (10s timeout)
↓
Load balancer marks frontend unhealthy
↓
Traffic routes to other frontends (overload them)
↓
All frontends fail → Complete outage
Diagnosis
Step 1: Identify Initial Failure Point
Check Service Dependencies:
Frontend → API → Database
↓
Cache (Redis)
↓
Queue (RabbitMQ)
↓
External API
Find the root:
# Check service health (start with leaf dependencies)
# 1. Database
psql -c "SELECT 1"
# 2. Cache
redis-cli PING
# 3. Queue
rabbitmqctl status
# 4. External API
curl https://api.external.com/health
# First failure = likely root cause
Step 2: Trace Failure Propagation
Check Service Logs (in order):
# Database logs (first)
tail -100 /var/log/postgresql/postgresql.log
# API logs (second)
tail -100 /var/log/api/error.log
# Frontend logs (third)
tail -100 /var/log/frontend/error.log
Look for timestamps:
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
14:00:05 - API: Timeout error
14:00:10 - Frontend: API unavailable
14:00:15 - Load balancer: All frontends unhealthy
Step 3: Assess Cascade Depth
How many layers affected?
- 1 layer: Database only (isolated failure)
- 2-3 layers: Database → API → Frontend (cascade)
- 4+ layers: Full system cascade (critical)
Mitigation
Immediate (Now - 5 min)
PRIORITY: Stop the cascade from spreading
Option A: Circuit Breaker (if not already enabled)
// Enable circuit breaker manually
// Prevents API from overwhelming database
const CircuitBreaker = require('opossum');
const dbQuery = new CircuitBreaker(queryDatabase, {
timeout: 3000, // 3s timeout
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000 // Try again after 30s
});
dbQuery.on('open', () => {
console.log('Circuit breaker OPEN - using fallback');
});
// Use fallback when circuit open
dbQuery.fallback(() => {
return cachedData; // Return cached data instead
});
Option B: Rate Limiting (protect downstream)
# Limit requests to database (nginx)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://api-backend;
}
Option C: Shed Load (reject non-critical requests)
// Reject non-critical requests when overloaded
app.use((req, res, next) => {
const load = getCurrentLoad(); // CPU, memory, queue depth
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
return res.status(503).json({
error: 'Service overloaded, try again later'
});
}
next();
});
function isCriticalEndpoint(path) {
return ['/api/health', '/api/payment'].includes(path);
}
Option D: Isolate Failure (take failing service offline)
# Remove failing service from load balancer
# AWS ELB:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# nginx:
# Comment out failing backend in upstream block
# upstream api {
# server api1.example.com; # Healthy
# # server api2.example.com; # FAILING - commented out
# }
# Impact: Prevents failing service from affecting others
# Risk: Reduced capacity
Short-term (5 min - 1 hour)
Option A: Fix Root Cause
If database slow:
-- Add missing index
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
If external API slow:
// Add timeout + fallback
const response = await fetch('https://api.external.com', {
timeout: 2000 // 2s timeout
});
if (!response.ok) {
return fallbackData; // Don't cascade failure
}
If service overloaded:
# Scale horizontally (add more instances)
# AWS Auto Scaling:
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 10 # Was 5
Option B: Add Timeouts (prevent indefinite waiting)
// Database query timeout
const result = await db.query('SELECT * FROM users', {
timeout: 3000 // 3 second timeout
});
// API call timeout
const response = await fetch('/api/data', {
signal: AbortSignal.timeout(5000) // 5 second timeout
});
// Impact: Fail fast instead of cascading
// Risk: Low (better to timeout than cascade)
Option C: Add Bulkheads (isolate critical paths)
// Separate connection pools for critical vs non-critical
const criticalPool = new Pool({ max: 10 }); // Payments, auth
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
// Critical requests get priority
app.post('/api/payment', async (req, res) => {
const conn = await criticalPool.connect();
// ...
});
// Non-critical requests use separate pool
app.get('/api/analytics', async (req, res) => {
const conn = await nonCriticalPool.connect();
// ...
});
// Impact: Critical paths protected from non-critical load
// Risk: None (isolation improves reliability)
Long-term (1 hour+)
Architecture Improvements:
- Circuit Breakers (all external dependencies)
- Timeouts (every network call, database query)
- Retries with exponential backoff (transient failures)
- Bulkheads (isolate critical paths)
- Rate limiting (protect downstream services)
- Graceful degradation (fallback data, cached responses)
- Health checks (detect failures early)
- Auto-scaling (handle load spikes)
- Chaos engineering (test cascade scenarios)
Cascade Prevention Patterns
1. Circuit Breaker Pattern
const breaker = new CircuitBreaker(riskyOperation, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => cachedData);
Benefits:
- Fast failure (don't wait for timeout)
- Automatic recovery (reset after timeout)
- Fallback data (graceful degradation)
2. Timeout Pattern
// ALWAYS set timeouts
const response = await fetch('/api', {
signal: AbortSignal.timeout(5000)
});
Benefits:
- Fail fast (don't cascade indefinite waits)
- Predictable behavior
3. Bulkhead Pattern
// Separate resource pools
const criticalPool = new Pool({ max: 10 });
const nonCriticalPool = new Pool({ max: 5 });
Benefits:
- Critical paths protected
- Non-critical load can't exhaust resources
4. Retry with Backoff
async function retryWithBackoff(fn, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (error) {
if (i === retries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
}
}
}
Benefits:
- Handles transient failures
- Exponential backoff prevents thundering herd
5. Load Shedding
// Reject requests when overloaded
if (queueDepth > threshold) {
return res.status(503).send('Overloaded');
}
Benefits:
- Prevent overload
- Protect downstream services
Escalation
Escalate to architecture team if:
- System-wide cascade
- Architectural changes needed
Escalate to all service owners if:
- Multiple teams affected
- Need coordinated response
Escalate to management if:
- Complete outage
- Large customer impact
Prevention Checklist
- Circuit breakers on all external calls
- Timeouts on all network operations
- Retries with exponential backoff
- Bulkheads for critical paths
- Rate limiting (protect downstream)
- Health checks (detect failures early)
- Auto-scaling (handle load)
- Graceful degradation (fallback data)
- Chaos engineering (test failure scenarios)
- Load testing (find breaking points)
Related Runbooks
- 04-slow-api-response.md - API performance
- 07-service-down.md - Service failures
- ../modules/backend-diagnostics.md - Backend troubleshooting
Post-Incident
After resolving:
- Create post-mortem (MANDATORY for cascade failures)
- Draw cascade diagram (which services failed in order)
- Identify missing safeguards (circuit breakers, timeouts)
- Implement prevention patterns
- Test cascade scenarios (chaos engineering)
- Update this runbook if needed
Cascade Failure Examples
Netflix Outage (2012):
- Database latency → API timeouts → Frontend failures → Complete outage
- Fix: Circuit breakers, timeouts, fallback data
AWS S3 Outage (2017):
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
- Fix: Multi-region redundancy, fallback to different regions
Google Cloud Outage (2019):
- Network misconfiguration → Internal services fail → External services cascade
- Fix: Network configuration validation, staged rollouts
Key Takeaways
- Cascades happen when failures propagate (no circuit breakers, timeouts)
- Fix the root cause first (not the symptoms)
- Fail fast, don't cascade waits (timeouts everywhere)
- Graceful degradation (fallback > failure)
- Test failure scenarios (chaos engineering)