431 lines
9.6 KiB
Markdown
431 lines
9.6 KiB
Markdown
# Playbook: Cascade Failure
|
|
|
|
## Symptoms
|
|
|
|
- Multiple services failing simultaneously
|
|
- Failures spreading across services
|
|
- Dependency services timing out
|
|
- Error rate increasing exponentially
|
|
- Monitoring alert: "Multiple services degraded", "Cascade detected"
|
|
|
|
## Severity
|
|
|
|
- **SEV1** - Cascade affecting production services
|
|
|
|
## What is a Cascade Failure?
|
|
|
|
**Definition**: One service failure triggers failures in dependent services, spreading through the system.
|
|
|
|
**Example**:
|
|
```
|
|
Database slow (2s queries)
|
|
↓
|
|
API times out waiting for database (5s timeout)
|
|
↓
|
|
Frontend times out waiting for API (10s timeout)
|
|
↓
|
|
Load balancer marks frontend unhealthy
|
|
↓
|
|
Traffic routes to other frontends (overload them)
|
|
↓
|
|
All frontends fail → Complete outage
|
|
```
|
|
|
|
---
|
|
|
|
## Diagnosis
|
|
|
|
### Step 1: Identify Initial Failure Point
|
|
|
|
**Check Service Dependencies**:
|
|
```
|
|
Frontend → API → Database
|
|
↓
|
|
Cache (Redis)
|
|
↓
|
|
Queue (RabbitMQ)
|
|
↓
|
|
External API
|
|
```
|
|
|
|
**Find the root**:
|
|
```bash
|
|
# Check service health (start with leaf dependencies)
|
|
# 1. Database
|
|
psql -c "SELECT 1"
|
|
|
|
# 2. Cache
|
|
redis-cli PING
|
|
|
|
# 3. Queue
|
|
rabbitmqctl status
|
|
|
|
# 4. External API
|
|
curl https://api.external.com/health
|
|
|
|
# First failure = likely root cause
|
|
```
|
|
|
|
---
|
|
|
|
### Step 2: Trace Failure Propagation
|
|
|
|
**Check Service Logs** (in order):
|
|
```bash
|
|
# Database logs (first)
|
|
tail -100 /var/log/postgresql/postgresql.log
|
|
|
|
# API logs (second)
|
|
tail -100 /var/log/api/error.log
|
|
|
|
# Frontend logs (third)
|
|
tail -100 /var/log/frontend/error.log
|
|
```
|
|
|
|
**Look for timestamps**:
|
|
```
|
|
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
|
|
14:00:05 - API: Timeout error
|
|
14:00:10 - Frontend: API unavailable
|
|
14:00:15 - Load balancer: All frontends unhealthy
|
|
```
|
|
|
|
---
|
|
|
|
### Step 3: Assess Cascade Depth
|
|
|
|
**How many layers affected?**
|
|
- **1 layer**: Database only (isolated failure)
|
|
- **2-3 layers**: Database → API → Frontend (cascade)
|
|
- **4+ layers**: Full system cascade (critical)
|
|
|
|
---
|
|
|
|
## Mitigation
|
|
|
|
### Immediate (Now - 5 min)
|
|
|
|
**PRIORITY: Stop the cascade from spreading**
|
|
|
|
**Option A: Circuit Breaker** (if not already enabled)
|
|
```javascript
|
|
// Enable circuit breaker manually
|
|
// Prevents API from overwhelming database
|
|
|
|
const CircuitBreaker = require('opossum');
|
|
|
|
const dbQuery = new CircuitBreaker(queryDatabase, {
|
|
timeout: 3000, // 3s timeout
|
|
errorThresholdPercentage: 50, // Open after 50% failures
|
|
resetTimeout: 30000 // Try again after 30s
|
|
});
|
|
|
|
dbQuery.on('open', () => {
|
|
console.log('Circuit breaker OPEN - using fallback');
|
|
});
|
|
|
|
// Use fallback when circuit open
|
|
dbQuery.fallback(() => {
|
|
return cachedData; // Return cached data instead
|
|
});
|
|
```
|
|
|
|
**Option B: Rate Limiting** (protect downstream)
|
|
```nginx
|
|
# Limit requests to database (nginx)
|
|
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
|
|
|
|
location /api/ {
|
|
limit_req zone=api burst=20 nodelay;
|
|
proxy_pass http://api-backend;
|
|
}
|
|
```
|
|
|
|
**Option C: Shed Load** (reject non-critical requests)
|
|
```javascript
|
|
// Reject non-critical requests when overloaded
|
|
app.use((req, res, next) => {
|
|
const load = getCurrentLoad(); // CPU, memory, queue depth
|
|
|
|
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
|
|
return res.status(503).json({
|
|
error: 'Service overloaded, try again later'
|
|
});
|
|
}
|
|
|
|
next();
|
|
});
|
|
|
|
function isCriticalEndpoint(path) {
|
|
return ['/api/health', '/api/payment'].includes(path);
|
|
}
|
|
```
|
|
|
|
**Option D: Isolate Failure** (take failing service offline)
|
|
```bash
|
|
# Remove failing service from load balancer
|
|
# AWS ELB:
|
|
aws elbv2 deregister-targets \
|
|
--target-group-arn <arn> \
|
|
--targets Id=i-1234567890abcdef0
|
|
|
|
# nginx:
|
|
# Comment out failing backend in upstream block
|
|
# upstream api {
|
|
# server api1.example.com; # Healthy
|
|
# # server api2.example.com; # FAILING - commented out
|
|
# }
|
|
|
|
# Impact: Prevents failing service from affecting others
|
|
# Risk: Reduced capacity
|
|
```
|
|
|
|
---
|
|
|
|
### Short-term (5 min - 1 hour)
|
|
|
|
**Option A: Fix Root Cause**
|
|
|
|
**If database slow**:
|
|
```sql
|
|
-- Add missing index
|
|
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
|
|
```
|
|
|
|
**If external API slow**:
|
|
```javascript
|
|
// Add timeout + fallback
|
|
const response = await fetch('https://api.external.com', {
|
|
timeout: 2000 // 2s timeout
|
|
});
|
|
|
|
if (!response.ok) {
|
|
return fallbackData; // Don't cascade failure
|
|
}
|
|
```
|
|
|
|
**If service overloaded**:
|
|
```bash
|
|
# Scale horizontally (add more instances)
|
|
# AWS Auto Scaling:
|
|
aws autoscaling set-desired-capacity \
|
|
--auto-scaling-group-name my-asg \
|
|
--desired-capacity 10 # Was 5
|
|
```
|
|
|
|
---
|
|
|
|
**Option B: Add Timeouts** (prevent indefinite waiting)
|
|
```javascript
|
|
// Database query timeout
|
|
const result = await db.query('SELECT * FROM users', {
|
|
timeout: 3000 // 3 second timeout
|
|
});
|
|
|
|
// API call timeout
|
|
const response = await fetch('/api/data', {
|
|
signal: AbortSignal.timeout(5000) // 5 second timeout
|
|
});
|
|
|
|
// Impact: Fail fast instead of cascading
|
|
// Risk: Low (better to timeout than cascade)
|
|
```
|
|
|
|
---
|
|
|
|
**Option C: Add Bulkheads** (isolate critical paths)
|
|
```javascript
|
|
// Separate connection pools for critical vs non-critical
|
|
const criticalPool = new Pool({ max: 10 }); // Payments, auth
|
|
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
|
|
|
|
// Critical requests get priority
|
|
app.post('/api/payment', async (req, res) => {
|
|
const conn = await criticalPool.connect();
|
|
// ...
|
|
});
|
|
|
|
// Non-critical requests use separate pool
|
|
app.get('/api/analytics', async (req, res) => {
|
|
const conn = await nonCriticalPool.connect();
|
|
// ...
|
|
});
|
|
|
|
// Impact: Critical paths protected from non-critical load
|
|
// Risk: None (isolation improves reliability)
|
|
```
|
|
|
|
---
|
|
|
|
### Long-term (1 hour+)
|
|
|
|
**Architecture Improvements**:
|
|
|
|
- [ ] **Circuit Breakers** (all external dependencies)
|
|
- [ ] **Timeouts** (every network call, database query)
|
|
- [ ] **Retries with exponential backoff** (transient failures)
|
|
- [ ] **Bulkheads** (isolate critical paths)
|
|
- [ ] **Rate limiting** (protect downstream services)
|
|
- [ ] **Graceful degradation** (fallback data, cached responses)
|
|
- [ ] **Health checks** (detect failures early)
|
|
- [ ] **Auto-scaling** (handle load spikes)
|
|
- [ ] **Chaos engineering** (test cascade scenarios)
|
|
|
|
---
|
|
|
|
## Cascade Prevention Patterns
|
|
|
|
### 1. Circuit Breaker Pattern
|
|
```javascript
|
|
const breaker = new CircuitBreaker(riskyOperation, {
|
|
timeout: 3000,
|
|
errorThresholdPercentage: 50,
|
|
resetTimeout: 30000
|
|
});
|
|
|
|
breaker.fallback(() => cachedData);
|
|
```
|
|
|
|
**Benefits**:
|
|
- Fast failure (don't wait for timeout)
|
|
- Automatic recovery (reset after timeout)
|
|
- Fallback data (graceful degradation)
|
|
|
|
---
|
|
|
|
### 2. Timeout Pattern
|
|
```javascript
|
|
// ALWAYS set timeouts
|
|
const response = await fetch('/api', {
|
|
signal: AbortSignal.timeout(5000)
|
|
});
|
|
```
|
|
|
|
**Benefits**:
|
|
- Fail fast (don't cascade indefinite waits)
|
|
- Predictable behavior
|
|
|
|
---
|
|
|
|
### 3. Bulkhead Pattern
|
|
```javascript
|
|
// Separate resource pools
|
|
const criticalPool = new Pool({ max: 10 });
|
|
const nonCriticalPool = new Pool({ max: 5 });
|
|
```
|
|
|
|
**Benefits**:
|
|
- Critical paths protected
|
|
- Non-critical load can't exhaust resources
|
|
|
|
---
|
|
|
|
### 4. Retry with Backoff
|
|
```javascript
|
|
async function retryWithBackoff(fn, retries = 3) {
|
|
for (let i = 0; i < retries; i++) {
|
|
try {
|
|
return await fn();
|
|
} catch (error) {
|
|
if (i === retries - 1) throw error;
|
|
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits**:
|
|
- Handles transient failures
|
|
- Exponential backoff prevents thundering herd
|
|
|
|
---
|
|
|
|
### 5. Load Shedding
|
|
```javascript
|
|
// Reject requests when overloaded
|
|
if (queueDepth > threshold) {
|
|
return res.status(503).send('Overloaded');
|
|
}
|
|
```
|
|
|
|
**Benefits**:
|
|
- Prevent overload
|
|
- Protect downstream services
|
|
|
|
---
|
|
|
|
## Escalation
|
|
|
|
**Escalate to architecture team if**:
|
|
- System-wide cascade
|
|
- Architectural changes needed
|
|
|
|
**Escalate to all service owners if**:
|
|
- Multiple teams affected
|
|
- Need coordinated response
|
|
|
|
**Escalate to management if**:
|
|
- Complete outage
|
|
- Large customer impact
|
|
|
|
---
|
|
|
|
## Prevention Checklist
|
|
|
|
- [ ] Circuit breakers on all external calls
|
|
- [ ] Timeouts on all network operations
|
|
- [ ] Retries with exponential backoff
|
|
- [ ] Bulkheads for critical paths
|
|
- [ ] Rate limiting (protect downstream)
|
|
- [ ] Health checks (detect failures early)
|
|
- [ ] Auto-scaling (handle load)
|
|
- [ ] Graceful degradation (fallback data)
|
|
- [ ] Chaos engineering (test failure scenarios)
|
|
- [ ] Load testing (find breaking points)
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [04-slow-api-response.md](04-slow-api-response.md) - API performance
|
|
- [07-service-down.md](07-service-down.md) - Service failures
|
|
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
|
|
|
---
|
|
|
|
## Post-Incident
|
|
|
|
After resolving:
|
|
- [ ] Create post-mortem (MANDATORY for cascade failures)
|
|
- [ ] Draw cascade diagram (which services failed in order)
|
|
- [ ] Identify missing safeguards (circuit breakers, timeouts)
|
|
- [ ] Implement prevention patterns
|
|
- [ ] Test cascade scenarios (chaos engineering)
|
|
- [ ] Update this runbook if needed
|
|
|
|
---
|
|
|
|
## Cascade Failure Examples
|
|
|
|
**Netflix Outage (2012)**:
|
|
- Database latency → API timeouts → Frontend failures → Complete outage
|
|
- **Fix**: Circuit breakers, timeouts, fallback data
|
|
|
|
**AWS S3 Outage (2017)**:
|
|
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
|
|
- **Fix**: Multi-region redundancy, fallback to different regions
|
|
|
|
**Google Cloud Outage (2019)**:
|
|
- Network misconfiguration → Internal services fail → External services cascade
|
|
- **Fix**: Network configuration validation, staged rollouts
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
|
|
2. **Fix the root cause first** (not the symptoms)
|
|
3. **Fail fast, don't cascade waits** (timeouts everywhere)
|
|
4. **Graceful degradation** (fallback > failure)
|
|
5. **Test failure scenarios** (chaos engineering)
|