Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions

View File

@@ -0,0 +1,430 @@
# Playbook: Cascade Failure
## Symptoms
- Multiple services failing simultaneously
- Failures spreading across services
- Dependency services timing out
- Error rate increasing exponentially
- Monitoring alert: "Multiple services degraded", "Cascade detected"
## Severity
- **SEV1** - Cascade affecting production services
## What is a Cascade Failure?
**Definition**: One service failure triggers failures in dependent services, spreading through the system.
**Example**:
```
Database slow (2s queries)
API times out waiting for database (5s timeout)
Frontend times out waiting for API (10s timeout)
Load balancer marks frontend unhealthy
Traffic routes to other frontends (overload them)
All frontends fail → Complete outage
```
---
## Diagnosis
### Step 1: Identify Initial Failure Point
**Check Service Dependencies**:
```
Frontend → API → Database
Cache (Redis)
Queue (RabbitMQ)
External API
```
**Find the root**:
```bash
# Check service health (start with leaf dependencies)
# 1. Database
psql -c "SELECT 1"
# 2. Cache
redis-cli PING
# 3. Queue
rabbitmqctl status
# 4. External API
curl https://api.external.com/health
# First failure = likely root cause
```
---
### Step 2: Trace Failure Propagation
**Check Service Logs** (in order):
```bash
# Database logs (first)
tail -100 /var/log/postgresql/postgresql.log
# API logs (second)
tail -100 /var/log/api/error.log
# Frontend logs (third)
tail -100 /var/log/frontend/error.log
```
**Look for timestamps**:
```
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
14:00:05 - API: Timeout error
14:00:10 - Frontend: API unavailable
14:00:15 - Load balancer: All frontends unhealthy
```
---
### Step 3: Assess Cascade Depth
**How many layers affected?**
- **1 layer**: Database only (isolated failure)
- **2-3 layers**: Database → API → Frontend (cascade)
- **4+ layers**: Full system cascade (critical)
---
## Mitigation
### Immediate (Now - 5 min)
**PRIORITY: Stop the cascade from spreading**
**Option A: Circuit Breaker** (if not already enabled)
```javascript
// Enable circuit breaker manually
// Prevents API from overwhelming database
const CircuitBreaker = require('opossum');
const dbQuery = new CircuitBreaker(queryDatabase, {
timeout: 3000, // 3s timeout
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000 // Try again after 30s
});
dbQuery.on('open', () => {
console.log('Circuit breaker OPEN - using fallback');
});
// Use fallback when circuit open
dbQuery.fallback(() => {
return cachedData; // Return cached data instead
});
```
**Option B: Rate Limiting** (protect downstream)
```nginx
# Limit requests to database (nginx)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://api-backend;
}
```
**Option C: Shed Load** (reject non-critical requests)
```javascript
// Reject non-critical requests when overloaded
app.use((req, res, next) => {
const load = getCurrentLoad(); // CPU, memory, queue depth
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
return res.status(503).json({
error: 'Service overloaded, try again later'
});
}
next();
});
function isCriticalEndpoint(path) {
return ['/api/health', '/api/payment'].includes(path);
}
```
**Option D: Isolate Failure** (take failing service offline)
```bash
# Remove failing service from load balancer
# AWS ELB:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# nginx:
# Comment out failing backend in upstream block
# upstream api {
# server api1.example.com; # Healthy
# # server api2.example.com; # FAILING - commented out
# }
# Impact: Prevents failing service from affecting others
# Risk: Reduced capacity
```
---
### Short-term (5 min - 1 hour)
**Option A: Fix Root Cause**
**If database slow**:
```sql
-- Add missing index
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
```
**If external API slow**:
```javascript
// Add timeout + fallback
const response = await fetch('https://api.external.com', {
timeout: 2000 // 2s timeout
});
if (!response.ok) {
return fallbackData; // Don't cascade failure
}
```
**If service overloaded**:
```bash
# Scale horizontally (add more instances)
# AWS Auto Scaling:
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 10 # Was 5
```
---
**Option B: Add Timeouts** (prevent indefinite waiting)
```javascript
// Database query timeout
const result = await db.query('SELECT * FROM users', {
timeout: 3000 // 3 second timeout
});
// API call timeout
const response = await fetch('/api/data', {
signal: AbortSignal.timeout(5000) // 5 second timeout
});
// Impact: Fail fast instead of cascading
// Risk: Low (better to timeout than cascade)
```
---
**Option C: Add Bulkheads** (isolate critical paths)
```javascript
// Separate connection pools for critical vs non-critical
const criticalPool = new Pool({ max: 10 }); // Payments, auth
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
// Critical requests get priority
app.post('/api/payment', async (req, res) => {
const conn = await criticalPool.connect();
// ...
});
// Non-critical requests use separate pool
app.get('/api/analytics', async (req, res) => {
const conn = await nonCriticalPool.connect();
// ...
});
// Impact: Critical paths protected from non-critical load
// Risk: None (isolation improves reliability)
```
---
### Long-term (1 hour+)
**Architecture Improvements**:
- [ ] **Circuit Breakers** (all external dependencies)
- [ ] **Timeouts** (every network call, database query)
- [ ] **Retries with exponential backoff** (transient failures)
- [ ] **Bulkheads** (isolate critical paths)
- [ ] **Rate limiting** (protect downstream services)
- [ ] **Graceful degradation** (fallback data, cached responses)
- [ ] **Health checks** (detect failures early)
- [ ] **Auto-scaling** (handle load spikes)
- [ ] **Chaos engineering** (test cascade scenarios)
---
## Cascade Prevention Patterns
### 1. Circuit Breaker Pattern
```javascript
const breaker = new CircuitBreaker(riskyOperation, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => cachedData);
```
**Benefits**:
- Fast failure (don't wait for timeout)
- Automatic recovery (reset after timeout)
- Fallback data (graceful degradation)
---
### 2. Timeout Pattern
```javascript
// ALWAYS set timeouts
const response = await fetch('/api', {
signal: AbortSignal.timeout(5000)
});
```
**Benefits**:
- Fail fast (don't cascade indefinite waits)
- Predictable behavior
---
### 3. Bulkhead Pattern
```javascript
// Separate resource pools
const criticalPool = new Pool({ max: 10 });
const nonCriticalPool = new Pool({ max: 5 });
```
**Benefits**:
- Critical paths protected
- Non-critical load can't exhaust resources
---
### 4. Retry with Backoff
```javascript
async function retryWithBackoff(fn, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (error) {
if (i === retries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
}
}
}
```
**Benefits**:
- Handles transient failures
- Exponential backoff prevents thundering herd
---
### 5. Load Shedding
```javascript
// Reject requests when overloaded
if (queueDepth > threshold) {
return res.status(503).send('Overloaded');
}
```
**Benefits**:
- Prevent overload
- Protect downstream services
---
## Escalation
**Escalate to architecture team if**:
- System-wide cascade
- Architectural changes needed
**Escalate to all service owners if**:
- Multiple teams affected
- Need coordinated response
**Escalate to management if**:
- Complete outage
- Large customer impact
---
## Prevention Checklist
- [ ] Circuit breakers on all external calls
- [ ] Timeouts on all network operations
- [ ] Retries with exponential backoff
- [ ] Bulkheads for critical paths
- [ ] Rate limiting (protect downstream)
- [ ] Health checks (detect failures early)
- [ ] Auto-scaling (handle load)
- [ ] Graceful degradation (fallback data)
- [ ] Chaos engineering (test failure scenarios)
- [ ] Load testing (find breaking points)
---
## Related Runbooks
- [04-slow-api-response.md](04-slow-api-response.md) - API performance
- [07-service-down.md](07-service-down.md) - Service failures
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (MANDATORY for cascade failures)
- [ ] Draw cascade diagram (which services failed in order)
- [ ] Identify missing safeguards (circuit breakers, timeouts)
- [ ] Implement prevention patterns
- [ ] Test cascade scenarios (chaos engineering)
- [ ] Update this runbook if needed
---
## Cascade Failure Examples
**Netflix Outage (2012)**:
- Database latency → API timeouts → Frontend failures → Complete outage
- **Fix**: Circuit breakers, timeouts, fallback data
**AWS S3 Outage (2017)**:
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
- **Fix**: Multi-region redundancy, fallback to different regions
**Google Cloud Outage (2019)**:
- Network misconfiguration → Internal services fail → External services cascade
- **Fix**: Network configuration validation, staged rollouts
---
## Key Takeaways
1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
2. **Fix the root cause first** (not the symptoms)
3. **Fail fast, don't cascade waits** (timeouts everywhere)
4. **Graceful degradation** (fallback > failure)
5. **Test failure scenarios** (chaos engineering)