Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions

View File

@@ -0,0 +1,204 @@
# Playbook: High CPU Usage
## Symptoms
- CPU usage at 80-100%
- Applications slow or unresponsive
- Server lag, SSH slow
- Monitoring alert: "CPU usage >80% for 5 minutes"
## Severity
- **SEV2** if application degraded but functional
- **SEV1** if application unresponsive
## Diagnosis
### Step 1: Identify Top CPU Process
```bash
# Current CPU usage
top -bn1 | head -20
# Top CPU processes
ps aux | sort -nrk 3,3 | head -10
# CPU per thread
top -H -p <PID>
```
**What to look for**:
- Single process using >80% CPU
- Multiple processes all high (system-wide issue)
- System CPU vs user CPU (iowait = disk issue)
---
### Step 2: Identify Process Type
**Application process** (node, java, python):
```bash
# Check application logs
tail -100 /var/log/application.log
# Check for infinite loops, heavy computation
# Check APM for slow endpoints
```
**System process** (kernel, systemd):
```bash
# Check system logs
dmesg | tail -50
journalctl -xe
# Check for hardware issues
```
**Unknown/suspicious process**:
```bash
# Check process details
ps aux | grep <PID>
lsof -p <PID>
# Could be malware (crypto mining)
# See security-incidents.md
```
---
### Step 3: Check If Disk-Related
```bash
# Check iowait
iostat -x 1 5
# If iowait >20%, disk is bottleneck
# See infrastructure.md for disk I/O troubleshooting
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Lower Process Priority**
```bash
# Reduce CPU priority
renice +10 <PID>
# Impact: Process gets less CPU time
# Risk: Low (process still runs, just slower)
```
**Option B: Kill Process** (if application)
```bash
# Graceful shutdown
kill -TERM <PID>
# Force kill (last resort)
kill -KILL <PID>
# Restart service
systemctl restart <service>
# Impact: Process restarts, CPU normalizes
# Risk: Medium (brief downtime)
```
**Option C: Scale Horizontally** (cloud)
```bash
# Add more instances to distribute load
# AWS: Auto Scaling Group
# Azure: Scale Set
# Kubernetes: Horizontal Pod Autoscaler
# Impact: Load distributed across instances
# Risk: Low (no downtime)
```
---
### Short-term (5 min - 1 hour)
**Option A: Optimize Code** (if application bug)
```bash
# Profile application
# Node.js: node --prof
# Java: jstack, jvisualvm
# Python: py-spy
# Identify hot path
# Fix infinite loop, heavy computation
```
**Option B: Add Caching**
```javascript
// Cache expensive computation
const cache = new Map();
function expensiveOperation(input) {
if (cache.has(input)) {
return cache.get(input);
}
const result = /* heavy computation */;
cache.set(input, result);
return result;
}
```
**Option C: Scale Vertically** (cloud)
```bash
# Resize to larger instance type
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More CPU capacity
# Risk: Medium (brief downtime during resize)
```
---
### Long-term (1 hour+)
- [ ] Add CPU monitoring alert (>70% for 5 min)
- [ ] Optimize application code (reduce computation)
- [ ] Use worker threads for heavy tasks (Node.js)
- [ ] Implement auto-scaling (cloud)
- [ ] Add APM for performance profiling
- [ ] Review architecture (async processing, job queues)
---
## Escalation
**Escalate to developer if**:
- Application code causing issue
- Requires code fix or optimization
**Escalate to security-agent if**:
- Unknown/suspicious process
- Potential malware or crypto mining
**Escalate to infrastructure if**:
- Hardware issue (kernel errors)
- Cloud infrastructure problem
---
## Related Runbooks
- [03-memory-leak.md](03-memory-leak.md) - If memory also high
- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to CPU
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure diagnostics
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (if SEV1/SEV2)
- [ ] Identify root cause
- [ ] Add monitoring/alerting
- [ ] Update this runbook if needed
- [ ] Add regression test (if code bug)

View File

@@ -0,0 +1,241 @@
# Playbook: Database Deadlock
## Symptoms
- "Deadlock detected" errors in application
- API returning 500 errors
- Transactions timing out
- Database connection pool exhausted
- Monitoring alert: "Deadlock count >0"
## Severity
- **SEV2** if isolated to specific endpoint
- **SEV1** if affecting all database operations
## Diagnosis
### Step 1: Confirm Deadlock (PostgreSQL)
```sql
-- Check for currently locked queries
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Check deadlock log
SELECT * FROM pg_stat_database WHERE datname = 'your_database';
```
### Step 2: Confirm Deadlock (MySQL)
```sql
-- Show InnoDB status (includes deadlock info)
SHOW ENGINE INNODB STATUS\G
-- Look for "LATEST DETECTED DEADLOCK" section
-- Shows which transactions were involved
```
---
### Step 3: Identify Deadlock Pattern
**Common Pattern 1: Lock Order Mismatch**
```
Transaction A: Locks row 1, then row 2
Transaction B: Locks row 2, then row 1
→ DEADLOCK
```
**Common Pattern 2: Gap Locks**
```
Transaction A: SELECT ... FOR UPDATE WHERE id BETWEEN 1 AND 10
Transaction B: INSERT INTO table (id) VALUES (5)
→ DEADLOCK
```
**Common Pattern 3: Foreign Key Deadlock**
```
Transaction A: Updates parent table
Transaction B: Inserts into child table
→ DEADLOCK (foreign key check locks)
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Kill Blocking Query** (PostgreSQL)
```sql
-- Terminate blocking process
SELECT pg_terminate_backend(<blocking_pid>);
-- Verify deadlock cleared
SELECT count(*) FROM pg_locks WHERE NOT granted;
-- Should return 0
```
**Option B: Kill Blocking Query** (MySQL)
```sql
-- Show process list
SHOW PROCESSLIST;
-- Kill blocking query
KILL <process_id>;
```
**Option C: Kill Idle Transactions** (PostgreSQL)
```sql
-- Find idle transactions (>5 min)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < NOW() - INTERVAL '5 minutes';
-- Impact: Frees up locks
-- Risk: Low (transactions are idle)
```
---
### Short-term (5 min - 1 hour)
**Option A: Add Transaction Timeout** (PostgreSQL)
```sql
-- Set statement timeout (30 seconds)
ALTER DATABASE your_database SET statement_timeout = '30s';
-- Or in application:
SET statement_timeout = '30s';
-- Impact: Prevents long-running transactions
-- Risk: Low (transactions should be fast)
```
**Option B: Add Transaction Timeout** (MySQL)
```sql
-- Set lock wait timeout
SET GLOBAL innodb_lock_wait_timeout = 30;
-- Impact: Transactions fail instead of waiting forever
-- Risk: Low (application should handle errors)
```
**Option C: Fix Lock Order in Application**
```javascript
// BAD: Inconsistent lock order
async function transferMoney(fromId, toId, amount) {
await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, fromId]);
await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, toId]);
}
// GOOD: Consistent lock order
async function transferMoney(fromId, toId, amount) {
const firstId = Math.min(fromId, toId);
const secondId = Math.max(fromId, toId);
await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, firstId]);
await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, secondId]);
}
```
---
### Long-term (1 hour+)
**Option A: Reduce Transaction Scope**
```javascript
// BAD: Long transaction
BEGIN;
const user = await db.query('SELECT * FROM users WHERE id = ? FOR UPDATE', [userId]);
await sendEmail(user.email); // External call (slow!)
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
COMMIT;
// GOOD: Short transaction
const user = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
await sendEmail(user.email); // Outside transaction
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
```
**Option B: Use Optimistic Locking**
```sql
-- Add version column
ALTER TABLE accounts ADD COLUMN version INT DEFAULT 0;
-- Update with version check
UPDATE accounts
SET balance = balance - 100, version = version + 1
WHERE id = 1 AND version = <current_version>;
-- If 0 rows updated, retry with new version
```
**Option C: Review Isolation Level**
```sql
-- PostgreSQL default: READ COMMITTED
-- Most cases: READ COMMITTED is fine
-- Rare cases: REPEATABLE READ or SERIALIZABLE
-- Lower isolation = less locking = fewer deadlocks
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
```
---
## Escalation
**Escalate to developer if**:
- Application code causing deadlock
- Requires code refactoring
**Escalate to DBA if**:
- Database configuration issue
- Foreign key constraint problem
---
## Prevention
- [ ] Always lock in same order
- [ ] Keep transactions short
- [ ] Use timeout (statement_timeout, lock_wait_timeout)
- [ ] Use optimistic locking when possible
- [ ] Add deadlock monitoring alert
- [ ] Review isolation level (lower = fewer deadlocks)
---
## Related Runbooks
- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to deadlock
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem
- [ ] Identify which queries deadlocked
- [ ] Fix lock order in application code
- [ ] Add regression test
- [ ] Update this runbook if needed

View File

@@ -0,0 +1,252 @@
# Playbook: Memory Leak
## Symptoms
- Memory usage increasing continuously over time
- Application crashes with OutOfMemoryError (Java) or "JavaScript heap out of memory" (Node.js)
- Performance degrades over time
- High swap usage
- Monitoring alert: "Memory usage >90%"
## Severity
- **SEV2** if memory increasing but not yet critical
- **SEV1** if application crashed or unresponsive
## Diagnosis
### Step 1: Confirm Memory Leak
```bash
# Monitor memory over time (5 minute intervals)
watch -n 300 'ps aux | grep <process> | awk "{print \$4, \$5, \$6}"'
# Check if memory continuously increasing
# Leak: 20% → 30% → 40% → 50% (linear growth)
# Normal: 30% → 32% → 31% → 30% (stable)
```
---
### Step 2: Get Memory Snapshot
**Java (Heap Dump)**:
```bash
# Get heap dump
jmap -dump:format=b,file=heap.bin <PID>
# Analyze with jhat or VisualVM
jhat heap.bin
# Open http://localhost:7000
# Or use Eclipse Memory Analyzer
```
**Node.js (Heap Snapshot)**:
```bash
# Start with --inspect
node --inspect index.js
# Chrome DevTools → Memory → Take heap snapshot
# Or use heapdump module
const heapdump = require('heapdump');
heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');
```
**Python (Memory Profiler)**:
```bash
# Install memory_profiler
pip install memory_profiler
# Profile function
python -m memory_profiler script.py
```
---
### Step 3: Identify Leak Source
**Look for**:
- Large arrays/objects growing over time
- Detached DOM nodes (if browser/UI)
- Event listeners not removed
- Timers/intervals not cleared
- Closures holding references
- Cache without eviction policy
**Common patterns**:
```javascript
// 1. Global cache growing forever
global.cache = {}; // Never cleared
// 2. Event listeners not removed
emitter.on('event', handler); // Never removed
// 3. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared
// 4. Closures
function createHandler() {
const largeData = new Array(1000000);
return () => {
// Closure keeps largeData in memory
};
}
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Restart Application**
```bash
# Restart to free memory
systemctl restart application
# Impact: Memory usage returns to baseline
# Risk: Low (brief downtime)
# NOTE: This is temporary, leak will recur!
```
**Option B: Increase Memory Limit** (temporary)
```bash
# Java
java -Xmx4G -jar application.jar # Was 2G
# Node.js
node --max-old-space-size=4096 index.js # Was 2048
# Impact: Buys time to find root cause
# Risk: Low (but doesn't fix leak)
```
**Option C: Scale Horizontally** (cloud)
```bash
# Add more instances
# Use load balancer to rotate traffic
# Restart instances on schedule (e.g., every 6 hours)
# Impact: Distributes load, restarts prevent OOM
# Risk: Low (but doesn't fix root cause)
```
---
### Short-term (5 min - 1 hour)
**Analyze heap dump** and identify leak source
**Common Fixes**:
**1. Add LRU Cache**
```javascript
// BAD: Unbounded cache
const cache = {};
// GOOD: LRU cache with size limit
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });
```
**2. Remove Event Listeners**
```javascript
// Add listener
const handler = () => { /* ... */ };
emitter.on('event', handler);
// CRITICAL: Remove later
emitter.off('event', handler);
// React/Vue: cleanup in componentWillUnmount/onUnmounted
```
**3. Clear Timers**
```javascript
// Set timer
const intervalId = setInterval(() => { /* ... */ }, 1000);
// CRITICAL: Clear later
clearInterval(intervalId);
// React: cleanup in useEffect return
useEffect(() => {
const id = setInterval(() => { /* ... */ }, 1000);
return () => clearInterval(id);
}, []);
```
**4. Close Connections**
```javascript
// BAD: Connection leak
const conn = await db.connect();
await conn.query(/* ... */);
// Connection never closed!
// GOOD: Always close
const conn = await db.connect();
try {
await conn.query(/* ... */);
} finally {
await conn.close(); // CRITICAL
}
```
---
### Long-term (1 hour+)
- [ ] Add memory monitoring (alert if >80% and increasing)
- [ ] Add memory profiling to CI/CD (detect leaks early)
- [ ] Use WeakMap for caches (auto garbage collected)
- [ ] Review closure usage (avoid holding large data)
- [ ] Add automated restart (every N hours, if leak can't be fixed immediately)
- [ ] Load test to reproduce leak in test environment
- [ ] Fix root cause in code
---
## Escalation
**Escalate to developer if**:
- Application code causing leak
- Requires code fix
**Escalate to platform team if**:
- Platform/framework bug
- Requires upgrade or workaround
---
## Prevention Checklist
- [ ] Use LRU cache (not unbounded)
- [ ] Remove event listeners in cleanup
- [ ] Clear timers/intervals
- [ ] Close database connections (use `finally`)
- [ ] Avoid closures holding large data
- [ ] Use WeakMap for temporary caches
- [ ] Profile memory in development
- [ ] Load test before production
---
## Related Runbooks
- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU also high
- [07-service-down.md](07-service-down.md) - If OOM crashed service
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem
- [ ] Identify leak source from heap dump
- [ ] Fix code
- [ ] Add regression test (memory profiling)
- [ ] Add monitoring alert
- [ ] Update this runbook if needed

View File

@@ -0,0 +1,269 @@
# Playbook: Slow API Response
## Symptoms
- API response time >1 second (degraded)
- API response time >5 seconds (critical)
- Users reporting slow loading
- Timeout errors (504 Gateway Timeout)
- Monitoring alert: "p95 response time >1s"
## Severity
- **SEV3** if response time 1-3 seconds
- **SEV2** if response time 3-5 seconds
- **SEV1** if response time >5 seconds or timeouts
## Diagnosis
### Step 1: Check Application Logs
```bash
# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
# Example output:
# /api/dashboard 8200ms ← SLOW
# /api/users 50ms
# /api/posts 120ms
```
---
### Step 2: Measure Response Time Breakdown
**Total response time = Database + Application + Network**
```bash
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n
```
**Example breakdown**:
```
time_namelookup: 0.005s (DNS)
time_connect: 0.010s (TCP connect)
time_starttransfer: 8.200s (Time to first byte) ← SLOW HERE
time_total: 8.250s
→ Problem is backend processing, not network
```
---
### Step 3: Check Database Query Time
```bash
# Check application logs for query time
grep "query.*duration" /var/log/application.log
# Example:
# query: SELECT * FROM users... duration: 7800ms ← SLOW
```
**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)
---
### Step 4: Check External API Calls
```bash
# Check logs for external API calls
grep "http.request" /var/log/application.log
# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Add Database Index** (if DB is bottleneck)
```sql
-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);
-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)
```
**Option B: Enable Caching** (if same data requested frequently)
```javascript
// Add Redis cache
const redis = require('redis').createClient();
app.get('/api/dashboard', async (req, res) => {
// Check cache first
const cached = await redis.get('dashboard:' + req.user.id);
if (cached) return res.json(JSON.parse(cached));
// Generate data
const data = await generateDashboard(req.user.id);
// Cache for 5 minutes
await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
res.json(data);
});
// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)
```
**Option C: Optimize Query** (if N+1 query)
```javascript
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
user.posts = posts;
}
// GOOD: Single query with JOIN
const users = await db.query(`
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
`);
```
---
### Short-term (5 min - 1 hour)
**Option A: Add Timeout** (if external API is slow)
```javascript
// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
timeout: 2000, // 2 second timeout
});
// If timeout, use fallback data
if (!response.ok) {
return fallbackData;
}
// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)
```
**Option B: Async Processing** (if computation is heavy)
```javascript
// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
const result = await heavyComputation(req.body); // 10 seconds
res.json(result);
});
// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
const jobId = await queue.add('process', req.body);
res.status(202).json({ jobId, status: 'processing' });
});
// Client polls for result
app.get('/api/job/:id', async (req, res) => {
const job = await queue.getJob(req.params.id);
res.json({ status: job.status, result: job.result });
});
// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)
```
**Option C: Pagination** (if returning large dataset)
```javascript
// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
const users = await db.query('SELECT * FROM users');
res.json(users); // Huge payload
});
// GOOD: Pagination
app.get('/api/users', async (req, res) => {
const page = parseInt(req.query.page) || 1;
const limit = 50;
const offset = (page - 1) * limit;
const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
res.json({ data: users, page, limit });
});
// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)
```
---
### Long-term (1 hour+)
- [ ] Add response time monitoring (p95, p99)
- [ ] Add APM (Application Performance Monitoring)
- [ ] Optimize database queries (add indexes, reduce JOINs)
- [ ] Add caching layer (Redis, Memcached)
- [ ] Implement pagination for large datasets
- [ ] Move heavy computation to background jobs
- [ ] Add timeout for external APIs
- [ ] Add E2E test: API response <1s
- [ ] Review and optimize N+1 queries
---
## Common Root Causes
| Symptom | Root Cause | Solution |
|---------|------------|----------|
| 7.8s query time | Missing database index | CREATE INDEX |
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
| 5s external API call | No timeout | Add timeout + fallback |
| Heavy computation | Sync processing | Async job queue |
| Same data fetched repeatedly | No caching | Add Redis cache |
---
## Escalation
**Escalate to developer if**:
- Application code needs optimization
- N+1 query problem
**Escalate to DBA if**:
- Database performance issue
- Need help with query optimization
**Escalate to external team if**:
- External API consistently slow
- Need to negotiate SLA
---
## Related Runbooks
- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem
- [ ] Identify root cause (DB, external API, N+1, etc.)
- [ ] Add performance test (response time <1s)
- [ ] Add monitoring alert
- [ ] Update this runbook if needed

View File

@@ -0,0 +1,293 @@
# Playbook: DDoS Attack
## Symptoms
- Sudden traffic spike (10x-100x normal)
- Legitimate users can't access service
- High bandwidth usage (saturated)
- Server overload (CPU, memory, network)
- Monitoring alert: "Traffic spike", "Bandwidth >90%"
## Severity
- **SEV1** - Production service unavailable due to attack
## Diagnosis
### Step 1: Confirm Traffic Spike
```bash
# Check current connections
netstat -ntu | wc -l
# Compare to baseline (normal: 100-500, attack: 10,000+)
# Check requests per second (nginx)
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
```
---
### Step 2: Identify Attack Pattern
**Check connections by IP**:
```bash
# Top 20 IPs by connection count
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
# Example output:
# 5000 192.168.1.100 ← Attacker IP
# 3000 192.168.1.101 ← Attacker IP
# 2 192.168.1.200 ← Legitimate user
```
**Check HTTP requests by IP** (nginx):
```bash
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
```
**Check request patterns**:
```bash
# Check requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
# Check user agents (bots often have telltale user agents)
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -nr
```
---
### Step 3: Classify Attack Type
**HTTP Flood** (application layer):
- Many HTTP requests from distributed IPs
- Valid HTTP requests, just too many
- Example: 10,000 requests/second to homepage
**SYN Flood** (network layer):
- Many TCP SYN packets
- Connection requests never complete
- Exhausts server connection table
**Amplification** (DNS, NTP):
- Small request → Large response
- Attacker spoofs your IP
- Servers send large responses to you
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Block Attacker IPs** (if few IPs)
```bash
# Block single IP (iptables)
iptables -A INPUT -s <ATTACKER_IP> -j DROP
# Block IP range
iptables -A INPUT -s 192.168.1.0/24 -j DROP
# Block specific country (using ipset + GeoIP)
# Advanced, see infrastructure team
# Impact: Blocks attacker, restores service
# Risk: Low (if attacker IPs identified correctly)
```
**Option B: Enable Rate Limiting** (nginx)
```nginx
# Add to nginx.conf
http {
# Define rate limit zone (10 req/s per IP)
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
server {
location / {
# Apply rate limit
limit_req zone=one burst=20 nodelay;
limit_req_status 429;
}
}
}
# Reload nginx
nginx -t && systemctl reload nginx
# Impact: Limits requests per IP
# Risk: Low (legitimate users rarely exceed 10 req/s)
```
**Option C: Enable CloudFlare "Under Attack" Mode**
```bash
# If using CloudFlare:
# 1. Log in to CloudFlare dashboard
# 2. Select domain
# 3. Click "Under Attack Mode"
# 4. Adds JavaScript challenge before serving content
# Impact: Blocks bots, allows legitimate browsers
# Risk: Low (slight user friction)
```
**Option D: Enable AWS Shield** (AWS)
```bash
# AWS Shield Standard: Free, automatic DDoS protection
# AWS Shield Advanced: $3000/month, enhanced protection
# CloudFormation:
aws cloudformation deploy \
--template-file shield.yaml \
--stack-name ddos-protection
# Impact: Absorbs DDoS at AWS edge
# Risk: None (AWS handles)
```
---
### Short-term (5 min - 1 hour)
**Option A: Add Connection Limits**
```nginx
# Limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=addr:10m;
server {
location / {
limit_conn addr 10; # Max 10 concurrent connections per IP
}
}
```
**Option B: Add CAPTCHA** (reCAPTCHA)
```html
<!-- Add reCAPTCHA to sensitive endpoints -->
<form action="/login" method="POST">
<input type="email" name="email">
<input type="password" name="password">
<div class="g-recaptcha" data-sitekey="YOUR_SITE_KEY"></div>
<button type="submit">Login</button>
</form>
```
**Option C: Scale Up** (cloud auto-scaling)
```bash
# AWS: Increase Auto Scaling Group desired capacity
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 20 # Was 5
# Impact: More capacity to handle attack
# Risk: Medium (costs money, may not fully mitigate)
# NOTE: Only do this if legitimate traffic also spiked
```
---
### Long-term (1 hour+)
- [ ] Enable CloudFlare or AWS Shield (DDoS protection service)
- [ ] Implement rate limiting on all endpoints
- [ ] Add CAPTCHA to login, signup, checkout
- [ ] Configure auto-scaling (handle legitimate traffic spikes)
- [ ] Add monitoring alert for traffic anomalies
- [ ] Create DDoS response plan
- [ ] Contact ISP for upstream filtering (if very large attack)
- [ ] Review and update firewall rules
- [ ] Add geographic blocking (if applicable)
---
## Important Notes
**DO NOT**:
- Scale up indefinitely (attack can grow, costs explode)
- Fight DDoS at application layer alone (use CDN, cloud protection)
**DO**:
- Use CDN/DDoS protection service (CloudFlare, AWS Shield, Akamai)
- Enable rate limiting
- Block attacker IPs/ranges
- Monitor costs (auto-scaling can be expensive)
---
## Escalation
**Escalate to infrastructure team if**:
- Attack very large (>10 Gbps)
- Need upstream filtering at ISP level
**Escalate to security team**:
- All DDoS attacks (for post-mortem, legal action)
**Contact ISP if**:
- Attack saturating internet connection
- Need transit provider to filter
**Contact CloudFlare/AWS if**:
- Using their DDoS protection
- Need assistance enabling features
---
## Prevention Checklist
- [ ] Use CDN (CloudFlare, CloudFront, Akamai)
- [ ] Enable DDoS protection (AWS Shield, CloudFlare)
- [ ] Implement rate limiting (per IP, per user)
- [ ] Add CAPTCHA to sensitive endpoints
- [ ] Configure auto-scaling (within cost limits)
- [ ] Monitor traffic patterns (detect spikes early)
- [ ] Have DDoS response plan ready
- [ ] Test response plan (tabletop exercise)
---
## Related Runbooks
- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU overloaded
- [07-service-down.md](07-service-down.md) - If service crashed
- [../modules/security-incidents.md](../modules/security-incidents.md) - Security response
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (mandatory for DDoS)
- [ ] Identify attack vectors
- [ ] Document attacker IPs, patterns
- [ ] Report to ISP, CloudFlare (they may block attacker)
- [ ] Review and improve DDoS defenses
- [ ] Consider legal action (if attacker identified)
- [ ] Update this runbook if needed
---
## Useful Commands Reference
```bash
# Check connection count
netstat -ntu | wc -l
# Top IPs by connection count
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
# Block IP (iptables)
iptables -A INPUT -s <IP> -j DROP
# Check nginx requests per second
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
# List iptables rules
iptables -L -n -v
# Clear all iptables rules (CAREFUL!)
iptables -F
# Save iptables rules (persist after reboot)
iptables-save > /etc/iptables/rules.v4
```

View File

@@ -0,0 +1,314 @@
# Playbook: Disk Full
## Symptoms
- "No space left on device" errors
- Applications can't write files
- Database refuses writes
- Logs not being written
- Monitoring alert: "Disk usage >90%"
## Severity
- **SEV3** if disk >90% but still functioning
- **SEV2** if disk >95% and applications degraded
- **SEV1** if disk 100% and applications down
## Diagnosis
### Step 1: Check Disk Usage
```bash
# Check disk usage by partition
df -h
# Example output:
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 50G 48G 2G 96% / ← CRITICAL
# /dev/sdb1 100G 20G 80G 20% /data
```
---
### Step 2: Find Large Directories
```bash
# Disk usage by top-level directory
du -sh /*
# Example output:
# 15G /var ← Likely logs
# 10G /home
# 5G /usr
# 1G /tmp
# Drill down into large directory
du -sh /var/*
# Example:
# 14G /var/log ← FOUND IT
# 500M /var/cache
```
---
### Step 3: Find Large Files
```bash
# Find files larger than 100MB
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -h -r | head -20
# Example output:
# 5.0G /var/log/application.log ← Large log file
# 2.0G /var/log/nginx/access.log
# 500M /tmp/dump.sql
```
---
### Step 4: Check for Deleted Files Holding Space
```bash
# Files deleted but process still has handle
lsof | grep deleted | awk '{print $1, $2, $7}' | sort -u
# Example output:
# nginx 1234 10G ← nginx has handle to 10GB deleted file
```
**Why this happens**:
- File deleted (`rm /var/log/nginx/access.log`)
- But process (nginx) still writing to it
- Disk space not released until process closes file or restarts
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Delete Old Logs**
```bash
# Delete old log files (>7 days)
find /var/log -name "*.log.*" -mtime +7 -delete
# Delete compressed logs (>30 days)
find /var/log -name "*.gz" -mtime +30 -delete
# journalctl: Keep only last 7 days
journalctl --vacuum-time=7d
# Impact: Frees disk space immediately
# Risk: Low (old logs not needed for debugging recent issues)
```
**Option B: Compress Logs**
```bash
# Compress large log files
gzip /var/log/application.log
gzip /var/log/nginx/access.log
# Impact: Reduces log file size by 80-90%
# Risk: Low (logs still available, just compressed)
```
**Option C: Release Deleted Files**
```bash
# Find processes holding deleted files
lsof | grep deleted
# Restart process to release space
systemctl restart nginx
# Or kill and restart
kill -HUP <PID>
# Impact: Frees disk space held by deleted files
# Risk: Medium (brief service interruption)
```
**Option D: Clean Temp Files**
```bash
# Delete old temp files
rm -rf /tmp/*
rm -rf /var/tmp/*
# Delete apt/yum cache
apt-get clean # Ubuntu/Debian
yum clean all # RHEL/CentOS
# Delete old kernels (Ubuntu)
apt-get autoremove --purge
# Impact: Frees disk space
# Risk: Low (temp files can be deleted)
```
---
### Short-term (5 min - 1 hour)
**Option A: Rotate Logs Immediately**
```bash
# Force log rotation
logrotate -f /etc/logrotate.conf
# Verify logs rotated
ls -lh /var/log/
# Configure aggressive rotation (daily instead of weekly)
# Edit /etc/logrotate.d/application:
/var/log/application.log {
daily # Was: weekly
rotate 7 # Keep 7 days
compress # Compress old logs
delaycompress # Don't compress most recent
missingok # Don't error if file missing
notifempty # Don't rotate if empty
create 0640 www-data www-data
sharedscripts
postrotate
systemctl reload application
endscript
}
```
**Option B: Archive Old Data**
```bash
# Archive old database dumps
tar -czf old-dumps.tar.gz /backup/*.sql
rm /backup/*.sql
# Move to cheaper storage (S3, Archive)
aws s3 cp old-dumps.tar.gz s3://archive-bucket/
rm old-dumps.tar.gz
# Impact: Frees local disk space
# Risk: Low (data archived, not deleted)
```
**Option C: Expand Disk** (cloud)
```bash
# AWS: Modify EBS volume
aws ec2 modify-volume --volume-id vol-1234567890abcdef0 --size 100 # Was 50 GB
# Wait for modification to complete (5-10 min)
watch aws ec2 describe-volumes-modifications --volume-ids vol-1234567890abcdef0
# Resize filesystem
# ext4:
sudo resize2fs /dev/xvda1
# xfs:
sudo xfs_growfs /
# Verify
df -h
# Impact: More disk space
# Risk: Low (no downtime, but takes time)
```
---
### Long-term (1 hour+)
- [ ] Add disk usage monitoring (alert at >80%)
- [ ] Configure log rotation (daily, keep 7 days)
- [ ] Set up log forwarding (to ELK, Splunk, CloudWatch)
- [ ] Review disk usage trends (plan capacity)
- [ ] Add automated cleanup (cron job for old files)
- [ ] Archive old data (move to S3, Glacier)
- [ ] Implement log sampling (reduce volume)
- [ ] Review application logging (reduce verbosity)
---
## Common Culprits
| Location | Cause | Solution |
|----------|-------|----------|
| /var/log | Log files not rotated | logrotate, compress, delete old |
| /tmp | Temp files not cleaned | Delete old files, add cron job |
| /var/cache | Apt/yum cache | apt-get clean, yum clean all |
| /home | User files, downloads | Clean up or expand disk |
| Database | Large tables, no archiving | Archive old data, vacuum |
| Deleted files | Process holding handle | Restart process |
---
## Prevention Checklist
- [ ] Configure log rotation (daily, 7 days retention)
- [ ] Add disk monitoring (alert at >80%)
- [ ] Set up log forwarding (reduce local storage)
- [ ] Add cron job to clean temp files
- [ ] Review disk trends monthly
- [ ] Plan capacity (expand before hitting limit)
- [ ] Archive old data (move to cheaper storage)
- [ ] Implement log sampling (reduce volume)
---
## Escalation
**Escalate to developer if**:
- Application generating excessive logs
- Need to reduce logging verbosity
**Escalate to DBA if**:
- Database files consuming disk
- Need to archive old data
**Escalate to infrastructure if**:
- Need to expand disk (physical server)
- Need to add new disk
---
## Related Runbooks
- [07-service-down.md](07-service-down.md) - If disk full crashed service
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (if SEV1/SEV2)
- [ ] Identify what filled disk
- [ ] Implement prevention (log rotation, monitoring)
- [ ] Review disk trends (prevent recurrence)
- [ ] Update this runbook if needed
---
## Useful Commands Reference
```bash
# Disk usage
df -h # By partition
du -sh /* # By directory
du -sh /var/* # Drill down
# Large files
find / -type f -size +100M -exec ls -lh {} \;
# Deleted files holding space
lsof | grep deleted
# Clean up
find /var/log -name "*.log.*" -mtime +7 -delete # Old logs
gzip /var/log/*.log # Compress
journalctl --vacuum-time=7d # journalctl
apt-get clean # Apt cache
yum clean all # Yum cache
# Log rotation
logrotate -f /etc/logrotate.conf
# Expand disk (after EBS resize)
resize2fs /dev/xvda1 # ext4
xfs_growfs / # xfs
```

View File

@@ -0,0 +1,333 @@
# Playbook: Service Down
## Symptoms
- Service not responding
- Health check failures
- 502 Bad Gateway or 503 Service Unavailable
- Users can't access application
- Monitoring alert: "Service down", "Health check failed"
## Severity
- **SEV1** - Production service completely unavailable
## Diagnosis
### Step 1: Check Service Status
```bash
# Check if service is running (systemd)
systemctl status nginx
systemctl status application
systemctl status postgresql
# Check process
ps aux | grep nginx
pidof nginx
# Example output:
# nginx.service - nginx web server
# Active: inactive (dead) ← SERVICE IS DOWN
```
---
### Step 2: Check Why Service Stopped
**Check Service Logs** (systemd):
```bash
# Last 50 lines of service logs
journalctl -u nginx -n 50
# Tail logs in real-time
journalctl -u nginx -f
# Look for:
# - Exit code (0 = normal, non-zero = error)
# - Error messages
# - Crash reason
```
**Check Application Logs**:
```bash
# Check application error log
tail -100 /var/log/application/error.log
# Look for:
# - Exception/error before crash
# - Stack trace
# - "Fatal error", "Segmentation fault"
```
**Check System Logs**:
```bash
# Check for OOM (Out of Memory) killer
dmesg | grep -i "out of memory\|oom\|killed process"
# Example:
# Out of memory: Killed process 1234 (node) total-vm:8GB
# ↑ OOM Killer terminated application
# Check kernel errors
dmesg | tail -50
# Check syslog
grep "error\|segfault" /var/log/syslog
```
---
### Step 3: Identify Root Cause
**Common causes**:
| Symptom | Root Cause |
|---------|------------|
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
| "Segmentation fault" | Application bug (crash) |
| "Address already in use" | Port already bound |
| "Connection refused" to database | Database down |
| "No such file or directory" | Missing config file |
| "Permission denied" | Wrong file permissions |
| Exit code 137 | Killed by OOM Killer |
| Exit code 139 | Segmentation fault |
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Restart Service**
```bash
# Restart service
systemctl restart nginx
# Check if started successfully
systemctl status nginx
# Test endpoint
curl http://localhost
# Impact: Service restored
# Risk: Low (if root cause not addressed, may crash again)
```
**Option B: Fix Configuration Error** (if config issue)
```bash
# Test configuration
nginx -t # nginx
postgresql --help # postgres
# If config error, check recent changes
git diff HEAD~1 /etc/nginx/nginx.conf
# Revert to working config
git checkout HEAD~1 /etc/nginx/nginx.conf
# Restart
systemctl restart nginx
```
**Option C: Free Up Resources** (if OOM)
```bash
# Check memory usage
free -h
# Kill memory-heavy processes (non-critical)
kill -9 <PID>
# Free page cache
sync && echo 3 > /proc/sys/vm/drop_caches
# Restart service
systemctl restart application
```
**Option D: Change Port** (if port conflict)
```bash
# Check what's using port
lsof -i :80
# Example:
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
# ↑ Apache using port 80
# Stop conflicting service
systemctl stop apache2
# Start intended service
systemctl start nginx
```
---
### Short-term (5 min - 1 hour)
**Option A: Fix Crash Bug** (if application bug)
```bash
# Check stack trace in logs
tail -100 /var/log/application/error.log
# Identify line causing crash
# Example: NullPointerException at PaymentService.java:42
# Deploy hotfix OR revert to previous version
git checkout <previous-working-commit>
npm run build && pm2 restart all
# Impact: Bug fixed, service stable
# Risk: Medium (need proper testing)
```
**Option B: Increase Memory** (if OOM)
```bash
# Short-term: Increase swap
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# Long-term: Resize instance
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More memory available
# Risk: Medium (swap is slow, instance resize has downtime)
```
**Option C: Enable Auto-Restart** (systemd)
```bash
# Edit service file
# /etc/systemd/system/application.service
[Service]
Restart=always # Auto-restart on failure
RestartSec=10 # Wait 10s before restart
StartLimitBurst=5 # Max 5 restarts
StartLimitIntervalSec=60 # In 60 seconds
# Reload systemd
systemctl daemon-reload
# Impact: Service auto-restarts on crash
# Risk: Low (but doesn't fix root cause)
```
**Option D: Route Traffic to Backup** (if multi-instance)
```bash
# If using load balancer:
# 1. Remove failed instance from LB
# 2. Traffic goes to healthy instances
# AWS:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# Impact: Users see working instance
# Risk: Low (other instances handle load)
```
---
### Long-term (1 hour+)
- [ ] Fix root cause (memory leak, bug, etc.)
- [ ] Add health check monitoring
- [ ] Enable auto-restart (systemd)
- [ ] Set up redundancy (multiple instances)
- [ ] Add load balancer (distribute traffic)
- [ ] Increase memory/CPU (if resource issue)
- [ ] Add alerting (service down, health check fail)
- [ ] Add E2E test (smoke test after deploy)
- [ ] Review deployment process (how did bug reach prod?)
---
## Root Cause Analysis
**For each incident, determine**:
1. **What failed?** (nginx, application, database)
2. **Why did it fail?** (OOM, bug, config error)
3. **What triggered it?** (deploy, traffic spike, external event)
4. **How to prevent?** (fix bug, add monitoring, increase capacity)
---
## Escalation
**Escalate to developer if**:
- Application crash due to bug
- Need code fix
**Escalate to platform team if**:
- Platform/framework issue
- Infrastructure problem
**Escalate to on-call manager if**:
- Can't restore service in 30 min
- Need additional resources
---
## Prevention Checklist
- [ ] Health check monitoring (alert on failure)
- [ ] Auto-restart (systemd Restart=always)
- [ ] Redundancy (multiple instances behind LB)
- [ ] Resource monitoring (CPU, memory alerts)
- [ ] Graceful degradation (circuit breakers, fallbacks)
- [ ] Smoke tests after deploy
- [ ] Rollback plan (blue-green, canary)
- [ ] Chaos engineering (test failure scenarios)
---
## Related Runbooks
- [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (MANDATORY for SEV1)
- [ ] Timeline with all events
- [ ] Root cause analysis
- [ ] Action items (prevent recurrence)
- [ ] Update runbook if needed
- [ ] Share learnings with team
---
## Useful Commands Reference
```bash
# Service status
systemctl status <service>
systemctl restart <service>
journalctl -u <service> -n 50
# Process check
ps aux | grep <process>
pidof <process>
# Check OOM
dmesg | grep -i "out of memory\|oom"
# Check port usage
lsof -i :<port>
netstat -tlnp | grep <port>
# Test config
nginx -t
postgresql --help
# Health check
curl http://localhost/health
```

View File

@@ -0,0 +1,337 @@
# Playbook: Data Corruption
## Symptoms
- Users report incorrect data
- Database integrity constraint violations
- Foreign key errors
- Application errors due to unexpected data
- Failed backups (checksum mismatch)
- Monitoring alert: "Data integrity check failed"
## Severity
- **SEV1** - Critical data corrupted (financial, health, legal)
- **SEV2** - Non-critical data corrupted (user profiles, cache)
- **SEV3** - Recoverable corruption (can restore from backup)
## Diagnosis
### Step 1: Confirm Corruption
**Database Integrity Check** (PostgreSQL):
```sql
-- Check for corruption
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';
-- Verify checksums (if enabled)
SELECT datname, datcollate, datctype
FROM pg_database
WHERE datname = 'your_database';
-- Check for bloat
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```
**Database Integrity Check** (MySQL):
```sql
-- Check table for corruption
CHECK TABLE users;
-- Repair table (if corrupted)
REPAIR TABLE users;
-- Optimize table (defragment)
OPTIMIZE TABLE users;
```
---
### Step 2: Identify Scope
**Questions to answer**:
- Which tables/data are affected?
- How many records corrupted?
- When did corruption start?
- What's the impact on users?
**Check Database Logs**:
```bash
# PostgreSQL
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log
# MySQL
grep "ERROR" /var/log/mysql/error.log
# Look for:
# - Constraint violations
# - Foreign key errors
# - Checksum errors
# - Disk I/O errors
```
---
### Step 3: Determine Root Cause
**Common causes**:
| Cause | Symptoms |
|-------|----------|
| Disk corruption | I/O errors in dmesg, checksum failures |
| Application bug | Logical corruption (wrong data, not random) |
| Failed migration | Schema mismatch, foreign key violations |
| Concurrent writes | Race condition, duplicate records |
| Hardware failure | Random corruption, unrelated records |
| Malicious attack | Deliberate data modification |
**Check for Disk Errors**:
```bash
# Check disk errors
dmesg | grep -i "I/O error\|disk error"
# Check SMART status
smartctl -a /dev/sda
# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
```
---
## Mitigation
### Immediate (Now - 5 min)
**CRITICAL: Preserve Evidence**
```bash
# 1. STOP ALL WRITES (prevent further corruption)
# Put application in read-only mode OR
# Take application offline
# 2. Snapshot/backup current state (even if corrupted)
# PostgreSQL:
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
# MySQL:
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
# 3. Snapshot disk (cloud)
# AWS:
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"
# Impact: Preserves evidence for forensics
# Risk: None (read-only operations)
```
**CRITICAL: DO NOT**:
- Delete corrupted data (may need for forensics)
- Run REPAIR TABLE (may destroy evidence)
- Restart database (may clear logs)
---
### Short-term (5 min - 1 hour)
**Option A: Restore from Backup** (if recent clean backup)
```bash
# 1. Identify last known good backup
ls -lh /backup/ | grep pg_dump
# Example:
# backup-20251026-0200.sql ← Clean backup (before corruption)
# backup-20251026-0800.sql ← Corrupted
# 2. Restore from clean backup
# PostgreSQL:
psql your_database < /backup/backup-20251026-0200.sql
# MySQL:
mysql your_database < /backup/backup-20251026-0200.sql
# 3. Verify data integrity
# Run application tests
# Check user-reported issues
# Impact: Data restored to clean state
# Risk: Medium (lose data after backup time)
```
**Option B: Repair Corrupted Records** (if isolated corruption)
```sql
-- Identify corrupted records
SELECT * FROM users WHERE email IS NULL; -- Should not be null
-- Fix corrupted records
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;
-- Verify fix
SELECT count(*) FROM users WHERE email IS NULL; -- Should be 0
-- Impact: Corruption fixed
-- Risk: Low (if corruption is known and fixable)
```
**Option C: Point-in-Time Recovery** (PostgreSQL)
```bash
# If WAL (Write-Ahead Logging) enabled:
# 1. Determine recovery point (before corruption)
# 2025-10-26 07:00:00 (corruption detected at 08:00)
# 2. Restore from base backup + WAL
pg_basebackup -D /var/lib/postgresql/data-recovery
# 3. Configure recovery.conf
# recovery_target_time = '2025-10-26 07:00:00'
# 4. Start PostgreSQL (will replay WAL until target time)
systemctl start postgresql
# Impact: Restore to exact point before corruption
# Risk: Low (if WAL available)
```
---
### Long-term (1 hour+)
**Root Cause Analysis**:
**If disk corruption**:
- [ ] Replace disk immediately
- [ ] Check RAID status
- [ ] Run filesystem check (fsck)
- [ ] Enable database checksums
**If application bug**:
- [ ] Fix bug in application code
- [ ] Add data validation
- [ ] Add integrity checks
- [ ] Add regression test
**If failed migration**:
- [ ] Review migration script
- [ ] Test migrations in staging first
- [ ] Add rollback plan
- [ ] Use transaction-based migrations
**If concurrent writes**:
- [ ] Add locking (row-level, table-level)
- [ ] Use optimistic locking (version column)
- [ ] Review transaction isolation level
- [ ] Add unique constraints
---
## Prevention
**Backups**:
- [ ] Daily automated backups
- [ ] Test restore process monthly
- [ ] Multiple backup locations (local + S3)
- [ ] Point-in-time recovery enabled (WAL)
- [ ] Retention: 30 days
**Monitoring**:
- [ ] Data integrity checks (checksums)
- [ ] Foreign key violation alerts
- [ ] Disk error monitoring (SMART)
- [ ] Backup success/failure alerts
- [ ] Application-level data validation
**Data Validation**:
- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
- [ ] Application-level validation
- [ ] Schema migrations in transactions
- [ ] Automated data quality tests
**Redundancy**:
- [ ] Database replication (primary + replica)
- [ ] RAID for disk redundancy
- [ ] Multi-AZ deployment (cloud)
---
## Escalation
**Escalate to DBA if**:
- Database-level corruption
- Need expert for recovery
**Escalate to developer if**:
- Application bug causing corruption
- Need code fix
**Escalate to security team if**:
- Suspected malicious attack
- Unauthorized data modification
**Escalate to management if**:
- Critical data lost
- Legal/compliance implications
- Data breach
---
## Legal/Compliance
**If critical data corrupted**:
- [ ] Notify legal team
- [ ] Notify compliance team
- [ ] Check notification requirements:
- GDPR: 72 hours for breach notification
- HIPAA: 60 days for breach notification
- PCI-DSS: Immediate notification
- [ ] Document incident timeline (for audit)
- [ ] Preserve evidence (forensics)
---
## Related Runbooks
- [07-service-down.md](07-service-down.md) - If database down
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (MANDATORY for SEV1)
- [ ] Root cause analysis (what, why, how)
- [ ] Identify affected users/records
- [ ] User communication (if needed)
- [ ] Action items (prevent recurrence)
- [ ] Update backup/recovery procedures
- [ ] Update this runbook if needed
---
## Useful Commands Reference
```bash
# PostgreSQL integrity check
psql -c "SELECT * FROM pg_catalog.pg_database"
# MySQL table check
mysqlcheck -c your_database
# Backup
pg_dump your_database > backup.sql
mysqldump your_database > backup.sql
# Restore
psql your_database < backup.sql
mysql your_database < backup.sql
# Disk check
dmesg | grep -i "I/O error"
smartctl -a /dev/sda
fsck /dev/sda1
# Snapshot (AWS)
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
```

View File

@@ -0,0 +1,430 @@
# Playbook: Cascade Failure
## Symptoms
- Multiple services failing simultaneously
- Failures spreading across services
- Dependency services timing out
- Error rate increasing exponentially
- Monitoring alert: "Multiple services degraded", "Cascade detected"
## Severity
- **SEV1** - Cascade affecting production services
## What is a Cascade Failure?
**Definition**: One service failure triggers failures in dependent services, spreading through the system.
**Example**:
```
Database slow (2s queries)
API times out waiting for database (5s timeout)
Frontend times out waiting for API (10s timeout)
Load balancer marks frontend unhealthy
Traffic routes to other frontends (overload them)
All frontends fail → Complete outage
```
---
## Diagnosis
### Step 1: Identify Initial Failure Point
**Check Service Dependencies**:
```
Frontend → API → Database
Cache (Redis)
Queue (RabbitMQ)
External API
```
**Find the root**:
```bash
# Check service health (start with leaf dependencies)
# 1. Database
psql -c "SELECT 1"
# 2. Cache
redis-cli PING
# 3. Queue
rabbitmqctl status
# 4. External API
curl https://api.external.com/health
# First failure = likely root cause
```
---
### Step 2: Trace Failure Propagation
**Check Service Logs** (in order):
```bash
# Database logs (first)
tail -100 /var/log/postgresql/postgresql.log
# API logs (second)
tail -100 /var/log/api/error.log
# Frontend logs (third)
tail -100 /var/log/frontend/error.log
```
**Look for timestamps**:
```
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
14:00:05 - API: Timeout error
14:00:10 - Frontend: API unavailable
14:00:15 - Load balancer: All frontends unhealthy
```
---
### Step 3: Assess Cascade Depth
**How many layers affected?**
- **1 layer**: Database only (isolated failure)
- **2-3 layers**: Database → API → Frontend (cascade)
- **4+ layers**: Full system cascade (critical)
---
## Mitigation
### Immediate (Now - 5 min)
**PRIORITY: Stop the cascade from spreading**
**Option A: Circuit Breaker** (if not already enabled)
```javascript
// Enable circuit breaker manually
// Prevents API from overwhelming database
const CircuitBreaker = require('opossum');
const dbQuery = new CircuitBreaker(queryDatabase, {
timeout: 3000, // 3s timeout
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000 // Try again after 30s
});
dbQuery.on('open', () => {
console.log('Circuit breaker OPEN - using fallback');
});
// Use fallback when circuit open
dbQuery.fallback(() => {
return cachedData; // Return cached data instead
});
```
**Option B: Rate Limiting** (protect downstream)
```nginx
# Limit requests to database (nginx)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://api-backend;
}
```
**Option C: Shed Load** (reject non-critical requests)
```javascript
// Reject non-critical requests when overloaded
app.use((req, res, next) => {
const load = getCurrentLoad(); // CPU, memory, queue depth
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
return res.status(503).json({
error: 'Service overloaded, try again later'
});
}
next();
});
function isCriticalEndpoint(path) {
return ['/api/health', '/api/payment'].includes(path);
}
```
**Option D: Isolate Failure** (take failing service offline)
```bash
# Remove failing service from load balancer
# AWS ELB:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# nginx:
# Comment out failing backend in upstream block
# upstream api {
# server api1.example.com; # Healthy
# # server api2.example.com; # FAILING - commented out
# }
# Impact: Prevents failing service from affecting others
# Risk: Reduced capacity
```
---
### Short-term (5 min - 1 hour)
**Option A: Fix Root Cause**
**If database slow**:
```sql
-- Add missing index
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
```
**If external API slow**:
```javascript
// Add timeout + fallback
const response = await fetch('https://api.external.com', {
timeout: 2000 // 2s timeout
});
if (!response.ok) {
return fallbackData; // Don't cascade failure
}
```
**If service overloaded**:
```bash
# Scale horizontally (add more instances)
# AWS Auto Scaling:
aws autoscaling set-desired-capacity \
--auto-scaling-group-name my-asg \
--desired-capacity 10 # Was 5
```
---
**Option B: Add Timeouts** (prevent indefinite waiting)
```javascript
// Database query timeout
const result = await db.query('SELECT * FROM users', {
timeout: 3000 // 3 second timeout
});
// API call timeout
const response = await fetch('/api/data', {
signal: AbortSignal.timeout(5000) // 5 second timeout
});
// Impact: Fail fast instead of cascading
// Risk: Low (better to timeout than cascade)
```
---
**Option C: Add Bulkheads** (isolate critical paths)
```javascript
// Separate connection pools for critical vs non-critical
const criticalPool = new Pool({ max: 10 }); // Payments, auth
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
// Critical requests get priority
app.post('/api/payment', async (req, res) => {
const conn = await criticalPool.connect();
// ...
});
// Non-critical requests use separate pool
app.get('/api/analytics', async (req, res) => {
const conn = await nonCriticalPool.connect();
// ...
});
// Impact: Critical paths protected from non-critical load
// Risk: None (isolation improves reliability)
```
---
### Long-term (1 hour+)
**Architecture Improvements**:
- [ ] **Circuit Breakers** (all external dependencies)
- [ ] **Timeouts** (every network call, database query)
- [ ] **Retries with exponential backoff** (transient failures)
- [ ] **Bulkheads** (isolate critical paths)
- [ ] **Rate limiting** (protect downstream services)
- [ ] **Graceful degradation** (fallback data, cached responses)
- [ ] **Health checks** (detect failures early)
- [ ] **Auto-scaling** (handle load spikes)
- [ ] **Chaos engineering** (test cascade scenarios)
---
## Cascade Prevention Patterns
### 1. Circuit Breaker Pattern
```javascript
const breaker = new CircuitBreaker(riskyOperation, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => cachedData);
```
**Benefits**:
- Fast failure (don't wait for timeout)
- Automatic recovery (reset after timeout)
- Fallback data (graceful degradation)
---
### 2. Timeout Pattern
```javascript
// ALWAYS set timeouts
const response = await fetch('/api', {
signal: AbortSignal.timeout(5000)
});
```
**Benefits**:
- Fail fast (don't cascade indefinite waits)
- Predictable behavior
---
### 3. Bulkhead Pattern
```javascript
// Separate resource pools
const criticalPool = new Pool({ max: 10 });
const nonCriticalPool = new Pool({ max: 5 });
```
**Benefits**:
- Critical paths protected
- Non-critical load can't exhaust resources
---
### 4. Retry with Backoff
```javascript
async function retryWithBackoff(fn, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (error) {
if (i === retries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
}
}
}
```
**Benefits**:
- Handles transient failures
- Exponential backoff prevents thundering herd
---
### 5. Load Shedding
```javascript
// Reject requests when overloaded
if (queueDepth > threshold) {
return res.status(503).send('Overloaded');
}
```
**Benefits**:
- Prevent overload
- Protect downstream services
---
## Escalation
**Escalate to architecture team if**:
- System-wide cascade
- Architectural changes needed
**Escalate to all service owners if**:
- Multiple teams affected
- Need coordinated response
**Escalate to management if**:
- Complete outage
- Large customer impact
---
## Prevention Checklist
- [ ] Circuit breakers on all external calls
- [ ] Timeouts on all network operations
- [ ] Retries with exponential backoff
- [ ] Bulkheads for critical paths
- [ ] Rate limiting (protect downstream)
- [ ] Health checks (detect failures early)
- [ ] Auto-scaling (handle load)
- [ ] Graceful degradation (fallback data)
- [ ] Chaos engineering (test failure scenarios)
- [ ] Load testing (find breaking points)
---
## Related Runbooks
- [04-slow-api-response.md](04-slow-api-response.md) - API performance
- [07-service-down.md](07-service-down.md) - Service failures
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (MANDATORY for cascade failures)
- [ ] Draw cascade diagram (which services failed in order)
- [ ] Identify missing safeguards (circuit breakers, timeouts)
- [ ] Implement prevention patterns
- [ ] Test cascade scenarios (chaos engineering)
- [ ] Update this runbook if needed
---
## Cascade Failure Examples
**Netflix Outage (2012)**:
- Database latency → API timeouts → Frontend failures → Complete outage
- **Fix**: Circuit breakers, timeouts, fallback data
**AWS S3 Outage (2017)**:
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
- **Fix**: Multi-region redundancy, fallback to different regions
**Google Cloud Outage (2019)**:
- Network misconfiguration → Internal services fail → External services cascade
- **Fix**: Network configuration validation, staged rollouts
---
## Key Takeaways
1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
2. **Fix the root cause first** (not the symptoms)
3. **Fail fast, don't cascade waits** (timeouts everywhere)
4. **Graceful degradation** (fallback > failure)
5. **Test failure scenarios** (chaos engineering)

View File

@@ -0,0 +1,464 @@
# Playbook: Rate Limit Exceeded
## Symptoms
- "Rate limit exceeded" errors
- "429 Too Many Requests" responses
- "Quota exceeded" messages
- Legitimate requests being blocked
- Monitoring alert: "High rate of 429 errors"
## Severity
- **SEV3** if isolated to specific users/endpoints
- **SEV2** if affecting many users
- **SEV1** if critical functionality blocked (payments, auth)
## Diagnosis
### Step 1: Identify What's Rate Limited
**Check Error Messages**:
```bash
# Application logs
grep "rate limit\|429\|quota exceeded" /var/log/application.log
# nginx logs
awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
# Example output:
# 500 192.168.1.100 /api/users ← IP hitting rate limit
# 200 192.168.1.101 /api/posts
```
**Check Rate Limit Source**:
- **Application-level**: Your code enforcing limit
- **nginx/API Gateway**: Reverse proxy rate limiting
- **External API**: Third-party service limit (Stripe, Twilio, etc.)
- **Cloud**: AWS API Gateway, CloudFlare
---
### Step 2: Determine If Legitimate or Malicious
**Legitimate traffic**:
```
Scenario: User refreshing dashboard repeatedly
Pattern: Single user, single endpoint, short burst
Action: Increase rate limit or add caching
```
**Malicious traffic** (abuse):
```
Scenario: Scraper or bot
Pattern: Multiple IPs, automated behavior, sustained
Action: Block IPs, add CAPTCHA
```
**Traffic spike** (legitimate):
```
Scenario: Marketing campaign, viral post
Pattern: Many users, distributed IPs, real user behavior
Action: Increase rate limit, scale up
```
---
### Step 3: Check Current Rate Limits
**nginx**:
```nginx
# Check nginx.conf
grep "limit_req" /etc/nginx/nginx.conf
# Example:
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
# ^^^^ Current limit
```
**Application** (Express.js example):
```javascript
// Check rate limit middleware
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // Limit: 100 requests per 15 minutes
});
```
**External API**:
```bash
# Check external API documentation
# Stripe: 100 requests per second
# Twilio: 100 requests per second
# Google Maps: $200/month free quota
# Check current usage
# Stripe:
curl https://api.stripe.com/v1/balance \
-u sk_test_XXX: \
-H "Stripe-Account: acct_XXX"
# Response headers:
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 45 ← 45 requests left
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Increase Rate Limit** (if legitimate traffic)
**nginx**:
```nginx
# Edit /etc/nginx/nginx.conf
# Increase from 10r/s to 50r/s
limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
# Test and reload
nginx -t && systemctl reload nginx
# Impact: Allows more requests
# Risk: Low (if traffic is legitimate)
```
**Application** (Express.js):
```javascript
// Increase from 100 to 500 requests per 15 min
const limiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 500, // Increased
});
// Restart application
pm2 restart all
```
---
**Option B: Whitelist Specific IPs** (if known legitimate source)
**nginx**:
```nginx
# Whitelist internal IPs, monitoring systems
geo $limit {
default 1;
10.0.0.0/8 0; # Internal network
192.168.1.100 0; # Monitoring system
}
map $limit $limit_key {
0 "";
1 $binary_remote_addr;
}
limit_req_zone $limit_key zone=one:10m rate=10r/s;
```
**Application**:
```javascript
const limiter = rateLimit({
skip: (req) => {
// Whitelist internal IPs
return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
},
windowMs: 15 * 60 * 1000,
max: 100,
});
```
---
**Option C: Add Caching** (reduce requests to backend)
**Redis cache**:
```javascript
const redis = require('redis').createClient();
app.get('/api/users', async (req, res) => {
// Check cache first
const cached = await redis.get('users:' + req.query.id);
if (cached) {
return res.json(JSON.parse(cached));
}
// Fetch from database
const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
// Cache for 5 minutes
await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
res.json(user);
});
// Impact: Reduces backend load, fewer rate limit hits
// Risk: Low (data staleness acceptable)
```
---
**Option D: Block Malicious IPs** (if abuse detected)
**nginx**:
```bash
# Block specific IP
iptables -A INPUT -s 192.168.1.100 -j DROP
# Or in nginx.conf:
deny 192.168.1.100;
deny 192.168.1.0/24; # Block range
```
**CloudFlare**:
```
# CloudFlare dashboard:
# Security → WAF → Custom rules
# Block IP: 192.168.1.100
```
---
### Short-term (5 min - 1 hour)
**Option A: Implement Tiered Rate Limits**
**Different limits for different users**:
```javascript
const rateLimit = require('express-rate-limit');
const createLimiter = (max) => rateLimit({
windowMs: 15 * 60 * 1000,
max: max,
keyGenerator: (req) => req.user?.id || req.ip,
});
app.use('/api', (req, res, next) => {
let limiter;
if (req.user?.tier === 'premium') {
limiter = createLimiter(1000); // Premium: 1000 req/15min
} else if (req.user) {
limiter = createLimiter(300); // Authenticated: 300 req/15min
} else {
limiter = createLimiter(100); // Anonymous: 100 req/15min
}
limiter(req, res, next);
});
```
---
**Option B: Add CAPTCHA** (prevent bots)
**reCAPTCHA** on sensitive endpoints:
```javascript
const { recaptcha } = require('express-recaptcha');
app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
if (!req.recaptcha.error) {
// CAPTCHA valid, proceed with login
await handleLogin(req, res);
} else {
res.status(400).json({ error: 'CAPTCHA failed' });
}
});
```
---
**Option C: Upgrade External API Plan** (if hitting external limit)
**Stripe**:
```
Current: 100 requests/second (free)
Upgrade: Contact Stripe for higher limit (paid)
```
**AWS API Gateway**:
```bash
# Increase throttle limit
aws apigateway update-usage-plan \
--usage-plan-id <ID> \
--patch-operations \
op=replace,path=/throttle/rateLimit,value=1000
# Impact: Higher rate limit
# Risk: None (may cost more)
```
---
### Long-term (1 hour+)
- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
- [ ] **Add caching** (reduce backend load)
- [ ] **Use CDN** (cache static content, reduce origin requests)
- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
- [ ] **Monitor rate limit usage** (alert before hitting limit)
- [ ] **Batch requests** (reduce API calls to external services)
- [ ] **Implement retry with backoff** (external API rate limits)
- [ ] **Document rate limits** (API documentation for users)
- [ ] **Add rate limit headers** (tell users their remaining quota)
---
## Rate Limit Best Practices
### 1. Return Helpful Headers
**RFC 6585 standard**:
```http
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1698345600 # Unix timestamp
Retry-After: 60 # Seconds until reset
{
"error": "Rate limit exceeded",
"message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
}
```
**Implementation**:
```javascript
const limiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 100,
standardHeaders: true, // Return RateLimit-* headers
legacyHeaders: false,
handler: (req, res) => {
res.status(429).json({
error: 'Rate limit exceeded',
message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
});
},
});
```
---
### 2. Use Sliding Window (not Fixed Window)
**Fixed window** (bad):
```
Window 1: 00:00-00:15 (100 requests)
Window 2: 00:15-00:30 (100 requests)
User makes 100 requests at 00:14:59
User makes 100 requests at 00:15:01
→ 200 requests in 2 seconds! (burst)
```
**Sliding window** (good):
```
Rate limit based on last 15 minutes from current time
→ Can't burst (limit enforced continuously)
```
---
### 3. Different Limits for Different Endpoints
```javascript
// Expensive endpoint (lower limit)
app.get('/api/analytics', rateLimit({ max: 10 }), handler);
// Cheap endpoint (higher limit)
app.get('/api/health', rateLimit({ max: 1000 }), handler);
```
---
## External API Rate Limit Handling
### Retry with Backoff
```javascript
async function callExternalAPI(url, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
const response = await fetch(url);
// Check rate limit headers
const remaining = response.headers.get('X-RateLimit-Remaining');
if (remaining < 10) {
console.warn('Approaching rate limit:', remaining);
}
if (response.status === 429) {
// Rate limited
const retryAfter = response.headers.get('Retry-After') || 60;
console.log(`Rate limited, retrying after ${retryAfter}s`);
await sleep(retryAfter * 1000);
continue;
}
return response.json();
} catch (error) {
if (i === retries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
}
}
}
```
---
## Escalation
**Escalate to developer if**:
- Application rate limit logic needs changes
- Need to implement caching
**Escalate to infrastructure if**:
- nginx/API Gateway rate limit config
- Need to scale up capacity
**Escalate to external vendor if**:
- Hitting external API rate limit
- Need higher quota
---
## Related Runbooks
- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (if SEV1/SEV2)
- [ ] Identify why rate limit hit
- [ ] Adjust rate limits (if needed)
- [ ] Add monitoring (alert before hitting limit)
- [ ] Document rate limits (for users/API consumers)
- [ ] Update this runbook if needed
---
## Useful Commands Reference
```bash
# Check 429 errors (nginx)
awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
# Check rate limit config (nginx)
grep "limit_req" /etc/nginx/nginx.conf
# Block IP (iptables)
iptables -A INPUT -s <IP> -j DROP
# Test rate limit
for i in {1..200}; do curl http://localhost/api; done
# Check external API rate limit
curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
# Look for X-RateLimit-* headers
```