Initial commit
This commit is contained in:
204
agents/sre/playbooks/01-high-cpu-usage.md
Normal file
204
agents/sre/playbooks/01-high-cpu-usage.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Playbook: High CPU Usage
|
||||
|
||||
## Symptoms
|
||||
|
||||
- CPU usage at 80-100%
|
||||
- Applications slow or unresponsive
|
||||
- Server lag, SSH slow
|
||||
- Monitoring alert: "CPU usage >80% for 5 minutes"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV2** if application degraded but functional
|
||||
- **SEV1** if application unresponsive
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Identify Top CPU Process
|
||||
|
||||
```bash
|
||||
# Current CPU usage
|
||||
top -bn1 | head -20
|
||||
|
||||
# Top CPU processes
|
||||
ps aux | sort -nrk 3,3 | head -10
|
||||
|
||||
# CPU per thread
|
||||
top -H -p <PID>
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- Single process using >80% CPU
|
||||
- Multiple processes all high (system-wide issue)
|
||||
- System CPU vs user CPU (iowait = disk issue)
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Process Type
|
||||
|
||||
**Application process** (node, java, python):
|
||||
```bash
|
||||
# Check application logs
|
||||
tail -100 /var/log/application.log
|
||||
|
||||
# Check for infinite loops, heavy computation
|
||||
# Check APM for slow endpoints
|
||||
```
|
||||
|
||||
**System process** (kernel, systemd):
|
||||
```bash
|
||||
# Check system logs
|
||||
dmesg | tail -50
|
||||
journalctl -xe
|
||||
|
||||
# Check for hardware issues
|
||||
```
|
||||
|
||||
**Unknown/suspicious process**:
|
||||
```bash
|
||||
# Check process details
|
||||
ps aux | grep <PID>
|
||||
lsof -p <PID>
|
||||
|
||||
# Could be malware (crypto mining)
|
||||
# See security-incidents.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Check If Disk-Related
|
||||
|
||||
```bash
|
||||
# Check iowait
|
||||
iostat -x 1 5
|
||||
|
||||
# If iowait >20%, disk is bottleneck
|
||||
# See infrastructure.md for disk I/O troubleshooting
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Lower Process Priority**
|
||||
```bash
|
||||
# Reduce CPU priority
|
||||
renice +10 <PID>
|
||||
|
||||
# Impact: Process gets less CPU time
|
||||
# Risk: Low (process still runs, just slower)
|
||||
```
|
||||
|
||||
**Option B: Kill Process** (if application)
|
||||
```bash
|
||||
# Graceful shutdown
|
||||
kill -TERM <PID>
|
||||
|
||||
# Force kill (last resort)
|
||||
kill -KILL <PID>
|
||||
|
||||
# Restart service
|
||||
systemctl restart <service>
|
||||
|
||||
# Impact: Process restarts, CPU normalizes
|
||||
# Risk: Medium (brief downtime)
|
||||
```
|
||||
|
||||
**Option C: Scale Horizontally** (cloud)
|
||||
```bash
|
||||
# Add more instances to distribute load
|
||||
# AWS: Auto Scaling Group
|
||||
# Azure: Scale Set
|
||||
# Kubernetes: Horizontal Pod Autoscaler
|
||||
|
||||
# Impact: Load distributed across instances
|
||||
# Risk: Low (no downtime)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Optimize Code** (if application bug)
|
||||
```bash
|
||||
# Profile application
|
||||
# Node.js: node --prof
|
||||
# Java: jstack, jvisualvm
|
||||
# Python: py-spy
|
||||
|
||||
# Identify hot path
|
||||
# Fix infinite loop, heavy computation
|
||||
```
|
||||
|
||||
**Option B: Add Caching**
|
||||
```javascript
|
||||
// Cache expensive computation
|
||||
const cache = new Map();
|
||||
|
||||
function expensiveOperation(input) {
|
||||
if (cache.has(input)) {
|
||||
return cache.get(input);
|
||||
}
|
||||
|
||||
const result = /* heavy computation */;
|
||||
cache.set(input, result);
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
**Option C: Scale Vertically** (cloud)
|
||||
```bash
|
||||
# Resize to larger instance type
|
||||
# AWS: Change instance type (t3.medium → t3.large)
|
||||
# Azure: Resize VM
|
||||
# Impact: More CPU capacity
|
||||
# Risk: Medium (brief downtime during resize)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Add CPU monitoring alert (>70% for 5 min)
|
||||
- [ ] Optimize application code (reduce computation)
|
||||
- [ ] Use worker threads for heavy tasks (Node.js)
|
||||
- [ ] Implement auto-scaling (cloud)
|
||||
- [ ] Add APM for performance profiling
|
||||
- [ ] Review architecture (async processing, job queues)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application code causing issue
|
||||
- Requires code fix or optimization
|
||||
|
||||
**Escalate to security-agent if**:
|
||||
- Unknown/suspicious process
|
||||
- Potential malware or crypto mining
|
||||
|
||||
**Escalate to infrastructure if**:
|
||||
- Hardware issue (kernel errors)
|
||||
- Cloud infrastructure problem
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [03-memory-leak.md](03-memory-leak.md) - If memory also high
|
||||
- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to CPU
|
||||
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure diagnostics
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (if SEV1/SEV2)
|
||||
- [ ] Identify root cause
|
||||
- [ ] Add monitoring/alerting
|
||||
- [ ] Update this runbook if needed
|
||||
- [ ] Add regression test (if code bug)
|
||||
241
agents/sre/playbooks/02-database-deadlock.md
Normal file
241
agents/sre/playbooks/02-database-deadlock.md
Normal file
@@ -0,0 +1,241 @@
|
||||
# Playbook: Database Deadlock
|
||||
|
||||
## Symptoms
|
||||
|
||||
- "Deadlock detected" errors in application
|
||||
- API returning 500 errors
|
||||
- Transactions timing out
|
||||
- Database connection pool exhausted
|
||||
- Monitoring alert: "Deadlock count >0"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV2** if isolated to specific endpoint
|
||||
- **SEV1** if affecting all database operations
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Confirm Deadlock (PostgreSQL)
|
||||
|
||||
```sql
|
||||
-- Check for currently locked queries
|
||||
SELECT
|
||||
blocked_locks.pid AS blocked_pid,
|
||||
blocked_activity.usename AS blocked_user,
|
||||
blocking_locks.pid AS blocking_pid,
|
||||
blocking_activity.usename AS blocking_user,
|
||||
blocked_activity.query AS blocked_statement,
|
||||
blocking_activity.query AS blocking_statement
|
||||
FROM pg_catalog.pg_locks blocked_locks
|
||||
JOIN pg_catalog.pg_stat_activity blocked_activity
|
||||
ON blocked_activity.pid = blocked_locks.pid
|
||||
JOIN pg_catalog.pg_locks blocking_locks
|
||||
ON blocking_locks.locktype = blocked_locks.locktype
|
||||
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
|
||||
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
||||
AND blocking_locks.pid != blocked_locks.pid
|
||||
JOIN pg_catalog.pg_stat_activity blocking_activity
|
||||
ON blocking_activity.pid = blocking_locks.pid
|
||||
WHERE NOT blocked_locks.granted;
|
||||
|
||||
-- Check deadlock log
|
||||
SELECT * FROM pg_stat_database WHERE datname = 'your_database';
|
||||
```
|
||||
|
||||
### Step 2: Confirm Deadlock (MySQL)
|
||||
|
||||
```sql
|
||||
-- Show InnoDB status (includes deadlock info)
|
||||
SHOW ENGINE INNODB STATUS\G
|
||||
|
||||
-- Look for "LATEST DETECTED DEADLOCK" section
|
||||
-- Shows which transactions were involved
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Identify Deadlock Pattern
|
||||
|
||||
**Common Pattern 1: Lock Order Mismatch**
|
||||
```
|
||||
Transaction A: Locks row 1, then row 2
|
||||
Transaction B: Locks row 2, then row 1
|
||||
→ DEADLOCK
|
||||
```
|
||||
|
||||
**Common Pattern 2: Gap Locks**
|
||||
```
|
||||
Transaction A: SELECT ... FOR UPDATE WHERE id BETWEEN 1 AND 10
|
||||
Transaction B: INSERT INTO table (id) VALUES (5)
|
||||
→ DEADLOCK
|
||||
```
|
||||
|
||||
**Common Pattern 3: Foreign Key Deadlock**
|
||||
```
|
||||
Transaction A: Updates parent table
|
||||
Transaction B: Inserts into child table
|
||||
→ DEADLOCK (foreign key check locks)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Kill Blocking Query** (PostgreSQL)
|
||||
```sql
|
||||
-- Terminate blocking process
|
||||
SELECT pg_terminate_backend(<blocking_pid>);
|
||||
|
||||
-- Verify deadlock cleared
|
||||
SELECT count(*) FROM pg_locks WHERE NOT granted;
|
||||
-- Should return 0
|
||||
```
|
||||
|
||||
**Option B: Kill Blocking Query** (MySQL)
|
||||
```sql
|
||||
-- Show process list
|
||||
SHOW PROCESSLIST;
|
||||
|
||||
-- Kill blocking query
|
||||
KILL <process_id>;
|
||||
```
|
||||
|
||||
**Option C: Kill Idle Transactions** (PostgreSQL)
|
||||
```sql
|
||||
-- Find idle transactions (>5 min)
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle in transaction'
|
||||
AND state_change < NOW() - INTERVAL '5 minutes';
|
||||
|
||||
-- Impact: Frees up locks
|
||||
-- Risk: Low (transactions are idle)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Add Transaction Timeout** (PostgreSQL)
|
||||
```sql
|
||||
-- Set statement timeout (30 seconds)
|
||||
ALTER DATABASE your_database SET statement_timeout = '30s';
|
||||
|
||||
-- Or in application:
|
||||
SET statement_timeout = '30s';
|
||||
|
||||
-- Impact: Prevents long-running transactions
|
||||
-- Risk: Low (transactions should be fast)
|
||||
```
|
||||
|
||||
**Option B: Add Transaction Timeout** (MySQL)
|
||||
```sql
|
||||
-- Set lock wait timeout
|
||||
SET GLOBAL innodb_lock_wait_timeout = 30;
|
||||
|
||||
-- Impact: Transactions fail instead of waiting forever
|
||||
-- Risk: Low (application should handle errors)
|
||||
```
|
||||
|
||||
**Option C: Fix Lock Order in Application**
|
||||
```javascript
|
||||
// BAD: Inconsistent lock order
|
||||
async function transferMoney(fromId, toId, amount) {
|
||||
await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, fromId]);
|
||||
await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, toId]);
|
||||
}
|
||||
|
||||
// GOOD: Consistent lock order
|
||||
async function transferMoney(fromId, toId, amount) {
|
||||
const firstId = Math.min(fromId, toId);
|
||||
const secondId = Math.max(fromId, toId);
|
||||
|
||||
await db.query('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, firstId]);
|
||||
await db.query('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, secondId]);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
**Option A: Reduce Transaction Scope**
|
||||
```javascript
|
||||
// BAD: Long transaction
|
||||
BEGIN;
|
||||
const user = await db.query('SELECT * FROM users WHERE id = ? FOR UPDATE', [userId]);
|
||||
await sendEmail(user.email); // External call (slow!)
|
||||
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
|
||||
COMMIT;
|
||||
|
||||
// GOOD: Short transaction
|
||||
const user = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
|
||||
await sendEmail(user.email); // Outside transaction
|
||||
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = ?', [userId]);
|
||||
```
|
||||
|
||||
**Option B: Use Optimistic Locking**
|
||||
```sql
|
||||
-- Add version column
|
||||
ALTER TABLE accounts ADD COLUMN version INT DEFAULT 0;
|
||||
|
||||
-- Update with version check
|
||||
UPDATE accounts
|
||||
SET balance = balance - 100, version = version + 1
|
||||
WHERE id = 1 AND version = <current_version>;
|
||||
|
||||
-- If 0 rows updated, retry with new version
|
||||
```
|
||||
|
||||
**Option C: Review Isolation Level**
|
||||
```sql
|
||||
-- PostgreSQL default: READ COMMITTED
|
||||
-- Most cases: READ COMMITTED is fine
|
||||
-- Rare cases: REPEATABLE READ or SERIALIZABLE
|
||||
|
||||
-- Lower isolation = less locking = fewer deadlocks
|
||||
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application code causing deadlock
|
||||
- Requires code refactoring
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database configuration issue
|
||||
- Foreign key constraint problem
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] Always lock in same order
|
||||
- [ ] Keep transactions short
|
||||
- [ ] Use timeout (statement_timeout, lock_wait_timeout)
|
||||
- [ ] Use optimistic locking when possible
|
||||
- [ ] Add deadlock monitoring alert
|
||||
- [ ] Review isolation level (lower = fewer deadlocks)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [04-slow-api-response.md](04-slow-api-response.md) - If API slow due to deadlock
|
||||
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem
|
||||
- [ ] Identify which queries deadlocked
|
||||
- [ ] Fix lock order in application code
|
||||
- [ ] Add regression test
|
||||
- [ ] Update this runbook if needed
|
||||
252
agents/sre/playbooks/03-memory-leak.md
Normal file
252
agents/sre/playbooks/03-memory-leak.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Playbook: Memory Leak
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Memory usage increasing continuously over time
|
||||
- Application crashes with OutOfMemoryError (Java) or "JavaScript heap out of memory" (Node.js)
|
||||
- Performance degrades over time
|
||||
- High swap usage
|
||||
- Monitoring alert: "Memory usage >90%"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV2** if memory increasing but not yet critical
|
||||
- **SEV1** if application crashed or unresponsive
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Confirm Memory Leak
|
||||
|
||||
```bash
|
||||
# Monitor memory over time (5 minute intervals)
|
||||
watch -n 300 'ps aux | grep <process> | awk "{print \$4, \$5, \$6}"'
|
||||
|
||||
# Check if memory continuously increasing
|
||||
# Leak: 20% → 30% → 40% → 50% (linear growth)
|
||||
# Normal: 30% → 32% → 31% → 30% (stable)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Get Memory Snapshot
|
||||
|
||||
**Java (Heap Dump)**:
|
||||
```bash
|
||||
# Get heap dump
|
||||
jmap -dump:format=b,file=heap.bin <PID>
|
||||
|
||||
# Analyze with jhat or VisualVM
|
||||
jhat heap.bin
|
||||
# Open http://localhost:7000
|
||||
|
||||
# Or use Eclipse Memory Analyzer
|
||||
```
|
||||
|
||||
**Node.js (Heap Snapshot)**:
|
||||
```bash
|
||||
# Start with --inspect
|
||||
node --inspect index.js
|
||||
|
||||
# Chrome DevTools → Memory → Take heap snapshot
|
||||
|
||||
# Or use heapdump module
|
||||
const heapdump = require('heapdump');
|
||||
heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');
|
||||
```
|
||||
|
||||
**Python (Memory Profiler)**:
|
||||
```bash
|
||||
# Install memory_profiler
|
||||
pip install memory_profiler
|
||||
|
||||
# Profile function
|
||||
python -m memory_profiler script.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Identify Leak Source
|
||||
|
||||
**Look for**:
|
||||
- Large arrays/objects growing over time
|
||||
- Detached DOM nodes (if browser/UI)
|
||||
- Event listeners not removed
|
||||
- Timers/intervals not cleared
|
||||
- Closures holding references
|
||||
- Cache without eviction policy
|
||||
|
||||
**Common patterns**:
|
||||
```javascript
|
||||
// 1. Global cache growing forever
|
||||
global.cache = {}; // Never cleared
|
||||
|
||||
// 2. Event listeners not removed
|
||||
emitter.on('event', handler); // Never removed
|
||||
|
||||
// 3. Timers not cleared
|
||||
setInterval(() => { /* ... */ }, 1000); // Never cleared
|
||||
|
||||
// 4. Closures
|
||||
function createHandler() {
|
||||
const largeData = new Array(1000000);
|
||||
return () => {
|
||||
// Closure keeps largeData in memory
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Restart Application**
|
||||
```bash
|
||||
# Restart to free memory
|
||||
systemctl restart application
|
||||
|
||||
# Impact: Memory usage returns to baseline
|
||||
# Risk: Low (brief downtime)
|
||||
# NOTE: This is temporary, leak will recur!
|
||||
```
|
||||
|
||||
**Option B: Increase Memory Limit** (temporary)
|
||||
```bash
|
||||
# Java
|
||||
java -Xmx4G -jar application.jar # Was 2G
|
||||
|
||||
# Node.js
|
||||
node --max-old-space-size=4096 index.js # Was 2048
|
||||
|
||||
# Impact: Buys time to find root cause
|
||||
# Risk: Low (but doesn't fix leak)
|
||||
```
|
||||
|
||||
**Option C: Scale Horizontally** (cloud)
|
||||
```bash
|
||||
# Add more instances
|
||||
# Use load balancer to rotate traffic
|
||||
# Restart instances on schedule (e.g., every 6 hours)
|
||||
|
||||
# Impact: Distributes load, restarts prevent OOM
|
||||
# Risk: Low (but doesn't fix root cause)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Analyze heap dump** and identify leak source
|
||||
|
||||
**Common Fixes**:
|
||||
|
||||
**1. Add LRU Cache**
|
||||
```javascript
|
||||
// BAD: Unbounded cache
|
||||
const cache = {};
|
||||
|
||||
// GOOD: LRU cache with size limit
|
||||
const LRU = require('lru-cache');
|
||||
const cache = new LRU({ max: 1000 });
|
||||
```
|
||||
|
||||
**2. Remove Event Listeners**
|
||||
```javascript
|
||||
// Add listener
|
||||
const handler = () => { /* ... */ };
|
||||
emitter.on('event', handler);
|
||||
|
||||
// CRITICAL: Remove later
|
||||
emitter.off('event', handler);
|
||||
|
||||
// React/Vue: cleanup in componentWillUnmount/onUnmounted
|
||||
```
|
||||
|
||||
**3. Clear Timers**
|
||||
```javascript
|
||||
// Set timer
|
||||
const intervalId = setInterval(() => { /* ... */ }, 1000);
|
||||
|
||||
// CRITICAL: Clear later
|
||||
clearInterval(intervalId);
|
||||
|
||||
// React: cleanup in useEffect return
|
||||
useEffect(() => {
|
||||
const id = setInterval(() => { /* ... */ }, 1000);
|
||||
return () => clearInterval(id);
|
||||
}, []);
|
||||
```
|
||||
|
||||
**4. Close Connections**
|
||||
```javascript
|
||||
// BAD: Connection leak
|
||||
const conn = await db.connect();
|
||||
await conn.query(/* ... */);
|
||||
// Connection never closed!
|
||||
|
||||
// GOOD: Always close
|
||||
const conn = await db.connect();
|
||||
try {
|
||||
await conn.query(/* ... */);
|
||||
} finally {
|
||||
await conn.close(); // CRITICAL
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Add memory monitoring (alert if >80% and increasing)
|
||||
- [ ] Add memory profiling to CI/CD (detect leaks early)
|
||||
- [ ] Use WeakMap for caches (auto garbage collected)
|
||||
- [ ] Review closure usage (avoid holding large data)
|
||||
- [ ] Add automated restart (every N hours, if leak can't be fixed immediately)
|
||||
- [ ] Load test to reproduce leak in test environment
|
||||
- [ ] Fix root cause in code
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application code causing leak
|
||||
- Requires code fix
|
||||
|
||||
**Escalate to platform team if**:
|
||||
- Platform/framework bug
|
||||
- Requires upgrade or workaround
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Use LRU cache (not unbounded)
|
||||
- [ ] Remove event listeners in cleanup
|
||||
- [ ] Clear timers/intervals
|
||||
- [ ] Close database connections (use `finally`)
|
||||
- [ ] Avoid closures holding large data
|
||||
- [ ] Use WeakMap for temporary caches
|
||||
- [ ] Profile memory in development
|
||||
- [ ] Load test before production
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU also high
|
||||
- [07-service-down.md](07-service-down.md) - If OOM crashed service
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem
|
||||
- [ ] Identify leak source from heap dump
|
||||
- [ ] Fix code
|
||||
- [ ] Add regression test (memory profiling)
|
||||
- [ ] Add monitoring alert
|
||||
- [ ] Update this runbook if needed
|
||||
269
agents/sre/playbooks/04-slow-api-response.md
Normal file
269
agents/sre/playbooks/04-slow-api-response.md
Normal file
@@ -0,0 +1,269 @@
|
||||
# Playbook: Slow API Response
|
||||
|
||||
## Symptoms
|
||||
|
||||
- API response time >1 second (degraded)
|
||||
- API response time >5 seconds (critical)
|
||||
- Users reporting slow loading
|
||||
- Timeout errors (504 Gateway Timeout)
|
||||
- Monitoring alert: "p95 response time >1s"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV3** if response time 1-3 seconds
|
||||
- **SEV2** if response time 3-5 seconds
|
||||
- **SEV1** if response time >5 seconds or timeouts
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Application Logs
|
||||
|
||||
```bash
|
||||
# Find slow requests
|
||||
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
|
||||
|
||||
# Identify slow endpoint
|
||||
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
|
||||
|
||||
# Example output:
|
||||
# /api/dashboard 8200ms ← SLOW
|
||||
# /api/users 50ms
|
||||
# /api/posts 120ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Measure Response Time Breakdown
|
||||
|
||||
**Total response time = Database + Application + Network**
|
||||
|
||||
```bash
|
||||
# Use curl with timing
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
|
||||
|
||||
# curl-format.txt:
|
||||
# time_namelookup: %{time_namelookup}\n
|
||||
# time_connect: %{time_connect}\n
|
||||
# time_starttransfer: %{time_starttransfer}\n
|
||||
# time_total: %{time_total}\n
|
||||
```
|
||||
|
||||
**Example breakdown**:
|
||||
```
|
||||
time_namelookup: 0.005s (DNS)
|
||||
time_connect: 0.010s (TCP connect)
|
||||
time_starttransfer: 8.200s (Time to first byte) ← SLOW HERE
|
||||
time_total: 8.250s
|
||||
|
||||
→ Problem is backend processing, not network
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Check Database Query Time
|
||||
|
||||
```bash
|
||||
# Check application logs for query time
|
||||
grep "query.*duration" /var/log/application.log
|
||||
|
||||
# Example:
|
||||
# query: SELECT * FROM users... duration: 7800ms ← SLOW
|
||||
```
|
||||
|
||||
**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Check External API Calls
|
||||
|
||||
```bash
|
||||
# Check logs for external API calls
|
||||
grep "http.request" /var/log/application.log
|
||||
|
||||
# Example:
|
||||
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Add Database Index** (if DB is bottleneck)
|
||||
```sql
|
||||
-- Example: Missing index on last_login_at
|
||||
CREATE INDEX CONCURRENTLY idx_users_last_login_at
|
||||
ON users(last_login_at);
|
||||
|
||||
-- Impact: 7.8s → 50ms query time
|
||||
-- Risk: Low (CONCURRENTLY = no table lock)
|
||||
```
|
||||
|
||||
**Option B: Enable Caching** (if same data requested frequently)
|
||||
```javascript
|
||||
// Add Redis cache
|
||||
const redis = require('redis').createClient();
|
||||
|
||||
app.get('/api/dashboard', async (req, res) => {
|
||||
// Check cache first
|
||||
const cached = await redis.get('dashboard:' + req.user.id);
|
||||
if (cached) return res.json(JSON.parse(cached));
|
||||
|
||||
// Generate data
|
||||
const data = await generateDashboard(req.user.id);
|
||||
|
||||
// Cache for 5 minutes
|
||||
await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
|
||||
|
||||
res.json(data);
|
||||
});
|
||||
|
||||
// Impact: 8s → 10ms (cache hit)
|
||||
// Risk: Low (data staleness acceptable for dashboard)
|
||||
```
|
||||
|
||||
**Option C: Optimize Query** (if N+1 query)
|
||||
```javascript
|
||||
// BAD: N+1 queries
|
||||
const users = await db.query('SELECT * FROM users');
|
||||
for (const user of users) {
|
||||
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
|
||||
user.posts = posts;
|
||||
}
|
||||
|
||||
// GOOD: Single query with JOIN
|
||||
const users = await db.query(`
|
||||
SELECT users.*, posts.*
|
||||
FROM users
|
||||
LEFT JOIN posts ON posts.user_id = users.id
|
||||
`);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Add Timeout** (if external API is slow)
|
||||
```javascript
|
||||
// Add timeout to external API call
|
||||
const response = await fetch('https://api.external.com/data', {
|
||||
timeout: 2000, // 2 second timeout
|
||||
});
|
||||
|
||||
// If timeout, use fallback data
|
||||
if (!response.ok) {
|
||||
return fallbackData;
|
||||
}
|
||||
|
||||
// Impact: Prevents slow external API from blocking response
|
||||
// Risk: Low (fallback data acceptable)
|
||||
```
|
||||
|
||||
**Option B: Async Processing** (if computation is heavy)
|
||||
```javascript
|
||||
// BAD: Synchronous heavy computation
|
||||
app.post('/api/process', async (req, res) => {
|
||||
const result = await heavyComputation(req.body); // 10 seconds
|
||||
res.json(result);
|
||||
});
|
||||
|
||||
// GOOD: Async processing with job queue
|
||||
app.post('/api/process', async (req, res) => {
|
||||
const jobId = await queue.add('process', req.body);
|
||||
res.status(202).json({ jobId, status: 'processing' });
|
||||
});
|
||||
|
||||
// Client polls for result
|
||||
app.get('/api/job/:id', async (req, res) => {
|
||||
const job = await queue.getJob(req.params.id);
|
||||
res.json({ status: job.status, result: job.result });
|
||||
});
|
||||
|
||||
// Impact: API responds immediately (202 Accepted)
|
||||
// Risk: Low (client needs to handle async pattern)
|
||||
```
|
||||
|
||||
**Option C: Pagination** (if returning large dataset)
|
||||
```javascript
|
||||
// BAD: Return all 10,000 records
|
||||
app.get('/api/users', async (req, res) => {
|
||||
const users = await db.query('SELECT * FROM users');
|
||||
res.json(users); // Huge payload
|
||||
});
|
||||
|
||||
// GOOD: Pagination
|
||||
app.get('/api/users', async (req, res) => {
|
||||
const page = parseInt(req.query.page) || 1;
|
||||
const limit = 50;
|
||||
const offset = (page - 1) * limit;
|
||||
|
||||
const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
|
||||
res.json({ data: users, page, limit });
|
||||
});
|
||||
|
||||
// Impact: 8s → 200ms (smaller dataset)
|
||||
// Risk: Low (clients usually want pagination anyway)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Add response time monitoring (p95, p99)
|
||||
- [ ] Add APM (Application Performance Monitoring)
|
||||
- [ ] Optimize database queries (add indexes, reduce JOINs)
|
||||
- [ ] Add caching layer (Redis, Memcached)
|
||||
- [ ] Implement pagination for large datasets
|
||||
- [ ] Move heavy computation to background jobs
|
||||
- [ ] Add timeout for external APIs
|
||||
- [ ] Add E2E test: API response <1s
|
||||
- [ ] Review and optimize N+1 queries
|
||||
|
||||
---
|
||||
|
||||
## Common Root Causes
|
||||
|
||||
| Symptom | Root Cause | Solution |
|
||||
|---------|------------|----------|
|
||||
| 7.8s query time | Missing database index | CREATE INDEX |
|
||||
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
|
||||
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
|
||||
| 5s external API call | No timeout | Add timeout + fallback |
|
||||
| Heavy computation | Sync processing | Async job queue |
|
||||
| Same data fetched repeatedly | No caching | Add Redis cache |
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application code needs optimization
|
||||
- N+1 query problem
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database performance issue
|
||||
- Need help with query optimization
|
||||
|
||||
**Escalate to external team if**:
|
||||
- External API consistently slow
|
||||
- Need to negotiate SLA
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
|
||||
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem
|
||||
- [ ] Identify root cause (DB, external API, N+1, etc.)
|
||||
- [ ] Add performance test (response time <1s)
|
||||
- [ ] Add monitoring alert
|
||||
- [ ] Update this runbook if needed
|
||||
293
agents/sre/playbooks/05-ddos-attack.md
Normal file
293
agents/sre/playbooks/05-ddos-attack.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Playbook: DDoS Attack
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Sudden traffic spike (10x-100x normal)
|
||||
- Legitimate users can't access service
|
||||
- High bandwidth usage (saturated)
|
||||
- Server overload (CPU, memory, network)
|
||||
- Monitoring alert: "Traffic spike", "Bandwidth >90%"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV1** - Production service unavailable due to attack
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Confirm Traffic Spike
|
||||
|
||||
```bash
|
||||
# Check current connections
|
||||
netstat -ntu | wc -l
|
||||
|
||||
# Compare to baseline (normal: 100-500, attack: 10,000+)
|
||||
|
||||
# Check requests per second (nginx)
|
||||
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Attack Pattern
|
||||
|
||||
**Check connections by IP**:
|
||||
```bash
|
||||
# Top 20 IPs by connection count
|
||||
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
|
||||
|
||||
# Example output:
|
||||
# 5000 192.168.1.100 ← Attacker IP
|
||||
# 3000 192.168.1.101 ← Attacker IP
|
||||
# 2 192.168.1.200 ← Legitimate user
|
||||
```
|
||||
|
||||
**Check HTTP requests by IP** (nginx):
|
||||
```bash
|
||||
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
|
||||
```
|
||||
|
||||
**Check request patterns**:
|
||||
```bash
|
||||
# Check requested URLs
|
||||
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
|
||||
|
||||
# Check user agents (bots often have telltale user agents)
|
||||
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -nr
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Classify Attack Type
|
||||
|
||||
**HTTP Flood** (application layer):
|
||||
- Many HTTP requests from distributed IPs
|
||||
- Valid HTTP requests, just too many
|
||||
- Example: 10,000 requests/second to homepage
|
||||
|
||||
**SYN Flood** (network layer):
|
||||
- Many TCP SYN packets
|
||||
- Connection requests never complete
|
||||
- Exhausts server connection table
|
||||
|
||||
**Amplification** (DNS, NTP):
|
||||
- Small request → Large response
|
||||
- Attacker spoofs your IP
|
||||
- Servers send large responses to you
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Block Attacker IPs** (if few IPs)
|
||||
```bash
|
||||
# Block single IP (iptables)
|
||||
iptables -A INPUT -s <ATTACKER_IP> -j DROP
|
||||
|
||||
# Block IP range
|
||||
iptables -A INPUT -s 192.168.1.0/24 -j DROP
|
||||
|
||||
# Block specific country (using ipset + GeoIP)
|
||||
# Advanced, see infrastructure team
|
||||
|
||||
# Impact: Blocks attacker, restores service
|
||||
# Risk: Low (if attacker IPs identified correctly)
|
||||
```
|
||||
|
||||
**Option B: Enable Rate Limiting** (nginx)
|
||||
```nginx
|
||||
# Add to nginx.conf
|
||||
http {
|
||||
# Define rate limit zone (10 req/s per IP)
|
||||
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
|
||||
|
||||
server {
|
||||
location / {
|
||||
# Apply rate limit
|
||||
limit_req zone=one burst=20 nodelay;
|
||||
limit_req_status 429;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Reload nginx
|
||||
nginx -t && systemctl reload nginx
|
||||
|
||||
# Impact: Limits requests per IP
|
||||
# Risk: Low (legitimate users rarely exceed 10 req/s)
|
||||
```
|
||||
|
||||
**Option C: Enable CloudFlare "Under Attack" Mode**
|
||||
```bash
|
||||
# If using CloudFlare:
|
||||
# 1. Log in to CloudFlare dashboard
|
||||
# 2. Select domain
|
||||
# 3. Click "Under Attack Mode"
|
||||
# 4. Adds JavaScript challenge before serving content
|
||||
|
||||
# Impact: Blocks bots, allows legitimate browsers
|
||||
# Risk: Low (slight user friction)
|
||||
```
|
||||
|
||||
**Option D: Enable AWS Shield** (AWS)
|
||||
```bash
|
||||
# AWS Shield Standard: Free, automatic DDoS protection
|
||||
# AWS Shield Advanced: $3000/month, enhanced protection
|
||||
|
||||
# CloudFormation:
|
||||
aws cloudformation deploy \
|
||||
--template-file shield.yaml \
|
||||
--stack-name ddos-protection
|
||||
|
||||
# Impact: Absorbs DDoS at AWS edge
|
||||
# Risk: None (AWS handles)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Add Connection Limits**
|
||||
```nginx
|
||||
# Limit concurrent connections per IP
|
||||
limit_conn_zone $binary_remote_addr zone=addr:10m;
|
||||
|
||||
server {
|
||||
location / {
|
||||
limit_conn addr 10; # Max 10 concurrent connections per IP
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Option B: Add CAPTCHA** (reCAPTCHA)
|
||||
```html
|
||||
<!-- Add reCAPTCHA to sensitive endpoints -->
|
||||
<form action="/login" method="POST">
|
||||
<input type="email" name="email">
|
||||
<input type="password" name="password">
|
||||
<div class="g-recaptcha" data-sitekey="YOUR_SITE_KEY"></div>
|
||||
<button type="submit">Login</button>
|
||||
</form>
|
||||
```
|
||||
|
||||
**Option C: Scale Up** (cloud auto-scaling)
|
||||
```bash
|
||||
# AWS: Increase Auto Scaling Group desired capacity
|
||||
aws autoscaling set-desired-capacity \
|
||||
--auto-scaling-group-name my-asg \
|
||||
--desired-capacity 20 # Was 5
|
||||
|
||||
# Impact: More capacity to handle attack
|
||||
# Risk: Medium (costs money, may not fully mitigate)
|
||||
# NOTE: Only do this if legitimate traffic also spiked
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Enable CloudFlare or AWS Shield (DDoS protection service)
|
||||
- [ ] Implement rate limiting on all endpoints
|
||||
- [ ] Add CAPTCHA to login, signup, checkout
|
||||
- [ ] Configure auto-scaling (handle legitimate traffic spikes)
|
||||
- [ ] Add monitoring alert for traffic anomalies
|
||||
- [ ] Create DDoS response plan
|
||||
- [ ] Contact ISP for upstream filtering (if very large attack)
|
||||
- [ ] Review and update firewall rules
|
||||
- [ ] Add geographic blocking (if applicable)
|
||||
|
||||
---
|
||||
|
||||
## Important Notes
|
||||
|
||||
**DO NOT**:
|
||||
- Scale up indefinitely (attack can grow, costs explode)
|
||||
- Fight DDoS at application layer alone (use CDN, cloud protection)
|
||||
|
||||
**DO**:
|
||||
- Use CDN/DDoS protection service (CloudFlare, AWS Shield, Akamai)
|
||||
- Enable rate limiting
|
||||
- Block attacker IPs/ranges
|
||||
- Monitor costs (auto-scaling can be expensive)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to infrastructure team if**:
|
||||
- Attack very large (>10 Gbps)
|
||||
- Need upstream filtering at ISP level
|
||||
|
||||
**Escalate to security team**:
|
||||
- All DDoS attacks (for post-mortem, legal action)
|
||||
|
||||
**Contact ISP if**:
|
||||
- Attack saturating internet connection
|
||||
- Need transit provider to filter
|
||||
|
||||
**Contact CloudFlare/AWS if**:
|
||||
- Using their DDoS protection
|
||||
- Need assistance enabling features
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Use CDN (CloudFlare, CloudFront, Akamai)
|
||||
- [ ] Enable DDoS protection (AWS Shield, CloudFlare)
|
||||
- [ ] Implement rate limiting (per IP, per user)
|
||||
- [ ] Add CAPTCHA to sensitive endpoints
|
||||
- [ ] Configure auto-scaling (within cost limits)
|
||||
- [ ] Monitor traffic patterns (detect spikes early)
|
||||
- [ ] Have DDoS response plan ready
|
||||
- [ ] Test response plan (tabletop exercise)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [01-high-cpu-usage.md](01-high-cpu-usage.md) - If CPU overloaded
|
||||
- [07-service-down.md](07-service-down.md) - If service crashed
|
||||
- [../modules/security-incidents.md](../modules/security-incidents.md) - Security response
|
||||
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (mandatory for DDoS)
|
||||
- [ ] Identify attack vectors
|
||||
- [ ] Document attacker IPs, patterns
|
||||
- [ ] Report to ISP, CloudFlare (they may block attacker)
|
||||
- [ ] Review and improve DDoS defenses
|
||||
- [ ] Consider legal action (if attacker identified)
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# Check connection count
|
||||
netstat -ntu | wc -l
|
||||
|
||||
# Top IPs by connection count
|
||||
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
|
||||
|
||||
# Block IP (iptables)
|
||||
iptables -A INPUT -s <IP> -j DROP
|
||||
|
||||
# Check nginx requests per second
|
||||
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
|
||||
|
||||
# List iptables rules
|
||||
iptables -L -n -v
|
||||
|
||||
# Clear all iptables rules (CAREFUL!)
|
||||
iptables -F
|
||||
|
||||
# Save iptables rules (persist after reboot)
|
||||
iptables-save > /etc/iptables/rules.v4
|
||||
```
|
||||
314
agents/sre/playbooks/06-disk-full.md
Normal file
314
agents/sre/playbooks/06-disk-full.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# Playbook: Disk Full
|
||||
|
||||
## Symptoms
|
||||
|
||||
- "No space left on device" errors
|
||||
- Applications can't write files
|
||||
- Database refuses writes
|
||||
- Logs not being written
|
||||
- Monitoring alert: "Disk usage >90%"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV3** if disk >90% but still functioning
|
||||
- **SEV2** if disk >95% and applications degraded
|
||||
- **SEV1** if disk 100% and applications down
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Disk Usage
|
||||
|
||||
```bash
|
||||
# Check disk usage by partition
|
||||
df -h
|
||||
|
||||
# Example output:
|
||||
# Filesystem Size Used Avail Use% Mounted on
|
||||
# /dev/sda1 50G 48G 2G 96% / ← CRITICAL
|
||||
# /dev/sdb1 100G 20G 80G 20% /data
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Find Large Directories
|
||||
|
||||
```bash
|
||||
# Disk usage by top-level directory
|
||||
du -sh /*
|
||||
|
||||
# Example output:
|
||||
# 15G /var ← Likely logs
|
||||
# 10G /home
|
||||
# 5G /usr
|
||||
# 1G /tmp
|
||||
|
||||
# Drill down into large directory
|
||||
du -sh /var/*
|
||||
|
||||
# Example:
|
||||
# 14G /var/log ← FOUND IT
|
||||
# 500M /var/cache
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Find Large Files
|
||||
|
||||
```bash
|
||||
# Find files larger than 100MB
|
||||
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -h -r | head -20
|
||||
|
||||
# Example output:
|
||||
# 5.0G /var/log/application.log ← Large log file
|
||||
# 2.0G /var/log/nginx/access.log
|
||||
# 500M /tmp/dump.sql
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Check for Deleted Files Holding Space
|
||||
|
||||
```bash
|
||||
# Files deleted but process still has handle
|
||||
lsof | grep deleted | awk '{print $1, $2, $7}' | sort -u
|
||||
|
||||
# Example output:
|
||||
# nginx 1234 10G ← nginx has handle to 10GB deleted file
|
||||
```
|
||||
|
||||
**Why this happens**:
|
||||
- File deleted (`rm /var/log/nginx/access.log`)
|
||||
- But process (nginx) still writing to it
|
||||
- Disk space not released until process closes file or restarts
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Delete Old Logs**
|
||||
```bash
|
||||
# Delete old log files (>7 days)
|
||||
find /var/log -name "*.log.*" -mtime +7 -delete
|
||||
|
||||
# Delete compressed logs (>30 days)
|
||||
find /var/log -name "*.gz" -mtime +30 -delete
|
||||
|
||||
# journalctl: Keep only last 7 days
|
||||
journalctl --vacuum-time=7d
|
||||
|
||||
# Impact: Frees disk space immediately
|
||||
# Risk: Low (old logs not needed for debugging recent issues)
|
||||
```
|
||||
|
||||
**Option B: Compress Logs**
|
||||
```bash
|
||||
# Compress large log files
|
||||
gzip /var/log/application.log
|
||||
gzip /var/log/nginx/access.log
|
||||
|
||||
# Impact: Reduces log file size by 80-90%
|
||||
# Risk: Low (logs still available, just compressed)
|
||||
```
|
||||
|
||||
**Option C: Release Deleted Files**
|
||||
```bash
|
||||
# Find processes holding deleted files
|
||||
lsof | grep deleted
|
||||
|
||||
# Restart process to release space
|
||||
systemctl restart nginx
|
||||
|
||||
# Or kill and restart
|
||||
kill -HUP <PID>
|
||||
|
||||
# Impact: Frees disk space held by deleted files
|
||||
# Risk: Medium (brief service interruption)
|
||||
```
|
||||
|
||||
**Option D: Clean Temp Files**
|
||||
```bash
|
||||
# Delete old temp files
|
||||
rm -rf /tmp/*
|
||||
rm -rf /var/tmp/*
|
||||
|
||||
# Delete apt/yum cache
|
||||
apt-get clean # Ubuntu/Debian
|
||||
yum clean all # RHEL/CentOS
|
||||
|
||||
# Delete old kernels (Ubuntu)
|
||||
apt-get autoremove --purge
|
||||
|
||||
# Impact: Frees disk space
|
||||
# Risk: Low (temp files can be deleted)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Rotate Logs Immediately**
|
||||
```bash
|
||||
# Force log rotation
|
||||
logrotate -f /etc/logrotate.conf
|
||||
|
||||
# Verify logs rotated
|
||||
ls -lh /var/log/
|
||||
|
||||
# Configure aggressive rotation (daily instead of weekly)
|
||||
# Edit /etc/logrotate.d/application:
|
||||
/var/log/application.log {
|
||||
daily # Was: weekly
|
||||
rotate 7 # Keep 7 days
|
||||
compress # Compress old logs
|
||||
delaycompress # Don't compress most recent
|
||||
missingok # Don't error if file missing
|
||||
notifempty # Don't rotate if empty
|
||||
create 0640 www-data www-data
|
||||
sharedscripts
|
||||
postrotate
|
||||
systemctl reload application
|
||||
endscript
|
||||
}
|
||||
```
|
||||
|
||||
**Option B: Archive Old Data**
|
||||
```bash
|
||||
# Archive old database dumps
|
||||
tar -czf old-dumps.tar.gz /backup/*.sql
|
||||
rm /backup/*.sql
|
||||
|
||||
# Move to cheaper storage (S3, Archive)
|
||||
aws s3 cp old-dumps.tar.gz s3://archive-bucket/
|
||||
rm old-dumps.tar.gz
|
||||
|
||||
# Impact: Frees local disk space
|
||||
# Risk: Low (data archived, not deleted)
|
||||
```
|
||||
|
||||
**Option C: Expand Disk** (cloud)
|
||||
```bash
|
||||
# AWS: Modify EBS volume
|
||||
aws ec2 modify-volume --volume-id vol-1234567890abcdef0 --size 100 # Was 50 GB
|
||||
|
||||
# Wait for modification to complete (5-10 min)
|
||||
watch aws ec2 describe-volumes-modifications --volume-ids vol-1234567890abcdef0
|
||||
|
||||
# Resize filesystem
|
||||
# ext4:
|
||||
sudo resize2fs /dev/xvda1
|
||||
|
||||
# xfs:
|
||||
sudo xfs_growfs /
|
||||
|
||||
# Verify
|
||||
df -h
|
||||
|
||||
# Impact: More disk space
|
||||
# Risk: Low (no downtime, but takes time)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Add disk usage monitoring (alert at >80%)
|
||||
- [ ] Configure log rotation (daily, keep 7 days)
|
||||
- [ ] Set up log forwarding (to ELK, Splunk, CloudWatch)
|
||||
- [ ] Review disk usage trends (plan capacity)
|
||||
- [ ] Add automated cleanup (cron job for old files)
|
||||
- [ ] Archive old data (move to S3, Glacier)
|
||||
- [ ] Implement log sampling (reduce volume)
|
||||
- [ ] Review application logging (reduce verbosity)
|
||||
|
||||
---
|
||||
|
||||
## Common Culprits
|
||||
|
||||
| Location | Cause | Solution |
|
||||
|----------|-------|----------|
|
||||
| /var/log | Log files not rotated | logrotate, compress, delete old |
|
||||
| /tmp | Temp files not cleaned | Delete old files, add cron job |
|
||||
| /var/cache | Apt/yum cache | apt-get clean, yum clean all |
|
||||
| /home | User files, downloads | Clean up or expand disk |
|
||||
| Database | Large tables, no archiving | Archive old data, vacuum |
|
||||
| Deleted files | Process holding handle | Restart process |
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Configure log rotation (daily, 7 days retention)
|
||||
- [ ] Add disk monitoring (alert at >80%)
|
||||
- [ ] Set up log forwarding (reduce local storage)
|
||||
- [ ] Add cron job to clean temp files
|
||||
- [ ] Review disk trends monthly
|
||||
- [ ] Plan capacity (expand before hitting limit)
|
||||
- [ ] Archive old data (move to cheaper storage)
|
||||
- [ ] Implement log sampling (reduce volume)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application generating excessive logs
|
||||
- Need to reduce logging verbosity
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database files consuming disk
|
||||
- Need to archive old data
|
||||
|
||||
**Escalate to infrastructure if**:
|
||||
- Need to expand disk (physical server)
|
||||
- Need to add new disk
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [07-service-down.md](07-service-down.md) - If disk full crashed service
|
||||
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (if SEV1/SEV2)
|
||||
- [ ] Identify what filled disk
|
||||
- [ ] Implement prevention (log rotation, monitoring)
|
||||
- [ ] Review disk trends (prevent recurrence)
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# Disk usage
|
||||
df -h # By partition
|
||||
du -sh /* # By directory
|
||||
du -sh /var/* # Drill down
|
||||
|
||||
# Large files
|
||||
find / -type f -size +100M -exec ls -lh {} \;
|
||||
|
||||
# Deleted files holding space
|
||||
lsof | grep deleted
|
||||
|
||||
# Clean up
|
||||
find /var/log -name "*.log.*" -mtime +7 -delete # Old logs
|
||||
gzip /var/log/*.log # Compress
|
||||
journalctl --vacuum-time=7d # journalctl
|
||||
apt-get clean # Apt cache
|
||||
yum clean all # Yum cache
|
||||
|
||||
# Log rotation
|
||||
logrotate -f /etc/logrotate.conf
|
||||
|
||||
# Expand disk (after EBS resize)
|
||||
resize2fs /dev/xvda1 # ext4
|
||||
xfs_growfs / # xfs
|
||||
```
|
||||
333
agents/sre/playbooks/07-service-down.md
Normal file
333
agents/sre/playbooks/07-service-down.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Playbook: Service Down
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Service not responding
|
||||
- Health check failures
|
||||
- 502 Bad Gateway or 503 Service Unavailable
|
||||
- Users can't access application
|
||||
- Monitoring alert: "Service down", "Health check failed"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV1** - Production service completely unavailable
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Service Status
|
||||
|
||||
```bash
|
||||
# Check if service is running (systemd)
|
||||
systemctl status nginx
|
||||
systemctl status application
|
||||
systemctl status postgresql
|
||||
|
||||
# Check process
|
||||
ps aux | grep nginx
|
||||
pidof nginx
|
||||
|
||||
# Example output:
|
||||
# nginx.service - nginx web server
|
||||
# Active: inactive (dead) ← SERVICE IS DOWN
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Check Why Service Stopped
|
||||
|
||||
**Check Service Logs** (systemd):
|
||||
```bash
|
||||
# Last 50 lines of service logs
|
||||
journalctl -u nginx -n 50
|
||||
|
||||
# Tail logs in real-time
|
||||
journalctl -u nginx -f
|
||||
|
||||
# Look for:
|
||||
# - Exit code (0 = normal, non-zero = error)
|
||||
# - Error messages
|
||||
# - Crash reason
|
||||
```
|
||||
|
||||
**Check Application Logs**:
|
||||
```bash
|
||||
# Check application error log
|
||||
tail -100 /var/log/application/error.log
|
||||
|
||||
# Look for:
|
||||
# - Exception/error before crash
|
||||
# - Stack trace
|
||||
# - "Fatal error", "Segmentation fault"
|
||||
```
|
||||
|
||||
**Check System Logs**:
|
||||
```bash
|
||||
# Check for OOM (Out of Memory) killer
|
||||
dmesg | grep -i "out of memory\|oom\|killed process"
|
||||
|
||||
# Example:
|
||||
# Out of memory: Killed process 1234 (node) total-vm:8GB
|
||||
# ↑ OOM Killer terminated application
|
||||
|
||||
# Check kernel errors
|
||||
dmesg | tail -50
|
||||
|
||||
# Check syslog
|
||||
grep "error\|segfault" /var/log/syslog
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Identify Root Cause
|
||||
|
||||
**Common causes**:
|
||||
|
||||
| Symptom | Root Cause |
|
||||
|---------|------------|
|
||||
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
|
||||
| "Segmentation fault" | Application bug (crash) |
|
||||
| "Address already in use" | Port already bound |
|
||||
| "Connection refused" to database | Database down |
|
||||
| "No such file or directory" | Missing config file |
|
||||
| "Permission denied" | Wrong file permissions |
|
||||
| Exit code 137 | Killed by OOM Killer |
|
||||
| Exit code 139 | Segmentation fault |
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Restart Service**
|
||||
```bash
|
||||
# Restart service
|
||||
systemctl restart nginx
|
||||
|
||||
# Check if started successfully
|
||||
systemctl status nginx
|
||||
|
||||
# Test endpoint
|
||||
curl http://localhost
|
||||
|
||||
# Impact: Service restored
|
||||
# Risk: Low (if root cause not addressed, may crash again)
|
||||
```
|
||||
|
||||
**Option B: Fix Configuration Error** (if config issue)
|
||||
```bash
|
||||
# Test configuration
|
||||
nginx -t # nginx
|
||||
postgresql --help # postgres
|
||||
|
||||
# If config error, check recent changes
|
||||
git diff HEAD~1 /etc/nginx/nginx.conf
|
||||
|
||||
# Revert to working config
|
||||
git checkout HEAD~1 /etc/nginx/nginx.conf
|
||||
|
||||
# Restart
|
||||
systemctl restart nginx
|
||||
```
|
||||
|
||||
**Option C: Free Up Resources** (if OOM)
|
||||
```bash
|
||||
# Check memory usage
|
||||
free -h
|
||||
|
||||
# Kill memory-heavy processes (non-critical)
|
||||
kill -9 <PID>
|
||||
|
||||
# Free page cache
|
||||
sync && echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
# Restart service
|
||||
systemctl restart application
|
||||
```
|
||||
|
||||
**Option D: Change Port** (if port conflict)
|
||||
```bash
|
||||
# Check what's using port
|
||||
lsof -i :80
|
||||
|
||||
# Example:
|
||||
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
|
||||
# ↑ Apache using port 80
|
||||
|
||||
# Stop conflicting service
|
||||
systemctl stop apache2
|
||||
|
||||
# Start intended service
|
||||
systemctl start nginx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Fix Crash Bug** (if application bug)
|
||||
```bash
|
||||
# Check stack trace in logs
|
||||
tail -100 /var/log/application/error.log
|
||||
|
||||
# Identify line causing crash
|
||||
# Example: NullPointerException at PaymentService.java:42
|
||||
|
||||
# Deploy hotfix OR revert to previous version
|
||||
git checkout <previous-working-commit>
|
||||
npm run build && pm2 restart all
|
||||
|
||||
# Impact: Bug fixed, service stable
|
||||
# Risk: Medium (need proper testing)
|
||||
```
|
||||
|
||||
**Option B: Increase Memory** (if OOM)
|
||||
```bash
|
||||
# Short-term: Increase swap
|
||||
dd if=/dev/zero of=/swapfile bs=1M count=2048
|
||||
mkswap /swapfile
|
||||
swapon /swapfile
|
||||
|
||||
# Long-term: Resize instance
|
||||
# AWS: Change instance type (t3.medium → t3.large)
|
||||
# Azure: Resize VM
|
||||
|
||||
# Impact: More memory available
|
||||
# Risk: Medium (swap is slow, instance resize has downtime)
|
||||
```
|
||||
|
||||
**Option C: Enable Auto-Restart** (systemd)
|
||||
```bash
|
||||
# Edit service file
|
||||
# /etc/systemd/system/application.service
|
||||
|
||||
[Service]
|
||||
Restart=always # Auto-restart on failure
|
||||
RestartSec=10 # Wait 10s before restart
|
||||
StartLimitBurst=5 # Max 5 restarts
|
||||
StartLimitIntervalSec=60 # In 60 seconds
|
||||
|
||||
# Reload systemd
|
||||
systemctl daemon-reload
|
||||
|
||||
# Impact: Service auto-restarts on crash
|
||||
# Risk: Low (but doesn't fix root cause)
|
||||
```
|
||||
|
||||
**Option D: Route Traffic to Backup** (if multi-instance)
|
||||
```bash
|
||||
# If using load balancer:
|
||||
# 1. Remove failed instance from LB
|
||||
# 2. Traffic goes to healthy instances
|
||||
|
||||
# AWS:
|
||||
aws elbv2 deregister-targets \
|
||||
--target-group-arn <arn> \
|
||||
--targets Id=i-1234567890abcdef0
|
||||
|
||||
# Impact: Users see working instance
|
||||
# Risk: Low (other instances handle load)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Fix root cause (memory leak, bug, etc.)
|
||||
- [ ] Add health check monitoring
|
||||
- [ ] Enable auto-restart (systemd)
|
||||
- [ ] Set up redundancy (multiple instances)
|
||||
- [ ] Add load balancer (distribute traffic)
|
||||
- [ ] Increase memory/CPU (if resource issue)
|
||||
- [ ] Add alerting (service down, health check fail)
|
||||
- [ ] Add E2E test (smoke test after deploy)
|
||||
- [ ] Review deployment process (how did bug reach prod?)
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
**For each incident, determine**:
|
||||
|
||||
1. **What failed?** (nginx, application, database)
|
||||
2. **Why did it fail?** (OOM, bug, config error)
|
||||
3. **What triggered it?** (deploy, traffic spike, external event)
|
||||
4. **How to prevent?** (fix bug, add monitoring, increase capacity)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application crash due to bug
|
||||
- Need code fix
|
||||
|
||||
**Escalate to platform team if**:
|
||||
- Platform/framework issue
|
||||
- Infrastructure problem
|
||||
|
||||
**Escalate to on-call manager if**:
|
||||
- Can't restore service in 30 min
|
||||
- Need additional resources
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Health check monitoring (alert on failure)
|
||||
- [ ] Auto-restart (systemd Restart=always)
|
||||
- [ ] Redundancy (multiple instances behind LB)
|
||||
- [ ] Resource monitoring (CPU, memory alerts)
|
||||
- [ ] Graceful degradation (circuit breakers, fallbacks)
|
||||
- [ ] Smoke tests after deploy
|
||||
- [ ] Rollback plan (blue-green, canary)
|
||||
- [ ] Chaos engineering (test failure scenarios)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash
|
||||
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (MANDATORY for SEV1)
|
||||
- [ ] Timeline with all events
|
||||
- [ ] Root cause analysis
|
||||
- [ ] Action items (prevent recurrence)
|
||||
- [ ] Update runbook if needed
|
||||
- [ ] Share learnings with team
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# Service status
|
||||
systemctl status <service>
|
||||
systemctl restart <service>
|
||||
journalctl -u <service> -n 50
|
||||
|
||||
# Process check
|
||||
ps aux | grep <process>
|
||||
pidof <process>
|
||||
|
||||
# Check OOM
|
||||
dmesg | grep -i "out of memory\|oom"
|
||||
|
||||
# Check port usage
|
||||
lsof -i :<port>
|
||||
netstat -tlnp | grep <port>
|
||||
|
||||
# Test config
|
||||
nginx -t
|
||||
postgresql --help
|
||||
|
||||
# Health check
|
||||
curl http://localhost/health
|
||||
```
|
||||
337
agents/sre/playbooks/08-data-corruption.md
Normal file
337
agents/sre/playbooks/08-data-corruption.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Playbook: Data Corruption
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Users report incorrect data
|
||||
- Database integrity constraint violations
|
||||
- Foreign key errors
|
||||
- Application errors due to unexpected data
|
||||
- Failed backups (checksum mismatch)
|
||||
- Monitoring alert: "Data integrity check failed"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV1** - Critical data corrupted (financial, health, legal)
|
||||
- **SEV2** - Non-critical data corrupted (user profiles, cache)
|
||||
- **SEV3** - Recoverable corruption (can restore from backup)
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Confirm Corruption
|
||||
|
||||
**Database Integrity Check** (PostgreSQL):
|
||||
```sql
|
||||
-- Check for corruption
|
||||
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';
|
||||
|
||||
-- Verify checksums (if enabled)
|
||||
SELECT datname, datcollate, datctype
|
||||
FROM pg_database
|
||||
WHERE datname = 'your_database';
|
||||
|
||||
-- Check for bloat
|
||||
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
|
||||
FROM pg_tables
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
```
|
||||
|
||||
**Database Integrity Check** (MySQL):
|
||||
```sql
|
||||
-- Check table for corruption
|
||||
CHECK TABLE users;
|
||||
|
||||
-- Repair table (if corrupted)
|
||||
REPAIR TABLE users;
|
||||
|
||||
-- Optimize table (defragment)
|
||||
OPTIMIZE TABLE users;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Scope
|
||||
|
||||
**Questions to answer**:
|
||||
- Which tables/data are affected?
|
||||
- How many records corrupted?
|
||||
- When did corruption start?
|
||||
- What's the impact on users?
|
||||
|
||||
**Check Database Logs**:
|
||||
```bash
|
||||
# PostgreSQL
|
||||
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log
|
||||
|
||||
# MySQL
|
||||
grep "ERROR" /var/log/mysql/error.log
|
||||
|
||||
# Look for:
|
||||
# - Constraint violations
|
||||
# - Foreign key errors
|
||||
# - Checksum errors
|
||||
# - Disk I/O errors
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Determine Root Cause
|
||||
|
||||
**Common causes**:
|
||||
|
||||
| Cause | Symptoms |
|
||||
|-------|----------|
|
||||
| Disk corruption | I/O errors in dmesg, checksum failures |
|
||||
| Application bug | Logical corruption (wrong data, not random) |
|
||||
| Failed migration | Schema mismatch, foreign key violations |
|
||||
| Concurrent writes | Race condition, duplicate records |
|
||||
| Hardware failure | Random corruption, unrelated records |
|
||||
| Malicious attack | Deliberate data modification |
|
||||
|
||||
**Check for Disk Errors**:
|
||||
```bash
|
||||
# Check disk errors
|
||||
dmesg | grep -i "I/O error\|disk error"
|
||||
|
||||
# Check SMART status
|
||||
smartctl -a /dev/sda
|
||||
|
||||
# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**CRITICAL: Preserve Evidence**
|
||||
```bash
|
||||
# 1. STOP ALL WRITES (prevent further corruption)
|
||||
# Put application in read-only mode OR
|
||||
# Take application offline
|
||||
|
||||
# 2. Snapshot/backup current state (even if corrupted)
|
||||
# PostgreSQL:
|
||||
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
|
||||
|
||||
# MySQL:
|
||||
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
|
||||
|
||||
# 3. Snapshot disk (cloud)
|
||||
# AWS:
|
||||
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"
|
||||
|
||||
# Impact: Preserves evidence for forensics
|
||||
# Risk: None (read-only operations)
|
||||
```
|
||||
|
||||
**CRITICAL: DO NOT**:
|
||||
- Delete corrupted data (may need for forensics)
|
||||
- Run REPAIR TABLE (may destroy evidence)
|
||||
- Restart database (may clear logs)
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Restore from Backup** (if recent clean backup)
|
||||
```bash
|
||||
# 1. Identify last known good backup
|
||||
ls -lh /backup/ | grep pg_dump
|
||||
|
||||
# Example:
|
||||
# backup-20251026-0200.sql ← Clean backup (before corruption)
|
||||
# backup-20251026-0800.sql ← Corrupted
|
||||
|
||||
# 2. Restore from clean backup
|
||||
# PostgreSQL:
|
||||
psql your_database < /backup/backup-20251026-0200.sql
|
||||
|
||||
# MySQL:
|
||||
mysql your_database < /backup/backup-20251026-0200.sql
|
||||
|
||||
# 3. Verify data integrity
|
||||
# Run application tests
|
||||
# Check user-reported issues
|
||||
|
||||
# Impact: Data restored to clean state
|
||||
# Risk: Medium (lose data after backup time)
|
||||
```
|
||||
|
||||
**Option B: Repair Corrupted Records** (if isolated corruption)
|
||||
```sql
|
||||
-- Identify corrupted records
|
||||
SELECT * FROM users WHERE email IS NULL; -- Should not be null
|
||||
|
||||
-- Fix corrupted records
|
||||
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;
|
||||
|
||||
-- Verify fix
|
||||
SELECT count(*) FROM users WHERE email IS NULL; -- Should be 0
|
||||
|
||||
-- Impact: Corruption fixed
|
||||
-- Risk: Low (if corruption is known and fixable)
|
||||
```
|
||||
|
||||
**Option C: Point-in-Time Recovery** (PostgreSQL)
|
||||
```bash
|
||||
# If WAL (Write-Ahead Logging) enabled:
|
||||
|
||||
# 1. Determine recovery point (before corruption)
|
||||
# 2025-10-26 07:00:00 (corruption detected at 08:00)
|
||||
|
||||
# 2. Restore from base backup + WAL
|
||||
pg_basebackup -D /var/lib/postgresql/data-recovery
|
||||
|
||||
# 3. Configure recovery.conf
|
||||
# recovery_target_time = '2025-10-26 07:00:00'
|
||||
|
||||
# 4. Start PostgreSQL (will replay WAL until target time)
|
||||
systemctl start postgresql
|
||||
|
||||
# Impact: Restore to exact point before corruption
|
||||
# Risk: Low (if WAL available)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
**Root Cause Analysis**:
|
||||
|
||||
**If disk corruption**:
|
||||
- [ ] Replace disk immediately
|
||||
- [ ] Check RAID status
|
||||
- [ ] Run filesystem check (fsck)
|
||||
- [ ] Enable database checksums
|
||||
|
||||
**If application bug**:
|
||||
- [ ] Fix bug in application code
|
||||
- [ ] Add data validation
|
||||
- [ ] Add integrity checks
|
||||
- [ ] Add regression test
|
||||
|
||||
**If failed migration**:
|
||||
- [ ] Review migration script
|
||||
- [ ] Test migrations in staging first
|
||||
- [ ] Add rollback plan
|
||||
- [ ] Use transaction-based migrations
|
||||
|
||||
**If concurrent writes**:
|
||||
- [ ] Add locking (row-level, table-level)
|
||||
- [ ] Use optimistic locking (version column)
|
||||
- [ ] Review transaction isolation level
|
||||
- [ ] Add unique constraints
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
**Backups**:
|
||||
- [ ] Daily automated backups
|
||||
- [ ] Test restore process monthly
|
||||
- [ ] Multiple backup locations (local + S3)
|
||||
- [ ] Point-in-time recovery enabled (WAL)
|
||||
- [ ] Retention: 30 days
|
||||
|
||||
**Monitoring**:
|
||||
- [ ] Data integrity checks (checksums)
|
||||
- [ ] Foreign key violation alerts
|
||||
- [ ] Disk error monitoring (SMART)
|
||||
- [ ] Backup success/failure alerts
|
||||
- [ ] Application-level data validation
|
||||
|
||||
**Data Validation**:
|
||||
- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
|
||||
- [ ] Application-level validation
|
||||
- [ ] Schema migrations in transactions
|
||||
- [ ] Automated data quality tests
|
||||
|
||||
**Redundancy**:
|
||||
- [ ] Database replication (primary + replica)
|
||||
- [ ] RAID for disk redundancy
|
||||
- [ ] Multi-AZ deployment (cloud)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database-level corruption
|
||||
- Need expert for recovery
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application bug causing corruption
|
||||
- Need code fix
|
||||
|
||||
**Escalate to security team if**:
|
||||
- Suspected malicious attack
|
||||
- Unauthorized data modification
|
||||
|
||||
**Escalate to management if**:
|
||||
- Critical data lost
|
||||
- Legal/compliance implications
|
||||
- Data breach
|
||||
|
||||
---
|
||||
|
||||
## Legal/Compliance
|
||||
|
||||
**If critical data corrupted**:
|
||||
- [ ] Notify legal team
|
||||
- [ ] Notify compliance team
|
||||
- [ ] Check notification requirements:
|
||||
- GDPR: 72 hours for breach notification
|
||||
- HIPAA: 60 days for breach notification
|
||||
- PCI-DSS: Immediate notification
|
||||
- [ ] Document incident timeline (for audit)
|
||||
- [ ] Preserve evidence (forensics)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [07-service-down.md](07-service-down.md) - If database down
|
||||
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
|
||||
- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (MANDATORY for SEV1)
|
||||
- [ ] Root cause analysis (what, why, how)
|
||||
- [ ] Identify affected users/records
|
||||
- [ ] User communication (if needed)
|
||||
- [ ] Action items (prevent recurrence)
|
||||
- [ ] Update backup/recovery procedures
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# PostgreSQL integrity check
|
||||
psql -c "SELECT * FROM pg_catalog.pg_database"
|
||||
|
||||
# MySQL table check
|
||||
mysqlcheck -c your_database
|
||||
|
||||
# Backup
|
||||
pg_dump your_database > backup.sql
|
||||
mysqldump your_database > backup.sql
|
||||
|
||||
# Restore
|
||||
psql your_database < backup.sql
|
||||
mysql your_database < backup.sql
|
||||
|
||||
# Disk check
|
||||
dmesg | grep -i "I/O error"
|
||||
smartctl -a /dev/sda
|
||||
fsck /dev/sda1
|
||||
|
||||
# Snapshot (AWS)
|
||||
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
|
||||
```
|
||||
430
agents/sre/playbooks/09-cascade-failure.md
Normal file
430
agents/sre/playbooks/09-cascade-failure.md
Normal file
@@ -0,0 +1,430 @@
|
||||
# Playbook: Cascade Failure
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Multiple services failing simultaneously
|
||||
- Failures spreading across services
|
||||
- Dependency services timing out
|
||||
- Error rate increasing exponentially
|
||||
- Monitoring alert: "Multiple services degraded", "Cascade detected"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV1** - Cascade affecting production services
|
||||
|
||||
## What is a Cascade Failure?
|
||||
|
||||
**Definition**: One service failure triggers failures in dependent services, spreading through the system.
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Database slow (2s queries)
|
||||
↓
|
||||
API times out waiting for database (5s timeout)
|
||||
↓
|
||||
Frontend times out waiting for API (10s timeout)
|
||||
↓
|
||||
Load balancer marks frontend unhealthy
|
||||
↓
|
||||
Traffic routes to other frontends (overload them)
|
||||
↓
|
||||
All frontends fail → Complete outage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Identify Initial Failure Point
|
||||
|
||||
**Check Service Dependencies**:
|
||||
```
|
||||
Frontend → API → Database
|
||||
↓
|
||||
Cache (Redis)
|
||||
↓
|
||||
Queue (RabbitMQ)
|
||||
↓
|
||||
External API
|
||||
```
|
||||
|
||||
**Find the root**:
|
||||
```bash
|
||||
# Check service health (start with leaf dependencies)
|
||||
# 1. Database
|
||||
psql -c "SELECT 1"
|
||||
|
||||
# 2. Cache
|
||||
redis-cli PING
|
||||
|
||||
# 3. Queue
|
||||
rabbitmqctl status
|
||||
|
||||
# 4. External API
|
||||
curl https://api.external.com/health
|
||||
|
||||
# First failure = likely root cause
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Trace Failure Propagation
|
||||
|
||||
**Check Service Logs** (in order):
|
||||
```bash
|
||||
# Database logs (first)
|
||||
tail -100 /var/log/postgresql/postgresql.log
|
||||
|
||||
# API logs (second)
|
||||
tail -100 /var/log/api/error.log
|
||||
|
||||
# Frontend logs (third)
|
||||
tail -100 /var/log/frontend/error.log
|
||||
```
|
||||
|
||||
**Look for timestamps**:
|
||||
```
|
||||
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
|
||||
14:00:05 - API: Timeout error
|
||||
14:00:10 - Frontend: API unavailable
|
||||
14:00:15 - Load balancer: All frontends unhealthy
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Assess Cascade Depth
|
||||
|
||||
**How many layers affected?**
|
||||
- **1 layer**: Database only (isolated failure)
|
||||
- **2-3 layers**: Database → API → Frontend (cascade)
|
||||
- **4+ layers**: Full system cascade (critical)
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**PRIORITY: Stop the cascade from spreading**
|
||||
|
||||
**Option A: Circuit Breaker** (if not already enabled)
|
||||
```javascript
|
||||
// Enable circuit breaker manually
|
||||
// Prevents API from overwhelming database
|
||||
|
||||
const CircuitBreaker = require('opossum');
|
||||
|
||||
const dbQuery = new CircuitBreaker(queryDatabase, {
|
||||
timeout: 3000, // 3s timeout
|
||||
errorThresholdPercentage: 50, // Open after 50% failures
|
||||
resetTimeout: 30000 // Try again after 30s
|
||||
});
|
||||
|
||||
dbQuery.on('open', () => {
|
||||
console.log('Circuit breaker OPEN - using fallback');
|
||||
});
|
||||
|
||||
// Use fallback when circuit open
|
||||
dbQuery.fallback(() => {
|
||||
return cachedData; // Return cached data instead
|
||||
});
|
||||
```
|
||||
|
||||
**Option B: Rate Limiting** (protect downstream)
|
||||
```nginx
|
||||
# Limit requests to database (nginx)
|
||||
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
|
||||
|
||||
location /api/ {
|
||||
limit_req zone=api burst=20 nodelay;
|
||||
proxy_pass http://api-backend;
|
||||
}
|
||||
```
|
||||
|
||||
**Option C: Shed Load** (reject non-critical requests)
|
||||
```javascript
|
||||
// Reject non-critical requests when overloaded
|
||||
app.use((req, res, next) => {
|
||||
const load = getCurrentLoad(); // CPU, memory, queue depth
|
||||
|
||||
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
|
||||
return res.status(503).json({
|
||||
error: 'Service overloaded, try again later'
|
||||
});
|
||||
}
|
||||
|
||||
next();
|
||||
});
|
||||
|
||||
function isCriticalEndpoint(path) {
|
||||
return ['/api/health', '/api/payment'].includes(path);
|
||||
}
|
||||
```
|
||||
|
||||
**Option D: Isolate Failure** (take failing service offline)
|
||||
```bash
|
||||
# Remove failing service from load balancer
|
||||
# AWS ELB:
|
||||
aws elbv2 deregister-targets \
|
||||
--target-group-arn <arn> \
|
||||
--targets Id=i-1234567890abcdef0
|
||||
|
||||
# nginx:
|
||||
# Comment out failing backend in upstream block
|
||||
# upstream api {
|
||||
# server api1.example.com; # Healthy
|
||||
# # server api2.example.com; # FAILING - commented out
|
||||
# }
|
||||
|
||||
# Impact: Prevents failing service from affecting others
|
||||
# Risk: Reduced capacity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Fix Root Cause**
|
||||
|
||||
**If database slow**:
|
||||
```sql
|
||||
-- Add missing index
|
||||
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
|
||||
```
|
||||
|
||||
**If external API slow**:
|
||||
```javascript
|
||||
// Add timeout + fallback
|
||||
const response = await fetch('https://api.external.com', {
|
||||
timeout: 2000 // 2s timeout
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
return fallbackData; // Don't cascade failure
|
||||
}
|
||||
```
|
||||
|
||||
**If service overloaded**:
|
||||
```bash
|
||||
# Scale horizontally (add more instances)
|
||||
# AWS Auto Scaling:
|
||||
aws autoscaling set-desired-capacity \
|
||||
--auto-scaling-group-name my-asg \
|
||||
--desired-capacity 10 # Was 5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option B: Add Timeouts** (prevent indefinite waiting)
|
||||
```javascript
|
||||
// Database query timeout
|
||||
const result = await db.query('SELECT * FROM users', {
|
||||
timeout: 3000 // 3 second timeout
|
||||
});
|
||||
|
||||
// API call timeout
|
||||
const response = await fetch('/api/data', {
|
||||
signal: AbortSignal.timeout(5000) // 5 second timeout
|
||||
});
|
||||
|
||||
// Impact: Fail fast instead of cascading
|
||||
// Risk: Low (better to timeout than cascade)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option C: Add Bulkheads** (isolate critical paths)
|
||||
```javascript
|
||||
// Separate connection pools for critical vs non-critical
|
||||
const criticalPool = new Pool({ max: 10 }); // Payments, auth
|
||||
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
|
||||
|
||||
// Critical requests get priority
|
||||
app.post('/api/payment', async (req, res) => {
|
||||
const conn = await criticalPool.connect();
|
||||
// ...
|
||||
});
|
||||
|
||||
// Non-critical requests use separate pool
|
||||
app.get('/api/analytics', async (req, res) => {
|
||||
const conn = await nonCriticalPool.connect();
|
||||
// ...
|
||||
});
|
||||
|
||||
// Impact: Critical paths protected from non-critical load
|
||||
// Risk: None (isolation improves reliability)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
**Architecture Improvements**:
|
||||
|
||||
- [ ] **Circuit Breakers** (all external dependencies)
|
||||
- [ ] **Timeouts** (every network call, database query)
|
||||
- [ ] **Retries with exponential backoff** (transient failures)
|
||||
- [ ] **Bulkheads** (isolate critical paths)
|
||||
- [ ] **Rate limiting** (protect downstream services)
|
||||
- [ ] **Graceful degradation** (fallback data, cached responses)
|
||||
- [ ] **Health checks** (detect failures early)
|
||||
- [ ] **Auto-scaling** (handle load spikes)
|
||||
- [ ] **Chaos engineering** (test cascade scenarios)
|
||||
|
||||
---
|
||||
|
||||
## Cascade Prevention Patterns
|
||||
|
||||
### 1. Circuit Breaker Pattern
|
||||
```javascript
|
||||
const breaker = new CircuitBreaker(riskyOperation, {
|
||||
timeout: 3000,
|
||||
errorThresholdPercentage: 50,
|
||||
resetTimeout: 30000
|
||||
});
|
||||
|
||||
breaker.fallback(() => cachedData);
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Fast failure (don't wait for timeout)
|
||||
- Automatic recovery (reset after timeout)
|
||||
- Fallback data (graceful degradation)
|
||||
|
||||
---
|
||||
|
||||
### 2. Timeout Pattern
|
||||
```javascript
|
||||
// ALWAYS set timeouts
|
||||
const response = await fetch('/api', {
|
||||
signal: AbortSignal.timeout(5000)
|
||||
});
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Fail fast (don't cascade indefinite waits)
|
||||
- Predictable behavior
|
||||
|
||||
---
|
||||
|
||||
### 3. Bulkhead Pattern
|
||||
```javascript
|
||||
// Separate resource pools
|
||||
const criticalPool = new Pool({ max: 10 });
|
||||
const nonCriticalPool = new Pool({ max: 5 });
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Critical paths protected
|
||||
- Non-critical load can't exhaust resources
|
||||
|
||||
---
|
||||
|
||||
### 4. Retry with Backoff
|
||||
```javascript
|
||||
async function retryWithBackoff(fn, retries = 3) {
|
||||
for (let i = 0; i < retries; i++) {
|
||||
try {
|
||||
return await fn();
|
||||
} catch (error) {
|
||||
if (i === retries - 1) throw error;
|
||||
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Handles transient failures
|
||||
- Exponential backoff prevents thundering herd
|
||||
|
||||
---
|
||||
|
||||
### 5. Load Shedding
|
||||
```javascript
|
||||
// Reject requests when overloaded
|
||||
if (queueDepth > threshold) {
|
||||
return res.status(503).send('Overloaded');
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Prevent overload
|
||||
- Protect downstream services
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to architecture team if**:
|
||||
- System-wide cascade
|
||||
- Architectural changes needed
|
||||
|
||||
**Escalate to all service owners if**:
|
||||
- Multiple teams affected
|
||||
- Need coordinated response
|
||||
|
||||
**Escalate to management if**:
|
||||
- Complete outage
|
||||
- Large customer impact
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Circuit breakers on all external calls
|
||||
- [ ] Timeouts on all network operations
|
||||
- [ ] Retries with exponential backoff
|
||||
- [ ] Bulkheads for critical paths
|
||||
- [ ] Rate limiting (protect downstream)
|
||||
- [ ] Health checks (detect failures early)
|
||||
- [ ] Auto-scaling (handle load)
|
||||
- [ ] Graceful degradation (fallback data)
|
||||
- [ ] Chaos engineering (test failure scenarios)
|
||||
- [ ] Load testing (find breaking points)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [04-slow-api-response.md](04-slow-api-response.md) - API performance
|
||||
- [07-service-down.md](07-service-down.md) - Service failures
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (MANDATORY for cascade failures)
|
||||
- [ ] Draw cascade diagram (which services failed in order)
|
||||
- [ ] Identify missing safeguards (circuit breakers, timeouts)
|
||||
- [ ] Implement prevention patterns
|
||||
- [ ] Test cascade scenarios (chaos engineering)
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Cascade Failure Examples
|
||||
|
||||
**Netflix Outage (2012)**:
|
||||
- Database latency → API timeouts → Frontend failures → Complete outage
|
||||
- **Fix**: Circuit breakers, timeouts, fallback data
|
||||
|
||||
**AWS S3 Outage (2017)**:
|
||||
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
|
||||
- **Fix**: Multi-region redundancy, fallback to different regions
|
||||
|
||||
**Google Cloud Outage (2019)**:
|
||||
- Network misconfiguration → Internal services fail → External services cascade
|
||||
- **Fix**: Network configuration validation, staged rollouts
|
||||
|
||||
---
|
||||
|
||||
## Key Takeaways
|
||||
|
||||
1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
|
||||
2. **Fix the root cause first** (not the symptoms)
|
||||
3. **Fail fast, don't cascade waits** (timeouts everywhere)
|
||||
4. **Graceful degradation** (fallback > failure)
|
||||
5. **Test failure scenarios** (chaos engineering)
|
||||
464
agents/sre/playbooks/10-rate-limit-exceeded.md
Normal file
464
agents/sre/playbooks/10-rate-limit-exceeded.md
Normal file
@@ -0,0 +1,464 @@
|
||||
# Playbook: Rate Limit Exceeded
|
||||
|
||||
## Symptoms
|
||||
|
||||
- "Rate limit exceeded" errors
|
||||
- "429 Too Many Requests" responses
|
||||
- "Quota exceeded" messages
|
||||
- Legitimate requests being blocked
|
||||
- Monitoring alert: "High rate of 429 errors"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV3** if isolated to specific users/endpoints
|
||||
- **SEV2** if affecting many users
|
||||
- **SEV1** if critical functionality blocked (payments, auth)
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Identify What's Rate Limited
|
||||
|
||||
**Check Error Messages**:
|
||||
```bash
|
||||
# Application logs
|
||||
grep "rate limit\|429\|quota exceeded" /var/log/application.log
|
||||
|
||||
# nginx logs
|
||||
awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
|
||||
|
||||
# Example output:
|
||||
# 500 192.168.1.100 /api/users ← IP hitting rate limit
|
||||
# 200 192.168.1.101 /api/posts
|
||||
```
|
||||
|
||||
**Check Rate Limit Source**:
|
||||
- **Application-level**: Your code enforcing limit
|
||||
- **nginx/API Gateway**: Reverse proxy rate limiting
|
||||
- **External API**: Third-party service limit (Stripe, Twilio, etc.)
|
||||
- **Cloud**: AWS API Gateway, CloudFlare
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Determine If Legitimate or Malicious
|
||||
|
||||
**Legitimate traffic**:
|
||||
```
|
||||
Scenario: User refreshing dashboard repeatedly
|
||||
Pattern: Single user, single endpoint, short burst
|
||||
Action: Increase rate limit or add caching
|
||||
```
|
||||
|
||||
**Malicious traffic** (abuse):
|
||||
```
|
||||
Scenario: Scraper or bot
|
||||
Pattern: Multiple IPs, automated behavior, sustained
|
||||
Action: Block IPs, add CAPTCHA
|
||||
```
|
||||
|
||||
**Traffic spike** (legitimate):
|
||||
```
|
||||
Scenario: Marketing campaign, viral post
|
||||
Pattern: Many users, distributed IPs, real user behavior
|
||||
Action: Increase rate limit, scale up
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Check Current Rate Limits
|
||||
|
||||
**nginx**:
|
||||
```nginx
|
||||
# Check nginx.conf
|
||||
grep "limit_req" /etc/nginx/nginx.conf
|
||||
|
||||
# Example:
|
||||
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
|
||||
# ^^^^ Current limit
|
||||
```
|
||||
|
||||
**Application** (Express.js example):
|
||||
```javascript
|
||||
// Check rate limit middleware
|
||||
const rateLimit = require('express-rate-limit');
|
||||
|
||||
const limiter = rateLimit({
|
||||
windowMs: 15 * 60 * 1000, // 15 minutes
|
||||
max: 100, // Limit: 100 requests per 15 minutes
|
||||
});
|
||||
```
|
||||
|
||||
**External API**:
|
||||
```bash
|
||||
# Check external API documentation
|
||||
# Stripe: 100 requests per second
|
||||
# Twilio: 100 requests per second
|
||||
# Google Maps: $200/month free quota
|
||||
|
||||
# Check current usage
|
||||
# Stripe:
|
||||
curl https://api.stripe.com/v1/balance \
|
||||
-u sk_test_XXX: \
|
||||
-H "Stripe-Account: acct_XXX"
|
||||
|
||||
# Response headers:
|
||||
# X-RateLimit-Limit: 100
|
||||
# X-RateLimit-Remaining: 45 ← 45 requests left
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Increase Rate Limit** (if legitimate traffic)
|
||||
|
||||
**nginx**:
|
||||
```nginx
|
||||
# Edit /etc/nginx/nginx.conf
|
||||
# Increase from 10r/s to 50r/s
|
||||
limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
|
||||
|
||||
# Test and reload
|
||||
nginx -t && systemctl reload nginx
|
||||
|
||||
# Impact: Allows more requests
|
||||
# Risk: Low (if traffic is legitimate)
|
||||
```
|
||||
|
||||
**Application** (Express.js):
|
||||
```javascript
|
||||
// Increase from 100 to 500 requests per 15 min
|
||||
const limiter = rateLimit({
|
||||
windowMs: 15 * 60 * 1000,
|
||||
max: 500, // Increased
|
||||
});
|
||||
|
||||
// Restart application
|
||||
pm2 restart all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option B: Whitelist Specific IPs** (if known legitimate source)
|
||||
|
||||
**nginx**:
|
||||
```nginx
|
||||
# Whitelist internal IPs, monitoring systems
|
||||
geo $limit {
|
||||
default 1;
|
||||
10.0.0.0/8 0; # Internal network
|
||||
192.168.1.100 0; # Monitoring system
|
||||
}
|
||||
|
||||
map $limit $limit_key {
|
||||
0 "";
|
||||
1 $binary_remote_addr;
|
||||
}
|
||||
|
||||
limit_req_zone $limit_key zone=one:10m rate=10r/s;
|
||||
```
|
||||
|
||||
**Application**:
|
||||
```javascript
|
||||
const limiter = rateLimit({
|
||||
skip: (req) => {
|
||||
// Whitelist internal IPs
|
||||
return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
|
||||
},
|
||||
windowMs: 15 * 60 * 1000,
|
||||
max: 100,
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option C: Add Caching** (reduce requests to backend)
|
||||
|
||||
**Redis cache**:
|
||||
```javascript
|
||||
const redis = require('redis').createClient();
|
||||
|
||||
app.get('/api/users', async (req, res) => {
|
||||
// Check cache first
|
||||
const cached = await redis.get('users:' + req.query.id);
|
||||
if (cached) {
|
||||
return res.json(JSON.parse(cached));
|
||||
}
|
||||
|
||||
// Fetch from database
|
||||
const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
|
||||
|
||||
// Cache for 5 minutes
|
||||
await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
|
||||
|
||||
res.json(user);
|
||||
});
|
||||
|
||||
// Impact: Reduces backend load, fewer rate limit hits
|
||||
// Risk: Low (data staleness acceptable)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option D: Block Malicious IPs** (if abuse detected)
|
||||
|
||||
**nginx**:
|
||||
```bash
|
||||
# Block specific IP
|
||||
iptables -A INPUT -s 192.168.1.100 -j DROP
|
||||
|
||||
# Or in nginx.conf:
|
||||
deny 192.168.1.100;
|
||||
deny 192.168.1.0/24; # Block range
|
||||
```
|
||||
|
||||
**CloudFlare**:
|
||||
```
|
||||
# CloudFlare dashboard:
|
||||
# Security → WAF → Custom rules
|
||||
# Block IP: 192.168.1.100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Implement Tiered Rate Limits**
|
||||
|
||||
**Different limits for different users**:
|
||||
```javascript
|
||||
const rateLimit = require('express-rate-limit');
|
||||
|
||||
const createLimiter = (max) => rateLimit({
|
||||
windowMs: 15 * 60 * 1000,
|
||||
max: max,
|
||||
keyGenerator: (req) => req.user?.id || req.ip,
|
||||
});
|
||||
|
||||
app.use('/api', (req, res, next) => {
|
||||
let limiter;
|
||||
if (req.user?.tier === 'premium') {
|
||||
limiter = createLimiter(1000); // Premium: 1000 req/15min
|
||||
} else if (req.user) {
|
||||
limiter = createLimiter(300); // Authenticated: 300 req/15min
|
||||
} else {
|
||||
limiter = createLimiter(100); // Anonymous: 100 req/15min
|
||||
}
|
||||
limiter(req, res, next);
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option B: Add CAPTCHA** (prevent bots)
|
||||
|
||||
**reCAPTCHA** on sensitive endpoints:
|
||||
```javascript
|
||||
const { recaptcha } = require('express-recaptcha');
|
||||
|
||||
app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
|
||||
if (!req.recaptcha.error) {
|
||||
// CAPTCHA valid, proceed with login
|
||||
await handleLogin(req, res);
|
||||
} else {
|
||||
res.status(400).json({ error: 'CAPTCHA failed' });
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Option C: Upgrade External API Plan** (if hitting external limit)
|
||||
|
||||
**Stripe**:
|
||||
```
|
||||
Current: 100 requests/second (free)
|
||||
Upgrade: Contact Stripe for higher limit (paid)
|
||||
```
|
||||
|
||||
**AWS API Gateway**:
|
||||
```bash
|
||||
# Increase throttle limit
|
||||
aws apigateway update-usage-plan \
|
||||
--usage-plan-id <ID> \
|
||||
--patch-operations \
|
||||
op=replace,path=/throttle/rateLimit,value=1000
|
||||
|
||||
# Impact: Higher rate limit
|
||||
# Risk: None (may cost more)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
|
||||
- [ ] **Add caching** (reduce backend load)
|
||||
- [ ] **Use CDN** (cache static content, reduce origin requests)
|
||||
- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
|
||||
- [ ] **Monitor rate limit usage** (alert before hitting limit)
|
||||
- [ ] **Batch requests** (reduce API calls to external services)
|
||||
- [ ] **Implement retry with backoff** (external API rate limits)
|
||||
- [ ] **Document rate limits** (API documentation for users)
|
||||
- [ ] **Add rate limit headers** (tell users their remaining quota)
|
||||
|
||||
---
|
||||
|
||||
## Rate Limit Best Practices
|
||||
|
||||
### 1. Return Helpful Headers
|
||||
|
||||
**RFC 6585 standard**:
|
||||
```http
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
X-RateLimit-Limit: 100
|
||||
X-RateLimit-Remaining: 0
|
||||
X-RateLimit-Reset: 1698345600 # Unix timestamp
|
||||
Retry-After: 60 # Seconds until reset
|
||||
|
||||
{
|
||||
"error": "Rate limit exceeded",
|
||||
"message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation**:
|
||||
```javascript
|
||||
const limiter = rateLimit({
|
||||
windowMs: 15 * 60 * 1000,
|
||||
max: 100,
|
||||
standardHeaders: true, // Return RateLimit-* headers
|
||||
legacyHeaders: false,
|
||||
handler: (req, res) => {
|
||||
res.status(429).json({
|
||||
error: 'Rate limit exceeded',
|
||||
message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
|
||||
});
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Use Sliding Window (not Fixed Window)
|
||||
|
||||
**Fixed window** (bad):
|
||||
```
|
||||
Window 1: 00:00-00:15 (100 requests)
|
||||
Window 2: 00:15-00:30 (100 requests)
|
||||
|
||||
User makes 100 requests at 00:14:59
|
||||
User makes 100 requests at 00:15:01
|
||||
→ 200 requests in 2 seconds! (burst)
|
||||
```
|
||||
|
||||
**Sliding window** (good):
|
||||
```
|
||||
Rate limit based on last 15 minutes from current time
|
||||
→ Can't burst (limit enforced continuously)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Different Limits for Different Endpoints
|
||||
|
||||
```javascript
|
||||
// Expensive endpoint (lower limit)
|
||||
app.get('/api/analytics', rateLimit({ max: 10 }), handler);
|
||||
|
||||
// Cheap endpoint (higher limit)
|
||||
app.get('/api/health', rateLimit({ max: 1000 }), handler);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## External API Rate Limit Handling
|
||||
|
||||
### Retry with Backoff
|
||||
|
||||
```javascript
|
||||
async function callExternalAPI(url, retries = 3) {
|
||||
for (let i = 0; i < retries; i++) {
|
||||
try {
|
||||
const response = await fetch(url);
|
||||
|
||||
// Check rate limit headers
|
||||
const remaining = response.headers.get('X-RateLimit-Remaining');
|
||||
if (remaining < 10) {
|
||||
console.warn('Approaching rate limit:', remaining);
|
||||
}
|
||||
|
||||
if (response.status === 429) {
|
||||
// Rate limited
|
||||
const retryAfter = response.headers.get('Retry-After') || 60;
|
||||
console.log(`Rate limited, retrying after ${retryAfter}s`);
|
||||
await sleep(retryAfter * 1000);
|
||||
continue;
|
||||
}
|
||||
|
||||
return response.json();
|
||||
} catch (error) {
|
||||
if (i === retries - 1) throw error;
|
||||
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application rate limit logic needs changes
|
||||
- Need to implement caching
|
||||
|
||||
**Escalate to infrastructure if**:
|
||||
- nginx/API Gateway rate limit config
|
||||
- Need to scale up capacity
|
||||
|
||||
**Escalate to external vendor if**:
|
||||
- Hitting external API rate limit
|
||||
- Need higher quota
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (if SEV1/SEV2)
|
||||
- [ ] Identify why rate limit hit
|
||||
- [ ] Adjust rate limits (if needed)
|
||||
- [ ] Add monitoring (alert before hitting limit)
|
||||
- [ ] Document rate limits (for users/API consumers)
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# Check 429 errors (nginx)
|
||||
awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
|
||||
|
||||
# Check rate limit config (nginx)
|
||||
grep "limit_req" /etc/nginx/nginx.conf
|
||||
|
||||
# Block IP (iptables)
|
||||
iptables -A INPUT -s <IP> -j DROP
|
||||
|
||||
# Test rate limit
|
||||
for i in {1..200}; do curl http://localhost/api; done
|
||||
|
||||
# Check external API rate limit
|
||||
curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
|
||||
# Look for X-RateLimit-* headers
|
||||
```
|
||||
Reference in New Issue
Block a user