Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions

View File

@@ -0,0 +1,464 @@
# Playbook: Rate Limit Exceeded
## Symptoms
- "Rate limit exceeded" errors
- "429 Too Many Requests" responses
- "Quota exceeded" messages
- Legitimate requests being blocked
- Monitoring alert: "High rate of 429 errors"
## Severity
- **SEV3** if isolated to specific users/endpoints
- **SEV2** if affecting many users
- **SEV1** if critical functionality blocked (payments, auth)
## Diagnosis
### Step 1: Identify What's Rate Limited
**Check Error Messages**:
```bash
# Application logs
grep "rate limit\|429\|quota exceeded" /var/log/application.log
# nginx logs
awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
# Example output:
# 500 192.168.1.100 /api/users ← IP hitting rate limit
# 200 192.168.1.101 /api/posts
```
**Check Rate Limit Source**:
- **Application-level**: Your code enforcing limit
- **nginx/API Gateway**: Reverse proxy rate limiting
- **External API**: Third-party service limit (Stripe, Twilio, etc.)
- **Cloud**: AWS API Gateway, CloudFlare
---
### Step 2: Determine If Legitimate or Malicious
**Legitimate traffic**:
```
Scenario: User refreshing dashboard repeatedly
Pattern: Single user, single endpoint, short burst
Action: Increase rate limit or add caching
```
**Malicious traffic** (abuse):
```
Scenario: Scraper or bot
Pattern: Multiple IPs, automated behavior, sustained
Action: Block IPs, add CAPTCHA
```
**Traffic spike** (legitimate):
```
Scenario: Marketing campaign, viral post
Pattern: Many users, distributed IPs, real user behavior
Action: Increase rate limit, scale up
```
---
### Step 3: Check Current Rate Limits
**nginx**:
```nginx
# Check nginx.conf
grep "limit_req" /etc/nginx/nginx.conf
# Example:
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
# ^^^^ Current limit
```
**Application** (Express.js example):
```javascript
// Check rate limit middleware
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // Limit: 100 requests per 15 minutes
});
```
**External API**:
```bash
# Check external API documentation
# Stripe: 100 requests per second
# Twilio: 100 requests per second
# Google Maps: $200/month free quota
# Check current usage
# Stripe:
curl https://api.stripe.com/v1/balance \
-u sk_test_XXX: \
-H "Stripe-Account: acct_XXX"
# Response headers:
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 45 ← 45 requests left
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Increase Rate Limit** (if legitimate traffic)
**nginx**:
```nginx
# Edit /etc/nginx/nginx.conf
# Increase from 10r/s to 50r/s
limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
# Test and reload
nginx -t && systemctl reload nginx
# Impact: Allows more requests
# Risk: Low (if traffic is legitimate)
```
**Application** (Express.js):
```javascript
// Increase from 100 to 500 requests per 15 min
const limiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 500, // Increased
});
// Restart application
pm2 restart all
```
---
**Option B: Whitelist Specific IPs** (if known legitimate source)
**nginx**:
```nginx
# Whitelist internal IPs, monitoring systems
geo $limit {
default 1;
10.0.0.0/8 0; # Internal network
192.168.1.100 0; # Monitoring system
}
map $limit $limit_key {
0 "";
1 $binary_remote_addr;
}
limit_req_zone $limit_key zone=one:10m rate=10r/s;
```
**Application**:
```javascript
const limiter = rateLimit({
skip: (req) => {
// Whitelist internal IPs
return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
},
windowMs: 15 * 60 * 1000,
max: 100,
});
```
---
**Option C: Add Caching** (reduce requests to backend)
**Redis cache**:
```javascript
const redis = require('redis').createClient();
app.get('/api/users', async (req, res) => {
// Check cache first
const cached = await redis.get('users:' + req.query.id);
if (cached) {
return res.json(JSON.parse(cached));
}
// Fetch from database
const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
// Cache for 5 minutes
await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
res.json(user);
});
// Impact: Reduces backend load, fewer rate limit hits
// Risk: Low (data staleness acceptable)
```
---
**Option D: Block Malicious IPs** (if abuse detected)
**nginx**:
```bash
# Block specific IP
iptables -A INPUT -s 192.168.1.100 -j DROP
# Or in nginx.conf:
deny 192.168.1.100;
deny 192.168.1.0/24; # Block range
```
**CloudFlare**:
```
# CloudFlare dashboard:
# Security → WAF → Custom rules
# Block IP: 192.168.1.100
```
---
### Short-term (5 min - 1 hour)
**Option A: Implement Tiered Rate Limits**
**Different limits for different users**:
```javascript
const rateLimit = require('express-rate-limit');
const createLimiter = (max) => rateLimit({
windowMs: 15 * 60 * 1000,
max: max,
keyGenerator: (req) => req.user?.id || req.ip,
});
app.use('/api', (req, res, next) => {
let limiter;
if (req.user?.tier === 'premium') {
limiter = createLimiter(1000); // Premium: 1000 req/15min
} else if (req.user) {
limiter = createLimiter(300); // Authenticated: 300 req/15min
} else {
limiter = createLimiter(100); // Anonymous: 100 req/15min
}
limiter(req, res, next);
});
```
---
**Option B: Add CAPTCHA** (prevent bots)
**reCAPTCHA** on sensitive endpoints:
```javascript
const { recaptcha } = require('express-recaptcha');
app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
if (!req.recaptcha.error) {
// CAPTCHA valid, proceed with login
await handleLogin(req, res);
} else {
res.status(400).json({ error: 'CAPTCHA failed' });
}
});
```
---
**Option C: Upgrade External API Plan** (if hitting external limit)
**Stripe**:
```
Current: 100 requests/second (free)
Upgrade: Contact Stripe for higher limit (paid)
```
**AWS API Gateway**:
```bash
# Increase throttle limit
aws apigateway update-usage-plan \
--usage-plan-id <ID> \
--patch-operations \
op=replace,path=/throttle/rateLimit,value=1000
# Impact: Higher rate limit
# Risk: None (may cost more)
```
---
### Long-term (1 hour+)
- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
- [ ] **Add caching** (reduce backend load)
- [ ] **Use CDN** (cache static content, reduce origin requests)
- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
- [ ] **Monitor rate limit usage** (alert before hitting limit)
- [ ] **Batch requests** (reduce API calls to external services)
- [ ] **Implement retry with backoff** (external API rate limits)
- [ ] **Document rate limits** (API documentation for users)
- [ ] **Add rate limit headers** (tell users their remaining quota)
---
## Rate Limit Best Practices
### 1. Return Helpful Headers
**RFC 6585 standard**:
```http
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1698345600 # Unix timestamp
Retry-After: 60 # Seconds until reset
{
"error": "Rate limit exceeded",
"message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
}
```
**Implementation**:
```javascript
const limiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 100,
standardHeaders: true, // Return RateLimit-* headers
legacyHeaders: false,
handler: (req, res) => {
res.status(429).json({
error: 'Rate limit exceeded',
message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
});
},
});
```
---
### 2. Use Sliding Window (not Fixed Window)
**Fixed window** (bad):
```
Window 1: 00:00-00:15 (100 requests)
Window 2: 00:15-00:30 (100 requests)
User makes 100 requests at 00:14:59
User makes 100 requests at 00:15:01
→ 200 requests in 2 seconds! (burst)
```
**Sliding window** (good):
```
Rate limit based on last 15 minutes from current time
→ Can't burst (limit enforced continuously)
```
---
### 3. Different Limits for Different Endpoints
```javascript
// Expensive endpoint (lower limit)
app.get('/api/analytics', rateLimit({ max: 10 }), handler);
// Cheap endpoint (higher limit)
app.get('/api/health', rateLimit({ max: 1000 }), handler);
```
---
## External API Rate Limit Handling
### Retry with Backoff
```javascript
async function callExternalAPI(url, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
const response = await fetch(url);
// Check rate limit headers
const remaining = response.headers.get('X-RateLimit-Remaining');
if (remaining < 10) {
console.warn('Approaching rate limit:', remaining);
}
if (response.status === 429) {
// Rate limited
const retryAfter = response.headers.get('Retry-After') || 60;
console.log(`Rate limited, retrying after ${retryAfter}s`);
await sleep(retryAfter * 1000);
continue;
}
return response.json();
} catch (error) {
if (i === retries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
}
}
}
```
---
## Escalation
**Escalate to developer if**:
- Application rate limit logic needs changes
- Need to implement caching
**Escalate to infrastructure if**:
- nginx/API Gateway rate limit config
- Need to scale up capacity
**Escalate to external vendor if**:
- Hitting external API rate limit
- Need higher quota
---
## Related Runbooks
- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (if SEV1/SEV2)
- [ ] Identify why rate limit hit
- [ ] Adjust rate limits (if needed)
- [ ] Add monitoring (alert before hitting limit)
- [ ] Document rate limits (for users/API consumers)
- [ ] Update this runbook if needed
---
## Useful Commands Reference
```bash
# Check 429 errors (nginx)
awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
# Check rate limit config (nginx)
grep "limit_req" /etc/nginx/nginx.conf
# Block IP (iptables)
iptables -A INPUT -s <IP> -j DROP
# Test rate limit
for i in {1..200}; do curl http://localhost/api; done
# Check external API rate limit
curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
# Look for X-RateLimit-* headers
```