465 lines
10 KiB
Markdown
465 lines
10 KiB
Markdown
# Playbook: Rate Limit Exceeded
|
|
|
|
## Symptoms
|
|
|
|
- "Rate limit exceeded" errors
|
|
- "429 Too Many Requests" responses
|
|
- "Quota exceeded" messages
|
|
- Legitimate requests being blocked
|
|
- Monitoring alert: "High rate of 429 errors"
|
|
|
|
## Severity
|
|
|
|
- **SEV3** if isolated to specific users/endpoints
|
|
- **SEV2** if affecting many users
|
|
- **SEV1** if critical functionality blocked (payments, auth)
|
|
|
|
## Diagnosis
|
|
|
|
### Step 1: Identify What's Rate Limited
|
|
|
|
**Check Error Messages**:
|
|
```bash
|
|
# Application logs
|
|
grep "rate limit\|429\|quota exceeded" /var/log/application.log
|
|
|
|
# nginx logs
|
|
awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
|
|
|
|
# Example output:
|
|
# 500 192.168.1.100 /api/users ← IP hitting rate limit
|
|
# 200 192.168.1.101 /api/posts
|
|
```
|
|
|
|
**Check Rate Limit Source**:
|
|
- **Application-level**: Your code enforcing limit
|
|
- **nginx/API Gateway**: Reverse proxy rate limiting
|
|
- **External API**: Third-party service limit (Stripe, Twilio, etc.)
|
|
- **Cloud**: AWS API Gateway, CloudFlare
|
|
|
|
---
|
|
|
|
### Step 2: Determine If Legitimate or Malicious
|
|
|
|
**Legitimate traffic**:
|
|
```
|
|
Scenario: User refreshing dashboard repeatedly
|
|
Pattern: Single user, single endpoint, short burst
|
|
Action: Increase rate limit or add caching
|
|
```
|
|
|
|
**Malicious traffic** (abuse):
|
|
```
|
|
Scenario: Scraper or bot
|
|
Pattern: Multiple IPs, automated behavior, sustained
|
|
Action: Block IPs, add CAPTCHA
|
|
```
|
|
|
|
**Traffic spike** (legitimate):
|
|
```
|
|
Scenario: Marketing campaign, viral post
|
|
Pattern: Many users, distributed IPs, real user behavior
|
|
Action: Increase rate limit, scale up
|
|
```
|
|
|
|
---
|
|
|
|
### Step 3: Check Current Rate Limits
|
|
|
|
**nginx**:
|
|
```nginx
|
|
# Check nginx.conf
|
|
grep "limit_req" /etc/nginx/nginx.conf
|
|
|
|
# Example:
|
|
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
|
|
# ^^^^ Current limit
|
|
```
|
|
|
|
**Application** (Express.js example):
|
|
```javascript
|
|
// Check rate limit middleware
|
|
const rateLimit = require('express-rate-limit');
|
|
|
|
const limiter = rateLimit({
|
|
windowMs: 15 * 60 * 1000, // 15 minutes
|
|
max: 100, // Limit: 100 requests per 15 minutes
|
|
});
|
|
```
|
|
|
|
**External API**:
|
|
```bash
|
|
# Check external API documentation
|
|
# Stripe: 100 requests per second
|
|
# Twilio: 100 requests per second
|
|
# Google Maps: $200/month free quota
|
|
|
|
# Check current usage
|
|
# Stripe:
|
|
curl https://api.stripe.com/v1/balance \
|
|
-u sk_test_XXX: \
|
|
-H "Stripe-Account: acct_XXX"
|
|
|
|
# Response headers:
|
|
# X-RateLimit-Limit: 100
|
|
# X-RateLimit-Remaining: 45 ← 45 requests left
|
|
```
|
|
|
|
---
|
|
|
|
## Mitigation
|
|
|
|
### Immediate (Now - 5 min)
|
|
|
|
**Option A: Increase Rate Limit** (if legitimate traffic)
|
|
|
|
**nginx**:
|
|
```nginx
|
|
# Edit /etc/nginx/nginx.conf
|
|
# Increase from 10r/s to 50r/s
|
|
limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
|
|
|
|
# Test and reload
|
|
nginx -t && systemctl reload nginx
|
|
|
|
# Impact: Allows more requests
|
|
# Risk: Low (if traffic is legitimate)
|
|
```
|
|
|
|
**Application** (Express.js):
|
|
```javascript
|
|
// Increase from 100 to 500 requests per 15 min
|
|
const limiter = rateLimit({
|
|
windowMs: 15 * 60 * 1000,
|
|
max: 500, // Increased
|
|
});
|
|
|
|
// Restart application
|
|
pm2 restart all
|
|
```
|
|
|
|
---
|
|
|
|
**Option B: Whitelist Specific IPs** (if known legitimate source)
|
|
|
|
**nginx**:
|
|
```nginx
|
|
# Whitelist internal IPs, monitoring systems
|
|
geo $limit {
|
|
default 1;
|
|
10.0.0.0/8 0; # Internal network
|
|
192.168.1.100 0; # Monitoring system
|
|
}
|
|
|
|
map $limit $limit_key {
|
|
0 "";
|
|
1 $binary_remote_addr;
|
|
}
|
|
|
|
limit_req_zone $limit_key zone=one:10m rate=10r/s;
|
|
```
|
|
|
|
**Application**:
|
|
```javascript
|
|
const limiter = rateLimit({
|
|
skip: (req) => {
|
|
// Whitelist internal IPs
|
|
return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
|
|
},
|
|
windowMs: 15 * 60 * 1000,
|
|
max: 100,
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
**Option C: Add Caching** (reduce requests to backend)
|
|
|
|
**Redis cache**:
|
|
```javascript
|
|
const redis = require('redis').createClient();
|
|
|
|
app.get('/api/users', async (req, res) => {
|
|
// Check cache first
|
|
const cached = await redis.get('users:' + req.query.id);
|
|
if (cached) {
|
|
return res.json(JSON.parse(cached));
|
|
}
|
|
|
|
// Fetch from database
|
|
const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
|
|
|
|
// Cache for 5 minutes
|
|
await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
|
|
|
|
res.json(user);
|
|
});
|
|
|
|
// Impact: Reduces backend load, fewer rate limit hits
|
|
// Risk: Low (data staleness acceptable)
|
|
```
|
|
|
|
---
|
|
|
|
**Option D: Block Malicious IPs** (if abuse detected)
|
|
|
|
**nginx**:
|
|
```bash
|
|
# Block specific IP
|
|
iptables -A INPUT -s 192.168.1.100 -j DROP
|
|
|
|
# Or in nginx.conf:
|
|
deny 192.168.1.100;
|
|
deny 192.168.1.0/24; # Block range
|
|
```
|
|
|
|
**CloudFlare**:
|
|
```
|
|
# CloudFlare dashboard:
|
|
# Security → WAF → Custom rules
|
|
# Block IP: 192.168.1.100
|
|
```
|
|
|
|
---
|
|
|
|
### Short-term (5 min - 1 hour)
|
|
|
|
**Option A: Implement Tiered Rate Limits**
|
|
|
|
**Different limits for different users**:
|
|
```javascript
|
|
const rateLimit = require('express-rate-limit');
|
|
|
|
const createLimiter = (max) => rateLimit({
|
|
windowMs: 15 * 60 * 1000,
|
|
max: max,
|
|
keyGenerator: (req) => req.user?.id || req.ip,
|
|
});
|
|
|
|
app.use('/api', (req, res, next) => {
|
|
let limiter;
|
|
if (req.user?.tier === 'premium') {
|
|
limiter = createLimiter(1000); // Premium: 1000 req/15min
|
|
} else if (req.user) {
|
|
limiter = createLimiter(300); // Authenticated: 300 req/15min
|
|
} else {
|
|
limiter = createLimiter(100); // Anonymous: 100 req/15min
|
|
}
|
|
limiter(req, res, next);
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
**Option B: Add CAPTCHA** (prevent bots)
|
|
|
|
**reCAPTCHA** on sensitive endpoints:
|
|
```javascript
|
|
const { recaptcha } = require('express-recaptcha');
|
|
|
|
app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
|
|
if (!req.recaptcha.error) {
|
|
// CAPTCHA valid, proceed with login
|
|
await handleLogin(req, res);
|
|
} else {
|
|
res.status(400).json({ error: 'CAPTCHA failed' });
|
|
}
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
**Option C: Upgrade External API Plan** (if hitting external limit)
|
|
|
|
**Stripe**:
|
|
```
|
|
Current: 100 requests/second (free)
|
|
Upgrade: Contact Stripe for higher limit (paid)
|
|
```
|
|
|
|
**AWS API Gateway**:
|
|
```bash
|
|
# Increase throttle limit
|
|
aws apigateway update-usage-plan \
|
|
--usage-plan-id <ID> \
|
|
--patch-operations \
|
|
op=replace,path=/throttle/rateLimit,value=1000
|
|
|
|
# Impact: Higher rate limit
|
|
# Risk: None (may cost more)
|
|
```
|
|
|
|
---
|
|
|
|
### Long-term (1 hour+)
|
|
|
|
- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
|
|
- [ ] **Add caching** (reduce backend load)
|
|
- [ ] **Use CDN** (cache static content, reduce origin requests)
|
|
- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
|
|
- [ ] **Monitor rate limit usage** (alert before hitting limit)
|
|
- [ ] **Batch requests** (reduce API calls to external services)
|
|
- [ ] **Implement retry with backoff** (external API rate limits)
|
|
- [ ] **Document rate limits** (API documentation for users)
|
|
- [ ] **Add rate limit headers** (tell users their remaining quota)
|
|
|
|
---
|
|
|
|
## Rate Limit Best Practices
|
|
|
|
### 1. Return Helpful Headers
|
|
|
|
**RFC 6585 standard**:
|
|
```http
|
|
HTTP/1.1 429 Too Many Requests
|
|
X-RateLimit-Limit: 100
|
|
X-RateLimit-Remaining: 0
|
|
X-RateLimit-Reset: 1698345600 # Unix timestamp
|
|
Retry-After: 60 # Seconds until reset
|
|
|
|
{
|
|
"error": "Rate limit exceeded",
|
|
"message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
|
|
}
|
|
```
|
|
|
|
**Implementation**:
|
|
```javascript
|
|
const limiter = rateLimit({
|
|
windowMs: 15 * 60 * 1000,
|
|
max: 100,
|
|
standardHeaders: true, // Return RateLimit-* headers
|
|
legacyHeaders: false,
|
|
handler: (req, res) => {
|
|
res.status(429).json({
|
|
error: 'Rate limit exceeded',
|
|
message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
|
|
});
|
|
},
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Use Sliding Window (not Fixed Window)
|
|
|
|
**Fixed window** (bad):
|
|
```
|
|
Window 1: 00:00-00:15 (100 requests)
|
|
Window 2: 00:15-00:30 (100 requests)
|
|
|
|
User makes 100 requests at 00:14:59
|
|
User makes 100 requests at 00:15:01
|
|
→ 200 requests in 2 seconds! (burst)
|
|
```
|
|
|
|
**Sliding window** (good):
|
|
```
|
|
Rate limit based on last 15 minutes from current time
|
|
→ Can't burst (limit enforced continuously)
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Different Limits for Different Endpoints
|
|
|
|
```javascript
|
|
// Expensive endpoint (lower limit)
|
|
app.get('/api/analytics', rateLimit({ max: 10 }), handler);
|
|
|
|
// Cheap endpoint (higher limit)
|
|
app.get('/api/health', rateLimit({ max: 1000 }), handler);
|
|
```
|
|
|
|
---
|
|
|
|
## External API Rate Limit Handling
|
|
|
|
### Retry with Backoff
|
|
|
|
```javascript
|
|
async function callExternalAPI(url, retries = 3) {
|
|
for (let i = 0; i < retries; i++) {
|
|
try {
|
|
const response = await fetch(url);
|
|
|
|
// Check rate limit headers
|
|
const remaining = response.headers.get('X-RateLimit-Remaining');
|
|
if (remaining < 10) {
|
|
console.warn('Approaching rate limit:', remaining);
|
|
}
|
|
|
|
if (response.status === 429) {
|
|
// Rate limited
|
|
const retryAfter = response.headers.get('Retry-After') || 60;
|
|
console.log(`Rate limited, retrying after ${retryAfter}s`);
|
|
await sleep(retryAfter * 1000);
|
|
continue;
|
|
}
|
|
|
|
return response.json();
|
|
} catch (error) {
|
|
if (i === retries - 1) throw error;
|
|
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Escalation
|
|
|
|
**Escalate to developer if**:
|
|
- Application rate limit logic needs changes
|
|
- Need to implement caching
|
|
|
|
**Escalate to infrastructure if**:
|
|
- nginx/API Gateway rate limit config
|
|
- Need to scale up capacity
|
|
|
|
**Escalate to external vendor if**:
|
|
- Hitting external API rate limit
|
|
- Need higher quota
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
|
|
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
|
|
|
---
|
|
|
|
## Post-Incident
|
|
|
|
After resolving:
|
|
- [ ] Create post-mortem (if SEV1/SEV2)
|
|
- [ ] Identify why rate limit hit
|
|
- [ ] Adjust rate limits (if needed)
|
|
- [ ] Add monitoring (alert before hitting limit)
|
|
- [ ] Document rate limits (for users/API consumers)
|
|
- [ ] Update this runbook if needed
|
|
|
|
---
|
|
|
|
## Useful Commands Reference
|
|
|
|
```bash
|
|
# Check 429 errors (nginx)
|
|
awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
|
|
|
|
# Check rate limit config (nginx)
|
|
grep "limit_req" /etc/nginx/nginx.conf
|
|
|
|
# Block IP (iptables)
|
|
iptables -A INPUT -s <IP> -j DROP
|
|
|
|
# Test rate limit
|
|
for i in {1..200}; do curl http://localhost/api; done
|
|
|
|
# Check external API rate limit
|
|
curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
|
|
# Look for X-RateLimit-* headers
|
|
```
|