# Playbook: Rate Limit Exceeded ## Symptoms - "Rate limit exceeded" errors - "429 Too Many Requests" responses - "Quota exceeded" messages - Legitimate requests being blocked - Monitoring alert: "High rate of 429 errors" ## Severity - **SEV3** if isolated to specific users/endpoints - **SEV2** if affecting many users - **SEV1** if critical functionality blocked (payments, auth) ## Diagnosis ### Step 1: Identify What's Rate Limited **Check Error Messages**: ```bash # Application logs grep "rate limit\|429\|quota exceeded" /var/log/application.log # nginx logs awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c # Example output: # 500 192.168.1.100 /api/users ← IP hitting rate limit # 200 192.168.1.101 /api/posts ``` **Check Rate Limit Source**: - **Application-level**: Your code enforcing limit - **nginx/API Gateway**: Reverse proxy rate limiting - **External API**: Third-party service limit (Stripe, Twilio, etc.) - **Cloud**: AWS API Gateway, CloudFlare --- ### Step 2: Determine If Legitimate or Malicious **Legitimate traffic**: ``` Scenario: User refreshing dashboard repeatedly Pattern: Single user, single endpoint, short burst Action: Increase rate limit or add caching ``` **Malicious traffic** (abuse): ``` Scenario: Scraper or bot Pattern: Multiple IPs, automated behavior, sustained Action: Block IPs, add CAPTCHA ``` **Traffic spike** (legitimate): ``` Scenario: Marketing campaign, viral post Pattern: Many users, distributed IPs, real user behavior Action: Increase rate limit, scale up ``` --- ### Step 3: Check Current Rate Limits **nginx**: ```nginx # Check nginx.conf grep "limit_req" /etc/nginx/nginx.conf # Example: # limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s; # ^^^^ Current limit ``` **Application** (Express.js example): ```javascript // Check rate limit middleware const rateLimit = require('express-rate-limit'); const limiter = rateLimit({ windowMs: 15 * 60 * 1000, // 15 minutes max: 100, // Limit: 100 requests per 15 minutes }); ``` **External API**: ```bash # Check external API documentation # Stripe: 100 requests per second # Twilio: 100 requests per second # Google Maps: $200/month free quota # Check current usage # Stripe: curl https://api.stripe.com/v1/balance \ -u sk_test_XXX: \ -H "Stripe-Account: acct_XXX" # Response headers: # X-RateLimit-Limit: 100 # X-RateLimit-Remaining: 45 ← 45 requests left ``` --- ## Mitigation ### Immediate (Now - 5 min) **Option A: Increase Rate Limit** (if legitimate traffic) **nginx**: ```nginx # Edit /etc/nginx/nginx.conf # Increase from 10r/s to 50r/s limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s; # Test and reload nginx -t && systemctl reload nginx # Impact: Allows more requests # Risk: Low (if traffic is legitimate) ``` **Application** (Express.js): ```javascript // Increase from 100 to 500 requests per 15 min const limiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 500, // Increased }); // Restart application pm2 restart all ``` --- **Option B: Whitelist Specific IPs** (if known legitimate source) **nginx**: ```nginx # Whitelist internal IPs, monitoring systems geo $limit { default 1; 10.0.0.0/8 0; # Internal network 192.168.1.100 0; # Monitoring system } map $limit $limit_key { 0 ""; 1 $binary_remote_addr; } limit_req_zone $limit_key zone=one:10m rate=10r/s; ``` **Application**: ```javascript const limiter = rateLimit({ skip: (req) => { // Whitelist internal IPs return req.ip.startsWith('10.') || req.ip === '192.168.1.100'; }, windowMs: 15 * 60 * 1000, max: 100, }); ``` --- **Option C: Add Caching** (reduce requests to backend) **Redis cache**: ```javascript const redis = require('redis').createClient(); app.get('/api/users', async (req, res) => { // Check cache first const cached = await redis.get('users:' + req.query.id); if (cached) { return res.json(JSON.parse(cached)); } // Fetch from database const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]); // Cache for 5 minutes await redis.setex('users:' + req.query.id, 300, JSON.stringify(user)); res.json(user); }); // Impact: Reduces backend load, fewer rate limit hits // Risk: Low (data staleness acceptable) ``` --- **Option D: Block Malicious IPs** (if abuse detected) **nginx**: ```bash # Block specific IP iptables -A INPUT -s 192.168.1.100 -j DROP # Or in nginx.conf: deny 192.168.1.100; deny 192.168.1.0/24; # Block range ``` **CloudFlare**: ``` # CloudFlare dashboard: # Security → WAF → Custom rules # Block IP: 192.168.1.100 ``` --- ### Short-term (5 min - 1 hour) **Option A: Implement Tiered Rate Limits** **Different limits for different users**: ```javascript const rateLimit = require('express-rate-limit'); const createLimiter = (max) => rateLimit({ windowMs: 15 * 60 * 1000, max: max, keyGenerator: (req) => req.user?.id || req.ip, }); app.use('/api', (req, res, next) => { let limiter; if (req.user?.tier === 'premium') { limiter = createLimiter(1000); // Premium: 1000 req/15min } else if (req.user) { limiter = createLimiter(300); // Authenticated: 300 req/15min } else { limiter = createLimiter(100); // Anonymous: 100 req/15min } limiter(req, res, next); }); ``` --- **Option B: Add CAPTCHA** (prevent bots) **reCAPTCHA** on sensitive endpoints: ```javascript const { recaptcha } = require('express-recaptcha'); app.post('/api/login', recaptcha.middleware.verify, async (req, res) => { if (!req.recaptcha.error) { // CAPTCHA valid, proceed with login await handleLogin(req, res); } else { res.status(400).json({ error: 'CAPTCHA failed' }); } }); ``` --- **Option C: Upgrade External API Plan** (if hitting external limit) **Stripe**: ``` Current: 100 requests/second (free) Upgrade: Contact Stripe for higher limit (paid) ``` **AWS API Gateway**: ```bash # Increase throttle limit aws apigateway update-usage-plan \ --usage-plan-id \ --patch-operations \ op=replace,path=/throttle/rateLimit,value=1000 # Impact: Higher rate limit # Risk: None (may cost more) ``` --- ### Long-term (1 hour+) - [ ] **Implement tiered rate limits** (premium, authenticated, anonymous) - [ ] **Add caching** (reduce backend load) - [ ] **Use CDN** (cache static content, reduce origin requests) - [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints) - [ ] **Monitor rate limit usage** (alert before hitting limit) - [ ] **Batch requests** (reduce API calls to external services) - [ ] **Implement retry with backoff** (external API rate limits) - [ ] **Document rate limits** (API documentation for users) - [ ] **Add rate limit headers** (tell users their remaining quota) --- ## Rate Limit Best Practices ### 1. Return Helpful Headers **RFC 6585 standard**: ```http HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1698345600 # Unix timestamp Retry-After: 60 # Seconds until reset { "error": "Rate limit exceeded", "message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds." } ``` **Implementation**: ```javascript const limiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 100, standardHeaders: true, // Return RateLimit-* headers legacyHeaders: false, handler: (req, res) => { res.status(429).json({ error: 'Rate limit exceeded', message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`, }); }, }); ``` --- ### 2. Use Sliding Window (not Fixed Window) **Fixed window** (bad): ``` Window 1: 00:00-00:15 (100 requests) Window 2: 00:15-00:30 (100 requests) User makes 100 requests at 00:14:59 User makes 100 requests at 00:15:01 → 200 requests in 2 seconds! (burst) ``` **Sliding window** (good): ``` Rate limit based on last 15 minutes from current time → Can't burst (limit enforced continuously) ``` --- ### 3. Different Limits for Different Endpoints ```javascript // Expensive endpoint (lower limit) app.get('/api/analytics', rateLimit({ max: 10 }), handler); // Cheap endpoint (higher limit) app.get('/api/health', rateLimit({ max: 1000 }), handler); ``` --- ## External API Rate Limit Handling ### Retry with Backoff ```javascript async function callExternalAPI(url, retries = 3) { for (let i = 0; i < retries; i++) { try { const response = await fetch(url); // Check rate limit headers const remaining = response.headers.get('X-RateLimit-Remaining'); if (remaining < 10) { console.warn('Approaching rate limit:', remaining); } if (response.status === 429) { // Rate limited const retryAfter = response.headers.get('Retry-After') || 60; console.log(`Rate limited, retrying after ${retryAfter}s`); await sleep(retryAfter * 1000); continue; } return response.json(); } catch (error) { if (i === retries - 1) throw error; await sleep(Math.pow(2, i) * 1000); // Exponential backoff } } } ``` --- ## Escalation **Escalate to developer if**: - Application rate limit logic needs changes - Need to implement caching **Escalate to infrastructure if**: - nginx/API Gateway rate limit config - Need to scale up capacity **Escalate to external vendor if**: - Hitting external API rate limit - Need higher quota --- ## Related Runbooks - [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic - [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting --- ## Post-Incident After resolving: - [ ] Create post-mortem (if SEV1/SEV2) - [ ] Identify why rate limit hit - [ ] Adjust rate limits (if needed) - [ ] Add monitoring (alert before hitting limit) - [ ] Document rate limits (for users/API consumers) - [ ] Update this runbook if needed --- ## Useful Commands Reference ```bash # Check 429 errors (nginx) awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c # Check rate limit config (nginx) grep "limit_req" /etc/nginx/nginx.conf # Block IP (iptables) iptables -A INPUT -s -j DROP # Test rate limit for i in {1..200}; do curl http://localhost/api; done # Check external API rate limit curl -I https://api.example.com -H "Authorization: Bearer TOKEN" # Look for X-RateLimit-* headers ```