Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/10-rate-limit-exceeded.md
+++ b/agents/sre/playbooks/10-rate-limit-exceeded.md
@@ -0,0 +1,464 @@
+# Playbook: Rate Limit Exceeded
+
+## Symptoms
+
+- "Rate limit exceeded" errors
+- "429 Too Many Requests" responses
+- "Quota exceeded" messages
+- Legitimate requests being blocked
+- Monitoring alert: "High rate of 429 errors"
+
+## Severity
+
+- **SEV3** if isolated to specific users/endpoints
+- **SEV2** if affecting many users
+- **SEV1** if critical functionality blocked (payments, auth)
+
+## Diagnosis
+
+### Step 1: Identify What's Rate Limited
+
+**Check Error Messages**:
+```bash
+# Application logs
+grep "rate limit\|429\|quota exceeded" /var/log/application.log
+
+# nginx logs
+awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c
+
+# Example output:
+# 500 192.168.1.100 /api/users    ← IP hitting rate limit
+# 200 192.168.1.101 /api/posts
+```
+
+**Check Rate Limit Source**:
+- **Application-level**: Your code enforcing limit
+- **nginx/API Gateway**: Reverse proxy rate limiting
+- **External API**: Third-party service limit (Stripe, Twilio, etc.)
+- **Cloud**: AWS API Gateway, CloudFlare
+
+---
+
+### Step 2: Determine If Legitimate or Malicious
+
+**Legitimate traffic**:
+```
+Scenario: User refreshing dashboard repeatedly
+Pattern: Single user, single endpoint, short burst
+Action: Increase rate limit or add caching
+```
+
+**Malicious traffic** (abuse):
+```
+Scenario: Scraper or bot
+Pattern: Multiple IPs, automated behavior, sustained
+Action: Block IPs, add CAPTCHA
+```
+
+**Traffic spike** (legitimate):
+```
+Scenario: Marketing campaign, viral post
+Pattern: Many users, distributed IPs, real user behavior
+Action: Increase rate limit, scale up
+```
+
+---
+
+### Step 3: Check Current Rate Limits
+
+**nginx**:
+```nginx
+# Check nginx.conf
+grep "limit_req" /etc/nginx/nginx.conf
+
+# Example:
+# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
+#                                                         ^^^^ Current limit
+```
+
+**Application** (Express.js example):
+```javascript
+// Check rate limit middleware
+const rateLimit = require('express-rate-limit');
+
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 100, // Limit: 100 requests per 15 minutes
+});
+```
+
+**External API**:
+```bash
+# Check external API documentation
+# Stripe: 100 requests per second
+# Twilio: 100 requests per second
+# Google Maps: $200/month free quota
+
+# Check current usage
+# Stripe:
+curl https://api.stripe.com/v1/balance \
+  -u sk_test_XXX: \
+  -H "Stripe-Account: acct_XXX"
+
+# Response headers:
+# X-RateLimit-Limit: 100
+# X-RateLimit-Remaining: 45  ← 45 requests left
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Option A: Increase Rate Limit** (if legitimate traffic)
+
+**nginx**:
+```nginx
+# Edit /etc/nginx/nginx.conf
+# Increase from 10r/s to 50r/s
+limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;
+
+# Test and reload
+nginx -t && systemctl reload nginx
+
+# Impact: Allows more requests
+# Risk: Low (if traffic is legitimate)
+```
+
+**Application** (Express.js):
+```javascript
+// Increase from 100 to 500 requests per 15 min
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: 500, // Increased
+});
+
+// Restart application
+pm2 restart all
+```
+
+---
+
+**Option B: Whitelist Specific IPs** (if known legitimate source)
+
+**nginx**:
+```nginx
+# Whitelist internal IPs, monitoring systems
+geo $limit {
+  default 1;
+  10.0.0.0/8 0;        # Internal network
+  192.168.1.100 0;     # Monitoring system
+}
+
+map $limit $limit_key {
+  0 "";
+  1 $binary_remote_addr;
+}
+
+limit_req_zone $limit_key zone=one:10m rate=10r/s;
+```
+
+**Application**:
+```javascript
+const limiter = rateLimit({
+  skip: (req) => {
+    // Whitelist internal IPs
+    return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
+  },
+  windowMs: 15 * 60 * 1000,
+  max: 100,
+});
+```
+
+---
+
+**Option C: Add Caching** (reduce requests to backend)
+
+**Redis cache**:
+```javascript
+const redis = require('redis').createClient();
+
+app.get('/api/users', async (req, res) => {
+  // Check cache first
+  const cached = await redis.get('users:' + req.query.id);
+  if (cached) {
+    return res.json(JSON.parse(cached));
+  }
+
+  // Fetch from database
+  const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);
+
+  // Cache for 5 minutes
+  await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));
+
+  res.json(user);
+});
+
+// Impact: Reduces backend load, fewer rate limit hits
+// Risk: Low (data staleness acceptable)
+```
+
+---
+
+**Option D: Block Malicious IPs** (if abuse detected)
+
+**nginx**:
+```bash
+# Block specific IP
+iptables -A INPUT -s 192.168.1.100 -j DROP
+
+# Or in nginx.conf:
+deny 192.168.1.100;
+deny 192.168.1.0/24;  # Block range
+```
+
+**CloudFlare**:
+```
+# CloudFlare dashboard:
+# Security → WAF → Custom rules
+# Block IP: 192.168.1.100
+```
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Implement Tiered Rate Limits**
+
+**Different limits for different users**:
+```javascript
+const rateLimit = require('express-rate-limit');
+
+const createLimiter = (max) => rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: max,
+  keyGenerator: (req) => req.user?.id || req.ip,
+});
+
+app.use('/api', (req, res, next) => {
+  let limiter;
+  if (req.user?.tier === 'premium') {
+    limiter = createLimiter(1000);  // Premium: 1000 req/15min
+  } else if (req.user) {
+    limiter = createLimiter(300);   // Authenticated: 300 req/15min
+  } else {
+    limiter = createLimiter(100);   // Anonymous: 100 req/15min
+  }
+  limiter(req, res, next);
+});
+```
+
+---
+
+**Option B: Add CAPTCHA** (prevent bots)
+
+**reCAPTCHA** on sensitive endpoints:
+```javascript
+const { recaptcha } = require('express-recaptcha');
+
+app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
+  if (!req.recaptcha.error) {
+    // CAPTCHA valid, proceed with login
+    await handleLogin(req, res);
+  } else {
+    res.status(400).json({ error: 'CAPTCHA failed' });
+  }
+});
+```
+
+---
+
+**Option C: Upgrade External API Plan** (if hitting external limit)
+
+**Stripe**:
+```
+Current: 100 requests/second (free)
+Upgrade: Contact Stripe for higher limit (paid)
+```
+
+**AWS API Gateway**:
+```bash
+# Increase throttle limit
+aws apigateway update-usage-plan \
+  --usage-plan-id <ID> \
+  --patch-operations \
+    op=replace,path=/throttle/rateLimit,value=1000
+
+# Impact: Higher rate limit
+# Risk: None (may cost more)
+```
+
+---
+
+### Long-term (1 hour+)
+
+- [ ] **Implement tiered rate limits** (premium, authenticated, anonymous)
+- [ ] **Add caching** (reduce backend load)
+- [ ] **Use CDN** (cache static content, reduce origin requests)
+- [ ] **Add CAPTCHA** (prevent bots on sensitive endpoints)
+- [ ] **Monitor rate limit usage** (alert before hitting limit)
+- [ ] **Batch requests** (reduce API calls to external services)
+- [ ] **Implement retry with backoff** (external API rate limits)
+- [ ] **Document rate limits** (API documentation for users)
+- [ ] **Add rate limit headers** (tell users their remaining quota)
+
+---
+
+## Rate Limit Best Practices
+
+### 1. Return Helpful Headers
+
+**RFC 6585 standard**:
+```http
+HTTP/1.1 429 Too Many Requests
+X-RateLimit-Limit: 100
+X-RateLimit-Remaining: 0
+X-RateLimit-Reset: 1698345600  # Unix timestamp
+Retry-After: 60  # Seconds until reset
+
+{
+  "error": "Rate limit exceeded",
+  "message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
+}
+```
+
+**Implementation**:
+```javascript
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000,
+  max: 100,
+  standardHeaders: true,  // Return RateLimit-* headers
+  legacyHeaders: false,
+  handler: (req, res) => {
+    res.status(429).json({
+      error: 'Rate limit exceeded',
+      message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
+    });
+  },
+});
+```
+
+---
+
+### 2. Use Sliding Window (not Fixed Window)
+
+**Fixed window** (bad):
+```
+Window 1: 00:00-00:15 (100 requests)
+Window 2: 00:15-00:30 (100 requests)
+
+User makes 100 requests at 00:14:59
+User makes 100 requests at 00:15:01
+→ 200 requests in 2 seconds! (burst)
+```
+
+**Sliding window** (good):
+```
+Rate limit based on last 15 minutes from current time
+→ Can't burst (limit enforced continuously)
+```
+
+---
+
+### 3. Different Limits for Different Endpoints
+
+```javascript
+// Expensive endpoint (lower limit)
+app.get('/api/analytics', rateLimit({ max: 10 }), handler);
+
+// Cheap endpoint (higher limit)
+app.get('/api/health', rateLimit({ max: 1000 }), handler);
+```
+
+---
+
+## External API Rate Limit Handling
+
+### Retry with Backoff
+
+```javascript
+async function callExternalAPI(url, retries = 3) {
+  for (let i = 0; i < retries; i++) {
+    try {
+      const response = await fetch(url);
+
+      // Check rate limit headers
+      const remaining = response.headers.get('X-RateLimit-Remaining');
+      if (remaining < 10) {
+        console.warn('Approaching rate limit:', remaining);
+      }
+
+      if (response.status === 429) {
+        // Rate limited
+        const retryAfter = response.headers.get('Retry-After') || 60;
+        console.log(`Rate limited, retrying after ${retryAfter}s`);
+        await sleep(retryAfter * 1000);
+        continue;
+      }
+
+      return response.json();
+    } catch (error) {
+      if (i === retries - 1) throw error;
+      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
+    }
+  }
+}
+```
+
+---
+
+## Escalation
+
+**Escalate to developer if**:
+- Application rate limit logic needs changes
+- Need to implement caching
+
+**Escalate to infrastructure if**:
+- nginx/API Gateway rate limit config
+- Need to scale up capacity
+
+**Escalate to external vendor if**:
+- Hitting external API rate limit
+- Need higher quota
+
+---
+
+## Related Runbooks
+
+- [05-ddos-attack.md](05-ddos-attack.md) - If malicious traffic
+- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (if SEV1/SEV2)
+- [ ] Identify why rate limit hit
+- [ ] Adjust rate limits (if needed)
+- [ ] Add monitoring (alert before hitting limit)
+- [ ] Document rate limits (for users/API consumers)
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# Check 429 errors (nginx)
+awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c
+
+# Check rate limit config (nginx)
+grep "limit_req" /etc/nginx/nginx.conf
+
+# Block IP (iptables)
+iptables -A INPUT -s <IP> -j DROP
+
+# Test rate limit
+for i in {1..200}; do curl http://localhost/api; done
+
+# Check external API rate limit
+curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
+# Look for X-RateLimit-* headers
+```