zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

10 KiB

Raw Permalink Blame History

Playbook: Rate Limit Exceeded

Symptoms

"Rate limit exceeded" errors
"429 Too Many Requests" responses
"Quota exceeded" messages
Legitimate requests being blocked
Monitoring alert: "High rate of 429 errors"

Severity

SEV3 if isolated to specific users/endpoints
SEV2 if affecting many users
SEV1 if critical functionality blocked (payments, auth)

Diagnosis

Step 1: Identify What's Rate Limited

Check Error Messages:

# Application logs
grep "rate limit\|429\|quota exceeded" /var/log/application.log

# nginx logs
awk '$9 == 429 {print $1, $7}' /var/log/nginx/access.log | sort | uniq -c

# Example output:
# 500 192.168.1.100 /api/users    ← IP hitting rate limit
# 200 192.168.1.101 /api/posts

Check Rate Limit Source:

Application-level: Your code enforcing limit
nginx/API Gateway: Reverse proxy rate limiting
External API: Third-party service limit (Stripe, Twilio, etc.)
Cloud: AWS API Gateway, CloudFlare

Step 2: Determine If Legitimate or Malicious

Legitimate traffic:

Scenario: User refreshing dashboard repeatedly
Pattern: Single user, single endpoint, short burst
Action: Increase rate limit or add caching

Malicious traffic (abuse):

Scenario: Scraper or bot
Pattern: Multiple IPs, automated behavior, sustained
Action: Block IPs, add CAPTCHA

Traffic spike (legitimate):

Scenario: Marketing campaign, viral post
Pattern: Many users, distributed IPs, real user behavior
Action: Increase rate limit, scale up

Step 3: Check Current Rate Limits

nginx:

# Check nginx.conf
grep "limit_req" /etc/nginx/nginx.conf

# Example:
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
#                                                         ^^^^ Current limit

Application (Express.js example):

// Check rate limit middleware
const rateLimit = require('express-rate-limit');

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // Limit: 100 requests per 15 minutes
});

External API:

# Check external API documentation
# Stripe: 100 requests per second
# Twilio: 100 requests per second
# Google Maps: $200/month free quota

# Check current usage
# Stripe:
curl https://api.stripe.com/v1/balance \
  -u sk_test_XXX: \
  -H "Stripe-Account: acct_XXX"

# Response headers:
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 45  ← 45 requests left

Mitigation

Immediate (Now - 5 min)

Option A: Increase Rate Limit (if legitimate traffic)

nginx:

# Edit /etc/nginx/nginx.conf
# Increase from 10r/s to 50r/s
limit_req_zone $binary_remote_addr zone=one:10m rate=50r/s;

# Test and reload
nginx -t && systemctl reload nginx

# Impact: Allows more requests
# Risk: Low (if traffic is legitimate)

Application (Express.js):

// Increase from 100 to 500 requests per 15 min
const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 500, // Increased
});

// Restart application
pm2 restart all

Option B: Whitelist Specific IPs (if known legitimate source)

nginx:

# Whitelist internal IPs, monitoring systems
geo $limit {
  default 1;
  10.0.0.0/8 0;        # Internal network
  192.168.1.100 0;     # Monitoring system
}

map $limit $limit_key {
  0 "";
  1 $binary_remote_addr;
}

limit_req_zone $limit_key zone=one:10m rate=10r/s;

Application:

const limiter = rateLimit({
  skip: (req) => {
    // Whitelist internal IPs
    return req.ip.startsWith('10.') || req.ip === '192.168.1.100';
  },
  windowMs: 15 * 60 * 1000,
  max: 100,
});

Option C: Add Caching (reduce requests to backend)

Redis cache:

const redis = require('redis').createClient();

app.get('/api/users', async (req, res) => {
  // Check cache first
  const cached = await redis.get('users:' + req.query.id);
  if (cached) {
    return res.json(JSON.parse(cached));
  }

  // Fetch from database
  const user = await db.query('SELECT * FROM users WHERE id = ?', [req.query.id]);

  // Cache for 5 minutes
  await redis.setex('users:' + req.query.id, 300, JSON.stringify(user));

  res.json(user);
});

// Impact: Reduces backend load, fewer rate limit hits
// Risk: Low (data staleness acceptable)

Option D: Block Malicious IPs (if abuse detected)

nginx:

# Block specific IP
iptables -A INPUT -s 192.168.1.100 -j DROP

# Or in nginx.conf:
deny 192.168.1.100;
deny 192.168.1.0/24;  # Block range

CloudFlare:

# CloudFlare dashboard:
# Security → WAF → Custom rules
# Block IP: 192.168.1.100

Short-term (5 min - 1 hour)

Option A: Implement Tiered Rate Limits

Different limits for different users:

const rateLimit = require('express-rate-limit');

const createLimiter = (max) => rateLimit({
  windowMs: 15 * 60 * 1000,
  max: max,
  keyGenerator: (req) => req.user?.id || req.ip,
});

app.use('/api', (req, res, next) => {
  let limiter;
  if (req.user?.tier === 'premium') {
    limiter = createLimiter(1000);  // Premium: 1000 req/15min
  } else if (req.user) {
    limiter = createLimiter(300);   // Authenticated: 300 req/15min
  } else {
    limiter = createLimiter(100);   // Anonymous: 100 req/15min
  }
  limiter(req, res, next);
});

Option B: Add CAPTCHA (prevent bots)

reCAPTCHA on sensitive endpoints:

const { recaptcha } = require('express-recaptcha');

app.post('/api/login', recaptcha.middleware.verify, async (req, res) => {
  if (!req.recaptcha.error) {
    // CAPTCHA valid, proceed with login
    await handleLogin(req, res);
  } else {
    res.status(400).json({ error: 'CAPTCHA failed' });
  }
});

Option C: Upgrade External API Plan (if hitting external limit)

Stripe:

Current: 100 requests/second (free)
Upgrade: Contact Stripe for higher limit (paid)

AWS API Gateway:

# Increase throttle limit
aws apigateway update-usage-plan \
  --usage-plan-id <ID> \
  --patch-operations \
    op=replace,path=/throttle/rateLimit,value=1000

# Impact: Higher rate limit
# Risk: None (may cost more)

Long-term (1 hour+)

Implement tiered rate limits (premium, authenticated, anonymous)
Add caching (reduce backend load)
Use CDN (cache static content, reduce origin requests)
Add CAPTCHA (prevent bots on sensitive endpoints)
Monitor rate limit usage (alert before hitting limit)
Batch requests (reduce API calls to external services)
Implement retry with backoff (external API rate limits)
Document rate limits (API documentation for users)
Add rate limit headers (tell users their remaining quota)

Rate Limit Best Practices

1. Return Helpful Headers

RFC 6585 standard:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1698345600  # Unix timestamp
Retry-After: 60  # Seconds until reset

{
  "error": "Rate limit exceeded",
  "message": "You have exceeded the rate limit of 100 requests per 15 minutes. Try again in 60 seconds."
}

Implementation:

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 100,
  standardHeaders: true,  // Return RateLimit-* headers
  legacyHeaders: false,
  handler: (req, res) => {
    res.status(429).json({
      error: 'Rate limit exceeded',
      message: `You have exceeded the rate limit of ${req.rateLimit.limit} requests per 15 minutes. Try again in ${Math.ceil(req.rateLimit.resetTime - Date.now()) / 1000} seconds.`,
    });
  },
});

2. Use Sliding Window (not Fixed Window)

Fixed window (bad):

Window 1: 00:00-00:15 (100 requests)
Window 2: 00:15-00:30 (100 requests)

User makes 100 requests at 00:14:59
User makes 100 requests at 00:15:01
→ 200 requests in 2 seconds! (burst)

Sliding window (good):

Rate limit based on last 15 minutes from current time
→ Can't burst (limit enforced continuously)

3. Different Limits for Different Endpoints

// Expensive endpoint (lower limit)
app.get('/api/analytics', rateLimit({ max: 10 }), handler);

// Cheap endpoint (higher limit)
app.get('/api/health', rateLimit({ max: 1000 }), handler);

External API Rate Limit Handling

Retry with Backoff

async function callExternalAPI(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await fetch(url);

      // Check rate limit headers
      const remaining = response.headers.get('X-RateLimit-Remaining');
      if (remaining < 10) {
        console.warn('Approaching rate limit:', remaining);
      }

      if (response.status === 429) {
        // Rate limited
        const retryAfter = response.headers.get('Retry-After') || 60;
        console.log(`Rate limited, retrying after ${retryAfter}s`);
        await sleep(retryAfter * 1000);
        continue;
      }

      return response.json();
    } catch (error) {
      if (i === retries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
    }
  }
}

Escalation

Escalate to developer if:

Application rate limit logic needs changes
Need to implement caching

Escalate to infrastructure if:

nginx/API Gateway rate limit config
Need to scale up capacity

Escalate to external vendor if:

Hitting external API rate limit
Need higher quota

05-ddos-attack.md - If malicious traffic
../modules/backend-diagnostics.md - Backend troubleshooting

Post-Incident

After resolving:

Create post-mortem (if SEV1/SEV2)
Identify why rate limit hit
Adjust rate limits (if needed)
Add monitoring (alert before hitting limit)
Document rate limits (for users/API consumers)
Update this runbook if needed

Useful Commands Reference

# Check 429 errors (nginx)
awk '$9 == 429 {print $1}' /var/log/nginx/access.log | sort | uniq -c

# Check rate limit config (nginx)
grep "limit_req" /etc/nginx/nginx.conf

# Block IP (iptables)
iptables -A INPUT -s <IP> -j DROP

# Test rate limit
for i in {1..200}; do curl http://localhost/api; done

# Check external API rate limit
curl -I https://api.example.com -H "Authorization: Bearer TOKEN"
# Look for X-RateLimit-* headers

10 KiB Raw Permalink Blame History

Playbook: Rate Limit Exceeded

Symptoms

Severity

Diagnosis

Step 1: Identify What's Rate Limited

Step 2: Determine If Legitimate or Malicious

Step 3: Check Current Rate Limits

Mitigation

Immediate (Now - 5 min)

Short-term (5 min - 1 hour)

Long-term (1 hour+)

Rate Limit Best Practices

1. Return Helpful Headers

2. Use Sliding Window (not Fixed Window)

3. Different Limits for Different Endpoints

External API Rate Limit Handling

Retry with Backoff

Escalation

Related Runbooks

Post-Incident

Useful Commands Reference

10 KiB

Raw Permalink Blame History