Files
gh-anton-abyzov-specweave-p…/agents/sre/playbooks/04-slow-api-response.md
2025-11-29 17:56:41 +08:00

6.7 KiB

Playbook: Slow API Response

Symptoms

  • API response time >1 second (degraded)
  • API response time >5 seconds (critical)
  • Users reporting slow loading
  • Timeout errors (504 Gateway Timeout)
  • Monitoring alert: "p95 response time >1s"

Severity

  • SEV3 if response time 1-3 seconds
  • SEV2 if response time 3-5 seconds
  • SEV1 if response time >5 seconds or timeouts

Diagnosis

Step 1: Check Application Logs

# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'

# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20

# Example output:
# /api/dashboard 8200ms  ← SLOW
# /api/users 50ms
# /api/posts 120ms

Step 2: Measure Response Time Breakdown

Total response time = Database + Application + Network

# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint

# curl-format.txt:
# time_namelookup:  %{time_namelookup}\n
# time_connect:  %{time_connect}\n
# time_starttransfer:  %{time_starttransfer}\n
# time_total:  %{time_total}\n

Example breakdown:

time_namelookup:    0.005s  (DNS)
time_connect:       0.010s  (TCP connect)
time_starttransfer: 8.200s  (Time to first byte) ← SLOW HERE
time_total:         8.250s

→ Problem is backend processing, not network

Step 3: Check Database Query Time

# Check application logs for query time
grep "query.*duration" /var/log/application.log

# Example:
# query: SELECT * FROM users... duration: 7800ms  ← SLOW

If database is slow → See database-diagnostics.md


Step 4: Check External API Calls

# Check logs for external API calls
grep "http.request" /var/log/application.log

# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW

Mitigation

Immediate (Now - 5 min)

Option A: Add Database Index (if DB is bottleneck)

-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);

-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)

Option B: Enable Caching (if same data requested frequently)

// Add Redis cache
const redis = require('redis').createClient();

app.get('/api/dashboard', async (req, res) => {
  // Check cache first
  const cached = await redis.get('dashboard:' + req.user.id);
  if (cached) return res.json(JSON.parse(cached));

  // Generate data
  const data = await generateDashboard(req.user.id);

  // Cache for 5 minutes
  await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));

  res.json(data);
});

// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)

Option C: Optimize Query (if N+1 query)

// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
  user.posts = posts;
}

// GOOD: Single query with JOIN
const users = await db.query(`
  SELECT users.*, posts.*
  FROM users
  LEFT JOIN posts ON posts.user_id = users.id
`);

Short-term (5 min - 1 hour)

Option A: Add Timeout (if external API is slow)

// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
  timeout: 2000, // 2 second timeout
});

// If timeout, use fallback data
if (!response.ok) {
  return fallbackData;
}

// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)

Option B: Async Processing (if computation is heavy)

// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
  const result = await heavyComputation(req.body); // 10 seconds
  res.json(result);
});

// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
  const jobId = await queue.add('process', req.body);
  res.status(202).json({ jobId, status: 'processing' });
});

// Client polls for result
app.get('/api/job/:id', async (req, res) => {
  const job = await queue.getJob(req.params.id);
  res.json({ status: job.status, result: job.result });
});

// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)

Option C: Pagination (if returning large dataset)

// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
  const users = await db.query('SELECT * FROM users');
  res.json(users); // Huge payload
});

// GOOD: Pagination
app.get('/api/users', async (req, res) => {
  const page = parseInt(req.query.page) || 1;
  const limit = 50;
  const offset = (page - 1) * limit;

  const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
  res.json({ data: users, page, limit });
});

// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)

Long-term (1 hour+)

  • Add response time monitoring (p95, p99)
  • Add APM (Application Performance Monitoring)
  • Optimize database queries (add indexes, reduce JOINs)
  • Add caching layer (Redis, Memcached)
  • Implement pagination for large datasets
  • Move heavy computation to background jobs
  • Add timeout for external APIs
  • Add E2E test: API response <1s
  • Review and optimize N+1 queries

Common Root Causes

Symptom Root Cause Solution
7.8s query time Missing database index CREATE INDEX
10,000 records returned No pagination Add LIMIT/OFFSET
50 queries for 1 request N+1 query problem Use JOIN or DataLoader
5s external API call No timeout Add timeout + fallback
Heavy computation Sync processing Async job queue
Same data fetched repeatedly No caching Add Redis cache

Escalation

Escalate to developer if:

  • Application code needs optimization
  • N+1 query problem

Escalate to DBA if:

  • Database performance issue
  • Need help with query optimization

Escalate to external team if:

  • External API consistently slow
  • Need to negotiate SLA


Post-Incident

After resolving:

  • Create post-mortem
  • Identify root cause (DB, external API, N+1, etc.)
  • Add performance test (response time <1s)
  • Add monitoring alert
  • Update this runbook if needed