zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

6.7 KiB

Raw Permalink Blame History

Playbook: Slow API Response

Symptoms

API response time >1 second (degraded)
API response time >5 seconds (critical)
Users reporting slow loading
Timeout errors (504 Gateway Timeout)
Monitoring alert: "p95 response time >1s"

Severity

SEV3 if response time 1-3 seconds
SEV2 if response time 3-5 seconds
SEV1 if response time >5 seconds or timeouts

Diagnosis

Step 1: Check Application Logs

# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'

# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20

# Example output:
# /api/dashboard 8200ms  ← SLOW
# /api/users 50ms
# /api/posts 120ms

Step 2: Measure Response Time Breakdown

Total response time = Database + Application + Network

# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint

# curl-format.txt:
# time_namelookup:  %{time_namelookup}\n
# time_connect:  %{time_connect}\n
# time_starttransfer:  %{time_starttransfer}\n
# time_total:  %{time_total}\n

Example breakdown:

time_namelookup:    0.005s  (DNS)
time_connect:       0.010s  (TCP connect)
time_starttransfer: 8.200s  (Time to first byte) ← SLOW HERE
time_total:         8.250s

→ Problem is backend processing, not network

Step 3: Check Database Query Time

# Check application logs for query time
grep "query.*duration" /var/log/application.log

# Example:
# query: SELECT * FROM users... duration: 7800ms  ← SLOW

If database is slow → See database-diagnostics.md

Step 4: Check External API Calls

# Check logs for external API calls
grep "http.request" /var/log/application.log

# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW

Mitigation

Immediate (Now - 5 min)

Option A: Add Database Index (if DB is bottleneck)

-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);

-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)

Option B: Enable Caching (if same data requested frequently)

// Add Redis cache
const redis = require('redis').createClient();

app.get('/api/dashboard', async (req, res) => {
  // Check cache first
  const cached = await redis.get('dashboard:' + req.user.id);
  if (cached) return res.json(JSON.parse(cached));

  // Generate data
  const data = await generateDashboard(req.user.id);

  // Cache for 5 minutes
  await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));

  res.json(data);
});

// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)

Option C: Optimize Query (if N+1 query)

// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
  user.posts = posts;
}

// GOOD: Single query with JOIN
const users = await db.query(`
  SELECT users.*, posts.*
  FROM users
  LEFT JOIN posts ON posts.user_id = users.id
`);

Short-term (5 min - 1 hour)

Option A: Add Timeout (if external API is slow)

// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
  timeout: 2000, // 2 second timeout
});

// If timeout, use fallback data
if (!response.ok) {
  return fallbackData;
}

// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)

Option B: Async Processing (if computation is heavy)

// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
  const result = await heavyComputation(req.body); // 10 seconds
  res.json(result);
});

// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
  const jobId = await queue.add('process', req.body);
  res.status(202).json({ jobId, status: 'processing' });
});

// Client polls for result
app.get('/api/job/:id', async (req, res) => {
  const job = await queue.getJob(req.params.id);
  res.json({ status: job.status, result: job.result });
});

// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)

Option C: Pagination (if returning large dataset)

// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
  const users = await db.query('SELECT * FROM users');
  res.json(users); // Huge payload
});

// GOOD: Pagination
app.get('/api/users', async (req, res) => {
  const page = parseInt(req.query.page) || 1;
  const limit = 50;
  const offset = (page - 1) * limit;

  const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
  res.json({ data: users, page, limit });
});

// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)

Long-term (1 hour+)

Add response time monitoring (p95, p99)
Add APM (Application Performance Monitoring)
Optimize database queries (add indexes, reduce JOINs)
Add caching layer (Redis, Memcached)
Implement pagination for large datasets
Move heavy computation to background jobs
Add timeout for external APIs
Add E2E test: API response <1s
Review and optimize N+1 queries

Common Root Causes

Symptom	Root Cause	Solution
7.8s query time	Missing database index	CREATE INDEX
10,000 records returned	No pagination	Add LIMIT/OFFSET
50 queries for 1 request	N+1 query problem	Use JOIN or DataLoader
5s external API call	No timeout	Add timeout + fallback
Heavy computation	Sync processing	Async job queue
Same data fetched repeatedly	No caching	Add Redis cache

Escalation

Escalate to developer if:

Application code needs optimization
N+1 query problem

Escalate to DBA if:

Database performance issue
Need help with query optimization

Escalate to external team if:

External API consistently slow
Need to negotiate SLA

02-database-deadlock.md - If database locked
../modules/database-diagnostics.md - Database troubleshooting
../modules/backend-diagnostics.md - Backend troubleshooting

Post-Incident

After resolving:

Create post-mortem
Identify root cause (DB, external API, N+1, etc.)
Add performance test (response time <1s)
Add monitoring alert
Update this runbook if needed

6.7 KiB Raw Permalink Blame History