6.7 KiB
6.7 KiB
Playbook: Slow API Response
Symptoms
- API response time >1 second (degraded)
- API response time >5 seconds (critical)
- Users reporting slow loading
- Timeout errors (504 Gateway Timeout)
- Monitoring alert: "p95 response time >1s"
Severity
- SEV3 if response time 1-3 seconds
- SEV2 if response time 3-5 seconds
- SEV1 if response time >5 seconds or timeouts
Diagnosis
Step 1: Check Application Logs
# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
# Example output:
# /api/dashboard 8200ms ← SLOW
# /api/users 50ms
# /api/posts 120ms
Step 2: Measure Response Time Breakdown
Total response time = Database + Application + Network
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n
Example breakdown:
time_namelookup: 0.005s (DNS)
time_connect: 0.010s (TCP connect)
time_starttransfer: 8.200s (Time to first byte) ← SLOW HERE
time_total: 8.250s
→ Problem is backend processing, not network
Step 3: Check Database Query Time
# Check application logs for query time
grep "query.*duration" /var/log/application.log
# Example:
# query: SELECT * FROM users... duration: 7800ms ← SLOW
If database is slow → See database-diagnostics.md
Step 4: Check External API Calls
# Check logs for external API calls
grep "http.request" /var/log/application.log
# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
Mitigation
Immediate (Now - 5 min)
Option A: Add Database Index (if DB is bottleneck)
-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);
-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)
Option B: Enable Caching (if same data requested frequently)
// Add Redis cache
const redis = require('redis').createClient();
app.get('/api/dashboard', async (req, res) => {
// Check cache first
const cached = await redis.get('dashboard:' + req.user.id);
if (cached) return res.json(JSON.parse(cached));
// Generate data
const data = await generateDashboard(req.user.id);
// Cache for 5 minutes
await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
res.json(data);
});
// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)
Option C: Optimize Query (if N+1 query)
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
user.posts = posts;
}
// GOOD: Single query with JOIN
const users = await db.query(`
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
`);
Short-term (5 min - 1 hour)
Option A: Add Timeout (if external API is slow)
// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
timeout: 2000, // 2 second timeout
});
// If timeout, use fallback data
if (!response.ok) {
return fallbackData;
}
// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)
Option B: Async Processing (if computation is heavy)
// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
const result = await heavyComputation(req.body); // 10 seconds
res.json(result);
});
// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
const jobId = await queue.add('process', req.body);
res.status(202).json({ jobId, status: 'processing' });
});
// Client polls for result
app.get('/api/job/:id', async (req, res) => {
const job = await queue.getJob(req.params.id);
res.json({ status: job.status, result: job.result });
});
// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)
Option C: Pagination (if returning large dataset)
// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
const users = await db.query('SELECT * FROM users');
res.json(users); // Huge payload
});
// GOOD: Pagination
app.get('/api/users', async (req, res) => {
const page = parseInt(req.query.page) || 1;
const limit = 50;
const offset = (page - 1) * limit;
const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
res.json({ data: users, page, limit });
});
// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)
Long-term (1 hour+)
- Add response time monitoring (p95, p99)
- Add APM (Application Performance Monitoring)
- Optimize database queries (add indexes, reduce JOINs)
- Add caching layer (Redis, Memcached)
- Implement pagination for large datasets
- Move heavy computation to background jobs
- Add timeout for external APIs
- Add E2E test: API response <1s
- Review and optimize N+1 queries
Common Root Causes
| Symptom | Root Cause | Solution |
|---|---|---|
| 7.8s query time | Missing database index | CREATE INDEX |
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
| 5s external API call | No timeout | Add timeout + fallback |
| Heavy computation | Sync processing | Async job queue |
| Same data fetched repeatedly | No caching | Add Redis cache |
Escalation
Escalate to developer if:
- Application code needs optimization
- N+1 query problem
Escalate to DBA if:
- Database performance issue
- Need help with query optimization
Escalate to external team if:
- External API consistently slow
- Need to negotiate SLA
Related Runbooks
- 02-database-deadlock.md - If database locked
- ../modules/database-diagnostics.md - Database troubleshooting
- ../modules/backend-diagnostics.md - Backend troubleshooting
Post-Incident
After resolving:
- Create post-mortem
- Identify root cause (DB, external API, N+1, etc.)
- Add performance test (response time <1s)
- Add monitoring alert
- Update this runbook if needed