Files
gh-anton-abyzov-specweave-p…/agents/sre/playbooks/04-slow-api-response.md
2025-11-29 17:56:41 +08:00

270 lines
6.7 KiB
Markdown

# Playbook: Slow API Response
## Symptoms
- API response time >1 second (degraded)
- API response time >5 seconds (critical)
- Users reporting slow loading
- Timeout errors (504 Gateway Timeout)
- Monitoring alert: "p95 response time >1s"
## Severity
- **SEV3** if response time 1-3 seconds
- **SEV2** if response time 3-5 seconds
- **SEV1** if response time >5 seconds or timeouts
## Diagnosis
### Step 1: Check Application Logs
```bash
# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
# Example output:
# /api/dashboard 8200ms ← SLOW
# /api/users 50ms
# /api/posts 120ms
```
---
### Step 2: Measure Response Time Breakdown
**Total response time = Database + Application + Network**
```bash
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n
```
**Example breakdown**:
```
time_namelookup: 0.005s (DNS)
time_connect: 0.010s (TCP connect)
time_starttransfer: 8.200s (Time to first byte) ← SLOW HERE
time_total: 8.250s
→ Problem is backend processing, not network
```
---
### Step 3: Check Database Query Time
```bash
# Check application logs for query time
grep "query.*duration" /var/log/application.log
# Example:
# query: SELECT * FROM users... duration: 7800ms ← SLOW
```
**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)
---
### Step 4: Check External API Calls
```bash
# Check logs for external API calls
grep "http.request" /var/log/application.log
# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
```
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Add Database Index** (if DB is bottleneck)
```sql
-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);
-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)
```
**Option B: Enable Caching** (if same data requested frequently)
```javascript
// Add Redis cache
const redis = require('redis').createClient();
app.get('/api/dashboard', async (req, res) => {
// Check cache first
const cached = await redis.get('dashboard:' + req.user.id);
if (cached) return res.json(JSON.parse(cached));
// Generate data
const data = await generateDashboard(req.user.id);
// Cache for 5 minutes
await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
res.json(data);
});
// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)
```
**Option C: Optimize Query** (if N+1 query)
```javascript
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
user.posts = posts;
}
// GOOD: Single query with JOIN
const users = await db.query(`
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
`);
```
---
### Short-term (5 min - 1 hour)
**Option A: Add Timeout** (if external API is slow)
```javascript
// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
timeout: 2000, // 2 second timeout
});
// If timeout, use fallback data
if (!response.ok) {
return fallbackData;
}
// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)
```
**Option B: Async Processing** (if computation is heavy)
```javascript
// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
const result = await heavyComputation(req.body); // 10 seconds
res.json(result);
});
// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
const jobId = await queue.add('process', req.body);
res.status(202).json({ jobId, status: 'processing' });
});
// Client polls for result
app.get('/api/job/:id', async (req, res) => {
const job = await queue.getJob(req.params.id);
res.json({ status: job.status, result: job.result });
});
// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)
```
**Option C: Pagination** (if returning large dataset)
```javascript
// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
const users = await db.query('SELECT * FROM users');
res.json(users); // Huge payload
});
// GOOD: Pagination
app.get('/api/users', async (req, res) => {
const page = parseInt(req.query.page) || 1;
const limit = 50;
const offset = (page - 1) * limit;
const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
res.json({ data: users, page, limit });
});
// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)
```
---
### Long-term (1 hour+)
- [ ] Add response time monitoring (p95, p99)
- [ ] Add APM (Application Performance Monitoring)
- [ ] Optimize database queries (add indexes, reduce JOINs)
- [ ] Add caching layer (Redis, Memcached)
- [ ] Implement pagination for large datasets
- [ ] Move heavy computation to background jobs
- [ ] Add timeout for external APIs
- [ ] Add E2E test: API response <1s
- [ ] Review and optimize N+1 queries
---
## Common Root Causes
| Symptom | Root Cause | Solution |
|---------|------------|----------|
| 7.8s query time | Missing database index | CREATE INDEX |
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
| 5s external API call | No timeout | Add timeout + fallback |
| Heavy computation | Sync processing | Async job queue |
| Same data fetched repeatedly | No caching | Add Redis cache |
---
## Escalation
**Escalate to developer if**:
- Application code needs optimization
- N+1 query problem
**Escalate to DBA if**:
- Database performance issue
- Need help with query optimization
**Escalate to external team if**:
- External API consistently slow
- Need to negotiate SLA
---
## Related Runbooks
- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
---
## Post-Incident
After resolving:
- [ ] Create post-mortem
- [ ] Identify root cause (DB, external API, N+1, etc.)
- [ ] Add performance test (response time <1s)
- [ ] Add monitoring alert
- [ ] Update this runbook if needed