Initial commit
This commit is contained in:
269
agents/sre/playbooks/04-slow-api-response.md
Normal file
269
agents/sre/playbooks/04-slow-api-response.md
Normal file
@@ -0,0 +1,269 @@
|
||||
# Playbook: Slow API Response
|
||||
|
||||
## Symptoms
|
||||
|
||||
- API response time >1 second (degraded)
|
||||
- API response time >5 seconds (critical)
|
||||
- Users reporting slow loading
|
||||
- Timeout errors (504 Gateway Timeout)
|
||||
- Monitoring alert: "p95 response time >1s"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV3** if response time 1-3 seconds
|
||||
- **SEV2** if response time 3-5 seconds
|
||||
- **SEV1** if response time >5 seconds or timeouts
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Application Logs
|
||||
|
||||
```bash
|
||||
# Find slow requests
|
||||
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
|
||||
|
||||
# Identify slow endpoint
|
||||
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20
|
||||
|
||||
# Example output:
|
||||
# /api/dashboard 8200ms ← SLOW
|
||||
# /api/users 50ms
|
||||
# /api/posts 120ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Measure Response Time Breakdown
|
||||
|
||||
**Total response time = Database + Application + Network**
|
||||
|
||||
```bash
|
||||
# Use curl with timing
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint
|
||||
|
||||
# curl-format.txt:
|
||||
# time_namelookup: %{time_namelookup}\n
|
||||
# time_connect: %{time_connect}\n
|
||||
# time_starttransfer: %{time_starttransfer}\n
|
||||
# time_total: %{time_total}\n
|
||||
```
|
||||
|
||||
**Example breakdown**:
|
||||
```
|
||||
time_namelookup: 0.005s (DNS)
|
||||
time_connect: 0.010s (TCP connect)
|
||||
time_starttransfer: 8.200s (Time to first byte) ← SLOW HERE
|
||||
time_total: 8.250s
|
||||
|
||||
→ Problem is backend processing, not network
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Check Database Query Time
|
||||
|
||||
```bash
|
||||
# Check application logs for query time
|
||||
grep "query.*duration" /var/log/application.log
|
||||
|
||||
# Example:
|
||||
# query: SELECT * FROM users... duration: 7800ms ← SLOW
|
||||
```
|
||||
|
||||
**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Check External API Calls
|
||||
|
||||
```bash
|
||||
# Check logs for external API calls
|
||||
grep "http.request" /var/log/application.log
|
||||
|
||||
# Example:
|
||||
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**Option A: Add Database Index** (if DB is bottleneck)
|
||||
```sql
|
||||
-- Example: Missing index on last_login_at
|
||||
CREATE INDEX CONCURRENTLY idx_users_last_login_at
|
||||
ON users(last_login_at);
|
||||
|
||||
-- Impact: 7.8s → 50ms query time
|
||||
-- Risk: Low (CONCURRENTLY = no table lock)
|
||||
```
|
||||
|
||||
**Option B: Enable Caching** (if same data requested frequently)
|
||||
```javascript
|
||||
// Add Redis cache
|
||||
const redis = require('redis').createClient();
|
||||
|
||||
app.get('/api/dashboard', async (req, res) => {
|
||||
// Check cache first
|
||||
const cached = await redis.get('dashboard:' + req.user.id);
|
||||
if (cached) return res.json(JSON.parse(cached));
|
||||
|
||||
// Generate data
|
||||
const data = await generateDashboard(req.user.id);
|
||||
|
||||
// Cache for 5 minutes
|
||||
await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));
|
||||
|
||||
res.json(data);
|
||||
});
|
||||
|
||||
// Impact: 8s → 10ms (cache hit)
|
||||
// Risk: Low (data staleness acceptable for dashboard)
|
||||
```
|
||||
|
||||
**Option C: Optimize Query** (if N+1 query)
|
||||
```javascript
|
||||
// BAD: N+1 queries
|
||||
const users = await db.query('SELECT * FROM users');
|
||||
for (const user of users) {
|
||||
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
|
||||
user.posts = posts;
|
||||
}
|
||||
|
||||
// GOOD: Single query with JOIN
|
||||
const users = await db.query(`
|
||||
SELECT users.*, posts.*
|
||||
FROM users
|
||||
LEFT JOIN posts ON posts.user_id = users.id
|
||||
`);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Add Timeout** (if external API is slow)
|
||||
```javascript
|
||||
// Add timeout to external API call
|
||||
const response = await fetch('https://api.external.com/data', {
|
||||
timeout: 2000, // 2 second timeout
|
||||
});
|
||||
|
||||
// If timeout, use fallback data
|
||||
if (!response.ok) {
|
||||
return fallbackData;
|
||||
}
|
||||
|
||||
// Impact: Prevents slow external API from blocking response
|
||||
// Risk: Low (fallback data acceptable)
|
||||
```
|
||||
|
||||
**Option B: Async Processing** (if computation is heavy)
|
||||
```javascript
|
||||
// BAD: Synchronous heavy computation
|
||||
app.post('/api/process', async (req, res) => {
|
||||
const result = await heavyComputation(req.body); // 10 seconds
|
||||
res.json(result);
|
||||
});
|
||||
|
||||
// GOOD: Async processing with job queue
|
||||
app.post('/api/process', async (req, res) => {
|
||||
const jobId = await queue.add('process', req.body);
|
||||
res.status(202).json({ jobId, status: 'processing' });
|
||||
});
|
||||
|
||||
// Client polls for result
|
||||
app.get('/api/job/:id', async (req, res) => {
|
||||
const job = await queue.getJob(req.params.id);
|
||||
res.json({ status: job.status, result: job.result });
|
||||
});
|
||||
|
||||
// Impact: API responds immediately (202 Accepted)
|
||||
// Risk: Low (client needs to handle async pattern)
|
||||
```
|
||||
|
||||
**Option C: Pagination** (if returning large dataset)
|
||||
```javascript
|
||||
// BAD: Return all 10,000 records
|
||||
app.get('/api/users', async (req, res) => {
|
||||
const users = await db.query('SELECT * FROM users');
|
||||
res.json(users); // Huge payload
|
||||
});
|
||||
|
||||
// GOOD: Pagination
|
||||
app.get('/api/users', async (req, res) => {
|
||||
const page = parseInt(req.query.page) || 1;
|
||||
const limit = 50;
|
||||
const offset = (page - 1) * limit;
|
||||
|
||||
const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
|
||||
res.json({ data: users, page, limit });
|
||||
});
|
||||
|
||||
// Impact: 8s → 200ms (smaller dataset)
|
||||
// Risk: Low (clients usually want pagination anyway)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
- [ ] Add response time monitoring (p95, p99)
|
||||
- [ ] Add APM (Application Performance Monitoring)
|
||||
- [ ] Optimize database queries (add indexes, reduce JOINs)
|
||||
- [ ] Add caching layer (Redis, Memcached)
|
||||
- [ ] Implement pagination for large datasets
|
||||
- [ ] Move heavy computation to background jobs
|
||||
- [ ] Add timeout for external APIs
|
||||
- [ ] Add E2E test: API response <1s
|
||||
- [ ] Review and optimize N+1 queries
|
||||
|
||||
---
|
||||
|
||||
## Common Root Causes
|
||||
|
||||
| Symptom | Root Cause | Solution |
|
||||
|---------|------------|----------|
|
||||
| 7.8s query time | Missing database index | CREATE INDEX |
|
||||
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
|
||||
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
|
||||
| 5s external API call | No timeout | Add timeout + fallback |
|
||||
| Heavy computation | Sync processing | Async job queue |
|
||||
| Same data fetched repeatedly | No caching | Add Redis cache |
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application code needs optimization
|
||||
- N+1 query problem
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database performance issue
|
||||
- Need help with query optimization
|
||||
|
||||
**Escalate to external team if**:
|
||||
- External API consistently slow
|
||||
- Need to negotiate SLA
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
|
||||
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
|
||||
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem
|
||||
- [ ] Identify root cause (DB, external API, N+1, etc.)
|
||||
- [ ] Add performance test (response time <1s)
|
||||
- [ ] Add monitoring alert
|
||||
- [ ] Update this runbook if needed
|
||||
Reference in New Issue
Block a user