gh-anton-abyzov-specweave-p…/agents/sre/playbooks/04-slow-api-response.md

# Playbook: Slow API Response

## Symptoms

- API response time >1 second (degraded)
- API response time >5 seconds (critical)
- Users reporting slow loading
- Timeout errors (504 Gateway Timeout)
- Monitoring alert: "p95 response time >1s"

## Severity

- **SEV3** if response time 1-3 seconds
- **SEV2** if response time 3-5 seconds
- **SEV1** if response time >5 seconds or timeouts

## Diagnosis

### Step 1: Check Application Logs

```bash
# Find slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'

# Identify slow endpoint
awk '/duration/ {print $3, $5}' /var/log/application.log | sort -nk2 | tail -20

# Example output:
# /api/dashboard 8200ms  ← SLOW
# /api/users 50ms
# /api/posts 120ms
```

---

### Step 2: Measure Response Time Breakdown

**Total response time = Database + Application + Network**

```bash
# Use curl with timing
curl -w "@curl-format.txt" -o /dev/null -s http://api.example.com/endpoint

# curl-format.txt:
# time_namelookup:  %{time_namelookup}\n
# time_connect:  %{time_connect}\n
# time_starttransfer:  %{time_starttransfer}\n
# time_total:  %{time_total}\n
```

**Example breakdown**:
```
time_namelookup:    0.005s  (DNS)
time_connect:       0.010s  (TCP connect)
time_starttransfer: 8.200s  (Time to first byte) ← SLOW HERE
time_total:         8.250s

→ Problem is backend processing, not network
```

---

### Step 3: Check Database Query Time

```bash
# Check application logs for query time
grep "query.*duration" /var/log/application.log

# Example:
# query: SELECT * FROM users... duration: 7800ms  ← SLOW
```

**If database is slow** → See [database-diagnostics.md](../modules/database-diagnostics.md)

---

### Step 4: Check External API Calls

```bash
# Check logs for external API calls
grep "http.request" /var/log/application.log

# Example:
# http.request: GET https://api.external.com/data duration: 5000ms ← SLOW
```

---

## Mitigation

### Immediate (Now - 5 min)

**Option A: Add Database Index** (if DB is bottleneck)
```sql
-- Example: Missing index on last_login_at
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);

-- Impact: 7.8s → 50ms query time
-- Risk: Low (CONCURRENTLY = no table lock)
```

**Option B: Enable Caching** (if same data requested frequently)
```javascript
// Add Redis cache
const redis = require('redis').createClient();

app.get('/api/dashboard', async (req, res) => {
  // Check cache first
  const cached = await redis.get('dashboard:' + req.user.id);
  if (cached) return res.json(JSON.parse(cached));

  // Generate data
  const data = await generateDashboard(req.user.id);

  // Cache for 5 minutes
  await redis.setex('dashboard:' + req.user.id, 300, JSON.stringify(data));

  res.json(data);
});

// Impact: 8s → 10ms (cache hit)
// Risk: Low (data staleness acceptable for dashboard)
```

**Option C: Optimize Query** (if N+1 query)
```javascript
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
  user.posts = posts;
}

// GOOD: Single query with JOIN
const users = await db.query(`
  SELECT users.*, posts.*
  FROM users
  LEFT JOIN posts ON posts.user_id = users.id
`);
```

---

### Short-term (5 min - 1 hour)

**Option A: Add Timeout** (if external API is slow)
```javascript
// Add timeout to external API call
const response = await fetch('https://api.external.com/data', {
  timeout: 2000, // 2 second timeout
});

// If timeout, use fallback data
if (!response.ok) {
  return fallbackData;
}

// Impact: Prevents slow external API from blocking response
// Risk: Low (fallback data acceptable)
```

**Option B: Async Processing** (if computation is heavy)
```javascript
// BAD: Synchronous heavy computation
app.post('/api/process', async (req, res) => {
  const result = await heavyComputation(req.body); // 10 seconds
  res.json(result);
});

// GOOD: Async processing with job queue
app.post('/api/process', async (req, res) => {
  const jobId = await queue.add('process', req.body);
  res.status(202).json({ jobId, status: 'processing' });
});

// Client polls for result
app.get('/api/job/:id', async (req, res) => {
  const job = await queue.getJob(req.params.id);
  res.json({ status: job.status, result: job.result });
});

// Impact: API responds immediately (202 Accepted)
// Risk: Low (client needs to handle async pattern)
```

**Option C: Pagination** (if returning large dataset)
```javascript
// BAD: Return all 10,000 records
app.get('/api/users', async (req, res) => {
  const users = await db.query('SELECT * FROM users');
  res.json(users); // Huge payload
});

// GOOD: Pagination
app.get('/api/users', async (req, res) => {
  const page = parseInt(req.query.page) || 1;
  const limit = 50;
  const offset = (page - 1) * limit;

  const users = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [limit, offset]);
  res.json({ data: users, page, limit });
});

// Impact: 8s → 200ms (smaller dataset)
// Risk: Low (clients usually want pagination anyway)
```

---

### Long-term (1 hour+)

- [ ] Add response time monitoring (p95, p99)
- [ ] Add APM (Application Performance Monitoring)
- [ ] Optimize database queries (add indexes, reduce JOINs)
- [ ] Add caching layer (Redis, Memcached)
- [ ] Implement pagination for large datasets
- [ ] Move heavy computation to background jobs
- [ ] Add timeout for external APIs
- [ ] Add E2E test: API response <1s
- [ ] Review and optimize N+1 queries

---

## Common Root Causes

| Symptom | Root Cause | Solution |
|---------|------------|----------|
| 7.8s query time | Missing database index | CREATE INDEX |
| 10,000 records returned | No pagination | Add LIMIT/OFFSET |
| 50 queries for 1 request | N+1 query problem | Use JOIN or DataLoader |
| 5s external API call | No timeout | Add timeout + fallback |
| Heavy computation | Sync processing | Async job queue |
| Same data fetched repeatedly | No caching | Add Redis cache |

---

## Escalation

**Escalate to developer if**:
- Application code needs optimization
- N+1 query problem

**Escalate to DBA if**:
- Database performance issue
- Need help with query optimization

**Escalate to external team if**:
- External API consistently slow
- Need to negotiate SLA

---

## Related Runbooks

- [02-database-deadlock.md](02-database-deadlock.md) - If database locked
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting

---

## Post-Incident

After resolving:
- [ ] Create post-mortem
- [ ] Identify root cause (DB, external API, N+1, etc.)
- [ ] Add performance test (response time <1s)
- [ ] Add monitoring alert
- [ ] Update this runbook if needed