414 lines
10 KiB
Markdown
414 lines
10 KiB
Markdown
# Performance Degradation Analysis
|
||
|
||
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
|
||
|
||
## Overview
|
||
|
||
**Incident**: API response times degraded 10x (200ms → 2000ms)
|
||
**Impact**: User-facing slowness, timeout errors, poor UX
|
||
**Root Cause**: N+1 query problem + missing indexes + no caching
|
||
**Resolution**: Query optimization + indexes + Redis caching
|
||
**Status**: Resolved
|
||
|
||
## Incident Timeline
|
||
|
||
| Time | Event | Action |
|
||
|------|-------|--------|
|
||
| 08:00 | Slowness reports from users | Support tickets opened |
|
||
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
|
||
| 08:30 | Database profiling started | Slow query log analysis |
|
||
| 09:00 | N+1 query identified | Found 100+ queries per request |
|
||
| 09:30 | Fix implemented | Eager loading + indexes |
|
||
| 10:00 | Caching added | Redis for frequently accessed data |
|
||
| 10:30 | Deployment complete | Latency back to 200ms |
|
||
|
||
---
|
||
|
||
## Symptoms and Detection
|
||
|
||
### Initial Metrics
|
||
|
||
**Latency Increase**:
|
||
```
|
||
p50: 180ms → 1800ms (+900% slower)
|
||
p95: 220ms → 2100ms (+854% slower)
|
||
p99: 450ms → 3500ms (+677% slower)
|
||
|
||
Requests timing out: 5% (>3s timeout)
|
||
```
|
||
|
||
**User Impact**:
|
||
- Page load times: 5-10 seconds
|
||
- API timeouts: 5% of requests
|
||
- Support tickets: 47 in 1 hour
|
||
- User complaints: "App is unusable"
|
||
|
||
---
|
||
|
||
## Diagnosis
|
||
|
||
### Step 1: Application Performance Monitoring
|
||
|
||
**Wrangler Tail Analysis**:
|
||
```bash
|
||
# Monitor worker requests in real-time
|
||
wrangler tail --format pretty
|
||
|
||
# Output shows slow requests:
|
||
[2024-12-05 08:20:15] GET /api/orders - 2145ms
|
||
└─ database_query: 1950ms (90% of total time!)
|
||
└─ json_serialization: 150ms
|
||
└─ response_headers: 45ms
|
||
|
||
# Red flag: Database taking 90% of request time
|
||
```
|
||
|
||
---
|
||
|
||
### Step 2: Database Query Analysis
|
||
|
||
**PlanetScale Slow Query Log**:
|
||
```bash
|
||
# Enable and check slow queries
|
||
pscale database insights greyhaven-db main --slow-queries
|
||
|
||
# Results:
|
||
Query: SELECT * FROM order_items WHERE order_id = ?
|
||
Calls: 157 times per request # ❌ N+1 query problem!
|
||
Avg time: 12ms per query
|
||
Total: 1884ms per request (12ms × 157)
|
||
```
|
||
|
||
**N+1 Query Pattern Identified**:
|
||
```python
|
||
# api/orders.py (BEFORE - N+1 Problem)
|
||
@router.get("/orders/{user_id}")
|
||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||
# Query 1: Get all orders for user
|
||
orders = session.exec(
|
||
select(Order).where(Order.user_id == user_id)
|
||
).all() # Returns 157 orders
|
||
|
||
# Query 2-158: Get items for EACH order (N+1!)
|
||
for order in orders:
|
||
order.items = session.exec(
|
||
select(OrderItem).where(OrderItem.order_id == order.id)
|
||
).all() # 157 additional queries!
|
||
|
||
return orders
|
||
|
||
# Total queries: 1 + 157 = 158 queries per request
|
||
# Total time: 10ms + (157 × 12ms) = 1894ms
|
||
```
|
||
|
||
---
|
||
|
||
### Step 3: Database Index Analysis
|
||
|
||
**Missing Indexes**:
|
||
```sql
|
||
-- Check existing indexes
|
||
SELECT indexname, indexdef
|
||
FROM pg_indexes
|
||
WHERE tablename = 'order_items';
|
||
|
||
-- Results:
|
||
-- Primary key on id (exists) ✅
|
||
-- NO index on order_id ❌ (needed for WHERE clause)
|
||
-- NO index on user_id ❌ (needed for joins)
|
||
|
||
-- Explain plan shows full table scan
|
||
EXPLAIN ANALYZE
|
||
SELECT * FROM order_items WHERE order_id = 123;
|
||
|
||
-- Result:
|
||
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
|
||
Filter: (order_id = 123)
|
||
Rows Removed by Filter: 10000
|
||
|
||
-- Full table scan on 10K rows, 157 times = extremely slow!
|
||
```
|
||
|
||
---
|
||
|
||
## Resolution
|
||
|
||
### Fix 1: Eliminate N+1 with Eager Loading
|
||
|
||
**After - Single Query with Join**:
|
||
```python
|
||
# api/orders.py (AFTER - Eager Loading)
|
||
from sqlmodel import select
|
||
from sqlalchemy.orm import selectinload
|
||
|
||
@router.get("/orders/{user_id}")
|
||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||
# ✅ Single query with eager loading
|
||
statement = (
|
||
select(Order)
|
||
.where(Order.user_id == user_id)
|
||
.options(selectinload(Order.items)) # Eager load items
|
||
)
|
||
|
||
orders = session.exec(statement).all()
|
||
|
||
return orders
|
||
|
||
# Total queries: 2 (1 for orders, 1 for all items)
|
||
# Total time: 10ms + 25ms = 35ms (98% faster!)
|
||
```
|
||
|
||
**Query Comparison**:
|
||
```
|
||
BEFORE (N+1):
|
||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
|
||
- Total: 1894ms
|
||
|
||
AFTER (Eager Loading):
|
||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
|
||
- Total: 35ms (54x faster!)
|
||
```
|
||
|
||
---
|
||
|
||
### Fix 2: Add Database Indexes
|
||
|
||
**Create Indexes**:
|
||
```sql
|
||
-- Index on order_id for faster lookups
|
||
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
|
||
|
||
-- Index on user_id for user queries
|
||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||
|
||
-- Index on created_at for time-based queries
|
||
CREATE INDEX idx_orders_created_at ON orders(created_at);
|
||
|
||
-- Composite index for common filters
|
||
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
|
||
```
|
||
|
||
**Before/After EXPLAIN**:
|
||
```sql
|
||
-- BEFORE (no index):
|
||
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
|
||
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
|
||
|
||
-- AFTER (with index):
|
||
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
|
||
|
||
-- 270x faster (12.345ms → 0.045ms)
|
||
```
|
||
|
||
---
|
||
|
||
### Fix 3: Implement Redis Caching
|
||
|
||
**Cache Frequent Queries**:
|
||
```typescript
|
||
// cache.ts - Redis caching layer
|
||
import { Redis } from '@upstash/redis';
|
||
|
||
const redis = new Redis({
|
||
url: env.UPSTASH_REDIS_URL,
|
||
token: env.UPSTASH_REDIS_TOKEN
|
||
});
|
||
|
||
async function getCachedOrders(userId: number) {
|
||
const cacheKey = `orders:user:${userId}`;
|
||
|
||
// Check cache
|
||
const cached = await redis.get(cacheKey);
|
||
if (cached) {
|
||
return JSON.parse(cached); // Cache hit
|
||
}
|
||
|
||
// Cache miss - query database
|
||
const orders = await fetchOrdersFromDb(userId);
|
||
|
||
// Store in cache (5 minute TTL)
|
||
await redis.setex(cacheKey, 300, JSON.stringify(orders));
|
||
|
||
return orders;
|
||
}
|
||
```
|
||
|
||
**Cache Hit Rates**:
|
||
```
|
||
Requests: 10,000
|
||
Cache hits: 8,500 (85%)
|
||
Cache misses: 1,500 (15%)
|
||
|
||
Avg latency with cache:
|
||
- Cache hit: 5ms (Redis)
|
||
- Cache miss: 35ms (database)
|
||
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
|
||
```
|
||
|
||
---
|
||
|
||
### Fix 4: Database Connection Pooling
|
||
|
||
**Optimize Pool Settings**:
|
||
```python
|
||
# database.py - Tuned for performance
|
||
engine = create_engine(
|
||
database_url,
|
||
pool_size=50, # Increased from 20
|
||
max_overflow=20,
|
||
pool_recycle=1800, # 30 minutes
|
||
pool_pre_ping=True, # Health check
|
||
echo=False,
|
||
connect_args={
|
||
"server_settings": {
|
||
"statement_timeout": "30000", # 30s query timeout
|
||
"idle_in_transaction_session_timeout": "60000" # 60s idle
|
||
}
|
||
}
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## Results
|
||
|
||
### Performance Metrics
|
||
|
||
| Metric | Before | After | Improvement |
|
||
|--------|--------|-------|-------------|
|
||
| **p50 Latency** | 1800ms | 180ms | **90% faster** |
|
||
| **p95 Latency** | 2100ms | 220ms | **90% faster** |
|
||
| **p99 Latency** | 3500ms | 450ms | **87% faster** |
|
||
| **Database Queries** | 158/request | 2/request | **99% reduction** |
|
||
| **Cache Hit Rate** | 0% | 85% | **85% hits** |
|
||
| **Timeout Errors** | 5% | 0% | **100% eliminated** |
|
||
|
||
### Cost Impact
|
||
|
||
**Database Query Reduction**:
|
||
```
|
||
Before: 158 queries × 100 req/s = 15,800 queries/s
|
||
After: 2 queries × 100 req/s = 200 queries/s
|
||
|
||
Reduction: 98.7% fewer queries
|
||
Cost savings: $450/month (reduced database tier)
|
||
```
|
||
|
||
---
|
||
|
||
## Prevention Measures
|
||
|
||
### 1. Query Performance Monitoring
|
||
|
||
**Slow Query Alert**:
|
||
```yaml
|
||
# Alert on slow database queries
|
||
- alert: SlowDatabaseQueries
|
||
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
|
||
for: 5m
|
||
annotations:
|
||
summary: "Database queries p95 >100ms"
|
||
```
|
||
|
||
### 2. N+1 Query Detection
|
||
|
||
**Test for N+1 Patterns**:
|
||
```python
|
||
# tests/test_n_plus_one.py
|
||
import pytest
|
||
from sqlalchemy import event
|
||
from database import engine
|
||
|
||
@pytest.fixture
|
||
def query_counter():
|
||
"""Count SQL queries during test"""
|
||
queries = []
|
||
|
||
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
|
||
queries.append(statement)
|
||
|
||
event.listen(engine, "before_cursor_execute", before_cursor_execute)
|
||
yield queries
|
||
event.remove(engine, "before_cursor_execute", before_cursor_execute)
|
||
|
||
def test_get_user_orders_no_n_plus_one(query_counter):
|
||
"""Verify endpoint doesn't have N+1 queries"""
|
||
get_user_orders(user_id=1)
|
||
|
||
# Should be 2 queries max (orders + items)
|
||
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
|
||
```
|
||
|
||
### 3. Database Index Coverage
|
||
|
||
```sql
|
||
-- Check for missing indexes
|
||
SELECT
|
||
schemaname,
|
||
tablename,
|
||
attname,
|
||
n_distinct,
|
||
correlation
|
||
FROM pg_stats
|
||
WHERE schemaname = 'public'
|
||
AND n_distinct > 100 -- Cardinality suggests index needed
|
||
ORDER BY tablename, attname;
|
||
```
|
||
|
||
### 4. Performance Budget
|
||
|
||
```typescript
|
||
// Set performance budgets
|
||
const PERFORMANCE_BUDGETS = {
|
||
api_latency_p95: 500, // ms
|
||
database_queries_per_request: 5,
|
||
cache_hit_rate_min: 0.70, // 70%
|
||
};
|
||
|
||
// CI/CD check
|
||
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
|
||
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Lessons Learned
|
||
|
||
### What Went Well
|
||
|
||
✅ Slow query log pinpointed N+1 problem
|
||
✅ Eager loading eliminated 99% of queries
|
||
✅ Indexes provided 270x speedup
|
||
✅ Caching reduced load by 85%
|
||
|
||
### What Could Be Improved
|
||
|
||
❌ No N+1 query detection before production
|
||
❌ Missing indexes not caught in code review
|
||
❌ No caching layer initially
|
||
❌ No query performance monitoring
|
||
|
||
### Key Takeaways
|
||
|
||
1. **Always use eager loading** for associations
|
||
2. **Add indexes** for all foreign keys and WHERE clauses
|
||
3. **Implement caching** for frequently accessed data
|
||
4. **Monitor query counts** per request (alert on >10)
|
||
5. **Test for N+1** in CI/CD pipeline
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||
|
||
---
|
||
|
||
Return to [examples index](INDEX.md)
|