Initial commit
This commit is contained in:
@@ -0,0 +1,413 @@
|
||||
# Performance Degradation Analysis
|
||||
|
||||
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: API response times degraded 10x (200ms → 2000ms)
|
||||
**Impact**: User-facing slowness, timeout errors, poor UX
|
||||
**Root Cause**: N+1 query problem + missing indexes + no caching
|
||||
**Resolution**: Query optimization + indexes + Redis caching
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 08:00 | Slowness reports from users | Support tickets opened |
|
||||
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
|
||||
| 08:30 | Database profiling started | Slow query log analysis |
|
||||
| 09:00 | N+1 query identified | Found 100+ queries per request |
|
||||
| 09:30 | Fix implemented | Eager loading + indexes |
|
||||
| 10:00 | Caching added | Redis for frequently accessed data |
|
||||
| 10:30 | Deployment complete | Latency back to 200ms |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Metrics
|
||||
|
||||
**Latency Increase**:
|
||||
```
|
||||
p50: 180ms → 1800ms (+900% slower)
|
||||
p95: 220ms → 2100ms (+854% slower)
|
||||
p99: 450ms → 3500ms (+677% slower)
|
||||
|
||||
Requests timing out: 5% (>3s timeout)
|
||||
```
|
||||
|
||||
**User Impact**:
|
||||
- Page load times: 5-10 seconds
|
||||
- API timeouts: 5% of requests
|
||||
- Support tickets: 47 in 1 hour
|
||||
- User complaints: "App is unusable"
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Application Performance Monitoring
|
||||
|
||||
**Wrangler Tail Analysis**:
|
||||
```bash
|
||||
# Monitor worker requests in real-time
|
||||
wrangler tail --format pretty
|
||||
|
||||
# Output shows slow requests:
|
||||
[2024-12-05 08:20:15] GET /api/orders - 2145ms
|
||||
└─ database_query: 1950ms (90% of total time!)
|
||||
└─ json_serialization: 150ms
|
||||
└─ response_headers: 45ms
|
||||
|
||||
# Red flag: Database taking 90% of request time
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Database Query Analysis
|
||||
|
||||
**PlanetScale Slow Query Log**:
|
||||
```bash
|
||||
# Enable and check slow queries
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Results:
|
||||
Query: SELECT * FROM order_items WHERE order_id = ?
|
||||
Calls: 157 times per request # ❌ N+1 query problem!
|
||||
Avg time: 12ms per query
|
||||
Total: 1884ms per request (12ms × 157)
|
||||
```
|
||||
|
||||
**N+1 Query Pattern Identified**:
|
||||
```python
|
||||
# api/orders.py (BEFORE - N+1 Problem)
|
||||
@router.get("/orders/{user_id}")
|
||||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||||
# Query 1: Get all orders for user
|
||||
orders = session.exec(
|
||||
select(Order).where(Order.user_id == user_id)
|
||||
).all() # Returns 157 orders
|
||||
|
||||
# Query 2-158: Get items for EACH order (N+1!)
|
||||
for order in orders:
|
||||
order.items = session.exec(
|
||||
select(OrderItem).where(OrderItem.order_id == order.id)
|
||||
).all() # 157 additional queries!
|
||||
|
||||
return orders
|
||||
|
||||
# Total queries: 1 + 157 = 158 queries per request
|
||||
# Total time: 10ms + (157 × 12ms) = 1894ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Database Index Analysis
|
||||
|
||||
**Missing Indexes**:
|
||||
```sql
|
||||
-- Check existing indexes
|
||||
SELECT indexname, indexdef
|
||||
FROM pg_indexes
|
||||
WHERE tablename = 'order_items';
|
||||
|
||||
-- Results:
|
||||
-- Primary key on id (exists) ✅
|
||||
-- NO index on order_id ❌ (needed for WHERE clause)
|
||||
-- NO index on user_id ❌ (needed for joins)
|
||||
|
||||
-- Explain plan shows full table scan
|
||||
EXPLAIN ANALYZE
|
||||
SELECT * FROM order_items WHERE order_id = 123;
|
||||
|
||||
-- Result:
|
||||
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
|
||||
Filter: (order_id = 123)
|
||||
Rows Removed by Filter: 10000
|
||||
|
||||
-- Full table scan on 10K rows, 157 times = extremely slow!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Eliminate N+1 with Eager Loading
|
||||
|
||||
**After - Single Query with Join**:
|
||||
```python
|
||||
# api/orders.py (AFTER - Eager Loading)
|
||||
from sqlmodel import select
|
||||
from sqlalchemy.orm import selectinload
|
||||
|
||||
@router.get("/orders/{user_id}")
|
||||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||||
# ✅ Single query with eager loading
|
||||
statement = (
|
||||
select(Order)
|
||||
.where(Order.user_id == user_id)
|
||||
.options(selectinload(Order.items)) # Eager load items
|
||||
)
|
||||
|
||||
orders = session.exec(statement).all()
|
||||
|
||||
return orders
|
||||
|
||||
# Total queries: 2 (1 for orders, 1 for all items)
|
||||
# Total time: 10ms + 25ms = 35ms (98% faster!)
|
||||
```
|
||||
|
||||
**Query Comparison**:
|
||||
```
|
||||
BEFORE (N+1):
|
||||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||||
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
|
||||
- Total: 1894ms
|
||||
|
||||
AFTER (Eager Loading):
|
||||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||||
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
|
||||
- Total: 35ms (54x faster!)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Add Database Indexes
|
||||
|
||||
**Create Indexes**:
|
||||
```sql
|
||||
-- Index on order_id for faster lookups
|
||||
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
|
||||
|
||||
-- Index on user_id for user queries
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
|
||||
-- Index on created_at for time-based queries
|
||||
CREATE INDEX idx_orders_created_at ON orders(created_at);
|
||||
|
||||
-- Composite index for common filters
|
||||
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
|
||||
```
|
||||
|
||||
**Before/After EXPLAIN**:
|
||||
```sql
|
||||
-- BEFORE (no index):
|
||||
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
|
||||
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
|
||||
|
||||
-- AFTER (with index):
|
||||
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
|
||||
|
||||
-- 270x faster (12.345ms → 0.045ms)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Implement Redis Caching
|
||||
|
||||
**Cache Frequent Queries**:
|
||||
```typescript
|
||||
// cache.ts - Redis caching layer
|
||||
import { Redis } from '@upstash/redis';
|
||||
|
||||
const redis = new Redis({
|
||||
url: env.UPSTASH_REDIS_URL,
|
||||
token: env.UPSTASH_REDIS_TOKEN
|
||||
});
|
||||
|
||||
async function getCachedOrders(userId: number) {
|
||||
const cacheKey = `orders:user:${userId}`;
|
||||
|
||||
// Check cache
|
||||
const cached = await redis.get(cacheKey);
|
||||
if (cached) {
|
||||
return JSON.parse(cached); // Cache hit
|
||||
}
|
||||
|
||||
// Cache miss - query database
|
||||
const orders = await fetchOrdersFromDb(userId);
|
||||
|
||||
// Store in cache (5 minute TTL)
|
||||
await redis.setex(cacheKey, 300, JSON.stringify(orders));
|
||||
|
||||
return orders;
|
||||
}
|
||||
```
|
||||
|
||||
**Cache Hit Rates**:
|
||||
```
|
||||
Requests: 10,000
|
||||
Cache hits: 8,500 (85%)
|
||||
Cache misses: 1,500 (15%)
|
||||
|
||||
Avg latency with cache:
|
||||
- Cache hit: 5ms (Redis)
|
||||
- Cache miss: 35ms (database)
|
||||
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Database Connection Pooling
|
||||
|
||||
**Optimize Pool Settings**:
|
||||
```python
|
||||
# database.py - Tuned for performance
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=50, # Increased from 20
|
||||
max_overflow=20,
|
||||
pool_recycle=1800, # 30 minutes
|
||||
pool_pre_ping=True, # Health check
|
||||
echo=False,
|
||||
connect_args={
|
||||
"server_settings": {
|
||||
"statement_timeout": "30000", # 30s query timeout
|
||||
"idle_in_transaction_session_timeout": "60000" # 60s idle
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
| Metric | Before | After | Improvement |
|
||||
|--------|--------|-------|-------------|
|
||||
| **p50 Latency** | 1800ms | 180ms | **90% faster** |
|
||||
| **p95 Latency** | 2100ms | 220ms | **90% faster** |
|
||||
| **p99 Latency** | 3500ms | 450ms | **87% faster** |
|
||||
| **Database Queries** | 158/request | 2/request | **99% reduction** |
|
||||
| **Cache Hit Rate** | 0% | 85% | **85% hits** |
|
||||
| **Timeout Errors** | 5% | 0% | **100% eliminated** |
|
||||
|
||||
### Cost Impact
|
||||
|
||||
**Database Query Reduction**:
|
||||
```
|
||||
Before: 158 queries × 100 req/s = 15,800 queries/s
|
||||
After: 2 queries × 100 req/s = 200 queries/s
|
||||
|
||||
Reduction: 98.7% fewer queries
|
||||
Cost savings: $450/month (reduced database tier)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Query Performance Monitoring
|
||||
|
||||
**Slow Query Alert**:
|
||||
```yaml
|
||||
# Alert on slow database queries
|
||||
- alert: SlowDatabaseQueries
|
||||
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Database queries p95 >100ms"
|
||||
```
|
||||
|
||||
### 2. N+1 Query Detection
|
||||
|
||||
**Test for N+1 Patterns**:
|
||||
```python
|
||||
# tests/test_n_plus_one.py
|
||||
import pytest
|
||||
from sqlalchemy import event
|
||||
from database import engine
|
||||
|
||||
@pytest.fixture
|
||||
def query_counter():
|
||||
"""Count SQL queries during test"""
|
||||
queries = []
|
||||
|
||||
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
|
||||
queries.append(statement)
|
||||
|
||||
event.listen(engine, "before_cursor_execute", before_cursor_execute)
|
||||
yield queries
|
||||
event.remove(engine, "before_cursor_execute", before_cursor_execute)
|
||||
|
||||
def test_get_user_orders_no_n_plus_one(query_counter):
|
||||
"""Verify endpoint doesn't have N+1 queries"""
|
||||
get_user_orders(user_id=1)
|
||||
|
||||
# Should be 2 queries max (orders + items)
|
||||
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
|
||||
```
|
||||
|
||||
### 3. Database Index Coverage
|
||||
|
||||
```sql
|
||||
-- Check for missing indexes
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
attname,
|
||||
n_distinct,
|
||||
correlation
|
||||
FROM pg_stats
|
||||
WHERE schemaname = 'public'
|
||||
AND n_distinct > 100 -- Cardinality suggests index needed
|
||||
ORDER BY tablename, attname;
|
||||
```
|
||||
|
||||
### 4. Performance Budget
|
||||
|
||||
```typescript
|
||||
// Set performance budgets
|
||||
const PERFORMANCE_BUDGETS = {
|
||||
api_latency_p95: 500, // ms
|
||||
database_queries_per_request: 5,
|
||||
cache_hit_rate_min: 0.70, // 70%
|
||||
};
|
||||
|
||||
// CI/CD check
|
||||
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
|
||||
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Slow query log pinpointed N+1 problem
|
||||
✅ Eager loading eliminated 99% of queries
|
||||
✅ Indexes provided 270x speedup
|
||||
✅ Caching reduced load by 85%
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No N+1 query detection before production
|
||||
❌ Missing indexes not caught in code review
|
||||
❌ No caching layer initially
|
||||
❌ No query performance monitoring
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always use eager loading** for associations
|
||||
2. **Add indexes** for all foreign keys and WHERE clauses
|
||||
3. **Implement caching** for frequently accessed data
|
||||
4. **Monitor query counts** per request (alert on >10)
|
||||
5. **Test for N+1** in CI/CD pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
Reference in New Issue
Block a user