10 KiB
Performance Degradation Analysis
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
Overview
Incident: API response times degraded 10x (200ms → 2000ms) Impact: User-facing slowness, timeout errors, poor UX Root Cause: N+1 query problem + missing indexes + no caching Resolution: Query optimization + indexes + Redis caching Status: Resolved
Incident Timeline
| Time | Event | Action |
|---|---|---|
| 08:00 | Slowness reports from users | Support tickets opened |
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
| 08:30 | Database profiling started | Slow query log analysis |
| 09:00 | N+1 query identified | Found 100+ queries per request |
| 09:30 | Fix implemented | Eager loading + indexes |
| 10:00 | Caching added | Redis for frequently accessed data |
| 10:30 | Deployment complete | Latency back to 200ms |
Symptoms and Detection
Initial Metrics
Latency Increase:
p50: 180ms → 1800ms (+900% slower)
p95: 220ms → 2100ms (+854% slower)
p99: 450ms → 3500ms (+677% slower)
Requests timing out: 5% (>3s timeout)
User Impact:
- Page load times: 5-10 seconds
- API timeouts: 5% of requests
- Support tickets: 47 in 1 hour
- User complaints: "App is unusable"
Diagnosis
Step 1: Application Performance Monitoring
Wrangler Tail Analysis:
# Monitor worker requests in real-time
wrangler tail --format pretty
# Output shows slow requests:
[2024-12-05 08:20:15] GET /api/orders - 2145ms
└─ database_query: 1950ms (90% of total time!)
└─ json_serialization: 150ms
└─ response_headers: 45ms
# Red flag: Database taking 90% of request time
Step 2: Database Query Analysis
PlanetScale Slow Query Log:
# Enable and check slow queries
pscale database insights greyhaven-db main --slow-queries
# Results:
Query: SELECT * FROM order_items WHERE order_id = ?
Calls: 157 times per request # ❌ N+1 query problem!
Avg time: 12ms per query
Total: 1884ms per request (12ms × 157)
N+1 Query Pattern Identified:
# api/orders.py (BEFORE - N+1 Problem)
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# Query 1: Get all orders for user
orders = session.exec(
select(Order).where(Order.user_id == user_id)
).all() # Returns 157 orders
# Query 2-158: Get items for EACH order (N+1!)
for order in orders:
order.items = session.exec(
select(OrderItem).where(OrderItem.order_id == order.id)
).all() # 157 additional queries!
return orders
# Total queries: 1 + 157 = 158 queries per request
# Total time: 10ms + (157 × 12ms) = 1894ms
Step 3: Database Index Analysis
Missing Indexes:
-- Check existing indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'order_items';
-- Results:
-- Primary key on id (exists) ✅
-- NO index on order_id ❌ (needed for WHERE clause)
-- NO index on user_id ❌ (needed for joins)
-- Explain plan shows full table scan
EXPLAIN ANALYZE
SELECT * FROM order_items WHERE order_id = 123;
-- Result:
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
Filter: (order_id = 123)
Rows Removed by Filter: 10000
-- Full table scan on 10K rows, 157 times = extremely slow!
Resolution
Fix 1: Eliminate N+1 with Eager Loading
After - Single Query with Join:
# api/orders.py (AFTER - Eager Loading)
from sqlmodel import select
from sqlalchemy.orm import selectinload
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# ✅ Single query with eager loading
statement = (
select(Order)
.where(Order.user_id == user_id)
.options(selectinload(Order.items)) # Eager load items
)
orders = session.exec(statement).all()
return orders
# Total queries: 2 (1 for orders, 1 for all items)
# Total time: 10ms + 25ms = 35ms (98% faster!)
Query Comparison:
BEFORE (N+1):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
- Total: 1894ms
AFTER (Eager Loading):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
- Total: 35ms (54x faster!)
Fix 2: Add Database Indexes
Create Indexes:
-- Index on order_id for faster lookups
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- Index on user_id for user queries
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Index on created_at for time-based queries
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Composite index for common filters
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
Before/After EXPLAIN:
-- BEFORE (no index):
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
-- AFTER (with index):
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
-- 270x faster (12.345ms → 0.045ms)
Fix 3: Implement Redis Caching
Cache Frequent Queries:
// cache.ts - Redis caching layer
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: env.UPSTASH_REDIS_URL,
token: env.UPSTASH_REDIS_TOKEN
});
async function getCachedOrders(userId: number) {
const cacheKey = `orders:user:${userId}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached); // Cache hit
}
// Cache miss - query database
const orders = await fetchOrdersFromDb(userId);
// Store in cache (5 minute TTL)
await redis.setex(cacheKey, 300, JSON.stringify(orders));
return orders;
}
Cache Hit Rates:
Requests: 10,000
Cache hits: 8,500 (85%)
Cache misses: 1,500 (15%)
Avg latency with cache:
- Cache hit: 5ms (Redis)
- Cache miss: 35ms (database)
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
Fix 4: Database Connection Pooling
Optimize Pool Settings:
# database.py - Tuned for performance
engine = create_engine(
database_url,
pool_size=50, # Increased from 20
max_overflow=20,
pool_recycle=1800, # 30 minutes
pool_pre_ping=True, # Health check
echo=False,
connect_args={
"server_settings": {
"statement_timeout": "30000", # 30s query timeout
"idle_in_transaction_session_timeout": "60000" # 60s idle
}
}
)
Results
Performance Metrics
| Metric | Before | After | Improvement |
|---|---|---|---|
| p50 Latency | 1800ms | 180ms | 90% faster |
| p95 Latency | 2100ms | 220ms | 90% faster |
| p99 Latency | 3500ms | 450ms | 87% faster |
| Database Queries | 158/request | 2/request | 99% reduction |
| Cache Hit Rate | 0% | 85% | 85% hits |
| Timeout Errors | 5% | 0% | 100% eliminated |
Cost Impact
Database Query Reduction:
Before: 158 queries × 100 req/s = 15,800 queries/s
After: 2 queries × 100 req/s = 200 queries/s
Reduction: 98.7% fewer queries
Cost savings: $450/month (reduced database tier)
Prevention Measures
1. Query Performance Monitoring
Slow Query Alert:
# Alert on slow database queries
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
for: 5m
annotations:
summary: "Database queries p95 >100ms"
2. N+1 Query Detection
Test for N+1 Patterns:
# tests/test_n_plus_one.py
import pytest
from sqlalchemy import event
from database import engine
@pytest.fixture
def query_counter():
"""Count SQL queries during test"""
queries = []
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
queries.append(statement)
event.listen(engine, "before_cursor_execute", before_cursor_execute)
yield queries
event.remove(engine, "before_cursor_execute", before_cursor_execute)
def test_get_user_orders_no_n_plus_one(query_counter):
"""Verify endpoint doesn't have N+1 queries"""
get_user_orders(user_id=1)
# Should be 2 queries max (orders + items)
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
3. Database Index Coverage
-- Check for missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100 -- Cardinality suggests index needed
ORDER BY tablename, attname;
4. Performance Budget
// Set performance budgets
const PERFORMANCE_BUDGETS = {
api_latency_p95: 500, // ms
database_queries_per_request: 5,
cache_hit_rate_min: 0.70, // 70%
};
// CI/CD check
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
}
Lessons Learned
What Went Well
✅ Slow query log pinpointed N+1 problem ✅ Eager loading eliminated 99% of queries ✅ Indexes provided 270x speedup ✅ Caching reduced load by 85%
What Could Be Improved
❌ No N+1 query detection before production ❌ Missing indexes not caught in code review ❌ No caching layer initially ❌ No query performance monitoring
Key Takeaways
- Always use eager loading for associations
- Add indexes for all foreign keys and WHERE clauses
- Implement caching for frequently accessed data
- Monitor query counts per request (alert on >10)
- Test for N+1 in CI/CD pipeline
Related Documentation
- Worker Deployment: cloudflare-worker-deployment-failure.md
- Database Issues: planetscale-connection-issues.md
- Network Debugging: distributed-system-debugging.md
- Runbooks: ../reference/troubleshooting-runbooks.md
Return to examples index