# Performance Degradation Analysis Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization. ## Overview **Incident**: API response times degraded 10x (200ms → 2000ms) **Impact**: User-facing slowness, timeout errors, poor UX **Root Cause**: N+1 query problem + missing indexes + no caching **Resolution**: Query optimization + indexes + Redis caching **Status**: Resolved ## Incident Timeline | Time | Event | Action | |------|-------|--------| | 08:00 | Slowness reports from users | Support tickets opened | | 08:15 | Monitoring confirms degradation | p95 latency 2000ms | | 08:30 | Database profiling started | Slow query log analysis | | 09:00 | N+1 query identified | Found 100+ queries per request | | 09:30 | Fix implemented | Eager loading + indexes | | 10:00 | Caching added | Redis for frequently accessed data | | 10:30 | Deployment complete | Latency back to 200ms | --- ## Symptoms and Detection ### Initial Metrics **Latency Increase**: ``` p50: 180ms → 1800ms (+900% slower) p95: 220ms → 2100ms (+854% slower) p99: 450ms → 3500ms (+677% slower) Requests timing out: 5% (>3s timeout) ``` **User Impact**: - Page load times: 5-10 seconds - API timeouts: 5% of requests - Support tickets: 47 in 1 hour - User complaints: "App is unusable" --- ## Diagnosis ### Step 1: Application Performance Monitoring **Wrangler Tail Analysis**: ```bash # Monitor worker requests in real-time wrangler tail --format pretty # Output shows slow requests: [2024-12-05 08:20:15] GET /api/orders - 2145ms └─ database_query: 1950ms (90% of total time!) └─ json_serialization: 150ms └─ response_headers: 45ms # Red flag: Database taking 90% of request time ``` --- ### Step 2: Database Query Analysis **PlanetScale Slow Query Log**: ```bash # Enable and check slow queries pscale database insights greyhaven-db main --slow-queries # Results: Query: SELECT * FROM order_items WHERE order_id = ? Calls: 157 times per request # ❌ N+1 query problem! Avg time: 12ms per query Total: 1884ms per request (12ms × 157) ``` **N+1 Query Pattern Identified**: ```python # api/orders.py (BEFORE - N+1 Problem) @router.get("/orders/{user_id}") async def get_user_orders(user_id: int, session: Session = Depends(get_session)): # Query 1: Get all orders for user orders = session.exec( select(Order).where(Order.user_id == user_id) ).all() # Returns 157 orders # Query 2-158: Get items for EACH order (N+1!) for order in orders: order.items = session.exec( select(OrderItem).where(OrderItem.order_id == order.id) ).all() # 157 additional queries! return orders # Total queries: 1 + 157 = 158 queries per request # Total time: 10ms + (157 × 12ms) = 1894ms ``` --- ### Step 3: Database Index Analysis **Missing Indexes**: ```sql -- Check existing indexes SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'order_items'; -- Results: -- Primary key on id (exists) ✅ -- NO index on order_id ❌ (needed for WHERE clause) -- NO index on user_id ❌ (needed for joins) -- Explain plan shows full table scan EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123; -- Result: Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157) Filter: (order_id = 123) Rows Removed by Filter: 10000 -- Full table scan on 10K rows, 157 times = extremely slow! ``` --- ## Resolution ### Fix 1: Eliminate N+1 with Eager Loading **After - Single Query with Join**: ```python # api/orders.py (AFTER - Eager Loading) from sqlmodel import select from sqlalchemy.orm import selectinload @router.get("/orders/{user_id}") async def get_user_orders(user_id: int, session: Session = Depends(get_session)): # ✅ Single query with eager loading statement = ( select(Order) .where(Order.user_id == user_id) .options(selectinload(Order.items)) # Eager load items ) orders = session.exec(statement).all() return orders # Total queries: 2 (1 for orders, 1 for all items) # Total time: 10ms + 25ms = 35ms (98% faster!) ``` **Query Comparison**: ``` BEFORE (N+1): - Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms) - Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each) - Total: 1894ms AFTER (Eager Loading): - Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms) - Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms) - Total: 35ms (54x faster!) ``` --- ### Fix 2: Add Database Indexes **Create Indexes**: ```sql -- Index on order_id for faster lookups CREATE INDEX idx_order_items_order_id ON order_items(order_id); -- Index on user_id for user queries CREATE INDEX idx_orders_user_id ON orders(user_id); -- Index on created_at for time-based queries CREATE INDEX idx_orders_created_at ON orders(created_at); -- Composite index for common filters CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC); ``` **Before/After EXPLAIN**: ```sql -- BEFORE (no index): EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123; Seq Scan (cost=0.00..1500.00) (actual time=12.345ms) -- AFTER (with index): Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms) -- 270x faster (12.345ms → 0.045ms) ``` --- ### Fix 3: Implement Redis Caching **Cache Frequent Queries**: ```typescript // cache.ts - Redis caching layer import { Redis } from '@upstash/redis'; const redis = new Redis({ url: env.UPSTASH_REDIS_URL, token: env.UPSTASH_REDIS_TOKEN }); async function getCachedOrders(userId: number) { const cacheKey = `orders:user:${userId}`; // Check cache const cached = await redis.get(cacheKey); if (cached) { return JSON.parse(cached); // Cache hit } // Cache miss - query database const orders = await fetchOrdersFromDb(userId); // Store in cache (5 minute TTL) await redis.setex(cacheKey, 300, JSON.stringify(orders)); return orders; } ``` **Cache Hit Rates**: ``` Requests: 10,000 Cache hits: 8,500 (85%) Cache misses: 1,500 (15%) Avg latency with cache: - Cache hit: 5ms (Redis) - Cache miss: 35ms (database) - Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms ``` --- ### Fix 4: Database Connection Pooling **Optimize Pool Settings**: ```python # database.py - Tuned for performance engine = create_engine( database_url, pool_size=50, # Increased from 20 max_overflow=20, pool_recycle=1800, # 30 minutes pool_pre_ping=True, # Health check echo=False, connect_args={ "server_settings": { "statement_timeout": "30000", # 30s query timeout "idle_in_transaction_session_timeout": "60000" # 60s idle } } ) ``` --- ## Results ### Performance Metrics | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | **p50 Latency** | 1800ms | 180ms | **90% faster** | | **p95 Latency** | 2100ms | 220ms | **90% faster** | | **p99 Latency** | 3500ms | 450ms | **87% faster** | | **Database Queries** | 158/request | 2/request | **99% reduction** | | **Cache Hit Rate** | 0% | 85% | **85% hits** | | **Timeout Errors** | 5% | 0% | **100% eliminated** | ### Cost Impact **Database Query Reduction**: ``` Before: 158 queries × 100 req/s = 15,800 queries/s After: 2 queries × 100 req/s = 200 queries/s Reduction: 98.7% fewer queries Cost savings: $450/month (reduced database tier) ``` --- ## Prevention Measures ### 1. Query Performance Monitoring **Slow Query Alert**: ```yaml # Alert on slow database queries - alert: SlowDatabaseQueries expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1 for: 5m annotations: summary: "Database queries p95 >100ms" ``` ### 2. N+1 Query Detection **Test for N+1 Patterns**: ```python # tests/test_n_plus_one.py import pytest from sqlalchemy import event from database import engine @pytest.fixture def query_counter(): """Count SQL queries during test""" queries = [] def before_cursor_execute(conn, cursor, statement, parameters, context, executemany): queries.append(statement) event.listen(engine, "before_cursor_execute", before_cursor_execute) yield queries event.remove(engine, "before_cursor_execute", before_cursor_execute) def test_get_user_orders_no_n_plus_one(query_counter): """Verify endpoint doesn't have N+1 queries""" get_user_orders(user_id=1) # Should be 2 queries max (orders + items) assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries" ``` ### 3. Database Index Coverage ```sql -- Check for missing indexes SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname = 'public' AND n_distinct > 100 -- Cardinality suggests index needed ORDER BY tablename, attname; ``` ### 4. Performance Budget ```typescript // Set performance budgets const PERFORMANCE_BUDGETS = { api_latency_p95: 500, // ms database_queries_per_request: 5, cache_hit_rate_min: 0.70, // 70% }; // CI/CD check if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) { throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`); } ``` --- ## Lessons Learned ### What Went Well ✅ Slow query log pinpointed N+1 problem ✅ Eager loading eliminated 99% of queries ✅ Indexes provided 270x speedup ✅ Caching reduced load by 85% ### What Could Be Improved ❌ No N+1 query detection before production ❌ Missing indexes not caught in code review ❌ No caching layer initially ❌ No query performance monitoring ### Key Takeaways 1. **Always use eager loading** for associations 2. **Add indexes** for all foreign keys and WHERE clauses 3. **Implement caching** for frequently accessed data 4. **Monitor query counts** per request (alert on >10) 5. **Test for N+1** in CI/CD pipeline --- ## Related Documentation - **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md) - **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md) - **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md) - **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md) --- Return to [examples index](INDEX.md)