Files
gh-greyhaven-ai-claude-code…/skills/devops-troubleshooting/examples/performance-degradation-analysis.md
2025-11-29 18:29:23 +08:00

414 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Performance Degradation Analysis
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
## Overview
**Incident**: API response times degraded 10x (200ms → 2000ms)
**Impact**: User-facing slowness, timeout errors, poor UX
**Root Cause**: N+1 query problem + missing indexes + no caching
**Resolution**: Query optimization + indexes + Redis caching
**Status**: Resolved
## Incident Timeline
| Time | Event | Action |
|------|-------|--------|
| 08:00 | Slowness reports from users | Support tickets opened |
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
| 08:30 | Database profiling started | Slow query log analysis |
| 09:00 | N+1 query identified | Found 100+ queries per request |
| 09:30 | Fix implemented | Eager loading + indexes |
| 10:00 | Caching added | Redis for frequently accessed data |
| 10:30 | Deployment complete | Latency back to 200ms |
---
## Symptoms and Detection
### Initial Metrics
**Latency Increase**:
```
p50: 180ms → 1800ms (+900% slower)
p95: 220ms → 2100ms (+854% slower)
p99: 450ms → 3500ms (+677% slower)
Requests timing out: 5% (>3s timeout)
```
**User Impact**:
- Page load times: 5-10 seconds
- API timeouts: 5% of requests
- Support tickets: 47 in 1 hour
- User complaints: "App is unusable"
---
## Diagnosis
### Step 1: Application Performance Monitoring
**Wrangler Tail Analysis**:
```bash
# Monitor worker requests in real-time
wrangler tail --format pretty
# Output shows slow requests:
[2024-12-05 08:20:15] GET /api/orders - 2145ms
└─ database_query: 1950ms (90% of total time!)
└─ json_serialization: 150ms
└─ response_headers: 45ms
# Red flag: Database taking 90% of request time
```
---
### Step 2: Database Query Analysis
**PlanetScale Slow Query Log**:
```bash
# Enable and check slow queries
pscale database insights greyhaven-db main --slow-queries
# Results:
Query: SELECT * FROM order_items WHERE order_id = ?
Calls: 157 times per request # ❌ N+1 query problem!
Avg time: 12ms per query
Total: 1884ms per request (12ms × 157)
```
**N+1 Query Pattern Identified**:
```python
# api/orders.py (BEFORE - N+1 Problem)
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# Query 1: Get all orders for user
orders = session.exec(
select(Order).where(Order.user_id == user_id)
).all() # Returns 157 orders
# Query 2-158: Get items for EACH order (N+1!)
for order in orders:
order.items = session.exec(
select(OrderItem).where(OrderItem.order_id == order.id)
).all() # 157 additional queries!
return orders
# Total queries: 1 + 157 = 158 queries per request
# Total time: 10ms + (157 × 12ms) = 1894ms
```
---
### Step 3: Database Index Analysis
**Missing Indexes**:
```sql
-- Check existing indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'order_items';
-- Results:
-- Primary key on id (exists) ✅
-- NO index on order_id ❌ (needed for WHERE clause)
-- NO index on user_id ❌ (needed for joins)
-- Explain plan shows full table scan
EXPLAIN ANALYZE
SELECT * FROM order_items WHERE order_id = 123;
-- Result:
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
Filter: (order_id = 123)
Rows Removed by Filter: 10000
-- Full table scan on 10K rows, 157 times = extremely slow!
```
---
## Resolution
### Fix 1: Eliminate N+1 with Eager Loading
**After - Single Query with Join**:
```python
# api/orders.py (AFTER - Eager Loading)
from sqlmodel import select
from sqlalchemy.orm import selectinload
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# ✅ Single query with eager loading
statement = (
select(Order)
.where(Order.user_id == user_id)
.options(selectinload(Order.items)) # Eager load items
)
orders = session.exec(statement).all()
return orders
# Total queries: 2 (1 for orders, 1 for all items)
# Total time: 10ms + 25ms = 35ms (98% faster!)
```
**Query Comparison**:
```
BEFORE (N+1):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
- Total: 1894ms
AFTER (Eager Loading):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
- Total: 35ms (54x faster!)
```
---
### Fix 2: Add Database Indexes
**Create Indexes**:
```sql
-- Index on order_id for faster lookups
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- Index on user_id for user queries
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Index on created_at for time-based queries
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Composite index for common filters
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
```
**Before/After EXPLAIN**:
```sql
-- BEFORE (no index):
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
-- AFTER (with index):
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
-- 270x faster (12.345ms → 0.045ms)
```
---
### Fix 3: Implement Redis Caching
**Cache Frequent Queries**:
```typescript
// cache.ts - Redis caching layer
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: env.UPSTASH_REDIS_URL,
token: env.UPSTASH_REDIS_TOKEN
});
async function getCachedOrders(userId: number) {
const cacheKey = `orders:user:${userId}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached); // Cache hit
}
// Cache miss - query database
const orders = await fetchOrdersFromDb(userId);
// Store in cache (5 minute TTL)
await redis.setex(cacheKey, 300, JSON.stringify(orders));
return orders;
}
```
**Cache Hit Rates**:
```
Requests: 10,000
Cache hits: 8,500 (85%)
Cache misses: 1,500 (15%)
Avg latency with cache:
- Cache hit: 5ms (Redis)
- Cache miss: 35ms (database)
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
```
---
### Fix 4: Database Connection Pooling
**Optimize Pool Settings**:
```python
# database.py - Tuned for performance
engine = create_engine(
database_url,
pool_size=50, # Increased from 20
max_overflow=20,
pool_recycle=1800, # 30 minutes
pool_pre_ping=True, # Health check
echo=False,
connect_args={
"server_settings": {
"statement_timeout": "30000", # 30s query timeout
"idle_in_transaction_session_timeout": "60000" # 60s idle
}
}
)
```
---
## Results
### Performance Metrics
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **p50 Latency** | 1800ms | 180ms | **90% faster** |
| **p95 Latency** | 2100ms | 220ms | **90% faster** |
| **p99 Latency** | 3500ms | 450ms | **87% faster** |
| **Database Queries** | 158/request | 2/request | **99% reduction** |
| **Cache Hit Rate** | 0% | 85% | **85% hits** |
| **Timeout Errors** | 5% | 0% | **100% eliminated** |
### Cost Impact
**Database Query Reduction**:
```
Before: 158 queries × 100 req/s = 15,800 queries/s
After: 2 queries × 100 req/s = 200 queries/s
Reduction: 98.7% fewer queries
Cost savings: $450/month (reduced database tier)
```
---
## Prevention Measures
### 1. Query Performance Monitoring
**Slow Query Alert**:
```yaml
# Alert on slow database queries
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
for: 5m
annotations:
summary: "Database queries p95 >100ms"
```
### 2. N+1 Query Detection
**Test for N+1 Patterns**:
```python
# tests/test_n_plus_one.py
import pytest
from sqlalchemy import event
from database import engine
@pytest.fixture
def query_counter():
"""Count SQL queries during test"""
queries = []
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
queries.append(statement)
event.listen(engine, "before_cursor_execute", before_cursor_execute)
yield queries
event.remove(engine, "before_cursor_execute", before_cursor_execute)
def test_get_user_orders_no_n_plus_one(query_counter):
"""Verify endpoint doesn't have N+1 queries"""
get_user_orders(user_id=1)
# Should be 2 queries max (orders + items)
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
```
### 3. Database Index Coverage
```sql
-- Check for missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100 -- Cardinality suggests index needed
ORDER BY tablename, attname;
```
### 4. Performance Budget
```typescript
// Set performance budgets
const PERFORMANCE_BUDGETS = {
api_latency_p95: 500, // ms
database_queries_per_request: 5,
cache_hit_rate_min: 0.70, // 70%
};
// CI/CD check
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
}
```
---
## Lessons Learned
### What Went Well
✅ Slow query log pinpointed N+1 problem
✅ Eager loading eliminated 99% of queries
✅ Indexes provided 270x speedup
✅ Caching reduced load by 85%
### What Could Be Improved
❌ No N+1 query detection before production
❌ Missing indexes not caught in code review
❌ No caching layer initially
❌ No query performance monitoring
### Key Takeaways
1. **Always use eager loading** for associations
2. **Add indexes** for all foreign keys and WHERE clauses
3. **Implement caching** for frequently accessed data
4. **Monitor query counts** per request (alert on >10)
5. **Test for N+1** in CI/CD pipeline
---
## Related Documentation
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
---
Return to [examples index](INDEX.md)