Initial commit

2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions
--- a/skills/devops-troubleshooting/examples/planetscale-connection-issues.md
+++ b/skills/devops-troubleshooting/examples/planetscale-connection-issues.md
@@ -0,0 +1,499 @@
+# PlanetScale Connection Pool Exhaustion
+
+Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
+
+## Overview
+
+**Incident**: Database connection timeouts causing 15% request failure rate
+**Impact**: Customer-facing 503 errors, support tickets increasing
+**Root Cause**: Connection pool too small + unclosed connections in error paths
+**Resolution**: Pool tuning (20→50) + connection leak fixes
+**Status**: Resolved
+
+## Incident Timeline
+
+| Time | Event | Action |
+|------|-------|--------|
+| 09:30 | Alerts: High 503 error rate | Oncall paged |
+| 09:35 | Investigation started | Check logs, metrics |
+| 09:45 | Database connections at 100% | Identified pool exhaustion |
+| 10:00 | Temporary fix: restart service | Bought time for root cause |
+| 10:30 | Code analysis complete | Found connection leaks |
+| 11:00 | Fix deployed (pool + leaks) | Production deployment |
+| 11:30 | Monitoring confirmed stable | Incident resolved |
+
+---
+
+## Symptoms and Detection
+
+### Initial Alerts
+
+**Prometheus Alert**:
+```yaml
+# Alert: HighErrorRate
+expr: rate(http_requests_total{status="503"}[5m]) > 0.05
+for: 5m
+annotations:
+  summary: "503 error rate >5% for 5 minutes"
+  description: "Current rate: {{ $value | humanizePercentage }}"
+```
+
+**Error Logs**:
+```
+[ERROR] Database query failed: connection timeout
+[ERROR] Pool exhausted, waiting for available connection
+[ERROR] Request timeout after 30s waiting for DB connection
+```
+
+**Impact Metrics**:
+```
+Error rate: 15% (150 failures per 1000 requests)
+User complaints: 23 support tickets in 30 minutes
+Failed transactions: ~$15,000 in abandoned carts
+```
+
+---
+
+## Diagnosis
+
+### Step 1: Check Connection Pool Status
+
+**Query PlanetScale**:
+```bash
+# Connect to database
+pscale shell greyhaven-db main
+
+# Check active connections
+SELECT
+  COUNT(*) as active_connections,
+  MAX(pg_stat_activity.query_start) as oldest_query
+FROM pg_stat_activity
+WHERE state = 'active';
+
+# Result:
+# active_connections: 98
+# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
+```
+
+**Check Application Pool**:
+```python
+# In FastAPI app - add diagnostic endpoint
+from sqlmodel import Session
+from database import engine
+
+@app.get("/pool-status")
+def pool_status():
+    pool = engine.pool
+    return {
+        "size": pool.size(),
+        "checked_out": pool.checkedout(),
+        "overflow": pool.overflow(),
+        "timeout": pool._timeout,
+        "max_overflow": pool._max_overflow
+    }
+
+# Response:
+{
+  "size": 20,
+  "checked_out": 20,  # Pool exhausted!
+  "overflow": 0,
+  "timeout": 30,
+  "max_overflow": 10
+}
+```
+
+**Red Flags**:
+- ✅ Pool at 100% capacity (20/20 connections checked out)
+- ✅ No overflow connections being used (0/10)
+- ✅ Connections held for >15 minutes
+- ✅ New requests timing out waiting for connections
+
+---
+
+### Step 2: Identify Connection Leaks
+
+**Code Review - Found Vulnerable Pattern**:
+```python
+# api/orders.py (BEFORE - LEAK)
+from fastapi import APIRouter
+from sqlmodel import Session, select
+from database import engine
+
+router = APIRouter()
+
+@router.post("/orders")
+async def create_order(order_data: OrderCreate):
+    # ❌ LEAK: Session never closed on exception
+    session = Session(engine)
+
+    # Create order
+    order = Order(**order_data.dict())
+    session.add(order)
+    session.commit()
+
+    # If exception here, session never closed!
+    if order.total > 10000:
+        raise ValueError("Order exceeds limit")
+
+    # session.close() never reached
+    return order
+```
+
+**How Leak Occurs**:
+1. Request creates session (acquires connection from pool)
+2. Exception raised after commit
+3. Function exits without calling `session.close()`
+4. Connection remains "checked out" from pool
+5. After 20 such exceptions, pool exhausted
+
+---
+
+### Step 3: Load Testing to Reproduce
+
+**Test Script**:
+```python
+# test_connection_leak.py
+import asyncio
+import httpx
+
+async def create_order(client, amount):
+    """Create order that will trigger exception"""
+    try:
+        response = await client.post(
+            "https://api.greyhaven.io/orders",
+            json={"total": amount}
+        )
+        return response.status_code
+    except Exception:
+        return 503
+
+async def load_test():
+    """Simulate 100 orders with high amounts (triggers leak)"""
+    async with httpx.AsyncClient() as client:
+        # Trigger 100 exceptions (leak 100 connections)
+        tasks = [create_order(client, 15000) for _ in range(100)]
+        results = await asyncio.gather(*tasks)
+
+        success = sum(1 for r in results if r == 201)
+        errors = sum(1 for r in results if r == 503)
+
+        print(f"Success: {success}, Errors: {errors}")
+
+asyncio.run(load_test())
+```
+
+**Results**:
+```
+Success: 20  (first 20 use all connections)
+Errors: 80   (remaining 80 timeout waiting for pool)
+
+Proves: Connection leak exhausts pool
+```
+
+---
+
+## Resolution
+
+### Fix 1: Use Context Manager (Guaranteed Cleanup)
+
+**After - With Context Manager**:
+```python
+# api/orders.py (AFTER - FIXED)
+from fastapi import APIRouter, Depends
+from sqlmodel import Session
+from database import get_session
+
+router = APIRouter()
+
+# ✅ Dependency injection with automatic cleanup
+def get_session():
+    with Session(engine) as session:
+        yield session
+    # Session always closed (even on exception)
+
+@router.post("/orders")
+async def create_order(
+    order_data: OrderCreate,
+    session: Session = Depends(get_session)
+):
+    # Session managed by FastAPI dependency
+    order = Order(**order_data.dict())
+    session.add(order)
+    session.commit()
+
+    # Exception here? No problem - session still closed by context manager
+    if order.total > 10000:
+        raise ValueError("Order exceeds limit")
+
+    return order
+```
+
+**Why This Works**:
+- Context manager (`with` statement) guarantees `session.close()` in `__exit__`
+- Works even if exception raised
+- FastAPI `Depends()` handles async cleanup automatically
+
+---
+
+### Fix 2: Increase Connection Pool Size
+
+**Before** (pool too small):
+```python
+# database.py (BEFORE)
+from sqlmodel import create_engine
+
+engine = create_engine(
+    database_url,
+    pool_size=20,        # Too small for load
+    max_overflow=10,
+    pool_timeout=30
+)
+```
+
+**After** (tuned for load):
+```python
+# database.py (AFTER)
+from sqlmodel import create_engine
+import os
+
+# Calculate pool size based on workers
+# Formula: (workers * 2) + buffer
+# 16 workers * 2 + 20 buffer = 52
+workers = int(os.getenv("WEB_CONCURRENCY", 16))
+pool_size = (workers * 2) + 20
+
+engine = create_engine(
+    database_url,
+    pool_size=pool_size,      # 52 connections
+    max_overflow=20,          # Burst to 72 total
+    pool_timeout=30,
+    pool_recycle=3600,        # Recycle after 1 hour
+    pool_pre_ping=True,       # Verify connection health
+    echo=False
+)
+```
+
+**Pool Size Calculation**:
+```
+Workers: 16 (Uvicorn workers)
+Connections per worker: 2 (normal peak)
+Buffer: 20 (for spikes)
+
+pool_size = (16 * 2) + 20 = 52
+max_overflow = 20 (total 72 for extreme spikes)
+```
+
+---
+
+### Fix 3: Add Connection Pool Monitoring
+
+**Prometheus Metrics**:
+```python
+# monitoring.py
+from prometheus_client import Gauge
+from database import engine
+
+# Pool metrics
+db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
+db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
+db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
+db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
+
+def update_pool_metrics():
+    """Update pool metrics every 10 seconds"""
+    pool = engine.pool
+    db_pool_size.set(pool.size())
+    db_pool_checked_out.set(pool.checkedout())
+    db_pool_idle.set(pool.size() - pool.checkedout())
+    db_pool_overflow.set(pool.overflow())
+
+# Schedule in background task
+import asyncio
+async def pool_monitor():
+    while True:
+        update_pool_metrics()
+        await asyncio.sleep(10)
+```
+
+**Grafana Alert**:
+```yaml
+# Alert: Connection pool near exhaustion
+expr: db_pool_checked_out / db_pool_size_total > 0.8
+for: 5m
+annotations:
+  summary: "Connection pool >80% utilized"
+  description: "{{ $value | humanizePercentage }} of pool in use"
+```
+
+---
+
+### Fix 4: Add Timeout and Retry Logic
+
+**Connection Timeout Handling**:
+```python
+# database.py - Add connection retry
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+@retry(
+    stop=stop_after_attempt(3),
+    wait=wait_exponential(multiplier=1, min=1, max=10)
+)
+def get_session_with_retry():
+    """Get session with automatic retry on pool timeout"""
+    try:
+        with Session(engine) as session:
+            yield session
+    except TimeoutError:
+        # Pool exhausted - retry after exponential backoff
+        raise
+
+@router.post("/orders")
+async def create_order(
+    order_data: OrderCreate,
+    session: Session = Depends(get_session_with_retry)
+):
+    # Will retry up to 3 times if pool exhausted
+    ...
+```
+
+---
+
+## Results
+
+### Before vs After Metrics
+
+| Metric | Before Fix | After Fix | Improvement |
+|--------|-----------|-----------|-------------|
+| **Connection Pool Size** | 20 | 52 | +160% capacity |
+| **Pool Utilization** | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
+| **503 Error Rate** | 15% | 0.01% | **99.9% reduction** |
+| **Request Timeout** | 30s (waiting) | <100ms | **99.7% faster** |
+| **Leaked Connections** | 12/hour | 0/day | **100% eliminated** |
+
+---
+
+### Deployment Verification
+
+**Load Test After Fix**:
+```bash
+# Simulate 1000 concurrent orders
+ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
+
+# Results:
+Requests per second: 250 [#/sec]
+Time per request: 200ms [mean]
+Failed requests: 0 (0%)
+Successful requests: 1000 (100%)
+
+# Pool status during test:
+{
+  "size": 52,
+  "checked_out": 28,     # 54% utilization (healthy)
+  "overflow": 0,
+  "idle": 24
+}
+```
+
+---
+
+## Prevention Measures
+
+### 1. Connection Leak Tests
+
+```python
+# tests/test_connection_leaks.py
+@pytest.fixture
+def track_connections():
+    before = engine.pool.checkedout()
+    yield
+    after = engine.pool.checkedout()
+    assert after == before, f"Leaked {after - before} connections"
+```
+
+### 2. Pool Alerts
+
+```yaml
+# Alert if pool >80% for 5 minutes
+expr: db_pool_checked_out / db_pool_size_total > 0.8
+```
+
+### 3. Health Check
+
+```python
+@app.get("/health/database")
+async def database_health():
+    with Session(engine) as session:
+        session.execute("SELECT 1")
+        return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
+```
+
+### 4. Monitoring Commands
+
+```bash
+# Active connections
+pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
+
+# Slow queries
+pscale database insights db main --slow-queries
+```
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+✅ Quick identification of pool exhaustion (Prometheus alerts)
+✅ Context manager pattern eliminated leaks
+✅ Pool tuning based on formula (workers * 2 + buffer)
+✅ Comprehensive monitoring added
+
+### What Could Be Improved
+
+❌ No pool monitoring before incident
+❌ Pool size not calculated based on load
+❌ Missing connection leak tests
+
+### Key Takeaways
+
+1. **Always use context managers** for database sessions
+2. **Calculate pool size** based on workers and load
+3. **Monitor pool utilization** with alerts at 80%
+4. **Test for connection leaks** in CI/CD
+5. **Add retry logic** for transient pool timeouts
+
+---
+
+## PlanetScale Best Practices
+
+```bash
+# Connection string with SSL
+DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
+
+# Schema changes via deploy requests
+pscale deploy-request create db schema-update
+
+# Test in branch
+pscale branch create db test-feature
+```
+
+```sql
+-- Index frequently queried columns
+CREATE INDEX idx_orders_user_id ON orders(user_id);
+
+-- Analyze slow queries
+EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
+```
+
+---
+
+## Related Documentation
+
+- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
+- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
+- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
+
+---
+
+Return to [examples index](INDEX.md)