Initial commit
This commit is contained in:
@@ -0,0 +1,499 @@
|
||||
# PlanetScale Connection Pool Exhaustion
|
||||
|
||||
Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: Database connection timeouts causing 15% request failure rate
|
||||
**Impact**: Customer-facing 503 errors, support tickets increasing
|
||||
**Root Cause**: Connection pool too small + unclosed connections in error paths
|
||||
**Resolution**: Pool tuning (20→50) + connection leak fixes
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 09:30 | Alerts: High 503 error rate | Oncall paged |
|
||||
| 09:35 | Investigation started | Check logs, metrics |
|
||||
| 09:45 | Database connections at 100% | Identified pool exhaustion |
|
||||
| 10:00 | Temporary fix: restart service | Bought time for root cause |
|
||||
| 10:30 | Code analysis complete | Found connection leaks |
|
||||
| 11:00 | Fix deployed (pool + leaks) | Production deployment |
|
||||
| 11:30 | Monitoring confirmed stable | Incident resolved |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Alerts
|
||||
|
||||
**Prometheus Alert**:
|
||||
```yaml
|
||||
# Alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status="503"}[5m]) > 0.05
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "503 error rate >5% for 5 minutes"
|
||||
description: "Current rate: {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
**Error Logs**:
|
||||
```
|
||||
[ERROR] Database query failed: connection timeout
|
||||
[ERROR] Pool exhausted, waiting for available connection
|
||||
[ERROR] Request timeout after 30s waiting for DB connection
|
||||
```
|
||||
|
||||
**Impact Metrics**:
|
||||
```
|
||||
Error rate: 15% (150 failures per 1000 requests)
|
||||
User complaints: 23 support tickets in 30 minutes
|
||||
Failed transactions: ~$15,000 in abandoned carts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Connection Pool Status
|
||||
|
||||
**Query PlanetScale**:
|
||||
```bash
|
||||
# Connect to database
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# Check active connections
|
||||
SELECT
|
||||
COUNT(*) as active_connections,
|
||||
MAX(pg_stat_activity.query_start) as oldest_query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
# Result:
|
||||
# active_connections: 98
|
||||
# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
|
||||
```
|
||||
|
||||
**Check Application Pool**:
|
||||
```python
|
||||
# In FastAPI app - add diagnostic endpoint
|
||||
from sqlmodel import Session
|
||||
from database import engine
|
||||
|
||||
@app.get("/pool-status")
|
||||
def pool_status():
|
||||
pool = engine.pool
|
||||
return {
|
||||
"size": pool.size(),
|
||||
"checked_out": pool.checkedout(),
|
||||
"overflow": pool.overflow(),
|
||||
"timeout": pool._timeout,
|
||||
"max_overflow": pool._max_overflow
|
||||
}
|
||||
|
||||
# Response:
|
||||
{
|
||||
"size": 20,
|
||||
"checked_out": 20, # Pool exhausted!
|
||||
"overflow": 0,
|
||||
"timeout": 30,
|
||||
"max_overflow": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Red Flags**:
|
||||
- ✅ Pool at 100% capacity (20/20 connections checked out)
|
||||
- ✅ No overflow connections being used (0/10)
|
||||
- ✅ Connections held for >15 minutes
|
||||
- ✅ New requests timing out waiting for connections
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Connection Leaks
|
||||
|
||||
**Code Review - Found Vulnerable Pattern**:
|
||||
```python
|
||||
# api/orders.py (BEFORE - LEAK)
|
||||
from fastapi import APIRouter
|
||||
from sqlmodel import Session, select
|
||||
from database import engine
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(order_data: OrderCreate):
|
||||
# ❌ LEAK: Session never closed on exception
|
||||
session = Session(engine)
|
||||
|
||||
# Create order
|
||||
order = Order(**order_data.dict())
|
||||
session.add(order)
|
||||
session.commit()
|
||||
|
||||
# If exception here, session never closed!
|
||||
if order.total > 10000:
|
||||
raise ValueError("Order exceeds limit")
|
||||
|
||||
# session.close() never reached
|
||||
return order
|
||||
```
|
||||
|
||||
**How Leak Occurs**:
|
||||
1. Request creates session (acquires connection from pool)
|
||||
2. Exception raised after commit
|
||||
3. Function exits without calling `session.close()`
|
||||
4. Connection remains "checked out" from pool
|
||||
5. After 20 such exceptions, pool exhausted
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Load Testing to Reproduce
|
||||
|
||||
**Test Script**:
|
||||
```python
|
||||
# test_connection_leak.py
|
||||
import asyncio
|
||||
import httpx
|
||||
|
||||
async def create_order(client, amount):
|
||||
"""Create order that will trigger exception"""
|
||||
try:
|
||||
response = await client.post(
|
||||
"https://api.greyhaven.io/orders",
|
||||
json={"total": amount}
|
||||
)
|
||||
return response.status_code
|
||||
except Exception:
|
||||
return 503
|
||||
|
||||
async def load_test():
|
||||
"""Simulate 100 orders with high amounts (triggers leak)"""
|
||||
async with httpx.AsyncClient() as client:
|
||||
# Trigger 100 exceptions (leak 100 connections)
|
||||
tasks = [create_order(client, 15000) for _ in range(100)]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
success = sum(1 for r in results if r == 201)
|
||||
errors = sum(1 for r in results if r == 503)
|
||||
|
||||
print(f"Success: {success}, Errors: {errors}")
|
||||
|
||||
asyncio.run(load_test())
|
||||
```
|
||||
|
||||
**Results**:
|
||||
```
|
||||
Success: 20 (first 20 use all connections)
|
||||
Errors: 80 (remaining 80 timeout waiting for pool)
|
||||
|
||||
Proves: Connection leak exhausts pool
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Use Context Manager (Guaranteed Cleanup)
|
||||
|
||||
**After - With Context Manager**:
|
||||
```python
|
||||
# api/orders.py (AFTER - FIXED)
|
||||
from fastapi import APIRouter, Depends
|
||||
from sqlmodel import Session
|
||||
from database import get_session
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
# ✅ Dependency injection with automatic cleanup
|
||||
def get_session():
|
||||
with Session(engine) as session:
|
||||
yield session
|
||||
# Session always closed (even on exception)
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(
|
||||
order_data: OrderCreate,
|
||||
session: Session = Depends(get_session)
|
||||
):
|
||||
# Session managed by FastAPI dependency
|
||||
order = Order(**order_data.dict())
|
||||
session.add(order)
|
||||
session.commit()
|
||||
|
||||
# Exception here? No problem - session still closed by context manager
|
||||
if order.total > 10000:
|
||||
raise ValueError("Order exceeds limit")
|
||||
|
||||
return order
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- Context manager (`with` statement) guarantees `session.close()` in `__exit__`
|
||||
- Works even if exception raised
|
||||
- FastAPI `Depends()` handles async cleanup automatically
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Increase Connection Pool Size
|
||||
|
||||
**Before** (pool too small):
|
||||
```python
|
||||
# database.py (BEFORE)
|
||||
from sqlmodel import create_engine
|
||||
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=20, # Too small for load
|
||||
max_overflow=10,
|
||||
pool_timeout=30
|
||||
)
|
||||
```
|
||||
|
||||
**After** (tuned for load):
|
||||
```python
|
||||
# database.py (AFTER)
|
||||
from sqlmodel import create_engine
|
||||
import os
|
||||
|
||||
# Calculate pool size based on workers
|
||||
# Formula: (workers * 2) + buffer
|
||||
# 16 workers * 2 + 20 buffer = 52
|
||||
workers = int(os.getenv("WEB_CONCURRENCY", 16))
|
||||
pool_size = (workers * 2) + 20
|
||||
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=pool_size, # 52 connections
|
||||
max_overflow=20, # Burst to 72 total
|
||||
pool_timeout=30,
|
||||
pool_recycle=3600, # Recycle after 1 hour
|
||||
pool_pre_ping=True, # Verify connection health
|
||||
echo=False
|
||||
)
|
||||
```
|
||||
|
||||
**Pool Size Calculation**:
|
||||
```
|
||||
Workers: 16 (Uvicorn workers)
|
||||
Connections per worker: 2 (normal peak)
|
||||
Buffer: 20 (for spikes)
|
||||
|
||||
pool_size = (16 * 2) + 20 = 52
|
||||
max_overflow = 20 (total 72 for extreme spikes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Add Connection Pool Monitoring
|
||||
|
||||
**Prometheus Metrics**:
|
||||
```python
|
||||
# monitoring.py
|
||||
from prometheus_client import Gauge
|
||||
from database import engine
|
||||
|
||||
# Pool metrics
|
||||
db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
|
||||
db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
|
||||
db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
|
||||
db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
|
||||
|
||||
def update_pool_metrics():
|
||||
"""Update pool metrics every 10 seconds"""
|
||||
pool = engine.pool
|
||||
db_pool_size.set(pool.size())
|
||||
db_pool_checked_out.set(pool.checkedout())
|
||||
db_pool_idle.set(pool.size() - pool.checkedout())
|
||||
db_pool_overflow.set(pool.overflow())
|
||||
|
||||
# Schedule in background task
|
||||
import asyncio
|
||||
async def pool_monitor():
|
||||
while True:
|
||||
update_pool_metrics()
|
||||
await asyncio.sleep(10)
|
||||
```
|
||||
|
||||
**Grafana Alert**:
|
||||
```yaml
|
||||
# Alert: Connection pool near exhaustion
|
||||
expr: db_pool_checked_out / db_pool_size_total > 0.8
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Connection pool >80% utilized"
|
||||
description: "{{ $value | humanizePercentage }} of pool in use"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Add Timeout and Retry Logic
|
||||
|
||||
**Connection Timeout Handling**:
|
||||
```python
|
||||
# database.py - Add connection retry
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=1, max=10)
|
||||
)
|
||||
def get_session_with_retry():
|
||||
"""Get session with automatic retry on pool timeout"""
|
||||
try:
|
||||
with Session(engine) as session:
|
||||
yield session
|
||||
except TimeoutError:
|
||||
# Pool exhausted - retry after exponential backoff
|
||||
raise
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(
|
||||
order_data: OrderCreate,
|
||||
session: Session = Depends(get_session_with_retry)
|
||||
):
|
||||
# Will retry up to 3 times if pool exhausted
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before Fix | After Fix | Improvement |
|
||||
|--------|-----------|-----------|-------------|
|
||||
| **Connection Pool Size** | 20 | 52 | +160% capacity |
|
||||
| **Pool Utilization** | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
|
||||
| **503 Error Rate** | 15% | 0.01% | **99.9% reduction** |
|
||||
| **Request Timeout** | 30s (waiting) | <100ms | **99.7% faster** |
|
||||
| **Leaked Connections** | 12/hour | 0/day | **100% eliminated** |
|
||||
|
||||
---
|
||||
|
||||
### Deployment Verification
|
||||
|
||||
**Load Test After Fix**:
|
||||
```bash
|
||||
# Simulate 1000 concurrent orders
|
||||
ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
|
||||
|
||||
# Results:
|
||||
Requests per second: 250 [#/sec]
|
||||
Time per request: 200ms [mean]
|
||||
Failed requests: 0 (0%)
|
||||
Successful requests: 1000 (100%)
|
||||
|
||||
# Pool status during test:
|
||||
{
|
||||
"size": 52,
|
||||
"checked_out": 28, # 54% utilization (healthy)
|
||||
"overflow": 0,
|
||||
"idle": 24
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Connection Leak Tests
|
||||
|
||||
```python
|
||||
# tests/test_connection_leaks.py
|
||||
@pytest.fixture
|
||||
def track_connections():
|
||||
before = engine.pool.checkedout()
|
||||
yield
|
||||
after = engine.pool.checkedout()
|
||||
assert after == before, f"Leaked {after - before} connections"
|
||||
```
|
||||
|
||||
### 2. Pool Alerts
|
||||
|
||||
```yaml
|
||||
# Alert if pool >80% for 5 minutes
|
||||
expr: db_pool_checked_out / db_pool_size_total > 0.8
|
||||
```
|
||||
|
||||
### 3. Health Check
|
||||
|
||||
```python
|
||||
@app.get("/health/database")
|
||||
async def database_health():
|
||||
with Session(engine) as session:
|
||||
session.execute("SELECT 1")
|
||||
return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
|
||||
```
|
||||
|
||||
### 4. Monitoring Commands
|
||||
|
||||
```bash
|
||||
# Active connections
|
||||
pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
|
||||
|
||||
# Slow queries
|
||||
pscale database insights db main --slow-queries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Quick identification of pool exhaustion (Prometheus alerts)
|
||||
✅ Context manager pattern eliminated leaks
|
||||
✅ Pool tuning based on formula (workers * 2 + buffer)
|
||||
✅ Comprehensive monitoring added
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No pool monitoring before incident
|
||||
❌ Pool size not calculated based on load
|
||||
❌ Missing connection leak tests
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always use context managers** for database sessions
|
||||
2. **Calculate pool size** based on workers and load
|
||||
3. **Monitor pool utilization** with alerts at 80%
|
||||
4. **Test for connection leaks** in CI/CD
|
||||
5. **Add retry logic** for transient pool timeouts
|
||||
|
||||
---
|
||||
|
||||
## PlanetScale Best Practices
|
||||
|
||||
```bash
|
||||
# Connection string with SSL
|
||||
DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
|
||||
|
||||
# Schema changes via deploy requests
|
||||
pscale deploy-request create db schema-update
|
||||
|
||||
# Test in branch
|
||||
pscale branch create db test-feature
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Index frequently queried columns
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
|
||||
-- Analyze slow queries
|
||||
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
Reference in New Issue
Block a user