12 KiB
PlanetScale Connection Pool Exhaustion
Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
Overview
Incident: Database connection timeouts causing 15% request failure rate Impact: Customer-facing 503 errors, support tickets increasing Root Cause: Connection pool too small + unclosed connections in error paths Resolution: Pool tuning (20→50) + connection leak fixes Status: Resolved
Incident Timeline
| Time | Event | Action |
|---|---|---|
| 09:30 | Alerts: High 503 error rate | Oncall paged |
| 09:35 | Investigation started | Check logs, metrics |
| 09:45 | Database connections at 100% | Identified pool exhaustion |
| 10:00 | Temporary fix: restart service | Bought time for root cause |
| 10:30 | Code analysis complete | Found connection leaks |
| 11:00 | Fix deployed (pool + leaks) | Production deployment |
| 11:30 | Monitoring confirmed stable | Incident resolved |
Symptoms and Detection
Initial Alerts
Prometheus Alert:
# Alert: HighErrorRate
expr: rate(http_requests_total{status="503"}[5m]) > 0.05
for: 5m
annotations:
summary: "503 error rate >5% for 5 minutes"
description: "Current rate: {{ $value | humanizePercentage }}"
Error Logs:
[ERROR] Database query failed: connection timeout
[ERROR] Pool exhausted, waiting for available connection
[ERROR] Request timeout after 30s waiting for DB connection
Impact Metrics:
Error rate: 15% (150 failures per 1000 requests)
User complaints: 23 support tickets in 30 minutes
Failed transactions: ~$15,000 in abandoned carts
Diagnosis
Step 1: Check Connection Pool Status
Query PlanetScale:
# Connect to database
pscale shell greyhaven-db main
# Check active connections
SELECT
COUNT(*) as active_connections,
MAX(pg_stat_activity.query_start) as oldest_query
FROM pg_stat_activity
WHERE state = 'active';
# Result:
# active_connections: 98
# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
Check Application Pool:
# In FastAPI app - add diagnostic endpoint
from sqlmodel import Session
from database import engine
@app.get("/pool-status")
def pool_status():
pool = engine.pool
return {
"size": pool.size(),
"checked_out": pool.checkedout(),
"overflow": pool.overflow(),
"timeout": pool._timeout,
"max_overflow": pool._max_overflow
}
# Response:
{
"size": 20,
"checked_out": 20, # Pool exhausted!
"overflow": 0,
"timeout": 30,
"max_overflow": 10
}
Red Flags:
- ✅ Pool at 100% capacity (20/20 connections checked out)
- ✅ No overflow connections being used (0/10)
- ✅ Connections held for >15 minutes
- ✅ New requests timing out waiting for connections
Step 2: Identify Connection Leaks
Code Review - Found Vulnerable Pattern:
# api/orders.py (BEFORE - LEAK)
from fastapi import APIRouter
from sqlmodel import Session, select
from database import engine
router = APIRouter()
@router.post("/orders")
async def create_order(order_data: OrderCreate):
# ❌ LEAK: Session never closed on exception
session = Session(engine)
# Create order
order = Order(**order_data.dict())
session.add(order)
session.commit()
# If exception here, session never closed!
if order.total > 10000:
raise ValueError("Order exceeds limit")
# session.close() never reached
return order
How Leak Occurs:
- Request creates session (acquires connection from pool)
- Exception raised after commit
- Function exits without calling
session.close() - Connection remains "checked out" from pool
- After 20 such exceptions, pool exhausted
Step 3: Load Testing to Reproduce
Test Script:
# test_connection_leak.py
import asyncio
import httpx
async def create_order(client, amount):
"""Create order that will trigger exception"""
try:
response = await client.post(
"https://api.greyhaven.io/orders",
json={"total": amount}
)
return response.status_code
except Exception:
return 503
async def load_test():
"""Simulate 100 orders with high amounts (triggers leak)"""
async with httpx.AsyncClient() as client:
# Trigger 100 exceptions (leak 100 connections)
tasks = [create_order(client, 15000) for _ in range(100)]
results = await asyncio.gather(*tasks)
success = sum(1 for r in results if r == 201)
errors = sum(1 for r in results if r == 503)
print(f"Success: {success}, Errors: {errors}")
asyncio.run(load_test())
Results:
Success: 20 (first 20 use all connections)
Errors: 80 (remaining 80 timeout waiting for pool)
Proves: Connection leak exhausts pool
Resolution
Fix 1: Use Context Manager (Guaranteed Cleanup)
After - With Context Manager:
# api/orders.py (AFTER - FIXED)
from fastapi import APIRouter, Depends
from sqlmodel import Session
from database import get_session
router = APIRouter()
# ✅ Dependency injection with automatic cleanup
def get_session():
with Session(engine) as session:
yield session
# Session always closed (even on exception)
@router.post("/orders")
async def create_order(
order_data: OrderCreate,
session: Session = Depends(get_session)
):
# Session managed by FastAPI dependency
order = Order(**order_data.dict())
session.add(order)
session.commit()
# Exception here? No problem - session still closed by context manager
if order.total > 10000:
raise ValueError("Order exceeds limit")
return order
Why This Works:
- Context manager (
withstatement) guaranteessession.close()in__exit__ - Works even if exception raised
- FastAPI
Depends()handles async cleanup automatically
Fix 2: Increase Connection Pool Size
Before (pool too small):
# database.py (BEFORE)
from sqlmodel import create_engine
engine = create_engine(
database_url,
pool_size=20, # Too small for load
max_overflow=10,
pool_timeout=30
)
After (tuned for load):
# database.py (AFTER)
from sqlmodel import create_engine
import os
# Calculate pool size based on workers
# Formula: (workers * 2) + buffer
# 16 workers * 2 + 20 buffer = 52
workers = int(os.getenv("WEB_CONCURRENCY", 16))
pool_size = (workers * 2) + 20
engine = create_engine(
database_url,
pool_size=pool_size, # 52 connections
max_overflow=20, # Burst to 72 total
pool_timeout=30,
pool_recycle=3600, # Recycle after 1 hour
pool_pre_ping=True, # Verify connection health
echo=False
)
Pool Size Calculation:
Workers: 16 (Uvicorn workers)
Connections per worker: 2 (normal peak)
Buffer: 20 (for spikes)
pool_size = (16 * 2) + 20 = 52
max_overflow = 20 (total 72 for extreme spikes)
Fix 3: Add Connection Pool Monitoring
Prometheus Metrics:
# monitoring.py
from prometheus_client import Gauge
from database import engine
# Pool metrics
db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
def update_pool_metrics():
"""Update pool metrics every 10 seconds"""
pool = engine.pool
db_pool_size.set(pool.size())
db_pool_checked_out.set(pool.checkedout())
db_pool_idle.set(pool.size() - pool.checkedout())
db_pool_overflow.set(pool.overflow())
# Schedule in background task
import asyncio
async def pool_monitor():
while True:
update_pool_metrics()
await asyncio.sleep(10)
Grafana Alert:
# Alert: Connection pool near exhaustion
expr: db_pool_checked_out / db_pool_size_total > 0.8
for: 5m
annotations:
summary: "Connection pool >80% utilized"
description: "{{ $value | humanizePercentage }} of pool in use"
Fix 4: Add Timeout and Retry Logic
Connection Timeout Handling:
# database.py - Add connection retry
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
def get_session_with_retry():
"""Get session with automatic retry on pool timeout"""
try:
with Session(engine) as session:
yield session
except TimeoutError:
# Pool exhausted - retry after exponential backoff
raise
@router.post("/orders")
async def create_order(
order_data: OrderCreate,
session: Session = Depends(get_session_with_retry)
):
# Will retry up to 3 times if pool exhausted
...
Results
Before vs After Metrics
| Metric | Before Fix | After Fix | Improvement |
|---|---|---|---|
| Connection Pool Size | 20 | 52 | +160% capacity |
| Pool Utilization | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
| 503 Error Rate | 15% | 0.01% | 99.9% reduction |
| Request Timeout | 30s (waiting) | <100ms | 99.7% faster |
| Leaked Connections | 12/hour | 0/day | 100% eliminated |
Deployment Verification
Load Test After Fix:
# Simulate 1000 concurrent orders
ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
# Results:
Requests per second: 250 [#/sec]
Time per request: 200ms [mean]
Failed requests: 0 (0%)
Successful requests: 1000 (100%)
# Pool status during test:
{
"size": 52,
"checked_out": 28, # 54% utilization (healthy)
"overflow": 0,
"idle": 24
}
Prevention Measures
1. Connection Leak Tests
# tests/test_connection_leaks.py
@pytest.fixture
def track_connections():
before = engine.pool.checkedout()
yield
after = engine.pool.checkedout()
assert after == before, f"Leaked {after - before} connections"
2. Pool Alerts
# Alert if pool >80% for 5 minutes
expr: db_pool_checked_out / db_pool_size_total > 0.8
3. Health Check
@app.get("/health/database")
async def database_health():
with Session(engine) as session:
session.execute("SELECT 1")
return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
4. Monitoring Commands
# Active connections
pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
# Slow queries
pscale database insights db main --slow-queries
Lessons Learned
What Went Well
✅ Quick identification of pool exhaustion (Prometheus alerts) ✅ Context manager pattern eliminated leaks ✅ Pool tuning based on formula (workers * 2 + buffer) ✅ Comprehensive monitoring added
What Could Be Improved
❌ No pool monitoring before incident ❌ Pool size not calculated based on load ❌ Missing connection leak tests
Key Takeaways
- Always use context managers for database sessions
- Calculate pool size based on workers and load
- Monitor pool utilization with alerts at 80%
- Test for connection leaks in CI/CD
- Add retry logic for transient pool timeouts
PlanetScale Best Practices
# Connection string with SSL
DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
# Schema changes via deploy requests
pscale deploy-request create db schema-update
# Test in branch
pscale branch create db test-feature
-- Index frequently queried columns
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Analyze slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
Related Documentation
- Worker Deployment: cloudflare-worker-deployment-failure.md
- Network Debugging: distributed-system-debugging.md
- Performance: performance-degradation-analysis.md
- Runbooks: ../reference/troubleshooting-runbooks.md
Return to examples index