Files
gh-greyhaven-ai-claude-code…/skills/devops-troubleshooting/examples/planetscale-connection-issues.md
2025-11-29 18:29:23 +08:00

12 KiB

PlanetScale Connection Pool Exhaustion

Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.

Overview

Incident: Database connection timeouts causing 15% request failure rate Impact: Customer-facing 503 errors, support tickets increasing Root Cause: Connection pool too small + unclosed connections in error paths Resolution: Pool tuning (20→50) + connection leak fixes Status: Resolved

Incident Timeline

Time Event Action
09:30 Alerts: High 503 error rate Oncall paged
09:35 Investigation started Check logs, metrics
09:45 Database connections at 100% Identified pool exhaustion
10:00 Temporary fix: restart service Bought time for root cause
10:30 Code analysis complete Found connection leaks
11:00 Fix deployed (pool + leaks) Production deployment
11:30 Monitoring confirmed stable Incident resolved

Symptoms and Detection

Initial Alerts

Prometheus Alert:

# Alert: HighErrorRate
expr: rate(http_requests_total{status="503"}[5m]) > 0.05
for: 5m
annotations:
  summary: "503 error rate >5% for 5 minutes"
  description: "Current rate: {{ $value | humanizePercentage }}"

Error Logs:

[ERROR] Database query failed: connection timeout
[ERROR] Pool exhausted, waiting for available connection
[ERROR] Request timeout after 30s waiting for DB connection

Impact Metrics:

Error rate: 15% (150 failures per 1000 requests)
User complaints: 23 support tickets in 30 minutes
Failed transactions: ~$15,000 in abandoned carts

Diagnosis

Step 1: Check Connection Pool Status

Query PlanetScale:

# Connect to database
pscale shell greyhaven-db main

# Check active connections
SELECT
  COUNT(*) as active_connections,
  MAX(pg_stat_activity.query_start) as oldest_query
FROM pg_stat_activity
WHERE state = 'active';

# Result:
# active_connections: 98
# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)

Check Application Pool:

# In FastAPI app - add diagnostic endpoint
from sqlmodel import Session
from database import engine

@app.get("/pool-status")
def pool_status():
    pool = engine.pool
    return {
        "size": pool.size(),
        "checked_out": pool.checkedout(),
        "overflow": pool.overflow(),
        "timeout": pool._timeout,
        "max_overflow": pool._max_overflow
    }

# Response:
{
  "size": 20,
  "checked_out": 20,  # Pool exhausted!
  "overflow": 0,
  "timeout": 30,
  "max_overflow": 10
}

Red Flags:

  • Pool at 100% capacity (20/20 connections checked out)
  • No overflow connections being used (0/10)
  • Connections held for >15 minutes
  • New requests timing out waiting for connections

Step 2: Identify Connection Leaks

Code Review - Found Vulnerable Pattern:

# api/orders.py (BEFORE - LEAK)
from fastapi import APIRouter
from sqlmodel import Session, select
from database import engine

router = APIRouter()

@router.post("/orders")
async def create_order(order_data: OrderCreate):
    # ❌ LEAK: Session never closed on exception
    session = Session(engine)

    # Create order
    order = Order(**order_data.dict())
    session.add(order)
    session.commit()

    # If exception here, session never closed!
    if order.total > 10000:
        raise ValueError("Order exceeds limit")

    # session.close() never reached
    return order

How Leak Occurs:

  1. Request creates session (acquires connection from pool)
  2. Exception raised after commit
  3. Function exits without calling session.close()
  4. Connection remains "checked out" from pool
  5. After 20 such exceptions, pool exhausted

Step 3: Load Testing to Reproduce

Test Script:

# test_connection_leak.py
import asyncio
import httpx

async def create_order(client, amount):
    """Create order that will trigger exception"""
    try:
        response = await client.post(
            "https://api.greyhaven.io/orders",
            json={"total": amount}
        )
        return response.status_code
    except Exception:
        return 503

async def load_test():
    """Simulate 100 orders with high amounts (triggers leak)"""
    async with httpx.AsyncClient() as client:
        # Trigger 100 exceptions (leak 100 connections)
        tasks = [create_order(client, 15000) for _ in range(100)]
        results = await asyncio.gather(*tasks)

        success = sum(1 for r in results if r == 201)
        errors = sum(1 for r in results if r == 503)

        print(f"Success: {success}, Errors: {errors}")

asyncio.run(load_test())

Results:

Success: 20  (first 20 use all connections)
Errors: 80   (remaining 80 timeout waiting for pool)

Proves: Connection leak exhausts pool

Resolution

Fix 1: Use Context Manager (Guaranteed Cleanup)

After - With Context Manager:

# api/orders.py (AFTER - FIXED)
from fastapi import APIRouter, Depends
from sqlmodel import Session
from database import get_session

router = APIRouter()

# ✅ Dependency injection with automatic cleanup
def get_session():
    with Session(engine) as session:
        yield session
    # Session always closed (even on exception)

@router.post("/orders")
async def create_order(
    order_data: OrderCreate,
    session: Session = Depends(get_session)
):
    # Session managed by FastAPI dependency
    order = Order(**order_data.dict())
    session.add(order)
    session.commit()

    # Exception here? No problem - session still closed by context manager
    if order.total > 10000:
        raise ValueError("Order exceeds limit")

    return order

Why This Works:

  • Context manager (with statement) guarantees session.close() in __exit__
  • Works even if exception raised
  • FastAPI Depends() handles async cleanup automatically

Fix 2: Increase Connection Pool Size

Before (pool too small):

# database.py (BEFORE)
from sqlmodel import create_engine

engine = create_engine(
    database_url,
    pool_size=20,        # Too small for load
    max_overflow=10,
    pool_timeout=30
)

After (tuned for load):

# database.py (AFTER)
from sqlmodel import create_engine
import os

# Calculate pool size based on workers
# Formula: (workers * 2) + buffer
# 16 workers * 2 + 20 buffer = 52
workers = int(os.getenv("WEB_CONCURRENCY", 16))
pool_size = (workers * 2) + 20

engine = create_engine(
    database_url,
    pool_size=pool_size,      # 52 connections
    max_overflow=20,          # Burst to 72 total
    pool_timeout=30,
    pool_recycle=3600,        # Recycle after 1 hour
    pool_pre_ping=True,       # Verify connection health
    echo=False
)

Pool Size Calculation:

Workers: 16 (Uvicorn workers)
Connections per worker: 2 (normal peak)
Buffer: 20 (for spikes)

pool_size = (16 * 2) + 20 = 52
max_overflow = 20 (total 72 for extreme spikes)

Fix 3: Add Connection Pool Monitoring

Prometheus Metrics:

# monitoring.py
from prometheus_client import Gauge
from database import engine

# Pool metrics
db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')

def update_pool_metrics():
    """Update pool metrics every 10 seconds"""
    pool = engine.pool
    db_pool_size.set(pool.size())
    db_pool_checked_out.set(pool.checkedout())
    db_pool_idle.set(pool.size() - pool.checkedout())
    db_pool_overflow.set(pool.overflow())

# Schedule in background task
import asyncio
async def pool_monitor():
    while True:
        update_pool_metrics()
        await asyncio.sleep(10)

Grafana Alert:

# Alert: Connection pool near exhaustion
expr: db_pool_checked_out / db_pool_size_total > 0.8
for: 5m
annotations:
  summary: "Connection pool >80% utilized"
  description: "{{ $value | humanizePercentage }} of pool in use"

Fix 4: Add Timeout and Retry Logic

Connection Timeout Handling:

# database.py - Add connection retry
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10)
)
def get_session_with_retry():
    """Get session with automatic retry on pool timeout"""
    try:
        with Session(engine) as session:
            yield session
    except TimeoutError:
        # Pool exhausted - retry after exponential backoff
        raise

@router.post("/orders")
async def create_order(
    order_data: OrderCreate,
    session: Session = Depends(get_session_with_retry)
):
    # Will retry up to 3 times if pool exhausted
    ...

Results

Before vs After Metrics

Metric Before Fix After Fix Improvement
Connection Pool Size 20 52 +160% capacity
Pool Utilization 100% (exhausted) 40-60% (healthy) -40% utilization
503 Error Rate 15% 0.01% 99.9% reduction
Request Timeout 30s (waiting) <100ms 99.7% faster
Leaked Connections 12/hour 0/day 100% eliminated

Deployment Verification

Load Test After Fix:

# Simulate 1000 concurrent orders
ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders

# Results:
Requests per second: 250 [#/sec]
Time per request: 200ms [mean]
Failed requests: 0 (0%)
Successful requests: 1000 (100%)

# Pool status during test:
{
  "size": 52,
  "checked_out": 28,     # 54% utilization (healthy)
  "overflow": 0,
  "idle": 24
}

Prevention Measures

1. Connection Leak Tests

# tests/test_connection_leaks.py
@pytest.fixture
def track_connections():
    before = engine.pool.checkedout()
    yield
    after = engine.pool.checkedout()
    assert after == before, f"Leaked {after - before} connections"

2. Pool Alerts

# Alert if pool >80% for 5 minutes
expr: db_pool_checked_out / db_pool_size_total > 0.8

3. Health Check

@app.get("/health/database")
async def database_health():
    with Session(engine) as session:
        session.execute("SELECT 1")
        return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}

4. Monitoring Commands

# Active connections
pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"

# Slow queries
pscale database insights db main --slow-queries

Lessons Learned

What Went Well

Quick identification of pool exhaustion (Prometheus alerts) Context manager pattern eliminated leaks Pool tuning based on formula (workers * 2 + buffer) Comprehensive monitoring added

What Could Be Improved

No pool monitoring before incident Pool size not calculated based on load Missing connection leak tests

Key Takeaways

  1. Always use context managers for database sessions
  2. Calculate pool size based on workers and load
  3. Monitor pool utilization with alerts at 80%
  4. Test for connection leaks in CI/CD
  5. Add retry logic for transient pool timeouts

PlanetScale Best Practices

# Connection string with SSL
DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"

# Schema changes via deploy requests
pscale deploy-request create db schema-update

# Test in branch
pscale branch create db test-feature
-- Index frequently queried columns
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Analyze slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;


Return to examples index