Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions

View File

@@ -0,0 +1,68 @@
# DevOps Troubleshooter Examples
Real-world infrastructure troubleshooting scenarios for Grey Haven's Cloudflare Workers + PlanetScale PostgreSQL stack.
## Examples Overview
### 1. Cloudflare Worker Deployment Failure
**File**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
**Scenario**: Worker deployment fails with "Script exceeds size limit" error
**Stack**: Cloudflare Workers, wrangler, webpack bundling
**Impact**: Production deployment blocked, 2-hour downtime
**Resolution**: Bundle size reduction (5.2MB → 1.8MB), code splitting, tree shaking
**Lines**: ~450 lines
### 2. PlanetScale Connection Pool Exhaustion
**File**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
**Scenario**: Database connection timeouts causing 503 errors
**Stack**: PlanetScale PostgreSQL, connection pooling, FastAPI
**Impact**: 15% of requests failing, customer complaints
**Resolution**: Connection pool tuning, connection leak fixes
**Lines**: ~430 lines
### 3. Distributed System Network Debugging
**File**: [distributed-system-debugging.md](distributed-system-debugging.md)
**Scenario**: Intermittent 504 Gateway Timeout errors between services
**Stack**: Cloudflare Workers, external APIs, DNS, CORS
**Impact**: 5% of API calls failing, no clear pattern
**Resolution**: DNS caching issue, worker timeout configuration
**Lines**: ~420 lines
### 4. Performance Degradation Analysis
**File**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
**Scenario**: API response times increased from 200ms to 2000ms
**Stack**: Cloudflare Workers, PlanetScale, caching layer
**Impact**: User-facing slowness, poor UX
**Resolution**: N+1 query elimination, caching strategy, index optimization
**Lines**: ~410 lines
---
## Quick Navigation
**By Issue Type**:
- Deployment failures → [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
- Database issues → [planetscale-connection-issues.md](planetscale-connection-issues.md)
- Network problems → [distributed-system-debugging.md](distributed-system-debugging.md)
- Performance issues → [performance-degradation-analysis.md](performance-degradation-analysis.md)
**By Stack Component**:
- Cloudflare Workers → Examples 1, 3, 4
- PlanetScale PostgreSQL → Examples 2, 4
- Distributed Systems → Example 3
---
## Related Documentation
- **Reference**: [Reference Index](../reference/INDEX.md) - Runbooks and diagnostic commands
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident templates
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
---
Return to [main agent](../devops-troubleshooter.md)

View File

@@ -0,0 +1,466 @@
# Cloudflare Worker Deployment Failure Investigation
Complete troubleshooting workflow for "Script exceeds size limit" deployment failure, resolved through bundle optimization and code splitting.
## Overview
**Incident**: Worker deployment failing with size limit error
**Impact**: Production deployment blocked for 2 hours
**Root Cause**: Bundle size grew from 1.2MB to 5.2MB after adding dependencies
**Resolution**: Bundle optimization (code splitting, tree shaking) reduced size to 1.8MB
**Status**: Resolved
## Incident Timeline
| Time | Event | Action |
|------|-------|--------|
| 14:00 | Deployment initiated via CI/CD | `wrangler deploy` triggered |
| 14:02 | Deployment failed | Error: "Script exceeds 1MB size limit" |
| 14:05 | Investigation started | Check recent code changes |
| 14:15 | Root cause identified | New dependencies increased bundle size |
| 14:30 | Fix implemented | Bundle optimization applied |
| 14:45 | Fix deployed | Successful deployment to production |
| 16:00 | Monitoring complete | Confirmed stable deployment |
---
## Symptoms and Detection
### Initial Error
**Deployment Command**:
```bash
$ wrangler deploy
[ERROR] Script exceeds the size limit (5.2MB > 1MB after compression)
```
**CI/CD Pipeline Failure**:
```yaml
# GitHub Actions output
Step: Deploy to Cloudflare Workers
✓ Build completed (5.2MB bundle)
✗ Deployment failed: Script size exceeds limit
Error: Workers Free plan limit is 1MB compressed
```
**Impact**:
- Production deployment blocked
- New features stuck in staging
- Team unable to deploy hotfixes
---
## Diagnosis
### Step 1: Check Bundle Size
**Before Investigation**:
```bash
# Build the worker locally
npm run build
# Check output size
ls -lh dist/
-rw-r--r-- 1 user staff 5.2M Dec 5 14:10 worker.js
```
**Analyze Bundle Composition**:
```bash
# Use webpack-bundle-analyzer
npm install --save-dev webpack-bundle-analyzer
# Add to webpack.config.js
const BundleAnalyzerPlugin = require('webpack-bundle-analyzer').BundleAnalyzerPlugin;
module.exports = {
plugins: [
new BundleAnalyzerPlugin()
]
};
# Build and open analyzer
npm run build
# Opens http://127.0.0.1:8888 with visual bundle breakdown
```
**Bundle Analyzer Findings**:
```
Total Size: 5.2MB
Breakdown:
- @anthropic-ai/sdk: 2.1MB (40%)
- aws-sdk: 1.8MB (35%)
- lodash: 800KB (15%)
- moment: 300KB (6%)
- application code: 200KB (4%)
```
**Red Flags**:
1. Full `aws-sdk` imported (only needed S3)
2. Entire `lodash` library (only using 3 functions)
3. `moment` included (native Date API would suffice)
4. Large AI SDK (only using text generation)
---
### Step 2: Identify Recent Changes
**Git Diff**:
```bash
# Check what changed in last deploy
git diff HEAD~1 HEAD -- src/
# Key changes:
+ import { Anthropic } from '@anthropic-ai/sdk';
+ import AWS from 'aws-sdk';
+ import _ from 'lodash';
+ import moment from 'moment';
```
**PR Analysis**:
```
PR #234: Add AI content generation feature
- Added @anthropic-ai/sdk (full SDK)
- Added AWS S3 integration (full aws-sdk)
- Used lodash for data manipulation
- Used moment for date formatting
Result: Bundle size increased by 4MB
```
---
### Step 3: Cloudflare Worker Size Limits
**Plan Limits**:
```
Workers Free: 1MB compressed
Workers Paid: 10MB compressed
Current plan: Workers Free
Current size: 5.2MB (over limit)
Options:
1. Upgrade to Workers Paid ($5/month)
2. Reduce bundle size to <1MB
3. Split into multiple workers
```
**Decision**: Reduce bundle size (no budget for upgrade)
---
## Resolution
### Fix 1: Tree Shaking with Named Imports
**Before** (imports entire libraries):
```typescript
// ❌ BAD: Imports full library
import _ from 'lodash';
import moment from 'moment';
import AWS from 'aws-sdk';
// Usage:
const unique = _.uniq(array);
const date = moment().format('YYYY-MM-DD');
const s3 = new AWS.S3();
```
**After** (imports only needed functions):
```typescript
// ✅ GOOD: Named imports enable tree shaking
import { uniq, map, filter } from 'lodash-es';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
// ✅ BETTER: Native alternatives
const unique = [...new Set(array)];
const date = new Date().toISOString().split('T')[0];
// S3 client (v3 - modular)
const s3 = new S3Client({ region: 'us-east-1' });
```
**Size Reduction**:
```
Before:
- lodash: 800KB → lodash-es tree-shaken: 50KB (94% reduction)
- moment: 300KB → native Date: 0KB (100% reduction)
- aws-sdk: 1.8MB → @aws-sdk/client-s3: 200KB (89% reduction)
```
---
### Fix 2: External Dependencies (Don't Bundle Large SDKs)
**Before**:
```typescript
// worker.ts - bundled @anthropic-ai/sdk (2.1MB)
import { Anthropic } from '@anthropic-ai/sdk';
const client = new Anthropic({
apiKey: env.ANTHROPIC_API_KEY
});
```
**After** (use fetch directly):
```typescript
// worker.ts - use native fetch (0KB)
async function callAnthropic(prompt: string, env: Env) {
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01'
},
body: JSON.stringify({
model: 'claude-3-sonnet-20240229',
max_tokens: 1024,
messages: [
{ role: 'user', content: prompt }
]
})
});
return response.json();
}
```
**Size Reduction**:
```
Before: @anthropic-ai/sdk: 2.1MB
After: Native fetch: 0KB
Savings: 2.1MB (100% reduction)
```
---
### Fix 3: Code Splitting (Async Imports)
**Before** (everything bundled):
```typescript
// worker.ts
import { expensiveFunction } from './expensive-module';
export default {
async fetch(request: Request, env: Env) {
// Even if not used, expensive-module is in bundle
if (request.url.includes('/special')) {
return expensiveFunction(request);
}
return new Response('OK');
}
};
```
**After** (lazy load):
```typescript
// worker.ts
export default {
async fetch(request: Request, env: Env) {
if (request.url.includes('/special')) {
// Only load when needed (separate chunk)
const { expensiveFunction } = await import('./expensive-module');
return expensiveFunction(request);
}
return new Response('OK');
}
};
```
**Size Reduction**:
```
Main bundle: 1.8MB → 500KB (72% reduction)
expensive-module chunk: Loaded on-demand (lazy)
```
---
### Fix 4: Webpack Configuration Optimization
**Updated webpack.config.js**:
```javascript
const webpack = require('webpack');
const path = require('path');
module.exports = {
entry: './src/worker.ts',
target: 'webworker',
mode: 'production',
optimization: {
minimize: true,
usedExports: true, // Tree shaking
sideEffects: false,
},
resolve: {
extensions: ['.ts', '.js'],
alias: {
// Replace heavy libraries with lighter alternatives
'moment': 'date-fns',
'lodash': 'lodash-es'
}
},
module: {
rules: [
{
test: /\.ts$/,
use: {
loader: 'ts-loader',
options: {
transpileOnly: true,
compilerOptions: {
module: 'esnext', // Enable tree shaking
moduleResolution: 'node'
}
}
},
exclude: /node_modules/
}
]
},
plugins: [
new webpack.DefinePlugin({
'process.env.NODE_ENV': JSON.stringify('production')
})
],
output: {
filename: 'worker.js',
path: path.resolve(__dirname, 'dist'),
libraryTarget: 'commonjs2'
}
};
```
---
## Results
### Bundle Size Comparison
| Category | Before | After | Reduction |
|----------|--------|-------|-----------|
| **@anthropic-ai/sdk** | 2.1MB | 0KB (fetch) | -100% |
| **aws-sdk** | 1.8MB | 200KB (v3) | -89% |
| **lodash** | 800KB | 50KB (tree-shaken) | -94% |
| **moment** | 300KB | 0KB (native Date) | -100% |
| **Application code** | 200KB | 200KB | 0% |
| **TOTAL** | **5.2MB** | **450KB** | **-91%** |
**Compressed Size**:
- Before: 5.2MB → 1.8MB compressed (over 1MB limit)
- After: 450KB → 180KB compressed (under 1MB limit)
---
### Deployment Verification
**Successful Deployment**:
```bash
$ wrangler deploy
✔ Building...
✔ Validating...
Bundle size: 450KB (180KB compressed)
✔ Uploading...
✔ Deployed to production
Production URL: https://api.greyhaven.io
Worker ID: worker-abc123
```
**Load Testing**:
```bash
# Before optimization (would fail deployment)
# Bundle: 5.2MB, deploy: FAIL
# After optimization
$ ab -n 1000 -c 10 https://api.greyhaven.io/
Requests per second: 1250 [#/sec]
Time per request: 8ms [mean]
Successful requests: 1000 (100%)
Bundle size: 450KB ✓
```
---
## Prevention Measures
### 1. CI/CD Bundle Size Check
```yaml
# .github/workflows/deploy.yml - Add size validation
steps:
- run: npm ci && npm run build
- name: Check bundle size
run: |
SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
if (( $(echo "$SIZE_MB > 1.0" | bc -l) )); then
echo "❌ Bundle exceeds 1MB"; exit 1
fi
- run: npx wrangler deploy
```
### 2. Pre-commit Hook
```bash
# .git/hooks/pre-commit
SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
[ "$SIZE_MB" -lt "1.0" ] || { echo "❌ Bundle >1MB"; exit 1; }
```
### 3. PR Template
```markdown
## Bundle Impact
- [ ] Bundle size <800KB
- [ ] Tree shaking verified
Size: [Before → After]
```
### 4. Automated Analysis
```json
{
"scripts": {
"analyze": "webpack --profile --json > stats.json && webpack-bundle-analyzer stats.json"
}
}
```
---
## Lessons Learned
### What Went Well
✅ Identified root cause quickly (bundle analyzer)
✅ Multiple optimization strategies applied
✅ Achieved 91% bundle size reduction
✅ Added automated checks to prevent recurrence
### What Could Be Improved
❌ No bundle size monitoring before incident
❌ Dependencies added without size consideration
❌ No pre-commit checks for bundle size
### Key Takeaways
1. **Always check bundle size** when adding dependencies
2. **Use native APIs** instead of libraries when possible
3. **Tree shaking** requires named imports (not default)
4. **Code splitting** for rarely-used features
5. **External API calls** are lighter than bundling SDKs
---
## Related Documentation
- **PlanetScale Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,477 @@
# Distributed System Network Debugging
Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.
## Overview
**Incident**: 5% of API requests failing with 504 timeouts
**Impact**: Intermittent failures, no clear pattern, user frustration
**Root Cause**: DNS resolution delays + worker timeout too aggressive
**Resolution**: DNS caching + timeout increase (5s→30s)
**Status**: Resolved
## Incident Timeline
| Time | Event | Action |
|------|-------|--------|
| 14:00 | 504 errors detected | Alerts triggered |
| 14:10 | Pattern analysis started | Check logs, no obvious cause |
| 14:30 | Network trace performed | Found DNS delays |
| 14:50 | Root cause identified | DNS + timeout combination |
| 15:10 | Fix deployed | DNS caching + timeout tuning |
| 15:40 | Monitoring confirmed | 504s eliminated |
---
## Symptoms and Detection
### Initial Alerts
**Error Pattern**:
```
[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
[ERROR] Upstream timeout after 5000ms
[ERROR] DNS lookup took 3200ms (80% of timeout!)
```
**Characteristics**:
- ❌ Random occurrence (5% of requests)
- ❌ No pattern by time of day
- ❌ Affects all worker regions equally
- ❌ External API reports no issues
- ✅ Only affects specific external endpoints
---
## Diagnosis
### Step 1: Network Request Breakdown
**curl Timing Analysis**:
```bash
# Test external API with detailed timing
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nStart: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://api.partner.com/data
# Results (intermittent):
DNS: 3.201s # ❌ Very slow!
Connect: 3.450s
TLS: 3.780s
Start: 4.120s
Total: 4.823s # Close to 5s worker timeout
```
**Fast vs Slow Requests**:
```
FAST (95% of requests):
DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅
SLOW (5% of requests):
DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)
```
**Root Cause**: DNS resolution delays causing total request time to exceed worker timeout.
---
### Step 2: DNS Investigation
**nslookup Testing**:
```bash
# Test DNS resolution
time nslookup api.partner.com
# Results (vary):
Run 1: 0.05s ✅
Run 2: 3.10s ❌
Run 3: 0.04s ✅
Run 4: 2.95s ❌
Pattern: DNS cache miss causes 3s delay
```
**dig Analysis**:
```bash
# Detailed DNS query
dig api.partner.com +stats
# Results:
;; Query time: 3021 msec # Slow!
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Dec 05 14:25:32 UTC 2024
;; MSG SIZE rcvd: 84
# Root cause: No DNS caching in worker
```
---
### Step 3: Worker Timeout Configuration
**Current Worker Code**:
```typescript
// worker.ts (BEFORE - Too aggressive timeout)
export default {
async fetch(request: Request, env: Env) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout
try {
const response = await fetch('https://api.partner.com/data', {
signal: controller.signal
});
return response;
} catch (error) {
// 5% of requests timeout here
return new Response('Gateway Timeout', { status: 504 });
} finally {
clearTimeout(timeout);
}
}
};
```
**Problem**: 5s timeout doesn't account for DNS delays (up to 3s).
---
### Step 4: CORS and Headers Check
**Test CORS Headers**:
```bash
# Check CORS preflight
curl -I -X OPTIONS https://api.greyhaven.io/proxy \
-H "Origin: https://app.greyhaven.io" \
-H "Access-Control-Request-Method: POST"
# Response:
HTTP/2 200
access-control-allow-origin: https://app.greyhaven.io ✅
access-control-allow-methods: GET, POST, PUT, DELETE ✅
access-control-max-age: 86400
```
**No CORS issues** - problem isolated to DNS + timeout.
---
## Resolution
### Fix 1: Implement DNS Caching
**Worker with DNS Cache**:
```typescript
// worker.ts (AFTER - With DNS caching)
interface DnsCache {
ip: string;
timestamp: number;
ttl: number;
}
const DNS_CACHE = new Map<string, DnsCache>();
const DNS_TTL = 60 * 1000; // 60 seconds
async function resolveWithCache(hostname: string): Promise<string> {
const cached = DNS_CACHE.get(hostname);
if (cached && Date.now() - cached.timestamp < cached.ttl) {
// Cache hit - return immediately
return cached.ip;
}
// Cache miss - resolve DNS
const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
headers: { 'accept': 'application/dns-json' }
});
const dnsData = await dnsResponse.json();
const ip = dnsData.Answer[0].data;
// Update cache
DNS_CACHE.set(hostname, {
ip,
timestamp: Date.now(),
ttl: DNS_TTL
});
return ip;
}
export default {
async fetch(request: Request, env: Env) {
// Pre-resolve DNS (cached)
const ip = await resolveWithCache('api.partner.com');
// Use IP directly (bypass DNS)
const response = await fetch(`https://${ip}/data`, {
headers: {
'Host': 'api.partner.com' // Required for SNI
}
});
return response;
}
};
```
**Result**: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).
---
### Fix 2: Increase Worker Timeout
**Updated Timeout**:
```typescript
// worker.ts - Increased timeout to account for DNS
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
try {
const response = await fetch('https://api.partner.com/data', {
signal: controller.signal
});
return response;
} finally {
clearTimeout(timeout);
}
```
**Timeout Breakdown**:
```
Old: 5s total
- DNS: 3s (worst case)
- Connect: 1s
- Request: 1s
= Frequent timeouts
New: 30s total
- DNS: <0.01s (cached)
- Connect: 1s
- Request: 2s
- Buffer: 27s (ample)
= No timeouts
```
---
### Fix 3: Add Retry Logic with Exponential Backoff
**Retry Implementation**:
```typescript
// utils/retry.ts
async function fetchWithRetry(
url: string,
options: RequestInit,
maxRetries: number = 3
): Promise<Response> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch(url, options);
// Retry on 5xx errors
if (response.status >= 500 && attempt < maxRetries - 1) {
const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
return response;
} catch (error) {
if (attempt === maxRetries - 1) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}
// Usage:
const response = await fetchWithRetry('https://api.partner.com/data', {
signal: controller.signal
});
```
---
### Fix 4: Circuit Breaker Pattern
**Prevent Cascading Failures**:
```typescript
// utils/circuit-breaker.ts
class CircuitBreaker {
private failures: number = 0;
private lastFailureTime: number = 0;
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
// Check if enough time passed to try again
if (Date.now() - this.lastFailureTime > 60000) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
private onFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= 5) {
this.state = 'OPEN'; // Trip circuit after 5 failures
}
}
}
// Usage:
const breaker = new CircuitBreaker();
const response = await breaker.execute(() =>
fetch('https://api.partner.com/data')
);
```
---
## Results
### Before vs After Metrics
| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **504 Error Rate** | 5% | 0.01% | **99.8% reduction** |
| **DNS Resolution** | 3000ms (worst) | <5ms (cached) | **99.8% faster** |
| **Total Request Time** | 4800ms (p95) | 850ms (p95) | **82% faster** |
| **Timeout Threshold** | 5s (too low) | 30s (appropriate) | +500% headroom |
---
### Network Diagnostics
**traceroute Analysis**:
```bash
# Check network path to external API
traceroute api.partner.com
# Results show no packet loss
1 gateway (10.0.0.1) 1.234 ms
2 isp-router (100.64.0.1) 5.678 ms
...
15 api.partner.com (203.0.113.42) 45.234 ms
```
**No packet loss** - confirms DNS was the issue, not network.
---
## Prevention Measures
### 1. Network Monitoring Dashboard
**Metrics to Track**:
```typescript
// Track network timing metrics
const network_dns_duration = new Histogram({
name: 'network_dns_duration_seconds',
help: 'DNS resolution time'
});
const network_connect_duration = new Histogram({
name: 'network_connect_duration_seconds',
help: 'TCP connection time'
});
const network_total_duration = new Histogram({
name: 'network_total_duration_seconds',
help: 'Total request time'
});
```
### 2. Alert Rules
```yaml
# Alert on high DNS resolution time
- alert: SlowDnsResolution
expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
for: 5m
annotations:
summary: "DNS resolution p95 >1s"
# Alert on gateway timeouts
- alert: HighGatewayTimeouts
expr: rate(http_requests_total{status="504"}[5m]) > 0.01
for: 5m
annotations:
summary: "504 error rate >1%"
```
### 3. Health Check Endpoints
```typescript
@app.get("/health/network")
async function networkHealth() {
const checks = await Promise.all([
checkDns('api.partner.com'),
checkConnectivity('https://api.partner.com/health'),
checkLatency('https://api.partner.com/ping')
]);
return {
status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
checks
};
}
```
---
## Lessons Learned
### What Went Well
✅ Detailed network timing analysis pinpointed DNS
✅ DNS caching eliminated 99.8% of timeouts
✅ Circuit breaker prevents cascading failures
### What Could Be Improved
❌ No DNS monitoring before incident
❌ Timeout too aggressive without considering DNS
❌ No retry logic for transient failures
### Key Takeaways
1. **Always cache DNS** in workers (60s TTL minimum)
2. **Account for DNS time** when setting timeouts
3. **Add retry logic** with exponential backoff
4. **Implement circuit breakers** for external dependencies
5. **Monitor network timing** (DNS, connect, TLS, transfer)
---
## Related Documentation
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,413 @@
# Performance Degradation Analysis
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
## Overview
**Incident**: API response times degraded 10x (200ms → 2000ms)
**Impact**: User-facing slowness, timeout errors, poor UX
**Root Cause**: N+1 query problem + missing indexes + no caching
**Resolution**: Query optimization + indexes + Redis caching
**Status**: Resolved
## Incident Timeline
| Time | Event | Action |
|------|-------|--------|
| 08:00 | Slowness reports from users | Support tickets opened |
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
| 08:30 | Database profiling started | Slow query log analysis |
| 09:00 | N+1 query identified | Found 100+ queries per request |
| 09:30 | Fix implemented | Eager loading + indexes |
| 10:00 | Caching added | Redis for frequently accessed data |
| 10:30 | Deployment complete | Latency back to 200ms |
---
## Symptoms and Detection
### Initial Metrics
**Latency Increase**:
```
p50: 180ms → 1800ms (+900% slower)
p95: 220ms → 2100ms (+854% slower)
p99: 450ms → 3500ms (+677% slower)
Requests timing out: 5% (>3s timeout)
```
**User Impact**:
- Page load times: 5-10 seconds
- API timeouts: 5% of requests
- Support tickets: 47 in 1 hour
- User complaints: "App is unusable"
---
## Diagnosis
### Step 1: Application Performance Monitoring
**Wrangler Tail Analysis**:
```bash
# Monitor worker requests in real-time
wrangler tail --format pretty
# Output shows slow requests:
[2024-12-05 08:20:15] GET /api/orders - 2145ms
└─ database_query: 1950ms (90% of total time!)
└─ json_serialization: 150ms
└─ response_headers: 45ms
# Red flag: Database taking 90% of request time
```
---
### Step 2: Database Query Analysis
**PlanetScale Slow Query Log**:
```bash
# Enable and check slow queries
pscale database insights greyhaven-db main --slow-queries
# Results:
Query: SELECT * FROM order_items WHERE order_id = ?
Calls: 157 times per request # ❌ N+1 query problem!
Avg time: 12ms per query
Total: 1884ms per request (12ms × 157)
```
**N+1 Query Pattern Identified**:
```python
# api/orders.py (BEFORE - N+1 Problem)
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# Query 1: Get all orders for user
orders = session.exec(
select(Order).where(Order.user_id == user_id)
).all() # Returns 157 orders
# Query 2-158: Get items for EACH order (N+1!)
for order in orders:
order.items = session.exec(
select(OrderItem).where(OrderItem.order_id == order.id)
).all() # 157 additional queries!
return orders
# Total queries: 1 + 157 = 158 queries per request
# Total time: 10ms + (157 × 12ms) = 1894ms
```
---
### Step 3: Database Index Analysis
**Missing Indexes**:
```sql
-- Check existing indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'order_items';
-- Results:
-- Primary key on id (exists) ✅
-- NO index on order_id ❌ (needed for WHERE clause)
-- NO index on user_id ❌ (needed for joins)
-- Explain plan shows full table scan
EXPLAIN ANALYZE
SELECT * FROM order_items WHERE order_id = 123;
-- Result:
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
Filter: (order_id = 123)
Rows Removed by Filter: 10000
-- Full table scan on 10K rows, 157 times = extremely slow!
```
---
## Resolution
### Fix 1: Eliminate N+1 with Eager Loading
**After - Single Query with Join**:
```python
# api/orders.py (AFTER - Eager Loading)
from sqlmodel import select
from sqlalchemy.orm import selectinload
@router.get("/orders/{user_id}")
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
# ✅ Single query with eager loading
statement = (
select(Order)
.where(Order.user_id == user_id)
.options(selectinload(Order.items)) # Eager load items
)
orders = session.exec(statement).all()
return orders
# Total queries: 2 (1 for orders, 1 for all items)
# Total time: 10ms + 25ms = 35ms (98% faster!)
```
**Query Comparison**:
```
BEFORE (N+1):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
- Total: 1894ms
AFTER (Eager Loading):
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
- Total: 35ms (54x faster!)
```
---
### Fix 2: Add Database Indexes
**Create Indexes**:
```sql
-- Index on order_id for faster lookups
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- Index on user_id for user queries
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Index on created_at for time-based queries
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Composite index for common filters
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
```
**Before/After EXPLAIN**:
```sql
-- BEFORE (no index):
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
-- AFTER (with index):
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
-- 270x faster (12.345ms → 0.045ms)
```
---
### Fix 3: Implement Redis Caching
**Cache Frequent Queries**:
```typescript
// cache.ts - Redis caching layer
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: env.UPSTASH_REDIS_URL,
token: env.UPSTASH_REDIS_TOKEN
});
async function getCachedOrders(userId: number) {
const cacheKey = `orders:user:${userId}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached); // Cache hit
}
// Cache miss - query database
const orders = await fetchOrdersFromDb(userId);
// Store in cache (5 minute TTL)
await redis.setex(cacheKey, 300, JSON.stringify(orders));
return orders;
}
```
**Cache Hit Rates**:
```
Requests: 10,000
Cache hits: 8,500 (85%)
Cache misses: 1,500 (15%)
Avg latency with cache:
- Cache hit: 5ms (Redis)
- Cache miss: 35ms (database)
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
```
---
### Fix 4: Database Connection Pooling
**Optimize Pool Settings**:
```python
# database.py - Tuned for performance
engine = create_engine(
database_url,
pool_size=50, # Increased from 20
max_overflow=20,
pool_recycle=1800, # 30 minutes
pool_pre_ping=True, # Health check
echo=False,
connect_args={
"server_settings": {
"statement_timeout": "30000", # 30s query timeout
"idle_in_transaction_session_timeout": "60000" # 60s idle
}
}
)
```
---
## Results
### Performance Metrics
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **p50 Latency** | 1800ms | 180ms | **90% faster** |
| **p95 Latency** | 2100ms | 220ms | **90% faster** |
| **p99 Latency** | 3500ms | 450ms | **87% faster** |
| **Database Queries** | 158/request | 2/request | **99% reduction** |
| **Cache Hit Rate** | 0% | 85% | **85% hits** |
| **Timeout Errors** | 5% | 0% | **100% eliminated** |
### Cost Impact
**Database Query Reduction**:
```
Before: 158 queries × 100 req/s = 15,800 queries/s
After: 2 queries × 100 req/s = 200 queries/s
Reduction: 98.7% fewer queries
Cost savings: $450/month (reduced database tier)
```
---
## Prevention Measures
### 1. Query Performance Monitoring
**Slow Query Alert**:
```yaml
# Alert on slow database queries
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
for: 5m
annotations:
summary: "Database queries p95 >100ms"
```
### 2. N+1 Query Detection
**Test for N+1 Patterns**:
```python
# tests/test_n_plus_one.py
import pytest
from sqlalchemy import event
from database import engine
@pytest.fixture
def query_counter():
"""Count SQL queries during test"""
queries = []
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
queries.append(statement)
event.listen(engine, "before_cursor_execute", before_cursor_execute)
yield queries
event.remove(engine, "before_cursor_execute", before_cursor_execute)
def test_get_user_orders_no_n_plus_one(query_counter):
"""Verify endpoint doesn't have N+1 queries"""
get_user_orders(user_id=1)
# Should be 2 queries max (orders + items)
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
```
### 3. Database Index Coverage
```sql
-- Check for missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100 -- Cardinality suggests index needed
ORDER BY tablename, attname;
```
### 4. Performance Budget
```typescript
// Set performance budgets
const PERFORMANCE_BUDGETS = {
api_latency_p95: 500, // ms
database_queries_per_request: 5,
cache_hit_rate_min: 0.70, // 70%
};
// CI/CD check
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
}
```
---
## Lessons Learned
### What Went Well
✅ Slow query log pinpointed N+1 problem
✅ Eager loading eliminated 99% of queries
✅ Indexes provided 270x speedup
✅ Caching reduced load by 85%
### What Could Be Improved
❌ No N+1 query detection before production
❌ Missing indexes not caught in code review
❌ No caching layer initially
❌ No query performance monitoring
### Key Takeaways
1. **Always use eager loading** for associations
2. **Add indexes** for all foreign keys and WHERE clauses
3. **Implement caching** for frequently accessed data
4. **Monitor query counts** per request (alert on >10)
5. **Test for N+1** in CI/CD pipeline
---
## Related Documentation
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,499 @@
# PlanetScale Connection Pool Exhaustion
Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
## Overview
**Incident**: Database connection timeouts causing 15% request failure rate
**Impact**: Customer-facing 503 errors, support tickets increasing
**Root Cause**: Connection pool too small + unclosed connections in error paths
**Resolution**: Pool tuning (20→50) + connection leak fixes
**Status**: Resolved
## Incident Timeline
| Time | Event | Action |
|------|-------|--------|
| 09:30 | Alerts: High 503 error rate | Oncall paged |
| 09:35 | Investigation started | Check logs, metrics |
| 09:45 | Database connections at 100% | Identified pool exhaustion |
| 10:00 | Temporary fix: restart service | Bought time for root cause |
| 10:30 | Code analysis complete | Found connection leaks |
| 11:00 | Fix deployed (pool + leaks) | Production deployment |
| 11:30 | Monitoring confirmed stable | Incident resolved |
---
## Symptoms and Detection
### Initial Alerts
**Prometheus Alert**:
```yaml
# Alert: HighErrorRate
expr: rate(http_requests_total{status="503"}[5m]) > 0.05
for: 5m
annotations:
summary: "503 error rate >5% for 5 minutes"
description: "Current rate: {{ $value | humanizePercentage }}"
```
**Error Logs**:
```
[ERROR] Database query failed: connection timeout
[ERROR] Pool exhausted, waiting for available connection
[ERROR] Request timeout after 30s waiting for DB connection
```
**Impact Metrics**:
```
Error rate: 15% (150 failures per 1000 requests)
User complaints: 23 support tickets in 30 minutes
Failed transactions: ~$15,000 in abandoned carts
```
---
## Diagnosis
### Step 1: Check Connection Pool Status
**Query PlanetScale**:
```bash
# Connect to database
pscale shell greyhaven-db main
# Check active connections
SELECT
COUNT(*) as active_connections,
MAX(pg_stat_activity.query_start) as oldest_query
FROM pg_stat_activity
WHERE state = 'active';
# Result:
# active_connections: 98
# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
```
**Check Application Pool**:
```python
# In FastAPI app - add diagnostic endpoint
from sqlmodel import Session
from database import engine
@app.get("/pool-status")
def pool_status():
pool = engine.pool
return {
"size": pool.size(),
"checked_out": pool.checkedout(),
"overflow": pool.overflow(),
"timeout": pool._timeout,
"max_overflow": pool._max_overflow
}
# Response:
{
"size": 20,
"checked_out": 20, # Pool exhausted!
"overflow": 0,
"timeout": 30,
"max_overflow": 10
}
```
**Red Flags**:
- ✅ Pool at 100% capacity (20/20 connections checked out)
- ✅ No overflow connections being used (0/10)
- ✅ Connections held for >15 minutes
- ✅ New requests timing out waiting for connections
---
### Step 2: Identify Connection Leaks
**Code Review - Found Vulnerable Pattern**:
```python
# api/orders.py (BEFORE - LEAK)
from fastapi import APIRouter
from sqlmodel import Session, select
from database import engine
router = APIRouter()
@router.post("/orders")
async def create_order(order_data: OrderCreate):
# ❌ LEAK: Session never closed on exception
session = Session(engine)
# Create order
order = Order(**order_data.dict())
session.add(order)
session.commit()
# If exception here, session never closed!
if order.total > 10000:
raise ValueError("Order exceeds limit")
# session.close() never reached
return order
```
**How Leak Occurs**:
1. Request creates session (acquires connection from pool)
2. Exception raised after commit
3. Function exits without calling `session.close()`
4. Connection remains "checked out" from pool
5. After 20 such exceptions, pool exhausted
---
### Step 3: Load Testing to Reproduce
**Test Script**:
```python
# test_connection_leak.py
import asyncio
import httpx
async def create_order(client, amount):
"""Create order that will trigger exception"""
try:
response = await client.post(
"https://api.greyhaven.io/orders",
json={"total": amount}
)
return response.status_code
except Exception:
return 503
async def load_test():
"""Simulate 100 orders with high amounts (triggers leak)"""
async with httpx.AsyncClient() as client:
# Trigger 100 exceptions (leak 100 connections)
tasks = [create_order(client, 15000) for _ in range(100)]
results = await asyncio.gather(*tasks)
success = sum(1 for r in results if r == 201)
errors = sum(1 for r in results if r == 503)
print(f"Success: {success}, Errors: {errors}")
asyncio.run(load_test())
```
**Results**:
```
Success: 20 (first 20 use all connections)
Errors: 80 (remaining 80 timeout waiting for pool)
Proves: Connection leak exhausts pool
```
---
## Resolution
### Fix 1: Use Context Manager (Guaranteed Cleanup)
**After - With Context Manager**:
```python
# api/orders.py (AFTER - FIXED)
from fastapi import APIRouter, Depends
from sqlmodel import Session
from database import get_session
router = APIRouter()
# ✅ Dependency injection with automatic cleanup
def get_session():
with Session(engine) as session:
yield session
# Session always closed (even on exception)
@router.post("/orders")
async def create_order(
order_data: OrderCreate,
session: Session = Depends(get_session)
):
# Session managed by FastAPI dependency
order = Order(**order_data.dict())
session.add(order)
session.commit()
# Exception here? No problem - session still closed by context manager
if order.total > 10000:
raise ValueError("Order exceeds limit")
return order
```
**Why This Works**:
- Context manager (`with` statement) guarantees `session.close()` in `__exit__`
- Works even if exception raised
- FastAPI `Depends()` handles async cleanup automatically
---
### Fix 2: Increase Connection Pool Size
**Before** (pool too small):
```python
# database.py (BEFORE)
from sqlmodel import create_engine
engine = create_engine(
database_url,
pool_size=20, # Too small for load
max_overflow=10,
pool_timeout=30
)
```
**After** (tuned for load):
```python
# database.py (AFTER)
from sqlmodel import create_engine
import os
# Calculate pool size based on workers
# Formula: (workers * 2) + buffer
# 16 workers * 2 + 20 buffer = 52
workers = int(os.getenv("WEB_CONCURRENCY", 16))
pool_size = (workers * 2) + 20
engine = create_engine(
database_url,
pool_size=pool_size, # 52 connections
max_overflow=20, # Burst to 72 total
pool_timeout=30,
pool_recycle=3600, # Recycle after 1 hour
pool_pre_ping=True, # Verify connection health
echo=False
)
```
**Pool Size Calculation**:
```
Workers: 16 (Uvicorn workers)
Connections per worker: 2 (normal peak)
Buffer: 20 (for spikes)
pool_size = (16 * 2) + 20 = 52
max_overflow = 20 (total 72 for extreme spikes)
```
---
### Fix 3: Add Connection Pool Monitoring
**Prometheus Metrics**:
```python
# monitoring.py
from prometheus_client import Gauge
from database import engine
# Pool metrics
db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
def update_pool_metrics():
"""Update pool metrics every 10 seconds"""
pool = engine.pool
db_pool_size.set(pool.size())
db_pool_checked_out.set(pool.checkedout())
db_pool_idle.set(pool.size() - pool.checkedout())
db_pool_overflow.set(pool.overflow())
# Schedule in background task
import asyncio
async def pool_monitor():
while True:
update_pool_metrics()
await asyncio.sleep(10)
```
**Grafana Alert**:
```yaml
# Alert: Connection pool near exhaustion
expr: db_pool_checked_out / db_pool_size_total > 0.8
for: 5m
annotations:
summary: "Connection pool >80% utilized"
description: "{{ $value | humanizePercentage }} of pool in use"
```
---
### Fix 4: Add Timeout and Retry Logic
**Connection Timeout Handling**:
```python
# database.py - Add connection retry
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
def get_session_with_retry():
"""Get session with automatic retry on pool timeout"""
try:
with Session(engine) as session:
yield session
except TimeoutError:
# Pool exhausted - retry after exponential backoff
raise
@router.post("/orders")
async def create_order(
order_data: OrderCreate,
session: Session = Depends(get_session_with_retry)
):
# Will retry up to 3 times if pool exhausted
...
```
---
## Results
### Before vs After Metrics
| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **Connection Pool Size** | 20 | 52 | +160% capacity |
| **Pool Utilization** | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
| **503 Error Rate** | 15% | 0.01% | **99.9% reduction** |
| **Request Timeout** | 30s (waiting) | <100ms | **99.7% faster** |
| **Leaked Connections** | 12/hour | 0/day | **100% eliminated** |
---
### Deployment Verification
**Load Test After Fix**:
```bash
# Simulate 1000 concurrent orders
ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
# Results:
Requests per second: 250 [#/sec]
Time per request: 200ms [mean]
Failed requests: 0 (0%)
Successful requests: 1000 (100%)
# Pool status during test:
{
"size": 52,
"checked_out": 28, # 54% utilization (healthy)
"overflow": 0,
"idle": 24
}
```
---
## Prevention Measures
### 1. Connection Leak Tests
```python
# tests/test_connection_leaks.py
@pytest.fixture
def track_connections():
before = engine.pool.checkedout()
yield
after = engine.pool.checkedout()
assert after == before, f"Leaked {after - before} connections"
```
### 2. Pool Alerts
```yaml
# Alert if pool >80% for 5 minutes
expr: db_pool_checked_out / db_pool_size_total > 0.8
```
### 3. Health Check
```python
@app.get("/health/database")
async def database_health():
with Session(engine) as session:
session.execute("SELECT 1")
return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
```
### 4. Monitoring Commands
```bash
# Active connections
pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
# Slow queries
pscale database insights db main --slow-queries
```
---
## Lessons Learned
### What Went Well
✅ Quick identification of pool exhaustion (Prometheus alerts)
✅ Context manager pattern eliminated leaks
✅ Pool tuning based on formula (workers * 2 + buffer)
✅ Comprehensive monitoring added
### What Could Be Improved
❌ No pool monitoring before incident
❌ Pool size not calculated based on load
❌ Missing connection leak tests
### Key Takeaways
1. **Always use context managers** for database sessions
2. **Calculate pool size** based on workers and load
3. **Monitor pool utilization** with alerts at 80%
4. **Test for connection leaks** in CI/CD
5. **Add retry logic** for transient pool timeouts
---
## PlanetScale Best Practices
```bash
# Connection string with SSL
DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
# Schema changes via deploy requests
pscale deploy-request create db schema-update
# Test in branch
pscale branch create db test-feature
```
```sql
-- Index frequently queried columns
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Analyze slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
```
---
## Related Documentation
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
---
Return to [examples index](INDEX.md)