Initial commit
This commit is contained in:
26
skills/devops-troubleshooting/SKILL.md
Normal file
26
skills/devops-troubleshooting/SKILL.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# DevOps Troubleshooting Skill
|
||||
|
||||
DevOps and infrastructure troubleshooting for Cloudflare Workers, PlanetScale PostgreSQL, and distributed systems.
|
||||
|
||||
## Description
|
||||
|
||||
Infrastructure diagnosis, performance analysis, network debugging, and cloud platform troubleshooting.
|
||||
|
||||
## What's Included
|
||||
|
||||
- **Examples**: Deployment issues, connection errors, performance degradation
|
||||
- **Reference**: Troubleshooting methodologies, common issues
|
||||
- **Templates**: Diagnostic reports, fix commands
|
||||
|
||||
## Use When
|
||||
|
||||
- Deployment issues
|
||||
- Infrastructure problems
|
||||
- Connection errors
|
||||
- Performance degradation
|
||||
|
||||
## Related Agents
|
||||
|
||||
- `devops-troubleshooter`
|
||||
|
||||
**Skill Version**: 1.0
|
||||
68
skills/devops-troubleshooting/examples/INDEX.md
Normal file
68
skills/devops-troubleshooting/examples/INDEX.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# DevOps Troubleshooter Examples
|
||||
|
||||
Real-world infrastructure troubleshooting scenarios for Grey Haven's Cloudflare Workers + PlanetScale PostgreSQL stack.
|
||||
|
||||
## Examples Overview
|
||||
|
||||
### 1. Cloudflare Worker Deployment Failure
|
||||
|
||||
**File**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
**Scenario**: Worker deployment fails with "Script exceeds size limit" error
|
||||
**Stack**: Cloudflare Workers, wrangler, webpack bundling
|
||||
**Impact**: Production deployment blocked, 2-hour downtime
|
||||
**Resolution**: Bundle size reduction (5.2MB → 1.8MB), code splitting, tree shaking
|
||||
**Lines**: ~450 lines
|
||||
|
||||
### 2. PlanetScale Connection Pool Exhaustion
|
||||
|
||||
**File**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
**Scenario**: Database connection timeouts causing 503 errors
|
||||
**Stack**: PlanetScale PostgreSQL, connection pooling, FastAPI
|
||||
**Impact**: 15% of requests failing, customer complaints
|
||||
**Resolution**: Connection pool tuning, connection leak fixes
|
||||
**Lines**: ~430 lines
|
||||
|
||||
### 3. Distributed System Network Debugging
|
||||
|
||||
**File**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
**Scenario**: Intermittent 504 Gateway Timeout errors between services
|
||||
**Stack**: Cloudflare Workers, external APIs, DNS, CORS
|
||||
**Impact**: 5% of API calls failing, no clear pattern
|
||||
**Resolution**: DNS caching issue, worker timeout configuration
|
||||
**Lines**: ~420 lines
|
||||
|
||||
### 4. Performance Degradation Analysis
|
||||
|
||||
**File**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
**Scenario**: API response times increased from 200ms to 2000ms
|
||||
**Stack**: Cloudflare Workers, PlanetScale, caching layer
|
||||
**Impact**: User-facing slowness, poor UX
|
||||
**Resolution**: N+1 query elimination, caching strategy, index optimization
|
||||
**Lines**: ~410 lines
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
**By Issue Type**:
|
||||
- Deployment failures → [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- Database issues → [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- Network problems → [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- Performance issues → [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
|
||||
**By Stack Component**:
|
||||
- Cloudflare Workers → Examples 1, 3, 4
|
||||
- PlanetScale PostgreSQL → Examples 2, 4
|
||||
- Distributed Systems → Example 3
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - Runbooks and diagnostic commands
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident templates
|
||||
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../devops-troubleshooter.md)
|
||||
@@ -0,0 +1,466 @@
|
||||
# Cloudflare Worker Deployment Failure Investigation
|
||||
|
||||
Complete troubleshooting workflow for "Script exceeds size limit" deployment failure, resolved through bundle optimization and code splitting.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: Worker deployment failing with size limit error
|
||||
**Impact**: Production deployment blocked for 2 hours
|
||||
**Root Cause**: Bundle size grew from 1.2MB to 5.2MB after adding dependencies
|
||||
**Resolution**: Bundle optimization (code splitting, tree shaking) reduced size to 1.8MB
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 14:00 | Deployment initiated via CI/CD | `wrangler deploy` triggered |
|
||||
| 14:02 | Deployment failed | Error: "Script exceeds 1MB size limit" |
|
||||
| 14:05 | Investigation started | Check recent code changes |
|
||||
| 14:15 | Root cause identified | New dependencies increased bundle size |
|
||||
| 14:30 | Fix implemented | Bundle optimization applied |
|
||||
| 14:45 | Fix deployed | Successful deployment to production |
|
||||
| 16:00 | Monitoring complete | Confirmed stable deployment |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Error
|
||||
|
||||
**Deployment Command**:
|
||||
```bash
|
||||
$ wrangler deploy
|
||||
✘ [ERROR] Script exceeds the size limit (5.2MB > 1MB after compression)
|
||||
```
|
||||
|
||||
**CI/CD Pipeline Failure**:
|
||||
```yaml
|
||||
# GitHub Actions output
|
||||
Step: Deploy to Cloudflare Workers
|
||||
✓ Build completed (5.2MB bundle)
|
||||
✗ Deployment failed: Script size exceeds limit
|
||||
Error: Workers Free plan limit is 1MB compressed
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Production deployment blocked
|
||||
- New features stuck in staging
|
||||
- Team unable to deploy hotfixes
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Bundle Size
|
||||
|
||||
**Before Investigation**:
|
||||
```bash
|
||||
# Build the worker locally
|
||||
npm run build
|
||||
|
||||
# Check output size
|
||||
ls -lh dist/
|
||||
-rw-r--r-- 1 user staff 5.2M Dec 5 14:10 worker.js
|
||||
```
|
||||
|
||||
**Analyze Bundle Composition**:
|
||||
```bash
|
||||
# Use webpack-bundle-analyzer
|
||||
npm install --save-dev webpack-bundle-analyzer
|
||||
|
||||
# Add to webpack.config.js
|
||||
const BundleAnalyzerPlugin = require('webpack-bundle-analyzer').BundleAnalyzerPlugin;
|
||||
|
||||
module.exports = {
|
||||
plugins: [
|
||||
new BundleAnalyzerPlugin()
|
||||
]
|
||||
};
|
||||
|
||||
# Build and open analyzer
|
||||
npm run build
|
||||
# Opens http://127.0.0.1:8888 with visual bundle breakdown
|
||||
```
|
||||
|
||||
**Bundle Analyzer Findings**:
|
||||
```
|
||||
Total Size: 5.2MB
|
||||
|
||||
Breakdown:
|
||||
- @anthropic-ai/sdk: 2.1MB (40%)
|
||||
- aws-sdk: 1.8MB (35%)
|
||||
- lodash: 800KB (15%)
|
||||
- moment: 300KB (6%)
|
||||
- application code: 200KB (4%)
|
||||
```
|
||||
|
||||
**Red Flags**:
|
||||
1. Full `aws-sdk` imported (only needed S3)
|
||||
2. Entire `lodash` library (only using 3 functions)
|
||||
3. `moment` included (native Date API would suffice)
|
||||
4. Large AI SDK (only using text generation)
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Recent Changes
|
||||
|
||||
**Git Diff**:
|
||||
```bash
|
||||
# Check what changed in last deploy
|
||||
git diff HEAD~1 HEAD -- src/
|
||||
|
||||
# Key changes:
|
||||
+ import { Anthropic } from '@anthropic-ai/sdk';
|
||||
+ import AWS from 'aws-sdk';
|
||||
+ import _ from 'lodash';
|
||||
+ import moment from 'moment';
|
||||
```
|
||||
|
||||
**PR Analysis**:
|
||||
```
|
||||
PR #234: Add AI content generation feature
|
||||
- Added @anthropic-ai/sdk (full SDK)
|
||||
- Added AWS S3 integration (full aws-sdk)
|
||||
- Used lodash for data manipulation
|
||||
- Used moment for date formatting
|
||||
|
||||
Result: Bundle size increased by 4MB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Cloudflare Worker Size Limits
|
||||
|
||||
**Plan Limits**:
|
||||
```
|
||||
Workers Free: 1MB compressed
|
||||
Workers Paid: 10MB compressed
|
||||
|
||||
Current plan: Workers Free
|
||||
Current size: 5.2MB (over limit)
|
||||
|
||||
Options:
|
||||
1. Upgrade to Workers Paid ($5/month)
|
||||
2. Reduce bundle size to <1MB
|
||||
3. Split into multiple workers
|
||||
```
|
||||
|
||||
**Decision**: Reduce bundle size (no budget for upgrade)
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Tree Shaking with Named Imports
|
||||
|
||||
**Before** (imports entire libraries):
|
||||
```typescript
|
||||
// ❌ BAD: Imports full library
|
||||
import _ from 'lodash';
|
||||
import moment from 'moment';
|
||||
import AWS from 'aws-sdk';
|
||||
|
||||
// Usage:
|
||||
const unique = _.uniq(array);
|
||||
const date = moment().format('YYYY-MM-DD');
|
||||
const s3 = new AWS.S3();
|
||||
```
|
||||
|
||||
**After** (imports only needed functions):
|
||||
```typescript
|
||||
// ✅ GOOD: Named imports enable tree shaking
|
||||
import { uniq, map, filter } from 'lodash-es';
|
||||
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
|
||||
|
||||
// ✅ BETTER: Native alternatives
|
||||
const unique = [...new Set(array)];
|
||||
const date = new Date().toISOString().split('T')[0];
|
||||
|
||||
// S3 client (v3 - modular)
|
||||
const s3 = new S3Client({ region: 'us-east-1' });
|
||||
```
|
||||
|
||||
**Size Reduction**:
|
||||
```
|
||||
Before:
|
||||
- lodash: 800KB → lodash-es tree-shaken: 50KB (94% reduction)
|
||||
- moment: 300KB → native Date: 0KB (100% reduction)
|
||||
- aws-sdk: 1.8MB → @aws-sdk/client-s3: 200KB (89% reduction)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: External Dependencies (Don't Bundle Large SDKs)
|
||||
|
||||
**Before**:
|
||||
```typescript
|
||||
// worker.ts - bundled @anthropic-ai/sdk (2.1MB)
|
||||
import { Anthropic } from '@anthropic-ai/sdk';
|
||||
|
||||
const client = new Anthropic({
|
||||
apiKey: env.ANTHROPIC_API_KEY
|
||||
});
|
||||
```
|
||||
|
||||
**After** (use fetch directly):
|
||||
```typescript
|
||||
// worker.ts - use native fetch (0KB)
|
||||
async function callAnthropic(prompt: string, env: Env) {
|
||||
const response = await fetch('https://api.anthropic.com/v1/messages', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'x-api-key': env.ANTHROPIC_API_KEY,
|
||||
'anthropic-version': '2023-06-01'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
model: 'claude-3-sonnet-20240229',
|
||||
max_tokens: 1024,
|
||||
messages: [
|
||||
{ role: 'user', content: prompt }
|
||||
]
|
||||
})
|
||||
});
|
||||
|
||||
return response.json();
|
||||
}
|
||||
```
|
||||
|
||||
**Size Reduction**:
|
||||
```
|
||||
Before: @anthropic-ai/sdk: 2.1MB
|
||||
After: Native fetch: 0KB
|
||||
Savings: 2.1MB (100% reduction)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Code Splitting (Async Imports)
|
||||
|
||||
**Before** (everything bundled):
|
||||
```typescript
|
||||
// worker.ts
|
||||
import { expensiveFunction } from './expensive-module';
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
// Even if not used, expensive-module is in bundle
|
||||
if (request.url.includes('/special')) {
|
||||
return expensiveFunction(request);
|
||||
}
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**After** (lazy load):
|
||||
```typescript
|
||||
// worker.ts
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
if (request.url.includes('/special')) {
|
||||
// Only load when needed (separate chunk)
|
||||
const { expensiveFunction } = await import('./expensive-module');
|
||||
return expensiveFunction(request);
|
||||
}
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Size Reduction**:
|
||||
```
|
||||
Main bundle: 1.8MB → 500KB (72% reduction)
|
||||
expensive-module chunk: Loaded on-demand (lazy)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Webpack Configuration Optimization
|
||||
|
||||
**Updated webpack.config.js**:
|
||||
```javascript
|
||||
const webpack = require('webpack');
|
||||
const path = require('path');
|
||||
|
||||
module.exports = {
|
||||
entry: './src/worker.ts',
|
||||
target: 'webworker',
|
||||
mode: 'production',
|
||||
optimization: {
|
||||
minimize: true,
|
||||
usedExports: true, // Tree shaking
|
||||
sideEffects: false,
|
||||
},
|
||||
resolve: {
|
||||
extensions: ['.ts', '.js'],
|
||||
alias: {
|
||||
// Replace heavy libraries with lighter alternatives
|
||||
'moment': 'date-fns',
|
||||
'lodash': 'lodash-es'
|
||||
}
|
||||
},
|
||||
module: {
|
||||
rules: [
|
||||
{
|
||||
test: /\.ts$/,
|
||||
use: {
|
||||
loader: 'ts-loader',
|
||||
options: {
|
||||
transpileOnly: true,
|
||||
compilerOptions: {
|
||||
module: 'esnext', // Enable tree shaking
|
||||
moduleResolution: 'node'
|
||||
}
|
||||
}
|
||||
},
|
||||
exclude: /node_modules/
|
||||
}
|
||||
]
|
||||
},
|
||||
plugins: [
|
||||
new webpack.DefinePlugin({
|
||||
'process.env.NODE_ENV': JSON.stringify('production')
|
||||
})
|
||||
],
|
||||
output: {
|
||||
filename: 'worker.js',
|
||||
path: path.resolve(__dirname, 'dist'),
|
||||
libraryTarget: 'commonjs2'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Bundle Size Comparison
|
||||
|
||||
| Category | Before | After | Reduction |
|
||||
|----------|--------|-------|-----------|
|
||||
| **@anthropic-ai/sdk** | 2.1MB | 0KB (fetch) | -100% |
|
||||
| **aws-sdk** | 1.8MB | 200KB (v3) | -89% |
|
||||
| **lodash** | 800KB | 50KB (tree-shaken) | -94% |
|
||||
| **moment** | 300KB | 0KB (native Date) | -100% |
|
||||
| **Application code** | 200KB | 200KB | 0% |
|
||||
| **TOTAL** | **5.2MB** | **450KB** | **-91%** |
|
||||
|
||||
**Compressed Size**:
|
||||
- Before: 5.2MB → 1.8MB compressed (over 1MB limit)
|
||||
- After: 450KB → 180KB compressed (under 1MB limit)
|
||||
|
||||
---
|
||||
|
||||
### Deployment Verification
|
||||
|
||||
**Successful Deployment**:
|
||||
```bash
|
||||
$ wrangler deploy
|
||||
✔ Building...
|
||||
✔ Validating...
|
||||
Bundle size: 450KB (180KB compressed)
|
||||
✔ Uploading...
|
||||
✔ Deployed to production
|
||||
|
||||
Production URL: https://api.greyhaven.io
|
||||
Worker ID: worker-abc123
|
||||
```
|
||||
|
||||
**Load Testing**:
|
||||
```bash
|
||||
# Before optimization (would fail deployment)
|
||||
# Bundle: 5.2MB, deploy: FAIL
|
||||
|
||||
# After optimization
|
||||
$ ab -n 1000 -c 10 https://api.greyhaven.io/
|
||||
Requests per second: 1250 [#/sec]
|
||||
Time per request: 8ms [mean]
|
||||
Successful requests: 1000 (100%)
|
||||
Bundle size: 450KB ✓
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. CI/CD Bundle Size Check
|
||||
|
||||
```yaml
|
||||
# .github/workflows/deploy.yml - Add size validation
|
||||
steps:
|
||||
- run: npm ci && npm run build
|
||||
- name: Check bundle size
|
||||
run: |
|
||||
SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
|
||||
if (( $(echo "$SIZE_MB > 1.0" | bc -l) )); then
|
||||
echo "❌ Bundle exceeds 1MB"; exit 1
|
||||
fi
|
||||
- run: npx wrangler deploy
|
||||
```
|
||||
|
||||
### 2. Pre-commit Hook
|
||||
|
||||
```bash
|
||||
# .git/hooks/pre-commit
|
||||
SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
|
||||
[ "$SIZE_MB" -lt "1.0" ] || { echo "❌ Bundle >1MB"; exit 1; }
|
||||
```
|
||||
|
||||
### 3. PR Template
|
||||
|
||||
```markdown
|
||||
## Bundle Impact
|
||||
- [ ] Bundle size <800KB
|
||||
- [ ] Tree shaking verified
|
||||
Size: [Before → After]
|
||||
```
|
||||
|
||||
### 4. Automated Analysis
|
||||
|
||||
```json
|
||||
{
|
||||
"scripts": {
|
||||
"analyze": "webpack --profile --json > stats.json && webpack-bundle-analyzer stats.json"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Identified root cause quickly (bundle analyzer)
|
||||
✅ Multiple optimization strategies applied
|
||||
✅ Achieved 91% bundle size reduction
|
||||
✅ Added automated checks to prevent recurrence
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No bundle size monitoring before incident
|
||||
❌ Dependencies added without size consideration
|
||||
❌ No pre-commit checks for bundle size
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always check bundle size** when adding dependencies
|
||||
2. **Use native APIs** instead of libraries when possible
|
||||
3. **Tree shaking** requires named imports (not default)
|
||||
4. **Code splitting** for rarely-used features
|
||||
5. **External API calls** are lighter than bundling SDKs
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **PlanetScale Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
@@ -0,0 +1,477 @@
|
||||
# Distributed System Network Debugging
|
||||
|
||||
Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: 5% of API requests failing with 504 timeouts
|
||||
**Impact**: Intermittent failures, no clear pattern, user frustration
|
||||
**Root Cause**: DNS resolution delays + worker timeout too aggressive
|
||||
**Resolution**: DNS caching + timeout increase (5s→30s)
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 14:00 | 504 errors detected | Alerts triggered |
|
||||
| 14:10 | Pattern analysis started | Check logs, no obvious cause |
|
||||
| 14:30 | Network trace performed | Found DNS delays |
|
||||
| 14:50 | Root cause identified | DNS + timeout combination |
|
||||
| 15:10 | Fix deployed | DNS caching + timeout tuning |
|
||||
| 15:40 | Monitoring confirmed | 504s eliminated |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Alerts
|
||||
|
||||
**Error Pattern**:
|
||||
```
|
||||
[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
|
||||
[ERROR] Upstream timeout after 5000ms
|
||||
[ERROR] DNS lookup took 3200ms (80% of timeout!)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- ❌ Random occurrence (5% of requests)
|
||||
- ❌ No pattern by time of day
|
||||
- ❌ Affects all worker regions equally
|
||||
- ❌ External API reports no issues
|
||||
- ✅ Only affects specific external endpoints
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Network Request Breakdown
|
||||
|
||||
**curl Timing Analysis**:
|
||||
```bash
|
||||
# Test external API with detailed timing
|
||||
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nStart: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com/data
|
||||
|
||||
# Results (intermittent):
|
||||
DNS: 3.201s # ❌ Very slow!
|
||||
Connect: 3.450s
|
||||
TLS: 3.780s
|
||||
Start: 4.120s
|
||||
Total: 4.823s # Close to 5s worker timeout
|
||||
```
|
||||
|
||||
**Fast vs Slow Requests**:
|
||||
```
|
||||
FAST (95% of requests):
|
||||
DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅
|
||||
|
||||
SLOW (5% of requests):
|
||||
DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)
|
||||
```
|
||||
|
||||
**Root Cause**: DNS resolution delays causing total request time to exceed worker timeout.
|
||||
|
||||
---
|
||||
|
||||
### Step 2: DNS Investigation
|
||||
|
||||
**nslookup Testing**:
|
||||
```bash
|
||||
# Test DNS resolution
|
||||
time nslookup api.partner.com
|
||||
|
||||
# Results (vary):
|
||||
Run 1: 0.05s ✅
|
||||
Run 2: 3.10s ❌
|
||||
Run 3: 0.04s ✅
|
||||
Run 4: 2.95s ❌
|
||||
|
||||
Pattern: DNS cache miss causes 3s delay
|
||||
```
|
||||
|
||||
**dig Analysis**:
|
||||
```bash
|
||||
# Detailed DNS query
|
||||
dig api.partner.com +stats
|
||||
|
||||
# Results:
|
||||
;; Query time: 3021 msec # Slow!
|
||||
;; SERVER: 1.1.1.1#53(1.1.1.1)
|
||||
;; WHEN: Thu Dec 05 14:25:32 UTC 2024
|
||||
;; MSG SIZE rcvd: 84
|
||||
|
||||
# Root cause: No DNS caching in worker
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Worker Timeout Configuration
|
||||
|
||||
**Current Worker Code**:
|
||||
```typescript
|
||||
// worker.ts (BEFORE - Too aggressive timeout)
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const controller = new AbortController();
|
||||
const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout
|
||||
|
||||
try {
|
||||
const response = await fetch('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
return response;
|
||||
} catch (error) {
|
||||
// 5% of requests timeout here
|
||||
return new Response('Gateway Timeout', { status: 504 });
|
||||
} finally {
|
||||
clearTimeout(timeout);
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Problem**: 5s timeout doesn't account for DNS delays (up to 3s).
|
||||
|
||||
---
|
||||
|
||||
### Step 4: CORS and Headers Check
|
||||
|
||||
**Test CORS Headers**:
|
||||
```bash
|
||||
# Check CORS preflight
|
||||
curl -I -X OPTIONS https://api.greyhaven.io/proxy \
|
||||
-H "Origin: https://app.greyhaven.io" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
|
||||
# Response:
|
||||
HTTP/2 200
|
||||
access-control-allow-origin: https://app.greyhaven.io ✅
|
||||
access-control-allow-methods: GET, POST, PUT, DELETE ✅
|
||||
access-control-max-age: 86400 ✅
|
||||
```
|
||||
|
||||
**No CORS issues** - problem isolated to DNS + timeout.
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Implement DNS Caching
|
||||
|
||||
**Worker with DNS Cache**:
|
||||
```typescript
|
||||
// worker.ts (AFTER - With DNS caching)
|
||||
interface DnsCache {
|
||||
ip: string;
|
||||
timestamp: number;
|
||||
ttl: number;
|
||||
}
|
||||
|
||||
const DNS_CACHE = new Map<string, DnsCache>();
|
||||
const DNS_TTL = 60 * 1000; // 60 seconds
|
||||
|
||||
async function resolveWithCache(hostname: string): Promise<string> {
|
||||
const cached = DNS_CACHE.get(hostname);
|
||||
|
||||
if (cached && Date.now() - cached.timestamp < cached.ttl) {
|
||||
// Cache hit - return immediately
|
||||
return cached.ip;
|
||||
}
|
||||
|
||||
// Cache miss - resolve DNS
|
||||
const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
|
||||
headers: { 'accept': 'application/dns-json' }
|
||||
});
|
||||
const dnsData = await dnsResponse.json();
|
||||
const ip = dnsData.Answer[0].data;
|
||||
|
||||
// Update cache
|
||||
DNS_CACHE.set(hostname, {
|
||||
ip,
|
||||
timestamp: Date.now(),
|
||||
ttl: DNS_TTL
|
||||
});
|
||||
|
||||
return ip;
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
// Pre-resolve DNS (cached)
|
||||
const ip = await resolveWithCache('api.partner.com');
|
||||
|
||||
// Use IP directly (bypass DNS)
|
||||
const response = await fetch(`https://${ip}/data`, {
|
||||
headers: {
|
||||
'Host': 'api.partner.com' // Required for SNI
|
||||
}
|
||||
});
|
||||
|
||||
return response;
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Result**: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Increase Worker Timeout
|
||||
|
||||
**Updated Timeout**:
|
||||
```typescript
|
||||
// worker.ts - Increased timeout to account for DNS
|
||||
const controller = new AbortController();
|
||||
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
|
||||
|
||||
try {
|
||||
const response = await fetch('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
return response;
|
||||
} finally {
|
||||
clearTimeout(timeout);
|
||||
}
|
||||
```
|
||||
|
||||
**Timeout Breakdown**:
|
||||
```
|
||||
Old: 5s total
|
||||
- DNS: 3s (worst case)
|
||||
- Connect: 1s
|
||||
- Request: 1s
|
||||
= Frequent timeouts
|
||||
|
||||
New: 30s total
|
||||
- DNS: <0.01s (cached)
|
||||
- Connect: 1s
|
||||
- Request: 2s
|
||||
- Buffer: 27s (ample)
|
||||
= No timeouts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Add Retry Logic with Exponential Backoff
|
||||
|
||||
**Retry Implementation**:
|
||||
```typescript
|
||||
// utils/retry.ts
|
||||
async function fetchWithRetry(
|
||||
url: string,
|
||||
options: RequestInit,
|
||||
maxRetries: number = 3
|
||||
): Promise<Response> {
|
||||
for (let attempt = 0; attempt < maxRetries; attempt++) {
|
||||
try {
|
||||
const response = await fetch(url, options);
|
||||
|
||||
// Retry on 5xx errors
|
||||
if (response.status >= 500 && attempt < maxRetries - 1) {
|
||||
const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
|
||||
await new Promise(resolve => setTimeout(resolve, delay));
|
||||
continue;
|
||||
}
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
if (attempt === maxRetries - 1) throw error;
|
||||
|
||||
// Exponential backoff: 1s, 2s, 4s
|
||||
const delay = Math.pow(2, attempt) * 1000;
|
||||
await new Promise(resolve => setTimeout(resolve, delay));
|
||||
}
|
||||
}
|
||||
|
||||
throw new Error('Max retries exceeded');
|
||||
}
|
||||
|
||||
// Usage:
|
||||
const response = await fetchWithRetry('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Circuit Breaker Pattern
|
||||
|
||||
**Prevent Cascading Failures**:
|
||||
```typescript
|
||||
// utils/circuit-breaker.ts
|
||||
class CircuitBreaker {
|
||||
private failures: number = 0;
|
||||
private lastFailureTime: number = 0;
|
||||
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
|
||||
|
||||
async execute<T>(fn: () => Promise<T>): Promise<T> {
|
||||
if (this.state === 'OPEN') {
|
||||
// Check if enough time passed to try again
|
||||
if (Date.now() - this.lastFailureTime > 60000) {
|
||||
this.state = 'HALF_OPEN';
|
||||
} else {
|
||||
throw new Error('Circuit breaker is OPEN');
|
||||
}
|
||||
}
|
||||
|
||||
try {
|
||||
const result = await fn();
|
||||
this.onSuccess();
|
||||
return result;
|
||||
} catch (error) {
|
||||
this.onFailure();
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private onSuccess() {
|
||||
this.failures = 0;
|
||||
this.state = 'CLOSED';
|
||||
}
|
||||
|
||||
private onFailure() {
|
||||
this.failures++;
|
||||
this.lastFailureTime = Date.now();
|
||||
|
||||
if (this.failures >= 5) {
|
||||
this.state = 'OPEN'; // Trip circuit after 5 failures
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Usage:
|
||||
const breaker = new CircuitBreaker();
|
||||
const response = await breaker.execute(() =>
|
||||
fetch('https://api.partner.com/data')
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before Fix | After Fix | Improvement |
|
||||
|--------|-----------|-----------|-------------|
|
||||
| **504 Error Rate** | 5% | 0.01% | **99.8% reduction** |
|
||||
| **DNS Resolution** | 3000ms (worst) | <5ms (cached) | **99.8% faster** |
|
||||
| **Total Request Time** | 4800ms (p95) | 850ms (p95) | **82% faster** |
|
||||
| **Timeout Threshold** | 5s (too low) | 30s (appropriate) | +500% headroom |
|
||||
|
||||
---
|
||||
|
||||
### Network Diagnostics
|
||||
|
||||
**traceroute Analysis**:
|
||||
```bash
|
||||
# Check network path to external API
|
||||
traceroute api.partner.com
|
||||
|
||||
# Results show no packet loss
|
||||
1 gateway (10.0.0.1) 1.234 ms
|
||||
2 isp-router (100.64.0.1) 5.678 ms
|
||||
...
|
||||
15 api.partner.com (203.0.113.42) 45.234 ms
|
||||
```
|
||||
|
||||
**No packet loss** - confirms DNS was the issue, not network.
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Network Monitoring Dashboard
|
||||
|
||||
**Metrics to Track**:
|
||||
```typescript
|
||||
// Track network timing metrics
|
||||
const network_dns_duration = new Histogram({
|
||||
name: 'network_dns_duration_seconds',
|
||||
help: 'DNS resolution time'
|
||||
});
|
||||
|
||||
const network_connect_duration = new Histogram({
|
||||
name: 'network_connect_duration_seconds',
|
||||
help: 'TCP connection time'
|
||||
});
|
||||
|
||||
const network_total_duration = new Histogram({
|
||||
name: 'network_total_duration_seconds',
|
||||
help: 'Total request time'
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Alert Rules
|
||||
|
||||
```yaml
|
||||
# Alert on high DNS resolution time
|
||||
- alert: SlowDnsResolution
|
||||
expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "DNS resolution p95 >1s"
|
||||
|
||||
# Alert on gateway timeouts
|
||||
- alert: HighGatewayTimeouts
|
||||
expr: rate(http_requests_total{status="504"}[5m]) > 0.01
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "504 error rate >1%"
|
||||
```
|
||||
|
||||
### 3. Health Check Endpoints
|
||||
|
||||
```typescript
|
||||
@app.get("/health/network")
|
||||
async function networkHealth() {
|
||||
const checks = await Promise.all([
|
||||
checkDns('api.partner.com'),
|
||||
checkConnectivity('https://api.partner.com/health'),
|
||||
checkLatency('https://api.partner.com/ping')
|
||||
]);
|
||||
|
||||
return {
|
||||
status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
|
||||
checks
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Detailed network timing analysis pinpointed DNS
|
||||
✅ DNS caching eliminated 99.8% of timeouts
|
||||
✅ Circuit breaker prevents cascading failures
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No DNS monitoring before incident
|
||||
❌ Timeout too aggressive without considering DNS
|
||||
❌ No retry logic for transient failures
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always cache DNS** in workers (60s TTL minimum)
|
||||
2. **Account for DNS time** when setting timeouts
|
||||
3. **Add retry logic** with exponential backoff
|
||||
4. **Implement circuit breakers** for external dependencies
|
||||
5. **Monitor network timing** (DNS, connect, TLS, transfer)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
@@ -0,0 +1,413 @@
|
||||
# Performance Degradation Analysis
|
||||
|
||||
Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: API response times degraded 10x (200ms → 2000ms)
|
||||
**Impact**: User-facing slowness, timeout errors, poor UX
|
||||
**Root Cause**: N+1 query problem + missing indexes + no caching
|
||||
**Resolution**: Query optimization + indexes + Redis caching
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 08:00 | Slowness reports from users | Support tickets opened |
|
||||
| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
|
||||
| 08:30 | Database profiling started | Slow query log analysis |
|
||||
| 09:00 | N+1 query identified | Found 100+ queries per request |
|
||||
| 09:30 | Fix implemented | Eager loading + indexes |
|
||||
| 10:00 | Caching added | Redis for frequently accessed data |
|
||||
| 10:30 | Deployment complete | Latency back to 200ms |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Metrics
|
||||
|
||||
**Latency Increase**:
|
||||
```
|
||||
p50: 180ms → 1800ms (+900% slower)
|
||||
p95: 220ms → 2100ms (+854% slower)
|
||||
p99: 450ms → 3500ms (+677% slower)
|
||||
|
||||
Requests timing out: 5% (>3s timeout)
|
||||
```
|
||||
|
||||
**User Impact**:
|
||||
- Page load times: 5-10 seconds
|
||||
- API timeouts: 5% of requests
|
||||
- Support tickets: 47 in 1 hour
|
||||
- User complaints: "App is unusable"
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Application Performance Monitoring
|
||||
|
||||
**Wrangler Tail Analysis**:
|
||||
```bash
|
||||
# Monitor worker requests in real-time
|
||||
wrangler tail --format pretty
|
||||
|
||||
# Output shows slow requests:
|
||||
[2024-12-05 08:20:15] GET /api/orders - 2145ms
|
||||
└─ database_query: 1950ms (90% of total time!)
|
||||
└─ json_serialization: 150ms
|
||||
└─ response_headers: 45ms
|
||||
|
||||
# Red flag: Database taking 90% of request time
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Database Query Analysis
|
||||
|
||||
**PlanetScale Slow Query Log**:
|
||||
```bash
|
||||
# Enable and check slow queries
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Results:
|
||||
Query: SELECT * FROM order_items WHERE order_id = ?
|
||||
Calls: 157 times per request # ❌ N+1 query problem!
|
||||
Avg time: 12ms per query
|
||||
Total: 1884ms per request (12ms × 157)
|
||||
```
|
||||
|
||||
**N+1 Query Pattern Identified**:
|
||||
```python
|
||||
# api/orders.py (BEFORE - N+1 Problem)
|
||||
@router.get("/orders/{user_id}")
|
||||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||||
# Query 1: Get all orders for user
|
||||
orders = session.exec(
|
||||
select(Order).where(Order.user_id == user_id)
|
||||
).all() # Returns 157 orders
|
||||
|
||||
# Query 2-158: Get items for EACH order (N+1!)
|
||||
for order in orders:
|
||||
order.items = session.exec(
|
||||
select(OrderItem).where(OrderItem.order_id == order.id)
|
||||
).all() # 157 additional queries!
|
||||
|
||||
return orders
|
||||
|
||||
# Total queries: 1 + 157 = 158 queries per request
|
||||
# Total time: 10ms + (157 × 12ms) = 1894ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Database Index Analysis
|
||||
|
||||
**Missing Indexes**:
|
||||
```sql
|
||||
-- Check existing indexes
|
||||
SELECT indexname, indexdef
|
||||
FROM pg_indexes
|
||||
WHERE tablename = 'order_items';
|
||||
|
||||
-- Results:
|
||||
-- Primary key on id (exists) ✅
|
||||
-- NO index on order_id ❌ (needed for WHERE clause)
|
||||
-- NO index on user_id ❌ (needed for joins)
|
||||
|
||||
-- Explain plan shows full table scan
|
||||
EXPLAIN ANALYZE
|
||||
SELECT * FROM order_items WHERE order_id = 123;
|
||||
|
||||
-- Result:
|
||||
Seq Scan on order_items (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
|
||||
Filter: (order_id = 123)
|
||||
Rows Removed by Filter: 10000
|
||||
|
||||
-- Full table scan on 10K rows, 157 times = extremely slow!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Eliminate N+1 with Eager Loading
|
||||
|
||||
**After - Single Query with Join**:
|
||||
```python
|
||||
# api/orders.py (AFTER - Eager Loading)
|
||||
from sqlmodel import select
|
||||
from sqlalchemy.orm import selectinload
|
||||
|
||||
@router.get("/orders/{user_id}")
|
||||
async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
|
||||
# ✅ Single query with eager loading
|
||||
statement = (
|
||||
select(Order)
|
||||
.where(Order.user_id == user_id)
|
||||
.options(selectinload(Order.items)) # Eager load items
|
||||
)
|
||||
|
||||
orders = session.exec(statement).all()
|
||||
|
||||
return orders
|
||||
|
||||
# Total queries: 2 (1 for orders, 1 for all items)
|
||||
# Total time: 10ms + 25ms = 35ms (98% faster!)
|
||||
```
|
||||
|
||||
**Query Comparison**:
|
||||
```
|
||||
BEFORE (N+1):
|
||||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||||
- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
|
||||
- Total: 1894ms
|
||||
|
||||
AFTER (Eager Loading):
|
||||
- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
|
||||
- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
|
||||
- Total: 35ms (54x faster!)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Add Database Indexes
|
||||
|
||||
**Create Indexes**:
|
||||
```sql
|
||||
-- Index on order_id for faster lookups
|
||||
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
|
||||
|
||||
-- Index on user_id for user queries
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
|
||||
-- Index on created_at for time-based queries
|
||||
CREATE INDEX idx_orders_created_at ON orders(created_at);
|
||||
|
||||
-- Composite index for common filters
|
||||
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
|
||||
```
|
||||
|
||||
**Before/After EXPLAIN**:
|
||||
```sql
|
||||
-- BEFORE (no index):
|
||||
EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
|
||||
Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
|
||||
|
||||
-- AFTER (with index):
|
||||
Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
|
||||
|
||||
-- 270x faster (12.345ms → 0.045ms)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Implement Redis Caching
|
||||
|
||||
**Cache Frequent Queries**:
|
||||
```typescript
|
||||
// cache.ts - Redis caching layer
|
||||
import { Redis } from '@upstash/redis';
|
||||
|
||||
const redis = new Redis({
|
||||
url: env.UPSTASH_REDIS_URL,
|
||||
token: env.UPSTASH_REDIS_TOKEN
|
||||
});
|
||||
|
||||
async function getCachedOrders(userId: number) {
|
||||
const cacheKey = `orders:user:${userId}`;
|
||||
|
||||
// Check cache
|
||||
const cached = await redis.get(cacheKey);
|
||||
if (cached) {
|
||||
return JSON.parse(cached); // Cache hit
|
||||
}
|
||||
|
||||
// Cache miss - query database
|
||||
const orders = await fetchOrdersFromDb(userId);
|
||||
|
||||
// Store in cache (5 minute TTL)
|
||||
await redis.setex(cacheKey, 300, JSON.stringify(orders));
|
||||
|
||||
return orders;
|
||||
}
|
||||
```
|
||||
|
||||
**Cache Hit Rates**:
|
||||
```
|
||||
Requests: 10,000
|
||||
Cache hits: 8,500 (85%)
|
||||
Cache misses: 1,500 (15%)
|
||||
|
||||
Avg latency with cache:
|
||||
- Cache hit: 5ms (Redis)
|
||||
- Cache miss: 35ms (database)
|
||||
- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Database Connection Pooling
|
||||
|
||||
**Optimize Pool Settings**:
|
||||
```python
|
||||
# database.py - Tuned for performance
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=50, # Increased from 20
|
||||
max_overflow=20,
|
||||
pool_recycle=1800, # 30 minutes
|
||||
pool_pre_ping=True, # Health check
|
||||
echo=False,
|
||||
connect_args={
|
||||
"server_settings": {
|
||||
"statement_timeout": "30000", # 30s query timeout
|
||||
"idle_in_transaction_session_timeout": "60000" # 60s idle
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
| Metric | Before | After | Improvement |
|
||||
|--------|--------|-------|-------------|
|
||||
| **p50 Latency** | 1800ms | 180ms | **90% faster** |
|
||||
| **p95 Latency** | 2100ms | 220ms | **90% faster** |
|
||||
| **p99 Latency** | 3500ms | 450ms | **87% faster** |
|
||||
| **Database Queries** | 158/request | 2/request | **99% reduction** |
|
||||
| **Cache Hit Rate** | 0% | 85% | **85% hits** |
|
||||
| **Timeout Errors** | 5% | 0% | **100% eliminated** |
|
||||
|
||||
### Cost Impact
|
||||
|
||||
**Database Query Reduction**:
|
||||
```
|
||||
Before: 158 queries × 100 req/s = 15,800 queries/s
|
||||
After: 2 queries × 100 req/s = 200 queries/s
|
||||
|
||||
Reduction: 98.7% fewer queries
|
||||
Cost savings: $450/month (reduced database tier)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Query Performance Monitoring
|
||||
|
||||
**Slow Query Alert**:
|
||||
```yaml
|
||||
# Alert on slow database queries
|
||||
- alert: SlowDatabaseQueries
|
||||
expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Database queries p95 >100ms"
|
||||
```
|
||||
|
||||
### 2. N+1 Query Detection
|
||||
|
||||
**Test for N+1 Patterns**:
|
||||
```python
|
||||
# tests/test_n_plus_one.py
|
||||
import pytest
|
||||
from sqlalchemy import event
|
||||
from database import engine
|
||||
|
||||
@pytest.fixture
|
||||
def query_counter():
|
||||
"""Count SQL queries during test"""
|
||||
queries = []
|
||||
|
||||
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
|
||||
queries.append(statement)
|
||||
|
||||
event.listen(engine, "before_cursor_execute", before_cursor_execute)
|
||||
yield queries
|
||||
event.remove(engine, "before_cursor_execute", before_cursor_execute)
|
||||
|
||||
def test_get_user_orders_no_n_plus_one(query_counter):
|
||||
"""Verify endpoint doesn't have N+1 queries"""
|
||||
get_user_orders(user_id=1)
|
||||
|
||||
# Should be 2 queries max (orders + items)
|
||||
assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
|
||||
```
|
||||
|
||||
### 3. Database Index Coverage
|
||||
|
||||
```sql
|
||||
-- Check for missing indexes
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
attname,
|
||||
n_distinct,
|
||||
correlation
|
||||
FROM pg_stats
|
||||
WHERE schemaname = 'public'
|
||||
AND n_distinct > 100 -- Cardinality suggests index needed
|
||||
ORDER BY tablename, attname;
|
||||
```
|
||||
|
||||
### 4. Performance Budget
|
||||
|
||||
```typescript
|
||||
// Set performance budgets
|
||||
const PERFORMANCE_BUDGETS = {
|
||||
api_latency_p95: 500, // ms
|
||||
database_queries_per_request: 5,
|
||||
cache_hit_rate_min: 0.70, // 70%
|
||||
};
|
||||
|
||||
// CI/CD check
|
||||
if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
|
||||
throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Slow query log pinpointed N+1 problem
|
||||
✅ Eager loading eliminated 99% of queries
|
||||
✅ Indexes provided 270x speedup
|
||||
✅ Caching reduced load by 85%
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No N+1 query detection before production
|
||||
❌ Missing indexes not caught in code review
|
||||
❌ No caching layer initially
|
||||
❌ No query performance monitoring
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always use eager loading** for associations
|
||||
2. **Add indexes** for all foreign keys and WHERE clauses
|
||||
3. **Implement caching** for frequently accessed data
|
||||
4. **Monitor query counts** per request (alert on >10)
|
||||
5. **Test for N+1** in CI/CD pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
@@ -0,0 +1,499 @@
|
||||
# PlanetScale Connection Pool Exhaustion
|
||||
|
||||
Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: Database connection timeouts causing 15% request failure rate
|
||||
**Impact**: Customer-facing 503 errors, support tickets increasing
|
||||
**Root Cause**: Connection pool too small + unclosed connections in error paths
|
||||
**Resolution**: Pool tuning (20→50) + connection leak fixes
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 09:30 | Alerts: High 503 error rate | Oncall paged |
|
||||
| 09:35 | Investigation started | Check logs, metrics |
|
||||
| 09:45 | Database connections at 100% | Identified pool exhaustion |
|
||||
| 10:00 | Temporary fix: restart service | Bought time for root cause |
|
||||
| 10:30 | Code analysis complete | Found connection leaks |
|
||||
| 11:00 | Fix deployed (pool + leaks) | Production deployment |
|
||||
| 11:30 | Monitoring confirmed stable | Incident resolved |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Alerts
|
||||
|
||||
**Prometheus Alert**:
|
||||
```yaml
|
||||
# Alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status="503"}[5m]) > 0.05
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "503 error rate >5% for 5 minutes"
|
||||
description: "Current rate: {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
**Error Logs**:
|
||||
```
|
||||
[ERROR] Database query failed: connection timeout
|
||||
[ERROR] Pool exhausted, waiting for available connection
|
||||
[ERROR] Request timeout after 30s waiting for DB connection
|
||||
```
|
||||
|
||||
**Impact Metrics**:
|
||||
```
|
||||
Error rate: 15% (150 failures per 1000 requests)
|
||||
User complaints: 23 support tickets in 30 minutes
|
||||
Failed transactions: ~$15,000 in abandoned carts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Check Connection Pool Status
|
||||
|
||||
**Query PlanetScale**:
|
||||
```bash
|
||||
# Connect to database
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# Check active connections
|
||||
SELECT
|
||||
COUNT(*) as active_connections,
|
||||
MAX(pg_stat_activity.query_start) as oldest_query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
# Result:
|
||||
# active_connections: 98
|
||||
# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
|
||||
```
|
||||
|
||||
**Check Application Pool**:
|
||||
```python
|
||||
# In FastAPI app - add diagnostic endpoint
|
||||
from sqlmodel import Session
|
||||
from database import engine
|
||||
|
||||
@app.get("/pool-status")
|
||||
def pool_status():
|
||||
pool = engine.pool
|
||||
return {
|
||||
"size": pool.size(),
|
||||
"checked_out": pool.checkedout(),
|
||||
"overflow": pool.overflow(),
|
||||
"timeout": pool._timeout,
|
||||
"max_overflow": pool._max_overflow
|
||||
}
|
||||
|
||||
# Response:
|
||||
{
|
||||
"size": 20,
|
||||
"checked_out": 20, # Pool exhausted!
|
||||
"overflow": 0,
|
||||
"timeout": 30,
|
||||
"max_overflow": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Red Flags**:
|
||||
- ✅ Pool at 100% capacity (20/20 connections checked out)
|
||||
- ✅ No overflow connections being used (0/10)
|
||||
- ✅ Connections held for >15 minutes
|
||||
- ✅ New requests timing out waiting for connections
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Connection Leaks
|
||||
|
||||
**Code Review - Found Vulnerable Pattern**:
|
||||
```python
|
||||
# api/orders.py (BEFORE - LEAK)
|
||||
from fastapi import APIRouter
|
||||
from sqlmodel import Session, select
|
||||
from database import engine
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(order_data: OrderCreate):
|
||||
# ❌ LEAK: Session never closed on exception
|
||||
session = Session(engine)
|
||||
|
||||
# Create order
|
||||
order = Order(**order_data.dict())
|
||||
session.add(order)
|
||||
session.commit()
|
||||
|
||||
# If exception here, session never closed!
|
||||
if order.total > 10000:
|
||||
raise ValueError("Order exceeds limit")
|
||||
|
||||
# session.close() never reached
|
||||
return order
|
||||
```
|
||||
|
||||
**How Leak Occurs**:
|
||||
1. Request creates session (acquires connection from pool)
|
||||
2. Exception raised after commit
|
||||
3. Function exits without calling `session.close()`
|
||||
4. Connection remains "checked out" from pool
|
||||
5. After 20 such exceptions, pool exhausted
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Load Testing to Reproduce
|
||||
|
||||
**Test Script**:
|
||||
```python
|
||||
# test_connection_leak.py
|
||||
import asyncio
|
||||
import httpx
|
||||
|
||||
async def create_order(client, amount):
|
||||
"""Create order that will trigger exception"""
|
||||
try:
|
||||
response = await client.post(
|
||||
"https://api.greyhaven.io/orders",
|
||||
json={"total": amount}
|
||||
)
|
||||
return response.status_code
|
||||
except Exception:
|
||||
return 503
|
||||
|
||||
async def load_test():
|
||||
"""Simulate 100 orders with high amounts (triggers leak)"""
|
||||
async with httpx.AsyncClient() as client:
|
||||
# Trigger 100 exceptions (leak 100 connections)
|
||||
tasks = [create_order(client, 15000) for _ in range(100)]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
success = sum(1 for r in results if r == 201)
|
||||
errors = sum(1 for r in results if r == 503)
|
||||
|
||||
print(f"Success: {success}, Errors: {errors}")
|
||||
|
||||
asyncio.run(load_test())
|
||||
```
|
||||
|
||||
**Results**:
|
||||
```
|
||||
Success: 20 (first 20 use all connections)
|
||||
Errors: 80 (remaining 80 timeout waiting for pool)
|
||||
|
||||
Proves: Connection leak exhausts pool
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Use Context Manager (Guaranteed Cleanup)
|
||||
|
||||
**After - With Context Manager**:
|
||||
```python
|
||||
# api/orders.py (AFTER - FIXED)
|
||||
from fastapi import APIRouter, Depends
|
||||
from sqlmodel import Session
|
||||
from database import get_session
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
# ✅ Dependency injection with automatic cleanup
|
||||
def get_session():
|
||||
with Session(engine) as session:
|
||||
yield session
|
||||
# Session always closed (even on exception)
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(
|
||||
order_data: OrderCreate,
|
||||
session: Session = Depends(get_session)
|
||||
):
|
||||
# Session managed by FastAPI dependency
|
||||
order = Order(**order_data.dict())
|
||||
session.add(order)
|
||||
session.commit()
|
||||
|
||||
# Exception here? No problem - session still closed by context manager
|
||||
if order.total > 10000:
|
||||
raise ValueError("Order exceeds limit")
|
||||
|
||||
return order
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- Context manager (`with` statement) guarantees `session.close()` in `__exit__`
|
||||
- Works even if exception raised
|
||||
- FastAPI `Depends()` handles async cleanup automatically
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Increase Connection Pool Size
|
||||
|
||||
**Before** (pool too small):
|
||||
```python
|
||||
# database.py (BEFORE)
|
||||
from sqlmodel import create_engine
|
||||
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=20, # Too small for load
|
||||
max_overflow=10,
|
||||
pool_timeout=30
|
||||
)
|
||||
```
|
||||
|
||||
**After** (tuned for load):
|
||||
```python
|
||||
# database.py (AFTER)
|
||||
from sqlmodel import create_engine
|
||||
import os
|
||||
|
||||
# Calculate pool size based on workers
|
||||
# Formula: (workers * 2) + buffer
|
||||
# 16 workers * 2 + 20 buffer = 52
|
||||
workers = int(os.getenv("WEB_CONCURRENCY", 16))
|
||||
pool_size = (workers * 2) + 20
|
||||
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=pool_size, # 52 connections
|
||||
max_overflow=20, # Burst to 72 total
|
||||
pool_timeout=30,
|
||||
pool_recycle=3600, # Recycle after 1 hour
|
||||
pool_pre_ping=True, # Verify connection health
|
||||
echo=False
|
||||
)
|
||||
```
|
||||
|
||||
**Pool Size Calculation**:
|
||||
```
|
||||
Workers: 16 (Uvicorn workers)
|
||||
Connections per worker: 2 (normal peak)
|
||||
Buffer: 20 (for spikes)
|
||||
|
||||
pool_size = (16 * 2) + 20 = 52
|
||||
max_overflow = 20 (total 72 for extreme spikes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Add Connection Pool Monitoring
|
||||
|
||||
**Prometheus Metrics**:
|
||||
```python
|
||||
# monitoring.py
|
||||
from prometheus_client import Gauge
|
||||
from database import engine
|
||||
|
||||
# Pool metrics
|
||||
db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
|
||||
db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
|
||||
db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
|
||||
db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
|
||||
|
||||
def update_pool_metrics():
|
||||
"""Update pool metrics every 10 seconds"""
|
||||
pool = engine.pool
|
||||
db_pool_size.set(pool.size())
|
||||
db_pool_checked_out.set(pool.checkedout())
|
||||
db_pool_idle.set(pool.size() - pool.checkedout())
|
||||
db_pool_overflow.set(pool.overflow())
|
||||
|
||||
# Schedule in background task
|
||||
import asyncio
|
||||
async def pool_monitor():
|
||||
while True:
|
||||
update_pool_metrics()
|
||||
await asyncio.sleep(10)
|
||||
```
|
||||
|
||||
**Grafana Alert**:
|
||||
```yaml
|
||||
# Alert: Connection pool near exhaustion
|
||||
expr: db_pool_checked_out / db_pool_size_total > 0.8
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Connection pool >80% utilized"
|
||||
description: "{{ $value | humanizePercentage }} of pool in use"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Add Timeout and Retry Logic
|
||||
|
||||
**Connection Timeout Handling**:
|
||||
```python
|
||||
# database.py - Add connection retry
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=1, max=10)
|
||||
)
|
||||
def get_session_with_retry():
|
||||
"""Get session with automatic retry on pool timeout"""
|
||||
try:
|
||||
with Session(engine) as session:
|
||||
yield session
|
||||
except TimeoutError:
|
||||
# Pool exhausted - retry after exponential backoff
|
||||
raise
|
||||
|
||||
@router.post("/orders")
|
||||
async def create_order(
|
||||
order_data: OrderCreate,
|
||||
session: Session = Depends(get_session_with_retry)
|
||||
):
|
||||
# Will retry up to 3 times if pool exhausted
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before Fix | After Fix | Improvement |
|
||||
|--------|-----------|-----------|-------------|
|
||||
| **Connection Pool Size** | 20 | 52 | +160% capacity |
|
||||
| **Pool Utilization** | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
|
||||
| **503 Error Rate** | 15% | 0.01% | **99.9% reduction** |
|
||||
| **Request Timeout** | 30s (waiting) | <100ms | **99.7% faster** |
|
||||
| **Leaked Connections** | 12/hour | 0/day | **100% eliminated** |
|
||||
|
||||
---
|
||||
|
||||
### Deployment Verification
|
||||
|
||||
**Load Test After Fix**:
|
||||
```bash
|
||||
# Simulate 1000 concurrent orders
|
||||
ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
|
||||
|
||||
# Results:
|
||||
Requests per second: 250 [#/sec]
|
||||
Time per request: 200ms [mean]
|
||||
Failed requests: 0 (0%)
|
||||
Successful requests: 1000 (100%)
|
||||
|
||||
# Pool status during test:
|
||||
{
|
||||
"size": 52,
|
||||
"checked_out": 28, # 54% utilization (healthy)
|
||||
"overflow": 0,
|
||||
"idle": 24
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Connection Leak Tests
|
||||
|
||||
```python
|
||||
# tests/test_connection_leaks.py
|
||||
@pytest.fixture
|
||||
def track_connections():
|
||||
before = engine.pool.checkedout()
|
||||
yield
|
||||
after = engine.pool.checkedout()
|
||||
assert after == before, f"Leaked {after - before} connections"
|
||||
```
|
||||
|
||||
### 2. Pool Alerts
|
||||
|
||||
```yaml
|
||||
# Alert if pool >80% for 5 minutes
|
||||
expr: db_pool_checked_out / db_pool_size_total > 0.8
|
||||
```
|
||||
|
||||
### 3. Health Check
|
||||
|
||||
```python
|
||||
@app.get("/health/database")
|
||||
async def database_health():
|
||||
with Session(engine) as session:
|
||||
session.execute("SELECT 1")
|
||||
return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
|
||||
```
|
||||
|
||||
### 4. Monitoring Commands
|
||||
|
||||
```bash
|
||||
# Active connections
|
||||
pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
|
||||
|
||||
# Slow queries
|
||||
pscale database insights db main --slow-queries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Quick identification of pool exhaustion (Prometheus alerts)
|
||||
✅ Context manager pattern eliminated leaks
|
||||
✅ Pool tuning based on formula (workers * 2 + buffer)
|
||||
✅ Comprehensive monitoring added
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No pool monitoring before incident
|
||||
❌ Pool size not calculated based on load
|
||||
❌ Missing connection leak tests
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always use context managers** for database sessions
|
||||
2. **Calculate pool size** based on workers and load
|
||||
3. **Monitor pool utilization** with alerts at 80%
|
||||
4. **Test for connection leaks** in CI/CD
|
||||
5. **Add retry logic** for transient pool timeouts
|
||||
|
||||
---
|
||||
|
||||
## PlanetScale Best Practices
|
||||
|
||||
```bash
|
||||
# Connection string with SSL
|
||||
DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
|
||||
|
||||
# Schema changes via deploy requests
|
||||
pscale deploy-request create db schema-update
|
||||
|
||||
# Test in branch
|
||||
pscale branch create db test-feature
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Index frequently queried columns
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
|
||||
-- Analyze slow queries
|
||||
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
|
||||
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
72
skills/devops-troubleshooting/reference/INDEX.md
Normal file
72
skills/devops-troubleshooting/reference/INDEX.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# DevOps Troubleshooter Reference
|
||||
|
||||
Quick reference guides for Grey Haven infrastructure troubleshooting - runbooks, diagnostic commands, and platform-specific guides.
|
||||
|
||||
## Reference Guides
|
||||
|
||||
### Troubleshooting Runbooks
|
||||
|
||||
**File**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md)
|
||||
|
||||
Step-by-step runbooks for common infrastructure issues:
|
||||
- **Worker Not Responding**: 500/502/503 errors from Cloudflare Workers
|
||||
- **Database Connection Failures**: Connection refused, pool exhaustion
|
||||
- **Deployment Failures**: Failed deployments, rollback procedures
|
||||
- **Performance Degradation**: Slow responses, high latency
|
||||
- **Network Issues**: DNS failures, connectivity problems
|
||||
|
||||
**Use when**: Following structured resolution for known issues
|
||||
|
||||
---
|
||||
|
||||
### Diagnostic Commands Reference
|
||||
|
||||
**File**: [diagnostic-commands.md](diagnostic-commands.md)
|
||||
|
||||
Command reference for quick troubleshooting:
|
||||
- **Cloudflare Workers**: wrangler commands, log analysis
|
||||
- **PlanetScale**: Database queries, connection checks
|
||||
- **Network**: curl timing, DNS resolution, traceroute
|
||||
- **Performance**: Profiling, metrics collection
|
||||
|
||||
**Use when**: Need quick command syntax for diagnostics
|
||||
|
||||
---
|
||||
|
||||
### Cloudflare Workers Platform Guide
|
||||
|
||||
**File**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
|
||||
|
||||
Cloudflare Workers-specific guidance:
|
||||
- **Deployment Best Practices**: Bundle size, environment variables
|
||||
- **Performance Optimization**: CPU limits, memory management
|
||||
- **Error Handling**: Common errors and solutions
|
||||
- **Monitoring**: Logs, metrics, analytics
|
||||
|
||||
**Use when**: Cloudflare Workers-specific issues
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
**By Issue Type**:
|
||||
- Worker errors → [troubleshooting-runbooks.md#worker-not-responding](troubleshooting-runbooks.md#worker-not-responding)
|
||||
- Database issues → [troubleshooting-runbooks.md#database-connection-failures](troubleshooting-runbooks.md#database-connection-failures)
|
||||
- Performance → [troubleshooting-runbooks.md#performance-degradation](troubleshooting-runbooks.md#performance-degradation)
|
||||
|
||||
**By Platform**:
|
||||
- Cloudflare Workers → [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
|
||||
- PlanetScale → [diagnostic-commands.md#planetscale-commands](diagnostic-commands.md#planetscale-commands)
|
||||
- Network → [diagnostic-commands.md#network-commands](diagnostic-commands.md#network-commands)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting walkthroughs
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident report templates
|
||||
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../devops-troubleshooter.md)
|
||||
@@ -0,0 +1,472 @@
|
||||
# Cloudflare Workers Platform Guide
|
||||
|
||||
Comprehensive guide for deploying, monitoring, and troubleshooting Cloudflare Workers in Grey Haven's stack.
|
||||
|
||||
## Workers Architecture
|
||||
|
||||
**Execution Model**:
|
||||
- V8 isolates (not containers)
|
||||
- Deployed globally to 300+ datacenters
|
||||
- Request routed to nearest location
|
||||
- Cold start: ~1-5ms (vs 100-1000ms for containers)
|
||||
- CPU time limit: 50ms (Free), 50ms-30s (Paid)
|
||||
|
||||
**Resource Limits**:
|
||||
```
|
||||
Free Plan:
|
||||
- Bundle size: 1MB compressed
|
||||
- CPU time: 50ms per request
|
||||
- Requests: 100,000/day
|
||||
- KV reads: 100,000/day
|
||||
|
||||
Paid Plan ($5/month):
|
||||
- Bundle size: 10MB compressed
|
||||
- CPU time: 50ms (standard), up to 30s (unbound)
|
||||
- Requests: 10M included, $0.50/million after
|
||||
- KV reads: 10M included
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Best Practices
|
||||
|
||||
### Bundle Optimization
|
||||
|
||||
**Size Reduction Strategies**:
|
||||
```typescript
|
||||
// 1. Tree shaking with named imports
|
||||
import { uniq } from 'lodash-es'; // ✅ Only imports uniq
|
||||
import _ from 'lodash'; // ❌ Imports entire library
|
||||
|
||||
// 2. Use native APIs instead of libraries
|
||||
const date = new Date().toISOString(); // ✅ Native
|
||||
import moment from 'moment'; // ❌ 300KB library
|
||||
|
||||
// 3. External API calls instead of SDKs
|
||||
await fetch('https://api.anthropic.com/v1/messages', {
|
||||
method: 'POST',
|
||||
headers: { 'x-api-key': env.API_KEY },
|
||||
body: JSON.stringify({ ... })
|
||||
}); // ✅ 0KB vs @anthropic-ai/sdk (2.1MB)
|
||||
|
||||
// 4. Code splitting with dynamic imports
|
||||
if (request.url.includes('/special')) {
|
||||
const { handler } = await import('./expensive-module');
|
||||
return handler(request);
|
||||
} // ✅ Lazy load
|
||||
```
|
||||
|
||||
**webpack Configuration**:
|
||||
```javascript
|
||||
module.exports = {
|
||||
mode: 'production',
|
||||
target: 'webworker',
|
||||
optimization: {
|
||||
minimize: true,
|
||||
usedExports: true, // Tree shaking
|
||||
sideEffects: false
|
||||
},
|
||||
resolve: {
|
||||
alias: {
|
||||
'lodash': 'lodash-es' // Use ES modules version
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Using Secrets**:
|
||||
```bash
|
||||
# Add secret (never in code)
|
||||
wrangler secret put DATABASE_URL
|
||||
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Delete secret
|
||||
wrangler secret delete OLD_KEY
|
||||
```
|
||||
|
||||
**Using Variables** (wrangler.toml):
|
||||
```toml
|
||||
[vars]
|
||||
API_ENDPOINT = "https://api.partner.com"
|
||||
MAX_RETRIES = "3"
|
||||
CACHE_TTL = "300"
|
||||
|
||||
[env.staging.vars]
|
||||
API_ENDPOINT = "https://staging-api.partner.com"
|
||||
|
||||
[env.production.vars]
|
||||
API_ENDPOINT = "https://api.partner.com"
|
||||
```
|
||||
|
||||
**Accessing in Code**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const dbUrl = env.DATABASE_URL; // Secret
|
||||
const endpoint = env.API_ENDPOINT; // Var
|
||||
const maxRetries = parseInt(env.MAX_RETRIES);
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### CPU Time Management
|
||||
|
||||
**Avoid CPU-Intensive Operations**:
|
||||
```typescript
|
||||
// ❌ BAD: CPU-intensive operation
|
||||
function processLargeDataset(data) {
|
||||
const sorted = data.sort((a, b) => a.value - b.value);
|
||||
const filtered = sorted.filter(item => item.value > 1000);
|
||||
const mapped = filtered.map(item => ({ ...item, processed: true }));
|
||||
return mapped; // Can exceed 50ms CPU limit
|
||||
}
|
||||
|
||||
// ✅ GOOD: Offload to external service
|
||||
async function processLargeDataset(data, env) {
|
||||
const response = await fetch(`${env.PROCESSING_API}/process`, {
|
||||
method: 'POST',
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
return response.json(); // External service handles heavy lifting
|
||||
}
|
||||
|
||||
// ✅ BETTER: Use Durable Objects for stateful computation
|
||||
const id = env.PROCESSOR.idFromName('processor');
|
||||
const stub = env.PROCESSOR.get(id);
|
||||
return stub.fetch(request); // Durable Object has more CPU time
|
||||
```
|
||||
|
||||
**Monitor CPU Usage**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const start = Date.now();
|
||||
|
||||
try {
|
||||
const response = await handleRequest(request, env);
|
||||
const duration = Date.now() - start;
|
||||
|
||||
if (duration > 40) {
|
||||
console.warn(`CPU time approaching limit: ${duration}ms`);
|
||||
}
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
const duration = Date.now() - start;
|
||||
console.error(`Request failed after ${duration}ms:`, error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Caching Strategies
|
||||
|
||||
**Cache API**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const cache = caches.default;
|
||||
|
||||
// Check cache
|
||||
let response = await cache.match(request);
|
||||
if (response) return response;
|
||||
|
||||
// Cache miss - fetch and cache
|
||||
response = await fetch(request);
|
||||
|
||||
// Cache for 5 minutes
|
||||
const cacheResponse = new Response(response.body, response);
|
||||
cacheResponse.headers.set('Cache-Control', 'max-age=300');
|
||||
await cache.put(request, cacheResponse.clone());
|
||||
|
||||
return response;
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**KV for Data Caching**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const url = new URL(request.url);
|
||||
const cacheKey = `data:${url.pathname}`;
|
||||
|
||||
// Check KV
|
||||
const cached = await env.CACHE.get(cacheKey, 'json');
|
||||
if (cached) return Response.json(cached);
|
||||
|
||||
// Fetch data
|
||||
const data = await fetchExpensiveData();
|
||||
|
||||
// Store in KV with 5min TTL
|
||||
await env.CACHE.put(cacheKey, JSON.stringify(data), {
|
||||
expirationTtl: 300
|
||||
});
|
||||
|
||||
return Response.json(data);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Errors and Solutions
|
||||
|
||||
### Error 1101: Worker Threw Exception
|
||||
|
||||
**Cause**: Unhandled JavaScript exception
|
||||
|
||||
**Example**:
|
||||
```typescript
|
||||
// ❌ BAD: Unhandled error
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const data = JSON.parse(request.body); // Throws if invalid JSON
|
||||
return Response.json(data);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
```typescript
|
||||
// ✅ GOOD: Proper error handling
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
try {
|
||||
const body = await request.text();
|
||||
const data = JSON.parse(body);
|
||||
return Response.json(data);
|
||||
} catch (error) {
|
||||
console.error('JSON parse error:', error);
|
||||
return new Response('Invalid JSON', { status: 400 });
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error 1015: Rate Limited
|
||||
|
||||
**Cause**: Too many requests to origin
|
||||
|
||||
**Solution**: Implement caching and rate limiting
|
||||
```typescript
|
||||
const RATE_LIMIT = 100; // requests per minute
|
||||
const rateLimits = new Map();
|
||||
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const ip = request.headers.get('CF-Connecting-IP');
|
||||
const key = `ratelimit:${ip}`;
|
||||
|
||||
const count = rateLimits.get(key) || 0;
|
||||
if (count >= RATE_LIMIT) {
|
||||
return new Response('Rate limit exceeded', { status: 429 });
|
||||
}
|
||||
|
||||
rateLimits.set(key, count + 1);
|
||||
setTimeout(() => rateLimits.delete(key), 60000);
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error: Script Exceeds Size Limit
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check bundle size
|
||||
npm run build
|
||||
ls -lh dist/worker.js
|
||||
|
||||
# Analyze bundle
|
||||
npm install --save-dev webpack-bundle-analyzer
|
||||
npm run build -- --analyze
|
||||
```
|
||||
|
||||
**Solutions**: See [bundle optimization](#bundle-optimization) above
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Structured Logging
|
||||
|
||||
```typescript
|
||||
interface LogEntry {
|
||||
level: 'info' | 'warn' | 'error';
|
||||
message: string;
|
||||
timestamp: string;
|
||||
requestId?: string;
|
||||
duration?: number;
|
||||
metadata?: Record<string, any>;
|
||||
}
|
||||
|
||||
function log(entry: LogEntry) {
|
||||
console.log(JSON.stringify({
|
||||
...entry,
|
||||
timestamp: new Date().toISOString()
|
||||
}));
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const requestId = crypto.randomUUID();
|
||||
const start = Date.now();
|
||||
|
||||
try {
|
||||
log({
|
||||
level: 'info',
|
||||
message: 'Request started',
|
||||
requestId,
|
||||
metadata: {
|
||||
method: request.method,
|
||||
url: request.url
|
||||
}
|
||||
});
|
||||
|
||||
const response = await handleRequest(request, env);
|
||||
|
||||
log({
|
||||
level: 'info',
|
||||
message: 'Request completed',
|
||||
requestId,
|
||||
duration: Date.now() - start,
|
||||
metadata: {
|
||||
status: response.status
|
||||
}
|
||||
});
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
log({
|
||||
level: 'error',
|
||||
message: 'Request failed',
|
||||
requestId,
|
||||
duration: Date.now() - start,
|
||||
metadata: {
|
||||
error: error.message,
|
||||
stack: error.stack
|
||||
}
|
||||
});
|
||||
|
||||
return new Response('Internal Server Error', { status: 500 });
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Health Check Endpoint
|
||||
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const url = new URL(request.url);
|
||||
|
||||
if (url.pathname === '/health') {
|
||||
return Response.json({
|
||||
status: 'healthy',
|
||||
timestamp: new Date().toISOString(),
|
||||
version: env.VERSION || 'unknown'
|
||||
});
|
||||
}
|
||||
|
||||
// Regular request handling
|
||||
return handleRequest(request, env);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Workers
|
||||
|
||||
```bash
|
||||
# Local testing
|
||||
wrangler dev
|
||||
curl http://localhost:8787/api/users
|
||||
curl -X POST http://localhost:8787/api/users -H "Content-Type: application/json" -d '{"name": "Test User"}'
|
||||
|
||||
# Unit testing (Vitest)
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import worker from './worker';
|
||||
|
||||
describe('Worker', () => {
|
||||
it('returns 200 for health check', async () => {
|
||||
const request = new Request('https://example.com/health');
|
||||
const response = await worker.fetch(request, getMockEnv());
|
||||
expect(response.status).toBe(200);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
```typescript
|
||||
// 1. Validate inputs
|
||||
function validateEmail(email: string): boolean {
|
||||
return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
|
||||
}
|
||||
|
||||
// 2. Set security headers
|
||||
function addSecurityHeaders(response: Response): Response {
|
||||
response.headers.set('X-Content-Type-Options', 'nosniff');
|
||||
response.headers.set('X-Frame-Options', 'DENY');
|
||||
response.headers.set('Strict-Transport-Security', 'max-age=31536000');
|
||||
return response;
|
||||
}
|
||||
|
||||
// 3. CORS configuration
|
||||
const ALLOWED_ORIGINS = ['https://app.greyhaven.io', 'https://staging.greyhaven.io'];
|
||||
function handleCors(request: Request): Response | null {
|
||||
const origin = request.headers.get('Origin');
|
||||
if (request.method === 'OPTIONS') {
|
||||
return new Response(null, {
|
||||
headers: {
|
||||
'Access-Control-Allow-Origin': origin,
|
||||
'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE',
|
||||
'Access-Control-Max-Age': '86400'
|
||||
}
|
||||
});
|
||||
}
|
||||
if (origin && !ALLOWED_ORIGINS.includes(origin)) {
|
||||
return new Response('Forbidden', { status: 403 });
|
||||
}
|
||||
return null;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
|
||||
- **Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full examples
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
473
skills/devops-troubleshooting/reference/diagnostic-commands.md
Normal file
473
skills/devops-troubleshooting/reference/diagnostic-commands.md
Normal file
@@ -0,0 +1,473 @@
|
||||
# Diagnostic Commands Reference
|
||||
|
||||
Quick command reference for Grey Haven infrastructure troubleshooting. Copy-paste ready commands for rapid diagnosis.
|
||||
|
||||
## Cloudflare Workers Commands
|
||||
|
||||
### Deployment Management
|
||||
|
||||
```bash
|
||||
# List recent deployments
|
||||
wrangler deployments list
|
||||
|
||||
# View specific deployment
|
||||
wrangler deployments view <deployment-id>
|
||||
|
||||
# Rollback to previous version
|
||||
wrangler rollback --message "Reverting due to errors"
|
||||
|
||||
# Deploy to production
|
||||
wrangler deploy
|
||||
|
||||
# Deploy to staging
|
||||
wrangler deploy --env staging
|
||||
```
|
||||
|
||||
### Logs and Monitoring
|
||||
|
||||
```bash
|
||||
# Real-time logs (pretty format)
|
||||
wrangler tail --format pretty
|
||||
|
||||
# JSON logs for parsing
|
||||
wrangler tail --format json
|
||||
|
||||
# Filter by status code
|
||||
wrangler tail --format json | grep "\"status\":500"
|
||||
|
||||
# Show only errors
|
||||
wrangler tail --format json | grep -i "error"
|
||||
|
||||
# Save logs to file
|
||||
wrangler tail --format json > worker-logs.json
|
||||
|
||||
# Monitor specific worker
|
||||
wrangler tail --name my-worker
|
||||
```
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# Start local dev server
|
||||
wrangler dev
|
||||
|
||||
# Dev with specific port
|
||||
wrangler dev --port 8788
|
||||
|
||||
# Dev with remote mode (use production bindings)
|
||||
wrangler dev --remote
|
||||
|
||||
# Test locally
|
||||
curl http://localhost:8787/api/health
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# Show account info
|
||||
wrangler whoami
|
||||
|
||||
# List KV namespaces
|
||||
wrangler kv:namespace list
|
||||
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Add secret
|
||||
wrangler secret put API_KEY
|
||||
|
||||
# Delete secret
|
||||
wrangler secret delete API_KEY
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PlanetScale Commands
|
||||
|
||||
### Database Management
|
||||
|
||||
```bash
|
||||
# Connect to database shell
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# Connect and execute query
|
||||
pscale shell greyhaven-db main --execute "SELECT COUNT(*) FROM users"
|
||||
|
||||
# Show database info
|
||||
pscale database show greyhaven-db
|
||||
|
||||
# List all databases
|
||||
pscale database list
|
||||
|
||||
# Create new branch
|
||||
pscale branch create greyhaven-db feature-branch
|
||||
|
||||
# List branches
|
||||
pscale branch list greyhaven-db
|
||||
```
|
||||
|
||||
### Connection Monitoring
|
||||
|
||||
```sql
|
||||
-- Active connections
|
||||
SELECT COUNT(*) as active_connections
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
-- Long-running queries
|
||||
SELECT
|
||||
pid,
|
||||
now() - query_start as duration,
|
||||
query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND query_start < now() - interval '10 seconds'
|
||||
ORDER BY duration DESC;
|
||||
|
||||
-- Connection by state
|
||||
SELECT state, COUNT(*)
|
||||
FROM pg_stat_activity
|
||||
GROUP BY state;
|
||||
|
||||
-- Blocked queries
|
||||
SELECT
|
||||
blocked.pid AS blocked_pid,
|
||||
blocking.pid AS blocking_pid,
|
||||
blocked.query AS blocked_query
|
||||
FROM pg_stat_activity blocked
|
||||
JOIN pg_stat_activity blocking
|
||||
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
|
||||
```
|
||||
|
||||
### Performance Analysis
|
||||
|
||||
```bash
|
||||
# Slow query insights
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Database size
|
||||
pscale database show greyhaven-db --web
|
||||
|
||||
# Enable slow query log
|
||||
pscale database settings update greyhaven-db --enable-slow-query-log
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Table sizes
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
|
||||
-- Index usage
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
indexname,
|
||||
idx_scan as index_scans
|
||||
FROM pg_stat_user_indexes
|
||||
ORDER BY idx_scan ASC;
|
||||
|
||||
-- Cache hit ratio
|
||||
SELECT
|
||||
'cache hit rate' AS metric,
|
||||
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
|
||||
FROM pg_statio_user_tables;
|
||||
```
|
||||
|
||||
### Schema Migrations
|
||||
|
||||
```bash
|
||||
# Create deploy request
|
||||
pscale deploy-request create greyhaven-db <branch-name>
|
||||
|
||||
# List deploy requests
|
||||
pscale deploy-request list greyhaven-db
|
||||
|
||||
# View deploy request diff
|
||||
pscale deploy-request diff greyhaven-db <number>
|
||||
|
||||
# Deploy schema changes
|
||||
pscale deploy-request deploy greyhaven-db <number>
|
||||
|
||||
# Close deploy request
|
||||
pscale deploy-request close greyhaven-db <number>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Diagnostic Commands
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
```bash
|
||||
# Basic DNS lookup
|
||||
nslookup api.partner.com
|
||||
|
||||
# Detailed DNS query
|
||||
dig api.partner.com
|
||||
|
||||
# Measure DNS time
|
||||
time nslookup api.partner.com
|
||||
|
||||
# Check DNS propagation
|
||||
dig api.partner.com @8.8.8.8
|
||||
dig api.partner.com @1.1.1.1
|
||||
|
||||
# Reverse DNS lookup
|
||||
dig -x 203.0.113.42
|
||||
```
|
||||
|
||||
### Connectivity Testing
|
||||
|
||||
```bash
|
||||
# Ping test
|
||||
ping -c 10 api.partner.com
|
||||
|
||||
# Trace network route
|
||||
traceroute api.partner.com
|
||||
|
||||
# TCP connection test
|
||||
nc -zv api.partner.com 443
|
||||
|
||||
# Test specific port
|
||||
telnet api.partner.com 443
|
||||
```
|
||||
|
||||
### HTTP Request Timing
|
||||
|
||||
```bash
|
||||
# Full timing breakdown
|
||||
curl -w "\nDNS Lookup: %{time_namelookup}s\nTCP Connect: %{time_connect}s\nTLS Handshake: %{time_appconnect}s\nStart Transfer:%{time_starttransfer}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com/data
|
||||
|
||||
# Test with specific method
|
||||
curl -X POST https://api.example.com/api \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"test": "data"}'
|
||||
|
||||
# Follow redirects
|
||||
curl -L https://example.com
|
||||
|
||||
# Show response headers
|
||||
curl -I https://api.example.com
|
||||
|
||||
# Test CORS
|
||||
curl -I -X OPTIONS https://api.example.com \
|
||||
-H "Origin: https://app.example.com" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
```
|
||||
|
||||
### SSL/TLS Verification
|
||||
|
||||
```bash
|
||||
# Check SSL certificate
|
||||
openssl s_client -connect api.example.com:443
|
||||
|
||||
# Show certificate expiry
|
||||
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
|
||||
openssl x509 -noout -dates
|
||||
|
||||
# Verify certificate chain
|
||||
openssl s_client -connect api.example.com:443 -showcerts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Application Performance Commands
|
||||
|
||||
### Resource Monitoring
|
||||
|
||||
```bash
|
||||
# CPU usage
|
||||
top -o cpu
|
||||
|
||||
# Memory usage
|
||||
free -h # Linux
|
||||
vm_stat # macOS
|
||||
|
||||
# Disk usage
|
||||
df -h
|
||||
|
||||
# Process list
|
||||
ps aux | grep node
|
||||
|
||||
# Port usage
|
||||
lsof -i :8000
|
||||
netstat -an | grep 8000
|
||||
```
|
||||
|
||||
### Log Analysis
|
||||
|
||||
```bash
|
||||
# Tail logs
|
||||
tail -f /var/log/app.log
|
||||
|
||||
# Search logs
|
||||
grep -i "error" /var/log/app.log
|
||||
|
||||
# Count errors
|
||||
grep -c "ERROR" /var/log/app.log
|
||||
|
||||
# Show recent errors with context
|
||||
grep -B 5 -A 5 "error" /var/log/app.log
|
||||
|
||||
# Parse JSON logs
|
||||
cat app.log | jq 'select(.level=="error")'
|
||||
|
||||
# Error frequency
|
||||
grep "ERROR" /var/log/app.log | cut -d' ' -f1 | uniq -c
|
||||
```
|
||||
|
||||
### Worker Performance
|
||||
|
||||
```bash
|
||||
# Monitor CPU time
|
||||
wrangler tail --format json | jq '.outcome.cpuTime'
|
||||
|
||||
# Monitor duration
|
||||
wrangler tail --format json | jq '.outcome.duration'
|
||||
|
||||
# Requests per second
|
||||
wrangler tail --format json | wc -l
|
||||
|
||||
# Average response time
|
||||
wrangler tail --format json | \
|
||||
jq -r '.outcome.duration' | \
|
||||
awk '{sum+=$1; count++} END {print sum/count}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health Check Scripts
|
||||
|
||||
### Worker Health Check
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-worker.sh
|
||||
|
||||
echo "=== Worker Health Check ==="
|
||||
|
||||
# Test endpoint
|
||||
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.greyhaven.io/health)
|
||||
|
||||
if [ "$STATUS" -eq 200 ]; then
|
||||
echo "✅ Worker responding (HTTP $STATUS)"
|
||||
else
|
||||
echo "❌ Worker error (HTTP $STATUS)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check response time
|
||||
TIME=$(curl -w "%{time_total}" -o /dev/null -s https://api.greyhaven.io/health)
|
||||
echo "Response time: ${TIME}s"
|
||||
|
||||
if (( $(echo "$TIME > 1.0" | bc -l) )); then
|
||||
echo "⚠️ Slow response (>${TIME}s)"
|
||||
fi
|
||||
```
|
||||
|
||||
### Database Health Check
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-db.sh
|
||||
|
||||
echo "=== Database Health Check ==="
|
||||
|
||||
# Test connection
|
||||
pscale shell greyhaven-db main --execute "SELECT 1" > /dev/null 2>&1
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✅ Database connection OK"
|
||||
else
|
||||
echo "❌ Database connection failed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check active connections
|
||||
ACTIVE=$(pscale shell greyhaven-db main --execute \
|
||||
"SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'" | tail -1)
|
||||
|
||||
echo "Active connections: $ACTIVE"
|
||||
|
||||
if [ "$ACTIVE" -gt 80 ]; then
|
||||
echo "⚠️ High connection count (>80)"
|
||||
fi
|
||||
```
|
||||
|
||||
### Complete System Health
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-all.sh
|
||||
|
||||
echo "=== Complete System Health Check ==="
|
||||
|
||||
# Worker
|
||||
echo "\n1. Cloudflare Worker"
|
||||
./health-check-worker.sh
|
||||
|
||||
# Database
|
||||
echo "\n2. PlanetScale Database"
|
||||
./health-check-db.sh
|
||||
|
||||
# External APIs
|
||||
echo "\n3. External Dependencies"
|
||||
for API in "https://api.partner1.com/health" "https://api.partner2.com/health"; do
|
||||
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API")
|
||||
if [ "$STATUS" -eq 200 ]; then
|
||||
echo "✅ $API (HTTP $STATUS)"
|
||||
else
|
||||
echo "❌ $API (HTTP $STATUS)"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "\n=== Health Check Complete ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting One-Liners
|
||||
|
||||
```bash
|
||||
# Find memory hogs
|
||||
ps aux --sort=-%mem | head -10
|
||||
|
||||
# Find CPU hogs
|
||||
ps aux --sort=-%cpu | head -10
|
||||
|
||||
# Disk space by directory
|
||||
du -sh /* | sort -h
|
||||
|
||||
# Network connections
|
||||
netstat -ant | awk '{print $6}' | sort | uniq -c
|
||||
|
||||
# Failed login attempts
|
||||
grep "Failed password" /var/log/auth.log | wc -l
|
||||
|
||||
# Top error codes
|
||||
awk '{print $9}' access.log | sort | uniq -c | sort -rn
|
||||
|
||||
# Requests per minute
|
||||
awk '{print $4}' access.log | cut -d: -f1-2 | uniq -c
|
||||
|
||||
# Average response size
|
||||
awk '{sum+=$10; count++} END {print sum/count}' access.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
|
||||
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
@@ -0,0 +1,489 @@
|
||||
# Troubleshooting Runbooks
|
||||
|
||||
Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
|
||||
|
||||
## Runbook 1: Worker Not Responding
|
||||
|
||||
### Symptoms
|
||||
- API returning 500/502/503 errors
|
||||
- Workers timing out or not processing requests
|
||||
- Cloudflare error pages showing
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Cloudflare Status**
|
||||
```bash
|
||||
# Visit: https://www.cloudflarestatus.com
|
||||
# Or query status API
|
||||
curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
|
||||
```
|
||||
|
||||
**2. View Worker Logs**
|
||||
```bash
|
||||
# Real-time logs
|
||||
wrangler tail --format pretty
|
||||
|
||||
# Look for errors:
|
||||
# - "Script exceeded CPU time limit"
|
||||
# - "Worker threw exception"
|
||||
# - "Uncaught TypeError"
|
||||
```
|
||||
|
||||
**3. Check Recent Deployments**
|
||||
```bash
|
||||
wrangler deployments list
|
||||
|
||||
# If recent deployment suspicious, rollback:
|
||||
wrangler rollback --message "Reverting to stable version"
|
||||
```
|
||||
|
||||
**4. Test Worker Locally**
|
||||
```bash
|
||||
# Run worker in dev mode
|
||||
wrangler dev
|
||||
|
||||
# Test endpoint
|
||||
curl http://localhost:8787/api/health
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
|
||||
**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
|
||||
**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
|
||||
**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
|
||||
|
||||
### Prevention
|
||||
- Health check endpoint: `GET /health`
|
||||
- Monitor error rate with alerts (>1% = alert)
|
||||
- Test deployments in staging first
|
||||
- Implement circuit breakers for external calls
|
||||
|
||||
---
|
||||
|
||||
## Runbook 2: Database Connection Failures
|
||||
|
||||
### Symptoms
|
||||
- "connection refused" errors
|
||||
- "too many connections" errors
|
||||
- Application timing out on database queries
|
||||
- 503 errors from API
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test Database Connection**
|
||||
```bash
|
||||
# Direct connection test
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# If fails, check:
|
||||
# - Database status
|
||||
# - Credentials
|
||||
# - Network connectivity
|
||||
```
|
||||
|
||||
**2. Check Connection Pool**
|
||||
```bash
|
||||
# Query pool status
|
||||
curl http://localhost:8000/pool-status
|
||||
|
||||
# Expected healthy response:
|
||||
{
|
||||
"size": 50,
|
||||
"checked_out": 25, # <80% is healthy
|
||||
"overflow": 0,
|
||||
"available": 25
|
||||
}
|
||||
```
|
||||
|
||||
**3. Check Active Connections**
|
||||
```sql
|
||||
-- In pscale shell
|
||||
SELECT
|
||||
COUNT(*) as active,
|
||||
MAX(query_start) as oldest_query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
-- If active = pool size, pool exhausted
|
||||
-- If oldest_query >10min, leaked connection
|
||||
```
|
||||
|
||||
**4. Review Application Logs**
|
||||
```bash
|
||||
# Search for connection errors
|
||||
grep -i "connection" logs/app.log | tail -50
|
||||
|
||||
# Common errors:
|
||||
# - "Pool timeout"
|
||||
# - "Connection refused"
|
||||
# - "Max connections reached"
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Invalid Credentials**
|
||||
```bash
|
||||
# Rotate credentials
|
||||
pscale password create greyhaven-db main app-password
|
||||
|
||||
# Update environment variable
|
||||
# Restart application
|
||||
```
|
||||
|
||||
**Path B: Pool Exhausted**
|
||||
```python
|
||||
# Increase pool size in database.py
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=50, # Increase from 20
|
||||
max_overflow=20
|
||||
)
|
||||
```
|
||||
|
||||
**Path C: Connection Leaks**
|
||||
```python
|
||||
# Fix: Use context managers
|
||||
with Session(engine) as session:
|
||||
# Work with session
|
||||
pass # Automatically closed
|
||||
```
|
||||
|
||||
**Path D: Database Paused/Down**
|
||||
```bash
|
||||
# Resume database if paused
|
||||
pscale database resume greyhaven-db
|
||||
|
||||
# Check database status
|
||||
pscale database show greyhaven-db
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Use connection pooling with proper limits
|
||||
- Implement retry logic with exponential backoff
|
||||
- Monitor pool utilization (alert >80%)
|
||||
- Test for connection leaks in CI/CD
|
||||
|
||||
---
|
||||
|
||||
## Runbook 3: Deployment Failures
|
||||
|
||||
### Symptoms
|
||||
- `wrangler deploy` fails
|
||||
- CI/CD pipeline fails at deployment step
|
||||
- New code not reflecting in production
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Deployment Error**
|
||||
```bash
|
||||
wrangler deploy --verbose
|
||||
|
||||
# Common errors:
|
||||
# - "Script exceeds size limit"
|
||||
# - "Syntax error in worker"
|
||||
# - "Environment variable missing"
|
||||
# - "Binding not found"
|
||||
```
|
||||
|
||||
**2. Verify Build Output**
|
||||
```bash
|
||||
# Check built file
|
||||
ls -lh dist/
|
||||
npm run build
|
||||
|
||||
# Ensure build succeeds locally
|
||||
```
|
||||
|
||||
**3. Check Environment Variables**
|
||||
```bash
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Verify wrangler.toml vars
|
||||
cat wrangler.toml | grep -A 10 "\[vars\]"
|
||||
```
|
||||
|
||||
**4. Test Locally**
|
||||
```bash
|
||||
# Start dev server
|
||||
wrangler dev
|
||||
|
||||
# If works locally but not production:
|
||||
# - Environment variable mismatch
|
||||
# - Binding configuration issue
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Bundle Too Large**
|
||||
```bash
|
||||
# Check bundle size
|
||||
ls -lh dist/worker.js
|
||||
|
||||
# Solutions:
|
||||
# - Tree shake unused code
|
||||
# - Code split large modules
|
||||
# - Use fetch instead of SDK
|
||||
```
|
||||
|
||||
**Path B: Syntax Error**
|
||||
```bash
|
||||
# Run TypeScript check
|
||||
npm run type-check
|
||||
|
||||
# Run linter
|
||||
npm run lint
|
||||
|
||||
# Fix errors before deploying
|
||||
```
|
||||
|
||||
**Path C: Missing Variables**
|
||||
```bash
|
||||
# Add missing secret
|
||||
wrangler secret put API_KEY
|
||||
|
||||
# Or add to wrangler.toml vars
|
||||
[vars]
|
||||
API_ENDPOINT = "https://api.example.com"
|
||||
```
|
||||
|
||||
**Path D: Binding Not Found**
|
||||
```toml
|
||||
# wrangler.toml - Add binding
|
||||
[[kv_namespaces]]
|
||||
binding = "CACHE"
|
||||
id = "abc123"
|
||||
|
||||
[[d1_databases]]
|
||||
binding = "DB"
|
||||
database_name = "greyhaven-db"
|
||||
database_id = "xyz789"
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Bundle size check in CI/CD
|
||||
- Pre-commit hooks for validation
|
||||
- Staging environment for testing
|
||||
- Automated deployment tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 4: Performance Degradation
|
||||
|
||||
### Symptoms
|
||||
- API response times increased (>2x normal)
|
||||
- Slow page loads
|
||||
- User complaints about slowness
|
||||
- Timeout errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Current Latency**
|
||||
```bash
|
||||
# Test endpoint
|
||||
curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
|
||||
|
||||
# p95 should be <500ms
|
||||
# If >1s, investigate
|
||||
```
|
||||
|
||||
**2. Analyze Worker Logs**
|
||||
```bash
|
||||
wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
|
||||
|
||||
# Identify slow requests
|
||||
# Check what's taking time
|
||||
```
|
||||
|
||||
**3. Check Database Queries**
|
||||
```bash
|
||||
# Slow query log
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Look for:
|
||||
# - N+1 queries (many small queries)
|
||||
# - Missing indexes (full table scans)
|
||||
# - Long-running queries (>100ms)
|
||||
```
|
||||
|
||||
**4. Profile Application**
|
||||
```bash
|
||||
# Add timing middleware
|
||||
# Log slow operations
|
||||
# Identify bottleneck (DB, API, compute)
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: N+1 Queries**
|
||||
```python
|
||||
# Use eager loading
|
||||
statement = (
|
||||
select(Order)
|
||||
.options(selectinload(Order.items))
|
||||
)
|
||||
```
|
||||
|
||||
**Path B: Missing Indexes**
|
||||
```sql
|
||||
-- Add indexes
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
CREATE INDEX idx_items_order_id ON order_items(order_id);
|
||||
```
|
||||
|
||||
**Path C: No Caching**
|
||||
```typescript
|
||||
// Add Redis caching
|
||||
const cached = await redis.get(cacheKey);
|
||||
if (cached) return cached;
|
||||
|
||||
const result = await expensiveOperation();
|
||||
await redis.setex(cacheKey, 300, result);
|
||||
```
|
||||
|
||||
**Path D: Worker CPU Limit**
|
||||
```typescript
|
||||
// Optimize expensive operations
|
||||
// Use async operations
|
||||
// Offload to external service
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Monitor p95 latency (alert >500ms)
|
||||
- Test for N+1 queries in CI/CD
|
||||
- Add indexes for foreign keys
|
||||
- Implement caching layer
|
||||
- Performance budgets in tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 5: Network Connectivity Issues
|
||||
|
||||
### Symptoms
|
||||
- Intermittent failures
|
||||
- DNS resolution errors
|
||||
- Connection timeouts
|
||||
- CORS errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test DNS Resolution**
|
||||
```bash
|
||||
# Check DNS
|
||||
nslookup api.partner.com
|
||||
dig api.partner.com
|
||||
|
||||
# Measure DNS time
|
||||
time nslookup api.partner.com
|
||||
|
||||
# If >1s, DNS is slow
|
||||
```
|
||||
|
||||
**2. Test Connectivity**
|
||||
```bash
|
||||
# Basic connectivity
|
||||
ping api.partner.com
|
||||
|
||||
# Trace route
|
||||
traceroute api.partner.com
|
||||
|
||||
# Full timing breakdown
|
||||
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com
|
||||
```
|
||||
|
||||
**3. Check CORS**
|
||||
```bash
|
||||
# Preflight request
|
||||
curl -I -X OPTIONS https://api.greyhaven.io/api/users \
|
||||
-H "Origin: https://app.greyhaven.io" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
|
||||
# Verify headers:
|
||||
# - Access-Control-Allow-Origin
|
||||
# - Access-Control-Allow-Methods
|
||||
```
|
||||
|
||||
**4. Check Firewall/Security**
|
||||
```bash
|
||||
# Test from different location
|
||||
# Check IP whitelist
|
||||
# Verify SSL certificate
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Slow DNS**
|
||||
```typescript
|
||||
// Implement DNS caching
|
||||
const DNS_CACHE = new Map();
|
||||
// Cache DNS for 60s
|
||||
```
|
||||
|
||||
**Path B: Connection Timeout**
|
||||
```typescript
|
||||
// Increase timeout
|
||||
const controller = new AbortController();
|
||||
setTimeout(() => controller.abort(), 30000); // 30s
|
||||
```
|
||||
|
||||
**Path C: CORS Error**
|
||||
```typescript
|
||||
// Add CORS headers
|
||||
response.headers.set('Access-Control-Allow-Origin', origin);
|
||||
response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
|
||||
```
|
||||
|
||||
**Path D: SSL/TLS Issue**
|
||||
```bash
|
||||
# Check certificate
|
||||
openssl s_client -connect api.partner.com:443
|
||||
|
||||
# Verify not expired
|
||||
# Check certificate chain
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- DNS caching (60s TTL)
|
||||
- Appropriate timeouts (30s for external APIs)
|
||||
- Health checks for external dependencies
|
||||
- Circuit breakers for failures
|
||||
- Monitor external API latency
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures (SEV1)
|
||||
|
||||
**Immediate Actions**:
|
||||
1. **Assess**: Users affected? Functionality broken? Data loss risk?
|
||||
2. **Communicate**: Alert team, update status page
|
||||
3. **Stop Bleeding**: `wrangler rollback` or disable feature
|
||||
4. **Diagnose**: Logs, recent changes, metrics
|
||||
5. **Fix**: Hotfix or workaround, test first
|
||||
6. **Verify**: Monitor metrics, test functionality
|
||||
7. **Postmortem**: Document, root cause, prevention
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Issue Type | First Response | Escalate To | Escalation Trigger |
|
||||
|------------|---------------|-------------|-------------------|
|
||||
| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
|
||||
| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
|
||||
| Database | DevOps troubleshooter | data-validator | Schema issues |
|
||||
| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
|
||||
| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
|
||||
- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
|
||||
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
81
skills/devops-troubleshooting/templates/INDEX.md
Normal file
81
skills/devops-troubleshooting/templates/INDEX.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# DevOps Troubleshooter Templates
|
||||
|
||||
Ready-to-use templates for infrastructure incident response, deployment checklists, and performance investigations.
|
||||
|
||||
## Available Templates
|
||||
|
||||
### Incident Report Template
|
||||
|
||||
**File**: [incident-report-template.md](incident-report-template.md)
|
||||
|
||||
Comprehensive template for documenting infrastructure incidents:
|
||||
- **Incident Overview**: Summary, impact, timeline
|
||||
- **Root Cause Analysis**: What happened, why it happened
|
||||
- **Resolution Steps**: What was done to fix it
|
||||
- **Prevention Measures**: How to prevent recurrence
|
||||
- **Lessons Learned**: What went well, what could improve
|
||||
|
||||
**Use when**: Documenting production outages, degradations, or significant infrastructure issues
|
||||
|
||||
**Copy and fill in** all sections for your specific incident.
|
||||
|
||||
---
|
||||
|
||||
### Deployment Checklist
|
||||
|
||||
**File**: [deployment-checklist.md](deployment-checklist.md)
|
||||
|
||||
Pre-deployment and post-deployment verification checklist:
|
||||
- **Pre-Deployment Verification**: Code review, tests, dependencies, configuration
|
||||
- **Deployment Steps**: Backup, deploy, verify, rollback plan
|
||||
- **Post-Deployment Monitoring**: Health checks, metrics, logs, alerts
|
||||
- **Rollback Procedures**: When and how to rollback
|
||||
|
||||
**Use when**: Deploying Cloudflare Workers, database migrations, infrastructure changes
|
||||
|
||||
**Check off** each item before and after deployment.
|
||||
|
||||
---
|
||||
|
||||
### Performance Investigation Template
|
||||
|
||||
**File**: [performance-investigation-template.md](performance-investigation-template.md)
|
||||
|
||||
Systematic template for investigating performance issues:
|
||||
- **Performance Baseline**: Current metrics vs expected
|
||||
- **Hypothesis Generation**: Potential root causes
|
||||
- **Data Collection**: Profiling, metrics, logs
|
||||
- **Analysis**: What the data reveals
|
||||
- **Optimization Plan**: Prioritized fixes with impact estimates
|
||||
- **Validation**: Before/after metrics
|
||||
|
||||
**Use when**: API latency increases, database slow queries, high CPU/memory usage
|
||||
|
||||
**Follow systematically** to diagnose and resolve performance problems.
|
||||
|
||||
---
|
||||
|
||||
## Template Usage
|
||||
|
||||
**How to use these templates**:
|
||||
1. Copy the template file to your project documentation
|
||||
2. Fill in all sections marked with `[FILL IN]` placeholders
|
||||
3. Remove sections that don't apply (optional)
|
||||
4. Share with your team for review
|
||||
|
||||
**When to create reports**:
|
||||
- **Incident Report**: After any production incident (SEV1-SEV3)
|
||||
- **Deployment Checklist**: Before every production deployment
|
||||
- **Performance Investigation**: When performance degrades >20%
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world troubleshooting walkthroughs
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - Runbooks and diagnostic commands
|
||||
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../devops-troubleshooter.md)
|
||||
Reference in New Issue
Block a user