Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions

View File

@@ -0,0 +1,72 @@
# DevOps Troubleshooter Reference
Quick reference guides for Grey Haven infrastructure troubleshooting - runbooks, diagnostic commands, and platform-specific guides.
## Reference Guides
### Troubleshooting Runbooks
**File**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md)
Step-by-step runbooks for common infrastructure issues:
- **Worker Not Responding**: 500/502/503 errors from Cloudflare Workers
- **Database Connection Failures**: Connection refused, pool exhaustion
- **Deployment Failures**: Failed deployments, rollback procedures
- **Performance Degradation**: Slow responses, high latency
- **Network Issues**: DNS failures, connectivity problems
**Use when**: Following structured resolution for known issues
---
### Diagnostic Commands Reference
**File**: [diagnostic-commands.md](diagnostic-commands.md)
Command reference for quick troubleshooting:
- **Cloudflare Workers**: wrangler commands, log analysis
- **PlanetScale**: Database queries, connection checks
- **Network**: curl timing, DNS resolution, traceroute
- **Performance**: Profiling, metrics collection
**Use when**: Need quick command syntax for diagnostics
---
### Cloudflare Workers Platform Guide
**File**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
Cloudflare Workers-specific guidance:
- **Deployment Best Practices**: Bundle size, environment variables
- **Performance Optimization**: CPU limits, memory management
- **Error Handling**: Common errors and solutions
- **Monitoring**: Logs, metrics, analytics
**Use when**: Cloudflare Workers-specific issues
---
## Quick Navigation
**By Issue Type**:
- Worker errors → [troubleshooting-runbooks.md#worker-not-responding](troubleshooting-runbooks.md#worker-not-responding)
- Database issues → [troubleshooting-runbooks.md#database-connection-failures](troubleshooting-runbooks.md#database-connection-failures)
- Performance → [troubleshooting-runbooks.md#performance-degradation](troubleshooting-runbooks.md#performance-degradation)
**By Platform**:
- Cloudflare Workers → [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
- PlanetScale → [diagnostic-commands.md#planetscale-commands](diagnostic-commands.md#planetscale-commands)
- Network → [diagnostic-commands.md#network-commands](diagnostic-commands.md#network-commands)
---
## Related Documentation
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting walkthroughs
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident report templates
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
---
Return to [main agent](../devops-troubleshooter.md)

View File

@@ -0,0 +1,472 @@
# Cloudflare Workers Platform Guide
Comprehensive guide for deploying, monitoring, and troubleshooting Cloudflare Workers in Grey Haven's stack.
## Workers Architecture
**Execution Model**:
- V8 isolates (not containers)
- Deployed globally to 300+ datacenters
- Request routed to nearest location
- Cold start: ~1-5ms (vs 100-1000ms for containers)
- CPU time limit: 50ms (Free), 50ms-30s (Paid)
**Resource Limits**:
```
Free Plan:
- Bundle size: 1MB compressed
- CPU time: 50ms per request
- Requests: 100,000/day
- KV reads: 100,000/day
Paid Plan ($5/month):
- Bundle size: 10MB compressed
- CPU time: 50ms (standard), up to 30s (unbound)
- Requests: 10M included, $0.50/million after
- KV reads: 10M included
```
---
## Deployment Best Practices
### Bundle Optimization
**Size Reduction Strategies**:
```typescript
// 1. Tree shaking with named imports
import { uniq } from 'lodash-es'; // ✅ Only imports uniq
import _ from 'lodash'; // ❌ Imports entire library
// 2. Use native APIs instead of libraries
const date = new Date().toISOString(); // ✅ Native
import moment from 'moment'; // ❌ 300KB library
// 3. External API calls instead of SDKs
await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: { 'x-api-key': env.API_KEY },
body: JSON.stringify({ ... })
}); // ✅ 0KB vs @anthropic-ai/sdk (2.1MB)
// 4. Code splitting with dynamic imports
if (request.url.includes('/special')) {
const { handler } = await import('./expensive-module');
return handler(request);
} // ✅ Lazy load
```
**webpack Configuration**:
```javascript
module.exports = {
mode: 'production',
target: 'webworker',
optimization: {
minimize: true,
usedExports: true, // Tree shaking
sideEffects: false
},
resolve: {
alias: {
'lodash': 'lodash-es' // Use ES modules version
}
}
};
```
---
### Environment Variables
**Using Secrets**:
```bash
# Add secret (never in code)
wrangler secret put DATABASE_URL
# List secrets
wrangler secret list
# Delete secret
wrangler secret delete OLD_KEY
```
**Using Variables** (wrangler.toml):
```toml
[vars]
API_ENDPOINT = "https://api.partner.com"
MAX_RETRIES = "3"
CACHE_TTL = "300"
[env.staging.vars]
API_ENDPOINT = "https://staging-api.partner.com"
[env.production.vars]
API_ENDPOINT = "https://api.partner.com"
```
**Accessing in Code**:
```typescript
export default {
async fetch(request: Request, env: Env) {
const dbUrl = env.DATABASE_URL; // Secret
const endpoint = env.API_ENDPOINT; // Var
const maxRetries = parseInt(env.MAX_RETRIES);
return new Response('OK');
}
};
```
---
## Performance Optimization
### CPU Time Management
**Avoid CPU-Intensive Operations**:
```typescript
// ❌ BAD: CPU-intensive operation
function processLargeDataset(data) {
const sorted = data.sort((a, b) => a.value - b.value);
const filtered = sorted.filter(item => item.value > 1000);
const mapped = filtered.map(item => ({ ...item, processed: true }));
return mapped; // Can exceed 50ms CPU limit
}
// ✅ GOOD: Offload to external service
async function processLargeDataset(data, env) {
const response = await fetch(`${env.PROCESSING_API}/process`, {
method: 'POST',
body: JSON.stringify(data)
});
return response.json(); // External service handles heavy lifting
}
// ✅ BETTER: Use Durable Objects for stateful computation
const id = env.PROCESSOR.idFromName('processor');
const stub = env.PROCESSOR.get(id);
return stub.fetch(request); // Durable Object has more CPU time
```
**Monitor CPU Usage**:
```typescript
export default {
async fetch(request: Request, env: Env) {
const start = Date.now();
try {
const response = await handleRequest(request, env);
const duration = Date.now() - start;
if (duration > 40) {
console.warn(`CPU time approaching limit: ${duration}ms`);
}
return response;
} catch (error) {
const duration = Date.now() - start;
console.error(`Request failed after ${duration}ms:`, error);
throw error;
}
}
};
```
---
### Caching Strategies
**Cache API**:
```typescript
export default {
async fetch(request: Request) {
const cache = caches.default;
// Check cache
let response = await cache.match(request);
if (response) return response;
// Cache miss - fetch and cache
response = await fetch(request);
// Cache for 5 minutes
const cacheResponse = new Response(response.body, response);
cacheResponse.headers.set('Cache-Control', 'max-age=300');
await cache.put(request, cacheResponse.clone());
return response;
}
};
```
**KV for Data Caching**:
```typescript
export default {
async fetch(request: Request, env: Env) {
const url = new URL(request.url);
const cacheKey = `data:${url.pathname}`;
// Check KV
const cached = await env.CACHE.get(cacheKey, 'json');
if (cached) return Response.json(cached);
// Fetch data
const data = await fetchExpensiveData();
// Store in KV with 5min TTL
await env.CACHE.put(cacheKey, JSON.stringify(data), {
expirationTtl: 300
});
return Response.json(data);
}
};
```
---
## Common Errors and Solutions
### Error 1101: Worker Threw Exception
**Cause**: Unhandled JavaScript exception
**Example**:
```typescript
// ❌ BAD: Unhandled error
export default {
async fetch(request: Request) {
const data = JSON.parse(request.body); // Throws if invalid JSON
return Response.json(data);
}
};
```
**Solution**:
```typescript
// ✅ GOOD: Proper error handling
export default {
async fetch(request: Request) {
try {
const body = await request.text();
const data = JSON.parse(body);
return Response.json(data);
} catch (error) {
console.error('JSON parse error:', error);
return new Response('Invalid JSON', { status: 400 });
}
}
};
```
---
### Error 1015: Rate Limited
**Cause**: Too many requests to origin
**Solution**: Implement caching and rate limiting
```typescript
const RATE_LIMIT = 100; // requests per minute
const rateLimits = new Map();
export default {
async fetch(request: Request) {
const ip = request.headers.get('CF-Connecting-IP');
const key = `ratelimit:${ip}`;
const count = rateLimits.get(key) || 0;
if (count >= RATE_LIMIT) {
return new Response('Rate limit exceeded', { status: 429 });
}
rateLimits.set(key, count + 1);
setTimeout(() => rateLimits.delete(key), 60000);
return new Response('OK');
}
};
```
---
### Error: Script Exceeds Size Limit
**Diagnosis**:
```bash
# Check bundle size
npm run build
ls -lh dist/worker.js
# Analyze bundle
npm install --save-dev webpack-bundle-analyzer
npm run build -- --analyze
```
**Solutions**: See [bundle optimization](#bundle-optimization) above
---
## Monitoring and Logging
### Structured Logging
```typescript
interface LogEntry {
level: 'info' | 'warn' | 'error';
message: string;
timestamp: string;
requestId?: string;
duration?: number;
metadata?: Record<string, any>;
}
function log(entry: LogEntry) {
console.log(JSON.stringify({
...entry,
timestamp: new Date().toISOString()
}));
}
export default {
async fetch(request: Request, env: Env) {
const requestId = crypto.randomUUID();
const start = Date.now();
try {
log({
level: 'info',
message: 'Request started',
requestId,
metadata: {
method: request.method,
url: request.url
}
});
const response = await handleRequest(request, env);
log({
level: 'info',
message: 'Request completed',
requestId,
duration: Date.now() - start,
metadata: {
status: response.status
}
});
return response;
} catch (error) {
log({
level: 'error',
message: 'Request failed',
requestId,
duration: Date.now() - start,
metadata: {
error: error.message,
stack: error.stack
}
});
return new Response('Internal Server Error', { status: 500 });
}
}
};
```
---
### Health Check Endpoint
```typescript
export default {
async fetch(request: Request, env: Env) {
const url = new URL(request.url);
if (url.pathname === '/health') {
return Response.json({
status: 'healthy',
timestamp: new Date().toISOString(),
version: env.VERSION || 'unknown'
});
}
// Regular request handling
return handleRequest(request, env);
}
};
```
---
## Testing Workers
```bash
# Local testing
wrangler dev
curl http://localhost:8787/api/users
curl -X POST http://localhost:8787/api/users -H "Content-Type: application/json" -d '{"name": "Test User"}'
# Unit testing (Vitest)
import { describe, it, expect } from 'vitest';
import worker from './worker';
describe('Worker', () => {
it('returns 200 for health check', async () => {
const request = new Request('https://example.com/health');
const response = await worker.fetch(request, getMockEnv());
expect(response.status).toBe(200);
});
});
```
---
## Security Best Practices
```typescript
// 1. Validate inputs
function validateEmail(email: string): boolean {
return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
}
// 2. Set security headers
function addSecurityHeaders(response: Response): Response {
response.headers.set('X-Content-Type-Options', 'nosniff');
response.headers.set('X-Frame-Options', 'DENY');
response.headers.set('Strict-Transport-Security', 'max-age=31536000');
return response;
}
// 3. CORS configuration
const ALLOWED_ORIGINS = ['https://app.greyhaven.io', 'https://staging.greyhaven.io'];
function handleCors(request: Request): Response | null {
const origin = request.headers.get('Origin');
if (request.method === 'OPTIONS') {
return new Response(null, {
headers: {
'Access-Control-Allow-Origin': origin,
'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE',
'Access-Control-Max-Age': '86400'
}
});
}
if (origin && !ALLOWED_ORIGINS.includes(origin)) {
return new Response('Forbidden', { status: 403 });
}
return null;
}
```
---
## Related Documentation
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
- **Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
- **Examples**: [Examples Index](../examples/INDEX.md) - Full examples
---
Return to [reference index](INDEX.md)

View File

@@ -0,0 +1,473 @@
# Diagnostic Commands Reference
Quick command reference for Grey Haven infrastructure troubleshooting. Copy-paste ready commands for rapid diagnosis.
## Cloudflare Workers Commands
### Deployment Management
```bash
# List recent deployments
wrangler deployments list
# View specific deployment
wrangler deployments view <deployment-id>
# Rollback to previous version
wrangler rollback --message "Reverting due to errors"
# Deploy to production
wrangler deploy
# Deploy to staging
wrangler deploy --env staging
```
### Logs and Monitoring
```bash
# Real-time logs (pretty format)
wrangler tail --format pretty
# JSON logs for parsing
wrangler tail --format json
# Filter by status code
wrangler tail --format json | grep "\"status\":500"
# Show only errors
wrangler tail --format json | grep -i "error"
# Save logs to file
wrangler tail --format json > worker-logs.json
# Monitor specific worker
wrangler tail --name my-worker
```
### Local Development
```bash
# Start local dev server
wrangler dev
# Dev with specific port
wrangler dev --port 8788
# Dev with remote mode (use production bindings)
wrangler dev --remote
# Test locally
curl http://localhost:8787/api/health
```
### Configuration
```bash
# Show account info
wrangler whoami
# List KV namespaces
wrangler kv:namespace list
# List secrets
wrangler secret list
# Add secret
wrangler secret put API_KEY
# Delete secret
wrangler secret delete API_KEY
```
---
## PlanetScale Commands
### Database Management
```bash
# Connect to database shell
pscale shell greyhaven-db main
# Connect and execute query
pscale shell greyhaven-db main --execute "SELECT COUNT(*) FROM users"
# Show database info
pscale database show greyhaven-db
# List all databases
pscale database list
# Create new branch
pscale branch create greyhaven-db feature-branch
# List branches
pscale branch list greyhaven-db
```
### Connection Monitoring
```sql
-- Active connections
SELECT COUNT(*) as active_connections
FROM pg_stat_activity
WHERE state = 'active';
-- Long-running queries
SELECT
pid,
now() - query_start as duration,
query
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '10 seconds'
ORDER BY duration DESC;
-- Connection by state
SELECT state, COUNT(*)
FROM pg_stat_activity
GROUP BY state;
-- Blocked queries
SELECT
blocked.pid AS blocked_pid,
blocking.pid AS blocking_pid,
blocked.query AS blocked_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
```
### Performance Analysis
```bash
# Slow query insights
pscale database insights greyhaven-db main --slow-queries
# Database size
pscale database show greyhaven-db --web
# Enable slow query log
pscale database settings update greyhaven-db --enable-slow-query-log
```
```sql
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Index usage
SELECT
schemaname,
tablename,
indexname,
idx_scan as index_scans
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Cache hit ratio
SELECT
'cache hit rate' AS metric,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
FROM pg_statio_user_tables;
```
### Schema Migrations
```bash
# Create deploy request
pscale deploy-request create greyhaven-db <branch-name>
# List deploy requests
pscale deploy-request list greyhaven-db
# View deploy request diff
pscale deploy-request diff greyhaven-db <number>
# Deploy schema changes
pscale deploy-request deploy greyhaven-db <number>
# Close deploy request
pscale deploy-request close greyhaven-db <number>
```
---
## Network Diagnostic Commands
### DNS Resolution
```bash
# Basic DNS lookup
nslookup api.partner.com
# Detailed DNS query
dig api.partner.com
# Measure DNS time
time nslookup api.partner.com
# Check DNS propagation
dig api.partner.com @8.8.8.8
dig api.partner.com @1.1.1.1
# Reverse DNS lookup
dig -x 203.0.113.42
```
### Connectivity Testing
```bash
# Ping test
ping -c 10 api.partner.com
# Trace network route
traceroute api.partner.com
# TCP connection test
nc -zv api.partner.com 443
# Test specific port
telnet api.partner.com 443
```
### HTTP Request Timing
```bash
# Full timing breakdown
curl -w "\nDNS Lookup: %{time_namelookup}s\nTCP Connect: %{time_connect}s\nTLS Handshake: %{time_appconnect}s\nStart Transfer:%{time_starttransfer}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://api.partner.com/data
# Test with specific method
curl -X POST https://api.example.com/api \
-H "Content-Type: application/json" \
-d '{"test": "data"}'
# Follow redirects
curl -L https://example.com
# Show response headers
curl -I https://api.example.com
# Test CORS
curl -I -X OPTIONS https://api.example.com \
-H "Origin: https://app.example.com" \
-H "Access-Control-Request-Method: POST"
```
### SSL/TLS Verification
```bash
# Check SSL certificate
openssl s_client -connect api.example.com:443
# Show certificate expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Verify certificate chain
openssl s_client -connect api.example.com:443 -showcerts
```
---
## Application Performance Commands
### Resource Monitoring
```bash
# CPU usage
top -o cpu
# Memory usage
free -h # Linux
vm_stat # macOS
# Disk usage
df -h
# Process list
ps aux | grep node
# Port usage
lsof -i :8000
netstat -an | grep 8000
```
### Log Analysis
```bash
# Tail logs
tail -f /var/log/app.log
# Search logs
grep -i "error" /var/log/app.log
# Count errors
grep -c "ERROR" /var/log/app.log
# Show recent errors with context
grep -B 5 -A 5 "error" /var/log/app.log
# Parse JSON logs
cat app.log | jq 'select(.level=="error")'
# Error frequency
grep "ERROR" /var/log/app.log | cut -d' ' -f1 | uniq -c
```
### Worker Performance
```bash
# Monitor CPU time
wrangler tail --format json | jq '.outcome.cpuTime'
# Monitor duration
wrangler tail --format json | jq '.outcome.duration'
# Requests per second
wrangler tail --format json | wc -l
# Average response time
wrangler tail --format json | \
jq -r '.outcome.duration' | \
awk '{sum+=$1; count++} END {print sum/count}'
```
---
## Health Check Scripts
### Worker Health Check
```bash
#!/bin/bash
# health-check-worker.sh
echo "=== Worker Health Check ==="
# Test endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.greyhaven.io/health)
if [ "$STATUS" -eq 200 ]; then
echo "✅ Worker responding (HTTP $STATUS)"
else
echo "❌ Worker error (HTTP $STATUS)"
exit 1
fi
# Check response time
TIME=$(curl -w "%{time_total}" -o /dev/null -s https://api.greyhaven.io/health)
echo "Response time: ${TIME}s"
if (( $(echo "$TIME > 1.0" | bc -l) )); then
echo "⚠️ Slow response (>${TIME}s)"
fi
```
### Database Health Check
```bash
#!/bin/bash
# health-check-db.sh
echo "=== Database Health Check ==="
# Test connection
pscale shell greyhaven-db main --execute "SELECT 1" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✅ Database connection OK"
else
echo "❌ Database connection failed"
exit 1
fi
# Check active connections
ACTIVE=$(pscale shell greyhaven-db main --execute \
"SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'" | tail -1)
echo "Active connections: $ACTIVE"
if [ "$ACTIVE" -gt 80 ]; then
echo "⚠️ High connection count (>80)"
fi
```
### Complete System Health
```bash
#!/bin/bash
# health-check-all.sh
echo "=== Complete System Health Check ==="
# Worker
echo "\n1. Cloudflare Worker"
./health-check-worker.sh
# Database
echo "\n2. PlanetScale Database"
./health-check-db.sh
# External APIs
echo "\n3. External Dependencies"
for API in "https://api.partner1.com/health" "https://api.partner2.com/health"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API")
if [ "$STATUS" -eq 200 ]; then
echo "$API (HTTP $STATUS)"
else
echo "$API (HTTP $STATUS)"
fi
done
echo "\n=== Health Check Complete ==="
```
---
## Troubleshooting One-Liners
```bash
# Find memory hogs
ps aux --sort=-%mem | head -10
# Find CPU hogs
ps aux --sort=-%cpu | head -10
# Disk space by directory
du -sh /* | sort -h
# Network connections
netstat -ant | awk '{print $6}' | sort | uniq -c
# Failed login attempts
grep "Failed password" /var/log/auth.log | wc -l
# Top error codes
awk '{print $9}' access.log | sort | uniq -c | sort -rn
# Requests per minute
awk '{print $4}' access.log | cut -d: -f1-2 | uniq -c
# Average response size
awk '{sum+=$10; count++} END {print sum/count}' access.log
```
---
## Related Documentation
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
---
Return to [reference index](INDEX.md)

View File

@@ -0,0 +1,489 @@
# Troubleshooting Runbooks
Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
## Runbook 1: Worker Not Responding
### Symptoms
- API returning 500/502/503 errors
- Workers timing out or not processing requests
- Cloudflare error pages showing
### Diagnosis Steps
**1. Check Cloudflare Status**
```bash
# Visit: https://www.cloudflarestatus.com
# Or query status API
curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
```
**2. View Worker Logs**
```bash
# Real-time logs
wrangler tail --format pretty
# Look for errors:
# - "Script exceeded CPU time limit"
# - "Worker threw exception"
# - "Uncaught TypeError"
```
**3. Check Recent Deployments**
```bash
wrangler deployments list
# If recent deployment suspicious, rollback:
wrangler rollback --message "Reverting to stable version"
```
**4. Test Worker Locally**
```bash
# Run worker in dev mode
wrangler dev
# Test endpoint
curl http://localhost:8787/api/health
```
### Resolution Paths
**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
### Prevention
- Health check endpoint: `GET /health`
- Monitor error rate with alerts (>1% = alert)
- Test deployments in staging first
- Implement circuit breakers for external calls
---
## Runbook 2: Database Connection Failures
### Symptoms
- "connection refused" errors
- "too many connections" errors
- Application timing out on database queries
- 503 errors from API
### Diagnosis Steps
**1. Test Database Connection**
```bash
# Direct connection test
pscale shell greyhaven-db main
# If fails, check:
# - Database status
# - Credentials
# - Network connectivity
```
**2. Check Connection Pool**
```bash
# Query pool status
curl http://localhost:8000/pool-status
# Expected healthy response:
{
"size": 50,
"checked_out": 25, # <80% is healthy
"overflow": 0,
"available": 25
}
```
**3. Check Active Connections**
```sql
-- In pscale shell
SELECT
COUNT(*) as active,
MAX(query_start) as oldest_query
FROM pg_stat_activity
WHERE state = 'active';
-- If active = pool size, pool exhausted
-- If oldest_query >10min, leaked connection
```
**4. Review Application Logs**
```bash
# Search for connection errors
grep -i "connection" logs/app.log | tail -50
# Common errors:
# - "Pool timeout"
# - "Connection refused"
# - "Max connections reached"
```
### Resolution Paths
**Path A: Invalid Credentials**
```bash
# Rotate credentials
pscale password create greyhaven-db main app-password
# Update environment variable
# Restart application
```
**Path B: Pool Exhausted**
```python
# Increase pool size in database.py
engine = create_engine(
database_url,
pool_size=50, # Increase from 20
max_overflow=20
)
```
**Path C: Connection Leaks**
```python
# Fix: Use context managers
with Session(engine) as session:
# Work with session
pass # Automatically closed
```
**Path D: Database Paused/Down**
```bash
# Resume database if paused
pscale database resume greyhaven-db
# Check database status
pscale database show greyhaven-db
```
### Prevention
- Use connection pooling with proper limits
- Implement retry logic with exponential backoff
- Monitor pool utilization (alert >80%)
- Test for connection leaks in CI/CD
---
## Runbook 3: Deployment Failures
### Symptoms
- `wrangler deploy` fails
- CI/CD pipeline fails at deployment step
- New code not reflecting in production
### Diagnosis Steps
**1. Check Deployment Error**
```bash
wrangler deploy --verbose
# Common errors:
# - "Script exceeds size limit"
# - "Syntax error in worker"
# - "Environment variable missing"
# - "Binding not found"
```
**2. Verify Build Output**
```bash
# Check built file
ls -lh dist/
npm run build
# Ensure build succeeds locally
```
**3. Check Environment Variables**
```bash
# List secrets
wrangler secret list
# Verify wrangler.toml vars
cat wrangler.toml | grep -A 10 "\[vars\]"
```
**4. Test Locally**
```bash
# Start dev server
wrangler dev
# If works locally but not production:
# - Environment variable mismatch
# - Binding configuration issue
```
### Resolution Paths
**Path A: Bundle Too Large**
```bash
# Check bundle size
ls -lh dist/worker.js
# Solutions:
# - Tree shake unused code
# - Code split large modules
# - Use fetch instead of SDK
```
**Path B: Syntax Error**
```bash
# Run TypeScript check
npm run type-check
# Run linter
npm run lint
# Fix errors before deploying
```
**Path C: Missing Variables**
```bash
# Add missing secret
wrangler secret put API_KEY
# Or add to wrangler.toml vars
[vars]
API_ENDPOINT = "https://api.example.com"
```
**Path D: Binding Not Found**
```toml
# wrangler.toml - Add binding
[[kv_namespaces]]
binding = "CACHE"
id = "abc123"
[[d1_databases]]
binding = "DB"
database_name = "greyhaven-db"
database_id = "xyz789"
```
### Prevention
- Bundle size check in CI/CD
- Pre-commit hooks for validation
- Staging environment for testing
- Automated deployment tests
---
## Runbook 4: Performance Degradation
### Symptoms
- API response times increased (>2x normal)
- Slow page loads
- User complaints about slowness
- Timeout errors
### Diagnosis Steps
**1. Check Current Latency**
```bash
# Test endpoint
curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
# p95 should be <500ms
# If >1s, investigate
```
**2. Analyze Worker Logs**
```bash
wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
# Identify slow requests
# Check what's taking time
```
**3. Check Database Queries**
```bash
# Slow query log
pscale database insights greyhaven-db main --slow-queries
# Look for:
# - N+1 queries (many small queries)
# - Missing indexes (full table scans)
# - Long-running queries (>100ms)
```
**4. Profile Application**
```bash
# Add timing middleware
# Log slow operations
# Identify bottleneck (DB, API, compute)
```
### Resolution Paths
**Path A: N+1 Queries**
```python
# Use eager loading
statement = (
select(Order)
.options(selectinload(Order.items))
)
```
**Path B: Missing Indexes**
```sql
-- Add indexes
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_items_order_id ON order_items(order_id);
```
**Path C: No Caching**
```typescript
// Add Redis caching
const cached = await redis.get(cacheKey);
if (cached) return cached;
const result = await expensiveOperation();
await redis.setex(cacheKey, 300, result);
```
**Path D: Worker CPU Limit**
```typescript
// Optimize expensive operations
// Use async operations
// Offload to external service
```
### Prevention
- Monitor p95 latency (alert >500ms)
- Test for N+1 queries in CI/CD
- Add indexes for foreign keys
- Implement caching layer
- Performance budgets in tests
---
## Runbook 5: Network Connectivity Issues
### Symptoms
- Intermittent failures
- DNS resolution errors
- Connection timeouts
- CORS errors
### Diagnosis Steps
**1. Test DNS Resolution**
```bash
# Check DNS
nslookup api.partner.com
dig api.partner.com
# Measure DNS time
time nslookup api.partner.com
# If >1s, DNS is slow
```
**2. Test Connectivity**
```bash
# Basic connectivity
ping api.partner.com
# Trace route
traceroute api.partner.com
# Full timing breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://api.partner.com
```
**3. Check CORS**
```bash
# Preflight request
curl -I -X OPTIONS https://api.greyhaven.io/api/users \
-H "Origin: https://app.greyhaven.io" \
-H "Access-Control-Request-Method: POST"
# Verify headers:
# - Access-Control-Allow-Origin
# - Access-Control-Allow-Methods
```
**4. Check Firewall/Security**
```bash
# Test from different location
# Check IP whitelist
# Verify SSL certificate
```
### Resolution Paths
**Path A: Slow DNS**
```typescript
// Implement DNS caching
const DNS_CACHE = new Map();
// Cache DNS for 60s
```
**Path B: Connection Timeout**
```typescript
// Increase timeout
const controller = new AbortController();
setTimeout(() => controller.abort(), 30000); // 30s
```
**Path C: CORS Error**
```typescript
// Add CORS headers
response.headers.set('Access-Control-Allow-Origin', origin);
response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
```
**Path D: SSL/TLS Issue**
```bash
# Check certificate
openssl s_client -connect api.partner.com:443
# Verify not expired
# Check certificate chain
```
### Prevention
- DNS caching (60s TTL)
- Appropriate timeouts (30s for external APIs)
- Health checks for external dependencies
- Circuit breakers for failures
- Monitor external API latency
---
## Emergency Procedures (SEV1)
**Immediate Actions**:
1. **Assess**: Users affected? Functionality broken? Data loss risk?
2. **Communicate**: Alert team, update status page
3. **Stop Bleeding**: `wrangler rollback` or disable feature
4. **Diagnose**: Logs, recent changes, metrics
5. **Fix**: Hotfix or workaround, test first
6. **Verify**: Monitor metrics, test functionality
7. **Postmortem**: Document, root cause, prevention
---
## Escalation Matrix
| Issue Type | First Response | Escalate To | Escalation Trigger |
|------------|---------------|-------------|-------------------|
| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
| Database | DevOps troubleshooter | data-validator | Schema issues |
| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
---
## Related Documentation
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
---
Return to [reference index](INDEX.md)