Initial commit
This commit is contained in:
72
skills/devops-troubleshooting/reference/INDEX.md
Normal file
72
skills/devops-troubleshooting/reference/INDEX.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# DevOps Troubleshooter Reference
|
||||
|
||||
Quick reference guides for Grey Haven infrastructure troubleshooting - runbooks, diagnostic commands, and platform-specific guides.
|
||||
|
||||
## Reference Guides
|
||||
|
||||
### Troubleshooting Runbooks
|
||||
|
||||
**File**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md)
|
||||
|
||||
Step-by-step runbooks for common infrastructure issues:
|
||||
- **Worker Not Responding**: 500/502/503 errors from Cloudflare Workers
|
||||
- **Database Connection Failures**: Connection refused, pool exhaustion
|
||||
- **Deployment Failures**: Failed deployments, rollback procedures
|
||||
- **Performance Degradation**: Slow responses, high latency
|
||||
- **Network Issues**: DNS failures, connectivity problems
|
||||
|
||||
**Use when**: Following structured resolution for known issues
|
||||
|
||||
---
|
||||
|
||||
### Diagnostic Commands Reference
|
||||
|
||||
**File**: [diagnostic-commands.md](diagnostic-commands.md)
|
||||
|
||||
Command reference for quick troubleshooting:
|
||||
- **Cloudflare Workers**: wrangler commands, log analysis
|
||||
- **PlanetScale**: Database queries, connection checks
|
||||
- **Network**: curl timing, DNS resolution, traceroute
|
||||
- **Performance**: Profiling, metrics collection
|
||||
|
||||
**Use when**: Need quick command syntax for diagnostics
|
||||
|
||||
---
|
||||
|
||||
### Cloudflare Workers Platform Guide
|
||||
|
||||
**File**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
|
||||
|
||||
Cloudflare Workers-specific guidance:
|
||||
- **Deployment Best Practices**: Bundle size, environment variables
|
||||
- **Performance Optimization**: CPU limits, memory management
|
||||
- **Error Handling**: Common errors and solutions
|
||||
- **Monitoring**: Logs, metrics, analytics
|
||||
|
||||
**Use when**: Cloudflare Workers-specific issues
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
**By Issue Type**:
|
||||
- Worker errors → [troubleshooting-runbooks.md#worker-not-responding](troubleshooting-runbooks.md#worker-not-responding)
|
||||
- Database issues → [troubleshooting-runbooks.md#database-connection-failures](troubleshooting-runbooks.md#database-connection-failures)
|
||||
- Performance → [troubleshooting-runbooks.md#performance-degradation](troubleshooting-runbooks.md#performance-degradation)
|
||||
|
||||
**By Platform**:
|
||||
- Cloudflare Workers → [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
|
||||
- PlanetScale → [diagnostic-commands.md#planetscale-commands](diagnostic-commands.md#planetscale-commands)
|
||||
- Network → [diagnostic-commands.md#network-commands](diagnostic-commands.md#network-commands)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting walkthroughs
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident report templates
|
||||
- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../devops-troubleshooter.md)
|
||||
@@ -0,0 +1,472 @@
|
||||
# Cloudflare Workers Platform Guide
|
||||
|
||||
Comprehensive guide for deploying, monitoring, and troubleshooting Cloudflare Workers in Grey Haven's stack.
|
||||
|
||||
## Workers Architecture
|
||||
|
||||
**Execution Model**:
|
||||
- V8 isolates (not containers)
|
||||
- Deployed globally to 300+ datacenters
|
||||
- Request routed to nearest location
|
||||
- Cold start: ~1-5ms (vs 100-1000ms for containers)
|
||||
- CPU time limit: 50ms (Free), 50ms-30s (Paid)
|
||||
|
||||
**Resource Limits**:
|
||||
```
|
||||
Free Plan:
|
||||
- Bundle size: 1MB compressed
|
||||
- CPU time: 50ms per request
|
||||
- Requests: 100,000/day
|
||||
- KV reads: 100,000/day
|
||||
|
||||
Paid Plan ($5/month):
|
||||
- Bundle size: 10MB compressed
|
||||
- CPU time: 50ms (standard), up to 30s (unbound)
|
||||
- Requests: 10M included, $0.50/million after
|
||||
- KV reads: 10M included
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Best Practices
|
||||
|
||||
### Bundle Optimization
|
||||
|
||||
**Size Reduction Strategies**:
|
||||
```typescript
|
||||
// 1. Tree shaking with named imports
|
||||
import { uniq } from 'lodash-es'; // ✅ Only imports uniq
|
||||
import _ from 'lodash'; // ❌ Imports entire library
|
||||
|
||||
// 2. Use native APIs instead of libraries
|
||||
const date = new Date().toISOString(); // ✅ Native
|
||||
import moment from 'moment'; // ❌ 300KB library
|
||||
|
||||
// 3. External API calls instead of SDKs
|
||||
await fetch('https://api.anthropic.com/v1/messages', {
|
||||
method: 'POST',
|
||||
headers: { 'x-api-key': env.API_KEY },
|
||||
body: JSON.stringify({ ... })
|
||||
}); // ✅ 0KB vs @anthropic-ai/sdk (2.1MB)
|
||||
|
||||
// 4. Code splitting with dynamic imports
|
||||
if (request.url.includes('/special')) {
|
||||
const { handler } = await import('./expensive-module');
|
||||
return handler(request);
|
||||
} // ✅ Lazy load
|
||||
```
|
||||
|
||||
**webpack Configuration**:
|
||||
```javascript
|
||||
module.exports = {
|
||||
mode: 'production',
|
||||
target: 'webworker',
|
||||
optimization: {
|
||||
minimize: true,
|
||||
usedExports: true, // Tree shaking
|
||||
sideEffects: false
|
||||
},
|
||||
resolve: {
|
||||
alias: {
|
||||
'lodash': 'lodash-es' // Use ES modules version
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Using Secrets**:
|
||||
```bash
|
||||
# Add secret (never in code)
|
||||
wrangler secret put DATABASE_URL
|
||||
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Delete secret
|
||||
wrangler secret delete OLD_KEY
|
||||
```
|
||||
|
||||
**Using Variables** (wrangler.toml):
|
||||
```toml
|
||||
[vars]
|
||||
API_ENDPOINT = "https://api.partner.com"
|
||||
MAX_RETRIES = "3"
|
||||
CACHE_TTL = "300"
|
||||
|
||||
[env.staging.vars]
|
||||
API_ENDPOINT = "https://staging-api.partner.com"
|
||||
|
||||
[env.production.vars]
|
||||
API_ENDPOINT = "https://api.partner.com"
|
||||
```
|
||||
|
||||
**Accessing in Code**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const dbUrl = env.DATABASE_URL; // Secret
|
||||
const endpoint = env.API_ENDPOINT; // Var
|
||||
const maxRetries = parseInt(env.MAX_RETRIES);
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### CPU Time Management
|
||||
|
||||
**Avoid CPU-Intensive Operations**:
|
||||
```typescript
|
||||
// ❌ BAD: CPU-intensive operation
|
||||
function processLargeDataset(data) {
|
||||
const sorted = data.sort((a, b) => a.value - b.value);
|
||||
const filtered = sorted.filter(item => item.value > 1000);
|
||||
const mapped = filtered.map(item => ({ ...item, processed: true }));
|
||||
return mapped; // Can exceed 50ms CPU limit
|
||||
}
|
||||
|
||||
// ✅ GOOD: Offload to external service
|
||||
async function processLargeDataset(data, env) {
|
||||
const response = await fetch(`${env.PROCESSING_API}/process`, {
|
||||
method: 'POST',
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
return response.json(); // External service handles heavy lifting
|
||||
}
|
||||
|
||||
// ✅ BETTER: Use Durable Objects for stateful computation
|
||||
const id = env.PROCESSOR.idFromName('processor');
|
||||
const stub = env.PROCESSOR.get(id);
|
||||
return stub.fetch(request); // Durable Object has more CPU time
|
||||
```
|
||||
|
||||
**Monitor CPU Usage**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const start = Date.now();
|
||||
|
||||
try {
|
||||
const response = await handleRequest(request, env);
|
||||
const duration = Date.now() - start;
|
||||
|
||||
if (duration > 40) {
|
||||
console.warn(`CPU time approaching limit: ${duration}ms`);
|
||||
}
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
const duration = Date.now() - start;
|
||||
console.error(`Request failed after ${duration}ms:`, error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Caching Strategies
|
||||
|
||||
**Cache API**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const cache = caches.default;
|
||||
|
||||
// Check cache
|
||||
let response = await cache.match(request);
|
||||
if (response) return response;
|
||||
|
||||
// Cache miss - fetch and cache
|
||||
response = await fetch(request);
|
||||
|
||||
// Cache for 5 minutes
|
||||
const cacheResponse = new Response(response.body, response);
|
||||
cacheResponse.headers.set('Cache-Control', 'max-age=300');
|
||||
await cache.put(request, cacheResponse.clone());
|
||||
|
||||
return response;
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**KV for Data Caching**:
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const url = new URL(request.url);
|
||||
const cacheKey = `data:${url.pathname}`;
|
||||
|
||||
// Check KV
|
||||
const cached = await env.CACHE.get(cacheKey, 'json');
|
||||
if (cached) return Response.json(cached);
|
||||
|
||||
// Fetch data
|
||||
const data = await fetchExpensiveData();
|
||||
|
||||
// Store in KV with 5min TTL
|
||||
await env.CACHE.put(cacheKey, JSON.stringify(data), {
|
||||
expirationTtl: 300
|
||||
});
|
||||
|
||||
return Response.json(data);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Errors and Solutions
|
||||
|
||||
### Error 1101: Worker Threw Exception
|
||||
|
||||
**Cause**: Unhandled JavaScript exception
|
||||
|
||||
**Example**:
|
||||
```typescript
|
||||
// ❌ BAD: Unhandled error
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const data = JSON.parse(request.body); // Throws if invalid JSON
|
||||
return Response.json(data);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
```typescript
|
||||
// ✅ GOOD: Proper error handling
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
try {
|
||||
const body = await request.text();
|
||||
const data = JSON.parse(body);
|
||||
return Response.json(data);
|
||||
} catch (error) {
|
||||
console.error('JSON parse error:', error);
|
||||
return new Response('Invalid JSON', { status: 400 });
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error 1015: Rate Limited
|
||||
|
||||
**Cause**: Too many requests to origin
|
||||
|
||||
**Solution**: Implement caching and rate limiting
|
||||
```typescript
|
||||
const RATE_LIMIT = 100; // requests per minute
|
||||
const rateLimits = new Map();
|
||||
|
||||
export default {
|
||||
async fetch(request: Request) {
|
||||
const ip = request.headers.get('CF-Connecting-IP');
|
||||
const key = `ratelimit:${ip}`;
|
||||
|
||||
const count = rateLimits.get(key) || 0;
|
||||
if (count >= RATE_LIMIT) {
|
||||
return new Response('Rate limit exceeded', { status: 429 });
|
||||
}
|
||||
|
||||
rateLimits.set(key, count + 1);
|
||||
setTimeout(() => rateLimits.delete(key), 60000);
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error: Script Exceeds Size Limit
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check bundle size
|
||||
npm run build
|
||||
ls -lh dist/worker.js
|
||||
|
||||
# Analyze bundle
|
||||
npm install --save-dev webpack-bundle-analyzer
|
||||
npm run build -- --analyze
|
||||
```
|
||||
|
||||
**Solutions**: See [bundle optimization](#bundle-optimization) above
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Structured Logging
|
||||
|
||||
```typescript
|
||||
interface LogEntry {
|
||||
level: 'info' | 'warn' | 'error';
|
||||
message: string;
|
||||
timestamp: string;
|
||||
requestId?: string;
|
||||
duration?: number;
|
||||
metadata?: Record<string, any>;
|
||||
}
|
||||
|
||||
function log(entry: LogEntry) {
|
||||
console.log(JSON.stringify({
|
||||
...entry,
|
||||
timestamp: new Date().toISOString()
|
||||
}));
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const requestId = crypto.randomUUID();
|
||||
const start = Date.now();
|
||||
|
||||
try {
|
||||
log({
|
||||
level: 'info',
|
||||
message: 'Request started',
|
||||
requestId,
|
||||
metadata: {
|
||||
method: request.method,
|
||||
url: request.url
|
||||
}
|
||||
});
|
||||
|
||||
const response = await handleRequest(request, env);
|
||||
|
||||
log({
|
||||
level: 'info',
|
||||
message: 'Request completed',
|
||||
requestId,
|
||||
duration: Date.now() - start,
|
||||
metadata: {
|
||||
status: response.status
|
||||
}
|
||||
});
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
log({
|
||||
level: 'error',
|
||||
message: 'Request failed',
|
||||
requestId,
|
||||
duration: Date.now() - start,
|
||||
metadata: {
|
||||
error: error.message,
|
||||
stack: error.stack
|
||||
}
|
||||
});
|
||||
|
||||
return new Response('Internal Server Error', { status: 500 });
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Health Check Endpoint
|
||||
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const url = new URL(request.url);
|
||||
|
||||
if (url.pathname === '/health') {
|
||||
return Response.json({
|
||||
status: 'healthy',
|
||||
timestamp: new Date().toISOString(),
|
||||
version: env.VERSION || 'unknown'
|
||||
});
|
||||
}
|
||||
|
||||
// Regular request handling
|
||||
return handleRequest(request, env);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Workers
|
||||
|
||||
```bash
|
||||
# Local testing
|
||||
wrangler dev
|
||||
curl http://localhost:8787/api/users
|
||||
curl -X POST http://localhost:8787/api/users -H "Content-Type: application/json" -d '{"name": "Test User"}'
|
||||
|
||||
# Unit testing (Vitest)
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import worker from './worker';
|
||||
|
||||
describe('Worker', () => {
|
||||
it('returns 200 for health check', async () => {
|
||||
const request = new Request('https://example.com/health');
|
||||
const response = await worker.fetch(request, getMockEnv());
|
||||
expect(response.status).toBe(200);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
```typescript
|
||||
// 1. Validate inputs
|
||||
function validateEmail(email: string): boolean {
|
||||
return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
|
||||
}
|
||||
|
||||
// 2. Set security headers
|
||||
function addSecurityHeaders(response: Response): Response {
|
||||
response.headers.set('X-Content-Type-Options', 'nosniff');
|
||||
response.headers.set('X-Frame-Options', 'DENY');
|
||||
response.headers.set('Strict-Transport-Security', 'max-age=31536000');
|
||||
return response;
|
||||
}
|
||||
|
||||
// 3. CORS configuration
|
||||
const ALLOWED_ORIGINS = ['https://app.greyhaven.io', 'https://staging.greyhaven.io'];
|
||||
function handleCors(request: Request): Response | null {
|
||||
const origin = request.headers.get('Origin');
|
||||
if (request.method === 'OPTIONS') {
|
||||
return new Response(null, {
|
||||
headers: {
|
||||
'Access-Control-Allow-Origin': origin,
|
||||
'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE',
|
||||
'Access-Control-Max-Age': '86400'
|
||||
}
|
||||
});
|
||||
}
|
||||
if (origin && !ALLOWED_ORIGINS.includes(origin)) {
|
||||
return new Response('Forbidden', { status: 403 });
|
||||
}
|
||||
return null;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
|
||||
- **Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full examples
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
473
skills/devops-troubleshooting/reference/diagnostic-commands.md
Normal file
473
skills/devops-troubleshooting/reference/diagnostic-commands.md
Normal file
@@ -0,0 +1,473 @@
|
||||
# Diagnostic Commands Reference
|
||||
|
||||
Quick command reference for Grey Haven infrastructure troubleshooting. Copy-paste ready commands for rapid diagnosis.
|
||||
|
||||
## Cloudflare Workers Commands
|
||||
|
||||
### Deployment Management
|
||||
|
||||
```bash
|
||||
# List recent deployments
|
||||
wrangler deployments list
|
||||
|
||||
# View specific deployment
|
||||
wrangler deployments view <deployment-id>
|
||||
|
||||
# Rollback to previous version
|
||||
wrangler rollback --message "Reverting due to errors"
|
||||
|
||||
# Deploy to production
|
||||
wrangler deploy
|
||||
|
||||
# Deploy to staging
|
||||
wrangler deploy --env staging
|
||||
```
|
||||
|
||||
### Logs and Monitoring
|
||||
|
||||
```bash
|
||||
# Real-time logs (pretty format)
|
||||
wrangler tail --format pretty
|
||||
|
||||
# JSON logs for parsing
|
||||
wrangler tail --format json
|
||||
|
||||
# Filter by status code
|
||||
wrangler tail --format json | grep "\"status\":500"
|
||||
|
||||
# Show only errors
|
||||
wrangler tail --format json | grep -i "error"
|
||||
|
||||
# Save logs to file
|
||||
wrangler tail --format json > worker-logs.json
|
||||
|
||||
# Monitor specific worker
|
||||
wrangler tail --name my-worker
|
||||
```
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# Start local dev server
|
||||
wrangler dev
|
||||
|
||||
# Dev with specific port
|
||||
wrangler dev --port 8788
|
||||
|
||||
# Dev with remote mode (use production bindings)
|
||||
wrangler dev --remote
|
||||
|
||||
# Test locally
|
||||
curl http://localhost:8787/api/health
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# Show account info
|
||||
wrangler whoami
|
||||
|
||||
# List KV namespaces
|
||||
wrangler kv:namespace list
|
||||
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Add secret
|
||||
wrangler secret put API_KEY
|
||||
|
||||
# Delete secret
|
||||
wrangler secret delete API_KEY
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PlanetScale Commands
|
||||
|
||||
### Database Management
|
||||
|
||||
```bash
|
||||
# Connect to database shell
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# Connect and execute query
|
||||
pscale shell greyhaven-db main --execute "SELECT COUNT(*) FROM users"
|
||||
|
||||
# Show database info
|
||||
pscale database show greyhaven-db
|
||||
|
||||
# List all databases
|
||||
pscale database list
|
||||
|
||||
# Create new branch
|
||||
pscale branch create greyhaven-db feature-branch
|
||||
|
||||
# List branches
|
||||
pscale branch list greyhaven-db
|
||||
```
|
||||
|
||||
### Connection Monitoring
|
||||
|
||||
```sql
|
||||
-- Active connections
|
||||
SELECT COUNT(*) as active_connections
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
-- Long-running queries
|
||||
SELECT
|
||||
pid,
|
||||
now() - query_start as duration,
|
||||
query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND query_start < now() - interval '10 seconds'
|
||||
ORDER BY duration DESC;
|
||||
|
||||
-- Connection by state
|
||||
SELECT state, COUNT(*)
|
||||
FROM pg_stat_activity
|
||||
GROUP BY state;
|
||||
|
||||
-- Blocked queries
|
||||
SELECT
|
||||
blocked.pid AS blocked_pid,
|
||||
blocking.pid AS blocking_pid,
|
||||
blocked.query AS blocked_query
|
||||
FROM pg_stat_activity blocked
|
||||
JOIN pg_stat_activity blocking
|
||||
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
|
||||
```
|
||||
|
||||
### Performance Analysis
|
||||
|
||||
```bash
|
||||
# Slow query insights
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Database size
|
||||
pscale database show greyhaven-db --web
|
||||
|
||||
# Enable slow query log
|
||||
pscale database settings update greyhaven-db --enable-slow-query-log
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Table sizes
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
|
||||
-- Index usage
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
indexname,
|
||||
idx_scan as index_scans
|
||||
FROM pg_stat_user_indexes
|
||||
ORDER BY idx_scan ASC;
|
||||
|
||||
-- Cache hit ratio
|
||||
SELECT
|
||||
'cache hit rate' AS metric,
|
||||
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
|
||||
FROM pg_statio_user_tables;
|
||||
```
|
||||
|
||||
### Schema Migrations
|
||||
|
||||
```bash
|
||||
# Create deploy request
|
||||
pscale deploy-request create greyhaven-db <branch-name>
|
||||
|
||||
# List deploy requests
|
||||
pscale deploy-request list greyhaven-db
|
||||
|
||||
# View deploy request diff
|
||||
pscale deploy-request diff greyhaven-db <number>
|
||||
|
||||
# Deploy schema changes
|
||||
pscale deploy-request deploy greyhaven-db <number>
|
||||
|
||||
# Close deploy request
|
||||
pscale deploy-request close greyhaven-db <number>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Diagnostic Commands
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
```bash
|
||||
# Basic DNS lookup
|
||||
nslookup api.partner.com
|
||||
|
||||
# Detailed DNS query
|
||||
dig api.partner.com
|
||||
|
||||
# Measure DNS time
|
||||
time nslookup api.partner.com
|
||||
|
||||
# Check DNS propagation
|
||||
dig api.partner.com @8.8.8.8
|
||||
dig api.partner.com @1.1.1.1
|
||||
|
||||
# Reverse DNS lookup
|
||||
dig -x 203.0.113.42
|
||||
```
|
||||
|
||||
### Connectivity Testing
|
||||
|
||||
```bash
|
||||
# Ping test
|
||||
ping -c 10 api.partner.com
|
||||
|
||||
# Trace network route
|
||||
traceroute api.partner.com
|
||||
|
||||
# TCP connection test
|
||||
nc -zv api.partner.com 443
|
||||
|
||||
# Test specific port
|
||||
telnet api.partner.com 443
|
||||
```
|
||||
|
||||
### HTTP Request Timing
|
||||
|
||||
```bash
|
||||
# Full timing breakdown
|
||||
curl -w "\nDNS Lookup: %{time_namelookup}s\nTCP Connect: %{time_connect}s\nTLS Handshake: %{time_appconnect}s\nStart Transfer:%{time_starttransfer}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com/data
|
||||
|
||||
# Test with specific method
|
||||
curl -X POST https://api.example.com/api \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"test": "data"}'
|
||||
|
||||
# Follow redirects
|
||||
curl -L https://example.com
|
||||
|
||||
# Show response headers
|
||||
curl -I https://api.example.com
|
||||
|
||||
# Test CORS
|
||||
curl -I -X OPTIONS https://api.example.com \
|
||||
-H "Origin: https://app.example.com" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
```
|
||||
|
||||
### SSL/TLS Verification
|
||||
|
||||
```bash
|
||||
# Check SSL certificate
|
||||
openssl s_client -connect api.example.com:443
|
||||
|
||||
# Show certificate expiry
|
||||
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
|
||||
openssl x509 -noout -dates
|
||||
|
||||
# Verify certificate chain
|
||||
openssl s_client -connect api.example.com:443 -showcerts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Application Performance Commands
|
||||
|
||||
### Resource Monitoring
|
||||
|
||||
```bash
|
||||
# CPU usage
|
||||
top -o cpu
|
||||
|
||||
# Memory usage
|
||||
free -h # Linux
|
||||
vm_stat # macOS
|
||||
|
||||
# Disk usage
|
||||
df -h
|
||||
|
||||
# Process list
|
||||
ps aux | grep node
|
||||
|
||||
# Port usage
|
||||
lsof -i :8000
|
||||
netstat -an | grep 8000
|
||||
```
|
||||
|
||||
### Log Analysis
|
||||
|
||||
```bash
|
||||
# Tail logs
|
||||
tail -f /var/log/app.log
|
||||
|
||||
# Search logs
|
||||
grep -i "error" /var/log/app.log
|
||||
|
||||
# Count errors
|
||||
grep -c "ERROR" /var/log/app.log
|
||||
|
||||
# Show recent errors with context
|
||||
grep -B 5 -A 5 "error" /var/log/app.log
|
||||
|
||||
# Parse JSON logs
|
||||
cat app.log | jq 'select(.level=="error")'
|
||||
|
||||
# Error frequency
|
||||
grep "ERROR" /var/log/app.log | cut -d' ' -f1 | uniq -c
|
||||
```
|
||||
|
||||
### Worker Performance
|
||||
|
||||
```bash
|
||||
# Monitor CPU time
|
||||
wrangler tail --format json | jq '.outcome.cpuTime'
|
||||
|
||||
# Monitor duration
|
||||
wrangler tail --format json | jq '.outcome.duration'
|
||||
|
||||
# Requests per second
|
||||
wrangler tail --format json | wc -l
|
||||
|
||||
# Average response time
|
||||
wrangler tail --format json | \
|
||||
jq -r '.outcome.duration' | \
|
||||
awk '{sum+=$1; count++} END {print sum/count}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health Check Scripts
|
||||
|
||||
### Worker Health Check
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-worker.sh
|
||||
|
||||
echo "=== Worker Health Check ==="
|
||||
|
||||
# Test endpoint
|
||||
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.greyhaven.io/health)
|
||||
|
||||
if [ "$STATUS" -eq 200 ]; then
|
||||
echo "✅ Worker responding (HTTP $STATUS)"
|
||||
else
|
||||
echo "❌ Worker error (HTTP $STATUS)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check response time
|
||||
TIME=$(curl -w "%{time_total}" -o /dev/null -s https://api.greyhaven.io/health)
|
||||
echo "Response time: ${TIME}s"
|
||||
|
||||
if (( $(echo "$TIME > 1.0" | bc -l) )); then
|
||||
echo "⚠️ Slow response (>${TIME}s)"
|
||||
fi
|
||||
```
|
||||
|
||||
### Database Health Check
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-db.sh
|
||||
|
||||
echo "=== Database Health Check ==="
|
||||
|
||||
# Test connection
|
||||
pscale shell greyhaven-db main --execute "SELECT 1" > /dev/null 2>&1
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✅ Database connection OK"
|
||||
else
|
||||
echo "❌ Database connection failed"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check active connections
|
||||
ACTIVE=$(pscale shell greyhaven-db main --execute \
|
||||
"SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'" | tail -1)
|
||||
|
||||
echo "Active connections: $ACTIVE"
|
||||
|
||||
if [ "$ACTIVE" -gt 80 ]; then
|
||||
echo "⚠️ High connection count (>80)"
|
||||
fi
|
||||
```
|
||||
|
||||
### Complete System Health
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check-all.sh
|
||||
|
||||
echo "=== Complete System Health Check ==="
|
||||
|
||||
# Worker
|
||||
echo "\n1. Cloudflare Worker"
|
||||
./health-check-worker.sh
|
||||
|
||||
# Database
|
||||
echo "\n2. PlanetScale Database"
|
||||
./health-check-db.sh
|
||||
|
||||
# External APIs
|
||||
echo "\n3. External Dependencies"
|
||||
for API in "https://api.partner1.com/health" "https://api.partner2.com/health"; do
|
||||
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API")
|
||||
if [ "$STATUS" -eq 200 ]; then
|
||||
echo "✅ $API (HTTP $STATUS)"
|
||||
else
|
||||
echo "❌ $API (HTTP $STATUS)"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "\n=== Health Check Complete ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting One-Liners
|
||||
|
||||
```bash
|
||||
# Find memory hogs
|
||||
ps aux --sort=-%mem | head -10
|
||||
|
||||
# Find CPU hogs
|
||||
ps aux --sort=-%cpu | head -10
|
||||
|
||||
# Disk space by directory
|
||||
du -sh /* | sort -h
|
||||
|
||||
# Network connections
|
||||
netstat -ant | awk '{print $6}' | sort | uniq -c
|
||||
|
||||
# Failed login attempts
|
||||
grep "Failed password" /var/log/auth.log | wc -l
|
||||
|
||||
# Top error codes
|
||||
awk '{print $9}' access.log | sort | uniq -c | sort -rn
|
||||
|
||||
# Requests per minute
|
||||
awk '{print $4}' access.log | cut -d: -f1-2 | uniq -c
|
||||
|
||||
# Average response size
|
||||
awk '{sum+=$10; count++} END {print sum/count}' access.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
|
||||
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
@@ -0,0 +1,489 @@
|
||||
# Troubleshooting Runbooks
|
||||
|
||||
Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
|
||||
|
||||
## Runbook 1: Worker Not Responding
|
||||
|
||||
### Symptoms
|
||||
- API returning 500/502/503 errors
|
||||
- Workers timing out or not processing requests
|
||||
- Cloudflare error pages showing
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Cloudflare Status**
|
||||
```bash
|
||||
# Visit: https://www.cloudflarestatus.com
|
||||
# Or query status API
|
||||
curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
|
||||
```
|
||||
|
||||
**2. View Worker Logs**
|
||||
```bash
|
||||
# Real-time logs
|
||||
wrangler tail --format pretty
|
||||
|
||||
# Look for errors:
|
||||
# - "Script exceeded CPU time limit"
|
||||
# - "Worker threw exception"
|
||||
# - "Uncaught TypeError"
|
||||
```
|
||||
|
||||
**3. Check Recent Deployments**
|
||||
```bash
|
||||
wrangler deployments list
|
||||
|
||||
# If recent deployment suspicious, rollback:
|
||||
wrangler rollback --message "Reverting to stable version"
|
||||
```
|
||||
|
||||
**4. Test Worker Locally**
|
||||
```bash
|
||||
# Run worker in dev mode
|
||||
wrangler dev
|
||||
|
||||
# Test endpoint
|
||||
curl http://localhost:8787/api/health
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
|
||||
**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
|
||||
**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
|
||||
**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
|
||||
|
||||
### Prevention
|
||||
- Health check endpoint: `GET /health`
|
||||
- Monitor error rate with alerts (>1% = alert)
|
||||
- Test deployments in staging first
|
||||
- Implement circuit breakers for external calls
|
||||
|
||||
---
|
||||
|
||||
## Runbook 2: Database Connection Failures
|
||||
|
||||
### Symptoms
|
||||
- "connection refused" errors
|
||||
- "too many connections" errors
|
||||
- Application timing out on database queries
|
||||
- 503 errors from API
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test Database Connection**
|
||||
```bash
|
||||
# Direct connection test
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# If fails, check:
|
||||
# - Database status
|
||||
# - Credentials
|
||||
# - Network connectivity
|
||||
```
|
||||
|
||||
**2. Check Connection Pool**
|
||||
```bash
|
||||
# Query pool status
|
||||
curl http://localhost:8000/pool-status
|
||||
|
||||
# Expected healthy response:
|
||||
{
|
||||
"size": 50,
|
||||
"checked_out": 25, # <80% is healthy
|
||||
"overflow": 0,
|
||||
"available": 25
|
||||
}
|
||||
```
|
||||
|
||||
**3. Check Active Connections**
|
||||
```sql
|
||||
-- In pscale shell
|
||||
SELECT
|
||||
COUNT(*) as active,
|
||||
MAX(query_start) as oldest_query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
-- If active = pool size, pool exhausted
|
||||
-- If oldest_query >10min, leaked connection
|
||||
```
|
||||
|
||||
**4. Review Application Logs**
|
||||
```bash
|
||||
# Search for connection errors
|
||||
grep -i "connection" logs/app.log | tail -50
|
||||
|
||||
# Common errors:
|
||||
# - "Pool timeout"
|
||||
# - "Connection refused"
|
||||
# - "Max connections reached"
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Invalid Credentials**
|
||||
```bash
|
||||
# Rotate credentials
|
||||
pscale password create greyhaven-db main app-password
|
||||
|
||||
# Update environment variable
|
||||
# Restart application
|
||||
```
|
||||
|
||||
**Path B: Pool Exhausted**
|
||||
```python
|
||||
# Increase pool size in database.py
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=50, # Increase from 20
|
||||
max_overflow=20
|
||||
)
|
||||
```
|
||||
|
||||
**Path C: Connection Leaks**
|
||||
```python
|
||||
# Fix: Use context managers
|
||||
with Session(engine) as session:
|
||||
# Work with session
|
||||
pass # Automatically closed
|
||||
```
|
||||
|
||||
**Path D: Database Paused/Down**
|
||||
```bash
|
||||
# Resume database if paused
|
||||
pscale database resume greyhaven-db
|
||||
|
||||
# Check database status
|
||||
pscale database show greyhaven-db
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Use connection pooling with proper limits
|
||||
- Implement retry logic with exponential backoff
|
||||
- Monitor pool utilization (alert >80%)
|
||||
- Test for connection leaks in CI/CD
|
||||
|
||||
---
|
||||
|
||||
## Runbook 3: Deployment Failures
|
||||
|
||||
### Symptoms
|
||||
- `wrangler deploy` fails
|
||||
- CI/CD pipeline fails at deployment step
|
||||
- New code not reflecting in production
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Deployment Error**
|
||||
```bash
|
||||
wrangler deploy --verbose
|
||||
|
||||
# Common errors:
|
||||
# - "Script exceeds size limit"
|
||||
# - "Syntax error in worker"
|
||||
# - "Environment variable missing"
|
||||
# - "Binding not found"
|
||||
```
|
||||
|
||||
**2. Verify Build Output**
|
||||
```bash
|
||||
# Check built file
|
||||
ls -lh dist/
|
||||
npm run build
|
||||
|
||||
# Ensure build succeeds locally
|
||||
```
|
||||
|
||||
**3. Check Environment Variables**
|
||||
```bash
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Verify wrangler.toml vars
|
||||
cat wrangler.toml | grep -A 10 "\[vars\]"
|
||||
```
|
||||
|
||||
**4. Test Locally**
|
||||
```bash
|
||||
# Start dev server
|
||||
wrangler dev
|
||||
|
||||
# If works locally but not production:
|
||||
# - Environment variable mismatch
|
||||
# - Binding configuration issue
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Bundle Too Large**
|
||||
```bash
|
||||
# Check bundle size
|
||||
ls -lh dist/worker.js
|
||||
|
||||
# Solutions:
|
||||
# - Tree shake unused code
|
||||
# - Code split large modules
|
||||
# - Use fetch instead of SDK
|
||||
```
|
||||
|
||||
**Path B: Syntax Error**
|
||||
```bash
|
||||
# Run TypeScript check
|
||||
npm run type-check
|
||||
|
||||
# Run linter
|
||||
npm run lint
|
||||
|
||||
# Fix errors before deploying
|
||||
```
|
||||
|
||||
**Path C: Missing Variables**
|
||||
```bash
|
||||
# Add missing secret
|
||||
wrangler secret put API_KEY
|
||||
|
||||
# Or add to wrangler.toml vars
|
||||
[vars]
|
||||
API_ENDPOINT = "https://api.example.com"
|
||||
```
|
||||
|
||||
**Path D: Binding Not Found**
|
||||
```toml
|
||||
# wrangler.toml - Add binding
|
||||
[[kv_namespaces]]
|
||||
binding = "CACHE"
|
||||
id = "abc123"
|
||||
|
||||
[[d1_databases]]
|
||||
binding = "DB"
|
||||
database_name = "greyhaven-db"
|
||||
database_id = "xyz789"
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Bundle size check in CI/CD
|
||||
- Pre-commit hooks for validation
|
||||
- Staging environment for testing
|
||||
- Automated deployment tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 4: Performance Degradation
|
||||
|
||||
### Symptoms
|
||||
- API response times increased (>2x normal)
|
||||
- Slow page loads
|
||||
- User complaints about slowness
|
||||
- Timeout errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Current Latency**
|
||||
```bash
|
||||
# Test endpoint
|
||||
curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
|
||||
|
||||
# p95 should be <500ms
|
||||
# If >1s, investigate
|
||||
```
|
||||
|
||||
**2. Analyze Worker Logs**
|
||||
```bash
|
||||
wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
|
||||
|
||||
# Identify slow requests
|
||||
# Check what's taking time
|
||||
```
|
||||
|
||||
**3. Check Database Queries**
|
||||
```bash
|
||||
# Slow query log
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Look for:
|
||||
# - N+1 queries (many small queries)
|
||||
# - Missing indexes (full table scans)
|
||||
# - Long-running queries (>100ms)
|
||||
```
|
||||
|
||||
**4. Profile Application**
|
||||
```bash
|
||||
# Add timing middleware
|
||||
# Log slow operations
|
||||
# Identify bottleneck (DB, API, compute)
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: N+1 Queries**
|
||||
```python
|
||||
# Use eager loading
|
||||
statement = (
|
||||
select(Order)
|
||||
.options(selectinload(Order.items))
|
||||
)
|
||||
```
|
||||
|
||||
**Path B: Missing Indexes**
|
||||
```sql
|
||||
-- Add indexes
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
CREATE INDEX idx_items_order_id ON order_items(order_id);
|
||||
```
|
||||
|
||||
**Path C: No Caching**
|
||||
```typescript
|
||||
// Add Redis caching
|
||||
const cached = await redis.get(cacheKey);
|
||||
if (cached) return cached;
|
||||
|
||||
const result = await expensiveOperation();
|
||||
await redis.setex(cacheKey, 300, result);
|
||||
```
|
||||
|
||||
**Path D: Worker CPU Limit**
|
||||
```typescript
|
||||
// Optimize expensive operations
|
||||
// Use async operations
|
||||
// Offload to external service
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Monitor p95 latency (alert >500ms)
|
||||
- Test for N+1 queries in CI/CD
|
||||
- Add indexes for foreign keys
|
||||
- Implement caching layer
|
||||
- Performance budgets in tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 5: Network Connectivity Issues
|
||||
|
||||
### Symptoms
|
||||
- Intermittent failures
|
||||
- DNS resolution errors
|
||||
- Connection timeouts
|
||||
- CORS errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test DNS Resolution**
|
||||
```bash
|
||||
# Check DNS
|
||||
nslookup api.partner.com
|
||||
dig api.partner.com
|
||||
|
||||
# Measure DNS time
|
||||
time nslookup api.partner.com
|
||||
|
||||
# If >1s, DNS is slow
|
||||
```
|
||||
|
||||
**2. Test Connectivity**
|
||||
```bash
|
||||
# Basic connectivity
|
||||
ping api.partner.com
|
||||
|
||||
# Trace route
|
||||
traceroute api.partner.com
|
||||
|
||||
# Full timing breakdown
|
||||
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com
|
||||
```
|
||||
|
||||
**3. Check CORS**
|
||||
```bash
|
||||
# Preflight request
|
||||
curl -I -X OPTIONS https://api.greyhaven.io/api/users \
|
||||
-H "Origin: https://app.greyhaven.io" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
|
||||
# Verify headers:
|
||||
# - Access-Control-Allow-Origin
|
||||
# - Access-Control-Allow-Methods
|
||||
```
|
||||
|
||||
**4. Check Firewall/Security**
|
||||
```bash
|
||||
# Test from different location
|
||||
# Check IP whitelist
|
||||
# Verify SSL certificate
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Slow DNS**
|
||||
```typescript
|
||||
// Implement DNS caching
|
||||
const DNS_CACHE = new Map();
|
||||
// Cache DNS for 60s
|
||||
```
|
||||
|
||||
**Path B: Connection Timeout**
|
||||
```typescript
|
||||
// Increase timeout
|
||||
const controller = new AbortController();
|
||||
setTimeout(() => controller.abort(), 30000); // 30s
|
||||
```
|
||||
|
||||
**Path C: CORS Error**
|
||||
```typescript
|
||||
// Add CORS headers
|
||||
response.headers.set('Access-Control-Allow-Origin', origin);
|
||||
response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
|
||||
```
|
||||
|
||||
**Path D: SSL/TLS Issue**
|
||||
```bash
|
||||
# Check certificate
|
||||
openssl s_client -connect api.partner.com:443
|
||||
|
||||
# Verify not expired
|
||||
# Check certificate chain
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- DNS caching (60s TTL)
|
||||
- Appropriate timeouts (30s for external APIs)
|
||||
- Health checks for external dependencies
|
||||
- Circuit breakers for failures
|
||||
- Monitor external API latency
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures (SEV1)
|
||||
|
||||
**Immediate Actions**:
|
||||
1. **Assess**: Users affected? Functionality broken? Data loss risk?
|
||||
2. **Communicate**: Alert team, update status page
|
||||
3. **Stop Bleeding**: `wrangler rollback` or disable feature
|
||||
4. **Diagnose**: Logs, recent changes, metrics
|
||||
5. **Fix**: Hotfix or workaround, test first
|
||||
6. **Verify**: Monitor metrics, test functionality
|
||||
7. **Postmortem**: Document, root cause, prevention
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Issue Type | First Response | Escalate To | Escalation Trigger |
|
||||
|------------|---------------|-------------|-------------------|
|
||||
| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
|
||||
| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
|
||||
| Database | DevOps troubleshooter | data-validator | Schema issues |
|
||||
| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
|
||||
| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
|
||||
- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
|
||||
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
Reference in New Issue
Block a user