Files
gh-greyhaven-ai-claude-code…/skills/devops-troubleshooting/reference/troubleshooting-runbooks.md
2025-11-29 18:29:23 +08:00

10 KiB

Troubleshooting Runbooks

Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.

Runbook 1: Worker Not Responding

Symptoms

  • API returning 500/502/503 errors
  • Workers timing out or not processing requests
  • Cloudflare error pages showing

Diagnosis Steps

1. Check Cloudflare Status

# Visit: https://www.cloudflarestatus.com
# Or query status API
curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'

2. View Worker Logs

# Real-time logs
wrangler tail --format pretty

# Look for errors:
# - "Script exceeded CPU time limit"
# - "Worker threw exception"
# - "Uncaught TypeError"

3. Check Recent Deployments

wrangler deployments list

# If recent deployment suspicious, rollback:
wrangler rollback --message "Reverting to stable version"

4. Test Worker Locally

# Run worker in dev mode
wrangler dev

# Test endpoint
curl http://localhost:8787/api/health

Resolution Paths

Path A: Platform Issue - Wait for Cloudflare, monitor status, communicate ETA Path B: Code Error - Rollback deployment, fix in dev, test before redeploy Path C: Resource Limit - Check CPU logs, optimize operations, upgrade if needed Path D: Binding Issue - Verify wrangler.toml, check bindings, redeploy

Prevention

  • Health check endpoint: GET /health
  • Monitor error rate with alerts (>1% = alert)
  • Test deployments in staging first
  • Implement circuit breakers for external calls

Runbook 2: Database Connection Failures

Symptoms

  • "connection refused" errors
  • "too many connections" errors
  • Application timing out on database queries
  • 503 errors from API

Diagnosis Steps

1. Test Database Connection

# Direct connection test
pscale shell greyhaven-db main

# If fails, check:
# - Database status
# - Credentials
# - Network connectivity

2. Check Connection Pool

# Query pool status
curl http://localhost:8000/pool-status

# Expected healthy response:
{
  "size": 50,
  "checked_out": 25,  # <80% is healthy
  "overflow": 0,
  "available": 25
}

3. Check Active Connections

-- In pscale shell
SELECT
  COUNT(*) as active,
  MAX(query_start) as oldest_query
FROM pg_stat_activity
WHERE state = 'active';

-- If active = pool size, pool exhausted
-- If oldest_query >10min, leaked connection

4. Review Application Logs

# Search for connection errors
grep -i "connection" logs/app.log | tail -50

# Common errors:
# - "Pool timeout"
# - "Connection refused"
# - "Max connections reached"

Resolution Paths

Path A: Invalid Credentials

# Rotate credentials
pscale password create greyhaven-db main app-password

# Update environment variable
# Restart application

Path B: Pool Exhausted

# Increase pool size in database.py
engine = create_engine(
    database_url,
    pool_size=50,      # Increase from 20
    max_overflow=20
)

Path C: Connection Leaks

# Fix: Use context managers
with Session(engine) as session:
    # Work with session
    pass  # Automatically closed

Path D: Database Paused/Down

# Resume database if paused
pscale database resume greyhaven-db

# Check database status
pscale database show greyhaven-db

Prevention

  • Use connection pooling with proper limits
  • Implement retry logic with exponential backoff
  • Monitor pool utilization (alert >80%)
  • Test for connection leaks in CI/CD

Runbook 3: Deployment Failures

Symptoms

  • wrangler deploy fails
  • CI/CD pipeline fails at deployment step
  • New code not reflecting in production

Diagnosis Steps

1. Check Deployment Error

wrangler deploy --verbose

# Common errors:
# - "Script exceeds size limit"
# - "Syntax error in worker"
# - "Environment variable missing"
# - "Binding not found"

2. Verify Build Output

# Check built file
ls -lh dist/
npm run build

# Ensure build succeeds locally

3. Check Environment Variables

# List secrets
wrangler secret list

# Verify wrangler.toml vars
cat wrangler.toml | grep -A 10 "\[vars\]"

4. Test Locally

# Start dev server
wrangler dev

# If works locally but not production:
# - Environment variable mismatch
# - Binding configuration issue

Resolution Paths

Path A: Bundle Too Large

# Check bundle size
ls -lh dist/worker.js

# Solutions:
# - Tree shake unused code
# - Code split large modules
# - Use fetch instead of SDK

Path B: Syntax Error

# Run TypeScript check
npm run type-check

# Run linter
npm run lint

# Fix errors before deploying

Path C: Missing Variables

# Add missing secret
wrangler secret put API_KEY

# Or add to wrangler.toml vars
[vars]
API_ENDPOINT = "https://api.example.com"

Path D: Binding Not Found

# wrangler.toml - Add binding
[[kv_namespaces]]
binding = "CACHE"
id = "abc123"

[[d1_databases]]
binding = "DB"
database_name = "greyhaven-db"
database_id = "xyz789"

Prevention

  • Bundle size check in CI/CD
  • Pre-commit hooks for validation
  • Staging environment for testing
  • Automated deployment tests

Runbook 4: Performance Degradation

Symptoms

  • API response times increased (>2x normal)
  • Slow page loads
  • User complaints about slowness
  • Timeout errors

Diagnosis Steps

1. Check Current Latency

# Test endpoint
curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders

# p95 should be <500ms
# If >1s, investigate

2. Analyze Worker Logs

wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'

# Identify slow requests
# Check what's taking time

3. Check Database Queries

# Slow query log
pscale database insights greyhaven-db main --slow-queries

# Look for:
# - N+1 queries (many small queries)
# - Missing indexes (full table scans)
# - Long-running queries (>100ms)

4. Profile Application

# Add timing middleware
# Log slow operations
# Identify bottleneck (DB, API, compute)

Resolution Paths

Path A: N+1 Queries

# Use eager loading
statement = (
    select(Order)
    .options(selectinload(Order.items))
)

Path B: Missing Indexes

-- Add indexes
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_items_order_id ON order_items(order_id);

Path C: No Caching

// Add Redis caching
const cached = await redis.get(cacheKey);
if (cached) return cached;

const result = await expensiveOperation();
await redis.setex(cacheKey, 300, result);

Path D: Worker CPU Limit

// Optimize expensive operations
// Use async operations
// Offload to external service

Prevention

  • Monitor p95 latency (alert >500ms)
  • Test for N+1 queries in CI/CD
  • Add indexes for foreign keys
  • Implement caching layer
  • Performance budgets in tests

Runbook 5: Network Connectivity Issues

Symptoms

  • Intermittent failures
  • DNS resolution errors
  • Connection timeouts
  • CORS errors

Diagnosis Steps

1. Test DNS Resolution

# Check DNS
nslookup api.partner.com
dig api.partner.com

# Measure DNS time
time nslookup api.partner.com

# If >1s, DNS is slow

2. Test Connectivity

# Basic connectivity
ping api.partner.com

# Trace route
traceroute api.partner.com

# Full timing breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
  -o /dev/null -s https://api.partner.com

3. Check CORS

# Preflight request
curl -I -X OPTIONS https://api.greyhaven.io/api/users \
  -H "Origin: https://app.greyhaven.io" \
  -H "Access-Control-Request-Method: POST"

# Verify headers:
# - Access-Control-Allow-Origin
# - Access-Control-Allow-Methods

4. Check Firewall/Security

# Test from different location
# Check IP whitelist
# Verify SSL certificate

Resolution Paths

Path A: Slow DNS

// Implement DNS caching
const DNS_CACHE = new Map();
// Cache DNS for 60s

Path B: Connection Timeout

// Increase timeout
const controller = new AbortController();
setTimeout(() => controller.abort(), 30000); // 30s

Path C: CORS Error

// Add CORS headers
response.headers.set('Access-Control-Allow-Origin', origin);
response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');

Path D: SSL/TLS Issue

# Check certificate
openssl s_client -connect api.partner.com:443

# Verify not expired
# Check certificate chain

Prevention

  • DNS caching (60s TTL)
  • Appropriate timeouts (30s for external APIs)
  • Health checks for external dependencies
  • Circuit breakers for failures
  • Monitor external API latency

Emergency Procedures (SEV1)

Immediate Actions:

  1. Assess: Users affected? Functionality broken? Data loss risk?
  2. Communicate: Alert team, update status page
  3. Stop Bleeding: wrangler rollback or disable feature
  4. Diagnose: Logs, recent changes, metrics
  5. Fix: Hotfix or workaround, test first
  6. Verify: Monitor metrics, test functionality
  7. Postmortem: Document, root cause, prevention

Escalation Matrix

Issue Type First Response Escalate To Escalation Trigger
Worker errors DevOps troubleshooter incident-responder SEV1/SEV2
Performance DevOps troubleshooter performance-optimizer >30min unresolved
Database DevOps troubleshooter data-validator Schema issues
Security DevOps troubleshooter security-analyzer Breach suspected
Application bugs DevOps troubleshooter smart-debug Infrastructure ruled out


Return to reference index