Initial commit

2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions
--- a/skills/devops-troubleshooting/reference/troubleshooting-runbooks.md
+++ b/skills/devops-troubleshooting/reference/troubleshooting-runbooks.md
@@ -0,0 +1,489 @@
+# Troubleshooting Runbooks
+
+Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
+
+## Runbook 1: Worker Not Responding
+
+### Symptoms
+- API returning 500/502/503 errors
+- Workers timing out or not processing requests
+- Cloudflare error pages showing
+
+### Diagnosis Steps
+
+**1. Check Cloudflare Status**
+```bash
+# Visit: https://www.cloudflarestatus.com
+# Or query status API
+curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
+```
+
+**2. View Worker Logs**
+```bash
+# Real-time logs
+wrangler tail --format pretty
+
+# Look for errors:
+# - "Script exceeded CPU time limit"
+# - "Worker threw exception"
+# - "Uncaught TypeError"
+```
+
+**3. Check Recent Deployments**
+```bash
+wrangler deployments list
+
+# If recent deployment suspicious, rollback:
+wrangler rollback --message "Reverting to stable version"
+```
+
+**4. Test Worker Locally**
+```bash
+# Run worker in dev mode
+wrangler dev
+
+# Test endpoint
+curl http://localhost:8787/api/health
+```
+
+### Resolution Paths
+
+**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
+**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
+**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
+**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
+
+### Prevention
+- Health check endpoint: `GET /health`
+- Monitor error rate with alerts (>1% = alert)
+- Test deployments in staging first
+- Implement circuit breakers for external calls
+
+---
+
+## Runbook 2: Database Connection Failures
+
+### Symptoms
+- "connection refused" errors
+- "too many connections" errors
+- Application timing out on database queries
+- 503 errors from API
+
+### Diagnosis Steps
+
+**1. Test Database Connection**
+```bash
+# Direct connection test
+pscale shell greyhaven-db main
+
+# If fails, check:
+# - Database status
+# - Credentials
+# - Network connectivity
+```
+
+**2. Check Connection Pool**
+```bash
+# Query pool status
+curl http://localhost:8000/pool-status
+
+# Expected healthy response:
+{
+  "size": 50,
+  "checked_out": 25,  # <80% is healthy
+  "overflow": 0,
+  "available": 25
+}
+```
+
+**3. Check Active Connections**
+```sql
+-- In pscale shell
+SELECT
+  COUNT(*) as active,
+  MAX(query_start) as oldest_query
+FROM pg_stat_activity
+WHERE state = 'active';
+
+-- If active = pool size, pool exhausted
+-- If oldest_query >10min, leaked connection
+```
+
+**4. Review Application Logs**
+```bash
+# Search for connection errors
+grep -i "connection" logs/app.log | tail -50
+
+# Common errors:
+# - "Pool timeout"
+# - "Connection refused"
+# - "Max connections reached"
+```
+
+### Resolution Paths
+
+**Path A: Invalid Credentials**
+```bash
+# Rotate credentials
+pscale password create greyhaven-db main app-password
+
+# Update environment variable
+# Restart application
+```
+
+**Path B: Pool Exhausted**
+```python
+# Increase pool size in database.py
+engine = create_engine(
+    database_url,
+    pool_size=50,      # Increase from 20
+    max_overflow=20
+)
+```
+
+**Path C: Connection Leaks**
+```python
+# Fix: Use context managers
+with Session(engine) as session:
+    # Work with session
+    pass  # Automatically closed
+```
+
+**Path D: Database Paused/Down**
+```bash
+# Resume database if paused
+pscale database resume greyhaven-db
+
+# Check database status
+pscale database show greyhaven-db
+```
+
+### Prevention
+- Use connection pooling with proper limits
+- Implement retry logic with exponential backoff
+- Monitor pool utilization (alert >80%)
+- Test for connection leaks in CI/CD
+
+---
+
+## Runbook 3: Deployment Failures
+
+### Symptoms
+- `wrangler deploy` fails
+- CI/CD pipeline fails at deployment step
+- New code not reflecting in production
+
+### Diagnosis Steps
+
+**1. Check Deployment Error**
+```bash
+wrangler deploy --verbose
+
+# Common errors:
+# - "Script exceeds size limit"
+# - "Syntax error in worker"
+# - "Environment variable missing"
+# - "Binding not found"
+```
+
+**2. Verify Build Output**
+```bash
+# Check built file
+ls -lh dist/
+npm run build
+
+# Ensure build succeeds locally
+```
+
+**3. Check Environment Variables**
+```bash
+# List secrets
+wrangler secret list
+
+# Verify wrangler.toml vars
+cat wrangler.toml | grep -A 10 "\[vars\]"
+```
+
+**4. Test Locally**
+```bash
+# Start dev server
+wrangler dev
+
+# If works locally but not production:
+# - Environment variable mismatch
+# - Binding configuration issue
+```
+
+### Resolution Paths
+
+**Path A: Bundle Too Large**
+```bash
+# Check bundle size
+ls -lh dist/worker.js
+
+# Solutions:
+# - Tree shake unused code
+# - Code split large modules
+# - Use fetch instead of SDK
+```
+
+**Path B: Syntax Error**
+```bash
+# Run TypeScript check
+npm run type-check
+
+# Run linter
+npm run lint
+
+# Fix errors before deploying
+```
+
+**Path C: Missing Variables**
+```bash
+# Add missing secret
+wrangler secret put API_KEY
+
+# Or add to wrangler.toml vars
+[vars]
+API_ENDPOINT = "https://api.example.com"
+```
+
+**Path D: Binding Not Found**
+```toml
+# wrangler.toml - Add binding
+[[kv_namespaces]]
+binding = "CACHE"
+id = "abc123"
+
+[[d1_databases]]
+binding = "DB"
+database_name = "greyhaven-db"
+database_id = "xyz789"
+```
+
+### Prevention
+- Bundle size check in CI/CD
+- Pre-commit hooks for validation
+- Staging environment for testing
+- Automated deployment tests
+
+---
+
+## Runbook 4: Performance Degradation
+
+### Symptoms
+- API response times increased (>2x normal)
+- Slow page loads
+- User complaints about slowness
+- Timeout errors
+
+### Diagnosis Steps
+
+**1. Check Current Latency**
+```bash
+# Test endpoint
+curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
+
+# p95 should be <500ms
+# If >1s, investigate
+```
+
+**2. Analyze Worker Logs**
+```bash
+wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
+
+# Identify slow requests
+# Check what's taking time
+```
+
+**3. Check Database Queries**
+```bash
+# Slow query log
+pscale database insights greyhaven-db main --slow-queries
+
+# Look for:
+# - N+1 queries (many small queries)
+# - Missing indexes (full table scans)
+# - Long-running queries (>100ms)
+```
+
+**4. Profile Application**
+```bash
+# Add timing middleware
+# Log slow operations
+# Identify bottleneck (DB, API, compute)
+```
+
+### Resolution Paths
+
+**Path A: N+1 Queries**
+```python
+# Use eager loading
+statement = (
+    select(Order)
+    .options(selectinload(Order.items))
+)
+```
+
+**Path B: Missing Indexes**
+```sql
+-- Add indexes
+CREATE INDEX idx_orders_user_id ON orders(user_id);
+CREATE INDEX idx_items_order_id ON order_items(order_id);
+```
+
+**Path C: No Caching**
+```typescript
+// Add Redis caching
+const cached = await redis.get(cacheKey);
+if (cached) return cached;
+
+const result = await expensiveOperation();
+await redis.setex(cacheKey, 300, result);
+```
+
+**Path D: Worker CPU Limit**
+```typescript
+// Optimize expensive operations
+// Use async operations
+// Offload to external service
+```
+
+### Prevention
+- Monitor p95 latency (alert >500ms)
+- Test for N+1 queries in CI/CD
+- Add indexes for foreign keys
+- Implement caching layer
+- Performance budgets in tests
+
+---
+
+## Runbook 5: Network Connectivity Issues
+
+### Symptoms
+- Intermittent failures
+- DNS resolution errors
+- Connection timeouts
+- CORS errors
+
+### Diagnosis Steps
+
+**1. Test DNS Resolution**
+```bash
+# Check DNS
+nslookup api.partner.com
+dig api.partner.com
+
+# Measure DNS time
+time nslookup api.partner.com
+
+# If >1s, DNS is slow
+```
+
+**2. Test Connectivity**
+```bash
+# Basic connectivity
+ping api.partner.com
+
+# Trace route
+traceroute api.partner.com
+
+# Full timing breakdown
+curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
+  -o /dev/null -s https://api.partner.com
+```
+
+**3. Check CORS**
+```bash
+# Preflight request
+curl -I -X OPTIONS https://api.greyhaven.io/api/users \
+  -H "Origin: https://app.greyhaven.io" \
+  -H "Access-Control-Request-Method: POST"
+
+# Verify headers:
+# - Access-Control-Allow-Origin
+# - Access-Control-Allow-Methods
+```
+
+**4. Check Firewall/Security**
+```bash
+# Test from different location
+# Check IP whitelist
+# Verify SSL certificate
+```
+
+### Resolution Paths
+
+**Path A: Slow DNS**
+```typescript
+// Implement DNS caching
+const DNS_CACHE = new Map();
+// Cache DNS for 60s
+```
+
+**Path B: Connection Timeout**
+```typescript
+// Increase timeout
+const controller = new AbortController();
+setTimeout(() => controller.abort(), 30000); // 30s
+```
+
+**Path C: CORS Error**
+```typescript
+// Add CORS headers
+response.headers.set('Access-Control-Allow-Origin', origin);
+response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
+```
+
+**Path D: SSL/TLS Issue**
+```bash
+# Check certificate
+openssl s_client -connect api.partner.com:443
+
+# Verify not expired
+# Check certificate chain
+```
+
+### Prevention
+- DNS caching (60s TTL)
+- Appropriate timeouts (30s for external APIs)
+- Health checks for external dependencies
+- Circuit breakers for failures
+- Monitor external API latency
+
+---
+
+## Emergency Procedures (SEV1)
+
+**Immediate Actions**:
+1. **Assess**: Users affected? Functionality broken? Data loss risk?
+2. **Communicate**: Alert team, update status page
+3. **Stop Bleeding**: `wrangler rollback` or disable feature
+4. **Diagnose**: Logs, recent changes, metrics
+5. **Fix**: Hotfix or workaround, test first
+6. **Verify**: Monitor metrics, test functionality
+7. **Postmortem**: Document, root cause, prevention
+
+---
+
+## Escalation Matrix
+
+| Issue Type | First Response | Escalate To | Escalation Trigger |
+|------------|---------------|-------------|-------------------|
+| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
+| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
+| Database | DevOps troubleshooter | data-validator | Schema issues |
+| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
+| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
+- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
+- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
+
+---
+
+Return to [reference index](INDEX.md)