Initial commit
This commit is contained in:
@@ -0,0 +1,489 @@
|
||||
# Troubleshooting Runbooks
|
||||
|
||||
Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
|
||||
|
||||
## Runbook 1: Worker Not Responding
|
||||
|
||||
### Symptoms
|
||||
- API returning 500/502/503 errors
|
||||
- Workers timing out or not processing requests
|
||||
- Cloudflare error pages showing
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Cloudflare Status**
|
||||
```bash
|
||||
# Visit: https://www.cloudflarestatus.com
|
||||
# Or query status API
|
||||
curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
|
||||
```
|
||||
|
||||
**2. View Worker Logs**
|
||||
```bash
|
||||
# Real-time logs
|
||||
wrangler tail --format pretty
|
||||
|
||||
# Look for errors:
|
||||
# - "Script exceeded CPU time limit"
|
||||
# - "Worker threw exception"
|
||||
# - "Uncaught TypeError"
|
||||
```
|
||||
|
||||
**3. Check Recent Deployments**
|
||||
```bash
|
||||
wrangler deployments list
|
||||
|
||||
# If recent deployment suspicious, rollback:
|
||||
wrangler rollback --message "Reverting to stable version"
|
||||
```
|
||||
|
||||
**4. Test Worker Locally**
|
||||
```bash
|
||||
# Run worker in dev mode
|
||||
wrangler dev
|
||||
|
||||
# Test endpoint
|
||||
curl http://localhost:8787/api/health
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
|
||||
**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
|
||||
**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
|
||||
**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
|
||||
|
||||
### Prevention
|
||||
- Health check endpoint: `GET /health`
|
||||
- Monitor error rate with alerts (>1% = alert)
|
||||
- Test deployments in staging first
|
||||
- Implement circuit breakers for external calls
|
||||
|
||||
---
|
||||
|
||||
## Runbook 2: Database Connection Failures
|
||||
|
||||
### Symptoms
|
||||
- "connection refused" errors
|
||||
- "too many connections" errors
|
||||
- Application timing out on database queries
|
||||
- 503 errors from API
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test Database Connection**
|
||||
```bash
|
||||
# Direct connection test
|
||||
pscale shell greyhaven-db main
|
||||
|
||||
# If fails, check:
|
||||
# - Database status
|
||||
# - Credentials
|
||||
# - Network connectivity
|
||||
```
|
||||
|
||||
**2. Check Connection Pool**
|
||||
```bash
|
||||
# Query pool status
|
||||
curl http://localhost:8000/pool-status
|
||||
|
||||
# Expected healthy response:
|
||||
{
|
||||
"size": 50,
|
||||
"checked_out": 25, # <80% is healthy
|
||||
"overflow": 0,
|
||||
"available": 25
|
||||
}
|
||||
```
|
||||
|
||||
**3. Check Active Connections**
|
||||
```sql
|
||||
-- In pscale shell
|
||||
SELECT
|
||||
COUNT(*) as active,
|
||||
MAX(query_start) as oldest_query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active';
|
||||
|
||||
-- If active = pool size, pool exhausted
|
||||
-- If oldest_query >10min, leaked connection
|
||||
```
|
||||
|
||||
**4. Review Application Logs**
|
||||
```bash
|
||||
# Search for connection errors
|
||||
grep -i "connection" logs/app.log | tail -50
|
||||
|
||||
# Common errors:
|
||||
# - "Pool timeout"
|
||||
# - "Connection refused"
|
||||
# - "Max connections reached"
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Invalid Credentials**
|
||||
```bash
|
||||
# Rotate credentials
|
||||
pscale password create greyhaven-db main app-password
|
||||
|
||||
# Update environment variable
|
||||
# Restart application
|
||||
```
|
||||
|
||||
**Path B: Pool Exhausted**
|
||||
```python
|
||||
# Increase pool size in database.py
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
pool_size=50, # Increase from 20
|
||||
max_overflow=20
|
||||
)
|
||||
```
|
||||
|
||||
**Path C: Connection Leaks**
|
||||
```python
|
||||
# Fix: Use context managers
|
||||
with Session(engine) as session:
|
||||
# Work with session
|
||||
pass # Automatically closed
|
||||
```
|
||||
|
||||
**Path D: Database Paused/Down**
|
||||
```bash
|
||||
# Resume database if paused
|
||||
pscale database resume greyhaven-db
|
||||
|
||||
# Check database status
|
||||
pscale database show greyhaven-db
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Use connection pooling with proper limits
|
||||
- Implement retry logic with exponential backoff
|
||||
- Monitor pool utilization (alert >80%)
|
||||
- Test for connection leaks in CI/CD
|
||||
|
||||
---
|
||||
|
||||
## Runbook 3: Deployment Failures
|
||||
|
||||
### Symptoms
|
||||
- `wrangler deploy` fails
|
||||
- CI/CD pipeline fails at deployment step
|
||||
- New code not reflecting in production
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Deployment Error**
|
||||
```bash
|
||||
wrangler deploy --verbose
|
||||
|
||||
# Common errors:
|
||||
# - "Script exceeds size limit"
|
||||
# - "Syntax error in worker"
|
||||
# - "Environment variable missing"
|
||||
# - "Binding not found"
|
||||
```
|
||||
|
||||
**2. Verify Build Output**
|
||||
```bash
|
||||
# Check built file
|
||||
ls -lh dist/
|
||||
npm run build
|
||||
|
||||
# Ensure build succeeds locally
|
||||
```
|
||||
|
||||
**3. Check Environment Variables**
|
||||
```bash
|
||||
# List secrets
|
||||
wrangler secret list
|
||||
|
||||
# Verify wrangler.toml vars
|
||||
cat wrangler.toml | grep -A 10 "\[vars\]"
|
||||
```
|
||||
|
||||
**4. Test Locally**
|
||||
```bash
|
||||
# Start dev server
|
||||
wrangler dev
|
||||
|
||||
# If works locally but not production:
|
||||
# - Environment variable mismatch
|
||||
# - Binding configuration issue
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Bundle Too Large**
|
||||
```bash
|
||||
# Check bundle size
|
||||
ls -lh dist/worker.js
|
||||
|
||||
# Solutions:
|
||||
# - Tree shake unused code
|
||||
# - Code split large modules
|
||||
# - Use fetch instead of SDK
|
||||
```
|
||||
|
||||
**Path B: Syntax Error**
|
||||
```bash
|
||||
# Run TypeScript check
|
||||
npm run type-check
|
||||
|
||||
# Run linter
|
||||
npm run lint
|
||||
|
||||
# Fix errors before deploying
|
||||
```
|
||||
|
||||
**Path C: Missing Variables**
|
||||
```bash
|
||||
# Add missing secret
|
||||
wrangler secret put API_KEY
|
||||
|
||||
# Or add to wrangler.toml vars
|
||||
[vars]
|
||||
API_ENDPOINT = "https://api.example.com"
|
||||
```
|
||||
|
||||
**Path D: Binding Not Found**
|
||||
```toml
|
||||
# wrangler.toml - Add binding
|
||||
[[kv_namespaces]]
|
||||
binding = "CACHE"
|
||||
id = "abc123"
|
||||
|
||||
[[d1_databases]]
|
||||
binding = "DB"
|
||||
database_name = "greyhaven-db"
|
||||
database_id = "xyz789"
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Bundle size check in CI/CD
|
||||
- Pre-commit hooks for validation
|
||||
- Staging environment for testing
|
||||
- Automated deployment tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 4: Performance Degradation
|
||||
|
||||
### Symptoms
|
||||
- API response times increased (>2x normal)
|
||||
- Slow page loads
|
||||
- User complaints about slowness
|
||||
- Timeout errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Check Current Latency**
|
||||
```bash
|
||||
# Test endpoint
|
||||
curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
|
||||
|
||||
# p95 should be <500ms
|
||||
# If >1s, investigate
|
||||
```
|
||||
|
||||
**2. Analyze Worker Logs**
|
||||
```bash
|
||||
wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
|
||||
|
||||
# Identify slow requests
|
||||
# Check what's taking time
|
||||
```
|
||||
|
||||
**3. Check Database Queries**
|
||||
```bash
|
||||
# Slow query log
|
||||
pscale database insights greyhaven-db main --slow-queries
|
||||
|
||||
# Look for:
|
||||
# - N+1 queries (many small queries)
|
||||
# - Missing indexes (full table scans)
|
||||
# - Long-running queries (>100ms)
|
||||
```
|
||||
|
||||
**4. Profile Application**
|
||||
```bash
|
||||
# Add timing middleware
|
||||
# Log slow operations
|
||||
# Identify bottleneck (DB, API, compute)
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: N+1 Queries**
|
||||
```python
|
||||
# Use eager loading
|
||||
statement = (
|
||||
select(Order)
|
||||
.options(selectinload(Order.items))
|
||||
)
|
||||
```
|
||||
|
||||
**Path B: Missing Indexes**
|
||||
```sql
|
||||
-- Add indexes
|
||||
CREATE INDEX idx_orders_user_id ON orders(user_id);
|
||||
CREATE INDEX idx_items_order_id ON order_items(order_id);
|
||||
```
|
||||
|
||||
**Path C: No Caching**
|
||||
```typescript
|
||||
// Add Redis caching
|
||||
const cached = await redis.get(cacheKey);
|
||||
if (cached) return cached;
|
||||
|
||||
const result = await expensiveOperation();
|
||||
await redis.setex(cacheKey, 300, result);
|
||||
```
|
||||
|
||||
**Path D: Worker CPU Limit**
|
||||
```typescript
|
||||
// Optimize expensive operations
|
||||
// Use async operations
|
||||
// Offload to external service
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- Monitor p95 latency (alert >500ms)
|
||||
- Test for N+1 queries in CI/CD
|
||||
- Add indexes for foreign keys
|
||||
- Implement caching layer
|
||||
- Performance budgets in tests
|
||||
|
||||
---
|
||||
|
||||
## Runbook 5: Network Connectivity Issues
|
||||
|
||||
### Symptoms
|
||||
- Intermittent failures
|
||||
- DNS resolution errors
|
||||
- Connection timeouts
|
||||
- CORS errors
|
||||
|
||||
### Diagnosis Steps
|
||||
|
||||
**1. Test DNS Resolution**
|
||||
```bash
|
||||
# Check DNS
|
||||
nslookup api.partner.com
|
||||
dig api.partner.com
|
||||
|
||||
# Measure DNS time
|
||||
time nslookup api.partner.com
|
||||
|
||||
# If >1s, DNS is slow
|
||||
```
|
||||
|
||||
**2. Test Connectivity**
|
||||
```bash
|
||||
# Basic connectivity
|
||||
ping api.partner.com
|
||||
|
||||
# Trace route
|
||||
traceroute api.partner.com
|
||||
|
||||
# Full timing breakdown
|
||||
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com
|
||||
```
|
||||
|
||||
**3. Check CORS**
|
||||
```bash
|
||||
# Preflight request
|
||||
curl -I -X OPTIONS https://api.greyhaven.io/api/users \
|
||||
-H "Origin: https://app.greyhaven.io" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
|
||||
# Verify headers:
|
||||
# - Access-Control-Allow-Origin
|
||||
# - Access-Control-Allow-Methods
|
||||
```
|
||||
|
||||
**4. Check Firewall/Security**
|
||||
```bash
|
||||
# Test from different location
|
||||
# Check IP whitelist
|
||||
# Verify SSL certificate
|
||||
```
|
||||
|
||||
### Resolution Paths
|
||||
|
||||
**Path A: Slow DNS**
|
||||
```typescript
|
||||
// Implement DNS caching
|
||||
const DNS_CACHE = new Map();
|
||||
// Cache DNS for 60s
|
||||
```
|
||||
|
||||
**Path B: Connection Timeout**
|
||||
```typescript
|
||||
// Increase timeout
|
||||
const controller = new AbortController();
|
||||
setTimeout(() => controller.abort(), 30000); // 30s
|
||||
```
|
||||
|
||||
**Path C: CORS Error**
|
||||
```typescript
|
||||
// Add CORS headers
|
||||
response.headers.set('Access-Control-Allow-Origin', origin);
|
||||
response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
|
||||
```
|
||||
|
||||
**Path D: SSL/TLS Issue**
|
||||
```bash
|
||||
# Check certificate
|
||||
openssl s_client -connect api.partner.com:443
|
||||
|
||||
# Verify not expired
|
||||
# Check certificate chain
|
||||
```
|
||||
|
||||
### Prevention
|
||||
- DNS caching (60s TTL)
|
||||
- Appropriate timeouts (30s for external APIs)
|
||||
- Health checks for external dependencies
|
||||
- Circuit breakers for failures
|
||||
- Monitor external API latency
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures (SEV1)
|
||||
|
||||
**Immediate Actions**:
|
||||
1. **Assess**: Users affected? Functionality broken? Data loss risk?
|
||||
2. **Communicate**: Alert team, update status page
|
||||
3. **Stop Bleeding**: `wrangler rollback` or disable feature
|
||||
4. **Diagnose**: Logs, recent changes, metrics
|
||||
5. **Fix**: Hotfix or workaround, test first
|
||||
6. **Verify**: Monitor metrics, test functionality
|
||||
7. **Postmortem**: Document, root cause, prevention
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Issue Type | First Response | Escalate To | Escalation Trigger |
|
||||
|------------|---------------|-------------|-------------------|
|
||||
| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
|
||||
| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
|
||||
| Database | DevOps troubleshooter | data-validator | Schema issues |
|
||||
| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
|
||||
| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
|
||||
- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
|
||||
- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
Reference in New Issue
Block a user