Initial commit
This commit is contained in:
@@ -0,0 +1,477 @@
|
||||
# Distributed System Network Debugging
|
||||
|
||||
Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.
|
||||
|
||||
## Overview
|
||||
|
||||
**Incident**: 5% of API requests failing with 504 timeouts
|
||||
**Impact**: Intermittent failures, no clear pattern, user frustration
|
||||
**Root Cause**: DNS resolution delays + worker timeout too aggressive
|
||||
**Resolution**: DNS caching + timeout increase (5s→30s)
|
||||
**Status**: Resolved
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Time | Event | Action |
|
||||
|------|-------|--------|
|
||||
| 14:00 | 504 errors detected | Alerts triggered |
|
||||
| 14:10 | Pattern analysis started | Check logs, no obvious cause |
|
||||
| 14:30 | Network trace performed | Found DNS delays |
|
||||
| 14:50 | Root cause identified | DNS + timeout combination |
|
||||
| 15:10 | Fix deployed | DNS caching + timeout tuning |
|
||||
| 15:40 | Monitoring confirmed | 504s eliminated |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms and Detection
|
||||
|
||||
### Initial Alerts
|
||||
|
||||
**Error Pattern**:
|
||||
```
|
||||
[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
|
||||
[ERROR] Upstream timeout after 5000ms
|
||||
[ERROR] DNS lookup took 3200ms (80% of timeout!)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- ❌ Random occurrence (5% of requests)
|
||||
- ❌ No pattern by time of day
|
||||
- ❌ Affects all worker regions equally
|
||||
- ❌ External API reports no issues
|
||||
- ✅ Only affects specific external endpoints
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Network Request Breakdown
|
||||
|
||||
**curl Timing Analysis**:
|
||||
```bash
|
||||
# Test external API with detailed timing
|
||||
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nStart: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
|
||||
-o /dev/null -s https://api.partner.com/data
|
||||
|
||||
# Results (intermittent):
|
||||
DNS: 3.201s # ❌ Very slow!
|
||||
Connect: 3.450s
|
||||
TLS: 3.780s
|
||||
Start: 4.120s
|
||||
Total: 4.823s # Close to 5s worker timeout
|
||||
```
|
||||
|
||||
**Fast vs Slow Requests**:
|
||||
```
|
||||
FAST (95% of requests):
|
||||
DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅
|
||||
|
||||
SLOW (5% of requests):
|
||||
DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)
|
||||
```
|
||||
|
||||
**Root Cause**: DNS resolution delays causing total request time to exceed worker timeout.
|
||||
|
||||
---
|
||||
|
||||
### Step 2: DNS Investigation
|
||||
|
||||
**nslookup Testing**:
|
||||
```bash
|
||||
# Test DNS resolution
|
||||
time nslookup api.partner.com
|
||||
|
||||
# Results (vary):
|
||||
Run 1: 0.05s ✅
|
||||
Run 2: 3.10s ❌
|
||||
Run 3: 0.04s ✅
|
||||
Run 4: 2.95s ❌
|
||||
|
||||
Pattern: DNS cache miss causes 3s delay
|
||||
```
|
||||
|
||||
**dig Analysis**:
|
||||
```bash
|
||||
# Detailed DNS query
|
||||
dig api.partner.com +stats
|
||||
|
||||
# Results:
|
||||
;; Query time: 3021 msec # Slow!
|
||||
;; SERVER: 1.1.1.1#53(1.1.1.1)
|
||||
;; WHEN: Thu Dec 05 14:25:32 UTC 2024
|
||||
;; MSG SIZE rcvd: 84
|
||||
|
||||
# Root cause: No DNS caching in worker
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Worker Timeout Configuration
|
||||
|
||||
**Current Worker Code**:
|
||||
```typescript
|
||||
// worker.ts (BEFORE - Too aggressive timeout)
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
const controller = new AbortController();
|
||||
const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout
|
||||
|
||||
try {
|
||||
const response = await fetch('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
return response;
|
||||
} catch (error) {
|
||||
// 5% of requests timeout here
|
||||
return new Response('Gateway Timeout', { status: 504 });
|
||||
} finally {
|
||||
clearTimeout(timeout);
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Problem**: 5s timeout doesn't account for DNS delays (up to 3s).
|
||||
|
||||
---
|
||||
|
||||
### Step 4: CORS and Headers Check
|
||||
|
||||
**Test CORS Headers**:
|
||||
```bash
|
||||
# Check CORS preflight
|
||||
curl -I -X OPTIONS https://api.greyhaven.io/proxy \
|
||||
-H "Origin: https://app.greyhaven.io" \
|
||||
-H "Access-Control-Request-Method: POST"
|
||||
|
||||
# Response:
|
||||
HTTP/2 200
|
||||
access-control-allow-origin: https://app.greyhaven.io ✅
|
||||
access-control-allow-methods: GET, POST, PUT, DELETE ✅
|
||||
access-control-max-age: 86400 ✅
|
||||
```
|
||||
|
||||
**No CORS issues** - problem isolated to DNS + timeout.
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Fix 1: Implement DNS Caching
|
||||
|
||||
**Worker with DNS Cache**:
|
||||
```typescript
|
||||
// worker.ts (AFTER - With DNS caching)
|
||||
interface DnsCache {
|
||||
ip: string;
|
||||
timestamp: number;
|
||||
ttl: number;
|
||||
}
|
||||
|
||||
const DNS_CACHE = new Map<string, DnsCache>();
|
||||
const DNS_TTL = 60 * 1000; // 60 seconds
|
||||
|
||||
async function resolveWithCache(hostname: string): Promise<string> {
|
||||
const cached = DNS_CACHE.get(hostname);
|
||||
|
||||
if (cached && Date.now() - cached.timestamp < cached.ttl) {
|
||||
// Cache hit - return immediately
|
||||
return cached.ip;
|
||||
}
|
||||
|
||||
// Cache miss - resolve DNS
|
||||
const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
|
||||
headers: { 'accept': 'application/dns-json' }
|
||||
});
|
||||
const dnsData = await dnsResponse.json();
|
||||
const ip = dnsData.Answer[0].data;
|
||||
|
||||
// Update cache
|
||||
DNS_CACHE.set(hostname, {
|
||||
ip,
|
||||
timestamp: Date.now(),
|
||||
ttl: DNS_TTL
|
||||
});
|
||||
|
||||
return ip;
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env) {
|
||||
// Pre-resolve DNS (cached)
|
||||
const ip = await resolveWithCache('api.partner.com');
|
||||
|
||||
// Use IP directly (bypass DNS)
|
||||
const response = await fetch(`https://${ip}/data`, {
|
||||
headers: {
|
||||
'Host': 'api.partner.com' // Required for SNI
|
||||
}
|
||||
});
|
||||
|
||||
return response;
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Result**: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).
|
||||
|
||||
---
|
||||
|
||||
### Fix 2: Increase Worker Timeout
|
||||
|
||||
**Updated Timeout**:
|
||||
```typescript
|
||||
// worker.ts - Increased timeout to account for DNS
|
||||
const controller = new AbortController();
|
||||
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
|
||||
|
||||
try {
|
||||
const response = await fetch('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
return response;
|
||||
} finally {
|
||||
clearTimeout(timeout);
|
||||
}
|
||||
```
|
||||
|
||||
**Timeout Breakdown**:
|
||||
```
|
||||
Old: 5s total
|
||||
- DNS: 3s (worst case)
|
||||
- Connect: 1s
|
||||
- Request: 1s
|
||||
= Frequent timeouts
|
||||
|
||||
New: 30s total
|
||||
- DNS: <0.01s (cached)
|
||||
- Connect: 1s
|
||||
- Request: 2s
|
||||
- Buffer: 27s (ample)
|
||||
= No timeouts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 3: Add Retry Logic with Exponential Backoff
|
||||
|
||||
**Retry Implementation**:
|
||||
```typescript
|
||||
// utils/retry.ts
|
||||
async function fetchWithRetry(
|
||||
url: string,
|
||||
options: RequestInit,
|
||||
maxRetries: number = 3
|
||||
): Promise<Response> {
|
||||
for (let attempt = 0; attempt < maxRetries; attempt++) {
|
||||
try {
|
||||
const response = await fetch(url, options);
|
||||
|
||||
// Retry on 5xx errors
|
||||
if (response.status >= 500 && attempt < maxRetries - 1) {
|
||||
const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
|
||||
await new Promise(resolve => setTimeout(resolve, delay));
|
||||
continue;
|
||||
}
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
if (attempt === maxRetries - 1) throw error;
|
||||
|
||||
// Exponential backoff: 1s, 2s, 4s
|
||||
const delay = Math.pow(2, attempt) * 1000;
|
||||
await new Promise(resolve => setTimeout(resolve, delay));
|
||||
}
|
||||
}
|
||||
|
||||
throw new Error('Max retries exceeded');
|
||||
}
|
||||
|
||||
// Usage:
|
||||
const response = await fetchWithRetry('https://api.partner.com/data', {
|
||||
signal: controller.signal
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fix 4: Circuit Breaker Pattern
|
||||
|
||||
**Prevent Cascading Failures**:
|
||||
```typescript
|
||||
// utils/circuit-breaker.ts
|
||||
class CircuitBreaker {
|
||||
private failures: number = 0;
|
||||
private lastFailureTime: number = 0;
|
||||
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
|
||||
|
||||
async execute<T>(fn: () => Promise<T>): Promise<T> {
|
||||
if (this.state === 'OPEN') {
|
||||
// Check if enough time passed to try again
|
||||
if (Date.now() - this.lastFailureTime > 60000) {
|
||||
this.state = 'HALF_OPEN';
|
||||
} else {
|
||||
throw new Error('Circuit breaker is OPEN');
|
||||
}
|
||||
}
|
||||
|
||||
try {
|
||||
const result = await fn();
|
||||
this.onSuccess();
|
||||
return result;
|
||||
} catch (error) {
|
||||
this.onFailure();
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private onSuccess() {
|
||||
this.failures = 0;
|
||||
this.state = 'CLOSED';
|
||||
}
|
||||
|
||||
private onFailure() {
|
||||
this.failures++;
|
||||
this.lastFailureTime = Date.now();
|
||||
|
||||
if (this.failures >= 5) {
|
||||
this.state = 'OPEN'; // Trip circuit after 5 failures
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Usage:
|
||||
const breaker = new CircuitBreaker();
|
||||
const response = await breaker.execute(() =>
|
||||
fetch('https://api.partner.com/data')
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before Fix | After Fix | Improvement |
|
||||
|--------|-----------|-----------|-------------|
|
||||
| **504 Error Rate** | 5% | 0.01% | **99.8% reduction** |
|
||||
| **DNS Resolution** | 3000ms (worst) | <5ms (cached) | **99.8% faster** |
|
||||
| **Total Request Time** | 4800ms (p95) | 850ms (p95) | **82% faster** |
|
||||
| **Timeout Threshold** | 5s (too low) | 30s (appropriate) | +500% headroom |
|
||||
|
||||
---
|
||||
|
||||
### Network Diagnostics
|
||||
|
||||
**traceroute Analysis**:
|
||||
```bash
|
||||
# Check network path to external API
|
||||
traceroute api.partner.com
|
||||
|
||||
# Results show no packet loss
|
||||
1 gateway (10.0.0.1) 1.234 ms
|
||||
2 isp-router (100.64.0.1) 5.678 ms
|
||||
...
|
||||
15 api.partner.com (203.0.113.42) 45.234 ms
|
||||
```
|
||||
|
||||
**No packet loss** - confirms DNS was the issue, not network.
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### 1. Network Monitoring Dashboard
|
||||
|
||||
**Metrics to Track**:
|
||||
```typescript
|
||||
// Track network timing metrics
|
||||
const network_dns_duration = new Histogram({
|
||||
name: 'network_dns_duration_seconds',
|
||||
help: 'DNS resolution time'
|
||||
});
|
||||
|
||||
const network_connect_duration = new Histogram({
|
||||
name: 'network_connect_duration_seconds',
|
||||
help: 'TCP connection time'
|
||||
});
|
||||
|
||||
const network_total_duration = new Histogram({
|
||||
name: 'network_total_duration_seconds',
|
||||
help: 'Total request time'
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Alert Rules
|
||||
|
||||
```yaml
|
||||
# Alert on high DNS resolution time
|
||||
- alert: SlowDnsResolution
|
||||
expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "DNS resolution p95 >1s"
|
||||
|
||||
# Alert on gateway timeouts
|
||||
- alert: HighGatewayTimeouts
|
||||
expr: rate(http_requests_total{status="504"}[5m]) > 0.01
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "504 error rate >1%"
|
||||
```
|
||||
|
||||
### 3. Health Check Endpoints
|
||||
|
||||
```typescript
|
||||
@app.get("/health/network")
|
||||
async function networkHealth() {
|
||||
const checks = await Promise.all([
|
||||
checkDns('api.partner.com'),
|
||||
checkConnectivity('https://api.partner.com/health'),
|
||||
checkLatency('https://api.partner.com/ping')
|
||||
]);
|
||||
|
||||
return {
|
||||
status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
|
||||
checks
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
✅ Detailed network timing analysis pinpointed DNS
|
||||
✅ DNS caching eliminated 99.8% of timeouts
|
||||
✅ Circuit breaker prevents cascading failures
|
||||
|
||||
### What Could Be Improved
|
||||
|
||||
❌ No DNS monitoring before incident
|
||||
❌ Timeout too aggressive without considering DNS
|
||||
❌ No retry logic for transient failures
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
1. **Always cache DNS** in workers (60s TTL minimum)
|
||||
2. **Account for DNS time** when setting timeouts
|
||||
3. **Add retry logic** with exponential backoff
|
||||
4. **Implement circuit breakers** for external dependencies
|
||||
5. **Monitor network timing** (DNS, connect, TLS, transfer)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
|
||||
- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
|
||||
- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
|
||||
- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
Reference in New Issue
Block a user