11 KiB
Distributed System Network Debugging
Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.
Overview
Incident: 5% of API requests failing with 504 timeouts Impact: Intermittent failures, no clear pattern, user frustration Root Cause: DNS resolution delays + worker timeout too aggressive Resolution: DNS caching + timeout increase (5s→30s) Status: Resolved
Incident Timeline
| Time | Event | Action |
|---|---|---|
| 14:00 | 504 errors detected | Alerts triggered |
| 14:10 | Pattern analysis started | Check logs, no obvious cause |
| 14:30 | Network trace performed | Found DNS delays |
| 14:50 | Root cause identified | DNS + timeout combination |
| 15:10 | Fix deployed | DNS caching + timeout tuning |
| 15:40 | Monitoring confirmed | 504s eliminated |
Symptoms and Detection
Initial Alerts
Error Pattern:
[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
[ERROR] Upstream timeout after 5000ms
[ERROR] DNS lookup took 3200ms (80% of timeout!)
Characteristics:
- ❌ Random occurrence (5% of requests)
- ❌ No pattern by time of day
- ❌ Affects all worker regions equally
- ❌ External API reports no issues
- ✅ Only affects specific external endpoints
Diagnosis
Step 1: Network Request Breakdown
curl Timing Analysis:
# Test external API with detailed timing
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nStart: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
-o /dev/null -s https://api.partner.com/data
# Results (intermittent):
DNS: 3.201s # ❌ Very slow!
Connect: 3.450s
TLS: 3.780s
Start: 4.120s
Total: 4.823s # Close to 5s worker timeout
Fast vs Slow Requests:
FAST (95% of requests):
DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅
SLOW (5% of requests):
DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)
Root Cause: DNS resolution delays causing total request time to exceed worker timeout.
Step 2: DNS Investigation
nslookup Testing:
# Test DNS resolution
time nslookup api.partner.com
# Results (vary):
Run 1: 0.05s ✅
Run 2: 3.10s ❌
Run 3: 0.04s ✅
Run 4: 2.95s ❌
Pattern: DNS cache miss causes 3s delay
dig Analysis:
# Detailed DNS query
dig api.partner.com +stats
# Results:
;; Query time: 3021 msec # Slow!
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Dec 05 14:25:32 UTC 2024
;; MSG SIZE rcvd: 84
# Root cause: No DNS caching in worker
Step 3: Worker Timeout Configuration
Current Worker Code:
// worker.ts (BEFORE - Too aggressive timeout)
export default {
async fetch(request: Request, env: Env) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout
try {
const response = await fetch('https://api.partner.com/data', {
signal: controller.signal
});
return response;
} catch (error) {
// 5% of requests timeout here
return new Response('Gateway Timeout', { status: 504 });
} finally {
clearTimeout(timeout);
}
}
};
Problem: 5s timeout doesn't account for DNS delays (up to 3s).
Step 4: CORS and Headers Check
Test CORS Headers:
# Check CORS preflight
curl -I -X OPTIONS https://api.greyhaven.io/proxy \
-H "Origin: https://app.greyhaven.io" \
-H "Access-Control-Request-Method: POST"
# Response:
HTTP/2 200
access-control-allow-origin: https://app.greyhaven.io ✅
access-control-allow-methods: GET, POST, PUT, DELETE ✅
access-control-max-age: 86400 ✅
No CORS issues - problem isolated to DNS + timeout.
Resolution
Fix 1: Implement DNS Caching
Worker with DNS Cache:
// worker.ts (AFTER - With DNS caching)
interface DnsCache {
ip: string;
timestamp: number;
ttl: number;
}
const DNS_CACHE = new Map<string, DnsCache>();
const DNS_TTL = 60 * 1000; // 60 seconds
async function resolveWithCache(hostname: string): Promise<string> {
const cached = DNS_CACHE.get(hostname);
if (cached && Date.now() - cached.timestamp < cached.ttl) {
// Cache hit - return immediately
return cached.ip;
}
// Cache miss - resolve DNS
const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
headers: { 'accept': 'application/dns-json' }
});
const dnsData = await dnsResponse.json();
const ip = dnsData.Answer[0].data;
// Update cache
DNS_CACHE.set(hostname, {
ip,
timestamp: Date.now(),
ttl: DNS_TTL
});
return ip;
}
export default {
async fetch(request: Request, env: Env) {
// Pre-resolve DNS (cached)
const ip = await resolveWithCache('api.partner.com');
// Use IP directly (bypass DNS)
const response = await fetch(`https://${ip}/data`, {
headers: {
'Host': 'api.partner.com' // Required for SNI
}
});
return response;
}
};
Result: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).
Fix 2: Increase Worker Timeout
Updated Timeout:
// worker.ts - Increased timeout to account for DNS
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
try {
const response = await fetch('https://api.partner.com/data', {
signal: controller.signal
});
return response;
} finally {
clearTimeout(timeout);
}
Timeout Breakdown:
Old: 5s total
- DNS: 3s (worst case)
- Connect: 1s
- Request: 1s
= Frequent timeouts
New: 30s total
- DNS: <0.01s (cached)
- Connect: 1s
- Request: 2s
- Buffer: 27s (ample)
= No timeouts
Fix 3: Add Retry Logic with Exponential Backoff
Retry Implementation:
// utils/retry.ts
async function fetchWithRetry(
url: string,
options: RequestInit,
maxRetries: number = 3
): Promise<Response> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch(url, options);
// Retry on 5xx errors
if (response.status >= 500 && attempt < maxRetries - 1) {
const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
return response;
} catch (error) {
if (attempt === maxRetries - 1) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}
// Usage:
const response = await fetchWithRetry('https://api.partner.com/data', {
signal: controller.signal
});
Fix 4: Circuit Breaker Pattern
Prevent Cascading Failures:
// utils/circuit-breaker.ts
class CircuitBreaker {
private failures: number = 0;
private lastFailureTime: number = 0;
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
// Check if enough time passed to try again
if (Date.now() - this.lastFailureTime > 60000) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
private onFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= 5) {
this.state = 'OPEN'; // Trip circuit after 5 failures
}
}
}
// Usage:
const breaker = new CircuitBreaker();
const response = await breaker.execute(() =>
fetch('https://api.partner.com/data')
);
Results
Before vs After Metrics
| Metric | Before Fix | After Fix | Improvement |
|---|---|---|---|
| 504 Error Rate | 5% | 0.01% | 99.8% reduction |
| DNS Resolution | 3000ms (worst) | <5ms (cached) | 99.8% faster |
| Total Request Time | 4800ms (p95) | 850ms (p95) | 82% faster |
| Timeout Threshold | 5s (too low) | 30s (appropriate) | +500% headroom |
Network Diagnostics
traceroute Analysis:
# Check network path to external API
traceroute api.partner.com
# Results show no packet loss
1 gateway (10.0.0.1) 1.234 ms
2 isp-router (100.64.0.1) 5.678 ms
...
15 api.partner.com (203.0.113.42) 45.234 ms
No packet loss - confirms DNS was the issue, not network.
Prevention Measures
1. Network Monitoring Dashboard
Metrics to Track:
// Track network timing metrics
const network_dns_duration = new Histogram({
name: 'network_dns_duration_seconds',
help: 'DNS resolution time'
});
const network_connect_duration = new Histogram({
name: 'network_connect_duration_seconds',
help: 'TCP connection time'
});
const network_total_duration = new Histogram({
name: 'network_total_duration_seconds',
help: 'Total request time'
});
2. Alert Rules
# Alert on high DNS resolution time
- alert: SlowDnsResolution
expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
for: 5m
annotations:
summary: "DNS resolution p95 >1s"
# Alert on gateway timeouts
- alert: HighGatewayTimeouts
expr: rate(http_requests_total{status="504"}[5m]) > 0.01
for: 5m
annotations:
summary: "504 error rate >1%"
3. Health Check Endpoints
@app.get("/health/network")
async function networkHealth() {
const checks = await Promise.all([
checkDns('api.partner.com'),
checkConnectivity('https://api.partner.com/health'),
checkLatency('https://api.partner.com/ping')
]);
return {
status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
checks
};
}
Lessons Learned
What Went Well
✅ Detailed network timing analysis pinpointed DNS ✅ DNS caching eliminated 99.8% of timeouts ✅ Circuit breaker prevents cascading failures
What Could Be Improved
❌ No DNS monitoring before incident ❌ Timeout too aggressive without considering DNS ❌ No retry logic for transient failures
Key Takeaways
- Always cache DNS in workers (60s TTL minimum)
- Account for DNS time when setting timeouts
- Add retry logic with exponential backoff
- Implement circuit breakers for external dependencies
- Monitor network timing (DNS, connect, TLS, transfer)
Related Documentation
- Worker Deployment: cloudflare-worker-deployment-failure.md
- Database Issues: planetscale-connection-issues.md
- Performance: performance-degradation-analysis.md
- Runbooks: ../reference/troubleshooting-runbooks.md
Return to examples index