Files
gh-greyhaven-ai-claude-code…/skills/devops-troubleshooting/examples/distributed-system-debugging.md
2025-11-29 18:29:23 +08:00

11 KiB

Distributed System Network Debugging

Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.

Overview

Incident: 5% of API requests failing with 504 timeouts Impact: Intermittent failures, no clear pattern, user frustration Root Cause: DNS resolution delays + worker timeout too aggressive Resolution: DNS caching + timeout increase (5s→30s) Status: Resolved

Incident Timeline

Time Event Action
14:00 504 errors detected Alerts triggered
14:10 Pattern analysis started Check logs, no obvious cause
14:30 Network trace performed Found DNS delays
14:50 Root cause identified DNS + timeout combination
15:10 Fix deployed DNS caching + timeout tuning
15:40 Monitoring confirmed 504s eliminated

Symptoms and Detection

Initial Alerts

Error Pattern:

[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
[ERROR] Upstream timeout after 5000ms
[ERROR] DNS lookup took 3200ms (80% of timeout!)

Characteristics:

  • Random occurrence (5% of requests)
  • No pattern by time of day
  • Affects all worker regions equally
  • External API reports no issues
  • Only affects specific external endpoints

Diagnosis

Step 1: Network Request Breakdown

curl Timing Analysis:

# Test external API with detailed timing
curl -w "\nDNS:     %{time_namelookup}s\nConnect: %{time_connect}s\nTLS:     %{time_appconnect}s\nStart:   %{time_starttransfer}s\nTotal:   %{time_total}s\n" \
  -o /dev/null -s https://api.partner.com/data

# Results (intermittent):
DNS:     3.201s  # ❌ Very slow!
Connect: 3.450s
TLS:     3.780s
Start:   4.120s
Total:   4.823s  # Close to 5s worker timeout

Fast vs Slow Requests:

FAST (95% of requests):
DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅

SLOW (5% of requests):
DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)

Root Cause: DNS resolution delays causing total request time to exceed worker timeout.


Step 2: DNS Investigation

nslookup Testing:

# Test DNS resolution
time nslookup api.partner.com

# Results (vary):
Run 1: 0.05s  ✅
Run 2: 3.10s  ❌
Run 3: 0.04s  ✅
Run 4: 2.95s  ❌

Pattern: DNS cache miss causes 3s delay

dig Analysis:

# Detailed DNS query
dig api.partner.com +stats

# Results:
;; Query time: 3021 msec          # Slow!
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Dec 05 14:25:32 UTC 2024
;; MSG SIZE  rcvd: 84

# Root cause: No DNS caching in worker

Step 3: Worker Timeout Configuration

Current Worker Code:

// worker.ts (BEFORE - Too aggressive timeout)
export default {
  async fetch(request: Request, env: Env) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout

    try {
      const response = await fetch('https://api.partner.com/data', {
        signal: controller.signal
      });
      return response;
    } catch (error) {
      // 5% of requests timeout here
      return new Response('Gateway Timeout', { status: 504 });
    } finally {
      clearTimeout(timeout);
    }
  }
};

Problem: 5s timeout doesn't account for DNS delays (up to 3s).


Step 4: CORS and Headers Check

Test CORS Headers:

# Check CORS preflight
curl -I -X OPTIONS https://api.greyhaven.io/proxy \
  -H "Origin: https://app.greyhaven.io" \
  -H "Access-Control-Request-Method: POST"

# Response:
HTTP/2 200
access-control-allow-origin: https://app.greyhaven.io ✅
access-control-allow-methods: GET, POST, PUT, DELETE ✅
access-control-max-age: 86400

No CORS issues - problem isolated to DNS + timeout.


Resolution

Fix 1: Implement DNS Caching

Worker with DNS Cache:

// worker.ts (AFTER - With DNS caching)
interface DnsCache {
  ip: string;
  timestamp: number;
  ttl: number;
}

const DNS_CACHE = new Map<string, DnsCache>();
const DNS_TTL = 60 * 1000; // 60 seconds

async function resolveWithCache(hostname: string): Promise<string> {
  const cached = DNS_CACHE.get(hostname);

  if (cached && Date.now() - cached.timestamp < cached.ttl) {
    // Cache hit - return immediately
    return cached.ip;
  }

  // Cache miss - resolve DNS
  const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
    headers: { 'accept': 'application/dns-json' }
  });
  const dnsData = await dnsResponse.json();
  const ip = dnsData.Answer[0].data;

  // Update cache
  DNS_CACHE.set(hostname, {
    ip,
    timestamp: Date.now(),
    ttl: DNS_TTL
  });

  return ip;
}

export default {
  async fetch(request: Request, env: Env) {
    // Pre-resolve DNS (cached)
    const ip = await resolveWithCache('api.partner.com');

    // Use IP directly (bypass DNS)
    const response = await fetch(`https://${ip}/data`, {
      headers: {
        'Host': 'api.partner.com' // Required for SNI
      }
    });

    return response;
  }
};

Result: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).


Fix 2: Increase Worker Timeout

Updated Timeout:

// worker.ts - Increased timeout to account for DNS
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout

try {
  const response = await fetch('https://api.partner.com/data', {
    signal: controller.signal
  });
  return response;
} finally {
  clearTimeout(timeout);
}

Timeout Breakdown:

Old: 5s total
- DNS: 3s (worst case)
- Connect: 1s
- Request: 1s
= Frequent timeouts

New: 30s total
- DNS: <0.01s (cached)
- Connect: 1s
- Request: 2s
- Buffer: 27s (ample)
= No timeouts

Fix 3: Add Retry Logic with Exponential Backoff

Retry Implementation:

// utils/retry.ts
async function fetchWithRetry(
  url: string,
  options: RequestInit,
  maxRetries: number = 3
): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);

      // Retry on 5xx errors
      if (response.status >= 500 && attempt < maxRetries - 1) {
        const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }

      return response;
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw new Error('Max retries exceeded');
}

// Usage:
const response = await fetchWithRetry('https://api.partner.com/data', {
  signal: controller.signal
});

Fix 4: Circuit Breaker Pattern

Prevent Cascading Failures:

// utils/circuit-breaker.ts
class CircuitBreaker {
  private failures: number = 0;
  private lastFailureTime: number = 0;
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      // Check if enough time passed to try again
      if (Date.now() - this.lastFailureTime > 60000) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  private onFailure() {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.failures >= 5) {
      this.state = 'OPEN'; // Trip circuit after 5 failures
    }
  }
}

// Usage:
const breaker = new CircuitBreaker();
const response = await breaker.execute(() =>
  fetch('https://api.partner.com/data')
);

Results

Before vs After Metrics

Metric Before Fix After Fix Improvement
504 Error Rate 5% 0.01% 99.8% reduction
DNS Resolution 3000ms (worst) <5ms (cached) 99.8% faster
Total Request Time 4800ms (p95) 850ms (p95) 82% faster
Timeout Threshold 5s (too low) 30s (appropriate) +500% headroom

Network Diagnostics

traceroute Analysis:

# Check network path to external API
traceroute api.partner.com

# Results show no packet loss
 1  gateway (10.0.0.1)  1.234 ms
 2  isp-router (100.64.0.1)  5.678 ms
...
15  api.partner.com (203.0.113.42)  45.234 ms

No packet loss - confirms DNS was the issue, not network.


Prevention Measures

1. Network Monitoring Dashboard

Metrics to Track:

// Track network timing metrics
const network_dns_duration = new Histogram({
  name: 'network_dns_duration_seconds',
  help: 'DNS resolution time'
});

const network_connect_duration = new Histogram({
  name: 'network_connect_duration_seconds',
  help: 'TCP connection time'
});

const network_total_duration = new Histogram({
  name: 'network_total_duration_seconds',
  help: 'Total request time'
});

2. Alert Rules

# Alert on high DNS resolution time
- alert: SlowDnsResolution
  expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
  for: 5m
  annotations:
    summary: "DNS resolution p95 >1s"

# Alert on gateway timeouts
- alert: HighGatewayTimeouts
  expr: rate(http_requests_total{status="504"}[5m]) > 0.01
  for: 5m
  annotations:
    summary: "504 error rate >1%"

3. Health Check Endpoints

@app.get("/health/network")
async function networkHealth() {
  const checks = await Promise.all([
    checkDns('api.partner.com'),
    checkConnectivity('https://api.partner.com/health'),
    checkLatency('https://api.partner.com/ping')
  ]);

  return {
    status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
    checks
  };
}

Lessons Learned

What Went Well

Detailed network timing analysis pinpointed DNS DNS caching eliminated 99.8% of timeouts Circuit breaker prevents cascading failures

What Could Be Improved

No DNS monitoring before incident Timeout too aggressive without considering DNS No retry logic for transient failures

Key Takeaways

  1. Always cache DNS in workers (60s TTL minimum)
  2. Account for DNS time when setting timeouts
  3. Add retry logic with exponential backoff
  4. Implement circuit breakers for external dependencies
  5. Monitor network timing (DNS, connect, TLS, transfer)


Return to examples index