Initial commit

2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions
--- a/skills/devops-troubleshooting/SKILL.md
+++ b/skills/devops-troubleshooting/SKILL.md
@@ -0,0 +1,26 @@
+# DevOps Troubleshooting Skill
+
+DevOps and infrastructure troubleshooting for Cloudflare Workers, PlanetScale PostgreSQL, and distributed systems.
+
+## Description
+
+Infrastructure diagnosis, performance analysis, network debugging, and cloud platform troubleshooting.
+
+## What's Included
+
+- **Examples**: Deployment issues, connection errors, performance degradation
+- **Reference**: Troubleshooting methodologies, common issues
+- **Templates**: Diagnostic reports, fix commands
+
+## Use When
+
+- Deployment issues
+- Infrastructure problems
+- Connection errors
+- Performance degradation
+
+## Related Agents
+
+- `devops-troubleshooter`
+
+**Skill Version**: 1.0
--- a/skills/devops-troubleshooting/examples/INDEX.md
+++ b/skills/devops-troubleshooting/examples/INDEX.md
@@ -0,0 +1,68 @@
+# DevOps Troubleshooter Examples
+
+Real-world infrastructure troubleshooting scenarios for Grey Haven's Cloudflare Workers + PlanetScale PostgreSQL stack.
+
+## Examples Overview
+
+### 1. Cloudflare Worker Deployment Failure
+
+**File**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+**Scenario**: Worker deployment fails with "Script exceeds size limit" error
+**Stack**: Cloudflare Workers, wrangler, webpack bundling
+**Impact**: Production deployment blocked, 2-hour downtime
+**Resolution**: Bundle size reduction (5.2MB → 1.8MB), code splitting, tree shaking
+**Lines**: ~450 lines
+
+### 2. PlanetScale Connection Pool Exhaustion
+
+**File**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
+**Scenario**: Database connection timeouts causing 503 errors
+**Stack**: PlanetScale PostgreSQL, connection pooling, FastAPI
+**Impact**: 15% of requests failing, customer complaints
+**Resolution**: Connection pool tuning, connection leak fixes
+**Lines**: ~430 lines
+
+### 3. Distributed System Network Debugging
+
+**File**: [distributed-system-debugging.md](distributed-system-debugging.md)
+**Scenario**: Intermittent 504 Gateway Timeout errors between services
+**Stack**: Cloudflare Workers, external APIs, DNS, CORS
+**Impact**: 5% of API calls failing, no clear pattern
+**Resolution**: DNS caching issue, worker timeout configuration
+**Lines**: ~420 lines
+
+### 4. Performance Degradation Analysis
+
+**File**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
+**Scenario**: API response times increased from 200ms to 2000ms
+**Stack**: Cloudflare Workers, PlanetScale, caching layer
+**Impact**: User-facing slowness, poor UX
+**Resolution**: N+1 query elimination, caching strategy, index optimization
+**Lines**: ~410 lines
+
+---
+
+## Quick Navigation
+
+**By Issue Type**:
+- Deployment failures → [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+- Database issues → [planetscale-connection-issues.md](planetscale-connection-issues.md)
+- Network problems → [distributed-system-debugging.md](distributed-system-debugging.md)
+- Performance issues → [performance-degradation-analysis.md](performance-degradation-analysis.md)
+
+**By Stack Component**:
+- Cloudflare Workers → Examples 1, 3, 4
+- PlanetScale PostgreSQL → Examples 2, 4
+- Distributed Systems → Example 3
+
+---
+
+## Related Documentation
+
+- **Reference**: [Reference Index](../reference/INDEX.md) - Runbooks and diagnostic commands
+- **Templates**: [Templates Index](../templates/INDEX.md) - Incident templates
+- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
+
+---
+
+Return to [main agent](../devops-troubleshooter.md)
--- a/skills/devops-troubleshooting/examples/cloudflare-worker-deployment-failure.md
+++ b/skills/devops-troubleshooting/examples/cloudflare-worker-deployment-failure.md
@@ -0,0 +1,466 @@
+# Cloudflare Worker Deployment Failure Investigation
+
+Complete troubleshooting workflow for "Script exceeds size limit" deployment failure, resolved through bundle optimization and code splitting.
+
+## Overview
+
+**Incident**: Worker deployment failing with size limit error
+**Impact**: Production deployment blocked for 2 hours
+**Root Cause**: Bundle size grew from 1.2MB to 5.2MB after adding dependencies
+**Resolution**: Bundle optimization (code splitting, tree shaking) reduced size to 1.8MB
+**Status**: Resolved
+
+## Incident Timeline
+
+| Time | Event | Action |
+|------|-------|--------|
+| 14:00 | Deployment initiated via CI/CD | `wrangler deploy` triggered |
+| 14:02 | Deployment failed | Error: "Script exceeds 1MB size limit" |
+| 14:05 | Investigation started | Check recent code changes |
+| 14:15 | Root cause identified | New dependencies increased bundle size |
+| 14:30 | Fix implemented | Bundle optimization applied |
+| 14:45 | Fix deployed | Successful deployment to production |
+| 16:00 | Monitoring complete | Confirmed stable deployment |
+
+---
+
+## Symptoms and Detection
+
+### Initial Error
+
+**Deployment Command**:
+```bash
+$ wrangler deploy
+✘ [ERROR] Script exceeds the size limit (5.2MB > 1MB after compression)
+```
+
+**CI/CD Pipeline Failure**:
+```yaml
+# GitHub Actions output
+Step: Deploy to Cloudflare Workers
+  ✓ Build completed (5.2MB bundle)
+  ✗ Deployment failed: Script size exceeds limit
+  Error: Workers Free plan limit is 1MB compressed
+```
+
+**Impact**:
+- Production deployment blocked
+- New features stuck in staging
+- Team unable to deploy hotfixes
+
+---
+
+## Diagnosis
+
+### Step 1: Check Bundle Size
+
+**Before Investigation**:
+```bash
+# Build the worker locally
+npm run build
+
+# Check output size
+ls -lh dist/
+-rw-r--r--  1 user  staff   5.2M Dec  5 14:10 worker.js
+```
+
+**Analyze Bundle Composition**:
+```bash
+# Use webpack-bundle-analyzer
+npm install --save-dev webpack-bundle-analyzer
+
+# Add to webpack.config.js
+const BundleAnalyzerPlugin = require('webpack-bundle-analyzer').BundleAnalyzerPlugin;
+
+module.exports = {
+  plugins: [
+    new BundleAnalyzerPlugin()
+  ]
+};
+
+# Build and open analyzer
+npm run build
+# Opens http://127.0.0.1:8888 with visual bundle breakdown
+```
+
+**Bundle Analyzer Findings**:
+```
+Total Size: 5.2MB
+
+Breakdown:
+- @anthropic-ai/sdk: 2.1MB (40%)
+- aws-sdk: 1.8MB (35%)
+- lodash: 800KB (15%)
+- moment: 300KB (6%)
+- application code: 200KB (4%)
+```
+
+**Red Flags**:
+1. Full `aws-sdk` imported (only needed S3)
+2. Entire `lodash` library (only using 3 functions)
+3. `moment` included (native Date API would suffice)
+4. Large AI SDK (only using text generation)
+
+---
+
+### Step 2: Identify Recent Changes
+
+**Git Diff**:
+```bash
+# Check what changed in last deploy
+git diff HEAD~1 HEAD -- src/
+
+# Key changes:
+ import { Anthropic } from '@anthropic-ai/sdk';
+ import AWS from 'aws-sdk';
+ import _ from 'lodash';
+ import moment from 'moment';
+```
+
+**PR Analysis**:
+```
+PR #234: Add AI content generation feature
+- Added @anthropic-ai/sdk (full SDK)
+- Added AWS S3 integration (full aws-sdk)
+- Used lodash for data manipulation
+- Used moment for date formatting
+
+Result: Bundle size increased by 4MB
+```
+
+---
+
+### Step 3: Cloudflare Worker Size Limits
+
+**Plan Limits**:
+```
+Workers Free: 1MB compressed
+Workers Paid: 10MB compressed
+
+Current plan: Workers Free
+Current size: 5.2MB (over limit)
+
+Options:
+1. Upgrade to Workers Paid ($5/month)
+2. Reduce bundle size to <1MB
+3. Split into multiple workers
+```
+
+**Decision**: Reduce bundle size (no budget for upgrade)
+
+---
+
+## Resolution
+
+### Fix 1: Tree Shaking with Named Imports
+
+**Before** (imports entire libraries):
+```typescript
+// ❌ BAD: Imports full library
+import _ from 'lodash';
+import moment from 'moment';
+import AWS from 'aws-sdk';
+
+// Usage:
+const unique = _.uniq(array);
+const date = moment().format('YYYY-MM-DD');
+const s3 = new AWS.S3();
+```
+
+**After** (imports only needed functions):
+```typescript
+// ✅ GOOD: Named imports enable tree shaking
+import { uniq, map, filter } from 'lodash-es';
+import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
+
+// ✅ BETTER: Native alternatives
+const unique = [...new Set(array)];
+const date = new Date().toISOString().split('T')[0];
+
+// S3 client (v3 - modular)
+const s3 = new S3Client({ region: 'us-east-1' });
+```
+
+**Size Reduction**:
+```
+Before:
+- lodash: 800KB → lodash-es tree-shaken: 50KB (94% reduction)
+- moment: 300KB → native Date: 0KB (100% reduction)
+- aws-sdk: 1.8MB → @aws-sdk/client-s3: 200KB (89% reduction)
+```
+
+---
+
+### Fix 2: External Dependencies (Don't Bundle Large SDKs)
+
+**Before**:
+```typescript
+// worker.ts - bundled @anthropic-ai/sdk (2.1MB)
+import { Anthropic } from '@anthropic-ai/sdk';
+
+const client = new Anthropic({
+  apiKey: env.ANTHROPIC_API_KEY
+});
+```
+
+**After** (use fetch directly):
+```typescript
+// worker.ts - use native fetch (0KB)
+async function callAnthropic(prompt: string, env: Env) {
+  const response = await fetch('https://api.anthropic.com/v1/messages', {
+    method: 'POST',
+    headers: {
+      'Content-Type': 'application/json',
+      'x-api-key': env.ANTHROPIC_API_KEY,
+      'anthropic-version': '2023-06-01'
+    },
+    body: JSON.stringify({
+      model: 'claude-3-sonnet-20240229',
+      max_tokens: 1024,
+      messages: [
+        { role: 'user', content: prompt }
+      ]
+    })
+  });
+
+  return response.json();
+}
+```
+
+**Size Reduction**:
+```
+Before: @anthropic-ai/sdk: 2.1MB
+After: Native fetch: 0KB
+Savings: 2.1MB (100% reduction)
+```
+
+---
+
+### Fix 3: Code Splitting (Async Imports)
+
+**Before** (everything bundled):
+```typescript
+// worker.ts
+import { expensiveFunction } from './expensive-module';
+
+export default {
+  async fetch(request: Request, env: Env) {
+    // Even if not used, expensive-module is in bundle
+    if (request.url.includes('/special')) {
+      return expensiveFunction(request);
+    }
+    return new Response('OK');
+  }
+};
+```
+
+**After** (lazy load):
+```typescript
+// worker.ts
+export default {
+  async fetch(request: Request, env: Env) {
+    if (request.url.includes('/special')) {
+      // Only load when needed (separate chunk)
+      const { expensiveFunction } = await import('./expensive-module');
+      return expensiveFunction(request);
+    }
+    return new Response('OK');
+  }
+};
+```
+
+**Size Reduction**:
+```
+Main bundle: 1.8MB → 500KB (72% reduction)
+expensive-module chunk: Loaded on-demand (lazy)
+```
+
+---
+
+### Fix 4: Webpack Configuration Optimization
+
+**Updated webpack.config.js**:
+```javascript
+const webpack = require('webpack');
+const path = require('path');
+
+module.exports = {
+  entry: './src/worker.ts',
+  target: 'webworker',
+  mode: 'production',
+  optimization: {
+    minimize: true,
+    usedExports: true,  // Tree shaking
+    sideEffects: false,
+  },
+  resolve: {
+    extensions: ['.ts', '.js'],
+    alias: {
+      // Replace heavy libraries with lighter alternatives
+      'moment': 'date-fns',
+      'lodash': 'lodash-es'
+    }
+  },
+  module: {
+    rules: [
+      {
+        test: /\.ts$/,
+        use: {
+          loader: 'ts-loader',
+          options: {
+            transpileOnly: true,
+            compilerOptions: {
+              module: 'esnext',  // Enable tree shaking
+              moduleResolution: 'node'
+            }
+          }
+        },
+        exclude: /node_modules/
+      }
+    ]
+  },
+  plugins: [
+    new webpack.DefinePlugin({
+      'process.env.NODE_ENV': JSON.stringify('production')
+    })
+  ],
+  output: {
+    filename: 'worker.js',
+    path: path.resolve(__dirname, 'dist'),
+    libraryTarget: 'commonjs2'
+  }
+};
+```
+
+---
+
+## Results
+
+### Bundle Size Comparison
+
+| Category | Before | After | Reduction |
+|----------|--------|-------|-----------|
+| **@anthropic-ai/sdk** | 2.1MB | 0KB (fetch) | -100% |
+| **aws-sdk** | 1.8MB | 200KB (v3) | -89% |
+| **lodash** | 800KB | 50KB (tree-shaken) | -94% |
+| **moment** | 300KB | 0KB (native Date) | -100% |
+| **Application code** | 200KB | 200KB | 0% |
+| **TOTAL** | **5.2MB** | **450KB** | **-91%** |
+
+**Compressed Size**:
+- Before: 5.2MB → 1.8MB compressed (over 1MB limit)
+- After: 450KB → 180KB compressed (under 1MB limit)
+
+---
+
+### Deployment Verification
+
+**Successful Deployment**:
+```bash
+$ wrangler deploy
+✔ Building...
+✔ Validating...
+Bundle size: 450KB (180KB compressed)
+✔ Uploading...
+✔ Deployed to production
+
+Production URL: https://api.greyhaven.io
+Worker ID: worker-abc123
+```
+
+**Load Testing**:
+```bash
+# Before optimization (would fail deployment)
+# Bundle: 5.2MB, deploy: FAIL
+
+# After optimization
+$ ab -n 1000 -c 10 https://api.greyhaven.io/
+Requests per second: 1250 [#/sec]
+Time per request: 8ms [mean]
+Successful requests: 1000 (100%)
+Bundle size: 450KB ✓
+```
+
+---
+
+## Prevention Measures
+
+### 1. CI/CD Bundle Size Check
+
+```yaml
+# .github/workflows/deploy.yml - Add size validation
+steps:
+  - run: npm ci && npm run build
+  - name: Check bundle size
+    run: |
+      SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
+      if (( $(echo "$SIZE_MB > 1.0" | bc -l) )); then
+        echo "❌ Bundle exceeds 1MB"; exit 1
+      fi
+  - run: npx wrangler deploy
+```
+
+### 2. Pre-commit Hook
+
+```bash
+# .git/hooks/pre-commit
+SIZE_MB=$(stat -f%z dist/worker.js | awk '{print $1/1048576}')
+[ "$SIZE_MB" -lt "1.0" ] || { echo "❌ Bundle >1MB"; exit 1; }
+```
+
+### 3. PR Template
+
+```markdown
+## Bundle Impact
+- [ ] Bundle size <800KB
+- [ ] Tree shaking verified
+Size: [Before → After]
+```
+
+### 4. Automated Analysis
+
+```json
+{
+  "scripts": {
+    "analyze": "webpack --profile --json > stats.json && webpack-bundle-analyzer stats.json"
+  }
+}
+```
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+✅ Identified root cause quickly (bundle analyzer)
+✅ Multiple optimization strategies applied
+✅ Achieved 91% bundle size reduction
+✅ Added automated checks to prevent recurrence
+
+### What Could Be Improved
+
+❌ No bundle size monitoring before incident
+❌ Dependencies added without size consideration
+❌ No pre-commit checks for bundle size
+
+### Key Takeaways
+
+1. **Always check bundle size** when adding dependencies
+2. **Use native APIs** instead of libraries when possible
+3. **Tree shaking** requires named imports (not default)
+4. **Code splitting** for rarely-used features
+5. **External API calls** are lighter than bundling SDKs
+
+---
+
+## Related Documentation
+
+- **PlanetScale Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
+- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
+- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
+- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
+
+---
+
+Return to [examples index](INDEX.md)
--- a/skills/devops-troubleshooting/examples/distributed-system-debugging.md
+++ b/skills/devops-troubleshooting/examples/distributed-system-debugging.md
@@ -0,0 +1,477 @@
+# Distributed System Network Debugging
+
+Investigating intermittent 504 Gateway Timeout errors between Cloudflare Workers and external APIs, resolved through DNS caching and timeout tuning.
+
+## Overview
+
+**Incident**: 5% of API requests failing with 504 timeouts
+**Impact**: Intermittent failures, no clear pattern, user frustration
+**Root Cause**: DNS resolution delays + worker timeout too aggressive
+**Resolution**: DNS caching + timeout increase (5s→30s)
+**Status**: Resolved
+
+## Incident Timeline
+
+| Time | Event | Action |
+|------|-------|--------|
+| 14:00 | 504 errors detected | Alerts triggered |
+| 14:10 | Pattern analysis started | Check logs, no obvious cause |
+| 14:30 | Network trace performed | Found DNS delays |
+| 14:50 | Root cause identified | DNS + timeout combination |
+| 15:10 | Fix deployed | DNS caching + timeout tuning |
+| 15:40 | Monitoring confirmed | 504s eliminated |
+
+---
+
+## Symptoms and Detection
+
+### Initial Alerts
+
+**Error Pattern**:
+```
+[ERROR] Request to https://api.partner.com/data failed: 504 Gateway Timeout
+[ERROR] Upstream timeout after 5000ms
+[ERROR] DNS lookup took 3200ms (80% of timeout!)
+```
+
+**Characteristics**:
+- ❌ Random occurrence (5% of requests)
+- ❌ No pattern by time of day
+- ❌ Affects all worker regions equally
+- ❌ External API reports no issues
+- ✅ Only affects specific external endpoints
+
+---
+
+## Diagnosis
+
+### Step 1: Network Request Breakdown
+
+**curl Timing Analysis**:
+```bash
+# Test external API with detailed timing
+curl -w "\nDNS:     %{time_namelookup}s\nConnect: %{time_connect}s\nTLS:     %{time_appconnect}s\nStart:   %{time_starttransfer}s\nTotal:   %{time_total}s\n" \
+  -o /dev/null -s https://api.partner.com/data
+
+# Results (intermittent):
+DNS:     3.201s  # ❌ Very slow!
+Connect: 3.450s
+TLS:     3.780s
+Start:   4.120s
+Total:   4.823s  # Close to 5s worker timeout
+```
+
+**Fast vs Slow Requests**:
+```
+FAST (95% of requests):
+DNS: 0.050s → Connect: 0.120s → Total: 0.850s ✅
+
+SLOW (5% of requests):
+DNS: 3.200s → Connect: 3.450s → Total: 4.850s ❌ (near timeout)
+```
+
+**Root Cause**: DNS resolution delays causing total request time to exceed worker timeout.
+
+---
+
+### Step 2: DNS Investigation
+
+**nslookup Testing**:
+```bash
+# Test DNS resolution
+time nslookup api.partner.com
+
+# Results (vary):
+Run 1: 0.05s  ✅
+Run 2: 3.10s  ❌
+Run 3: 0.04s  ✅
+Run 4: 2.95s  ❌
+
+Pattern: DNS cache miss causes 3s delay
+```
+
+**dig Analysis**:
+```bash
+# Detailed DNS query
+dig api.partner.com +stats
+
+# Results:
+;; Query time: 3021 msec          # Slow!
+;; SERVER: 1.1.1.1#53(1.1.1.1)
+;; WHEN: Thu Dec 05 14:25:32 UTC 2024
+;; MSG SIZE  rcvd: 84
+
+# Root cause: No DNS caching in worker
+```
+
+---
+
+### Step 3: Worker Timeout Configuration
+
+**Current Worker Code**:
+```typescript
+// worker.ts (BEFORE - Too aggressive timeout)
+export default {
+  async fetch(request: Request, env: Env) {
+    const controller = new AbortController();
+    const timeout = setTimeout(() => controller.abort(), 5000); // 5s timeout
+
+    try {
+      const response = await fetch('https://api.partner.com/data', {
+        signal: controller.signal
+      });
+      return response;
+    } catch (error) {
+      // 5% of requests timeout here
+      return new Response('Gateway Timeout', { status: 504 });
+    } finally {
+      clearTimeout(timeout);
+    }
+  }
+};
+```
+
+**Problem**: 5s timeout doesn't account for DNS delays (up to 3s).
+
+---
+
+### Step 4: CORS and Headers Check
+
+**Test CORS Headers**:
+```bash
+# Check CORS preflight
+curl -I -X OPTIONS https://api.greyhaven.io/proxy \
+  -H "Origin: https://app.greyhaven.io" \
+  -H "Access-Control-Request-Method: POST"
+
+# Response:
+HTTP/2 200
+access-control-allow-origin: https://app.greyhaven.io ✅
+access-control-allow-methods: GET, POST, PUT, DELETE ✅
+access-control-max-age: 86400 ✅
+```
+
+**No CORS issues** - problem isolated to DNS + timeout.
+
+---
+
+## Resolution
+
+### Fix 1: Implement DNS Caching
+
+**Worker with DNS Cache**:
+```typescript
+// worker.ts (AFTER - With DNS caching)
+interface DnsCache {
+  ip: string;
+  timestamp: number;
+  ttl: number;
+}
+
+const DNS_CACHE = new Map<string, DnsCache>();
+const DNS_TTL = 60 * 1000; // 60 seconds
+
+async function resolveWithCache(hostname: string): Promise<string> {
+  const cached = DNS_CACHE.get(hostname);
+
+  if (cached && Date.now() - cached.timestamp < cached.ttl) {
+    // Cache hit - return immediately
+    return cached.ip;
+  }
+
+  // Cache miss - resolve DNS
+  const dnsResponse = await fetch(`https://1.1.1.1/dns-query?name=${hostname}`, {
+    headers: { 'accept': 'application/dns-json' }
+  });
+  const dnsData = await dnsResponse.json();
+  const ip = dnsData.Answer[0].data;
+
+  // Update cache
+  DNS_CACHE.set(hostname, {
+    ip,
+    timestamp: Date.now(),
+    ttl: DNS_TTL
+  });
+
+  return ip;
+}
+
+export default {
+  async fetch(request: Request, env: Env) {
+    // Pre-resolve DNS (cached)
+    const ip = await resolveWithCache('api.partner.com');
+
+    // Use IP directly (bypass DNS)
+    const response = await fetch(`https://${ip}/data`, {
+      headers: {
+        'Host': 'api.partner.com' // Required for SNI
+      }
+    });
+
+    return response;
+  }
+};
+```
+
+**Result**: DNS resolution <5ms (cache hit) vs 3000ms (cache miss).
+
+---
+
+### Fix 2: Increase Worker Timeout
+
+**Updated Timeout**:
+```typescript
+// worker.ts - Increased timeout to account for DNS
+const controller = new AbortController();
+const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
+
+try {
+  const response = await fetch('https://api.partner.com/data', {
+    signal: controller.signal
+  });
+  return response;
+} finally {
+  clearTimeout(timeout);
+}
+```
+
+**Timeout Breakdown**:
+```
+Old: 5s total
+- DNS: 3s (worst case)
+- Connect: 1s
+- Request: 1s
+= Frequent timeouts
+
+New: 30s total
+- DNS: <0.01s (cached)
+- Connect: 1s
+- Request: 2s
+- Buffer: 27s (ample)
+= No timeouts
+```
+
+---
+
+### Fix 3: Add Retry Logic with Exponential Backoff
+
+**Retry Implementation**:
+```typescript
+// utils/retry.ts
+async function fetchWithRetry(
+  url: string,
+  options: RequestInit,
+  maxRetries: number = 3
+): Promise<Response> {
+  for (let attempt = 0; attempt < maxRetries; attempt++) {
+    try {
+      const response = await fetch(url, options);
+
+      // Retry on 5xx errors
+      if (response.status >= 500 && attempt < maxRetries - 1) {
+        const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
+        await new Promise(resolve => setTimeout(resolve, delay));
+        continue;
+      }
+
+      return response;
+    } catch (error) {
+      if (attempt === maxRetries - 1) throw error;
+
+      // Exponential backoff: 1s, 2s, 4s
+      const delay = Math.pow(2, attempt) * 1000;
+      await new Promise(resolve => setTimeout(resolve, delay));
+    }
+  }
+
+  throw new Error('Max retries exceeded');
+}
+
+// Usage:
+const response = await fetchWithRetry('https://api.partner.com/data', {
+  signal: controller.signal
+});
+```
+
+---
+
+### Fix 4: Circuit Breaker Pattern
+
+**Prevent Cascading Failures**:
+```typescript
+// utils/circuit-breaker.ts
+class CircuitBreaker {
+  private failures: number = 0;
+  private lastFailureTime: number = 0;
+  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
+
+  async execute<T>(fn: () => Promise<T>): Promise<T> {
+    if (this.state === 'OPEN') {
+      // Check if enough time passed to try again
+      if (Date.now() - this.lastFailureTime > 60000) {
+        this.state = 'HALF_OPEN';
+      } else {
+        throw new Error('Circuit breaker is OPEN');
+      }
+    }
+
+    try {
+      const result = await fn();
+      this.onSuccess();
+      return result;
+    } catch (error) {
+      this.onFailure();
+      throw error;
+    }
+  }
+
+  private onSuccess() {
+    this.failures = 0;
+    this.state = 'CLOSED';
+  }
+
+  private onFailure() {
+    this.failures++;
+    this.lastFailureTime = Date.now();
+
+    if (this.failures >= 5) {
+      this.state = 'OPEN'; // Trip circuit after 5 failures
+    }
+  }
+}
+
+// Usage:
+const breaker = new CircuitBreaker();
+const response = await breaker.execute(() =>
+  fetch('https://api.partner.com/data')
+);
+```
+
+---
+
+## Results
+
+### Before vs After Metrics
+
+| Metric | Before Fix | After Fix | Improvement |
+|--------|-----------|-----------|-------------|
+| **504 Error Rate** | 5% | 0.01% | **99.8% reduction** |
+| **DNS Resolution** | 3000ms (worst) | <5ms (cached) | **99.8% faster** |
+| **Total Request Time** | 4800ms (p95) | 850ms (p95) | **82% faster** |
+| **Timeout Threshold** | 5s (too low) | 30s (appropriate) | +500% headroom |
+
+---
+
+### Network Diagnostics
+
+**traceroute Analysis**:
+```bash
+# Check network path to external API
+traceroute api.partner.com
+
+# Results show no packet loss
+ 1  gateway (10.0.0.1)  1.234 ms
+ 2  isp-router (100.64.0.1)  5.678 ms
+...
+15  api.partner.com (203.0.113.42)  45.234 ms
+```
+
+**No packet loss** - confirms DNS was the issue, not network.
+
+---
+
+## Prevention Measures
+
+### 1. Network Monitoring Dashboard
+
+**Metrics to Track**:
+```typescript
+// Track network timing metrics
+const network_dns_duration = new Histogram({
+  name: 'network_dns_duration_seconds',
+  help: 'DNS resolution time'
+});
+
+const network_connect_duration = new Histogram({
+  name: 'network_connect_duration_seconds',
+  help: 'TCP connection time'
+});
+
+const network_total_duration = new Histogram({
+  name: 'network_total_duration_seconds',
+  help: 'Total request time'
+});
+```
+
+### 2. Alert Rules
+
+```yaml
+# Alert on high DNS resolution time
+- alert: SlowDnsResolution
+  expr: histogram_quantile(0.95, network_dns_duration_seconds) > 1
+  for: 5m
+  annotations:
+    summary: "DNS resolution p95 >1s"
+
+# Alert on gateway timeouts
+- alert: HighGatewayTimeouts
+  expr: rate(http_requests_total{status="504"}[5m]) > 0.01
+  for: 5m
+  annotations:
+    summary: "504 error rate >1%"
+```
+
+### 3. Health Check Endpoints
+
+```typescript
+@app.get("/health/network")
+async function networkHealth() {
+  const checks = await Promise.all([
+    checkDns('api.partner.com'),
+    checkConnectivity('https://api.partner.com/health'),
+    checkLatency('https://api.partner.com/ping')
+  ]);
+
+  return {
+    status: checks.every(c => c.healthy) ? 'healthy' : 'degraded',
+    checks
+  };
+}
+```
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+✅ Detailed network timing analysis pinpointed DNS
+✅ DNS caching eliminated 99.8% of timeouts
+✅ Circuit breaker prevents cascading failures
+
+### What Could Be Improved
+
+❌ No DNS monitoring before incident
+❌ Timeout too aggressive without considering DNS
+❌ No retry logic for transient failures
+
+### Key Takeaways
+
+1. **Always cache DNS** in workers (60s TTL minimum)
+2. **Account for DNS time** when setting timeouts
+3. **Add retry logic** with exponential backoff
+4. **Implement circuit breakers** for external dependencies
+5. **Monitor network timing** (DNS, connect, TLS, transfer)
+
+---
+
+## Related Documentation
+
+- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
+- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
+- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
+
+---
+
+Return to [examples index](INDEX.md)
--- a/skills/devops-troubleshooting/examples/performance-degradation-analysis.md
+++ b/skills/devops-troubleshooting/examples/performance-degradation-analysis.md
@@ -0,0 +1,413 @@
+# Performance Degradation Analysis
+
+Investigating API response time increase from 200ms to 2000ms, resolved through N+1 query elimination, caching, and index optimization.
+
+## Overview
+
+**Incident**: API response times degraded 10x (200ms → 2000ms)
+**Impact**: User-facing slowness, timeout errors, poor UX
+**Root Cause**: N+1 query problem + missing indexes + no caching
+**Resolution**: Query optimization + indexes + Redis caching
+**Status**: Resolved
+
+## Incident Timeline
+
+| Time | Event | Action |
+|------|-------|--------|
+| 08:00 | Slowness reports from users | Support tickets opened |
+| 08:15 | Monitoring confirms degradation | p95 latency 2000ms |
+| 08:30 | Database profiling started | Slow query log analysis |
+| 09:00 | N+1 query identified | Found 100+ queries per request |
+| 09:30 | Fix implemented | Eager loading + indexes |
+| 10:00 | Caching added | Redis for frequently accessed data |
+| 10:30 | Deployment complete | Latency back to 200ms |
+
+---
+
+## Symptoms and Detection
+
+### Initial Metrics
+
+**Latency Increase**:
+```
+p50: 180ms → 1800ms (+900% slower)
+p95: 220ms → 2100ms (+854% slower)
+p99: 450ms → 3500ms (+677% slower)
+
+Requests timing out: 5% (>3s timeout)
+```
+
+**User Impact**:
+- Page load times: 5-10 seconds
+- API timeouts: 5% of requests
+- Support tickets: 47 in 1 hour
+- User complaints: "App is unusable"
+
+---
+
+## Diagnosis
+
+### Step 1: Application Performance Monitoring
+
+**Wrangler Tail Analysis**:
+```bash
+# Monitor worker requests in real-time
+wrangler tail --format pretty
+
+# Output shows slow requests:
+[2024-12-05 08:20:15] GET /api/orders - 2145ms
+  └─ database_query: 1950ms (90% of total time!)
+  └─ json_serialization: 150ms
+  └─ response_headers: 45ms
+
+# Red flag: Database taking 90% of request time
+```
+
+---
+
+### Step 2: Database Query Analysis
+
+**PlanetScale Slow Query Log**:
+```bash
+# Enable and check slow queries
+pscale database insights greyhaven-db main --slow-queries
+
+# Results:
+Query: SELECT * FROM order_items WHERE order_id = ?
+Calls: 157 times per request  # ❌ N+1 query problem!
+Avg time: 12ms per query
+Total: 1884ms per request (12ms × 157)
+```
+
+**N+1 Query Pattern Identified**:
+```python
+# api/orders.py (BEFORE - N+1 Problem)
+@router.get("/orders/{user_id}")
+async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
+    # Query 1: Get all orders for user
+    orders = session.exec(
+        select(Order).where(Order.user_id == user_id)
+    ).all()  # Returns 157 orders
+
+    # Query 2-158: Get items for EACH order (N+1!)
+    for order in orders:
+        order.items = session.exec(
+            select(OrderItem).where(OrderItem.order_id == order.id)
+        ).all()  # 157 additional queries!
+
+    return orders
+
+# Total queries: 1 + 157 = 158 queries per request
+# Total time: 10ms + (157 × 12ms) = 1894ms
+```
+
+---
+
+### Step 3: Database Index Analysis
+
+**Missing Indexes**:
+```sql
+-- Check existing indexes
+SELECT indexname, indexdef
+FROM pg_indexes
+WHERE tablename = 'order_items';
+
+-- Results:
+-- Primary key on id (exists) ✅
+-- NO index on order_id ❌ (needed for WHERE clause)
+-- NO index on user_id ❌ (needed for joins)
+
+-- Explain plan shows full table scan
+EXPLAIN ANALYZE
+SELECT * FROM order_items WHERE order_id = 123;
+
+-- Result:
+Seq Scan on order_items  (cost=0.00..1500.00 rows=1 width=100) (actual time=12.345..12.345 rows=5 loops=157)
+  Filter: (order_id = 123)
+  Rows Removed by Filter: 10000
+
+-- Full table scan on 10K rows, 157 times = extremely slow!
+```
+
+---
+
+## Resolution
+
+### Fix 1: Eliminate N+1 with Eager Loading
+
+**After - Single Query with Join**:
+```python
+# api/orders.py (AFTER - Eager Loading)
+from sqlmodel import select
+from sqlalchemy.orm import selectinload
+
+@router.get("/orders/{user_id}")
+async def get_user_orders(user_id: int, session: Session = Depends(get_session)):
+    # ✅ Single query with eager loading
+    statement = (
+        select(Order)
+        .where(Order.user_id == user_id)
+        .options(selectinload(Order.items))  # Eager load items
+    )
+
+    orders = session.exec(statement).all()
+
+    return orders
+
+# Total queries: 2 (1 for orders, 1 for all items)
+# Total time: 10ms + 25ms = 35ms (98% faster!)
+```
+
+**Query Comparison**:
+```
+BEFORE (N+1):
+- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
+- Query 2-158: SELECT * FROM order_items WHERE order_id = ? (×157, 12ms each)
+- Total: 1894ms
+
+AFTER (Eager Loading):
+- Query 1: SELECT * FROM orders WHERE user_id = 1 (10ms)
+- Query 2: SELECT * FROM order_items WHERE order_id IN (?, ?, ..., ?) (25ms)
+- Total: 35ms (54x faster!)
+```
+
+---
+
+### Fix 2: Add Database Indexes
+
+**Create Indexes**:
+```sql
+-- Index on order_id for faster lookups
+CREATE INDEX idx_order_items_order_id ON order_items(order_id);
+
+-- Index on user_id for user queries
+CREATE INDEX idx_orders_user_id ON orders(user_id);
+
+-- Index on created_at for time-based queries
+CREATE INDEX idx_orders_created_at ON orders(created_at);
+
+-- Composite index for common filters
+CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
+```
+
+**Before/After EXPLAIN**:
+```sql
+-- BEFORE (no index):
+EXPLAIN ANALYZE SELECT * FROM order_items WHERE order_id = 123;
+Seq Scan (cost=0.00..1500.00) (actual time=12.345ms)
+
+-- AFTER (with index):
+Index Scan using idx_order_items_order_id (cost=0.00..8.50) (actual time=0.045ms)
+
+-- 270x faster (12.345ms → 0.045ms)
+```
+
+---
+
+### Fix 3: Implement Redis Caching
+
+**Cache Frequent Queries**:
+```typescript
+// cache.ts - Redis caching layer
+import { Redis } from '@upstash/redis';
+
+const redis = new Redis({
+  url: env.UPSTASH_REDIS_URL,
+  token: env.UPSTASH_REDIS_TOKEN
+});
+
+async function getCachedOrders(userId: number) {
+  const cacheKey = `orders:user:${userId}`;
+
+  // Check cache
+  const cached = await redis.get(cacheKey);
+  if (cached) {
+    return JSON.parse(cached);  // Cache hit
+  }
+
+  // Cache miss - query database
+  const orders = await fetchOrdersFromDb(userId);
+
+  // Store in cache (5 minute TTL)
+  await redis.setex(cacheKey, 300, JSON.stringify(orders));
+
+  return orders;
+}
+```
+
+**Cache Hit Rates**:
+```
+Requests: 10,000
+Cache hits: 8,500 (85%)
+Cache misses: 1,500 (15%)
+
+Avg latency with cache:
+- Cache hit: 5ms (Redis)
+- Cache miss: 35ms (database)
+- Overall: (0.85 × 5) + (0.15 × 35) = 9.5ms
+```
+
+---
+
+### Fix 4: Database Connection Pooling
+
+**Optimize Pool Settings**:
+```python
+# database.py - Tuned for performance
+engine = create_engine(
+    database_url,
+    pool_size=50,           # Increased from 20
+    max_overflow=20,
+    pool_recycle=1800,      # 30 minutes
+    pool_pre_ping=True,     # Health check
+    echo=False,
+    connect_args={
+        "server_settings": {
+            "statement_timeout": "30000",  # 30s query timeout
+            "idle_in_transaction_session_timeout": "60000"  # 60s idle
+        }
+    }
+)
+```
+
+---
+
+## Results
+
+### Performance Metrics
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| **p50 Latency** | 1800ms | 180ms | **90% faster** |
+| **p95 Latency** | 2100ms | 220ms | **90% faster** |
+| **p99 Latency** | 3500ms | 450ms | **87% faster** |
+| **Database Queries** | 158/request | 2/request | **99% reduction** |
+| **Cache Hit Rate** | 0% | 85% | **85% hits** |
+| **Timeout Errors** | 5% | 0% | **100% eliminated** |
+
+### Cost Impact
+
+**Database Query Reduction**:
+```
+Before: 158 queries × 100 req/s = 15,800 queries/s
+After: 2 queries × 100 req/s = 200 queries/s
+
+Reduction: 98.7% fewer queries
+Cost savings: $450/month (reduced database tier)
+```
+
+---
+
+## Prevention Measures
+
+### 1. Query Performance Monitoring
+
+**Slow Query Alert**:
+```yaml
+# Alert on slow database queries
+- alert: SlowDatabaseQueries
+  expr: histogram_quantile(0.95, rate(database_query_duration_seconds[5m])) > 0.1
+  for: 5m
+  annotations:
+    summary: "Database queries p95 >100ms"
+```
+
+### 2. N+1 Query Detection
+
+**Test for N+1 Patterns**:
+```python
+# tests/test_n_plus_one.py
+import pytest
+from sqlalchemy import event
+from database import engine
+
+@pytest.fixture
+def query_counter():
+    """Count SQL queries during test"""
+    queries = []
+
+    def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
+        queries.append(statement)
+
+    event.listen(engine, "before_cursor_execute", before_cursor_execute)
+    yield queries
+    event.remove(engine, "before_cursor_execute", before_cursor_execute)
+
+def test_get_user_orders_no_n_plus_one(query_counter):
+    """Verify endpoint doesn't have N+1 queries"""
+    get_user_orders(user_id=1)
+
+    # Should be 2 queries max (orders + items)
+    assert len(query_counter) <= 2, f"N+1 detected: {len(query_counter)} queries"
+```
+
+### 3. Database Index Coverage
+
+```sql
+-- Check for missing indexes
+SELECT
+  schemaname,
+  tablename,
+  attname,
+  n_distinct,
+  correlation
+FROM pg_stats
+WHERE schemaname = 'public'
+  AND n_distinct > 100  -- Cardinality suggests index needed
+ORDER BY tablename, attname;
+```
+
+### 4. Performance Budget
+
+```typescript
+// Set performance budgets
+const PERFORMANCE_BUDGETS = {
+  api_latency_p95: 500,  // ms
+  database_queries_per_request: 5,
+  cache_hit_rate_min: 0.70,  // 70%
+};
+
+// CI/CD check
+if (metrics.api_latency_p95 > PERFORMANCE_BUDGETS.api_latency_p95) {
+  throw new Error(`Performance budget exceeded: ${metrics.api_latency_p95}ms > 500ms`);
+}
+```
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+✅ Slow query log pinpointed N+1 problem
+✅ Eager loading eliminated 99% of queries
+✅ Indexes provided 270x speedup
+✅ Caching reduced load by 85%
+
+### What Could Be Improved
+
+❌ No N+1 query detection before production
+❌ Missing indexes not caught in code review
+❌ No caching layer initially
+❌ No query performance monitoring
+
+### Key Takeaways
+
+1. **Always use eager loading** for associations
+2. **Add indexes** for all foreign keys and WHERE clauses
+3. **Implement caching** for frequently accessed data
+4. **Monitor query counts** per request (alert on >10)
+5. **Test for N+1** in CI/CD pipeline
+
+---
+
+## Related Documentation
+
+- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+- **Database Issues**: [planetscale-connection-issues.md](planetscale-connection-issues.md)
+- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
+- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
+
+---
+
+Return to [examples index](INDEX.md)
--- a/skills/devops-troubleshooting/examples/planetscale-connection-issues.md
+++ b/skills/devops-troubleshooting/examples/planetscale-connection-issues.md
@@ -0,0 +1,499 @@
+# PlanetScale Connection Pool Exhaustion
+
+Complete investigation of database connection pool exhaustion causing 503 errors, resolved through connection pool tuning and leak fixes.
+
+## Overview
+
+**Incident**: Database connection timeouts causing 15% request failure rate
+**Impact**: Customer-facing 503 errors, support tickets increasing
+**Root Cause**: Connection pool too small + unclosed connections in error paths
+**Resolution**: Pool tuning (20→50) + connection leak fixes
+**Status**: Resolved
+
+## Incident Timeline
+
+| Time | Event | Action |
+|------|-------|--------|
+| 09:30 | Alerts: High 503 error rate | Oncall paged |
+| 09:35 | Investigation started | Check logs, metrics |
+| 09:45 | Database connections at 100% | Identified pool exhaustion |
+| 10:00 | Temporary fix: restart service | Bought time for root cause |
+| 10:30 | Code analysis complete | Found connection leaks |
+| 11:00 | Fix deployed (pool + leaks) | Production deployment |
+| 11:30 | Monitoring confirmed stable | Incident resolved |
+
+---
+
+## Symptoms and Detection
+
+### Initial Alerts
+
+**Prometheus Alert**:
+```yaml
+# Alert: HighErrorRate
+expr: rate(http_requests_total{status="503"}[5m]) > 0.05
+for: 5m
+annotations:
+  summary: "503 error rate >5% for 5 minutes"
+  description: "Current rate: {{ $value | humanizePercentage }}"
+```
+
+**Error Logs**:
+```
+[ERROR] Database query failed: connection timeout
+[ERROR] Pool exhausted, waiting for available connection
+[ERROR] Request timeout after 30s waiting for DB connection
+```
+
+**Impact Metrics**:
+```
+Error rate: 15% (150 failures per 1000 requests)
+User complaints: 23 support tickets in 30 minutes
+Failed transactions: ~$15,000 in abandoned carts
+```
+
+---
+
+## Diagnosis
+
+### Step 1: Check Connection Pool Status
+
+**Query PlanetScale**:
+```bash
+# Connect to database
+pscale shell greyhaven-db main
+
+# Check active connections
+SELECT
+  COUNT(*) as active_connections,
+  MAX(pg_stat_activity.query_start) as oldest_query
+FROM pg_stat_activity
+WHERE state = 'active';
+
+# Result:
+# active_connections: 98
+# oldest_query: 2024-12-05 09:15:23 (15 minutes ago!)
+```
+
+**Check Application Pool**:
+```python
+# In FastAPI app - add diagnostic endpoint
+from sqlmodel import Session
+from database import engine
+
+@app.get("/pool-status")
+def pool_status():
+    pool = engine.pool
+    return {
+        "size": pool.size(),
+        "checked_out": pool.checkedout(),
+        "overflow": pool.overflow(),
+        "timeout": pool._timeout,
+        "max_overflow": pool._max_overflow
+    }
+
+# Response:
+{
+  "size": 20,
+  "checked_out": 20,  # Pool exhausted!
+  "overflow": 0,
+  "timeout": 30,
+  "max_overflow": 10
+}
+```
+
+**Red Flags**:
+- ✅ Pool at 100% capacity (20/20 connections checked out)
+- ✅ No overflow connections being used (0/10)
+- ✅ Connections held for >15 minutes
+- ✅ New requests timing out waiting for connections
+
+---
+
+### Step 2: Identify Connection Leaks
+
+**Code Review - Found Vulnerable Pattern**:
+```python
+# api/orders.py (BEFORE - LEAK)
+from fastapi import APIRouter
+from sqlmodel import Session, select
+from database import engine
+
+router = APIRouter()
+
+@router.post("/orders")
+async def create_order(order_data: OrderCreate):
+    # ❌ LEAK: Session never closed on exception
+    session = Session(engine)
+
+    # Create order
+    order = Order(**order_data.dict())
+    session.add(order)
+    session.commit()
+
+    # If exception here, session never closed!
+    if order.total > 10000:
+        raise ValueError("Order exceeds limit")
+
+    # session.close() never reached
+    return order
+```
+
+**How Leak Occurs**:
+1. Request creates session (acquires connection from pool)
+2. Exception raised after commit
+3. Function exits without calling `session.close()`
+4. Connection remains "checked out" from pool
+5. After 20 such exceptions, pool exhausted
+
+---
+
+### Step 3: Load Testing to Reproduce
+
+**Test Script**:
+```python
+# test_connection_leak.py
+import asyncio
+import httpx
+
+async def create_order(client, amount):
+    """Create order that will trigger exception"""
+    try:
+        response = await client.post(
+            "https://api.greyhaven.io/orders",
+            json={"total": amount}
+        )
+        return response.status_code
+    except Exception:
+        return 503
+
+async def load_test():
+    """Simulate 100 orders with high amounts (triggers leak)"""
+    async with httpx.AsyncClient() as client:
+        # Trigger 100 exceptions (leak 100 connections)
+        tasks = [create_order(client, 15000) for _ in range(100)]
+        results = await asyncio.gather(*tasks)
+
+        success = sum(1 for r in results if r == 201)
+        errors = sum(1 for r in results if r == 503)
+
+        print(f"Success: {success}, Errors: {errors}")
+
+asyncio.run(load_test())
+```
+
+**Results**:
+```
+Success: 20  (first 20 use all connections)
+Errors: 80   (remaining 80 timeout waiting for pool)
+
+Proves: Connection leak exhausts pool
+```
+
+---
+
+## Resolution
+
+### Fix 1: Use Context Manager (Guaranteed Cleanup)
+
+**After - With Context Manager**:
+```python
+# api/orders.py (AFTER - FIXED)
+from fastapi import APIRouter, Depends
+from sqlmodel import Session
+from database import get_session
+
+router = APIRouter()
+
+# ✅ Dependency injection with automatic cleanup
+def get_session():
+    with Session(engine) as session:
+        yield session
+    # Session always closed (even on exception)
+
+@router.post("/orders")
+async def create_order(
+    order_data: OrderCreate,
+    session: Session = Depends(get_session)
+):
+    # Session managed by FastAPI dependency
+    order = Order(**order_data.dict())
+    session.add(order)
+    session.commit()
+
+    # Exception here? No problem - session still closed by context manager
+    if order.total > 10000:
+        raise ValueError("Order exceeds limit")
+
+    return order
+```
+
+**Why This Works**:
+- Context manager (`with` statement) guarantees `session.close()` in `__exit__`
+- Works even if exception raised
+- FastAPI `Depends()` handles async cleanup automatically
+
+---
+
+### Fix 2: Increase Connection Pool Size
+
+**Before** (pool too small):
+```python
+# database.py (BEFORE)
+from sqlmodel import create_engine
+
+engine = create_engine(
+    database_url,
+    pool_size=20,        # Too small for load
+    max_overflow=10,
+    pool_timeout=30
+)
+```
+
+**After** (tuned for load):
+```python
+# database.py (AFTER)
+from sqlmodel import create_engine
+import os
+
+# Calculate pool size based on workers
+# Formula: (workers * 2) + buffer
+# 16 workers * 2 + 20 buffer = 52
+workers = int(os.getenv("WEB_CONCURRENCY", 16))
+pool_size = (workers * 2) + 20
+
+engine = create_engine(
+    database_url,
+    pool_size=pool_size,      # 52 connections
+    max_overflow=20,          # Burst to 72 total
+    pool_timeout=30,
+    pool_recycle=3600,        # Recycle after 1 hour
+    pool_pre_ping=True,       # Verify connection health
+    echo=False
+)
+```
+
+**Pool Size Calculation**:
+```
+Workers: 16 (Uvicorn workers)
+Connections per worker: 2 (normal peak)
+Buffer: 20 (for spikes)
+
+pool_size = (16 * 2) + 20 = 52
+max_overflow = 20 (total 72 for extreme spikes)
+```
+
+---
+
+### Fix 3: Add Connection Pool Monitoring
+
+**Prometheus Metrics**:
+```python
+# monitoring.py
+from prometheus_client import Gauge
+from database import engine
+
+# Pool metrics
+db_pool_size = Gauge('db_pool_size_total', 'Total pool size')
+db_pool_checked_out = Gauge('db_pool_checked_out', 'Connections in use')
+db_pool_idle = Gauge('db_pool_idle', 'Idle connections')
+db_pool_overflow = Gauge('db_pool_overflow', 'Overflow connections')
+
+def update_pool_metrics():
+    """Update pool metrics every 10 seconds"""
+    pool = engine.pool
+    db_pool_size.set(pool.size())
+    db_pool_checked_out.set(pool.checkedout())
+    db_pool_idle.set(pool.size() - pool.checkedout())
+    db_pool_overflow.set(pool.overflow())
+
+# Schedule in background task
+import asyncio
+async def pool_monitor():
+    while True:
+        update_pool_metrics()
+        await asyncio.sleep(10)
+```
+
+**Grafana Alert**:
+```yaml
+# Alert: Connection pool near exhaustion
+expr: db_pool_checked_out / db_pool_size_total > 0.8
+for: 5m
+annotations:
+  summary: "Connection pool >80% utilized"
+  description: "{{ $value | humanizePercentage }} of pool in use"
+```
+
+---
+
+### Fix 4: Add Timeout and Retry Logic
+
+**Connection Timeout Handling**:
+```python
+# database.py - Add connection retry
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+@retry(
+    stop=stop_after_attempt(3),
+    wait=wait_exponential(multiplier=1, min=1, max=10)
+)
+def get_session_with_retry():
+    """Get session with automatic retry on pool timeout"""
+    try:
+        with Session(engine) as session:
+            yield session
+    except TimeoutError:
+        # Pool exhausted - retry after exponential backoff
+        raise
+
+@router.post("/orders")
+async def create_order(
+    order_data: OrderCreate,
+    session: Session = Depends(get_session_with_retry)
+):
+    # Will retry up to 3 times if pool exhausted
+    ...
+```
+
+---
+
+## Results
+
+### Before vs After Metrics
+
+| Metric | Before Fix | After Fix | Improvement |
+|--------|-----------|-----------|-------------|
+| **Connection Pool Size** | 20 | 52 | +160% capacity |
+| **Pool Utilization** | 100% (exhausted) | 40-60% (healthy) | -40% utilization |
+| **503 Error Rate** | 15% | 0.01% | **99.9% reduction** |
+| **Request Timeout** | 30s (waiting) | <100ms | **99.7% faster** |
+| **Leaked Connections** | 12/hour | 0/day | **100% eliminated** |
+
+---
+
+### Deployment Verification
+
+**Load Test After Fix**:
+```bash
+# Simulate 1000 concurrent orders
+ab -n 1000 -c 50 -p order.json https://api.greyhaven.io/orders
+
+# Results:
+Requests per second: 250 [#/sec]
+Time per request: 200ms [mean]
+Failed requests: 0 (0%)
+Successful requests: 1000 (100%)
+
+# Pool status during test:
+{
+  "size": 52,
+  "checked_out": 28,     # 54% utilization (healthy)
+  "overflow": 0,
+  "idle": 24
+}
+```
+
+---
+
+## Prevention Measures
+
+### 1. Connection Leak Tests
+
+```python
+# tests/test_connection_leaks.py
+@pytest.fixture
+def track_connections():
+    before = engine.pool.checkedout()
+    yield
+    after = engine.pool.checkedout()
+    assert after == before, f"Leaked {after - before} connections"
+```
+
+### 2. Pool Alerts
+
+```yaml
+# Alert if pool >80% for 5 minutes
+expr: db_pool_checked_out / db_pool_size_total > 0.8
+```
+
+### 3. Health Check
+
+```python
+@app.get("/health/database")
+async def database_health():
+    with Session(engine) as session:
+        session.execute("SELECT 1")
+        return {"status": "healthy", "pool_utilization": pool.checkedout() / pool.size()}
+```
+
+### 4. Monitoring Commands
+
+```bash
+# Active connections
+pscale shell db main --execute "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
+
+# Slow queries
+pscale database insights db main --slow-queries
+```
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+✅ Quick identification of pool exhaustion (Prometheus alerts)
+✅ Context manager pattern eliminated leaks
+✅ Pool tuning based on formula (workers * 2 + buffer)
+✅ Comprehensive monitoring added
+
+### What Could Be Improved
+
+❌ No pool monitoring before incident
+❌ Pool size not calculated based on load
+❌ Missing connection leak tests
+
+### Key Takeaways
+
+1. **Always use context managers** for database sessions
+2. **Calculate pool size** based on workers and load
+3. **Monitor pool utilization** with alerts at 80%
+4. **Test for connection leaks** in CI/CD
+5. **Add retry logic** for transient pool timeouts
+
+---
+
+## PlanetScale Best Practices
+
+```bash
+# Connection string with SSL
+DATABASE_URL="postgresql://user:pass@aws.connect.psdb.cloud/db?sslmode=require"
+
+# Schema changes via deploy requests
+pscale deploy-request create db schema-update
+
+# Test in branch
+pscale branch create db test-feature
+```
+
+```sql
+-- Index frequently queried columns
+CREATE INDEX idx_orders_user_id ON orders(user_id);
+
+-- Analyze slow queries
+EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
+```
+
+---
+
+## Related Documentation
+
+- **Worker Deployment**: [cloudflare-worker-deployment-failure.md](cloudflare-worker-deployment-failure.md)
+- **Network Debugging**: [distributed-system-debugging.md](distributed-system-debugging.md)
+- **Performance**: [performance-degradation-analysis.md](performance-degradation-analysis.md)
+- **Runbooks**: [../reference/troubleshooting-runbooks.md](../reference/troubleshooting-runbooks.md)
+
+---
+
+Return to [examples index](INDEX.md)
--- a/skills/devops-troubleshooting/reference/INDEX.md
+++ b/skills/devops-troubleshooting/reference/INDEX.md
@@ -0,0 +1,72 @@
+# DevOps Troubleshooter Reference
+
+Quick reference guides for Grey Haven infrastructure troubleshooting - runbooks, diagnostic commands, and platform-specific guides.
+
+## Reference Guides
+
+### Troubleshooting Runbooks
+
+**File**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md)
+
+Step-by-step runbooks for common infrastructure issues:
+- **Worker Not Responding**: 500/502/503 errors from Cloudflare Workers
+- **Database Connection Failures**: Connection refused, pool exhaustion
+- **Deployment Failures**: Failed deployments, rollback procedures
+- **Performance Degradation**: Slow responses, high latency
+- **Network Issues**: DNS failures, connectivity problems
+
+**Use when**: Following structured resolution for known issues
+
+---
+
+### Diagnostic Commands Reference
+
+**File**: [diagnostic-commands.md](diagnostic-commands.md)
+
+Command reference for quick troubleshooting:
+- **Cloudflare Workers**: wrangler commands, log analysis
+- **PlanetScale**: Database queries, connection checks
+- **Network**: curl timing, DNS resolution, traceroute
+- **Performance**: Profiling, metrics collection
+
+**Use when**: Need quick command syntax for diagnostics
+
+---
+
+### Cloudflare Workers Platform Guide
+
+**File**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
+
+Cloudflare Workers-specific guidance:
+- **Deployment Best Practices**: Bundle size, environment variables
+- **Performance Optimization**: CPU limits, memory management
+- **Error Handling**: Common errors and solutions
+- **Monitoring**: Logs, metrics, analytics
+
+**Use when**: Cloudflare Workers-specific issues
+
+---
+
+## Quick Navigation
+
+**By Issue Type**:
+- Worker errors → [troubleshooting-runbooks.md#worker-not-responding](troubleshooting-runbooks.md#worker-not-responding)
+- Database issues → [troubleshooting-runbooks.md#database-connection-failures](troubleshooting-runbooks.md#database-connection-failures)
+- Performance → [troubleshooting-runbooks.md#performance-degradation](troubleshooting-runbooks.md#performance-degradation)
+
+**By Platform**:
+- Cloudflare Workers → [cloudflare-workers-guide.md](cloudflare-workers-guide.md)
+- PlanetScale → [diagnostic-commands.md#planetscale-commands](diagnostic-commands.md#planetscale-commands)
+- Network → [diagnostic-commands.md#network-commands](diagnostic-commands.md#network-commands)
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting walkthroughs
+- **Templates**: [Templates Index](../templates/INDEX.md) - Incident report templates
+- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
+
+---
+
+Return to [main agent](../devops-troubleshooter.md)
--- a/skills/devops-troubleshooting/reference/cloudflare-workers-guide.md
+++ b/skills/devops-troubleshooting/reference/cloudflare-workers-guide.md
@@ -0,0 +1,472 @@
+# Cloudflare Workers Platform Guide
+
+Comprehensive guide for deploying, monitoring, and troubleshooting Cloudflare Workers in Grey Haven's stack.
+
+## Workers Architecture
+
+**Execution Model**:
+- V8 isolates (not containers)
+- Deployed globally to 300+ datacenters
+- Request routed to nearest location
+- Cold start: ~1-5ms (vs 100-1000ms for containers)
+- CPU time limit: 50ms (Free), 50ms-30s (Paid)
+
+**Resource Limits**:
+```
+Free Plan:
+- Bundle size: 1MB compressed
+- CPU time: 50ms per request
+- Requests: 100,000/day
+- KV reads: 100,000/day
+
+Paid Plan ($5/month):
+- Bundle size: 10MB compressed
+- CPU time: 50ms (standard), up to 30s (unbound)
+- Requests: 10M included, $0.50/million after
+- KV reads: 10M included
+```
+
+---
+
+## Deployment Best Practices
+
+### Bundle Optimization
+
+**Size Reduction Strategies**:
+```typescript
+// 1. Tree shaking with named imports
+import { uniq } from 'lodash-es';  // ✅ Only imports uniq
+import _ from 'lodash';             // ❌ Imports entire library
+
+// 2. Use native APIs instead of libraries
+const date = new Date().toISOString();  // ✅ Native
+import moment from 'moment';             // ❌ 300KB library
+
+// 3. External API calls instead of SDKs
+await fetch('https://api.anthropic.com/v1/messages', {
+  method: 'POST',
+  headers: { 'x-api-key': env.API_KEY },
+  body: JSON.stringify({ ... })
+});  // ✅ 0KB vs @anthropic-ai/sdk (2.1MB)
+
+// 4. Code splitting with dynamic imports
+if (request.url.includes('/special')) {
+  const { handler } = await import('./expensive-module');
+  return handler(request);
+}  // ✅ Lazy load
+```
+
+**webpack Configuration**:
+```javascript
+module.exports = {
+  mode: 'production',
+  target: 'webworker',
+  optimization: {
+    minimize: true,
+    usedExports: true,  // Tree shaking
+    sideEffects: false
+  },
+  resolve: {
+    alias: {
+      'lodash': 'lodash-es'  // Use ES modules version
+    }
+  }
+};
+```
+
+---
+
+### Environment Variables
+
+**Using Secrets**:
+```bash
+# Add secret (never in code)
+wrangler secret put DATABASE_URL
+
+# List secrets
+wrangler secret list
+
+# Delete secret
+wrangler secret delete OLD_KEY
+```
+
+**Using Variables** (wrangler.toml):
+```toml
+[vars]
+API_ENDPOINT = "https://api.partner.com"
+MAX_RETRIES = "3"
+CACHE_TTL = "300"
+
+[env.staging.vars]
+API_ENDPOINT = "https://staging-api.partner.com"
+
+[env.production.vars]
+API_ENDPOINT = "https://api.partner.com"
+```
+
+**Accessing in Code**:
+```typescript
+export default {
+  async fetch(request: Request, env: Env) {
+    const dbUrl = env.DATABASE_URL;  // Secret
+    const endpoint = env.API_ENDPOINT;  // Var
+    const maxRetries = parseInt(env.MAX_RETRIES);
+
+    return new Response('OK');
+  }
+};
+```
+
+---
+
+## Performance Optimization
+
+### CPU Time Management
+
+**Avoid CPU-Intensive Operations**:
+```typescript
+// ❌ BAD: CPU-intensive operation
+function processLargeDataset(data) {
+  const sorted = data.sort((a, b) => a.value - b.value);
+  const filtered = sorted.filter(item => item.value > 1000);
+  const mapped = filtered.map(item => ({ ...item, processed: true }));
+  return mapped;  // Can exceed 50ms CPU limit
+}
+
+// ✅ GOOD: Offload to external service
+async function processLargeDataset(data, env) {
+  const response = await fetch(`${env.PROCESSING_API}/process`, {
+    method: 'POST',
+    body: JSON.stringify(data)
+  });
+  return response.json();  // External service handles heavy lifting
+}
+
+// ✅ BETTER: Use Durable Objects for stateful computation
+const id = env.PROCESSOR.idFromName('processor');
+const stub = env.PROCESSOR.get(id);
+return stub.fetch(request);  // Durable Object has more CPU time
+```
+
+**Monitor CPU Usage**:
+```typescript
+export default {
+  async fetch(request: Request, env: Env) {
+    const start = Date.now();
+
+    try {
+      const response = await handleRequest(request, env);
+      const duration = Date.now() - start;
+
+      if (duration > 40) {
+        console.warn(`CPU time approaching limit: ${duration}ms`);
+      }
+
+      return response;
+    } catch (error) {
+      const duration = Date.now() - start;
+      console.error(`Request failed after ${duration}ms:`, error);
+      throw error;
+    }
+  }
+};
+```
+
+---
+
+### Caching Strategies
+
+**Cache API**:
+```typescript
+export default {
+  async fetch(request: Request) {
+    const cache = caches.default;
+
+    // Check cache
+    let response = await cache.match(request);
+    if (response) return response;
+
+    // Cache miss - fetch and cache
+    response = await fetch(request);
+
+    // Cache for 5 minutes
+    const cacheResponse = new Response(response.body, response);
+    cacheResponse.headers.set('Cache-Control', 'max-age=300');
+    await cache.put(request, cacheResponse.clone());
+
+    return response;
+  }
+};
+```
+
+**KV for Data Caching**:
+```typescript
+export default {
+  async fetch(request: Request, env: Env) {
+    const url = new URL(request.url);
+    const cacheKey = `data:${url.pathname}`;
+
+    // Check KV
+    const cached = await env.CACHE.get(cacheKey, 'json');
+    if (cached) return Response.json(cached);
+
+    // Fetch data
+    const data = await fetchExpensiveData();
+
+    // Store in KV with 5min TTL
+    await env.CACHE.put(cacheKey, JSON.stringify(data), {
+      expirationTtl: 300
+    });
+
+    return Response.json(data);
+  }
+};
+```
+
+---
+
+## Common Errors and Solutions
+
+### Error 1101: Worker Threw Exception
+
+**Cause**: Unhandled JavaScript exception
+
+**Example**:
+```typescript
+// ❌ BAD: Unhandled error
+export default {
+  async fetch(request: Request) {
+    const data = JSON.parse(request.body);  // Throws if invalid JSON
+    return Response.json(data);
+  }
+};
+```
+
+**Solution**:
+```typescript
+// ✅ GOOD: Proper error handling
+export default {
+  async fetch(request: Request) {
+    try {
+      const body = await request.text();
+      const data = JSON.parse(body);
+      return Response.json(data);
+    } catch (error) {
+      console.error('JSON parse error:', error);
+      return new Response('Invalid JSON', { status: 400 });
+    }
+  }
+};
+```
+
+---
+
+### Error 1015: Rate Limited
+
+**Cause**: Too many requests to origin
+
+**Solution**: Implement caching and rate limiting
+```typescript
+const RATE_LIMIT = 100;  // requests per minute
+const rateLimits = new Map();
+
+export default {
+  async fetch(request: Request) {
+    const ip = request.headers.get('CF-Connecting-IP');
+    const key = `ratelimit:${ip}`;
+
+    const count = rateLimits.get(key) || 0;
+    if (count >= RATE_LIMIT) {
+      return new Response('Rate limit exceeded', { status: 429 });
+    }
+
+    rateLimits.set(key, count + 1);
+    setTimeout(() => rateLimits.delete(key), 60000);
+
+    return new Response('OK');
+  }
+};
+```
+
+---
+
+### Error: Script Exceeds Size Limit
+
+**Diagnosis**:
+```bash
+# Check bundle size
+npm run build
+ls -lh dist/worker.js
+
+# Analyze bundle
+npm install --save-dev webpack-bundle-analyzer
+npm run build -- --analyze
+```
+
+**Solutions**: See [bundle optimization](#bundle-optimization) above
+
+---
+
+## Monitoring and Logging
+
+### Structured Logging
+
+```typescript
+interface LogEntry {
+  level: 'info' | 'warn' | 'error';
+  message: string;
+  timestamp: string;
+  requestId?: string;
+  duration?: number;
+  metadata?: Record<string, any>;
+}
+
+function log(entry: LogEntry) {
+  console.log(JSON.stringify({
+    ...entry,
+    timestamp: new Date().toISOString()
+  }));
+}
+
+export default {
+  async fetch(request: Request, env: Env) {
+    const requestId = crypto.randomUUID();
+    const start = Date.now();
+
+    try {
+      log({
+        level: 'info',
+        message: 'Request started',
+        requestId,
+        metadata: {
+          method: request.method,
+          url: request.url
+        }
+      });
+
+      const response = await handleRequest(request, env);
+
+      log({
+        level: 'info',
+        message: 'Request completed',
+        requestId,
+        duration: Date.now() - start,
+        metadata: {
+          status: response.status
+        }
+      });
+
+      return response;
+    } catch (error) {
+      log({
+        level: 'error',
+        message: 'Request failed',
+        requestId,
+        duration: Date.now() - start,
+        metadata: {
+          error: error.message,
+          stack: error.stack
+        }
+      });
+
+      return new Response('Internal Server Error', { status: 500 });
+    }
+  }
+};
+```
+
+---
+
+### Health Check Endpoint
+
+```typescript
+export default {
+  async fetch(request: Request, env: Env) {
+    const url = new URL(request.url);
+
+    if (url.pathname === '/health') {
+      return Response.json({
+        status: 'healthy',
+        timestamp: new Date().toISOString(),
+        version: env.VERSION || 'unknown'
+      });
+    }
+
+    // Regular request handling
+    return handleRequest(request, env);
+  }
+};
+```
+
+---
+
+## Testing Workers
+
+```bash
+# Local testing
+wrangler dev
+curl http://localhost:8787/api/users
+curl -X POST http://localhost:8787/api/users -H "Content-Type: application/json" -d '{"name": "Test User"}'
+
+# Unit testing (Vitest)
+import { describe, it, expect } from 'vitest';
+import worker from './worker';
+
+describe('Worker', () => {
+  it('returns 200 for health check', async () => {
+    const request = new Request('https://example.com/health');
+    const response = await worker.fetch(request, getMockEnv());
+    expect(response.status).toBe(200);
+  });
+});
+```
+
+---
+
+## Security Best Practices
+
+```typescript
+// 1. Validate inputs
+function validateEmail(email: string): boolean {
+  return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
+}
+
+// 2. Set security headers
+function addSecurityHeaders(response: Response): Response {
+  response.headers.set('X-Content-Type-Options', 'nosniff');
+  response.headers.set('X-Frame-Options', 'DENY');
+  response.headers.set('Strict-Transport-Security', 'max-age=31536000');
+  return response;
+}
+
+// 3. CORS configuration
+const ALLOWED_ORIGINS = ['https://app.greyhaven.io', 'https://staging.greyhaven.io'];
+function handleCors(request: Request): Response | null {
+  const origin = request.headers.get('Origin');
+  if (request.method === 'OPTIONS') {
+    return new Response(null, {
+      headers: {
+        'Access-Control-Allow-Origin': origin,
+        'Access-Control-Allow-Methods': 'GET,POST,PUT,DELETE',
+        'Access-Control-Max-Age': '86400'
+      }
+    });
+  }
+  if (origin && !ALLOWED_ORIGINS.includes(origin)) {
+    return new Response('Forbidden', { status: 403 });
+  }
+  return null;
+}
+```
+
+---
+
+## Related Documentation
+
+- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
+- **Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full examples
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/devops-troubleshooting/reference/diagnostic-commands.md
+++ b/skills/devops-troubleshooting/reference/diagnostic-commands.md
@@ -0,0 +1,473 @@
+# Diagnostic Commands Reference
+
+Quick command reference for Grey Haven infrastructure troubleshooting. Copy-paste ready commands for rapid diagnosis.
+
+## Cloudflare Workers Commands
+
+### Deployment Management
+
+```bash
+# List recent deployments
+wrangler deployments list
+
+# View specific deployment
+wrangler deployments view <deployment-id>
+
+# Rollback to previous version
+wrangler rollback --message "Reverting due to errors"
+
+# Deploy to production
+wrangler deploy
+
+# Deploy to staging
+wrangler deploy --env staging
+```
+
+### Logs and Monitoring
+
+```bash
+# Real-time logs (pretty format)
+wrangler tail --format pretty
+
+# JSON logs for parsing
+wrangler tail --format json
+
+# Filter by status code
+wrangler tail --format json | grep "\"status\":500"
+
+# Show only errors
+wrangler tail --format json | grep -i "error"
+
+# Save logs to file
+wrangler tail --format json > worker-logs.json
+
+# Monitor specific worker
+wrangler tail --name my-worker
+```
+
+### Local Development
+
+```bash
+# Start local dev server
+wrangler dev
+
+# Dev with specific port
+wrangler dev --port 8788
+
+# Dev with remote mode (use production bindings)
+wrangler dev --remote
+
+# Test locally
+curl http://localhost:8787/api/health
+```
+
+### Configuration
+
+```bash
+# Show account info
+wrangler whoami
+
+# List KV namespaces
+wrangler kv:namespace list
+
+# List secrets
+wrangler secret list
+
+# Add secret
+wrangler secret put API_KEY
+
+# Delete secret
+wrangler secret delete API_KEY
+```
+
+---
+
+## PlanetScale Commands
+
+### Database Management
+
+```bash
+# Connect to database shell
+pscale shell greyhaven-db main
+
+# Connect and execute query
+pscale shell greyhaven-db main --execute "SELECT COUNT(*) FROM users"
+
+# Show database info
+pscale database show greyhaven-db
+
+# List all databases
+pscale database list
+
+# Create new branch
+pscale branch create greyhaven-db feature-branch
+
+# List branches
+pscale branch list greyhaven-db
+```
+
+### Connection Monitoring
+
+```sql
+-- Active connections
+SELECT COUNT(*) as active_connections
+FROM pg_stat_activity
+WHERE state = 'active';
+
+-- Long-running queries
+SELECT
+  pid,
+  now() - query_start as duration,
+  query
+FROM pg_stat_activity
+WHERE state = 'active'
+  AND query_start < now() - interval '10 seconds'
+ORDER BY duration DESC;
+
+-- Connection by state
+SELECT state, COUNT(*)
+FROM pg_stat_activity
+GROUP BY state;
+
+-- Blocked queries
+SELECT
+  blocked.pid AS blocked_pid,
+  blocking.pid AS blocking_pid,
+  blocked.query AS blocked_query
+FROM pg_stat_activity blocked
+JOIN pg_stat_activity blocking
+  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
+```
+
+### Performance Analysis
+
+```bash
+# Slow query insights
+pscale database insights greyhaven-db main --slow-queries
+
+# Database size
+pscale database show greyhaven-db --web
+
+# Enable slow query log
+pscale database settings update greyhaven-db --enable-slow-query-log
+```
+
+```sql
+-- Table sizes
+SELECT
+  schemaname,
+  tablename,
+  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
+FROM pg_tables
+WHERE schemaname = 'public'
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+
+-- Index usage
+SELECT
+  schemaname,
+  tablename,
+  indexname,
+  idx_scan as index_scans
+FROM pg_stat_user_indexes
+ORDER BY idx_scan ASC;
+
+-- Cache hit ratio
+SELECT
+  'cache hit rate' AS metric,
+  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
+FROM pg_statio_user_tables;
+```
+
+### Schema Migrations
+
+```bash
+# Create deploy request
+pscale deploy-request create greyhaven-db <branch-name>
+
+# List deploy requests
+pscale deploy-request list greyhaven-db
+
+# View deploy request diff
+pscale deploy-request diff greyhaven-db <number>
+
+# Deploy schema changes
+pscale deploy-request deploy greyhaven-db <number>
+
+# Close deploy request
+pscale deploy-request close greyhaven-db <number>
+```
+
+---
+
+## Network Diagnostic Commands
+
+### DNS Resolution
+
+```bash
+# Basic DNS lookup
+nslookup api.partner.com
+
+# Detailed DNS query
+dig api.partner.com
+
+# Measure DNS time
+time nslookup api.partner.com
+
+# Check DNS propagation
+dig api.partner.com @8.8.8.8
+dig api.partner.com @1.1.1.1
+
+# Reverse DNS lookup
+dig -x 203.0.113.42
+```
+
+### Connectivity Testing
+
+```bash
+# Ping test
+ping -c 10 api.partner.com
+
+# Trace network route
+traceroute api.partner.com
+
+# TCP connection test
+nc -zv api.partner.com 443
+
+# Test specific port
+telnet api.partner.com 443
+```
+
+### HTTP Request Timing
+
+```bash
+# Full timing breakdown
+curl -w "\nDNS Lookup:    %{time_namelookup}s\nTCP Connect:   %{time_connect}s\nTLS Handshake: %{time_appconnect}s\nStart Transfer:%{time_starttransfer}s\nTotal:         %{time_total}s\n" \
+  -o /dev/null -s https://api.partner.com/data
+
+# Test with specific method
+curl -X POST https://api.example.com/api \
+  -H "Content-Type: application/json" \
+  -d '{"test": "data"}'
+
+# Follow redirects
+curl -L https://example.com
+
+# Show response headers
+curl -I https://api.example.com
+
+# Test CORS
+curl -I -X OPTIONS https://api.example.com \
+  -H "Origin: https://app.example.com" \
+  -H "Access-Control-Request-Method: POST"
+```
+
+### SSL/TLS Verification
+
+```bash
+# Check SSL certificate
+openssl s_client -connect api.example.com:443
+
+# Show certificate expiry
+echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
+  openssl x509 -noout -dates
+
+# Verify certificate chain
+openssl s_client -connect api.example.com:443 -showcerts
+```
+
+---
+
+## Application Performance Commands
+
+### Resource Monitoring
+
+```bash
+# CPU usage
+top -o cpu
+
+# Memory usage
+free -h  # Linux
+vm_stat  # macOS
+
+# Disk usage
+df -h
+
+# Process list
+ps aux | grep node
+
+# Port usage
+lsof -i :8000
+netstat -an | grep 8000
+```
+
+### Log Analysis
+
+```bash
+# Tail logs
+tail -f /var/log/app.log
+
+# Search logs
+grep -i "error" /var/log/app.log
+
+# Count errors
+grep -c "ERROR" /var/log/app.log
+
+# Show recent errors with context
+grep -B 5 -A 5 "error" /var/log/app.log
+
+# Parse JSON logs
+cat app.log | jq 'select(.level=="error")'
+
+# Error frequency
+grep "ERROR" /var/log/app.log | cut -d' ' -f1 | uniq -c
+```
+
+### Worker Performance
+
+```bash
+# Monitor CPU time
+wrangler tail --format json | jq '.outcome.cpuTime'
+
+# Monitor duration
+wrangler tail --format json | jq '.outcome.duration'
+
+# Requests per second
+wrangler tail --format json | wc -l
+
+# Average response time
+wrangler tail --format json | \
+  jq -r '.outcome.duration' | \
+  awk '{sum+=$1; count++} END {print sum/count}'
+```
+
+---
+
+## Health Check Scripts
+
+### Worker Health Check
+
+```bash
+#!/bin/bash
+# health-check-worker.sh
+
+echo "=== Worker Health Check ==="
+
+# Test endpoint
+STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://api.greyhaven.io/health)
+
+if [ "$STATUS" -eq 200 ]; then
+  echo "✅ Worker responding (HTTP $STATUS)"
+else
+  echo "❌ Worker error (HTTP $STATUS)"
+  exit 1
+fi
+
+# Check response time
+TIME=$(curl -w "%{time_total}" -o /dev/null -s https://api.greyhaven.io/health)
+echo "Response time: ${TIME}s"
+
+if (( $(echo "$TIME > 1.0" | bc -l) )); then
+  echo "⚠️  Slow response (>${TIME}s)"
+fi
+```
+
+### Database Health Check
+
+```bash
+#!/bin/bash
+# health-check-db.sh
+
+echo "=== Database Health Check ==="
+
+# Test connection
+pscale shell greyhaven-db main --execute "SELECT 1" > /dev/null 2>&1
+
+if [ $? -eq 0 ]; then
+  echo "✅ Database connection OK"
+else
+  echo "❌ Database connection failed"
+  exit 1
+fi
+
+# Check active connections
+ACTIVE=$(pscale shell greyhaven-db main --execute \
+  "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'" | tail -1)
+
+echo "Active connections: $ACTIVE"
+
+if [ "$ACTIVE" -gt 80 ]; then
+  echo "⚠️  High connection count (>80)"
+fi
+```
+
+### Complete System Health
+
+```bash
+#!/bin/bash
+# health-check-all.sh
+
+echo "=== Complete System Health Check ==="
+
+# Worker
+echo "\n1. Cloudflare Worker"
+./health-check-worker.sh
+
+# Database
+echo "\n2. PlanetScale Database"
+./health-check-db.sh
+
+# External APIs
+echo "\n3. External Dependencies"
+for API in "https://api.partner1.com/health" "https://api.partner2.com/health"; do
+  STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API")
+  if [ "$STATUS" -eq 200 ]; then
+    echo "✅ $API (HTTP $STATUS)"
+  else
+    echo "❌ $API (HTTP $STATUS)"
+  fi
+done
+
+echo "\n=== Health Check Complete ==="
+```
+
+---
+
+## Troubleshooting One-Liners
+
+```bash
+# Find memory hogs
+ps aux --sort=-%mem | head -10
+
+# Find CPU hogs
+ps aux --sort=-%cpu | head -10
+
+# Disk space by directory
+du -sh /* | sort -h
+
+# Network connections
+netstat -ant | awk '{print $6}' | sort | uniq -c
+
+# Failed login attempts
+grep "Failed password" /var/log/auth.log | wc -l
+
+# Top error codes
+awk '{print $9}' access.log | sort | uniq -c | sort -rn
+
+# Requests per minute
+awk '{print $4}' access.log | cut -d: -f1-2 | uniq -c
+
+# Average response size
+awk '{sum+=$10; count++} END {print sum/count}' access.log
+```
+
+---
+
+## Related Documentation
+
+- **Runbooks**: [troubleshooting-runbooks.md](troubleshooting-runbooks.md) - Step-by-step procedures
+- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/devops-troubleshooting/reference/troubleshooting-runbooks.md
+++ b/skills/devops-troubleshooting/reference/troubleshooting-runbooks.md
@@ -0,0 +1,489 @@
+# Troubleshooting Runbooks
+
+Step-by-step runbooks for resolving common Grey Haven infrastructure issues. Follow procedures systematically for fastest resolution.
+
+## Runbook 1: Worker Not Responding
+
+### Symptoms
+- API returning 500/502/503 errors
+- Workers timing out or not processing requests
+- Cloudflare error pages showing
+
+### Diagnosis Steps
+
+**1. Check Cloudflare Status**
+```bash
+# Visit: https://www.cloudflarestatus.com
+# Or query status API
+curl -s https://www.cloudflarestatus.com/api/v2/status.json | jq '.status.indicator'
+```
+
+**2. View Worker Logs**
+```bash
+# Real-time logs
+wrangler tail --format pretty
+
+# Look for errors:
+# - "Script exceeded CPU time limit"
+# - "Worker threw exception"
+# - "Uncaught TypeError"
+```
+
+**3. Check Recent Deployments**
+```bash
+wrangler deployments list
+
+# If recent deployment suspicious, rollback:
+wrangler rollback --message "Reverting to stable version"
+```
+
+**4. Test Worker Locally**
+```bash
+# Run worker in dev mode
+wrangler dev
+
+# Test endpoint
+curl http://localhost:8787/api/health
+```
+
+### Resolution Paths
+
+**Path A: Platform Issue** - Wait for Cloudflare, monitor status, communicate ETA
+**Path B: Code Error** - Rollback deployment, fix in dev, test before redeploy
+**Path C: Resource Limit** - Check CPU logs, optimize operations, upgrade if needed
+**Path D: Binding Issue** - Verify wrangler.toml, check bindings, redeploy
+
+### Prevention
+- Health check endpoint: `GET /health`
+- Monitor error rate with alerts (>1% = alert)
+- Test deployments in staging first
+- Implement circuit breakers for external calls
+
+---
+
+## Runbook 2: Database Connection Failures
+
+### Symptoms
+- "connection refused" errors
+- "too many connections" errors
+- Application timing out on database queries
+- 503 errors from API
+
+### Diagnosis Steps
+
+**1. Test Database Connection**
+```bash
+# Direct connection test
+pscale shell greyhaven-db main
+
+# If fails, check:
+# - Database status
+# - Credentials
+# - Network connectivity
+```
+
+**2. Check Connection Pool**
+```bash
+# Query pool status
+curl http://localhost:8000/pool-status
+
+# Expected healthy response:
+{
+  "size": 50,
+  "checked_out": 25,  # <80% is healthy
+  "overflow": 0,
+  "available": 25
+}
+```
+
+**3. Check Active Connections**
+```sql
+-- In pscale shell
+SELECT
+  COUNT(*) as active,
+  MAX(query_start) as oldest_query
+FROM pg_stat_activity
+WHERE state = 'active';
+
+-- If active = pool size, pool exhausted
+-- If oldest_query >10min, leaked connection
+```
+
+**4. Review Application Logs**
+```bash
+# Search for connection errors
+grep -i "connection" logs/app.log | tail -50
+
+# Common errors:
+# - "Pool timeout"
+# - "Connection refused"
+# - "Max connections reached"
+```
+
+### Resolution Paths
+
+**Path A: Invalid Credentials**
+```bash
+# Rotate credentials
+pscale password create greyhaven-db main app-password
+
+# Update environment variable
+# Restart application
+```
+
+**Path B: Pool Exhausted**
+```python
+# Increase pool size in database.py
+engine = create_engine(
+    database_url,
+    pool_size=50,      # Increase from 20
+    max_overflow=20
+)
+```
+
+**Path C: Connection Leaks**
+```python
+# Fix: Use context managers
+with Session(engine) as session:
+    # Work with session
+    pass  # Automatically closed
+```
+
+**Path D: Database Paused/Down**
+```bash
+# Resume database if paused
+pscale database resume greyhaven-db
+
+# Check database status
+pscale database show greyhaven-db
+```
+
+### Prevention
+- Use connection pooling with proper limits
+- Implement retry logic with exponential backoff
+- Monitor pool utilization (alert >80%)
+- Test for connection leaks in CI/CD
+
+---
+
+## Runbook 3: Deployment Failures
+
+### Symptoms
+- `wrangler deploy` fails
+- CI/CD pipeline fails at deployment step
+- New code not reflecting in production
+
+### Diagnosis Steps
+
+**1. Check Deployment Error**
+```bash
+wrangler deploy --verbose
+
+# Common errors:
+# - "Script exceeds size limit"
+# - "Syntax error in worker"
+# - "Environment variable missing"
+# - "Binding not found"
+```
+
+**2. Verify Build Output**
+```bash
+# Check built file
+ls -lh dist/
+npm run build
+
+# Ensure build succeeds locally
+```
+
+**3. Check Environment Variables**
+```bash
+# List secrets
+wrangler secret list
+
+# Verify wrangler.toml vars
+cat wrangler.toml | grep -A 10 "\[vars\]"
+```
+
+**4. Test Locally**
+```bash
+# Start dev server
+wrangler dev
+
+# If works locally but not production:
+# - Environment variable mismatch
+# - Binding configuration issue
+```
+
+### Resolution Paths
+
+**Path A: Bundle Too Large**
+```bash
+# Check bundle size
+ls -lh dist/worker.js
+
+# Solutions:
+# - Tree shake unused code
+# - Code split large modules
+# - Use fetch instead of SDK
+```
+
+**Path B: Syntax Error**
+```bash
+# Run TypeScript check
+npm run type-check
+
+# Run linter
+npm run lint
+
+# Fix errors before deploying
+```
+
+**Path C: Missing Variables**
+```bash
+# Add missing secret
+wrangler secret put API_KEY
+
+# Or add to wrangler.toml vars
+[vars]
+API_ENDPOINT = "https://api.example.com"
+```
+
+**Path D: Binding Not Found**
+```toml
+# wrangler.toml - Add binding
+[[kv_namespaces]]
+binding = "CACHE"
+id = "abc123"
+
+[[d1_databases]]
+binding = "DB"
+database_name = "greyhaven-db"
+database_id = "xyz789"
+```
+
+### Prevention
+- Bundle size check in CI/CD
+- Pre-commit hooks for validation
+- Staging environment for testing
+- Automated deployment tests
+
+---
+
+## Runbook 4: Performance Degradation
+
+### Symptoms
+- API response times increased (>2x normal)
+- Slow page loads
+- User complaints about slowness
+- Timeout errors
+
+### Diagnosis Steps
+
+**1. Check Current Latency**
+```bash
+# Test endpoint
+curl -w "\nTotal: %{time_total}s\n" -o /dev/null -s https://api.greyhaven.io/orders
+
+# p95 should be <500ms
+# If >1s, investigate
+```
+
+**2. Analyze Worker Logs**
+```bash
+wrangler tail --format json | jq '{duration: .outcome.duration, event: .event}'
+
+# Identify slow requests
+# Check what's taking time
+```
+
+**3. Check Database Queries**
+```bash
+# Slow query log
+pscale database insights greyhaven-db main --slow-queries
+
+# Look for:
+# - N+1 queries (many small queries)
+# - Missing indexes (full table scans)
+# - Long-running queries (>100ms)
+```
+
+**4. Profile Application**
+```bash
+# Add timing middleware
+# Log slow operations
+# Identify bottleneck (DB, API, compute)
+```
+
+### Resolution Paths
+
+**Path A: N+1 Queries**
+```python
+# Use eager loading
+statement = (
+    select(Order)
+    .options(selectinload(Order.items))
+)
+```
+
+**Path B: Missing Indexes**
+```sql
+-- Add indexes
+CREATE INDEX idx_orders_user_id ON orders(user_id);
+CREATE INDEX idx_items_order_id ON order_items(order_id);
+```
+
+**Path C: No Caching**
+```typescript
+// Add Redis caching
+const cached = await redis.get(cacheKey);
+if (cached) return cached;
+
+const result = await expensiveOperation();
+await redis.setex(cacheKey, 300, result);
+```
+
+**Path D: Worker CPU Limit**
+```typescript
+// Optimize expensive operations
+// Use async operations
+// Offload to external service
+```
+
+### Prevention
+- Monitor p95 latency (alert >500ms)
+- Test for N+1 queries in CI/CD
+- Add indexes for foreign keys
+- Implement caching layer
+- Performance budgets in tests
+
+---
+
+## Runbook 5: Network Connectivity Issues
+
+### Symptoms
+- Intermittent failures
+- DNS resolution errors
+- Connection timeouts
+- CORS errors
+
+### Diagnosis Steps
+
+**1. Test DNS Resolution**
+```bash
+# Check DNS
+nslookup api.partner.com
+dig api.partner.com
+
+# Measure DNS time
+time nslookup api.partner.com
+
+# If >1s, DNS is slow
+```
+
+**2. Test Connectivity**
+```bash
+# Basic connectivity
+ping api.partner.com
+
+# Trace route
+traceroute api.partner.com
+
+# Full timing breakdown
+curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTotal: %{time_total}s\n" \
+  -o /dev/null -s https://api.partner.com
+```
+
+**3. Check CORS**
+```bash
+# Preflight request
+curl -I -X OPTIONS https://api.greyhaven.io/api/users \
+  -H "Origin: https://app.greyhaven.io" \
+  -H "Access-Control-Request-Method: POST"
+
+# Verify headers:
+# - Access-Control-Allow-Origin
+# - Access-Control-Allow-Methods
+```
+
+**4. Check Firewall/Security**
+```bash
+# Test from different location
+# Check IP whitelist
+# Verify SSL certificate
+```
+
+### Resolution Paths
+
+**Path A: Slow DNS**
+```typescript
+// Implement DNS caching
+const DNS_CACHE = new Map();
+// Cache DNS for 60s
+```
+
+**Path B: Connection Timeout**
+```typescript
+// Increase timeout
+const controller = new AbortController();
+setTimeout(() => controller.abort(), 30000); // 30s
+```
+
+**Path C: CORS Error**
+```typescript
+// Add CORS headers
+response.headers.set('Access-Control-Allow-Origin', origin);
+response.headers.set('Access-Control-Allow-Methods', 'GET,POST,PUT,DELETE');
+```
+
+**Path D: SSL/TLS Issue**
+```bash
+# Check certificate
+openssl s_client -connect api.partner.com:443
+
+# Verify not expired
+# Check certificate chain
+```
+
+### Prevention
+- DNS caching (60s TTL)
+- Appropriate timeouts (30s for external APIs)
+- Health checks for external dependencies
+- Circuit breakers for failures
+- Monitor external API latency
+
+---
+
+## Emergency Procedures (SEV1)
+
+**Immediate Actions**:
+1. **Assess**: Users affected? Functionality broken? Data loss risk?
+2. **Communicate**: Alert team, update status page
+3. **Stop Bleeding**: `wrangler rollback` or disable feature
+4. **Diagnose**: Logs, recent changes, metrics
+5. **Fix**: Hotfix or workaround, test first
+6. **Verify**: Monitor metrics, test functionality
+7. **Postmortem**: Document, root cause, prevention
+
+---
+
+## Escalation Matrix
+
+| Issue Type | First Response | Escalate To | Escalation Trigger |
+|------------|---------------|-------------|-------------------|
+| Worker errors | DevOps troubleshooter | incident-responder | SEV1/SEV2 |
+| Performance | DevOps troubleshooter | performance-optimizer | >30min unresolved |
+| Database | DevOps troubleshooter | data-validator | Schema issues |
+| Security | DevOps troubleshooter | security-analyzer | Breach suspected |
+| Application bugs | DevOps troubleshooter | smart-debug | Infrastructure ruled out |
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full troubleshooting examples
+- **Diagnostic Commands**: [diagnostic-commands.md](diagnostic-commands.md) - Command reference
+- **Cloudflare Guide**: [cloudflare-workers-guide.md](cloudflare-workers-guide.md) - Platform-specific
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/devops-troubleshooting/templates/INDEX.md
+++ b/skills/devops-troubleshooting/templates/INDEX.md
@@ -0,0 +1,81 @@
+# DevOps Troubleshooter Templates
+
+Ready-to-use templates for infrastructure incident response, deployment checklists, and performance investigations.
+
+## Available Templates
+
+### Incident Report Template
+
+**File**: [incident-report-template.md](incident-report-template.md)
+
+Comprehensive template for documenting infrastructure incidents:
+- **Incident Overview**: Summary, impact, timeline
+- **Root Cause Analysis**: What happened, why it happened
+- **Resolution Steps**: What was done to fix it
+- **Prevention Measures**: How to prevent recurrence
+- **Lessons Learned**: What went well, what could improve
+
+**Use when**: Documenting production outages, degradations, or significant infrastructure issues
+
+**Copy and fill in** all sections for your specific incident.
+
+---
+
+### Deployment Checklist
+
+**File**: [deployment-checklist.md](deployment-checklist.md)
+
+Pre-deployment and post-deployment verification checklist:
+- **Pre-Deployment Verification**: Code review, tests, dependencies, configuration
+- **Deployment Steps**: Backup, deploy, verify, rollback plan
+- **Post-Deployment Monitoring**: Health checks, metrics, logs, alerts
+- **Rollback Procedures**: When and how to rollback
+
+**Use when**: Deploying Cloudflare Workers, database migrations, infrastructure changes
+
+**Check off** each item before and after deployment.
+
+---
+
+### Performance Investigation Template
+
+**File**: [performance-investigation-template.md](performance-investigation-template.md)
+
+Systematic template for investigating performance issues:
+- **Performance Baseline**: Current metrics vs expected
+- **Hypothesis Generation**: Potential root causes
+- **Data Collection**: Profiling, metrics, logs
+- **Analysis**: What the data reveals
+- **Optimization Plan**: Prioritized fixes with impact estimates
+- **Validation**: Before/after metrics
+
+**Use when**: API latency increases, database slow queries, high CPU/memory usage
+
+**Follow systematically** to diagnose and resolve performance problems.
+
+---
+
+## Template Usage
+
+**How to use these templates**:
+1. Copy the template file to your project documentation
+2. Fill in all sections marked with `[FILL IN]` placeholders
+3. Remove sections that don't apply (optional)
+4. Share with your team for review
+
+**When to create reports**:
+- **Incident Report**: After any production incident (SEV1-SEV3)
+- **Deployment Checklist**: Before every production deployment
+- **Performance Investigation**: When performance degrades >20%
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world troubleshooting walkthroughs
+- **Reference**: [Reference Index](../reference/INDEX.md) - Runbooks and diagnostic commands
+- **Main Agent**: [devops-troubleshooter.md](../devops-troubleshooter.md) - DevOps troubleshooter agent
+
+---
+
+Return to [main agent](../devops-troubleshooter.md)