Files
gh-iciakky-cc-general-skills/skills/error-troubleshooter/references/systematic-debugging-methodology.md
2025-11-29 18:47:55 +08:00

25 KiB

Systematic Debugging Methodology

Guiding Principle: Occam's Razor for Debugging

When facing mysterious errors, the root cause is almost always simpler than it appears.

Common error distribution in real-world debugging:

  • 70%: Configuration issues (wrong parameters, missing flags, incorrect paths)
  • 20%: Environment issues (missing dependencies, version mismatches, path problems)
  • 8%: Data format issues (encoding, structure assumptions)
  • 2%: Actual bugs in external APIs or fundamental incompatibilities

Core Rule: Exhaust simple hypotheses before considering complex ones. Authentication failures are configuration problems 99% of the time, not API design flaws.


1. Hypothesis Priority Framework

Always Start Here: The "Boring Checklist"

Before investigating complex hypotheses, verify these fundamentals:

  1. Configuration Loading

    • Are environment variables actually loaded? (Not just "file exists")
    • Are config files in the correct location relative to execution path?
    • Are there hidden default parameters overriding your explicit config?
  2. Exact Version/Name Matching

    • Library versions exact match with documentation examples?
    • API endpoint names exactly correct (not similar, not partially matching)?
    • Model/service names character-perfect? (e.g., preview-09-2025preview-2025-09)
  3. First-Hand Error Visibility

    • Have you personally executed the failing code?
    • Have you seen the complete error output (not summarized or truncated)?
    • Are there error details hidden in non-obvious places (WebSocket close codes, HTTP headers)?

Hypothesis Ordering Template

## Problem: [Description]

### Priority 1: Configuration Issues (Check First)
- [ ] Hypothesis 1a: Config file not loaded from expected path
      Verification: Add logging immediately after config load
- [ ] Hypothesis 1b: SDK using unexpected default parameters
      Verification: Log all parameters passed to SDK constructor
- [ ] Hypothesis 1c: Authentication credentials format issue
      Verification: Check credential string length, prefix, encoding

### Priority 2: Environment Issues
- [ ] Hypothesis 2a: Dependency version mismatch
      Verification: Check package.json vs installed versions
- [ ] Hypothesis 2b: Runtime environment differences
      Verification: Compare working vs failing environment variables

### Priority 3: Data Format Issues
- [ ] Hypothesis 3a: Incorrect data structure assumptions
      Verification: Log raw data structure, don't assume
- [ ] Hypothesis 3b: Encoding mismatch (UTF-8, base64, binary)
      Verification: Inspect first few bytes/characters

### Priority 4: Complex Issues (Only After Above Exhausted)
- [ ] Hypothesis 4a: API limitation or bug
      Verification: Find official documentation or working examples
- [ ] Hypothesis 4b: Fundamental incompatibility
      Verification: Find counter-examples proving it can work

2. Isolation Strategy: Diagnostic Scripts

The Most Powerful Debugging Technique: Create minimal, single-purpose scripts that test ONE thing at a time.

Why This Works

  • Eliminates confounding variables: Complex scripts have multiple failure points
  • Provides clear pass/fail criteria: If this 10-line script fails, the problem is X
  • Builds confidence incrementally: Each passing diagnostic narrows the search space
  • Creates reusable verification tools: Same scripts can verify fixes

Diagnostic Script Principles

  1. One Variable Per Script: Test configuration loading separately from API calls separately from data processing
  2. Verbose Logging: Log every step, every value, every assumption
  3. Explicit Success Criteria: Script should clearly output PASS or FAIL
  4. No Assumptions: Check everything explicitly, even "obvious" things

Template: Configuration Diagnostic

// diagnose-config.js
// Purpose: Verify configuration loading and format
// Success criteria: All checks pass with ✓

console.log('=== Configuration Diagnostic ===\n');

// 1. Environment
console.log('--- Environment ---');
console.log('Node version:', process.version);
console.log('Working directory:', process.cwd());
console.log('Script location:', __dirname);

// 2. Config File Loading
console.log('\n--- Config File ---');
const fs = require('fs');
const path = require('path');

const configPath = path.join(__dirname, '.env');
console.log('Expected path:', configPath);
console.log('File exists:', fs.existsSync(configPath));

if (fs.existsSync(configPath)) {
  const content = fs.readFileSync(configPath, 'utf-8');
  console.log('File size:', content.length, 'bytes');
  console.log('Lines:', content.split('\n').length);
}

// 3. Environment Variable
require('dotenv').config({ path: configPath });
console.log('\n--- Environment Variable ---');
console.log('API_KEY defined:', !!process.env.API_KEY);
console.log('API_KEY length:', process.env.API_KEY?.length);
console.log('API_KEY prefix:', process.env.API_KEY?.slice(0, 4));
console.log('API_KEY has whitespace:', /\s/.test(process.env.API_KEY || ''));

// 4. Format Validation
console.log('\n--- Format Validation ---');
const expectedPrefix = 'AIza'; // Example for Google API keys
const expectedLength = 39;

const prefixMatch = process.env.API_KEY?.startsWith(expectedPrefix);
const lengthMatch = process.env.API_KEY?.length === expectedLength;

console.log(prefixMatch ? '✓' : '✗', 'Prefix matches expected format');
console.log(lengthMatch ? '✓' : '✗', 'Length matches expected format');

// 5. SDK Constructor (No Network Call)
console.log('\n--- SDK Initialization ---');
try {
  const SDK = require('./sdk'); // Your SDK
  const client = new SDK({
    apiKey: process.env.API_KEY,
    // Log all parameters, including defaults
  });
  console.log('✓ SDK initialized without errors');
  console.log('Client config:', JSON.stringify(client.config, null, 2));
} catch (error) {
  console.log('✗ SDK initialization failed:', error.message);
}

console.log('\n=== Diagnostic Complete ===');

Template: Data Structure Diagnostic

// diagnose-data-structure.js
// Purpose: Inspect actual data structure without assumptions
// Success criteria: Understand exact structure and location of target data

function inspectObject(obj, path = 'root', maxDepth = 3, currentDepth = 0) {
  const indent = '  '.repeat(currentDepth);

  console.log(`${indent}${path}:`);
  console.log(`${indent}  Type: ${typeof obj}`);
  console.log(`${indent}  Null: ${obj === null}`);
  console.log(`${indent}  Undefined: ${obj === undefined}`);

  if (obj === null || obj === undefined) return;

  if (typeof obj === 'object') {
    if (Buffer.isBuffer(obj)) {
      console.log(`${indent}  [Buffer: ${obj.length} bytes]`);
      console.log(`${indent}  First 20 bytes: ${obj.slice(0, 20).toString('hex')}`);
      return;
    }

    if (Array.isArray(obj)) {
      console.log(`${indent}  [Array: ${obj.length} items]`);
      if (obj.length > 0 && currentDepth < maxDepth) {
        inspectObject(obj[0], `${path}[0]`, maxDepth, currentDepth + 1);
      }
      return;
    }

    const keys = Object.keys(obj);
    console.log(`${indent}  Keys: [${keys.join(', ')}]`);

    if (currentDepth < maxDepth) {
      for (const key of keys.slice(0, 5)) { // Limit to first 5 keys
        inspectObject(obj[key], `${path}.${key}`, maxDepth, currentDepth + 1);
      }
    }
  } else if (typeof obj === 'string') {
    console.log(`${indent}  Length: ${obj.length}`);
    console.log(`${indent}  Preview: "${obj.slice(0, 50)}${obj.length > 50 ? '...' : ''}"`);

    // Check encoding hints
    const isBase64Like = /^[A-Za-z0-9+/=]+$/.test(obj);
    const isHexLike = /^[0-9a-fA-F]+$/.test(obj);
    console.log(`${indent}  Looks like base64: ${isBase64Like}`);
    console.log(`${indent}  Looks like hex: ${isHexLike}`);
  } else {
    console.log(`${indent}  Value: ${obj}`);
  }
}

// Usage example with API response
const response = getAPIResponse(); // Your actual API call
console.log('=== Raw Response Structure ===\n');
inspectObject(response);

// Specific path investigation
console.log('\n=== Investigating Suspected Path ===');
console.log('response.data exists:', !!response.data);
console.log('response.body exists:', !!response.body);
console.log('response.payload exists:', !!response.payload);

// If you're looking for binary data
console.log('\n=== Binary Data Search ===');
function findBuffers(obj, path = 'root') {
  if (Buffer.isBuffer(obj)) {
    console.log(`Found Buffer at: ${path} (${obj.length} bytes)`);
    return;
  }
  if (typeof obj === 'object' && obj !== null) {
    for (const key in obj) {
      findBuffers(obj[key], `${path}.${key}`);
    }
  }
}
findBuffers(response);

Template: API Interaction Diagnostic

// diagnose-api-minimal.js
// Purpose: Minimal API call to isolate authentication from functionality
// Success criteria: Connection succeeds, response received (any response)

console.log('=== Minimal API Test ===\n');

const API = require('./api-client');

async function testMinimalConnection() {
  console.log('1. Creating client...');
  const client = new API({
    apiKey: process.env.API_KEY,
    // Start with absolute minimal config
  });

  console.log('2. Attempting connection...');
  try {
    await client.connect();
    console.log('✓ Connection established');
  } catch (error) {
    console.log('✗ Connection failed:', error.message);
    console.log('Error code:', error.code);
    console.log('Error details:', JSON.stringify(error, null, 2));
    process.exit(1);
  }

  console.log('3. Sending minimal request...');
  try {
    // Simplest possible request
    const response = await client.send({ message: 'ping' });
    console.log('✓ Response received');
    console.log('Response type:', typeof response);
    console.log('Response keys:', Object.keys(response));
  } catch (error) {
    console.log('✗ Request failed:', error.message);
  }

  console.log('4. Closing connection...');
  await client.close();
  console.log('✓ Test complete');
}

testMinimalConnection().catch(console.error);

Real-World Example: WebSocket Authentication Mystery

Problem: WebSocket connection immediately closes with code 1007 "API key not valid"

Initial Approach (jumping to complex hypotheses):

  • "Maybe the API doesn't support this model"
  • "Maybe the feature flag is disabled for my account"
  • "Maybe there's a rate limit"

Diagnostic Script Approach:

// diagnose-websocket-auth.js
console.log('=== WebSocket Auth Diagnostic ===\n');

// Step 1: Verify API key loading
console.log('--- API Key ---');
console.log('Loaded:', !!process.env.API_KEY);
console.log('Length:', process.env.API_KEY?.length);
console.log('Format:', process.env.API_KEY?.slice(0, 4) + '...' + process.env.API_KEY?.slice(-4));

// Step 2: Check SDK configuration
console.log('\n--- SDK Config ---');
const sdk = new SDK({
  apiKey: process.env.API_KEY,
});

// KEY DISCOVERY: Log the actual endpoint being used
console.log('Endpoint:', sdk.endpoint); // Revealed: Using Vertex AI endpoint!

// Step 3: Check SDK defaults
console.log('Default vertexai:', sdk.config.vertexai); // Revealed: true by default!

// Step 4: Test with explicit configuration
console.log('\n--- Testing explicit vertexai: false ---');
const sdkFixed = new SDK({
  apiKey: process.env.API_KEY,
  vertexai: false, // Explicit override
});
console.log('Endpoint:', sdkFixed.endpoint); // Now using correct endpoint!

Result: The diagnostic revealed that the SDK defaulted to vertexai: true, sending requests to Vertex AI endpoint instead of the Gemini Developer API endpoint. The fix was a single parameter.

Time saved: This 15-line diagnostic script found the issue in 2 minutes. The alternative (reading SDK source code or trial-and-error config changes) would have taken hours.


3. Comparison with Working Examples

Why This Is Critical

When you have a working example (different language, different version, official sample), it's a treasure map showing the correct configuration.

Working example exists → Problem is in the differences

Systematic Comparison Process

  1. Find the Working Example

    • Official SDK examples (preferred)
    • Successful previous implementations in your codebase
    • Community examples with verified success (check issues/discussions)
  2. Compare Layer by Layer

    ## Comparison Checklist
    
    ### Language/Runtime
    - [ ] Working: Python 3.11, Failing: Node.js 20
    - [ ] Any known language-specific issues?
    
    ### SDK Versions
    - [ ] Working: v2.1.0, Failing: v2.3.1
    - [ ] Check changelog between versions
    
    ### Configuration Parameters
    Working:
    ```python
    client = Client(api_key=key)  # Only 1 parameter
    

    Failing:

    const client = new Client({ apiKey: key, ...manyOtherParams });
    
    • What are those other params? What are their defaults?

    API Endpoint

    • Working: api.service.com, Failing: ???
    • Log actual endpoint used by SDK

    Request Format

    • Compare actual HTTP/WebSocket frames sent (use network inspector)
    
    
  3. Identify Hidden Differences

    Common gotchas:

    • Default parameters: JavaScript SDK has vertexai: true default, Python doesn't have this parameter
    • Authentication methods: One uses header, another uses query param
    • Endpoint URLs: SDKs may auto-select endpoints based on config
    • Retry behavior: One SDK retries automatically, hiding transient failures

Example: Cross-Language Comparison

Problem: Python POC works, JavaScript POC fails with authentication error

Comparison:

# Python (WORKING)
import google.generativeai as genai

genai.configure(api_key=api_key)  # Simple, one-line config
model = genai.GenerativeModel('gemini-2.5-flash')
response = model.generate_content('Hello')
// JavaScript (FAILING)
const { GoogleGenerativeAI } = require('@google/generative-ai');

const ai = new GoogleGenerativeAI({
  apiKey: process.env.API_KEY,
  // What's different?
});

Investigation:

  1. Python library source: genai.configure() only sets API key, no other parameters
  2. JavaScript SDK docs: Constructor accepts vertexai parameter (default: true)
  3. Hypothesis: JavaScript defaulting to Vertex AI endpoint

Verification:

const ai = new GoogleGenerativeAI({
  apiKey: process.env.API_KEY,
  vertexai: false,  // Match Python's implicit behavior
});

Result: Fixed. The difference was an implicit vs explicit endpoint selection.


4. Data Structure Verification: Never Assume

The Anti-Pattern

// ❌ Assumption-based code
const audioData = response.data; // Assuming 'data' contains audio
audioFile.write(audioData);
// Result: Writes undefined or wrong data, produces corrupted file

The Correct Pattern

// ✅ Verification-first code

// Step 1: Inspect actual structure
console.log('Response keys:', Object.keys(response));
console.log('Response structure:', JSON.stringify(response, null, 2).slice(0, 500));

// Step 2: Search for target data
function findAudioData(obj, path = 'response') {
  if (Buffer.isBuffer(obj)) {
    console.log(`Found Buffer at ${path}: ${obj.length} bytes`);
  }
  if (typeof obj === 'object' && obj !== null) {
    for (const [key, value] of Object.entries(obj)) {
      if (key.includes('audio') || key.includes('data')) {
        console.log(`Candidate at ${path}.${key}:`, typeof value);
      }
      findAudioData(value, `${path}.${key}`);
    }
  }
}
findAudioData(response);

// Step 3: Verify encoding
const candidateData = response.serverContent.modelTurn.parts[0].inlineData.data;
console.log('Data type:', typeof candidateData);
console.log('First 50 chars:', candidateData.slice(0, 50));
console.log('Looks like base64:', /^[A-Za-z0-9+/=]+$/.test(candidateData));

// Step 4: Test decoding
const decoded = Buffer.from(candidateData, 'base64');
console.log('Decoded size:', decoded.length, 'bytes');
console.log('First 10 bytes (hex):', decoded.slice(0, 10).toString('hex'));

// Step 5: Use verified data
audioFile.write(decoded); // Now confident this is correct

Layer-by-Layer Verification Template

// For nested data structures (e.g., API responses, message objects)

function verifyPath(obj, pathString) {
  console.log(`\n=== Verifying: ${pathString} ===`);

  const parts = pathString.split('.');
  let current = obj;
  let currentPath = 'root';

  for (const part of parts) {
    currentPath += `.${part}`;

    console.log(`Checking ${currentPath}...`);

    if (current === null || current === undefined) {
      console.log(`✗ Path broken at ${currentPath}: value is ${current}`);
      return null;
    }

    if (typeof current !== 'object') {
      console.log(`✗ Path broken at ${currentPath}: not an object (${typeof current})`);
      return null;
    }

    if (!(part in current)) {
      console.log(`✗ Key "${part}" doesn't exist`);
      console.log(`  Available keys:`, Object.keys(current));
      return null;
    }

    console.log(`✓ ${part} exists`);
    current = current[part];

    if (Array.isArray(current)) {
      console.log(`  (Array with ${current.length} items)`);
    } else if (Buffer.isBuffer(current)) {
      console.log(`  (Buffer with ${current.length} bytes)`);
    } else {
      console.log(`  (${typeof current})`);
    }
  }

  console.log(`\n✓ Full path verified: ${pathString}`);
  return current;
}

// Usage
const audioData = verifyPath(
  response,
  'serverContent.modelTurn.parts.0.inlineData.data'
);

5. Evidence Quality Hierarchy

Not all evidence is equal. Rank your evidence sources:

Tier 1: Direct Verification (Strongest)

  • Running code that succeeds/fails in front of you
  • Network traffic you personally captured
  • Logs you personally generated with verbose flags
  • Binary data you inspected byte-by-byte

Tier 2: Official Sources

  • Official API documentation (with version number matching yours)
  • Official SDK examples (with version number matching yours)
  • Official changelog entries

Tier 3: Working Examples

  • Community examples with verified success (stars, recent activity)
  • Stack Overflow answers with upvotes and recent dates
  • Your own previous successful implementations

Tier 4: Problem Reports

  • GitHub Issues (open or closed)
  • Stack Overflow questions (problems, not solutions)
  • Forum discussions

Tier 5: Speculation (Weakest)

  • "I think this API doesn't support..."
  • "This probably means..."
  • Assumptions based on API names or parameter names

Applying the Hierarchy

Scenario: Investigating why authentication fails

## Evidence Analysis

### Hypothesis: API doesn't support this authentication method

Evidence collected:
1. [Tier 4] GitHub Issue #123: User reports auth failure (OPEN, no resolution)
2. [Tier 5] Parameter named "beta" suggests experimental feature
3. [Tier 2] Official docs state: "Authentication via API key is supported"
4. [Tier 3] Example repo uses API key successfully (last updated 2 months ago)

Conclusion:
- Tier 2 (official docs) contradicts hypothesis
- Tier 3 (working example) disproves hypothesis
- Tier 4 evidence (open issue) indicates others hit same problem, but doesn't prove API limitation
- Hypothesis REJECTED: Auth method IS supported, problem is likely in configuration

Key Principle: Higher-tier evidence always overrules lower-tier evidence.


6. The First-Hand Execution Rule

Rule: Before forming conclusions, personally execute the failing code and observe the complete output.

Why This Matters

Second-hand error reports often omit critical details:

  • Full error messages (users summarize or truncate)
  • Error codes (users paste message but not code)
  • Preceding warnings (users skip "unimportant" output)
  • Environment differences (users assume their env is "normal")

Checklist Before Forming Hypothesis

  • Have I executed the failing code myself?
  • Have I seen the complete console output (not summarized)?
  • Have I checked for errors in non-obvious places (close codes, HTTP status, exit codes)?
  • Have I added extra logging to expose internal state?

Example: The Hidden Close Code

Second-hand report: "The WebSocket connection closes immediately with no error"

Assumptions formed:

  • "No error" → Maybe timeout?
  • "Closes immediately" → Maybe connection refused?

First-hand execution:

// Added logging
websocket.on('close', (code, reason) => {
  console.log('Close code:', code);      // Revealed: 1007
  console.log('Close reason:', reason);  // Revealed: "API key not valid"
});

Result: Error WAS present, just not logged by default. The close code (1007) immediately pointed to authentication issue.


7. Debugging Session Template

Use this template to structure your investigation:

## Problem Statement
[Concise description of unexpected behavior]

## Environment
- Runtime: [Node.js 20.11.0, Python 3.11, etc.]
- Library versions: [Exact versions from package.json/requirements.txt]
- OS: [If potentially relevant]

## Reproduction
[Minimal code that reproduces the issue]

## Expected vs Actual
- Expected: [What should happen]
- Actual: [What actually happens, with exact error messages]

## Hypothesis Priority List

### Priority 1: Configuration (Check First)
- [ ] Hypothesis 1a: [Specific config issue]
      Evidence needed: [What would prove/disprove this]
      Diagnostic: [Script or test to verify]

### Priority 2: Environment
- [ ] Hypothesis 2a: [Specific env issue]
      Evidence needed: [...]
      Diagnostic: [...]

### Priority 3: Data Format
[...]

### Priority 4: Complex Issues (Only if above exhausted)
[...]

## Evidence Collected

### [Hypothesis 1a]
- **Status**: ✓ PROVEN / ✗ DISPROVEN / ⚠ INCONCLUSIVE
- **Evidence tier**: [1-5]
- **Details**: [What you found]
- **Source**: [Where this evidence came from]

[Repeat for each hypothesis]

## Working Examples Comparison

### Python Implementation (WORKING)
```python
[Code]

JavaScript Implementation (FAILING)

[Code]

Differences Identified

  1. [Difference 1]
  2. [Difference 2]

Solution

Root Cause

[Final verified cause, with evidence tier]

Fix Applied

[Exact code change]

Verification

[How you verified the fix works]

Lessons Learned

[What would make this faster next time]


---

## 8. Common Anti-Patterns to Avoid

### Anti-Pattern 1: "Debugging by Modification"

**❌ Wrong**:
```javascript
// Try random changes hoping something works
const client = new API({ apiKey: key, timeout: 5000 });  // Doesn't work
const client = new API({ apiKey: key, timeout: 10000 }); // Doesn't work
const client = new API({ apiKey: key, retry: true });    // Doesn't work
// [30 more random attempts...]

Right:

// Diagnose THEN fix
// 1. Create diagnostic to understand current behavior
// 2. Form hypothesis based on diagnostic output
// 3. Make targeted change
// 4. Verify with diagnostic

Anti-Pattern 2: "Complex First"

Wrong: "The API must not support this feature with this model configuration"

Right: "Let me first check if my API key is even loading correctly"

Anti-Pattern 3: "Assumption Stacking"

Wrong:

// Assuming response.data exists
// Assuming it's a Buffer
// Assuming it's in the right format
fs.writeFileSync('output.wav', response.data);

Right:

// Verify each assumption
console.log('data exists:', !!response.data);
console.log('data type:', typeof response.data);
console.log('is Buffer:', Buffer.isBuffer(response.data));
// [Then use data]

Anti-Pattern 4: "Trust the Summary"

Wrong: User says "no error", assume there's no error

Right: Execute code yourself, log everything, find the hidden error code


9. Speed Optimization: Parallel Diagnostics

Once you've identified multiple hypotheses, test them in parallel when possible.

Pattern: Parallel Diagnostic Scripts

# Instead of running diagnostics sequentially:
node diagnose-config.js        # 5 seconds
node diagnose-api.js           # 10 seconds
node diagnose-data-format.js   # 5 seconds
# Total: 20 seconds sequential

# Run in parallel:
node diagnose-config.js &
node diagnose-api.js &
node diagnose-data-format.js &
wait
# Total: 10 seconds (limited by slowest)

Pattern: Multi-Hypothesis Test Script

// test-all-hypotheses.js
async function testAll() {
  const tests = [
    testConfigLoading,
    testAPIEndpoint,
    testDataFormat,
    testVersionCompatibility
  ];

  const results = await Promise.allSettled(
    tests.map(test => test().catch(e => ({ error: e })))
  );

  results.forEach((result, i) => {
    console.log(`\nTest ${i + 1}: ${tests[i].name}`);
    if (result.status === 'fulfilled') {
      console.log('✓ PASSED');
    } else {
      console.log('✗ FAILED:', result.reason);
    }
  });
}

Application to Error Troubleshooting

When using the error-troubleshooter skill:

  1. Start with Occam's Razor: Always check configuration and environment issues first (90% of problems)
  2. Create Diagnostic Scripts: Write minimal scripts to isolate variables
  3. Compare with Working Examples: If it works somewhere else, find the differences
  4. Never Assume Data Structure: Verify every layer explicitly
  5. Rank Your Evidence: Tier 1 (direct verification) beats Tier 5 (speculation)
  6. Execute First-Hand: Don't trust summaries, see the complete output yourself
  7. Avoid Anti-Patterns: Diagnose first, fix second; simple first, complex later

This systematic approach leads to faster, more reliable problem resolution.