Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:47:55 +08:00
commit e732da8316
20 changed files with 4969 additions and 0 deletions

View File

@@ -0,0 +1,835 @@
# Systematic Debugging Methodology
## Guiding Principle: Occam's Razor for Debugging
**When facing mysterious errors, the root cause is almost always simpler than it appears.**
Common error distribution in real-world debugging:
- **70%**: Configuration issues (wrong parameters, missing flags, incorrect paths)
- **20%**: Environment issues (missing dependencies, version mismatches, path problems)
- **8%**: Data format issues (encoding, structure assumptions)
- **2%**: Actual bugs in external APIs or fundamental incompatibilities
**Core Rule**: Exhaust simple hypotheses before considering complex ones. Authentication failures are configuration problems 99% of the time, not API design flaws.
---
## 1. Hypothesis Priority Framework
### Always Start Here: The "Boring Checklist"
Before investigating complex hypotheses, verify these fundamentals:
1. **Configuration Loading**
- Are environment variables actually loaded? (Not just "file exists")
- Are config files in the correct location relative to execution path?
- Are there hidden default parameters overriding your explicit config?
2. **Exact Version/Name Matching**
- Library versions exact match with documentation examples?
- API endpoint names exactly correct (not similar, not partially matching)?
- Model/service names character-perfect? (e.g., `preview-09-2025``preview-2025-09`)
3. **First-Hand Error Visibility**
- Have you personally executed the failing code?
- Have you seen the complete error output (not summarized or truncated)?
- Are there error details hidden in non-obvious places (WebSocket close codes, HTTP headers)?
### Hypothesis Ordering Template
```markdown
## Problem: [Description]
### Priority 1: Configuration Issues (Check First)
- [ ] Hypothesis 1a: Config file not loaded from expected path
Verification: Add logging immediately after config load
- [ ] Hypothesis 1b: SDK using unexpected default parameters
Verification: Log all parameters passed to SDK constructor
- [ ] Hypothesis 1c: Authentication credentials format issue
Verification: Check credential string length, prefix, encoding
### Priority 2: Environment Issues
- [ ] Hypothesis 2a: Dependency version mismatch
Verification: Check package.json vs installed versions
- [ ] Hypothesis 2b: Runtime environment differences
Verification: Compare working vs failing environment variables
### Priority 3: Data Format Issues
- [ ] Hypothesis 3a: Incorrect data structure assumptions
Verification: Log raw data structure, don't assume
- [ ] Hypothesis 3b: Encoding mismatch (UTF-8, base64, binary)
Verification: Inspect first few bytes/characters
### Priority 4: Complex Issues (Only After Above Exhausted)
- [ ] Hypothesis 4a: API limitation or bug
Verification: Find official documentation or working examples
- [ ] Hypothesis 4b: Fundamental incompatibility
Verification: Find counter-examples proving it can work
```
---
## 2. Isolation Strategy: Diagnostic Scripts
**The Most Powerful Debugging Technique**: Create minimal, single-purpose scripts that test ONE thing at a time.
### Why This Works
- **Eliminates confounding variables**: Complex scripts have multiple failure points
- **Provides clear pass/fail criteria**: If this 10-line script fails, the problem is X
- **Builds confidence incrementally**: Each passing diagnostic narrows the search space
- **Creates reusable verification tools**: Same scripts can verify fixes
### Diagnostic Script Principles
1. **One Variable Per Script**: Test configuration loading separately from API calls separately from data processing
2. **Verbose Logging**: Log every step, every value, every assumption
3. **Explicit Success Criteria**: Script should clearly output PASS or FAIL
4. **No Assumptions**: Check everything explicitly, even "obvious" things
### Template: Configuration Diagnostic
```javascript
// diagnose-config.js
// Purpose: Verify configuration loading and format
// Success criteria: All checks pass with ✓
console.log('=== Configuration Diagnostic ===\n');
// 1. Environment
console.log('--- Environment ---');
console.log('Node version:', process.version);
console.log('Working directory:', process.cwd());
console.log('Script location:', __dirname);
// 2. Config File Loading
console.log('\n--- Config File ---');
const fs = require('fs');
const path = require('path');
const configPath = path.join(__dirname, '.env');
console.log('Expected path:', configPath);
console.log('File exists:', fs.existsSync(configPath));
if (fs.existsSync(configPath)) {
const content = fs.readFileSync(configPath, 'utf-8');
console.log('File size:', content.length, 'bytes');
console.log('Lines:', content.split('\n').length);
}
// 3. Environment Variable
require('dotenv').config({ path: configPath });
console.log('\n--- Environment Variable ---');
console.log('API_KEY defined:', !!process.env.API_KEY);
console.log('API_KEY length:', process.env.API_KEY?.length);
console.log('API_KEY prefix:', process.env.API_KEY?.slice(0, 4));
console.log('API_KEY has whitespace:', /\s/.test(process.env.API_KEY || ''));
// 4. Format Validation
console.log('\n--- Format Validation ---');
const expectedPrefix = 'AIza'; // Example for Google API keys
const expectedLength = 39;
const prefixMatch = process.env.API_KEY?.startsWith(expectedPrefix);
const lengthMatch = process.env.API_KEY?.length === expectedLength;
console.log(prefixMatch ? '✓' : '✗', 'Prefix matches expected format');
console.log(lengthMatch ? '✓' : '✗', 'Length matches expected format');
// 5. SDK Constructor (No Network Call)
console.log('\n--- SDK Initialization ---');
try {
const SDK = require('./sdk'); // Your SDK
const client = new SDK({
apiKey: process.env.API_KEY,
// Log all parameters, including defaults
});
console.log('✓ SDK initialized without errors');
console.log('Client config:', JSON.stringify(client.config, null, 2));
} catch (error) {
console.log('✗ SDK initialization failed:', error.message);
}
console.log('\n=== Diagnostic Complete ===');
```
### Template: Data Structure Diagnostic
```javascript
// diagnose-data-structure.js
// Purpose: Inspect actual data structure without assumptions
// Success criteria: Understand exact structure and location of target data
function inspectObject(obj, path = 'root', maxDepth = 3, currentDepth = 0) {
const indent = ' '.repeat(currentDepth);
console.log(`${indent}${path}:`);
console.log(`${indent} Type: ${typeof obj}`);
console.log(`${indent} Null: ${obj === null}`);
console.log(`${indent} Undefined: ${obj === undefined}`);
if (obj === null || obj === undefined) return;
if (typeof obj === 'object') {
if (Buffer.isBuffer(obj)) {
console.log(`${indent} [Buffer: ${obj.length} bytes]`);
console.log(`${indent} First 20 bytes: ${obj.slice(0, 20).toString('hex')}`);
return;
}
if (Array.isArray(obj)) {
console.log(`${indent} [Array: ${obj.length} items]`);
if (obj.length > 0 && currentDepth < maxDepth) {
inspectObject(obj[0], `${path}[0]`, maxDepth, currentDepth + 1);
}
return;
}
const keys = Object.keys(obj);
console.log(`${indent} Keys: [${keys.join(', ')}]`);
if (currentDepth < maxDepth) {
for (const key of keys.slice(0, 5)) { // Limit to first 5 keys
inspectObject(obj[key], `${path}.${key}`, maxDepth, currentDepth + 1);
}
}
} else if (typeof obj === 'string') {
console.log(`${indent} Length: ${obj.length}`);
console.log(`${indent} Preview: "${obj.slice(0, 50)}${obj.length > 50 ? '...' : ''}"`);
// Check encoding hints
const isBase64Like = /^[A-Za-z0-9+/=]+$/.test(obj);
const isHexLike = /^[0-9a-fA-F]+$/.test(obj);
console.log(`${indent} Looks like base64: ${isBase64Like}`);
console.log(`${indent} Looks like hex: ${isHexLike}`);
} else {
console.log(`${indent} Value: ${obj}`);
}
}
// Usage example with API response
const response = getAPIResponse(); // Your actual API call
console.log('=== Raw Response Structure ===\n');
inspectObject(response);
// Specific path investigation
console.log('\n=== Investigating Suspected Path ===');
console.log('response.data exists:', !!response.data);
console.log('response.body exists:', !!response.body);
console.log('response.payload exists:', !!response.payload);
// If you're looking for binary data
console.log('\n=== Binary Data Search ===');
function findBuffers(obj, path = 'root') {
if (Buffer.isBuffer(obj)) {
console.log(`Found Buffer at: ${path} (${obj.length} bytes)`);
return;
}
if (typeof obj === 'object' && obj !== null) {
for (const key in obj) {
findBuffers(obj[key], `${path}.${key}`);
}
}
}
findBuffers(response);
```
### Template: API Interaction Diagnostic
```javascript
// diagnose-api-minimal.js
// Purpose: Minimal API call to isolate authentication from functionality
// Success criteria: Connection succeeds, response received (any response)
console.log('=== Minimal API Test ===\n');
const API = require('./api-client');
async function testMinimalConnection() {
console.log('1. Creating client...');
const client = new API({
apiKey: process.env.API_KEY,
// Start with absolute minimal config
});
console.log('2. Attempting connection...');
try {
await client.connect();
console.log('✓ Connection established');
} catch (error) {
console.log('✗ Connection failed:', error.message);
console.log('Error code:', error.code);
console.log('Error details:', JSON.stringify(error, null, 2));
process.exit(1);
}
console.log('3. Sending minimal request...');
try {
// Simplest possible request
const response = await client.send({ message: 'ping' });
console.log('✓ Response received');
console.log('Response type:', typeof response);
console.log('Response keys:', Object.keys(response));
} catch (error) {
console.log('✗ Request failed:', error.message);
}
console.log('4. Closing connection...');
await client.close();
console.log('✓ Test complete');
}
testMinimalConnection().catch(console.error);
```
### Real-World Example: WebSocket Authentication Mystery
**Problem**: WebSocket connection immediately closes with code 1007 "API key not valid"
**❌ Initial Approach** (jumping to complex hypotheses):
- "Maybe the API doesn't support this model"
- "Maybe the feature flag is disabled for my account"
- "Maybe there's a rate limit"
**✅ Diagnostic Script Approach**:
```javascript
// diagnose-websocket-auth.js
console.log('=== WebSocket Auth Diagnostic ===\n');
// Step 1: Verify API key loading
console.log('--- API Key ---');
console.log('Loaded:', !!process.env.API_KEY);
console.log('Length:', process.env.API_KEY?.length);
console.log('Format:', process.env.API_KEY?.slice(0, 4) + '...' + process.env.API_KEY?.slice(-4));
// Step 2: Check SDK configuration
console.log('\n--- SDK Config ---');
const sdk = new SDK({
apiKey: process.env.API_KEY,
});
// KEY DISCOVERY: Log the actual endpoint being used
console.log('Endpoint:', sdk.endpoint); // Revealed: Using Vertex AI endpoint!
// Step 3: Check SDK defaults
console.log('Default vertexai:', sdk.config.vertexai); // Revealed: true by default!
// Step 4: Test with explicit configuration
console.log('\n--- Testing explicit vertexai: false ---');
const sdkFixed = new SDK({
apiKey: process.env.API_KEY,
vertexai: false, // Explicit override
});
console.log('Endpoint:', sdkFixed.endpoint); // Now using correct endpoint!
```
**Result**: The diagnostic revealed that the SDK defaulted to `vertexai: true`, sending requests to Vertex AI endpoint instead of the Gemini Developer API endpoint. The fix was a single parameter.
**Time saved**: This 15-line diagnostic script found the issue in 2 minutes. The alternative (reading SDK source code or trial-and-error config changes) would have taken hours.
---
## 3. Comparison with Working Examples
### Why This Is Critical
When you have a working example (different language, different version, official sample), it's a **treasure map** showing the correct configuration.
**Working example exists → Problem is in the differences**
### Systematic Comparison Process
1. **Find the Working Example**
- Official SDK examples (preferred)
- Successful previous implementations in your codebase
- Community examples with verified success (check issues/discussions)
2. **Compare Layer by Layer**
```markdown
## Comparison Checklist
### Language/Runtime
- [ ] Working: Python 3.11, Failing: Node.js 20
- [ ] Any known language-specific issues?
### SDK Versions
- [ ] Working: v2.1.0, Failing: v2.3.1
- [ ] Check changelog between versions
### Configuration Parameters
Working:
```python
client = Client(api_key=key) # Only 1 parameter
```
Failing:
```javascript
const client = new Client({ apiKey: key, ...manyOtherParams });
```
- [ ] What are those other params? What are their defaults?
### API Endpoint
- [ ] Working: api.service.com, Failing: ???
- [ ] Log actual endpoint used by SDK
### Request Format
- [ ] Compare actual HTTP/WebSocket frames sent (use network inspector)
```
3. **Identify Hidden Differences**
Common gotchas:
- **Default parameters**: JavaScript SDK has `vertexai: true` default, Python doesn't have this parameter
- **Authentication methods**: One uses header, another uses query param
- **Endpoint URLs**: SDKs may auto-select endpoints based on config
- **Retry behavior**: One SDK retries automatically, hiding transient failures
### Example: Cross-Language Comparison
**Problem**: Python POC works, JavaScript POC fails with authentication error
**Comparison**:
```python
# Python (WORKING)
import google.generativeai as genai
genai.configure(api_key=api_key) # Simple, one-line config
model = genai.GenerativeModel('gemini-2.5-flash')
response = model.generate_content('Hello')
```
```javascript
// JavaScript (FAILING)
const { GoogleGenerativeAI } = require('@google/generative-ai');
const ai = new GoogleGenerativeAI({
apiKey: process.env.API_KEY,
// What's different?
});
```
**Investigation**:
1. Python library source: `genai.configure()` only sets API key, no other parameters
2. JavaScript SDK docs: Constructor accepts `vertexai` parameter (default: `true`)
3. Hypothesis: JavaScript defaulting to Vertex AI endpoint
**Verification**:
```javascript
const ai = new GoogleGenerativeAI({
apiKey: process.env.API_KEY,
vertexai: false, // Match Python's implicit behavior
});
```
**Result**: Fixed. The difference was an implicit vs explicit endpoint selection.
---
## 4. Data Structure Verification: Never Assume
### The Anti-Pattern
```javascript
// ❌ Assumption-based code
const audioData = response.data; // Assuming 'data' contains audio
audioFile.write(audioData);
// Result: Writes undefined or wrong data, produces corrupted file
```
### The Correct Pattern
```javascript
// ✅ Verification-first code
// Step 1: Inspect actual structure
console.log('Response keys:', Object.keys(response));
console.log('Response structure:', JSON.stringify(response, null, 2).slice(0, 500));
// Step 2: Search for target data
function findAudioData(obj, path = 'response') {
if (Buffer.isBuffer(obj)) {
console.log(`Found Buffer at ${path}: ${obj.length} bytes`);
}
if (typeof obj === 'object' && obj !== null) {
for (const [key, value] of Object.entries(obj)) {
if (key.includes('audio') || key.includes('data')) {
console.log(`Candidate at ${path}.${key}:`, typeof value);
}
findAudioData(value, `${path}.${key}`);
}
}
}
findAudioData(response);
// Step 3: Verify encoding
const candidateData = response.serverContent.modelTurn.parts[0].inlineData.data;
console.log('Data type:', typeof candidateData);
console.log('First 50 chars:', candidateData.slice(0, 50));
console.log('Looks like base64:', /^[A-Za-z0-9+/=]+$/.test(candidateData));
// Step 4: Test decoding
const decoded = Buffer.from(candidateData, 'base64');
console.log('Decoded size:', decoded.length, 'bytes');
console.log('First 10 bytes (hex):', decoded.slice(0, 10).toString('hex'));
// Step 5: Use verified data
audioFile.write(decoded); // Now confident this is correct
```
### Layer-by-Layer Verification Template
```javascript
// For nested data structures (e.g., API responses, message objects)
function verifyPath(obj, pathString) {
console.log(`\n=== Verifying: ${pathString} ===`);
const parts = pathString.split('.');
let current = obj;
let currentPath = 'root';
for (const part of parts) {
currentPath += `.${part}`;
console.log(`Checking ${currentPath}...`);
if (current === null || current === undefined) {
console.log(`✗ Path broken at ${currentPath}: value is ${current}`);
return null;
}
if (typeof current !== 'object') {
console.log(`✗ Path broken at ${currentPath}: not an object (${typeof current})`);
return null;
}
if (!(part in current)) {
console.log(`✗ Key "${part}" doesn't exist`);
console.log(` Available keys:`, Object.keys(current));
return null;
}
console.log(`✓ ${part} exists`);
current = current[part];
if (Array.isArray(current)) {
console.log(` (Array with ${current.length} items)`);
} else if (Buffer.isBuffer(current)) {
console.log(` (Buffer with ${current.length} bytes)`);
} else {
console.log(` (${typeof current})`);
}
}
console.log(`\n✓ Full path verified: ${pathString}`);
return current;
}
// Usage
const audioData = verifyPath(
response,
'serverContent.modelTurn.parts.0.inlineData.data'
);
```
---
## 5. Evidence Quality Hierarchy
Not all evidence is equal. Rank your evidence sources:
### Tier 1: Direct Verification (Strongest)
- Running code that succeeds/fails in front of you
- Network traffic you personally captured
- Logs you personally generated with verbose flags
- Binary data you inspected byte-by-byte
### Tier 2: Official Sources
- Official API documentation (with version number matching yours)
- Official SDK examples (with version number matching yours)
- Official changelog entries
### Tier 3: Working Examples
- Community examples with verified success (stars, recent activity)
- Stack Overflow answers with upvotes and recent dates
- Your own previous successful implementations
### Tier 4: Problem Reports
- GitHub Issues (open or closed)
- Stack Overflow questions (problems, not solutions)
- Forum discussions
### Tier 5: Speculation (Weakest)
- "I think this API doesn't support..."
- "This probably means..."
- Assumptions based on API names or parameter names
### Applying the Hierarchy
**Scenario**: Investigating why authentication fails
```markdown
## Evidence Analysis
### Hypothesis: API doesn't support this authentication method
Evidence collected:
1. [Tier 4] GitHub Issue #123: User reports auth failure (OPEN, no resolution)
2. [Tier 5] Parameter named "beta" suggests experimental feature
3. [Tier 2] Official docs state: "Authentication via API key is supported"
4. [Tier 3] Example repo uses API key successfully (last updated 2 months ago)
Conclusion:
- Tier 2 (official docs) contradicts hypothesis
- Tier 3 (working example) disproves hypothesis
- Tier 4 evidence (open issue) indicates others hit same problem, but doesn't prove API limitation
- Hypothesis REJECTED: Auth method IS supported, problem is likely in configuration
```
**Key Principle**: Higher-tier evidence always overrules lower-tier evidence.
---
## 6. The First-Hand Execution Rule
**Rule**: Before forming conclusions, personally execute the failing code and observe the complete output.
### Why This Matters
Second-hand error reports often omit critical details:
- Full error messages (users summarize or truncate)
- Error codes (users paste message but not code)
- Preceding warnings (users skip "unimportant" output)
- Environment differences (users assume their env is "normal")
### Checklist Before Forming Hypothesis
- [ ] Have I executed the failing code myself?
- [ ] Have I seen the complete console output (not summarized)?
- [ ] Have I checked for errors in non-obvious places (close codes, HTTP status, exit codes)?
- [ ] Have I added extra logging to expose internal state?
### Example: The Hidden Close Code
**Second-hand report**: "The WebSocket connection closes immediately with no error"
**Assumptions formed**:
- "No error" → Maybe timeout?
- "Closes immediately" → Maybe connection refused?
**First-hand execution**:
```javascript
// Added logging
websocket.on('close', (code, reason) => {
console.log('Close code:', code); // Revealed: 1007
console.log('Close reason:', reason); // Revealed: "API key not valid"
});
```
**Result**: Error WAS present, just not logged by default. The close code (1007) immediately pointed to authentication issue.
---
## 7. Debugging Session Template
Use this template to structure your investigation:
```markdown
## Problem Statement
[Concise description of unexpected behavior]
## Environment
- Runtime: [Node.js 20.11.0, Python 3.11, etc.]
- Library versions: [Exact versions from package.json/requirements.txt]
- OS: [If potentially relevant]
## Reproduction
[Minimal code that reproduces the issue]
## Expected vs Actual
- Expected: [What should happen]
- Actual: [What actually happens, with exact error messages]
## Hypothesis Priority List
### Priority 1: Configuration (Check First)
- [ ] Hypothesis 1a: [Specific config issue]
Evidence needed: [What would prove/disprove this]
Diagnostic: [Script or test to verify]
### Priority 2: Environment
- [ ] Hypothesis 2a: [Specific env issue]
Evidence needed: [...]
Diagnostic: [...]
### Priority 3: Data Format
[...]
### Priority 4: Complex Issues (Only if above exhausted)
[...]
## Evidence Collected
### [Hypothesis 1a]
- **Status**: ✓ PROVEN / ✗ DISPROVEN / ⚠ INCONCLUSIVE
- **Evidence tier**: [1-5]
- **Details**: [What you found]
- **Source**: [Where this evidence came from]
[Repeat for each hypothesis]
## Working Examples Comparison
### Python Implementation (WORKING)
```python
[Code]
```
### JavaScript Implementation (FAILING)
```javascript
[Code]
```
### Differences Identified
1. [Difference 1]
2. [Difference 2]
## Solution
### Root Cause
[Final verified cause, with evidence tier]
### Fix Applied
```javascript
[Exact code change]
```
### Verification
[How you verified the fix works]
## Lessons Learned
[What would make this faster next time]
```
---
## 8. Common Anti-Patterns to Avoid
### Anti-Pattern 1: "Debugging by Modification"
**❌ Wrong**:
```javascript
// Try random changes hoping something works
const client = new API({ apiKey: key, timeout: 5000 }); // Doesn't work
const client = new API({ apiKey: key, timeout: 10000 }); // Doesn't work
const client = new API({ apiKey: key, retry: true }); // Doesn't work
// [30 more random attempts...]
```
**✅ Right**:
```javascript
// Diagnose THEN fix
// 1. Create diagnostic to understand current behavior
// 2. Form hypothesis based on diagnostic output
// 3. Make targeted change
// 4. Verify with diagnostic
```
### Anti-Pattern 2: "Complex First"
**❌ Wrong**: "The API must not support this feature with this model configuration"
**✅ Right**: "Let me first check if my API key is even loading correctly"
### Anti-Pattern 3: "Assumption Stacking"
**❌ Wrong**:
```javascript
// Assuming response.data exists
// Assuming it's a Buffer
// Assuming it's in the right format
fs.writeFileSync('output.wav', response.data);
```
**✅ Right**:
```javascript
// Verify each assumption
console.log('data exists:', !!response.data);
console.log('data type:', typeof response.data);
console.log('is Buffer:', Buffer.isBuffer(response.data));
// [Then use data]
```
### Anti-Pattern 4: "Trust the Summary"
**❌ Wrong**: User says "no error", assume there's no error
**✅ Right**: Execute code yourself, log everything, find the hidden error code
---
## 9. Speed Optimization: Parallel Diagnostics
Once you've identified multiple hypotheses, test them in parallel when possible.
### Pattern: Parallel Diagnostic Scripts
```bash
# Instead of running diagnostics sequentially:
node diagnose-config.js # 5 seconds
node diagnose-api.js # 10 seconds
node diagnose-data-format.js # 5 seconds
# Total: 20 seconds sequential
# Run in parallel:
node diagnose-config.js &
node diagnose-api.js &
node diagnose-data-format.js &
wait
# Total: 10 seconds (limited by slowest)
```
### Pattern: Multi-Hypothesis Test Script
```javascript
// test-all-hypotheses.js
async function testAll() {
const tests = [
testConfigLoading,
testAPIEndpoint,
testDataFormat,
testVersionCompatibility
];
const results = await Promise.allSettled(
tests.map(test => test().catch(e => ({ error: e })))
);
results.forEach((result, i) => {
console.log(`\nTest ${i + 1}: ${tests[i].name}`);
if (result.status === 'fulfilled') {
console.log('✓ PASSED');
} else {
console.log('✗ FAILED:', result.reason);
}
});
}
```
---
## Application to Error Troubleshooting
When using the error-troubleshooter skill:
1. **Start with Occam's Razor**: Always check configuration and environment issues first (90% of problems)
2. **Create Diagnostic Scripts**: Write minimal scripts to isolate variables
3. **Compare with Working Examples**: If it works somewhere else, find the differences
4. **Never Assume Data Structure**: Verify every layer explicitly
5. **Rank Your Evidence**: Tier 1 (direct verification) beats Tier 5 (speculation)
6. **Execute First-Hand**: Don't trust summaries, see the complete output yourself
7. **Avoid Anti-Patterns**: Diagnose first, fix second; simple first, complex later
This systematic approach leads to faster, more reliable problem resolution.