11 KiB
Troubleshooting Standard Operating Procedures
This document provides detailed, step-by-step procedures for systematic error troubleshooting.
Core Troubleshooting Workflow
Phase 1: Error Recognition and Initial Response
When a tool, script, or command fails:
-
Capture Complete Error Information
- Full error message (stdout and stderr)
- Tool/command that was executed
- Context of what was being attempted
- Any stack traces or error codes
-
Assess Error Clarity
- Is the error message self-explanatory?
- Does it explicitly state what went wrong and how to fix it?
- Is this a commonly encountered error pattern?
-
Decide on Investigation Depth
- Trivial/Clear: Apply quick fix
- Non-trivial/Ambiguous: Proceed to rigorous investigation
Phase 2: Quick Fix Attempt (Happy Case)
Criteria for attempting quick fix:
- Error message explicitly describes the problem and solution
- Error matches a well-known trivial pattern
- Fix requires minimal changes with low risk
Procedure:
- Apply the fix based on error message or experience
- Re-execute the failing command
- Evaluate the result:
- Success: Error is resolved → Done
- No change: Error message identical → Revert and escalate
- Worse: New errors or degraded state → Revert immediately and escalate
Reversion Protocol:
- Undo all changes made during quick fix attempt
- Verify system is back to pre-fix state
- Document what was attempted (if creating debug notes)
Phase 3: Rigorous Investigation
Enter this phase when:
- Quick fix failed
- Error is non-trivial from the start
- Multiple potential causes exist
- Context is ambiguous
Step 1: Error Template Extraction
Purpose: Prepare error message for effective searching by removing variable components.
Procedure:
-
Identify the error type/category (e.g., FileNotFoundError, TypeError, ConnectionError)
-
Locate the core error message
-
Remove variable components:
- File paths:
/home/user/project/file.py→ (remove) - Usernames:
user@example.com→ (remove) - IDs/numbers:
id=12345→ (remove) - Timestamps:
2024-01-15 10:30:45→ (remove) - User inputs:
input='value'→ (remove) - Line numbers:
line 42→ (keep only if part of standard template)
- File paths:
-
Retain:
- Error type/class names
- Standard error message structure
- Function/method names from standard library/SDK
- Standard error codes
Example Transformations:
Original:
ValueError: invalid literal for int() with base 10: 'abc' at line 42 in /home/user/script.py
Template:
ValueError invalid literal for int() with base 10
Original:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='api.example.com', port=443): Max retries exceeded with url: /v1/users
Template:
requests.exceptions.ConnectionError HTTPConnectionPool Max retries exceeded
Original:
ModuleNotFoundError: No module named 'pandas'
Template:
ModuleNotFoundError No module named
Step 2: Environment Information Collection
Purpose: Gather context needed to understand and resolve the error.
Collection Strategy:
- Start Minimal: Only collect what's clearly relevant
- Expand as Needed: Add more context if initial research is inconclusive
- Always Protect Privacy: Never collect passwords, API keys, personal data without explicit permission
Standard Environment Information:
System Context:
- Operating system and version
- Shell/terminal environment
- Current working directory (if relevant)
Runtime Context:
- Language/SDK versions (Python, Node.js, etc.)
- Package manager versions (pip, npm, etc.)
- Virtual environment status
Dependency Context:
- Installed package versions (for error-related packages)
- Package lock file status
- Dependency conflicts
Configuration Context:
- Relevant configuration files (only if directly related to error)
- Environment variables (only non-sensitive ones)
Collection Commands by Context:
# Python environment
python --version
pip --version
pip list | grep <package-name>
# Node.js environment
node --version
npm --version
npm list <package-name>
# System information
uname -a # Unix/Linux/Mac
ver # Windows
# Package conflicts
pip check # Python
npm doctor # Node.js
# Environment variables (careful with sensitive data)
env | grep <RELEVANT_VAR>
Privacy Protection:
- Never collect: Passwords, API keys, tokens, private keys, personal identifiable information
- Request permission before collecting: Project-specific paths, configuration files, custom environment variables
- Sanitize output: Remove sensitive data before recording
Step 3: Research and Information Gathering
Research Sources (in order of efficiency):
-
Web Search with Error Template
- Search the extracted template (not full error)
- Add language/framework name to query
- Example: "ModuleNotFoundError No module named python"
-
Official Documentation
- Error code references
- SDK/API documentation
- Known issues and breaking changes
-
Community Resources
- Stack Overflow
- GitHub Issues (especially for specific libraries)
- Framework-specific forums
Parallel Research Strategy:
For complex problems, launch multiple research angles simultaneously using subagents:
Investigation Angles:
├─ Subagent 1: Web search for error template
├─ Subagent 2: Search GitHub Issues for affected package
├─ Subagent 3: Check official documentation for breaking changes
└─ Subagent 4: Search for similar errors in codebase history
Research Efficiency:
- Delegate broad searches to subagents
- Keep main context focused on synthesis and decision-making
- Use file-based notes to accumulate findings
Step 4: Debug Notes Creation
When to create debug notes:
- Investigation is expected to be complex
- Multiple theories need tracking
- Context is being consumed quickly
- Investigation may span multiple sessions
Debug Notes Structure:
# Debug Session: [Error Summary]
## Error Information
[Full error details]
## Environment
[Relevant environment information]
## Theories
1. [Theory 1]: [likelihood: high/medium/low]
- Evidence: [supporting information]
- Test: [how to verify]
- Result: [pending/confirmed/rejected]
2. [Theory 2]: ...
## Research Findings
- [Source]: [key information]
- [Source]: [key information]
## Tests Conducted
1. [Test description]
- Command: [test command]
- Result: [outcome]
- Conclusion: [what was learned]
## Solution
[Final solution that resolved the issue]
Use the template in assets/debug-notes-template.md as a starting point.
Step 5: Theory Formulation and Testing
Theory Development:
Based on research and environment analysis:
- List all plausible explanations for the error
- Assess likelihood of each (high/medium/low)
- Identify evidence supporting or contradicting each theory
- Order theories by likelihood and ease of testing
Theory Testing Protocol:
For each theory (starting with most likely):
-
Design Test
- What command/change will verify or reject this theory?
- What outcome would confirm the theory?
- What outcome would reject the theory?
-
Execute Test
- Run the test in a controlled manner
- Capture all output
- Note any side effects
-
Evaluate Result
- Does the result confirm or reject the theory?
- Are there unexpected outcomes?
- What new information was gained?
-
Document in Debug Notes
- Record test and result
- Update theory status
- Note any new theories generated
-
Iterate
- If theory confirmed: proceed to solution implementation
- If theory rejected: move to next theory
- If inconclusive: gather more information
Testing Best Practices:
- Test one variable at a time
- Use minimal reproducible cases when possible
- Revert changes between theory tests
- Document negative results (what didn't work is valuable information)
Phase 4: Solution Implementation
Once the correct theory is identified:
-
Plan Implementation
- What changes are needed?
- Are there risks or side effects?
- Can the solution be tested incrementally?
-
Apply Fix
- Make the necessary changes
- Document what was changed
- Keep changes minimal and focused
-
Verify Resolution
- Re-run the original failing command
- Confirm error is completely resolved
- Check for new errors or warnings
-
Document Solution
- Record the fix in debug notes (if created)
- Note root cause and solution for future reference
- Consider adding to common error patterns if widely applicable
Phase 5: Post-Resolution
-
Clean Up
- Remove temporary debug files (if any)
- Clean up test artifacts
- Restore any temporary changes
-
Knowledge Capture
- If this was a difficult problem with a general solution, consider documenting it
- Update
references/common-error-patterns.mdif appropriate - Note any tools or techniques that were particularly effective
Advanced Investigation Techniques
Bisection Method
For errors introduced by recent changes:
- Identify last known good state
- Bisect the changes between good and bad state
- Test each bisection point
- Narrow down to the specific change that introduced the error
Differential Diagnosis
When multiple theories seem equally plausible:
- Identify distinguishing characteristics of each theory
- Design tests that differentiate between theories
- Execute targeted tests to rule out theories
- Converge on the correct diagnosis through elimination
Reproduction Reduction
For complex errors:
- Create minimal reproducible example
- Strip away unrelated code/configuration
- Isolate the essential elements that trigger the error
- Use reduced case for easier investigation
Communication and Escalation
When to Request User Input
Request user input when:
- Multiple equally valid solutions exist
- User preference affects solution choice
- Sensitive information access is needed
- Problem domain knowledge is required
- Verification of fix needs user testing
How to Present Findings
When communicating with user:
- Summarize: Brief description of the error and its cause
- Explain: Why the error occurred
- Solution: What was done to fix it
- Verification: How to confirm it's resolved
- Prevention: How to avoid in the future (if applicable)
Common Pitfalls to Avoid
- Assumption Paralysis: Don't assume too much; verify theories with tests
- Fix Stacking: Don't apply multiple fixes simultaneously; test one at a time
- Context Drift: Stay focused on the original error; avoid rabbit holes
- Incomplete Reversion: Always fully revert failed fixes
- Premature Success: Verify the error is truly resolved, not just hidden
- Privacy Violations: Never collect sensitive data without permission
Success Criteria
A troubleshooting session is successful when:
- The original error is completely resolved
- The fix is understood and documented
- No new errors were introduced
- The solution is appropriate and maintainable
- Lessons learned are captured for future reference