Initial commit
This commit is contained in:
378
skills/error-troubleshooter/references/troubleshooting-sop.md
Normal file
378
skills/error-troubleshooter/references/troubleshooting-sop.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Troubleshooting Standard Operating Procedures
|
||||
|
||||
This document provides detailed, step-by-step procedures for systematic error troubleshooting.
|
||||
|
||||
## Core Troubleshooting Workflow
|
||||
|
||||
### Phase 1: Error Recognition and Initial Response
|
||||
|
||||
When a tool, script, or command fails:
|
||||
|
||||
1. **Capture Complete Error Information**
|
||||
- Full error message (stdout and stderr)
|
||||
- Tool/command that was executed
|
||||
- Context of what was being attempted
|
||||
- Any stack traces or error codes
|
||||
|
||||
2. **Assess Error Clarity**
|
||||
- Is the error message self-explanatory?
|
||||
- Does it explicitly state what went wrong and how to fix it?
|
||||
- Is this a commonly encountered error pattern?
|
||||
|
||||
3. **Decide on Investigation Depth**
|
||||
- **Trivial/Clear**: Apply quick fix
|
||||
- **Non-trivial/Ambiguous**: Proceed to rigorous investigation
|
||||
|
||||
### Phase 2: Quick Fix Attempt (Happy Case)
|
||||
|
||||
**Criteria for attempting quick fix:**
|
||||
- Error message explicitly describes the problem and solution
|
||||
- Error matches a well-known trivial pattern
|
||||
- Fix requires minimal changes with low risk
|
||||
|
||||
**Procedure:**
|
||||
1. Apply the fix based on error message or experience
|
||||
2. Re-execute the failing command
|
||||
3. Evaluate the result:
|
||||
- **Success**: Error is resolved → Done
|
||||
- **No change**: Error message identical → Revert and escalate
|
||||
- **Worse**: New errors or degraded state → Revert immediately and escalate
|
||||
|
||||
**Reversion Protocol:**
|
||||
- Undo all changes made during quick fix attempt
|
||||
- Verify system is back to pre-fix state
|
||||
- Document what was attempted (if creating debug notes)
|
||||
|
||||
### Phase 3: Rigorous Investigation
|
||||
|
||||
Enter this phase when:
|
||||
- Quick fix failed
|
||||
- Error is non-trivial from the start
|
||||
- Multiple potential causes exist
|
||||
- Context is ambiguous
|
||||
|
||||
#### Step 1: Error Template Extraction
|
||||
|
||||
**Purpose**: Prepare error message for effective searching by removing variable components.
|
||||
|
||||
**Procedure:**
|
||||
1. Identify the error type/category (e.g., FileNotFoundError, TypeError, ConnectionError)
|
||||
2. Locate the core error message
|
||||
3. Remove variable components:
|
||||
- File paths: `/home/user/project/file.py` → (remove)
|
||||
- Usernames: `user@example.com` → (remove)
|
||||
- IDs/numbers: `id=12345` → (remove)
|
||||
- Timestamps: `2024-01-15 10:30:45` → (remove)
|
||||
- User inputs: `input='value'` → (remove)
|
||||
- Line numbers: `line 42` → (keep only if part of standard template)
|
||||
|
||||
4. Retain:
|
||||
- Error type/class names
|
||||
- Standard error message structure
|
||||
- Function/method names from standard library/SDK
|
||||
- Standard error codes
|
||||
|
||||
**Example Transformations:**
|
||||
```
|
||||
Original:
|
||||
ValueError: invalid literal for int() with base 10: 'abc' at line 42 in /home/user/script.py
|
||||
|
||||
Template:
|
||||
ValueError invalid literal for int() with base 10
|
||||
|
||||
Original:
|
||||
requests.exceptions.ConnectionError: HTTPConnectionPool(host='api.example.com', port=443): Max retries exceeded with url: /v1/users
|
||||
|
||||
Template:
|
||||
requests.exceptions.ConnectionError HTTPConnectionPool Max retries exceeded
|
||||
|
||||
Original:
|
||||
ModuleNotFoundError: No module named 'pandas'
|
||||
|
||||
Template:
|
||||
ModuleNotFoundError No module named
|
||||
```
|
||||
|
||||
#### Step 2: Environment Information Collection
|
||||
|
||||
**Purpose**: Gather context needed to understand and resolve the error.
|
||||
|
||||
**Collection Strategy:**
|
||||
1. **Start Minimal**: Only collect what's clearly relevant
|
||||
2. **Expand as Needed**: Add more context if initial research is inconclusive
|
||||
3. **Always Protect Privacy**: Never collect passwords, API keys, personal data without explicit permission
|
||||
|
||||
**Standard Environment Information:**
|
||||
|
||||
**System Context:**
|
||||
- Operating system and version
|
||||
- Shell/terminal environment
|
||||
- Current working directory (if relevant)
|
||||
|
||||
**Runtime Context:**
|
||||
- Language/SDK versions (Python, Node.js, etc.)
|
||||
- Package manager versions (pip, npm, etc.)
|
||||
- Virtual environment status
|
||||
|
||||
**Dependency Context:**
|
||||
- Installed package versions (for error-related packages)
|
||||
- Package lock file status
|
||||
- Dependency conflicts
|
||||
|
||||
**Configuration Context:**
|
||||
- Relevant configuration files (only if directly related to error)
|
||||
- Environment variables (only non-sensitive ones)
|
||||
|
||||
**Collection Commands by Context:**
|
||||
|
||||
```bash
|
||||
# Python environment
|
||||
python --version
|
||||
pip --version
|
||||
pip list | grep <package-name>
|
||||
|
||||
# Node.js environment
|
||||
node --version
|
||||
npm --version
|
||||
npm list <package-name>
|
||||
|
||||
# System information
|
||||
uname -a # Unix/Linux/Mac
|
||||
ver # Windows
|
||||
|
||||
# Package conflicts
|
||||
pip check # Python
|
||||
npm doctor # Node.js
|
||||
|
||||
# Environment variables (careful with sensitive data)
|
||||
env | grep <RELEVANT_VAR>
|
||||
```
|
||||
|
||||
**Privacy Protection:**
|
||||
- **Never collect**: Passwords, API keys, tokens, private keys, personal identifiable information
|
||||
- **Request permission before collecting**: Project-specific paths, configuration files, custom environment variables
|
||||
- **Sanitize output**: Remove sensitive data before recording
|
||||
|
||||
#### Step 3: Research and Information Gathering
|
||||
|
||||
**Research Sources (in order of efficiency):**
|
||||
|
||||
1. **Web Search with Error Template**
|
||||
- Search the extracted template (not full error)
|
||||
- Add language/framework name to query
|
||||
- Example: "ModuleNotFoundError No module named python"
|
||||
|
||||
2. **Official Documentation**
|
||||
- Error code references
|
||||
- SDK/API documentation
|
||||
- Known issues and breaking changes
|
||||
|
||||
3. **Community Resources**
|
||||
- Stack Overflow
|
||||
- GitHub Issues (especially for specific libraries)
|
||||
- Framework-specific forums
|
||||
|
||||
**Parallel Research Strategy:**
|
||||
|
||||
For complex problems, launch multiple research angles simultaneously using subagents:
|
||||
|
||||
```
|
||||
Investigation Angles:
|
||||
├─ Subagent 1: Web search for error template
|
||||
├─ Subagent 2: Search GitHub Issues for affected package
|
||||
├─ Subagent 3: Check official documentation for breaking changes
|
||||
└─ Subagent 4: Search for similar errors in codebase history
|
||||
```
|
||||
|
||||
**Research Efficiency:**
|
||||
- Delegate broad searches to subagents
|
||||
- Keep main context focused on synthesis and decision-making
|
||||
- Use file-based notes to accumulate findings
|
||||
|
||||
#### Step 4: Debug Notes Creation
|
||||
|
||||
**When to create debug notes:**
|
||||
- Investigation is expected to be complex
|
||||
- Multiple theories need tracking
|
||||
- Context is being consumed quickly
|
||||
- Investigation may span multiple sessions
|
||||
|
||||
**Debug Notes Structure:**
|
||||
|
||||
```markdown
|
||||
# Debug Session: [Error Summary]
|
||||
|
||||
## Error Information
|
||||
[Full error details]
|
||||
|
||||
## Environment
|
||||
[Relevant environment information]
|
||||
|
||||
## Theories
|
||||
1. [Theory 1]: [likelihood: high/medium/low]
|
||||
- Evidence: [supporting information]
|
||||
- Test: [how to verify]
|
||||
- Result: [pending/confirmed/rejected]
|
||||
|
||||
2. [Theory 2]: ...
|
||||
|
||||
## Research Findings
|
||||
- [Source]: [key information]
|
||||
- [Source]: [key information]
|
||||
|
||||
## Tests Conducted
|
||||
1. [Test description]
|
||||
- Command: [test command]
|
||||
- Result: [outcome]
|
||||
- Conclusion: [what was learned]
|
||||
|
||||
## Solution
|
||||
[Final solution that resolved the issue]
|
||||
```
|
||||
|
||||
Use the template in `assets/debug-notes-template.md` as a starting point.
|
||||
|
||||
#### Step 5: Theory Formulation and Testing
|
||||
|
||||
**Theory Development:**
|
||||
|
||||
Based on research and environment analysis:
|
||||
1. List all plausible explanations for the error
|
||||
2. Assess likelihood of each (high/medium/low)
|
||||
3. Identify evidence supporting or contradicting each theory
|
||||
4. Order theories by likelihood and ease of testing
|
||||
|
||||
**Theory Testing Protocol:**
|
||||
|
||||
For each theory (starting with most likely):
|
||||
|
||||
1. **Design Test**
|
||||
- What command/change will verify or reject this theory?
|
||||
- What outcome would confirm the theory?
|
||||
- What outcome would reject the theory?
|
||||
|
||||
2. **Execute Test**
|
||||
- Run the test in a controlled manner
|
||||
- Capture all output
|
||||
- Note any side effects
|
||||
|
||||
3. **Evaluate Result**
|
||||
- Does the result confirm or reject the theory?
|
||||
- Are there unexpected outcomes?
|
||||
- What new information was gained?
|
||||
|
||||
4. **Document in Debug Notes**
|
||||
- Record test and result
|
||||
- Update theory status
|
||||
- Note any new theories generated
|
||||
|
||||
5. **Iterate**
|
||||
- If theory confirmed: proceed to solution implementation
|
||||
- If theory rejected: move to next theory
|
||||
- If inconclusive: gather more information
|
||||
|
||||
**Testing Best Practices:**
|
||||
- Test one variable at a time
|
||||
- Use minimal reproducible cases when possible
|
||||
- Revert changes between theory tests
|
||||
- Document negative results (what didn't work is valuable information)
|
||||
|
||||
### Phase 4: Solution Implementation
|
||||
|
||||
Once the correct theory is identified:
|
||||
|
||||
1. **Plan Implementation**
|
||||
- What changes are needed?
|
||||
- Are there risks or side effects?
|
||||
- Can the solution be tested incrementally?
|
||||
|
||||
2. **Apply Fix**
|
||||
- Make the necessary changes
|
||||
- Document what was changed
|
||||
- Keep changes minimal and focused
|
||||
|
||||
3. **Verify Resolution**
|
||||
- Re-run the original failing command
|
||||
- Confirm error is completely resolved
|
||||
- Check for new errors or warnings
|
||||
|
||||
4. **Document Solution**
|
||||
- Record the fix in debug notes (if created)
|
||||
- Note root cause and solution for future reference
|
||||
- Consider adding to common error patterns if widely applicable
|
||||
|
||||
### Phase 5: Post-Resolution
|
||||
|
||||
1. **Clean Up**
|
||||
- Remove temporary debug files (if any)
|
||||
- Clean up test artifacts
|
||||
- Restore any temporary changes
|
||||
|
||||
2. **Knowledge Capture**
|
||||
- If this was a difficult problem with a general solution, consider documenting it
|
||||
- Update `references/common-error-patterns.md` if appropriate
|
||||
- Note any tools or techniques that were particularly effective
|
||||
|
||||
## Advanced Investigation Techniques
|
||||
|
||||
### Bisection Method
|
||||
|
||||
For errors introduced by recent changes:
|
||||
1. Identify last known good state
|
||||
2. Bisect the changes between good and bad state
|
||||
3. Test each bisection point
|
||||
4. Narrow down to the specific change that introduced the error
|
||||
|
||||
### Differential Diagnosis
|
||||
|
||||
When multiple theories seem equally plausible:
|
||||
1. Identify distinguishing characteristics of each theory
|
||||
2. Design tests that differentiate between theories
|
||||
3. Execute targeted tests to rule out theories
|
||||
4. Converge on the correct diagnosis through elimination
|
||||
|
||||
### Reproduction Reduction
|
||||
|
||||
For complex errors:
|
||||
1. Create minimal reproducible example
|
||||
2. Strip away unrelated code/configuration
|
||||
3. Isolate the essential elements that trigger the error
|
||||
4. Use reduced case for easier investigation
|
||||
|
||||
## Communication and Escalation
|
||||
|
||||
### When to Request User Input
|
||||
|
||||
Request user input when:
|
||||
- Multiple equally valid solutions exist
|
||||
- User preference affects solution choice
|
||||
- Sensitive information access is needed
|
||||
- Problem domain knowledge is required
|
||||
- Verification of fix needs user testing
|
||||
|
||||
### How to Present Findings
|
||||
|
||||
When communicating with user:
|
||||
1. **Summarize**: Brief description of the error and its cause
|
||||
2. **Explain**: Why the error occurred
|
||||
3. **Solution**: What was done to fix it
|
||||
4. **Verification**: How to confirm it's resolved
|
||||
5. **Prevention**: How to avoid in the future (if applicable)
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Assumption Paralysis**: Don't assume too much; verify theories with tests
|
||||
2. **Fix Stacking**: Don't apply multiple fixes simultaneously; test one at a time
|
||||
3. **Context Drift**: Stay focused on the original error; avoid rabbit holes
|
||||
4. **Incomplete Reversion**: Always fully revert failed fixes
|
||||
5. **Premature Success**: Verify the error is truly resolved, not just hidden
|
||||
6. **Privacy Violations**: Never collect sensitive data without permission
|
||||
|
||||
## Success Criteria
|
||||
|
||||
A troubleshooting session is successful when:
|
||||
1. The original error is completely resolved
|
||||
2. The fix is understood and documented
|
||||
3. No new errors were introduced
|
||||
4. The solution is appropriate and maintainable
|
||||
5. Lessons learned are captured for future reference
|
||||
Reference in New Issue
Block a user