Initial commit

2025-11-29 18:47:55 +08:00
commit e732da8316
20 changed files with 4969 additions and 0 deletions
--- a/skills/error-troubleshooter/references/troubleshooting-sop.md
+++ b/skills/error-troubleshooter/references/troubleshooting-sop.md
@@ -0,0 +1,378 @@
+# Troubleshooting Standard Operating Procedures
+
+This document provides detailed, step-by-step procedures for systematic error troubleshooting.
+
+## Core Troubleshooting Workflow
+
+### Phase 1: Error Recognition and Initial Response
+
+When a tool, script, or command fails:
+
+1. **Capture Complete Error Information**
+   - Full error message (stdout and stderr)
+   - Tool/command that was executed
+   - Context of what was being attempted
+   - Any stack traces or error codes
+
+2. **Assess Error Clarity**
+   - Is the error message self-explanatory?
+   - Does it explicitly state what went wrong and how to fix it?
+   - Is this a commonly encountered error pattern?
+
+3. **Decide on Investigation Depth**
+   - **Trivial/Clear**: Apply quick fix
+   - **Non-trivial/Ambiguous**: Proceed to rigorous investigation
+
+### Phase 2: Quick Fix Attempt (Happy Case)
+
+**Criteria for attempting quick fix:**
+- Error message explicitly describes the problem and solution
+- Error matches a well-known trivial pattern
+- Fix requires minimal changes with low risk
+
+**Procedure:**
+1. Apply the fix based on error message or experience
+2. Re-execute the failing command
+3. Evaluate the result:
+   - **Success**: Error is resolved → Done
+   - **No change**: Error message identical → Revert and escalate
+   - **Worse**: New errors or degraded state → Revert immediately and escalate
+
+**Reversion Protocol:**
+- Undo all changes made during quick fix attempt
+- Verify system is back to pre-fix state
+- Document what was attempted (if creating debug notes)
+
+### Phase 3: Rigorous Investigation
+
+Enter this phase when:
+- Quick fix failed
+- Error is non-trivial from the start
+- Multiple potential causes exist
+- Context is ambiguous
+
+#### Step 1: Error Template Extraction
+
+**Purpose**: Prepare error message for effective searching by removing variable components.
+
+**Procedure:**
+1. Identify the error type/category (e.g., FileNotFoundError, TypeError, ConnectionError)
+2. Locate the core error message
+3. Remove variable components:
+   - File paths: `/home/user/project/file.py` → (remove)
+   - Usernames: `user@example.com` → (remove)
+   - IDs/numbers: `id=12345` → (remove)
+   - Timestamps: `2024-01-15 10:30:45` → (remove)
+   - User inputs: `input='value'` → (remove)
+   - Line numbers: `line 42` → (keep only if part of standard template)
+
+4. Retain:
+   - Error type/class names
+   - Standard error message structure
+   - Function/method names from standard library/SDK
+   - Standard error codes
+
+**Example Transformations:**
+```
+Original:
+  ValueError: invalid literal for int() with base 10: 'abc' at line 42 in /home/user/script.py
+
+Template:
+  ValueError invalid literal for int() with base 10
+
+Original:
+  requests.exceptions.ConnectionError: HTTPConnectionPool(host='api.example.com', port=443): Max retries exceeded with url: /v1/users
+
+Template:
+  requests.exceptions.ConnectionError HTTPConnectionPool Max retries exceeded
+
+Original:
+  ModuleNotFoundError: No module named 'pandas'
+
+Template:
+  ModuleNotFoundError No module named
+```
+
+#### Step 2: Environment Information Collection
+
+**Purpose**: Gather context needed to understand and resolve the error.
+
+**Collection Strategy:**
+1. **Start Minimal**: Only collect what's clearly relevant
+2. **Expand as Needed**: Add more context if initial research is inconclusive
+3. **Always Protect Privacy**: Never collect passwords, API keys, personal data without explicit permission
+
+**Standard Environment Information:**
+
+**System Context:**
+- Operating system and version
+- Shell/terminal environment
+- Current working directory (if relevant)
+
+**Runtime Context:**
+- Language/SDK versions (Python, Node.js, etc.)
+- Package manager versions (pip, npm, etc.)
+- Virtual environment status
+
+**Dependency Context:**
+- Installed package versions (for error-related packages)
+- Package lock file status
+- Dependency conflicts
+
+**Configuration Context:**
+- Relevant configuration files (only if directly related to error)
+- Environment variables (only non-sensitive ones)
+
+**Collection Commands by Context:**
+
+```bash
+# Python environment
+python --version
+pip --version
+pip list | grep <package-name>
+
+# Node.js environment
+node --version
+npm --version
+npm list <package-name>
+
+# System information
+uname -a  # Unix/Linux/Mac
+ver  # Windows
+
+# Package conflicts
+pip check  # Python
+npm doctor  # Node.js
+
+# Environment variables (careful with sensitive data)
+env | grep <RELEVANT_VAR>
+```
+
+**Privacy Protection:**
+- **Never collect**: Passwords, API keys, tokens, private keys, personal identifiable information
+- **Request permission before collecting**: Project-specific paths, configuration files, custom environment variables
+- **Sanitize output**: Remove sensitive data before recording
+
+#### Step 3: Research and Information Gathering
+
+**Research Sources (in order of efficiency):**
+
+1. **Web Search with Error Template**
+   - Search the extracted template (not full error)
+   - Add language/framework name to query
+   - Example: "ModuleNotFoundError No module named python"
+
+2. **Official Documentation**
+   - Error code references
+   - SDK/API documentation
+   - Known issues and breaking changes
+
+3. **Community Resources**
+   - Stack Overflow
+   - GitHub Issues (especially for specific libraries)
+   - Framework-specific forums
+
+**Parallel Research Strategy:**
+
+For complex problems, launch multiple research angles simultaneously using subagents:
+
+```
+Investigation Angles:
+├─ Subagent 1: Web search for error template
+├─ Subagent 2: Search GitHub Issues for affected package
+├─ Subagent 3: Check official documentation for breaking changes
+└─ Subagent 4: Search for similar errors in codebase history
+```
+
+**Research Efficiency:**
+- Delegate broad searches to subagents
+- Keep main context focused on synthesis and decision-making
+- Use file-based notes to accumulate findings
+
+#### Step 4: Debug Notes Creation
+
+**When to create debug notes:**
+- Investigation is expected to be complex
+- Multiple theories need tracking
+- Context is being consumed quickly
+- Investigation may span multiple sessions
+
+**Debug Notes Structure:**
+
+```markdown
+# Debug Session: [Error Summary]
+
+## Error Information
+[Full error details]
+
+## Environment
+[Relevant environment information]
+
+## Theories
+1. [Theory 1]: [likelihood: high/medium/low]
+   - Evidence: [supporting information]
+   - Test: [how to verify]
+   - Result: [pending/confirmed/rejected]
+
+2. [Theory 2]: ...
+
+## Research Findings
+- [Source]: [key information]
+- [Source]: [key information]
+
+## Tests Conducted
+1. [Test description]
+   - Command: [test command]
+   - Result: [outcome]
+   - Conclusion: [what was learned]
+
+## Solution
+[Final solution that resolved the issue]
+```
+
+Use the template in `assets/debug-notes-template.md` as a starting point.
+
+#### Step 5: Theory Formulation and Testing
+
+**Theory Development:**
+
+Based on research and environment analysis:
+1. List all plausible explanations for the error
+2. Assess likelihood of each (high/medium/low)
+3. Identify evidence supporting or contradicting each theory
+4. Order theories by likelihood and ease of testing
+
+**Theory Testing Protocol:**
+
+For each theory (starting with most likely):
+
+1. **Design Test**
+   - What command/change will verify or reject this theory?
+   - What outcome would confirm the theory?
+   - What outcome would reject the theory?
+
+2. **Execute Test**
+   - Run the test in a controlled manner
+   - Capture all output
+   - Note any side effects
+
+3. **Evaluate Result**
+   - Does the result confirm or reject the theory?
+   - Are there unexpected outcomes?
+   - What new information was gained?
+
+4. **Document in Debug Notes**
+   - Record test and result
+   - Update theory status
+   - Note any new theories generated
+
+5. **Iterate**
+   - If theory confirmed: proceed to solution implementation
+   - If theory rejected: move to next theory
+   - If inconclusive: gather more information
+
+**Testing Best Practices:**
+- Test one variable at a time
+- Use minimal reproducible cases when possible
+- Revert changes between theory tests
+- Document negative results (what didn't work is valuable information)
+
+### Phase 4: Solution Implementation
+
+Once the correct theory is identified:
+
+1. **Plan Implementation**
+   - What changes are needed?
+   - Are there risks or side effects?
+   - Can the solution be tested incrementally?
+
+2. **Apply Fix**
+   - Make the necessary changes
+   - Document what was changed
+   - Keep changes minimal and focused
+
+3. **Verify Resolution**
+   - Re-run the original failing command
+   - Confirm error is completely resolved
+   - Check for new errors or warnings
+
+4. **Document Solution**
+   - Record the fix in debug notes (if created)
+   - Note root cause and solution for future reference
+   - Consider adding to common error patterns if widely applicable
+
+### Phase 5: Post-Resolution
+
+1. **Clean Up**
+   - Remove temporary debug files (if any)
+   - Clean up test artifacts
+   - Restore any temporary changes
+
+2. **Knowledge Capture**
+   - If this was a difficult problem with a general solution, consider documenting it
+   - Update `references/common-error-patterns.md` if appropriate
+   - Note any tools or techniques that were particularly effective
+
+## Advanced Investigation Techniques
+
+### Bisection Method
+
+For errors introduced by recent changes:
+1. Identify last known good state
+2. Bisect the changes between good and bad state
+3. Test each bisection point
+4. Narrow down to the specific change that introduced the error
+
+### Differential Diagnosis
+
+When multiple theories seem equally plausible:
+1. Identify distinguishing characteristics of each theory
+2. Design tests that differentiate between theories
+3. Execute targeted tests to rule out theories
+4. Converge on the correct diagnosis through elimination
+
+### Reproduction Reduction
+
+For complex errors:
+1. Create minimal reproducible example
+2. Strip away unrelated code/configuration
+3. Isolate the essential elements that trigger the error
+4. Use reduced case for easier investigation
+
+## Communication and Escalation
+
+### When to Request User Input
+
+Request user input when:
+- Multiple equally valid solutions exist
+- User preference affects solution choice
+- Sensitive information access is needed
+- Problem domain knowledge is required
+- Verification of fix needs user testing
+
+### How to Present Findings
+
+When communicating with user:
+1. **Summarize**: Brief description of the error and its cause
+2. **Explain**: Why the error occurred
+3. **Solution**: What was done to fix it
+4. **Verification**: How to confirm it's resolved
+5. **Prevention**: How to avoid in the future (if applicable)
+
+## Common Pitfalls to Avoid
+
+1. **Assumption Paralysis**: Don't assume too much; verify theories with tests
+2. **Fix Stacking**: Don't apply multiple fixes simultaneously; test one at a time
+3. **Context Drift**: Stay focused on the original error; avoid rabbit holes
+4. **Incomplete Reversion**: Always fully revert failed fixes
+5. **Premature Success**: Verify the error is truly resolved, not just hidden
+6. **Privacy Violations**: Never collect sensitive data without permission
+
+## Success Criteria
+
+A troubleshooting session is successful when:
+1. The original error is completely resolved
+2. The fix is understood and documented
+3. No new errors were introduced
+4. The solution is appropriate and maintainable
+5. Lessons learned are captured for future reference