zhongwei/gh-iciakky-cc-general-skills

Fork 0

Files

Zhongwei Li e732da8316 Initial commit

2025-11-29 18:47:55 +08:00

11 KiB

Raw Blame History

Troubleshooting Standard Operating Procedures

This document provides detailed, step-by-step procedures for systematic error troubleshooting.

Core Troubleshooting Workflow

Phase 1: Error Recognition and Initial Response

When a tool, script, or command fails:

Capture Complete Error Information
- Full error message (stdout and stderr)
- Tool/command that was executed
- Context of what was being attempted
- Any stack traces or error codes
Assess Error Clarity
- Is the error message self-explanatory?
- Does it explicitly state what went wrong and how to fix it?
- Is this a commonly encountered error pattern?
Decide on Investigation Depth
- Trivial/Clear: Apply quick fix
- Non-trivial/Ambiguous: Proceed to rigorous investigation

Phase 2: Quick Fix Attempt (Happy Case)

Criteria for attempting quick fix:

Error message explicitly describes the problem and solution
Error matches a well-known trivial pattern
Fix requires minimal changes with low risk

Procedure:

Apply the fix based on error message or experience
Re-execute the failing command
Evaluate the result:
- Success: Error is resolved → Done
- No change: Error message identical → Revert and escalate
- Worse: New errors or degraded state → Revert immediately and escalate

Reversion Protocol:

Undo all changes made during quick fix attempt
Verify system is back to pre-fix state
Document what was attempted (if creating debug notes)

Phase 3: Rigorous Investigation

Enter this phase when:

Quick fix failed
Error is non-trivial from the start
Multiple potential causes exist
Context is ambiguous

Step 1: Error Template Extraction

Purpose: Prepare error message for effective searching by removing variable components.

Procedure:

Identify the error type/category (e.g., FileNotFoundError, TypeError, ConnectionError)
Locate the core error message
Remove variable components:
- File paths: /home/user/project/file.py → (remove)
- Usernames: user@example.com → (remove)
- IDs/numbers: id=12345 → (remove)
- Timestamps: 2024-01-15 10:30:45 → (remove)
- User inputs: input='value' → (remove)
- Line numbers: line 42 → (keep only if part of standard template)
Retain:
- Error type/class names
- Standard error message structure
- Function/method names from standard library/SDK
- Standard error codes

Example Transformations:

Original:
  ValueError: invalid literal for int() with base 10: 'abc' at line 42 in /home/user/script.py

Template:
  ValueError invalid literal for int() with base 10

Original:
  requests.exceptions.ConnectionError: HTTPConnectionPool(host='api.example.com', port=443): Max retries exceeded with url: /v1/users

Template:
  requests.exceptions.ConnectionError HTTPConnectionPool Max retries exceeded

Original:
  ModuleNotFoundError: No module named 'pandas'

Template:
  ModuleNotFoundError No module named

Step 2: Environment Information Collection

Purpose: Gather context needed to understand and resolve the error.

Collection Strategy:

Start Minimal: Only collect what's clearly relevant
Expand as Needed: Add more context if initial research is inconclusive
Always Protect Privacy: Never collect passwords, API keys, personal data without explicit permission

Standard Environment Information:

System Context:

Operating system and version
Shell/terminal environment
Current working directory (if relevant)

Runtime Context:

Language/SDK versions (Python, Node.js, etc.)
Package manager versions (pip, npm, etc.)
Virtual environment status

Dependency Context:

Installed package versions (for error-related packages)
Package lock file status
Dependency conflicts

Configuration Context:

Relevant configuration files (only if directly related to error)
Environment variables (only non-sensitive ones)

Collection Commands by Context:

# Python environment
python --version
pip --version
pip list | grep <package-name>

# Node.js environment
node --version
npm --version
npm list <package-name>

# System information
uname -a  # Unix/Linux/Mac
ver  # Windows

# Package conflicts
pip check  # Python
npm doctor  # Node.js

# Environment variables (careful with sensitive data)
env | grep <RELEVANT_VAR>

Privacy Protection:

Never collect: Passwords, API keys, tokens, private keys, personal identifiable information
Request permission before collecting: Project-specific paths, configuration files, custom environment variables
Sanitize output: Remove sensitive data before recording

Step 3: Research and Information Gathering

Research Sources (in order of efficiency):

Web Search with Error Template
- Search the extracted template (not full error)
- Add language/framework name to query
- Example: "ModuleNotFoundError No module named python"
Official Documentation
- Error code references
- SDK/API documentation
- Known issues and breaking changes
Community Resources
- Stack Overflow
- GitHub Issues (especially for specific libraries)
- Framework-specific forums

Parallel Research Strategy:

For complex problems, launch multiple research angles simultaneously using subagents:

Investigation Angles:
├─ Subagent 1: Web search for error template
├─ Subagent 2: Search GitHub Issues for affected package
├─ Subagent 3: Check official documentation for breaking changes
└─ Subagent 4: Search for similar errors in codebase history

Research Efficiency:

Delegate broad searches to subagents
Keep main context focused on synthesis and decision-making
Use file-based notes to accumulate findings

Step 4: Debug Notes Creation

When to create debug notes:

Investigation is expected to be complex
Multiple theories need tracking
Context is being consumed quickly
Investigation may span multiple sessions

Debug Notes Structure:

# Debug Session: [Error Summary]

## Error Information
[Full error details]

## Environment
[Relevant environment information]

## Theories
1. [Theory 1]: [likelihood: high/medium/low]
   - Evidence: [supporting information]
   - Test: [how to verify]
   - Result: [pending/confirmed/rejected]

2. [Theory 2]: ...

## Research Findings
- [Source]: [key information]
- [Source]: [key information]

## Tests Conducted
1. [Test description]
   - Command: [test command]
   - Result: [outcome]
   - Conclusion: [what was learned]

## Solution
[Final solution that resolved the issue]

Use the template in assets/debug-notes-template.md as a starting point.

Step 5: Theory Formulation and Testing

Theory Development:

Based on research and environment analysis:

List all plausible explanations for the error
Assess likelihood of each (high/medium/low)
Identify evidence supporting or contradicting each theory
Order theories by likelihood and ease of testing

Theory Testing Protocol:

For each theory (starting with most likely):

Design Test
- What command/change will verify or reject this theory?
- What outcome would confirm the theory?
- What outcome would reject the theory?
Execute Test
- Run the test in a controlled manner
- Capture all output
- Note any side effects
Evaluate Result
- Does the result confirm or reject the theory?
- Are there unexpected outcomes?
- What new information was gained?
Document in Debug Notes
- Record test and result
- Update theory status
- Note any new theories generated
Iterate
- If theory confirmed: proceed to solution implementation
- If theory rejected: move to next theory
- If inconclusive: gather more information

Testing Best Practices:

Test one variable at a time
Use minimal reproducible cases when possible
Revert changes between theory tests
Document negative results (what didn't work is valuable information)

Phase 4: Solution Implementation

Once the correct theory is identified:

Plan Implementation
- What changes are needed?
- Are there risks or side effects?
- Can the solution be tested incrementally?
Apply Fix
- Make the necessary changes
- Document what was changed
- Keep changes minimal and focused
Verify Resolution
- Re-run the original failing command
- Confirm error is completely resolved
- Check for new errors or warnings
Document Solution
- Record the fix in debug notes (if created)
- Note root cause and solution for future reference
- Consider adding to common error patterns if widely applicable

Phase 5: Post-Resolution

Clean Up
- Remove temporary debug files (if any)
- Clean up test artifacts
- Restore any temporary changes
Knowledge Capture
- If this was a difficult problem with a general solution, consider documenting it
- Update references/common-error-patterns.md if appropriate
- Note any tools or techniques that were particularly effective

Advanced Investigation Techniques

Bisection Method

For errors introduced by recent changes:

Identify last known good state
Bisect the changes between good and bad state
Test each bisection point
Narrow down to the specific change that introduced the error

Differential Diagnosis

When multiple theories seem equally plausible:

Identify distinguishing characteristics of each theory
Design tests that differentiate between theories
Execute targeted tests to rule out theories
Converge on the correct diagnosis through elimination

Reproduction Reduction

For complex errors:

Create minimal reproducible example
Strip away unrelated code/configuration
Isolate the essential elements that trigger the error
Use reduced case for easier investigation

Communication and Escalation

When to Request User Input

Request user input when:

Multiple equally valid solutions exist
User preference affects solution choice
Sensitive information access is needed
Problem domain knowledge is required
Verification of fix needs user testing

How to Present Findings

When communicating with user:

Summarize: Brief description of the error and its cause
Explain: Why the error occurred
Solution: What was done to fix it
Verification: How to confirm it's resolved
Prevention: How to avoid in the future (if applicable)

Common Pitfalls to Avoid

Assumption Paralysis: Don't assume too much; verify theories with tests
Fix Stacking: Don't apply multiple fixes simultaneously; test one at a time
Context Drift: Stay focused on the original error; avoid rabbit holes
Incomplete Reversion: Always fully revert failed fixes
Premature Success: Verify the error is truly resolved, not just hidden
Privacy Violations: Never collect sensitive data without permission

Success Criteria

A troubleshooting session is successful when:

The original error is completely resolved
The fix is understood and documented
No new errors were introduced
The solution is appropriate and maintainable
Lessons learned are captured for future reference

11 KiB Raw Blame History