Initial commit

2025-11-29 18:47:55 +08:00
commit e732da8316
20 changed files with 4969 additions and 0 deletions
--- a/skills/debug/references/common_patterns.md
+++ b/skills/debug/references/common_patterns.md
@@ -0,0 +1,306 @@
+# Common Bug Patterns and Signatures
+
+This reference documents frequently encountered bug patterns, their signatures, and diagnostic approaches.
+
+## Pattern Categories
+
+### 1. Timing and Concurrency Issues
+
+#### Race Conditions
+**Signature:**
+- Intermittent failures
+- Works in development but fails in production
+- Different results with same input
+- Failures during high load
+
+**Common Causes:**
+- Shared mutable state without synchronization
+- Incorrect thread-safety assumptions
+- Async operations completing in unexpected order
+
+**Investigation Approach:**
+- Add extensive logging with timestamps
+- Use debugger breakpoints sparingly (changes timing)
+- Add delays to expose race windows
+- Review all shared state access patterns
+
+#### Deadlocks
+**Signature:**
+- Application hangs indefinitely
+- No error messages
+- High CPU or complete freeze
+- Multiple threads waiting
+
+**Common Causes:**
+- Circular wait for locks
+- Lock ordering violations
+- Database transaction deadlocks
+
+**Investigation Approach:**
+- Check thread dumps / stack traces
+- Review lock acquisition order
+- Use database deadlock detection tools
+- Add timeout mechanisms
+
+### 2. Memory Issues
+
+#### Memory Leaks
+**Signature:**
+- Gradually increasing memory usage
+- Performance degradation over time
+- Out of memory errors after extended runtime
+- Works initially, fails after hours/days
+
+**Common Causes:**
+- Event listeners not cleaned up
+- Cache without eviction policy
+- Circular references preventing garbage collection
+- Resource handles not closed
+
+**Investigation Approach:**
+- Profile memory over time
+- Take heap dumps at intervals
+- Compare object counts between snapshots
+- Check for unclosed resources (files, connections, sockets)
+
+#### Stack Overflow
+**Signature:**
+- Stack overflow error
+- Deep recursion errors
+- Crashes at predictable depth
+
+**Common Causes:**
+- Unbounded recursion
+- Missing base case in recursive function
+- Circular data structure traversal
+
+**Investigation Approach:**
+- Check recursion depth
+- Verify base case conditions
+- Look for circular references
+- Consider iterative alternative
+
+### 3. State Management Issues
+
+#### Stale Cache
+**Signature:**
+- Outdated data displayed
+- Inconsistency between systems
+- Works after cache clear
+- Different results on different servers
+
+**Common Causes:**
+- Cache invalidation not triggered
+- TTL too long
+- Distributed cache synchronization issues
+
+**Investigation Approach:**
+- Check cache invalidation logic
+- Verify cache key generation
+- Test with cache disabled
+- Review cache update patterns
+
+#### State Corruption
+**Signature:**
+- Invalid state transitions
+- Data inconsistency
+- Unexpected null values
+- Objects in impossible states
+
+**Common Causes:**
+- Direct state mutation
+- Missing validation
+- Incorrect error handling leaving partial updates
+- Concurrent modifications
+
+**Investigation Approach:**
+- Add state validation assertions
+- Review state mutation points
+- Check transaction boundaries
+- Look for error handling gaps
+
+### 4. Integration Issues
+
+#### API Failures
+**Signature:**
+- Timeout errors
+- 500/503 errors
+- Network errors
+- Rate limiting responses
+
+**Common Causes:**
+- Third-party API downtime
+- Network connectivity issues
+- Authentication token expiration
+- Rate limits exceeded
+
+**Investigation Approach:**
+- Check API status pages
+- Verify network connectivity
+- Review authentication flow
+- Check rate limit headers
+- Test with API directly (curl/Postman)
+
+#### Database Issues
+**Signature:**
+- Connection pool exhausted
+- Slow query performance
+- Lock wait timeouts
+- Connection refused errors
+
+**Common Causes:**
+- Connection leaks (not closing connections)
+- Missing indexes causing full table scans
+- N+1 query problems
+- Database server overload
+
+**Investigation Approach:**
+- Monitor connection pool metrics
+- Review slow query logs
+- Check execution plans
+- Look for repeated queries in loops
+
+### 5. Configuration Issues
+
+#### Environment Mismatches
+**Signature:**
+- Works locally, fails in production
+- Different behavior across environments
+- "It works on my machine"
+
+**Common Causes:**
+- Different environment variables
+- Different dependency versions
+- Different configuration files
+- Platform-specific code paths
+
+**Investigation Approach:**
+- Compare environment variables
+- Check dependency versions (package-lock.json, poetry.lock, etc.)
+- Review configuration for environment-specific values
+- Check platform-specific code paths
+
+#### Missing Dependencies
+**Signature:**
+- Module not found errors
+- Import errors
+- Class/function not defined
+- Version incompatibility errors
+
+**Common Causes:**
+- Missing package in requirements
+- Outdated dependency versions
+- Peer dependency conflicts
+- System library missing
+
+**Investigation Approach:**
+- Review dependency manifests
+- Check installed versions vs required
+- Look for dependency conflicts
+- Verify system libraries installed
+
+### 6. Logic Errors
+
+#### Off-by-One Errors
+**Signature:**
+- Index out of bounds
+- Missing first or last element
+- Infinite loops
+- Incorrect boundary handling
+
+**Common Causes:**
+- Using < instead of <=
+- 0-indexed vs 1-indexed confusion
+- Incorrect loop conditions
+
+**Investigation Approach:**
+- Check boundary conditions
+- Test with edge cases (empty, single element)
+- Review loop conditions carefully
+- Add assertions for expected ranges
+
+#### Type Coercion Bugs
+**Signature:**
+- Unexpected type errors
+- Comparison behaving unexpectedly
+- String concatenation instead of addition
+- Falsy value handling issues
+
+**Common Causes:**
+- Implicit type conversion
+- Loose equality checks (== vs ===)
+- Type assumptions without validation
+- Mixed numeric types
+
+**Investigation Approach:**
+- Add explicit type checks
+- Use strict equality
+- Add type annotations/hints
+- Check for implicit conversions
+
+### 7. Error Handling Issues
+
+#### Swallowed Exceptions
+**Signature:**
+- Silent failures
+- No error messages despite failure
+- Incomplete operations
+- Success reported despite failure
+
+**Common Causes:**
+- Empty catch blocks
+- Broad exception catching
+- Returning default values on error
+- Not re-raising exceptions
+
+**Investigation Approach:**
+- Search for empty catch/except blocks
+- Review exception handling patterns
+- Add logging to all error paths
+- Check for bare except/catch clauses
+
+#### Error Propagation Failures
+**Signature:**
+- Low-level errors exposed to users
+- Unclear error messages
+- Generic "Something went wrong"
+- Stack traces in user interface
+
+**Common Causes:**
+- No error translation layer
+- Missing error boundaries
+- Not catching specific exceptions
+- No user-friendly error messages
+
+**Investigation Approach:**
+- Review error handling architecture
+- Check error message clarity
+- Verify error boundaries exist
+- Test error scenarios
+
+## Pattern Recognition Strategies
+
+### Look for These Red Flags:
+
+1. **Time-based behavior**: If adding delays changes behavior, suspect timing issues
+2. **Load-based failures**: If failures increase with load, suspect resource exhaustion or race conditions
+3. **Environment-specific**: If only fails in certain environments, suspect configuration differences
+4. **Gradual degradation**: If performance worsens over time, suspect memory leaks or resource leaks
+5. **Intermittent failures**: If behavior is non-deterministic, suspect concurrency issues or external dependencies
+
+### Diagnostic Quick Checks:
+
+1. **Can you reproduce it consistently?** No → Likely timing/concurrency issue
+2. **Does it fail immediately?** Yes → Likely configuration or initialization issue
+3. **Does it fail after some time?** Yes → Likely resource leak or state corruption
+4. **Does it fail with specific input?** Yes → Likely validation or edge case handling issue
+5. **Does it fail only in production?** Yes → Likely environment or load-related issue
+
+## Using This Reference
+
+When encountering a bug:
+1. Match the signature to patterns above
+2. Review common causes for that pattern
+3. Follow the investigation approach
+4. Apply lessons from similar past issues
+5. Update this document if you discover new patterns
--- a/skills/debug/references/debugging_checklist.md
+++ b/skills/debug/references/debugging_checklist.md
@@ -0,0 +1,176 @@
+# Debugging Checklist
+
+This checklist provides detailed action items for each step of the debugging workflow.
+
+## Step 1: Observe Without Preconception ✓
+
+**Evidence Collection:**
+- [ ] Review user's bug report or issue description
+- [ ] Examine error messages and stack traces
+- [ ] Check application logs (stderr, stdout, application-specific logs)
+- [ ] Review monitoring dashboards (if available)
+- [ ] Inspect recent code changes (`git diff`, `git log`)
+- [ ] Document current environment (OS, versions, dependencies)
+- [ ] Capture configuration files (config files, environment variables, CLI arguments)
+- [ ] Screenshot or record the error if visual
+- [ ] Note exact steps to reproduce
+
+**Documentation:**
+- [ ] Create investigation log file
+- [ ] Record timestamp and initial observations
+- [ ] List all data sources consulted
+
+## Step 2: Classify and Isolate Facts ✓
+
+**Symptom Analysis:**
+- [ ] List all observable symptoms
+- [ ] Distinguish symptoms from potential causes
+- [ ] Identify what changed recently (code, config, dependencies, infrastructure)
+
+**Scope Narrowing:**
+- [ ] Test across different environments (dev, staging, production)
+- [ ] Test across different platforms (Windows, Linux, macOS)
+- [ ] Test across different browsers (if web application)
+- [ ] Test with different input data
+- [ ] Test with different configurations
+- [ ] Identify minimal reproduction case
+- [ ] Test with previous working version (regression testing)
+
+**Component Isolation:**
+- [ ] List all involved components/modules
+- [ ] Mark components known to work correctly
+- [ ] Highlight suspicious components
+- [ ] Draw dependency diagram if complex
+
+## Step 3: Build Differential Diagnosis List ✓
+
+**Infrastructure Issues:**
+- [ ] Network connectivity problems
+- [ ] DNS resolution failures
+- [ ] Load balancer misconfiguration
+- [ ] Firewall/security group blocking
+- [ ] Resource exhaustion (CPU, memory, disk)
+
+**Application Issues:**
+- [ ] Cache staleness or corruption
+- [ ] Database connection pool exhaustion
+- [ ] Database deadlocks or slow queries
+- [ ] Third-party API failures or timeouts
+- [ ] Memory leaks
+- [ ] Race conditions or threading issues
+- [ ] Incorrect error handling
+- [ ] Invalid input validation
+
+**Configuration Issues:**
+- [ ] Environment variable mismatch
+- [ ] Configuration file errors
+- [ ] Version incompatibility
+- [ ] Missing dependencies
+- [ ] Permission problems
+
+**Code Issues:**
+- [ ] Logic errors in recent changes
+- [ ] Null pointer/undefined errors
+- [ ] Type mismatches
+- [ ] Off-by-one errors
+- [ ] Incorrect assumptions
+
+## Step 4: Apply Elimination and Deductive Reasoning ✓
+
+**Hypothesis Testing:**
+- [ ] Rank hypotheses by likelihood
+- [ ] Design test for most likely hypothesis
+- [ ] Execute test and document result
+- [ ] If hypothesis invalidated, mark as eliminated
+- [ ] If hypothesis confirmed, design further verification
+- [ ] Move to next hypothesis if needed
+
+**Reasoning Documentation:**
+- [ ] Document "If X, then Y" statements
+- [ ] Record why each hypothesis was eliminated
+- [ ] Note which tests ruled out which possibilities
+- [ ] Maintain chain of reasoning for review
+
+**Narrowing Down:**
+- [ ] Eliminate external factors first (network, APIs)
+- [ ] Then infrastructure (resources, configuration)
+- [ ] Then application-level issues (cache, database)
+- [ ] Finally code-level issues (logic, types)
+
+## Step 5: Experimental Verification ✓
+
+**Preparation:**
+- [ ] Create git branch for experiments
+- [ ] Backup current state (checkpoint)
+- [ ] Document experiment plan
+
+**Experimentation:**
+- [ ] Add logging/instrumentation to suspected area
+- [ ] Add debug breakpoints if using debugger
+- [ ] Create controlled test case
+- [ ] Run experiment and capture output
+- [ ] Compare actual vs expected behavior
+
+**Research:**
+- [ ] Search GitHub issues for similar problems
+- [ ] Check Stack Overflow for related questions
+- [ ] Review official documentation for edge cases
+- [ ] Check release notes for known issues
+- [ ] Consult language/framework changelog
+
+**Validation:**
+- [ ] Can the issue be reproduced consistently?
+- [ ] Does the evidence match the hypothesis?
+- [ ] Are there alternative explanations?
+
+## Step 6: Locate and Implement Fix ✓
+
+**Root Cause Confirmation:**
+- [ ] Identify exact file and line number
+- [ ] Understand why the code fails
+- [ ] Confirm this is root cause, not symptom
+
+**Solution Design:**
+- [ ] Consider multiple fix approaches
+- [ ] Evaluate side effects of each approach
+- [ ] Choose most elegant and maintainable solution
+- [ ] Ensure fix doesn't introduce new issues
+
+**Implementation:**
+- [ ] Implement the fix
+- [ ] Add comments explaining the fix
+- [ ] Update related documentation
+- [ ] Add test case to prevent regression
+
+**Verification:**
+- [ ] Test the fix resolves original issue
+- [ ] Run existing test suite
+- [ ] Test edge cases
+- [ ] Verify no new issues introduced
+
+## Step 7: Prevention Mechanism ✓
+
+**Stability Verification:**
+- [ ] Run full test suite
+- [ ] Perform integration testing
+- [ ] Test in staging environment
+- [ ] Monitor for unexpected behavior
+
+**Documentation:**
+- [ ] Update CLAUDE.md or project docs
+- [ ] Document root cause
+- [ ] Document fix and reasoning
+- [ ] Add to knowledge base
+
+**Prevention Measures:**
+- [ ] Add automated test for this scenario
+- [ ] Add validation/assertions to prevent recurrence
+- [ ] Update error messages for clarity
+- [ ] Add monitoring/alerting if applicable
+- [ ] Share learnings with team
+
+**Post-Mortem:**
+- [ ] Review what went well
+- [ ] Identify what could improve
+- [ ] Update debugging procedures if needed
+- [ ] Celebrate the fix! 🎉
--- a/skills/debug/references/investigation_template.md
+++ b/skills/debug/references/investigation_template.md
@@ -0,0 +1,292 @@
+# Bug Investigation Log Template
+
+Use this template to document debugging sessions systematically. Copy and adapt as needed.
+
+---
+
+## Investigation Metadata
+
+**Issue ID/Reference:** [e.g., #123, TICKET-456]
+**Date Started:** [YYYY-MM-DD HH:MM]
+**Investigator:** [Name or AI assistant]
+**Priority:** [Critical / High / Medium / Low]
+**Status:** [🔴 Investigating / 🟡 In Progress / 🟢 Resolved]
+
+---
+
+## Step 1: Initial Observations
+
+**User Report:**
+```
+[Paste user's bug report or description here]
+```
+
+**Reproduction Steps:**
+1. [Step 1]
+2. [Step 2]
+3. [Step 3]
+
+**Expected Behavior:**
+[What should happen]
+
+**Actual Behavior:**
+[What actually happens]
+
+**Environment:**
+- OS: [e.g., Ubuntu 22.04, Windows 11, macOS 14]
+- Application Version: [e.g., v2.3.1]
+- Runtime: [e.g., Node.js 18.16, Python 3.11]
+- Browser: [if applicable]
+
+**Evidence Collected:**
+
+*Error Messages:*
+```
+[Paste error messages, stack traces]
+```
+
+*Logs:*
+```
+[Relevant log entries]
+```
+
+*Configuration:*
+```
+[Relevant config values, environment variables]
+```
+
+*Recent Changes:*
+- [Commit hash / PR / change description]
+- [git diff summary if relevant]
+
+---
+
+## Step 2: Fact Classification
+
+**Confirmed Symptoms (Observable Facts):**
+1. [Symptom 1]
+2. [Symptom 2]
+3. [Symptom 3]
+
+**Scope Analysis:**
+
+| Test | Result | Notes |
+|------|--------|-------|
+| Different environments (dev/staging/prod) | ✓/✗ | |
+| Different platforms (Win/Mac/Linux) | ✓/✗ | |
+| Different browsers | ✓/✗ | |
+| Different input data | ✓/✗ | |
+| Previous version | ✓/✗ | |
+
+**Isolated Components:**
+- ✅ Working correctly: [Component A, Component B]
+- ❌ Suspected issues: [Component C, Component D]
+- ❓ Uncertain: [Component E]
+
+**What Changed Recently:**
+- [Change 1 - date, description]
+- [Change 2 - date, description]
+
+---
+
+## Step 3: Differential Diagnosis List
+
+**Hypotheses (Ranked by Likelihood):**
+
+### Hypothesis 1: [Name of hypothesis]
+**Likelihood:** High / Medium / Low
+**Category:** [Infrastructure / Application / Configuration / Code]
+**Reasoning:** [Why this is suspected]
+
+### Hypothesis 2: [Name of hypothesis]
+**Likelihood:** High / Medium / Low
+**Category:** [Infrastructure / Application / Configuration / Code]
+**Reasoning:** [Why this is suspected]
+
+### Hypothesis 3: [Name of hypothesis]
+**Likelihood:** High / Medium / Low
+**Category:** [Infrastructure / Application / Configuration / Code]
+**Reasoning:** [Why this is suspected]
+
+[Add more as needed]
+
+---
+
+## Step 4: Elimination and Deductive Reasoning
+
+### Test 1: [Hypothesis being tested]
+**Test Design:** [How to test this hypothesis]
+**Expected Result if Hypothesis True:** [What you expect to see]
+**Actual Result:** [What you observed]
+**Conclusion:** ✅ Confirmed / ❌ Eliminated / ⚠️ Inconclusive
+**Reasoning:**
+```
+If [condition], then [expected behavior]
+We observed [actual behavior]
+Therefore [conclusion]
+```
+
+### Test 2: [Hypothesis being tested]
+**Test Design:** [How to test this hypothesis]
+**Expected Result if Hypothesis True:** [What you expect to see]
+**Actual Result:** [What you observed]
+**Conclusion:** ✅ Confirmed / ❌ Eliminated / ⚠️ Inconclusive
+**Reasoning:**
+```
+[Chain of reasoning]
+```
+
+[Continue for each test]
+
+**Hypotheses Remaining:** [List hypotheses not yet eliminated]
+
+---
+
+## Step 5: Experimental Verification
+
+**Checkpoint Created:** [git branch name, commit hash, or backup location]
+
+### Experiment 1: [Description]
+**Goal:** [What this experiment aims to prove/disprove]
+**Method:**
+```bash
+# Commands or code used
+```
+**Results:**
+```
+[Output or findings]
+```
+**Conclusion:** [What this proves]
+
+### Experiment 2: [Description]
+**Goal:** [What this experiment aims to prove/disprove]
+**Method:**
+```bash
+# Commands or code used
+```
+**Results:**
+```
+[Output or findings]
+```
+**Conclusion:** [What this proves]
+
+**Research Conducted:**
+- [ ] GitHub issues searched: [keywords used]
+- [ ] Stack Overflow checked: [relevant Q&As]
+- [ ] Documentation reviewed: [sections consulted]
+- [ ] Release notes: [findings]
+
+**Findings:**
+[Summary of research findings]
+
+**Root Cause Identified:** ✅ Yes / ❌ No
+
+---
+
+## Step 6: Root Cause and Fix
+
+**Root Cause:**
+[Precise description of what's causing the issue]
+
+**Location:**
+- File: [path/to/file.ext]
+- Line(s): [line number(s)]
+- Function/Method: [name]
+
+**Why This Causes the Issue:**
+[Explanation of the causal mechanism]
+
+**Fix Approaches Considered:**
+
+| Approach | Pros | Cons | Selected |
+|----------|------|------|----------|
+| [Approach 1] | [pros] | [cons] | ✅/❌ |
+| [Approach 2] | [pros] | [cons] | ✅/❌ |
+| [Approach 3] | [pros] | [cons] | ✅/❌ |
+
+**Selected Fix:**
+```diff
+[Show code diff or configuration change]
+```
+
+**Rationale:** [Why this fix was chosen]
+
+**Implementation Notes:**
+[Any important details about the fix]
+
+**Verification:**
+- [ ] Original issue resolved
+- [ ] No new issues introduced
+- [ ] Test suite passes
+- [ ] Edge cases tested
+
+---
+
+## Step 7: Prevention and Documentation
+
+**Regression Test Added:**
+```
+[Test code or test case description]
+```
+
+**Documentation Updated:**
+- [ ] CLAUDE.md updated
+- [ ] Code comments added
+- [ ] API documentation updated
+- [ ] README updated (if needed)
+
+**Prevention Measures Implemented:**
+1. [Measure 1 - e.g., added validation]
+2. [Measure 2 - e.g., improved error handling]
+3. [Measure 3 - e.g., added monitoring]
+
+**Lessons Learned:**
+1. [Lesson 1]
+2. [Lesson 2]
+3. [Lesson 3]
+
+**Knowledge Base Update:**
+- Pattern: [If this represents a new pattern to document]
+- Category: [What category of bug this was]
+- Key Insight: [Main takeaway for future debugging]
+
+---
+
+## Timeline Summary
+
+| Time | Activity | Result |
+|------|----------|--------|
+| [HH:MM] | Investigation started | |
+| [HH:MM] | Initial observations completed | |
+| [HH:MM] | Hypothesis list created | |
+| [HH:MM] | Testing began | |
+| [HH:MM] | Root cause identified | |
+| [HH:MM] | Fix implemented | |
+| [HH:MM] | Verification completed | |
+| [HH:MM] | Issue resolved | |
+
+**Total Time:** [Duration]
+
+---
+
+## Status Update for Stakeholders
+
+**Summary for Non-Technical Audience:**
+[1-2 sentence explanation of what went wrong and how it was fixed]
+
+**Impact:**
+- Users affected: [number or description]
+- Duration: [how long the issue existed]
+- Severity: [impact level]
+
+**Resolution:**
+[Brief description of the fix]
+
+**Follow-up Actions:**
+- [ ] [Action 1]
+- [ ] [Action 2]
+
+---
+
+**Investigation Completed:** [YYYY-MM-DD HH:MM]
+**Final Status:** 🟢 Resolved / 🔴 Unresolved / ⚠️ Workaround Applied