Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:47:55 +08:00
commit e732da8316
20 changed files with 4969 additions and 0 deletions

View File

@@ -0,0 +1,306 @@
# Common Bug Patterns and Signatures
This reference documents frequently encountered bug patterns, their signatures, and diagnostic approaches.
## Pattern Categories
### 1. Timing and Concurrency Issues
#### Race Conditions
**Signature:**
- Intermittent failures
- Works in development but fails in production
- Different results with same input
- Failures during high load
**Common Causes:**
- Shared mutable state without synchronization
- Incorrect thread-safety assumptions
- Async operations completing in unexpected order
**Investigation Approach:**
- Add extensive logging with timestamps
- Use debugger breakpoints sparingly (changes timing)
- Add delays to expose race windows
- Review all shared state access patterns
#### Deadlocks
**Signature:**
- Application hangs indefinitely
- No error messages
- High CPU or complete freeze
- Multiple threads waiting
**Common Causes:**
- Circular wait for locks
- Lock ordering violations
- Database transaction deadlocks
**Investigation Approach:**
- Check thread dumps / stack traces
- Review lock acquisition order
- Use database deadlock detection tools
- Add timeout mechanisms
### 2. Memory Issues
#### Memory Leaks
**Signature:**
- Gradually increasing memory usage
- Performance degradation over time
- Out of memory errors after extended runtime
- Works initially, fails after hours/days
**Common Causes:**
- Event listeners not cleaned up
- Cache without eviction policy
- Circular references preventing garbage collection
- Resource handles not closed
**Investigation Approach:**
- Profile memory over time
- Take heap dumps at intervals
- Compare object counts between snapshots
- Check for unclosed resources (files, connections, sockets)
#### Stack Overflow
**Signature:**
- Stack overflow error
- Deep recursion errors
- Crashes at predictable depth
**Common Causes:**
- Unbounded recursion
- Missing base case in recursive function
- Circular data structure traversal
**Investigation Approach:**
- Check recursion depth
- Verify base case conditions
- Look for circular references
- Consider iterative alternative
### 3. State Management Issues
#### Stale Cache
**Signature:**
- Outdated data displayed
- Inconsistency between systems
- Works after cache clear
- Different results on different servers
**Common Causes:**
- Cache invalidation not triggered
- TTL too long
- Distributed cache synchronization issues
**Investigation Approach:**
- Check cache invalidation logic
- Verify cache key generation
- Test with cache disabled
- Review cache update patterns
#### State Corruption
**Signature:**
- Invalid state transitions
- Data inconsistency
- Unexpected null values
- Objects in impossible states
**Common Causes:**
- Direct state mutation
- Missing validation
- Incorrect error handling leaving partial updates
- Concurrent modifications
**Investigation Approach:**
- Add state validation assertions
- Review state mutation points
- Check transaction boundaries
- Look for error handling gaps
### 4. Integration Issues
#### API Failures
**Signature:**
- Timeout errors
- 500/503 errors
- Network errors
- Rate limiting responses
**Common Causes:**
- Third-party API downtime
- Network connectivity issues
- Authentication token expiration
- Rate limits exceeded
**Investigation Approach:**
- Check API status pages
- Verify network connectivity
- Review authentication flow
- Check rate limit headers
- Test with API directly (curl/Postman)
#### Database Issues
**Signature:**
- Connection pool exhausted
- Slow query performance
- Lock wait timeouts
- Connection refused errors
**Common Causes:**
- Connection leaks (not closing connections)
- Missing indexes causing full table scans
- N+1 query problems
- Database server overload
**Investigation Approach:**
- Monitor connection pool metrics
- Review slow query logs
- Check execution plans
- Look for repeated queries in loops
### 5. Configuration Issues
#### Environment Mismatches
**Signature:**
- Works locally, fails in production
- Different behavior across environments
- "It works on my machine"
**Common Causes:**
- Different environment variables
- Different dependency versions
- Different configuration files
- Platform-specific code paths
**Investigation Approach:**
- Compare environment variables
- Check dependency versions (package-lock.json, poetry.lock, etc.)
- Review configuration for environment-specific values
- Check platform-specific code paths
#### Missing Dependencies
**Signature:**
- Module not found errors
- Import errors
- Class/function not defined
- Version incompatibility errors
**Common Causes:**
- Missing package in requirements
- Outdated dependency versions
- Peer dependency conflicts
- System library missing
**Investigation Approach:**
- Review dependency manifests
- Check installed versions vs required
- Look for dependency conflicts
- Verify system libraries installed
### 6. Logic Errors
#### Off-by-One Errors
**Signature:**
- Index out of bounds
- Missing first or last element
- Infinite loops
- Incorrect boundary handling
**Common Causes:**
- Using < instead of <=
- 0-indexed vs 1-indexed confusion
- Incorrect loop conditions
**Investigation Approach:**
- Check boundary conditions
- Test with edge cases (empty, single element)
- Review loop conditions carefully
- Add assertions for expected ranges
#### Type Coercion Bugs
**Signature:**
- Unexpected type errors
- Comparison behaving unexpectedly
- String concatenation instead of addition
- Falsy value handling issues
**Common Causes:**
- Implicit type conversion
- Loose equality checks (== vs ===)
- Type assumptions without validation
- Mixed numeric types
**Investigation Approach:**
- Add explicit type checks
- Use strict equality
- Add type annotations/hints
- Check for implicit conversions
### 7. Error Handling Issues
#### Swallowed Exceptions
**Signature:**
- Silent failures
- No error messages despite failure
- Incomplete operations
- Success reported despite failure
**Common Causes:**
- Empty catch blocks
- Broad exception catching
- Returning default values on error
- Not re-raising exceptions
**Investigation Approach:**
- Search for empty catch/except blocks
- Review exception handling patterns
- Add logging to all error paths
- Check for bare except/catch clauses
#### Error Propagation Failures
**Signature:**
- Low-level errors exposed to users
- Unclear error messages
- Generic "Something went wrong"
- Stack traces in user interface
**Common Causes:**
- No error translation layer
- Missing error boundaries
- Not catching specific exceptions
- No user-friendly error messages
**Investigation Approach:**
- Review error handling architecture
- Check error message clarity
- Verify error boundaries exist
- Test error scenarios
## Pattern Recognition Strategies
### Look for These Red Flags:
1. **Time-based behavior**: If adding delays changes behavior, suspect timing issues
2. **Load-based failures**: If failures increase with load, suspect resource exhaustion or race conditions
3. **Environment-specific**: If only fails in certain environments, suspect configuration differences
4. **Gradual degradation**: If performance worsens over time, suspect memory leaks or resource leaks
5. **Intermittent failures**: If behavior is non-deterministic, suspect concurrency issues or external dependencies
### Diagnostic Quick Checks:
1. **Can you reproduce it consistently?** No → Likely timing/concurrency issue
2. **Does it fail immediately?** Yes → Likely configuration or initialization issue
3. **Does it fail after some time?** Yes → Likely resource leak or state corruption
4. **Does it fail with specific input?** Yes → Likely validation or edge case handling issue
5. **Does it fail only in production?** Yes → Likely environment or load-related issue
## Using This Reference
When encountering a bug:
1. Match the signature to patterns above
2. Review common causes for that pattern
3. Follow the investigation approach
4. Apply lessons from similar past issues
5. Update this document if you discover new patterns

View File

@@ -0,0 +1,176 @@
# Debugging Checklist
This checklist provides detailed action items for each step of the debugging workflow.
## Step 1: Observe Without Preconception ✓
**Evidence Collection:**
- [ ] Review user's bug report or issue description
- [ ] Examine error messages and stack traces
- [ ] Check application logs (stderr, stdout, application-specific logs)
- [ ] Review monitoring dashboards (if available)
- [ ] Inspect recent code changes (`git diff`, `git log`)
- [ ] Document current environment (OS, versions, dependencies)
- [ ] Capture configuration files (config files, environment variables, CLI arguments)
- [ ] Screenshot or record the error if visual
- [ ] Note exact steps to reproduce
**Documentation:**
- [ ] Create investigation log file
- [ ] Record timestamp and initial observations
- [ ] List all data sources consulted
## Step 2: Classify and Isolate Facts ✓
**Symptom Analysis:**
- [ ] List all observable symptoms
- [ ] Distinguish symptoms from potential causes
- [ ] Identify what changed recently (code, config, dependencies, infrastructure)
**Scope Narrowing:**
- [ ] Test across different environments (dev, staging, production)
- [ ] Test across different platforms (Windows, Linux, macOS)
- [ ] Test across different browsers (if web application)
- [ ] Test with different input data
- [ ] Test with different configurations
- [ ] Identify minimal reproduction case
- [ ] Test with previous working version (regression testing)
**Component Isolation:**
- [ ] List all involved components/modules
- [ ] Mark components known to work correctly
- [ ] Highlight suspicious components
- [ ] Draw dependency diagram if complex
## Step 3: Build Differential Diagnosis List ✓
**Infrastructure Issues:**
- [ ] Network connectivity problems
- [ ] DNS resolution failures
- [ ] Load balancer misconfiguration
- [ ] Firewall/security group blocking
- [ ] Resource exhaustion (CPU, memory, disk)
**Application Issues:**
- [ ] Cache staleness or corruption
- [ ] Database connection pool exhaustion
- [ ] Database deadlocks or slow queries
- [ ] Third-party API failures or timeouts
- [ ] Memory leaks
- [ ] Race conditions or threading issues
- [ ] Incorrect error handling
- [ ] Invalid input validation
**Configuration Issues:**
- [ ] Environment variable mismatch
- [ ] Configuration file errors
- [ ] Version incompatibility
- [ ] Missing dependencies
- [ ] Permission problems
**Code Issues:**
- [ ] Logic errors in recent changes
- [ ] Null pointer/undefined errors
- [ ] Type mismatches
- [ ] Off-by-one errors
- [ ] Incorrect assumptions
## Step 4: Apply Elimination and Deductive Reasoning ✓
**Hypothesis Testing:**
- [ ] Rank hypotheses by likelihood
- [ ] Design test for most likely hypothesis
- [ ] Execute test and document result
- [ ] If hypothesis invalidated, mark as eliminated
- [ ] If hypothesis confirmed, design further verification
- [ ] Move to next hypothesis if needed
**Reasoning Documentation:**
- [ ] Document "If X, then Y" statements
- [ ] Record why each hypothesis was eliminated
- [ ] Note which tests ruled out which possibilities
- [ ] Maintain chain of reasoning for review
**Narrowing Down:**
- [ ] Eliminate external factors first (network, APIs)
- [ ] Then infrastructure (resources, configuration)
- [ ] Then application-level issues (cache, database)
- [ ] Finally code-level issues (logic, types)
## Step 5: Experimental Verification ✓
**Preparation:**
- [ ] Create git branch for experiments
- [ ] Backup current state (checkpoint)
- [ ] Document experiment plan
**Experimentation:**
- [ ] Add logging/instrumentation to suspected area
- [ ] Add debug breakpoints if using debugger
- [ ] Create controlled test case
- [ ] Run experiment and capture output
- [ ] Compare actual vs expected behavior
**Research:**
- [ ] Search GitHub issues for similar problems
- [ ] Check Stack Overflow for related questions
- [ ] Review official documentation for edge cases
- [ ] Check release notes for known issues
- [ ] Consult language/framework changelog
**Validation:**
- [ ] Can the issue be reproduced consistently?
- [ ] Does the evidence match the hypothesis?
- [ ] Are there alternative explanations?
## Step 6: Locate and Implement Fix ✓
**Root Cause Confirmation:**
- [ ] Identify exact file and line number
- [ ] Understand why the code fails
- [ ] Confirm this is root cause, not symptom
**Solution Design:**
- [ ] Consider multiple fix approaches
- [ ] Evaluate side effects of each approach
- [ ] Choose most elegant and maintainable solution
- [ ] Ensure fix doesn't introduce new issues
**Implementation:**
- [ ] Implement the fix
- [ ] Add comments explaining the fix
- [ ] Update related documentation
- [ ] Add test case to prevent regression
**Verification:**
- [ ] Test the fix resolves original issue
- [ ] Run existing test suite
- [ ] Test edge cases
- [ ] Verify no new issues introduced
## Step 7: Prevention Mechanism ✓
**Stability Verification:**
- [ ] Run full test suite
- [ ] Perform integration testing
- [ ] Test in staging environment
- [ ] Monitor for unexpected behavior
**Documentation:**
- [ ] Update CLAUDE.md or project docs
- [ ] Document root cause
- [ ] Document fix and reasoning
- [ ] Add to knowledge base
**Prevention Measures:**
- [ ] Add automated test for this scenario
- [ ] Add validation/assertions to prevent recurrence
- [ ] Update error messages for clarity
- [ ] Add monitoring/alerting if applicable
- [ ] Share learnings with team
**Post-Mortem:**
- [ ] Review what went well
- [ ] Identify what could improve
- [ ] Update debugging procedures if needed
- [ ] Celebrate the fix! 🎉

View File

@@ -0,0 +1,292 @@
# Bug Investigation Log Template
Use this template to document debugging sessions systematically. Copy and adapt as needed.
---
## Investigation Metadata
**Issue ID/Reference:** [e.g., #123, TICKET-456]
**Date Started:** [YYYY-MM-DD HH:MM]
**Investigator:** [Name or AI assistant]
**Priority:** [Critical / High / Medium / Low]
**Status:** [🔴 Investigating / 🟡 In Progress / 🟢 Resolved]
---
## Step 1: Initial Observations
**User Report:**
```
[Paste user's bug report or description here]
```
**Reproduction Steps:**
1. [Step 1]
2. [Step 2]
3. [Step 3]
**Expected Behavior:**
[What should happen]
**Actual Behavior:**
[What actually happens]
**Environment:**
- OS: [e.g., Ubuntu 22.04, Windows 11, macOS 14]
- Application Version: [e.g., v2.3.1]
- Runtime: [e.g., Node.js 18.16, Python 3.11]
- Browser: [if applicable]
**Evidence Collected:**
*Error Messages:*
```
[Paste error messages, stack traces]
```
*Logs:*
```
[Relevant log entries]
```
*Configuration:*
```
[Relevant config values, environment variables]
```
*Recent Changes:*
- [Commit hash / PR / change description]
- [git diff summary if relevant]
---
## Step 2: Fact Classification
**Confirmed Symptoms (Observable Facts):**
1. [Symptom 1]
2. [Symptom 2]
3. [Symptom 3]
**Scope Analysis:**
| Test | Result | Notes |
|------|--------|-------|
| Different environments (dev/staging/prod) | ✓/✗ | |
| Different platforms (Win/Mac/Linux) | ✓/✗ | |
| Different browsers | ✓/✗ | |
| Different input data | ✓/✗ | |
| Previous version | ✓/✗ | |
**Isolated Components:**
- ✅ Working correctly: [Component A, Component B]
- ❌ Suspected issues: [Component C, Component D]
- ❓ Uncertain: [Component E]
**What Changed Recently:**
- [Change 1 - date, description]
- [Change 2 - date, description]
---
## Step 3: Differential Diagnosis List
**Hypotheses (Ranked by Likelihood):**
### Hypothesis 1: [Name of hypothesis]
**Likelihood:** High / Medium / Low
**Category:** [Infrastructure / Application / Configuration / Code]
**Reasoning:** [Why this is suspected]
### Hypothesis 2: [Name of hypothesis]
**Likelihood:** High / Medium / Low
**Category:** [Infrastructure / Application / Configuration / Code]
**Reasoning:** [Why this is suspected]
### Hypothesis 3: [Name of hypothesis]
**Likelihood:** High / Medium / Low
**Category:** [Infrastructure / Application / Configuration / Code]
**Reasoning:** [Why this is suspected]
[Add more as needed]
---
## Step 4: Elimination and Deductive Reasoning
### Test 1: [Hypothesis being tested]
**Test Design:** [How to test this hypothesis]
**Expected Result if Hypothesis True:** [What you expect to see]
**Actual Result:** [What you observed]
**Conclusion:** ✅ Confirmed / ❌ Eliminated / ⚠️ Inconclusive
**Reasoning:**
```
If [condition], then [expected behavior]
We observed [actual behavior]
Therefore [conclusion]
```
### Test 2: [Hypothesis being tested]
**Test Design:** [How to test this hypothesis]
**Expected Result if Hypothesis True:** [What you expect to see]
**Actual Result:** [What you observed]
**Conclusion:** ✅ Confirmed / ❌ Eliminated / ⚠️ Inconclusive
**Reasoning:**
```
[Chain of reasoning]
```
[Continue for each test]
**Hypotheses Remaining:** [List hypotheses not yet eliminated]
---
## Step 5: Experimental Verification
**Checkpoint Created:** [git branch name, commit hash, or backup location]
### Experiment 1: [Description]
**Goal:** [What this experiment aims to prove/disprove]
**Method:**
```bash
# Commands or code used
```
**Results:**
```
[Output or findings]
```
**Conclusion:** [What this proves]
### Experiment 2: [Description]
**Goal:** [What this experiment aims to prove/disprove]
**Method:**
```bash
# Commands or code used
```
**Results:**
```
[Output or findings]
```
**Conclusion:** [What this proves]
**Research Conducted:**
- [ ] GitHub issues searched: [keywords used]
- [ ] Stack Overflow checked: [relevant Q&As]
- [ ] Documentation reviewed: [sections consulted]
- [ ] Release notes: [findings]
**Findings:**
[Summary of research findings]
**Root Cause Identified:** ✅ Yes / ❌ No
---
## Step 6: Root Cause and Fix
**Root Cause:**
[Precise description of what's causing the issue]
**Location:**
- File: [path/to/file.ext]
- Line(s): [line number(s)]
- Function/Method: [name]
**Why This Causes the Issue:**
[Explanation of the causal mechanism]
**Fix Approaches Considered:**
| Approach | Pros | Cons | Selected |
|----------|------|------|----------|
| [Approach 1] | [pros] | [cons] | ✅/❌ |
| [Approach 2] | [pros] | [cons] | ✅/❌ |
| [Approach 3] | [pros] | [cons] | ✅/❌ |
**Selected Fix:**
```diff
[Show code diff or configuration change]
```
**Rationale:** [Why this fix was chosen]
**Implementation Notes:**
[Any important details about the fix]
**Verification:**
- [ ] Original issue resolved
- [ ] No new issues introduced
- [ ] Test suite passes
- [ ] Edge cases tested
---
## Step 7: Prevention and Documentation
**Regression Test Added:**
```
[Test code or test case description]
```
**Documentation Updated:**
- [ ] CLAUDE.md updated
- [ ] Code comments added
- [ ] API documentation updated
- [ ] README updated (if needed)
**Prevention Measures Implemented:**
1. [Measure 1 - e.g., added validation]
2. [Measure 2 - e.g., improved error handling]
3. [Measure 3 - e.g., added monitoring]
**Lessons Learned:**
1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]
**Knowledge Base Update:**
- Pattern: [If this represents a new pattern to document]
- Category: [What category of bug this was]
- Key Insight: [Main takeaway for future debugging]
---
## Timeline Summary
| Time | Activity | Result |
|------|----------|--------|
| [HH:MM] | Investigation started | |
| [HH:MM] | Initial observations completed | |
| [HH:MM] | Hypothesis list created | |
| [HH:MM] | Testing began | |
| [HH:MM] | Root cause identified | |
| [HH:MM] | Fix implemented | |
| [HH:MM] | Verification completed | |
| [HH:MM] | Issue resolved | |
**Total Time:** [Duration]
---
## Status Update for Stakeholders
**Summary for Non-Technical Audience:**
[1-2 sentence explanation of what went wrong and how it was fixed]
**Impact:**
- Users affected: [number or description]
- Duration: [how long the issue existed]
- Severity: [impact level]
**Resolution:**
[Brief description of the fix]
**Follow-up Actions:**
- [ ] [Action 1]
- [ ] [Action 2]
---
**Investigation Completed:** [YYYY-MM-DD HH:MM]
**Final Status:** 🟢 Resolved / 🔴 Unresolved / ⚠️ Workaround Applied