Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions

View File

@@ -0,0 +1,56 @@
# Smart Debug Reference
Debugging references and methodologies for systematic error resolution.
## Available References
### [error-patterns-database.md](error-patterns-database.md)
Complete error pattern catalog with fixes.
- **Null Pointer Errors** - NoneType, undefined, null reference
- **Type Errors** - Type mismatch, unsupported operand, conversion failures
- **Index Errors** - Array bounds, list access, slice errors
- **Key Errors** - Dictionary key missing, object property undefined
- **Import Errors** - Module not found, circular imports
- **Database Errors** - Connection refused, timeout, constraint violations
- **API Errors** - 400/422/500 responses, contract violations
- **Concurrency Errors** - Race conditions, deadlocks, async issues
- **Memory Errors** - Out of memory, memory leaks
- **Performance Errors** - Slow queries, N+1 problems, inefficient algorithms
### [stack-trace-patterns.md](stack-trace-patterns.md)
Stack trace reading and analysis guide.
- Python stack traces (Traceback format)
- JavaScript/TypeScript stack traces (Error.stack format)
- Java stack traces (Exception format)
- Identifying root file vs. propagation
- Filtering stdlib and third-party frames
- Understanding async stack traces
- Reading minified stack traces
- Source map integration
### [rca-methodology.md](rca-methodology.md)
Root cause analysis methodologies.
- **5 Whys** - Iterative questioning to root cause
- **Timeline Analysis** - Chronological event reconstruction
- **Fishbone Diagram** - Ishikawa cause categorization
- **Fault Tree Analysis** - Logic diagram of failure paths
- **Change Analysis** - Recent deployments and config changes
- **Comparative Analysis** - Working vs. broken environments
- **Reproducibility Testing** - Isolation of causal factors
### [fix-generation-patterns.md](fix-generation-patterns.md)
Code fix patterns for common errors.
- Null check patterns (guard clauses, optional chaining)
- Type validation patterns (isinstance, type hints)
- Error handling patterns (try-catch, error boundaries)
- Input validation patterns (Pydantic, zod)
- Defensive programming patterns
- Fail-fast vs. graceful degradation
- Error recovery strategies
## Quick Reference
**Need error patterns?** → [error-patterns-database.md](error-patterns-database.md)
**Need stack trace help?** → [stack-trace-patterns.md](stack-trace-patterns.md)
**Need RCA methods?** → [rca-methodology.md](rca-methodology.md)
**Need fix patterns?** → [fix-generation-patterns.md](fix-generation-patterns.md)

View File

@@ -0,0 +1,204 @@
# Error Patterns Database
Comprehensive catalog of common error patterns with fixes and prevention strategies.
## Null Pointer / None Type Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **NoneType Attribute** | `'NoneType' object has no attribute 'x'` | Accessing property on None | Add null check: `if obj is None: return` | Use Optional[] types, validation |
| **Undefined Variable** | `undefined is not defined` (JS) | Using variable before assignment | Initialize variable | Use `let`/`const`, enable strict mode |
| **Null Dereference** | `Cannot read property 'x' of null` | Object is null/undefined | Optional chaining: `obj?.property` | Use TypeScript strict null checks |
### Fix Template
```python
# Before
user.name # May be None
# After
if user is None:
return "Unknown"
return user.name
# Or use get() with default
getattr(user, 'name', 'Unknown')
```
## Type Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **Operand Type Mismatch** | `unsupported operand type(s) for +: 'int' and 'str'` | Wrong types in operation | Type conversion or fix type hint | mypy, Pydantic validation |
| **Wrong Argument Type** | `expected str, got int` | Passing wrong type | Convert type or fix signature | Static type checking |
| **JSON Serialization** | `Object of type datetime is not JSON serializable` | Can't serialize type | Custom JSON encoder | Use Pydantic models |
### Fix Template
```python
# Type validation with Pydantic
from pydantic import BaseModel
class UserInput(BaseModel):
age: int # Automatic validation and conversion
# Input: {"age": "25"} → converts to int(25)
# Input: {"age": "abc"} → ValidationError
```
## Index / Key Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **List Index Out of Range** | `list index out of range` | Accessing beyond list length | Check length first | Use `.get()` or try/except |
| **Dict KeyError** | `KeyError: 'missing_key'` | Key doesn't exist in dict | Use `.get()` with default | Pydantic models, TypedDict |
| **Array Out of Bounds** | `undefined` (JS array) | Accessing invalid index | Check array length | Use `?.[]` optional chaining |
### Fix Template
```python
# Bad
user_dict['email'] # KeyError if 'email' missing
# Good
user_dict.get('email', 'no-email@example.com')
# Best (with Pydantic)
class User(BaseModel):
email: EmailStr # Required, validated
```
## Import / Module Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **Module Not Found** | `ModuleNotFoundError: No module named 'x'` | Missing dependency | Install: `pip install x` | Add to requirements.txt |
| **Circular Import** | `ImportError: cannot import name 'X' from partially initialized module` | A imports B, B imports A | Refactor to remove cycle | Dependency injection |
| **Relative Import** | `attempted relative import with no known parent package` | Incorrect relative import | Use absolute imports | Configure PYTHONPATH |
### Fix Template
```bash
# Check installed packages
pip list | grep package_name
# Install missing package
pip install package_name
# Add to requirements
echo "package_name==1.2.3" >> requirements.txt
```
## Database Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **Connection Refused** | `Connection refused` | DB not running or wrong host | Check connection string | Health checks, retry logic |
| **Timeout** | `timeout exceeded` | Query too slow or DB overloaded | Optimize query, add indexes | Query analysis, connection pooling |
| **Unique Constraint** | `UNIQUE constraint failed` | Duplicate key | Handle conflict (upsert) | Pre-check existence |
| **Foreign Key Violation** | `FOREIGN KEY constraint failed` | Referenced record doesn't exist | Validate FK exists first | Use transactions |
### Fix Template
```python
# Handle constraint violations
from sqlalchemy.exc import IntegrityError
try:
db.add(user)
db.commit()
except IntegrityError as e:
db.rollback()
if 'UNIQUE constraint' in str(e):
raise DuplicateUserError()
raise
```
## API / HTTP Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **400 Bad Request** | Malformed request | Invalid JSON or missing fields | Validate request schema | Pydantic, OpenAPI validation |
| **401 Unauthorized** | Missing/invalid auth token | Token expired or missing | Refresh token logic | Token rotation, validation |
| **404 Not Found** | Resource doesn't exist | Wrong ID or deleted resource | Return 404 with helpful message | Check existence first |
| **422 Unprocessable** | Validation failed | Data doesn't meet constraints | Fix validation or API call | Schema validation |
| **500 Internal Error** | Server-side error | Unhandled exception | Fix server code, add logging | Error handling, monitoring |
### Fix Template
```python
# Proper error handling
from fastapi import HTTPException
@app.post("/users")
async def create_user(user: UserCreate):
existing = await db.users.find_one({"email": user.email})
if existing:
raise HTTPException(
status_code=409, # Conflict
detail="User with this email already exists"
)
return await db.users.insert_one(user)
```
## Concurrency / Race Condition Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **Race Condition** | Inconsistent results, data corruption | Multiple threads accessing shared state | Use locks, atomic operations | Immutable data, message queues |
| **Deadlock** | System hangs | Circular wait for resources | Order lock acquisition consistently | Avoid nested locks |
| **Lost Update** | Changes overwritten | Concurrent updates | Optimistic locking (version field) | Transactions, SELECT FOR UPDATE |
### Fix Template
```python
# Use distributed lock (Redis)
import redis_lock
with redis_lock.Lock(redis_client, "order:123"):
order = db.orders.get(123)
order.status = "processed"
db.orders.save(order)
```
## Memory / Performance Errors
| Pattern | Indicators | Root Cause | Fix | Prevention |
|---------|-----------|------------|-----|------------|
| **Out of Memory** | `MemoryError` | Loading too much data | Stream/paginate data | Lazy loading, generators |
| **N+1 Query Problem** | Slow performance | Loop with query inside | Use JOIN or eager loading | Query analysis, APM tools |
| **Memory Leak** | Memory grows over time | Objects not garbage collected | Fix circular references | Profiling, weak references |
### Fix Template
```python
# Bad: N+1 queries
for user in users: # 1 query
posts = db.posts.find(user_id=user.id) # N queries
# Good: Single query with join
users_with_posts = db.execute("""
SELECT users.*, json_agg(posts.*) as posts
FROM users
LEFT JOIN posts ON posts.user_id = users.id
GROUP BY users.id
""")
```
## Quick Reference: Error → Pattern
| Error Message | Pattern | Fix Priority |
|---------------|---------|--------------|
| `'NoneType' object has no attribute` | null_pointer | High |
| `unsupported operand type` | type_mismatch | Medium |
| `list index out of range` | index_error | Medium |
| `KeyError` | key_error | Medium |
| `ModuleNotFoundError` | import_error | High |
| `Connection refused` | db_connection | High |
| `UNIQUE constraint failed` | db_constraint | Medium |
| `401 Unauthorized` | api_auth | High |
| `MemoryError` | memory_error | Critical |
---
**Usage**: When debugging, match error message to pattern, apply fix template, implement prevention strategy.

View File

@@ -0,0 +1,483 @@
# Fix Generation Patterns
Comprehensive guide to generating, evaluating, and implementing fixes for software bugs.
## Multiple Fix Options Strategy
**Core Principle**: Always generate 2-3 fix options with trade-off analysis.
### Fix Option Template
```markdown
**Option 1: [Name]** (e.g., Quick Fix)
**Implementation**: [What to change]
**Pros**: [Benefits]
**Cons**: [Drawbacks]
**Effort**: [Time estimate]
**Risk**: [Low/Medium/High]
**Option 2: [Name]** (e.g., Proper Fix)
**Implementation**: [What to change]
**Pros**: [Benefits]
**Cons**: [Drawbacks]
**Effort**: [Time estimate]
**Risk**: [Low/Medium/High]
**Option 3: [Name]** (e.g., Comprehensive Fix)
**Implementation**: [What to change]
**Pros**: [Benefits]
**Cons**: [Drawbacks]
**Effort**: [Time estimate]
**Risk**: [Low/Medium/High]
**Recommendation**: Option [X] because [reasoning]
```
_See [null-pointer-debug-example.md](../examples/null-pointer-debug-example.md) for complete fix options example._
## Quick Fix vs. Proper Fix
### Decision Matrix
| Criteria | Quick Fix | Proper Fix |
|----------|-----------|------------|
| **Urgency** | Production down, immediate relief needed | Incident resolved, addressing root cause |
| **Scope** | Minimal changes, single file | Multiple files, architectural changes |
| **Time** | Minutes to hours | Hours to days |
| **Testing** | Manual verification | Full test coverage required |
| **Risk** | Low (minimal changes) | Medium (broader impact) |
| **Longevity** | Temporary patch | Permanent solution |
### When to Use Quick Fix
**Production incident** - System is down, users impacted
**Known workaround** - Clear, safe mitigation exists
**Low risk** - Change is isolated and reversible
**Follow-up planned** - Proper fix scheduled for next sprint
**Pattern**: Quick fix now → Monitor → Proper fix later
### When to Use Proper Fix
**Root cause addressed** - Not just treating symptoms
**Proper testing** - Comprehensive test coverage added
**Type safety** - Leverages static type checking
**Prevention** - Prevents entire class of similar bugs
**Documentation** - Code is self-documenting
**Pattern**: Understand root cause → Comprehensive fix → Prevent recurrence
## Fix Priority Assessment
### Priority Matrix
| Severity | Frequency | Priority | Response Time |
|----------|-----------|----------|---------------|
| **Critical** | High | P0 | Immediate (< 1 hour) |
| **Critical** | Low | P1 | Same day |
| **Major** | High | P1 | Same day |
| **Major** | Low | P2 | This week |
| **Minor** | High | P2 | This week |
| **Minor** | Low | P3 | Next sprint |
**Severity Criteria**:
- **Critical**: Data loss, security breach, production down
- **Major**: Degraded performance, incorrect results, feature broken
- **Minor**: Edge case, cosmetic issue, rare error
**Frequency Criteria**:
- **High**: Affects >10% of users or happens >10 times/day
- **Low**: Affects <1% of users or happens occasionally
## Common Fix Patterns by Error Type
### Null/Undefined Errors
**Pattern 1: Null Check with Default**
```python
# Before
name = user.name # NoneType error
# After
name = user.name if user else "Unknown"
```
**Pattern 2: Raise Exception** (API boundaries)
```python
# Before
user = db.users.find_one(user_id)
return user.name # NoneType error
# After
user = db.users.find_one(user_id)
if user is None:
raise HTTPException(404, "User not found")
return user.name
```
### Type Errors
**Pattern 1: Type Conversion with Validation**
```python
# Before
total = base_price + discount # TypeError: int + str
# After
from pydantic import BaseModel
class PriceInput(BaseModel):
base_price: int
discount: int # Automatic validation and conversion
input_data = PriceInput(**request_body) # Validates types
total = input_data.base_price + input_data.discount
```
### Database Errors
**Pattern 1: Constraint Violations**
```python
# Before
db.add(user)
db.commit() # IntegrityError: UNIQUE constraint failed
# After
from sqlalchemy.exc import IntegrityError
try:
db.add(user)
db.commit()
except IntegrityError:
db.rollback()
# Option A: Return error
raise HTTPException(409, "User with this email already exists")
# Option B: Upsert
existing = db.query(User).filter_by(email=user.email).first()
if existing:
existing.name = user.name
db.commit()
```
**Pattern 2: Connection Failures**
```python
# Before
engine = create_engine(DATABASE_URL)
connection = engine.connect() # OperationalError: connection refused
# After
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def get_connection():
engine = create_engine(DATABASE_URL)
return engine.connect()
connection = get_connection()
```
### API/Integration Errors
**Pattern 1: Validation at Boundary**
```python
# Before
response = payment_api.create_charge(amount=order.total)
# Fails with 422 if amount < 50 (API minimum)
# After
class CreateChargeRequest(BaseModel):
amount: int
@validator('amount')
def amount_meets_minimum(cls, v):
if v < 50:
raise ValueError('Amount must be at least $0.50')
return v
# Validate before API call
request = CreateChargeRequest(amount=order.total) # Fails early
response = payment_api.create_charge(**request.dict())
```
**Pattern 2: Retry with Backoff**
```python
# Before
response = httpx.get(api_url) # Timeout occasionally
# After
from tenacity import retry, retry_if_exception_type, stop_after_attempt
@retry(
retry=retry_if_exception_type(httpx.TimeoutException),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url: str):
async with httpx.AsyncClient(timeout=5.0) as client:
return await client.get(url)
```
### Performance Errors
**Pattern 1: N+1 Query Fix**
```python
# Before (N+1 queries)
users = db.query(User).all() # 1 query
for user in users:
posts = db.query(Post).filter(Post.user_id == user.id).all() # N queries
# After (Single query with join)
users = db.query(User).options(
joinedload(User.posts)
).all() # 1 query with join
```
**Pattern 2: Caching**
```python
# Before
def get_user_profile(user_id: str):
return db.query(User).filter_by(id=user_id).first() # Every time
# After
from functools import lru_cache
from cachetools import TTLCache, cached
cache = TTLCache(maxsize=1000, ttl=300) # 5 minute TTL
@cached(cache)
def get_user_profile(user_id: str):
return db.query(User).filter_by(id=user_id).first()
```
## Fix Validation Strategies
### Validation Checklist
```markdown
Before deploying fix:
- [ ] Fix addresses root cause (not just symptoms)
- [ ] Tests added to prevent recurrence
- [ ] Tests pass locally
- [ ] Code reviewed by peer
- [ ] No new linting/type errors
- [ ] Performance impact assessed
- [ ] Security implications reviewed
- [ ] Rollback plan documented
- [ ] Monitoring/alerts updated
```
### Test-Driven Fix Approach
**Pattern**: Write failing test → Implement fix → Test passes
```python
# Step 1: Write failing test
def test_get_user_with_invalid_id_returns_404():
"""Test that invalid user_id returns 404, not 500."""
response = client.get("/users/invalid-id")
assert response.status_code == 404
assert "User not found" in response.json()["detail"]
# Step 2: Run test (should fail with current bug)
# pytest tests/test_users.py::test_get_user_with_invalid_id_returns_404
# AssertionError: 500 != 404
# Step 3: Implement fix
@app.get("/users/{user_id}")
async def get_user(user_id: str):
user = await db.users.find_one({"id": user_id})
if user is None:
raise HTTPException(404, "User not found")
return user
# Step 4: Run test (should pass)
# pytest tests/test_users.py::test_get_user_with_invalid_id_returns_404
# PASSED
```
### Integration Testing
```python
# Test fix with realistic scenario
@pytest.mark.integration
async def test_order_creation_with_negative_total():
"""Integration test: Ensure negative order total is rejected."""
# Setup
user = await create_test_user()
# Attempt to create order with negative total
response = await client.post("/orders", json={
"user_id": user.id,
"items": [],
"total": -100 # Invalid
})
# Assert validation error
assert response.status_code == 422
assert "total must be positive" in response.json()["detail"]
# Verify no order created in database
orders = await db.orders.find({"user_id": user.id})
assert len(orders) == 0
```
## Refactoring Considerations
### When to Refactor During Fix
**Refactor if**:
✅ Fix requires understanding convoluted code
✅ Code duplication prevents proper fix
✅ Poor structure makes fix risky
✅ Fix is part of larger architectural improvement
**Don't refactor if**:
❌ Production incident needs immediate fix
❌ Refactoring scope unclear
❌ Tests insufficient to ensure safety
❌ Refactoring can be done separately
### Refactoring Patterns
**Pattern 1: Extract Function**
```python
# Before (hard to fix null error)
def process_order(order_data):
user = db.users.find_one(order_data["user_id"])
if user.is_active and user.credits > 0:
# 50 lines of order processing
pass
# After (easier to add null check)
def process_order(order_data):
user = get_validated_user(order_data["user_id"])
process_order_for_user(user, order_data)
def get_validated_user(user_id: str) -> User:
"""Get user and validate they can place orders."""
user = db.users.find_one(user_id)
if user is None:
raise HTTPException(404, "User not found")
if not user.is_active:
raise HTTPException(403, "User account inactive")
if user.credits <= 0:
raise HTTPException(402, "Insufficient credits")
return user
```
## Production Safety
### Pre-Deployment Checklist
```markdown
- [ ] Fix tested in staging environment
- [ ] Performance impact measured (CPU, memory, latency)
- [ ] Database migrations tested with production-sized data
- [ ] Feature flag available for gradual rollout
- [ ] Rollback procedure documented and tested
- [ ] Monitoring dashboard shows relevant metrics
- [ ] Alerts configured for fix-related failures
- [ ] On-call engineer briefed on deployment
- [ ] Communication sent to stakeholders
```
### Gradual Rollout Pattern
```python
# Use feature flag for gradual rollout
from launchdarkly import LDClient
ld_client = LDClient("sdk-key")
@app.get("/users/{user_id}")
async def get_user(user_id: str):
use_new_validation = ld_client.variation(
"new-user-validation",
{"key": user_id},
default=False
)
if use_new_validation:
# New fix with validation
user = await get_validated_user(user_id)
else:
# Old code (fallback)
user = await db.users.find_one(user_id)
return user
```
## Rollback Planning
### Rollback Decision Criteria
**Rollback immediately if**:
- Error rate spikes >5% above baseline
- Critical functionality broken
- Data corruption detected
- Performance degrades >50%
- Security vulnerability introduced
**Monitor and investigate if**:
- Error rate increases <5%
- Non-critical functionality affected
- Performance degrades <20%
- Edge cases failing
### Rollback Procedures
**1. Application Code Rollback**
```bash
# Git-based rollback
git revert <commit-hash>
git push origin main
# Or redeploy previous version
git checkout <previous-tag>
./deploy.sh
```
**2. Database Migration Rollback**
```bash
# Alembic (Python)
alembic downgrade -1
# Drizzle (TypeScript)
bun run drizzle-kit drop --migration <migration-name>
```
**3. Feature Flag Disable**
```python
# Instantly disable via LaunchDarkly dashboard or API
ld_client.variation("new-user-validation", context, default=False)
```
**4. Cache Invalidation**
```python
# Clear cache after rollback
redis_client.flushdb() # Clear all cache
# Or selectively
redis_client.delete("user:*") # Clear user cache only
```
## Quick Reference
| Error Type | Primary Fix Pattern | Testing Strategy |
|------------|-------------------|------------------|
| **Null/Undefined** | Null check, optional chaining, raise exception | Unit test with None input |
| **Type Mismatch** | Pydantic validation, type guards | Unit test with wrong types |
| **Database** | Try/except with rollback, retries | Integration test with DB |
| **API/Integration** | Validation at boundary, retries | Mock API responses |
| **Performance** | Caching, query optimization | Performance benchmark test |
| Fix Type | When to Use | Risk Level |
|----------|-------------|------------|
| **Quick Fix** | Production incident | Low (isolated change) |
| **Proper Fix** | Root cause resolution | Medium (broader changes) |
| **Comprehensive Fix** | Prevention of entire class | Medium-High (architectural) |
---
**Usage**: When implementing fix, generate 2-3 options with trade-offs, select best option based on priority, validate with tests, deploy with gradual rollout, monitor closely, document rollback procedure.

View File

@@ -0,0 +1,466 @@
# Root Cause Analysis (RCA) Methodology
Comprehensive guide to performing effective root cause analysis for software bugs and incidents.
## What is Root Cause Analysis?
**Definition**: Systematic process of identifying the fundamental reason why a problem occurred, not just treating symptoms.
**Goal**: Find the root cause(s) to implement prevention strategies that stop recurrence.
**Key Principle**: Distinguish between:
- **Symptom**: Observable error (e.g., "API returns 500 error")
- **Proximate Cause**: Immediate trigger (e.g., "Database query timeout")
- **Root Cause**: Fundamental reason (e.g., "Missing database index on frequently queried column")
## The 5 Whys Technique
### Overview
**Method**: Ask "Why?" five times (or more) to drill down from symptom to root cause.
**Origin**: Toyota Production System (Lean Manufacturing)
**Best For**: Sequential cause-effect chains
### Example: Null Pointer Error
```
Problem: API endpoint returns 500 error
Why? → User object is null when accessing .name property
Why? → Database query returned null instead of user
Why? → User ID doesn't exist in database
Why? → Frontend sent incorrect user ID from stale cache
Why? → Cache invalidation not triggered after user deletion
ROOT CAUSE: Missing cache invalidation on user deletion
```
### Rules for Effective 5 Whys
1. **Be Specific**: "Database slow" → "Query takes 4.5s (target: <200ms)"
2. **Use Data**: Support each "Why" with evidence (logs, metrics, traces)
3. **Don't Stop Too Early**: Keep asking until you reach a process/policy root cause
4. **Don't Blame People**: Focus on processes, not individuals
5. **May Need More/Fewer Than 5**: Stop when you reach actionable root cause
### Template
```markdown
**Problem Statement**: [Observable symptom with impact]
**Why #1**: [First level cause]
**Evidence**: [Logs, metrics, traces]
**Why #2**: [Deeper cause]
**Evidence**: [Supporting data]
**Why #3**: [Even deeper]
**Evidence**: [Supporting data]
**Why #4**: [Near root cause]
**Evidence**: [Supporting data]
**Why #5**: [Root cause]
**Evidence**: [Supporting data]
**Root Cause**: [Fundamental reason]
**Prevention**: [How to prevent recurrence]
```
## Fishbone (Ishikawa) Diagram
### Overview
**Method**: Visual diagram categorizing potential causes into major categories.
**Best For**: Complex problems with multiple contributing factors
**Categories (Software Context)**:
- **Code**: Logic errors, missing validation, edge cases
- **Data**: Invalid inputs, corrupt data, missing records
- **Infrastructure**: Server issues, network problems, resource limits
- **Dependencies**: Third-party APIs, libraries, services
- **Process**: Deployment issues, configuration errors, environment mismatches
- **People**: Knowledge gaps, communication failures, assumptions
### Example Structure
```
Code
|
Missing validation
/
/
API 500 ─────────────── Infrastructure
Error \
\
Database timeout
|
Data
```
### When to Use
- Multiple potential root causes
- Need stakeholder alignment on cause
- Complex systems with many dependencies
- Post-incident reviews with team
## Timeline Analysis
### Overview
**Method**: Create chronological sequence of events leading to incident.
**Best For**: Understanding cascading failures, race conditions, timing issues
### Example
```markdown
## Timeline: User Profile Page Crash
**T-00:05** - User updates profile information
**T-00:03** - Profile update succeeds, cache invalidation triggered
**T-00:02** - Cache clear initiated but takes 3s (network latency)
**T-00:00** - User refreshes page, cache still has old data
**T+00:01** - API fetches user from cache (stale)
**T+00:02** - Frontend renders with field that was deleted in update
**T+00:03** - JavaScript error: Cannot read property 'X' of undefined
**T+00:04** - Error boundary catches error, shows crash page
**Root Cause**: Cache invalidation is async and completes after page reload, causing stale data rendering.
```
### Components
| Element | Description | Example |
|---------|-------------|---------|
| **Timestamp** | Relative or absolute time | `T-00:05` or `14:32:15 UTC` |
| **Event** | What happened | `User clicked submit` |
| **System State** | Relevant state at time | `Cache: stale, DB: updated` |
| **Decision Point** | Branch in event chain | `If cache miss: fetch DB` |
## Distinguishing Root Cause from Symptoms
### Symptom vs. Root Cause
| Symptom | Root Cause |
|---------|------------|
| API returns 500 error | Missing error handling for null user |
| Database query slow | Missing index on `user_id` column |
| Memory leak in production | Circular reference in event listeners |
| User can't login | Session cookie expires after 5 minutes (should be 24h) |
### Test: The Prevention Question
**Ask**: "If I fix this, will the problem never happen again?"
**If Yes** → Likely root cause
**If No** → Still a symptom or contributing factor
**Example**:
- "Add null check" → Prevents this specific null error, but why was data null? (Symptom fix)
- "Add database foreign key constraint" → Prevents any invalid user_id from being stored (Root cause fix)
## Contributing Factors vs. Root Cause
### Multiple Contributing Factors
Complex incidents often have multiple contributing factors and one primary root cause.
**Example: Data Loss Incident**
```markdown
**Primary Root Cause**: Database backup script fails silently (no monitoring)
**Contributing Factors**:
1. No backup validation process
2. Backup monitoring disabled in production
3. Backup script lacks error logging
4. No runbook for backup verification
5. Manual backup never tested
**Analysis**: All factors contributed, but root cause is silent failure. Fix that first.
```
### Prioritization
| Priority | Factor Type | Action |
|----------|-------------|--------|
| **P0** | Root cause | Fix immediately |
| **P1** | Major contributor | Fix in same release |
| **P2** | Minor contributor | Fix in next sprint |
| **P3** | Edge case | Backlog |
## RCA Documentation Format
### Standard Structure
```markdown
# Root Cause Analysis: [Title]
**Date**: YYYY-MM-DD
**Incident ID**: INC-12345
**Severity**: [SEV1/SEV2/SEV3]
**Participants**: [Names of people involved in RCA]
## Summary
[2-3 sentence overview of incident and root cause]
## Impact
- **Users Affected**: [Number or %]
- **Duration**: [Time from start to resolution]
- **Business Impact**: [Revenue loss, SLA breach, etc.]
## Timeline
[Chronological sequence of events]
## Root Cause
[Detailed explanation of fundamental cause]
### 5 Whys Analysis
[Step-by-step "Why?" chain]
## Contributing Factors
[List of factors that enabled or worsened the incident]
## Prevention
### Immediate Actions (Within 24h)
- [ ] Action 1
- [ ] Action 2
### Short-term Actions (Within 1 week)
- [ ] Action 1
- [ ] Action 2
### Long-term Actions (Within 1 month)
- [ ] Action 1
- [ ] Action 2
## Lessons Learned
[Key takeaways and process improvements]
```
## Prevention Strategy Development
### Fix Categories
| Category | Description | Example |
|----------|-------------|---------|
| **Technical** | Code, config, infrastructure changes | Add database index, implement retry logic |
| **Process** | Changes to how work is done | Require code review for DB changes |
| **Monitoring** | Detect issues before they cause incidents | Alert on slow query thresholds |
| **Testing** | Catch issues before production | Add integration test for edge case |
| **Documentation** | Improve knowledge sharing | Document backup restoration procedure |
### Prevention Checklist
```markdown
**Can we prevent the root cause?**
- [ ] Technical fix implemented
- [ ] Tests added to catch recurrence
- [ ] Monitoring added to detect early
**Can we detect it faster?**
- [ ] Alerts configured
- [ ] Logging improved
- [ ] Dashboards updated
**Can we mitigate impact?**
- [ ] Graceful degradation added
- [ ] Circuit breaker implemented
- [ ] Fallback logic added
**Can we recover faster?**
- [ ] Runbook created
- [ ] Automation added
- [ ] Team trained
**Can we prevent similar issues?**
- [ ] Pattern identified
- [ ] Linting rule added
- [ ] Architecture review scheduled
```
## Common RCA Pitfalls
### Pitfall 1: Stopping Too Early
**Bad**:
```
Why? → User got 500 error
Root Cause: Server returned error
Prevention: Fix server error
```
**Good**:
```
Why? → User got 500 error
Why? → Server threw unhandled exception
Why? → Null pointer accessing user.email
Why? → User object was null
Why? → Database returned no user
Why? → User ID didn't exist
Why? → Frontend sent deleted user's ID
Why? → Frontend cache not invalidated after deletion
Root Cause: Missing cache invalidation on user deletion
Prevention: Trigger cache clear on user deletion, add cache TTL as safety
```
### Pitfall 2: Blaming People
**Bad**:
```
Root Cause: Developer forgot to add validation
Prevention: Tell developer to remember next time
```
**Good**:
```
Root Cause: No validation enforced at API boundary
Prevention:
- Use Pydantic for automatic validation
- Add linting rule to detect missing validation
- Update code review checklist
```
### Pitfall 3: Accepting "Human Error" as Root Cause
**Bad**:
```
Root Cause: Admin accidentally deleted production database
Prevention: Be more careful
```
**Good**:
```
Root Cause: Production database lacks deletion protection
Prevention:
- Enable RDS deletion protection
- Require MFA for production access
- Implement soft-delete instead of hard-delete
- Add "Are you sure?" confirmation with typed confirmation
```
### Pitfall 4: Multiple Root Causes Without Prioritization
**Bad**:
```
Root Causes:
1. Missing error handling
2. No monitoring
3. Bad documentation
4. Insufficient testing
5. Poor communication
Prevention: Fix all of them
```
**Good**:
```
Primary Root Cause: Missing error handling (caused immediate incident)
Contributing Factors:
- No monitoring (delayed detection)
- Insufficient testing (didn't catch before deployment)
Prevention Priority:
1. Add error handling (prevents recurrence) - P0
2. Add monitoring (faster detection) - P1
3. Add tests (catch in CI) - P1
4. Improve docs (better response) - P2
```
### Pitfall 5: Technical Fix Without Process Improvement
**Bad**:
```
Root Cause: Missing database index
Prevention: Add index
```
**Good**:
```
Root Cause: Missing database index causing slow queries
Prevention:
- Technical: Add index on user_id column
- Process: Require query performance review in code review
- Monitoring: Alert on queries >200ms
- Testing: Add performance test asserting query count
```
## RCA Review and Validation
### Review Checklist
```markdown
- [ ] Root cause clearly identified and evidence-based
- [ ] Timeline accurate and complete
- [ ] All contributing factors documented
- [ ] Prevention strategies are actionable
- [ ] Prevention strategies assigned owners and due dates
- [ ] Lessons learned documented
- [ ] Incident review meeting scheduled
- [ ] RCA shared with relevant teams
```
### Validation Questions
1. **Completeness**: Does the RCA explain all observed symptoms?
2. **Preventability**: Will the proposed fixes prevent recurrence?
3. **Testability**: Can we verify the fixes work?
4. **Generalizability**: Are there similar issues we should address?
5. **Sustainability**: Will fixes remain effective long-term?
## Best Practices
### Do's
**Start RCA immediately** after incident resolution
**Involve multiple people** for diverse perspectives
**Use data** to support each "Why" answer
**Focus on processes**, not people
**Document everything** even if it seems obvious
**Assign owners** to all prevention actions
**Set deadlines** for prevention implementation
**Follow up** to ensure actions completed
### Don'ts
**Don't rush** - Thorough RCA takes time
**Don't blame** - Focus on systemic issues
**Don't accept vague answers** - "System was slow" → "Query took 4.5s"
**Don't stop at technical fixes** - Address process and monitoring too
**Don't skip documentation** - Future incidents benefit from past RCAs
**Don't forget to close the loop** - Verify prevention actions worked
## Quick Reference
| Technique | Best For | Output |
|-----------|----------|--------|
| **5 Whys** | Sequential cause-effect chains | Linear cause chain → root cause |
| **Fishbone** | Multiple potential causes | Categorized causes diagram |
| **Timeline** | Cascading failures, timing issues | Chronological event sequence |
| Root Cause Type | Fix Strategy |
|-----------------|--------------|
| **Missing validation** | Add validation at boundary + tests |
| **Missing error handling** | Add try/catch + logging + monitoring |
| **Performance issue** | Optimize + add performance test + alert |
| **Configuration error** | Fix config + add validation + documentation |
| **Process gap** | Update process + add checklist + training |
---
**Usage**: When debugging is complete, perform RCA to understand why the bug existed and how to prevent similar issues. Use 5 Whys for most cases, Fishbone for complex multi-factor incidents, Timeline for cascading failures.

View File

@@ -0,0 +1,414 @@
# Stack Trace Patterns
Comprehensive guide to reading, analyzing, and extracting insights from stack traces across different languages and environments.
## Python Stack Traces
### Anatomy
```python
Traceback (most recent call last):
File "/app/api/users.py", line 45, in get_user
user = db.query(User).filter(User.id == user_id).one()
File "/venv/lib/sqlalchemy/orm/query.py", line 2890, in one
raise NoResultFound("No row was found")
sqlalchemy.orm.exc.NoResultFound: No row was found
```
**Reading Order**: Bottom-up (exception → root cause)
| Component | Description | Example |
|-----------|-------------|---------|
| **Exception Type** | The error class | `sqlalchemy.orm.exc.NoResultFound` |
| **Exception Message** | Error description | `"No row was found"` |
| **Root Frame** | Where error originated | `query.py:2890, in one` |
| **Call Stack** | Function call chain | `users.py:45 → query.py:2890` |
### Identifying the Root Cause
```python
# Stack trace from user code → library code
Traceback (most recent call last):
File "/app/api/orders.py", line 23, in create_order # ← Your code (start here!)
payment = process_payment(order.total)
File "/app/services/payment.py", line 67, in process_payment
stripe.charge.create(amount=amount)
File "/venv/lib/stripe/api.py", line 342, in create # ← Library code (ignore)
raise InvalidRequestError("Amount must be positive")
stripe.error.InvalidRequestError: Amount must be positive
```
**Analysis**:
1. Exception: `InvalidRequestError` - Amount validation failed
2. Root frame in your code: `payment.py:67` - Calling Stripe with invalid amount
3. Source of bad data: `orders.py:23` - Passing `order.total` (likely 0 or negative)
**Fix Location**: Check `order.total` validation in `orders.py:23`
### Filtering Noise
**Focus on**:
- Files in your project directory (`/app/*`)
- First occurrence of error in your code
**Ignore**:
- Virtual environment files (`/venv/*`, `site-packages/*`)
- Standard library (`/usr/lib/python3.*/`)
- Framework internals (unless debugging framework)
### Async Stack Traces
```python
Traceback (most recent call last):
File "/app/api/users.py", line 23, in get_user_profile
user = await fetch_user(user_id)
File "/app/services/users.py", line 45, in fetch_user
data = await http_client.get(f"/users/{user_id}")
File "/venv/lib/httpx/_client.py", line 1234, in get
raise ConnectTimeout()
httpx.ConnectTimeout: Connection timed out
```
**Key Indicators**:
- `await` in frame descriptions
- Async function names (`async def`)
- Coroutine references
**Analysis**: Trace async call chain: `get_user_profile``fetch_user``httpx.get` → timeout
## JavaScript/TypeScript Stack Traces
### Node.js Format
```javascript
Error: User not found
at UserService.findById (/app/services/user.service.ts:42:11)
at async getUserProfile (/app/api/users.controller.ts:23:18)
at async /app/middleware/auth.ts:67:5
at async handleRequest (/app/server/request-handler.ts:15:3)
```
**Reading Order**: Top-down (error → call chain)
| Component | Description | Example |
|-----------|-------------|---------|
| **Error Type** | Error class | `Error` |
| **Error Message** | Description | `"User not found"` |
| **Root Frame** | Where thrown | `user.service.ts:42:11` |
| **Call Stack** | Caller chain | `getUserProfile` → middleware → request handler |
### Browser Stack Traces
```javascript
Uncaught TypeError: Cannot read property 'name' of undefined
at UserProfile.render (UserProfile.tsx:15:32)
at finishClassComponent (react-dom.production.min.js:123:45)
at updateClassComponent (react-dom.production.min.js:456:12)
```
**Analysis**:
- Error: Accessing `.name` on undefined object
- Root: `UserProfile.tsx:15:32` (your component)
- Framework: React rendering internals (ignore)
**Fix**: Add null check in `UserProfile.tsx:15`
### Minified Stack Traces
```javascript
TypeError: Cannot read property 'name' of undefined
at t.render (main.a3b4c5d6.js:1:23456)
at u (2.chunk.js:4:567)
```
**Problem**: Minified code is unreadable (`t`, `u`, cryptic filenames)
**Solution**: Use source maps
```javascript
// With source map
TypeError: Cannot read property 'name' of undefined
at UserProfile.render (src/components/UserProfile.tsx:15:32)
at ReactComponent.update (src/lib/react.ts:45:12)
```
**How**: Ensure source maps are available:
- Development: Always enabled
- Production: Enable for debugging (`.map` files)
- Error tracking: Sentry, Bugsnag auto-apply source maps
## Java Stack Traces
### Format
```java
java.lang.NullPointerException: Cannot invoke "User.getName()" because "user" is null
at com.example.UserService.getFullName(UserService.java:42)
at com.example.UserController.getUserProfile(UserController.java:23)
at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:142)
```
**Reading Order**: Top-down
**Analysis**:
- Exception: `NullPointerException` with helpful message (Java 14+)
- Root: `UserService.java:42` calling `.getName()` on null
- Caller: `UserController.java:23`
- Framework: Spring MVC (ignore)
**Fix**: Add null check at `UserService.java:42`
## FastAPI/Pydantic Stack Traces
### Validation Error
```python
pydantic.error_wrappers.ValidationError: 2 validation errors for UserCreate
email
field required (type=value_error.missing)
age
ensure this value is greater than 0 (type=value_error.number.not_gt; limit_value=0)
```
**Analysis**:
- Not a traditional stack trace - validation error report
- Lists all validation failures
- Each error shows: field, message, type
**Fix**: Client must send valid `email` (required) and `age > 0`
### FastAPI Exception
```python
Traceback (most recent call last):
File "/app/api/endpoints/users.py", line 45, in create_user
db_user = await crud.user.create(user_in)
File "/app/crud/user.py", line 23, in create
db.add(db_obj)
File "/venv/lib/sqlalchemy/orm/session.py", line 2345, in add
raise IntegrityError("UNIQUE constraint failed: users.email")
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: users.email
```
**Analysis**:
- Database constraint violation
- Root in your code: `crud/user.py:23` trying to insert duplicate email
- Caller: `api/endpoints/users.py:45`
**Fix**: Check if user exists before insert, or use upsert
## Cloudflare Workers Stack Traces
### Format
```javascript
Error: Failed to fetch user data
at fetchUser (worker.js:45:11)
at handleRequest (worker.js:23:18)
```
**Characteristics**:
- Minimal stack (no Node.js internals)
- V8 isolate execution context
- Limited to worker code only
**Edge Cases**:
```javascript
Uncaught (in promise) TypeError: response.json is not a function
```
- Common: Missing `await` on fetch response
- Fix: `await response.json()` instead of `response.json()`
## Pattern Recognition
### Null/Undefined Access Patterns
**Python**:
```
AttributeError: 'NoneType' object has no attribute 'X'
TypeError: 'NoneType' object is not subscriptable
```
**JavaScript**:
```
TypeError: Cannot read property 'X' of null
TypeError: Cannot read property 'X' of undefined
```
**Java**:
```
java.lang.NullPointerException: Cannot invoke "X" because "Y" is null
```
### Type Mismatch Patterns
**Python**:
```
TypeError: unsupported operand type(s) for +: 'int' and 'str'
TypeError: 'X' object is not callable
```
**JavaScript**:
```
TypeError: X is not a function
TypeError: Cannot convert undefined or null to object
```
### Import/Module Patterns
**Python**:
```
ModuleNotFoundError: No module named 'X'
ImportError: cannot import name 'X' from 'Y'
```
**JavaScript**:
```
Error: Cannot find module 'X'
SyntaxError: Unexpected token 'export'
```
### Database Patterns
**SQLAlchemy**:
```
sqlalchemy.orm.exc.NoResultFound
sqlalchemy.exc.IntegrityError: UNIQUE constraint failed
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection refused
```
**Drizzle ORM**:
```
DrizzleError: Unique constraint failed on column: email
DrizzleError: Connection to database server failed
```
## Analysis Workflow
### 1. Identify Exception Type
```python
# Example
sqlalchemy.exc.IntegrityError: UNIQUE constraint failed: users.email
# ↓
# Type: IntegrityError (database constraint)
# Subtype: UNIQUE (duplicate key)
```
### 2. Locate Root Frame in Your Code
```python
Traceback (most recent call last):
File "/app/api/users.py", line 45, in create_user # ← Root frame
db.add(user)
File "/venv/lib/sqlalchemy/orm/session.py", line 2345, in add # ← Library
raise IntegrityError()
```
**Rule**: First frame in your project directory before library/framework code
### 3. Trace Backwards Through Call Chain
```python
create_user (users.py:45)
calls
UserService.create (user_service.py:23)
calls
db.add (sqlalchemy) IntegrityError
```
**Analysis**: Error originates in `db.add`, propagates through `UserService.create`, surfaces in `create_user` endpoint
### 4. Identify Data Flow
```python
# users.py:45
user = User(email=request.email) # ← Where does request.email come from?
db.add(user) # ← Fails with UNIQUE constraint
# Trace back:
# request.email ← Request body
# ← Client sent duplicate email
# ← Need validation before DB insert
```
### 5. Formulate Hypothesis
**Pattern**: UNIQUE constraint → Attempting duplicate insert
**Root Cause**: No existence check before insert
**Fix**: Add existence check or use upsert
## Advanced Patterns
### Recursive Stack Traces
```python
RecursionError: maximum recursion depth exceeded
File "/app/services/tree.py", line 23, in calculate_depth
return 1 + calculate_depth(node.parent)
File "/app/services/tree.py", line 23, in calculate_depth
return 1 + calculate_depth(node.parent)
[Previous line repeated 996 more times]
```
**Analysis**: Circular reference in `node.parent` chain
**Fix**: Add base case or cycle detection
### Chained Exceptions (Python 3)
```python
Traceback (most recent call last):
File "/app/db/connection.py", line 15, in connect
engine.connect()
sqlalchemy.exc.OperationalError: connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/api/users.py", line 45, in get_user
db.connect()
File "/app/db/connection.py", line 20, in connect
raise DatabaseConnectionError() from e
app.exceptions.DatabaseConnectionError: Failed to connect to database
```
**Reading**: Two stack traces:
1. Original: `OperationalError` (connection refused)
2. Wrapped: `DatabaseConnectionError` (user-friendly message)
**Root Cause**: Original exception (connection refused)
### Multiple Exception Points
```javascript
UnhandledPromiseRejectionWarning: Error: API request failed
at fetchData (api.js:23:11)
(node:12345) UnhandledPromiseRejectionWarning: Unhandled promise rejection.
```
**Analysis**: Promise rejected but no `.catch()` handler
**Fix**: Add `.catch()` or use `try/await/catch`
## Quick Reference
| Language | Read Direction | Focus On | Ignore |
|----------|---------------|----------|--------|
| **Python** | Bottom-up | Last frame in your code | `/venv/`, stdlib |
| **JavaScript** | Top-down | First frame in your code | `node_modules/` |
| **Java** | Top-down | `com.example.*` | `org.springframework.*` |
| **TypeScript** | Top-down | `src/`, `.ts` files | `node_modules/`, `.min.js` |
| Error Pattern | Stack Trace Indicator | Fix Priority |
|---------------|----------------------|--------------|
| **Null/undefined** | `NoneType`, `null`, `undefined` | High |
| **Type mismatch** | `unsupported operand`, `is not a function` | Medium |
| **Import error** | `ModuleNotFoundError`, `Cannot find module` | High |
| **Database** | `IntegrityError`, `OperationalError` | High |
| **Async** | `await`, `Promise`, `Coroutine` | Medium |
---
**Usage**: When debugging, identify exception type, locate root frame in your code, trace backwards through call chain, identify data flow, formulate hypothesis.