Initial commit
This commit is contained in:
56
skills/smart-debugging/reference/INDEX.md
Normal file
56
skills/smart-debugging/reference/INDEX.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Smart Debug Reference
|
||||
|
||||
Debugging references and methodologies for systematic error resolution.
|
||||
|
||||
## Available References
|
||||
|
||||
### [error-patterns-database.md](error-patterns-database.md)
|
||||
Complete error pattern catalog with fixes.
|
||||
- **Null Pointer Errors** - NoneType, undefined, null reference
|
||||
- **Type Errors** - Type mismatch, unsupported operand, conversion failures
|
||||
- **Index Errors** - Array bounds, list access, slice errors
|
||||
- **Key Errors** - Dictionary key missing, object property undefined
|
||||
- **Import Errors** - Module not found, circular imports
|
||||
- **Database Errors** - Connection refused, timeout, constraint violations
|
||||
- **API Errors** - 400/422/500 responses, contract violations
|
||||
- **Concurrency Errors** - Race conditions, deadlocks, async issues
|
||||
- **Memory Errors** - Out of memory, memory leaks
|
||||
- **Performance Errors** - Slow queries, N+1 problems, inefficient algorithms
|
||||
|
||||
### [stack-trace-patterns.md](stack-trace-patterns.md)
|
||||
Stack trace reading and analysis guide.
|
||||
- Python stack traces (Traceback format)
|
||||
- JavaScript/TypeScript stack traces (Error.stack format)
|
||||
- Java stack traces (Exception format)
|
||||
- Identifying root file vs. propagation
|
||||
- Filtering stdlib and third-party frames
|
||||
- Understanding async stack traces
|
||||
- Reading minified stack traces
|
||||
- Source map integration
|
||||
|
||||
### [rca-methodology.md](rca-methodology.md)
|
||||
Root cause analysis methodologies.
|
||||
- **5 Whys** - Iterative questioning to root cause
|
||||
- **Timeline Analysis** - Chronological event reconstruction
|
||||
- **Fishbone Diagram** - Ishikawa cause categorization
|
||||
- **Fault Tree Analysis** - Logic diagram of failure paths
|
||||
- **Change Analysis** - Recent deployments and config changes
|
||||
- **Comparative Analysis** - Working vs. broken environments
|
||||
- **Reproducibility Testing** - Isolation of causal factors
|
||||
|
||||
### [fix-generation-patterns.md](fix-generation-patterns.md)
|
||||
Code fix patterns for common errors.
|
||||
- Null check patterns (guard clauses, optional chaining)
|
||||
- Type validation patterns (isinstance, type hints)
|
||||
- Error handling patterns (try-catch, error boundaries)
|
||||
- Input validation patterns (Pydantic, zod)
|
||||
- Defensive programming patterns
|
||||
- Fail-fast vs. graceful degradation
|
||||
- Error recovery strategies
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Need error patterns?** → [error-patterns-database.md](error-patterns-database.md)
|
||||
**Need stack trace help?** → [stack-trace-patterns.md](stack-trace-patterns.md)
|
||||
**Need RCA methods?** → [rca-methodology.md](rca-methodology.md)
|
||||
**Need fix patterns?** → [fix-generation-patterns.md](fix-generation-patterns.md)
|
||||
204
skills/smart-debugging/reference/error-patterns-database.md
Normal file
204
skills/smart-debugging/reference/error-patterns-database.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Error Patterns Database
|
||||
|
||||
Comprehensive catalog of common error patterns with fixes and prevention strategies.
|
||||
|
||||
## Null Pointer / None Type Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **NoneType Attribute** | `'NoneType' object has no attribute 'x'` | Accessing property on None | Add null check: `if obj is None: return` | Use Optional[] types, validation |
|
||||
| **Undefined Variable** | `undefined is not defined` (JS) | Using variable before assignment | Initialize variable | Use `let`/`const`, enable strict mode |
|
||||
| **Null Dereference** | `Cannot read property 'x' of null` | Object is null/undefined | Optional chaining: `obj?.property` | Use TypeScript strict null checks |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Before
|
||||
user.name # May be None
|
||||
|
||||
# After
|
||||
if user is None:
|
||||
return "Unknown"
|
||||
return user.name
|
||||
|
||||
# Or use get() with default
|
||||
getattr(user, 'name', 'Unknown')
|
||||
```
|
||||
|
||||
## Type Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **Operand Type Mismatch** | `unsupported operand type(s) for +: 'int' and 'str'` | Wrong types in operation | Type conversion or fix type hint | mypy, Pydantic validation |
|
||||
| **Wrong Argument Type** | `expected str, got int` | Passing wrong type | Convert type or fix signature | Static type checking |
|
||||
| **JSON Serialization** | `Object of type datetime is not JSON serializable` | Can't serialize type | Custom JSON encoder | Use Pydantic models |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Type validation with Pydantic
|
||||
from pydantic import BaseModel
|
||||
|
||||
class UserInput(BaseModel):
|
||||
age: int # Automatic validation and conversion
|
||||
|
||||
# Input: {"age": "25"} → converts to int(25)
|
||||
# Input: {"age": "abc"} → ValidationError
|
||||
```
|
||||
|
||||
## Index / Key Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **List Index Out of Range** | `list index out of range` | Accessing beyond list length | Check length first | Use `.get()` or try/except |
|
||||
| **Dict KeyError** | `KeyError: 'missing_key'` | Key doesn't exist in dict | Use `.get()` with default | Pydantic models, TypedDict |
|
||||
| **Array Out of Bounds** | `undefined` (JS array) | Accessing invalid index | Check array length | Use `?.[]` optional chaining |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Bad
|
||||
user_dict['email'] # KeyError if 'email' missing
|
||||
|
||||
# Good
|
||||
user_dict.get('email', 'no-email@example.com')
|
||||
|
||||
# Best (with Pydantic)
|
||||
class User(BaseModel):
|
||||
email: EmailStr # Required, validated
|
||||
```
|
||||
|
||||
## Import / Module Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **Module Not Found** | `ModuleNotFoundError: No module named 'x'` | Missing dependency | Install: `pip install x` | Add to requirements.txt |
|
||||
| **Circular Import** | `ImportError: cannot import name 'X' from partially initialized module` | A imports B, B imports A | Refactor to remove cycle | Dependency injection |
|
||||
| **Relative Import** | `attempted relative import with no known parent package` | Incorrect relative import | Use absolute imports | Configure PYTHONPATH |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```bash
|
||||
# Check installed packages
|
||||
pip list | grep package_name
|
||||
|
||||
# Install missing package
|
||||
pip install package_name
|
||||
|
||||
# Add to requirements
|
||||
echo "package_name==1.2.3" >> requirements.txt
|
||||
```
|
||||
|
||||
## Database Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **Connection Refused** | `Connection refused` | DB not running or wrong host | Check connection string | Health checks, retry logic |
|
||||
| **Timeout** | `timeout exceeded` | Query too slow or DB overloaded | Optimize query, add indexes | Query analysis, connection pooling |
|
||||
| **Unique Constraint** | `UNIQUE constraint failed` | Duplicate key | Handle conflict (upsert) | Pre-check existence |
|
||||
| **Foreign Key Violation** | `FOREIGN KEY constraint failed` | Referenced record doesn't exist | Validate FK exists first | Use transactions |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Handle constraint violations
|
||||
from sqlalchemy.exc import IntegrityError
|
||||
|
||||
try:
|
||||
db.add(user)
|
||||
db.commit()
|
||||
except IntegrityError as e:
|
||||
db.rollback()
|
||||
if 'UNIQUE constraint' in str(e):
|
||||
raise DuplicateUserError()
|
||||
raise
|
||||
```
|
||||
|
||||
## API / HTTP Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **400 Bad Request** | Malformed request | Invalid JSON or missing fields | Validate request schema | Pydantic, OpenAPI validation |
|
||||
| **401 Unauthorized** | Missing/invalid auth token | Token expired or missing | Refresh token logic | Token rotation, validation |
|
||||
| **404 Not Found** | Resource doesn't exist | Wrong ID or deleted resource | Return 404 with helpful message | Check existence first |
|
||||
| **422 Unprocessable** | Validation failed | Data doesn't meet constraints | Fix validation or API call | Schema validation |
|
||||
| **500 Internal Error** | Server-side error | Unhandled exception | Fix server code, add logging | Error handling, monitoring |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Proper error handling
|
||||
from fastapi import HTTPException
|
||||
|
||||
@app.post("/users")
|
||||
async def create_user(user: UserCreate):
|
||||
existing = await db.users.find_one({"email": user.email})
|
||||
if existing:
|
||||
raise HTTPException(
|
||||
status_code=409, # Conflict
|
||||
detail="User with this email already exists"
|
||||
)
|
||||
return await db.users.insert_one(user)
|
||||
```
|
||||
|
||||
## Concurrency / Race Condition Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **Race Condition** | Inconsistent results, data corruption | Multiple threads accessing shared state | Use locks, atomic operations | Immutable data, message queues |
|
||||
| **Deadlock** | System hangs | Circular wait for resources | Order lock acquisition consistently | Avoid nested locks |
|
||||
| **Lost Update** | Changes overwritten | Concurrent updates | Optimistic locking (version field) | Transactions, SELECT FOR UPDATE |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Use distributed lock (Redis)
|
||||
import redis_lock
|
||||
|
||||
with redis_lock.Lock(redis_client, "order:123"):
|
||||
order = db.orders.get(123)
|
||||
order.status = "processed"
|
||||
db.orders.save(order)
|
||||
```
|
||||
|
||||
## Memory / Performance Errors
|
||||
|
||||
| Pattern | Indicators | Root Cause | Fix | Prevention |
|
||||
|---------|-----------|------------|-----|------------|
|
||||
| **Out of Memory** | `MemoryError` | Loading too much data | Stream/paginate data | Lazy loading, generators |
|
||||
| **N+1 Query Problem** | Slow performance | Loop with query inside | Use JOIN or eager loading | Query analysis, APM tools |
|
||||
| **Memory Leak** | Memory grows over time | Objects not garbage collected | Fix circular references | Profiling, weak references |
|
||||
|
||||
### Fix Template
|
||||
|
||||
```python
|
||||
# Bad: N+1 queries
|
||||
for user in users: # 1 query
|
||||
posts = db.posts.find(user_id=user.id) # N queries
|
||||
|
||||
# Good: Single query with join
|
||||
users_with_posts = db.execute("""
|
||||
SELECT users.*, json_agg(posts.*) as posts
|
||||
FROM users
|
||||
LEFT JOIN posts ON posts.user_id = users.id
|
||||
GROUP BY users.id
|
||||
""")
|
||||
```
|
||||
|
||||
## Quick Reference: Error → Pattern
|
||||
|
||||
| Error Message | Pattern | Fix Priority |
|
||||
|---------------|---------|--------------|
|
||||
| `'NoneType' object has no attribute` | null_pointer | High |
|
||||
| `unsupported operand type` | type_mismatch | Medium |
|
||||
| `list index out of range` | index_error | Medium |
|
||||
| `KeyError` | key_error | Medium |
|
||||
| `ModuleNotFoundError` | import_error | High |
|
||||
| `Connection refused` | db_connection | High |
|
||||
| `UNIQUE constraint failed` | db_constraint | Medium |
|
||||
| `401 Unauthorized` | api_auth | High |
|
||||
| `MemoryError` | memory_error | Critical |
|
||||
|
||||
---
|
||||
|
||||
**Usage**: When debugging, match error message to pattern, apply fix template, implement prevention strategy.
|
||||
483
skills/smart-debugging/reference/fix-generation-patterns.md
Normal file
483
skills/smart-debugging/reference/fix-generation-patterns.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# Fix Generation Patterns
|
||||
|
||||
Comprehensive guide to generating, evaluating, and implementing fixes for software bugs.
|
||||
|
||||
## Multiple Fix Options Strategy
|
||||
|
||||
**Core Principle**: Always generate 2-3 fix options with trade-off analysis.
|
||||
|
||||
### Fix Option Template
|
||||
|
||||
```markdown
|
||||
**Option 1: [Name]** (e.g., Quick Fix)
|
||||
**Implementation**: [What to change]
|
||||
**Pros**: [Benefits]
|
||||
**Cons**: [Drawbacks]
|
||||
**Effort**: [Time estimate]
|
||||
**Risk**: [Low/Medium/High]
|
||||
|
||||
**Option 2: [Name]** (e.g., Proper Fix)
|
||||
**Implementation**: [What to change]
|
||||
**Pros**: [Benefits]
|
||||
**Cons**: [Drawbacks]
|
||||
**Effort**: [Time estimate]
|
||||
**Risk**: [Low/Medium/High]
|
||||
|
||||
**Option 3: [Name]** (e.g., Comprehensive Fix)
|
||||
**Implementation**: [What to change]
|
||||
**Pros**: [Benefits]
|
||||
**Cons**: [Drawbacks]
|
||||
**Effort**: [Time estimate]
|
||||
**Risk**: [Low/Medium/High]
|
||||
|
||||
**Recommendation**: Option [X] because [reasoning]
|
||||
```
|
||||
|
||||
_See [null-pointer-debug-example.md](../examples/null-pointer-debug-example.md) for complete fix options example._
|
||||
|
||||
## Quick Fix vs. Proper Fix
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
| Criteria | Quick Fix | Proper Fix |
|
||||
|----------|-----------|------------|
|
||||
| **Urgency** | Production down, immediate relief needed | Incident resolved, addressing root cause |
|
||||
| **Scope** | Minimal changes, single file | Multiple files, architectural changes |
|
||||
| **Time** | Minutes to hours | Hours to days |
|
||||
| **Testing** | Manual verification | Full test coverage required |
|
||||
| **Risk** | Low (minimal changes) | Medium (broader impact) |
|
||||
| **Longevity** | Temporary patch | Permanent solution |
|
||||
|
||||
### When to Use Quick Fix
|
||||
|
||||
✅ **Production incident** - System is down, users impacted
|
||||
✅ **Known workaround** - Clear, safe mitigation exists
|
||||
✅ **Low risk** - Change is isolated and reversible
|
||||
✅ **Follow-up planned** - Proper fix scheduled for next sprint
|
||||
|
||||
**Pattern**: Quick fix now → Monitor → Proper fix later
|
||||
|
||||
### When to Use Proper Fix
|
||||
|
||||
✅ **Root cause addressed** - Not just treating symptoms
|
||||
✅ **Proper testing** - Comprehensive test coverage added
|
||||
✅ **Type safety** - Leverages static type checking
|
||||
✅ **Prevention** - Prevents entire class of similar bugs
|
||||
✅ **Documentation** - Code is self-documenting
|
||||
|
||||
**Pattern**: Understand root cause → Comprehensive fix → Prevent recurrence
|
||||
|
||||
## Fix Priority Assessment
|
||||
|
||||
### Priority Matrix
|
||||
|
||||
| Severity | Frequency | Priority | Response Time |
|
||||
|----------|-----------|----------|---------------|
|
||||
| **Critical** | High | P0 | Immediate (< 1 hour) |
|
||||
| **Critical** | Low | P1 | Same day |
|
||||
| **Major** | High | P1 | Same day |
|
||||
| **Major** | Low | P2 | This week |
|
||||
| **Minor** | High | P2 | This week |
|
||||
| **Minor** | Low | P3 | Next sprint |
|
||||
|
||||
**Severity Criteria**:
|
||||
- **Critical**: Data loss, security breach, production down
|
||||
- **Major**: Degraded performance, incorrect results, feature broken
|
||||
- **Minor**: Edge case, cosmetic issue, rare error
|
||||
|
||||
**Frequency Criteria**:
|
||||
- **High**: Affects >10% of users or happens >10 times/day
|
||||
- **Low**: Affects <1% of users or happens occasionally
|
||||
|
||||
## Common Fix Patterns by Error Type
|
||||
|
||||
### Null/Undefined Errors
|
||||
|
||||
**Pattern 1: Null Check with Default**
|
||||
```python
|
||||
# Before
|
||||
name = user.name # NoneType error
|
||||
|
||||
# After
|
||||
name = user.name if user else "Unknown"
|
||||
```
|
||||
|
||||
**Pattern 2: Raise Exception** (API boundaries)
|
||||
```python
|
||||
# Before
|
||||
user = db.users.find_one(user_id)
|
||||
return user.name # NoneType error
|
||||
|
||||
# After
|
||||
user = db.users.find_one(user_id)
|
||||
if user is None:
|
||||
raise HTTPException(404, "User not found")
|
||||
return user.name
|
||||
```
|
||||
|
||||
### Type Errors
|
||||
|
||||
**Pattern 1: Type Conversion with Validation**
|
||||
```python
|
||||
# Before
|
||||
total = base_price + discount # TypeError: int + str
|
||||
|
||||
# After
|
||||
from pydantic import BaseModel
|
||||
|
||||
class PriceInput(BaseModel):
|
||||
base_price: int
|
||||
discount: int # Automatic validation and conversion
|
||||
|
||||
input_data = PriceInput(**request_body) # Validates types
|
||||
total = input_data.base_price + input_data.discount
|
||||
```
|
||||
|
||||
### Database Errors
|
||||
|
||||
**Pattern 1: Constraint Violations**
|
||||
```python
|
||||
# Before
|
||||
db.add(user)
|
||||
db.commit() # IntegrityError: UNIQUE constraint failed
|
||||
|
||||
# After
|
||||
from sqlalchemy.exc import IntegrityError
|
||||
|
||||
try:
|
||||
db.add(user)
|
||||
db.commit()
|
||||
except IntegrityError:
|
||||
db.rollback()
|
||||
# Option A: Return error
|
||||
raise HTTPException(409, "User with this email already exists")
|
||||
# Option B: Upsert
|
||||
existing = db.query(User).filter_by(email=user.email).first()
|
||||
if existing:
|
||||
existing.name = user.name
|
||||
db.commit()
|
||||
```
|
||||
|
||||
**Pattern 2: Connection Failures**
|
||||
```python
|
||||
# Before
|
||||
engine = create_engine(DATABASE_URL)
|
||||
connection = engine.connect() # OperationalError: connection refused
|
||||
|
||||
# After
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10)
|
||||
)
|
||||
def get_connection():
|
||||
engine = create_engine(DATABASE_URL)
|
||||
return engine.connect()
|
||||
|
||||
connection = get_connection()
|
||||
```
|
||||
|
||||
### API/Integration Errors
|
||||
|
||||
**Pattern 1: Validation at Boundary**
|
||||
```python
|
||||
# Before
|
||||
response = payment_api.create_charge(amount=order.total)
|
||||
# Fails with 422 if amount < 50 (API minimum)
|
||||
|
||||
# After
|
||||
class CreateChargeRequest(BaseModel):
|
||||
amount: int
|
||||
|
||||
@validator('amount')
|
||||
def amount_meets_minimum(cls, v):
|
||||
if v < 50:
|
||||
raise ValueError('Amount must be at least $0.50')
|
||||
return v
|
||||
|
||||
# Validate before API call
|
||||
request = CreateChargeRequest(amount=order.total) # Fails early
|
||||
response = payment_api.create_charge(**request.dict())
|
||||
```
|
||||
|
||||
**Pattern 2: Retry with Backoff**
|
||||
```python
|
||||
# Before
|
||||
response = httpx.get(api_url) # Timeout occasionally
|
||||
|
||||
# After
|
||||
from tenacity import retry, retry_if_exception_type, stop_after_attempt
|
||||
|
||||
@retry(
|
||||
retry=retry_if_exception_type(httpx.TimeoutException),
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=2, max=10)
|
||||
)
|
||||
async def fetch_with_retry(url: str):
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
return await client.get(url)
|
||||
```
|
||||
|
||||
### Performance Errors
|
||||
|
||||
**Pattern 1: N+1 Query Fix**
|
||||
```python
|
||||
# Before (N+1 queries)
|
||||
users = db.query(User).all() # 1 query
|
||||
for user in users:
|
||||
posts = db.query(Post).filter(Post.user_id == user.id).all() # N queries
|
||||
|
||||
# After (Single query with join)
|
||||
users = db.query(User).options(
|
||||
joinedload(User.posts)
|
||||
).all() # 1 query with join
|
||||
```
|
||||
|
||||
**Pattern 2: Caching**
|
||||
```python
|
||||
# Before
|
||||
def get_user_profile(user_id: str):
|
||||
return db.query(User).filter_by(id=user_id).first() # Every time
|
||||
|
||||
# After
|
||||
from functools import lru_cache
|
||||
from cachetools import TTLCache, cached
|
||||
|
||||
cache = TTLCache(maxsize=1000, ttl=300) # 5 minute TTL
|
||||
|
||||
@cached(cache)
|
||||
def get_user_profile(user_id: str):
|
||||
return db.query(User).filter_by(id=user_id).first()
|
||||
```
|
||||
|
||||
## Fix Validation Strategies
|
||||
|
||||
### Validation Checklist
|
||||
|
||||
```markdown
|
||||
Before deploying fix:
|
||||
- [ ] Fix addresses root cause (not just symptoms)
|
||||
- [ ] Tests added to prevent recurrence
|
||||
- [ ] Tests pass locally
|
||||
- [ ] Code reviewed by peer
|
||||
- [ ] No new linting/type errors
|
||||
- [ ] Performance impact assessed
|
||||
- [ ] Security implications reviewed
|
||||
- [ ] Rollback plan documented
|
||||
- [ ] Monitoring/alerts updated
|
||||
```
|
||||
|
||||
### Test-Driven Fix Approach
|
||||
|
||||
**Pattern**: Write failing test → Implement fix → Test passes
|
||||
|
||||
```python
|
||||
# Step 1: Write failing test
|
||||
def test_get_user_with_invalid_id_returns_404():
|
||||
"""Test that invalid user_id returns 404, not 500."""
|
||||
response = client.get("/users/invalid-id")
|
||||
assert response.status_code == 404
|
||||
assert "User not found" in response.json()["detail"]
|
||||
|
||||
# Step 2: Run test (should fail with current bug)
|
||||
# pytest tests/test_users.py::test_get_user_with_invalid_id_returns_404
|
||||
# AssertionError: 500 != 404
|
||||
|
||||
# Step 3: Implement fix
|
||||
@app.get("/users/{user_id}")
|
||||
async def get_user(user_id: str):
|
||||
user = await db.users.find_one({"id": user_id})
|
||||
if user is None:
|
||||
raise HTTPException(404, "User not found")
|
||||
return user
|
||||
|
||||
# Step 4: Run test (should pass)
|
||||
# pytest tests/test_users.py::test_get_user_with_invalid_id_returns_404
|
||||
# PASSED
|
||||
```
|
||||
|
||||
### Integration Testing
|
||||
|
||||
```python
|
||||
# Test fix with realistic scenario
|
||||
@pytest.mark.integration
|
||||
async def test_order_creation_with_negative_total():
|
||||
"""Integration test: Ensure negative order total is rejected."""
|
||||
# Setup
|
||||
user = await create_test_user()
|
||||
|
||||
# Attempt to create order with negative total
|
||||
response = await client.post("/orders", json={
|
||||
"user_id": user.id,
|
||||
"items": [],
|
||||
"total": -100 # Invalid
|
||||
})
|
||||
|
||||
# Assert validation error
|
||||
assert response.status_code == 422
|
||||
assert "total must be positive" in response.json()["detail"]
|
||||
|
||||
# Verify no order created in database
|
||||
orders = await db.orders.find({"user_id": user.id})
|
||||
assert len(orders) == 0
|
||||
```
|
||||
|
||||
## Refactoring Considerations
|
||||
|
||||
### When to Refactor During Fix
|
||||
|
||||
**Refactor if**:
|
||||
✅ Fix requires understanding convoluted code
|
||||
✅ Code duplication prevents proper fix
|
||||
✅ Poor structure makes fix risky
|
||||
✅ Fix is part of larger architectural improvement
|
||||
|
||||
**Don't refactor if**:
|
||||
❌ Production incident needs immediate fix
|
||||
❌ Refactoring scope unclear
|
||||
❌ Tests insufficient to ensure safety
|
||||
❌ Refactoring can be done separately
|
||||
|
||||
### Refactoring Patterns
|
||||
|
||||
**Pattern 1: Extract Function**
|
||||
```python
|
||||
# Before (hard to fix null error)
|
||||
def process_order(order_data):
|
||||
user = db.users.find_one(order_data["user_id"])
|
||||
if user.is_active and user.credits > 0:
|
||||
# 50 lines of order processing
|
||||
pass
|
||||
|
||||
# After (easier to add null check)
|
||||
def process_order(order_data):
|
||||
user = get_validated_user(order_data["user_id"])
|
||||
process_order_for_user(user, order_data)
|
||||
|
||||
def get_validated_user(user_id: str) -> User:
|
||||
"""Get user and validate they can place orders."""
|
||||
user = db.users.find_one(user_id)
|
||||
if user is None:
|
||||
raise HTTPException(404, "User not found")
|
||||
if not user.is_active:
|
||||
raise HTTPException(403, "User account inactive")
|
||||
if user.credits <= 0:
|
||||
raise HTTPException(402, "Insufficient credits")
|
||||
return user
|
||||
```
|
||||
|
||||
## Production Safety
|
||||
|
||||
### Pre-Deployment Checklist
|
||||
|
||||
```markdown
|
||||
- [ ] Fix tested in staging environment
|
||||
- [ ] Performance impact measured (CPU, memory, latency)
|
||||
- [ ] Database migrations tested with production-sized data
|
||||
- [ ] Feature flag available for gradual rollout
|
||||
- [ ] Rollback procedure documented and tested
|
||||
- [ ] Monitoring dashboard shows relevant metrics
|
||||
- [ ] Alerts configured for fix-related failures
|
||||
- [ ] On-call engineer briefed on deployment
|
||||
- [ ] Communication sent to stakeholders
|
||||
```
|
||||
|
||||
### Gradual Rollout Pattern
|
||||
|
||||
```python
|
||||
# Use feature flag for gradual rollout
|
||||
from launchdarkly import LDClient
|
||||
|
||||
ld_client = LDClient("sdk-key")
|
||||
|
||||
@app.get("/users/{user_id}")
|
||||
async def get_user(user_id: str):
|
||||
use_new_validation = ld_client.variation(
|
||||
"new-user-validation",
|
||||
{"key": user_id},
|
||||
default=False
|
||||
)
|
||||
|
||||
if use_new_validation:
|
||||
# New fix with validation
|
||||
user = await get_validated_user(user_id)
|
||||
else:
|
||||
# Old code (fallback)
|
||||
user = await db.users.find_one(user_id)
|
||||
|
||||
return user
|
||||
```
|
||||
|
||||
## Rollback Planning
|
||||
|
||||
### Rollback Decision Criteria
|
||||
|
||||
**Rollback immediately if**:
|
||||
- Error rate spikes >5% above baseline
|
||||
- Critical functionality broken
|
||||
- Data corruption detected
|
||||
- Performance degrades >50%
|
||||
- Security vulnerability introduced
|
||||
|
||||
**Monitor and investigate if**:
|
||||
- Error rate increases <5%
|
||||
- Non-critical functionality affected
|
||||
- Performance degrades <20%
|
||||
- Edge cases failing
|
||||
|
||||
### Rollback Procedures
|
||||
|
||||
**1. Application Code Rollback**
|
||||
```bash
|
||||
# Git-based rollback
|
||||
git revert <commit-hash>
|
||||
git push origin main
|
||||
|
||||
# Or redeploy previous version
|
||||
git checkout <previous-tag>
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
**2. Database Migration Rollback**
|
||||
```bash
|
||||
# Alembic (Python)
|
||||
alembic downgrade -1
|
||||
|
||||
# Drizzle (TypeScript)
|
||||
bun run drizzle-kit drop --migration <migration-name>
|
||||
```
|
||||
|
||||
**3. Feature Flag Disable**
|
||||
```python
|
||||
# Instantly disable via LaunchDarkly dashboard or API
|
||||
ld_client.variation("new-user-validation", context, default=False)
|
||||
```
|
||||
|
||||
**4. Cache Invalidation**
|
||||
```python
|
||||
# Clear cache after rollback
|
||||
redis_client.flushdb() # Clear all cache
|
||||
# Or selectively
|
||||
redis_client.delete("user:*") # Clear user cache only
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Error Type | Primary Fix Pattern | Testing Strategy |
|
||||
|------------|-------------------|------------------|
|
||||
| **Null/Undefined** | Null check, optional chaining, raise exception | Unit test with None input |
|
||||
| **Type Mismatch** | Pydantic validation, type guards | Unit test with wrong types |
|
||||
| **Database** | Try/except with rollback, retries | Integration test with DB |
|
||||
| **API/Integration** | Validation at boundary, retries | Mock API responses |
|
||||
| **Performance** | Caching, query optimization | Performance benchmark test |
|
||||
|
||||
| Fix Type | When to Use | Risk Level |
|
||||
|----------|-------------|------------|
|
||||
| **Quick Fix** | Production incident | Low (isolated change) |
|
||||
| **Proper Fix** | Root cause resolution | Medium (broader changes) |
|
||||
| **Comprehensive Fix** | Prevention of entire class | Medium-High (architectural) |
|
||||
|
||||
---
|
||||
|
||||
**Usage**: When implementing fix, generate 2-3 options with trade-offs, select best option based on priority, validate with tests, deploy with gradual rollout, monitor closely, document rollback procedure.
|
||||
466
skills/smart-debugging/reference/rca-methodology.md
Normal file
466
skills/smart-debugging/reference/rca-methodology.md
Normal file
@@ -0,0 +1,466 @@
|
||||
# Root Cause Analysis (RCA) Methodology
|
||||
|
||||
Comprehensive guide to performing effective root cause analysis for software bugs and incidents.
|
||||
|
||||
## What is Root Cause Analysis?
|
||||
|
||||
**Definition**: Systematic process of identifying the fundamental reason why a problem occurred, not just treating symptoms.
|
||||
|
||||
**Goal**: Find the root cause(s) to implement prevention strategies that stop recurrence.
|
||||
|
||||
**Key Principle**: Distinguish between:
|
||||
- **Symptom**: Observable error (e.g., "API returns 500 error")
|
||||
- **Proximate Cause**: Immediate trigger (e.g., "Database query timeout")
|
||||
- **Root Cause**: Fundamental reason (e.g., "Missing database index on frequently queried column")
|
||||
|
||||
## The 5 Whys Technique
|
||||
|
||||
### Overview
|
||||
|
||||
**Method**: Ask "Why?" five times (or more) to drill down from symptom to root cause.
|
||||
|
||||
**Origin**: Toyota Production System (Lean Manufacturing)
|
||||
|
||||
**Best For**: Sequential cause-effect chains
|
||||
|
||||
### Example: Null Pointer Error
|
||||
|
||||
```
|
||||
Problem: API endpoint returns 500 error
|
||||
|
||||
Why? → User object is null when accessing .name property
|
||||
Why? → Database query returned null instead of user
|
||||
Why? → User ID doesn't exist in database
|
||||
Why? → Frontend sent incorrect user ID from stale cache
|
||||
Why? → Cache invalidation not triggered after user deletion
|
||||
ROOT CAUSE: Missing cache invalidation on user deletion
|
||||
```
|
||||
|
||||
### Rules for Effective 5 Whys
|
||||
|
||||
1. **Be Specific**: "Database slow" → "Query takes 4.5s (target: <200ms)"
|
||||
2. **Use Data**: Support each "Why" with evidence (logs, metrics, traces)
|
||||
3. **Don't Stop Too Early**: Keep asking until you reach a process/policy root cause
|
||||
4. **Don't Blame People**: Focus on processes, not individuals
|
||||
5. **May Need More/Fewer Than 5**: Stop when you reach actionable root cause
|
||||
|
||||
### Template
|
||||
|
||||
```markdown
|
||||
**Problem Statement**: [Observable symptom with impact]
|
||||
|
||||
**Why #1**: [First level cause]
|
||||
**Evidence**: [Logs, metrics, traces]
|
||||
|
||||
**Why #2**: [Deeper cause]
|
||||
**Evidence**: [Supporting data]
|
||||
|
||||
**Why #3**: [Even deeper]
|
||||
**Evidence**: [Supporting data]
|
||||
|
||||
**Why #4**: [Near root cause]
|
||||
**Evidence**: [Supporting data]
|
||||
|
||||
**Why #5**: [Root cause]
|
||||
**Evidence**: [Supporting data]
|
||||
|
||||
**Root Cause**: [Fundamental reason]
|
||||
**Prevention**: [How to prevent recurrence]
|
||||
```
|
||||
|
||||
## Fishbone (Ishikawa) Diagram
|
||||
|
||||
### Overview
|
||||
|
||||
**Method**: Visual diagram categorizing potential causes into major categories.
|
||||
|
||||
**Best For**: Complex problems with multiple contributing factors
|
||||
|
||||
**Categories (Software Context)**:
|
||||
- **Code**: Logic errors, missing validation, edge cases
|
||||
- **Data**: Invalid inputs, corrupt data, missing records
|
||||
- **Infrastructure**: Server issues, network problems, resource limits
|
||||
- **Dependencies**: Third-party APIs, libraries, services
|
||||
- **Process**: Deployment issues, configuration errors, environment mismatches
|
||||
- **People**: Knowledge gaps, communication failures, assumptions
|
||||
|
||||
### Example Structure
|
||||
|
||||
```
|
||||
Code
|
||||
|
|
||||
Missing validation
|
||||
/
|
||||
/
|
||||
API 500 ─────────────── Infrastructure
|
||||
Error \
|
||||
\
|
||||
Database timeout
|
||||
|
|
||||
Data
|
||||
```
|
||||
|
||||
### When to Use
|
||||
|
||||
- Multiple potential root causes
|
||||
- Need stakeholder alignment on cause
|
||||
- Complex systems with many dependencies
|
||||
- Post-incident reviews with team
|
||||
|
||||
## Timeline Analysis
|
||||
|
||||
### Overview
|
||||
|
||||
**Method**: Create chronological sequence of events leading to incident.
|
||||
|
||||
**Best For**: Understanding cascading failures, race conditions, timing issues
|
||||
|
||||
### Example
|
||||
|
||||
```markdown
|
||||
## Timeline: User Profile Page Crash
|
||||
|
||||
**T-00:05** - User updates profile information
|
||||
**T-00:03** - Profile update succeeds, cache invalidation triggered
|
||||
**T-00:02** - Cache clear initiated but takes 3s (network latency)
|
||||
**T-00:00** - User refreshes page, cache still has old data
|
||||
**T+00:01** - API fetches user from cache (stale)
|
||||
**T+00:02** - Frontend renders with field that was deleted in update
|
||||
**T+00:03** - JavaScript error: Cannot read property 'X' of undefined
|
||||
**T+00:04** - Error boundary catches error, shows crash page
|
||||
|
||||
**Root Cause**: Cache invalidation is async and completes after page reload, causing stale data rendering.
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
| Element | Description | Example |
|
||||
|---------|-------------|---------|
|
||||
| **Timestamp** | Relative or absolute time | `T-00:05` or `14:32:15 UTC` |
|
||||
| **Event** | What happened | `User clicked submit` |
|
||||
| **System State** | Relevant state at time | `Cache: stale, DB: updated` |
|
||||
| **Decision Point** | Branch in event chain | `If cache miss: fetch DB` |
|
||||
|
||||
## Distinguishing Root Cause from Symptoms
|
||||
|
||||
### Symptom vs. Root Cause
|
||||
|
||||
| Symptom | Root Cause |
|
||||
|---------|------------|
|
||||
| API returns 500 error | Missing error handling for null user |
|
||||
| Database query slow | Missing index on `user_id` column |
|
||||
| Memory leak in production | Circular reference in event listeners |
|
||||
| User can't login | Session cookie expires after 5 minutes (should be 24h) |
|
||||
|
||||
### Test: The Prevention Question
|
||||
|
||||
**Ask**: "If I fix this, will the problem never happen again?"
|
||||
|
||||
**If Yes** → Likely root cause
|
||||
**If No** → Still a symptom or contributing factor
|
||||
|
||||
**Example**:
|
||||
- "Add null check" → Prevents this specific null error, but why was data null? (Symptom fix)
|
||||
- "Add database foreign key constraint" → Prevents any invalid user_id from being stored (Root cause fix)
|
||||
|
||||
## Contributing Factors vs. Root Cause
|
||||
|
||||
### Multiple Contributing Factors
|
||||
|
||||
Complex incidents often have multiple contributing factors and one primary root cause.
|
||||
|
||||
**Example: Data Loss Incident**
|
||||
|
||||
```markdown
|
||||
**Primary Root Cause**: Database backup script fails silently (no monitoring)
|
||||
|
||||
**Contributing Factors**:
|
||||
1. No backup validation process
|
||||
2. Backup monitoring disabled in production
|
||||
3. Backup script lacks error logging
|
||||
4. No runbook for backup verification
|
||||
5. Manual backup never tested
|
||||
|
||||
**Analysis**: All factors contributed, but root cause is silent failure. Fix that first.
|
||||
```
|
||||
|
||||
### Prioritization
|
||||
|
||||
| Priority | Factor Type | Action |
|
||||
|----------|-------------|--------|
|
||||
| **P0** | Root cause | Fix immediately |
|
||||
| **P1** | Major contributor | Fix in same release |
|
||||
| **P2** | Minor contributor | Fix in next sprint |
|
||||
| **P3** | Edge case | Backlog |
|
||||
|
||||
## RCA Documentation Format
|
||||
|
||||
### Standard Structure
|
||||
|
||||
```markdown
|
||||
# Root Cause Analysis: [Title]
|
||||
|
||||
**Date**: YYYY-MM-DD
|
||||
**Incident ID**: INC-12345
|
||||
**Severity**: [SEV1/SEV2/SEV3]
|
||||
**Participants**: [Names of people involved in RCA]
|
||||
|
||||
## Summary
|
||||
|
||||
[2-3 sentence overview of incident and root cause]
|
||||
|
||||
## Impact
|
||||
|
||||
- **Users Affected**: [Number or %]
|
||||
- **Duration**: [Time from start to resolution]
|
||||
- **Business Impact**: [Revenue loss, SLA breach, etc.]
|
||||
|
||||
## Timeline
|
||||
|
||||
[Chronological sequence of events]
|
||||
|
||||
## Root Cause
|
||||
|
||||
[Detailed explanation of fundamental cause]
|
||||
|
||||
### 5 Whys Analysis
|
||||
|
||||
[Step-by-step "Why?" chain]
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
[List of factors that enabled or worsened the incident]
|
||||
|
||||
## Prevention
|
||||
|
||||
### Immediate Actions (Within 24h)
|
||||
- [ ] Action 1
|
||||
- [ ] Action 2
|
||||
|
||||
### Short-term Actions (Within 1 week)
|
||||
- [ ] Action 1
|
||||
- [ ] Action 2
|
||||
|
||||
### Long-term Actions (Within 1 month)
|
||||
- [ ] Action 1
|
||||
- [ ] Action 2
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
[Key takeaways and process improvements]
|
||||
```
|
||||
|
||||
## Prevention Strategy Development
|
||||
|
||||
### Fix Categories
|
||||
|
||||
| Category | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| **Technical** | Code, config, infrastructure changes | Add database index, implement retry logic |
|
||||
| **Process** | Changes to how work is done | Require code review for DB changes |
|
||||
| **Monitoring** | Detect issues before they cause incidents | Alert on slow query thresholds |
|
||||
| **Testing** | Catch issues before production | Add integration test for edge case |
|
||||
| **Documentation** | Improve knowledge sharing | Document backup restoration procedure |
|
||||
|
||||
### Prevention Checklist
|
||||
|
||||
```markdown
|
||||
**Can we prevent the root cause?**
|
||||
- [ ] Technical fix implemented
|
||||
- [ ] Tests added to catch recurrence
|
||||
- [ ] Monitoring added to detect early
|
||||
|
||||
**Can we detect it faster?**
|
||||
- [ ] Alerts configured
|
||||
- [ ] Logging improved
|
||||
- [ ] Dashboards updated
|
||||
|
||||
**Can we mitigate impact?**
|
||||
- [ ] Graceful degradation added
|
||||
- [ ] Circuit breaker implemented
|
||||
- [ ] Fallback logic added
|
||||
|
||||
**Can we recover faster?**
|
||||
- [ ] Runbook created
|
||||
- [ ] Automation added
|
||||
- [ ] Team trained
|
||||
|
||||
**Can we prevent similar issues?**
|
||||
- [ ] Pattern identified
|
||||
- [ ] Linting rule added
|
||||
- [ ] Architecture review scheduled
|
||||
```
|
||||
|
||||
## Common RCA Pitfalls
|
||||
|
||||
### Pitfall 1: Stopping Too Early
|
||||
|
||||
**Bad**:
|
||||
```
|
||||
Why? → User got 500 error
|
||||
Root Cause: Server returned error
|
||||
|
||||
Prevention: Fix server error
|
||||
```
|
||||
|
||||
**Good**:
|
||||
```
|
||||
Why? → User got 500 error
|
||||
Why? → Server threw unhandled exception
|
||||
Why? → Null pointer accessing user.email
|
||||
Why? → User object was null
|
||||
Why? → Database returned no user
|
||||
Why? → User ID didn't exist
|
||||
Why? → Frontend sent deleted user's ID
|
||||
Why? → Frontend cache not invalidated after deletion
|
||||
Root Cause: Missing cache invalidation on user deletion
|
||||
|
||||
Prevention: Trigger cache clear on user deletion, add cache TTL as safety
|
||||
```
|
||||
|
||||
### Pitfall 2: Blaming People
|
||||
|
||||
**Bad**:
|
||||
```
|
||||
Root Cause: Developer forgot to add validation
|
||||
Prevention: Tell developer to remember next time
|
||||
```
|
||||
|
||||
**Good**:
|
||||
```
|
||||
Root Cause: No validation enforced at API boundary
|
||||
Prevention:
|
||||
- Use Pydantic for automatic validation
|
||||
- Add linting rule to detect missing validation
|
||||
- Update code review checklist
|
||||
```
|
||||
|
||||
### Pitfall 3: Accepting "Human Error" as Root Cause
|
||||
|
||||
**Bad**:
|
||||
```
|
||||
Root Cause: Admin accidentally deleted production database
|
||||
Prevention: Be more careful
|
||||
```
|
||||
|
||||
**Good**:
|
||||
```
|
||||
Root Cause: Production database lacks deletion protection
|
||||
Prevention:
|
||||
- Enable RDS deletion protection
|
||||
- Require MFA for production access
|
||||
- Implement soft-delete instead of hard-delete
|
||||
- Add "Are you sure?" confirmation with typed confirmation
|
||||
```
|
||||
|
||||
### Pitfall 4: Multiple Root Causes Without Prioritization
|
||||
|
||||
**Bad**:
|
||||
```
|
||||
Root Causes:
|
||||
1. Missing error handling
|
||||
2. No monitoring
|
||||
3. Bad documentation
|
||||
4. Insufficient testing
|
||||
5. Poor communication
|
||||
|
||||
Prevention: Fix all of them
|
||||
```
|
||||
|
||||
**Good**:
|
||||
```
|
||||
Primary Root Cause: Missing error handling (caused immediate incident)
|
||||
|
||||
Contributing Factors:
|
||||
- No monitoring (delayed detection)
|
||||
- Insufficient testing (didn't catch before deployment)
|
||||
|
||||
Prevention Priority:
|
||||
1. Add error handling (prevents recurrence) - P0
|
||||
2. Add monitoring (faster detection) - P1
|
||||
3. Add tests (catch in CI) - P1
|
||||
4. Improve docs (better response) - P2
|
||||
```
|
||||
|
||||
### Pitfall 5: Technical Fix Without Process Improvement
|
||||
|
||||
**Bad**:
|
||||
```
|
||||
Root Cause: Missing database index
|
||||
Prevention: Add index
|
||||
```
|
||||
|
||||
**Good**:
|
||||
```
|
||||
Root Cause: Missing database index causing slow queries
|
||||
Prevention:
|
||||
- Technical: Add index on user_id column
|
||||
- Process: Require query performance review in code review
|
||||
- Monitoring: Alert on queries >200ms
|
||||
- Testing: Add performance test asserting query count
|
||||
```
|
||||
|
||||
## RCA Review and Validation
|
||||
|
||||
### Review Checklist
|
||||
|
||||
```markdown
|
||||
- [ ] Root cause clearly identified and evidence-based
|
||||
- [ ] Timeline accurate and complete
|
||||
- [ ] All contributing factors documented
|
||||
- [ ] Prevention strategies are actionable
|
||||
- [ ] Prevention strategies assigned owners and due dates
|
||||
- [ ] Lessons learned documented
|
||||
- [ ] Incident review meeting scheduled
|
||||
- [ ] RCA shared with relevant teams
|
||||
```
|
||||
|
||||
### Validation Questions
|
||||
|
||||
1. **Completeness**: Does the RCA explain all observed symptoms?
|
||||
2. **Preventability**: Will the proposed fixes prevent recurrence?
|
||||
3. **Testability**: Can we verify the fixes work?
|
||||
4. **Generalizability**: Are there similar issues we should address?
|
||||
5. **Sustainability**: Will fixes remain effective long-term?
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
✅ **Start RCA immediately** after incident resolution
|
||||
✅ **Involve multiple people** for diverse perspectives
|
||||
✅ **Use data** to support each "Why" answer
|
||||
✅ **Focus on processes**, not people
|
||||
✅ **Document everything** even if it seems obvious
|
||||
✅ **Assign owners** to all prevention actions
|
||||
✅ **Set deadlines** for prevention implementation
|
||||
✅ **Follow up** to ensure actions completed
|
||||
|
||||
### Don'ts
|
||||
|
||||
❌ **Don't rush** - Thorough RCA takes time
|
||||
❌ **Don't blame** - Focus on systemic issues
|
||||
❌ **Don't accept vague answers** - "System was slow" → "Query took 4.5s"
|
||||
❌ **Don't stop at technical fixes** - Address process and monitoring too
|
||||
❌ **Don't skip documentation** - Future incidents benefit from past RCAs
|
||||
❌ **Don't forget to close the loop** - Verify prevention actions worked
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Technique | Best For | Output |
|
||||
|-----------|----------|--------|
|
||||
| **5 Whys** | Sequential cause-effect chains | Linear cause chain → root cause |
|
||||
| **Fishbone** | Multiple potential causes | Categorized causes diagram |
|
||||
| **Timeline** | Cascading failures, timing issues | Chronological event sequence |
|
||||
|
||||
| Root Cause Type | Fix Strategy |
|
||||
|-----------------|--------------|
|
||||
| **Missing validation** | Add validation at boundary + tests |
|
||||
| **Missing error handling** | Add try/catch + logging + monitoring |
|
||||
| **Performance issue** | Optimize + add performance test + alert |
|
||||
| **Configuration error** | Fix config + add validation + documentation |
|
||||
| **Process gap** | Update process + add checklist + training |
|
||||
|
||||
---
|
||||
|
||||
**Usage**: When debugging is complete, perform RCA to understand why the bug existed and how to prevent similar issues. Use 5 Whys for most cases, Fishbone for complex multi-factor incidents, Timeline for cascading failures.
|
||||
414
skills/smart-debugging/reference/stack-trace-patterns.md
Normal file
414
skills/smart-debugging/reference/stack-trace-patterns.md
Normal file
@@ -0,0 +1,414 @@
|
||||
# Stack Trace Patterns
|
||||
|
||||
Comprehensive guide to reading, analyzing, and extracting insights from stack traces across different languages and environments.
|
||||
|
||||
## Python Stack Traces
|
||||
|
||||
### Anatomy
|
||||
|
||||
```python
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/users.py", line 45, in get_user
|
||||
user = db.query(User).filter(User.id == user_id).one()
|
||||
File "/venv/lib/sqlalchemy/orm/query.py", line 2890, in one
|
||||
raise NoResultFound("No row was found")
|
||||
sqlalchemy.orm.exc.NoResultFound: No row was found
|
||||
```
|
||||
|
||||
**Reading Order**: Bottom-up (exception → root cause)
|
||||
|
||||
| Component | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| **Exception Type** | The error class | `sqlalchemy.orm.exc.NoResultFound` |
|
||||
| **Exception Message** | Error description | `"No row was found"` |
|
||||
| **Root Frame** | Where error originated | `query.py:2890, in one` |
|
||||
| **Call Stack** | Function call chain | `users.py:45 → query.py:2890` |
|
||||
|
||||
### Identifying the Root Cause
|
||||
|
||||
```python
|
||||
# Stack trace from user code → library code
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/orders.py", line 23, in create_order # ← Your code (start here!)
|
||||
payment = process_payment(order.total)
|
||||
File "/app/services/payment.py", line 67, in process_payment
|
||||
stripe.charge.create(amount=amount)
|
||||
File "/venv/lib/stripe/api.py", line 342, in create # ← Library code (ignore)
|
||||
raise InvalidRequestError("Amount must be positive")
|
||||
stripe.error.InvalidRequestError: Amount must be positive
|
||||
```
|
||||
|
||||
**Analysis**:
|
||||
1. Exception: `InvalidRequestError` - Amount validation failed
|
||||
2. Root frame in your code: `payment.py:67` - Calling Stripe with invalid amount
|
||||
3. Source of bad data: `orders.py:23` - Passing `order.total` (likely 0 or negative)
|
||||
|
||||
**Fix Location**: Check `order.total` validation in `orders.py:23`
|
||||
|
||||
### Filtering Noise
|
||||
|
||||
**Focus on**:
|
||||
- Files in your project directory (`/app/*`)
|
||||
- First occurrence of error in your code
|
||||
|
||||
**Ignore**:
|
||||
- Virtual environment files (`/venv/*`, `site-packages/*`)
|
||||
- Standard library (`/usr/lib/python3.*/`)
|
||||
- Framework internals (unless debugging framework)
|
||||
|
||||
### Async Stack Traces
|
||||
|
||||
```python
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/users.py", line 23, in get_user_profile
|
||||
user = await fetch_user(user_id)
|
||||
File "/app/services/users.py", line 45, in fetch_user
|
||||
data = await http_client.get(f"/users/{user_id}")
|
||||
File "/venv/lib/httpx/_client.py", line 1234, in get
|
||||
raise ConnectTimeout()
|
||||
httpx.ConnectTimeout: Connection timed out
|
||||
```
|
||||
|
||||
**Key Indicators**:
|
||||
- `await` in frame descriptions
|
||||
- Async function names (`async def`)
|
||||
- Coroutine references
|
||||
|
||||
**Analysis**: Trace async call chain: `get_user_profile` → `fetch_user` → `httpx.get` → timeout
|
||||
|
||||
## JavaScript/TypeScript Stack Traces
|
||||
|
||||
### Node.js Format
|
||||
|
||||
```javascript
|
||||
Error: User not found
|
||||
at UserService.findById (/app/services/user.service.ts:42:11)
|
||||
at async getUserProfile (/app/api/users.controller.ts:23:18)
|
||||
at async /app/middleware/auth.ts:67:5
|
||||
at async handleRequest (/app/server/request-handler.ts:15:3)
|
||||
```
|
||||
|
||||
**Reading Order**: Top-down (error → call chain)
|
||||
|
||||
| Component | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| **Error Type** | Error class | `Error` |
|
||||
| **Error Message** | Description | `"User not found"` |
|
||||
| **Root Frame** | Where thrown | `user.service.ts:42:11` |
|
||||
| **Call Stack** | Caller chain | `getUserProfile` → middleware → request handler |
|
||||
|
||||
### Browser Stack Traces
|
||||
|
||||
```javascript
|
||||
Uncaught TypeError: Cannot read property 'name' of undefined
|
||||
at UserProfile.render (UserProfile.tsx:15:32)
|
||||
at finishClassComponent (react-dom.production.min.js:123:45)
|
||||
at updateClassComponent (react-dom.production.min.js:456:12)
|
||||
```
|
||||
|
||||
**Analysis**:
|
||||
- Error: Accessing `.name` on undefined object
|
||||
- Root: `UserProfile.tsx:15:32` (your component)
|
||||
- Framework: React rendering internals (ignore)
|
||||
|
||||
**Fix**: Add null check in `UserProfile.tsx:15`
|
||||
|
||||
### Minified Stack Traces
|
||||
|
||||
```javascript
|
||||
TypeError: Cannot read property 'name' of undefined
|
||||
at t.render (main.a3b4c5d6.js:1:23456)
|
||||
at u (2.chunk.js:4:567)
|
||||
```
|
||||
|
||||
**Problem**: Minified code is unreadable (`t`, `u`, cryptic filenames)
|
||||
|
||||
**Solution**: Use source maps
|
||||
|
||||
```javascript
|
||||
// With source map
|
||||
TypeError: Cannot read property 'name' of undefined
|
||||
at UserProfile.render (src/components/UserProfile.tsx:15:32)
|
||||
at ReactComponent.update (src/lib/react.ts:45:12)
|
||||
```
|
||||
|
||||
**How**: Ensure source maps are available:
|
||||
- Development: Always enabled
|
||||
- Production: Enable for debugging (`.map` files)
|
||||
- Error tracking: Sentry, Bugsnag auto-apply source maps
|
||||
|
||||
## Java Stack Traces
|
||||
|
||||
### Format
|
||||
|
||||
```java
|
||||
java.lang.NullPointerException: Cannot invoke "User.getName()" because "user" is null
|
||||
at com.example.UserService.getFullName(UserService.java:42)
|
||||
at com.example.UserController.getUserProfile(UserController.java:23)
|
||||
at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
|
||||
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:142)
|
||||
```
|
||||
|
||||
**Reading Order**: Top-down
|
||||
|
||||
**Analysis**:
|
||||
- Exception: `NullPointerException` with helpful message (Java 14+)
|
||||
- Root: `UserService.java:42` calling `.getName()` on null
|
||||
- Caller: `UserController.java:23`
|
||||
- Framework: Spring MVC (ignore)
|
||||
|
||||
**Fix**: Add null check at `UserService.java:42`
|
||||
|
||||
## FastAPI/Pydantic Stack Traces
|
||||
|
||||
### Validation Error
|
||||
|
||||
```python
|
||||
pydantic.error_wrappers.ValidationError: 2 validation errors for UserCreate
|
||||
email
|
||||
field required (type=value_error.missing)
|
||||
age
|
||||
ensure this value is greater than 0 (type=value_error.number.not_gt; limit_value=0)
|
||||
```
|
||||
|
||||
**Analysis**:
|
||||
- Not a traditional stack trace - validation error report
|
||||
- Lists all validation failures
|
||||
- Each error shows: field, message, type
|
||||
|
||||
**Fix**: Client must send valid `email` (required) and `age > 0`
|
||||
|
||||
### FastAPI Exception
|
||||
|
||||
```python
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/endpoints/users.py", line 45, in create_user
|
||||
db_user = await crud.user.create(user_in)
|
||||
File "/app/crud/user.py", line 23, in create
|
||||
db.add(db_obj)
|
||||
File "/venv/lib/sqlalchemy/orm/session.py", line 2345, in add
|
||||
raise IntegrityError("UNIQUE constraint failed: users.email")
|
||||
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: users.email
|
||||
```
|
||||
|
||||
**Analysis**:
|
||||
- Database constraint violation
|
||||
- Root in your code: `crud/user.py:23` trying to insert duplicate email
|
||||
- Caller: `api/endpoints/users.py:45`
|
||||
|
||||
**Fix**: Check if user exists before insert, or use upsert
|
||||
|
||||
## Cloudflare Workers Stack Traces
|
||||
|
||||
### Format
|
||||
|
||||
```javascript
|
||||
Error: Failed to fetch user data
|
||||
at fetchUser (worker.js:45:11)
|
||||
at handleRequest (worker.js:23:18)
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- Minimal stack (no Node.js internals)
|
||||
- V8 isolate execution context
|
||||
- Limited to worker code only
|
||||
|
||||
**Edge Cases**:
|
||||
```javascript
|
||||
Uncaught (in promise) TypeError: response.json is not a function
|
||||
```
|
||||
- Common: Missing `await` on fetch response
|
||||
- Fix: `await response.json()` instead of `response.json()`
|
||||
|
||||
## Pattern Recognition
|
||||
|
||||
### Null/Undefined Access Patterns
|
||||
|
||||
**Python**:
|
||||
```
|
||||
AttributeError: 'NoneType' object has no attribute 'X'
|
||||
TypeError: 'NoneType' object is not subscriptable
|
||||
```
|
||||
|
||||
**JavaScript**:
|
||||
```
|
||||
TypeError: Cannot read property 'X' of null
|
||||
TypeError: Cannot read property 'X' of undefined
|
||||
```
|
||||
|
||||
**Java**:
|
||||
```
|
||||
java.lang.NullPointerException: Cannot invoke "X" because "Y" is null
|
||||
```
|
||||
|
||||
### Type Mismatch Patterns
|
||||
|
||||
**Python**:
|
||||
```
|
||||
TypeError: unsupported operand type(s) for +: 'int' and 'str'
|
||||
TypeError: 'X' object is not callable
|
||||
```
|
||||
|
||||
**JavaScript**:
|
||||
```
|
||||
TypeError: X is not a function
|
||||
TypeError: Cannot convert undefined or null to object
|
||||
```
|
||||
|
||||
### Import/Module Patterns
|
||||
|
||||
**Python**:
|
||||
```
|
||||
ModuleNotFoundError: No module named 'X'
|
||||
ImportError: cannot import name 'X' from 'Y'
|
||||
```
|
||||
|
||||
**JavaScript**:
|
||||
```
|
||||
Error: Cannot find module 'X'
|
||||
SyntaxError: Unexpected token 'export'
|
||||
```
|
||||
|
||||
### Database Patterns
|
||||
|
||||
**SQLAlchemy**:
|
||||
```
|
||||
sqlalchemy.orm.exc.NoResultFound
|
||||
sqlalchemy.exc.IntegrityError: UNIQUE constraint failed
|
||||
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection refused
|
||||
```
|
||||
|
||||
**Drizzle ORM**:
|
||||
```
|
||||
DrizzleError: Unique constraint failed on column: email
|
||||
DrizzleError: Connection to database server failed
|
||||
```
|
||||
|
||||
## Analysis Workflow
|
||||
|
||||
### 1. Identify Exception Type
|
||||
|
||||
```python
|
||||
# Example
|
||||
sqlalchemy.exc.IntegrityError: UNIQUE constraint failed: users.email
|
||||
# ↓
|
||||
# Type: IntegrityError (database constraint)
|
||||
# Subtype: UNIQUE (duplicate key)
|
||||
```
|
||||
|
||||
### 2. Locate Root Frame in Your Code
|
||||
|
||||
```python
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/users.py", line 45, in create_user # ← Root frame
|
||||
db.add(user)
|
||||
File "/venv/lib/sqlalchemy/orm/session.py", line 2345, in add # ← Library
|
||||
raise IntegrityError()
|
||||
```
|
||||
|
||||
**Rule**: First frame in your project directory before library/framework code
|
||||
|
||||
### 3. Trace Backwards Through Call Chain
|
||||
|
||||
```python
|
||||
create_user (users.py:45)
|
||||
↓ calls
|
||||
UserService.create (user_service.py:23)
|
||||
↓ calls
|
||||
db.add (sqlalchemy) → IntegrityError
|
||||
```
|
||||
|
||||
**Analysis**: Error originates in `db.add`, propagates through `UserService.create`, surfaces in `create_user` endpoint
|
||||
|
||||
### 4. Identify Data Flow
|
||||
|
||||
```python
|
||||
# users.py:45
|
||||
user = User(email=request.email) # ← Where does request.email come from?
|
||||
db.add(user) # ← Fails with UNIQUE constraint
|
||||
|
||||
# Trace back:
|
||||
# request.email ← Request body
|
||||
# ← Client sent duplicate email
|
||||
# ← Need validation before DB insert
|
||||
```
|
||||
|
||||
### 5. Formulate Hypothesis
|
||||
|
||||
**Pattern**: UNIQUE constraint → Attempting duplicate insert
|
||||
**Root Cause**: No existence check before insert
|
||||
**Fix**: Add existence check or use upsert
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Recursive Stack Traces
|
||||
|
||||
```python
|
||||
RecursionError: maximum recursion depth exceeded
|
||||
File "/app/services/tree.py", line 23, in calculate_depth
|
||||
return 1 + calculate_depth(node.parent)
|
||||
File "/app/services/tree.py", line 23, in calculate_depth
|
||||
return 1 + calculate_depth(node.parent)
|
||||
[Previous line repeated 996 more times]
|
||||
```
|
||||
|
||||
**Analysis**: Circular reference in `node.parent` chain
|
||||
|
||||
**Fix**: Add base case or cycle detection
|
||||
|
||||
### Chained Exceptions (Python 3)
|
||||
|
||||
```python
|
||||
Traceback (most recent call last):
|
||||
File "/app/db/connection.py", line 15, in connect
|
||||
engine.connect()
|
||||
sqlalchemy.exc.OperationalError: connection refused
|
||||
|
||||
The above exception was the direct cause of the following exception:
|
||||
|
||||
Traceback (most recent call last):
|
||||
File "/app/api/users.py", line 45, in get_user
|
||||
db.connect()
|
||||
File "/app/db/connection.py", line 20, in connect
|
||||
raise DatabaseConnectionError() from e
|
||||
app.exceptions.DatabaseConnectionError: Failed to connect to database
|
||||
```
|
||||
|
||||
**Reading**: Two stack traces:
|
||||
1. Original: `OperationalError` (connection refused)
|
||||
2. Wrapped: `DatabaseConnectionError` (user-friendly message)
|
||||
|
||||
**Root Cause**: Original exception (connection refused)
|
||||
|
||||
### Multiple Exception Points
|
||||
|
||||
```javascript
|
||||
UnhandledPromiseRejectionWarning: Error: API request failed
|
||||
at fetchData (api.js:23:11)
|
||||
(node:12345) UnhandledPromiseRejectionWarning: Unhandled promise rejection.
|
||||
```
|
||||
|
||||
**Analysis**: Promise rejected but no `.catch()` handler
|
||||
|
||||
**Fix**: Add `.catch()` or use `try/await/catch`
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Language | Read Direction | Focus On | Ignore |
|
||||
|----------|---------------|----------|--------|
|
||||
| **Python** | Bottom-up | Last frame in your code | `/venv/`, stdlib |
|
||||
| **JavaScript** | Top-down | First frame in your code | `node_modules/` |
|
||||
| **Java** | Top-down | `com.example.*` | `org.springframework.*` |
|
||||
| **TypeScript** | Top-down | `src/`, `.ts` files | `node_modules/`, `.min.js` |
|
||||
|
||||
| Error Pattern | Stack Trace Indicator | Fix Priority |
|
||||
|---------------|----------------------|--------------|
|
||||
| **Null/undefined** | `NoneType`, `null`, `undefined` | High |
|
||||
| **Type mismatch** | `unsupported operand`, `is not a function` | Medium |
|
||||
| **Import error** | `ModuleNotFoundError`, `Cannot find module` | High |
|
||||
| **Database** | `IntegrityError`, `OperationalError` | High |
|
||||
| **Async** | `await`, `Promise`, `Coroutine` | Medium |
|
||||
|
||||
---
|
||||
|
||||
**Usage**: When debugging, identify exception type, locate root frame in your code, trace backwards through call chain, identify data flow, formulate hypothesis.
|
||||
Reference in New Issue
Block a user