Initial commit

2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions
--- a/skills/smart-debugging/examples/INDEX.md
+++ b/skills/smart-debugging/examples/INDEX.md
@@ -0,0 +1,52 @@
+# Smart Debug Examples
+
+Complete examples demonstrating systematic debugging workflows from error triage to verified fixes.
+
+## Available Examples
+
+### [null-pointer-debug-example.md](null-pointer-debug-example.md)
+Complete walkthrough of debugging a NoneType AttributeError.
+- Stack trace analysis and root file identification
+- Error pattern matching (null pointer pattern)
+- Code inspection of problematic function
+- Fix generation with 3 options (return early, default value, exception)
+- Test-driven debugging with failing test creation
+- Fix application and verification
+- Root cause analysis using 5 Whys
+- Prevention strategy implementation
+
+### [type-error-debug-example.md](type-error-debug-example.md)
+Debugging type mismatch and operand type errors.
+- TypeError analysis (unsupported operand types)
+- Type inference from stack trace
+- Pattern matching for type mismatches
+- Type validation fix generation
+- Unit test creation for type validation
+- Static analysis recommendations (mypy, Pydantic)
+- Prevention through type hints
+
+### [integration-failure-debug.md](integration-failure-debug.md)
+Debugging API integration failures and contract violations.
+- HTTP error analysis (400, 422, 500 responses)
+- API contract validation against OpenAPI spec
+- Request/response comparison
+- Schema validation with Pydantic
+- Integration test creation
+- Observability integration (trace ID correlation)
+- Rollback and deployment strategies
+
+### [performance-bug-debug.md](performance-bug-debug.md)
+Debugging performance-related bugs and slow queries.
+- Performance profiling with cProfile
+- Database query analysis (N+1 detection)
+- Caching strategy implementation
+- Optimization verification with benchmarks
+- Delegation to performance-optimizer agent
+- Production monitoring setup
+
+## Quick Reference
+
+**Need null pointer help?** → [null-pointer-debug-example.md](null-pointer-debug-example.md)
+**Need type error help?** → [type-error-debug-example.md](type-error-debug-example.md)
+**Need API debugging?** → [integration-failure-debug.md](integration-failure-debug.md)
+**Need performance debugging?** → [performance-bug-debug.md](performance-bug-debug.md)
--- a/skills/smart-debugging/examples/integration-failure-debug.md
+++ b/skills/smart-debugging/examples/integration-failure-debug.md
@@ -0,0 +1,88 @@
+# Integration Failure Debug Example
+
+Debugging API integration failures and contract violations.
+
+## Error: 422 Unprocessable Entity from Payment API
+
+```json
+{
+  "detail": [
+    {
+      "loc": ["body", "amount"],
+      "msg": "ensure this value is greater than 0",
+      "type": "value_error.number.not_gt"
+    }
+  ]
+}
+```
+
+## Investigation
+
+### Request Sent
+
+```python
+# Our code
+await payment_api.create_charge({
+    "amount": order.total,  # Sending cents: 0 (empty cart!)
+    "currency": "usd",
+    "customer_id": "cus_123"
+})
+```
+
+### API Contract (OpenAPI Spec)
+
+```yaml
+/charges:
+  post:
+    requestBody:
+      content:
+        application/json:
+          schema:
+            properties:
+              amount:
+                type: integer
+                minimum: 50  # $0.50 minimum!
+```
+
+**Issue**: Sending `amount: 0` violates API's minimum amount requirement.
+
+## Root Cause
+
+Order validation allows empty carts ($0 total). Payment API requires minimum $0.50.
+
+## Fix
+
+```python
+from pydantic import BaseModel, validator
+
+class CreateChargeRequest(BaseModel):
+    amount: int
+    currency: str
+    customer_id: str
+
+    @validator('amount')
+    def amount_must_meet_minimum(cls, v):
+        if v < 50:  # Match API's minimum
+            raise ValueError('Amount must be at least $0.50 (50 cents)')
+        return v
+
+# Service layer
+async def create_charge(order: Order):
+    # Validate before API call
+    request = CreateChargeRequest(
+        amount=order.total_cents,
+        currency="usd",
+        customer_id=order.customer_id
+    )
+    return await payment_api.create_charge(request.dict())
+```
+
+## Prevention
+
+1. **Schema validation**: Validate against OpenAPI spec
+2. **Contract tests**: Test API contract compliance
+3. **Integration tests**: Test with real API (or mocks matching spec)
+
+---
+
+**Result**: API contract violations caught at service boundary, not production.
--- a/skills/smart-debugging/examples/null-pointer-debug-example.md
+++ b/skills/smart-debugging/examples/null-pointer-debug-example.md
@@ -0,0 +1,495 @@
+# Null Pointer Debug Example
+
+Complete walkthrough of debugging a NoneType AttributeError using smart-debug systematic methodology.
+
+## Error Encountered
+
+**Environment**: Production
+**Severity**: SEV2 (Degraded service - user profile pages failing)
+**Frequency**: 127 occurrences in last 24 hours
+**First Occurrence**: 2025-01-16 14:23:00 UTC
+
+### Error Message
+
+```python
+AttributeError: 'NoneType' object has no attribute 'name'
+```
+
+### User Report
+
+> "When I click on a user's profile after they've deleted their account, the page crashes with a 500 error instead of showing a 'User not found' message."
+
+## Phase 1: Triage (3 minutes)
+
+**Severity Assessment**:
+- Not production down (SEV1)
+- Affects specific user workflow (profile viewing)
+- 127 occurrences = moderate frequency
+- **Decision**: SEV2 - Proceed with full smart-debug workflow
+
+**Error Category**: Runtime Exception (NoneType error)
+
+## Phase 2: Stack Trace Analysis
+
+### Full Stack Trace
+
+```python
+Traceback (most recent call last):
+  File "/app/api/users.py", line 42, in get_user_profile
+    return {"name": user.name, "email": user.email}
+  File "/app/models/user.py", line 89, in name
+    return self._name.upper()
+AttributeError: 'NoneType' object has no attribute 'name'
+```
+
+### Pattern Match
+
+**Pattern**: `null_pointer`
+**Indicators**: `'NoneType' object has no attribute`
+**Likely Cause**: Accessing property on None value - check for null/undefined
+**Fix Template**: Add null check before access
+
+## Phase 3: Code Inspection
+
+### Problematic Code (api/users.py:42)
+
+```python
+@router.get("/users/{user_id}/profile")
+async def get_user_profile(user_id: str):
+    """Get user profile information."""
+    user = await db.users.find_one({"id": user_id})
+
+    # Line 42 - THE PROBLEM
+    return {
+        "name": user.name,
+        "email": user.email,
+        "created_at": user.created_at
+    }
+```
+
+**Issue Identified**: No null check! If `find_one()` returns `None` (user doesn't exist), accessing `user.name` causes AttributeError.
+
+### Root Cause
+
+When a user deletes their account:
+1. User record is deleted from database
+2. But references to that user ID may still exist (links, mentions, etc.)
+3. When someone tries to view the deleted user's profile, `find_one()` returns `None`
+4. Code tries to access `None.name` → AttributeError
+
+## Phase 4: Reproduce Locally
+
+### Test Case (Failing)
+
+```python
+# tests/test_user_api.py
+import pytest
+from fastapi.testclient import TestClient
+
+def test_get_deleted_user_profile(client: TestClient, db):
+    """Test viewing profile of deleted user."""
+    # Setup: Create and then delete a user
+    user_id = "usr_deleted_123"
+    db.users.insert_one({"id": user_id, "name": "Deleted User"})
+    db.users.delete_one({"id": user_id})
+
+    # Action: Try to get profile of deleted user
+    response = client.get(f"/users/{user_id}/profile")
+
+    # Expected: 404 Not Found, not 500 Internal Server Error
+    assert response.status_code == 404
+    assert response.json() == {"detail": f"User {user_id} not found"}
+```
+
+### Run Test (Fails as Expected)
+
+```bash
+$ pytest tests/test_user_api.py::test_get_deleted_user_profile -v
+
+tests/test_user_api.py::test_get_deleted_user_profile FAILED
+
+E   assert 500 == 404
+E    +  where 500 = <Response [500 Internal Server Error]>.status_code
+```
+
+✅ **Reproduction Successful** - Test reliably reproduces the bug.
+
+## Phase 5: Fix Generation
+
+### Option 1: Quick Fix (Return Early)
+
+```python
+@router.get("/users/{user_id}/profile")
+async def get_user_profile(user_id: str):
+    """Get user profile information."""
+    user = await db.users.find_one({"id": user_id})
+
+    # Quick fix: Return early if user not found
+    if user is None:
+        raise HTTPException(status_code=404, detail=f"User {user_id} not found")
+
+    return {
+        "name": user.name,
+        "email": user.email,
+        "created_at": user.created_at
+    }
+```
+
+**Pros**: Simple, fixes the immediate issue
+**Cons**: Doesn't prevent similar issues elsewhere
+
+### Option 2: Robust Fix (Custom Exception)
+
+```python
+# models/exceptions.py
+class UserNotFoundError(Exception):
+    """Raised when user is not found in database."""
+    def __init__(self, user_id: str):
+        self.user_id = user_id
+        super().__init__(f"User {user_id} not found")
+
+# api/users.py
+@router.get("/users/{user_id}/profile")
+async def get_user_profile(user_id: str):
+    """Get user profile information."""
+    user = await db.users.find_one({"id": user_id})
+
+    if user is None:
+        raise UserNotFoundError(user_id)
+
+    return {
+        "name": user.name,
+        "email": user.email,
+        "created_at": user.created_at
+    }
+
+# Global exception handler
+@app.exception_handler(UserNotFoundError)
+async def user_not_found_handler(request, exc):
+    return JSONResponse(
+        status_code=404,
+        content={"detail": str(exc)}
+    )
+```
+
+**Pros**: Reusable, type-safe, better error handling
+**Cons**: More boilerplate
+
+### Option 3: Best Practice (Helper Function)
+
+```python
+# services/user_service.py
+async def get_user_or_404(user_id: str) -> User:
+    """Get user by ID or raise 404."""
+    user = await db.users.find_one({"id": user_id})
+    if user is None:
+        raise HTTPException(status_code=404, detail=f"User {user_id} not found")
+    return user
+
+# api/users.py
+@router.get("/users/{user_id}/profile")
+async def get_user_profile(user_id: str):
+    """Get user profile information."""
+    user = await get_user_or_404(user_id)  # Handles None case
+
+    return {
+        "name": user.name,
+        "email": user.email,
+        "created_at": user.created_at
+    }
+```
+
+**Pros**: Reusable pattern, prevents similar bugs, clean API code
+**Cons**: Slightly more abstraction
+
+**Selected**: **Option 3** - Best practice with reusable helper
+
+## Phase 6: Apply Fix
+
+```python
+# services/user_service.py
+from fastapi import HTTPException
+
+async def get_user_or_404(user_id: str) -> User:
+    """
+    Get user by ID or raise 404 Not Found.
+
+    Args:
+        user_id: User ID to retrieve
+
+    Returns:
+        User object
+
+    Raises:
+        HTTPException: 404 if user not found
+
+    Example:
+        user = await get_user_or_404("usr_123")
+    """
+    user = await db.users.find_one({"id": user_id})
+    if user is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"User {user_id} not found"
+        )
+    return user
+
+# api/users.py (FIXED)
+@router.get("/users/{user_id}/profile")
+async def get_user_profile(user_id: str):
+    """Get user profile information."""
+    user = await get_user_or_404(user_id)  # Null-safe!
+
+    return {
+        "name": user.name,
+        "email": user.email,
+        "created_at": user.created_at
+    }
+```
+
+## Phase 7: Verify Fix
+
+### Run Test Again (Should Pass)
+
+```bash
+$ pytest tests/test_user_api.py::test_get_deleted_user_profile -v
+
+tests/test_user_api.py::test_get_deleted_user_profile PASSED ✅
+```
+
+### Run Full Test Suite
+
+```bash
+$ pytest tests/ -v
+
+======================== 127 passed in 4.23s ========================
+```
+
+✅ **All tests pass** - No regressions introduced
+
+### Manual Verification
+
+```bash
+# Test in staging
+$ curl https://api-staging.example.com/users/usr_deleted_123/profile
+
+HTTP/1.1 404 Not Found
+{
+  "detail": "User usr_deleted_123 not found"
+}
+```
+
+✅ **Fix verified** - Returns proper 404 instead of 500
+
+## Phase 8: Root Cause Analysis (5 Whys)
+
+**Why 1**: Why did the API return 500 error?
+→ Code tried to access `user.name` on None
+
+**Why 2**: Why was `user` None?
+→ Database query returned None (user doesn't exist)
+
+**Why 3**: Why didn't code handle None case?
+→ No null check before accessing user properties
+
+**Why 4**: Why was there no null check?
+→ Developer assumed user would always exist (invalid assumption)
+
+**Why 5**: Why was invalid assumption made?
+→ No code review guideline requiring null checks for database queries
+
+**ROOT CAUSE**: Missing null-safety pattern for database queries + no enforcement in code review
+
+## Phase 9: Prevention Strategy
+
+### Immediate Prevention
+
+✅ **Unit test added** (prevents this specific bug)
+
+```python
+def test_get_deleted_user_profile(client, db):
+    # Test ensures 404 is returned for deleted users
+    pass
+```
+
+### Short-term Prevention
+
+✅ **Integration test added** (prevents similar bugs)
+
+```python
+@pytest.mark.parametrize("endpoint", [
+    "/users/{id}/profile",
+    "/users/{id}/settings",
+    "/users/{id}/posts"
+])
+def test_user_endpoints_return_404_for_deleted_users(client, db, endpoint):
+    """All user endpoints should return 404 for deleted users."""
+    user_id = create_and_delete_user(db)
+    response = client.get(endpoint.format(id=user_id))
+    assert response.status_code == 404
+```
+
+### Long-term Prevention
+
+✅ **Architecture change proposed**: Create `get_resource_or_404()` pattern
+
+```python
+# services/base_service.py
+from typing import TypeVar, Generic, Type
+
+T = TypeVar('T')
+
+class BaseService(Generic[T]):
+    """Base service with null-safe query methods."""
+
+    async def get_or_404(
+        self,
+        resource_id: str,
+        resource_type: str = "Resource"
+    ) -> T:
+        """Get resource by ID or raise 404."""
+        resource = await self.find_one({"id": resource_id})
+        if resource is None:
+            raise HTTPException(
+                status_code=404,
+                detail=f"{resource_type} {resource_id} not found"
+            )
+        return resource
+
+# Usage across all resources
+user_service = UserService()
+post_service = PostService()
+comment_service = CommentService()
+
+user = await user_service.get_or_404(user_id, "User")
+post = await post_service.get_or_404(post_id, "Post")
+```
+
+### Monitoring Added
+
+✅ **Alert created** (detects recurrence)
+
+```yaml
+# prometheus/alerts/user_not_found.yml
+groups:
+  - name: user_api
+    rules:
+    - alert: HighUserNotFoundRate
+      expr: |
+        rate(http_requests_total{
+          endpoint="/users/:id/profile",
+          status_code="404"
+        }[5m]) > 10
+      for: 5m
+      annotations:
+        summary: "High rate of user not found errors"
+        description: "{{ $value }} 404s/sec on user profile endpoint"
+```
+
+### Documentation Updated
+
+✅ **Runbook created**
+
+```markdown
+# Runbook: User Not Found Errors
+
+## Symptom
+404 errors when accessing user profiles
+
+## Diagnosis
+- Check if user was recently deleted
+- Verify database replication lag
+- Check for stale cache entries
+
+## Resolution
+- User deleted: Expected behavior
+- Replication lag: Wait 30 seconds
+- Stale cache: Clear user cache
+
+## Prevention
+Always use `get_user_or_404()` helper
+```
+
+## Phase 10: Deploy & Monitor
+
+### Pre-Deployment Checklist
+
+- [x] Fix tested in staging
+- [x] No performance impact
+- [x] Security review not needed (defensive fix)
+- [x] Deployment plan created
+- [x] Rollback plan ready
+
+### Deployment
+
+```bash
+# Deploy to staging
+$ git push origin feature/fix-user-not-found
+$ ./scripts/deploy-staging.sh
+
+# Verify in staging (1 hour)
+$ ./scripts/monitor-staging.sh --duration 1h
+
+# Deploy to production (gradual rollout)
+$ ./scripts/deploy-production.sh --canary 10%  # 10% traffic
+$ sleep 600  # Monitor for 10 minutes
+$ ./scripts/deploy-production.sh --canary 50%  # 50% traffic
+$ sleep 600
+$ ./scripts/deploy-production.sh --canary 100% # Full traffic
+```
+
+### Post-Deployment Monitoring
+
+**1 Hour Post-Deploy**:
+```bash
+# Check error logs
+$ kubectl logs -l app=api --since=1h | grep "User.*not found"
+# No unexpected errors ✅
+
+# Check error rate
+$ curl prometheus/query?query='rate(http_errors_total[1h])'
+# No increase in error rate ✅
+```
+
+**24 Hours Post-Deploy**:
+```bash
+# Verify user not found rate is zero
+$ curl prometheus/query?query='rate(http_requests_total{status_code="404",endpoint="/users/:id/profile"}[24h])'
+# Result: 0 errors ✅
+```
+
+## Summary
+
+| Metric | Value |
+|--------|-------|
+| **Time to Reproduce** | 5 minutes |
+| **Time to Fix** | 15 minutes |
+| **Time to Deploy** | 30 minutes |
+| **Total Time** | 50 minutes |
+| **Tests Added** | 2 (unit + integration) |
+| **Prevention Strategies** | 3 (tests, architecture, monitoring) |
+| **Recurrences** | 0 (monitored for 1 week) |
+
+## Lessons Learned
+
+### What Went Well
+1. Clear stack trace made root cause obvious
+2. Test-driven debugging caught the issue immediately
+3. Helper function prevents similar bugs across codebase
+
+### What Could Be Improved
+1. Should have had null-safety pattern from the start
+2. Code review should catch missing null checks
+3. Static analysis could detect this pattern
+
+### Recommendations
+1. Add `mypy` or similar for null-safety checking
+2. Update code review checklist to include null-safety checks
+3. Create linter rule: "Database queries must use `get_or_404` pattern"
+
+---
+
+**Bug Fixed**: ✅
+**Tests Pass**: ✅
+**Prevention Implemented**: ✅
+**Production Stable**: ✅
--- a/skills/smart-debugging/examples/performance-bug-debug.md
+++ b/skills/smart-debugging/examples/performance-bug-debug.md
@@ -0,0 +1,92 @@
+# Performance Bug Debug Example
+
+Debugging slow database queries and N+1 problems.
+
+## Symptom
+
+API endpoint taking 4.5 seconds to respond (target: < 200ms).
+
+## Profiling
+
+```python
+import cProfile
+import pstats
+
+profiler = cProfile.Profile()
+profiler.enable()
+
+response = await get_users_with_posts()
+
+profiler.disable()
+stats = pstats.Stats(profiler)
+stats.sort_stats('cumulative')
+stats.print_stats(10)
+```
+
+### Profile Output
+
+```
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+      100    4.321    0.043    4.321    0.043 database.py:42(execute_query)
+        1    0.089    0.089    4.410    4.410 users.py:15(get_users_with_posts)
+```
+
+**Issue**: Database query called 100 times! (N+1 problem)
+
+## Code Analysis
+
+```python
+# BAD: N+1 Query Problem
+async def get_users_with_posts():
+    users = await db.users.find_all()  # 1 query
+
+    result = []
+    for user in users:  # 100 iterations
+        posts = await db.posts.find({"user_id": user.id})  # N queries!
+        result.append({"user": user, "posts": posts})
+
+    return result  # Total: 101 queries (1 + 100)
+```
+
+## Fix: Use Join/Eager Loading
+
+```python
+# GOOD: Single Query with Join
+async def get_users_with_posts():
+    query = """
+        SELECT
+            users.*,
+            json_agg(posts.*) as posts
+        FROM users
+        LEFT JOIN posts ON posts.user_id = users.id
+        GROUP BY users.id
+    """
+    result = await db.execute(query)  # 1 query total!
+    return result
+```
+
+## Performance Comparison
+
+| Approach | Queries | Time |
+|----------|---------|------|
+| **Before (N+1)** | 101 | 4.5s ❌ |
+| **After (Join)** | 1 | 85ms ✅ |
+
+**Improvement**: 53x faster!
+
+## Prevention
+
+1. **Query logging**: Log all database queries in development
+2. **Performance tests**: Assert query count < threshold
+3. **APM monitoring**: Track query patterns in production (Datadog, New Relic)
+
+```python
+# Performance test
+def test_get_users_with_posts_query_count(query_counter):
+    get_users_with_posts()
+    assert query_counter.count <= 1, f"Expected 1 query, got {query_counter.count}"
+```
+
+---
+
+**Result**: N+1 detected and fixed. Performance SLA met (< 200ms).
--- a/skills/smart-debugging/examples/type-error-debug-example.md
+++ b/skills/smart-debugging/examples/type-error-debug-example.md
@@ -0,0 +1,126 @@
+# Type Error Debug Example
+
+Debugging type mismatch errors using systematic analysis and type validation.
+
+## Error Encountered
+
+**Environment**: Development
+**Severity**: SEV3 (Bug blocking feature development)
+
+### Error Message
+
+```python
+TypeError: unsupported operand type(s) for +: 'int' and 'str'
+```
+
+### Context
+
+Developer implementing new pricing calculation feature receives cryptic type error.
+
+## Stack Trace Analysis
+
+```python
+Traceback (most recent call last):
+  File "/app/services/pricing.py", line 45, in calculate_total
+    total = base_price + discount
+TypeError: unsupported operand type(s) for +: 'int' and 'str'
+```
+
+**Pattern Match**: `type_mismatch` - Incompatible types in operation
+
+## Code Inspection
+
+```python
+# services/pricing.py
+def calculate_total(base_price: int, discount: str) -> int:
+    """Calculate final price after discount."""
+    # Line 45 - THE PROBLEM
+    total = base_price + discount  # int + str = TypeError!
+    return total
+```
+
+**Issue**: `discount` parameter typed as `str` but used in numeric operation.
+
+## Root Cause
+
+API returns discount as string `"10"` instead of integer `10`. Type hint says `str`, but function logic expects `int`.
+
+## Fix Options
+
+### Option 1: Convert String to Int
+
+```python
+def calculate_total(base_price: int, discount: str) -> int:
+    """Calculate final price after discount."""
+    discount_int = int(discount)  # Convert string to int
+    total = base_price - discount_int
+    return total
+```
+
+**Issue**: Still accepts `str` - misleading type hint!
+
+### Option 2: Fix Type Hint (Correct!)
+
+```python
+def calculate_total(base_price: int, discount: int) -> int:
+    """Calculate final price after discount."""
+    total = base_price - discount
+    return total
+```
+
+**Better**: Type hint matches expected usage.
+
+### Option 3: Input Validation with Pydantic
+
+```python
+from pydantic import BaseModel, validator
+
+class PricingInput(BaseModel):
+    base_price: int
+    discount: int
+
+    @validator('discount')
+    def discount_must_be_positive(cls, v):
+        if v < 0:
+            raise ValueError('Discount must be positive')
+        return v
+
+def calculate_total(input: PricingInput) -> int:
+    """Calculate final price after discount."""
+    return input.base_price - input.discount
+```
+
+**Best**: Validates at API boundary, type-safe!
+
+## Test
+
+```python
+def test_calculate_total_with_valid_types():
+    """Test with correct types."""
+    result = calculate_total(100, 10)
+    assert result == 90
+
+def test_calculate_total_rejects_string_discount():
+    """Test rejects string discount."""
+    with pytest.raises(ValidationError):
+        PricingInput(base_price=100, discount="10")
+```
+
+## Prevention
+
+1. **Static type checking**: Run `mypy` in CI/CD
+2. **Pydantic validation**: Validate all API inputs
+3. **Integration tests**: Test with real API responses
+
+**Type Safety Enforcement**:
+```bash
+# mypy config
+[mypy]
+python_version = 3.11
+strict = True
+disallow_untyped_defs = True
+```
+
+---
+
+**Result**: Type error caught at dev time, not production. Type hints + Pydantic prevent recurrence.