9.8 KiB
9.8 KiB
Systematic Debugging Checklist
Use when debugging errors, exceptions, or unexpected behavior.
Phase 1: Triage (2-5 minutes)
- Error message captured completely
- Stack trace obtained (full, not truncated)
- Environment identified (production, staging, development)
- Severity assessed (SEV1, SEV2, SEV3, SEV4)
- Error frequency determined (one-time, intermittent, consistent)
- First occurrence timestamp identified
- Recent changes reviewed (deployments, config changes)
- Production impact assessed (users affected, revenue impact)
Triage Decision
- SEV1 (Production Down) - Escalate to incident-responder immediately
- SEV2 (Degraded) - Quick investigation (10 min max), then escalate if unresolved
- SEV3 (Bug) - Continue with full smart-debug workflow
- SEV4 (Enhancement) - Document and queue for later
Phase 2: Stack Trace Analysis
- Error type identified (TypeError, ValueError, KeyError, etc.)
- Error message parsed and understood
- Call stack extracted (all frames with file, line, function)
- Root file identified (where error originated, not propagated)
- Root line number identified
- Stdlib/third-party frames filtered out
- Related files identified (files in call stack)
- Likely cause predicted using pattern matching
Stack Trace Quality
- Stack trace is complete (not truncated)
- Source maps applied (for minified JavaScript)
- Line numbers accurate (code and deployed version match)
Phase 3: Pattern Matching
- Error pattern database searched
- Matching error pattern found (or "unknown")
- Root cause hypothesis generated
- Fix template identified
- Prevention strategy identified
- Similar historical bugs reviewed (if available)
Common Patterns Checked
- Null pointer / NoneType errors
- Type mismatch errors
- Index out of range errors
- Missing dictionary keys
- Module/import errors
- Database connection errors
- API contract violations
- Concurrency errors
Phase 4: Code Inspection
- Root file read completely
- Problematic line identified
- Context examined (5 lines before/after)
- Function signature examined
- Variable types inferred
- Data flow traced (inputs to problematic line)
- Assumptions identified (null checks, type validations missing)
Code Quality Check
- Tests exist for this code path (yes/no)
- Code has type hints (TypeScript, Python type hints)
- Code has input validation
- Code has error handling
Phase 5: Observability Investigation
Log Analysis
- Logs queried for error occurrences
- Error frequency calculated (per hour, per day)
- First occurrence timestamp confirmed
- Recent occurrences reviewed (last 10)
- Affected users identified (user IDs from logs)
- Error correlation checked (other errors at same time)
Metrics Analysis
- Error rate queried (Prometheus, Cloudflare Analytics)
- Error spike identified (yes/no, when)
- Correlation with traffic spike checked
- Correlation with deployment checked
- Resource utilization checked (CPU, memory, connections)
Trace Analysis
- Trace ID extracted from logs
- Distributed trace viewed (Jaeger, Zipkin)
- Span timings analyzed
- Upstream/downstream services checked
- Trace context propagation verified
Phase 6: Reproduce Locally
- Test environment set up (matches production config)
- Input data identified (from logs or user report)
- Reproduction steps documented
- Error reproduced locally (consistent reproduction)
- Minimal reproduction case created (simplest input that triggers error)
- Failing test case written (pytest, vitest)
- Test runs and fails as expected
Reproduction Quality
- Reproduction is reliable (100% reproducible)
- Reproduction is minimal (fewest steps possible)
- Test is isolated (no external dependencies if possible)
Phase 7: Fix Generation
- Fix hypothesis generated (based on pattern match and code inspection)
- Fix option 1 generated (quick fix)
- Fix option 2 generated (robust fix)
- Fix option 3 generated (best practice fix)
- Trade-offs analyzed (complexity vs. robustness)
- Fix option selected (document rationale)
Fix Options Evaluated
- Quick fix - Minimal code change, may not cover all cases
- Robust fix - Handles edge cases, more defensive
- Best practice fix - Follows design patterns, prevents similar bugs
Phase 8: Fix Application
- Code changes made (Edit or MultiEdit tool)
- Changes reviewed for correctness
- No new bugs introduced (code review)
- Code style consistent (matches project conventions)
- Type hints added (if applicable)
- Comments added (explaining why, not what)
Safety Checks
- No hardcoded values (use constants or config)
- No security vulnerabilities introduced
- No performance regressions introduced
- Backwards compatibility maintained (if API change)
Phase 9: Test Verification
- Failing test now passes (verify fix works)
- Full test suite runs (pytest, vitest)
- All tests pass (no regressions)
- Code coverage maintained or improved
- Integration tests pass (if applicable)
- Edge cases tested (null, empty, large inputs)
Test Quality
- Test is clear and readable
- Test documents expected behavior
- Test will catch regressions
- Test runs quickly (<1 second)
Phase 10: Root Cause Analysis
- 5 Whys analysis performed
- True root cause identified (not just symptom)
- Contributing factors identified
- Timeline reconstructed (what happened when)
- RCA document created (using template)
RCA Document Contents
- Error summary (what, where, when, impact)
- Timeline of events
- Investigation steps documented
- Root cause clearly stated
- Fix applied documented (with code snippet)
- Prevention strategy documented
Phase 11: Prevention Strategy
- Immediate prevention: Unit test added (prevents this specific bug)
- Short-term prevention: Integration tests added (prevents similar bugs)
- Long-term prevention: Architecture changes proposed (prevents class of bugs)
- Monitoring added: Alert created (detects recurrence)
- Documentation updated: Runbook created (guides future debugging)
Prevention Levels
- Test Coverage - Tests prevent regression
- Type Safety - Type hints catch errors at dev time
- Input Validation - Validates data early (Pydantic, zod)
- Error Handling - Graceful degradation
- Monitoring - Detects issues quickly
- Documentation - Team learns from incident
Phase 12: Deploy & Monitor
Pre-Deployment
- Fix tested in staging environment
- Performance impact assessed (no significant regression)
- Security review completed (if security-related bug)
- Deployment plan created (gradual rollout, rollback plan)
- Stakeholders notified (if high-impact bug)
Deployment
- Fix deployed to staging first
- Staging verification successful
- Fix deployed to production (gradual rollout if possible)
- Deployment monitoring active (logs, metrics, traces)
Post-Deployment
- Error logs monitored (1 hour post-deploy)
- Error rate confirmed to zero (or significantly reduced)
- No new errors introduced
- No performance degradation
- User reports checked (customer support, social media)
Monitoring Duration
- 1 hour: Active monitoring (logs, errors, metrics)
- 24 hours: Passive monitoring (alerting enabled)
- 1 week: Review error trends (ensure no recurrence)
Phase 13: Documentation & Learning
- Error pattern database updated (add new pattern if discovered)
- Team notified of fix (Slack, email)
- Postmortem conducted (if SEV1 or SEV2)
- Lessons learned documented
- Similar code locations reviewed (apply fix broadly if needed)
- Architecture improvements proposed (if needed)
Knowledge Sharing
- RCA document shared with team
- Runbook created or updated
- Presentation given (if interesting or impactful bug)
- Blog post written (if educational value)
Critical Validations
- Bug reliably reproduced before fixing
- Fix verified with passing test
- No regressions introduced
- Root cause identified (not just symptom fixed)
- Prevention strategy implemented
- Monitoring in place to detect recurrence
- Documentation complete (RCA, runbook)
- Team learns from incident
Debugging Anti-Patterns (Avoid These)
- Random code changes without hypothesis
- Adding print statements without plan
- Debugging production directly (use staging)
- Ignoring error messages or stack traces
- Not writing tests to verify fix
- Fixing symptoms instead of root cause
- Skipping reproduction step
- Not documenting investigation
- Not learning from mistakes (no RCA)
- Working alone for > 30 min when stuck
When to Escalate
Escalate immediately if:
- Production is down (SEV1)
- You're stuck for > 30 minutes with no progress
- Bug is in unfamiliar code/system
- Security vulnerability suspected
- Data corruption suspected
- Multiple systems affected
Who to escalate to:
- incident-responder - Production SEV1/SEV2
- performance-optimizer - Performance bugs
- security-analyzer - Security vulnerabilities
- data-validator - Data validation errors
- Senior engineer - Stuck for > 30 min
- On-call engineer - Outside business hours
Success Criteria
- Bug fixed and verified with test
- Root cause identified and documented
- Prevention strategy implemented
- Team learns from incident
- Similar bugs prevented in future
- Documentation complete and accurate
- Debugging completed in reasonable time (< 2 hours for SEV3)