281 lines
9.8 KiB
Markdown
281 lines
9.8 KiB
Markdown
# Systematic Debugging Checklist
|
|
|
|
**Use when debugging errors, exceptions, or unexpected behavior.**
|
|
|
|
## Phase 1: Triage (2-5 minutes)
|
|
|
|
- [ ] Error message captured completely
|
|
- [ ] Stack trace obtained (full, not truncated)
|
|
- [ ] Environment identified (production, staging, development)
|
|
- [ ] Severity assessed (SEV1, SEV2, SEV3, SEV4)
|
|
- [ ] Error frequency determined (one-time, intermittent, consistent)
|
|
- [ ] First occurrence timestamp identified
|
|
- [ ] Recent changes reviewed (deployments, config changes)
|
|
- [ ] Production impact assessed (users affected, revenue impact)
|
|
|
|
### Triage Decision
|
|
- [ ] **SEV1 (Production Down)** - Escalate to incident-responder immediately
|
|
- [ ] **SEV2 (Degraded)** - Quick investigation (10 min max), then escalate if unresolved
|
|
- [ ] **SEV3 (Bug)** - Continue with full smart-debug workflow
|
|
- [ ] **SEV4 (Enhancement)** - Document and queue for later
|
|
|
|
## Phase 2: Stack Trace Analysis
|
|
|
|
- [ ] Error type identified (TypeError, ValueError, KeyError, etc.)
|
|
- [ ] Error message parsed and understood
|
|
- [ ] Call stack extracted (all frames with file, line, function)
|
|
- [ ] Root file identified (where error originated, not propagated)
|
|
- [ ] Root line number identified
|
|
- [ ] Stdlib/third-party frames filtered out
|
|
- [ ] Related files identified (files in call stack)
|
|
- [ ] Likely cause predicted using pattern matching
|
|
|
|
### Stack Trace Quality
|
|
- [ ] Stack trace is complete (not truncated)
|
|
- [ ] Source maps applied (for minified JavaScript)
|
|
- [ ] Line numbers accurate (code and deployed version match)
|
|
|
|
## Phase 3: Pattern Matching
|
|
|
|
- [ ] Error pattern database searched
|
|
- [ ] Matching error pattern found (or "unknown")
|
|
- [ ] Root cause hypothesis generated
|
|
- [ ] Fix template identified
|
|
- [ ] Prevention strategy identified
|
|
- [ ] Similar historical bugs reviewed (if available)
|
|
|
|
### Common Patterns Checked
|
|
- [ ] Null pointer / NoneType errors
|
|
- [ ] Type mismatch errors
|
|
- [ ] Index out of range errors
|
|
- [ ] Missing dictionary keys
|
|
- [ ] Module/import errors
|
|
- [ ] Database connection errors
|
|
- [ ] API contract violations
|
|
- [ ] Concurrency errors
|
|
|
|
## Phase 4: Code Inspection
|
|
|
|
- [ ] Root file read completely
|
|
- [ ] Problematic line identified
|
|
- [ ] Context examined (5 lines before/after)
|
|
- [ ] Function signature examined
|
|
- [ ] Variable types inferred
|
|
- [ ] Data flow traced (inputs to problematic line)
|
|
- [ ] Assumptions identified (null checks, type validations missing)
|
|
|
|
### Code Quality Check
|
|
- [ ] Tests exist for this code path (yes/no)
|
|
- [ ] Code has type hints (TypeScript, Python type hints)
|
|
- [ ] Code has input validation
|
|
- [ ] Code has error handling
|
|
|
|
## Phase 5: Observability Investigation
|
|
|
|
### Log Analysis
|
|
- [ ] Logs queried for error occurrences
|
|
- [ ] Error frequency calculated (per hour, per day)
|
|
- [ ] First occurrence timestamp confirmed
|
|
- [ ] Recent occurrences reviewed (last 10)
|
|
- [ ] Affected users identified (user IDs from logs)
|
|
- [ ] Error correlation checked (other errors at same time)
|
|
|
|
### Metrics Analysis
|
|
- [ ] Error rate queried (Prometheus, Cloudflare Analytics)
|
|
- [ ] Error spike identified (yes/no, when)
|
|
- [ ] Correlation with traffic spike checked
|
|
- [ ] Correlation with deployment checked
|
|
- [ ] Resource utilization checked (CPU, memory, connections)
|
|
|
|
### Trace Analysis
|
|
- [ ] Trace ID extracted from logs
|
|
- [ ] Distributed trace viewed (Jaeger, Zipkin)
|
|
- [ ] Span timings analyzed
|
|
- [ ] Upstream/downstream services checked
|
|
- [ ] Trace context propagation verified
|
|
|
|
## Phase 6: Reproduce Locally
|
|
|
|
- [ ] Test environment set up (matches production config)
|
|
- [ ] Input data identified (from logs or user report)
|
|
- [ ] Reproduction steps documented
|
|
- [ ] Error reproduced locally (consistent reproduction)
|
|
- [ ] Minimal reproduction case created (simplest input that triggers error)
|
|
- [ ] Failing test case written (pytest, vitest)
|
|
- [ ] Test runs and fails as expected
|
|
|
|
### Reproduction Quality
|
|
- [ ] Reproduction is reliable (100% reproducible)
|
|
- [ ] Reproduction is minimal (fewest steps possible)
|
|
- [ ] Test is isolated (no external dependencies if possible)
|
|
|
|
## Phase 7: Fix Generation
|
|
|
|
- [ ] Fix hypothesis generated (based on pattern match and code inspection)
|
|
- [ ] Fix option 1 generated (quick fix)
|
|
- [ ] Fix option 2 generated (robust fix)
|
|
- [ ] Fix option 3 generated (best practice fix)
|
|
- [ ] Trade-offs analyzed (complexity vs. robustness)
|
|
- [ ] Fix option selected (document rationale)
|
|
|
|
### Fix Options Evaluated
|
|
- [ ] **Quick fix** - Minimal code change, may not cover all cases
|
|
- [ ] **Robust fix** - Handles edge cases, more defensive
|
|
- [ ] **Best practice fix** - Follows design patterns, prevents similar bugs
|
|
|
|
## Phase 8: Fix Application
|
|
|
|
- [ ] Code changes made (Edit or MultiEdit tool)
|
|
- [ ] Changes reviewed for correctness
|
|
- [ ] No new bugs introduced (code review)
|
|
- [ ] Code style consistent (matches project conventions)
|
|
- [ ] Type hints added (if applicable)
|
|
- [ ] Comments added (explaining why, not what)
|
|
|
|
### Safety Checks
|
|
- [ ] No hardcoded values (use constants or config)
|
|
- [ ] No security vulnerabilities introduced
|
|
- [ ] No performance regressions introduced
|
|
- [ ] Backwards compatibility maintained (if API change)
|
|
|
|
## Phase 9: Test Verification
|
|
|
|
- [ ] Failing test now passes (verify fix works)
|
|
- [ ] Full test suite runs (pytest, vitest)
|
|
- [ ] All tests pass (no regressions)
|
|
- [ ] Code coverage maintained or improved
|
|
- [ ] Integration tests pass (if applicable)
|
|
- [ ] Edge cases tested (null, empty, large inputs)
|
|
|
|
### Test Quality
|
|
- [ ] Test is clear and readable
|
|
- [ ] Test documents expected behavior
|
|
- [ ] Test will catch regressions
|
|
- [ ] Test runs quickly (<1 second)
|
|
|
|
## Phase 10: Root Cause Analysis
|
|
|
|
- [ ] 5 Whys analysis performed
|
|
- [ ] True root cause identified (not just symptom)
|
|
- [ ] Contributing factors identified
|
|
- [ ] Timeline reconstructed (what happened when)
|
|
- [ ] RCA document created (using template)
|
|
|
|
### RCA Document Contents
|
|
- [ ] Error summary (what, where, when, impact)
|
|
- [ ] Timeline of events
|
|
- [ ] Investigation steps documented
|
|
- [ ] Root cause clearly stated
|
|
- [ ] Fix applied documented (with code snippet)
|
|
- [ ] Prevention strategy documented
|
|
|
|
## Phase 11: Prevention Strategy
|
|
|
|
- [ ] Immediate prevention: Unit test added (prevents this specific bug)
|
|
- [ ] Short-term prevention: Integration tests added (prevents similar bugs)
|
|
- [ ] Long-term prevention: Architecture changes proposed (prevents class of bugs)
|
|
- [ ] Monitoring added: Alert created (detects recurrence)
|
|
- [ ] Documentation updated: Runbook created (guides future debugging)
|
|
|
|
### Prevention Levels
|
|
- [ ] **Test Coverage** - Tests prevent regression
|
|
- [ ] **Type Safety** - Type hints catch errors at dev time
|
|
- [ ] **Input Validation** - Validates data early (Pydantic, zod)
|
|
- [ ] **Error Handling** - Graceful degradation
|
|
- [ ] **Monitoring** - Detects issues quickly
|
|
- [ ] **Documentation** - Team learns from incident
|
|
|
|
## Phase 12: Deploy & Monitor
|
|
|
|
### Pre-Deployment
|
|
- [ ] Fix tested in staging environment
|
|
- [ ] Performance impact assessed (no significant regression)
|
|
- [ ] Security review completed (if security-related bug)
|
|
- [ ] Deployment plan created (gradual rollout, rollback plan)
|
|
- [ ] Stakeholders notified (if high-impact bug)
|
|
|
|
### Deployment
|
|
- [ ] Fix deployed to staging first
|
|
- [ ] Staging verification successful
|
|
- [ ] Fix deployed to production (gradual rollout if possible)
|
|
- [ ] Deployment monitoring active (logs, metrics, traces)
|
|
|
|
### Post-Deployment
|
|
- [ ] Error logs monitored (1 hour post-deploy)
|
|
- [ ] Error rate confirmed to zero (or significantly reduced)
|
|
- [ ] No new errors introduced
|
|
- [ ] No performance degradation
|
|
- [ ] User reports checked (customer support, social media)
|
|
|
|
### Monitoring Duration
|
|
- [ ] 1 hour: Active monitoring (logs, errors, metrics)
|
|
- [ ] 24 hours: Passive monitoring (alerting enabled)
|
|
- [ ] 1 week: Review error trends (ensure no recurrence)
|
|
|
|
## Phase 13: Documentation & Learning
|
|
|
|
- [ ] Error pattern database updated (add new pattern if discovered)
|
|
- [ ] Team notified of fix (Slack, email)
|
|
- [ ] Postmortem conducted (if SEV1 or SEV2)
|
|
- [ ] Lessons learned documented
|
|
- [ ] Similar code locations reviewed (apply fix broadly if needed)
|
|
- [ ] Architecture improvements proposed (if needed)
|
|
|
|
### Knowledge Sharing
|
|
- [ ] RCA document shared with team
|
|
- [ ] Runbook created or updated
|
|
- [ ] Presentation given (if interesting or impactful bug)
|
|
- [ ] Blog post written (if educational value)
|
|
|
|
## Critical Validations
|
|
|
|
- [ ] Bug reliably reproduced before fixing
|
|
- [ ] Fix verified with passing test
|
|
- [ ] No regressions introduced
|
|
- [ ] Root cause identified (not just symptom fixed)
|
|
- [ ] Prevention strategy implemented
|
|
- [ ] Monitoring in place to detect recurrence
|
|
- [ ] Documentation complete (RCA, runbook)
|
|
- [ ] Team learns from incident
|
|
|
|
## Debugging Anti-Patterns (Avoid These)
|
|
|
|
- [X] Random code changes without hypothesis
|
|
- [X] Adding print statements without plan
|
|
- [X] Debugging production directly (use staging)
|
|
- [X] Ignoring error messages or stack traces
|
|
- [X] Not writing tests to verify fix
|
|
- [X] Fixing symptoms instead of root cause
|
|
- [X] Skipping reproduction step
|
|
- [X] Not documenting investigation
|
|
- [X] Not learning from mistakes (no RCA)
|
|
- [X] Working alone for > 30 min when stuck
|
|
|
|
## When to Escalate
|
|
|
|
**Escalate immediately if**:
|
|
- Production is down (SEV1)
|
|
- You're stuck for > 30 minutes with no progress
|
|
- Bug is in unfamiliar code/system
|
|
- Security vulnerability suspected
|
|
- Data corruption suspected
|
|
- Multiple systems affected
|
|
|
|
**Who to escalate to**:
|
|
- **incident-responder** - Production SEV1/SEV2
|
|
- **performance-optimizer** - Performance bugs
|
|
- **security-analyzer** - Security vulnerabilities
|
|
- **data-validator** - Data validation errors
|
|
- **Senior engineer** - Stuck for > 30 min
|
|
- **On-call engineer** - Outside business hours
|
|
|
|
## Success Criteria
|
|
|
|
- [ ] Bug fixed and verified with test
|
|
- [ ] Root cause identified and documented
|
|
- [ ] Prevention strategy implemented
|
|
- [ ] Team learns from incident
|
|
- [ ] Similar bugs prevented in future
|
|
- [ ] Documentation complete and accurate
|
|
- [ ] Debugging completed in reasonable time (< 2 hours for SEV3)
|