gh-greyhaven-ai-claude-code…/skills/smart-debugging/checklists/systematic-debugging-checklist.md

# Systematic Debugging Checklist

**Use when debugging errors, exceptions, or unexpected behavior.**

## Phase 1: Triage (2-5 minutes)

- [ ] Error message captured completely
- [ ] Stack trace obtained (full, not truncated)
- [ ] Environment identified (production, staging, development)
- [ ] Severity assessed (SEV1, SEV2, SEV3, SEV4)
- [ ] Error frequency determined (one-time, intermittent, consistent)
- [ ] First occurrence timestamp identified
- [ ] Recent changes reviewed (deployments, config changes)
- [ ] Production impact assessed (users affected, revenue impact)

### Triage Decision
- [ ] **SEV1 (Production Down)** - Escalate to incident-responder immediately
- [ ] **SEV2 (Degraded)** - Quick investigation (10 min max), then escalate if unresolved
- [ ] **SEV3 (Bug)** - Continue with full smart-debug workflow
- [ ] **SEV4 (Enhancement)** - Document and queue for later

## Phase 2: Stack Trace Analysis

- [ ] Error type identified (TypeError, ValueError, KeyError, etc.)
- [ ] Error message parsed and understood
- [ ] Call stack extracted (all frames with file, line, function)
- [ ] Root file identified (where error originated, not propagated)
- [ ] Root line number identified
- [ ] Stdlib/third-party frames filtered out
- [ ] Related files identified (files in call stack)
- [ ] Likely cause predicted using pattern matching

### Stack Trace Quality
- [ ] Stack trace is complete (not truncated)
- [ ] Source maps applied (for minified JavaScript)
- [ ] Line numbers accurate (code and deployed version match)

## Phase 3: Pattern Matching

- [ ] Error pattern database searched
- [ ] Matching error pattern found (or "unknown")
- [ ] Root cause hypothesis generated
- [ ] Fix template identified
- [ ] Prevention strategy identified
- [ ] Similar historical bugs reviewed (if available)

### Common Patterns Checked
- [ ] Null pointer / NoneType errors
- [ ] Type mismatch errors
- [ ] Index out of range errors
- [ ] Missing dictionary keys
- [ ] Module/import errors
- [ ] Database connection errors
- [ ] API contract violations
- [ ] Concurrency errors

## Phase 4: Code Inspection

- [ ] Root file read completely
- [ ] Problematic line identified
- [ ] Context examined (5 lines before/after)
- [ ] Function signature examined
- [ ] Variable types inferred
- [ ] Data flow traced (inputs to problematic line)
- [ ] Assumptions identified (null checks, type validations missing)

### Code Quality Check
- [ ] Tests exist for this code path (yes/no)
- [ ] Code has type hints (TypeScript, Python type hints)
- [ ] Code has input validation
- [ ] Code has error handling

## Phase 5: Observability Investigation

### Log Analysis
- [ ] Logs queried for error occurrences
- [ ] Error frequency calculated (per hour, per day)
- [ ] First occurrence timestamp confirmed
- [ ] Recent occurrences reviewed (last 10)
- [ ] Affected users identified (user IDs from logs)
- [ ] Error correlation checked (other errors at same time)

### Metrics Analysis
- [ ] Error rate queried (Prometheus, Cloudflare Analytics)
- [ ] Error spike identified (yes/no, when)
- [ ] Correlation with traffic spike checked
- [ ] Correlation with deployment checked
- [ ] Resource utilization checked (CPU, memory, connections)

### Trace Analysis
- [ ] Trace ID extracted from logs
- [ ] Distributed trace viewed (Jaeger, Zipkin)
- [ ] Span timings analyzed
- [ ] Upstream/downstream services checked
- [ ] Trace context propagation verified

## Phase 6: Reproduce Locally

- [ ] Test environment set up (matches production config)
- [ ] Input data identified (from logs or user report)
- [ ] Reproduction steps documented
- [ ] Error reproduced locally (consistent reproduction)
- [ ] Minimal reproduction case created (simplest input that triggers error)
- [ ] Failing test case written (pytest, vitest)
- [ ] Test runs and fails as expected

### Reproduction Quality
- [ ] Reproduction is reliable (100% reproducible)
- [ ] Reproduction is minimal (fewest steps possible)
- [ ] Test is isolated (no external dependencies if possible)

## Phase 7: Fix Generation

- [ ] Fix hypothesis generated (based on pattern match and code inspection)
- [ ] Fix option 1 generated (quick fix)
- [ ] Fix option 2 generated (robust fix)
- [ ] Fix option 3 generated (best practice fix)
- [ ] Trade-offs analyzed (complexity vs. robustness)
- [ ] Fix option selected (document rationale)

### Fix Options Evaluated
- [ ] **Quick fix** - Minimal code change, may not cover all cases
- [ ] **Robust fix** - Handles edge cases, more defensive
- [ ] **Best practice fix** - Follows design patterns, prevents similar bugs

## Phase 8: Fix Application

- [ ] Code changes made (Edit or MultiEdit tool)
- [ ] Changes reviewed for correctness
- [ ] No new bugs introduced (code review)
- [ ] Code style consistent (matches project conventions)
- [ ] Type hints added (if applicable)
- [ ] Comments added (explaining why, not what)

### Safety Checks
- [ ] No hardcoded values (use constants or config)
- [ ] No security vulnerabilities introduced
- [ ] No performance regressions introduced
- [ ] Backwards compatibility maintained (if API change)

## Phase 9: Test Verification

- [ ] Failing test now passes (verify fix works)
- [ ] Full test suite runs (pytest, vitest)
- [ ] All tests pass (no regressions)
- [ ] Code coverage maintained or improved
- [ ] Integration tests pass (if applicable)
- [ ] Edge cases tested (null, empty, large inputs)

### Test Quality
- [ ] Test is clear and readable
- [ ] Test documents expected behavior
- [ ] Test will catch regressions
- [ ] Test runs quickly (<1 second)

## Phase 10: Root Cause Analysis

- [ ] 5 Whys analysis performed
- [ ] True root cause identified (not just symptom)
- [ ] Contributing factors identified
- [ ] Timeline reconstructed (what happened when)
- [ ] RCA document created (using template)

### RCA Document Contents
- [ ] Error summary (what, where, when, impact)
- [ ] Timeline of events
- [ ] Investigation steps documented
- [ ] Root cause clearly stated
- [ ] Fix applied documented (with code snippet)
- [ ] Prevention strategy documented

## Phase 11: Prevention Strategy

- [ ] Immediate prevention: Unit test added (prevents this specific bug)
- [ ] Short-term prevention: Integration tests added (prevents similar bugs)
- [ ] Long-term prevention: Architecture changes proposed (prevents class of bugs)
- [ ] Monitoring added: Alert created (detects recurrence)
- [ ] Documentation updated: Runbook created (guides future debugging)

### Prevention Levels
- [ ] **Test Coverage** - Tests prevent regression
- [ ] **Type Safety** - Type hints catch errors at dev time
- [ ] **Input Validation** - Validates data early (Pydantic, zod)
- [ ] **Error Handling** - Graceful degradation
- [ ] **Monitoring** - Detects issues quickly
- [ ] **Documentation** - Team learns from incident

## Phase 12: Deploy & Monitor

### Pre-Deployment
- [ ] Fix tested in staging environment
- [ ] Performance impact assessed (no significant regression)
- [ ] Security review completed (if security-related bug)
- [ ] Deployment plan created (gradual rollout, rollback plan)
- [ ] Stakeholders notified (if high-impact bug)

### Deployment
- [ ] Fix deployed to staging first
- [ ] Staging verification successful
- [ ] Fix deployed to production (gradual rollout if possible)
- [ ] Deployment monitoring active (logs, metrics, traces)

### Post-Deployment
- [ ] Error logs monitored (1 hour post-deploy)
- [ ] Error rate confirmed to zero (or significantly reduced)
- [ ] No new errors introduced
- [ ] No performance degradation
- [ ] User reports checked (customer support, social media)

### Monitoring Duration
- [ ] 1 hour: Active monitoring (logs, errors, metrics)
- [ ] 24 hours: Passive monitoring (alerting enabled)
- [ ] 1 week: Review error trends (ensure no recurrence)

## Phase 13: Documentation & Learning

- [ ] Error pattern database updated (add new pattern if discovered)
- [ ] Team notified of fix (Slack, email)
- [ ] Postmortem conducted (if SEV1 or SEV2)
- [ ] Lessons learned documented
- [ ] Similar code locations reviewed (apply fix broadly if needed)
- [ ] Architecture improvements proposed (if needed)

### Knowledge Sharing
- [ ] RCA document shared with team
- [ ] Runbook created or updated
- [ ] Presentation given (if interesting or impactful bug)
- [ ] Blog post written (if educational value)

## Critical Validations

- [ ] Bug reliably reproduced before fixing
- [ ] Fix verified with passing test
- [ ] No regressions introduced
- [ ] Root cause identified (not just symptom fixed)
- [ ] Prevention strategy implemented
- [ ] Monitoring in place to detect recurrence
- [ ] Documentation complete (RCA, runbook)
- [ ] Team learns from incident

## Debugging Anti-Patterns (Avoid These)

- [X] Random code changes without hypothesis
- [X] Adding print statements without plan
- [X] Debugging production directly (use staging)
- [X] Ignoring error messages or stack traces
- [X] Not writing tests to verify fix
- [X] Fixing symptoms instead of root cause
- [X] Skipping reproduction step
- [X] Not documenting investigation
- [X] Not learning from mistakes (no RCA)
- [X] Working alone for > 30 min when stuck

## When to Escalate

**Escalate immediately if**:
- Production is down (SEV1)
- You're stuck for > 30 minutes with no progress
- Bug is in unfamiliar code/system
- Security vulnerability suspected
- Data corruption suspected
- Multiple systems affected

**Who to escalate to**:
- **incident-responder** - Production SEV1/SEV2
- **performance-optimizer** - Performance bugs
- **security-analyzer** - Security vulnerabilities
- **data-validator** - Data validation errors
- **Senior engineer** - Stuck for > 30 min
- **On-call engineer** - Outside business hours

## Success Criteria

- [ ] Bug fixed and verified with test
- [ ] Root cause identified and documented
- [ ] Prevention strategy implemented
- [ ] Team learns from incident
- [ ] Similar bugs prevented in future
- [ ] Documentation complete and accurate
- [ ] Debugging completed in reasonable time (< 2 hours for SEV3)