Files
gh-greyhaven-ai-claude-code…/skills/smart-debugging/checklists/systematic-debugging-checklist.md
2025-11-29 18:29:18 +08:00

9.8 KiB

Systematic Debugging Checklist

Use when debugging errors, exceptions, or unexpected behavior.

Phase 1: Triage (2-5 minutes)

  • Error message captured completely
  • Stack trace obtained (full, not truncated)
  • Environment identified (production, staging, development)
  • Severity assessed (SEV1, SEV2, SEV3, SEV4)
  • Error frequency determined (one-time, intermittent, consistent)
  • First occurrence timestamp identified
  • Recent changes reviewed (deployments, config changes)
  • Production impact assessed (users affected, revenue impact)

Triage Decision

  • SEV1 (Production Down) - Escalate to incident-responder immediately
  • SEV2 (Degraded) - Quick investigation (10 min max), then escalate if unresolved
  • SEV3 (Bug) - Continue with full smart-debug workflow
  • SEV4 (Enhancement) - Document and queue for later

Phase 2: Stack Trace Analysis

  • Error type identified (TypeError, ValueError, KeyError, etc.)
  • Error message parsed and understood
  • Call stack extracted (all frames with file, line, function)
  • Root file identified (where error originated, not propagated)
  • Root line number identified
  • Stdlib/third-party frames filtered out
  • Related files identified (files in call stack)
  • Likely cause predicted using pattern matching

Stack Trace Quality

  • Stack trace is complete (not truncated)
  • Source maps applied (for minified JavaScript)
  • Line numbers accurate (code and deployed version match)

Phase 3: Pattern Matching

  • Error pattern database searched
  • Matching error pattern found (or "unknown")
  • Root cause hypothesis generated
  • Fix template identified
  • Prevention strategy identified
  • Similar historical bugs reviewed (if available)

Common Patterns Checked

  • Null pointer / NoneType errors
  • Type mismatch errors
  • Index out of range errors
  • Missing dictionary keys
  • Module/import errors
  • Database connection errors
  • API contract violations
  • Concurrency errors

Phase 4: Code Inspection

  • Root file read completely
  • Problematic line identified
  • Context examined (5 lines before/after)
  • Function signature examined
  • Variable types inferred
  • Data flow traced (inputs to problematic line)
  • Assumptions identified (null checks, type validations missing)

Code Quality Check

  • Tests exist for this code path (yes/no)
  • Code has type hints (TypeScript, Python type hints)
  • Code has input validation
  • Code has error handling

Phase 5: Observability Investigation

Log Analysis

  • Logs queried for error occurrences
  • Error frequency calculated (per hour, per day)
  • First occurrence timestamp confirmed
  • Recent occurrences reviewed (last 10)
  • Affected users identified (user IDs from logs)
  • Error correlation checked (other errors at same time)

Metrics Analysis

  • Error rate queried (Prometheus, Cloudflare Analytics)
  • Error spike identified (yes/no, when)
  • Correlation with traffic spike checked
  • Correlation with deployment checked
  • Resource utilization checked (CPU, memory, connections)

Trace Analysis

  • Trace ID extracted from logs
  • Distributed trace viewed (Jaeger, Zipkin)
  • Span timings analyzed
  • Upstream/downstream services checked
  • Trace context propagation verified

Phase 6: Reproduce Locally

  • Test environment set up (matches production config)
  • Input data identified (from logs or user report)
  • Reproduction steps documented
  • Error reproduced locally (consistent reproduction)
  • Minimal reproduction case created (simplest input that triggers error)
  • Failing test case written (pytest, vitest)
  • Test runs and fails as expected

Reproduction Quality

  • Reproduction is reliable (100% reproducible)
  • Reproduction is minimal (fewest steps possible)
  • Test is isolated (no external dependencies if possible)

Phase 7: Fix Generation

  • Fix hypothesis generated (based on pattern match and code inspection)
  • Fix option 1 generated (quick fix)
  • Fix option 2 generated (robust fix)
  • Fix option 3 generated (best practice fix)
  • Trade-offs analyzed (complexity vs. robustness)
  • Fix option selected (document rationale)

Fix Options Evaluated

  • Quick fix - Minimal code change, may not cover all cases
  • Robust fix - Handles edge cases, more defensive
  • Best practice fix - Follows design patterns, prevents similar bugs

Phase 8: Fix Application

  • Code changes made (Edit or MultiEdit tool)
  • Changes reviewed for correctness
  • No new bugs introduced (code review)
  • Code style consistent (matches project conventions)
  • Type hints added (if applicable)
  • Comments added (explaining why, not what)

Safety Checks

  • No hardcoded values (use constants or config)
  • No security vulnerabilities introduced
  • No performance regressions introduced
  • Backwards compatibility maintained (if API change)

Phase 9: Test Verification

  • Failing test now passes (verify fix works)
  • Full test suite runs (pytest, vitest)
  • All tests pass (no regressions)
  • Code coverage maintained or improved
  • Integration tests pass (if applicable)
  • Edge cases tested (null, empty, large inputs)

Test Quality

  • Test is clear and readable
  • Test documents expected behavior
  • Test will catch regressions
  • Test runs quickly (<1 second)

Phase 10: Root Cause Analysis

  • 5 Whys analysis performed
  • True root cause identified (not just symptom)
  • Contributing factors identified
  • Timeline reconstructed (what happened when)
  • RCA document created (using template)

RCA Document Contents

  • Error summary (what, where, when, impact)
  • Timeline of events
  • Investigation steps documented
  • Root cause clearly stated
  • Fix applied documented (with code snippet)
  • Prevention strategy documented

Phase 11: Prevention Strategy

  • Immediate prevention: Unit test added (prevents this specific bug)
  • Short-term prevention: Integration tests added (prevents similar bugs)
  • Long-term prevention: Architecture changes proposed (prevents class of bugs)
  • Monitoring added: Alert created (detects recurrence)
  • Documentation updated: Runbook created (guides future debugging)

Prevention Levels

  • Test Coverage - Tests prevent regression
  • Type Safety - Type hints catch errors at dev time
  • Input Validation - Validates data early (Pydantic, zod)
  • Error Handling - Graceful degradation
  • Monitoring - Detects issues quickly
  • Documentation - Team learns from incident

Phase 12: Deploy & Monitor

Pre-Deployment

  • Fix tested in staging environment
  • Performance impact assessed (no significant regression)
  • Security review completed (if security-related bug)
  • Deployment plan created (gradual rollout, rollback plan)
  • Stakeholders notified (if high-impact bug)

Deployment

  • Fix deployed to staging first
  • Staging verification successful
  • Fix deployed to production (gradual rollout if possible)
  • Deployment monitoring active (logs, metrics, traces)

Post-Deployment

  • Error logs monitored (1 hour post-deploy)
  • Error rate confirmed to zero (or significantly reduced)
  • No new errors introduced
  • No performance degradation
  • User reports checked (customer support, social media)

Monitoring Duration

  • 1 hour: Active monitoring (logs, errors, metrics)
  • 24 hours: Passive monitoring (alerting enabled)
  • 1 week: Review error trends (ensure no recurrence)

Phase 13: Documentation & Learning

  • Error pattern database updated (add new pattern if discovered)
  • Team notified of fix (Slack, email)
  • Postmortem conducted (if SEV1 or SEV2)
  • Lessons learned documented
  • Similar code locations reviewed (apply fix broadly if needed)
  • Architecture improvements proposed (if needed)

Knowledge Sharing

  • RCA document shared with team
  • Runbook created or updated
  • Presentation given (if interesting or impactful bug)
  • Blog post written (if educational value)

Critical Validations

  • Bug reliably reproduced before fixing
  • Fix verified with passing test
  • No regressions introduced
  • Root cause identified (not just symptom fixed)
  • Prevention strategy implemented
  • Monitoring in place to detect recurrence
  • Documentation complete (RCA, runbook)
  • Team learns from incident

Debugging Anti-Patterns (Avoid These)

  • Random code changes without hypothesis
  • Adding print statements without plan
  • Debugging production directly (use staging)
  • Ignoring error messages or stack traces
  • Not writing tests to verify fix
  • Fixing symptoms instead of root cause
  • Skipping reproduction step
  • Not documenting investigation
  • Not learning from mistakes (no RCA)
  • Working alone for > 30 min when stuck

When to Escalate

Escalate immediately if:

  • Production is down (SEV1)
  • You're stuck for > 30 minutes with no progress
  • Bug is in unfamiliar code/system
  • Security vulnerability suspected
  • Data corruption suspected
  • Multiple systems affected

Who to escalate to:

  • incident-responder - Production SEV1/SEV2
  • performance-optimizer - Performance bugs
  • security-analyzer - Security vulnerabilities
  • data-validator - Data validation errors
  • Senior engineer - Stuck for > 30 min
  • On-call engineer - Outside business hours

Success Criteria

  • Bug fixed and verified with test
  • Root cause identified and documented
  • Prevention strategy implemented
  • Team learns from incident
  • Similar bugs prevented in future
  • Documentation complete and accurate
  • Debugging completed in reasonable time (< 2 hours for SEV3)