zhongwei/gh-greyhaven-ai-claude-code-config-grey-haven-plugins-incident-response

Files

Zhongwei Li 46dfc30864 Initial commit

2025-11-29 18:29:18 +08:00

9.8 KiB

Raw Blame History

Systematic Debugging Checklist

Use when debugging errors, exceptions, or unexpected behavior.

Phase 1: Triage (2-5 minutes)

Error message captured completely
Stack trace obtained (full, not truncated)
Environment identified (production, staging, development)
Severity assessed (SEV1, SEV2, SEV3, SEV4)
Error frequency determined (one-time, intermittent, consistent)
First occurrence timestamp identified
Recent changes reviewed (deployments, config changes)
Production impact assessed (users affected, revenue impact)

Triage Decision

SEV1 (Production Down) - Escalate to incident-responder immediately
SEV2 (Degraded) - Quick investigation (10 min max), then escalate if unresolved
SEV3 (Bug) - Continue with full smart-debug workflow
SEV4 (Enhancement) - Document and queue for later

Phase 2: Stack Trace Analysis

Error type identified (TypeError, ValueError, KeyError, etc.)
Error message parsed and understood
Call stack extracted (all frames with file, line, function)
Root file identified (where error originated, not propagated)
Root line number identified
Stdlib/third-party frames filtered out
Related files identified (files in call stack)
Likely cause predicted using pattern matching

Stack Trace Quality

Stack trace is complete (not truncated)
Source maps applied (for minified JavaScript)
Line numbers accurate (code and deployed version match)

Phase 3: Pattern Matching

Error pattern database searched
Matching error pattern found (or "unknown")
Root cause hypothesis generated
Fix template identified
Prevention strategy identified
Similar historical bugs reviewed (if available)

Common Patterns Checked

Null pointer / NoneType errors
Type mismatch errors
Index out of range errors
Missing dictionary keys
Module/import errors
Database connection errors
API contract violations
Concurrency errors

Phase 4: Code Inspection

Root file read completely
Problematic line identified
Context examined (5 lines before/after)
Function signature examined
Variable types inferred
Data flow traced (inputs to problematic line)
Assumptions identified (null checks, type validations missing)

Code Quality Check

Tests exist for this code path (yes/no)
Code has type hints (TypeScript, Python type hints)
Code has input validation
Code has error handling

Phase 5: Observability Investigation

Log Analysis

Logs queried for error occurrences
Error frequency calculated (per hour, per day)
First occurrence timestamp confirmed
Recent occurrences reviewed (last 10)
Affected users identified (user IDs from logs)
Error correlation checked (other errors at same time)

Metrics Analysis

Error rate queried (Prometheus, Cloudflare Analytics)
Error spike identified (yes/no, when)
Correlation with traffic spike checked
Correlation with deployment checked
Resource utilization checked (CPU, memory, connections)

Trace Analysis

Trace ID extracted from logs
Distributed trace viewed (Jaeger, Zipkin)
Span timings analyzed
Upstream/downstream services checked
Trace context propagation verified

Phase 6: Reproduce Locally

Test environment set up (matches production config)
Input data identified (from logs or user report)
Reproduction steps documented
Error reproduced locally (consistent reproduction)
Minimal reproduction case created (simplest input that triggers error)
Failing test case written (pytest, vitest)
Test runs and fails as expected

Reproduction Quality

Reproduction is reliable (100% reproducible)
Reproduction is minimal (fewest steps possible)
Test is isolated (no external dependencies if possible)

Phase 7: Fix Generation

Fix hypothesis generated (based on pattern match and code inspection)
Fix option 1 generated (quick fix)
Fix option 2 generated (robust fix)
Fix option 3 generated (best practice fix)
Trade-offs analyzed (complexity vs. robustness)
Fix option selected (document rationale)

Fix Options Evaluated

Quick fix - Minimal code change, may not cover all cases
Robust fix - Handles edge cases, more defensive
Best practice fix - Follows design patterns, prevents similar bugs

Phase 8: Fix Application

Code changes made (Edit or MultiEdit tool)
Changes reviewed for correctness
No new bugs introduced (code review)
Code style consistent (matches project conventions)
Type hints added (if applicable)
Comments added (explaining why, not what)

Safety Checks

No hardcoded values (use constants or config)
No security vulnerabilities introduced
No performance regressions introduced
Backwards compatibility maintained (if API change)

Phase 9: Test Verification

Failing test now passes (verify fix works)
Full test suite runs (pytest, vitest)
All tests pass (no regressions)
Code coverage maintained or improved
Integration tests pass (if applicable)
Edge cases tested (null, empty, large inputs)

Test Quality

Test is clear and readable
Test documents expected behavior
Test will catch regressions
Test runs quickly (<1 second)

Phase 10: Root Cause Analysis

5 Whys analysis performed
True root cause identified (not just symptom)
Contributing factors identified
Timeline reconstructed (what happened when)
RCA document created (using template)

RCA Document Contents

Error summary (what, where, when, impact)
Timeline of events
Investigation steps documented
Root cause clearly stated
Fix applied documented (with code snippet)
Prevention strategy documented

Phase 11: Prevention Strategy

Immediate prevention: Unit test added (prevents this specific bug)
Short-term prevention: Integration tests added (prevents similar bugs)
Long-term prevention: Architecture changes proposed (prevents class of bugs)
Monitoring added: Alert created (detects recurrence)
Documentation updated: Runbook created (guides future debugging)

Prevention Levels

Test Coverage - Tests prevent regression
Type Safety - Type hints catch errors at dev time
Input Validation - Validates data early (Pydantic, zod)
Error Handling - Graceful degradation
Monitoring - Detects issues quickly
Documentation - Team learns from incident

Phase 12: Deploy & Monitor

Pre-Deployment

Fix tested in staging environment
Performance impact assessed (no significant regression)
Security review completed (if security-related bug)
Deployment plan created (gradual rollout, rollback plan)
Stakeholders notified (if high-impact bug)

Deployment

Fix deployed to staging first
Staging verification successful
Fix deployed to production (gradual rollout if possible)
Deployment monitoring active (logs, metrics, traces)

Post-Deployment

Error logs monitored (1 hour post-deploy)
Error rate confirmed to zero (or significantly reduced)
No new errors introduced
No performance degradation
User reports checked (customer support, social media)

Monitoring Duration

1 hour: Active monitoring (logs, errors, metrics)
24 hours: Passive monitoring (alerting enabled)
1 week: Review error trends (ensure no recurrence)

Phase 13: Documentation & Learning

Error pattern database updated (add new pattern if discovered)
Team notified of fix (Slack, email)
Postmortem conducted (if SEV1 or SEV2)
Lessons learned documented
Similar code locations reviewed (apply fix broadly if needed)
Architecture improvements proposed (if needed)

RCA document shared with team
Runbook created or updated
Presentation given (if interesting or impactful bug)
Blog post written (if educational value)

Critical Validations

Bug reliably reproduced before fixing
Fix verified with passing test
No regressions introduced
Root cause identified (not just symptom fixed)
Prevention strategy implemented
Monitoring in place to detect recurrence
Documentation complete (RCA, runbook)
Team learns from incident

Debugging Anti-Patterns (Avoid These)

Random code changes without hypothesis
Adding print statements without plan
Debugging production directly (use staging)
Ignoring error messages or stack traces
Not writing tests to verify fix
Fixing symptoms instead of root cause
Skipping reproduction step
Not documenting investigation
Not learning from mistakes (no RCA)
Working alone for > 30 min when stuck

When to Escalate

Escalate immediately if:

Production is down (SEV1)
You're stuck for > 30 minutes with no progress
Bug is in unfamiliar code/system
Security vulnerability suspected
Data corruption suspected
Multiple systems affected

Who to escalate to:

incident-responder - Production SEV1/SEV2
performance-optimizer - Performance bugs
security-analyzer - Security vulnerabilities
data-validator - Data validation errors
Senior engineer - Stuck for > 30 min
On-call engineer - Outside business hours

Success Criteria

Bug fixed and verified with test
Root cause identified and documented
Prevention strategy implemented
Team learns from incident
Similar bugs prevented in future
Documentation complete and accurate
Debugging completed in reasonable time (< 2 hours for SEV3)