Initial commit

2025-11-29 18:20:33 +08:00
commit 977fbf5872
27 changed files with 5714 additions and 0 deletions
--- a/skills/debugging/reference/root-cause-framework.md
+++ b/skills/debugging/reference/root-cause-framework.md
@@ -0,0 +1,135 @@
+# Root Cause Analysis Framework
+
+Advanced techniques for identifying fundamental causes rather than symptoms.
+
+## Table of Contents
+
+- [The 5 Whys (Applied to Code)](#the-5-whys-applied-to-code)
+- [Architectural Analysis Method](#architectural-analysis-method)
+- [Data Flow Tracing](#data-flow-tracing)
+- [State Analysis Patterns](#state-analysis-patterns)
+- [Integration Point Analysis](#integration-point-analysis)
+- [Dependency Chain Analysis](#dependency-chain-analysis)
+- [Performance Root Cause Analysis](#performance-root-cause-analysis)
+- [Sequential Thinking Templates](#sequential-thinking-templates)
+
+## The 5 Whys (Applied to Code)
+
+Ask "why" iteratively to drill down from symptom to root cause.
+
+### Example: Null Pointer Exception
+
+1. Why does the null pointer exception occur?
+   → `user.getEmail()` is called on a null user object
+2. Why is the user object null?
+   → `findUserById()` returns null when no user is found
+3. Why does `findUserById()` return null instead of throwing?
+   → Original design used null to indicate "not found"
+4. Why wasn't this caught earlier in the call chain?
+   → Calling code doesn't check for null before using the user
+5. Why doesn't the calling code check for null?
+   → API contract is ambiguous about null as a valid return value
+
+Root cause: Ambiguous API contract leads to inconsistent null handling.
+Proper fix: Define and enforce a clear API contract (Optional/exception/documented null).
+
+## Architectural Analysis Method
+
+When bugs suggest deeper design issues, analyze architecture systematically.
+
+1. Map components: interactions, data flows, boundaries
+2. Identify assumptions (inputs, state, timing, external systems)
+3. Find assumption mismatches between components
+4. Choose architectural fix over workaround when systemic
+
+Use `codebase_search` prompts like:
+- "How does ComponentA communicate with ComponentB?"
+- "What data flows from Source to Destination?"
+
+## Data Flow Tracing
+
+Trace transformations to locate where data goes wrong.
+
+- Backward tracing: start at observation point → immediate source → transformation → origin
+- Forward tracing: origin → each transformation → final state
+- At each step compare expected vs actual state
+
+Common root causes:
+- Missing validation
+- Incorrect transformation logic
+- Lost context/metadata
+- Race conditions
+- Type/encoding mismatch
+
+## State Analysis Patterns
+
+Investigate state transitions and invariants.
+
+- Uninitialized state: used before proper setup
+- Stale state: cache invalidation/refresh failures
+- Inconsistent state: related data out of sync (needs atomicity)
+- Invalid state: invariants not enforced (add validation/assertions)
+- Concurrent corruption: missing synchronization/immutability
+
+## Integration Point Analysis
+
+Verify integration contracts at boundaries.
+
+- Data format: actual vs expected
+- Protocol/version: compatibility and usage
+- Timing: sync vs async, timeouts, ordering
+- Error handling: propagation and retries
+- AuthZ/AuthN: credentials, validation, failure behavior
+
+Root cause patterns:
+- Mismatched versions
+- Incomplete error handling
+- Configuration mismatch
+- Network constraints
+
+## Dependency Chain Analysis
+
+Map direct, transitive, and hidden dependencies.
+
+- Version conflicts (multiple versions)
+- Missing dependencies (runtime load failures)
+- Initialization order issues
+- Circular dependencies
+
+Use `codebase_search`:
+- "What imports/uses ComponentX?"
+- "What does ComponentX depend on?"
+
+## Performance Root Cause Analysis
+
+Identify bottlenecks systematically.
+
+1. Measure first (profile under realistic load)
+2. Check algorithmic complexity and hotspots
+3. Analyze resource usage (CPU, memory, I/O, network)
+4. Classify cause: algorithm, implementation, contention, external
+
+Fix strategies:
+- Algorithmic improvements
+- Caching/batching
+- Lazy loading
+- Parallelization/asynchronous I/O
+
+## Sequential Thinking Templates
+
+Use `SequentialThinking:process_thought` to structure complex analysis.
+
+Thought 1 - Problem Definition
+- Symptom, context, confirmed facts, unknowns
+
+Thought 2 - Hypotheses
+- 3–5 candidates, assumptions, likelihood ranking
+
+Thought 3 - Evidence
+- For/against each hypothesis; challenge assumptions
+
+Thought 4 - Selection
+- Pick most likely; rationale; confidence
+
+Thought 5 - Verification
+- Predictions, test plan, alternatives if wrong