136 lines
4.2 KiB
Markdown
136 lines
4.2 KiB
Markdown
# Root Cause Analysis Framework
|
||
|
||
Advanced techniques for identifying fundamental causes rather than symptoms.
|
||
|
||
## Table of Contents
|
||
|
||
- [The 5 Whys (Applied to Code)](#the-5-whys-applied-to-code)
|
||
- [Architectural Analysis Method](#architectural-analysis-method)
|
||
- [Data Flow Tracing](#data-flow-tracing)
|
||
- [State Analysis Patterns](#state-analysis-patterns)
|
||
- [Integration Point Analysis](#integration-point-analysis)
|
||
- [Dependency Chain Analysis](#dependency-chain-analysis)
|
||
- [Performance Root Cause Analysis](#performance-root-cause-analysis)
|
||
- [Sequential Thinking Templates](#sequential-thinking-templates)
|
||
|
||
## The 5 Whys (Applied to Code)
|
||
|
||
Ask "why" iteratively to drill down from symptom to root cause.
|
||
|
||
### Example: Null Pointer Exception
|
||
|
||
1. Why does the null pointer exception occur?
|
||
→ `user.getEmail()` is called on a null user object
|
||
2. Why is the user object null?
|
||
→ `findUserById()` returns null when no user is found
|
||
3. Why does `findUserById()` return null instead of throwing?
|
||
→ Original design used null to indicate "not found"
|
||
4. Why wasn't this caught earlier in the call chain?
|
||
→ Calling code doesn't check for null before using the user
|
||
5. Why doesn't the calling code check for null?
|
||
→ API contract is ambiguous about null as a valid return value
|
||
|
||
Root cause: Ambiguous API contract leads to inconsistent null handling.
|
||
Proper fix: Define and enforce a clear API contract (Optional/exception/documented null).
|
||
|
||
## Architectural Analysis Method
|
||
|
||
When bugs suggest deeper design issues, analyze architecture systematically.
|
||
|
||
1. Map components: interactions, data flows, boundaries
|
||
2. Identify assumptions (inputs, state, timing, external systems)
|
||
3. Find assumption mismatches between components
|
||
4. Choose architectural fix over workaround when systemic
|
||
|
||
Use `codebase_search` prompts like:
|
||
- "How does ComponentA communicate with ComponentB?"
|
||
- "What data flows from Source to Destination?"
|
||
|
||
## Data Flow Tracing
|
||
|
||
Trace transformations to locate where data goes wrong.
|
||
|
||
- Backward tracing: start at observation point → immediate source → transformation → origin
|
||
- Forward tracing: origin → each transformation → final state
|
||
- At each step compare expected vs actual state
|
||
|
||
Common root causes:
|
||
- Missing validation
|
||
- Incorrect transformation logic
|
||
- Lost context/metadata
|
||
- Race conditions
|
||
- Type/encoding mismatch
|
||
|
||
## State Analysis Patterns
|
||
|
||
Investigate state transitions and invariants.
|
||
|
||
- Uninitialized state: used before proper setup
|
||
- Stale state: cache invalidation/refresh failures
|
||
- Inconsistent state: related data out of sync (needs atomicity)
|
||
- Invalid state: invariants not enforced (add validation/assertions)
|
||
- Concurrent corruption: missing synchronization/immutability
|
||
|
||
## Integration Point Analysis
|
||
|
||
Verify integration contracts at boundaries.
|
||
|
||
- Data format: actual vs expected
|
||
- Protocol/version: compatibility and usage
|
||
- Timing: sync vs async, timeouts, ordering
|
||
- Error handling: propagation and retries
|
||
- AuthZ/AuthN: credentials, validation, failure behavior
|
||
|
||
Root cause patterns:
|
||
- Mismatched versions
|
||
- Incomplete error handling
|
||
- Configuration mismatch
|
||
- Network constraints
|
||
|
||
## Dependency Chain Analysis
|
||
|
||
Map direct, transitive, and hidden dependencies.
|
||
|
||
- Version conflicts (multiple versions)
|
||
- Missing dependencies (runtime load failures)
|
||
- Initialization order issues
|
||
- Circular dependencies
|
||
|
||
Use `codebase_search`:
|
||
- "What imports/uses ComponentX?"
|
||
- "What does ComponentX depend on?"
|
||
|
||
## Performance Root Cause Analysis
|
||
|
||
Identify bottlenecks systematically.
|
||
|
||
1. Measure first (profile under realistic load)
|
||
2. Check algorithmic complexity and hotspots
|
||
3. Analyze resource usage (CPU, memory, I/O, network)
|
||
4. Classify cause: algorithm, implementation, contention, external
|
||
|
||
Fix strategies:
|
||
- Algorithmic improvements
|
||
- Caching/batching
|
||
- Lazy loading
|
||
- Parallelization/asynchronous I/O
|
||
|
||
## Sequential Thinking Templates
|
||
|
||
Use `SequentialThinking:process_thought` to structure complex analysis.
|
||
|
||
Thought 1 - Problem Definition
|
||
- Symptom, context, confirmed facts, unknowns
|
||
|
||
Thought 2 - Hypotheses
|
||
- 3–5 candidates, assumptions, likelihood ranking
|
||
|
||
Thought 3 - Evidence
|
||
- For/against each hypothesis; challenge assumptions
|
||
|
||
Thought 4 - Selection
|
||
- Pick most likely; rationale; confidence
|
||
|
||
Thought 5 - Verification
|
||
- Predictions, test plan, alternatives if wrong
|