Initial commit
This commit is contained in:
697
skills/causal-inference-root-cause/resources/methodology.md
Normal file
697
skills/causal-inference-root-cause/resources/methodology.md
Normal file
@@ -0,0 +1,697 @@
|
||||
# Causal Inference & Root Cause Analysis Methodology
|
||||
|
||||
## Root Cause Analysis Workflow
|
||||
|
||||
Copy this checklist and track your progress:
|
||||
|
||||
```
|
||||
Root Cause Analysis Progress:
|
||||
- [ ] Step 1: Generate hypotheses using structured techniques
|
||||
- [ ] Step 2: Build causal model and distinguish cause types
|
||||
- [ ] Step 3: Gather evidence and assess temporal sequence
|
||||
- [ ] Step 4: Test causality with Bradford Hill criteria
|
||||
- [ ] Step 5: Verify root cause and check coherence
|
||||
```
|
||||
|
||||
**Step 1: Generate hypotheses using structured techniques**
|
||||
|
||||
Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See [Hypothesis Generation Techniques](#hypothesis-generation-techniques) for detailed methods.
|
||||
|
||||
**Step 2: Build causal model and distinguish cause types**
|
||||
|
||||
Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See [Causal Model Building](#causal-model-building) for cause type definitions and chain mapping.
|
||||
|
||||
**Step 3: Gather evidence and assess temporal sequence**
|
||||
|
||||
Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See [Evidence Assessment Methods](#evidence-assessment-methods) for essential tests and evidence hierarchy.
|
||||
|
||||
**Step 4: Test causality with Bradford Hill criteria**
|
||||
|
||||
Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See [Bradford Hill Criteria](#bradford-hill-criteria) for scoring rubric.
|
||||
|
||||
**Step 5: Verify root cause and check coherence**
|
||||
|
||||
Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See [Quality Checklist](#quality-checklist) and [Worked Example](#worked-example-website-conversion-drop) for validation techniques.
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis Generation Techniques
|
||||
|
||||
### 5 Whys (Trace Back to Root)
|
||||
|
||||
Start with the effect and ask "why" repeatedly until you reach the root cause.
|
||||
|
||||
**Process:**
|
||||
```
|
||||
Effect: {Observed problem}
|
||||
Why? → {Proximate cause 1}
|
||||
Why? → {Proximate cause 2}
|
||||
Why? → {Intermediate cause}
|
||||
Why? → {Deeper cause}
|
||||
Why? → {Root cause}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Effect: Server crashed
|
||||
Why? → Out of memory
|
||||
Why? → Memory leak in new code
|
||||
Why? → Connection pool not releasing connections
|
||||
Why? → Error handling missing in new feature
|
||||
Why? → Code review process didn't catch this edge case (ROOT CAUSE)
|
||||
```
|
||||
|
||||
**When to stop**: When you reach a cause that, if fixed, would prevent recurrence.
|
||||
|
||||
**Pitfalls to avoid**:
|
||||
- Stopping too early at proximate causes
|
||||
- Following only one path (should explore multiple branches)
|
||||
- Accepting vague answers ("because it broke")
|
||||
|
||||
---
|
||||
|
||||
### Fishbone Diagram (Categorize Causes)
|
||||
|
||||
Organize potential causes by category to ensure comprehensive coverage.
|
||||
|
||||
**Categories (6 Ms)**:
|
||||
```
|
||||
Methods Machines Materials
|
||||
| | |
|
||||
| | |
|
||||
└───────────────┴───────────────┴─────────→ Effect
|
||||
| | |
|
||||
| | |
|
||||
Manpower Measurement Environment
|
||||
```
|
||||
|
||||
**Adapted for software/product**:
|
||||
- **People**: Skills, knowledge, communication, errors
|
||||
- **Process**: Procedures, workflows, standards, review
|
||||
- **Technology**: Tools, systems, infrastructure, dependencies
|
||||
- **Environment**: External factors, timing, context
|
||||
|
||||
**Example (API outage)**:
|
||||
- **People**: On-call engineer inexperienced with database
|
||||
- **Process**: No rollback procedure documented
|
||||
- **Technology**: Database connection pooling misconfigured
|
||||
- **Environment**: Traffic spike coincided with deploy
|
||||
|
||||
---
|
||||
|
||||
### Timeline Analysis (What Changed?)
|
||||
|
||||
Map events leading up to effect to identify temporal correlations.
|
||||
|
||||
**Process**:
|
||||
1. Establish baseline period (normal operation)
|
||||
2. Mark when effect first observed
|
||||
3. List all changes in days/hours leading up to effect
|
||||
4. Identify changes that temporally precede effect
|
||||
|
||||
**Example**:
|
||||
```
|
||||
T-7 days: Normal operation (baseline)
|
||||
T-3 days: Deployed new payment processor integration
|
||||
T-2 days: Traffic increased 20%
|
||||
T-1 day: First error logged (dismissed as fluke)
|
||||
T-0 (14:00): Conversion rate dropped 30%
|
||||
T+1 hour: Alert fired, investigation started
|
||||
```
|
||||
|
||||
**Look for**: Changes, anomalies, or events right before effect (temporal precedence).
|
||||
|
||||
**Red flags**:
|
||||
- Multiple changes at once (hard to isolate cause)
|
||||
- Long lag between change and effect (harder to connect)
|
||||
- No changes before effect (may be external factor or accumulated technical debt)
|
||||
|
||||
---
|
||||
|
||||
### Differential Diagnosis
|
||||
|
||||
List all conditions/causes that could produce the observed symptoms.
|
||||
|
||||
**Process**:
|
||||
1. List all possible causes (be comprehensive)
|
||||
2. For each cause, list symptoms it would produce
|
||||
3. Compare predicted symptoms to observed symptoms
|
||||
4. Eliminate causes that don't match
|
||||
|
||||
**Example (API returning 500 errors)**:
|
||||
|
||||
| Cause | Predicted Symptoms | Observed? |
|
||||
|-------|-------------------|-----------|
|
||||
| Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes |
|
||||
| Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal |
|
||||
| Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive |
|
||||
| Configuration error | Service won't start or immediate failure | ✗ Started fine |
|
||||
| DDoS attack | High traffic, distributed sources | ✗ Traffic normal |
|
||||
|
||||
**Conclusion**: Code bug most likely (matches symptoms, others ruled out).
|
||||
|
||||
---
|
||||
|
||||
## Causal Model Building
|
||||
|
||||
### Distinguish Cause Types
|
||||
|
||||
#### 1. Root Cause
|
||||
- **Definition**: Fundamental underlying issue
|
||||
- **Test**: If fixed, problem doesn't recur
|
||||
- **Usually**: Structural, procedural, or systemic
|
||||
- **Example**: "No code review process for infrastructure changes"
|
||||
|
||||
#### 2. Proximate Cause
|
||||
- **Definition**: Immediate trigger
|
||||
- **Test**: Directly precedes effect
|
||||
- **Often**: Symptom of deeper root cause
|
||||
- **Example**: "Deployed buggy config file"
|
||||
|
||||
#### 3. Contributing Factor
|
||||
- **Definition**: Makes effect more likely or severe
|
||||
- **Test**: Not sufficient alone, but amplifies
|
||||
- **Often**: Context or conditions
|
||||
- **Example**: "High traffic amplified impact of bug"
|
||||
|
||||
#### 4. Coincidence
|
||||
- **Definition**: Happened around same time, no causal relationship
|
||||
- **Test**: No mechanism, doesn't pass counterfactual
|
||||
- **Example**: "Marketing campaign launched same day" (but unrelated to technical issue)
|
||||
|
||||
---
|
||||
|
||||
### Map Causal Chains
|
||||
|
||||
#### Linear Chain
|
||||
```
|
||||
Root Cause
|
||||
↓ (mechanism)
|
||||
Intermediate Cause
|
||||
↓ (mechanism)
|
||||
Proximate Cause
|
||||
↓ (mechanism)
|
||||
Observed Effect
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
No SSL cert renewal process (Root)
|
||||
↓ (no automated reminders)
|
||||
Cert expires unnoticed (Intermediate)
|
||||
↓ (HTTPS handshake fails)
|
||||
HTTPS connections fail (Proximate)
|
||||
↓ (users can't connect)
|
||||
Users can't access site (Effect)
|
||||
```
|
||||
|
||||
#### Branching Chain (Multiple Paths)
|
||||
```
|
||||
Root Cause
|
||||
↙ ↘
|
||||
Path A Path B
|
||||
↓ ↓
|
||||
Effect A Effect B
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Database migration error (Root)
|
||||
↙ ↘
|
||||
Missing indexes Schema mismatch
|
||||
↓ ↓
|
||||
Slow queries Type errors
|
||||
```
|
||||
|
||||
#### Common Cause (Confounding)
|
||||
```
|
||||
Confounder (Z)
|
||||
↙ ↘
|
||||
Variable X Variable Y
|
||||
```
|
||||
X and Y correlate because Z causes both, but X doesn't cause Y.
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Hot weather (Confounder)
|
||||
↙ ↘
|
||||
Ice cream sales Drownings
|
||||
```
|
||||
Ice cream doesn't cause drownings; both caused by hot weather/summer season.
|
||||
|
||||
---
|
||||
|
||||
## Evidence Assessment Methods
|
||||
|
||||
### Essential Tests
|
||||
|
||||
#### 1. Temporal Sequence (Must Pass)
|
||||
**Rule**: Cause MUST precede effect. If effect happened before "cause," it's not causal.
|
||||
|
||||
**How to verify**:
|
||||
- Create detailed timeline
|
||||
- Mark exact timestamps of cause and effect
|
||||
- Calculate lag (effect should follow cause)
|
||||
|
||||
**Example**:
|
||||
- Migration deployed: March 10, 2:00 PM
|
||||
- Queries slowed: March 10, 2:15 PM
|
||||
- ✓ Cause precedes effect by 15 minutes
|
||||
|
||||
#### 2. Counterfactual (Strong Evidence)
|
||||
**Question**: "What would have happened without the cause?"
|
||||
|
||||
**How to test**:
|
||||
- **Control group**: Compare cases with cause vs without
|
||||
- **Rollback**: Remove cause, see if effect disappears
|
||||
- **Baseline comparison**: Compare to period before cause
|
||||
- **A/B test**: Random assignment of cause
|
||||
|
||||
**Example**:
|
||||
- Hypothesis: New feature X causes conversion drop
|
||||
- Test: A/B test with X enabled vs disabled
|
||||
- Result: No conversion difference → X not the cause
|
||||
|
||||
#### 3. Mechanism (Plausibility Test)
|
||||
**Question**: Can you explain HOW cause produces effect?
|
||||
|
||||
**Requirements**:
|
||||
- Pathway from cause to effect
|
||||
- Each step makes sense
|
||||
- Supported by theory or evidence
|
||||
|
||||
**Example - Strong**:
|
||||
- Cause: Increased checkout latency (5 sec)
|
||||
- Mechanism: Users abandon slow pages
|
||||
- Evidence: Industry data shows 7+ sec → 20% abandon
|
||||
- ✓ Clear, plausible mechanism
|
||||
|
||||
**Example - Weak**:
|
||||
- Cause: Moon phase (full moon)
|
||||
- Mechanism: ??? (no plausible pathway)
|
||||
- Effect: Website traffic increase
|
||||
- ✗ No mechanism, likely spurious correlation
|
||||
|
||||
---
|
||||
|
||||
### Evidence Hierarchy
|
||||
|
||||
**Strongest Evidence** (Gold Standard):
|
||||
|
||||
1. **Randomized Controlled Trial (RCT)**
|
||||
- Random assignment eliminates confounding
|
||||
- Compare treatment vs control group
|
||||
- Example: A/B test with random user assignment
|
||||
|
||||
**Strong Evidence**:
|
||||
|
||||
2. **Quasi-Experiment / Natural Experiment**
|
||||
- Some random-like assignment (not perfect)
|
||||
- Example: Policy implemented in one region, not another
|
||||
- Control for confounders statistically
|
||||
|
||||
**Medium Evidence**:
|
||||
|
||||
3. **Longitudinal Study (Before/After)**
|
||||
- Track same subjects over time
|
||||
- Controls for individual differences
|
||||
- Example: Metric before and after change
|
||||
|
||||
4. **Well-Controlled Observational Study**
|
||||
- Statistical controls for confounders
|
||||
- Multiple variables measured
|
||||
- Example: Regression analysis with covariates
|
||||
|
||||
**Weak Evidence**:
|
||||
|
||||
5. **Cross-Sectional Correlation**
|
||||
- Single point in time
|
||||
- Can't establish temporal order
|
||||
- Example: Survey at one moment
|
||||
|
||||
6. **Anecdote / Single Case**
|
||||
- May not generalize
|
||||
- Many confounders
|
||||
- Example: One user complaint
|
||||
|
||||
**Recommendation**: For high-stakes decisions, aim for quasi-experiment or better.
|
||||
|
||||
---
|
||||
|
||||
### Bradford Hill Criteria
|
||||
|
||||
Nine factors that strengthen causal claims. Score each on 1-3 scale.
|
||||
|
||||
#### 1. Strength
|
||||
**Question**: How strong is the association?
|
||||
|
||||
- **3**: Very strong correlation (r > 0.7, OR > 10)
|
||||
- **2**: Moderate correlation (r = 0.3-0.7, OR = 2-10)
|
||||
- **1**: Weak correlation (r < 0.3, OR < 2)
|
||||
|
||||
**Example**: 10x latency increase = strong association (3/3)
|
||||
|
||||
#### 2. Consistency
|
||||
**Question**: Does relationship replicate across contexts?
|
||||
|
||||
- **3**: Always consistent (multiple studies, settings)
|
||||
- **2**: Mostly consistent (some exceptions)
|
||||
- **1**: Rarely consistent (contradictory results)
|
||||
|
||||
**Example**: Latency affects conversion on all pages, devices, countries = consistent (3/3)
|
||||
|
||||
#### 3. Specificity
|
||||
**Question**: Is cause-effect relationship specific?
|
||||
|
||||
- **3**: Specific cause → specific effect
|
||||
- **2**: Somewhat specific (some other effects)
|
||||
- **1**: General/non-specific (cause has many effects)
|
||||
|
||||
**Example**: Missing index affects only queries on that table = specific (3/3)
|
||||
|
||||
#### 4. Temporality
|
||||
**Question**: Does cause precede effect?
|
||||
|
||||
- **3**: Always (clear temporal sequence)
|
||||
- **2**: Usually (mostly precedes)
|
||||
- **1**: Unclear or reverse causation possible
|
||||
|
||||
**Example**: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)
|
||||
|
||||
#### 5. Dose-Response
|
||||
**Question**: Does more cause → more effect?
|
||||
|
||||
- **3**: Clear gradient (linear or monotonic)
|
||||
- **2**: Some gradient (mostly true)
|
||||
- **1**: No gradient (flat or random)
|
||||
|
||||
**Example**: Larger tables → slower queries = clear gradient (3/3)
|
||||
|
||||
#### 6. Plausibility
|
||||
**Question**: Does mechanism make sense?
|
||||
|
||||
- **3**: Very plausible (well-established theory)
|
||||
- **2**: Somewhat plausible (reasonable)
|
||||
- **1**: Implausible (contradicts theory)
|
||||
|
||||
**Example**: Index scans faster than table scans = well-established (3/3)
|
||||
|
||||
#### 7. Coherence
|
||||
**Question**: Fits with existing knowledge?
|
||||
|
||||
- **3**: Fits well (no contradictions)
|
||||
- **2**: Somewhat fits (minor contradictions)
|
||||
- **1**: Contradicts existing knowledge
|
||||
|
||||
**Example**: Aligns with database optimization theory = fits well (3/3)
|
||||
|
||||
#### 8. Experiment
|
||||
**Question**: Does intervention change outcome?
|
||||
|
||||
- **3**: Strong experimental evidence (RCT)
|
||||
- **2**: Some experimental evidence (quasi)
|
||||
- **1**: No experimental evidence (observational only)
|
||||
|
||||
**Example**: Rollback restored performance = strong experiment (3/3)
|
||||
|
||||
#### 9. Analogy
|
||||
**Question**: Similar cause-effect patterns exist?
|
||||
|
||||
- **3**: Strong analogies (many similar cases)
|
||||
- **2**: Some analogies (a few similar)
|
||||
- **1**: No analogies (unique case)
|
||||
|
||||
**Example**: Similar patterns in other databases = some analogies (2/3)
|
||||
|
||||
**Scoring**:
|
||||
- **Total 18-27**: Strong causal evidence
|
||||
- **Total 12-17**: Moderate evidence
|
||||
- **Total < 12**: Weak evidence (need more investigation)
|
||||
|
||||
---
|
||||
|
||||
## Worked Example: Website Conversion Drop
|
||||
|
||||
### Problem
|
||||
|
||||
Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.
|
||||
|
||||
---
|
||||
|
||||
### 1. Effect Definition
|
||||
|
||||
**Effect**: Conversion rate dropped 40% (5% → 3%)
|
||||
|
||||
**Quantification**:
|
||||
- Baseline: 5% (stable for 6 months prior)
|
||||
- Current: 3% (observed for 2 weeks)
|
||||
- Absolute drop: 2 percentage points
|
||||
- Relative drop: 40%
|
||||
- Impact: ~$100k/week revenue loss
|
||||
|
||||
**Timeline**:
|
||||
- Last normal day: November 14th (5.1% conversion)
|
||||
- First drop observed: November 15th (3.2% conversion)
|
||||
- Ongoing: Yes (still at 3% as of November 29th)
|
||||
|
||||
**Impact**:
|
||||
- 10,000 daily visitors
|
||||
- 500 conversions → 300 conversions
|
||||
- $200 average order value
|
||||
- Loss: 200 conversions × $200 = $40k/day
|
||||
|
||||
---
|
||||
|
||||
### 2. Competing Hypotheses (Using Multiple Techniques)
|
||||
|
||||
#### Using 5 Whys:
|
||||
```
|
||||
Effect: Conversion dropped 40%
|
||||
Why? → Users abandoning checkout
|
||||
Why? → Checkout takes too long
|
||||
Why? → Payment processor is slow
|
||||
Why? → New processor has higher latency
|
||||
Why? → Switched to cheaper processor (ROOT)
|
||||
```
|
||||
|
||||
#### Using Fishbone Diagram:
|
||||
|
||||
**Technology**:
|
||||
- New payment processor (Nov 10)
|
||||
- New checkout UI (Nov 14)
|
||||
|
||||
**Process**:
|
||||
- Lack of A/B testing
|
||||
- No performance monitoring
|
||||
|
||||
**Environment**:
|
||||
- Seasonal traffic shift (holiday season)
|
||||
|
||||
**People**:
|
||||
- User behavior changes?
|
||||
|
||||
#### Using Timeline Analysis:
|
||||
```
|
||||
Nov 10: Switched to new payment processor
|
||||
Nov 14: Deployed new checkout UI
|
||||
Nov 15: Conversion drop observed (2:00 AM)
|
||||
```
|
||||
|
||||
#### Competing Hypotheses Generated:
|
||||
|
||||
**Hypothesis 1: New checkout UI** (deployed Nov 14)
|
||||
- Type: Proximate cause
|
||||
- Evidence for: Timing matches (Nov 14 → Nov 15)
|
||||
- Evidence against: A/B test showed no difference; Rollback didn't fix
|
||||
|
||||
**Hypothesis 2: Payment processor switch** (Nov 10)
|
||||
- Type: Root cause candidate
|
||||
- Evidence for: New processor slower (400ms vs 100ms); Timing precedes
|
||||
- Evidence against: 5-day lag (why not immediate?)
|
||||
|
||||
**Hypothesis 3: Payment latency increase** (Nov 15)
|
||||
- Type: Proximate cause/symptom
|
||||
- Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
|
||||
- Evidence against: None (strong evidence)
|
||||
|
||||
**Hypothesis 4: Seasonal traffic shift**
|
||||
- Type: Confounder
|
||||
- Evidence for: Holiday season
|
||||
- Evidence against: Traffic composition unchanged
|
||||
|
||||
---
|
||||
|
||||
### 3. Causal Model
|
||||
|
||||
#### Causal Chain:
|
||||
```
|
||||
ROOT: Switched to cheaper payment processor (Nov 10)
|
||||
↓ (mechanism: lower throughput processor)
|
||||
INTERMEDIATE: Payment API latency under load
|
||||
↓ (mechanism: slow API → page delays)
|
||||
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
|
||||
↓ (mechanism: users abandon slow pages)
|
||||
EFFECT: Conversion drops 40%
|
||||
```
|
||||
|
||||
#### Why Nov 15 lag?
|
||||
- Nov 10-14: Weekday traffic (low, processor handled it)
|
||||
- Nov 15: Weekend traffic spike (2x normal)
|
||||
- Processor couldn't handle weekend load → latency spike
|
||||
|
||||
#### Confounders Ruled Out:
|
||||
- UI change: Rollback had no effect
|
||||
- Seasonality: Traffic patterns unchanged
|
||||
- Marketing: No changes
|
||||
|
||||
---
|
||||
|
||||
### 4. Evidence Assessment
|
||||
|
||||
**Temporal Sequence**: ✓ PASS (3/3)
|
||||
- Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
|
||||
- Clear precedence
|
||||
|
||||
**Counterfactual**: ✓ PASS (3/3)
|
||||
- Test: Switched back to old processor
|
||||
- Result: Conversion recovered to 4.8%
|
||||
- Strong evidence
|
||||
|
||||
**Mechanism**: ✓ PASS (3/3)
|
||||
- Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
|
||||
- Industry data: 7+ sec = 20% abandon
|
||||
- Clear, plausible mechanism
|
||||
|
||||
**Dose-Response**: ✓ PASS (3/3)
|
||||
| Checkout Time | Conversion |
|
||||
|---------------|------------|
|
||||
| <3 sec | 5% |
|
||||
| 3-5 sec | 4% |
|
||||
| 5-7 sec | 3% |
|
||||
| >7 sec | 2% |
|
||||
|
||||
Clear gradient
|
||||
|
||||
**Consistency**: ✓ PASS (3/3)
|
||||
- All devices (mobile, desktop, tablet)
|
||||
- All payment methods
|
||||
- All countries
|
||||
- Consistent pattern
|
||||
|
||||
**Bradford Hill Score: 24/27** (Very Strong)
|
||||
1. Strength: 3 (40% drop)
|
||||
2. Consistency: 3 (all segments)
|
||||
3. Specificity: 2 (latency affects other things)
|
||||
4. Temporality: 3 (clear sequence)
|
||||
5. Dose-response: 3 (gradient)
|
||||
6. Plausibility: 3 (well-known)
|
||||
7. Coherence: 3 (fits theory)
|
||||
8. Experiment: 3 (rollback test)
|
||||
9. Analogy: 2 (some similar cases)
|
||||
|
||||
---
|
||||
|
||||
### 5. Conclusion
|
||||
|
||||
**Root Cause**: Switched to cheaper payment processor with insufficient throughput for weekend traffic.
|
||||
|
||||
**Confidence**: High (95%+)
|
||||
|
||||
**Why high confidence**:
|
||||
- Perfect temporal sequence
|
||||
- Strong counterfactual (rollback restored)
|
||||
- Clear mechanism
|
||||
- Dose-response present
|
||||
- Consistent across contexts
|
||||
- No confounders
|
||||
- Bradford Hill 24/27
|
||||
|
||||
**Why root (not symptom)**:
|
||||
- Processor switch is decision that created problem
|
||||
- Latency is symptom of this decision
|
||||
- Fixing latency alone treats symptom
|
||||
- Reverting switch eliminates problem
|
||||
|
||||
---
|
||||
|
||||
### 6. Recommended Actions
|
||||
|
||||
**Immediate**:
|
||||
1. ✓ Reverted to old processor (Nov 28)
|
||||
2. Negotiate better rates with old processor
|
||||
|
||||
**If keeping new processor**:
|
||||
1. Add caching layer (reduce latency <2 sec)
|
||||
2. Implement async checkout
|
||||
3. Load test at 3x peak
|
||||
|
||||
**Validation**:
|
||||
- A/B test: Old processor vs new+caching
|
||||
- Monitor: Latency p95, conversion rate
|
||||
- Success: Conversion >4.5%, latency <2 sec
|
||||
|
||||
---
|
||||
|
||||
### 7. Limitations
|
||||
|
||||
**Data limitations**:
|
||||
- No abandonment reason tracking
|
||||
- Processor metrics limited (black box)
|
||||
|
||||
**Analysis limitations**:
|
||||
- Didn't test all latency optimizations
|
||||
- Small sample for some device types
|
||||
|
||||
**Generalizability**:
|
||||
- Specific to our checkout flow
|
||||
- May not apply to other traffic patterns
|
||||
|
||||
---
|
||||
|
||||
## Quality Checklist
|
||||
|
||||
Before finalizing root cause analysis, verify:
|
||||
|
||||
**Effect Definition**:
|
||||
- [ ] Effect clearly described (not vague)
|
||||
- [ ] Quantified with magnitude
|
||||
- [ ] Timeline established (when started, duration)
|
||||
- [ ] Baseline for comparison stated
|
||||
|
||||
**Hypothesis Generation**:
|
||||
- [ ] Multiple hypotheses generated (3+ competing)
|
||||
- [ ] Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
|
||||
- [ ] Proximate vs root distinguished
|
||||
- [ ] Confounders identified
|
||||
|
||||
**Causal Model**:
|
||||
- [ ] Causal chain mapped (root → proximate → effect)
|
||||
- [ ] Mechanisms explained for each link
|
||||
- [ ] Confounding relationships noted
|
||||
- [ ] Direct vs indirect causation clear
|
||||
|
||||
**Evidence Assessment**:
|
||||
- [ ] Temporal sequence verified (cause before effect)
|
||||
- [ ] Counterfactual tested (what if cause absent)
|
||||
- [ ] Mechanism explained with supporting evidence
|
||||
- [ ] Dose-response checked (more cause → more effect)
|
||||
- [ ] Consistency assessed (holds across contexts)
|
||||
- [ ] Confounders ruled out or controlled
|
||||
|
||||
**Conclusion**:
|
||||
- [ ] Root cause clearly stated
|
||||
- [ ] Confidence level stated with justification
|
||||
- [ ] Distinguished from proximate causes/symptoms
|
||||
- [ ] Alternative explanations acknowledged
|
||||
- [ ] Limitations noted
|
||||
|
||||
**Recommendations**:
|
||||
- [ ] Address root cause (not just symptoms)
|
||||
- [ ] Actionable interventions proposed
|
||||
- [ ] Validation tests specified
|
||||
- [ ] Success metrics defined
|
||||
|
||||
**Minimum Standards**:
|
||||
- For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
|
||||
- For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
|
||||
- Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim
|
||||
Reference in New Issue
Block a user