Initial commit
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
{
|
||||
"name": "Bayesian Reasoning Quality Rubric",
|
||||
"scale": {
|
||||
"min": 1,
|
||||
"max": 5,
|
||||
"description": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent"
|
||||
},
|
||||
"criteria": [
|
||||
{
|
||||
"name": "Prior Quality",
|
||||
"description": "Prior is based on base rates and reference classes, not just intuition",
|
||||
"scoring": {
|
||||
"1": "No prior stated or purely intuitive guess",
|
||||
"2": "Prior stated but ignores base rates entirely",
|
||||
"3": "Prior considers base rates with some adjustment",
|
||||
"4": "Prior well-grounded in base rates with justified adjustments",
|
||||
"5": "Exceptional prior with multiple reference classes and clear reasoning"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Likelihood Justification",
|
||||
"description": "Likelihoods P(E|H) and P(E|¬H) are estimated with clear reasoning",
|
||||
"scoring": {
|
||||
"1": "No likelihoods or purely guessed",
|
||||
"2": "Likelihoods given but no justification",
|
||||
"3": "Likelihoods have basic reasoning",
|
||||
"4": "Likelihoods well-justified with clear logic",
|
||||
"5": "Exceptional likelihood estimates with empirical grounding or detailed reasoning"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Evidence Diagnosticity",
|
||||
"description": "Evidence meaningfully distinguishes between hypotheses (LR ≠ 1)",
|
||||
"scoring": {
|
||||
"1": "Evidence is not diagnostic at all (LR ≈ 1)",
|
||||
"2": "Evidence is weakly diagnostic (LR = 1-2)",
|
||||
"3": "Evidence is moderately diagnostic (LR = 2-5)",
|
||||
"4": "Evidence is strongly diagnostic (LR = 5-10)",
|
||||
"5": "Evidence is very strongly diagnostic (LR > 10)"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Calculation Correctness",
|
||||
"description": "Bayesian calculation is mathematically correct",
|
||||
"scoring": {
|
||||
"1": "Major calculation errors",
|
||||
"2": "Some calculation errors",
|
||||
"3": "Calculation is correct with minor issues",
|
||||
"4": "Calculation is fully correct",
|
||||
"5": "Perfect calculation with both probability and odds forms shown"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Calibration & Realism",
|
||||
"description": "Posterior is calibrated, not overconfident (avoids extremes without justification)",
|
||||
"scoring": {
|
||||
"1": "Posterior is 0% or 100% without extreme evidence",
|
||||
"2": "Posterior is very extreme (>95% or <5%) with weak evidence",
|
||||
"3": "Posterior is reasonable but might be slightly overconfident",
|
||||
"4": "Well-calibrated posterior with appropriate uncertainty",
|
||||
"5": "Exceptional calibration with explicit confidence bounds"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Assumption Transparency",
|
||||
"description": "Key assumptions and limitations are stated explicitly",
|
||||
"scoring": {
|
||||
"1": "No assumptions stated",
|
||||
"2": "Few assumptions mentioned vaguely",
|
||||
"3": "Key assumptions stated",
|
||||
"4": "Comprehensive assumption documentation",
|
||||
"5": "Exceptional transparency with sensitivity analysis showing assumption impact"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Base Rate Usage",
|
||||
"description": "Analysis uses base rates appropriately (avoids base rate neglect)",
|
||||
"scoring": {
|
||||
"1": "Completely ignores base rates",
|
||||
"2": "Acknowledges base rates but doesn't use them",
|
||||
"3": "Uses base rates for prior",
|
||||
"4": "Properly incorporates base rates with adjustments",
|
||||
"5": "Exceptional use of multiple base rates and reference classes"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Sensitivity Analysis",
|
||||
"description": "Tests how sensitive conclusion is to input assumptions",
|
||||
"scoring": {
|
||||
"1": "No sensitivity analysis",
|
||||
"2": "Minimal sensitivity check",
|
||||
"3": "Basic sensitivity analysis on key inputs",
|
||||
"4": "Comprehensive sensitivity analysis",
|
||||
"5": "Exceptional sensitivity analysis showing robustness or fragility clearly"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Interpretation Quality",
|
||||
"description": "Posterior is interpreted correctly with decision implications",
|
||||
"scoring": {
|
||||
"1": "Misinterprets posterior or no interpretation",
|
||||
"2": "Basic interpretation but lacks context",
|
||||
"3": "Good interpretation with some decision guidance",
|
||||
"4": "Clear interpretation with actionable decision implications",
|
||||
"5": "Exceptional interpretation linking probability to specific actions and thresholds"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Avoidance of Common Errors",
|
||||
"description": "Avoids prosecutor's fallacy, base rate neglect, and other Bayesian errors",
|
||||
"scoring": {
|
||||
"1": "Multiple major errors (confusing P(E|H) with P(H|E), ignoring base rates)",
|
||||
"2": "One major error present",
|
||||
"3": "Mostly avoids common errors",
|
||||
"4": "Cleanly avoids all common errors",
|
||||
"5": "Exceptional awareness with explicit checks against common errors"
|
||||
}
|
||||
}
|
||||
],
|
||||
"overall_assessment": {
|
||||
"thresholds": {
|
||||
"excellent": "Average score ≥ 4.5 (publication quality)",
|
||||
"very_good": "Average score ≥ 4.0 (most forecasts should aim for this)",
|
||||
"good": "Average score ≥ 3.5 (minimum for important decisions)",
|
||||
"acceptable": "Average score ≥ 3.0 (workable for low-stakes predictions)",
|
||||
"needs_rework": "Average score < 3.0 (redo before using)"
|
||||
},
|
||||
"stakes_guidance": {
|
||||
"low_stakes": "Personal predictions, low-cost decisions: aim for ≥ 3.0",
|
||||
"medium_stakes": "Business decisions, moderate cost: aim for ≥ 3.5",
|
||||
"high_stakes": "Major decisions, high cost of error: aim for ≥ 4.0"
|
||||
}
|
||||
},
|
||||
"usage_instructions": "Rate each criterion on 1-5 scale. Calculate average. For important forecasts or decisions, minimum score is 3.5. For high-stakes decisions where cost of error is high, aim for ≥4.0. Check especially for base rate neglect, prosecutor's fallacy, and overconfidence - these are the most common errors."
|
||||
}
|
||||
@@ -0,0 +1,319 @@
|
||||
# Bayesian Analysis: Feature Adoption Forecast
|
||||
|
||||
## Question
|
||||
|
||||
**Hypothesis**: New sharing feature will achieve >20% adoption within 3 months of launch
|
||||
|
||||
**Estimating**: P(adoption >20%)
|
||||
|
||||
**Timeframe**: 3 months post-launch (results measured at month 3)
|
||||
|
||||
**Matters because**: Need 20% adoption to justify ongoing development investment. Below 20%, we should sunset the feature and reallocate resources.
|
||||
|
||||
---
|
||||
|
||||
## Prior Belief (Before Evidence)
|
||||
|
||||
### Base Rate
|
||||
|
||||
What's the general frequency of similar features achieving >20% adoption?
|
||||
|
||||
- **Reference class**: Previous features we've launched in this product category
|
||||
- **Historical data**:
|
||||
- Last 8 features launched: 5 achieved >20% adoption (62.5%)
|
||||
- Industry benchmarks: Social sharing features average 15-25% adoption
|
||||
- Our product has higher engagement than average
|
||||
- **Base rate**: 60%
|
||||
|
||||
### Adjustments
|
||||
|
||||
How is this case different from the base rate?
|
||||
|
||||
- **Factor 1: Feature complexity** - This feature is simpler than average (+5%)
|
||||
- Previous successful features averaged 3 steps to use
|
||||
- This feature is 1-click sharing
|
||||
- Simpler features historically perform better
|
||||
|
||||
- **Factor 2: Market timing** - Competitive pressure is high (-10%)
|
||||
- Two competitors launched similar features 6 months ago
|
||||
- Early adopters may have already switched to competitors
|
||||
- Late-to-market features typically see 15-20% lower adoption
|
||||
|
||||
- **Factor 3: User research signals** - Strong user request (+10%)
|
||||
- Feature was #2 most requested in last user survey (450 responses)
|
||||
- 72% said they would use it "frequently" or "very frequently"
|
||||
- Strong stated intent typically correlates with 40-60% actual usage
|
||||
|
||||
### Prior Probability
|
||||
|
||||
**P(H) = 65%**
|
||||
|
||||
**Justification**: Starting from 60% base rate, adjusted upward for simplicity (+5%) and strong user signals (+10%), adjusted down for late market entry (-10%). Net effect: 65% prior confidence that adoption will exceed 20%.
|
||||
|
||||
**Range if uncertain**: 55% to 75% (accounting for uncertainty in adjustment factors)
|
||||
|
||||
---
|
||||
|
||||
## Evidence
|
||||
|
||||
**What was observed**: Beta test with 200 users showed 35% adoption (70 users actively used feature)
|
||||
|
||||
**How diagnostic**: This is moderately to strongly diagnostic evidence. Beta tests often show higher engagement than production (selection bias), but 35% is meaningfully above our 20% threshold. The question is whether this beta performance predicts production performance.
|
||||
|
||||
### Likelihoods
|
||||
|
||||
**P(E|H) = 75%** - Probability of seeing 35% beta adoption IF true production adoption will be >20%
|
||||
|
||||
**Reasoning**:
|
||||
- If production adoption will be >20%, beta should show higher (beta users are early adopters)
|
||||
- Typical pattern: beta adoption is 1.5-2x production adoption for engaged features
|
||||
- If production will be 22%, beta would likely be 33-44% → 35% fits this well
|
||||
- If production will be 25%, beta would likely be 38-50% → 35% is on lower end but plausible
|
||||
- 75% accounts for variance and beta-to-production conversion uncertainty
|
||||
|
||||
**P(E|¬H) = 15%** - Probability of seeing 35% beta adoption IF true production adoption will be ≤20%
|
||||
|
||||
**Reasoning**:
|
||||
- If production adoption will be ≤20% (say, 15%), beta would typically be 22-30%
|
||||
- Seeing 35% beta when production will be ≤20% would require unusual beta-to-production drop
|
||||
- This could happen (beta selection bias, novelty effect wears off), but is uncommon
|
||||
- 15% reflects that this scenario is possible but unlikely
|
||||
|
||||
**Likelihood Ratio = 75% / 15% = 5.0**
|
||||
|
||||
**Interpretation**: Evidence is moderately strong. A 35% beta result is 5 times more likely if production adoption will exceed 20% than if it won't. This is meaningful but not overwhelming evidence.
|
||||
|
||||
---
|
||||
|
||||
## Bayesian Update
|
||||
|
||||
### Calculation
|
||||
|
||||
**Using odds form** (simpler for this case):
|
||||
|
||||
```
|
||||
Prior Odds = P(H) / P(¬H) = 65% / 35% = 1.86
|
||||
|
||||
Likelihood Ratio = 5.0
|
||||
|
||||
Posterior Odds = Prior Odds × LR = 1.86 × 5.0 = 9.3
|
||||
|
||||
Posterior Probability = Posterior Odds / (1 + Posterior Odds)
|
||||
= 9.3 / 10.3
|
||||
= 90.3%
|
||||
```
|
||||
|
||||
**Verification using probability form**:
|
||||
|
||||
```
|
||||
P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
|
||||
P(E) = [75% × 65%] + [15% × 35%]
|
||||
P(E) = 48.75% + 5.25% = 54%
|
||||
|
||||
P(H|E) = [P(E|H) × P(H)] / P(E)
|
||||
P(H|E) = [75% × 65%] / 54%
|
||||
P(H|E) = 48.75% / 54% = 90.3%
|
||||
```
|
||||
|
||||
### Posterior Probability
|
||||
|
||||
**P(H|E) = 90%**
|
||||
|
||||
### Change in Belief
|
||||
|
||||
- **Prior**: 65%
|
||||
- **Posterior**: 90%
|
||||
- **Change**: +25 percentage points
|
||||
- **Interpretation**: Evidence strongly supports hypothesis. Beta test results meaningfully increased confidence that production adoption will exceed 20%.
|
||||
|
||||
---
|
||||
|
||||
## Sensitivity Analysis
|
||||
|
||||
**How sensitive is posterior to inputs?**
|
||||
|
||||
### If Prior was different:
|
||||
|
||||
| Prior | Posterior | Note |
|
||||
|-------|-----------|------|
|
||||
| 50% | 83% | Even starting at coin-flip, evidence pushes to high confidence |
|
||||
| 75% | 94% | Higher prior → very high posterior |
|
||||
| 40% | 77% | Lower prior → still high confidence |
|
||||
|
||||
**Finding**: Posterior is somewhat robust. Evidence is strong enough that even with priors ranging from 40-75%, posterior stays in 77-94% range.
|
||||
|
||||
### If P(E|H) was different:
|
||||
|
||||
| P(E\|H) | LR | Posterior | Note |
|
||||
|---------|-----|-----------|------|
|
||||
| 60% | 4.0 | 87% | Less diagnostic evidence → still high confidence |
|
||||
| 85% | 5.67 | 92% | More diagnostic evidence → very high confidence |
|
||||
| 50% | 3.33 | 82% | Weaker evidence → moderate-high confidence |
|
||||
|
||||
**Finding**: Posterior is moderately sensitive to P(E|H), but stays above 80% across plausible range.
|
||||
|
||||
### If P(E|¬H) was different:
|
||||
|
||||
| P(E\|¬H) | LR | Posterior | Note |
|
||||
|----------|-----|-----------|------|
|
||||
| 25% | 3.0 | 84% | Less diagnostic → still high confidence |
|
||||
| 10% | 7.5 | 94% | More diagnostic → very high confidence |
|
||||
| 30% | 2.5 | 80% | Weak evidence → moderate confidence |
|
||||
|
||||
**Finding**: Posterior is sensitive to P(E|¬H). If beta-to-production drop is common (higher P(E|¬H)), confidence decreases meaningfully.
|
||||
|
||||
**Robustness**: Conclusion is **moderately robust**. Across reasonable input ranges, posterior stays above 77%, supporting launch decision. Most sensitive to assumption about beta-to-production conversion rates.
|
||||
|
||||
---
|
||||
|
||||
## Calibration Check
|
||||
|
||||
**Am I overconfident?**
|
||||
|
||||
- **Did I anchor on initial belief?**
|
||||
- No - prior (65%) was based on base rates, not arbitrary
|
||||
- Evidence substantially moved belief (+25pp)
|
||||
- Not stuck at starting point
|
||||
|
||||
- **Did I ignore base rates?**
|
||||
- No - explicitly used historical feature adoption (60%) as starting point
|
||||
- Adjusted for known differences systematically
|
||||
|
||||
- **Is my posterior extreme (>90% or <10%)?**
|
||||
- Yes - 90% is borderline extreme
|
||||
- **Check**: Is evidence truly that strong?
|
||||
- LR = 5.0 is moderately strong (not very strong)
|
||||
- Prior was already high (65%)
|
||||
- Combination pushes to 90%
|
||||
- **Concern**: May be slightly overconfident
|
||||
- **Adjustment**: Consider reporting as 85-90% range rather than point estimate
|
||||
|
||||
- **Would an outside observer agree with my likelihoods?**
|
||||
- P(E|H) = 75%: Reasonable - beta users are engaged, expect higher than production
|
||||
- P(E|¬H) = 15%: Potentially optimistic - beta selection bias could be stronger
|
||||
- **Alternative**: If P(E|¬H) = 25%, posterior drops to 84% (more conservative)
|
||||
|
||||
**Red flags**:
|
||||
- ✓ Posterior is not 100% or 0%
|
||||
- ✓ Update magnitude (25pp) matches evidence strength (LR=5.0)
|
||||
- ✓ Prior uses base rates
|
||||
- ⚠ Posterior is at upper end (90%) - consider uncertainty range
|
||||
|
||||
**Calibration adjustment**: Report as 85-90% confidence range to account for uncertainty in likelihoods.
|
||||
|
||||
---
|
||||
|
||||
## Limitations & Assumptions
|
||||
|
||||
**Key assumptions**:
|
||||
|
||||
1. **Beta users are representative of broader user base**
|
||||
- Assumption: Beta users are 1.5-2x more engaged than average
|
||||
- Risk: If beta users are much more engaged (3x), production adoption could be lower
|
||||
- Impact: Could invalidate high posterior
|
||||
|
||||
2. **No major bugs or UX issues in production**
|
||||
- Assumption: Production experience will match beta experience
|
||||
- Risk: Unforeseen technical issues could crater adoption
|
||||
- Impact: Would make evidence misleading
|
||||
|
||||
3. **Competitive landscape stays stable**
|
||||
- Assumption: No major competitor moves in next 3 months
|
||||
- Risk: Competitor could launch superior version
|
||||
- Impact: Could reduce adoption below 20% despite strong beta
|
||||
|
||||
4. **Beta sample size is sufficient (n=200)**
|
||||
- Assumption: 200 users is enough to estimate adoption
|
||||
- Confidence interval: 35% ± 6.6% at 95% CI
|
||||
- Impact: True beta adoption could be 28-42%, adding uncertainty
|
||||
|
||||
**What could invalidate this analysis**:
|
||||
|
||||
- **Major product changes**: If we significantly alter the feature post-beta, beta results become less predictive
|
||||
- **Different user segment**: If we launch to a different user segment than beta testers, adoption patterns may differ
|
||||
- **Seasonal effects**: If beta ran during high-engagement season and launch is during low season
|
||||
- **Discovery/onboarding issues**: If users don't discover the feature in production (beta users were explicitly invited)
|
||||
|
||||
**Uncertainty**:
|
||||
|
||||
- **Most uncertain about**: P(E|¬H) = 15% - How often do features with ≤20% production adoption show 35% beta adoption?
|
||||
- This is the key assumption
|
||||
- If this is actually 25-30%, posterior drops to 80-84%
|
||||
- Recommendation: Review historical beta-to-production conversion data
|
||||
|
||||
- **Could be wrong if**:
|
||||
- Beta users are much more engaged than typical users (>2x multiplier)
|
||||
- Novelty effect in beta wears off quickly in production
|
||||
- Production launch has poor discoverability/onboarding
|
||||
|
||||
---
|
||||
|
||||
## Decision Implications
|
||||
|
||||
**Given posterior of 90% (range: 85-90%)**:
|
||||
|
||||
**Recommended action**: **Proceed with launch** with monitoring plan
|
||||
|
||||
**Rationale**:
|
||||
- 90% confidence exceeds decision threshold for feature launches
|
||||
- Even conservative estimate (85%) supports launch
|
||||
- Risk of failure (<20% adoption) is only 10-15%
|
||||
- Cost of being wrong: Wasted 3 months of development effort
|
||||
- Cost of not launching: Missing potential high-adoption feature
|
||||
|
||||
**If decision threshold is**:
|
||||
|
||||
- **High confidence needed (>80%)**: ✅ **LAUNCH** - Exceeds threshold, proceed with production rollout
|
||||
|
||||
- **Medium confidence (>60%)**: ✅ **LAUNCH** - Well above threshold, strong conviction
|
||||
|
||||
- **Low bar (>40%)**: ✅ **LAUNCH** - Far exceeds minimum threshold
|
||||
|
||||
**Monitoring plan** (to validate forecast):
|
||||
|
||||
1. **Week 1**: Check if adoption is on track for >6% (20% / 3 months, assuming linear growth)
|
||||
- If <4%: Red flag, investigate onboarding/discovery issues
|
||||
- If >8%: Exceeding expectations, validate data quality
|
||||
|
||||
2. **Month 1**: Check if adoption is trending toward >10%
|
||||
- If <7%: Update forecast downward, consider intervention
|
||||
- If >13%: Exceeding expectations, high confidence
|
||||
|
||||
3. **Month 3**: Measure final adoption
|
||||
- If <20%: Analyze what went wrong, calibrate future forecasts
|
||||
- If >20%: Validate forecast accuracy, update priors for future features
|
||||
|
||||
**Next evidence to gather**:
|
||||
|
||||
- **Historical beta-to-production conversion rates**: Review last 5-10 feature launches to calibrate P(E|¬H) more accurately
|
||||
- **User segment analysis**: Compare beta user demographics to production user base
|
||||
- **Competitive feature adoption**: Check competitors' sharing feature adoption rates
|
||||
- **Early production data**: After 1 week of production, use actual adoption data for next Bayesian update
|
||||
|
||||
**What would change our mind**:
|
||||
|
||||
- **Week 1 adoption <3%**: Would update posterior down to ~60%, trigger investigation
|
||||
- **Competitor launches superior feature**: Would need to recalculate with new competitive landscape
|
||||
- **Discovery of major beta sampling bias**: If beta users are 5x more engaged, would significantly reduce confidence
|
||||
|
||||
---
|
||||
|
||||
## Meta: Forecast Quality Assessment
|
||||
|
||||
Using rubric from `rubric_bayesian_reasoning_calibration.json`:
|
||||
|
||||
**Self-assessment**:
|
||||
- Prior Quality: 4/5 (good base rate usage, clear adjustments)
|
||||
- Likelihood Justification: 4/5 (clear reasoning, could use more empirical data)
|
||||
- Evidence Diagnosticity: 4/5 (LR=5.0 is moderately strong)
|
||||
- Calculation Correctness: 5/5 (verified with both odds and probability forms)
|
||||
- Calibration & Realism: 3/5 (posterior is 90%, borderline extreme, flagged for review)
|
||||
- Assumption Transparency: 4/5 (key assumptions stated clearly)
|
||||
- Base Rate Usage: 5/5 (explicit base rate from historical data)
|
||||
- Sensitivity Analysis: 4/5 (comprehensive sensitivity checks)
|
||||
- Interpretation Quality: 4/5 (clear decision implications with thresholds)
|
||||
- Avoidance of Common Errors: 4/5 (no prosecutor's fallacy, proper base rates)
|
||||
|
||||
**Average: 4.1/5** - Meets "very good" threshold for medium-stakes decision
|
||||
|
||||
**Decision**: Forecast is sufficiently rigorous for feature launch decision (medium stakes). Primary area for improvement: gather more data on beta-to-production conversion to refine P(E|¬H) estimate.
|
||||
437
skills/bayesian-reasoning-calibration/resources/methodology.md
Normal file
437
skills/bayesian-reasoning-calibration/resources/methodology.md
Normal file
@@ -0,0 +1,437 @@
|
||||
# Bayesian Reasoning & Calibration Methodology
|
||||
|
||||
## Bayesian Reasoning Workflow
|
||||
|
||||
Copy this checklist and track your progress:
|
||||
|
||||
```
|
||||
Bayesian Reasoning Progress:
|
||||
- [ ] Step 1: Define hypotheses and assign priors
|
||||
- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
|
||||
- [ ] Step 3: Compute posteriors and update sequentially
|
||||
- [ ] Step 4: Check for dependent evidence and adjust
|
||||
- [ ] Step 5: Validate calibration and check for bias
|
||||
```
|
||||
|
||||
**Step 1: Define hypotheses and assign priors**
|
||||
|
||||
List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications.
|
||||
|
||||
**Step 2: Assign likelihoods for each evidence-hypothesis pair**
|
||||
|
||||
For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques.
|
||||
|
||||
**Step 3: Compute posteriors and update sequentially**
|
||||
|
||||
Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process.
|
||||
|
||||
**Step 4: Check for dependent evidence and adjust**
|
||||
|
||||
Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction.
|
||||
|
||||
**Step 5: Validate calibration and check for bias**
|
||||
|
||||
Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics.
|
||||
|
||||
---
|
||||
|
||||
## Multiple Hypothesis Updates
|
||||
|
||||
### Problem: Choosing Among Many Hypotheses
|
||||
|
||||
Often you have 3+ competing hypotheses and need to update all simultaneously.
|
||||
|
||||
**Example:**
|
||||
- H1: Bug in payment processor code
|
||||
- H2: Database connection timeout
|
||||
- H3: Third-party API outage
|
||||
- H4: DDoS attack
|
||||
|
||||
### Approach: Compute Posterior for Each Hypothesis
|
||||
|
||||
**Step 1: Assign prior probabilities** (must sum to 1)
|
||||
|
||||
| Hypothesis | Prior P(H) | Justification |
|
||||
|------------|-----------|---------------|
|
||||
| H1: Payment bug | 0.30 | Common issue, recent deploy |
|
||||
| H2: DB timeout | 0.25 | Has happened before |
|
||||
| H3: API outage | 0.20 | Dependency on external service |
|
||||
| H4: DDoS | 0.10 | Rare but possible |
|
||||
| H5: Other | 0.15 | Catch-all for unknowns |
|
||||
| **Total** | **1.00** | Must sum to 1 |
|
||||
|
||||
**Step 2: Define likelihood for each hypothesis**
|
||||
|
||||
Evidence E: "500 errors only on payment endpoint"
|
||||
|
||||
| Hypothesis | P(E\|H) | Justification |
|
||||
|------------|---------|---------------|
|
||||
| H1: Payment bug | 0.80 | Bug would affect payment specifically |
|
||||
| H2: DB timeout | 0.30 | Would affect multiple endpoints |
|
||||
| H3: API outage | 0.70 | Payment uses external API |
|
||||
| H4: DDoS | 0.50 | Could target any endpoint |
|
||||
| H5: Other | 0.20 | Generic catch-all |
|
||||
|
||||
**Step 3: Compute P(E)** (marginal probability)
|
||||
|
||||
```
|
||||
P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
|
||||
|
||||
P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
|
||||
P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
|
||||
P(E) = 0.535
|
||||
```
|
||||
|
||||
**Step 4: Compute posterior for each hypothesis**
|
||||
|
||||
```
|
||||
P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
|
||||
|
||||
P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
|
||||
P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
|
||||
P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
|
||||
P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
|
||||
P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
|
||||
|
||||
Total: 100% (check: posteriors must sum to 1)
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
|
||||
- H3 (API outage): 20% → 26.2% (increased 6 pp)
|
||||
- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
|
||||
- H4 (DDoS): 10% → 9.3% (barely changed)
|
||||
|
||||
**Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
|
||||
|
||||
---
|
||||
|
||||
## Sequential Evidence Updates
|
||||
|
||||
### Problem: Multiple Pieces of Evidence Over Time
|
||||
|
||||
Evidence comes in stages, need to update belief sequentially.
|
||||
|
||||
**Example:**
|
||||
- Evidence 1: "500 errors on payment endpoint" (t=0)
|
||||
- Evidence 2: "External API status page shows outage" (t+5 min)
|
||||
- Evidence 3: "Our other services using same API also failing" (t+10 min)
|
||||
|
||||
### Approach: Chain Updates (Prior → E1 → E2 → E3)
|
||||
|
||||
**Step 1: Update with E1** (as above)
|
||||
```
|
||||
Prior → P(H|E1)
|
||||
P(H1|E1) = 44.9% (payment bug)
|
||||
P(H3|E1) = 26.2% (API outage)
|
||||
```
|
||||
|
||||
**Step 2: Use posterior as new prior, update with E2**
|
||||
|
||||
Evidence E2: "External API status page shows outage"
|
||||
|
||||
New prior (from E1 posterior):
|
||||
- P(H1) = 0.449 (payment bug)
|
||||
- P(H3) = 0.262 (API outage)
|
||||
|
||||
Likelihoods given E2:
|
||||
- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
|
||||
- P(E2|H3) = 0.95 (API outage would definitely show on status page)
|
||||
|
||||
```
|
||||
P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
|
||||
|
||||
P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
|
||||
P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
|
||||
```
|
||||
|
||||
**Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
|
||||
|
||||
**Step 3: Update with E3**
|
||||
|
||||
Evidence E3: "Other services using same API also failing"
|
||||
|
||||
New prior:
|
||||
- P(H1) = 0.265
|
||||
- P(H3) = 0.698
|
||||
|
||||
Likelihoods:
|
||||
- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
|
||||
- P(E3|H3) = 0.99 (API outage would affect all services)
|
||||
|
||||
```
|
||||
P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
|
||||
|
||||
P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
|
||||
P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
|
||||
```
|
||||
|
||||
**Final conclusion**: 96.3% confidence it's API outage, not payment bug.
|
||||
|
||||
**Summary of belief evolution:**
|
||||
```
|
||||
Prior After E1 After E2 After E3
|
||||
H1 (Bug): 30% → 44.9% → 26.5% → 3.7%
|
||||
H3 (API): 20% → 26.2% → 69.8% → 96.3%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Handling Dependent Evidence
|
||||
|
||||
### Problem: Evidence Items Not Independent
|
||||
|
||||
**Naive approach fails when**:
|
||||
- E1 and E2 are correlated (not independent)
|
||||
- Updating twice with same information
|
||||
|
||||
**Example of dependent evidence:**
|
||||
- E1: "User reports payment failing"
|
||||
- E2: "Another user reports payment failing"
|
||||
|
||||
If E1 and E2 are from same incident, they're not independent evidence!
|
||||
|
||||
### Solution 1: Treat as Single Evidence
|
||||
|
||||
If evidence is dependent, combine into one update:
|
||||
|
||||
**Instead of:**
|
||||
- Update with E1: "User reports payment failing"
|
||||
- Update with E2: "Another user reports payment failing"
|
||||
|
||||
**Do:**
|
||||
- Single update with E: "Multiple users report payment failing"
|
||||
|
||||
Likelihood:
|
||||
- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
|
||||
- P(E|No bug) = 0.05 (false reports rare)
|
||||
|
||||
### Solution 2: Conditional Likelihoods
|
||||
|
||||
If evidence is conditionally dependent (E2 depends on E1), use:
|
||||
|
||||
```
|
||||
P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
|
||||
```
|
||||
|
||||
Not:
|
||||
```
|
||||
P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence
|
||||
```
|
||||
|
||||
**Example:**
|
||||
- E1: "Test fails on staging"
|
||||
- E2: "Test fails on production" (same test, likely same cause)
|
||||
|
||||
Conditional:
|
||||
- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
|
||||
- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
|
||||
|
||||
### Red Flags for Dependent Evidence
|
||||
|
||||
Watch out for:
|
||||
- Multiple reports of same incident (count as one)
|
||||
- Cascading failures (downstream failure caused by upstream)
|
||||
- Repeated measurements of same thing (not new info)
|
||||
- Evidence from same source (correlated errors)
|
||||
|
||||
**Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent.
|
||||
|
||||
---
|
||||
|
||||
## Calibration Techniques
|
||||
|
||||
### Problem: Over/Underconfidence Bias
|
||||
|
||||
Common patterns:
|
||||
- **Overconfidence**: Stating 90% when true rate is 70%
|
||||
- **Underconfidence**: Stating 60% when true rate is 80%
|
||||
|
||||
### Calibration Check: Track Predictions Over Time
|
||||
|
||||
**Method**:
|
||||
1. Make many probabilistic forecasts (P=70%, P=40%, etc.)
|
||||
2. Track outcomes
|
||||
3. Group by confidence level
|
||||
4. Compare stated probability to actual frequency
|
||||
|
||||
**Example calibration check:**
|
||||
|
||||
| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
|
||||
|-----------------|---------------|-----------|----------|-------------|
|
||||
| 90-100% | 20 | 16 | 80% | Overconfident |
|
||||
| 70-89% | 30 | 24 | 80% | Good |
|
||||
| 50-69% | 25 | 14 | 56% | Good |
|
||||
| 30-49% | 15 | 5 | 33% | Good |
|
||||
| 0-29% | 10 | 2 | 20% | Good |
|
||||
|
||||
**Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct).
|
||||
|
||||
### Calibration Curve
|
||||
|
||||
Plot stated probability vs actual frequency:
|
||||
|
||||
```
|
||||
Actual %
|
||||
100 ┤ ●
|
||||
│ ●
|
||||
80 ┤ ● (overconfident)
|
||||
│ ●
|
||||
60 ┤ ●
|
||||
│ ●
|
||||
40 ┤ ●
|
||||
│
|
||||
20 ┤
|
||||
│
|
||||
0 └─────────────────────────────────
|
||||
0 20 40 60 80 100
|
||||
Stated probability %
|
||||
|
||||
Perfect calibration = diagonal line
|
||||
Above line = overconfident
|
||||
Below line = underconfident
|
||||
```
|
||||
|
||||
### Debiasing Techniques
|
||||
|
||||
**For overconfidence:**
|
||||
- Consider alternative explanations (how could I be wrong?)
|
||||
- Base rate check (what's the typical success rate?)
|
||||
- Pre-mortem: "It's 6 months from now and we failed. Why?"
|
||||
- Confidence intervals: State range, not point estimate
|
||||
|
||||
**For underconfidence:**
|
||||
- Review past successes (build evidence for confidence)
|
||||
- Test predictions: Am I systematically too cautious?
|
||||
- Cost of inaction: What's the cost of waiting for certainty?
|
||||
|
||||
### Brier Score (Calibration Metric)
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
Brier Score = (1/n) × Σ (pi - oi)²
|
||||
|
||||
pi = stated probability for outcome i
|
||||
oi = actual outcome (1 if happened, 0 if not)
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
|
||||
Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
|
||||
Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
|
||||
|
||||
Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- 0.00 = perfect calibration
|
||||
- 0.25 = random guessing
|
||||
- Lower is better
|
||||
|
||||
**Typical scores:**
|
||||
- Expert forecasters: 0.10-0.15
|
||||
- Average people: 0.20-0.25
|
||||
- Aim for: <0.15
|
||||
|
||||
---
|
||||
|
||||
## Advanced Applications
|
||||
|
||||
### Application 1: Hierarchical Priors
|
||||
|
||||
When you're uncertain about the prior itself.
|
||||
|
||||
**Example:**
|
||||
- Question: "Will project finish on time?"
|
||||
- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
|
||||
|
||||
**Approach**: Model uncertainty in prior
|
||||
|
||||
```
|
||||
P(On time) = Weighted average of different base rates
|
||||
|
||||
Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
|
||||
Scenario 2: Base rate = 50% (if average project), Weight = 30%
|
||||
Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
|
||||
|
||||
Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
|
||||
= 0.12 + 0.15 + 0.21
|
||||
= 0.48 (48%)
|
||||
```
|
||||
|
||||
Then update this 48% prior with evidence.
|
||||
|
||||
### Application 2: Bayesian Model Comparison
|
||||
|
||||
Compare which model/theory better explains data.
|
||||
|
||||
**Example:**
|
||||
- Model A: "Bug in feature X"
|
||||
- Model B: "Infrastructure issue"
|
||||
|
||||
Evidence: 10 data points
|
||||
|
||||
```
|
||||
P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
|
||||
```
|
||||
|
||||
**Bayes Factor** = P(Data|Model A) / P(Data|Model B)
|
||||
|
||||
- BF > 10: Strong evidence for Model A
|
||||
- BF = 3-10: Moderate evidence for Model A
|
||||
- BF = 1: Equal evidence
|
||||
- BF < 0.33: Evidence against Model A
|
||||
|
||||
### Application 3: Forecasting with Bayesian Updates
|
||||
|
||||
Use for repeated forecasting (elections, product launches, project timelines).
|
||||
|
||||
**Process:**
|
||||
1. Start with base rate prior
|
||||
2. Update weekly as new evidence arrives
|
||||
3. Track belief evolution over time
|
||||
4. Compare final forecast to outcome (calibration check)
|
||||
|
||||
**Example: Product launch success**
|
||||
|
||||
```
|
||||
Week -8: Prior = 60% (base rate for similar launches)
|
||||
Week -6: Beta feedback positive → Update to 70%
|
||||
Week -4: Competitor launches similar product → Update to 55%
|
||||
Week -2: Pre-orders exceed target → Update to 75%
|
||||
Week 0: Launch → Actual success: Yes ✓
|
||||
|
||||
Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
|
||||
```
|
||||
|
||||
**Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checklist for Complex Cases
|
||||
|
||||
**Multiple Hypotheses**:
|
||||
- [ ] All hypotheses listed (including catch-all "Other")
|
||||
- [ ] Priors sum to 1.0
|
||||
- [ ] Likelihoods defined for all hypothesis-evidence pairs
|
||||
- [ ] Posteriors sum to 1.0 (math check)
|
||||
- [ ] Interpretation provided (which hypothesis favored? by how much?)
|
||||
|
||||
**Sequential Updates**:
|
||||
- [ ] Evidence items clearly independent or conditional dependence noted
|
||||
- [ ] Each update uses previous posterior as new prior
|
||||
- [ ] Belief evolution tracked (how beliefs changed over time)
|
||||
- [ ] Final conclusion integrates all evidence
|
||||
- [ ] Timeline shows when each piece of evidence arrived
|
||||
|
||||
**Calibration**:
|
||||
- [ ] Considered alternative explanations (not overconfident?)
|
||||
- [ ] Checked against base rates (not ignoring priors?)
|
||||
- [ ] Stated confidence interval or range (not just point estimate)
|
||||
- [ ] Identified assumptions that could make forecast wrong
|
||||
- [ ] Planned follow-up to track calibration (compare forecast to outcome)
|
||||
|
||||
**Minimum Standard for Complex Cases**:
|
||||
- Multiple hypotheses: Score ≥ 4.0 on rubric
|
||||
- High-stakes forecasts: Track calibration over 10+ predictions
|
||||
308
skills/bayesian-reasoning-calibration/resources/template.md
Normal file
308
skills/bayesian-reasoning-calibration/resources/template.md
Normal file
@@ -0,0 +1,308 @@
|
||||
# Bayesian Reasoning Template
|
||||
|
||||
## Workflow
|
||||
|
||||
Copy this checklist and track your progress:
|
||||
|
||||
```
|
||||
Bayesian Update Progress:
|
||||
- [ ] Step 1: State question and establish prior from base rates
|
||||
- [ ] Step 2: Estimate likelihoods for evidence
|
||||
- [ ] Step 3: Calculate posterior using Bayes' theorem
|
||||
- [ ] Step 4: Perform sensitivity analysis
|
||||
- [ ] Step 5: Calibrate and validate with quality checklist
|
||||
```
|
||||
|
||||
**Step 1: State question and establish prior from base rates**
|
||||
|
||||
Define specific, testable hypothesis with timeframe and success criteria. Identify reference class and base rate, adjust for specific differences, and state prior explicitly with justification. See [Step 1: State the Question](#step-1-state-the-question) and [Step 2: Find Base Rates](#step-2-find-base-rates) for guidance.
|
||||
|
||||
**Step 2: Estimate likelihoods for evidence**
|
||||
|
||||
Assess P(E|H) (probability of evidence if hypothesis TRUE) and P(E|¬H) (probability if FALSE), calculate likelihood ratio = P(E|H) / P(E|¬H), and interpret diagnostic strength. See [Step 3: Estimate Likelihoods](#step-3-estimate-likelihoods) for examples and common mistakes.
|
||||
|
||||
**Step 3: Calculate posterior using Bayes' theorem**
|
||||
|
||||
Apply P(H|E) = [P(E|H) × P(H)] / P(E) or use simpler odds form: Posterior Odds = Prior Odds × LR. Interpret change in belief (prior → posterior) and strength of evidence. See [Step 4: Calculate Posterior](#step-4-calculate-posterior) for calculation methods.
|
||||
|
||||
**Step 4: Perform sensitivity analysis**
|
||||
|
||||
Test how posterior changes with different prior values and likelihoods to assess robustness of conclusion. See [Sensitivity Analysis](#sensitivity-analysis) section in template structure.
|
||||
|
||||
**Step 5: Calibrate and validate with quality checklist**
|
||||
|
||||
Check for overconfidence, base rate neglect, and extreme posteriors. Use [Calibration Check](#calibration-check) and [Quality Checklist](#quality-checklist) to verify prior is justified, likelihoods have reasoning, evidence is diagnostic (LR ≠ 1), calculation correct, and assumptions stated.
|
||||
|
||||
## Quick Template
|
||||
|
||||
```markdown
|
||||
# Bayesian Analysis: {Topic}
|
||||
|
||||
## Question
|
||||
**Hypothesis**: {What are you testing?}
|
||||
**Estimating**: P({specific outcome})
|
||||
**Timeframe**: {When will outcome be known?}
|
||||
**Matters because**: {What decision depends on this?}
|
||||
|
||||
---
|
||||
|
||||
## Prior Belief (Before Evidence)
|
||||
|
||||
### Base Rate
|
||||
{What's the general frequency in similar cases?}
|
||||
- Reference class: {Similar situations}
|
||||
- Base rate: {X%}
|
||||
|
||||
### Adjustments
|
||||
{How is this case different from base rate?}
|
||||
- Factor 1: {Increases/decreases probability because...}
|
||||
- Factor 2: {Increases/decreases probability because...}
|
||||
|
||||
### Prior Probability
|
||||
**P(H) = {X%}**
|
||||
|
||||
**Justification**: {Why this prior?}
|
||||
|
||||
**Range if uncertain**: {min%} to {max%}
|
||||
|
||||
---
|
||||
|
||||
## Evidence
|
||||
|
||||
**What was observed**: {Specific evidence or data}
|
||||
|
||||
**How diagnostic**: {Does this distinguish hypothesis true vs false?}
|
||||
|
||||
### Likelihoods
|
||||
|
||||
**P(E|H) = {X%}** - Probability of seeing this evidence IF hypothesis is TRUE
|
||||
- Reasoning: {Why this likelihood?}
|
||||
|
||||
**P(E|¬H) = {Y%}** - Probability of seeing this evidence IF hypothesis is FALSE
|
||||
- Reasoning: {Why this likelihood?}
|
||||
|
||||
**Likelihood Ratio = {X/Y} = {ratio}**
|
||||
- Interpretation: Evidence is {very strong / moderate / weak / not diagnostic}
|
||||
|
||||
---
|
||||
|
||||
## Bayesian Update
|
||||
|
||||
### Calculation
|
||||
|
||||
**Using probability form**:
|
||||
```
|
||||
P(H|E) = [P(E|H) × P(H)] / P(E)
|
||||
|
||||
where P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
|
||||
|
||||
P(E) = [{X%} × {Prior%}] + [{Y%} × {100-Prior%}]
|
||||
P(E) = {calculation}
|
||||
|
||||
P(H|E) = [{X%} × {Prior%}] / {P(E)}
|
||||
P(H|E) = {result%}
|
||||
```
|
||||
|
||||
**Or using odds form** (often simpler):
|
||||
```
|
||||
Prior Odds = P(H) / P(¬H) = {Prior%} / {100-Prior%} = {odds}
|
||||
Likelihood Ratio = {LR}
|
||||
Posterior Odds = Prior Odds × LR = {odds} × {LR} = {result}
|
||||
Posterior Probability = Posterior Odds / (1 + Posterior Odds) = {result%}
|
||||
```
|
||||
|
||||
### Posterior Probability
|
||||
**P(H|E) = {result%}**
|
||||
|
||||
### Change in Belief
|
||||
- Prior: {X%}
|
||||
- Posterior: {Y%}
|
||||
- Change: {+/- Z percentage points}
|
||||
- Interpretation: Evidence {strongly supports / moderately supports / weakly supports / contradicts} hypothesis
|
||||
|
||||
---
|
||||
|
||||
## Sensitivity Analysis
|
||||
|
||||
**How sensitive is posterior to inputs?**
|
||||
|
||||
If Prior was {different value}:
|
||||
- Posterior would be: {recalculated value}
|
||||
|
||||
If P(E|H) was {different value}:
|
||||
- Posterior would be: {recalculated value}
|
||||
|
||||
**Robustness**: Conclusion is {robust / somewhat robust / sensitive} to assumptions
|
||||
|
||||
---
|
||||
|
||||
## Calibration Check
|
||||
|
||||
**Am I overconfident?**
|
||||
- Did I anchor on initial belief? {yes/no - reasoning}
|
||||
- Did I ignore base rates? {yes/no - reasoning}
|
||||
- Is my posterior extreme (>90% or <10%)? {If yes, is evidence truly that strong?}
|
||||
- Would an outside observer agree with my likelihoods? {check}
|
||||
|
||||
**Red flags**:
|
||||
- ✗ Posterior is 100% or 0% (almost never justified)
|
||||
- ✗ Large update from weak evidence (check LR)
|
||||
- ✗ Prior ignores base rate entirely
|
||||
- ✗ Likelihoods are guesses without reasoning
|
||||
|
||||
---
|
||||
|
||||
## Limitations & Assumptions
|
||||
|
||||
**Key assumptions**:
|
||||
1. {Assumption 1}
|
||||
2. {Assumption 2}
|
||||
|
||||
**What could invalidate this analysis**:
|
||||
- {Condition that would change conclusion}
|
||||
- {Different interpretation of evidence}
|
||||
|
||||
**Uncertainty**:
|
||||
- Most uncertain about: {which input?}
|
||||
- Could be wrong if: {what scenario?}
|
||||
|
||||
---
|
||||
|
||||
## Decision Implications
|
||||
|
||||
**Given posterior of {X%}**:
|
||||
|
||||
Recommended action: {what to do}
|
||||
|
||||
**If decision threshold is**:
|
||||
- High confidence needed (>80%): {action}
|
||||
- Medium confidence (>60%): {action}
|
||||
- Low bar (>40%): {action}
|
||||
|
||||
**Next evidence to gather**: {What would further update belief?}
|
||||
```
|
||||
|
||||
## Step-by-Step Guide
|
||||
|
||||
### Step 1: State the Question
|
||||
|
||||
Be specific and testable.
|
||||
|
||||
**Good**: "Will our product achieve >1000 DAU within 6 months?"
|
||||
**Bad**: "Will the product succeed?"
|
||||
|
||||
Define success criteria numerically when possible.
|
||||
|
||||
### Step 2: Find Base Rates
|
||||
|
||||
**Method**:
|
||||
1. Identify reference class (similar situations)
|
||||
2. Look up historical frequency
|
||||
3. Adjust for known differences
|
||||
|
||||
**Example**:
|
||||
- Question: Will our SaaS startup raise Series A?
|
||||
- Reference class: B2B SaaS startups, seed stage, similar market
|
||||
- Base rate: ~30% raise Series A within 2 years
|
||||
- Adjustments: Strong traction (+), competitive market (-), experienced team (+)
|
||||
- Adjusted prior: 45%
|
||||
|
||||
**Common mistake**: Ignoring base rates entirely ("inside view" bias)
|
||||
|
||||
### Step 3: Estimate Likelihoods
|
||||
|
||||
Ask: "If hypothesis were true, how likely is this evidence?"
|
||||
Then: "If hypothesis were false, how likely is this evidence?"
|
||||
|
||||
**Example - Medical test**:
|
||||
- Hypothesis: Patient has disease (prevalence 1%)
|
||||
- Evidence: Positive test result
|
||||
- P(positive test | has disease) = 90% (test sensitivity)
|
||||
- P(positive test | no disease) = 5% (false positive rate)
|
||||
- LR = 90% / 5% = 18 (strong evidence)
|
||||
|
||||
**Common mistake**: Confusing P(E|H) with P(H|E) - the "prosecutor's fallacy"
|
||||
|
||||
### Step 4: Calculate Posterior
|
||||
|
||||
**Odds form is often easier**:
|
||||
|
||||
1. Convert prior to odds: Odds = P / (1-P)
|
||||
2. Multiply by LR: Posterior Odds = Prior Odds × LR
|
||||
3. Convert back to probability: P = Odds / (1 + Odds)
|
||||
|
||||
**Example**:
|
||||
- Prior: 30% → Odds = 0.3/0.7 = 0.43
|
||||
- LR = 5
|
||||
- Posterior Odds = 0.43 × 5 = 2.15
|
||||
- Posterior Probability = 2.15 / 3.15 = 68%
|
||||
|
||||
### Step 5: Calibrate
|
||||
|
||||
**Calibration questions**:
|
||||
- If you made 100 predictions at X% confidence, would X actually occur?
|
||||
- Are you systematically over/underconfident?
|
||||
- Does your posterior pass the "outside view" test?
|
||||
|
||||
**Calibration tips**:
|
||||
- Track your forecasts and outcomes
|
||||
- Be especially skeptical of extreme probabilities (>95%, <5%)
|
||||
- Consider opposite evidence (confirmation bias check)
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
**Ignoring base rates** ("base rate neglect"):
|
||||
- Bad: "Test is 90% accurate, so positive test means 90% chance of disease"
|
||||
- Good: "Disease is rare (1%), so even with positive test, probability is only ~15%"
|
||||
|
||||
**Confusing conditional probabilities**:
|
||||
- P(positive test | disease) ≠ P(disease | positive test)
|
||||
- These can be very different!
|
||||
|
||||
**Overconfident likelihoods**:
|
||||
- Claiming P(E|H) = 99% when evidence is ambiguous
|
||||
- Not considering alternative explanations
|
||||
|
||||
**Anchoring on prior**:
|
||||
- Weak evidence + starting at 50% = staying near 50%
|
||||
- Solution: Use base rates, not 50% default
|
||||
|
||||
**Treating all evidence as equally strong**:
|
||||
- Check likelihood ratio (LR)
|
||||
- LR ≈ 1 means evidence is not diagnostic
|
||||
|
||||
## Worked Example
|
||||
|
||||
**Question**: Will project finish on time?
|
||||
|
||||
**Prior**:
|
||||
- Base rate: 60% of our projects finish on time
|
||||
- This project: More complex than average (-), experienced team (+)
|
||||
- Prior: 55%
|
||||
|
||||
**Evidence**: At 50% milestone, we're 1 week behind schedule
|
||||
|
||||
**Likelihoods**:
|
||||
- P(behind at 50% | finish on time) = 30% (can recover)
|
||||
- P(behind at 50% | miss deadline) = 80% (usually signals trouble)
|
||||
- LR = 30% / 80% = 0.375 (evidence against on-time)
|
||||
|
||||
**Calculation**:
|
||||
- Prior odds = 0.55 / 0.45 = 1.22
|
||||
- Posterior odds = 1.22 × 0.375 = 0.46
|
||||
- Posterior probability = 0.46 / 1.46 = 32%
|
||||
|
||||
**Conclusion**: Updated from 55% to 32% probability of on-time finish. Being behind at 50% is meaningful evidence of delay.
|
||||
|
||||
**Decision**: If deadline is flexible, continue. If hard deadline, consider descoping or adding resources.
|
||||
|
||||
## Quality Checklist
|
||||
|
||||
- [ ] Prior is justified (base rate + adjustments)
|
||||
- [ ] Likelihoods have reasoning (not just guesses)
|
||||
- [ ] Evidence is diagnostic (LR significantly different from 1)
|
||||
- [ ] Calculation is correct
|
||||
- [ ] Posterior is in reasonable range (not 0% or 100%)
|
||||
- [ ] Assumptions are stated
|
||||
- [ ] Sensitivity analysis performed
|
||||
- [ ] Decision implications clear
|
||||
Reference in New Issue
Block a user