Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions

View File

@@ -0,0 +1,182 @@
---
name: bayesian-reasoning-calibration
description: Use when making predictions or judgments under uncertainty and need to explicitly update beliefs with new evidence. Invoke when forecasting outcomes, evaluating probabilities, testing hypotheses, calibrating confidence, assessing risks with uncertain data, or avoiding overconfidence bias. Use when user mentions priors, likelihoods, Bayes theorem, probability updates, forecasting, calibration, or belief revision.
---
# Bayesian Reasoning & Calibration
## Table of Contents
- [Purpose](#purpose)
- [When to Use This Skill](#when-to-use-this-skill)
- [What is Bayesian Reasoning?](#what-is-bayesian-reasoning)
- [Workflow](#workflow)
- [1. Define the Question](#1--define-the-question)
- [2. Establish Prior Beliefs](#2--establish-prior-beliefs)
- [3. Identify Evidence & Likelihoods](#3--identify-evidence--likelihoods)
- [4. Calculate Posterior](#4--calculate-posterior)
- [5. Calibrate & Document](#5--calibrate--document)
- [Common Patterns](#common-patterns)
- [Guardrails](#guardrails)
- [Quick Reference](#quick-reference)
## Purpose
Apply Bayesian reasoning to systematically update probability estimates as new evidence arrives. This helps make better forecasts, avoid overconfidence, and explicitly show how beliefs should change with data.
## When to Use This Skill
- Making forecasts or predictions with uncertainty
- Updating beliefs when new evidence emerges
- Calibrating confidence in estimates
- Testing hypotheses with imperfect data
- Evaluating risks with incomplete information
- Avoiding anchoring and overconfidence biases
- Making decisions under uncertainty
- Comparing multiple competing explanations
- Assessing diagnostic test results
- Forecasting project outcomes with new data
**Trigger phrases:** "What's the probability", "update my belief", "how confident", "forecast", "prior probability", "likelihood", "Bayes", "calibration", "base rate", "posterior probability"
## What is Bayesian Reasoning?
A systematic way to update probability estimates using Bayes' Theorem:
**P(H|E) = P(E|H) × P(H) / P(E)**
Where:
- **P(H)** = Prior: Probability of hypothesis before seeing evidence
- **P(E|H)** = Likelihood: Probability of evidence if hypothesis is true
- **P(E|¬H)** = Probability of evidence if hypothesis is false
- **P(H|E)** = Posterior: Updated probability after seeing evidence
**Quick Example:**
```markdown
# Should we launch Feature X?
## Prior Belief
Before beta testing: 60% chance of adoption >20%
- Base rate: Similar features get 15-25% adoption
- Our feature seems stronger than average
- Prior: 60%
## New Evidence
Beta test: 35% of users adopted (70 of 200 users)
## Likelihoods
If true adoption is >20%:
- P(seeing 35% in beta | adoption >20%) = 75% (likely to see high beta if true)
If true adoption is ≤20%:
- P(seeing 35% in beta | adoption ≤20%) = 15% (unlikely to see high beta if false)
## Bayesian Update
Posterior = (75% × 60%) / [(75% × 60%) + (15% × 40%)]
Posterior = 45% / (45% + 6%) = 88%
## Conclusion
Updated belief: 88% confident adoption will exceed 20%
Evidence strongly supports launch, but not certain.
```
## Workflow
Copy this checklist and track your progress:
```
Bayesian Reasoning Progress:
- [ ] Step 1: Define the question
- [ ] Step 2: Establish prior beliefs
- [ ] Step 3: Identify evidence and likelihoods
- [ ] Step 4: Calculate posterior
- [ ] Step 5: Calibrate and document
```
**Step 1: Define the question**
Clarify hypothesis (specific, testable claim), probability to estimate, timeframe (when outcome is known), success criteria, and why this matters (what decision depends on it). Example: "Product feature will achieve >20% adoption within 3 months" - matters for launch decision.
**Step 2: Establish prior beliefs**
Set initial probability using base rates (general frequency), reference class (similar situations), specific differences, and explicit probability assignment with justification. Good priors are based on base rates, account for differences, honest about uncertainty, and include ranges if unsure (e.g., 40-60%). Avoid purely intuitive priors, ignoring base rates, or extreme values without justification.
**Step 3: Identify evidence and likelihoods**
Assess evidence (specific observation/data), diagnostic power (does it distinguish hypotheses?), P(E|H) (probability if hypothesis TRUE), P(E|¬H) (probability if FALSE), and calculate likelihood ratio = P(E|H) / P(E|¬H). LR > 10 = very strong evidence, 3-10 = moderate, 1-3 = weak, ≈1 = not diagnostic, <1 = evidence against.
**Step 4: Calculate posterior**
Apply Bayes' Theorem: P(H|E) = [P(E|H) × P(H)] / P(E), or use odds form: Posterior Odds = Prior Odds × Likelihood Ratio. Calculate P(E) = P(E|H)×P(H) + P(E|¬H)×P(¬H), get posterior probability, and interpret change. For simple cases → Use `resources/template.md` calculator. For complex cases (multiple hypotheses) → Study `resources/methodology.md`.
**Step 5: Calibrate and document**
Check calibration (over/underconfident?), validate assumptions (are likelihoods reasonable?), perform sensitivity analysis, create `bayesian-reasoning-calibration.md`, and note limitations. Self-check using `resources/evaluators/rubric_bayesian_reasoning_calibration.json`: verify prior based on base rates, likelihoods justified, evidence diagnostic (LR ≠ 1), calculation correct, posterior calibrated, assumptions stated, sensitivity noted. Minimum standard: Score ≥ 3.5.
## Common Patterns
**For forecasting:**
- Use base rates as starting point
- Update incrementally as evidence arrives
- Track forecast accuracy over time
- Calibrate by comparing predictions to outcomes
**For hypothesis testing:**
- State competing hypotheses explicitly
- Calculate likelihood ratio for evidence
- Update belief proportionally to evidence strength
- Don't claim certainty unless LR is extreme
**For risk assessment:**
- Consider multiple scenarios (not just binary)
- Update risks as new data arrives
- Use ranges when uncertain about likelihoods
- Perform sensitivity analysis
**For avoiding bias:**
- Force explicit priors (prevents anchoring to evidence)
- Use reference classes (prevents ignoring base rates)
- Calculate mathematically (prevents motivated reasoning)
- Document before seeing outcome (enables calibration)
## Guardrails
**Do:**
- State priors explicitly before seeing all evidence
- Use base rates and reference classes
- Estimate likelihoods with justification
- Update incrementally as evidence arrives
- Be honest about uncertainty
- Perform sensitivity analysis
- Track forecasts for calibration
- Acknowledge limits of the model
**Don't:**
- Use extreme priors (1%, 99%) without exceptional justification
- Ignore base rates (common bias)
- Treat all evidence as equally diagnostic
- Update to 100% certainty (almost never justified)
- Cherry-pick evidence
- Skip documenting reasoning
- Forget to calibrate (compare predictions to outcomes)
- Apply to questions where probability is meaningless
## Quick Reference
- **Standard template**: `resources/template.md`
- **Multiple hypotheses**: `resources/methodology.md`
- **Examples**: `resources/examples/product-launch.md`, `resources/examples/medical-diagnosis.md`
- **Quality rubric**: `resources/evaluators/rubric_bayesian_reasoning_calibration.json`
**Bayesian Formula (Odds Form)**:
```
Posterior Odds = Prior Odds × Likelihood Ratio
```
**Likelihood Ratio**:
```
LR = P(Evidence | Hypothesis True) / P(Evidence | Hypothesis False)
```
**Output naming**: `bayesian-reasoning-calibration.md` or `{topic}-forecast.md`

View File

@@ -0,0 +1,135 @@
{
"name": "Bayesian Reasoning Quality Rubric",
"scale": {
"min": 1,
"max": 5,
"description": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent"
},
"criteria": [
{
"name": "Prior Quality",
"description": "Prior is based on base rates and reference classes, not just intuition",
"scoring": {
"1": "No prior stated or purely intuitive guess",
"2": "Prior stated but ignores base rates entirely",
"3": "Prior considers base rates with some adjustment",
"4": "Prior well-grounded in base rates with justified adjustments",
"5": "Exceptional prior with multiple reference classes and clear reasoning"
}
},
{
"name": "Likelihood Justification",
"description": "Likelihoods P(E|H) and P(E|¬H) are estimated with clear reasoning",
"scoring": {
"1": "No likelihoods or purely guessed",
"2": "Likelihoods given but no justification",
"3": "Likelihoods have basic reasoning",
"4": "Likelihoods well-justified with clear logic",
"5": "Exceptional likelihood estimates with empirical grounding or detailed reasoning"
}
},
{
"name": "Evidence Diagnosticity",
"description": "Evidence meaningfully distinguishes between hypotheses (LR ≠ 1)",
"scoring": {
"1": "Evidence is not diagnostic at all (LR ≈ 1)",
"2": "Evidence is weakly diagnostic (LR = 1-2)",
"3": "Evidence is moderately diagnostic (LR = 2-5)",
"4": "Evidence is strongly diagnostic (LR = 5-10)",
"5": "Evidence is very strongly diagnostic (LR > 10)"
}
},
{
"name": "Calculation Correctness",
"description": "Bayesian calculation is mathematically correct",
"scoring": {
"1": "Major calculation errors",
"2": "Some calculation errors",
"3": "Calculation is correct with minor issues",
"4": "Calculation is fully correct",
"5": "Perfect calculation with both probability and odds forms shown"
}
},
{
"name": "Calibration & Realism",
"description": "Posterior is calibrated, not overconfident (avoids extremes without justification)",
"scoring": {
"1": "Posterior is 0% or 100% without extreme evidence",
"2": "Posterior is very extreme (>95% or <5%) with weak evidence",
"3": "Posterior is reasonable but might be slightly overconfident",
"4": "Well-calibrated posterior with appropriate uncertainty",
"5": "Exceptional calibration with explicit confidence bounds"
}
},
{
"name": "Assumption Transparency",
"description": "Key assumptions and limitations are stated explicitly",
"scoring": {
"1": "No assumptions stated",
"2": "Few assumptions mentioned vaguely",
"3": "Key assumptions stated",
"4": "Comprehensive assumption documentation",
"5": "Exceptional transparency with sensitivity analysis showing assumption impact"
}
},
{
"name": "Base Rate Usage",
"description": "Analysis uses base rates appropriately (avoids base rate neglect)",
"scoring": {
"1": "Completely ignores base rates",
"2": "Acknowledges base rates but doesn't use them",
"3": "Uses base rates for prior",
"4": "Properly incorporates base rates with adjustments",
"5": "Exceptional use of multiple base rates and reference classes"
}
},
{
"name": "Sensitivity Analysis",
"description": "Tests how sensitive conclusion is to input assumptions",
"scoring": {
"1": "No sensitivity analysis",
"2": "Minimal sensitivity check",
"3": "Basic sensitivity analysis on key inputs",
"4": "Comprehensive sensitivity analysis",
"5": "Exceptional sensitivity analysis showing robustness or fragility clearly"
}
},
{
"name": "Interpretation Quality",
"description": "Posterior is interpreted correctly with decision implications",
"scoring": {
"1": "Misinterprets posterior or no interpretation",
"2": "Basic interpretation but lacks context",
"3": "Good interpretation with some decision guidance",
"4": "Clear interpretation with actionable decision implications",
"5": "Exceptional interpretation linking probability to specific actions and thresholds"
}
},
{
"name": "Avoidance of Common Errors",
"description": "Avoids prosecutor's fallacy, base rate neglect, and other Bayesian errors",
"scoring": {
"1": "Multiple major errors (confusing P(E|H) with P(H|E), ignoring base rates)",
"2": "One major error present",
"3": "Mostly avoids common errors",
"4": "Cleanly avoids all common errors",
"5": "Exceptional awareness with explicit checks against common errors"
}
}
],
"overall_assessment": {
"thresholds": {
"excellent": "Average score ≥ 4.5 (publication quality)",
"very_good": "Average score ≥ 4.0 (most forecasts should aim for this)",
"good": "Average score ≥ 3.5 (minimum for important decisions)",
"acceptable": "Average score ≥ 3.0 (workable for low-stakes predictions)",
"needs_rework": "Average score < 3.0 (redo before using)"
},
"stakes_guidance": {
"low_stakes": "Personal predictions, low-cost decisions: aim for ≥ 3.0",
"medium_stakes": "Business decisions, moderate cost: aim for ≥ 3.5",
"high_stakes": "Major decisions, high cost of error: aim for ≥ 4.0"
}
},
"usage_instructions": "Rate each criterion on 1-5 scale. Calculate average. For important forecasts or decisions, minimum score is 3.5. For high-stakes decisions where cost of error is high, aim for ≥4.0. Check especially for base rate neglect, prosecutor's fallacy, and overconfidence - these are the most common errors."
}

View File

@@ -0,0 +1,319 @@
# Bayesian Analysis: Feature Adoption Forecast
## Question
**Hypothesis**: New sharing feature will achieve >20% adoption within 3 months of launch
**Estimating**: P(adoption >20%)
**Timeframe**: 3 months post-launch (results measured at month 3)
**Matters because**: Need 20% adoption to justify ongoing development investment. Below 20%, we should sunset the feature and reallocate resources.
---
## Prior Belief (Before Evidence)
### Base Rate
What's the general frequency of similar features achieving >20% adoption?
- **Reference class**: Previous features we've launched in this product category
- **Historical data**:
- Last 8 features launched: 5 achieved >20% adoption (62.5%)
- Industry benchmarks: Social sharing features average 15-25% adoption
- Our product has higher engagement than average
- **Base rate**: 60%
### Adjustments
How is this case different from the base rate?
- **Factor 1: Feature complexity** - This feature is simpler than average (+5%)
- Previous successful features averaged 3 steps to use
- This feature is 1-click sharing
- Simpler features historically perform better
- **Factor 2: Market timing** - Competitive pressure is high (-10%)
- Two competitors launched similar features 6 months ago
- Early adopters may have already switched to competitors
- Late-to-market features typically see 15-20% lower adoption
- **Factor 3: User research signals** - Strong user request (+10%)
- Feature was #2 most requested in last user survey (450 responses)
- 72% said they would use it "frequently" or "very frequently"
- Strong stated intent typically correlates with 40-60% actual usage
### Prior Probability
**P(H) = 65%**
**Justification**: Starting from 60% base rate, adjusted upward for simplicity (+5%) and strong user signals (+10%), adjusted down for late market entry (-10%). Net effect: 65% prior confidence that adoption will exceed 20%.
**Range if uncertain**: 55% to 75% (accounting for uncertainty in adjustment factors)
---
## Evidence
**What was observed**: Beta test with 200 users showed 35% adoption (70 users actively used feature)
**How diagnostic**: This is moderately to strongly diagnostic evidence. Beta tests often show higher engagement than production (selection bias), but 35% is meaningfully above our 20% threshold. The question is whether this beta performance predicts production performance.
### Likelihoods
**P(E|H) = 75%** - Probability of seeing 35% beta adoption IF true production adoption will be >20%
**Reasoning**:
- If production adoption will be >20%, beta should show higher (beta users are early adopters)
- Typical pattern: beta adoption is 1.5-2x production adoption for engaged features
- If production will be 22%, beta would likely be 33-44% → 35% fits this well
- If production will be 25%, beta would likely be 38-50% → 35% is on lower end but plausible
- 75% accounts for variance and beta-to-production conversion uncertainty
**P(E|¬H) = 15%** - Probability of seeing 35% beta adoption IF true production adoption will be ≤20%
**Reasoning**:
- If production adoption will be ≤20% (say, 15%), beta would typically be 22-30%
- Seeing 35% beta when production will be ≤20% would require unusual beta-to-production drop
- This could happen (beta selection bias, novelty effect wears off), but is uncommon
- 15% reflects that this scenario is possible but unlikely
**Likelihood Ratio = 75% / 15% = 5.0**
**Interpretation**: Evidence is moderately strong. A 35% beta result is 5 times more likely if production adoption will exceed 20% than if it won't. This is meaningful but not overwhelming evidence.
---
## Bayesian Update
### Calculation
**Using odds form** (simpler for this case):
```
Prior Odds = P(H) / P(¬H) = 65% / 35% = 1.86
Likelihood Ratio = 5.0
Posterior Odds = Prior Odds × LR = 1.86 × 5.0 = 9.3
Posterior Probability = Posterior Odds / (1 + Posterior Odds)
= 9.3 / 10.3
= 90.3%
```
**Verification using probability form**:
```
P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
P(E) = [75% × 65%] + [15% × 35%]
P(E) = 48.75% + 5.25% = 54%
P(H|E) = [P(E|H) × P(H)] / P(E)
P(H|E) = [75% × 65%] / 54%
P(H|E) = 48.75% / 54% = 90.3%
```
### Posterior Probability
**P(H|E) = 90%**
### Change in Belief
- **Prior**: 65%
- **Posterior**: 90%
- **Change**: +25 percentage points
- **Interpretation**: Evidence strongly supports hypothesis. Beta test results meaningfully increased confidence that production adoption will exceed 20%.
---
## Sensitivity Analysis
**How sensitive is posterior to inputs?**
### If Prior was different:
| Prior | Posterior | Note |
|-------|-----------|------|
| 50% | 83% | Even starting at coin-flip, evidence pushes to high confidence |
| 75% | 94% | Higher prior → very high posterior |
| 40% | 77% | Lower prior → still high confidence |
**Finding**: Posterior is somewhat robust. Evidence is strong enough that even with priors ranging from 40-75%, posterior stays in 77-94% range.
### If P(E|H) was different:
| P(E\|H) | LR | Posterior | Note |
|---------|-----|-----------|------|
| 60% | 4.0 | 87% | Less diagnostic evidence → still high confidence |
| 85% | 5.67 | 92% | More diagnostic evidence → very high confidence |
| 50% | 3.33 | 82% | Weaker evidence → moderate-high confidence |
**Finding**: Posterior is moderately sensitive to P(E|H), but stays above 80% across plausible range.
### If P(E|¬H) was different:
| P(E\|¬H) | LR | Posterior | Note |
|----------|-----|-----------|------|
| 25% | 3.0 | 84% | Less diagnostic → still high confidence |
| 10% | 7.5 | 94% | More diagnostic → very high confidence |
| 30% | 2.5 | 80% | Weak evidence → moderate confidence |
**Finding**: Posterior is sensitive to P(E|¬H). If beta-to-production drop is common (higher P(E|¬H)), confidence decreases meaningfully.
**Robustness**: Conclusion is **moderately robust**. Across reasonable input ranges, posterior stays above 77%, supporting launch decision. Most sensitive to assumption about beta-to-production conversion rates.
---
## Calibration Check
**Am I overconfident?**
- **Did I anchor on initial belief?**
- No - prior (65%) was based on base rates, not arbitrary
- Evidence substantially moved belief (+25pp)
- Not stuck at starting point
- **Did I ignore base rates?**
- No - explicitly used historical feature adoption (60%) as starting point
- Adjusted for known differences systematically
- **Is my posterior extreme (>90% or <10%)?**
- Yes - 90% is borderline extreme
- **Check**: Is evidence truly that strong?
- LR = 5.0 is moderately strong (not very strong)
- Prior was already high (65%)
- Combination pushes to 90%
- **Concern**: May be slightly overconfident
- **Adjustment**: Consider reporting as 85-90% range rather than point estimate
- **Would an outside observer agree with my likelihoods?**
- P(E|H) = 75%: Reasonable - beta users are engaged, expect higher than production
- P(E|¬H) = 15%: Potentially optimistic - beta selection bias could be stronger
- **Alternative**: If P(E|¬H) = 25%, posterior drops to 84% (more conservative)
**Red flags**:
- ✓ Posterior is not 100% or 0%
- ✓ Update magnitude (25pp) matches evidence strength (LR=5.0)
- ✓ Prior uses base rates
- ⚠ Posterior is at upper end (90%) - consider uncertainty range
**Calibration adjustment**: Report as 85-90% confidence range to account for uncertainty in likelihoods.
---
## Limitations & Assumptions
**Key assumptions**:
1. **Beta users are representative of broader user base**
- Assumption: Beta users are 1.5-2x more engaged than average
- Risk: If beta users are much more engaged (3x), production adoption could be lower
- Impact: Could invalidate high posterior
2. **No major bugs or UX issues in production**
- Assumption: Production experience will match beta experience
- Risk: Unforeseen technical issues could crater adoption
- Impact: Would make evidence misleading
3. **Competitive landscape stays stable**
- Assumption: No major competitor moves in next 3 months
- Risk: Competitor could launch superior version
- Impact: Could reduce adoption below 20% despite strong beta
4. **Beta sample size is sufficient (n=200)**
- Assumption: 200 users is enough to estimate adoption
- Confidence interval: 35% ± 6.6% at 95% CI
- Impact: True beta adoption could be 28-42%, adding uncertainty
**What could invalidate this analysis**:
- **Major product changes**: If we significantly alter the feature post-beta, beta results become less predictive
- **Different user segment**: If we launch to a different user segment than beta testers, adoption patterns may differ
- **Seasonal effects**: If beta ran during high-engagement season and launch is during low season
- **Discovery/onboarding issues**: If users don't discover the feature in production (beta users were explicitly invited)
**Uncertainty**:
- **Most uncertain about**: P(E|¬H) = 15% - How often do features with ≤20% production adoption show 35% beta adoption?
- This is the key assumption
- If this is actually 25-30%, posterior drops to 80-84%
- Recommendation: Review historical beta-to-production conversion data
- **Could be wrong if**:
- Beta users are much more engaged than typical users (>2x multiplier)
- Novelty effect in beta wears off quickly in production
- Production launch has poor discoverability/onboarding
---
## Decision Implications
**Given posterior of 90% (range: 85-90%)**:
**Recommended action**: **Proceed with launch** with monitoring plan
**Rationale**:
- 90% confidence exceeds decision threshold for feature launches
- Even conservative estimate (85%) supports launch
- Risk of failure (<20% adoption) is only 10-15%
- Cost of being wrong: Wasted 3 months of development effort
- Cost of not launching: Missing potential high-adoption feature
**If decision threshold is**:
- **High confidence needed (>80%)**: ✅ **LAUNCH** - Exceeds threshold, proceed with production rollout
- **Medium confidence (>60%)**: ✅ **LAUNCH** - Well above threshold, strong conviction
- **Low bar (>40%)**: ✅ **LAUNCH** - Far exceeds minimum threshold
**Monitoring plan** (to validate forecast):
1. **Week 1**: Check if adoption is on track for >6% (20% / 3 months, assuming linear growth)
- If <4%: Red flag, investigate onboarding/discovery issues
- If >8%: Exceeding expectations, validate data quality
2. **Month 1**: Check if adoption is trending toward >10%
- If <7%: Update forecast downward, consider intervention
- If >13%: Exceeding expectations, high confidence
3. **Month 3**: Measure final adoption
- If <20%: Analyze what went wrong, calibrate future forecasts
- If >20%: Validate forecast accuracy, update priors for future features
**Next evidence to gather**:
- **Historical beta-to-production conversion rates**: Review last 5-10 feature launches to calibrate P(E|¬H) more accurately
- **User segment analysis**: Compare beta user demographics to production user base
- **Competitive feature adoption**: Check competitors' sharing feature adoption rates
- **Early production data**: After 1 week of production, use actual adoption data for next Bayesian update
**What would change our mind**:
- **Week 1 adoption <3%**: Would update posterior down to ~60%, trigger investigation
- **Competitor launches superior feature**: Would need to recalculate with new competitive landscape
- **Discovery of major beta sampling bias**: If beta users are 5x more engaged, would significantly reduce confidence
---
## Meta: Forecast Quality Assessment
Using rubric from `rubric_bayesian_reasoning_calibration.json`:
**Self-assessment**:
- Prior Quality: 4/5 (good base rate usage, clear adjustments)
- Likelihood Justification: 4/5 (clear reasoning, could use more empirical data)
- Evidence Diagnosticity: 4/5 (LR=5.0 is moderately strong)
- Calculation Correctness: 5/5 (verified with both odds and probability forms)
- Calibration & Realism: 3/5 (posterior is 90%, borderline extreme, flagged for review)
- Assumption Transparency: 4/5 (key assumptions stated clearly)
- Base Rate Usage: 5/5 (explicit base rate from historical data)
- Sensitivity Analysis: 4/5 (comprehensive sensitivity checks)
- Interpretation Quality: 4/5 (clear decision implications with thresholds)
- Avoidance of Common Errors: 4/5 (no prosecutor's fallacy, proper base rates)
**Average: 4.1/5** - Meets "very good" threshold for medium-stakes decision
**Decision**: Forecast is sufficiently rigorous for feature launch decision (medium stakes). Primary area for improvement: gather more data on beta-to-production conversion to refine P(E|¬H) estimate.

View File

@@ -0,0 +1,437 @@
# Bayesian Reasoning & Calibration Methodology
## Bayesian Reasoning Workflow
Copy this checklist and track your progress:
```
Bayesian Reasoning Progress:
- [ ] Step 1: Define hypotheses and assign priors
- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
- [ ] Step 3: Compute posteriors and update sequentially
- [ ] Step 4: Check for dependent evidence and adjust
- [ ] Step 5: Validate calibration and check for bias
```
**Step 1: Define hypotheses and assign priors**
List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications.
**Step 2: Assign likelihoods for each evidence-hypothesis pair**
For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques.
**Step 3: Compute posteriors and update sequentially**
Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process.
**Step 4: Check for dependent evidence and adjust**
Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction.
**Step 5: Validate calibration and check for bias**
Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics.
---
## Multiple Hypothesis Updates
### Problem: Choosing Among Many Hypotheses
Often you have 3+ competing hypotheses and need to update all simultaneously.
**Example:**
- H1: Bug in payment processor code
- H2: Database connection timeout
- H3: Third-party API outage
- H4: DDoS attack
### Approach: Compute Posterior for Each Hypothesis
**Step 1: Assign prior probabilities** (must sum to 1)
| Hypothesis | Prior P(H) | Justification |
|------------|-----------|---------------|
| H1: Payment bug | 0.30 | Common issue, recent deploy |
| H2: DB timeout | 0.25 | Has happened before |
| H3: API outage | 0.20 | Dependency on external service |
| H4: DDoS | 0.10 | Rare but possible |
| H5: Other | 0.15 | Catch-all for unknowns |
| **Total** | **1.00** | Must sum to 1 |
**Step 2: Define likelihood for each hypothesis**
Evidence E: "500 errors only on payment endpoint"
| Hypothesis | P(E\|H) | Justification |
|------------|---------|---------------|
| H1: Payment bug | 0.80 | Bug would affect payment specifically |
| H2: DB timeout | 0.30 | Would affect multiple endpoints |
| H3: API outage | 0.70 | Payment uses external API |
| H4: DDoS | 0.50 | Could target any endpoint |
| H5: Other | 0.20 | Generic catch-all |
**Step 3: Compute P(E)** (marginal probability)
```
P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
P(E) = 0.535
```
**Step 4: Compute posterior for each hypothesis**
```
P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
Total: 100% (check: posteriors must sum to 1)
```
**Interpretation:**
- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
- H3 (API outage): 20% → 26.2% (increased 6 pp)
- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
- H4 (DDoS): 10% → 9.3% (barely changed)
**Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
---
## Sequential Evidence Updates
### Problem: Multiple Pieces of Evidence Over Time
Evidence comes in stages, need to update belief sequentially.
**Example:**
- Evidence 1: "500 errors on payment endpoint" (t=0)
- Evidence 2: "External API status page shows outage" (t+5 min)
- Evidence 3: "Our other services using same API also failing" (t+10 min)
### Approach: Chain Updates (Prior → E1 → E2 → E3)
**Step 1: Update with E1** (as above)
```
Prior → P(H|E1)
P(H1|E1) = 44.9% (payment bug)
P(H3|E1) = 26.2% (API outage)
```
**Step 2: Use posterior as new prior, update with E2**
Evidence E2: "External API status page shows outage"
New prior (from E1 posterior):
- P(H1) = 0.449 (payment bug)
- P(H3) = 0.262 (API outage)
Likelihoods given E2:
- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
- P(E2|H3) = 0.95 (API outage would definitely show on status page)
```
P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
```
**Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
**Step 3: Update with E3**
Evidence E3: "Other services using same API also failing"
New prior:
- P(H1) = 0.265
- P(H3) = 0.698
Likelihoods:
- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
- P(E3|H3) = 0.99 (API outage would affect all services)
```
P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
```
**Final conclusion**: 96.3% confidence it's API outage, not payment bug.
**Summary of belief evolution:**
```
Prior After E1 After E2 After E3
H1 (Bug): 30% → 44.9% → 26.5% → 3.7%
H3 (API): 20% → 26.2% → 69.8% → 96.3%
```
---
## Handling Dependent Evidence
### Problem: Evidence Items Not Independent
**Naive approach fails when**:
- E1 and E2 are correlated (not independent)
- Updating twice with same information
**Example of dependent evidence:**
- E1: "User reports payment failing"
- E2: "Another user reports payment failing"
If E1 and E2 are from same incident, they're not independent evidence!
### Solution 1: Treat as Single Evidence
If evidence is dependent, combine into one update:
**Instead of:**
- Update with E1: "User reports payment failing"
- Update with E2: "Another user reports payment failing"
**Do:**
- Single update with E: "Multiple users report payment failing"
Likelihood:
- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
- P(E|No bug) = 0.05 (false reports rare)
### Solution 2: Conditional Likelihoods
If evidence is conditionally dependent (E2 depends on E1), use:
```
P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
```
Not:
```
P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence
```
**Example:**
- E1: "Test fails on staging"
- E2: "Test fails on production" (same test, likely same cause)
Conditional:
- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
### Red Flags for Dependent Evidence
Watch out for:
- Multiple reports of same incident (count as one)
- Cascading failures (downstream failure caused by upstream)
- Repeated measurements of same thing (not new info)
- Evidence from same source (correlated errors)
**Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent.
---
## Calibration Techniques
### Problem: Over/Underconfidence Bias
Common patterns:
- **Overconfidence**: Stating 90% when true rate is 70%
- **Underconfidence**: Stating 60% when true rate is 80%
### Calibration Check: Track Predictions Over Time
**Method**:
1. Make many probabilistic forecasts (P=70%, P=40%, etc.)
2. Track outcomes
3. Group by confidence level
4. Compare stated probability to actual frequency
**Example calibration check:**
| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
|-----------------|---------------|-----------|----------|-------------|
| 90-100% | 20 | 16 | 80% | Overconfident |
| 70-89% | 30 | 24 | 80% | Good |
| 50-69% | 25 | 14 | 56% | Good |
| 30-49% | 15 | 5 | 33% | Good |
| 0-29% | 10 | 2 | 20% | Good |
**Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct).
### Calibration Curve
Plot stated probability vs actual frequency:
```
Actual %
100 ┤ ●
│ ●
80 ┤ ● (overconfident)
│ ●
60 ┤ ●
│ ●
40 ┤ ●
20 ┤
0 └─────────────────────────────────
0 20 40 60 80 100
Stated probability %
Perfect calibration = diagonal line
Above line = overconfident
Below line = underconfident
```
### Debiasing Techniques
**For overconfidence:**
- Consider alternative explanations (how could I be wrong?)
- Base rate check (what's the typical success rate?)
- Pre-mortem: "It's 6 months from now and we failed. Why?"
- Confidence intervals: State range, not point estimate
**For underconfidence:**
- Review past successes (build evidence for confidence)
- Test predictions: Am I systematically too cautious?
- Cost of inaction: What's the cost of waiting for certainty?
### Brier Score (Calibration Metric)
**Formula**:
```
Brier Score = (1/n) × Σ (pi - oi)²
pi = stated probability for outcome i
oi = actual outcome (1 if happened, 0 if not)
```
**Example:**
```
Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
```
**Interpretation:**
- 0.00 = perfect calibration
- 0.25 = random guessing
- Lower is better
**Typical scores:**
- Expert forecasters: 0.10-0.15
- Average people: 0.20-0.25
- Aim for: <0.15
---
## Advanced Applications
### Application 1: Hierarchical Priors
When you're uncertain about the prior itself.
**Example:**
- Question: "Will project finish on time?"
- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
**Approach**: Model uncertainty in prior
```
P(On time) = Weighted average of different base rates
Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
Scenario 2: Base rate = 50% (if average project), Weight = 30%
Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
= 0.12 + 0.15 + 0.21
= 0.48 (48%)
```
Then update this 48% prior with evidence.
### Application 2: Bayesian Model Comparison
Compare which model/theory better explains data.
**Example:**
- Model A: "Bug in feature X"
- Model B: "Infrastructure issue"
Evidence: 10 data points
```
P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
```
**Bayes Factor** = P(Data|Model A) / P(Data|Model B)
- BF > 10: Strong evidence for Model A
- BF = 3-10: Moderate evidence for Model A
- BF = 1: Equal evidence
- BF < 0.33: Evidence against Model A
### Application 3: Forecasting with Bayesian Updates
Use for repeated forecasting (elections, product launches, project timelines).
**Process:**
1. Start with base rate prior
2. Update weekly as new evidence arrives
3. Track belief evolution over time
4. Compare final forecast to outcome (calibration check)
**Example: Product launch success**
```
Week -8: Prior = 60% (base rate for similar launches)
Week -6: Beta feedback positive → Update to 70%
Week -4: Competitor launches similar product → Update to 55%
Week -2: Pre-orders exceed target → Update to 75%
Week 0: Launch → Actual success: Yes ✓
Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
```
**Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
---
## Quality Checklist for Complex Cases
**Multiple Hypotheses**:
- [ ] All hypotheses listed (including catch-all "Other")
- [ ] Priors sum to 1.0
- [ ] Likelihoods defined for all hypothesis-evidence pairs
- [ ] Posteriors sum to 1.0 (math check)
- [ ] Interpretation provided (which hypothesis favored? by how much?)
**Sequential Updates**:
- [ ] Evidence items clearly independent or conditional dependence noted
- [ ] Each update uses previous posterior as new prior
- [ ] Belief evolution tracked (how beliefs changed over time)
- [ ] Final conclusion integrates all evidence
- [ ] Timeline shows when each piece of evidence arrived
**Calibration**:
- [ ] Considered alternative explanations (not overconfident?)
- [ ] Checked against base rates (not ignoring priors?)
- [ ] Stated confidence interval or range (not just point estimate)
- [ ] Identified assumptions that could make forecast wrong
- [ ] Planned follow-up to track calibration (compare forecast to outcome)
**Minimum Standard for Complex Cases**:
- Multiple hypotheses: Score ≥ 4.0 on rubric
- High-stakes forecasts: Track calibration over 10+ predictions

View File

@@ -0,0 +1,308 @@
# Bayesian Reasoning Template
## Workflow
Copy this checklist and track your progress:
```
Bayesian Update Progress:
- [ ] Step 1: State question and establish prior from base rates
- [ ] Step 2: Estimate likelihoods for evidence
- [ ] Step 3: Calculate posterior using Bayes' theorem
- [ ] Step 4: Perform sensitivity analysis
- [ ] Step 5: Calibrate and validate with quality checklist
```
**Step 1: State question and establish prior from base rates**
Define specific, testable hypothesis with timeframe and success criteria. Identify reference class and base rate, adjust for specific differences, and state prior explicitly with justification. See [Step 1: State the Question](#step-1-state-the-question) and [Step 2: Find Base Rates](#step-2-find-base-rates) for guidance.
**Step 2: Estimate likelihoods for evidence**
Assess P(E|H) (probability of evidence if hypothesis TRUE) and P(E|¬H) (probability if FALSE), calculate likelihood ratio = P(E|H) / P(E|¬H), and interpret diagnostic strength. See [Step 3: Estimate Likelihoods](#step-3-estimate-likelihoods) for examples and common mistakes.
**Step 3: Calculate posterior using Bayes' theorem**
Apply P(H|E) = [P(E|H) × P(H)] / P(E) or use simpler odds form: Posterior Odds = Prior Odds × LR. Interpret change in belief (prior → posterior) and strength of evidence. See [Step 4: Calculate Posterior](#step-4-calculate-posterior) for calculation methods.
**Step 4: Perform sensitivity analysis**
Test how posterior changes with different prior values and likelihoods to assess robustness of conclusion. See [Sensitivity Analysis](#sensitivity-analysis) section in template structure.
**Step 5: Calibrate and validate with quality checklist**
Check for overconfidence, base rate neglect, and extreme posteriors. Use [Calibration Check](#calibration-check) and [Quality Checklist](#quality-checklist) to verify prior is justified, likelihoods have reasoning, evidence is diagnostic (LR ≠ 1), calculation correct, and assumptions stated.
## Quick Template
```markdown
# Bayesian Analysis: {Topic}
## Question
**Hypothesis**: {What are you testing?}
**Estimating**: P({specific outcome})
**Timeframe**: {When will outcome be known?}
**Matters because**: {What decision depends on this?}
---
## Prior Belief (Before Evidence)
### Base Rate
{What's the general frequency in similar cases?}
- Reference class: {Similar situations}
- Base rate: {X%}
### Adjustments
{How is this case different from base rate?}
- Factor 1: {Increases/decreases probability because...}
- Factor 2: {Increases/decreases probability because...}
### Prior Probability
**P(H) = {X%}**
**Justification**: {Why this prior?}
**Range if uncertain**: {min%} to {max%}
---
## Evidence
**What was observed**: {Specific evidence or data}
**How diagnostic**: {Does this distinguish hypothesis true vs false?}
### Likelihoods
**P(E|H) = {X%}** - Probability of seeing this evidence IF hypothesis is TRUE
- Reasoning: {Why this likelihood?}
**P(E|¬H) = {Y%}** - Probability of seeing this evidence IF hypothesis is FALSE
- Reasoning: {Why this likelihood?}
**Likelihood Ratio = {X/Y} = {ratio}**
- Interpretation: Evidence is {very strong / moderate / weak / not diagnostic}
---
## Bayesian Update
### Calculation
**Using probability form**:
```
P(H|E) = [P(E|H) × P(H)] / P(E)
where P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
P(E) = [{X%} × {Prior%}] + [{Y%} × {100-Prior%}]
P(E) = {calculation}
P(H|E) = [{X%} × {Prior%}] / {P(E)}
P(H|E) = {result%}
```
**Or using odds form** (often simpler):
```
Prior Odds = P(H) / P(¬H) = {Prior%} / {100-Prior%} = {odds}
Likelihood Ratio = {LR}
Posterior Odds = Prior Odds × LR = {odds} × {LR} = {result}
Posterior Probability = Posterior Odds / (1 + Posterior Odds) = {result%}
```
### Posterior Probability
**P(H|E) = {result%}**
### Change in Belief
- Prior: {X%}
- Posterior: {Y%}
- Change: {+/- Z percentage points}
- Interpretation: Evidence {strongly supports / moderately supports / weakly supports / contradicts} hypothesis
---
## Sensitivity Analysis
**How sensitive is posterior to inputs?**
If Prior was {different value}:
- Posterior would be: {recalculated value}
If P(E|H) was {different value}:
- Posterior would be: {recalculated value}
**Robustness**: Conclusion is {robust / somewhat robust / sensitive} to assumptions
---
## Calibration Check
**Am I overconfident?**
- Did I anchor on initial belief? {yes/no - reasoning}
- Did I ignore base rates? {yes/no - reasoning}
- Is my posterior extreme (>90% or <10%)? {If yes, is evidence truly that strong?}
- Would an outside observer agree with my likelihoods? {check}
**Red flags**:
- ✗ Posterior is 100% or 0% (almost never justified)
- ✗ Large update from weak evidence (check LR)
- ✗ Prior ignores base rate entirely
- ✗ Likelihoods are guesses without reasoning
---
## Limitations & Assumptions
**Key assumptions**:
1. {Assumption 1}
2. {Assumption 2}
**What could invalidate this analysis**:
- {Condition that would change conclusion}
- {Different interpretation of evidence}
**Uncertainty**:
- Most uncertain about: {which input?}
- Could be wrong if: {what scenario?}
---
## Decision Implications
**Given posterior of {X%}**:
Recommended action: {what to do}
**If decision threshold is**:
- High confidence needed (>80%): {action}
- Medium confidence (>60%): {action}
- Low bar (>40%): {action}
**Next evidence to gather**: {What would further update belief?}
```
## Step-by-Step Guide
### Step 1: State the Question
Be specific and testable.
**Good**: "Will our product achieve >1000 DAU within 6 months?"
**Bad**: "Will the product succeed?"
Define success criteria numerically when possible.
### Step 2: Find Base Rates
**Method**:
1. Identify reference class (similar situations)
2. Look up historical frequency
3. Adjust for known differences
**Example**:
- Question: Will our SaaS startup raise Series A?
- Reference class: B2B SaaS startups, seed stage, similar market
- Base rate: ~30% raise Series A within 2 years
- Adjustments: Strong traction (+), competitive market (-), experienced team (+)
- Adjusted prior: 45%
**Common mistake**: Ignoring base rates entirely ("inside view" bias)
### Step 3: Estimate Likelihoods
Ask: "If hypothesis were true, how likely is this evidence?"
Then: "If hypothesis were false, how likely is this evidence?"
**Example - Medical test**:
- Hypothesis: Patient has disease (prevalence 1%)
- Evidence: Positive test result
- P(positive test | has disease) = 90% (test sensitivity)
- P(positive test | no disease) = 5% (false positive rate)
- LR = 90% / 5% = 18 (strong evidence)
**Common mistake**: Confusing P(E|H) with P(H|E) - the "prosecutor's fallacy"
### Step 4: Calculate Posterior
**Odds form is often easier**:
1. Convert prior to odds: Odds = P / (1-P)
2. Multiply by LR: Posterior Odds = Prior Odds × LR
3. Convert back to probability: P = Odds / (1 + Odds)
**Example**:
- Prior: 30% → Odds = 0.3/0.7 = 0.43
- LR = 5
- Posterior Odds = 0.43 × 5 = 2.15
- Posterior Probability = 2.15 / 3.15 = 68%
### Step 5: Calibrate
**Calibration questions**:
- If you made 100 predictions at X% confidence, would X actually occur?
- Are you systematically over/underconfident?
- Does your posterior pass the "outside view" test?
**Calibration tips**:
- Track your forecasts and outcomes
- Be especially skeptical of extreme probabilities (>95%, <5%)
- Consider opposite evidence (confirmation bias check)
## Common Pitfalls
**Ignoring base rates** ("base rate neglect"):
- Bad: "Test is 90% accurate, so positive test means 90% chance of disease"
- Good: "Disease is rare (1%), so even with positive test, probability is only ~15%"
**Confusing conditional probabilities**:
- P(positive test | disease) ≠ P(disease | positive test)
- These can be very different!
**Overconfident likelihoods**:
- Claiming P(E|H) = 99% when evidence is ambiguous
- Not considering alternative explanations
**Anchoring on prior**:
- Weak evidence + starting at 50% = staying near 50%
- Solution: Use base rates, not 50% default
**Treating all evidence as equally strong**:
- Check likelihood ratio (LR)
- LR ≈ 1 means evidence is not diagnostic
## Worked Example
**Question**: Will project finish on time?
**Prior**:
- Base rate: 60% of our projects finish on time
- This project: More complex than average (-), experienced team (+)
- Prior: 55%
**Evidence**: At 50% milestone, we're 1 week behind schedule
**Likelihoods**:
- P(behind at 50% | finish on time) = 30% (can recover)
- P(behind at 50% | miss deadline) = 80% (usually signals trouble)
- LR = 30% / 80% = 0.375 (evidence against on-time)
**Calculation**:
- Prior odds = 0.55 / 0.45 = 1.22
- Posterior odds = 1.22 × 0.375 = 0.46
- Posterior probability = 0.46 / 1.46 = 32%
**Conclusion**: Updated from 55% to 32% probability of on-time finish. Being behind at 50% is meaningful evidence of delay.
**Decision**: If deadline is flexible, continue. If hard deadline, consider descoping or adding resources.
## Quality Checklist
- [ ] Prior is justified (base rate + adjustments)
- [ ] Likelihoods have reasoning (not just guesses)
- [ ] Evidence is diagnostic (LR significantly different from 1)
- [ ] Calculation is correct
- [ ] Posterior is in reasonable range (not 0% or 100%)
- [ ] Assumptions are stated
- [ ] Sensitivity analysis performed
- [ ] Decision implications clear