13 KiB
Bayesian Reasoning & Calibration Methodology
Bayesian Reasoning Workflow
Copy this checklist and track your progress:
Bayesian Reasoning Progress:
- [ ] Step 1: Define hypotheses and assign priors
- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
- [ ] Step 3: Compute posteriors and update sequentially
- [ ] Step 4: Check for dependent evidence and adjust
- [ ] Step 5: Validate calibration and check for bias
Step 1: Define hypotheses and assign priors
List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See Multiple Hypothesis Updates for structuring priors with justifications.
Step 2: Assign likelihoods for each evidence-hypothesis pair
For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See Multiple Hypothesis Updates for likelihood assessment techniques.
Step 3: Compute posteriors and update sequentially
Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See Sequential Evidence Updates for multi-stage updating process.
Step 4: Check for dependent evidence and adjust
Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See Handling Dependent Evidence for dependence detection and correction.
Step 5: Validate calibration and check for bias
Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See Calibration Techniques for debiasing methods and metrics.
Multiple Hypothesis Updates
Problem: Choosing Among Many Hypotheses
Often you have 3+ competing hypotheses and need to update all simultaneously.
Example:
- H1: Bug in payment processor code
- H2: Database connection timeout
- H3: Third-party API outage
- H4: DDoS attack
Approach: Compute Posterior for Each Hypothesis
Step 1: Assign prior probabilities (must sum to 1)
| Hypothesis | Prior P(H) | Justification |
|---|---|---|
| H1: Payment bug | 0.30 | Common issue, recent deploy |
| H2: DB timeout | 0.25 | Has happened before |
| H3: API outage | 0.20 | Dependency on external service |
| H4: DDoS | 0.10 | Rare but possible |
| H5: Other | 0.15 | Catch-all for unknowns |
| Total | 1.00 | Must sum to 1 |
Step 2: Define likelihood for each hypothesis
Evidence E: "500 errors only on payment endpoint"
| Hypothesis | P(E|H) | Justification |
|---|---|---|
| H1: Payment bug | 0.80 | Bug would affect payment specifically |
| H2: DB timeout | 0.30 | Would affect multiple endpoints |
| H3: API outage | 0.70 | Payment uses external API |
| H4: DDoS | 0.50 | Could target any endpoint |
| H5: Other | 0.20 | Generic catch-all |
Step 3: Compute P(E) (marginal probability)
P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
P(E) = 0.535
Step 4: Compute posterior for each hypothesis
P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
Total: 100% (check: posteriors must sum to 1)
Interpretation:
- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
- H3 (API outage): 20% → 26.2% (increased 6 pp)
- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
- H4 (DDoS): 10% → 9.3% (barely changed)
Decision: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
Sequential Evidence Updates
Problem: Multiple Pieces of Evidence Over Time
Evidence comes in stages, need to update belief sequentially.
Example:
- Evidence 1: "500 errors on payment endpoint" (t=0)
- Evidence 2: "External API status page shows outage" (t+5 min)
- Evidence 3: "Our other services using same API also failing" (t+10 min)
Approach: Chain Updates (Prior → E1 → E2 → E3)
Step 1: Update with E1 (as above)
Prior → P(H|E1)
P(H1|E1) = 44.9% (payment bug)
P(H3|E1) = 26.2% (API outage)
Step 2: Use posterior as new prior, update with E2
Evidence E2: "External API status page shows outage"
New prior (from E1 posterior):
- P(H1) = 0.449 (payment bug)
- P(H3) = 0.262 (API outage)
Likelihoods given E2:
- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
- P(E2|H3) = 0.95 (API outage would definitely show on status page)
P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
Interpretation: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
Step 3: Update with E3
Evidence E3: "Other services using same API also failing"
New prior:
- P(H1) = 0.265
- P(H3) = 0.698
Likelihoods:
- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
- P(E3|H3) = 0.99 (API outage would affect all services)
P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
Final conclusion: 96.3% confidence it's API outage, not payment bug.
Summary of belief evolution:
Prior After E1 After E2 After E3
H1 (Bug): 30% → 44.9% → 26.5% → 3.7%
H3 (API): 20% → 26.2% → 69.8% → 96.3%
Handling Dependent Evidence
Problem: Evidence Items Not Independent
Naive approach fails when:
- E1 and E2 are correlated (not independent)
- Updating twice with same information
Example of dependent evidence:
- E1: "User reports payment failing"
- E2: "Another user reports payment failing"
If E1 and E2 are from same incident, they're not independent evidence!
Solution 1: Treat as Single Evidence
If evidence is dependent, combine into one update:
Instead of:
- Update with E1: "User reports payment failing"
- Update with E2: "Another user reports payment failing"
Do:
- Single update with E: "Multiple users report payment failing"
Likelihood:
- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
- P(E|No bug) = 0.05 (false reports rare)
Solution 2: Conditional Likelihoods
If evidence is conditionally dependent (E2 depends on E1), use:
P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
Not:
P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence
Example:
- E1: "Test fails on staging"
- E2: "Test fails on production" (same test, likely same cause)
Conditional:
- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
Red Flags for Dependent Evidence
Watch out for:
- Multiple reports of same incident (count as one)
- Cascading failures (downstream failure caused by upstream)
- Repeated measurements of same thing (not new info)
- Evidence from same source (correlated errors)
Principle: Each update should provide new information. If E2 is predictable from E1, it's not independent.
Calibration Techniques
Problem: Over/Underconfidence Bias
Common patterns:
- Overconfidence: Stating 90% when true rate is 70%
- Underconfidence: Stating 60% when true rate is 80%
Calibration Check: Track Predictions Over Time
Method:
- Make many probabilistic forecasts (P=70%, P=40%, etc.)
- Track outcomes
- Group by confidence level
- Compare stated probability to actual frequency
Example calibration check:
| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
|---|---|---|---|---|
| 90-100% | 20 | 16 | 80% | Overconfident |
| 70-89% | 30 | 24 | 80% | Good |
| 50-69% | 25 | 14 | 56% | Good |
| 30-49% | 15 | 5 | 33% | Good |
| 0-29% | 10 | 2 | 20% | Good |
Interpretation: Overconfident at high confidence levels (saying 90% but only 80% correct).
Calibration Curve
Plot stated probability vs actual frequency:
Actual %
100 ┤ ●
│ ●
80 ┤ ● (overconfident)
│ ●
60 ┤ ●
│ ●
40 ┤ ●
│
20 ┤
│
0 └─────────────────────────────────
0 20 40 60 80 100
Stated probability %
Perfect calibration = diagonal line
Above line = overconfident
Below line = underconfident
Debiasing Techniques
For overconfidence:
- Consider alternative explanations (how could I be wrong?)
- Base rate check (what's the typical success rate?)
- Pre-mortem: "It's 6 months from now and we failed. Why?"
- Confidence intervals: State range, not point estimate
For underconfidence:
- Review past successes (build evidence for confidence)
- Test predictions: Am I systematically too cautious?
- Cost of inaction: What's the cost of waiting for certainty?
Brier Score (Calibration Metric)
Formula:
Brier Score = (1/n) × Σ (pi - oi)²
pi = stated probability for outcome i
oi = actual outcome (1 if happened, 0 if not)
Example:
Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
Interpretation:
- 0.00 = perfect calibration
- 0.25 = random guessing
- Lower is better
Typical scores:
- Expert forecasters: 0.10-0.15
- Average people: 0.20-0.25
- Aim for: <0.15
Advanced Applications
Application 1: Hierarchical Priors
When you're uncertain about the prior itself.
Example:
- Question: "Will project finish on time?"
- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
Approach: Model uncertainty in prior
P(On time) = Weighted average of different base rates
Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
Scenario 2: Base rate = 50% (if average project), Weight = 30%
Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
= 0.12 + 0.15 + 0.21
= 0.48 (48%)
Then update this 48% prior with evidence.
Application 2: Bayesian Model Comparison
Compare which model/theory better explains data.
Example:
- Model A: "Bug in feature X"
- Model B: "Infrastructure issue"
Evidence: 10 data points
P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
Bayes Factor = P(Data|Model A) / P(Data|Model B)
- BF > 10: Strong evidence for Model A
- BF = 3-10: Moderate evidence for Model A
- BF = 1: Equal evidence
- BF < 0.33: Evidence against Model A
Application 3: Forecasting with Bayesian Updates
Use for repeated forecasting (elections, product launches, project timelines).
Process:
- Start with base rate prior
- Update weekly as new evidence arrives
- Track belief evolution over time
- Compare final forecast to outcome (calibration check)
Example: Product launch success
Week -8: Prior = 60% (base rate for similar launches)
Week -6: Beta feedback positive → Update to 70%
Week -4: Competitor launches similar product → Update to 55%
Week -2: Pre-orders exceed target → Update to 75%
Week 0: Launch → Actual success: Yes ✓
Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
Calibration: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
Quality Checklist for Complex Cases
Multiple Hypotheses:
- All hypotheses listed (including catch-all "Other")
- Priors sum to 1.0
- Likelihoods defined for all hypothesis-evidence pairs
- Posteriors sum to 1.0 (math check)
- Interpretation provided (which hypothesis favored? by how much?)
Sequential Updates:
- Evidence items clearly independent or conditional dependence noted
- Each update uses previous posterior as new prior
- Belief evolution tracked (how beliefs changed over time)
- Final conclusion integrates all evidence
- Timeline shows when each piece of evidence arrived
Calibration:
- Considered alternative explanations (not overconfident?)
- Checked against base rates (not ignoring priors?)
- Stated confidence interval or range (not just point estimate)
- Identified assumptions that could make forecast wrong
- Planned follow-up to track calibration (compare forecast to outcome)
Minimum Standard for Complex Cases:
- Multiple hypotheses: Score ≥ 4.0 on rubric
- High-stakes forecasts: Track calibration over 10+ predictions