# Bayesian Reasoning & Calibration Methodology ## Bayesian Reasoning Workflow Copy this checklist and track your progress: ``` Bayesian Reasoning Progress: - [ ] Step 1: Define hypotheses and assign priors - [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair - [ ] Step 3: Compute posteriors and update sequentially - [ ] Step 4: Check for dependent evidence and adjust - [ ] Step 5: Validate calibration and check for bias ``` **Step 1: Define hypotheses and assign priors** List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications. **Step 2: Assign likelihoods for each evidence-hypothesis pair** For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques. **Step 3: Compute posteriors and update sequentially** Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process. **Step 4: Check for dependent evidence and adjust** Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction. **Step 5: Validate calibration and check for bias** Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics. --- ## Multiple Hypothesis Updates ### Problem: Choosing Among Many Hypotheses Often you have 3+ competing hypotheses and need to update all simultaneously. **Example:** - H1: Bug in payment processor code - H2: Database connection timeout - H3: Third-party API outage - H4: DDoS attack ### Approach: Compute Posterior for Each Hypothesis **Step 1: Assign prior probabilities** (must sum to 1) | Hypothesis | Prior P(H) | Justification | |------------|-----------|---------------| | H1: Payment bug | 0.30 | Common issue, recent deploy | | H2: DB timeout | 0.25 | Has happened before | | H3: API outage | 0.20 | Dependency on external service | | H4: DDoS | 0.10 | Rare but possible | | H5: Other | 0.15 | Catch-all for unknowns | | **Total** | **1.00** | Must sum to 1 | **Step 2: Define likelihood for each hypothesis** Evidence E: "500 errors only on payment endpoint" | Hypothesis | P(E\|H) | Justification | |------------|---------|---------------| | H1: Payment bug | 0.80 | Bug would affect payment specifically | | H2: DB timeout | 0.30 | Would affect multiple endpoints | | H3: API outage | 0.70 | Payment uses external API | | H4: DDoS | 0.50 | Could target any endpoint | | H5: Other | 0.20 | Generic catch-all | **Step 3: Compute P(E)** (marginal probability) ``` P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15) P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03 P(E) = 0.535 ``` **Step 4: Compute posterior for each hypothesis** ``` P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E) P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%) P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%) P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%) P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%) P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%) Total: 100% (check: posteriors must sum to 1) ``` **Interpretation:** - H1 (Payment bug): 30% → 44.9% (increased 15 pp) - H3 (API outage): 20% → 26.2% (increased 6 pp) - H2 (DB timeout): 25% → 14.0% (decreased 11 pp) - H4 (DDoS): 10% → 9.3% (barely changed) **Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup. --- ## Sequential Evidence Updates ### Problem: Multiple Pieces of Evidence Over Time Evidence comes in stages, need to update belief sequentially. **Example:** - Evidence 1: "500 errors on payment endpoint" (t=0) - Evidence 2: "External API status page shows outage" (t+5 min) - Evidence 3: "Our other services using same API also failing" (t+10 min) ### Approach: Chain Updates (Prior → E1 → E2 → E3) **Step 1: Update with E1** (as above) ``` Prior → P(H|E1) P(H1|E1) = 44.9% (payment bug) P(H3|E1) = 26.2% (API outage) ``` **Step 2: Use posterior as new prior, update with E2** Evidence E2: "External API status page shows outage" New prior (from E1 posterior): - P(H1) = 0.449 (payment bug) - P(H3) = 0.262 (API outage) Likelihoods given E2: - P(E2|H1) = 0.20 (bug wouldn't cause external API status change) - P(E2|H3) = 0.95 (API outage would definitely show on status page) ``` P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387 P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%) P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%) ``` **Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%. **Step 3: Update with E3** Evidence E3: "Other services using same API also failing" New prior: - P(H1) = 0.265 - P(H3) = 0.698 Likelihoods: - P(E3|H1) = 0.10 (payment bug wouldn't affect other services) - P(E3|H3) = 0.99 (API outage would affect all services) ``` P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175 P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%) P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%) ``` **Final conclusion**: 96.3% confidence it's API outage, not payment bug. **Summary of belief evolution:** ``` Prior After E1 After E2 After E3 H1 (Bug): 30% → 44.9% → 26.5% → 3.7% H3 (API): 20% → 26.2% → 69.8% → 96.3% ``` --- ## Handling Dependent Evidence ### Problem: Evidence Items Not Independent **Naive approach fails when**: - E1 and E2 are correlated (not independent) - Updating twice with same information **Example of dependent evidence:** - E1: "User reports payment failing" - E2: "Another user reports payment failing" If E1 and E2 are from same incident, they're not independent evidence! ### Solution 1: Treat as Single Evidence If evidence is dependent, combine into one update: **Instead of:** - Update with E1: "User reports payment failing" - Update with E2: "Another user reports payment failing" **Do:** - Single update with E: "Multiple users report payment failing" Likelihood: - P(E|Bug) = 0.90 (if bug exists, multiple users affected) - P(E|No bug) = 0.05 (false reports rare) ### Solution 2: Conditional Likelihoods If evidence is conditionally dependent (E2 depends on E1), use: ``` P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H) ``` Not: ``` P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence ``` **Example:** - E1: "Test fails on staging" - E2: "Test fails on production" (same test, likely same cause) Conditional: - P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too) - P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment) ### Red Flags for Dependent Evidence Watch out for: - Multiple reports of same incident (count as one) - Cascading failures (downstream failure caused by upstream) - Repeated measurements of same thing (not new info) - Evidence from same source (correlated errors) **Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent. --- ## Calibration Techniques ### Problem: Over/Underconfidence Bias Common patterns: - **Overconfidence**: Stating 90% when true rate is 70% - **Underconfidence**: Stating 60% when true rate is 80% ### Calibration Check: Track Predictions Over Time **Method**: 1. Make many probabilistic forecasts (P=70%, P=40%, etc.) 2. Track outcomes 3. Group by confidence level 4. Compare stated probability to actual frequency **Example calibration check:** | Your Confidence | # Predictions | # Correct | Actual % | Calibration | |-----------------|---------------|-----------|----------|-------------| | 90-100% | 20 | 16 | 80% | Overconfident | | 70-89% | 30 | 24 | 80% | Good | | 50-69% | 25 | 14 | 56% | Good | | 30-49% | 15 | 5 | 33% | Good | | 0-29% | 10 | 2 | 20% | Good | **Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct). ### Calibration Curve Plot stated probability vs actual frequency: ``` Actual % 100 ┤ ● │ ● 80 ┤ ● (overconfident) │ ● 60 ┤ ● │ ● 40 ┤ ● │ 20 ┤ │ 0 └───────────────────────────────── 0 20 40 60 80 100 Stated probability % Perfect calibration = diagonal line Above line = overconfident Below line = underconfident ``` ### Debiasing Techniques **For overconfidence:** - Consider alternative explanations (how could I be wrong?) - Base rate check (what's the typical success rate?) - Pre-mortem: "It's 6 months from now and we failed. Why?" - Confidence intervals: State range, not point estimate **For underconfidence:** - Review past successes (build evidence for confidence) - Test predictions: Am I systematically too cautious? - Cost of inaction: What's the cost of waiting for certainty? ### Brier Score (Calibration Metric) **Formula**: ``` Brier Score = (1/n) × Σ (pi - oi)² pi = stated probability for outcome i oi = actual outcome (1 if happened, 0 if not) ``` **Example:** ``` Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04 Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36 Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01 Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137 ``` **Interpretation:** - 0.00 = perfect calibration - 0.25 = random guessing - Lower is better **Typical scores:** - Expert forecasters: 0.10-0.15 - Average people: 0.20-0.25 - Aim for: <0.15 --- ## Advanced Applications ### Application 1: Hierarchical Priors When you're uncertain about the prior itself. **Example:** - Question: "Will project finish on time?" - Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?" **Approach**: Model uncertainty in prior ``` P(On time) = Weighted average of different base rates Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40% Scenario 2: Base rate = 50% (if average project), Weight = 30% Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30% Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30) = 0.12 + 0.15 + 0.21 = 0.48 (48%) ``` Then update this 48% prior with evidence. ### Application 2: Bayesian Model Comparison Compare which model/theory better explains data. **Example:** - Model A: "Bug in feature X" - Model B: "Infrastructure issue" Evidence: 10 data points ``` P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)] ``` **Bayes Factor** = P(Data|Model A) / P(Data|Model B) - BF > 10: Strong evidence for Model A - BF = 3-10: Moderate evidence for Model A - BF = 1: Equal evidence - BF < 0.33: Evidence against Model A ### Application 3: Forecasting with Bayesian Updates Use for repeated forecasting (elections, product launches, project timelines). **Process:** 1. Start with base rate prior 2. Update weekly as new evidence arrives 3. Track belief evolution over time 4. Compare final forecast to outcome (calibration check) **Example: Product launch success** ``` Week -8: Prior = 60% (base rate for similar launches) Week -6: Beta feedback positive → Update to 70% Week -4: Competitor launches similar product → Update to 55% Week -2: Pre-orders exceed target → Update to 75% Week 0: Launch → Actual success: Yes ✓ Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes) ``` **Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct. --- ## Quality Checklist for Complex Cases **Multiple Hypotheses**: - [ ] All hypotheses listed (including catch-all "Other") - [ ] Priors sum to 1.0 - [ ] Likelihoods defined for all hypothesis-evidence pairs - [ ] Posteriors sum to 1.0 (math check) - [ ] Interpretation provided (which hypothesis favored? by how much?) **Sequential Updates**: - [ ] Evidence items clearly independent or conditional dependence noted - [ ] Each update uses previous posterior as new prior - [ ] Belief evolution tracked (how beliefs changed over time) - [ ] Final conclusion integrates all evidence - [ ] Timeline shows when each piece of evidence arrived **Calibration**: - [ ] Considered alternative explanations (not overconfident?) - [ ] Checked against base rates (not ignoring priors?) - [ ] Stated confidence interval or range (not just point estimate) - [ ] Identified assumptions that could make forecast wrong - [ ] Planned follow-up to track calibration (compare forecast to outcome) **Minimum Standard for Complex Cases**: - Multiple hypotheses: Score ≥ 4.0 on rubric - High-stakes forecasts: Track calibration over 10+ predictions