438 lines
13 KiB
Markdown
438 lines
13 KiB
Markdown
# Bayesian Reasoning & Calibration Methodology
|
||
|
||
## Bayesian Reasoning Workflow
|
||
|
||
Copy this checklist and track your progress:
|
||
|
||
```
|
||
Bayesian Reasoning Progress:
|
||
- [ ] Step 1: Define hypotheses and assign priors
|
||
- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
|
||
- [ ] Step 3: Compute posteriors and update sequentially
|
||
- [ ] Step 4: Check for dependent evidence and adjust
|
||
- [ ] Step 5: Validate calibration and check for bias
|
||
```
|
||
|
||
**Step 1: Define hypotheses and assign priors**
|
||
|
||
List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications.
|
||
|
||
**Step 2: Assign likelihoods for each evidence-hypothesis pair**
|
||
|
||
For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques.
|
||
|
||
**Step 3: Compute posteriors and update sequentially**
|
||
|
||
Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process.
|
||
|
||
**Step 4: Check for dependent evidence and adjust**
|
||
|
||
Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction.
|
||
|
||
**Step 5: Validate calibration and check for bias**
|
||
|
||
Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics.
|
||
|
||
---
|
||
|
||
## Multiple Hypothesis Updates
|
||
|
||
### Problem: Choosing Among Many Hypotheses
|
||
|
||
Often you have 3+ competing hypotheses and need to update all simultaneously.
|
||
|
||
**Example:**
|
||
- H1: Bug in payment processor code
|
||
- H2: Database connection timeout
|
||
- H3: Third-party API outage
|
||
- H4: DDoS attack
|
||
|
||
### Approach: Compute Posterior for Each Hypothesis
|
||
|
||
**Step 1: Assign prior probabilities** (must sum to 1)
|
||
|
||
| Hypothesis | Prior P(H) | Justification |
|
||
|------------|-----------|---------------|
|
||
| H1: Payment bug | 0.30 | Common issue, recent deploy |
|
||
| H2: DB timeout | 0.25 | Has happened before |
|
||
| H3: API outage | 0.20 | Dependency on external service |
|
||
| H4: DDoS | 0.10 | Rare but possible |
|
||
| H5: Other | 0.15 | Catch-all for unknowns |
|
||
| **Total** | **1.00** | Must sum to 1 |
|
||
|
||
**Step 2: Define likelihood for each hypothesis**
|
||
|
||
Evidence E: "500 errors only on payment endpoint"
|
||
|
||
| Hypothesis | P(E\|H) | Justification |
|
||
|------------|---------|---------------|
|
||
| H1: Payment bug | 0.80 | Bug would affect payment specifically |
|
||
| H2: DB timeout | 0.30 | Would affect multiple endpoints |
|
||
| H3: API outage | 0.70 | Payment uses external API |
|
||
| H4: DDoS | 0.50 | Could target any endpoint |
|
||
| H5: Other | 0.20 | Generic catch-all |
|
||
|
||
**Step 3: Compute P(E)** (marginal probability)
|
||
|
||
```
|
||
P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
|
||
|
||
P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
|
||
P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
|
||
P(E) = 0.535
|
||
```
|
||
|
||
**Step 4: Compute posterior for each hypothesis**
|
||
|
||
```
|
||
P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
|
||
|
||
P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
|
||
P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
|
||
P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
|
||
P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
|
||
P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
|
||
|
||
Total: 100% (check: posteriors must sum to 1)
|
||
```
|
||
|
||
**Interpretation:**
|
||
- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
|
||
- H3 (API outage): 20% → 26.2% (increased 6 pp)
|
||
- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
|
||
- H4 (DDoS): 10% → 9.3% (barely changed)
|
||
|
||
**Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
|
||
|
||
---
|
||
|
||
## Sequential Evidence Updates
|
||
|
||
### Problem: Multiple Pieces of Evidence Over Time
|
||
|
||
Evidence comes in stages, need to update belief sequentially.
|
||
|
||
**Example:**
|
||
- Evidence 1: "500 errors on payment endpoint" (t=0)
|
||
- Evidence 2: "External API status page shows outage" (t+5 min)
|
||
- Evidence 3: "Our other services using same API also failing" (t+10 min)
|
||
|
||
### Approach: Chain Updates (Prior → E1 → E2 → E3)
|
||
|
||
**Step 1: Update with E1** (as above)
|
||
```
|
||
Prior → P(H|E1)
|
||
P(H1|E1) = 44.9% (payment bug)
|
||
P(H3|E1) = 26.2% (API outage)
|
||
```
|
||
|
||
**Step 2: Use posterior as new prior, update with E2**
|
||
|
||
Evidence E2: "External API status page shows outage"
|
||
|
||
New prior (from E1 posterior):
|
||
- P(H1) = 0.449 (payment bug)
|
||
- P(H3) = 0.262 (API outage)
|
||
|
||
Likelihoods given E2:
|
||
- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
|
||
- P(E2|H3) = 0.95 (API outage would definitely show on status page)
|
||
|
||
```
|
||
P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
|
||
|
||
P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
|
||
P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
|
||
```
|
||
|
||
**Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
|
||
|
||
**Step 3: Update with E3**
|
||
|
||
Evidence E3: "Other services using same API also failing"
|
||
|
||
New prior:
|
||
- P(H1) = 0.265
|
||
- P(H3) = 0.698
|
||
|
||
Likelihoods:
|
||
- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
|
||
- P(E3|H3) = 0.99 (API outage would affect all services)
|
||
|
||
```
|
||
P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
|
||
|
||
P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
|
||
P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
|
||
```
|
||
|
||
**Final conclusion**: 96.3% confidence it's API outage, not payment bug.
|
||
|
||
**Summary of belief evolution:**
|
||
```
|
||
Prior After E1 After E2 After E3
|
||
H1 (Bug): 30% → 44.9% → 26.5% → 3.7%
|
||
H3 (API): 20% → 26.2% → 69.8% → 96.3%
|
||
```
|
||
|
||
---
|
||
|
||
## Handling Dependent Evidence
|
||
|
||
### Problem: Evidence Items Not Independent
|
||
|
||
**Naive approach fails when**:
|
||
- E1 and E2 are correlated (not independent)
|
||
- Updating twice with same information
|
||
|
||
**Example of dependent evidence:**
|
||
- E1: "User reports payment failing"
|
||
- E2: "Another user reports payment failing"
|
||
|
||
If E1 and E2 are from same incident, they're not independent evidence!
|
||
|
||
### Solution 1: Treat as Single Evidence
|
||
|
||
If evidence is dependent, combine into one update:
|
||
|
||
**Instead of:**
|
||
- Update with E1: "User reports payment failing"
|
||
- Update with E2: "Another user reports payment failing"
|
||
|
||
**Do:**
|
||
- Single update with E: "Multiple users report payment failing"
|
||
|
||
Likelihood:
|
||
- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
|
||
- P(E|No bug) = 0.05 (false reports rare)
|
||
|
||
### Solution 2: Conditional Likelihoods
|
||
|
||
If evidence is conditionally dependent (E2 depends on E1), use:
|
||
|
||
```
|
||
P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
|
||
```
|
||
|
||
Not:
|
||
```
|
||
P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence
|
||
```
|
||
|
||
**Example:**
|
||
- E1: "Test fails on staging"
|
||
- E2: "Test fails on production" (same test, likely same cause)
|
||
|
||
Conditional:
|
||
- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
|
||
- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
|
||
|
||
### Red Flags for Dependent Evidence
|
||
|
||
Watch out for:
|
||
- Multiple reports of same incident (count as one)
|
||
- Cascading failures (downstream failure caused by upstream)
|
||
- Repeated measurements of same thing (not new info)
|
||
- Evidence from same source (correlated errors)
|
||
|
||
**Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent.
|
||
|
||
---
|
||
|
||
## Calibration Techniques
|
||
|
||
### Problem: Over/Underconfidence Bias
|
||
|
||
Common patterns:
|
||
- **Overconfidence**: Stating 90% when true rate is 70%
|
||
- **Underconfidence**: Stating 60% when true rate is 80%
|
||
|
||
### Calibration Check: Track Predictions Over Time
|
||
|
||
**Method**:
|
||
1. Make many probabilistic forecasts (P=70%, P=40%, etc.)
|
||
2. Track outcomes
|
||
3. Group by confidence level
|
||
4. Compare stated probability to actual frequency
|
||
|
||
**Example calibration check:**
|
||
|
||
| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
|
||
|-----------------|---------------|-----------|----------|-------------|
|
||
| 90-100% | 20 | 16 | 80% | Overconfident |
|
||
| 70-89% | 30 | 24 | 80% | Good |
|
||
| 50-69% | 25 | 14 | 56% | Good |
|
||
| 30-49% | 15 | 5 | 33% | Good |
|
||
| 0-29% | 10 | 2 | 20% | Good |
|
||
|
||
**Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct).
|
||
|
||
### Calibration Curve
|
||
|
||
Plot stated probability vs actual frequency:
|
||
|
||
```
|
||
Actual %
|
||
100 ┤ ●
|
||
│ ●
|
||
80 ┤ ● (overconfident)
|
||
│ ●
|
||
60 ┤ ●
|
||
│ ●
|
||
40 ┤ ●
|
||
│
|
||
20 ┤
|
||
│
|
||
0 └─────────────────────────────────
|
||
0 20 40 60 80 100
|
||
Stated probability %
|
||
|
||
Perfect calibration = diagonal line
|
||
Above line = overconfident
|
||
Below line = underconfident
|
||
```
|
||
|
||
### Debiasing Techniques
|
||
|
||
**For overconfidence:**
|
||
- Consider alternative explanations (how could I be wrong?)
|
||
- Base rate check (what's the typical success rate?)
|
||
- Pre-mortem: "It's 6 months from now and we failed. Why?"
|
||
- Confidence intervals: State range, not point estimate
|
||
|
||
**For underconfidence:**
|
||
- Review past successes (build evidence for confidence)
|
||
- Test predictions: Am I systematically too cautious?
|
||
- Cost of inaction: What's the cost of waiting for certainty?
|
||
|
||
### Brier Score (Calibration Metric)
|
||
|
||
**Formula**:
|
||
```
|
||
Brier Score = (1/n) × Σ (pi - oi)²
|
||
|
||
pi = stated probability for outcome i
|
||
oi = actual outcome (1 if happened, 0 if not)
|
||
```
|
||
|
||
**Example:**
|
||
```
|
||
Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
|
||
Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
|
||
Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
|
||
|
||
Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
|
||
```
|
||
|
||
**Interpretation:**
|
||
- 0.00 = perfect calibration
|
||
- 0.25 = random guessing
|
||
- Lower is better
|
||
|
||
**Typical scores:**
|
||
- Expert forecasters: 0.10-0.15
|
||
- Average people: 0.20-0.25
|
||
- Aim for: <0.15
|
||
|
||
---
|
||
|
||
## Advanced Applications
|
||
|
||
### Application 1: Hierarchical Priors
|
||
|
||
When you're uncertain about the prior itself.
|
||
|
||
**Example:**
|
||
- Question: "Will project finish on time?"
|
||
- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
|
||
|
||
**Approach**: Model uncertainty in prior
|
||
|
||
```
|
||
P(On time) = Weighted average of different base rates
|
||
|
||
Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
|
||
Scenario 2: Base rate = 50% (if average project), Weight = 30%
|
||
Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
|
||
|
||
Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
|
||
= 0.12 + 0.15 + 0.21
|
||
= 0.48 (48%)
|
||
```
|
||
|
||
Then update this 48% prior with evidence.
|
||
|
||
### Application 2: Bayesian Model Comparison
|
||
|
||
Compare which model/theory better explains data.
|
||
|
||
**Example:**
|
||
- Model A: "Bug in feature X"
|
||
- Model B: "Infrastructure issue"
|
||
|
||
Evidence: 10 data points
|
||
|
||
```
|
||
P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
|
||
```
|
||
|
||
**Bayes Factor** = P(Data|Model A) / P(Data|Model B)
|
||
|
||
- BF > 10: Strong evidence for Model A
|
||
- BF = 3-10: Moderate evidence for Model A
|
||
- BF = 1: Equal evidence
|
||
- BF < 0.33: Evidence against Model A
|
||
|
||
### Application 3: Forecasting with Bayesian Updates
|
||
|
||
Use for repeated forecasting (elections, product launches, project timelines).
|
||
|
||
**Process:**
|
||
1. Start with base rate prior
|
||
2. Update weekly as new evidence arrives
|
||
3. Track belief evolution over time
|
||
4. Compare final forecast to outcome (calibration check)
|
||
|
||
**Example: Product launch success**
|
||
|
||
```
|
||
Week -8: Prior = 60% (base rate for similar launches)
|
||
Week -6: Beta feedback positive → Update to 70%
|
||
Week -4: Competitor launches similar product → Update to 55%
|
||
Week -2: Pre-orders exceed target → Update to 75%
|
||
Week 0: Launch → Actual success: Yes ✓
|
||
|
||
Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
|
||
```
|
||
|
||
**Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
|
||
|
||
---
|
||
|
||
## Quality Checklist for Complex Cases
|
||
|
||
**Multiple Hypotheses**:
|
||
- [ ] All hypotheses listed (including catch-all "Other")
|
||
- [ ] Priors sum to 1.0
|
||
- [ ] Likelihoods defined for all hypothesis-evidence pairs
|
||
- [ ] Posteriors sum to 1.0 (math check)
|
||
- [ ] Interpretation provided (which hypothesis favored? by how much?)
|
||
|
||
**Sequential Updates**:
|
||
- [ ] Evidence items clearly independent or conditional dependence noted
|
||
- [ ] Each update uses previous posterior as new prior
|
||
- [ ] Belief evolution tracked (how beliefs changed over time)
|
||
- [ ] Final conclusion integrates all evidence
|
||
- [ ] Timeline shows when each piece of evidence arrived
|
||
|
||
**Calibration**:
|
||
- [ ] Considered alternative explanations (not overconfident?)
|
||
- [ ] Checked against base rates (not ignoring priors?)
|
||
- [ ] Stated confidence interval or range (not just point estimate)
|
||
- [ ] Identified assumptions that could make forecast wrong
|
||
- [ ] Planned follow-up to track calibration (compare forecast to outcome)
|
||
|
||
**Minimum Standard for Complex Cases**:
|
||
- Multiple hypotheses: Score ≥ 4.0 on rubric
|
||
- High-stakes forecasts: Track calibration over 10+ predictions
|