Files
gh-lyndonkl-claude/skills/bayesian-reasoning-calibration/resources/methodology.md
2025-11-30 08:38:26 +08:00

438 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bayesian Reasoning & Calibration Methodology
## Bayesian Reasoning Workflow
Copy this checklist and track your progress:
```
Bayesian Reasoning Progress:
- [ ] Step 1: Define hypotheses and assign priors
- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
- [ ] Step 3: Compute posteriors and update sequentially
- [ ] Step 4: Check for dependent evidence and adjust
- [ ] Step 5: Validate calibration and check for bias
```
**Step 1: Define hypotheses and assign priors**
List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications.
**Step 2: Assign likelihoods for each evidence-hypothesis pair**
For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques.
**Step 3: Compute posteriors and update sequentially**
Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process.
**Step 4: Check for dependent evidence and adjust**
Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction.
**Step 5: Validate calibration and check for bias**
Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics.
---
## Multiple Hypothesis Updates
### Problem: Choosing Among Many Hypotheses
Often you have 3+ competing hypotheses and need to update all simultaneously.
**Example:**
- H1: Bug in payment processor code
- H2: Database connection timeout
- H3: Third-party API outage
- H4: DDoS attack
### Approach: Compute Posterior for Each Hypothesis
**Step 1: Assign prior probabilities** (must sum to 1)
| Hypothesis | Prior P(H) | Justification |
|------------|-----------|---------------|
| H1: Payment bug | 0.30 | Common issue, recent deploy |
| H2: DB timeout | 0.25 | Has happened before |
| H3: API outage | 0.20 | Dependency on external service |
| H4: DDoS | 0.10 | Rare but possible |
| H5: Other | 0.15 | Catch-all for unknowns |
| **Total** | **1.00** | Must sum to 1 |
**Step 2: Define likelihood for each hypothesis**
Evidence E: "500 errors only on payment endpoint"
| Hypothesis | P(E\|H) | Justification |
|------------|---------|---------------|
| H1: Payment bug | 0.80 | Bug would affect payment specifically |
| H2: DB timeout | 0.30 | Would affect multiple endpoints |
| H3: API outage | 0.70 | Payment uses external API |
| H4: DDoS | 0.50 | Could target any endpoint |
| H5: Other | 0.20 | Generic catch-all |
**Step 3: Compute P(E)** (marginal probability)
```
P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
P(E) = 0.535
```
**Step 4: Compute posterior for each hypothesis**
```
P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
Total: 100% (check: posteriors must sum to 1)
```
**Interpretation:**
- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
- H3 (API outage): 20% → 26.2% (increased 6 pp)
- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
- H4 (DDoS): 10% → 9.3% (barely changed)
**Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
---
## Sequential Evidence Updates
### Problem: Multiple Pieces of Evidence Over Time
Evidence comes in stages, need to update belief sequentially.
**Example:**
- Evidence 1: "500 errors on payment endpoint" (t=0)
- Evidence 2: "External API status page shows outage" (t+5 min)
- Evidence 3: "Our other services using same API also failing" (t+10 min)
### Approach: Chain Updates (Prior → E1 → E2 → E3)
**Step 1: Update with E1** (as above)
```
Prior → P(H|E1)
P(H1|E1) = 44.9% (payment bug)
P(H3|E1) = 26.2% (API outage)
```
**Step 2: Use posterior as new prior, update with E2**
Evidence E2: "External API status page shows outage"
New prior (from E1 posterior):
- P(H1) = 0.449 (payment bug)
- P(H3) = 0.262 (API outage)
Likelihoods given E2:
- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
- P(E2|H3) = 0.95 (API outage would definitely show on status page)
```
P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
```
**Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
**Step 3: Update with E3**
Evidence E3: "Other services using same API also failing"
New prior:
- P(H1) = 0.265
- P(H3) = 0.698
Likelihoods:
- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
- P(E3|H3) = 0.99 (API outage would affect all services)
```
P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
```
**Final conclusion**: 96.3% confidence it's API outage, not payment bug.
**Summary of belief evolution:**
```
Prior After E1 After E2 After E3
H1 (Bug): 30% → 44.9% → 26.5% → 3.7%
H3 (API): 20% → 26.2% → 69.8% → 96.3%
```
---
## Handling Dependent Evidence
### Problem: Evidence Items Not Independent
**Naive approach fails when**:
- E1 and E2 are correlated (not independent)
- Updating twice with same information
**Example of dependent evidence:**
- E1: "User reports payment failing"
- E2: "Another user reports payment failing"
If E1 and E2 are from same incident, they're not independent evidence!
### Solution 1: Treat as Single Evidence
If evidence is dependent, combine into one update:
**Instead of:**
- Update with E1: "User reports payment failing"
- Update with E2: "Another user reports payment failing"
**Do:**
- Single update with E: "Multiple users report payment failing"
Likelihood:
- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
- P(E|No bug) = 0.05 (false reports rare)
### Solution 2: Conditional Likelihoods
If evidence is conditionally dependent (E2 depends on E1), use:
```
P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
```
Not:
```
P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H) ← Assumes independence
```
**Example:**
- E1: "Test fails on staging"
- E2: "Test fails on production" (same test, likely same cause)
Conditional:
- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
### Red Flags for Dependent Evidence
Watch out for:
- Multiple reports of same incident (count as one)
- Cascading failures (downstream failure caused by upstream)
- Repeated measurements of same thing (not new info)
- Evidence from same source (correlated errors)
**Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent.
---
## Calibration Techniques
### Problem: Over/Underconfidence Bias
Common patterns:
- **Overconfidence**: Stating 90% when true rate is 70%
- **Underconfidence**: Stating 60% when true rate is 80%
### Calibration Check: Track Predictions Over Time
**Method**:
1. Make many probabilistic forecasts (P=70%, P=40%, etc.)
2. Track outcomes
3. Group by confidence level
4. Compare stated probability to actual frequency
**Example calibration check:**
| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
|-----------------|---------------|-----------|----------|-------------|
| 90-100% | 20 | 16 | 80% | Overconfident |
| 70-89% | 30 | 24 | 80% | Good |
| 50-69% | 25 | 14 | 56% | Good |
| 30-49% | 15 | 5 | 33% | Good |
| 0-29% | 10 | 2 | 20% | Good |
**Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct).
### Calibration Curve
Plot stated probability vs actual frequency:
```
Actual %
100 ┤ ●
│ ●
80 ┤ ● (overconfident)
│ ●
60 ┤ ●
│ ●
40 ┤ ●
20 ┤
0 └─────────────────────────────────
0 20 40 60 80 100
Stated probability %
Perfect calibration = diagonal line
Above line = overconfident
Below line = underconfident
```
### Debiasing Techniques
**For overconfidence:**
- Consider alternative explanations (how could I be wrong?)
- Base rate check (what's the typical success rate?)
- Pre-mortem: "It's 6 months from now and we failed. Why?"
- Confidence intervals: State range, not point estimate
**For underconfidence:**
- Review past successes (build evidence for confidence)
- Test predictions: Am I systematically too cautious?
- Cost of inaction: What's the cost of waiting for certainty?
### Brier Score (Calibration Metric)
**Formula**:
```
Brier Score = (1/n) × Σ (pi - oi)²
pi = stated probability for outcome i
oi = actual outcome (1 if happened, 0 if not)
```
**Example:**
```
Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
```
**Interpretation:**
- 0.00 = perfect calibration
- 0.25 = random guessing
- Lower is better
**Typical scores:**
- Expert forecasters: 0.10-0.15
- Average people: 0.20-0.25
- Aim for: <0.15
---
## Advanced Applications
### Application 1: Hierarchical Priors
When you're uncertain about the prior itself.
**Example:**
- Question: "Will project finish on time?"
- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
**Approach**: Model uncertainty in prior
```
P(On time) = Weighted average of different base rates
Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
Scenario 2: Base rate = 50% (if average project), Weight = 30%
Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
= 0.12 + 0.15 + 0.21
= 0.48 (48%)
```
Then update this 48% prior with evidence.
### Application 2: Bayesian Model Comparison
Compare which model/theory better explains data.
**Example:**
- Model A: "Bug in feature X"
- Model B: "Infrastructure issue"
Evidence: 10 data points
```
P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
```
**Bayes Factor** = P(Data|Model A) / P(Data|Model B)
- BF > 10: Strong evidence for Model A
- BF = 3-10: Moderate evidence for Model A
- BF = 1: Equal evidence
- BF < 0.33: Evidence against Model A
### Application 3: Forecasting with Bayesian Updates
Use for repeated forecasting (elections, product launches, project timelines).
**Process:**
1. Start with base rate prior
2. Update weekly as new evidence arrives
3. Track belief evolution over time
4. Compare final forecast to outcome (calibration check)
**Example: Product launch success**
```
Week -8: Prior = 60% (base rate for similar launches)
Week -6: Beta feedback positive → Update to 70%
Week -4: Competitor launches similar product → Update to 55%
Week -2: Pre-orders exceed target → Update to 75%
Week 0: Launch → Actual success: Yes ✓
Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
```
**Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
---
## Quality Checklist for Complex Cases
**Multiple Hypotheses**:
- [ ] All hypotheses listed (including catch-all "Other")
- [ ] Priors sum to 1.0
- [ ] Likelihoods defined for all hypothesis-evidence pairs
- [ ] Posteriors sum to 1.0 (math check)
- [ ] Interpretation provided (which hypothesis favored? by how much?)
**Sequential Updates**:
- [ ] Evidence items clearly independent or conditional dependence noted
- [ ] Each update uses previous posterior as new prior
- [ ] Belief evolution tracked (how beliefs changed over time)
- [ ] Final conclusion integrates all evidence
- [ ] Timeline shows when each piece of evidence arrived
**Calibration**:
- [ ] Considered alternative explanations (not overconfident?)
- [ ] Checked against base rates (not ignoring priors?)
- [ ] Stated confidence interval or range (not just point estimate)
- [ ] Identified assumptions that could make forecast wrong
- [ ] Planned follow-up to track calibration (compare forecast to outcome)
**Minimum Standard for Complex Cases**:
- Multiple hypotheses: Score ≥ 4.0 on rubric
- High-stakes forecasts: Track calibration over 10+ predictions