Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/bayesian-reasoning-calibration/resources/methodology.md
+++ b/skills/bayesian-reasoning-calibration/resources/methodology.md
@@ -0,0 +1,437 @@
+# Bayesian Reasoning & Calibration Methodology
+
+## Bayesian Reasoning Workflow
+
+Copy this checklist and track your progress:
+
+```
+Bayesian Reasoning Progress:
+- [ ] Step 1: Define hypotheses and assign priors
+- [ ] Step 2: Assign likelihoods for each evidence-hypothesis pair
+- [ ] Step 3: Compute posteriors and update sequentially
+- [ ] Step 4: Check for dependent evidence and adjust
+- [ ] Step 5: Validate calibration and check for bias
+```
+
+**Step 1: Define hypotheses and assign priors**
+
+List all competing hypotheses (including catch-all "Other") and assign prior probabilities that sum to 1.0. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for structuring priors with justifications.
+
+**Step 2: Assign likelihoods for each evidence-hypothesis pair**
+
+For each hypothesis, define P(Evidence|Hypothesis) based on how likely the evidence would be if that hypothesis were true. See [Multiple Hypothesis Updates](#multiple-hypothesis-updates) for likelihood assessment techniques.
+
+**Step 3: Compute posteriors and update sequentially**
+
+Calculate posterior probabilities using Bayes' theorem, then chain updates for sequential evidence. See [Sequential Evidence Updates](#sequential-evidence-updates) for multi-stage updating process.
+
+**Step 4: Check for dependent evidence and adjust**
+
+Identify whether evidence items are independent or correlated, and use conditional likelihoods if needed. See [Handling Dependent Evidence](#handling-dependent-evidence) for dependence detection and correction.
+
+**Step 5: Validate calibration and check for bias**
+
+Track forecasts over time and compare stated probabilities to actual outcomes using calibration curves and Brier scores. See [Calibration Techniques](#calibration-techniques) for debiasing methods and metrics.
+
+---
+
+## Multiple Hypothesis Updates
+
+### Problem: Choosing Among Many Hypotheses
+
+Often you have 3+ competing hypotheses and need to update all simultaneously.
+
+**Example:**
+- H1: Bug in payment processor code
+- H2: Database connection timeout
+- H3: Third-party API outage
+- H4: DDoS attack
+
+### Approach: Compute Posterior for Each Hypothesis
+
+**Step 1: Assign prior probabilities** (must sum to 1)
+
+| Hypothesis | Prior P(H) | Justification |
+|------------|-----------|---------------|
+| H1: Payment bug | 0.30 | Common issue, recent deploy |
+| H2: DB timeout | 0.25 | Has happened before |
+| H3: API outage | 0.20 | Dependency on external service |
+| H4: DDoS | 0.10 | Rare but possible |
+| H5: Other | 0.15 | Catch-all for unknowns |
+| **Total** | **1.00** | Must sum to 1 |
+
+**Step 2: Define likelihood for each hypothesis**
+
+Evidence E: "500 errors only on payment endpoint"
+
+| Hypothesis | P(E\|H) | Justification |
+|------------|---------|---------------|
+| H1: Payment bug | 0.80 | Bug would affect payment specifically |
+| H2: DB timeout | 0.30 | Would affect multiple endpoints |
+| H3: API outage | 0.70 | Payment uses external API |
+| H4: DDoS | 0.50 | Could target any endpoint |
+| H5: Other | 0.20 | Generic catch-all |
+
+**Step 3: Compute P(E)** (marginal probability)
+
+```
+P(E) = Σ [P(E|Hi) × P(Hi)] for all hypotheses
+
+P(E) = (0.80 × 0.30) + (0.30 × 0.25) + (0.70 × 0.20) + (0.50 × 0.10) + (0.20 × 0.15)
+P(E) = 0.24 + 0.075 + 0.14 + 0.05 + 0.03
+P(E) = 0.535
+```
+
+**Step 4: Compute posterior for each hypothesis**
+
+```
+P(Hi|E) = [P(E|Hi) × P(Hi)] / P(E)
+
+P(H1|E) = (0.80 × 0.30) / 0.535 = 0.24 / 0.535 = 0.449 (44.9%)
+P(H2|E) = (0.30 × 0.25) / 0.535 = 0.075 / 0.535 = 0.140 (14.0%)
+P(H3|E) = (0.70 × 0.20) / 0.535 = 0.14 / 0.535 = 0.262 (26.2%)
+P(H4|E) = (0.50 × 0.10) / 0.535 = 0.05 / 0.535 = 0.093 (9.3%)
+P(H5|E) = (0.20 × 0.15) / 0.535 = 0.03 / 0.535 = 0.056 (5.6%)
+
+Total: 100% (check: posteriors must sum to 1)
+```
+
+**Interpretation:**
+- H1 (Payment bug): 30% → 44.9% (increased 15 pp)
+- H3 (API outage): 20% → 26.2% (increased 6 pp)
+- H2 (DB timeout): 25% → 14.0% (decreased 11 pp)
+- H4 (DDoS): 10% → 9.3% (barely changed)
+
+**Decision**: Investigate H1 (payment bug) first, then H3 (API outage) as backup.
+
+---
+
+## Sequential Evidence Updates
+
+### Problem: Multiple Pieces of Evidence Over Time
+
+Evidence comes in stages, need to update belief sequentially.
+
+**Example:**
+- Evidence 1: "500 errors on payment endpoint" (t=0)
+- Evidence 2: "External API status page shows outage" (t+5 min)
+- Evidence 3: "Our other services using same API also failing" (t+10 min)
+
+### Approach: Chain Updates (Prior → E1 → E2 → E3)
+
+**Step 1: Update with E1** (as above)
+```
+Prior → P(H|E1)
+P(H1|E1) = 44.9% (payment bug)
+P(H3|E1) = 26.2% (API outage)
+```
+
+**Step 2: Use posterior as new prior, update with E2**
+
+Evidence E2: "External API status page shows outage"
+
+New prior (from E1 posterior):
+- P(H1) = 0.449 (payment bug)
+- P(H3) = 0.262 (API outage)
+
+Likelihoods given E2:
+- P(E2|H1) = 0.20 (bug wouldn't cause external API status change)
+- P(E2|H3) = 0.95 (API outage would definitely show on status page)
+
+```
+P(E2) = (0.20 × 0.449) + (0.95 × 0.262) = 0.0898 + 0.2489 = 0.3387
+
+P(H1|E1,E2) = (0.20 × 0.449) / 0.3387 = 0.265 (26.5%)
+P(H3|E1,E2) = (0.95 × 0.262) / 0.3387 = 0.698 (69.8%)
+```
+
+**Interpretation**: E2 strongly favors H3 (API outage). H1 dropped from 44.9% → 26.5%.
+
+**Step 3: Update with E3**
+
+Evidence E3: "Other services using same API also failing"
+
+New prior:
+- P(H1) = 0.265
+- P(H3) = 0.698
+
+Likelihoods:
+- P(E3|H1) = 0.10 (payment bug wouldn't affect other services)
+- P(E3|H3) = 0.99 (API outage would affect all services)
+
+```
+P(E3) = (0.10 × 0.265) + (0.99 × 0.698) = 0.0265 + 0.6910 = 0.7175
+
+P(H1|E1,E2,E3) = (0.10 × 0.265) / 0.7175 = 0.037 (3.7%)
+P(H3|E1,E2,E3) = (0.99 × 0.698) / 0.7175 = 0.963 (96.3%)
+```
+
+**Final conclusion**: 96.3% confidence it's API outage, not payment bug.
+
+**Summary of belief evolution:**
+```
+        Prior   After E1   After E2   After E3
+H1 (Bug):  30%  →  44.9%  →  26.5%  →   3.7%
+H3 (API):  20%  →  26.2%  →  69.8%  →  96.3%
+```
+
+---
+
+## Handling Dependent Evidence
+
+### Problem: Evidence Items Not Independent
+
+**Naive approach fails when**:
+- E1 and E2 are correlated (not independent)
+- Updating twice with same information
+
+**Example of dependent evidence:**
+- E1: "User reports payment failing"
+- E2: "Another user reports payment failing"
+
+If E1 and E2 are from same incident, they're not independent evidence!
+
+### Solution 1: Treat as Single Evidence
+
+If evidence is dependent, combine into one update:
+
+**Instead of:**
+- Update with E1: "User reports payment failing"
+- Update with E2: "Another user reports payment failing"
+
+**Do:**
+- Single update with E: "Multiple users report payment failing"
+
+Likelihood:
+- P(E|Bug) = 0.90 (if bug exists, multiple users affected)
+- P(E|No bug) = 0.05 (false reports rare)
+
+### Solution 2: Conditional Likelihoods
+
+If evidence is conditionally dependent (E2 depends on E1), use:
+
+```
+P(H|E1,E2) ∝ P(E2|E1,H) × P(E1|H) × P(H)
+```
+
+Not:
+```
+P(H|E1,E2) ≠ P(E2|H) × P(E1|H) × P(H)  ← Assumes independence
+```
+
+**Example:**
+- E1: "Test fails on staging"
+- E2: "Test fails on production" (same test, likely same cause)
+
+Conditional:
+- P(E2|E1, Bug) = 0.95 (if staging failed due to bug, production likely fails too)
+- P(E2|E1, Env) = 0.20 (if staging failed due to environment, production different environment)
+
+### Red Flags for Dependent Evidence
+
+Watch out for:
+- Multiple reports of same incident (count as one)
+- Cascading failures (downstream failure caused by upstream)
+- Repeated measurements of same thing (not new info)
+- Evidence from same source (correlated errors)
+
+**Principle**: Each update should provide **new information**. If E2 is predictable from E1, it's not independent.
+
+---
+
+## Calibration Techniques
+
+### Problem: Over/Underconfidence Bias
+
+Common patterns:
+- **Overconfidence**: Stating 90% when true rate is 70%
+- **Underconfidence**: Stating 60% when true rate is 80%
+
+### Calibration Check: Track Predictions Over Time
+
+**Method**:
+1. Make many probabilistic forecasts (P=70%, P=40%, etc.)
+2. Track outcomes
+3. Group by confidence level
+4. Compare stated probability to actual frequency
+
+**Example calibration check:**
+
+| Your Confidence | # Predictions | # Correct | Actual % | Calibration |
+|-----------------|---------------|-----------|----------|-------------|
+| 90-100% | 20 | 16 | 80% | Overconfident |
+| 70-89% | 30 | 24 | 80% | Good |
+| 50-69% | 25 | 14 | 56% | Good |
+| 30-49% | 15 | 5 | 33% | Good |
+| 0-29% | 10 | 2 | 20% | Good |
+
+**Interpretation**: Overconfident at high confidence levels (saying 90% but only 80% correct).
+
+### Calibration Curve
+
+Plot stated probability vs actual frequency:
+
+```
+Actual %
+    100 ┤                                    ●
+        │                              ●
+     80 ┤                        ●          (overconfident)
+        │                  ●
+     60 ┤            ●
+        │      ●
+     40 ┤ ●
+        │
+     20 ┤
+        │
+      0 └─────────────────────────────────
+        0    20   40   60   80   100
+              Stated probability %
+
+Perfect calibration = diagonal line
+Above line = overconfident
+Below line = underconfident
+```
+
+### Debiasing Techniques
+
+**For overconfidence:**
+- Consider alternative explanations (how could I be wrong?)
+- Base rate check (what's the typical success rate?)
+- Pre-mortem: "It's 6 months from now and we failed. Why?"
+- Confidence intervals: State range, not point estimate
+
+**For underconfidence:**
+- Review past successes (build evidence for confidence)
+- Test predictions: Am I systematically too cautious?
+- Cost of inaction: What's the cost of waiting for certainty?
+
+### Brier Score (Calibration Metric)
+
+**Formula**:
+```
+Brier Score = (1/n) × Σ (pi - oi)²
+
+pi = stated probability for outcome i
+oi = actual outcome (1 if happened, 0 if not)
+```
+
+**Example:**
+```
+Prediction 1: P=0.8, Outcome=1 → (0.8-1)² = 0.04
+Prediction 2: P=0.6, Outcome=0 → (0.6-0)² = 0.36
+Prediction 3: P=0.9, Outcome=1 → (0.9-1)² = 0.01
+
+Brier Score = (0.04 + 0.36 + 0.01) / 3 = 0.137
+```
+
+**Interpretation:**
+- 0.00 = perfect calibration
+- 0.25 = random guessing
+- Lower is better
+
+**Typical scores:**
+- Expert forecasters: 0.10-0.15
+- Average people: 0.20-0.25
+- Aim for: <0.15
+
+---
+
+## Advanced Applications
+
+### Application 1: Hierarchical Priors
+
+When you're uncertain about the prior itself.
+
+**Example:**
+- Question: "Will project finish on time?"
+- Uncertain about base rate: "Do similar projects finish on time 30% or 60% of the time?"
+
+**Approach**: Model uncertainty in prior
+
+```
+P(On time) = Weighted average of different base rates
+
+Scenario 1: Base rate = 30% (if similar to past failures), Weight = 40%
+Scenario 2: Base rate = 50% (if average project), Weight = 30%
+Scenario 3: Base rate = 70% (if similar to past successes), Weight = 30%
+
+Prior P(On time) = (0.30 × 0.40) + (0.50 × 0.30) + (0.70 × 0.30)
+                  = 0.12 + 0.15 + 0.21
+                  = 0.48 (48%)
+```
+
+Then update this 48% prior with evidence.
+
+### Application 2: Bayesian Model Comparison
+
+Compare which model/theory better explains data.
+
+**Example:**
+- Model A: "Bug in feature X"
+- Model B: "Infrastructure issue"
+
+Evidence: 10 data points
+
+```
+P(Model A|Data) / P(Model B|Data) = [P(Data|Model A) × P(Model A)] / [P(Data|Model B) × P(Model B)]
+```
+
+**Bayes Factor** = P(Data|Model A) / P(Data|Model B)
+
+- BF > 10: Strong evidence for Model A
+- BF = 3-10: Moderate evidence for Model A
+- BF = 1: Equal evidence
+- BF < 0.33: Evidence against Model A
+
+### Application 3: Forecasting with Bayesian Updates
+
+Use for repeated forecasting (elections, product launches, project timelines).
+
+**Process:**
+1. Start with base rate prior
+2. Update weekly as new evidence arrives
+3. Track belief evolution over time
+4. Compare final forecast to outcome (calibration check)
+
+**Example: Product launch success**
+
+```
+Week -8: Prior = 60% (base rate for similar launches)
+Week -6: Beta feedback positive → Update to 70%
+Week -4: Competitor launches similar product → Update to 55%
+Week -2: Pre-orders exceed target → Update to 75%
+Week 0: Launch → Actual success: Yes ✓
+
+Forecast evolution: 60% → 70% → 55% → 75% → (Outcome: Yes)
+```
+
+**Calibration**: 75% final forecast, outcome=Yes. Good calibration if 7-8 out of 10 forecasts at 75% are correct.
+
+---
+
+## Quality Checklist for Complex Cases
+
+**Multiple Hypotheses**:
+- [ ] All hypotheses listed (including catch-all "Other")
+- [ ] Priors sum to 1.0
+- [ ] Likelihoods defined for all hypothesis-evidence pairs
+- [ ] Posteriors sum to 1.0 (math check)
+- [ ] Interpretation provided (which hypothesis favored? by how much?)
+
+**Sequential Updates**:
+- [ ] Evidence items clearly independent or conditional dependence noted
+- [ ] Each update uses previous posterior as new prior
+- [ ] Belief evolution tracked (how beliefs changed over time)
+- [ ] Final conclusion integrates all evidence
+- [ ] Timeline shows when each piece of evidence arrived
+
+**Calibration**:
+- [ ] Considered alternative explanations (not overconfident?)
+- [ ] Checked against base rates (not ignoring priors?)
+- [ ] Stated confidence interval or range (not just point estimate)
+- [ ] Identified assumptions that could make forecast wrong
+- [ ] Planned follow-up to track calibration (compare forecast to outcome)
+
+**Minimum Standard for Complex Cases**:
+- Multiple hypotheses: Score ≥ 4.0 on rubric
+- High-stakes forecasts: Track calibration over 10+ predictions