Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/market-mechanics-betting/resources/scoring-rules.md
+++ b/skills/market-mechanics-betting/resources/scoring-rules.md
@@ -0,0 +1,494 @@
+# Scoring Rules and Calibration
+
+Comprehensive guide to proper scoring rules, calibration measurement, and forecast accuracy improvement.
+
+## Table of Contents
+
+1. [Proper Scoring Rules](#1-proper-scoring-rules)
+2. [Brier Score Deep Dive](#2-brier-score-deep-dive)
+3. [Log Score](#3-log-score-logarithmic-scoring-rule)
+4. [Calibration Curves](#4-calibration-curves)
+5. [Resolution Analysis](#5-resolution-analysis)
+6. [Sharpness](#6-sharpness)
+7. [Practical Calibration Training](#7-practical-calibration-training)
+8. [Comparison Table](#8-comparison-table-of-scoring-rules)
+
+---
+
+## 1. Proper Scoring Rules
+
+### What is a Scoring Rule?
+
+A **scoring rule** assigns a numerical score to a probabilistic forecast based on the forecast and actual outcome.
+
+**Purpose:** Measure accuracy, incentivize honesty, enable comparison, track calibration over time.
+
+### Strictly Proper vs Quasi-Proper
+
+**Strictly Proper:** Reporting your true belief uniquely maximizes your expected score. No other probability gives better expected score.
+
+**Why it matters:** Incentivizes honesty, eliminates gaming, optimizes for accurate beliefs.
+
+**Quasi-Proper:** True belief maximizes score, but other probabilities might tie. Less desirable for forecasting.
+
+### Common Proper Scoring Rules
+
+**1. Brier Score** (strictly proper)
+```
+Score = -(p - o)²
+p = Your probability (0 to 1)
+o = Outcome (0 or 1)
+```
+
+**2. Logarithmic Score** (strictly proper)
+```
+Score = log(p) if outcome occurs
+Score = log(1-p) if outcome doesn't occur
+```
+
+**3. Spherical Score** (strictly proper)
+```
+Score = p / √(p² + (1-p)²) if outcome occurs
+```
+
+### Common IMPROPER Scoring Rules (Avoid)
+
+**Absolute Error:** `Score = -|p - o|` → Incentivizes extremes (NOT proper)
+
+**Threshold Accuracy:** Binary right/wrong → Ignores calibration (NOT proper)
+
+**Example of gaming improper rules:**
+```
+Using absolute error (improper):
+True belief: 60% → Optimal report: 100% (dishonest)
+
+Using Brier score (proper):
+True belief: 60% → Optimal report: 60% (honest)
+```
+
+**Key Principle:** Only use strictly proper scoring rules for forecast evaluation.
+
+---
+
+## 2. Brier Score Deep Dive
+
+### Formula
+
+**Single forecast:** `Brier = (p - o)²`
+
+**Multiple forecasts:** `Brier = (1/N) × Σ(pi - oi)²`
+
+**Range:** 0.00 (perfect) to 1.00 (worst). Lower is better.
+
+### Calculation Examples
+
+```
+90% Yes → (0.90-1)² = 0.01 (good) | 90% No → (0.90-0)² = 0.81 (bad)
+60% Yes → (0.60-1)² = 0.16 (medium) | 50% Any → 0.25 (baseline)
+```
+
+### Brier Score Decomposition
+
+**Murphy Decomposition:**
+```
+Brier Score = Reliability - Resolution + Uncertainty
+```
+
+**Reliability (Calibration Error):** Are your probabilities correct on average? (Lower is better)
+
+**Resolution:** Do you assign different probabilities to different outcomes? (Higher is better)
+
+**Uncertainty:** Base rate variance (uncontrollable, depends on problem)
+
+**Improving Brier:**
+1. Minimize reliability (fix calibration)
+2. Maximize resolution (differentiate forecasts)
+
+### Brier Score Interpretation
+
+| Brier Score | Quality | Description |
+|-------------|---------|-------------|
+| 0.00 - 0.05 | Exceptional | Near-perfect |
+| 0.05 - 0.10 | Excellent | Top tier |
+| 0.10 - 0.15 | Good | Skilled |
+| 0.15 - 0.20 | Average | Better than random |
+| 0.20 - 0.25 | Below Average | Approaching random |
+| 0.25+ | Poor | At or worse than random |
+
+**Context matters:** Easy questions expect lower scores. Compare to baseline (0.25) and other forecasters.
+
+### Improving Your Brier Score
+
+**Path 1: Fix Calibration**
+
+**If overconfident:** 80% predictions happen 60% → Be less extreme, widen intervals
+
+**If underconfident:** 60% predictions happen 80% → Be more extreme when you have evidence
+
+**Path 2: Improve Resolution**
+
+**Problem:** All forecasts near 50% → Differentiate easy vs hard questions, research more, be bold when warranted
+
+**Balance:** `Good Forecaster = Well-Calibrated + High Resolution`
+
+### Brier Skill Score
+
+```
+BSS = 1 - (Your Brier / Baseline Brier)
+
+Example:
+Your Brier: 0.12, Baseline: 0.25
+BSS = 1 - 0.48 = 0.52 (52% improvement over baseline)
+```
+
+**Interpretation:** BSS = 1.00 (perfect), 0.00 (same as baseline), <0 (worse than baseline)
+
+---
+
+## 3. Log Score (Logarithmic Scoring Rule)
+
+### Formula
+
+```
+Log Score = log₂(p) if outcome occurs
+Log Score = log₂(1-p) if outcome doesn't occur
+
+Range: -∞ (worst) to 0 (perfect)
+Higher (less negative) is better
+```
+
+### Calculation Examples
+
+```
+90% Yes → -0.15 | 90% No → -3.32 (severe) | 50% Yes → -1.00
+99% No → -6.64 (catastrophic penalty for overconfidence)
+```
+
+### Relationship to Information Theory
+
+**Log score measures bits of surprise:**
+```
+Surprise = -log₂(p)
+
+p = 50% → 1 bit surprise
+p = 25% → 2 bits surprise
+p = 12.5% → 3 bits surprise
+```
+
+**Connection to entropy:** Log score equals cross-entropy between forecast distribution and true outcome.
+
+### When to Use Log Score vs Brier
+
+**Use Log Score when:**
+- Severe penalty for overconfidence desired
+- Tail risk matters (rare events important)
+- Information-theoretic interpretation useful
+- Comparing probabilistic models
+
+**Use Brier Score when:**
+- Human forecasters (less punishing)
+- Easier interpretation (squared error)
+- Standard benchmark (more common)
+- Avoiding extreme penalties
+
+**Key Difference:**
+
+Brier: Quadratic penalty (grows with square)
+```
+Error: 10% → 0.01, 20% → 0.04, 30% → 0.09, 40% → 0.16
+```
+
+Log: Logarithmic penalty (grows faster for extremes)
+```
+Forecast: 90% wrong → -3.3, 95% wrong → -4.3, 99% wrong → -6.6
+```
+
+**Recommendation:** Default to Brier. Add Log for high-stakes or to penalize overconfidence. Track both for complete picture.
+
+---
+
+## 4. Calibration Curves
+
+### What is a Calibration Curve?
+
+**Visualization of forecast accuracy:**
+```
+Y-axis: Actual frequency (how often outcome occurred)
+X-axis: Stated probability (your forecasts)
+Perfect calibration: Diagonal line (y = x)
+```
+
+**Example:**
+```
+Actual %
+   100 ┤                             ╱
+    80 ┤                       ●
+    60 ┤               ●
+    40 ┤       ●               ← Perfect calibration line
+    20 ┤ ●
+     0 └───────────────────────
+       0    20   40   60   80  100
+            Stated probability %
+```
+
+### How to Create
+
+**Step 1:** Collect 50+ forecasts and outcomes
+
+**Step 2:** Bin by probability (0-10%, 10-20%, ..., 90-100%)
+
+**Step 3:** For each bin, calculate actual frequency
+```
+Example: 60-70% bin
+Forecasts: 15 total, Outcomes: 9 Yes, 6 No
+Actual frequency: 9/15 = 60%
+Plot point: (65, 60)
+```
+
+**Step 4:** Draw perfect calibration line (diagonal from (0,0) to (100,100))
+
+**Step 5:** Compare points to line
+
+### Over/Under Confidence Detection
+
+**Overconfidence:** Points below diagonal (said 90%, happened 70%). Fix: Be less extreme, widen intervals.
+
+**Underconfidence:** Points above diagonal (said 90%, happened 95%). Fix: Be more extreme when evidence is strong.
+
+**Sample size:** <10/bin unreliable, 10-20 weak, 20-50 moderate, 50+ strong evidence
+
+---
+
+## 5. Resolution Analysis
+
+### What is Resolution?
+
+**Resolution** measures ability to assign different probabilities to outcomes that actually differ.
+
+**High resolution:** Events you call 90% happen much more than events you call 10% (good)
+
+**Low resolution:** All forecasts near 50%, can't discriminate (bad)
+
+### Formula
+
+```
+Resolution = (1/N) × Σ nk(ok - ō)²
+
+nk = Forecasts in bin k
+ok = Actual frequency in bin k
+ō = Overall base rate
+
+Higher is better
+```
+
+### How to Improve Resolution
+
+**Problem: Stuck at 50%**
+
+Bad pattern: All forecasts 48-52% → Low resolution
+
+Good pattern: Range from 20% to 90% → High resolution
+
+**Strategies:**
+
+1. **Gather discriminating information** - Find features that distinguish outcomes
+2. **Use decomposition** - Fermi, causal models, scenarios
+3. **Be bold when warranted** - If evidence strong → Say 85% not 65%
+4. **Update with evidence** - Start with base rate, update with Bayesian reasoning
+
+### Calibration vs Resolution Tradeoff
+
+```
+Perfect Calibration Only: Say 60% for everything when base rate is 60%
+  → Calibration: Perfect
+  → Resolution: Zero
+  → Brier: 0.24 (bad)
+
+High Resolution Only: Say 10% or 90% (extremes) incorrectly
+  → Calibration: Poor
+  → Resolution: High
+  → Brier: Terrible
+
+Optimal Balance: Well-calibrated AND high resolution
+  → Calibration: Good
+  → Resolution: High
+  → Brier: Minimized
+```
+
+**Best forecasters:** Well-calibrated (low reliability error) + High resolution (discriminate events) = Low Brier
+
+**Recommendation:** Don't sacrifice resolution for perfect calibration. Be bold when evidence warrants.
+
+---
+
+## 6. Sharpness
+
+### What is Sharpness?
+
+**Sharpness** = Tendency to make extreme predictions (away from 50%) when appropriate.
+
+**Sharp:** Predicts 5% or 95% when evidence supports it (decisive)
+
+**Unsharp:** Stays near 50% (plays it safe, indecisive)
+
+### Why Sharpness Matters
+
+```
+Scenario: Base rate 60%
+
+Unsharp forecaster: 50% for every event → Brier: 0.24, Usefulness: Low
+Sharp forecaster: Range 20-90% → Brier: 0.12 (if calibrated), Usefulness: High
+```
+
+**Insight:** Extreme predictions (when accurate) improve Brier significantly. When wrong, hurt badly. Solution: Be sharp when you have evidence.
+
+### Measuring Sharpness
+
+```
+Sharpness = Variance of forecast probabilities
+
+Forecaster A: [0.45, 0.50, 0.48, 0.52, 0.49] → Var = 0.0007 (unsharp)
+Forecaster B: [0.15, 0.85, 0.30, 0.90, 0.20] → Var = 0.1150 (sharp)
+```
+
+### When to Be Sharp
+
+**Be sharp (extreme probabilities) when:**
+- Strong discriminating evidence (multiple independent pieces align)
+- Easy questions (outcome nearly certain)
+- You have expertise (domain knowledge, track record)
+
+**Stay moderate (near 50%) when:**
+- High uncertainty (limited information, conflicting evidence)
+- Hard questions (true probability near 50%)
+- No expertise (unfamiliar domain)
+
+**Goal:** Sharp AND well-calibrated (extreme when warranted, accurate probabilities)
+
+---
+
+## 7. Practical Calibration Training
+
+### Calibration Exercises
+
+**Exercise Set 1:** Make 10 forecasts on verifiable questions (fair coin 50%, Paris capital 99%, two heads 25%, die shows 6 at 16.67%). Check: Did 99% come true 9-10 times? Did 50% come true ~5 times?
+
+**Exercise Set 2:** Make 20 "80% confident" predictions. Expected: 16/20 correct. Common: 12-14/20 (overconfident). What feels "80%" should be reported as "65%".
+
+### Tracking Methods
+
+**Method 1: Spreadsheet**
+```
+| Date | Question | Prob | Outcome | Brier | Notes |
+Monthly: Calculate mean Brier
+Quarterly: Generate calibration curve
+```
+
+**Method 2: Apps**
+- PredictionBook.com (free, tracks calibration)
+- Metaculus.com (forecasting platform)
+- Good Judgment Open (tournament)
+
+**Method 3: Focused Practice**
+- Week 1: Make 20 predictions (focus on honesty)
+- Week 2: Check calibration curve (identify bias)
+- Week 3: Increase resolution (be bold)
+- Week 4: Balance calibration + resolution
+
+### Training Drills
+
+**Drill 1:** Generate 10 "90% CIs" for unknowns. Target: 9/10 contain true value. Common mistake: Only 5-7 (overconfident). Fix: Widen by 1.5×.
+
+**Drill 2:** Bayesian practice - State prior, observe evidence, update posterior, check calibration.
+
+**Drill 3:** Make 10 predictions >80% or <20%. Force extremes when "pretty sure". Track: Are >80% happening >80%?
+
+---
+
+## 8. Comparison Table of Scoring Rules
+
+### Summary
+
+| Feature | Brier | Log | Spherical | Threshold |
+|---------|-------|-----|-----------|-----------|
+| **Proper** | Strictly | Strictly | Strictly | NO |
+| **Range** | 0 to 1 | -∞ to 0 | 0 to 1 | 0 to 1 |
+| **Penalty** | Quadratic | Logarithmic | Moderate | None |
+| **Interpretation** | Squared error | Bits surprise | Geometric | Binary |
+| **Usage** | Default | High-stakes | Rare | Avoid |
+| **Human-friendly** | Yes | Somewhat | No | Yes (misleading) |
+
+### Detailed Comparison
+
+**Brier Score**
+
+Pros: Easy to interpret, standard in competitions, moderate penalty, good for humans
+
+Cons: Less severe penalty for overconfidence
+
+Best for: General forecasting, calibration training, standard benchmarking
+
+**Log Score**
+
+Pros: Severe penalty for overconfidence, information-theoretic, strongly incentivizes honesty
+
+Cons: Too punishing for humans, infinite at 0%/100%, less intuitive
+
+Best for: High-stakes forecasting, penalizing overconfidence, ML models, tail risk
+
+**Spherical Score**
+
+Pros: Strictly proper, bounded, geometric interpretation
+
+Cons: Uncommon, complex formula, rarely used
+
+Best for: Theoretical analysis only
+
+**Threshold / Binary Accuracy**
+
+Pros: Very intuitive, easy to explain
+
+Cons: NOT proper (incentivizes extremes), ignores calibration, can be gamed
+
+Best for: Nothing (don't use for forecasting)
+
+### When to Use Each
+
+| Your Situation | Recommended |
+|----------------|-------------|
+| Starting out | **Brier** |
+| Experienced forecaster | **Brier** or **Log** |
+| High-stakes decisions | **Log** |
+| Comparing to benchmarks | **Brier** |
+| Building ML model | **Log** |
+| Personal tracking | **Brier** |
+| Teaching others | **Brier** |
+
+**Recommendation:** Use **Brier** as default. Add **Log** for high-stakes or to penalize overconfidence.
+
+### Conversion Example
+
+**Forecast: 80%, Outcome: Yes**
+```
+Brier: (0.80-1)² = 0.04
+Log (base 2): log₂(0.80) = -0.322
+Spherical: 0.80/√(0.80²+0.20²) = 0.971
+```
+
+**Forecast: 80%, Outcome: No**
+```
+Brier: (0.80-0)² = 0.64
+Log (base 2): log₂(0.20) = -2.322 (much worse penalty)
+Spherical: 0.20/√(0.80²+0.20²) = 0.243
+```
+
+---
+
+## Return to Main Skill
+
+[← Back to Market Mechanics & Betting](../SKILL.md)
+
+**Related Resources:**
+- [Betting Theory Fundamentals](betting-theory.md)
+- [Kelly Criterion Deep Dive](kelly-criterion.md)
+