# Scoring Rules and Calibration Comprehensive guide to proper scoring rules, calibration measurement, and forecast accuracy improvement. ## Table of Contents 1. [Proper Scoring Rules](#1-proper-scoring-rules) 2. [Brier Score Deep Dive](#2-brier-score-deep-dive) 3. [Log Score](#3-log-score-logarithmic-scoring-rule) 4. [Calibration Curves](#4-calibration-curves) 5. [Resolution Analysis](#5-resolution-analysis) 6. [Sharpness](#6-sharpness) 7. [Practical Calibration Training](#7-practical-calibration-training) 8. [Comparison Table](#8-comparison-table-of-scoring-rules) --- ## 1. Proper Scoring Rules ### What is a Scoring Rule? A **scoring rule** assigns a numerical score to a probabilistic forecast based on the forecast and actual outcome. **Purpose:** Measure accuracy, incentivize honesty, enable comparison, track calibration over time. ### Strictly Proper vs Quasi-Proper **Strictly Proper:** Reporting your true belief uniquely maximizes your expected score. No other probability gives better expected score. **Why it matters:** Incentivizes honesty, eliminates gaming, optimizes for accurate beliefs. **Quasi-Proper:** True belief maximizes score, but other probabilities might tie. Less desirable for forecasting. ### Common Proper Scoring Rules **1. Brier Score** (strictly proper) ``` Score = -(p - o)² p = Your probability (0 to 1) o = Outcome (0 or 1) ``` **2. Logarithmic Score** (strictly proper) ``` Score = log(p) if outcome occurs Score = log(1-p) if outcome doesn't occur ``` **3. Spherical Score** (strictly proper) ``` Score = p / √(p² + (1-p)²) if outcome occurs ``` ### Common IMPROPER Scoring Rules (Avoid) **Absolute Error:** `Score = -|p - o|` → Incentivizes extremes (NOT proper) **Threshold Accuracy:** Binary right/wrong → Ignores calibration (NOT proper) **Example of gaming improper rules:** ``` Using absolute error (improper): True belief: 60% → Optimal report: 100% (dishonest) Using Brier score (proper): True belief: 60% → Optimal report: 60% (honest) ``` **Key Principle:** Only use strictly proper scoring rules for forecast evaluation. --- ## 2. Brier Score Deep Dive ### Formula **Single forecast:** `Brier = (p - o)²` **Multiple forecasts:** `Brier = (1/N) × Σ(pi - oi)²` **Range:** 0.00 (perfect) to 1.00 (worst). Lower is better. ### Calculation Examples ``` 90% Yes → (0.90-1)² = 0.01 (good) | 90% No → (0.90-0)² = 0.81 (bad) 60% Yes → (0.60-1)² = 0.16 (medium) | 50% Any → 0.25 (baseline) ``` ### Brier Score Decomposition **Murphy Decomposition:** ``` Brier Score = Reliability - Resolution + Uncertainty ``` **Reliability (Calibration Error):** Are your probabilities correct on average? (Lower is better) **Resolution:** Do you assign different probabilities to different outcomes? (Higher is better) **Uncertainty:** Base rate variance (uncontrollable, depends on problem) **Improving Brier:** 1. Minimize reliability (fix calibration) 2. Maximize resolution (differentiate forecasts) ### Brier Score Interpretation | Brier Score | Quality | Description | |-------------|---------|-------------| | 0.00 - 0.05 | Exceptional | Near-perfect | | 0.05 - 0.10 | Excellent | Top tier | | 0.10 - 0.15 | Good | Skilled | | 0.15 - 0.20 | Average | Better than random | | 0.20 - 0.25 | Below Average | Approaching random | | 0.25+ | Poor | At or worse than random | **Context matters:** Easy questions expect lower scores. Compare to baseline (0.25) and other forecasters. ### Improving Your Brier Score **Path 1: Fix Calibration** **If overconfident:** 80% predictions happen 60% → Be less extreme, widen intervals **If underconfident:** 60% predictions happen 80% → Be more extreme when you have evidence **Path 2: Improve Resolution** **Problem:** All forecasts near 50% → Differentiate easy vs hard questions, research more, be bold when warranted **Balance:** `Good Forecaster = Well-Calibrated + High Resolution` ### Brier Skill Score ``` BSS = 1 - (Your Brier / Baseline Brier) Example: Your Brier: 0.12, Baseline: 0.25 BSS = 1 - 0.48 = 0.52 (52% improvement over baseline) ``` **Interpretation:** BSS = 1.00 (perfect), 0.00 (same as baseline), <0 (worse than baseline) --- ## 3. Log Score (Logarithmic Scoring Rule) ### Formula ``` Log Score = log₂(p) if outcome occurs Log Score = log₂(1-p) if outcome doesn't occur Range: -∞ (worst) to 0 (perfect) Higher (less negative) is better ``` ### Calculation Examples ``` 90% Yes → -0.15 | 90% No → -3.32 (severe) | 50% Yes → -1.00 99% No → -6.64 (catastrophic penalty for overconfidence) ``` ### Relationship to Information Theory **Log score measures bits of surprise:** ``` Surprise = -log₂(p) p = 50% → 1 bit surprise p = 25% → 2 bits surprise p = 12.5% → 3 bits surprise ``` **Connection to entropy:** Log score equals cross-entropy between forecast distribution and true outcome. ### When to Use Log Score vs Brier **Use Log Score when:** - Severe penalty for overconfidence desired - Tail risk matters (rare events important) - Information-theoretic interpretation useful - Comparing probabilistic models **Use Brier Score when:** - Human forecasters (less punishing) - Easier interpretation (squared error) - Standard benchmark (more common) - Avoiding extreme penalties **Key Difference:** Brier: Quadratic penalty (grows with square) ``` Error: 10% → 0.01, 20% → 0.04, 30% → 0.09, 40% → 0.16 ``` Log: Logarithmic penalty (grows faster for extremes) ``` Forecast: 90% wrong → -3.3, 95% wrong → -4.3, 99% wrong → -6.6 ``` **Recommendation:** Default to Brier. Add Log for high-stakes or to penalize overconfidence. Track both for complete picture. --- ## 4. Calibration Curves ### What is a Calibration Curve? **Visualization of forecast accuracy:** ``` Y-axis: Actual frequency (how often outcome occurred) X-axis: Stated probability (your forecasts) Perfect calibration: Diagonal line (y = x) ``` **Example:** ``` Actual % 100 ┤ ╱ 80 ┤ ● 60 ┤ ● 40 ┤ ● ← Perfect calibration line 20 ┤ ● 0 └─────────────────────── 0 20 40 60 80 100 Stated probability % ``` ### How to Create **Step 1:** Collect 50+ forecasts and outcomes **Step 2:** Bin by probability (0-10%, 10-20%, ..., 90-100%) **Step 3:** For each bin, calculate actual frequency ``` Example: 60-70% bin Forecasts: 15 total, Outcomes: 9 Yes, 6 No Actual frequency: 9/15 = 60% Plot point: (65, 60) ``` **Step 4:** Draw perfect calibration line (diagonal from (0,0) to (100,100)) **Step 5:** Compare points to line ### Over/Under Confidence Detection **Overconfidence:** Points below diagonal (said 90%, happened 70%). Fix: Be less extreme, widen intervals. **Underconfidence:** Points above diagonal (said 90%, happened 95%). Fix: Be more extreme when evidence is strong. **Sample size:** <10/bin unreliable, 10-20 weak, 20-50 moderate, 50+ strong evidence --- ## 5. Resolution Analysis ### What is Resolution? **Resolution** measures ability to assign different probabilities to outcomes that actually differ. **High resolution:** Events you call 90% happen much more than events you call 10% (good) **Low resolution:** All forecasts near 50%, can't discriminate (bad) ### Formula ``` Resolution = (1/N) × Σ nk(ok - ō)² nk = Forecasts in bin k ok = Actual frequency in bin k ō = Overall base rate Higher is better ``` ### How to Improve Resolution **Problem: Stuck at 50%** Bad pattern: All forecasts 48-52% → Low resolution Good pattern: Range from 20% to 90% → High resolution **Strategies:** 1. **Gather discriminating information** - Find features that distinguish outcomes 2. **Use decomposition** - Fermi, causal models, scenarios 3. **Be bold when warranted** - If evidence strong → Say 85% not 65% 4. **Update with evidence** - Start with base rate, update with Bayesian reasoning ### Calibration vs Resolution Tradeoff ``` Perfect Calibration Only: Say 60% for everything when base rate is 60% → Calibration: Perfect → Resolution: Zero → Brier: 0.24 (bad) High Resolution Only: Say 10% or 90% (extremes) incorrectly → Calibration: Poor → Resolution: High → Brier: Terrible Optimal Balance: Well-calibrated AND high resolution → Calibration: Good → Resolution: High → Brier: Minimized ``` **Best forecasters:** Well-calibrated (low reliability error) + High resolution (discriminate events) = Low Brier **Recommendation:** Don't sacrifice resolution for perfect calibration. Be bold when evidence warrants. --- ## 6. Sharpness ### What is Sharpness? **Sharpness** = Tendency to make extreme predictions (away from 50%) when appropriate. **Sharp:** Predicts 5% or 95% when evidence supports it (decisive) **Unsharp:** Stays near 50% (plays it safe, indecisive) ### Why Sharpness Matters ``` Scenario: Base rate 60% Unsharp forecaster: 50% for every event → Brier: 0.24, Usefulness: Low Sharp forecaster: Range 20-90% → Brier: 0.12 (if calibrated), Usefulness: High ``` **Insight:** Extreme predictions (when accurate) improve Brier significantly. When wrong, hurt badly. Solution: Be sharp when you have evidence. ### Measuring Sharpness ``` Sharpness = Variance of forecast probabilities Forecaster A: [0.45, 0.50, 0.48, 0.52, 0.49] → Var = 0.0007 (unsharp) Forecaster B: [0.15, 0.85, 0.30, 0.90, 0.20] → Var = 0.1150 (sharp) ``` ### When to Be Sharp **Be sharp (extreme probabilities) when:** - Strong discriminating evidence (multiple independent pieces align) - Easy questions (outcome nearly certain) - You have expertise (domain knowledge, track record) **Stay moderate (near 50%) when:** - High uncertainty (limited information, conflicting evidence) - Hard questions (true probability near 50%) - No expertise (unfamiliar domain) **Goal:** Sharp AND well-calibrated (extreme when warranted, accurate probabilities) --- ## 7. Practical Calibration Training ### Calibration Exercises **Exercise Set 1:** Make 10 forecasts on verifiable questions (fair coin 50%, Paris capital 99%, two heads 25%, die shows 6 at 16.67%). Check: Did 99% come true 9-10 times? Did 50% come true ~5 times? **Exercise Set 2:** Make 20 "80% confident" predictions. Expected: 16/20 correct. Common: 12-14/20 (overconfident). What feels "80%" should be reported as "65%". ### Tracking Methods **Method 1: Spreadsheet** ``` | Date | Question | Prob | Outcome | Brier | Notes | Monthly: Calculate mean Brier Quarterly: Generate calibration curve ``` **Method 2: Apps** - PredictionBook.com (free, tracks calibration) - Metaculus.com (forecasting platform) - Good Judgment Open (tournament) **Method 3: Focused Practice** - Week 1: Make 20 predictions (focus on honesty) - Week 2: Check calibration curve (identify bias) - Week 3: Increase resolution (be bold) - Week 4: Balance calibration + resolution ### Training Drills **Drill 1:** Generate 10 "90% CIs" for unknowns. Target: 9/10 contain true value. Common mistake: Only 5-7 (overconfident). Fix: Widen by 1.5×. **Drill 2:** Bayesian practice - State prior, observe evidence, update posterior, check calibration. **Drill 3:** Make 10 predictions >80% or <20%. Force extremes when "pretty sure". Track: Are >80% happening >80%? --- ## 8. Comparison Table of Scoring Rules ### Summary | Feature | Brier | Log | Spherical | Threshold | |---------|-------|-----|-----------|-----------| | **Proper** | Strictly | Strictly | Strictly | NO | | **Range** | 0 to 1 | -∞ to 0 | 0 to 1 | 0 to 1 | | **Penalty** | Quadratic | Logarithmic | Moderate | None | | **Interpretation** | Squared error | Bits surprise | Geometric | Binary | | **Usage** | Default | High-stakes | Rare | Avoid | | **Human-friendly** | Yes | Somewhat | No | Yes (misleading) | ### Detailed Comparison **Brier Score** Pros: Easy to interpret, standard in competitions, moderate penalty, good for humans Cons: Less severe penalty for overconfidence Best for: General forecasting, calibration training, standard benchmarking **Log Score** Pros: Severe penalty for overconfidence, information-theoretic, strongly incentivizes honesty Cons: Too punishing for humans, infinite at 0%/100%, less intuitive Best for: High-stakes forecasting, penalizing overconfidence, ML models, tail risk **Spherical Score** Pros: Strictly proper, bounded, geometric interpretation Cons: Uncommon, complex formula, rarely used Best for: Theoretical analysis only **Threshold / Binary Accuracy** Pros: Very intuitive, easy to explain Cons: NOT proper (incentivizes extremes), ignores calibration, can be gamed Best for: Nothing (don't use for forecasting) ### When to Use Each | Your Situation | Recommended | |----------------|-------------| | Starting out | **Brier** | | Experienced forecaster | **Brier** or **Log** | | High-stakes decisions | **Log** | | Comparing to benchmarks | **Brier** | | Building ML model | **Log** | | Personal tracking | **Brier** | | Teaching others | **Brier** | **Recommendation:** Use **Brier** as default. Add **Log** for high-stakes or to penalize overconfidence. ### Conversion Example **Forecast: 80%, Outcome: Yes** ``` Brier: (0.80-1)² = 0.04 Log (base 2): log₂(0.80) = -0.322 Spherical: 0.80/√(0.80²+0.20²) = 0.971 ``` **Forecast: 80%, Outcome: No** ``` Brier: (0.80-0)² = 0.64 Log (base 2): log₂(0.20) = -2.322 (much worse penalty) Spherical: 0.20/√(0.80²+0.20²) = 0.243 ``` --- ## Return to Main Skill [← Back to Market Mechanics & Betting](../SKILL.md) **Related Resources:** - [Betting Theory Fundamentals](betting-theory.md) - [Kelly Criterion Deep Dive](kelly-criterion.md)