14 KiB
Scoring Rules and Calibration
Comprehensive guide to proper scoring rules, calibration measurement, and forecast accuracy improvement.
Table of Contents
- Proper Scoring Rules
- Brier Score Deep Dive
- Log Score
- Calibration Curves
- Resolution Analysis
- Sharpness
- Practical Calibration Training
- Comparison Table
1. Proper Scoring Rules
What is a Scoring Rule?
A scoring rule assigns a numerical score to a probabilistic forecast based on the forecast and actual outcome.
Purpose: Measure accuracy, incentivize honesty, enable comparison, track calibration over time.
Strictly Proper vs Quasi-Proper
Strictly Proper: Reporting your true belief uniquely maximizes your expected score. No other probability gives better expected score.
Why it matters: Incentivizes honesty, eliminates gaming, optimizes for accurate beliefs.
Quasi-Proper: True belief maximizes score, but other probabilities might tie. Less desirable for forecasting.
Common Proper Scoring Rules
1. Brier Score (strictly proper)
Score = -(p - o)²
p = Your probability (0 to 1)
o = Outcome (0 or 1)
2. Logarithmic Score (strictly proper)
Score = log(p) if outcome occurs
Score = log(1-p) if outcome doesn't occur
3. Spherical Score (strictly proper)
Score = p / √(p² + (1-p)²) if outcome occurs
Common IMPROPER Scoring Rules (Avoid)
Absolute Error: Score = -|p - o| → Incentivizes extremes (NOT proper)
Threshold Accuracy: Binary right/wrong → Ignores calibration (NOT proper)
Example of gaming improper rules:
Using absolute error (improper):
True belief: 60% → Optimal report: 100% (dishonest)
Using Brier score (proper):
True belief: 60% → Optimal report: 60% (honest)
Key Principle: Only use strictly proper scoring rules for forecast evaluation.
2. Brier Score Deep Dive
Formula
Single forecast: Brier = (p - o)²
Multiple forecasts: Brier = (1/N) × Σ(pi - oi)²
Range: 0.00 (perfect) to 1.00 (worst). Lower is better.
Calculation Examples
90% Yes → (0.90-1)² = 0.01 (good) | 90% No → (0.90-0)² = 0.81 (bad)
60% Yes → (0.60-1)² = 0.16 (medium) | 50% Any → 0.25 (baseline)
Brier Score Decomposition
Murphy Decomposition:
Brier Score = Reliability - Resolution + Uncertainty
Reliability (Calibration Error): Are your probabilities correct on average? (Lower is better)
Resolution: Do you assign different probabilities to different outcomes? (Higher is better)
Uncertainty: Base rate variance (uncontrollable, depends on problem)
Improving Brier:
- Minimize reliability (fix calibration)
- Maximize resolution (differentiate forecasts)
Brier Score Interpretation
| Brier Score | Quality | Description |
|---|---|---|
| 0.00 - 0.05 | Exceptional | Near-perfect |
| 0.05 - 0.10 | Excellent | Top tier |
| 0.10 - 0.15 | Good | Skilled |
| 0.15 - 0.20 | Average | Better than random |
| 0.20 - 0.25 | Below Average | Approaching random |
| 0.25+ | Poor | At or worse than random |
Context matters: Easy questions expect lower scores. Compare to baseline (0.25) and other forecasters.
Improving Your Brier Score
Path 1: Fix Calibration
If overconfident: 80% predictions happen 60% → Be less extreme, widen intervals
If underconfident: 60% predictions happen 80% → Be more extreme when you have evidence
Path 2: Improve Resolution
Problem: All forecasts near 50% → Differentiate easy vs hard questions, research more, be bold when warranted
Balance: Good Forecaster = Well-Calibrated + High Resolution
Brier Skill Score
BSS = 1 - (Your Brier / Baseline Brier)
Example:
Your Brier: 0.12, Baseline: 0.25
BSS = 1 - 0.48 = 0.52 (52% improvement over baseline)
Interpretation: BSS = 1.00 (perfect), 0.00 (same as baseline), <0 (worse than baseline)
3. Log Score (Logarithmic Scoring Rule)
Formula
Log Score = log₂(p) if outcome occurs
Log Score = log₂(1-p) if outcome doesn't occur
Range: -∞ (worst) to 0 (perfect)
Higher (less negative) is better
Calculation Examples
90% Yes → -0.15 | 90% No → -3.32 (severe) | 50% Yes → -1.00
99% No → -6.64 (catastrophic penalty for overconfidence)
Relationship to Information Theory
Log score measures bits of surprise:
Surprise = -log₂(p)
p = 50% → 1 bit surprise
p = 25% → 2 bits surprise
p = 12.5% → 3 bits surprise
Connection to entropy: Log score equals cross-entropy between forecast distribution and true outcome.
When to Use Log Score vs Brier
Use Log Score when:
- Severe penalty for overconfidence desired
- Tail risk matters (rare events important)
- Information-theoretic interpretation useful
- Comparing probabilistic models
Use Brier Score when:
- Human forecasters (less punishing)
- Easier interpretation (squared error)
- Standard benchmark (more common)
- Avoiding extreme penalties
Key Difference:
Brier: Quadratic penalty (grows with square)
Error: 10% → 0.01, 20% → 0.04, 30% → 0.09, 40% → 0.16
Log: Logarithmic penalty (grows faster for extremes)
Forecast: 90% wrong → -3.3, 95% wrong → -4.3, 99% wrong → -6.6
Recommendation: Default to Brier. Add Log for high-stakes or to penalize overconfidence. Track both for complete picture.
4. Calibration Curves
What is a Calibration Curve?
Visualization of forecast accuracy:
Y-axis: Actual frequency (how often outcome occurred)
X-axis: Stated probability (your forecasts)
Perfect calibration: Diagonal line (y = x)
Example:
Actual %
100 ┤ ╱
80 ┤ ●
60 ┤ ●
40 ┤ ● ← Perfect calibration line
20 ┤ ●
0 └───────────────────────
0 20 40 60 80 100
Stated probability %
How to Create
Step 1: Collect 50+ forecasts and outcomes
Step 2: Bin by probability (0-10%, 10-20%, ..., 90-100%)
Step 3: For each bin, calculate actual frequency
Example: 60-70% bin
Forecasts: 15 total, Outcomes: 9 Yes, 6 No
Actual frequency: 9/15 = 60%
Plot point: (65, 60)
Step 4: Draw perfect calibration line (diagonal from (0,0) to (100,100))
Step 5: Compare points to line
Over/Under Confidence Detection
Overconfidence: Points below diagonal (said 90%, happened 70%). Fix: Be less extreme, widen intervals.
Underconfidence: Points above diagonal (said 90%, happened 95%). Fix: Be more extreme when evidence is strong.
Sample size: <10/bin unreliable, 10-20 weak, 20-50 moderate, 50+ strong evidence
5. Resolution Analysis
What is Resolution?
Resolution measures ability to assign different probabilities to outcomes that actually differ.
High resolution: Events you call 90% happen much more than events you call 10% (good)
Low resolution: All forecasts near 50%, can't discriminate (bad)
Formula
Resolution = (1/N) × Σ nk(ok - ō)²
nk = Forecasts in bin k
ok = Actual frequency in bin k
ō = Overall base rate
Higher is better
How to Improve Resolution
Problem: Stuck at 50%
Bad pattern: All forecasts 48-52% → Low resolution
Good pattern: Range from 20% to 90% → High resolution
Strategies:
- Gather discriminating information - Find features that distinguish outcomes
- Use decomposition - Fermi, causal models, scenarios
- Be bold when warranted - If evidence strong → Say 85% not 65%
- Update with evidence - Start with base rate, update with Bayesian reasoning
Calibration vs Resolution Tradeoff
Perfect Calibration Only: Say 60% for everything when base rate is 60%
→ Calibration: Perfect
→ Resolution: Zero
→ Brier: 0.24 (bad)
High Resolution Only: Say 10% or 90% (extremes) incorrectly
→ Calibration: Poor
→ Resolution: High
→ Brier: Terrible
Optimal Balance: Well-calibrated AND high resolution
→ Calibration: Good
→ Resolution: High
→ Brier: Minimized
Best forecasters: Well-calibrated (low reliability error) + High resolution (discriminate events) = Low Brier
Recommendation: Don't sacrifice resolution for perfect calibration. Be bold when evidence warrants.
6. Sharpness
What is Sharpness?
Sharpness = Tendency to make extreme predictions (away from 50%) when appropriate.
Sharp: Predicts 5% or 95% when evidence supports it (decisive)
Unsharp: Stays near 50% (plays it safe, indecisive)
Why Sharpness Matters
Scenario: Base rate 60%
Unsharp forecaster: 50% for every event → Brier: 0.24, Usefulness: Low
Sharp forecaster: Range 20-90% → Brier: 0.12 (if calibrated), Usefulness: High
Insight: Extreme predictions (when accurate) improve Brier significantly. When wrong, hurt badly. Solution: Be sharp when you have evidence.
Measuring Sharpness
Sharpness = Variance of forecast probabilities
Forecaster A: [0.45, 0.50, 0.48, 0.52, 0.49] → Var = 0.0007 (unsharp)
Forecaster B: [0.15, 0.85, 0.30, 0.90, 0.20] → Var = 0.1150 (sharp)
When to Be Sharp
Be sharp (extreme probabilities) when:
- Strong discriminating evidence (multiple independent pieces align)
- Easy questions (outcome nearly certain)
- You have expertise (domain knowledge, track record)
Stay moderate (near 50%) when:
- High uncertainty (limited information, conflicting evidence)
- Hard questions (true probability near 50%)
- No expertise (unfamiliar domain)
Goal: Sharp AND well-calibrated (extreme when warranted, accurate probabilities)
7. Practical Calibration Training
Calibration Exercises
Exercise Set 1: Make 10 forecasts on verifiable questions (fair coin 50%, Paris capital 99%, two heads 25%, die shows 6 at 16.67%). Check: Did 99% come true 9-10 times? Did 50% come true ~5 times?
Exercise Set 2: Make 20 "80% confident" predictions. Expected: 16/20 correct. Common: 12-14/20 (overconfident). What feels "80%" should be reported as "65%".
Tracking Methods
Method 1: Spreadsheet
| Date | Question | Prob | Outcome | Brier | Notes |
Monthly: Calculate mean Brier
Quarterly: Generate calibration curve
Method 2: Apps
- PredictionBook.com (free, tracks calibration)
- Metaculus.com (forecasting platform)
- Good Judgment Open (tournament)
Method 3: Focused Practice
- Week 1: Make 20 predictions (focus on honesty)
- Week 2: Check calibration curve (identify bias)
- Week 3: Increase resolution (be bold)
- Week 4: Balance calibration + resolution
Training Drills
Drill 1: Generate 10 "90% CIs" for unknowns. Target: 9/10 contain true value. Common mistake: Only 5-7 (overconfident). Fix: Widen by 1.5×.
Drill 2: Bayesian practice - State prior, observe evidence, update posterior, check calibration.
Drill 3: Make 10 predictions >80% or <20%. Force extremes when "pretty sure". Track: Are >80% happening >80%?
8. Comparison Table of Scoring Rules
Summary
| Feature | Brier | Log | Spherical | Threshold |
|---|---|---|---|---|
| Proper | Strictly | Strictly | Strictly | NO |
| Range | 0 to 1 | -∞ to 0 | 0 to 1 | 0 to 1 |
| Penalty | Quadratic | Logarithmic | Moderate | None |
| Interpretation | Squared error | Bits surprise | Geometric | Binary |
| Usage | Default | High-stakes | Rare | Avoid |
| Human-friendly | Yes | Somewhat | No | Yes (misleading) |
Detailed Comparison
Brier Score
Pros: Easy to interpret, standard in competitions, moderate penalty, good for humans
Cons: Less severe penalty for overconfidence
Best for: General forecasting, calibration training, standard benchmarking
Log Score
Pros: Severe penalty for overconfidence, information-theoretic, strongly incentivizes honesty
Cons: Too punishing for humans, infinite at 0%/100%, less intuitive
Best for: High-stakes forecasting, penalizing overconfidence, ML models, tail risk
Spherical Score
Pros: Strictly proper, bounded, geometric interpretation
Cons: Uncommon, complex formula, rarely used
Best for: Theoretical analysis only
Threshold / Binary Accuracy
Pros: Very intuitive, easy to explain
Cons: NOT proper (incentivizes extremes), ignores calibration, can be gamed
Best for: Nothing (don't use for forecasting)
When to Use Each
| Your Situation | Recommended |
|---|---|
| Starting out | Brier |
| Experienced forecaster | Brier or Log |
| High-stakes decisions | Log |
| Comparing to benchmarks | Brier |
| Building ML model | Log |
| Personal tracking | Brier |
| Teaching others | Brier |
Recommendation: Use Brier as default. Add Log for high-stakes or to penalize overconfidence.
Conversion Example
Forecast: 80%, Outcome: Yes
Brier: (0.80-1)² = 0.04
Log (base 2): log₂(0.80) = -0.322
Spherical: 0.80/√(0.80²+0.20²) = 0.971
Forecast: 80%, Outcome: No
Brier: (0.80-0)² = 0.64
Log (base 2): log₂(0.20) = -2.322 (much worse penalty)
Spherical: 0.20/√(0.80²+0.20²) = 0.243
Return to Main Skill
← Back to Market Mechanics & Betting
Related Resources: