Initial commit
This commit is contained in:
494
skills/market-mechanics-betting/resources/scoring-rules.md
Normal file
494
skills/market-mechanics-betting/resources/scoring-rules.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# Scoring Rules and Calibration
|
||||
|
||||
Comprehensive guide to proper scoring rules, calibration measurement, and forecast accuracy improvement.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Proper Scoring Rules](#1-proper-scoring-rules)
|
||||
2. [Brier Score Deep Dive](#2-brier-score-deep-dive)
|
||||
3. [Log Score](#3-log-score-logarithmic-scoring-rule)
|
||||
4. [Calibration Curves](#4-calibration-curves)
|
||||
5. [Resolution Analysis](#5-resolution-analysis)
|
||||
6. [Sharpness](#6-sharpness)
|
||||
7. [Practical Calibration Training](#7-practical-calibration-training)
|
||||
8. [Comparison Table](#8-comparison-table-of-scoring-rules)
|
||||
|
||||
---
|
||||
|
||||
## 1. Proper Scoring Rules
|
||||
|
||||
### What is a Scoring Rule?
|
||||
|
||||
A **scoring rule** assigns a numerical score to a probabilistic forecast based on the forecast and actual outcome.
|
||||
|
||||
**Purpose:** Measure accuracy, incentivize honesty, enable comparison, track calibration over time.
|
||||
|
||||
### Strictly Proper vs Quasi-Proper
|
||||
|
||||
**Strictly Proper:** Reporting your true belief uniquely maximizes your expected score. No other probability gives better expected score.
|
||||
|
||||
**Why it matters:** Incentivizes honesty, eliminates gaming, optimizes for accurate beliefs.
|
||||
|
||||
**Quasi-Proper:** True belief maximizes score, but other probabilities might tie. Less desirable for forecasting.
|
||||
|
||||
### Common Proper Scoring Rules
|
||||
|
||||
**1. Brier Score** (strictly proper)
|
||||
```
|
||||
Score = -(p - o)²
|
||||
p = Your probability (0 to 1)
|
||||
o = Outcome (0 or 1)
|
||||
```
|
||||
|
||||
**2. Logarithmic Score** (strictly proper)
|
||||
```
|
||||
Score = log(p) if outcome occurs
|
||||
Score = log(1-p) if outcome doesn't occur
|
||||
```
|
||||
|
||||
**3. Spherical Score** (strictly proper)
|
||||
```
|
||||
Score = p / √(p² + (1-p)²) if outcome occurs
|
||||
```
|
||||
|
||||
### Common IMPROPER Scoring Rules (Avoid)
|
||||
|
||||
**Absolute Error:** `Score = -|p - o|` → Incentivizes extremes (NOT proper)
|
||||
|
||||
**Threshold Accuracy:** Binary right/wrong → Ignores calibration (NOT proper)
|
||||
|
||||
**Example of gaming improper rules:**
|
||||
```
|
||||
Using absolute error (improper):
|
||||
True belief: 60% → Optimal report: 100% (dishonest)
|
||||
|
||||
Using Brier score (proper):
|
||||
True belief: 60% → Optimal report: 60% (honest)
|
||||
```
|
||||
|
||||
**Key Principle:** Only use strictly proper scoring rules for forecast evaluation.
|
||||
|
||||
---
|
||||
|
||||
## 2. Brier Score Deep Dive
|
||||
|
||||
### Formula
|
||||
|
||||
**Single forecast:** `Brier = (p - o)²`
|
||||
|
||||
**Multiple forecasts:** `Brier = (1/N) × Σ(pi - oi)²`
|
||||
|
||||
**Range:** 0.00 (perfect) to 1.00 (worst). Lower is better.
|
||||
|
||||
### Calculation Examples
|
||||
|
||||
```
|
||||
90% Yes → (0.90-1)² = 0.01 (good) | 90% No → (0.90-0)² = 0.81 (bad)
|
||||
60% Yes → (0.60-1)² = 0.16 (medium) | 50% Any → 0.25 (baseline)
|
||||
```
|
||||
|
||||
### Brier Score Decomposition
|
||||
|
||||
**Murphy Decomposition:**
|
||||
```
|
||||
Brier Score = Reliability - Resolution + Uncertainty
|
||||
```
|
||||
|
||||
**Reliability (Calibration Error):** Are your probabilities correct on average? (Lower is better)
|
||||
|
||||
**Resolution:** Do you assign different probabilities to different outcomes? (Higher is better)
|
||||
|
||||
**Uncertainty:** Base rate variance (uncontrollable, depends on problem)
|
||||
|
||||
**Improving Brier:**
|
||||
1. Minimize reliability (fix calibration)
|
||||
2. Maximize resolution (differentiate forecasts)
|
||||
|
||||
### Brier Score Interpretation
|
||||
|
||||
| Brier Score | Quality | Description |
|
||||
|-------------|---------|-------------|
|
||||
| 0.00 - 0.05 | Exceptional | Near-perfect |
|
||||
| 0.05 - 0.10 | Excellent | Top tier |
|
||||
| 0.10 - 0.15 | Good | Skilled |
|
||||
| 0.15 - 0.20 | Average | Better than random |
|
||||
| 0.20 - 0.25 | Below Average | Approaching random |
|
||||
| 0.25+ | Poor | At or worse than random |
|
||||
|
||||
**Context matters:** Easy questions expect lower scores. Compare to baseline (0.25) and other forecasters.
|
||||
|
||||
### Improving Your Brier Score
|
||||
|
||||
**Path 1: Fix Calibration**
|
||||
|
||||
**If overconfident:** 80% predictions happen 60% → Be less extreme, widen intervals
|
||||
|
||||
**If underconfident:** 60% predictions happen 80% → Be more extreme when you have evidence
|
||||
|
||||
**Path 2: Improve Resolution**
|
||||
|
||||
**Problem:** All forecasts near 50% → Differentiate easy vs hard questions, research more, be bold when warranted
|
||||
|
||||
**Balance:** `Good Forecaster = Well-Calibrated + High Resolution`
|
||||
|
||||
### Brier Skill Score
|
||||
|
||||
```
|
||||
BSS = 1 - (Your Brier / Baseline Brier)
|
||||
|
||||
Example:
|
||||
Your Brier: 0.12, Baseline: 0.25
|
||||
BSS = 1 - 0.48 = 0.52 (52% improvement over baseline)
|
||||
```
|
||||
|
||||
**Interpretation:** BSS = 1.00 (perfect), 0.00 (same as baseline), <0 (worse than baseline)
|
||||
|
||||
---
|
||||
|
||||
## 3. Log Score (Logarithmic Scoring Rule)
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Log Score = log₂(p) if outcome occurs
|
||||
Log Score = log₂(1-p) if outcome doesn't occur
|
||||
|
||||
Range: -∞ (worst) to 0 (perfect)
|
||||
Higher (less negative) is better
|
||||
```
|
||||
|
||||
### Calculation Examples
|
||||
|
||||
```
|
||||
90% Yes → -0.15 | 90% No → -3.32 (severe) | 50% Yes → -1.00
|
||||
99% No → -6.64 (catastrophic penalty for overconfidence)
|
||||
```
|
||||
|
||||
### Relationship to Information Theory
|
||||
|
||||
**Log score measures bits of surprise:**
|
||||
```
|
||||
Surprise = -log₂(p)
|
||||
|
||||
p = 50% → 1 bit surprise
|
||||
p = 25% → 2 bits surprise
|
||||
p = 12.5% → 3 bits surprise
|
||||
```
|
||||
|
||||
**Connection to entropy:** Log score equals cross-entropy between forecast distribution and true outcome.
|
||||
|
||||
### When to Use Log Score vs Brier
|
||||
|
||||
**Use Log Score when:**
|
||||
- Severe penalty for overconfidence desired
|
||||
- Tail risk matters (rare events important)
|
||||
- Information-theoretic interpretation useful
|
||||
- Comparing probabilistic models
|
||||
|
||||
**Use Brier Score when:**
|
||||
- Human forecasters (less punishing)
|
||||
- Easier interpretation (squared error)
|
||||
- Standard benchmark (more common)
|
||||
- Avoiding extreme penalties
|
||||
|
||||
**Key Difference:**
|
||||
|
||||
Brier: Quadratic penalty (grows with square)
|
||||
```
|
||||
Error: 10% → 0.01, 20% → 0.04, 30% → 0.09, 40% → 0.16
|
||||
```
|
||||
|
||||
Log: Logarithmic penalty (grows faster for extremes)
|
||||
```
|
||||
Forecast: 90% wrong → -3.3, 95% wrong → -4.3, 99% wrong → -6.6
|
||||
```
|
||||
|
||||
**Recommendation:** Default to Brier. Add Log for high-stakes or to penalize overconfidence. Track both for complete picture.
|
||||
|
||||
---
|
||||
|
||||
## 4. Calibration Curves
|
||||
|
||||
### What is a Calibration Curve?
|
||||
|
||||
**Visualization of forecast accuracy:**
|
||||
```
|
||||
Y-axis: Actual frequency (how often outcome occurred)
|
||||
X-axis: Stated probability (your forecasts)
|
||||
Perfect calibration: Diagonal line (y = x)
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Actual %
|
||||
100 ┤ ╱
|
||||
80 ┤ ●
|
||||
60 ┤ ●
|
||||
40 ┤ ● ← Perfect calibration line
|
||||
20 ┤ ●
|
||||
0 └───────────────────────
|
||||
0 20 40 60 80 100
|
||||
Stated probability %
|
||||
```
|
||||
|
||||
### How to Create
|
||||
|
||||
**Step 1:** Collect 50+ forecasts and outcomes
|
||||
|
||||
**Step 2:** Bin by probability (0-10%, 10-20%, ..., 90-100%)
|
||||
|
||||
**Step 3:** For each bin, calculate actual frequency
|
||||
```
|
||||
Example: 60-70% bin
|
||||
Forecasts: 15 total, Outcomes: 9 Yes, 6 No
|
||||
Actual frequency: 9/15 = 60%
|
||||
Plot point: (65, 60)
|
||||
```
|
||||
|
||||
**Step 4:** Draw perfect calibration line (diagonal from (0,0) to (100,100))
|
||||
|
||||
**Step 5:** Compare points to line
|
||||
|
||||
### Over/Under Confidence Detection
|
||||
|
||||
**Overconfidence:** Points below diagonal (said 90%, happened 70%). Fix: Be less extreme, widen intervals.
|
||||
|
||||
**Underconfidence:** Points above diagonal (said 90%, happened 95%). Fix: Be more extreme when evidence is strong.
|
||||
|
||||
**Sample size:** <10/bin unreliable, 10-20 weak, 20-50 moderate, 50+ strong evidence
|
||||
|
||||
---
|
||||
|
||||
## 5. Resolution Analysis
|
||||
|
||||
### What is Resolution?
|
||||
|
||||
**Resolution** measures ability to assign different probabilities to outcomes that actually differ.
|
||||
|
||||
**High resolution:** Events you call 90% happen much more than events you call 10% (good)
|
||||
|
||||
**Low resolution:** All forecasts near 50%, can't discriminate (bad)
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Resolution = (1/N) × Σ nk(ok - ō)²
|
||||
|
||||
nk = Forecasts in bin k
|
||||
ok = Actual frequency in bin k
|
||||
ō = Overall base rate
|
||||
|
||||
Higher is better
|
||||
```
|
||||
|
||||
### How to Improve Resolution
|
||||
|
||||
**Problem: Stuck at 50%**
|
||||
|
||||
Bad pattern: All forecasts 48-52% → Low resolution
|
||||
|
||||
Good pattern: Range from 20% to 90% → High resolution
|
||||
|
||||
**Strategies:**
|
||||
|
||||
1. **Gather discriminating information** - Find features that distinguish outcomes
|
||||
2. **Use decomposition** - Fermi, causal models, scenarios
|
||||
3. **Be bold when warranted** - If evidence strong → Say 85% not 65%
|
||||
4. **Update with evidence** - Start with base rate, update with Bayesian reasoning
|
||||
|
||||
### Calibration vs Resolution Tradeoff
|
||||
|
||||
```
|
||||
Perfect Calibration Only: Say 60% for everything when base rate is 60%
|
||||
→ Calibration: Perfect
|
||||
→ Resolution: Zero
|
||||
→ Brier: 0.24 (bad)
|
||||
|
||||
High Resolution Only: Say 10% or 90% (extremes) incorrectly
|
||||
→ Calibration: Poor
|
||||
→ Resolution: High
|
||||
→ Brier: Terrible
|
||||
|
||||
Optimal Balance: Well-calibrated AND high resolution
|
||||
→ Calibration: Good
|
||||
→ Resolution: High
|
||||
→ Brier: Minimized
|
||||
```
|
||||
|
||||
**Best forecasters:** Well-calibrated (low reliability error) + High resolution (discriminate events) = Low Brier
|
||||
|
||||
**Recommendation:** Don't sacrifice resolution for perfect calibration. Be bold when evidence warrants.
|
||||
|
||||
---
|
||||
|
||||
## 6. Sharpness
|
||||
|
||||
### What is Sharpness?
|
||||
|
||||
**Sharpness** = Tendency to make extreme predictions (away from 50%) when appropriate.
|
||||
|
||||
**Sharp:** Predicts 5% or 95% when evidence supports it (decisive)
|
||||
|
||||
**Unsharp:** Stays near 50% (plays it safe, indecisive)
|
||||
|
||||
### Why Sharpness Matters
|
||||
|
||||
```
|
||||
Scenario: Base rate 60%
|
||||
|
||||
Unsharp forecaster: 50% for every event → Brier: 0.24, Usefulness: Low
|
||||
Sharp forecaster: Range 20-90% → Brier: 0.12 (if calibrated), Usefulness: High
|
||||
```
|
||||
|
||||
**Insight:** Extreme predictions (when accurate) improve Brier significantly. When wrong, hurt badly. Solution: Be sharp when you have evidence.
|
||||
|
||||
### Measuring Sharpness
|
||||
|
||||
```
|
||||
Sharpness = Variance of forecast probabilities
|
||||
|
||||
Forecaster A: [0.45, 0.50, 0.48, 0.52, 0.49] → Var = 0.0007 (unsharp)
|
||||
Forecaster B: [0.15, 0.85, 0.30, 0.90, 0.20] → Var = 0.1150 (sharp)
|
||||
```
|
||||
|
||||
### When to Be Sharp
|
||||
|
||||
**Be sharp (extreme probabilities) when:**
|
||||
- Strong discriminating evidence (multiple independent pieces align)
|
||||
- Easy questions (outcome nearly certain)
|
||||
- You have expertise (domain knowledge, track record)
|
||||
|
||||
**Stay moderate (near 50%) when:**
|
||||
- High uncertainty (limited information, conflicting evidence)
|
||||
- Hard questions (true probability near 50%)
|
||||
- No expertise (unfamiliar domain)
|
||||
|
||||
**Goal:** Sharp AND well-calibrated (extreme when warranted, accurate probabilities)
|
||||
|
||||
---
|
||||
|
||||
## 7. Practical Calibration Training
|
||||
|
||||
### Calibration Exercises
|
||||
|
||||
**Exercise Set 1:** Make 10 forecasts on verifiable questions (fair coin 50%, Paris capital 99%, two heads 25%, die shows 6 at 16.67%). Check: Did 99% come true 9-10 times? Did 50% come true ~5 times?
|
||||
|
||||
**Exercise Set 2:** Make 20 "80% confident" predictions. Expected: 16/20 correct. Common: 12-14/20 (overconfident). What feels "80%" should be reported as "65%".
|
||||
|
||||
### Tracking Methods
|
||||
|
||||
**Method 1: Spreadsheet**
|
||||
```
|
||||
| Date | Question | Prob | Outcome | Brier | Notes |
|
||||
Monthly: Calculate mean Brier
|
||||
Quarterly: Generate calibration curve
|
||||
```
|
||||
|
||||
**Method 2: Apps**
|
||||
- PredictionBook.com (free, tracks calibration)
|
||||
- Metaculus.com (forecasting platform)
|
||||
- Good Judgment Open (tournament)
|
||||
|
||||
**Method 3: Focused Practice**
|
||||
- Week 1: Make 20 predictions (focus on honesty)
|
||||
- Week 2: Check calibration curve (identify bias)
|
||||
- Week 3: Increase resolution (be bold)
|
||||
- Week 4: Balance calibration + resolution
|
||||
|
||||
### Training Drills
|
||||
|
||||
**Drill 1:** Generate 10 "90% CIs" for unknowns. Target: 9/10 contain true value. Common mistake: Only 5-7 (overconfident). Fix: Widen by 1.5×.
|
||||
|
||||
**Drill 2:** Bayesian practice - State prior, observe evidence, update posterior, check calibration.
|
||||
|
||||
**Drill 3:** Make 10 predictions >80% or <20%. Force extremes when "pretty sure". Track: Are >80% happening >80%?
|
||||
|
||||
---
|
||||
|
||||
## 8. Comparison Table of Scoring Rules
|
||||
|
||||
### Summary
|
||||
|
||||
| Feature | Brier | Log | Spherical | Threshold |
|
||||
|---------|-------|-----|-----------|-----------|
|
||||
| **Proper** | Strictly | Strictly | Strictly | NO |
|
||||
| **Range** | 0 to 1 | -∞ to 0 | 0 to 1 | 0 to 1 |
|
||||
| **Penalty** | Quadratic | Logarithmic | Moderate | None |
|
||||
| **Interpretation** | Squared error | Bits surprise | Geometric | Binary |
|
||||
| **Usage** | Default | High-stakes | Rare | Avoid |
|
||||
| **Human-friendly** | Yes | Somewhat | No | Yes (misleading) |
|
||||
|
||||
### Detailed Comparison
|
||||
|
||||
**Brier Score**
|
||||
|
||||
Pros: Easy to interpret, standard in competitions, moderate penalty, good for humans
|
||||
|
||||
Cons: Less severe penalty for overconfidence
|
||||
|
||||
Best for: General forecasting, calibration training, standard benchmarking
|
||||
|
||||
**Log Score**
|
||||
|
||||
Pros: Severe penalty for overconfidence, information-theoretic, strongly incentivizes honesty
|
||||
|
||||
Cons: Too punishing for humans, infinite at 0%/100%, less intuitive
|
||||
|
||||
Best for: High-stakes forecasting, penalizing overconfidence, ML models, tail risk
|
||||
|
||||
**Spherical Score**
|
||||
|
||||
Pros: Strictly proper, bounded, geometric interpretation
|
||||
|
||||
Cons: Uncommon, complex formula, rarely used
|
||||
|
||||
Best for: Theoretical analysis only
|
||||
|
||||
**Threshold / Binary Accuracy**
|
||||
|
||||
Pros: Very intuitive, easy to explain
|
||||
|
||||
Cons: NOT proper (incentivizes extremes), ignores calibration, can be gamed
|
||||
|
||||
Best for: Nothing (don't use for forecasting)
|
||||
|
||||
### When to Use Each
|
||||
|
||||
| Your Situation | Recommended |
|
||||
|----------------|-------------|
|
||||
| Starting out | **Brier** |
|
||||
| Experienced forecaster | **Brier** or **Log** |
|
||||
| High-stakes decisions | **Log** |
|
||||
| Comparing to benchmarks | **Brier** |
|
||||
| Building ML model | **Log** |
|
||||
| Personal tracking | **Brier** |
|
||||
| Teaching others | **Brier** |
|
||||
|
||||
**Recommendation:** Use **Brier** as default. Add **Log** for high-stakes or to penalize overconfidence.
|
||||
|
||||
### Conversion Example
|
||||
|
||||
**Forecast: 80%, Outcome: Yes**
|
||||
```
|
||||
Brier: (0.80-1)² = 0.04
|
||||
Log (base 2): log₂(0.80) = -0.322
|
||||
Spherical: 0.80/√(0.80²+0.20²) = 0.971
|
||||
```
|
||||
|
||||
**Forecast: 80%, Outcome: No**
|
||||
```
|
||||
Brier: (0.80-0)² = 0.64
|
||||
Log (base 2): log₂(0.20) = -2.322 (much worse penalty)
|
||||
Spherical: 0.20/√(0.80²+0.20²) = 0.243
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Return to Main Skill
|
||||
|
||||
[← Back to Market Mechanics & Betting](../SKILL.md)
|
||||
|
||||
**Related Resources:**
|
||||
- [Betting Theory Fundamentals](betting-theory.md)
|
||||
- [Kelly Criterion Deep Dive](kelly-criterion.md)
|
||||
|
||||
Reference in New Issue
Block a user