11 KiB
Reference Class Selection Guide
The Art and Science of Choosing Comparison Sets
Selecting the right reference class is the most critical judgment call in forecasting. Too broad and the base rate is meaningless. Too narrow and you have no data.
The Goldilocks Principle
Too Broad
Problem: High variance, low signal Example: "Companies" as reference class for a fintech startup Base rate: ~50% fail? 90% fail? Meaningless. Why it fails: Includes everything from lemonade stands to Apple
Too Narrow
Problem: No data, overfitting Example: "Fintech startups founded in Q2 2024 by Stanford CS grads in SF" Base rate: N = 3 companies, no outcomes yet Why it fails: So specific there's no statistical pattern
Just Right
Sweet spot: Specific enough to be homogeneous, broad enough to have data Example: "Seed-stage B2B SaaS startups in financial services" Base rate: Can find N = 200+ companies with 5-year outcomes Why it works: Specific enough to be meaningful, broad enough for statistics
Systematic Selection Method
Step 1: Define the Core Entity Type
Question: What is the fundamental category?
Examples:
- Company (startup, public company, nonprofit)
- Project (software, construction, research)
- Person (athlete, politician, scientist)
- Event (election, war, natural disaster)
- Product (consumer, enterprise, service)
Output: "This is a [TYPE]"
Step 2: Add Specificity Layers
Work through these dimensions in order of importance:
Layer 1: Stage/Size
- Startups: Pre-seed, Seed, Series A, B, C, Growth
- Projects: Small (<$1M), Medium ($1-10M), Large (>$10M)
- People: Beginner, Intermediate, Expert
- Products: MVP, Version 1.0, Mature
Layer 2: Category/Domain
- Startups: B2B, B2C, B2B2C
- Industry: Fintech, Healthcare, SaaS, Hardware
- Projects: Software, Construction, Pharmaceutical
- People: Role (CEO, Engineer, Designer)
Layer 3: Geography/Market
- US, Europe, Global
- Urban, Rural, Suburban
- Developed, Emerging markets
Layer 4: Time Period
- Current decade (2020s)
- Previous decade (2010s)
- Historical (pre-2010)
Output: "This is a [Stage] [Category] [Geography] [Type] from [Time Period]"
Example: "This is a Seed-stage B2B SaaS startup in the US from 2020-2024"
Step 3: Test for Data Availability
Search queries:
"[Reference Class] success rate"
"[Reference Class] statistics"
"[Reference Class] survival rate"
"How many [Reference Class] succeed"
Data availability check:
- ✓ Found published studies/reports → Good reference class
- ⚠ Found anecdotal data → Usable but weak
- ✗ No data found → Reference class too narrow
If no data: Remove least important specificity layer and retry
Step 4: Validate Homogeneity
Question: Are members of this class similar enough that averaging makes sense?
Test: Variance Check If you have outcome data, calculate variance:
- Low variance → Good reference class (outcomes cluster)
- High variance → Bad reference class (outcomes all over the place)
Heuristic: The Substitution Test Pick any two members of the reference class at random.
Ask: "If I swapped one for the other, would the prediction change dramatically?"
- No → Good homogeneity
- Yes → Too broad, needs subdivision
Example:
- "Tech startups" → Swap consumer mobile app for enterprise database company → Prediction changes drastically → Too broad
- "Seed-stage B2B SaaS" → Swap CRM tool for analytics platform → Prediction mostly same → Good homogeneity
Similarity Metrics
When You Can't Find Exact Match
If no perfect reference class exists, use similarity matching to find nearest neighbors.
Dimensions of Similarity
For Startups:
- Business model (B2B, B2C, marketplace, SaaS)
- Revenue model (subscription, transaction, ads)
- Stage/funding (seed, Series A, etc.)
- Team size
- Market size
- Technology complexity
For Projects:
- Size (budget, team size, duration)
- Complexity (simple, moderate, complex)
- Technology maturity (proven, emerging, experimental)
- Team experience
- Dependencies (few, many)
For People:
- Experience level
- Domain expertise
- Resources available
- Historical track record
- Contextual factors (support, environment)
Similarity Scoring
Method: Nearest Neighbors
- List all dimensions of similarity (5-7 dimensions)
- For each dimension, score how similar the case is to reference class (0-10)
- Average the scores
- Threshold: If similarity < 7/10, the reference class may not apply
Example: Comparing "AI startup in 2024" to "Software startups 2010-2020" reference class:
- Business model: 9/10 (same)
- Revenue model: 8/10 (mostly SaaS)
- Technology maturity: 4/10 (AI is newer)
- Market size: 7/10 (comparable)
- Team size: 8/10 (similar)
- Funding environment: 5/10 (tighter in 2024)
Average: 6.8/10 → Marginal reference class; use with caution
Edge Cases and Judgment Calls
Case 1: Structural Regime Change
Problem: Conditions have changed fundamentally since historical data
Examples:
- Pre-internet vs post-internet business
- Pre-COVID vs post-COVID work patterns
- Pre-AI vs post-AI software development
Solution:
- Segment data by era if possible
- Use most recent data only
- Adjust base rate for known structural differences
- Increase uncertainty bounds
Case 2: The N=1 Problem
Problem: The case is literally unique (first of its kind)
Examples:
- First moon landing
- First pandemic of a novel pathogen
- First AGI system
Solution:
- Widen the class - Go up one abstraction level
- "First moon landing" → "First major engineering projects"
- "Novel pandemic" → "Past pandemics of any type"
- Component decomposition - Break into parts that have reference classes
- "Moon landing" → Rocket success rate × Navigation accuracy × Life support reliability
- Expert aggregation - When no data, aggregate expert predictions (but with humility)
Case 3: Multiple Plausible Reference Classes
Problem: Event could belong to multiple classes with different base rates
Example: "Elon Musk starting a brain-computer interface company"
Possible reference classes:
- "Startups by serial entrepreneurs" → 40% success
- "Medical device startups" → 10% success
- "Moonshot technology ventures" → 5% success
- "Companies founded by Elon Musk" → 80% success
Solution: Ensemble Averaging
- Identify all plausible reference classes
- Find base rate for each
- Weight by relevance/similarity
- Calculate weighted average
Example weights:
- Medical device (40%): 10% × 0.4 = 4%
- Moonshot tech (30%): 5% × 0.3 = 1.5%
- Serial entrepreneur (20%): 40% × 0.2 = 8%
- Elon track record (10%): 80% × 0.1 = 8%
Weighted base rate: 21.5%
Common Selection Mistakes
Mistake 1: Cherry-Picking Success Examples
What it looks like: "Reference class = Companies like Apple, Google, Facebook" Why it's wrong: Survivorship bias - only looking at winners Fix: Include all attempts, not just successes
Mistake 2: Availability Bias
What it looks like: Reference class = Recent, memorable cases Why it's wrong: Recent events are overweighted in memory Fix: Use systematic data collection, not what comes to mind
Mistake 3: Confirmation Bias
What it looks like: Choosing reference class that supports your prior belief Why it's wrong: You're reverse-engineering the answer Fix: Choose reference class BEFORE looking at base rate
Mistake 4: Overfitting to Irrelevant Details
What it looks like: "Female, left-handed CEOs who went to Ivy League schools" Why it's wrong: Most details don't matter; you're adding noise Fix: Only include features that causally affect outcomes
Mistake 5: Ignoring Time Decay
What it looks like: Using data from 1970s for 2024 prediction Why it's wrong: World has changed Fix: Weight recent data more heavily, or segment by era
Reference Class Hierarchy
Start Specific, Widen as Needed
Level 1: Maximally Specific (Try this first)
- Example: "Seed-stage B2B cybersecurity SaaS in US, 2020-2024"
- Check for data → If N > 30, use this
Level 2: Drop One Feature (If L1 has no data)
- Example: "Seed-stage B2B SaaS in US, 2020-2024" (removed "cybersecurity")
- Check for data → If N > 30, use this
Level 3: Drop Two Features (If L2 has no data)
- Example: "Seed-stage B2B SaaS, 2020-2024" (removed "US")
- Check for data → If N > 30, use this
Level 4: Generic Category (Last resort)
- Example: "Seed-stage startups"
- Always has data, but high variance
Rule: Use the most specific level that still gives you N ≥ 30 data points.
Checklist: Is This a Good Reference Class?
Use this to validate your choice:
- Sample size ≥ 30 historical cases
- Homogeneity: Members are similar enough that averaging makes sense
- Relevance: Data is from appropriate time period (last 10 years preferred)
- Specificity: Class is narrow enough to be meaningful
- Data availability: Base rate is published or calculable
- No survivorship bias: Includes failures, not just successes
- No cherry-picking: Class chosen before looking at base rate
- Causal relevance: Features included actually affect outcomes
If ≥ 6 checked: Good reference class If 4-5 checked: Acceptable, but increase uncertainty If < 4 checked: Find a better reference class
Advanced: Bayesian Reference Class Selection
When you have multiple plausible reference classes, you can use Bayesian reasoning:
Step 1: Prior Distribution Over Classes
Assign probability to each reference class being the "true" one
Example:
- P(Class = "B2B SaaS") = 60%
- P(Class = "All SaaS") = 30%
- P(Class = "All startups") = 10%
Step 2: Likelihood of Observed Features
How likely is this specific case under each class?
Step 3: Posterior Distribution
Update class probabilities using Bayes' rule
Step 4: Weighted Base Rate
Average base rates weighted by posterior probabilities
This is advanced. Default to the systematic selection method above unless you have strong quantitative skills.
Practical Workflow
Quick Protocol (5 minutes)
- Name the core type: "This is a [X]"
- Add 2-3 specificity layers: Stage, category, geography
- Google the base rate: "[Reference class] success rate"
- Sanity check: Does N > 30? Are members similar?
- Use it: This is your starting probability
Rigorous Protocol (30 minutes)
- Systematic selection (Steps 1-4 above)
- Similarity scoring for validation
- Check for structural regime changes
- Consider multiple reference classes
- Weighted ensemble if multiple classes
- Document assumptions and limitations
Return to: Main Skill