Files
2025-11-30 08:38:26 +08:00

11 KiB
Raw Permalink Blame History

Reference Class Selection Guide

The Art and Science of Choosing Comparison Sets

Selecting the right reference class is the most critical judgment call in forecasting. Too broad and the base rate is meaningless. Too narrow and you have no data.


The Goldilocks Principle

Too Broad

Problem: High variance, low signal Example: "Companies" as reference class for a fintech startup Base rate: ~50% fail? 90% fail? Meaningless. Why it fails: Includes everything from lemonade stands to Apple

Too Narrow

Problem: No data, overfitting Example: "Fintech startups founded in Q2 2024 by Stanford CS grads in SF" Base rate: N = 3 companies, no outcomes yet Why it fails: So specific there's no statistical pattern

Just Right

Sweet spot: Specific enough to be homogeneous, broad enough to have data Example: "Seed-stage B2B SaaS startups in financial services" Base rate: Can find N = 200+ companies with 5-year outcomes Why it works: Specific enough to be meaningful, broad enough for statistics


Systematic Selection Method

Step 1: Define the Core Entity Type

Question: What is the fundamental category?

Examples:

  • Company (startup, public company, nonprofit)
  • Project (software, construction, research)
  • Person (athlete, politician, scientist)
  • Event (election, war, natural disaster)
  • Product (consumer, enterprise, service)

Output: "This is a [TYPE]"


Step 2: Add Specificity Layers

Work through these dimensions in order of importance:

Layer 1: Stage/Size

  • Startups: Pre-seed, Seed, Series A, B, C, Growth
  • Projects: Small (<$1M), Medium ($1-10M), Large (>$10M)
  • People: Beginner, Intermediate, Expert
  • Products: MVP, Version 1.0, Mature

Layer 2: Category/Domain

  • Startups: B2B, B2C, B2B2C
  • Industry: Fintech, Healthcare, SaaS, Hardware
  • Projects: Software, Construction, Pharmaceutical
  • People: Role (CEO, Engineer, Designer)

Layer 3: Geography/Market

  • US, Europe, Global
  • Urban, Rural, Suburban
  • Developed, Emerging markets

Layer 4: Time Period

  • Current decade (2020s)
  • Previous decade (2010s)
  • Historical (pre-2010)

Output: "This is a [Stage] [Category] [Geography] [Type] from [Time Period]"

Example: "This is a Seed-stage B2B SaaS startup in the US from 2020-2024"


Step 3: Test for Data Availability

Search queries:

"[Reference Class] success rate"
"[Reference Class] statistics"
"[Reference Class] survival rate"
"How many [Reference Class] succeed"

Data availability check:

  • ✓ Found published studies/reports → Good reference class
  • ⚠ Found anecdotal data → Usable but weak
  • ✗ No data found → Reference class too narrow

If no data: Remove least important specificity layer and retry


Step 4: Validate Homogeneity

Question: Are members of this class similar enough that averaging makes sense?

Test: Variance Check If you have outcome data, calculate variance:

  • Low variance → Good reference class (outcomes cluster)
  • High variance → Bad reference class (outcomes all over the place)

Heuristic: The Substitution Test Pick any two members of the reference class at random.

Ask: "If I swapped one for the other, would the prediction change dramatically?"

  • No → Good homogeneity
  • Yes → Too broad, needs subdivision

Example:

  • "Tech startups" → Swap consumer mobile app for enterprise database company → Prediction changes drastically → Too broad
  • "Seed-stage B2B SaaS" → Swap CRM tool for analytics platform → Prediction mostly same → Good homogeneity

Similarity Metrics

When You Can't Find Exact Match

If no perfect reference class exists, use similarity matching to find nearest neighbors.

Dimensions of Similarity

For Startups:

  1. Business model (B2B, B2C, marketplace, SaaS)
  2. Revenue model (subscription, transaction, ads)
  3. Stage/funding (seed, Series A, etc.)
  4. Team size
  5. Market size
  6. Technology complexity

For Projects:

  1. Size (budget, team size, duration)
  2. Complexity (simple, moderate, complex)
  3. Technology maturity (proven, emerging, experimental)
  4. Team experience
  5. Dependencies (few, many)

For People:

  1. Experience level
  2. Domain expertise
  3. Resources available
  4. Historical track record
  5. Contextual factors (support, environment)

Similarity Scoring

Method: Nearest Neighbors

  1. List all dimensions of similarity (5-7 dimensions)
  2. For each dimension, score how similar the case is to reference class (0-10)
  3. Average the scores
  4. Threshold: If similarity < 7/10, the reference class may not apply

Example: Comparing "AI startup in 2024" to "Software startups 2010-2020" reference class:

  • Business model: 9/10 (same)
  • Revenue model: 8/10 (mostly SaaS)
  • Technology maturity: 4/10 (AI is newer)
  • Market size: 7/10 (comparable)
  • Team size: 8/10 (similar)
  • Funding environment: 5/10 (tighter in 2024)

Average: 6.8/10 → Marginal reference class; use with caution


Edge Cases and Judgment Calls

Case 1: Structural Regime Change

Problem: Conditions have changed fundamentally since historical data

Examples:

  • Pre-internet vs post-internet business
  • Pre-COVID vs post-COVID work patterns
  • Pre-AI vs post-AI software development

Solution:

  1. Segment data by era if possible
  2. Use most recent data only
  3. Adjust base rate for known structural differences
  4. Increase uncertainty bounds

Case 2: The N=1 Problem

Problem: The case is literally unique (first of its kind)

Examples:

  • First moon landing
  • First pandemic of a novel pathogen
  • First AGI system

Solution:

  1. Widen the class - Go up one abstraction level
    • "First moon landing" → "First major engineering projects"
    • "Novel pandemic" → "Past pandemics of any type"
  2. Component decomposition - Break into parts that have reference classes
    • "Moon landing" → Rocket success rate × Navigation accuracy × Life support reliability
  3. Expert aggregation - When no data, aggregate expert predictions (but with humility)

Case 3: Multiple Plausible Reference Classes

Problem: Event could belong to multiple classes with different base rates

Example: "Elon Musk starting a brain-computer interface company"

Possible reference classes:

  • "Startups by serial entrepreneurs" → 40% success
  • "Medical device startups" → 10% success
  • "Moonshot technology ventures" → 5% success
  • "Companies founded by Elon Musk" → 80% success

Solution: Ensemble Averaging

  1. Identify all plausible reference classes
  2. Find base rate for each
  3. Weight by relevance/similarity
  4. Calculate weighted average

Example weights:

  • Medical device (40%): 10% × 0.4 = 4%
  • Moonshot tech (30%): 5% × 0.3 = 1.5%
  • Serial entrepreneur (20%): 40% × 0.2 = 8%
  • Elon track record (10%): 80% × 0.1 = 8%

Weighted base rate: 21.5%


Common Selection Mistakes

Mistake 1: Cherry-Picking Success Examples

What it looks like: "Reference class = Companies like Apple, Google, Facebook" Why it's wrong: Survivorship bias - only looking at winners Fix: Include all attempts, not just successes

Mistake 2: Availability Bias

What it looks like: Reference class = Recent, memorable cases Why it's wrong: Recent events are overweighted in memory Fix: Use systematic data collection, not what comes to mind

Mistake 3: Confirmation Bias

What it looks like: Choosing reference class that supports your prior belief Why it's wrong: You're reverse-engineering the answer Fix: Choose reference class BEFORE looking at base rate

Mistake 4: Overfitting to Irrelevant Details

What it looks like: "Female, left-handed CEOs who went to Ivy League schools" Why it's wrong: Most details don't matter; you're adding noise Fix: Only include features that causally affect outcomes

Mistake 5: Ignoring Time Decay

What it looks like: Using data from 1970s for 2024 prediction Why it's wrong: World has changed Fix: Weight recent data more heavily, or segment by era


Reference Class Hierarchy

Start Specific, Widen as Needed

Level 1: Maximally Specific (Try this first)

  • Example: "Seed-stage B2B cybersecurity SaaS in US, 2020-2024"
  • Check for data → If N > 30, use this

Level 2: Drop One Feature (If L1 has no data)

  • Example: "Seed-stage B2B SaaS in US, 2020-2024" (removed "cybersecurity")
  • Check for data → If N > 30, use this

Level 3: Drop Two Features (If L2 has no data)

  • Example: "Seed-stage B2B SaaS, 2020-2024" (removed "US")
  • Check for data → If N > 30, use this

Level 4: Generic Category (Last resort)

  • Example: "Seed-stage startups"
  • Always has data, but high variance

Rule: Use the most specific level that still gives you N ≥ 30 data points.


Checklist: Is This a Good Reference Class?

Use this to validate your choice:

  • Sample size ≥ 30 historical cases
  • Homogeneity: Members are similar enough that averaging makes sense
  • Relevance: Data is from appropriate time period (last 10 years preferred)
  • Specificity: Class is narrow enough to be meaningful
  • Data availability: Base rate is published or calculable
  • No survivorship bias: Includes failures, not just successes
  • No cherry-picking: Class chosen before looking at base rate
  • Causal relevance: Features included actually affect outcomes

If ≥ 6 checked: Good reference class If 4-5 checked: Acceptable, but increase uncertainty If < 4 checked: Find a better reference class


Advanced: Bayesian Reference Class Selection

When you have multiple plausible reference classes, you can use Bayesian reasoning:

Step 1: Prior Distribution Over Classes

Assign probability to each reference class being the "true" one

Example:

  • P(Class = "B2B SaaS") = 60%
  • P(Class = "All SaaS") = 30%
  • P(Class = "All startups") = 10%

Step 2: Likelihood of Observed Features

How likely is this specific case under each class?

Step 3: Posterior Distribution

Update class probabilities using Bayes' rule

Step 4: Weighted Base Rate

Average base rates weighted by posterior probabilities

This is advanced. Default to the systematic selection method above unless you have strong quantitative skills.


Practical Workflow

Quick Protocol (5 minutes)

  1. Name the core type: "This is a [X]"
  2. Add 2-3 specificity layers: Stage, category, geography
  3. Google the base rate: "[Reference class] success rate"
  4. Sanity check: Does N > 30? Are members similar?
  5. Use it: This is your starting probability

Rigorous Protocol (30 minutes)

  1. Systematic selection (Steps 1-4 above)
  2. Similarity scoring for validation
  3. Check for structural regime changes
  4. Consider multiple reference classes
  5. Weighted ensemble if multiple classes
  6. Document assumptions and limitations

Return to: Main Skill