Files
gh-lyndonkl-claude/skills/bayesian-reasoning-calibration/resources/examples/product-launch.md
2025-11-30 08:38:26 +08:00

12 KiB
Raw Blame History

Bayesian Analysis: Feature Adoption Forecast

Question

Hypothesis: New sharing feature will achieve >20% adoption within 3 months of launch

Estimating: P(adoption >20%)

Timeframe: 3 months post-launch (results measured at month 3)

Matters because: Need 20% adoption to justify ongoing development investment. Below 20%, we should sunset the feature and reallocate resources.


Prior Belief (Before Evidence)

Base Rate

What's the general frequency of similar features achieving >20% adoption?

  • Reference class: Previous features we've launched in this product category
  • Historical data:
    • Last 8 features launched: 5 achieved >20% adoption (62.5%)
    • Industry benchmarks: Social sharing features average 15-25% adoption
    • Our product has higher engagement than average
  • Base rate: 60%

Adjustments

How is this case different from the base rate?

  • Factor 1: Feature complexity - This feature is simpler than average (+5%)

    • Previous successful features averaged 3 steps to use
    • This feature is 1-click sharing
    • Simpler features historically perform better
  • Factor 2: Market timing - Competitive pressure is high (-10%)

    • Two competitors launched similar features 6 months ago
    • Early adopters may have already switched to competitors
    • Late-to-market features typically see 15-20% lower adoption
  • Factor 3: User research signals - Strong user request (+10%)

    • Feature was #2 most requested in last user survey (450 responses)
    • 72% said they would use it "frequently" or "very frequently"
    • Strong stated intent typically correlates with 40-60% actual usage

Prior Probability

P(H) = 65%

Justification: Starting from 60% base rate, adjusted upward for simplicity (+5%) and strong user signals (+10%), adjusted down for late market entry (-10%). Net effect: 65% prior confidence that adoption will exceed 20%.

Range if uncertain: 55% to 75% (accounting for uncertainty in adjustment factors)


Evidence

What was observed: Beta test with 200 users showed 35% adoption (70 users actively used feature)

How diagnostic: This is moderately to strongly diagnostic evidence. Beta tests often show higher engagement than production (selection bias), but 35% is meaningfully above our 20% threshold. The question is whether this beta performance predicts production performance.

Likelihoods

P(E|H) = 75% - Probability of seeing 35% beta adoption IF true production adoption will be >20%

Reasoning:

  • If production adoption will be >20%, beta should show higher (beta users are early adopters)
  • Typical pattern: beta adoption is 1.5-2x production adoption for engaged features
  • If production will be 22%, beta would likely be 33-44% → 35% fits this well
  • If production will be 25%, beta would likely be 38-50% → 35% is on lower end but plausible
  • 75% accounts for variance and beta-to-production conversion uncertainty

P(E|¬H) = 15% - Probability of seeing 35% beta adoption IF true production adoption will be ≤20%

Reasoning:

  • If production adoption will be ≤20% (say, 15%), beta would typically be 22-30%
  • Seeing 35% beta when production will be ≤20% would require unusual beta-to-production drop
  • This could happen (beta selection bias, novelty effect wears off), but is uncommon
  • 15% reflects that this scenario is possible but unlikely

Likelihood Ratio = 75% / 15% = 5.0

Interpretation: Evidence is moderately strong. A 35% beta result is 5 times more likely if production adoption will exceed 20% than if it won't. This is meaningful but not overwhelming evidence.


Bayesian Update

Calculation

Using odds form (simpler for this case):

Prior Odds = P(H) / P(¬H) = 65% / 35% = 1.86

Likelihood Ratio = 5.0

Posterior Odds = Prior Odds × LR = 1.86 × 5.0 = 9.3

Posterior Probability = Posterior Odds / (1 + Posterior Odds)
                      = 9.3 / 10.3
                      = 90.3%

Verification using probability form:

P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
P(E) = [75% × 65%] + [15% × 35%]
P(E) = 48.75% + 5.25% = 54%

P(H|E) = [P(E|H) × P(H)] / P(E)
P(H|E) = [75% × 65%] / 54%
P(H|E) = 48.75% / 54% = 90.3%

Posterior Probability

P(H|E) = 90%

Change in Belief

  • Prior: 65%
  • Posterior: 90%
  • Change: +25 percentage points
  • Interpretation: Evidence strongly supports hypothesis. Beta test results meaningfully increased confidence that production adoption will exceed 20%.

Sensitivity Analysis

How sensitive is posterior to inputs?

If Prior was different:

Prior Posterior Note
50% 83% Even starting at coin-flip, evidence pushes to high confidence
75% 94% Higher prior → very high posterior
40% 77% Lower prior → still high confidence

Finding: Posterior is somewhat robust. Evidence is strong enough that even with priors ranging from 40-75%, posterior stays in 77-94% range.

If P(E|H) was different:

P(E|H) LR Posterior Note
60% 4.0 87% Less diagnostic evidence → still high confidence
85% 5.67 92% More diagnostic evidence → very high confidence
50% 3.33 82% Weaker evidence → moderate-high confidence

Finding: Posterior is moderately sensitive to P(E|H), but stays above 80% across plausible range.

If P(E|¬H) was different:

P(E|¬H) LR Posterior Note
25% 3.0 84% Less diagnostic → still high confidence
10% 7.5 94% More diagnostic → very high confidence
30% 2.5 80% Weak evidence → moderate confidence

Finding: Posterior is sensitive to P(E|¬H). If beta-to-production drop is common (higher P(E|¬H)), confidence decreases meaningfully.

Robustness: Conclusion is moderately robust. Across reasonable input ranges, posterior stays above 77%, supporting launch decision. Most sensitive to assumption about beta-to-production conversion rates.


Calibration Check

Am I overconfident?

  • Did I anchor on initial belief?

    • No - prior (65%) was based on base rates, not arbitrary
    • Evidence substantially moved belief (+25pp)
    • Not stuck at starting point
  • Did I ignore base rates?

    • No - explicitly used historical feature adoption (60%) as starting point
    • Adjusted for known differences systematically
  • Is my posterior extreme (>90% or <10%)?

    • Yes - 90% is borderline extreme
    • Check: Is evidence truly that strong?
      • LR = 5.0 is moderately strong (not very strong)
      • Prior was already high (65%)
      • Combination pushes to 90%
    • Concern: May be slightly overconfident
    • Adjustment: Consider reporting as 85-90% range rather than point estimate
  • Would an outside observer agree with my likelihoods?

    • P(E|H) = 75%: Reasonable - beta users are engaged, expect higher than production
    • P(E|¬H) = 15%: Potentially optimistic - beta selection bias could be stronger
    • Alternative: If P(E|¬H) = 25%, posterior drops to 84% (more conservative)

Red flags:

  • ✓ Posterior is not 100% or 0%
  • ✓ Update magnitude (25pp) matches evidence strength (LR=5.0)
  • ✓ Prior uses base rates
  • ⚠ Posterior is at upper end (90%) - consider uncertainty range

Calibration adjustment: Report as 85-90% confidence range to account for uncertainty in likelihoods.


Limitations & Assumptions

Key assumptions:

  1. Beta users are representative of broader user base

    • Assumption: Beta users are 1.5-2x more engaged than average
    • Risk: If beta users are much more engaged (3x), production adoption could be lower
    • Impact: Could invalidate high posterior
  2. No major bugs or UX issues in production

    • Assumption: Production experience will match beta experience
    • Risk: Unforeseen technical issues could crater adoption
    • Impact: Would make evidence misleading
  3. Competitive landscape stays stable

    • Assumption: No major competitor moves in next 3 months
    • Risk: Competitor could launch superior version
    • Impact: Could reduce adoption below 20% despite strong beta
  4. Beta sample size is sufficient (n=200)

    • Assumption: 200 users is enough to estimate adoption
    • Confidence interval: 35% ± 6.6% at 95% CI
    • Impact: True beta adoption could be 28-42%, adding uncertainty

What could invalidate this analysis:

  • Major product changes: If we significantly alter the feature post-beta, beta results become less predictive
  • Different user segment: If we launch to a different user segment than beta testers, adoption patterns may differ
  • Seasonal effects: If beta ran during high-engagement season and launch is during low season
  • Discovery/onboarding issues: If users don't discover the feature in production (beta users were explicitly invited)

Uncertainty:

  • Most uncertain about: P(E|¬H) = 15% - How often do features with ≤20% production adoption show 35% beta adoption?

    • This is the key assumption
    • If this is actually 25-30%, posterior drops to 80-84%
    • Recommendation: Review historical beta-to-production conversion data
  • Could be wrong if:

    • Beta users are much more engaged than typical users (>2x multiplier)
    • Novelty effect in beta wears off quickly in production
    • Production launch has poor discoverability/onboarding

Decision Implications

Given posterior of 90% (range: 85-90%):

Recommended action: Proceed with launch with monitoring plan

Rationale:

  • 90% confidence exceeds decision threshold for feature launches
  • Even conservative estimate (85%) supports launch
  • Risk of failure (<20% adoption) is only 10-15%
  • Cost of being wrong: Wasted 3 months of development effort
  • Cost of not launching: Missing potential high-adoption feature

If decision threshold is:

  • High confidence needed (>80%): LAUNCH - Exceeds threshold, proceed with production rollout

  • Medium confidence (>60%): LAUNCH - Well above threshold, strong conviction

  • Low bar (>40%): LAUNCH - Far exceeds minimum threshold

Monitoring plan (to validate forecast):

  1. Week 1: Check if adoption is on track for >6% (20% / 3 months, assuming linear growth)

    • If <4%: Red flag, investigate onboarding/discovery issues
    • If >8%: Exceeding expectations, validate data quality
  2. Month 1: Check if adoption is trending toward >10%

    • If <7%: Update forecast downward, consider intervention
    • If >13%: Exceeding expectations, high confidence
  3. Month 3: Measure final adoption

    • If <20%: Analyze what went wrong, calibrate future forecasts
    • If >20%: Validate forecast accuracy, update priors for future features

Next evidence to gather:

  • Historical beta-to-production conversion rates: Review last 5-10 feature launches to calibrate P(E|¬H) more accurately
  • User segment analysis: Compare beta user demographics to production user base
  • Competitive feature adoption: Check competitors' sharing feature adoption rates
  • Early production data: After 1 week of production, use actual adoption data for next Bayesian update

What would change our mind:

  • Week 1 adoption <3%: Would update posterior down to ~60%, trigger investigation
  • Competitor launches superior feature: Would need to recalculate with new competitive landscape
  • Discovery of major beta sampling bias: If beta users are 5x more engaged, would significantly reduce confidence

Meta: Forecast Quality Assessment

Using rubric from rubric_bayesian_reasoning_calibration.json:

Self-assessment:

  • Prior Quality: 4/5 (good base rate usage, clear adjustments)
  • Likelihood Justification: 4/5 (clear reasoning, could use more empirical data)
  • Evidence Diagnosticity: 4/5 (LR=5.0 is moderately strong)
  • Calculation Correctness: 5/5 (verified with both odds and probability forms)
  • Calibration & Realism: 3/5 (posterior is 90%, borderline extreme, flagged for review)
  • Assumption Transparency: 4/5 (key assumptions stated clearly)
  • Base Rate Usage: 5/5 (explicit base rate from historical data)
  • Sensitivity Analysis: 4/5 (comprehensive sensitivity checks)
  • Interpretation Quality: 4/5 (clear decision implications with thresholds)
  • Avoidance of Common Errors: 4/5 (no prosecutor's fallacy, proper base rates)

Average: 4.1/5 - Meets "very good" threshold for medium-stakes decision

Decision: Forecast is sufficiently rigorous for feature launch decision (medium stakes). Primary area for improvement: gather more data on beta-to-production conversion to refine P(E|¬H) estimate.