Files
gh-lyndonkl-claude/skills/bayesian-reasoning-calibration/resources/examples/product-launch.md
2025-11-30 08:38:26 +08:00

320 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bayesian Analysis: Feature Adoption Forecast
## Question
**Hypothesis**: New sharing feature will achieve >20% adoption within 3 months of launch
**Estimating**: P(adoption >20%)
**Timeframe**: 3 months post-launch (results measured at month 3)
**Matters because**: Need 20% adoption to justify ongoing development investment. Below 20%, we should sunset the feature and reallocate resources.
---
## Prior Belief (Before Evidence)
### Base Rate
What's the general frequency of similar features achieving >20% adoption?
- **Reference class**: Previous features we've launched in this product category
- **Historical data**:
- Last 8 features launched: 5 achieved >20% adoption (62.5%)
- Industry benchmarks: Social sharing features average 15-25% adoption
- Our product has higher engagement than average
- **Base rate**: 60%
### Adjustments
How is this case different from the base rate?
- **Factor 1: Feature complexity** - This feature is simpler than average (+5%)
- Previous successful features averaged 3 steps to use
- This feature is 1-click sharing
- Simpler features historically perform better
- **Factor 2: Market timing** - Competitive pressure is high (-10%)
- Two competitors launched similar features 6 months ago
- Early adopters may have already switched to competitors
- Late-to-market features typically see 15-20% lower adoption
- **Factor 3: User research signals** - Strong user request (+10%)
- Feature was #2 most requested in last user survey (450 responses)
- 72% said they would use it "frequently" or "very frequently"
- Strong stated intent typically correlates with 40-60% actual usage
### Prior Probability
**P(H) = 65%**
**Justification**: Starting from 60% base rate, adjusted upward for simplicity (+5%) and strong user signals (+10%), adjusted down for late market entry (-10%). Net effect: 65% prior confidence that adoption will exceed 20%.
**Range if uncertain**: 55% to 75% (accounting for uncertainty in adjustment factors)
---
## Evidence
**What was observed**: Beta test with 200 users showed 35% adoption (70 users actively used feature)
**How diagnostic**: This is moderately to strongly diagnostic evidence. Beta tests often show higher engagement than production (selection bias), but 35% is meaningfully above our 20% threshold. The question is whether this beta performance predicts production performance.
### Likelihoods
**P(E|H) = 75%** - Probability of seeing 35% beta adoption IF true production adoption will be >20%
**Reasoning**:
- If production adoption will be >20%, beta should show higher (beta users are early adopters)
- Typical pattern: beta adoption is 1.5-2x production adoption for engaged features
- If production will be 22%, beta would likely be 33-44% → 35% fits this well
- If production will be 25%, beta would likely be 38-50% → 35% is on lower end but plausible
- 75% accounts for variance and beta-to-production conversion uncertainty
**P(E|¬H) = 15%** - Probability of seeing 35% beta adoption IF true production adoption will be ≤20%
**Reasoning**:
- If production adoption will be ≤20% (say, 15%), beta would typically be 22-30%
- Seeing 35% beta when production will be ≤20% would require unusual beta-to-production drop
- This could happen (beta selection bias, novelty effect wears off), but is uncommon
- 15% reflects that this scenario is possible but unlikely
**Likelihood Ratio = 75% / 15% = 5.0**
**Interpretation**: Evidence is moderately strong. A 35% beta result is 5 times more likely if production adoption will exceed 20% than if it won't. This is meaningful but not overwhelming evidence.
---
## Bayesian Update
### Calculation
**Using odds form** (simpler for this case):
```
Prior Odds = P(H) / P(¬H) = 65% / 35% = 1.86
Likelihood Ratio = 5.0
Posterior Odds = Prior Odds × LR = 1.86 × 5.0 = 9.3
Posterior Probability = Posterior Odds / (1 + Posterior Odds)
= 9.3 / 10.3
= 90.3%
```
**Verification using probability form**:
```
P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
P(E) = [75% × 65%] + [15% × 35%]
P(E) = 48.75% + 5.25% = 54%
P(H|E) = [P(E|H) × P(H)] / P(E)
P(H|E) = [75% × 65%] / 54%
P(H|E) = 48.75% / 54% = 90.3%
```
### Posterior Probability
**P(H|E) = 90%**
### Change in Belief
- **Prior**: 65%
- **Posterior**: 90%
- **Change**: +25 percentage points
- **Interpretation**: Evidence strongly supports hypothesis. Beta test results meaningfully increased confidence that production adoption will exceed 20%.
---
## Sensitivity Analysis
**How sensitive is posterior to inputs?**
### If Prior was different:
| Prior | Posterior | Note |
|-------|-----------|------|
| 50% | 83% | Even starting at coin-flip, evidence pushes to high confidence |
| 75% | 94% | Higher prior → very high posterior |
| 40% | 77% | Lower prior → still high confidence |
**Finding**: Posterior is somewhat robust. Evidence is strong enough that even with priors ranging from 40-75%, posterior stays in 77-94% range.
### If P(E|H) was different:
| P(E\|H) | LR | Posterior | Note |
|---------|-----|-----------|------|
| 60% | 4.0 | 87% | Less diagnostic evidence → still high confidence |
| 85% | 5.67 | 92% | More diagnostic evidence → very high confidence |
| 50% | 3.33 | 82% | Weaker evidence → moderate-high confidence |
**Finding**: Posterior is moderately sensitive to P(E|H), but stays above 80% across plausible range.
### If P(E|¬H) was different:
| P(E\|¬H) | LR | Posterior | Note |
|----------|-----|-----------|------|
| 25% | 3.0 | 84% | Less diagnostic → still high confidence |
| 10% | 7.5 | 94% | More diagnostic → very high confidence |
| 30% | 2.5 | 80% | Weak evidence → moderate confidence |
**Finding**: Posterior is sensitive to P(E|¬H). If beta-to-production drop is common (higher P(E|¬H)), confidence decreases meaningfully.
**Robustness**: Conclusion is **moderately robust**. Across reasonable input ranges, posterior stays above 77%, supporting launch decision. Most sensitive to assumption about beta-to-production conversion rates.
---
## Calibration Check
**Am I overconfident?**
- **Did I anchor on initial belief?**
- No - prior (65%) was based on base rates, not arbitrary
- Evidence substantially moved belief (+25pp)
- Not stuck at starting point
- **Did I ignore base rates?**
- No - explicitly used historical feature adoption (60%) as starting point
- Adjusted for known differences systematically
- **Is my posterior extreme (>90% or <10%)?**
- Yes - 90% is borderline extreme
- **Check**: Is evidence truly that strong?
- LR = 5.0 is moderately strong (not very strong)
- Prior was already high (65%)
- Combination pushes to 90%
- **Concern**: May be slightly overconfident
- **Adjustment**: Consider reporting as 85-90% range rather than point estimate
- **Would an outside observer agree with my likelihoods?**
- P(E|H) = 75%: Reasonable - beta users are engaged, expect higher than production
- P(E|¬H) = 15%: Potentially optimistic - beta selection bias could be stronger
- **Alternative**: If P(E|¬H) = 25%, posterior drops to 84% (more conservative)
**Red flags**:
- ✓ Posterior is not 100% or 0%
- ✓ Update magnitude (25pp) matches evidence strength (LR=5.0)
- ✓ Prior uses base rates
- ⚠ Posterior is at upper end (90%) - consider uncertainty range
**Calibration adjustment**: Report as 85-90% confidence range to account for uncertainty in likelihoods.
---
## Limitations & Assumptions
**Key assumptions**:
1. **Beta users are representative of broader user base**
- Assumption: Beta users are 1.5-2x more engaged than average
- Risk: If beta users are much more engaged (3x), production adoption could be lower
- Impact: Could invalidate high posterior
2. **No major bugs or UX issues in production**
- Assumption: Production experience will match beta experience
- Risk: Unforeseen technical issues could crater adoption
- Impact: Would make evidence misleading
3. **Competitive landscape stays stable**
- Assumption: No major competitor moves in next 3 months
- Risk: Competitor could launch superior version
- Impact: Could reduce adoption below 20% despite strong beta
4. **Beta sample size is sufficient (n=200)**
- Assumption: 200 users is enough to estimate adoption
- Confidence interval: 35% ± 6.6% at 95% CI
- Impact: True beta adoption could be 28-42%, adding uncertainty
**What could invalidate this analysis**:
- **Major product changes**: If we significantly alter the feature post-beta, beta results become less predictive
- **Different user segment**: If we launch to a different user segment than beta testers, adoption patterns may differ
- **Seasonal effects**: If beta ran during high-engagement season and launch is during low season
- **Discovery/onboarding issues**: If users don't discover the feature in production (beta users were explicitly invited)
**Uncertainty**:
- **Most uncertain about**: P(E|¬H) = 15% - How often do features with ≤20% production adoption show 35% beta adoption?
- This is the key assumption
- If this is actually 25-30%, posterior drops to 80-84%
- Recommendation: Review historical beta-to-production conversion data
- **Could be wrong if**:
- Beta users are much more engaged than typical users (>2x multiplier)
- Novelty effect in beta wears off quickly in production
- Production launch has poor discoverability/onboarding
---
## Decision Implications
**Given posterior of 90% (range: 85-90%)**:
**Recommended action**: **Proceed with launch** with monitoring plan
**Rationale**:
- 90% confidence exceeds decision threshold for feature launches
- Even conservative estimate (85%) supports launch
- Risk of failure (<20% adoption) is only 10-15%
- Cost of being wrong: Wasted 3 months of development effort
- Cost of not launching: Missing potential high-adoption feature
**If decision threshold is**:
- **High confidence needed (>80%)**: ✅ **LAUNCH** - Exceeds threshold, proceed with production rollout
- **Medium confidence (>60%)**: ✅ **LAUNCH** - Well above threshold, strong conviction
- **Low bar (>40%)**: ✅ **LAUNCH** - Far exceeds minimum threshold
**Monitoring plan** (to validate forecast):
1. **Week 1**: Check if adoption is on track for >6% (20% / 3 months, assuming linear growth)
- If <4%: Red flag, investigate onboarding/discovery issues
- If >8%: Exceeding expectations, validate data quality
2. **Month 1**: Check if adoption is trending toward >10%
- If <7%: Update forecast downward, consider intervention
- If >13%: Exceeding expectations, high confidence
3. **Month 3**: Measure final adoption
- If <20%: Analyze what went wrong, calibrate future forecasts
- If >20%: Validate forecast accuracy, update priors for future features
**Next evidence to gather**:
- **Historical beta-to-production conversion rates**: Review last 5-10 feature launches to calibrate P(E|¬H) more accurately
- **User segment analysis**: Compare beta user demographics to production user base
- **Competitive feature adoption**: Check competitors' sharing feature adoption rates
- **Early production data**: After 1 week of production, use actual adoption data for next Bayesian update
**What would change our mind**:
- **Week 1 adoption <3%**: Would update posterior down to ~60%, trigger investigation
- **Competitor launches superior feature**: Would need to recalculate with new competitive landscape
- **Discovery of major beta sampling bias**: If beta users are 5x more engaged, would significantly reduce confidence
---
## Meta: Forecast Quality Assessment
Using rubric from `rubric_bayesian_reasoning_calibration.json`:
**Self-assessment**:
- Prior Quality: 4/5 (good base rate usage, clear adjustments)
- Likelihood Justification: 4/5 (clear reasoning, could use more empirical data)
- Evidence Diagnosticity: 4/5 (LR=5.0 is moderately strong)
- Calculation Correctness: 5/5 (verified with both odds and probability forms)
- Calibration & Realism: 3/5 (posterior is 90%, borderline extreme, flagged for review)
- Assumption Transparency: 4/5 (key assumptions stated clearly)
- Base Rate Usage: 5/5 (explicit base rate from historical data)
- Sensitivity Analysis: 4/5 (comprehensive sensitivity checks)
- Interpretation Quality: 4/5 (clear decision implications with thresholds)
- Avoidance of Common Errors: 4/5 (no prosecutor's fallacy, proper base rates)
**Average: 4.1/5** - Meets "very good" threshold for medium-stakes decision
**Decision**: Forecast is sufficiently rigorous for feature launch decision (medium stakes). Primary area for improvement: gather more data on beta-to-production conversion to refine P(E|¬H) estimate.