Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/bayesian-reasoning-calibration/resources/examples/product-launch.md
+++ b/skills/bayesian-reasoning-calibration/resources/examples/product-launch.md
@@ -0,0 +1,319 @@
+# Bayesian Analysis: Feature Adoption Forecast
+
+## Question
+
+**Hypothesis**: New sharing feature will achieve >20% adoption within 3 months of launch
+
+**Estimating**: P(adoption >20%)
+
+**Timeframe**: 3 months post-launch (results measured at month 3)
+
+**Matters because**: Need 20% adoption to justify ongoing development investment. Below 20%, we should sunset the feature and reallocate resources.
+
+---
+
+## Prior Belief (Before Evidence)
+
+### Base Rate
+
+What's the general frequency of similar features achieving >20% adoption?
+
+- **Reference class**: Previous features we've launched in this product category
+- **Historical data**:
+  - Last 8 features launched: 5 achieved >20% adoption (62.5%)
+  - Industry benchmarks: Social sharing features average 15-25% adoption
+  - Our product has higher engagement than average
+- **Base rate**: 60%
+
+### Adjustments
+
+How is this case different from the base rate?
+
+- **Factor 1: Feature complexity** - This feature is simpler than average (+5%)
+  - Previous successful features averaged 3 steps to use
+  - This feature is 1-click sharing
+  - Simpler features historically perform better
+
+- **Factor 2: Market timing** - Competitive pressure is high (-10%)
+  - Two competitors launched similar features 6 months ago
+  - Early adopters may have already switched to competitors
+  - Late-to-market features typically see 15-20% lower adoption
+
+- **Factor 3: User research signals** - Strong user request (+10%)
+  - Feature was #2 most requested in last user survey (450 responses)
+  - 72% said they would use it "frequently" or "very frequently"
+  - Strong stated intent typically correlates with 40-60% actual usage
+
+### Prior Probability
+
+**P(H) = 65%**
+
+**Justification**: Starting from 60% base rate, adjusted upward for simplicity (+5%) and strong user signals (+10%), adjusted down for late market entry (-10%). Net effect: 65% prior confidence that adoption will exceed 20%.
+
+**Range if uncertain**: 55% to 75% (accounting for uncertainty in adjustment factors)
+
+---
+
+## Evidence
+
+**What was observed**: Beta test with 200 users showed 35% adoption (70 users actively used feature)
+
+**How diagnostic**: This is moderately to strongly diagnostic evidence. Beta tests often show higher engagement than production (selection bias), but 35% is meaningfully above our 20% threshold. The question is whether this beta performance predicts production performance.
+
+### Likelihoods
+
+**P(E|H) = 75%** - Probability of seeing 35% beta adoption IF true production adoption will be >20%
+
+**Reasoning**:
+- If production adoption will be >20%, beta should show higher (beta users are early adopters)
+- Typical pattern: beta adoption is 1.5-2x production adoption for engaged features
+- If production will be 22%, beta would likely be 33-44% → 35% fits this well
+- If production will be 25%, beta would likely be 38-50% → 35% is on lower end but plausible
+- 75% accounts for variance and beta-to-production conversion uncertainty
+
+**P(E|¬H) = 15%** - Probability of seeing 35% beta adoption IF true production adoption will be ≤20%
+
+**Reasoning**:
+- If production adoption will be ≤20% (say, 15%), beta would typically be 22-30%
+- Seeing 35% beta when production will be ≤20% would require unusual beta-to-production drop
+- This could happen (beta selection bias, novelty effect wears off), but is uncommon
+- 15% reflects that this scenario is possible but unlikely
+
+**Likelihood Ratio = 75% / 15% = 5.0**
+
+**Interpretation**: Evidence is moderately strong. A 35% beta result is 5 times more likely if production adoption will exceed 20% than if it won't. This is meaningful but not overwhelming evidence.
+
+---
+
+## Bayesian Update
+
+### Calculation
+
+**Using odds form** (simpler for this case):
+
+```
+Prior Odds = P(H) / P(¬H) = 65% / 35% = 1.86
+
+Likelihood Ratio = 5.0
+
+Posterior Odds = Prior Odds × LR = 1.86 × 5.0 = 9.3
+
+Posterior Probability = Posterior Odds / (1 + Posterior Odds)
+                      = 9.3 / 10.3
+                      = 90.3%
+```
+
+**Verification using probability form**:
+
+```
+P(E) = [P(E|H) × P(H)] + [P(E|¬H) × P(¬H)]
+P(E) = [75% × 65%] + [15% × 35%]
+P(E) = 48.75% + 5.25% = 54%
+
+P(H|E) = [P(E|H) × P(H)] / P(E)
+P(H|E) = [75% × 65%] / 54%
+P(H|E) = 48.75% / 54% = 90.3%
+```
+
+### Posterior Probability
+
+**P(H|E) = 90%**
+
+### Change in Belief
+
+- **Prior**: 65%
+- **Posterior**: 90%
+- **Change**: +25 percentage points
+- **Interpretation**: Evidence strongly supports hypothesis. Beta test results meaningfully increased confidence that production adoption will exceed 20%.
+
+---
+
+## Sensitivity Analysis
+
+**How sensitive is posterior to inputs?**
+
+### If Prior was different:
+
+| Prior | Posterior | Note |
+|-------|-----------|------|
+| 50% | 83% | Even starting at coin-flip, evidence pushes to high confidence |
+| 75% | 94% | Higher prior → very high posterior |
+| 40% | 77% | Lower prior → still high confidence |
+
+**Finding**: Posterior is somewhat robust. Evidence is strong enough that even with priors ranging from 40-75%, posterior stays in 77-94% range.
+
+### If P(E|H) was different:
+
+| P(E\|H) | LR | Posterior | Note |
+|---------|-----|-----------|------|
+| 60% | 4.0 | 87% | Less diagnostic evidence → still high confidence |
+| 85% | 5.67 | 92% | More diagnostic evidence → very high confidence |
+| 50% | 3.33 | 82% | Weaker evidence → moderate-high confidence |
+
+**Finding**: Posterior is moderately sensitive to P(E|H), but stays above 80% across plausible range.
+
+### If P(E|¬H) was different:
+
+| P(E\|¬H) | LR | Posterior | Note |
+|----------|-----|-----------|------|
+| 25% | 3.0 | 84% | Less diagnostic → still high confidence |
+| 10% | 7.5 | 94% | More diagnostic → very high confidence |
+| 30% | 2.5 | 80% | Weak evidence → moderate confidence |
+
+**Finding**: Posterior is sensitive to P(E|¬H). If beta-to-production drop is common (higher P(E|¬H)), confidence decreases meaningfully.
+
+**Robustness**: Conclusion is **moderately robust**. Across reasonable input ranges, posterior stays above 77%, supporting launch decision. Most sensitive to assumption about beta-to-production conversion rates.
+
+---
+
+## Calibration Check
+
+**Am I overconfident?**
+
+- **Did I anchor on initial belief?**
+  - No - prior (65%) was based on base rates, not arbitrary
+  - Evidence substantially moved belief (+25pp)
+  - Not stuck at starting point
+
+- **Did I ignore base rates?**
+  - No - explicitly used historical feature adoption (60%) as starting point
+  - Adjusted for known differences systematically
+
+- **Is my posterior extreme (>90% or <10%)?**
+  - Yes - 90% is borderline extreme
+  - **Check**: Is evidence truly that strong?
+    - LR = 5.0 is moderately strong (not very strong)
+    - Prior was already high (65%)
+    - Combination pushes to 90%
+  - **Concern**: May be slightly overconfident
+  - **Adjustment**: Consider reporting as 85-90% range rather than point estimate
+
+- **Would an outside observer agree with my likelihoods?**
+  - P(E|H) = 75%: Reasonable - beta users are engaged, expect higher than production
+  - P(E|¬H) = 15%: Potentially optimistic - beta selection bias could be stronger
+  - **Alternative**: If P(E|¬H) = 25%, posterior drops to 84% (more conservative)
+
+**Red flags**:
+- ✓ Posterior is not 100% or 0%
+- ✓ Update magnitude (25pp) matches evidence strength (LR=5.0)
+- ✓ Prior uses base rates
+- ⚠ Posterior is at upper end (90%) - consider uncertainty range
+
+**Calibration adjustment**: Report as 85-90% confidence range to account for uncertainty in likelihoods.
+
+---
+
+## Limitations & Assumptions
+
+**Key assumptions**:
+
+1. **Beta users are representative of broader user base**
+   - Assumption: Beta users are 1.5-2x more engaged than average
+   - Risk: If beta users are much more engaged (3x), production adoption could be lower
+   - Impact: Could invalidate high posterior
+
+2. **No major bugs or UX issues in production**
+   - Assumption: Production experience will match beta experience
+   - Risk: Unforeseen technical issues could crater adoption
+   - Impact: Would make evidence misleading
+
+3. **Competitive landscape stays stable**
+   - Assumption: No major competitor moves in next 3 months
+   - Risk: Competitor could launch superior version
+   - Impact: Could reduce adoption below 20% despite strong beta
+
+4. **Beta sample size is sufficient (n=200)**
+   - Assumption: 200 users is enough to estimate adoption
+   - Confidence interval: 35% ± 6.6% at 95% CI
+   - Impact: True beta adoption could be 28-42%, adding uncertainty
+
+**What could invalidate this analysis**:
+
+- **Major product changes**: If we significantly alter the feature post-beta, beta results become less predictive
+- **Different user segment**: If we launch to a different user segment than beta testers, adoption patterns may differ
+- **Seasonal effects**: If beta ran during high-engagement season and launch is during low season
+- **Discovery/onboarding issues**: If users don't discover the feature in production (beta users were explicitly invited)
+
+**Uncertainty**:
+
+- **Most uncertain about**: P(E|¬H) = 15% - How often do features with ≤20% production adoption show 35% beta adoption?
+  - This is the key assumption
+  - If this is actually 25-30%, posterior drops to 80-84%
+  - Recommendation: Review historical beta-to-production conversion data
+
+- **Could be wrong if**:
+  - Beta users are much more engaged than typical users (>2x multiplier)
+  - Novelty effect in beta wears off quickly in production
+  - Production launch has poor discoverability/onboarding
+
+---
+
+## Decision Implications
+
+**Given posterior of 90% (range: 85-90%)**:
+
+**Recommended action**: **Proceed with launch** with monitoring plan
+
+**Rationale**:
+- 90% confidence exceeds decision threshold for feature launches
+- Even conservative estimate (85%) supports launch
+- Risk of failure (<20% adoption) is only 10-15%
+- Cost of being wrong: Wasted 3 months of development effort
+- Cost of not launching: Missing potential high-adoption feature
+
+**If decision threshold is**:
+
+- **High confidence needed (>80%)**: ✅ **LAUNCH** - Exceeds threshold, proceed with production rollout
+
+- **Medium confidence (>60%)**: ✅ **LAUNCH** - Well above threshold, strong conviction
+
+- **Low bar (>40%)**: ✅ **LAUNCH** - Far exceeds minimum threshold
+
+**Monitoring plan** (to validate forecast):
+
+1. **Week 1**: Check if adoption is on track for >6% (20% / 3 months, assuming linear growth)
+   - If <4%: Red flag, investigate onboarding/discovery issues
+   - If >8%: Exceeding expectations, validate data quality
+
+2. **Month 1**: Check if adoption is trending toward >10%
+   - If <7%: Update forecast downward, consider intervention
+   - If >13%: Exceeding expectations, high confidence
+
+3. **Month 3**: Measure final adoption
+   - If <20%: Analyze what went wrong, calibrate future forecasts
+   - If >20%: Validate forecast accuracy, update priors for future features
+
+**Next evidence to gather**:
+
+- **Historical beta-to-production conversion rates**: Review last 5-10 feature launches to calibrate P(E|¬H) more accurately
+- **User segment analysis**: Compare beta user demographics to production user base
+- **Competitive feature adoption**: Check competitors' sharing feature adoption rates
+- **Early production data**: After 1 week of production, use actual adoption data for next Bayesian update
+
+**What would change our mind**:
+
+- **Week 1 adoption <3%**: Would update posterior down to ~60%, trigger investigation
+- **Competitor launches superior feature**: Would need to recalculate with new competitive landscape
+- **Discovery of major beta sampling bias**: If beta users are 5x more engaged, would significantly reduce confidence
+
+---
+
+## Meta: Forecast Quality Assessment
+
+Using rubric from `rubric_bayesian_reasoning_calibration.json`:
+
+**Self-assessment**:
+- Prior Quality: 4/5 (good base rate usage, clear adjustments)
+- Likelihood Justification: 4/5 (clear reasoning, could use more empirical data)
+- Evidence Diagnosticity: 4/5 (LR=5.0 is moderately strong)
+- Calculation Correctness: 5/5 (verified with both odds and probability forms)
+- Calibration & Realism: 3/5 (posterior is 90%, borderline extreme, flagged for review)
+- Assumption Transparency: 4/5 (key assumptions stated clearly)
+- Base Rate Usage: 5/5 (explicit base rate from historical data)
+- Sensitivity Analysis: 4/5 (comprehensive sensitivity checks)
+- Interpretation Quality: 4/5 (clear decision implications with thresholds)
+- Avoidance of Common Errors: 4/5 (no prosecutor's fallacy, proper base rates)
+
+**Average: 4.1/5** - Meets "very good" threshold for medium-stakes decision
+
+**Decision**: Forecast is sufficiently rigorous for feature launch decision (medium stakes). Primary area for improvement: gather more data on beta-to-production conversion to refine P(E|¬H) estimate.