Initial commit

2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions
--- a/skills/retrospective-validation/SKILL.md
+++ b/skills/retrospective-validation/SKILL.md
@@ -0,0 +1,290 @@
+---
+name: Retrospective Validation
+description: Validate methodology effectiveness using historical data without live deployment. Use when rich historical data exists (100+ instances), methodology targets observable patterns (error prevention, test strategy, performance optimization), pattern matching is feasible with clear detection rules, and live deployment has high friction (CI/CD integration effort, user study time, deployment risk). Enables 40-60% time reduction vs prospective validation, 60-80% cost reduction. Confidence calculation model provides statistical rigor. Validated in error recovery (1,336 errors, 23.7% prevention, 0.79 confidence).
+allowed-tools: Read, Grep, Glob, Bash
+---
+
+# Retrospective Validation
+
+**Validate methodologies with historical data, not live deployment.**
+
+> When you have 1,000 past errors, you don't need to wait for 1,000 future errors to prove your methodology works.
+
+---
+
+## When to Use This Skill
+
+Use this skill when:
+- 📊 **Rich historical data**: 100+ instances (errors, test failures, performance issues)
+- 🎯 **Observable patterns**: Methodology targets detectable issues
+- 🔍 **Pattern matching feasible**: Clear detection heuristics, measurable false positive rate
+- ⚡ **High deployment friction**: CI/CD integration costly, user studies time-consuming
+- 📈 **Statistical rigor needed**: Want confidence intervals, not just hunches
+- ⏰ **Time constrained**: Need validation in hours, not weeks
+
+**Don't use when**:
+- ❌ Insufficient data (<50 instances)
+- ❌ Emergent effects (human behavior change, UX improvements)
+- ❌ Pattern matching unreliable (>20% false positive rate)
+- ❌ Low deployment friction (1-2 hour CI/CD integration)
+
+---
+
+## Quick Start (30 minutes)
+
+### Step 1: Check Historical Data (5 min)
+
+```bash
+# Example: Error data for meta-cc
+meta-cc query-tools --status error | jq '. | length'
+# Output: 1336 errors ✅ (>100 threshold)
+
+# Example: Test failures from CI logs
+grep "FAILED" ci-logs/*.txt | wc -l
+# Output: 427 failures ✅
+```
+
+**Threshold**: ≥100 instances for statistical confidence
+
+### Step 2: Define Detection Rule (10 min)
+
+```yaml
+Tool: validate-path.sh
+Prevents: "File not found" errors
+Detection:
+  - Error message matches: "no such file or directory"
+  - OR "cannot read file"
+  - OR "file does not exist"
+Confidence: High (90%+) - deterministic check
+```
+
+### Step 3: Apply Rule to Historical Data (10 min)
+
+```bash
+# Count matches
+grep -E "(no such file|cannot read|does not exist)" errors.log | wc -l
+# Output: 163 errors (12.2% of total)
+
+# Sample manual validation (30 errors)
+# True positives: 28/30 (93.3%)
+# Adjusted: 163 * 0.933 = 152 preventable ✅
+```
+
+### Step 4: Calculate Confidence (5 min)
+
+```
+Confidence = Data Quality × Accuracy × Logical Correctness
+           = 0.85 × 0.933 × 1.0
+           = 0.79 (High confidence)
+```
+
+**Result**: Tool would have prevented 152 errors with 79% confidence.
+
+---
+
+## Four-Phase Process
+
+### Phase 1: Data Collection
+
+**1. Identify Data Sources**
+
+For Claude Code / meta-cc:
+```bash
+# Error history
+meta-cc query-tools --status error
+
+# User pain points
+meta-cc query-user-messages --pattern "error|fail|broken"
+
+# Error context
+meta-cc query-context --error-signature "..."
+```
+
+For other projects:
+- Git history (commits, diffs, blame)
+- CI/CD logs (test failures, build errors)
+- Application logs (runtime errors)
+- Issue trackers (bug reports)
+
+**2. Quantify Baseline**
+
+Metrics needed:
+- **Volume**: Total instances (e.g., 1,336 errors)
+- **Rate**: Frequency (e.g., 5.78% error rate)
+- **Distribution**: Category breakdown (e.g., file-not-found: 12.2%)
+- **Impact**: Cost (e.g., MTTD: 15 min, MTTR: 30 min)
+
+### Phase 2: Pattern Definition
+
+**1. Create Detection Rules**
+
+For each tool/methodology:
+```yaml
+what_it_prevents: Error type or failure mode
+detection_rule: Pattern matching heuristic
+confidence: Estimated accuracy (high/medium/low)
+```
+
+**2. Define Success Criteria**
+
+```yaml
+prevention: Message matches AND tool would catch it
+speedup: Tool faster than manual debugging
+reliability: No false positives/negatives in sample
+```
+
+### Phase 3: Validation Execution
+
+**1. Apply Rules to Historical Data**
+
+```bash
+# Pseudo-code
+for instance in historical_data:
+  category = classify(instance)
+  tool = find_applicable_tool(category)
+  if would_have_prevented(tool, instance):
+    count_prevented++
+
+prevention_rate = count_prevented / total * 100
+```
+
+**2. Sample Manual Validation**
+
+```
+Sample size: 30 instances (95% confidence)
+For each: "Would tool have prevented this?"
+Calculate: True positive rate, False positive rate
+Adjust: prevention_claim * true_positive_rate
+```
+
+**Example** (Bootstrap-003):
+```
+Sample: 30/317 claimed prevented
+True positives: 28 (93.3%)
+Adjusted: 317 * 0.933 = 296 errors
+Confidence: High (93%+)
+```
+
+**3. Measure Performance**
+
+```bash
+# Tool time
+time tool.sh < test_input
+# Output: 0.05s
+
+# Manual time (estimate from historical data)
+# Average debug time: 15 min = 900s
+
+# Speedup: 900 / 0.05 = 18,000x
+```
+
+### Phase 4: Confidence Assessment
+
+**Confidence Formula**:
+
+```
+Confidence = D × A × L
+
+Where:
+D = Data Quality (0.5-1.0)
+A = Accuracy (True Positive Rate, 0.5-1.0)
+L = Logical Correctness (0.5-1.0)
+```
+
+**Data Quality** (D):
+- 1.0: Complete, accurate, representative
+- 0.8-0.9: Minor gaps or biases
+- 0.6-0.7: Significant gaps
+- <0.6: Unreliable data
+
+**Accuracy** (A):
+- 1.0: 100% true positive rate (verified)
+- 0.8-0.95: High (sample validation 80-95%)
+- 0.6-0.8: Medium (60-80%)
+- <0.6: Low (unreliable pattern matching)
+
+**Logical Correctness** (L):
+- 1.0: Deterministic (tool directly addresses root cause)
+- 0.8-0.9: High correlation (strong evidence)
+- 0.6-0.7: Moderate correlation
+- <0.6: Weak or speculative
+
+**Example** (Bootstrap-003):
+```
+D = 0.85 (Complete error logs, minor gaps in context)
+A = 0.933 (93.3% true positive rate from sample)
+L = 1.0 (File validation is deterministic)
+
+Confidence = 0.85 × 0.933 × 1.0 = 0.79 (High)
+```
+
+**Interpretation**:
+- ≥0.75: High confidence (publishable)
+- 0.60-0.74: Medium confidence (needs caveats)
+- 0.45-0.59: Low confidence (suggestive, not conclusive)
+- <0.45: Insufficient confidence (need prospective validation)
+
+---
+
+## Comparison: Retrospective vs Prospective
+
+| Aspect | Retrospective | Prospective |
+|--------|--------------|-------------|
+| **Time** | Hours-days | Weeks-months |
+| **Cost** | Low (queries) | High (deployment) |
+| **Risk** | Zero | May introduce issues |
+| **Confidence** | 0.60-0.95 | 0.90-1.0 |
+| **Data** | Historical | New |
+| **Scope** | Full history | Limited window |
+| **Bias** | Hindsight | None |
+
+**When to use each**:
+- **Retrospective**: Fast validation, high data volume, observable patterns
+- **Prospective**: Behavioral effects, UX, emergent properties
+- **Hybrid**: Retrospective first, limited prospective for edge cases
+
+---
+
+## Success Criteria
+
+Retrospective validation succeeded when:
+
+1. **Sufficient data**: ≥100 instances analyzed
+2. **High confidence**: ≥0.75 overall confidence score
+3. **Sample validated**: ≥80% true positive rate
+4. **Impact quantified**: Prevention % or speedup measured
+5. **Time savings**: 40-60% faster than prospective validation
+
+**Bootstrap-003 Validation**:
+- ✅ Data: 1,336 errors analyzed
+- ✅ Confidence: 0.79 (high)
+- ✅ Sample: 93.3% true positive rate
+- ✅ Impact: 23.7% error prevention
+- ✅ Time: 3 hours vs 2+ weeks (prospective)
+
+---
+
+## Related Skills
+
+**Parent framework**:
+- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
+
+**Complementary acceleration**:
+- [rapid-convergence](../rapid-convergence/SKILL.md) - Fast iteration (uses retrospective)
+- [baseline-quality-assessment](../baseline-quality-assessment/SKILL.md) - Strong iteration 0
+
+---
+
+## References
+
+**Core guide**:
+- [Four-Phase Process](reference/process.md) - Detailed methodology
+- [Confidence Calculation](reference/confidence.md) - Statistical rigor
+- [Detection Rules](reference/detection-rules.md) - Pattern matching guide
+
+**Examples**:
+- [Error Recovery Validation](examples/error-recovery-1336-errors.md) - Bootstrap-003
+
+---
+
+**Status**: ✅ Validated | Bootstrap-003 | 0.79 confidence | 40-60% time reduction