291 lines
8.0 KiB
Markdown
291 lines
8.0 KiB
Markdown
---
|
||
name: Retrospective Validation
|
||
description: Validate methodology effectiveness using historical data without live deployment. Use when rich historical data exists (100+ instances), methodology targets observable patterns (error prevention, test strategy, performance optimization), pattern matching is feasible with clear detection rules, and live deployment has high friction (CI/CD integration effort, user study time, deployment risk). Enables 40-60% time reduction vs prospective validation, 60-80% cost reduction. Confidence calculation model provides statistical rigor. Validated in error recovery (1,336 errors, 23.7% prevention, 0.79 confidence).
|
||
allowed-tools: Read, Grep, Glob, Bash
|
||
---
|
||
|
||
# Retrospective Validation
|
||
|
||
**Validate methodologies with historical data, not live deployment.**
|
||
|
||
> When you have 1,000 past errors, you don't need to wait for 1,000 future errors to prove your methodology works.
|
||
|
||
---
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when:
|
||
- 📊 **Rich historical data**: 100+ instances (errors, test failures, performance issues)
|
||
- 🎯 **Observable patterns**: Methodology targets detectable issues
|
||
- 🔍 **Pattern matching feasible**: Clear detection heuristics, measurable false positive rate
|
||
- ⚡ **High deployment friction**: CI/CD integration costly, user studies time-consuming
|
||
- 📈 **Statistical rigor needed**: Want confidence intervals, not just hunches
|
||
- ⏰ **Time constrained**: Need validation in hours, not weeks
|
||
|
||
**Don't use when**:
|
||
- ❌ Insufficient data (<50 instances)
|
||
- ❌ Emergent effects (human behavior change, UX improvements)
|
||
- ❌ Pattern matching unreliable (>20% false positive rate)
|
||
- ❌ Low deployment friction (1-2 hour CI/CD integration)
|
||
|
||
---
|
||
|
||
## Quick Start (30 minutes)
|
||
|
||
### Step 1: Check Historical Data (5 min)
|
||
|
||
```bash
|
||
# Example: Error data for meta-cc
|
||
meta-cc query-tools --status error | jq '. | length'
|
||
# Output: 1336 errors ✅ (>100 threshold)
|
||
|
||
# Example: Test failures from CI logs
|
||
grep "FAILED" ci-logs/*.txt | wc -l
|
||
# Output: 427 failures ✅
|
||
```
|
||
|
||
**Threshold**: ≥100 instances for statistical confidence
|
||
|
||
### Step 2: Define Detection Rule (10 min)
|
||
|
||
```yaml
|
||
Tool: validate-path.sh
|
||
Prevents: "File not found" errors
|
||
Detection:
|
||
- Error message matches: "no such file or directory"
|
||
- OR "cannot read file"
|
||
- OR "file does not exist"
|
||
Confidence: High (90%+) - deterministic check
|
||
```
|
||
|
||
### Step 3: Apply Rule to Historical Data (10 min)
|
||
|
||
```bash
|
||
# Count matches
|
||
grep -E "(no such file|cannot read|does not exist)" errors.log | wc -l
|
||
# Output: 163 errors (12.2% of total)
|
||
|
||
# Sample manual validation (30 errors)
|
||
# True positives: 28/30 (93.3%)
|
||
# Adjusted: 163 * 0.933 = 152 preventable ✅
|
||
```
|
||
|
||
### Step 4: Calculate Confidence (5 min)
|
||
|
||
```
|
||
Confidence = Data Quality × Accuracy × Logical Correctness
|
||
= 0.85 × 0.933 × 1.0
|
||
= 0.79 (High confidence)
|
||
```
|
||
|
||
**Result**: Tool would have prevented 152 errors with 79% confidence.
|
||
|
||
---
|
||
|
||
## Four-Phase Process
|
||
|
||
### Phase 1: Data Collection
|
||
|
||
**1. Identify Data Sources**
|
||
|
||
For Claude Code / meta-cc:
|
||
```bash
|
||
# Error history
|
||
meta-cc query-tools --status error
|
||
|
||
# User pain points
|
||
meta-cc query-user-messages --pattern "error|fail|broken"
|
||
|
||
# Error context
|
||
meta-cc query-context --error-signature "..."
|
||
```
|
||
|
||
For other projects:
|
||
- Git history (commits, diffs, blame)
|
||
- CI/CD logs (test failures, build errors)
|
||
- Application logs (runtime errors)
|
||
- Issue trackers (bug reports)
|
||
|
||
**2. Quantify Baseline**
|
||
|
||
Metrics needed:
|
||
- **Volume**: Total instances (e.g., 1,336 errors)
|
||
- **Rate**: Frequency (e.g., 5.78% error rate)
|
||
- **Distribution**: Category breakdown (e.g., file-not-found: 12.2%)
|
||
- **Impact**: Cost (e.g., MTTD: 15 min, MTTR: 30 min)
|
||
|
||
### Phase 2: Pattern Definition
|
||
|
||
**1. Create Detection Rules**
|
||
|
||
For each tool/methodology:
|
||
```yaml
|
||
what_it_prevents: Error type or failure mode
|
||
detection_rule: Pattern matching heuristic
|
||
confidence: Estimated accuracy (high/medium/low)
|
||
```
|
||
|
||
**2. Define Success Criteria**
|
||
|
||
```yaml
|
||
prevention: Message matches AND tool would catch it
|
||
speedup: Tool faster than manual debugging
|
||
reliability: No false positives/negatives in sample
|
||
```
|
||
|
||
### Phase 3: Validation Execution
|
||
|
||
**1. Apply Rules to Historical Data**
|
||
|
||
```bash
|
||
# Pseudo-code
|
||
for instance in historical_data:
|
||
category = classify(instance)
|
||
tool = find_applicable_tool(category)
|
||
if would_have_prevented(tool, instance):
|
||
count_prevented++
|
||
|
||
prevention_rate = count_prevented / total * 100
|
||
```
|
||
|
||
**2. Sample Manual Validation**
|
||
|
||
```
|
||
Sample size: 30 instances (95% confidence)
|
||
For each: "Would tool have prevented this?"
|
||
Calculate: True positive rate, False positive rate
|
||
Adjust: prevention_claim * true_positive_rate
|
||
```
|
||
|
||
**Example** (Bootstrap-003):
|
||
```
|
||
Sample: 30/317 claimed prevented
|
||
True positives: 28 (93.3%)
|
||
Adjusted: 317 * 0.933 = 296 errors
|
||
Confidence: High (93%+)
|
||
```
|
||
|
||
**3. Measure Performance**
|
||
|
||
```bash
|
||
# Tool time
|
||
time tool.sh < test_input
|
||
# Output: 0.05s
|
||
|
||
# Manual time (estimate from historical data)
|
||
# Average debug time: 15 min = 900s
|
||
|
||
# Speedup: 900 / 0.05 = 18,000x
|
||
```
|
||
|
||
### Phase 4: Confidence Assessment
|
||
|
||
**Confidence Formula**:
|
||
|
||
```
|
||
Confidence = D × A × L
|
||
|
||
Where:
|
||
D = Data Quality (0.5-1.0)
|
||
A = Accuracy (True Positive Rate, 0.5-1.0)
|
||
L = Logical Correctness (0.5-1.0)
|
||
```
|
||
|
||
**Data Quality** (D):
|
||
- 1.0: Complete, accurate, representative
|
||
- 0.8-0.9: Minor gaps or biases
|
||
- 0.6-0.7: Significant gaps
|
||
- <0.6: Unreliable data
|
||
|
||
**Accuracy** (A):
|
||
- 1.0: 100% true positive rate (verified)
|
||
- 0.8-0.95: High (sample validation 80-95%)
|
||
- 0.6-0.8: Medium (60-80%)
|
||
- <0.6: Low (unreliable pattern matching)
|
||
|
||
**Logical Correctness** (L):
|
||
- 1.0: Deterministic (tool directly addresses root cause)
|
||
- 0.8-0.9: High correlation (strong evidence)
|
||
- 0.6-0.7: Moderate correlation
|
||
- <0.6: Weak or speculative
|
||
|
||
**Example** (Bootstrap-003):
|
||
```
|
||
D = 0.85 (Complete error logs, minor gaps in context)
|
||
A = 0.933 (93.3% true positive rate from sample)
|
||
L = 1.0 (File validation is deterministic)
|
||
|
||
Confidence = 0.85 × 0.933 × 1.0 = 0.79 (High)
|
||
```
|
||
|
||
**Interpretation**:
|
||
- ≥0.75: High confidence (publishable)
|
||
- 0.60-0.74: Medium confidence (needs caveats)
|
||
- 0.45-0.59: Low confidence (suggestive, not conclusive)
|
||
- <0.45: Insufficient confidence (need prospective validation)
|
||
|
||
---
|
||
|
||
## Comparison: Retrospective vs Prospective
|
||
|
||
| Aspect | Retrospective | Prospective |
|
||
|--------|--------------|-------------|
|
||
| **Time** | Hours-days | Weeks-months |
|
||
| **Cost** | Low (queries) | High (deployment) |
|
||
| **Risk** | Zero | May introduce issues |
|
||
| **Confidence** | 0.60-0.95 | 0.90-1.0 |
|
||
| **Data** | Historical | New |
|
||
| **Scope** | Full history | Limited window |
|
||
| **Bias** | Hindsight | None |
|
||
|
||
**When to use each**:
|
||
- **Retrospective**: Fast validation, high data volume, observable patterns
|
||
- **Prospective**: Behavioral effects, UX, emergent properties
|
||
- **Hybrid**: Retrospective first, limited prospective for edge cases
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
Retrospective validation succeeded when:
|
||
|
||
1. **Sufficient data**: ≥100 instances analyzed
|
||
2. **High confidence**: ≥0.75 overall confidence score
|
||
3. **Sample validated**: ≥80% true positive rate
|
||
4. **Impact quantified**: Prevention % or speedup measured
|
||
5. **Time savings**: 40-60% faster than prospective validation
|
||
|
||
**Bootstrap-003 Validation**:
|
||
- ✅ Data: 1,336 errors analyzed
|
||
- ✅ Confidence: 0.79 (high)
|
||
- ✅ Sample: 93.3% true positive rate
|
||
- ✅ Impact: 23.7% error prevention
|
||
- ✅ Time: 3 hours vs 2+ weeks (prospective)
|
||
|
||
---
|
||
|
||
## Related Skills
|
||
|
||
**Parent framework**:
|
||
- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
|
||
|
||
**Complementary acceleration**:
|
||
- [rapid-convergence](../rapid-convergence/SKILL.md) - Fast iteration (uses retrospective)
|
||
- [baseline-quality-assessment](../baseline-quality-assessment/SKILL.md) - Strong iteration 0
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
**Core guide**:
|
||
- [Four-Phase Process](reference/process.md) - Detailed methodology
|
||
- [Confidence Calculation](reference/confidence.md) - Statistical rigor
|
||
- [Detection Rules](reference/detection-rules.md) - Pattern matching guide
|
||
|
||
**Examples**:
|
||
- [Error Recovery Validation](examples/error-recovery-1336-errors.md) - Bootstrap-003
|
||
|
||
---
|
||
|
||
**Status**: ✅ Validated | Bootstrap-003 | 0.79 confidence | 40-60% time reduction
|