Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions

View File

@@ -0,0 +1,326 @@
# Confidence Scoring Methodology
**Version**: 1.0
**Purpose**: Quantify validation confidence for methodologies
**Range**: 0.0-1.0 (threshold: 0.80 for production)
---
## Confidence Formula
```
Confidence = 0.4 × coverage +
0.3 × validation_sample_size +
0.2 × pattern_consistency +
0.1 × expert_review
Where all components ∈ [0, 1]
```
---
## Component 1: Coverage (40% weight)
**Definition**: Percentage of cases methodology handles
**Calculation**:
```
coverage = handled_cases / total_cases
```
**Example** (Error Recovery):
```
coverage = 1275 classified / 1336 total
= 0.954
```
**Thresholds**:
- 0.95-1.0: Excellent (comprehensive)
- 0.80-0.94: Good (most cases covered)
- 0.60-0.79: Fair (significant gaps)
- <0.60: Poor (incomplete)
---
## Component 2: Validation Sample Size (30% weight)
**Definition**: How much data was used for validation
**Calculation**:
```
validation_sample_size = min(validated_count / 50, 1.0)
```
**Rationale**: 50+ validated cases provides statistical confidence
**Example** (Error Recovery):
```
validation_sample_size = min(1336 / 50, 1.0)
= min(26.72, 1.0)
= 1.0
```
**Thresholds**:
- 50+ cases: 1.0 (high confidence)
- 20-49 cases: 0.4-0.98 (medium confidence)
- 10-19 cases: 0.2-0.38 (low confidence)
- <10 cases: <0.2 (insufficient data)
---
## Component 3: Pattern Consistency (20% weight)
**Definition**: Success rate when patterns are applied
**Calculation**:
```
pattern_consistency = successful_applications / total_applications
```
**Measurement**:
1. Apply each pattern to 5-10 representative cases
2. Count successes (problem solved correctly)
3. Calculate success rate per pattern
4. Average across all patterns
**Example** (Error Recovery):
```
Pattern 1 (Fix-and-Retry): 9/10 = 0.90
Pattern 2 (Test Fixture): 10/10 = 1.0
Pattern 3 (Path Correction): 8/10 = 0.80
...
Pattern 10 (Permission Fix): 10/10 = 1.0
Average: 91/100 = 0.91
```
**Thresholds**:
- 0.90-1.0: Excellent (reliable patterns)
- 0.75-0.89: Good (mostly reliable)
- 0.60-0.74: Fair (needs refinement)
- <0.60: Poor (unreliable)
---
## Component 4: Expert Review (10% weight)
**Definition**: Binary validation by domain expert
**Values**:
- 1.0: Reviewed and approved by expert
- 0.5: Partially reviewed or peer-reviewed
- 0.0: Not reviewed
**Review Criteria**:
1. Patterns are correct and complete
2. No critical gaps identified
3. Transferability claims validated
4. Automation tools tested
5. Documentation is accurate
**Example** (Error Recovery):
```
expert_review = 1.0 (fully reviewed and validated)
```
---
## Complete Example: Error Recovery
**Component Values**:
```
coverage = 1275/1336 = 0.954
validation_sample_size = min(1336/50, 1.0) = 1.0
pattern_consistency = 91/100 = 0.91
expert_review = 1.0 (reviewed)
```
**Confidence Calculation**:
```
Confidence = 0.4 × 0.954 +
0.3 × 1.0 +
0.2 × 0.91 +
0.1 × 1.0
= 0.382 + 0.300 + 0.182 + 0.100
= 0.964
```
**Interpretation**: **96.4% confidence** (High - Production Ready)
---
## Confidence Bands
### High Confidence (0.80-1.0)
**Characteristics**:
- ≥80% coverage
- ≥20 validated cases
- ≥75% pattern consistency
- Reviewed by expert
**Actions**: Deploy to production, recommend broadly
**Example Methodologies**:
- Error Recovery (0.96)
- Testing Strategy (0.87)
- CI/CD Pipeline (0.85)
---
### Medium Confidence (0.60-0.79)
**Characteristics**:
- 60-79% coverage
- 10-19 validated cases
- 60-74% pattern consistency
- May lack expert review
**Actions**: Use with caution, monitor results, refine gaps
**Example**:
- New methodology with limited validation
- Partial coverage of domain
---
### Low Confidence (<0.60)
**Characteristics**:
- <60% coverage
- <10 validated cases
- <60% pattern consistency
- Not reviewed
**Actions**: Do not use in production, requires significant refinement
**Example**:
- Untested methodology
- Insufficient validation data
---
## Adjustments for Domain Complexity
**Adjust thresholds for complex domains**:
**Simple Domain** (e.g., file operations):
- Target: 0.85+ (higher expectations)
- Coverage: ≥90%
- Patterns: 3-5 sufficient
**Medium Domain** (e.g., testing):
- Target: 0.80+ (standard)
- Coverage: ≥80%
- Patterns: 6-8 typical
**Complex Domain** (e.g., distributed systems):
- Target: 0.75+ (realistic)
- Coverage: ≥70%
- Patterns: 10-15 needed
---
## Confidence Over Time
**Track confidence across iterations**:
```
Iteration 0: N/A (baseline only)
Iteration 1: 0.42 (low - initial patterns)
Iteration 2: 0.63 (medium - expanded)
Iteration 3: 0.79 (approaching target)
Iteration 4: 0.88 (high - converged)
Iteration 5: 0.87 (stable)
```
**Convergence**: Confidence stable ±0.05 for 2 iterations
---
## Confidence vs. V_meta
**Different but related**:
**V_meta**: Methodology quality (completeness, transferability, automation)
**Confidence**: Validation strength (how sure we are V_meta is accurate)
**Relationship**:
- High V_meta, Low Confidence: Good methodology, insufficient validation
- High V_meta, High Confidence: Production-ready
- Low V_meta, High Confidence: Well-validated but incomplete methodology
- Low V_meta, Low Confidence: Needs significant work
---
## Reporting Template
```markdown
## Validation Confidence Report
**Methodology**: [Name]
**Version**: [X.Y]
**Validation Date**: [YYYY-MM-DD]
### Confidence Score: [X.XX]
**Components**:
- Coverage: [X.XX] ([handled]/[total] cases)
- Sample Size: [X.XX] ([count] validated cases)
- Pattern Consistency: [X.XX] ([successes]/[applications])
- Expert Review: [X.XX] ([status])
**Confidence Band**: [High/Medium/Low]
**Recommendation**: [Deploy/Refine/Rework]
**Gaps Identified**:
1. [Gap description]
2. [Gap description]
**Next Steps**:
1. [Action item]
2. [Action item]
```
---
## Automation
**Confidence Calculator**:
```bash
#!/bin/bash
# scripts/calculate-confidence.sh
METHODOLOGY=$1
HISTORY=$2
# Calculate coverage
coverage=$(calculate_coverage "$METHODOLOGY" "$HISTORY")
# Calculate sample size
sample_size=$(count_validated_cases "$HISTORY")
sample_score=$(echo "scale=2; if ($sample_size >= 50) 1.0 else $sample_size/50" | bc)
# Calculate pattern consistency
consistency=$(measure_pattern_consistency "$METHODOLOGY")
# Expert review (manual input)
expert_review=${3:-0.0}
# Calculate confidence
confidence=$(echo "scale=3; 0.4*$coverage + 0.3*$sample_score + 0.2*$consistency + 0.1*$expert_review" | bc)
echo "Confidence: $confidence"
echo " Coverage: $coverage"
echo " Sample Size: $sample_score"
echo " Consistency: $consistency"
echo " Expert Review: $expert_review"
```
---
**Source**: BAIME Retrospective Validation Framework
**Status**: Production-ready, validated across 13 methodologies
**Average Confidence**: 0.86 (median 0.87)

View File

@@ -0,0 +1,399 @@
# Automated Detection Rules
**Version**: 1.0
**Purpose**: Automated error pattern detection for validation
**Coverage**: 95.4% of 1336 historical errors
---
## Rule Engine
**Architecture**:
```
Session JSONL → Parser → Classifier → Pattern Matcher → Report
```
**Components**:
1. **Parser**: Extract tool calls, errors, timestamps
2. **Classifier**: Categorize errors by signature
3. **Pattern Matcher**: Apply recovery patterns
4. **Reporter**: Generate validation metrics
---
## Detection Rules (13 Categories)
### 1. Build/Compilation Errors
**Signature**:
```regex
(syntax error|undefined:|cannot find|compilation failed)
```
**Detection Logic**:
```python
def detect_build_error(tool_call):
if tool_call.tool != "Bash":
return False
error_patterns = [
r"syntax error",
r"undefined:",
r"cannot find",
r"compilation failed"
]
return any(re.search(p, tool_call.error, re.I)
for p in error_patterns)
```
**Frequency**: 15.0% (200/1336)
**Priority**: P1 (high impact)
---
### 2. Test Failures
**Signature**:
```regex
(FAIL|test.*failed|assertion.*failed)
```
**Detection Logic**:
```python
def detect_test_failure(tool_call):
if "test" not in tool_call.command.lower():
return False
return re.search(r"FAIL|failed", tool_call.output, re.I)
```
**Frequency**: 11.2% (150/1336)
**Priority**: P2 (medium impact)
---
### 3. File Not Found
**Signature**:
```regex
(no such file|file not found|cannot open)
```
**Detection Logic**:
```python
def detect_file_not_found(tool_call):
patterns = [
r"no such file",
r"file not found",
r"cannot open"
]
return any(re.search(p, tool_call.error, re.I)
for p in patterns)
```
**Frequency**: 18.7% (250/1336)
**Priority**: P1 (preventable with validation)
**Automation**: validate-path.sh prevents 65.2%
---
### 4. File Size Exceeded
**Signature**:
```regex
(file too large|exceeds.*limit|size.*exceeded)
```
**Detection Logic**:
```python
def detect_file_size_error(tool_call):
if tool_call.tool not in ["Read", "Edit"]:
return False
return re.search(r"file too large|exceeds.*limit",
tool_call.error, re.I)
```
**Frequency**: 6.3% (84/1336)
**Priority**: P1 (100% preventable)
**Automation**: check-file-size.sh prevents 100%
---
### 5. Write Before Read
**Signature**:
```regex
(must read before|file not read|write.*without.*read)
```
**Detection Logic**:
```python
def detect_write_before_read(session):
for i, call in enumerate(session.tool_calls):
if call.tool in ["Edit", "Write"] and call.status == "error":
# Check if file was read in previous N calls
lookback = session.tool_calls[max(0, i-5):i]
if not any(c.tool == "Read" and
c.file_path == call.file_path
for c in lookback):
return True
return False
```
**Frequency**: 5.2% (70/1336)
**Priority**: P1 (100% preventable)
**Automation**: check-read-before-write.sh prevents 100%
---
### 6. Command Not Found
**Signature**:
```regex
(command not found|not recognized|no such command)
```
**Detection Logic**:
```python
def detect_command_not_found(tool_call):
if tool_call.tool != "Bash":
return False
return re.search(r"command not found", tool_call.error, re.I)
```
**Frequency**: 3.7% (50/1336)
**Priority**: P3 (low automation value)
---
### 7. JSON Parsing Errors
**Signature**:
```regex
(invalid json|parse.*error|malformed json)
```
**Detection Logic**:
```python
def detect_json_error(tool_call):
return re.search(r"invalid json|parse.*error|malformed",
tool_call.error, re.I)
```
**Frequency**: 6.0% (80/1336)
**Priority**: P2 (medium impact)
---
### 8. Request Interruption
**Signature**:
```regex
(interrupted|cancelled|aborted)
```
**Detection Logic**:
```python
def detect_interruption(tool_call):
return re.search(r"interrupted|cancelled|aborted",
tool_call.error, re.I)
```
**Frequency**: 2.2% (30/1336)
**Priority**: P3 (user-initiated, not preventable)
---
### 9. MCP Server Errors
**Signature**:
```regex
(mcp.*error|server.*unavailable|connection.*refused)
```
**Detection Logic**:
```python
def detect_mcp_error(tool_call):
if not tool_call.tool.startswith("mcp__"):
return False
patterns = [
r"server.*unavailable",
r"connection.*refused",
r"timeout"
]
return any(re.search(p, tool_call.error, re.I)
for p in patterns)
```
**Frequency**: 17.1% (228/1336)
**Priority**: P2 (infrastructure)
---
### 10. Permission Denied
**Signature**:
```regex
(permission denied|access denied|forbidden)
```
**Detection Logic**:
```python
def detect_permission_error(tool_call):
return re.search(r"permission denied|access denied",
tool_call.error, re.I)
```
**Frequency**: 0.7% (10/1336)
**Priority**: P3 (rare)
---
### 11. Empty Command String
**Signature**:
```regex
(empty command|no command|command required)
```
**Detection Logic**:
```python
def detect_empty_command(tool_call):
if tool_call.tool != "Bash":
return False
return not tool_call.parameters.get("command", "").strip()
```
**Frequency**: 1.1% (15/1336)
**Priority**: P2 (easy to prevent)
---
### 12. Go Module Already Exists
**Signature**:
```regex
(module.*already exists|go.mod.*exists)
```
**Detection Logic**:
```python
def detect_module_exists(tool_call):
if tool_call.tool != "Bash":
return False
return (re.search(r"go mod init", tool_call.command) and
re.search(r"already exists", tool_call.error, re.I))
```
**Frequency**: 0.4% (5/1336)
**Priority**: P3 (rare)
---
### 13. String Not Found (Edit)
**Signature**:
```regex
(string not found|no match|pattern.*not found)
```
**Detection Logic**:
```python
def detect_string_not_found(tool_call):
if tool_call.tool != "Edit":
return False
return re.search(r"string not found|no match",
tool_call.error, re.I)
```
**Frequency**: 3.2% (43/1336)
**Priority**: P1 (impacts workflow)
---
## Composite Detection
**Multi-stage errors**:
```python
def detect_cascading_error(session):
"""Detect errors that cause subsequent errors"""
for i in range(len(session.tool_calls) - 1):
current = session.tool_calls[i]
next_call = session.tool_calls[i + 1]
# File not found → Write → Edit chain
if (detect_file_not_found(current) and
next_call.tool == "Write" and
current.file_path == next_call.file_path):
return "file-not-found-recovery"
# Build error → Fix → Rebuild chain
if (detect_build_error(current) and
next_call.tool in ["Edit", "Write"] and
detect_build_error(session.tool_calls[i + 2])):
return "build-error-incomplete-fix"
return None
```
---
## Validation Metrics
**Overall Coverage**:
```
Coverage = (Σ detected_errors) / total_errors
= 1275 / 1336
= 95.4%
```
**Per-Category Accuracy**:
- True Positives: 1265 (99.2%)
- False Positives: 10 (0.8%)
- False Negatives: 61 (4.6%)
**Precision**: 99.2%
**Recall**: 95.4%
**F1 Score**: 97.3%
---
## Usage
**CLI**:
```bash
# Classify all errors in session
meta-cc classify-errors session.jsonl
# Validate methodology against history
meta-cc validate \
--methodology error-recovery \
--history .claude/sessions/*.jsonl
```
**MCP**:
```python
# Query by error category
query_tools(status="error")
# Get error context
query_context(error_signature="file-not-found")
```
---
**Source**: Bootstrap-003 Error Recovery (1336 errors analyzed)
**Status**: Production-ready, 95.4% coverage, 97.3% F1 score

View File

@@ -0,0 +1,210 @@
# Retrospective Validation Process
**Version**: 1.0
**Framework**: BAIME
**Purpose**: Validate methodologies against historical data post-creation
---
## Overview
Retrospective validation applies a newly created methodology to historical work to measure effectiveness and identify gaps. This validates that the methodology would have improved past outcomes.
---
## Validation Process
### Phase 1: Data Collection (15 min)
**Gather historical data**:
- Session history (JSONL files)
- Error logs and recovery attempts
- Time measurements
- Quality metrics
**Tools**:
```bash
# Query session data
query_tools --status=error
query_user_messages --pattern="error|fail|bug"
query_context --error-signature="..."
```
### Phase 2: Baseline Measurement (15 min)
**Measure pre-methodology state**:
- Error frequency by category
- Mean Time To Recovery (MTTR)
- Prevention opportunities missed
- Quality metrics
**Example**:
```markdown
## Baseline (Without Methodology)
**Errors**: 1336 total
**MTTR**: 11.25 min average
**Prevention**: 0% (no automation)
**Classification**: Ad-hoc, inconsistent
```
### Phase 3: Apply Methodology (30 min)
**Retrospectively apply patterns**:
1. Classify errors using new taxonomy
2. Identify which patterns would apply
3. Calculate time saved per pattern
4. Measure coverage improvement
**Example**:
```markdown
## With Error Recovery Methodology
**Classification**: 1275/1336 = 95.4% coverage
**Patterns Applied**: 10 recovery patterns
**Time Saved**: 8.25 min per error average
**Prevention**: 317 errors (23.7%) preventable
```
### Phase 4: Calculate Impact (20 min)
**Metrics**:
```
Coverage = classified_errors / total_errors
Time_Saved = (MTTR_before - MTTR_after) × error_count
Prevention_Rate = preventable_errors / total_errors
ROI = time_saved / methodology_creation_time
```
**Example**:
```markdown
## Impact Analysis
**Coverage**: 95.4% (1275/1336)
**Time Saved**: 8.25 min × 1336 = 183.6 hours
**Prevention**: 23.7% (317 errors)
**ROI**: 183.6h saved / 5.75h invested = 31.9x
```
### Phase 5: Gap Analysis (15 min)
**Identify remaining gaps**:
- Uncategorized errors (4.6%)
- Patterns needed for edge cases
- Automation opportunities
- Transferability limits
---
## Confidence Scoring
**Formula**:
```
Confidence = 0.4 × coverage +
0.3 × validation_sample_size +
0.2 × pattern_consistency +
0.1 × expert_review
Where:
- coverage = classified / total (0-1)
- validation_sample_size = min(validated/50, 1.0)
- pattern_consistency = successful_applications / total_applications
- expert_review = binary (0 or 1)
```
**Thresholds**:
- Confidence ≥ 0.80: High confidence, production-ready
- Confidence 0.60-0.79: Medium confidence, needs refinement
- Confidence < 0.60: Low confidence, significant gaps
---
## Validation Criteria
**Methodology is validated if**:
1. Coverage ≥ 80% (methodology handles most cases)
2. Time savings ≥ 30% (significant efficiency gain)
3. Prevention ≥ 10% (automation provides value)
4. ROI ≥ 5x (worthwhile investment)
5. Transferability ≥ 70% (broadly applicable)
---
## Example: Error Recovery Validation
**Historical Data**: 1336 errors from 15 sessions
**Baseline**:
- MTTR: 11.25 min
- No systematic classification
- No prevention tools
**Post-Methodology** (retrospective):
- Coverage: 95.4% (13 categories)
- MTTR: 3 min (73% reduction)
- Prevention: 23.7% (3 automation tools)
- Time saved: 183.6 hours
- ROI: 31.9x
**Confidence Score**:
```
Confidence = 0.4 × 0.954 +
0.3 × 1.0 +
0.2 × 0.91 +
0.1 × 1.0
= 0.38 + 0.30 + 0.18 + 0.10
= 0.96 (High confidence)
```
**Validation Result**: ✅ VALIDATED (all criteria met)
---
## Common Pitfalls
**❌ Selection Bias**: Only validating on "easy" cases
- Fix: Use complete dataset, include edge cases
**❌ Overfitting**: Methodology too specific to validation data
- Fix: Test transferability on different project
**❌ Optimistic Timing**: Assuming perfect pattern application
- Fix: Use realistic time estimates (1.2x typical)
**❌ Ignoring Learning Curve**: Assuming immediate proficiency
- Fix: Factor in 2-3 iterations to master patterns
---
## Automation Support
**Validation Script**:
```bash
#!/bin/bash
# scripts/validate-methodology.sh
METHODOLOGY=$1
HISTORY_DIR=$2
# Extract baseline metrics
baseline=$(query_tools --scope=session | jq -r '.[] | .duration' | avg)
# Apply methodology patterns
coverage=$(classify_with_patterns "$METHODOLOGY" "$HISTORY_DIR")
# Calculate impact
time_saved=$(calculate_time_savings "$baseline" "$coverage")
prevention=$(calculate_prevention_rate "$METHODOLOGY")
# Generate report
echo "Coverage: $coverage"
echo "Time Saved: $time_saved"
echo "Prevention: $prevention"
echo "ROI: $(calculate_roi "$time_saved" "$methodology_time")"
```
---
**Source**: Bootstrap-003 Error Recovery Retrospective Validation
**Status**: Production-ready, 96% confidence score
**ROI**: 31.9x validated across 1336 historical errors