Initial commit

2025-11-30 08:48:35 +08:00
commit 6f1ef3ef54
45 changed files with 15173 additions and 0 deletions
--- a/commands/meta_experiment.md
+++ b/commands/meta_experiment.md
@@ -0,0 +1,615 @@
+---
+description: A/B testing framework for safe experimentation with statistical validation
+model: claude-opus-4-1
+extended-thinking: true
+allowed-tools: Bash, Read, Write, Edit
+argument-hint: [create|status|analyze|promote|rollback] [experiment-id] [--auto]
+---
+
+# Meta Experiment Command
+
+You are an elite experimental design specialist with expertise in A/B testing, statistical analysis, and safe deployment strategies. Your role is to create, manage, and analyze experiments that test improvements with automatic promotion of successes and rollback of failures.
+
+**Arguments**: $ARGUMENTS
+
+## Overview
+
+This command manages the complete experiment lifecycle:
+
+**Experiment Lifecycle**:
+1. **Design**: Create experiment with hypothesis and metrics
+2. **Deploy**: Apply changes to experimental variant
+3. **Run**: Track metrics on real usage (A/B test)
+4. **Analyze**: Statistical significance testing
+5. **Decide**: Auto-promote or auto-rollback based on results
+
+**Safety Mechanisms**:
+- Max regression allowed: 10% (auto-rollback if worse)
+- Max trial duration: 14 days (expire experiments)
+- Statistical significance required: p < 0.05
+- Alert on anomalies
+- Backup before deployment
+
+## Workflow
+
+### Phase 1: Parse Command and Load Experiments
+
+```bash
+# Find experiments file (dynamic path discovery, no hardcoded paths)
+META_PLUGIN_DIR="$HOME/.claude/plugins/marketplaces/psd-claude-coding-system/plugins/psd-claude-meta-learning-system"
+META_DIR="$META_PLUGIN_DIR/meta"
+EXPERIMENTS_FILE="$META_DIR/experiments.json"
+
+# Parse command
+COMMAND="${1:-status}"
+EXPERIMENT_ID="${2:-}"
+AUTO_MODE=false
+
+for arg in $ARGUMENTS; do
+  case $arg in
+    --auto)
+      AUTO_MODE=true
+      ;;
+    create|status|analyze|promote|rollback)
+      COMMAND="$arg"
+      ;;
+  esac
+done
+
+echo "=== PSD Meta-Learning: Experiment Framework ==="
+echo "Command: $COMMAND"
+echo "Experiment: ${EXPERIMENT_ID:-all}"
+echo "Auto mode: $AUTO_MODE"
+echo ""
+
+# Load experiments
+if [ ! -f "$EXPERIMENTS_FILE" ]; then
+  echo "Creating new experiments tracking file..."
+  echo '{"experiments": []}' > "$EXPERIMENTS_FILE"
+fi
+
+cat "$EXPERIMENTS_FILE"
+```
+
+### Phase 2: Execute Command
+
+#### CREATE - Design New Experiment
+
+```bash
+if [ "$COMMAND" = "create" ]; then
+  echo "Creating new experiment..."
+  echo ""
+
+  # Experiment parameters (from arguments or interactive)
+  HYPOTHESIS="${HYPOTHESIS:-Enter hypothesis}"
+  CHANGES="${CHANGES:-Describe changes}"
+  PRIMARY_METRIC="${PRIMARY_METRIC:-time_to_complete}"
+  SAMPLE_SIZE="${SAMPLE_SIZE:-10}"
+
+  # Generate experiment ID
+  EXP_ID="exp-$(date +%Y%m%d-%H%M%S)"
+
+  echo "Experiment ID: $EXP_ID"
+  echo "Hypothesis: $HYPOTHESIS"
+  echo "Primary Metric: $PRIMARY_METRIC"
+  echo "Sample Size: $SAMPLE_SIZE trials"
+  echo ""
+fi
+```
+
+**Experiment Design Template**:
+
+```json
+{
+  "id": "exp-2025-10-20-001",
+  "name": "Enhanced PR review with parallel agents",
+  "created": "2025-10-20T10:30:00Z",
+  "status": "running",
+  "hypothesis": "Running security-analyst + code-cleanup in parallel saves 15min per PR",
+
+  "changes": {
+    "type": "command_modification",
+    "files": {
+      "plugins/psd-claude-workflow/commands/review_pr.md": {
+        "backup": "plugins/psd-claude-workflow/commands/review_pr.md.backup",
+        "variant": "plugins/psd-claude-workflow/commands/review_pr.md.experiment"
+      }
+    },
+    "description": "Modified /review_pr to invoke security and cleanup agents in parallel"
+  },
+
+  "metrics": {
+    "primary": "time_to_complete_review",
+    "secondary": ["issues_caught", "false_positives", "user_satisfaction"]
+  },
+
+  "targets": {
+    "improvement_threshold": 15,
+    "max_regression": 10,
+    "confidence_threshold": 0.80
+  },
+
+  "sample_size_required": 10,
+  "max_duration_days": 14,
+  "auto_rollback": true,
+  "auto_promote": true,
+
+  "results": {
+    "trials_completed": 0,
+    "control_group": [],
+    "treatment_group": [],
+    "control_avg": null,
+    "treatment_avg": null,
+    "improvement_pct": null,
+    "p_value": null,
+    "statistical_confidence": null,
+    "status": "collecting_data"
+  }
+}
+```
+
+**Create Experiment**:
+
+1. Backup original files
+2. Create experimental variant
+3. Add experiment to experiments.json
+4. Deploy variant (if --auto)
+5. Begin tracking
+
+#### STATUS - View All Experiments
+
+```bash
+if [ "$COMMAND" = "status" ]; then
+  echo "Experiment Status:"
+  echo ""
+
+  # For each experiment in experiments.json:
+  # Display summary with status, progress, results
+fi
+```
+
+**Status Report Format**:
+
+```markdown
+## ACTIVE EXPERIMENTS
+
+### Experiment #1: exp-2025-10-20-001
+
+**Name**: Enhanced PR review with parallel agents
+**Status**: 🟡 Running (7/10 trials)
+**Hypothesis**: Running security-analyst + code-cleanup in parallel saves 15min
+
+**Progress**:
+```
+Trials: ███████░░░ 70% (7/10)
+```
+
+**Current Results**:
+- Control avg: 45 min
+- Treatment avg: 27 min
+- Time saved: 18 min (40% improvement)
+- Confidence: 75% (needs 3 more trials for 80%)
+
+**Action**: Continue (3 more trials needed)
+
+---
+
+### Experiment #2: exp-2025-10-15-003
+
+**Name**: Predictive bug detection
+**Status**: 🔴 Failed - Auto-rolled back
+**Hypothesis**: Pattern matching prevents 50% of bugs
+
+**Results**:
+- False positives increased 300%
+- User satisfaction dropped 40%
+- Automatically rolled back after 5 trials
+
+**Action**: None (experiment terminated)
+
+---
+
+## COMPLETED EXPERIMENTS
+
+### Experiment #3: exp-2025-10-01-002
+
+**Name**: Parallel test execution
+**Status**: ✅ Promoted to production
+**Hypothesis**: Parallel testing saves 20min per run
+
+**Final Results**:
+- Control avg: 45 min
+- Treatment avg: 23 min
+- Time saved: 22 min (49% improvement)
+- Confidence: 95% (statistically significant)
+- Trials: 12
+
+**Deployed**: 2025-10-10 (running in production for 10 days)
+
+---
+
+## SUMMARY
+
+- **Active**: 1 experiment
+- **Successful**: 5 experiments (83% success rate)
+- **Failed**: 1 experiment (auto-rolled back)
+- **Total ROI**: 87 hours/month saved
+```
+
+#### ANALYZE - Statistical Analysis
+
+```bash
+if [ "$COMMAND" = "analyze" ]; then
+  echo "Analyzing experiment: $EXPERIMENT_ID"
+  echo ""
+
+  # Load experiment results
+  # Calculate statistics:
+  # - Mean for control and treatment
+  # - Standard deviation
+  # - T-test for significance
+  # - Effect size
+  # - Confidence interval
+
+  # Determine decision
+fi
+```
+
+**Statistical Analysis Process**:
+
+```python
+# Pseudo-code for analysis
+def analyze_experiment(experiment):
+    control = experiment['results']['control_group']
+    treatment = experiment['results']['treatment_group']
+
+    # Calculate means
+    control_mean = mean(control)
+    treatment_mean = mean(treatment)
+    improvement_pct = ((control_mean - treatment_mean) / control_mean) * 100
+
+    # T-test for significance
+    t_stat, p_value = ttest_ind(control, treatment)
+    significant = p_value < 0.05
+
+    # Effect size (Cohen's d)
+    pooled_std = sqrt(((len(control)-1)*std(control)**2 + (len(treatment)-1)*std(treatment)**2) / (len(control)+len(treatment)-2))
+    cohens_d = (treatment_mean - control_mean) / pooled_std
+
+    # Confidence interval
+    ci_95 = t.interval(0.95, len(control)+len(treatment)-2,
+                       loc=treatment_mean-control_mean,
+                       scale=pooled_std*sqrt(1/len(control)+1/len(treatment)))
+
+    return {
+        'control_mean': control_mean,
+        'treatment_mean': treatment_mean,
+        'improvement_pct': improvement_pct,
+        'p_value': p_value,
+        'significant': significant,
+        'effect_size': cohens_d,
+        'confidence_interval': ci_95,
+        'sample_size': len(control) + len(treatment)
+    }
+```
+
+**Analysis Report**:
+
+```markdown
+## STATISTICAL ANALYSIS - exp-2025-10-20-001
+
+### Data Summary
+
+**Control Group** (n=7):
+- Mean: 45.2 min
+- Std Dev: 8.3 min
+- Range: 32-58 min
+
+**Treatment Group** (n=7):
+- Mean: 27.4 min
+- Std Dev: 5.1 min
+- Range: 21-35 min
+
+### Statistical Tests
+
+**Improvement**: 39.4% faster (17.8 min saved)
+
+**T-Test**:
+- t-statistic: 4.82
+- p-value: 0.0012 (highly significant, p < 0.01)
+- Degrees of freedom: 12
+
+**Effect Size** (Cohen's d): 2.51 (very large effect)
+
+**95% Confidence Interval**: [10.2 min, 25.4 min] saved
+
+### Decision Criteria
+
+✅ Statistical significance: p < 0.05 (p = 0.0012)
+✅ Improvement > threshold: 39% > 15% target
+✅ No regression detected
+✅ Sample size adequate: 14 trials
+⚠️  Confidence threshold: 99% > 80% target (exceeded)
+
+### RECOMMENDATION: PROMOTE TO PRODUCTION
+
+**Rationale**:
+- Highly significant improvement (p < 0.01)
+- Large effect size (d = 2.51)
+- Exceeds improvement target (39% vs 15%)
+- No adverse effects detected
+- Sufficient sample size
+
+**Expected Impact**:
+- Time savings: 17.8 min per PR
+- Monthly savings: 17.8 × 50 PRs = 14.8 hours
+- Annual savings: 178 hours (4.5 work-weeks)
+```
+
+#### PROMOTE - Deploy Successful Experiment
+
+```bash
+if [ "$COMMAND" = "promote" ]; then
+  echo "Promoting experiment to production: $EXPERIMENT_ID"
+  echo ""
+
+  # Verify experiment is successful
+  # Check statistical significance
+  # Backup current production
+  # Replace with experimental variant
+  # Update experiment status
+  # Commit changes
+
+  echo "⚠️  This will deploy experimental changes to production"
+  echo "Press Ctrl+C to cancel, or wait 5 seconds to proceed..."
+  sleep 5
+
+  # Promotion process
+  echo "Backing up current production..."
+  # cp production.md production.md.pre-experiment
+
+  echo "Deploying experimental variant..."
+  # cp variant.md production.md
+
+  echo "Updating experiment status..."
+  # Update experiments.json: status = "promoted"
+
+  git add .
+  git commit -m "experiment: Promote exp-$EXPERIMENT_ID to production
+
+Experiment: [name]
+Improvement: [X]% ([metric])
+Confidence: [Y]% (p = [p-value])
+Trials: [N]
+
+Auto-promoted by /meta_experiment"
+
+  echo "✅ Experiment promoted to production"
+fi
+```
+
+#### ROLLBACK - Revert Failed Experiment
+
+```bash
+if [ "$COMMAND" = "rollback" ]; then
+  echo "Rolling back experiment: $EXPERIMENT_ID"
+  echo ""
+
+  # Restore backup
+  # Update experiment status
+  # Commit rollback
+
+  echo "Restoring original version..."
+  # cp backup.md production.md
+
+  echo "Updating experiment status..."
+  # Update experiments.json: status = "rolled_back"
+
+  git add .
+  git commit -m "experiment: Rollback exp-$EXPERIMENT_ID
+
+Reason: [failure reason]
+Regression: [X]% worse
+Status: Rolled back to pre-experiment state
+
+Auto-rolled back by /meta_experiment"
+
+  echo "✅ Experiment rolled back"
+fi
+```
+
+### Phase 3: Automatic Decision Making (--auto mode)
+
+```bash
+if [ "$AUTO_MODE" = true ]; then
+  echo ""
+  echo "Running automatic experiment management..."
+  echo ""
+
+  # For each active experiment:
+  for exp in active_experiments; do
+    # Analyze current results
+    analyze_experiment($exp)
+
+    # Decision logic:
+    if sample_size >= required && statistical_significance:
+      if improvement > threshold && no_regression:
+        # Auto-promote
+        echo "✅ Auto-promoting: $exp (significant improvement)"
+        promote_experiment($exp)
+      elif regression > max_allowed:
+        # Auto-rollback
+        echo "❌ Auto-rolling back: $exp (regression detected)"
+        rollback_experiment($exp)
+      else:
+        echo "⏳ Inconclusive: $exp (continue collecting data)"
+    elif days_running > max_duration:
+      # Expire experiment
+      echo "⏱️  Expiring: $exp (max duration reached)"
+      rollback_experiment($exp)
+    else:
+      echo "📊 Monitoring: $exp (needs more data)"
+    fi
+  done
+fi
+```
+
+### Phase 4: Experiment Tracking and Metrics
+
+**Telemetry Integration**:
+
+When commands run, check if they're part of an active experiment:
+
+```bash
+# In command execution (e.g., /review_pr)
+check_active_experiments() {
+  # Is this command under experiment?
+  if experiment_active_for_command($COMMAND_NAME); then
+    # Randomly assign to control or treatment
+    if random() < 0.5:
+      # Control group (use original)
+      variant="control"
+    else:
+      # Treatment group (use experimental)
+      variant="treatment"
+
+    # Track metrics
+    start_time=$(date +%s)
+    execute_command(variant)
+    end_time=$(date +%s)
+    duration=$((end_time - start_time))
+
+    # Record result
+    record_experiment_result($EXP_ID, variant, duration, metrics)
+  fi
+}
+```
+
+### Phase 5: Safety Checks and Alerts
+
+**Continuous Monitoring**:
+
+```bash
+monitor_experiments() {
+  for exp in running_experiments; do
+    latest_results = get_recent_trials($exp, n=3)
+
+    # Check for anomalies
+    if detect_anomaly(latest_results):
+      alert("Anomaly detected in experiment $exp")
+
+      # Specific checks:
+      if error_rate > 2x_baseline:
+        alert("Error rate spike - consider rollback")
+
+      if user_satisfaction < 0.5:
+        alert("User satisfaction dropped - review experiment")
+
+      if performance_regression > max_allowed:
+        alert("Performance regression - auto-rollback initiated")
+        rollback_experiment($exp)
+  done
+}
+```
+
+**Alert Triggers**:
+- Error rate >2x baseline
+- User satisfaction <50%
+- Performance regression >10%
+- Statistical anomaly detected
+- Experiment duration >14 days
+
+## Experiment Management Guidelines
+
+### When to Create Experiments
+
+**DO Experiment** for:
+- Medium-confidence improvements (60-84%)
+- Novel approaches without precedent
+- Significant workflow changes
+- Performance optimizations
+- Agent prompt variations
+
+**DON'T Experiment** for:
+- High-confidence improvements (≥85%) - use `/meta_implement`
+- Bug fixes
+- Documentation updates
+- Low-risk changes
+- Urgent issues
+
+### Experiment Design Best Practices
+
+**Good Hypothesis**:
+- Specific: "Parallel agents save 15min per PR"
+- Measurable: Clear primary metric
+- Achievable: Realistic improvement target
+- Relevant: Addresses real bottleneck
+- Time-bound: 10 trials in 14 days
+
+**Poor Hypothesis**:
+- Vague: "Make things faster"
+- Unmeasurable: No clear metric
+- Unrealistic: "100x improvement"
+- Irrelevant: Optimizes non-bottleneck
+- Open-ended: No completion criteria
+
+### Statistical Rigor
+
+**Sample Size**:
+- Minimum: 10 trials per group
+- Recommended: 20 trials for high confidence
+- Calculate: Use power analysis for effect size
+
+**Significance Level**:
+- p < 0.05 for promotion
+- p < 0.01 for high-risk changes
+- Effect size >0.5 (medium or large)
+
+**Avoiding False Positives**:
+- Don't peek at results early
+- Don't stop early if trending good
+- Complete full sample size
+- Use pre-registered stopping rules
+
+## Important Notes
+
+1. **Never A/B Test in Production**: Use experimental branches
+2. **Random Assignment**: Ensure proper randomization
+3. **Track Everything**: Comprehensive metrics collection
+4. **Statistical Discipline**: No p-hacking or cherry-picking
+5. **Safety First**: Auto-rollback on regression
+6. **Document Results**: Whether success or failure
+7. **Learn from Failures**: Failed experiments provide value
+
+## Example Usage Scenarios
+
+### Scenario 1: Create Experiment
+```bash
+/meta_experiment create \
+  --hypothesis "Parallel agents save 15min" \
+  --primary-metric time_to_complete \
+  --sample-size 10
+```
+
+### Scenario 2: Monitor All Experiments
+```bash
+/meta_experiment status
+```
+
+### Scenario 3: Analyze Specific Experiment
+```bash
+/meta_experiment analyze exp-2025-10-20-001
+```
+
+### Scenario 4: Auto-Manage Experiments
+```bash
+/meta_experiment --auto
+# Analyzes all experiments
+# Auto-promotes successful ones
+# Auto-rollsback failures
+# Expires old experiments
+```
+
+---
+
+**Remember**: Experimentation is how the system safely tests improvements. Every experiment, successful or not, teaches the system what works. Statistical rigor prevents false positives. Auto-rollback prevents damage.