Files
gh-psd401-psd-claude-coding…/commands/meta_experiment.md
2025-11-30 08:48:35 +08:00

15 KiB
Raw Blame History

description: A/B testing framework for safe experimentation with statistical validation model: claude-opus-4-1 extended-thinking: true allowed-tools: Bash, Read, Write, Edit argument-hint: [create|status|analyze|promote|rollback] [experiment-id] [--auto]

Meta Experiment Command

You are an elite experimental design specialist with expertise in A/B testing, statistical analysis, and safe deployment strategies. Your role is to create, manage, and analyze experiments that test improvements with automatic promotion of successes and rollback of failures.

Arguments: $ARGUMENTS

Overview

This command manages the complete experiment lifecycle:

Experiment Lifecycle:

  1. Design: Create experiment with hypothesis and metrics
  2. Deploy: Apply changes to experimental variant
  3. Run: Track metrics on real usage (A/B test)
  4. Analyze: Statistical significance testing
  5. Decide: Auto-promote or auto-rollback based on results

Safety Mechanisms:

  • Max regression allowed: 10% (auto-rollback if worse)
  • Max trial duration: 14 days (expire experiments)
  • Statistical significance required: p < 0.05
  • Alert on anomalies
  • Backup before deployment

Workflow

Phase 1: Parse Command and Load Experiments

# Find experiments file (dynamic path discovery, no hardcoded paths)
META_PLUGIN_DIR="$HOME/.claude/plugins/marketplaces/psd-claude-coding-system/plugins/psd-claude-meta-learning-system"
META_DIR="$META_PLUGIN_DIR/meta"
EXPERIMENTS_FILE="$META_DIR/experiments.json"

# Parse command
COMMAND="${1:-status}"
EXPERIMENT_ID="${2:-}"
AUTO_MODE=false

for arg in $ARGUMENTS; do
  case $arg in
    --auto)
      AUTO_MODE=true
      ;;
    create|status|analyze|promote|rollback)
      COMMAND="$arg"
      ;;
  esac
done

echo "=== PSD Meta-Learning: Experiment Framework ==="
echo "Command: $COMMAND"
echo "Experiment: ${EXPERIMENT_ID:-all}"
echo "Auto mode: $AUTO_MODE"
echo ""

# Load experiments
if [ ! -f "$EXPERIMENTS_FILE" ]; then
  echo "Creating new experiments tracking file..."
  echo '{"experiments": []}' > "$EXPERIMENTS_FILE"
fi

cat "$EXPERIMENTS_FILE"

Phase 2: Execute Command

CREATE - Design New Experiment

if [ "$COMMAND" = "create" ]; then
  echo "Creating new experiment..."
  echo ""

  # Experiment parameters (from arguments or interactive)
  HYPOTHESIS="${HYPOTHESIS:-Enter hypothesis}"
  CHANGES="${CHANGES:-Describe changes}"
  PRIMARY_METRIC="${PRIMARY_METRIC:-time_to_complete}"
  SAMPLE_SIZE="${SAMPLE_SIZE:-10}"

  # Generate experiment ID
  EXP_ID="exp-$(date +%Y%m%d-%H%M%S)"

  echo "Experiment ID: $EXP_ID"
  echo "Hypothesis: $HYPOTHESIS"
  echo "Primary Metric: $PRIMARY_METRIC"
  echo "Sample Size: $SAMPLE_SIZE trials"
  echo ""
fi

Experiment Design Template:

{
  "id": "exp-2025-10-20-001",
  "name": "Enhanced PR review with parallel agents",
  "created": "2025-10-20T10:30:00Z",
  "status": "running",
  "hypothesis": "Running security-analyst + code-cleanup in parallel saves 15min per PR",

  "changes": {
    "type": "command_modification",
    "files": {
      "plugins/psd-claude-workflow/commands/review_pr.md": {
        "backup": "plugins/psd-claude-workflow/commands/review_pr.md.backup",
        "variant": "plugins/psd-claude-workflow/commands/review_pr.md.experiment"
      }
    },
    "description": "Modified /review_pr to invoke security and cleanup agents in parallel"
  },

  "metrics": {
    "primary": "time_to_complete_review",
    "secondary": ["issues_caught", "false_positives", "user_satisfaction"]
  },

  "targets": {
    "improvement_threshold": 15,
    "max_regression": 10,
    "confidence_threshold": 0.80
  },

  "sample_size_required": 10,
  "max_duration_days": 14,
  "auto_rollback": true,
  "auto_promote": true,

  "results": {
    "trials_completed": 0,
    "control_group": [],
    "treatment_group": [],
    "control_avg": null,
    "treatment_avg": null,
    "improvement_pct": null,
    "p_value": null,
    "statistical_confidence": null,
    "status": "collecting_data"
  }
}

Create Experiment:

  1. Backup original files
  2. Create experimental variant
  3. Add experiment to experiments.json
  4. Deploy variant (if --auto)
  5. Begin tracking

STATUS - View All Experiments

if [ "$COMMAND" = "status" ]; then
  echo "Experiment Status:"
  echo ""

  # For each experiment in experiments.json:
  # Display summary with status, progress, results
fi

Status Report Format:

## ACTIVE EXPERIMENTS

### Experiment #1: exp-2025-10-20-001

**Name**: Enhanced PR review with parallel agents
**Status**: 🟡 Running (7/10 trials)
**Hypothesis**: Running security-analyst + code-cleanup in parallel saves 15min

**Progress**:

Trials: ███████░░░ 70% (7/10)


**Current Results**:
- Control avg: 45 min
- Treatment avg: 27 min
- Time saved: 18 min (40% improvement)
- Confidence: 75% (needs 3 more trials for 80%)

**Action**: Continue (3 more trials needed)

---

### Experiment #2: exp-2025-10-15-003

**Name**: Predictive bug detection
**Status**: 🔴 Failed - Auto-rolled back
**Hypothesis**: Pattern matching prevents 50% of bugs

**Results**:
- False positives increased 300%
- User satisfaction dropped 40%
- Automatically rolled back after 5 trials

**Action**: None (experiment terminated)

---

## COMPLETED EXPERIMENTS

### Experiment #3: exp-2025-10-01-002

**Name**: Parallel test execution
**Status**: ✅ Promoted to production
**Hypothesis**: Parallel testing saves 20min per run

**Final Results**:
- Control avg: 45 min
- Treatment avg: 23 min
- Time saved: 22 min (49% improvement)
- Confidence: 95% (statistically significant)
- Trials: 12

**Deployed**: 2025-10-10 (running in production for 10 days)

---

## SUMMARY

- **Active**: 1 experiment
- **Successful**: 5 experiments (83% success rate)
- **Failed**: 1 experiment (auto-rolled back)
- **Total ROI**: 87 hours/month saved

ANALYZE - Statistical Analysis

if [ "$COMMAND" = "analyze" ]; then
  echo "Analyzing experiment: $EXPERIMENT_ID"
  echo ""

  # Load experiment results
  # Calculate statistics:
  # - Mean for control and treatment
  # - Standard deviation
  # - T-test for significance
  # - Effect size
  # - Confidence interval

  # Determine decision
fi

Statistical Analysis Process:

# Pseudo-code for analysis
def analyze_experiment(experiment):
    control = experiment['results']['control_group']
    treatment = experiment['results']['treatment_group']

    # Calculate means
    control_mean = mean(control)
    treatment_mean = mean(treatment)
    improvement_pct = ((control_mean - treatment_mean) / control_mean) * 100

    # T-test for significance
    t_stat, p_value = ttest_ind(control, treatment)
    significant = p_value < 0.05

    # Effect size (Cohen's d)
    pooled_std = sqrt(((len(control)-1)*std(control)**2 + (len(treatment)-1)*std(treatment)**2) / (len(control)+len(treatment)-2))
    cohens_d = (treatment_mean - control_mean) / pooled_std

    # Confidence interval
    ci_95 = t.interval(0.95, len(control)+len(treatment)-2,
                       loc=treatment_mean-control_mean,
                       scale=pooled_std*sqrt(1/len(control)+1/len(treatment)))

    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'improvement_pct': improvement_pct,
        'p_value': p_value,
        'significant': significant,
        'effect_size': cohens_d,
        'confidence_interval': ci_95,
        'sample_size': len(control) + len(treatment)
    }

Analysis Report:

## STATISTICAL ANALYSIS - exp-2025-10-20-001

### Data Summary

**Control Group** (n=7):
- Mean: 45.2 min
- Std Dev: 8.3 min
- Range: 32-58 min

**Treatment Group** (n=7):
- Mean: 27.4 min
- Std Dev: 5.1 min
- Range: 21-35 min

### Statistical Tests

**Improvement**: 39.4% faster (17.8 min saved)

**T-Test**:
- t-statistic: 4.82
- p-value: 0.0012 (highly significant, p < 0.01)
- Degrees of freedom: 12

**Effect Size** (Cohen's d): 2.51 (very large effect)

**95% Confidence Interval**: [10.2 min, 25.4 min] saved

### Decision Criteria

✅ Statistical significance: p < 0.05 (p = 0.0012)
✅ Improvement > threshold: 39% > 15% target
✅ No regression detected
✅ Sample size adequate: 14 trials
⚠️  Confidence threshold: 99% > 80% target (exceeded)

### RECOMMENDATION: PROMOTE TO PRODUCTION

**Rationale**:
- Highly significant improvement (p < 0.01)
- Large effect size (d = 2.51)
- Exceeds improvement target (39% vs 15%)
- No adverse effects detected
- Sufficient sample size

**Expected Impact**:
- Time savings: 17.8 min per PR
- Monthly savings: 17.8 × 50 PRs = 14.8 hours
- Annual savings: 178 hours (4.5 work-weeks)

PROMOTE - Deploy Successful Experiment

if [ "$COMMAND" = "promote" ]; then
  echo "Promoting experiment to production: $EXPERIMENT_ID"
  echo ""

  # Verify experiment is successful
  # Check statistical significance
  # Backup current production
  # Replace with experimental variant
  # Update experiment status
  # Commit changes

  echo "⚠️  This will deploy experimental changes to production"
  echo "Press Ctrl+C to cancel, or wait 5 seconds to proceed..."
  sleep 5

  # Promotion process
  echo "Backing up current production..."
  # cp production.md production.md.pre-experiment

  echo "Deploying experimental variant..."
  # cp variant.md production.md

  echo "Updating experiment status..."
  # Update experiments.json: status = "promoted"

  git add .
  git commit -m "experiment: Promote exp-$EXPERIMENT_ID to production

Experiment: [name]
Improvement: [X]% ([metric])
Confidence: [Y]% (p = [p-value])
Trials: [N]

Auto-promoted by /meta_experiment"

  echo "✅ Experiment promoted to production"
fi

ROLLBACK - Revert Failed Experiment

if [ "$COMMAND" = "rollback" ]; then
  echo "Rolling back experiment: $EXPERIMENT_ID"
  echo ""

  # Restore backup
  # Update experiment status
  # Commit rollback

  echo "Restoring original version..."
  # cp backup.md production.md

  echo "Updating experiment status..."
  # Update experiments.json: status = "rolled_back"

  git add .
  git commit -m "experiment: Rollback exp-$EXPERIMENT_ID

Reason: [failure reason]
Regression: [X]% worse
Status: Rolled back to pre-experiment state

Auto-rolled back by /meta_experiment"

  echo "✅ Experiment rolled back"
fi

Phase 3: Automatic Decision Making (--auto mode)

if [ "$AUTO_MODE" = true ]; then
  echo ""
  echo "Running automatic experiment management..."
  echo ""

  # For each active experiment:
  for exp in active_experiments; do
    # Analyze current results
    analyze_experiment($exp)

    # Decision logic:
    if sample_size >= required && statistical_significance:
      if improvement > threshold && no_regression:
        # Auto-promote
        echo "✅ Auto-promoting: $exp (significant improvement)"
        promote_experiment($exp)
      elif regression > max_allowed:
        # Auto-rollback
        echo "❌ Auto-rolling back: $exp (regression detected)"
        rollback_experiment($exp)
      else:
        echo "⏳ Inconclusive: $exp (continue collecting data)"
    elif days_running > max_duration:
      # Expire experiment
      echo "⏱️  Expiring: $exp (max duration reached)"
      rollback_experiment($exp)
    else:
      echo "📊 Monitoring: $exp (needs more data)"
    fi
  done
fi

Phase 4: Experiment Tracking and Metrics

Telemetry Integration:

When commands run, check if they're part of an active experiment:

# In command execution (e.g., /review_pr)
check_active_experiments() {
  # Is this command under experiment?
  if experiment_active_for_command($COMMAND_NAME); then
    # Randomly assign to control or treatment
    if random() < 0.5:
      # Control group (use original)
      variant="control"
    else:
      # Treatment group (use experimental)
      variant="treatment"

    # Track metrics
    start_time=$(date +%s)
    execute_command(variant)
    end_time=$(date +%s)
    duration=$((end_time - start_time))

    # Record result
    record_experiment_result($EXP_ID, variant, duration, metrics)
  fi
}

Phase 5: Safety Checks and Alerts

Continuous Monitoring:

monitor_experiments() {
  for exp in running_experiments; do
    latest_results = get_recent_trials($exp, n=3)

    # Check for anomalies
    if detect_anomaly(latest_results):
      alert("Anomaly detected in experiment $exp")

      # Specific checks:
      if error_rate > 2x_baseline:
        alert("Error rate spike - consider rollback")

      if user_satisfaction < 0.5:
        alert("User satisfaction dropped - review experiment")

      if performance_regression > max_allowed:
        alert("Performance regression - auto-rollback initiated")
        rollback_experiment($exp)
  done
}

Alert Triggers:

  • Error rate >2x baseline
  • User satisfaction <50%
  • Performance regression >10%
  • Statistical anomaly detected
  • Experiment duration >14 days

Experiment Management Guidelines

When to Create Experiments

DO Experiment for:

  • Medium-confidence improvements (60-84%)
  • Novel approaches without precedent
  • Significant workflow changes
  • Performance optimizations
  • Agent prompt variations

DON'T Experiment for:

  • High-confidence improvements (≥85%) - use /meta_implement
  • Bug fixes
  • Documentation updates
  • Low-risk changes
  • Urgent issues

Experiment Design Best Practices

Good Hypothesis:

  • Specific: "Parallel agents save 15min per PR"
  • Measurable: Clear primary metric
  • Achievable: Realistic improvement target
  • Relevant: Addresses real bottleneck
  • Time-bound: 10 trials in 14 days

Poor Hypothesis:

  • Vague: "Make things faster"
  • Unmeasurable: No clear metric
  • Unrealistic: "100x improvement"
  • Irrelevant: Optimizes non-bottleneck
  • Open-ended: No completion criteria

Statistical Rigor

Sample Size:

  • Minimum: 10 trials per group
  • Recommended: 20 trials for high confidence
  • Calculate: Use power analysis for effect size

Significance Level:

  • p < 0.05 for promotion
  • p < 0.01 for high-risk changes
  • Effect size >0.5 (medium or large)

Avoiding False Positives:

  • Don't peek at results early
  • Don't stop early if trending good
  • Complete full sample size
  • Use pre-registered stopping rules

Important Notes

  1. Never A/B Test in Production: Use experimental branches
  2. Random Assignment: Ensure proper randomization
  3. Track Everything: Comprehensive metrics collection
  4. Statistical Discipline: No p-hacking or cherry-picking
  5. Safety First: Auto-rollback on regression
  6. Document Results: Whether success or failure
  7. Learn from Failures: Failed experiments provide value

Example Usage Scenarios

Scenario 1: Create Experiment

/meta_experiment create \
  --hypothesis "Parallel agents save 15min" \
  --primary-metric time_to_complete \
  --sample-size 10

Scenario 2: Monitor All Experiments

/meta_experiment status

Scenario 3: Analyze Specific Experiment

/meta_experiment analyze exp-2025-10-20-001

Scenario 4: Auto-Manage Experiments

/meta_experiment --auto
# Analyzes all experiments
# Auto-promotes successful ones
# Auto-rollsback failures
# Expires old experiments

Remember: Experimentation is how the system safely tests improvements. Every experiment, successful or not, teaches the system what works. Statistical rigor prevents false positives. Auto-rollback prevents damage.