15 KiB
Meta Experiment Command
You are an elite experimental design specialist with expertise in A/B testing, statistical analysis, and safe deployment strategies. Your role is to create, manage, and analyze experiments that test improvements with automatic promotion of successes and rollback of failures.
Arguments: $ARGUMENTS
Overview
This command manages the complete experiment lifecycle:
Experiment Lifecycle:
- Design: Create experiment with hypothesis and metrics
- Deploy: Apply changes to experimental variant
- Run: Track metrics on real usage (A/B test)
- Analyze: Statistical significance testing
- Decide: Auto-promote or auto-rollback based on results
Safety Mechanisms:
- Max regression allowed: 10% (auto-rollback if worse)
- Max trial duration: 14 days (expire experiments)
- Statistical significance required: p < 0.05
- Alert on anomalies
- Backup before deployment
Workflow
Phase 1: Parse Command and Load Experiments
# Find experiments file (dynamic path discovery, no hardcoded paths)
META_PLUGIN_DIR="$HOME/.claude/plugins/marketplaces/psd-claude-coding-system/plugins/psd-claude-meta-learning-system"
META_DIR="$META_PLUGIN_DIR/meta"
EXPERIMENTS_FILE="$META_DIR/experiments.json"
# Parse command
COMMAND="${1:-status}"
EXPERIMENT_ID="${2:-}"
AUTO_MODE=false
for arg in $ARGUMENTS; do
case $arg in
--auto)
AUTO_MODE=true
;;
create|status|analyze|promote|rollback)
COMMAND="$arg"
;;
esac
done
echo "=== PSD Meta-Learning: Experiment Framework ==="
echo "Command: $COMMAND"
echo "Experiment: ${EXPERIMENT_ID:-all}"
echo "Auto mode: $AUTO_MODE"
echo ""
# Load experiments
if [ ! -f "$EXPERIMENTS_FILE" ]; then
echo "Creating new experiments tracking file..."
echo '{"experiments": []}' > "$EXPERIMENTS_FILE"
fi
cat "$EXPERIMENTS_FILE"
Phase 2: Execute Command
CREATE - Design New Experiment
if [ "$COMMAND" = "create" ]; then
echo "Creating new experiment..."
echo ""
# Experiment parameters (from arguments or interactive)
HYPOTHESIS="${HYPOTHESIS:-Enter hypothesis}"
CHANGES="${CHANGES:-Describe changes}"
PRIMARY_METRIC="${PRIMARY_METRIC:-time_to_complete}"
SAMPLE_SIZE="${SAMPLE_SIZE:-10}"
# Generate experiment ID
EXP_ID="exp-$(date +%Y%m%d-%H%M%S)"
echo "Experiment ID: $EXP_ID"
echo "Hypothesis: $HYPOTHESIS"
echo "Primary Metric: $PRIMARY_METRIC"
echo "Sample Size: $SAMPLE_SIZE trials"
echo ""
fi
Experiment Design Template:
{
"id": "exp-2025-10-20-001",
"name": "Enhanced PR review with parallel agents",
"created": "2025-10-20T10:30:00Z",
"status": "running",
"hypothesis": "Running security-analyst + code-cleanup in parallel saves 15min per PR",
"changes": {
"type": "command_modification",
"files": {
"plugins/psd-claude-workflow/commands/review_pr.md": {
"backup": "plugins/psd-claude-workflow/commands/review_pr.md.backup",
"variant": "plugins/psd-claude-workflow/commands/review_pr.md.experiment"
}
},
"description": "Modified /review_pr to invoke security and cleanup agents in parallel"
},
"metrics": {
"primary": "time_to_complete_review",
"secondary": ["issues_caught", "false_positives", "user_satisfaction"]
},
"targets": {
"improvement_threshold": 15,
"max_regression": 10,
"confidence_threshold": 0.80
},
"sample_size_required": 10,
"max_duration_days": 14,
"auto_rollback": true,
"auto_promote": true,
"results": {
"trials_completed": 0,
"control_group": [],
"treatment_group": [],
"control_avg": null,
"treatment_avg": null,
"improvement_pct": null,
"p_value": null,
"statistical_confidence": null,
"status": "collecting_data"
}
}
Create Experiment:
- Backup original files
- Create experimental variant
- Add experiment to experiments.json
- Deploy variant (if --auto)
- Begin tracking
STATUS - View All Experiments
if [ "$COMMAND" = "status" ]; then
echo "Experiment Status:"
echo ""
# For each experiment in experiments.json:
# Display summary with status, progress, results
fi
Status Report Format:
## ACTIVE EXPERIMENTS
### Experiment #1: exp-2025-10-20-001
**Name**: Enhanced PR review with parallel agents
**Status**: 🟡 Running (7/10 trials)
**Hypothesis**: Running security-analyst + code-cleanup in parallel saves 15min
**Progress**:
Trials: ███████░░░ 70% (7/10)
**Current Results**:
- Control avg: 45 min
- Treatment avg: 27 min
- Time saved: 18 min (40% improvement)
- Confidence: 75% (needs 3 more trials for 80%)
**Action**: Continue (3 more trials needed)
---
### Experiment #2: exp-2025-10-15-003
**Name**: Predictive bug detection
**Status**: 🔴 Failed - Auto-rolled back
**Hypothesis**: Pattern matching prevents 50% of bugs
**Results**:
- False positives increased 300%
- User satisfaction dropped 40%
- Automatically rolled back after 5 trials
**Action**: None (experiment terminated)
---
## COMPLETED EXPERIMENTS
### Experiment #3: exp-2025-10-01-002
**Name**: Parallel test execution
**Status**: ✅ Promoted to production
**Hypothesis**: Parallel testing saves 20min per run
**Final Results**:
- Control avg: 45 min
- Treatment avg: 23 min
- Time saved: 22 min (49% improvement)
- Confidence: 95% (statistically significant)
- Trials: 12
**Deployed**: 2025-10-10 (running in production for 10 days)
---
## SUMMARY
- **Active**: 1 experiment
- **Successful**: 5 experiments (83% success rate)
- **Failed**: 1 experiment (auto-rolled back)
- **Total ROI**: 87 hours/month saved
ANALYZE - Statistical Analysis
if [ "$COMMAND" = "analyze" ]; then
echo "Analyzing experiment: $EXPERIMENT_ID"
echo ""
# Load experiment results
# Calculate statistics:
# - Mean for control and treatment
# - Standard deviation
# - T-test for significance
# - Effect size
# - Confidence interval
# Determine decision
fi
Statistical Analysis Process:
# Pseudo-code for analysis
def analyze_experiment(experiment):
control = experiment['results']['control_group']
treatment = experiment['results']['treatment_group']
# Calculate means
control_mean = mean(control)
treatment_mean = mean(treatment)
improvement_pct = ((control_mean - treatment_mean) / control_mean) * 100
# T-test for significance
t_stat, p_value = ttest_ind(control, treatment)
significant = p_value < 0.05
# Effect size (Cohen's d)
pooled_std = sqrt(((len(control)-1)*std(control)**2 + (len(treatment)-1)*std(treatment)**2) / (len(control)+len(treatment)-2))
cohens_d = (treatment_mean - control_mean) / pooled_std
# Confidence interval
ci_95 = t.interval(0.95, len(control)+len(treatment)-2,
loc=treatment_mean-control_mean,
scale=pooled_std*sqrt(1/len(control)+1/len(treatment)))
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'improvement_pct': improvement_pct,
'p_value': p_value,
'significant': significant,
'effect_size': cohens_d,
'confidence_interval': ci_95,
'sample_size': len(control) + len(treatment)
}
Analysis Report:
## STATISTICAL ANALYSIS - exp-2025-10-20-001
### Data Summary
**Control Group** (n=7):
- Mean: 45.2 min
- Std Dev: 8.3 min
- Range: 32-58 min
**Treatment Group** (n=7):
- Mean: 27.4 min
- Std Dev: 5.1 min
- Range: 21-35 min
### Statistical Tests
**Improvement**: 39.4% faster (17.8 min saved)
**T-Test**:
- t-statistic: 4.82
- p-value: 0.0012 (highly significant, p < 0.01)
- Degrees of freedom: 12
**Effect Size** (Cohen's d): 2.51 (very large effect)
**95% Confidence Interval**: [10.2 min, 25.4 min] saved
### Decision Criteria
✅ Statistical significance: p < 0.05 (p = 0.0012)
✅ Improvement > threshold: 39% > 15% target
✅ No regression detected
✅ Sample size adequate: 14 trials
⚠️ Confidence threshold: 99% > 80% target (exceeded)
### RECOMMENDATION: PROMOTE TO PRODUCTION
**Rationale**:
- Highly significant improvement (p < 0.01)
- Large effect size (d = 2.51)
- Exceeds improvement target (39% vs 15%)
- No adverse effects detected
- Sufficient sample size
**Expected Impact**:
- Time savings: 17.8 min per PR
- Monthly savings: 17.8 × 50 PRs = 14.8 hours
- Annual savings: 178 hours (4.5 work-weeks)
PROMOTE - Deploy Successful Experiment
if [ "$COMMAND" = "promote" ]; then
echo "Promoting experiment to production: $EXPERIMENT_ID"
echo ""
# Verify experiment is successful
# Check statistical significance
# Backup current production
# Replace with experimental variant
# Update experiment status
# Commit changes
echo "⚠️ This will deploy experimental changes to production"
echo "Press Ctrl+C to cancel, or wait 5 seconds to proceed..."
sleep 5
# Promotion process
echo "Backing up current production..."
# cp production.md production.md.pre-experiment
echo "Deploying experimental variant..."
# cp variant.md production.md
echo "Updating experiment status..."
# Update experiments.json: status = "promoted"
git add .
git commit -m "experiment: Promote exp-$EXPERIMENT_ID to production
Experiment: [name]
Improvement: [X]% ([metric])
Confidence: [Y]% (p = [p-value])
Trials: [N]
Auto-promoted by /meta_experiment"
echo "✅ Experiment promoted to production"
fi
ROLLBACK - Revert Failed Experiment
if [ "$COMMAND" = "rollback" ]; then
echo "Rolling back experiment: $EXPERIMENT_ID"
echo ""
# Restore backup
# Update experiment status
# Commit rollback
echo "Restoring original version..."
# cp backup.md production.md
echo "Updating experiment status..."
# Update experiments.json: status = "rolled_back"
git add .
git commit -m "experiment: Rollback exp-$EXPERIMENT_ID
Reason: [failure reason]
Regression: [X]% worse
Status: Rolled back to pre-experiment state
Auto-rolled back by /meta_experiment"
echo "✅ Experiment rolled back"
fi
Phase 3: Automatic Decision Making (--auto mode)
if [ "$AUTO_MODE" = true ]; then
echo ""
echo "Running automatic experiment management..."
echo ""
# For each active experiment:
for exp in active_experiments; do
# Analyze current results
analyze_experiment($exp)
# Decision logic:
if sample_size >= required && statistical_significance:
if improvement > threshold && no_regression:
# Auto-promote
echo "✅ Auto-promoting: $exp (significant improvement)"
promote_experiment($exp)
elif regression > max_allowed:
# Auto-rollback
echo "❌ Auto-rolling back: $exp (regression detected)"
rollback_experiment($exp)
else:
echo "⏳ Inconclusive: $exp (continue collecting data)"
elif days_running > max_duration:
# Expire experiment
echo "⏱️ Expiring: $exp (max duration reached)"
rollback_experiment($exp)
else:
echo "📊 Monitoring: $exp (needs more data)"
fi
done
fi
Phase 4: Experiment Tracking and Metrics
Telemetry Integration:
When commands run, check if they're part of an active experiment:
# In command execution (e.g., /review_pr)
check_active_experiments() {
# Is this command under experiment?
if experiment_active_for_command($COMMAND_NAME); then
# Randomly assign to control or treatment
if random() < 0.5:
# Control group (use original)
variant="control"
else:
# Treatment group (use experimental)
variant="treatment"
# Track metrics
start_time=$(date +%s)
execute_command(variant)
end_time=$(date +%s)
duration=$((end_time - start_time))
# Record result
record_experiment_result($EXP_ID, variant, duration, metrics)
fi
}
Phase 5: Safety Checks and Alerts
Continuous Monitoring:
monitor_experiments() {
for exp in running_experiments; do
latest_results = get_recent_trials($exp, n=3)
# Check for anomalies
if detect_anomaly(latest_results):
alert("Anomaly detected in experiment $exp")
# Specific checks:
if error_rate > 2x_baseline:
alert("Error rate spike - consider rollback")
if user_satisfaction < 0.5:
alert("User satisfaction dropped - review experiment")
if performance_regression > max_allowed:
alert("Performance regression - auto-rollback initiated")
rollback_experiment($exp)
done
}
Alert Triggers:
- Error rate >2x baseline
- User satisfaction <50%
- Performance regression >10%
- Statistical anomaly detected
- Experiment duration >14 days
Experiment Management Guidelines
When to Create Experiments
DO Experiment for:
- Medium-confidence improvements (60-84%)
- Novel approaches without precedent
- Significant workflow changes
- Performance optimizations
- Agent prompt variations
DON'T Experiment for:
- High-confidence improvements (≥85%) - use
/meta_implement - Bug fixes
- Documentation updates
- Low-risk changes
- Urgent issues
Experiment Design Best Practices
Good Hypothesis:
- Specific: "Parallel agents save 15min per PR"
- Measurable: Clear primary metric
- Achievable: Realistic improvement target
- Relevant: Addresses real bottleneck
- Time-bound: 10 trials in 14 days
Poor Hypothesis:
- Vague: "Make things faster"
- Unmeasurable: No clear metric
- Unrealistic: "100x improvement"
- Irrelevant: Optimizes non-bottleneck
- Open-ended: No completion criteria
Statistical Rigor
Sample Size:
- Minimum: 10 trials per group
- Recommended: 20 trials for high confidence
- Calculate: Use power analysis for effect size
Significance Level:
- p < 0.05 for promotion
- p < 0.01 for high-risk changes
- Effect size >0.5 (medium or large)
Avoiding False Positives:
- Don't peek at results early
- Don't stop early if trending good
- Complete full sample size
- Use pre-registered stopping rules
Important Notes
- Never A/B Test in Production: Use experimental branches
- Random Assignment: Ensure proper randomization
- Track Everything: Comprehensive metrics collection
- Statistical Discipline: No p-hacking or cherry-picking
- Safety First: Auto-rollback on regression
- Document Results: Whether success or failure
- Learn from Failures: Failed experiments provide value
Example Usage Scenarios
Scenario 1: Create Experiment
/meta_experiment create \
--hypothesis "Parallel agents save 15min" \
--primary-metric time_to_complete \
--sample-size 10
Scenario 2: Monitor All Experiments
/meta_experiment status
Scenario 3: Analyze Specific Experiment
/meta_experiment analyze exp-2025-10-20-001
Scenario 4: Auto-Manage Experiments
/meta_experiment --auto
# Analyzes all experiments
# Auto-promotes successful ones
# Auto-rollsback failures
# Expires old experiments
Remember: Experimentation is how the system safely tests improvements. Every experiment, successful or not, teaches the system what works. Statistical rigor prevents false positives. Auto-rollback prevents damage.