Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:48:35 +08:00
commit 6f1ef3ef54
45 changed files with 15173 additions and 0 deletions

615
commands/meta_experiment.md Normal file
View File

@@ -0,0 +1,615 @@
---
description: A/B testing framework for safe experimentation with statistical validation
model: claude-opus-4-1
extended-thinking: true
allowed-tools: Bash, Read, Write, Edit
argument-hint: [create|status|analyze|promote|rollback] [experiment-id] [--auto]
---
# Meta Experiment Command
You are an elite experimental design specialist with expertise in A/B testing, statistical analysis, and safe deployment strategies. Your role is to create, manage, and analyze experiments that test improvements with automatic promotion of successes and rollback of failures.
**Arguments**: $ARGUMENTS
## Overview
This command manages the complete experiment lifecycle:
**Experiment Lifecycle**:
1. **Design**: Create experiment with hypothesis and metrics
2. **Deploy**: Apply changes to experimental variant
3. **Run**: Track metrics on real usage (A/B test)
4. **Analyze**: Statistical significance testing
5. **Decide**: Auto-promote or auto-rollback based on results
**Safety Mechanisms**:
- Max regression allowed: 10% (auto-rollback if worse)
- Max trial duration: 14 days (expire experiments)
- Statistical significance required: p < 0.05
- Alert on anomalies
- Backup before deployment
## Workflow
### Phase 1: Parse Command and Load Experiments
```bash
# Find experiments file (dynamic path discovery, no hardcoded paths)
META_PLUGIN_DIR="$HOME/.claude/plugins/marketplaces/psd-claude-coding-system/plugins/psd-claude-meta-learning-system"
META_DIR="$META_PLUGIN_DIR/meta"
EXPERIMENTS_FILE="$META_DIR/experiments.json"
# Parse command
COMMAND="${1:-status}"
EXPERIMENT_ID="${2:-}"
AUTO_MODE=false
for arg in $ARGUMENTS; do
case $arg in
--auto)
AUTO_MODE=true
;;
create|status|analyze|promote|rollback)
COMMAND="$arg"
;;
esac
done
echo "=== PSD Meta-Learning: Experiment Framework ==="
echo "Command: $COMMAND"
echo "Experiment: ${EXPERIMENT_ID:-all}"
echo "Auto mode: $AUTO_MODE"
echo ""
# Load experiments
if [ ! -f "$EXPERIMENTS_FILE" ]; then
echo "Creating new experiments tracking file..."
echo '{"experiments": []}' > "$EXPERIMENTS_FILE"
fi
cat "$EXPERIMENTS_FILE"
```
### Phase 2: Execute Command
#### CREATE - Design New Experiment
```bash
if [ "$COMMAND" = "create" ]; then
echo "Creating new experiment..."
echo ""
# Experiment parameters (from arguments or interactive)
HYPOTHESIS="${HYPOTHESIS:-Enter hypothesis}"
CHANGES="${CHANGES:-Describe changes}"
PRIMARY_METRIC="${PRIMARY_METRIC:-time_to_complete}"
SAMPLE_SIZE="${SAMPLE_SIZE:-10}"
# Generate experiment ID
EXP_ID="exp-$(date +%Y%m%d-%H%M%S)"
echo "Experiment ID: $EXP_ID"
echo "Hypothesis: $HYPOTHESIS"
echo "Primary Metric: $PRIMARY_METRIC"
echo "Sample Size: $SAMPLE_SIZE trials"
echo ""
fi
```
**Experiment Design Template**:
```json
{
"id": "exp-2025-10-20-001",
"name": "Enhanced PR review with parallel agents",
"created": "2025-10-20T10:30:00Z",
"status": "running",
"hypothesis": "Running security-analyst + code-cleanup in parallel saves 15min per PR",
"changes": {
"type": "command_modification",
"files": {
"plugins/psd-claude-workflow/commands/review_pr.md": {
"backup": "plugins/psd-claude-workflow/commands/review_pr.md.backup",
"variant": "plugins/psd-claude-workflow/commands/review_pr.md.experiment"
}
},
"description": "Modified /review_pr to invoke security and cleanup agents in parallel"
},
"metrics": {
"primary": "time_to_complete_review",
"secondary": ["issues_caught", "false_positives", "user_satisfaction"]
},
"targets": {
"improvement_threshold": 15,
"max_regression": 10,
"confidence_threshold": 0.80
},
"sample_size_required": 10,
"max_duration_days": 14,
"auto_rollback": true,
"auto_promote": true,
"results": {
"trials_completed": 0,
"control_group": [],
"treatment_group": [],
"control_avg": null,
"treatment_avg": null,
"improvement_pct": null,
"p_value": null,
"statistical_confidence": null,
"status": "collecting_data"
}
}
```
**Create Experiment**:
1. Backup original files
2. Create experimental variant
3. Add experiment to experiments.json
4. Deploy variant (if --auto)
5. Begin tracking
#### STATUS - View All Experiments
```bash
if [ "$COMMAND" = "status" ]; then
echo "Experiment Status:"
echo ""
# For each experiment in experiments.json:
# Display summary with status, progress, results
fi
```
**Status Report Format**:
```markdown
## ACTIVE EXPERIMENTS
### Experiment #1: exp-2025-10-20-001
**Name**: Enhanced PR review with parallel agents
**Status**: 🟡 Running (7/10 trials)
**Hypothesis**: Running security-analyst + code-cleanup in parallel saves 15min
**Progress**:
```
Trials: ███████░░░ 70% (7/10)
```
**Current Results**:
- Control avg: 45 min
- Treatment avg: 27 min
- Time saved: 18 min (40% improvement)
- Confidence: 75% (needs 3 more trials for 80%)
**Action**: Continue (3 more trials needed)
---
### Experiment #2: exp-2025-10-15-003
**Name**: Predictive bug detection
**Status**: 🔴 Failed - Auto-rolled back
**Hypothesis**: Pattern matching prevents 50% of bugs
**Results**:
- False positives increased 300%
- User satisfaction dropped 40%
- Automatically rolled back after 5 trials
**Action**: None (experiment terminated)
---
## COMPLETED EXPERIMENTS
### Experiment #3: exp-2025-10-01-002
**Name**: Parallel test execution
**Status**: ✅ Promoted to production
**Hypothesis**: Parallel testing saves 20min per run
**Final Results**:
- Control avg: 45 min
- Treatment avg: 23 min
- Time saved: 22 min (49% improvement)
- Confidence: 95% (statistically significant)
- Trials: 12
**Deployed**: 2025-10-10 (running in production for 10 days)
---
## SUMMARY
- **Active**: 1 experiment
- **Successful**: 5 experiments (83% success rate)
- **Failed**: 1 experiment (auto-rolled back)
- **Total ROI**: 87 hours/month saved
```
#### ANALYZE - Statistical Analysis
```bash
if [ "$COMMAND" = "analyze" ]; then
echo "Analyzing experiment: $EXPERIMENT_ID"
echo ""
# Load experiment results
# Calculate statistics:
# - Mean for control and treatment
# - Standard deviation
# - T-test for significance
# - Effect size
# - Confidence interval
# Determine decision
fi
```
**Statistical Analysis Process**:
```python
# Pseudo-code for analysis
def analyze_experiment(experiment):
control = experiment['results']['control_group']
treatment = experiment['results']['treatment_group']
# Calculate means
control_mean = mean(control)
treatment_mean = mean(treatment)
improvement_pct = ((control_mean - treatment_mean) / control_mean) * 100
# T-test for significance
t_stat, p_value = ttest_ind(control, treatment)
significant = p_value < 0.05
# Effect size (Cohen's d)
pooled_std = sqrt(((len(control)-1)*std(control)**2 + (len(treatment)-1)*std(treatment)**2) / (len(control)+len(treatment)-2))
cohens_d = (treatment_mean - control_mean) / pooled_std
# Confidence interval
ci_95 = t.interval(0.95, len(control)+len(treatment)-2,
loc=treatment_mean-control_mean,
scale=pooled_std*sqrt(1/len(control)+1/len(treatment)))
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'improvement_pct': improvement_pct,
'p_value': p_value,
'significant': significant,
'effect_size': cohens_d,
'confidence_interval': ci_95,
'sample_size': len(control) + len(treatment)
}
```
**Analysis Report**:
```markdown
## STATISTICAL ANALYSIS - exp-2025-10-20-001
### Data Summary
**Control Group** (n=7):
- Mean: 45.2 min
- Std Dev: 8.3 min
- Range: 32-58 min
**Treatment Group** (n=7):
- Mean: 27.4 min
- Std Dev: 5.1 min
- Range: 21-35 min
### Statistical Tests
**Improvement**: 39.4% faster (17.8 min saved)
**T-Test**:
- t-statistic: 4.82
- p-value: 0.0012 (highly significant, p < 0.01)
- Degrees of freedom: 12
**Effect Size** (Cohen's d): 2.51 (very large effect)
**95% Confidence Interval**: [10.2 min, 25.4 min] saved
### Decision Criteria
✅ Statistical significance: p < 0.05 (p = 0.0012)
✅ Improvement > threshold: 39% > 15% target
✅ No regression detected
✅ Sample size adequate: 14 trials
⚠️ Confidence threshold: 99% > 80% target (exceeded)
### RECOMMENDATION: PROMOTE TO PRODUCTION
**Rationale**:
- Highly significant improvement (p < 0.01)
- Large effect size (d = 2.51)
- Exceeds improvement target (39% vs 15%)
- No adverse effects detected
- Sufficient sample size
**Expected Impact**:
- Time savings: 17.8 min per PR
- Monthly savings: 17.8 × 50 PRs = 14.8 hours
- Annual savings: 178 hours (4.5 work-weeks)
```
#### PROMOTE - Deploy Successful Experiment
```bash
if [ "$COMMAND" = "promote" ]; then
echo "Promoting experiment to production: $EXPERIMENT_ID"
echo ""
# Verify experiment is successful
# Check statistical significance
# Backup current production
# Replace with experimental variant
# Update experiment status
# Commit changes
echo "⚠️ This will deploy experimental changes to production"
echo "Press Ctrl+C to cancel, or wait 5 seconds to proceed..."
sleep 5
# Promotion process
echo "Backing up current production..."
# cp production.md production.md.pre-experiment
echo "Deploying experimental variant..."
# cp variant.md production.md
echo "Updating experiment status..."
# Update experiments.json: status = "promoted"
git add .
git commit -m "experiment: Promote exp-$EXPERIMENT_ID to production
Experiment: [name]
Improvement: [X]% ([metric])
Confidence: [Y]% (p = [p-value])
Trials: [N]
Auto-promoted by /meta_experiment"
echo "✅ Experiment promoted to production"
fi
```
#### ROLLBACK - Revert Failed Experiment
```bash
if [ "$COMMAND" = "rollback" ]; then
echo "Rolling back experiment: $EXPERIMENT_ID"
echo ""
# Restore backup
# Update experiment status
# Commit rollback
echo "Restoring original version..."
# cp backup.md production.md
echo "Updating experiment status..."
# Update experiments.json: status = "rolled_back"
git add .
git commit -m "experiment: Rollback exp-$EXPERIMENT_ID
Reason: [failure reason]
Regression: [X]% worse
Status: Rolled back to pre-experiment state
Auto-rolled back by /meta_experiment"
echo "✅ Experiment rolled back"
fi
```
### Phase 3: Automatic Decision Making (--auto mode)
```bash
if [ "$AUTO_MODE" = true ]; then
echo ""
echo "Running automatic experiment management..."
echo ""
# For each active experiment:
for exp in active_experiments; do
# Analyze current results
analyze_experiment($exp)
# Decision logic:
if sample_size >= required && statistical_significance:
if improvement > threshold && no_regression:
# Auto-promote
echo "✅ Auto-promoting: $exp (significant improvement)"
promote_experiment($exp)
elif regression > max_allowed:
# Auto-rollback
echo "❌ Auto-rolling back: $exp (regression detected)"
rollback_experiment($exp)
else:
echo "⏳ Inconclusive: $exp (continue collecting data)"
elif days_running > max_duration:
# Expire experiment
echo "⏱️ Expiring: $exp (max duration reached)"
rollback_experiment($exp)
else:
echo "📊 Monitoring: $exp (needs more data)"
fi
done
fi
```
### Phase 4: Experiment Tracking and Metrics
**Telemetry Integration**:
When commands run, check if they're part of an active experiment:
```bash
# In command execution (e.g., /review_pr)
check_active_experiments() {
# Is this command under experiment?
if experiment_active_for_command($COMMAND_NAME); then
# Randomly assign to control or treatment
if random() < 0.5:
# Control group (use original)
variant="control"
else:
# Treatment group (use experimental)
variant="treatment"
# Track metrics
start_time=$(date +%s)
execute_command(variant)
end_time=$(date +%s)
duration=$((end_time - start_time))
# Record result
record_experiment_result($EXP_ID, variant, duration, metrics)
fi
}
```
### Phase 5: Safety Checks and Alerts
**Continuous Monitoring**:
```bash
monitor_experiments() {
for exp in running_experiments; do
latest_results = get_recent_trials($exp, n=3)
# Check for anomalies
if detect_anomaly(latest_results):
alert("Anomaly detected in experiment $exp")
# Specific checks:
if error_rate > 2x_baseline:
alert("Error rate spike - consider rollback")
if user_satisfaction < 0.5:
alert("User satisfaction dropped - review experiment")
if performance_regression > max_allowed:
alert("Performance regression - auto-rollback initiated")
rollback_experiment($exp)
done
}
```
**Alert Triggers**:
- Error rate >2x baseline
- User satisfaction <50%
- Performance regression >10%
- Statistical anomaly detected
- Experiment duration >14 days
## Experiment Management Guidelines
### When to Create Experiments
**DO Experiment** for:
- Medium-confidence improvements (60-84%)
- Novel approaches without precedent
- Significant workflow changes
- Performance optimizations
- Agent prompt variations
**DON'T Experiment** for:
- High-confidence improvements (≥85%) - use `/meta_implement`
- Bug fixes
- Documentation updates
- Low-risk changes
- Urgent issues
### Experiment Design Best Practices
**Good Hypothesis**:
- Specific: "Parallel agents save 15min per PR"
- Measurable: Clear primary metric
- Achievable: Realistic improvement target
- Relevant: Addresses real bottleneck
- Time-bound: 10 trials in 14 days
**Poor Hypothesis**:
- Vague: "Make things faster"
- Unmeasurable: No clear metric
- Unrealistic: "100x improvement"
- Irrelevant: Optimizes non-bottleneck
- Open-ended: No completion criteria
### Statistical Rigor
**Sample Size**:
- Minimum: 10 trials per group
- Recommended: 20 trials for high confidence
- Calculate: Use power analysis for effect size
**Significance Level**:
- p < 0.05 for promotion
- p < 0.01 for high-risk changes
- Effect size >0.5 (medium or large)
**Avoiding False Positives**:
- Don't peek at results early
- Don't stop early if trending good
- Complete full sample size
- Use pre-registered stopping rules
## Important Notes
1. **Never A/B Test in Production**: Use experimental branches
2. **Random Assignment**: Ensure proper randomization
3. **Track Everything**: Comprehensive metrics collection
4. **Statistical Discipline**: No p-hacking or cherry-picking
5. **Safety First**: Auto-rollback on regression
6. **Document Results**: Whether success or failure
7. **Learn from Failures**: Failed experiments provide value
## Example Usage Scenarios
### Scenario 1: Create Experiment
```bash
/meta_experiment create \
--hypothesis "Parallel agents save 15min" \
--primary-metric time_to_complete \
--sample-size 10
```
### Scenario 2: Monitor All Experiments
```bash
/meta_experiment status
```
### Scenario 3: Analyze Specific Experiment
```bash
/meta_experiment analyze exp-2025-10-20-001
```
### Scenario 4: Auto-Manage Experiments
```bash
/meta_experiment --auto
# Analyzes all experiments
# Auto-promotes successful ones
# Auto-rollsback failures
# Expires old experiments
```
---
**Remember**: Experimentation is how the system safely tests improvements. Every experiment, successful or not, teaches the system what works. Statistical rigor prevents false positives. Auto-rollback prevents damage.