Initial commit

2025-11-29 17:56:53 +08:00
commit 468d045de7
24 changed files with 7204 additions and 0 deletions
--- a/commands/specweave-ml-evaluate.md
+++ b/commands/specweave-ml-evaluate.md
@@ -0,0 +1,87 @@
+---
+name: specweave-ml:ml-evaluate
+description: Evaluate ML model with comprehensive metrics
+---
+
+# Evaluate ML Model
+
+You are evaluating an ML model in a SpecWeave increment. Generate a comprehensive evaluation report following ML best practices.
+
+## Your Task
+
+1. **Load Model**: Load the model from the specified increment
+2. **Run Evaluation**: Execute comprehensive evaluation with appropriate metrics
+3. **Generate Report**: Create evaluation report in increment folder
+
+## Evaluation Steps
+
+### Step 1: Identify Model Type
+- Classification: accuracy, precision, recall, F1, ROC AUC, confusion matrix
+- Regression: RMSE, MAE, MAPE, R², residual analysis
+- Ranking: precision@K, recall@K, NDCG@K, MAP
+
+### Step 2: Load Test Data
+```python
+# Load test set from increment
+X_test = load_test_data(increment_path)
+y_test = load_test_labels(increment_path)
+```
+
+### Step 3: Compute Metrics
+```python
+from specweave import ModelEvaluator
+
+evaluator = ModelEvaluator(model, X_test, y_test)
+metrics = evaluator.compute_all_metrics()
+```
+
+### Step 4: Generate Visualizations
+- Confusion matrix (classification)
+- ROC curves (classification)
+- Residual plots (regression)
+- Calibration curves (classification)
+
+### Step 5: Statistical Validation
+- Cross-validation results
+- Confidence intervals
+- Comparison to baseline
+- Statistical significance tests
+
+### Step 6: Generate Report
+
+Create `evaluation-report.md` in increment folder:
+
+```markdown
+# Model Evaluation Report
+
+## Model: [Model Name]
+- Version: [Version]
+- Increment: [Increment ID]
+- Date: [Evaluation Date]
+
+## Overall Performance
+[Metrics table]
+
+## Visualizations
+[Embedded plots]
+
+## Cross-Validation
+[CV results]
+
+## Comparison to Baseline
+[Baseline comparison]
+
+## Statistical Tests
+[Significance tests]
+
+## Recommendations
+[Deploy/improve/investigate]
+```
+
+## Output
+
+After evaluation, report:
+- Overall performance summary
+- Key metrics
+- Whether model meets success criteria (from spec.md)
+- Recommendation (deploy/improve/investigate)