Files
gh-anton-abyzov-specweave-p…/commands/specweave-ml-evaluate.md
2025-11-29 17:56:53 +08:00

88 lines
2.0 KiB
Markdown

---
name: specweave-ml:ml-evaluate
description: Evaluate ML model with comprehensive metrics
---
# Evaluate ML Model
You are evaluating an ML model in a SpecWeave increment. Generate a comprehensive evaluation report following ML best practices.
## Your Task
1. **Load Model**: Load the model from the specified increment
2. **Run Evaluation**: Execute comprehensive evaluation with appropriate metrics
3. **Generate Report**: Create evaluation report in increment folder
## Evaluation Steps
### Step 1: Identify Model Type
- Classification: accuracy, precision, recall, F1, ROC AUC, confusion matrix
- Regression: RMSE, MAE, MAPE, R², residual analysis
- Ranking: precision@K, recall@K, NDCG@K, MAP
### Step 2: Load Test Data
```python
# Load test set from increment
X_test = load_test_data(increment_path)
y_test = load_test_labels(increment_path)
```
### Step 3: Compute Metrics
```python
from specweave import ModelEvaluator
evaluator = ModelEvaluator(model, X_test, y_test)
metrics = evaluator.compute_all_metrics()
```
### Step 4: Generate Visualizations
- Confusion matrix (classification)
- ROC curves (classification)
- Residual plots (regression)
- Calibration curves (classification)
### Step 5: Statistical Validation
- Cross-validation results
- Confidence intervals
- Comparison to baseline
- Statistical significance tests
### Step 6: Generate Report
Create `evaluation-report.md` in increment folder:
```markdown
# Model Evaluation Report
## Model: [Model Name]
- Version: [Version]
- Increment: [Increment ID]
- Date: [Evaluation Date]
## Overall Performance
[Metrics table]
## Visualizations
[Embedded plots]
## Cross-Validation
[CV results]
## Comparison to Baseline
[Baseline comparison]
## Statistical Tests
[Significance tests]
## Recommendations
[Deploy/improve/investigate]
```
## Output
After evaluation, report:
- Overall performance summary
- Key metrics
- Whether model meets success criteria (from spec.md)
- Recommendation (deploy/improve/investigate)