88 lines
2.0 KiB
Markdown
88 lines
2.0 KiB
Markdown
---
|
|
name: specweave-ml:ml-evaluate
|
|
description: Evaluate ML model with comprehensive metrics
|
|
---
|
|
|
|
# Evaluate ML Model
|
|
|
|
You are evaluating an ML model in a SpecWeave increment. Generate a comprehensive evaluation report following ML best practices.
|
|
|
|
## Your Task
|
|
|
|
1. **Load Model**: Load the model from the specified increment
|
|
2. **Run Evaluation**: Execute comprehensive evaluation with appropriate metrics
|
|
3. **Generate Report**: Create evaluation report in increment folder
|
|
|
|
## Evaluation Steps
|
|
|
|
### Step 1: Identify Model Type
|
|
- Classification: accuracy, precision, recall, F1, ROC AUC, confusion matrix
|
|
- Regression: RMSE, MAE, MAPE, R², residual analysis
|
|
- Ranking: precision@K, recall@K, NDCG@K, MAP
|
|
|
|
### Step 2: Load Test Data
|
|
```python
|
|
# Load test set from increment
|
|
X_test = load_test_data(increment_path)
|
|
y_test = load_test_labels(increment_path)
|
|
```
|
|
|
|
### Step 3: Compute Metrics
|
|
```python
|
|
from specweave import ModelEvaluator
|
|
|
|
evaluator = ModelEvaluator(model, X_test, y_test)
|
|
metrics = evaluator.compute_all_metrics()
|
|
```
|
|
|
|
### Step 4: Generate Visualizations
|
|
- Confusion matrix (classification)
|
|
- ROC curves (classification)
|
|
- Residual plots (regression)
|
|
- Calibration curves (classification)
|
|
|
|
### Step 5: Statistical Validation
|
|
- Cross-validation results
|
|
- Confidence intervals
|
|
- Comparison to baseline
|
|
- Statistical significance tests
|
|
|
|
### Step 6: Generate Report
|
|
|
|
Create `evaluation-report.md` in increment folder:
|
|
|
|
```markdown
|
|
# Model Evaluation Report
|
|
|
|
## Model: [Model Name]
|
|
- Version: [Version]
|
|
- Increment: [Increment ID]
|
|
- Date: [Evaluation Date]
|
|
|
|
## Overall Performance
|
|
[Metrics table]
|
|
|
|
## Visualizations
|
|
[Embedded plots]
|
|
|
|
## Cross-Validation
|
|
[CV results]
|
|
|
|
## Comparison to Baseline
|
|
[Baseline comparison]
|
|
|
|
## Statistical Tests
|
|
[Significance tests]
|
|
|
|
## Recommendations
|
|
[Deploy/improve/investigate]
|
|
```
|
|
|
|
## Output
|
|
|
|
After evaluation, report:
|
|
- Overall performance summary
|
|
- Key metrics
|
|
- Whether model meets success criteria (from spec.md)
|
|
- Recommendation (deploy/improve/investigate)
|