Initial commit
This commit is contained in:
518
skills/ml-pipeline-orchestrator/SKILL.md
Normal file
518
skills/ml-pipeline-orchestrator/SKILL.md
Normal file
@@ -0,0 +1,518 @@
|
||||
---
|
||||
name: ml-pipeline-orchestrator
|
||||
description: |
|
||||
Orchestrates complete machine learning pipelines within SpecWeave increments. Activates when users request "ML pipeline", "train model", "build ML system", "end-to-end ML", "ML workflow", "model training pipeline", or similar. Guides users through data preprocessing, feature engineering, model training, evaluation, and deployment using SpecWeave's spec-driven approach. Integrates with increment lifecycle for reproducible ML development.
|
||||
---
|
||||
|
||||
# ML Pipeline Orchestrator
|
||||
|
||||
## Overview
|
||||
|
||||
This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.
|
||||
|
||||
## Core Philosophy
|
||||
|
||||
**SpecWeave + ML = Disciplined Data Science**
|
||||
|
||||
Traditional ML development often lacks structure:
|
||||
- ❌ Jupyter notebooks with no version control
|
||||
- ❌ Experiments without documentation
|
||||
- ❌ Models deployed with no reproducibility
|
||||
- ❌ Team knowledge trapped in individual notebooks
|
||||
|
||||
SpecWeave brings discipline:
|
||||
- ✅ Every ML feature is an increment (with spec, plan, tasks)
|
||||
- ✅ Experiments tracked and documented automatically
|
||||
- ✅ Model versions tied to increments
|
||||
- ✅ Living docs capture learnings and decisions
|
||||
|
||||
## How It Works
|
||||
|
||||
### Phase 1: ML Increment Planning
|
||||
|
||||
When you request "build a recommendation model", the skill:
|
||||
|
||||
1. **Creates ML increment structure**:
|
||||
```
|
||||
.specweave/increments/0042-recommendation-model/
|
||||
├── spec.md # ML requirements, success metrics
|
||||
├── plan.md # Pipeline architecture
|
||||
├── tasks.md # Implementation tasks
|
||||
├── tests.md # Evaluation criteria
|
||||
├── experiments/ # Experiment tracking
|
||||
│ ├── exp-001-baseline/
|
||||
│ ├── exp-002-xgboost/
|
||||
│ └── exp-003-neural-net/
|
||||
├── data/ # Data samples, schemas
|
||||
│ ├── schema.yaml
|
||||
│ └── sample.csv
|
||||
├── models/ # Trained models
|
||||
│ ├── model-v1.pkl
|
||||
│ └── model-v2.pkl
|
||||
└── notebooks/ # Exploratory notebooks
|
||||
├── 01-eda.ipynb
|
||||
└── 02-feature-engineering.ipynb
|
||||
```
|
||||
|
||||
2. **Generates ML-specific spec** (spec.md):
|
||||
```markdown
|
||||
## ML Problem Definition
|
||||
- Problem type: Recommendation (collaborative filtering)
|
||||
- Input: User behavior history
|
||||
- Output: Top-N product recommendations
|
||||
- Success metrics: Precision@10 > 0.25, Recall@10 > 0.15
|
||||
|
||||
## Data Requirements
|
||||
- Training data: 6 months user interactions
|
||||
- Validation: Last month
|
||||
- Features: User profile, product attributes, interaction history
|
||||
|
||||
## Model Requirements
|
||||
- Latency: <100ms inference
|
||||
- Throughput: 1000 req/sec
|
||||
- Accuracy: Better than random baseline by 3x
|
||||
- Explainability: Must explain top-3 recommendations
|
||||
```
|
||||
|
||||
3. **Creates ML-specific tasks** (tasks.md):
|
||||
```markdown
|
||||
- [ ] T-001: Data exploration and quality analysis
|
||||
- [ ] T-002: Feature engineering pipeline
|
||||
- [ ] T-003: Train baseline model (random/popularity)
|
||||
- [ ] T-004: Train candidate models (3 algorithms)
|
||||
- [ ] T-005: Hyperparameter tuning (best model)
|
||||
- [ ] T-006: Model evaluation (all metrics)
|
||||
- [ ] T-007: Model explainability (SHAP/LIME)
|
||||
- [ ] T-008: Production deployment preparation
|
||||
- [ ] T-009: A/B test plan
|
||||
```
|
||||
|
||||
### Phase 2: Pipeline Execution
|
||||
|
||||
The skill guides through each task with best practices:
|
||||
|
||||
#### Task 1: Data Exploration
|
||||
```python
|
||||
# Generated template with SpecWeave integration
|
||||
import pandas as pd
|
||||
import mlflow
|
||||
from specweave import track_experiment
|
||||
|
||||
# Auto-logs to .specweave/increments/0042.../experiments/
|
||||
with track_experiment("exp-001-eda") as exp:
|
||||
df = pd.read_csv("data/interactions.csv")
|
||||
|
||||
# EDA
|
||||
exp.log_param("dataset_size", len(df))
|
||||
exp.log_metric("missing_values", df.isnull().sum().sum())
|
||||
|
||||
# Auto-generates report in increment folder
|
||||
exp.save_report("eda-summary.md")
|
||||
```
|
||||
|
||||
#### Task 3: Train Baseline
|
||||
```python
|
||||
from sklearn.dummy import DummyClassifier
|
||||
from specweave import track_model
|
||||
|
||||
with track_model("baseline-random", increment="0042") as model:
|
||||
clf = DummyClassifier(strategy="uniform")
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
# Automatically logged to increment
|
||||
model.log_metrics({
|
||||
"accuracy": 0.12,
|
||||
"precision@10": 0.08
|
||||
})
|
||||
model.save_artifact(clf, "baseline.pkl")
|
||||
```
|
||||
|
||||
#### Task 4: Train Candidate Models
|
||||
```python
|
||||
from xgboost import XGBClassifier
|
||||
from specweave import ModelExperiment
|
||||
|
||||
# Parallel experiments with auto-tracking
|
||||
experiments = [
|
||||
ModelExperiment("xgboost", XGBClassifier, params_xgb),
|
||||
ModelExperiment("lightgbm", LGBMClassifier, params_lgbm),
|
||||
ModelExperiment("neural-net", KerasModel, params_nn)
|
||||
]
|
||||
|
||||
results = run_experiments(
|
||||
experiments,
|
||||
increment="0042",
|
||||
save_to="experiments/"
|
||||
)
|
||||
|
||||
# Auto-generates comparison table in increment docs
|
||||
```
|
||||
|
||||
### Phase 3: Increment Completion
|
||||
|
||||
When `/specweave:done 0042` runs:
|
||||
|
||||
1. **Validates ML-specific criteria**:
|
||||
- ✅ All experiments logged
|
||||
- ✅ Best model saved
|
||||
- ✅ Evaluation metrics documented
|
||||
- ✅ Model explainability artifacts present
|
||||
|
||||
2. **Generates completion summary**:
|
||||
```markdown
|
||||
## Recommendation Model - COMPLETE
|
||||
|
||||
### Experiments Run: 7
|
||||
1. exp-001-baseline (random): precision@10=0.08
|
||||
2. exp-002-popularity: precision@10=0.18
|
||||
3. exp-003-xgboost: precision@10=0.26 ✅ BEST
|
||||
4. exp-004-lightgbm: precision@10=0.24
|
||||
5. exp-005-neural-net: precision@10=0.22
|
||||
...
|
||||
|
||||
### Best Model
|
||||
- Algorithm: XGBoost
|
||||
- Version: model-v3.pkl
|
||||
- Metrics: precision@10=0.26, recall@10=0.16
|
||||
- Training time: 45 min
|
||||
- Model size: 12 MB
|
||||
|
||||
### Deployment Ready
|
||||
- ✅ Inference latency: 35ms (target: <100ms)
|
||||
- ✅ Explainability: SHAP values computed
|
||||
- ✅ A/B test plan documented
|
||||
```
|
||||
|
||||
3. **Syncs living docs** (via `/specweave:sync-docs`):
|
||||
- Updates architecture docs with model design
|
||||
- Adds ADR for algorithm selection
|
||||
- Documents learnings in runbooks
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when you need to:
|
||||
|
||||
- **Build ML features end-to-end** - From idea to deployed model
|
||||
- **Ensure reproducibility** - Every experiment tracked and documented
|
||||
- **Follow ML best practices** - Baseline comparison, proper validation, explainability
|
||||
- **Integrate ML with software engineering** - ML as increments, not isolated notebooks
|
||||
- **Maintain team knowledge** - Living docs capture why decisions were made
|
||||
|
||||
## ML Pipeline Stages
|
||||
|
||||
### 1. Data Stage
|
||||
- Data exploration (EDA)
|
||||
- Data quality assessment
|
||||
- Schema validation
|
||||
- Sample data documentation
|
||||
|
||||
### 2. Feature Stage
|
||||
- Feature engineering
|
||||
- Feature selection
|
||||
- Feature importance analysis
|
||||
- Feature store integration (optional)
|
||||
|
||||
### 3. Training Stage
|
||||
- Baseline model (random, rule-based)
|
||||
- Candidate models (3+ algorithms)
|
||||
- Hyperparameter tuning
|
||||
- Cross-validation
|
||||
|
||||
### 4. Evaluation Stage
|
||||
- Comprehensive metrics (accuracy, precision, recall, F1, AUC)
|
||||
- Business metrics (latency, throughput)
|
||||
- Model comparison (vs baseline, vs previous version)
|
||||
- Error analysis
|
||||
|
||||
### 5. Explainability Stage
|
||||
- Feature importance
|
||||
- SHAP values
|
||||
- LIME explanations
|
||||
- Example predictions with rationale
|
||||
|
||||
### 6. Deployment Stage
|
||||
- Model packaging
|
||||
- Inference pipeline
|
||||
- A/B test plan
|
||||
- Monitoring setup
|
||||
|
||||
## Integration with SpecWeave Workflow
|
||||
|
||||
### With Experiment Tracking
|
||||
```bash
|
||||
# Start ML increment
|
||||
/specweave:inc "0042-recommendation-model"
|
||||
|
||||
# Automatically integrates experiment tracking
|
||||
# All MLflow/W&B logs saved to increment folder
|
||||
```
|
||||
|
||||
### With Living Docs
|
||||
```bash
|
||||
# After training best model
|
||||
/specweave:sync-docs update
|
||||
|
||||
# Automatically:
|
||||
# - Updates architecture/ml-models.md
|
||||
# - Adds ADR for algorithm choice
|
||||
# - Documents hyperparameters in runbooks
|
||||
```
|
||||
|
||||
### With GitHub Sync
|
||||
```bash
|
||||
# Create GitHub issue for model retraining
|
||||
/specweave:github:create-issue "Retrain recommendation model with new data"
|
||||
|
||||
# Linked to increment 0042
|
||||
# Issue tracks model performance over time
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Start with Baseline
|
||||
```python
|
||||
# Before training complex models, establish baseline
|
||||
baseline_results = train_baseline_model(
|
||||
strategies=["random", "popularity", "rule-based"]
|
||||
)
|
||||
# Requirement: New model must beat best baseline by 20%+
|
||||
```
|
||||
|
||||
### 2. Use Cross-Validation
|
||||
```python
|
||||
# Never trust single train/test split
|
||||
cv_scores = cross_val_score(model, X, y, cv=5)
|
||||
exp.log_metric("cv_mean", cv_scores.mean())
|
||||
exp.log_metric("cv_std", cv_scores.std())
|
||||
```
|
||||
|
||||
### 3. Track Everything
|
||||
```python
|
||||
# Hyperparameters, metrics, artifacts, environment
|
||||
exp.log_params(model.get_params())
|
||||
exp.log_metrics({"accuracy": acc, "f1": f1})
|
||||
exp.log_artifact("model.pkl")
|
||||
exp.log_artifact("requirements.txt") # Reproducibility
|
||||
```
|
||||
|
||||
### 4. Document Failures
|
||||
```python
|
||||
# Failed experiments are valuable learnings
|
||||
with track_experiment("exp-006-failed-lstm") as exp:
|
||||
# ... training fails ...
|
||||
exp.log_note("FAILED: LSTM overfits badly, needs regularization")
|
||||
exp.set_status("failed")
|
||||
# This documents why LSTM wasn't chosen
|
||||
```
|
||||
|
||||
### 5. Model Versioning
|
||||
```python
|
||||
# Tie model versions to increments
|
||||
model_version = f"0042-v{iteration}"
|
||||
mlflow.register_model(
|
||||
f"runs:/{run_id}/model",
|
||||
f"recommendation-model-{model_version}"
|
||||
)
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Classification Pipeline
|
||||
```bash
|
||||
User: "Build a fraud detection model for transactions"
|
||||
|
||||
Skill creates increment 0051-fraud-detection with:
|
||||
- spec.md: Binary classification, 99% precision target
|
||||
- plan.md: Imbalanced data handling, threshold tuning
|
||||
- tasks.md: 9 tasks from EDA to deployment
|
||||
- experiments/: exp-001-baseline, exp-002-xgboost, etc.
|
||||
|
||||
Guides through:
|
||||
1. EDA → identify class imbalance (0.1% fraud)
|
||||
2. Baseline → random/majority (terrible results)
|
||||
3. Candidates → XGBoost, LightGBM, Neural Net
|
||||
4. Threshold tuning → optimize for precision
|
||||
5. SHAP → explain high-risk predictions
|
||||
6. Deploy → model + threshold + explainer
|
||||
```
|
||||
|
||||
### Example 2: Regression Pipeline
|
||||
```bash
|
||||
User: "Predict customer lifetime value"
|
||||
|
||||
Skill creates increment 0063-ltv-prediction with:
|
||||
- spec.md: Regression, RMSE < $50 target
|
||||
- plan.md: Time-based validation, feature engineering
|
||||
- tasks.md: Customer cohort analysis, feature importance
|
||||
|
||||
Key difference: Regression-specific evaluation (RMSE, MAE, R²)
|
||||
```
|
||||
|
||||
### Example 3: Time Series Forecasting
|
||||
```bash
|
||||
User: "Forecast weekly sales for next 12 weeks"
|
||||
|
||||
Skill creates increment 0072-sales-forecasting with:
|
||||
- spec.md: Time series, MAPE < 10% target
|
||||
- plan.md: Seasonal decomposition, ARIMA vs Prophet
|
||||
- tasks.md: Stationarity tests, residual analysis
|
||||
|
||||
Key difference: Time series validation (no random split)
|
||||
```
|
||||
|
||||
## Framework Support
|
||||
|
||||
This skill works with all major ML frameworks:
|
||||
|
||||
### Scikit-Learn
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from specweave import track_sklearn_model
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100)
|
||||
with track_sklearn_model(model, increment="0042") as tracked:
|
||||
tracked.fit(X_train, y_train)
|
||||
tracked.evaluate(X_test, y_test)
|
||||
```
|
||||
|
||||
### PyTorch
|
||||
```python
|
||||
import torch
|
||||
from specweave import track_pytorch_model
|
||||
|
||||
model = NeuralNet()
|
||||
with track_pytorch_model(model, increment="0042") as tracked:
|
||||
for epoch in range(epochs):
|
||||
tracked.train_epoch(train_loader)
|
||||
tracked.log_metric(f"loss_epoch_{epoch}", loss)
|
||||
```
|
||||
|
||||
### TensorFlow/Keras
|
||||
```python
|
||||
from tensorflow import keras
|
||||
from specweave import KerasCallback
|
||||
|
||||
model = keras.Sequential([...])
|
||||
model.fit(
|
||||
X_train, y_train,
|
||||
callbacks=[KerasCallback(increment="0042")]
|
||||
)
|
||||
```
|
||||
|
||||
### XGBoost/LightGBM
|
||||
```python
|
||||
import xgboost as xgb
|
||||
from specweave import track_boosting_model
|
||||
|
||||
dtrain = xgb.DMatrix(X_train, label=y_train)
|
||||
with track_boosting_model("xgboost", increment="0042") as tracked:
|
||||
model = xgb.train(params, dtrain, callbacks=[tracked.callback])
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### With `experiment-tracker` skill
|
||||
- Auto-detects MLflow/W&B in project
|
||||
- Configures tracking URI to increment folder
|
||||
- Syncs experiment metadata to increment docs
|
||||
|
||||
### With `model-evaluator` skill
|
||||
- Generates comprehensive evaluation reports
|
||||
- Compares models across experiments
|
||||
- Highlights best model with confidence intervals
|
||||
|
||||
### With `feature-engineer` skill
|
||||
- Generates feature engineering pipeline
|
||||
- Documents feature importance
|
||||
- Creates feature store schemas
|
||||
|
||||
### With `ml-engineer` agent
|
||||
- Delegates complex ML decisions to specialized agent
|
||||
- Reviews model architecture
|
||||
- Suggests improvements based on results
|
||||
|
||||
## Skill Outputs
|
||||
|
||||
After running `/specweave:do` on an ML increment, you get:
|
||||
|
||||
```
|
||||
.specweave/increments/0042-recommendation-model/
|
||||
├── spec.md ✅
|
||||
├── plan.md ✅
|
||||
├── tasks.md ✅ (all completed)
|
||||
├── COMPLETION-SUMMARY.md ✅
|
||||
├── experiments/
|
||||
│ ├── exp-001-baseline/
|
||||
│ │ ├── metrics.json
|
||||
│ │ ├── params.json
|
||||
│ │ └── logs/
|
||||
│ ├── exp-002-xgboost/ ✅ BEST
|
||||
│ │ ├── metrics.json
|
||||
│ │ ├── params.json
|
||||
│ │ ├── model.pkl
|
||||
│ │ └── shap_values.pkl
|
||||
│ └── comparison.md
|
||||
├── models/
|
||||
│ ├── model-v3.pkl (best)
|
||||
│ └── model-v3.metadata.json
|
||||
├── data/
|
||||
│ ├── schema.yaml
|
||||
│ └── sample.parquet
|
||||
└── notebooks/
|
||||
├── 01-eda.ipynb
|
||||
├── 02-feature-engineering.ipynb
|
||||
└── 03-model-analysis.ipynb
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
This skill integrates with SpecWeave commands:
|
||||
|
||||
```bash
|
||||
# Create ML increment
|
||||
/specweave:inc "build recommendation model"
|
||||
→ Activates ml-pipeline-orchestrator
|
||||
→ Creates ML-specific increment structure
|
||||
|
||||
# Execute ML tasks
|
||||
/specweave:do
|
||||
→ Guides through data → train → eval workflow
|
||||
→ Auto-tracks experiments
|
||||
|
||||
# Validate ML increment
|
||||
/specweave:validate 0042
|
||||
→ Checks: experiments logged, model saved, metrics documented
|
||||
→ Validates: model meets success criteria
|
||||
|
||||
# Complete ML increment
|
||||
/specweave:done 0042
|
||||
→ Generates ML completion summary
|
||||
→ Syncs model metadata to living docs
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start simple** - Always begin with baseline, then iterate
|
||||
2. **Track failures** - Document why approaches didn't work
|
||||
3. **Version data** - Use DVC or similar for data versioning
|
||||
4. **Reproducibility** - Log environment (requirements.txt, conda env)
|
||||
5. **Incremental improvement** - Each increment improves on previous model
|
||||
6. **Team collaboration** - Living docs make ML decisions visible to all
|
||||
|
||||
## Advanced: Multi-Increment ML Projects
|
||||
|
||||
For complex ML systems (e.g., recommendation system with multiple models):
|
||||
|
||||
```
|
||||
0042-recommendation-data-pipeline
|
||||
0043-recommendation-candidate-generation
|
||||
0044-recommendation-ranking-model
|
||||
0045-recommendation-reranking
|
||||
0046-recommendation-ab-test
|
||||
```
|
||||
|
||||
Each increment:
|
||||
- Has its own spec, plan, tasks
|
||||
- Builds on previous increments
|
||||
- Documents model interactions
|
||||
- Maintains system-level living docs
|
||||
Reference in New Issue
Block a user