--- name: ml-pipeline-orchestrator description: | Orchestrates complete machine learning pipelines within SpecWeave increments. Activates when users request "ML pipeline", "train model", "build ML system", "end-to-end ML", "ML workflow", "model training pipeline", or similar. Guides users through data preprocessing, feature engineering, model training, evaluation, and deployment using SpecWeave's spec-driven approach. Integrates with increment lifecycle for reproducible ML development. --- # ML Pipeline Orchestrator ## Overview This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation. ## Core Philosophy **SpecWeave + ML = Disciplined Data Science** Traditional ML development often lacks structure: - ❌ Jupyter notebooks with no version control - ❌ Experiments without documentation - ❌ Models deployed with no reproducibility - ❌ Team knowledge trapped in individual notebooks SpecWeave brings discipline: - ✅ Every ML feature is an increment (with spec, plan, tasks) - ✅ Experiments tracked and documented automatically - ✅ Model versions tied to increments - ✅ Living docs capture learnings and decisions ## How It Works ### Phase 1: ML Increment Planning When you request "build a recommendation model", the skill: 1. **Creates ML increment structure**: ``` .specweave/increments/0042-recommendation-model/ ├── spec.md # ML requirements, success metrics ├── plan.md # Pipeline architecture ├── tasks.md # Implementation tasks ├── tests.md # Evaluation criteria ├── experiments/ # Experiment tracking │ ├── exp-001-baseline/ │ ├── exp-002-xgboost/ │ └── exp-003-neural-net/ ├── data/ # Data samples, schemas │ ├── schema.yaml │ └── sample.csv ├── models/ # Trained models │ ├── model-v1.pkl │ └── model-v2.pkl └── notebooks/ # Exploratory notebooks ├── 01-eda.ipynb └── 02-feature-engineering.ipynb ``` 2. **Generates ML-specific spec** (spec.md): ```markdown ## ML Problem Definition - Problem type: Recommendation (collaborative filtering) - Input: User behavior history - Output: Top-N product recommendations - Success metrics: Precision@10 > 0.25, Recall@10 > 0.15 ## Data Requirements - Training data: 6 months user interactions - Validation: Last month - Features: User profile, product attributes, interaction history ## Model Requirements - Latency: <100ms inference - Throughput: 1000 req/sec - Accuracy: Better than random baseline by 3x - Explainability: Must explain top-3 recommendations ``` 3. **Creates ML-specific tasks** (tasks.md): ```markdown - [ ] T-001: Data exploration and quality analysis - [ ] T-002: Feature engineering pipeline - [ ] T-003: Train baseline model (random/popularity) - [ ] T-004: Train candidate models (3 algorithms) - [ ] T-005: Hyperparameter tuning (best model) - [ ] T-006: Model evaluation (all metrics) - [ ] T-007: Model explainability (SHAP/LIME) - [ ] T-008: Production deployment preparation - [ ] T-009: A/B test plan ``` ### Phase 2: Pipeline Execution The skill guides through each task with best practices: #### Task 1: Data Exploration ```python # Generated template with SpecWeave integration import pandas as pd import mlflow from specweave import track_experiment # Auto-logs to .specweave/increments/0042.../experiments/ with track_experiment("exp-001-eda") as exp: df = pd.read_csv("data/interactions.csv") # EDA exp.log_param("dataset_size", len(df)) exp.log_metric("missing_values", df.isnull().sum().sum()) # Auto-generates report in increment folder exp.save_report("eda-summary.md") ``` #### Task 3: Train Baseline ```python from sklearn.dummy import DummyClassifier from specweave import track_model with track_model("baseline-random", increment="0042") as model: clf = DummyClassifier(strategy="uniform") clf.fit(X_train, y_train) # Automatically logged to increment model.log_metrics({ "accuracy": 0.12, "precision@10": 0.08 }) model.save_artifact(clf, "baseline.pkl") ``` #### Task 4: Train Candidate Models ```python from xgboost import XGBClassifier from specweave import ModelExperiment # Parallel experiments with auto-tracking experiments = [ ModelExperiment("xgboost", XGBClassifier, params_xgb), ModelExperiment("lightgbm", LGBMClassifier, params_lgbm), ModelExperiment("neural-net", KerasModel, params_nn) ] results = run_experiments( experiments, increment="0042", save_to="experiments/" ) # Auto-generates comparison table in increment docs ``` ### Phase 3: Increment Completion When `/specweave:done 0042` runs: 1. **Validates ML-specific criteria**: - ✅ All experiments logged - ✅ Best model saved - ✅ Evaluation metrics documented - ✅ Model explainability artifacts present 2. **Generates completion summary**: ```markdown ## Recommendation Model - COMPLETE ### Experiments Run: 7 1. exp-001-baseline (random): precision@10=0.08 2. exp-002-popularity: precision@10=0.18 3. exp-003-xgboost: precision@10=0.26 ✅ BEST 4. exp-004-lightgbm: precision@10=0.24 5. exp-005-neural-net: precision@10=0.22 ... ### Best Model - Algorithm: XGBoost - Version: model-v3.pkl - Metrics: precision@10=0.26, recall@10=0.16 - Training time: 45 min - Model size: 12 MB ### Deployment Ready - ✅ Inference latency: 35ms (target: <100ms) - ✅ Explainability: SHAP values computed - ✅ A/B test plan documented ``` 3. **Syncs living docs** (via `/specweave:sync-docs`): - Updates architecture docs with model design - Adds ADR for algorithm selection - Documents learnings in runbooks ## When to Use This Skill Activate this skill when you need to: - **Build ML features end-to-end** - From idea to deployed model - **Ensure reproducibility** - Every experiment tracked and documented - **Follow ML best practices** - Baseline comparison, proper validation, explainability - **Integrate ML with software engineering** - ML as increments, not isolated notebooks - **Maintain team knowledge** - Living docs capture why decisions were made ## ML Pipeline Stages ### 1. Data Stage - Data exploration (EDA) - Data quality assessment - Schema validation - Sample data documentation ### 2. Feature Stage - Feature engineering - Feature selection - Feature importance analysis - Feature store integration (optional) ### 3. Training Stage - Baseline model (random, rule-based) - Candidate models (3+ algorithms) - Hyperparameter tuning - Cross-validation ### 4. Evaluation Stage - Comprehensive metrics (accuracy, precision, recall, F1, AUC) - Business metrics (latency, throughput) - Model comparison (vs baseline, vs previous version) - Error analysis ### 5. Explainability Stage - Feature importance - SHAP values - LIME explanations - Example predictions with rationale ### 6. Deployment Stage - Model packaging - Inference pipeline - A/B test plan - Monitoring setup ## Integration with SpecWeave Workflow ### With Experiment Tracking ```bash # Start ML increment /specweave:inc "0042-recommendation-model" # Automatically integrates experiment tracking # All MLflow/W&B logs saved to increment folder ``` ### With Living Docs ```bash # After training best model /specweave:sync-docs update # Automatically: # - Updates architecture/ml-models.md # - Adds ADR for algorithm choice # - Documents hyperparameters in runbooks ``` ### With GitHub Sync ```bash # Create GitHub issue for model retraining /specweave:github:create-issue "Retrain recommendation model with new data" # Linked to increment 0042 # Issue tracks model performance over time ``` ## Best Practices ### 1. Always Start with Baseline ```python # Before training complex models, establish baseline baseline_results = train_baseline_model( strategies=["random", "popularity", "rule-based"] ) # Requirement: New model must beat best baseline by 20%+ ``` ### 2. Use Cross-Validation ```python # Never trust single train/test split cv_scores = cross_val_score(model, X, y, cv=5) exp.log_metric("cv_mean", cv_scores.mean()) exp.log_metric("cv_std", cv_scores.std()) ``` ### 3. Track Everything ```python # Hyperparameters, metrics, artifacts, environment exp.log_params(model.get_params()) exp.log_metrics({"accuracy": acc, "f1": f1}) exp.log_artifact("model.pkl") exp.log_artifact("requirements.txt") # Reproducibility ``` ### 4. Document Failures ```python # Failed experiments are valuable learnings with track_experiment("exp-006-failed-lstm") as exp: # ... training fails ... exp.log_note("FAILED: LSTM overfits badly, needs regularization") exp.set_status("failed") # This documents why LSTM wasn't chosen ``` ### 5. Model Versioning ```python # Tie model versions to increments model_version = f"0042-v{iteration}" mlflow.register_model( f"runs:/{run_id}/model", f"recommendation-model-{model_version}" ) ``` ## Examples ### Example 1: Classification Pipeline ```bash User: "Build a fraud detection model for transactions" Skill creates increment 0051-fraud-detection with: - spec.md: Binary classification, 99% precision target - plan.md: Imbalanced data handling, threshold tuning - tasks.md: 9 tasks from EDA to deployment - experiments/: exp-001-baseline, exp-002-xgboost, etc. Guides through: 1. EDA → identify class imbalance (0.1% fraud) 2. Baseline → random/majority (terrible results) 3. Candidates → XGBoost, LightGBM, Neural Net 4. Threshold tuning → optimize for precision 5. SHAP → explain high-risk predictions 6. Deploy → model + threshold + explainer ``` ### Example 2: Regression Pipeline ```bash User: "Predict customer lifetime value" Skill creates increment 0063-ltv-prediction with: - spec.md: Regression, RMSE < $50 target - plan.md: Time-based validation, feature engineering - tasks.md: Customer cohort analysis, feature importance Key difference: Regression-specific evaluation (RMSE, MAE, R²) ``` ### Example 3: Time Series Forecasting ```bash User: "Forecast weekly sales for next 12 weeks" Skill creates increment 0072-sales-forecasting with: - spec.md: Time series, MAPE < 10% target - plan.md: Seasonal decomposition, ARIMA vs Prophet - tasks.md: Stationarity tests, residual analysis Key difference: Time series validation (no random split) ``` ## Framework Support This skill works with all major ML frameworks: ### Scikit-Learn ```python from sklearn.ensemble import RandomForestClassifier from specweave import track_sklearn_model model = RandomForestClassifier(n_estimators=100) with track_sklearn_model(model, increment="0042") as tracked: tracked.fit(X_train, y_train) tracked.evaluate(X_test, y_test) ``` ### PyTorch ```python import torch from specweave import track_pytorch_model model = NeuralNet() with track_pytorch_model(model, increment="0042") as tracked: for epoch in range(epochs): tracked.train_epoch(train_loader) tracked.log_metric(f"loss_epoch_{epoch}", loss) ``` ### TensorFlow/Keras ```python from tensorflow import keras from specweave import KerasCallback model = keras.Sequential([...]) model.fit( X_train, y_train, callbacks=[KerasCallback(increment="0042")] ) ``` ### XGBoost/LightGBM ```python import xgboost as xgb from specweave import track_boosting_model dtrain = xgb.DMatrix(X_train, label=y_train) with track_boosting_model("xgboost", increment="0042") as tracked: model = xgb.train(params, dtrain, callbacks=[tracked.callback]) ``` ## Integration Points ### With `experiment-tracker` skill - Auto-detects MLflow/W&B in project - Configures tracking URI to increment folder - Syncs experiment metadata to increment docs ### With `model-evaluator` skill - Generates comprehensive evaluation reports - Compares models across experiments - Highlights best model with confidence intervals ### With `feature-engineer` skill - Generates feature engineering pipeline - Documents feature importance - Creates feature store schemas ### With `ml-engineer` agent - Delegates complex ML decisions to specialized agent - Reviews model architecture - Suggests improvements based on results ## Skill Outputs After running `/specweave:do` on an ML increment, you get: ``` .specweave/increments/0042-recommendation-model/ ├── spec.md ✅ ├── plan.md ✅ ├── tasks.md ✅ (all completed) ├── COMPLETION-SUMMARY.md ✅ ├── experiments/ │ ├── exp-001-baseline/ │ │ ├── metrics.json │ │ ├── params.json │ │ └── logs/ │ ├── exp-002-xgboost/ ✅ BEST │ │ ├── metrics.json │ │ ├── params.json │ │ ├── model.pkl │ │ └── shap_values.pkl │ └── comparison.md ├── models/ │ ├── model-v3.pkl (best) │ └── model-v3.metadata.json ├── data/ │ ├── schema.yaml │ └── sample.parquet └── notebooks/ ├── 01-eda.ipynb ├── 02-feature-engineering.ipynb └── 03-model-analysis.ipynb ``` ## Commands This skill integrates with SpecWeave commands: ```bash # Create ML increment /specweave:inc "build recommendation model" → Activates ml-pipeline-orchestrator → Creates ML-specific increment structure # Execute ML tasks /specweave:do → Guides through data → train → eval workflow → Auto-tracks experiments # Validate ML increment /specweave:validate 0042 → Checks: experiments logged, model saved, metrics documented → Validates: model meets success criteria # Complete ML increment /specweave:done 0042 → Generates ML completion summary → Syncs model metadata to living docs ``` ## Tips 1. **Start simple** - Always begin with baseline, then iterate 2. **Track failures** - Document why approaches didn't work 3. **Version data** - Use DVC or similar for data versioning 4. **Reproducibility** - Log environment (requirements.txt, conda env) 5. **Incremental improvement** - Each increment improves on previous model 6. **Team collaboration** - Living docs make ML decisions visible to all ## Advanced: Multi-Increment ML Projects For complex ML systems (e.g., recommendation system with multiple models): ``` 0042-recommendation-data-pipeline 0043-recommendation-candidate-generation 0044-recommendation-ranking-model 0045-recommendation-reranking 0046-recommendation-ab-test ``` Each increment: - Has its own spec, plan, tasks - Builds on previous increments - Documents model interactions - Maintains system-level living docs