Initial commit

2025-11-29 17:56:53 +08:00
commit 468d045de7
24 changed files with 7204 additions and 0 deletions
--- a/skills/ml-pipeline-orchestrator/SKILL.md
+++ b/skills/ml-pipeline-orchestrator/SKILL.md
@@ -0,0 +1,518 @@
+---
+name: ml-pipeline-orchestrator
+description: |
+  Orchestrates complete machine learning pipelines within SpecWeave increments. Activates when users request "ML pipeline", "train model", "build ML system", "end-to-end ML", "ML workflow", "model training pipeline", or similar. Guides users through data preprocessing, feature engineering, model training, evaluation, and deployment using SpecWeave's spec-driven approach. Integrates with increment lifecycle for reproducible ML development.
+---
+
+# ML Pipeline Orchestrator
+
+## Overview
+
+This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.
+
+## Core Philosophy
+
+**SpecWeave + ML = Disciplined Data Science**
+
+Traditional ML development often lacks structure:
+- ❌ Jupyter notebooks with no version control
+- ❌ Experiments without documentation
+- ❌ Models deployed with no reproducibility
+- ❌ Team knowledge trapped in individual notebooks
+
+SpecWeave brings discipline:
+- ✅ Every ML feature is an increment (with spec, plan, tasks)
+- ✅ Experiments tracked and documented automatically
+- ✅ Model versions tied to increments
+- ✅ Living docs capture learnings and decisions
+
+## How It Works
+
+### Phase 1: ML Increment Planning
+
+When you request "build a recommendation model", the skill:
+
+1. **Creates ML increment structure**:
+```
+.specweave/increments/0042-recommendation-model/
+├── spec.md                    # ML requirements, success metrics
+├── plan.md                    # Pipeline architecture
+├── tasks.md                   # Implementation tasks
+├── tests.md                   # Evaluation criteria
+├── experiments/               # Experiment tracking
+│   ├── exp-001-baseline/
+│   ├── exp-002-xgboost/
+│   └── exp-003-neural-net/
+├── data/                      # Data samples, schemas
+│   ├── schema.yaml
+│   └── sample.csv
+├── models/                    # Trained models
+│   ├── model-v1.pkl
+│   └── model-v2.pkl
+└── notebooks/                 # Exploratory notebooks
+    ├── 01-eda.ipynb
+    └── 02-feature-engineering.ipynb
+```
+
+2. **Generates ML-specific spec** (spec.md):
+```markdown
+## ML Problem Definition
+- Problem type: Recommendation (collaborative filtering)
+- Input: User behavior history
+- Output: Top-N product recommendations
+- Success metrics: Precision@10 > 0.25, Recall@10 > 0.15
+
+## Data Requirements
+- Training data: 6 months user interactions
+- Validation: Last month
+- Features: User profile, product attributes, interaction history
+
+## Model Requirements
+- Latency: <100ms inference
+- Throughput: 1000 req/sec
+- Accuracy: Better than random baseline by 3x
+- Explainability: Must explain top-3 recommendations
+```
+
+3. **Creates ML-specific tasks** (tasks.md):
+```markdown
+- [ ] T-001: Data exploration and quality analysis
+- [ ] T-002: Feature engineering pipeline
+- [ ] T-003: Train baseline model (random/popularity)
+- [ ] T-004: Train candidate models (3 algorithms)
+- [ ] T-005: Hyperparameter tuning (best model)
+- [ ] T-006: Model evaluation (all metrics)
+- [ ] T-007: Model explainability (SHAP/LIME)
+- [ ] T-008: Production deployment preparation
+- [ ] T-009: A/B test plan
+```
+
+### Phase 2: Pipeline Execution
+
+The skill guides through each task with best practices:
+
+#### Task 1: Data Exploration
+```python
+# Generated template with SpecWeave integration
+import pandas as pd
+import mlflow
+from specweave import track_experiment
+
+# Auto-logs to .specweave/increments/0042.../experiments/
+with track_experiment("exp-001-eda") as exp:
+    df = pd.read_csv("data/interactions.csv")
+    
+    # EDA
+    exp.log_param("dataset_size", len(df))
+    exp.log_metric("missing_values", df.isnull().sum().sum())
+    
+    # Auto-generates report in increment folder
+    exp.save_report("eda-summary.md")
+```
+
+#### Task 3: Train Baseline
+```python
+from sklearn.dummy import DummyClassifier
+from specweave import track_model
+
+with track_model("baseline-random", increment="0042") as model:
+    clf = DummyClassifier(strategy="uniform")
+    clf.fit(X_train, y_train)
+    
+    # Automatically logged to increment
+    model.log_metrics({
+        "accuracy": 0.12,
+        "precision@10": 0.08
+    })
+    model.save_artifact(clf, "baseline.pkl")
+```
+
+#### Task 4: Train Candidate Models
+```python
+from xgboost import XGBClassifier
+from specweave import ModelExperiment
+
+# Parallel experiments with auto-tracking
+experiments = [
+    ModelExperiment("xgboost", XGBClassifier, params_xgb),
+    ModelExperiment("lightgbm", LGBMClassifier, params_lgbm),
+    ModelExperiment("neural-net", KerasModel, params_nn)
+]
+
+results = run_experiments(
+    experiments,
+    increment="0042",
+    save_to="experiments/"
+)
+
+# Auto-generates comparison table in increment docs
+```
+
+### Phase 3: Increment Completion
+
+When `/specweave:done 0042` runs:
+
+1. **Validates ML-specific criteria**:
+   - ✅ All experiments logged
+   - ✅ Best model saved
+   - ✅ Evaluation metrics documented
+   - ✅ Model explainability artifacts present
+
+2. **Generates completion summary**:
+```markdown
+## Recommendation Model - COMPLETE
+
+### Experiments Run: 7
+1. exp-001-baseline (random): precision@10=0.08
+2. exp-002-popularity: precision@10=0.18
+3. exp-003-xgboost: precision@10=0.26 ✅ BEST
+4. exp-004-lightgbm: precision@10=0.24
+5. exp-005-neural-net: precision@10=0.22
+...
+
+### Best Model
+- Algorithm: XGBoost
+- Version: model-v3.pkl
+- Metrics: precision@10=0.26, recall@10=0.16
+- Training time: 45 min
+- Model size: 12 MB
+
+### Deployment Ready
+- ✅ Inference latency: 35ms (target: <100ms)
+- ✅ Explainability: SHAP values computed
+- ✅ A/B test plan documented
+```
+
+3. **Syncs living docs** (via `/specweave:sync-docs`):
+   - Updates architecture docs with model design
+   - Adds ADR for algorithm selection
+   - Documents learnings in runbooks
+
+## When to Use This Skill
+
+Activate this skill when you need to:
+
+- **Build ML features end-to-end** - From idea to deployed model
+- **Ensure reproducibility** - Every experiment tracked and documented
+- **Follow ML best practices** - Baseline comparison, proper validation, explainability
+- **Integrate ML with software engineering** - ML as increments, not isolated notebooks
+- **Maintain team knowledge** - Living docs capture why decisions were made
+
+## ML Pipeline Stages
+
+### 1. Data Stage
+- Data exploration (EDA)
+- Data quality assessment
+- Schema validation
+- Sample data documentation
+
+### 2. Feature Stage
+- Feature engineering
+- Feature selection
+- Feature importance analysis
+- Feature store integration (optional)
+
+### 3. Training Stage
+- Baseline model (random, rule-based)
+- Candidate models (3+ algorithms)
+- Hyperparameter tuning
+- Cross-validation
+
+### 4. Evaluation Stage
+- Comprehensive metrics (accuracy, precision, recall, F1, AUC)
+- Business metrics (latency, throughput)
+- Model comparison (vs baseline, vs previous version)
+- Error analysis
+
+### 5. Explainability Stage
+- Feature importance
+- SHAP values
+- LIME explanations
+- Example predictions with rationale
+
+### 6. Deployment Stage
+- Model packaging
+- Inference pipeline
+- A/B test plan
+- Monitoring setup
+
+## Integration with SpecWeave Workflow
+
+### With Experiment Tracking
+```bash
+# Start ML increment
+/specweave:inc "0042-recommendation-model"
+
+# Automatically integrates experiment tracking
+# All MLflow/W&B logs saved to increment folder
+```
+
+### With Living Docs
+```bash
+# After training best model
+/specweave:sync-docs update
+
+# Automatically:
+# - Updates architecture/ml-models.md
+# - Adds ADR for algorithm choice
+# - Documents hyperparameters in runbooks
+```
+
+### With GitHub Sync
+```bash
+# Create GitHub issue for model retraining
+/specweave:github:create-issue "Retrain recommendation model with new data"
+
+# Linked to increment 0042
+# Issue tracks model performance over time
+```
+
+## Best Practices
+
+### 1. Always Start with Baseline
+```python
+# Before training complex models, establish baseline
+baseline_results = train_baseline_model(
+    strategies=["random", "popularity", "rule-based"]
+)
+# Requirement: New model must beat best baseline by 20%+
+```
+
+### 2. Use Cross-Validation
+```python
+# Never trust single train/test split
+cv_scores = cross_val_score(model, X, y, cv=5)
+exp.log_metric("cv_mean", cv_scores.mean())
+exp.log_metric("cv_std", cv_scores.std())
+```
+
+### 3. Track Everything
+```python
+# Hyperparameters, metrics, artifacts, environment
+exp.log_params(model.get_params())
+exp.log_metrics({"accuracy": acc, "f1": f1})
+exp.log_artifact("model.pkl")
+exp.log_artifact("requirements.txt")  # Reproducibility
+```
+
+### 4. Document Failures
+```python
+# Failed experiments are valuable learnings
+with track_experiment("exp-006-failed-lstm") as exp:
+    # ... training fails ...
+    exp.log_note("FAILED: LSTM overfits badly, needs regularization")
+    exp.set_status("failed")
+# This documents why LSTM wasn't chosen
+```
+
+### 5. Model Versioning
+```python
+# Tie model versions to increments
+model_version = f"0042-v{iteration}"
+mlflow.register_model(
+    f"runs:/{run_id}/model",
+    f"recommendation-model-{model_version}"
+)
+```
+
+## Examples
+
+### Example 1: Classification Pipeline
+```bash
+User: "Build a fraud detection model for transactions"
+
+Skill creates increment 0051-fraud-detection with:
+- spec.md: Binary classification, 99% precision target
+- plan.md: Imbalanced data handling, threshold tuning
+- tasks.md: 9 tasks from EDA to deployment
+- experiments/: exp-001-baseline, exp-002-xgboost, etc.
+
+Guides through:
+1. EDA → identify class imbalance (0.1% fraud)
+2. Baseline → random/majority (terrible results)
+3. Candidates → XGBoost, LightGBM, Neural Net
+4. Threshold tuning → optimize for precision
+5. SHAP → explain high-risk predictions
+6. Deploy → model + threshold + explainer
+```
+
+### Example 2: Regression Pipeline
+```bash
+User: "Predict customer lifetime value"
+
+Skill creates increment 0063-ltv-prediction with:
+- spec.md: Regression, RMSE < $50 target
+- plan.md: Time-based validation, feature engineering
+- tasks.md: Customer cohort analysis, feature importance
+
+Key difference: Regression-specific evaluation (RMSE, MAE, R²)
+```
+
+### Example 3: Time Series Forecasting
+```bash
+User: "Forecast weekly sales for next 12 weeks"
+
+Skill creates increment 0072-sales-forecasting with:
+- spec.md: Time series, MAPE < 10% target
+- plan.md: Seasonal decomposition, ARIMA vs Prophet
+- tasks.md: Stationarity tests, residual analysis
+
+Key difference: Time series validation (no random split)
+```
+
+## Framework Support
+
+This skill works with all major ML frameworks:
+
+### Scikit-Learn
+```python
+from sklearn.ensemble import RandomForestClassifier
+from specweave import track_sklearn_model
+
+model = RandomForestClassifier(n_estimators=100)
+with track_sklearn_model(model, increment="0042") as tracked:
+    tracked.fit(X_train, y_train)
+    tracked.evaluate(X_test, y_test)
+```
+
+### PyTorch
+```python
+import torch
+from specweave import track_pytorch_model
+
+model = NeuralNet()
+with track_pytorch_model(model, increment="0042") as tracked:
+    for epoch in range(epochs):
+        tracked.train_epoch(train_loader)
+        tracked.log_metric(f"loss_epoch_{epoch}", loss)
+```
+
+### TensorFlow/Keras
+```python
+from tensorflow import keras
+from specweave import KerasCallback
+
+model = keras.Sequential([...])
+model.fit(
+    X_train, y_train,
+    callbacks=[KerasCallback(increment="0042")]
+)
+```
+
+### XGBoost/LightGBM
+```python
+import xgboost as xgb
+from specweave import track_boosting_model
+
+dtrain = xgb.DMatrix(X_train, label=y_train)
+with track_boosting_model("xgboost", increment="0042") as tracked:
+    model = xgb.train(params, dtrain, callbacks=[tracked.callback])
+```
+
+## Integration Points
+
+### With `experiment-tracker` skill
+- Auto-detects MLflow/W&B in project
+- Configures tracking URI to increment folder
+- Syncs experiment metadata to increment docs
+
+### With `model-evaluator` skill
+- Generates comprehensive evaluation reports
+- Compares models across experiments
+- Highlights best model with confidence intervals
+
+### With `feature-engineer` skill
+- Generates feature engineering pipeline
+- Documents feature importance
+- Creates feature store schemas
+
+### With `ml-engineer` agent
+- Delegates complex ML decisions to specialized agent
+- Reviews model architecture
+- Suggests improvements based on results
+
+## Skill Outputs
+
+After running `/specweave:do` on an ML increment, you get:
+
+```
+.specweave/increments/0042-recommendation-model/
+├── spec.md ✅
+├── plan.md ✅
+├── tasks.md ✅ (all completed)
+├── COMPLETION-SUMMARY.md ✅
+├── experiments/
+│   ├── exp-001-baseline/
+│   │   ├── metrics.json
+│   │   ├── params.json
+│   │   └── logs/
+│   ├── exp-002-xgboost/ ✅ BEST
+│   │   ├── metrics.json
+│   │   ├── params.json
+│   │   ├── model.pkl
+│   │   └── shap_values.pkl
+│   └── comparison.md
+├── models/
+│   ├── model-v3.pkl (best)
+│   └── model-v3.metadata.json
+├── data/
+│   ├── schema.yaml
+│   └── sample.parquet
+└── notebooks/
+    ├── 01-eda.ipynb
+    ├── 02-feature-engineering.ipynb
+    └── 03-model-analysis.ipynb
+```
+
+## Commands
+
+This skill integrates with SpecWeave commands:
+
+```bash
+# Create ML increment
+/specweave:inc "build recommendation model"
+→ Activates ml-pipeline-orchestrator
+→ Creates ML-specific increment structure
+
+# Execute ML tasks
+/specweave:do
+→ Guides through data → train → eval workflow
+→ Auto-tracks experiments
+
+# Validate ML increment
+/specweave:validate 0042
+→ Checks: experiments logged, model saved, metrics documented
+→ Validates: model meets success criteria
+
+# Complete ML increment
+/specweave:done 0042
+→ Generates ML completion summary
+→ Syncs model metadata to living docs
+```
+
+## Tips
+
+1. **Start simple** - Always begin with baseline, then iterate
+2. **Track failures** - Document why approaches didn't work
+3. **Version data** - Use DVC or similar for data versioning
+4. **Reproducibility** - Log environment (requirements.txt, conda env)
+5. **Incremental improvement** - Each increment improves on previous model
+6. **Team collaboration** - Living docs make ML decisions visible to all
+
+## Advanced: Multi-Increment ML Projects
+
+For complex ML systems (e.g., recommendation system with multiple models):
+
+```
+0042-recommendation-data-pipeline
+0043-recommendation-candidate-generation
+0044-recommendation-ranking-model
+0045-recommendation-reranking
+0046-recommendation-ab-test
+```
+
+Each increment:
+- Has its own spec, plan, tasks
+- Builds on previous increments
+- Documents model interactions
+- Maintains system-level living docs