Initial commit

2025-11-29 17:56:53 +08:00
commit 468d045de7
24 changed files with 7204 additions and 0 deletions
--- a/skills/mlops-dag-builder/SKILL.md
+++ b/skills/mlops-dag-builder/SKILL.md
@@ -0,0 +1,249 @@
+---
+name: mlops-dag-builder
+description: Design DAG-based MLOps pipeline architectures with Airflow, Dagster, Kubeflow, or Prefect. Activates for DAG orchestration, workflow automation, pipeline design patterns, CI/CD for ML. Use for platform-agnostic MLOps infrastructure - NOT for SpecWeave increment-based ML (use ml-pipeline-orchestrator instead).
+---
+
+# MLOps DAG Builder
+
+Design and implement DAG-based ML pipeline architectures using production orchestration tools.
+
+## Overview
+
+This skill provides guidance for building **platform-agnostic MLOps pipelines** using DAG orchestrators (Airflow, Dagster, Kubeflow, Prefect). It focuses on workflow architecture, not SpecWeave integration.
+
+**When to use this skill vs ml-pipeline-orchestrator:**
+- **Use this skill**: General MLOps architecture, Airflow/Dagster DAGs, cloud ML platforms
+- **Use ml-pipeline-orchestrator**: SpecWeave increment-based ML development with experiment tracking
+
+## When to Use This Skill
+
+- Designing DAG-based workflow orchestration (Airflow, Dagster, Kubeflow)
+- Implementing platform-agnostic ML pipeline patterns
+- Setting up CI/CD automation for ML training jobs
+- Creating reusable pipeline templates for teams
+- Integrating with cloud ML services (SageMaker, Vertex AI, Azure ML)
+
+## What This Skill Provides
+
+### Core Capabilities
+
+1. **Pipeline Architecture**
+   - End-to-end workflow design
+   - DAG orchestration patterns (Airflow, Dagster, Kubeflow)
+   - Component dependencies and data flow
+   - Error handling and retry strategies
+
+2. **Data Preparation**
+   - Data validation and quality checks
+   - Feature engineering pipelines
+   - Data versioning and lineage
+   - Train/validation/test splitting strategies
+
+3. **Model Training**
+   - Training job orchestration
+   - Hyperparameter management
+   - Experiment tracking integration
+   - Distributed training patterns
+
+4. **Model Validation**
+   - Validation frameworks and metrics
+   - A/B testing infrastructure
+   - Performance regression detection
+   - Model comparison workflows
+
+5. **Deployment Automation**
+   - Model serving patterns
+   - Canary deployments
+   - Blue-green deployment strategies
+   - Rollback mechanisms
+
+### Reference Documentation
+
+See the `references/` directory for detailed guides:
+- **data-preparation.md** - Data cleaning, validation, and feature engineering
+- **model-training.md** - Training workflows and best practices
+- **model-validation.md** - Validation strategies and metrics
+- **model-deployment.md** - Deployment patterns and serving architectures
+
+### Assets and Templates
+
+The `assets/` directory contains:
+- **pipeline-dag.yaml.template** - DAG template for workflow orchestration
+- **training-config.yaml** - Training configuration template
+- **validation-checklist.md** - Pre-deployment validation checklist
+
+## Usage Patterns
+
+### Basic Pipeline Setup
+
+```python
+# 1. Define pipeline stages
+stages = [
+    "data_ingestion",
+    "data_validation",
+    "feature_engineering",
+    "model_training",
+    "model_validation",
+    "model_deployment"
+]
+
+# 2. Configure dependencies
+# See assets/pipeline-dag.yaml.template for full example
+```
+
+### Production Workflow
+
+1. **Data Preparation Phase**
+   - Ingest raw data from sources
+   - Run data quality checks
+   - Apply feature transformations
+   - Version processed datasets
+
+2. **Training Phase**
+   - Load versioned training data
+   - Execute training jobs
+   - Track experiments and metrics
+   - Save trained models
+
+3. **Validation Phase**
+   - Run validation test suite
+   - Compare against baseline
+   - Generate performance reports
+   - Approve for deployment
+
+4. **Deployment Phase**
+   - Package model artifacts
+   - Deploy to serving infrastructure
+   - Configure monitoring
+   - Validate production traffic
+
+## Best Practices
+
+### Pipeline Design
+
+- **Modularity**: Each stage should be independently testable
+- **Idempotency**: Re-running stages should be safe
+- **Observability**: Log metrics at every stage
+- **Versioning**: Track data, code, and model versions
+- **Failure Handling**: Implement retry logic and alerting
+
+### Data Management
+
+- Use data validation libraries (Great Expectations, TFX)
+- Version datasets with DVC or similar tools
+- Document feature engineering transformations
+- Maintain data lineage tracking
+
+### Model Operations
+
+- Separate training and serving infrastructure
+- Use model registries (MLflow, Weights & Biases)
+- Implement gradual rollouts for new models
+- Monitor model performance drift
+- Maintain rollback capabilities
+
+### Deployment Strategies
+
+- Start with shadow deployments
+- Use canary releases for validation
+- Implement A/B testing infrastructure
+- Set up automated rollback triggers
+- Monitor latency and throughput
+
+## Integration Points
+
+### Orchestration Tools
+
+- **Apache Airflow**: DAG-based workflow orchestration
+- **Dagster**: Asset-based pipeline orchestration
+- **Kubeflow Pipelines**: Kubernetes-native ML workflows
+- **Prefect**: Modern dataflow automation
+
+### Experiment Tracking
+
+- MLflow for experiment tracking and model registry
+- Weights & Biases for visualization and collaboration
+- TensorBoard for training metrics
+
+### Deployment Platforms
+
+- AWS SageMaker for managed ML infrastructure
+- Google Vertex AI for GCP deployments
+- Azure ML for Azure cloud
+- Kubernetes + KServe for cloud-agnostic serving
+
+## Progressive Disclosure
+
+Start with the basics and gradually add complexity:
+
+1. **Level 1**: Simple linear pipeline (data → train → deploy)
+2. **Level 2**: Add validation and monitoring stages
+3. **Level 3**: Implement hyperparameter tuning
+4. **Level 4**: Add A/B testing and gradual rollouts
+5. **Level 5**: Multi-model pipelines with ensemble strategies
+
+## Common Patterns
+
+### Batch Training Pipeline
+
+```yaml
+# See assets/pipeline-dag.yaml.template
+stages:
+  - name: data_preparation
+    dependencies: []
+  - name: model_training
+    dependencies: [data_preparation]
+  - name: model_evaluation
+    dependencies: [model_training]
+  - name: model_deployment
+    dependencies: [model_evaluation]
+```
+
+### Real-time Feature Pipeline
+
+```python
+# Stream processing for real-time features
+# Combined with batch training
+# See references/data-preparation.md
+```
+
+### Continuous Training
+
+```python
+# Automated retraining on schedule
+# Triggered by data drift detection
+# See references/model-training.md
+```
+
+## Troubleshooting
+
+### Common Issues
+
+- **Pipeline failures**: Check dependencies and data availability
+- **Training instability**: Review hyperparameters and data quality
+- **Deployment issues**: Validate model artifacts and serving config
+- **Performance degradation**: Monitor data drift and model metrics
+
+### Debugging Steps
+
+1. Check pipeline logs for each stage
+2. Validate input/output data at boundaries
+3. Test components in isolation
+4. Review experiment tracking metrics
+5. Inspect model artifacts and metadata
+
+## Next Steps
+
+After setting up your pipeline:
+
+1. Explore **hyperparameter-tuning** skill for optimization
+2. Learn **experiment-tracking-setup** for MLflow/W&B
+3. Review **model-deployment-patterns** for serving strategies
+4. Implement monitoring with observability tools
+
+## Related Skills
+
+- **ml-pipeline-orchestrator**: SpecWeave-integrated ML development (use for increment-based ML)
+- **experiment-tracker**: MLflow and Weights & Biases experiment tracking
+- **automl-optimizer**: Automated hyperparameter optimization with Optuna/Hyperopt
+- **ml-deployment-helper**: Model serving and deployment patterns