250 lines
7.3 KiB
Markdown
250 lines
7.3 KiB
Markdown
---
|
|
name: mlops-dag-builder
|
|
description: Design DAG-based MLOps pipeline architectures with Airflow, Dagster, Kubeflow, or Prefect. Activates for DAG orchestration, workflow automation, pipeline design patterns, CI/CD for ML. Use for platform-agnostic MLOps infrastructure - NOT for SpecWeave increment-based ML (use ml-pipeline-orchestrator instead).
|
|
---
|
|
|
|
# MLOps DAG Builder
|
|
|
|
Design and implement DAG-based ML pipeline architectures using production orchestration tools.
|
|
|
|
## Overview
|
|
|
|
This skill provides guidance for building **platform-agnostic MLOps pipelines** using DAG orchestrators (Airflow, Dagster, Kubeflow, Prefect). It focuses on workflow architecture, not SpecWeave integration.
|
|
|
|
**When to use this skill vs ml-pipeline-orchestrator:**
|
|
- **Use this skill**: General MLOps architecture, Airflow/Dagster DAGs, cloud ML platforms
|
|
- **Use ml-pipeline-orchestrator**: SpecWeave increment-based ML development with experiment tracking
|
|
|
|
## When to Use This Skill
|
|
|
|
- Designing DAG-based workflow orchestration (Airflow, Dagster, Kubeflow)
|
|
- Implementing platform-agnostic ML pipeline patterns
|
|
- Setting up CI/CD automation for ML training jobs
|
|
- Creating reusable pipeline templates for teams
|
|
- Integrating with cloud ML services (SageMaker, Vertex AI, Azure ML)
|
|
|
|
## What This Skill Provides
|
|
|
|
### Core Capabilities
|
|
|
|
1. **Pipeline Architecture**
|
|
- End-to-end workflow design
|
|
- DAG orchestration patterns (Airflow, Dagster, Kubeflow)
|
|
- Component dependencies and data flow
|
|
- Error handling and retry strategies
|
|
|
|
2. **Data Preparation**
|
|
- Data validation and quality checks
|
|
- Feature engineering pipelines
|
|
- Data versioning and lineage
|
|
- Train/validation/test splitting strategies
|
|
|
|
3. **Model Training**
|
|
- Training job orchestration
|
|
- Hyperparameter management
|
|
- Experiment tracking integration
|
|
- Distributed training patterns
|
|
|
|
4. **Model Validation**
|
|
- Validation frameworks and metrics
|
|
- A/B testing infrastructure
|
|
- Performance regression detection
|
|
- Model comparison workflows
|
|
|
|
5. **Deployment Automation**
|
|
- Model serving patterns
|
|
- Canary deployments
|
|
- Blue-green deployment strategies
|
|
- Rollback mechanisms
|
|
|
|
### Reference Documentation
|
|
|
|
See the `references/` directory for detailed guides:
|
|
- **data-preparation.md** - Data cleaning, validation, and feature engineering
|
|
- **model-training.md** - Training workflows and best practices
|
|
- **model-validation.md** - Validation strategies and metrics
|
|
- **model-deployment.md** - Deployment patterns and serving architectures
|
|
|
|
### Assets and Templates
|
|
|
|
The `assets/` directory contains:
|
|
- **pipeline-dag.yaml.template** - DAG template for workflow orchestration
|
|
- **training-config.yaml** - Training configuration template
|
|
- **validation-checklist.md** - Pre-deployment validation checklist
|
|
|
|
## Usage Patterns
|
|
|
|
### Basic Pipeline Setup
|
|
|
|
```python
|
|
# 1. Define pipeline stages
|
|
stages = [
|
|
"data_ingestion",
|
|
"data_validation",
|
|
"feature_engineering",
|
|
"model_training",
|
|
"model_validation",
|
|
"model_deployment"
|
|
]
|
|
|
|
# 2. Configure dependencies
|
|
# See assets/pipeline-dag.yaml.template for full example
|
|
```
|
|
|
|
### Production Workflow
|
|
|
|
1. **Data Preparation Phase**
|
|
- Ingest raw data from sources
|
|
- Run data quality checks
|
|
- Apply feature transformations
|
|
- Version processed datasets
|
|
|
|
2. **Training Phase**
|
|
- Load versioned training data
|
|
- Execute training jobs
|
|
- Track experiments and metrics
|
|
- Save trained models
|
|
|
|
3. **Validation Phase**
|
|
- Run validation test suite
|
|
- Compare against baseline
|
|
- Generate performance reports
|
|
- Approve for deployment
|
|
|
|
4. **Deployment Phase**
|
|
- Package model artifacts
|
|
- Deploy to serving infrastructure
|
|
- Configure monitoring
|
|
- Validate production traffic
|
|
|
|
## Best Practices
|
|
|
|
### Pipeline Design
|
|
|
|
- **Modularity**: Each stage should be independently testable
|
|
- **Idempotency**: Re-running stages should be safe
|
|
- **Observability**: Log metrics at every stage
|
|
- **Versioning**: Track data, code, and model versions
|
|
- **Failure Handling**: Implement retry logic and alerting
|
|
|
|
### Data Management
|
|
|
|
- Use data validation libraries (Great Expectations, TFX)
|
|
- Version datasets with DVC or similar tools
|
|
- Document feature engineering transformations
|
|
- Maintain data lineage tracking
|
|
|
|
### Model Operations
|
|
|
|
- Separate training and serving infrastructure
|
|
- Use model registries (MLflow, Weights & Biases)
|
|
- Implement gradual rollouts for new models
|
|
- Monitor model performance drift
|
|
- Maintain rollback capabilities
|
|
|
|
### Deployment Strategies
|
|
|
|
- Start with shadow deployments
|
|
- Use canary releases for validation
|
|
- Implement A/B testing infrastructure
|
|
- Set up automated rollback triggers
|
|
- Monitor latency and throughput
|
|
|
|
## Integration Points
|
|
|
|
### Orchestration Tools
|
|
|
|
- **Apache Airflow**: DAG-based workflow orchestration
|
|
- **Dagster**: Asset-based pipeline orchestration
|
|
- **Kubeflow Pipelines**: Kubernetes-native ML workflows
|
|
- **Prefect**: Modern dataflow automation
|
|
|
|
### Experiment Tracking
|
|
|
|
- MLflow for experiment tracking and model registry
|
|
- Weights & Biases for visualization and collaboration
|
|
- TensorBoard for training metrics
|
|
|
|
### Deployment Platforms
|
|
|
|
- AWS SageMaker for managed ML infrastructure
|
|
- Google Vertex AI for GCP deployments
|
|
- Azure ML for Azure cloud
|
|
- Kubernetes + KServe for cloud-agnostic serving
|
|
|
|
## Progressive Disclosure
|
|
|
|
Start with the basics and gradually add complexity:
|
|
|
|
1. **Level 1**: Simple linear pipeline (data → train → deploy)
|
|
2. **Level 2**: Add validation and monitoring stages
|
|
3. **Level 3**: Implement hyperparameter tuning
|
|
4. **Level 4**: Add A/B testing and gradual rollouts
|
|
5. **Level 5**: Multi-model pipelines with ensemble strategies
|
|
|
|
## Common Patterns
|
|
|
|
### Batch Training Pipeline
|
|
|
|
```yaml
|
|
# See assets/pipeline-dag.yaml.template
|
|
stages:
|
|
- name: data_preparation
|
|
dependencies: []
|
|
- name: model_training
|
|
dependencies: [data_preparation]
|
|
- name: model_evaluation
|
|
dependencies: [model_training]
|
|
- name: model_deployment
|
|
dependencies: [model_evaluation]
|
|
```
|
|
|
|
### Real-time Feature Pipeline
|
|
|
|
```python
|
|
# Stream processing for real-time features
|
|
# Combined with batch training
|
|
# See references/data-preparation.md
|
|
```
|
|
|
|
### Continuous Training
|
|
|
|
```python
|
|
# Automated retraining on schedule
|
|
# Triggered by data drift detection
|
|
# See references/model-training.md
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
- **Pipeline failures**: Check dependencies and data availability
|
|
- **Training instability**: Review hyperparameters and data quality
|
|
- **Deployment issues**: Validate model artifacts and serving config
|
|
- **Performance degradation**: Monitor data drift and model metrics
|
|
|
|
### Debugging Steps
|
|
|
|
1. Check pipeline logs for each stage
|
|
2. Validate input/output data at boundaries
|
|
3. Test components in isolation
|
|
4. Review experiment tracking metrics
|
|
5. Inspect model artifacts and metadata
|
|
|
|
## Next Steps
|
|
|
|
After setting up your pipeline:
|
|
|
|
1. Explore **hyperparameter-tuning** skill for optimization
|
|
2. Learn **experiment-tracking-setup** for MLflow/W&B
|
|
3. Review **model-deployment-patterns** for serving strategies
|
|
4. Implement monitoring with observability tools
|
|
|
|
## Related Skills
|
|
|
|
- **ml-pipeline-orchestrator**: SpecWeave-integrated ML development (use for increment-based ML)
|
|
- **experiment-tracker**: MLflow and Weights & Biases experiment tracking
|
|
- **automl-optimizer**: Automated hyperparameter optimization with Optuna/Hyperopt
|
|
- **ml-deployment-helper**: Model serving and deployment patterns
|