Initial commit
This commit is contained in:
249
skills/mlops-dag-builder/SKILL.md
Normal file
249
skills/mlops-dag-builder/SKILL.md
Normal file
@@ -0,0 +1,249 @@
|
||||
---
|
||||
name: mlops-dag-builder
|
||||
description: Design DAG-based MLOps pipeline architectures with Airflow, Dagster, Kubeflow, or Prefect. Activates for DAG orchestration, workflow automation, pipeline design patterns, CI/CD for ML. Use for platform-agnostic MLOps infrastructure - NOT for SpecWeave increment-based ML (use ml-pipeline-orchestrator instead).
|
||||
---
|
||||
|
||||
# MLOps DAG Builder
|
||||
|
||||
Design and implement DAG-based ML pipeline architectures using production orchestration tools.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides guidance for building **platform-agnostic MLOps pipelines** using DAG orchestrators (Airflow, Dagster, Kubeflow, Prefect). It focuses on workflow architecture, not SpecWeave integration.
|
||||
|
||||
**When to use this skill vs ml-pipeline-orchestrator:**
|
||||
- **Use this skill**: General MLOps architecture, Airflow/Dagster DAGs, cloud ML platforms
|
||||
- **Use ml-pipeline-orchestrator**: SpecWeave increment-based ML development with experiment tracking
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Designing DAG-based workflow orchestration (Airflow, Dagster, Kubeflow)
|
||||
- Implementing platform-agnostic ML pipeline patterns
|
||||
- Setting up CI/CD automation for ML training jobs
|
||||
- Creating reusable pipeline templates for teams
|
||||
- Integrating with cloud ML services (SageMaker, Vertex AI, Azure ML)
|
||||
|
||||
## What This Skill Provides
|
||||
|
||||
### Core Capabilities
|
||||
|
||||
1. **Pipeline Architecture**
|
||||
- End-to-end workflow design
|
||||
- DAG orchestration patterns (Airflow, Dagster, Kubeflow)
|
||||
- Component dependencies and data flow
|
||||
- Error handling and retry strategies
|
||||
|
||||
2. **Data Preparation**
|
||||
- Data validation and quality checks
|
||||
- Feature engineering pipelines
|
||||
- Data versioning and lineage
|
||||
- Train/validation/test splitting strategies
|
||||
|
||||
3. **Model Training**
|
||||
- Training job orchestration
|
||||
- Hyperparameter management
|
||||
- Experiment tracking integration
|
||||
- Distributed training patterns
|
||||
|
||||
4. **Model Validation**
|
||||
- Validation frameworks and metrics
|
||||
- A/B testing infrastructure
|
||||
- Performance regression detection
|
||||
- Model comparison workflows
|
||||
|
||||
5. **Deployment Automation**
|
||||
- Model serving patterns
|
||||
- Canary deployments
|
||||
- Blue-green deployment strategies
|
||||
- Rollback mechanisms
|
||||
|
||||
### Reference Documentation
|
||||
|
||||
See the `references/` directory for detailed guides:
|
||||
- **data-preparation.md** - Data cleaning, validation, and feature engineering
|
||||
- **model-training.md** - Training workflows and best practices
|
||||
- **model-validation.md** - Validation strategies and metrics
|
||||
- **model-deployment.md** - Deployment patterns and serving architectures
|
||||
|
||||
### Assets and Templates
|
||||
|
||||
The `assets/` directory contains:
|
||||
- **pipeline-dag.yaml.template** - DAG template for workflow orchestration
|
||||
- **training-config.yaml** - Training configuration template
|
||||
- **validation-checklist.md** - Pre-deployment validation checklist
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Basic Pipeline Setup
|
||||
|
||||
```python
|
||||
# 1. Define pipeline stages
|
||||
stages = [
|
||||
"data_ingestion",
|
||||
"data_validation",
|
||||
"feature_engineering",
|
||||
"model_training",
|
||||
"model_validation",
|
||||
"model_deployment"
|
||||
]
|
||||
|
||||
# 2. Configure dependencies
|
||||
# See assets/pipeline-dag.yaml.template for full example
|
||||
```
|
||||
|
||||
### Production Workflow
|
||||
|
||||
1. **Data Preparation Phase**
|
||||
- Ingest raw data from sources
|
||||
- Run data quality checks
|
||||
- Apply feature transformations
|
||||
- Version processed datasets
|
||||
|
||||
2. **Training Phase**
|
||||
- Load versioned training data
|
||||
- Execute training jobs
|
||||
- Track experiments and metrics
|
||||
- Save trained models
|
||||
|
||||
3. **Validation Phase**
|
||||
- Run validation test suite
|
||||
- Compare against baseline
|
||||
- Generate performance reports
|
||||
- Approve for deployment
|
||||
|
||||
4. **Deployment Phase**
|
||||
- Package model artifacts
|
||||
- Deploy to serving infrastructure
|
||||
- Configure monitoring
|
||||
- Validate production traffic
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Pipeline Design
|
||||
|
||||
- **Modularity**: Each stage should be independently testable
|
||||
- **Idempotency**: Re-running stages should be safe
|
||||
- **Observability**: Log metrics at every stage
|
||||
- **Versioning**: Track data, code, and model versions
|
||||
- **Failure Handling**: Implement retry logic and alerting
|
||||
|
||||
### Data Management
|
||||
|
||||
- Use data validation libraries (Great Expectations, TFX)
|
||||
- Version datasets with DVC or similar tools
|
||||
- Document feature engineering transformations
|
||||
- Maintain data lineage tracking
|
||||
|
||||
### Model Operations
|
||||
|
||||
- Separate training and serving infrastructure
|
||||
- Use model registries (MLflow, Weights & Biases)
|
||||
- Implement gradual rollouts for new models
|
||||
- Monitor model performance drift
|
||||
- Maintain rollback capabilities
|
||||
|
||||
### Deployment Strategies
|
||||
|
||||
- Start with shadow deployments
|
||||
- Use canary releases for validation
|
||||
- Implement A/B testing infrastructure
|
||||
- Set up automated rollback triggers
|
||||
- Monitor latency and throughput
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Orchestration Tools
|
||||
|
||||
- **Apache Airflow**: DAG-based workflow orchestration
|
||||
- **Dagster**: Asset-based pipeline orchestration
|
||||
- **Kubeflow Pipelines**: Kubernetes-native ML workflows
|
||||
- **Prefect**: Modern dataflow automation
|
||||
|
||||
### Experiment Tracking
|
||||
|
||||
- MLflow for experiment tracking and model registry
|
||||
- Weights & Biases for visualization and collaboration
|
||||
- TensorBoard for training metrics
|
||||
|
||||
### Deployment Platforms
|
||||
|
||||
- AWS SageMaker for managed ML infrastructure
|
||||
- Google Vertex AI for GCP deployments
|
||||
- Azure ML for Azure cloud
|
||||
- Kubernetes + KServe for cloud-agnostic serving
|
||||
|
||||
## Progressive Disclosure
|
||||
|
||||
Start with the basics and gradually add complexity:
|
||||
|
||||
1. **Level 1**: Simple linear pipeline (data → train → deploy)
|
||||
2. **Level 2**: Add validation and monitoring stages
|
||||
3. **Level 3**: Implement hyperparameter tuning
|
||||
4. **Level 4**: Add A/B testing and gradual rollouts
|
||||
5. **Level 5**: Multi-model pipelines with ensemble strategies
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Batch Training Pipeline
|
||||
|
||||
```yaml
|
||||
# See assets/pipeline-dag.yaml.template
|
||||
stages:
|
||||
- name: data_preparation
|
||||
dependencies: []
|
||||
- name: model_training
|
||||
dependencies: [data_preparation]
|
||||
- name: model_evaluation
|
||||
dependencies: [model_training]
|
||||
- name: model_deployment
|
||||
dependencies: [model_evaluation]
|
||||
```
|
||||
|
||||
### Real-time Feature Pipeline
|
||||
|
||||
```python
|
||||
# Stream processing for real-time features
|
||||
# Combined with batch training
|
||||
# See references/data-preparation.md
|
||||
```
|
||||
|
||||
### Continuous Training
|
||||
|
||||
```python
|
||||
# Automated retraining on schedule
|
||||
# Triggered by data drift detection
|
||||
# See references/model-training.md
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
- **Pipeline failures**: Check dependencies and data availability
|
||||
- **Training instability**: Review hyperparameters and data quality
|
||||
- **Deployment issues**: Validate model artifacts and serving config
|
||||
- **Performance degradation**: Monitor data drift and model metrics
|
||||
|
||||
### Debugging Steps
|
||||
|
||||
1. Check pipeline logs for each stage
|
||||
2. Validate input/output data at boundaries
|
||||
3. Test components in isolation
|
||||
4. Review experiment tracking metrics
|
||||
5. Inspect model artifacts and metadata
|
||||
|
||||
## Next Steps
|
||||
|
||||
After setting up your pipeline:
|
||||
|
||||
1. Explore **hyperparameter-tuning** skill for optimization
|
||||
2. Learn **experiment-tracking-setup** for MLflow/W&B
|
||||
3. Review **model-deployment-patterns** for serving strategies
|
||||
4. Implement monitoring with observability tools
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **ml-pipeline-orchestrator**: SpecWeave-integrated ML development (use for increment-based ML)
|
||||
- **experiment-tracker**: MLflow and Weights & Biases experiment tracking
|
||||
- **automl-optimizer**: Automated hyperparameter optimization with Optuna/Hyperopt
|
||||
- **ml-deployment-helper**: Model serving and deployment patterns
|
||||
Reference in New Issue
Block a user