Initial commit

2025-11-29 17:56:53 +08:00
commit 468d045de7
24 changed files with 7204 additions and 0 deletions
--- a/commands/specweave-ml-deploy.md
+++ b/commands/specweave-ml-deploy.md
@@ -0,0 +1,116 @@
+---
+name: specweave-ml:ml-deploy
+description: Generate deployment artifacts (API, Docker, monitoring)
+---
+
+# Deploy ML Model
+
+You are preparing an ML model for production deployment. Generate all necessary deployment artifacts following MLOps best practices.
+
+## Your Task
+
+1. **Generate API**: FastAPI endpoint for model serving
+2. **Containerize**: Dockerfile for model deployment
+3. **Setup Monitoring**: Prometheus/Grafana configuration
+4. **Create A/B Test**: Traffic splitting infrastructure
+5. **Document Deployment**: Deployment runbook
+
+## Deployment Steps
+
+### Step 1: Generate FastAPI App
+
+```python
+from specweave import create_model_api
+
+api = create_model_api(
+    model_path="models/model.pkl",
+    framework="fastapi"
+)
+```
+
+Creates: `api/main.py`, `api/models.py`, `api/predict.py`
+
+### Step 2: Create Dockerfile
+
+```python
+dockerfile = containerize_model(
+    model_path="models/model.pkl",
+    python_version="3.10"
+)
+```
+
+Creates: `Dockerfile`, `requirements.txt`
+
+### Step 3: Setup Monitoring
+
+```python
+monitoring = setup_monitoring(
+    model_name="recommendation-model",
+    metrics=["latency", "throughput", "error_rate", "drift"]
+)
+```
+
+Creates: `monitoring/prometheus.yaml`, `monitoring/grafana-dashboard.json`
+
+### Step 4: A/B Testing Infrastructure
+
+```python
+ab_test = create_ab_test(
+    control_model="model-v2.pkl",
+    treatment_model="model-v3.pkl",
+    traffic_split=0.1
+)
+```
+
+Creates: `ab-test/router.py`, `ab-test/metrics.py`
+
+### Step 5: Load Testing
+
+```python
+load_test_results = load_test_model(
+    api_url="http://localhost:8000/predict",
+    target_rps=100,
+    duration=60
+)
+```
+
+Creates: `load-tests/results.md`
+
+### Step 6: Deployment Runbook
+
+Create `DEPLOYMENT.md`:
+
+```markdown
+# Deployment Runbook
+
+## Pre-Deployment Checklist
+- [ ] Model versioned
+- [ ] API tested locally
+- [ ] Load testing passed
+- [ ] Monitoring configured
+- [ ] Rollback plan documented
+
+## Deployment Steps
+1. Build Docker image
+2. Push to registry
+3. Deploy to staging
+4. Validate staging
+5. Deploy to production (1% traffic)
+6. Monitor for 24 hours
+7. Ramp to 100% if stable
+
+## Rollback Procedure
+[Steps to rollback to previous version]
+
+## Monitoring
+[Grafana dashboard URL]
+[Key metrics to watch]
+```
+
+## Output
+
+Report:
+- All deployment artifacts generated
+- Load test results (can it handle target RPS?)
+- Deployment recommendation (ready/not ready)
+- Next steps for deployment
--- a/commands/specweave-ml-evaluate.md
+++ b/commands/specweave-ml-evaluate.md
@@ -0,0 +1,87 @@
+---
+name: specweave-ml:ml-evaluate
+description: Evaluate ML model with comprehensive metrics
+---
+
+# Evaluate ML Model
+
+You are evaluating an ML model in a SpecWeave increment. Generate a comprehensive evaluation report following ML best practices.
+
+## Your Task
+
+1. **Load Model**: Load the model from the specified increment
+2. **Run Evaluation**: Execute comprehensive evaluation with appropriate metrics
+3. **Generate Report**: Create evaluation report in increment folder
+
+## Evaluation Steps
+
+### Step 1: Identify Model Type
+- Classification: accuracy, precision, recall, F1, ROC AUC, confusion matrix
+- Regression: RMSE, MAE, MAPE, R², residual analysis
+- Ranking: precision@K, recall@K, NDCG@K, MAP
+
+### Step 2: Load Test Data
+```python
+# Load test set from increment
+X_test = load_test_data(increment_path)
+y_test = load_test_labels(increment_path)
+```
+
+### Step 3: Compute Metrics
+```python
+from specweave import ModelEvaluator
+
+evaluator = ModelEvaluator(model, X_test, y_test)
+metrics = evaluator.compute_all_metrics()
+```
+
+### Step 4: Generate Visualizations
+- Confusion matrix (classification)
+- ROC curves (classification)
+- Residual plots (regression)
+- Calibration curves (classification)
+
+### Step 5: Statistical Validation
+- Cross-validation results
+- Confidence intervals
+- Comparison to baseline
+- Statistical significance tests
+
+### Step 6: Generate Report
+
+Create `evaluation-report.md` in increment folder:
+
+```markdown
+# Model Evaluation Report
+
+## Model: [Model Name]
+- Version: [Version]
+- Increment: [Increment ID]
+- Date: [Evaluation Date]
+
+## Overall Performance
+[Metrics table]
+
+## Visualizations
+[Embedded plots]
+
+## Cross-Validation
+[CV results]
+
+## Comparison to Baseline
+[Baseline comparison]
+
+## Statistical Tests
+[Significance tests]
+
+## Recommendations
+[Deploy/improve/investigate]
+```
+
+## Output
+
+After evaluation, report:
+- Overall performance summary
+- Key metrics
+- Whether model meets success criteria (from spec.md)
+- Recommendation (deploy/improve/investigate)
--- a/commands/specweave-ml-explain.md
+++ b/commands/specweave-ml-explain.md
@@ -0,0 +1,83 @@
+---
+name: specweave-ml:ml-explain
+description: Generate model explainability reports (SHAP, LIME, feature importance)
+---
+
+# Explain ML Model
+
+You are generating explainability artifacts for an ML model in a SpecWeave increment. Make the black box transparent.
+
+## Your Task
+
+1. **Load Model**: Load model from increment
+2. **Generate Global Explanations**: Feature importance, partial dependence
+3. **Generate Local Explanations**: SHAP/LIME for sample predictions
+4. **Create Report**: Comprehensive explainability documentation
+
+## Explainability Steps
+
+### Step 1: Feature Importance
+```python
+from specweave import ModelExplainer
+
+explainer = ModelExplainer(model, X_train)
+importance = explainer.feature_importance()
+```
+
+Create: `feature-importance.png`
+
+### Step 2: SHAP Summary
+```python
+shap_values = explainer.shap_summary()
+```
+
+Create: `shap-summary.png` (beeswarm plot)
+
+### Step 3: Partial Dependence Plots
+```python
+for feature in top_features:
+    pdp = explainer.partial_dependence(feature)
+```
+
+Create: `pdp-plots/` directory
+
+### Step 4: Local Explanations
+```python
+# Explain sample predictions
+samples = [high_confidence, low_confidence, edge_case]
+for sample in samples:
+    explanation = explainer.explain_prediction(sample)
+```
+
+Create: `local-explanations/` directory
+
+### Step 5: Generate Report
+
+Create `explainability-report.md`:
+
+```markdown
+# Model Explainability Report
+
+## Global Feature Importance
+[Top 10 features with importance scores]
+
+## SHAP Analysis
+[Summary plot and interpretation]
+
+## Partial Dependence
+[How each feature affects predictions]
+
+## Example Explanations
+[3-5 example predictions with full explanations]
+
+## Recommendations
+[Model improvements based on feature analysis]
+```
+
+## Output
+
+Report:
+- Top 10 most important features
+- Any surprising feature importance (might indicate data leakage)
+- Model behavior insights
+- Recommendations for improvement
--- a/commands/specweave-ml-pipeline.md
+++ b/commands/specweave-ml-pipeline.md
@@ -0,0 +1,297 @@
+---
+name: specweave-ml:ml-pipeline
+description: Design and implement a complete ML pipeline with multi-agent MLOps orchestration
+---
+
+# Machine Learning Pipeline - Multi-Agent MLOps Orchestration
+
+Design and implement a complete ML pipeline for: $ARGUMENTS
+
+## Thinking
+
+This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:
+
+- **Phase-based coordination**: Each phase builds upon previous outputs, with clear handoffs between agents
+- **Modern tooling integration**: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
+- **Production-first mindset**: Every component designed for scale, monitoring, and reliability
+- **Reproducibility**: Version control for data, models, and infrastructure
+- **Continuous improvement**: Automated retraining, A/B testing, and drift detection
+
+The multi-agent approach ensures each aspect is handled by domain experts:
+- Data engineers handle ingestion and quality
+- Data scientists design features and experiments
+- ML engineers implement training pipelines
+- MLOps engineers handle production deployment
+- Observability engineers ensure monitoring
+
+## Phase 1: Data & Requirements Analysis
+
+<Task>
+subagent_type: data-engineer
+prompt: |
+  Analyze and design data pipeline for ML system with requirements: $ARGUMENTS
+
+  Deliverables:
+  1. Data source audit and ingestion strategy:
+     - Source systems and connection patterns
+     - Schema validation using Pydantic/Great Expectations
+     - Data versioning with DVC or lakeFS
+     - Incremental loading and CDC strategies
+
+  2. Data quality framework:
+     - Profiling and statistics generation
+     - Anomaly detection rules
+     - Data lineage tracking
+     - Quality gates and SLAs
+
+  3. Storage architecture:
+     - Raw/processed/feature layers
+     - Partitioning strategy
+     - Retention policies
+     - Cost optimization
+
+  Provide implementation code for critical components and integration patterns.
+</Task>
+
+<Task>
+subagent_type: data-scientist
+prompt: |
+  Design feature engineering and model requirements for: $ARGUMENTS
+  Using data architecture from: {phase1.data-engineer.output}
+
+  Deliverables:
+  1. Feature engineering pipeline:
+     - Transformation specifications
+     - Feature store schema (Feast/Tecton)
+     - Statistical validation rules
+     - Handling strategies for missing data/outliers
+
+  2. Model requirements:
+     - Algorithm selection rationale
+     - Performance metrics and baselines
+     - Training data requirements
+     - Evaluation criteria and thresholds
+
+  3. Experiment design:
+     - Hypothesis and success metrics
+     - A/B testing methodology
+     - Sample size calculations
+     - Bias detection approach
+
+  Include feature transformation code and statistical validation logic.
+</Task>
+
+## Phase 2: Model Development & Training
+
+<Task>
+subagent_type: ml-engineer
+prompt: |
+  Implement training pipeline based on requirements: {phase1.data-scientist.output}
+  Using data pipeline: {phase1.data-engineer.output}
+
+  Build comprehensive training system:
+  1. Training pipeline implementation:
+     - Modular training code with clear interfaces
+     - Hyperparameter optimization (Optuna/Ray Tune)
+     - Distributed training support (Horovod/PyTorch DDP)
+     - Cross-validation and ensemble strategies
+
+  2. Experiment tracking setup:
+     - MLflow/Weights & Biases integration
+     - Metric logging and visualization
+     - Artifact management (models, plots, data samples)
+     - Experiment comparison and analysis tools
+
+  3. Model registry integration:
+     - Version control and tagging strategy
+     - Model metadata and lineage
+     - Promotion workflows (dev -> staging -> prod)
+     - Rollback procedures
+
+  Provide complete training code with configuration management.
+</Task>
+
+<Task>
+subagent_type: python-pro
+prompt: |
+  Optimize and productionize ML code from: {phase2.ml-engineer.output}
+
+  Focus areas:
+  1. Code quality and structure:
+     - Refactor for production standards
+     - Add comprehensive error handling
+     - Implement proper logging with structured formats
+     - Create reusable components and utilities
+
+  2. Performance optimization:
+     - Profile and optimize bottlenecks
+     - Implement caching strategies
+     - Optimize data loading and preprocessing
+     - Memory management for large-scale training
+
+  3. Testing framework:
+     - Unit tests for data transformations
+     - Integration tests for pipeline components
+     - Model quality tests (invariance, directional)
+     - Performance regression tests
+
+  Deliver production-ready, maintainable code with full test coverage.
+</Task>
+
+## Phase 3: Production Deployment & Serving
+
+<Task>
+subagent_type: mlops-engineer
+prompt: |
+  Design production deployment for models from: {phase2.ml-engineer.output}
+  With optimized code from: {phase2.python-pro.output}
+
+  Implementation requirements:
+  1. Model serving infrastructure:
+     - REST/gRPC APIs with FastAPI/TorchServe
+     - Batch prediction pipelines (Airflow/Kubeflow)
+     - Stream processing (Kafka/Kinesis integration)
+     - Model serving platforms (KServe/Seldon Core)
+
+  2. Deployment strategies:
+     - Blue-green deployments for zero downtime
+     - Canary releases with traffic splitting
+     - Shadow deployments for validation
+     - A/B testing infrastructure
+
+  3. CI/CD pipeline:
+     - GitHub Actions/GitLab CI workflows
+     - Automated testing gates
+     - Model validation before deployment
+     - ArgoCD for GitOps deployment
+
+  4. Infrastructure as Code:
+     - Terraform modules for cloud resources
+     - Helm charts for Kubernetes deployments
+     - Docker multi-stage builds for optimization
+     - Secret management with Vault/Secrets Manager
+
+  Provide complete deployment configuration and automation scripts.
+</Task>
+
+<Task>
+subagent_type: kubernetes-architect
+prompt: |
+  Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}
+
+  Kubernetes-specific requirements:
+  1. Workload orchestration:
+     - Training job scheduling with Kubeflow
+     - GPU resource allocation and sharing
+     - Spot/preemptible instance integration
+     - Priority classes and resource quotas
+
+  2. Serving infrastructure:
+     - HPA/VPA for autoscaling
+     - KEDA for event-driven scaling
+     - Istio service mesh for traffic management
+     - Model caching and warm-up strategies
+
+  3. Storage and data access:
+     - PVC strategies for training data
+     - Model artifact storage with CSI drivers
+     - Distributed storage for feature stores
+     - Cache layers for inference optimization
+
+  Provide Kubernetes manifests and Helm charts for entire ML platform.
+</Task>
+
+## Phase 4: Monitoring & Continuous Improvement
+
+<Task>
+subagent_type: observability-engineer
+prompt: |
+  Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}
+  Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}
+
+  Monitoring framework:
+  1. Model performance monitoring:
+     - Prediction accuracy tracking
+     - Latency and throughput metrics
+     - Feature importance shifts
+     - Business KPI correlation
+
+  2. Data and model drift detection:
+     - Statistical drift detection (KS test, PSI)
+     - Concept drift monitoring
+     - Feature distribution tracking
+     - Automated drift alerts and reports
+
+  3. System observability:
+     - Prometheus metrics for all components
+     - Grafana dashboards for visualization
+     - Distributed tracing with Jaeger/Zipkin
+     - Log aggregation with ELK/Loki
+
+  4. Alerting and automation:
+     - PagerDuty/Opsgenie integration
+     - Automated retraining triggers
+     - Performance degradation workflows
+     - Incident response runbooks
+
+  5. Cost tracking:
+     - Resource utilization metrics
+     - Cost allocation by model/experiment
+     - Optimization recommendations
+     - Budget alerts and controls
+
+  Deliver monitoring configuration, dashboards, and alert rules.
+</Task>
+
+## Configuration Options
+
+- **experiment_tracking**: mlflow | wandb | neptune | clearml
+- **feature_store**: feast | tecton | databricks | custom
+- **serving_platform**: kserve | seldon | torchserve | triton
+- **orchestration**: kubeflow | airflow | prefect | dagster
+- **cloud_provider**: aws | azure | gcp | multi-cloud
+- **deployment_mode**: realtime | batch | streaming | hybrid
+- **monitoring_stack**: prometheus | datadog | newrelic | custom
+
+## Success Criteria
+
+1. **Data Pipeline Success**:
+   - < 0.1% data quality issues in production
+   - Automated data validation passing 99.9% of time
+   - Complete data lineage tracking
+   - Sub-second feature serving latency
+
+2. **Model Performance**:
+   - Meeting or exceeding baseline metrics
+   - < 5% performance degradation before retraining
+   - Successful A/B tests with statistical significance
+   - No undetected model drift > 24 hours
+
+3. **Operational Excellence**:
+   - 99.9% uptime for model serving
+   - < 200ms p99 inference latency
+   - Automated rollback within 5 minutes
+   - Complete observability with < 1 minute alert time
+
+4. **Development Velocity**:
+   - < 1 hour from commit to production
+   - Parallel experiment execution
+   - Reproducible training runs
+   - Self-service model deployment
+
+5. **Cost Efficiency**:
+   - < 20% infrastructure waste
+   - Optimized resource allocation
+   - Automatic scaling based on load
+   - Spot instance utilization > 60%
+
+## Final Deliverables
+
+Upon completion, the orchestrated pipeline will provide:
+- End-to-end ML pipeline with full automation
+- Comprehensive documentation and runbooks
+- Production-ready infrastructure as code
+- Complete monitoring and alerting system
+- CI/CD pipelines for continuous improvement
+- Cost optimization and scaling strategies
+- Disaster recovery and rollback procedures