Initial commit
This commit is contained in:
559
skills/anomaly-detector/SKILL.md
Normal file
559
skills/anomaly-detector/SKILL.md
Normal file
@@ -0,0 +1,559 @@
|
||||
---
|
||||
name: anomaly-detector
|
||||
description: |
|
||||
Anomaly and outlier detection using Isolation Forest, One-Class SVM, autoencoders, and statistical methods. Activates for "anomaly detection", "outlier detection", "fraud detection", "intrusion detection", "abnormal behavior", "unusual patterns", "detect anomalies", "system monitoring". Handles supervised and unsupervised anomaly detection with SpecWeave increment integration.
|
||||
---
|
||||
|
||||
# Anomaly Detector
|
||||
|
||||
## Overview
|
||||
|
||||
Detect unusual patterns, outliers, and anomalies in data using statistical methods, machine learning, and deep learning. Critical for fraud detection, security monitoring, quality control, and system health monitoring—all integrated with SpecWeave's increment workflow.
|
||||
|
||||
## Why Anomaly Detection is Different
|
||||
|
||||
**Challenge**: Anomalies are rare (0.1% - 5% of data)
|
||||
|
||||
**Standard classification doesn't work**:
|
||||
- ❌ Extreme class imbalance
|
||||
- ❌ Unknown anomaly patterns
|
||||
- ❌ Expensive to label anomalies
|
||||
- ❌ Anomalies evolve over time
|
||||
|
||||
**Anomaly detection approaches**:
|
||||
- ✅ Unsupervised (no labels needed)
|
||||
- ✅ Semi-supervised (learn from normal data)
|
||||
- ✅ Statistical (deviation from expected)
|
||||
- ✅ Context-aware (what's normal for this user/time/location?)
|
||||
|
||||
## Anomaly Detection Methods
|
||||
|
||||
### 1. Statistical Methods (Baseline)
|
||||
|
||||
**Z-Score / Standard Deviation**:
|
||||
```python
|
||||
from specweave import AnomalyDetector
|
||||
|
||||
detector = AnomalyDetector(
|
||||
method="statistical",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Flag values > 3 standard deviations from mean
|
||||
anomalies = detector.detect(
|
||||
data=transaction_amounts,
|
||||
threshold=3.0
|
||||
)
|
||||
|
||||
# Simple, fast, but assumes normal distribution
|
||||
```
|
||||
|
||||
**IQR (Interquartile Range)**:
|
||||
```python
|
||||
# More robust to non-normal distributions
|
||||
detector = AnomalyDetector(method="iqr")
|
||||
|
||||
# Flag values outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR]
|
||||
anomalies = detector.detect(data=response_times)
|
||||
|
||||
# Good for skewed distributions
|
||||
```
|
||||
|
||||
### 2. Isolation Forest (Recommended)
|
||||
|
||||
**Best for**: General purpose, high-dimensional data
|
||||
|
||||
```python
|
||||
from specweave import IsolationForestDetector
|
||||
|
||||
detector = IsolationForestDetector(
|
||||
contamination=0.05, # Expected anomaly rate (5%)
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Train on normal data (or mixed data)
|
||||
detector.fit(X_train)
|
||||
|
||||
# Detect anomalies
|
||||
predictions = detector.predict(X_test)
|
||||
# -1 = anomaly, 1 = normal
|
||||
|
||||
anomaly_scores = detector.score(X_test)
|
||||
# Lower score = more anomalous
|
||||
|
||||
# Generates:
|
||||
# - Anomaly scores for all samples
|
||||
# - Feature importance (which features contribute to anomaly)
|
||||
# - Threshold visualization
|
||||
# - Top anomalies ranked by score
|
||||
```
|
||||
|
||||
**Why Isolation Forest works**:
|
||||
- Fast (O(n log n))
|
||||
- Handles high dimensions well
|
||||
- No assumptions about data distribution
|
||||
- Anomalies are easier to isolate (fewer splits)
|
||||
|
||||
### 3. One-Class SVM
|
||||
|
||||
**Best for**: When you have only normal data for training
|
||||
|
||||
```python
|
||||
from specweave import OneClassSVMDetector
|
||||
|
||||
# Train only on normal transactions
|
||||
detector = OneClassSVMDetector(
|
||||
kernel='rbf',
|
||||
nu=0.05, # Expected anomaly rate
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
detector.fit(X_normal)
|
||||
|
||||
# Detect anomalies in new data
|
||||
predictions = detector.predict(X_new)
|
||||
# -1 = anomaly, 1 = normal
|
||||
|
||||
# Good for: Clean training data of normal samples
|
||||
```
|
||||
|
||||
### 4. Autoencoders (Deep Learning)
|
||||
|
||||
**Best for**: Complex patterns, high-dimensional data, images
|
||||
|
||||
```python
|
||||
from specweave import AutoencoderDetector
|
||||
|
||||
# Learn to reconstruct normal data
|
||||
detector = AutoencoderDetector(
|
||||
encoding_dim=32, # Compressed representation
|
||||
layers=[64, 32, 16, 32, 64],
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Train on normal data
|
||||
detector.fit(
|
||||
X_normal,
|
||||
epochs=100,
|
||||
validation_split=0.2
|
||||
)
|
||||
|
||||
# Anomalies have high reconstruction error
|
||||
anomaly_scores = detector.score(X_test)
|
||||
|
||||
# Generates:
|
||||
# - Reconstruction error distribution
|
||||
# - Threshold recommendation
|
||||
# - Top anomalies with explanations
|
||||
# - Learned representations (t-SNE plot)
|
||||
```
|
||||
|
||||
**How autoencoders work**:
|
||||
```
|
||||
Input → Encoder → Compressed → Decoder → Reconstructed
|
||||
|
||||
Normal data: Low reconstruction error (learned well)
|
||||
Anomalies: High reconstruction error (never seen before)
|
||||
```
|
||||
|
||||
### 5. LOF (Local Outlier Factor)
|
||||
|
||||
**Best for**: Density-based anomalies (sparse regions)
|
||||
|
||||
```python
|
||||
from specweave import LOFDetector
|
||||
|
||||
# Detects points in low-density regions
|
||||
detector = LOFDetector(
|
||||
n_neighbors=20,
|
||||
contamination=0.05,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
detector.fit(X_train)
|
||||
predictions = detector.predict(X_test)
|
||||
|
||||
# Good for: Clustered data with sparse anomalies
|
||||
```
|
||||
|
||||
## Anomaly Detection Workflows
|
||||
|
||||
### Workflow 1: Fraud Detection
|
||||
|
||||
```python
|
||||
from specweave import FraudDetectionPipeline
|
||||
|
||||
pipeline = FraudDetectionPipeline(increment="0042")
|
||||
|
||||
# Features: transaction amount, location, time, merchant, etc.
|
||||
pipeline.fit(normal_transactions)
|
||||
|
||||
# Real-time fraud detection
|
||||
fraud_scores = pipeline.predict_proba(new_transactions)
|
||||
|
||||
# For each transaction:
|
||||
# - Fraud probability (0-1)
|
||||
# - Anomaly score
|
||||
# - Contributing features
|
||||
# - Similar past cases
|
||||
|
||||
# Generates:
|
||||
# - Precision-Recall curve (fraud is rare)
|
||||
# - Cost-benefit analysis (false positives vs missed fraud)
|
||||
# - Feature importance for fraud
|
||||
# - Fraud patterns identified
|
||||
```
|
||||
|
||||
**Fraud Detection Best Practices**:
|
||||
```python
|
||||
# 1. Use multiple signals
|
||||
pipeline.add_signals([
|
||||
'amount_vs_user_average',
|
||||
'distance_from_home',
|
||||
'merchant_risk_score',
|
||||
'velocity_24h' # Transactions in last 24h
|
||||
])
|
||||
|
||||
# 2. Set threshold based on cost
|
||||
# False Positive cost: $5 (manual review)
|
||||
# False Negative cost: $500 (fraud loss)
|
||||
# Optimal threshold: Maximize (savings - review_cost)
|
||||
|
||||
# 3. Provide explanations
|
||||
explanation = pipeline.explain_prediction(suspicious_transaction)
|
||||
# "Flagged because: amount 10x user average, new merchant, foreign location"
|
||||
```
|
||||
|
||||
### Workflow 2: System Anomaly Detection
|
||||
|
||||
```python
|
||||
from specweave import SystemAnomalyPipeline
|
||||
|
||||
# Monitor system metrics (CPU, memory, latency, errors)
|
||||
pipeline = SystemAnomalyPipeline(increment="0042")
|
||||
|
||||
# Train on normal system behavior
|
||||
pipeline.fit(normal_metrics)
|
||||
|
||||
# Detect system anomalies
|
||||
anomalies = pipeline.detect(current_metrics)
|
||||
|
||||
# For each anomaly:
|
||||
# - Severity (low, medium, high, critical)
|
||||
# - Affected metrics
|
||||
# - Similar past incidents
|
||||
# - Recommended actions
|
||||
|
||||
# Generates:
|
||||
# - Anomaly timeline
|
||||
# - Metric correlations (which metrics moved together)
|
||||
# - Root cause analysis
|
||||
# - Alert rules
|
||||
```
|
||||
|
||||
**System Monitoring Best Practices**:
|
||||
```python
|
||||
# 1. Use time windows
|
||||
pipeline.add_time_windows([
|
||||
'5min', # Immediate spikes
|
||||
'1hour', # Short-term trends
|
||||
'24hour' # Daily patterns
|
||||
])
|
||||
|
||||
# 2. Correlate metrics
|
||||
pipeline.detect_correlations([
|
||||
('high_cpu', 'slow_response'),
|
||||
('memory_leak', 'increasing_errors')
|
||||
])
|
||||
|
||||
# 3. Reduce alert fatigue
|
||||
pipeline.set_alert_rules(
|
||||
min_severity='medium',
|
||||
min_duration='5min', # Ignore transient spikes
|
||||
max_alerts_per_hour=5
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow 3: Manufacturing Quality Control
|
||||
|
||||
```python
|
||||
from specweave import QualityControlPipeline
|
||||
|
||||
# Detect defective products from sensor data
|
||||
pipeline = QualityControlPipeline(increment="0042")
|
||||
|
||||
# Train on good products
|
||||
pipeline.fit(good_product_sensors)
|
||||
|
||||
# Detect defects in production line
|
||||
defect_scores = pipeline.predict(production_line_data)
|
||||
|
||||
# Generates:
|
||||
# - Real-time defect alerts
|
||||
# - Defect rate trends
|
||||
# - Most common defect patterns
|
||||
# - Preventive maintenance recommendations
|
||||
```
|
||||
|
||||
### Workflow 4: Network Intrusion Detection
|
||||
|
||||
```python
|
||||
from specweave import IntrusionDetectionPipeline
|
||||
|
||||
# Detect malicious network traffic
|
||||
pipeline = IntrusionDetectionPipeline(increment="0042")
|
||||
|
||||
# Features: packet size, frequency, ports, protocols, etc.
|
||||
pipeline.fit(normal_network_traffic)
|
||||
|
||||
# Detect intrusions
|
||||
intrusions = pipeline.detect(network_traffic_stream)
|
||||
|
||||
# Generates:
|
||||
# - Attack type classification (DDoS, port scan, etc.)
|
||||
# - Severity scores
|
||||
# - Source IPs
|
||||
# - Attack timeline
|
||||
```
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
**Anomaly detection metrics** (different from classification):
|
||||
|
||||
```python
|
||||
from specweave import AnomalyEvaluator
|
||||
|
||||
evaluator = AnomalyEvaluator(increment="0042")
|
||||
|
||||
metrics = evaluator.evaluate(
|
||||
y_true=true_labels, # 0=normal, 1=anomaly
|
||||
y_pred=predictions,
|
||||
y_scores=anomaly_scores
|
||||
)
|
||||
```
|
||||
|
||||
**Key Metrics**:
|
||||
|
||||
1. **Precision @ K** - Of top K flagged anomalies, how many are real?
|
||||
```python
|
||||
precision_at_100 = evaluator.precision_at_k(k=100)
|
||||
# "Of 100 flagged transactions, 85 were actual fraud" = 85%
|
||||
```
|
||||
|
||||
2. **Recall @ K** - Of all real anomalies, how many did we catch in top K?
|
||||
```python
|
||||
recall_at_100 = evaluator.recall_at_k(k=100)
|
||||
# "We caught 78% of all fraud in top 100 flagged"
|
||||
```
|
||||
|
||||
3. **ROC AUC** - Overall discrimination ability
|
||||
```python
|
||||
roc_auc = evaluator.roc_auc(y_true, y_scores)
|
||||
# 0.95 = excellent discrimination
|
||||
```
|
||||
|
||||
4. **PR AUC** - Better for imbalanced data
|
||||
```python
|
||||
pr_auc = evaluator.pr_auc(y_true, y_scores)
|
||||
# More informative when anomalies are rare (<5%)
|
||||
```
|
||||
|
||||
**Evaluation Report**:
|
||||
```markdown
|
||||
# Anomaly Detection Evaluation
|
||||
|
||||
## Dataset
|
||||
- Total samples: 100,000
|
||||
- Anomalies: 500 (0.5%)
|
||||
- Features: 25
|
||||
|
||||
## Method: Isolation Forest
|
||||
|
||||
## Performance Metrics
|
||||
- ROC AUC: 0.94 ✅ (excellent)
|
||||
- PR AUC: 0.78 ✅ (good for 0.5% anomaly rate)
|
||||
|
||||
## Precision-Recall Tradeoff
|
||||
- Precision @ 100: 85% (85 true anomalies in top 100)
|
||||
- Recall @ 100: 17% (caught 17% of all anomalies)
|
||||
- Precision @ 500: 62% (310 true anomalies in top 500)
|
||||
- Recall @ 500: 62% (caught 62% of all anomalies)
|
||||
|
||||
## Business Impact (Fraud Detection Example)
|
||||
- Review budget: 500 transactions/day
|
||||
- At Precision @ 500 = 62%:
|
||||
- True fraud caught: 310/day ($155,000 saved)
|
||||
- False positives: 190/day ($950 review cost)
|
||||
- Net benefit: $154,050/day ✅
|
||||
|
||||
## Recommendation
|
||||
✅ DEPLOY with threshold for top 500 (62% precision)
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Increment Structure
|
||||
|
||||
```
|
||||
.specweave/increments/0042-fraud-detection/
|
||||
├── spec.md (detection requirements, business impact)
|
||||
├── plan.md (method selection, threshold tuning)
|
||||
├── tasks.md
|
||||
├── data/
|
||||
│ ├── normal_transactions.csv
|
||||
│ ├── labeled_fraud.csv (if available)
|
||||
│ └── schema.yaml
|
||||
├── experiments/
|
||||
│ ├── statistical-baseline/
|
||||
│ ├── isolation-forest/
|
||||
│ ├── one-class-svm/
|
||||
│ └── autoencoder/
|
||||
├── models/
|
||||
│ ├── isolation_forest_model.pkl
|
||||
│ └── threshold_config.json
|
||||
├── evaluation/
|
||||
│ ├── precision_recall_curve.png
|
||||
│ ├── roc_curve.png
|
||||
│ ├── top_anomalies.csv
|
||||
│ └── evaluation_report.md
|
||||
└── deployment/
|
||||
├── real_time_api.py
|
||||
├── monitoring_dashboard.json
|
||||
└── alert_rules.yaml
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start with Labeled Anomalies (if available)
|
||||
|
||||
```python
|
||||
# Use labeled data to validate unsupervised methods
|
||||
detector.fit(X_train) # Unlabeled
|
||||
|
||||
# Evaluate on labeled test set
|
||||
metrics = evaluator.evaluate(y_true_test, detector.predict(X_test))
|
||||
|
||||
# Choose method with best precision @ K
|
||||
```
|
||||
|
||||
### 2. Tune Contamination Parameter
|
||||
|
||||
```python
|
||||
# Try different contamination rates
|
||||
for contamination in [0.01, 0.05, 0.1, 0.2]:
|
||||
detector = IsolationForestDetector(contamination=contamination)
|
||||
detector.fit(X_train)
|
||||
|
||||
metrics = evaluator.evaluate(y_test, detector.predict(X_test))
|
||||
|
||||
# Choose contamination that maximizes business value
|
||||
```
|
||||
|
||||
### 3. Explain Anomalies
|
||||
|
||||
```python
|
||||
# Don't just flag anomalies - explain why
|
||||
explainer = AnomalyExplainer(detector, increment="0042")
|
||||
|
||||
for anomaly in top_anomalies:
|
||||
explanation = explainer.explain(anomaly)
|
||||
print(f"Anomaly: {anomaly.id}")
|
||||
print(f"Reasons:")
|
||||
print(f" - {explanation.top_features}")
|
||||
print(f" - Similar cases: {explanation.similar_cases}")
|
||||
```
|
||||
|
||||
### 4. Handle Concept Drift
|
||||
|
||||
```python
|
||||
# Anomalies evolve over time
|
||||
monitor = AnomalyMonitor(increment="0042")
|
||||
|
||||
# Track detection performance
|
||||
monitor.track_daily_performance()
|
||||
|
||||
# Retrain when accuracy drops
|
||||
if monitor.performance_degraded():
|
||||
detector.retrain(new_normal_data)
|
||||
```
|
||||
|
||||
### 5. Set Business-Driven Thresholds
|
||||
|
||||
```python
|
||||
# Balance false positives vs false negatives
|
||||
optimizer = ThresholdOptimizer(increment="0042")
|
||||
|
||||
optimal_threshold = optimizer.find_optimal(
|
||||
detector=detector,
|
||||
data=validation_data,
|
||||
false_positive_cost=5, # $5 per manual review
|
||||
false_negative_cost=500 # $500 per missed fraud
|
||||
)
|
||||
|
||||
# Use optimal threshold for deployment
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### 1. Ensemble Anomaly Detection
|
||||
|
||||
```python
|
||||
# Combine multiple detectors
|
||||
ensemble = AnomalyEnsemble(increment="0042")
|
||||
|
||||
ensemble.add_detector("isolation_forest", weight=0.4)
|
||||
ensemble.add_detector("one_class_svm", weight=0.3)
|
||||
ensemble.add_detector("autoencoder", weight=0.3)
|
||||
|
||||
# Ensemble vote (more robust)
|
||||
anomalies = ensemble.detect(X_test)
|
||||
```
|
||||
|
||||
### 2. Contextual Anomaly Detection
|
||||
|
||||
```python
|
||||
# What's normal varies by context
|
||||
detector = ContextualAnomalyDetector(increment="0042")
|
||||
|
||||
# Different normality for different contexts
|
||||
detector.fit(data, contexts=['user_id', 'time_of_day', 'location'])
|
||||
|
||||
# $10 transaction: Normal for user A, anomaly for user B
|
||||
```
|
||||
|
||||
### 3. Sequential Anomaly Detection
|
||||
|
||||
```python
|
||||
# Detect anomalous sequences (not just individual points)
|
||||
detector = SequenceAnomalyDetector(
|
||||
method='lstm',
|
||||
window_size=10,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Example: Login from unusual sequence of locations
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Train anomaly detector
|
||||
/ml:train-anomaly-detector 0042
|
||||
|
||||
# Evaluate detector
|
||||
/ml:evaluate-anomaly-detector 0042
|
||||
|
||||
# Explain top anomalies
|
||||
/ml:explain-anomalies 0042 --top 100
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Anomaly detection is critical for:
|
||||
- ✅ Fraud detection (financial transactions)
|
||||
- ✅ Security monitoring (intrusion detection)
|
||||
- ✅ Quality control (manufacturing defects)
|
||||
- ✅ System health (performance monitoring)
|
||||
- ✅ Business intelligence (unusual patterns)
|
||||
|
||||
This skill provides battle-tested methods integrated with SpecWeave's increment workflow, ensuring anomaly detectors are reproducible, explainable, and business-aligned.
|
||||
485
skills/automl-optimizer/SKILL.md
Normal file
485
skills/automl-optimizer/SKILL.md
Normal file
@@ -0,0 +1,485 @@
|
||||
---
|
||||
name: automl-optimizer
|
||||
description: |
|
||||
Automated machine learning with hyperparameter optimization using Optuna, Hyperopt, or AutoML libraries. Activates for "automl", "hyperparameter tuning", "optimize hyperparameters", "auto tune model", "neural architecture search", "automated ml". Systematically explores model and hyperparameter spaces, tracks all experiments, and finds optimal configurations with minimal manual intervention.
|
||||
---
|
||||
|
||||
# AutoML Optimizer
|
||||
|
||||
## Overview
|
||||
|
||||
Automates the tedious process of hyperparameter tuning and model selection. Instead of manually trying different configurations, define a search space and let AutoML find the optimal configuration through intelligent exploration.
|
||||
|
||||
## Why AutoML?
|
||||
|
||||
**Manual Tuning Problems**:
|
||||
- Time-consuming (hours/days of trial and error)
|
||||
- Subjective (depends on intuition)
|
||||
- Incomplete (can't try all combinations)
|
||||
- Not reproducible (hard to document search process)
|
||||
|
||||
**AutoML Benefits**:
|
||||
- ✅ Systematic exploration of search space
|
||||
- ✅ Intelligent sampling (Bayesian optimization)
|
||||
- ✅ All experiments tracked automatically
|
||||
- ✅ Find optimal configuration faster
|
||||
- ✅ Reproducible (search process documented)
|
||||
|
||||
## AutoML Strategies
|
||||
|
||||
### Strategy 1: Hyperparameter Optimization (Optuna)
|
||||
|
||||
```python
|
||||
from specweave import OptunaOptimizer
|
||||
|
||||
# Define search space
|
||||
def objective(trial):
|
||||
# Suggest hyperparameters
|
||||
params = {
|
||||
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
|
||||
'max_depth': trial.suggest_int('max_depth', 3, 10),
|
||||
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
|
||||
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
|
||||
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
|
||||
}
|
||||
|
||||
# Train model
|
||||
model = XGBClassifier(**params)
|
||||
|
||||
# Cross-validation score
|
||||
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
|
||||
|
||||
return scores.mean()
|
||||
|
||||
# Run optimization
|
||||
optimizer = OptunaOptimizer(
|
||||
objective=objective,
|
||||
n_trials=100,
|
||||
direction='maximize',
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
best_params = optimizer.optimize()
|
||||
|
||||
# Creates:
|
||||
# - .specweave/increments/0042.../experiments/optuna-study/
|
||||
# ├── study.db (Optuna database)
|
||||
# ├── optimization_history.png
|
||||
# ├── param_importances.png
|
||||
# ├── parallel_coordinate.png
|
||||
# └── best_params.json
|
||||
```
|
||||
|
||||
**Optimization Report**:
|
||||
```markdown
|
||||
# Optuna Optimization Report
|
||||
|
||||
## Search Space
|
||||
- n_estimators: [100, 1000]
|
||||
- max_depth: [3, 10]
|
||||
- learning_rate: [0.01, 0.3] (log scale)
|
||||
- subsample: [0.5, 1.0]
|
||||
- colsample_bytree: [0.5, 1.0]
|
||||
|
||||
## Trials: 100
|
||||
- Completed: 98
|
||||
- Pruned: 2 (early stopping)
|
||||
- Failed: 0
|
||||
|
||||
## Best Trial (#47)
|
||||
- ROC AUC: 0.892 ± 0.012
|
||||
- Parameters:
|
||||
- n_estimators: 673
|
||||
- max_depth: 6
|
||||
- learning_rate: 0.094
|
||||
- subsample: 0.78
|
||||
- colsample_bytree: 0.91
|
||||
|
||||
## Parameter Importance
|
||||
1. learning_rate (0.42) - Most important
|
||||
2. n_estimators (0.28)
|
||||
3. max_depth (0.18)
|
||||
4. colsample_bytree (0.08)
|
||||
5. subsample (0.04) - Least important
|
||||
|
||||
## Improvement over Default
|
||||
- Default params: ROC AUC = 0.856
|
||||
- Optimized params: ROC AUC = 0.892
|
||||
- Improvement: +4.2%
|
||||
```
|
||||
|
||||
### Strategy 2: Algorithm Selection + Tuning
|
||||
|
||||
```python
|
||||
from specweave import AutoMLPipeline
|
||||
|
||||
# Define candidate algorithms with search spaces
|
||||
pipeline = AutoMLPipeline(increment="0042")
|
||||
|
||||
# Add candidates
|
||||
pipeline.add_candidate(
|
||||
name="xgboost",
|
||||
model=XGBClassifier,
|
||||
search_space={
|
||||
'n_estimators': (100, 1000),
|
||||
'max_depth': (3, 10),
|
||||
'learning_rate': (0.01, 0.3)
|
||||
}
|
||||
)
|
||||
|
||||
pipeline.add_candidate(
|
||||
name="lightgbm",
|
||||
model=LGBMClassifier,
|
||||
search_space={
|
||||
'n_estimators': (100, 1000),
|
||||
'max_depth': (3, 10),
|
||||
'learning_rate': (0.01, 0.3)
|
||||
}
|
||||
)
|
||||
|
||||
pipeline.add_candidate(
|
||||
name="random_forest",
|
||||
model=RandomForestClassifier,
|
||||
search_space={
|
||||
'n_estimators': (100, 500),
|
||||
'max_depth': (3, 20),
|
||||
'min_samples_split': (2, 20)
|
||||
}
|
||||
)
|
||||
|
||||
pipeline.add_candidate(
|
||||
name="logistic_regression",
|
||||
model=LogisticRegression,
|
||||
search_space={
|
||||
'C': (0.001, 100),
|
||||
'penalty': ['l1', 'l2']
|
||||
}
|
||||
)
|
||||
|
||||
# Run AutoML (tries all algorithms + hyperparameters)
|
||||
results = pipeline.fit(
|
||||
X_train, y_train,
|
||||
n_trials_per_model=50,
|
||||
cv_folds=5,
|
||||
metric='roc_auc'
|
||||
)
|
||||
|
||||
# Best model automatically selected
|
||||
best_model = pipeline.best_model_
|
||||
best_params = pipeline.best_params_
|
||||
```
|
||||
|
||||
**AutoML Comparison**:
|
||||
```markdown
|
||||
| Model | Trials | Best Score | Mean Score | Std | Best Params |
|
||||
|---------------------|--------|------------|------------|-------|--------------------------------------|
|
||||
| xgboost | 50 | 0.892 | 0.876 | 0.012 | n_est=673, depth=6, lr=0.094 |
|
||||
| lightgbm | 50 | 0.889 | 0.873 | 0.011 | n_est=542, depth=7, lr=0.082 |
|
||||
| random_forest | 50 | 0.871 | 0.858 | 0.015 | n_est=384, depth=12, min_split=5 |
|
||||
| logistic_regression | 50 | 0.845 | 0.840 | 0.008 | C=1.234, penalty=l2 |
|
||||
|
||||
**Winner: XGBoost** (ROC AUC = 0.892)
|
||||
```
|
||||
|
||||
### Strategy 3: Neural Architecture Search (NAS)
|
||||
|
||||
```python
|
||||
from specweave import NeuralArchitectureSearch
|
||||
|
||||
# For deep learning
|
||||
nas = NeuralArchitectureSearch(increment="0042")
|
||||
|
||||
# Define search space
|
||||
search_space = {
|
||||
'num_layers': (2, 5),
|
||||
'layer_sizes': (32, 512),
|
||||
'activation': ['relu', 'tanh', 'elu'],
|
||||
'dropout': (0.0, 0.5),
|
||||
'optimizer': ['adam', 'sgd', 'rmsprop'],
|
||||
'learning_rate': (0.0001, 0.01)
|
||||
}
|
||||
|
||||
# Search for best architecture
|
||||
best_architecture = nas.search(
|
||||
X_train, y_train,
|
||||
search_space=search_space,
|
||||
n_trials=100,
|
||||
max_epochs=50
|
||||
)
|
||||
|
||||
# Creates: Best neural network architecture
|
||||
```
|
||||
|
||||
## AutoML Frameworks Integration
|
||||
|
||||
### Optuna (Recommended)
|
||||
|
||||
```python
|
||||
import optuna
|
||||
from specweave import configure_optuna
|
||||
|
||||
# Auto-configures Optuna to log to increment
|
||||
configure_optuna(increment="0042")
|
||||
|
||||
def objective(trial):
|
||||
params = {
|
||||
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
|
||||
'max_depth': trial.suggest_int('max_depth', 3, 10),
|
||||
}
|
||||
|
||||
model = XGBClassifier(**params)
|
||||
score = cross_val_score(model, X, y, cv=5).mean()
|
||||
return score
|
||||
|
||||
study = optuna.create_study(direction='maximize')
|
||||
study.optimize(objective, n_trials=100)
|
||||
|
||||
# Automatically logged to increment folder
|
||||
```
|
||||
|
||||
### Auto-sklearn
|
||||
|
||||
```python
|
||||
from specweave import AutoSklearnOptimizer
|
||||
|
||||
# Automated model selection + feature engineering
|
||||
optimizer = AutoSklearnOptimizer(
|
||||
time_left_for_this_task=3600, # 1 hour
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
optimizer.fit(X_train, y_train)
|
||||
|
||||
# Auto-sklearn tries:
|
||||
# - Multiple algorithms
|
||||
# - Feature preprocessing combinations
|
||||
# - Ensemble methods
|
||||
# Returns best pipeline
|
||||
```
|
||||
|
||||
### H2O AutoML
|
||||
|
||||
```python
|
||||
from specweave import H2OAutoMLOptimizer
|
||||
|
||||
optimizer = H2OAutoMLOptimizer(
|
||||
max_runtime_secs=3600, # 1 hour
|
||||
max_models=50,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
optimizer.fit(X_train, y_train)
|
||||
|
||||
# H2O tries many algorithms in parallel
|
||||
# Returns leaderboard + best model
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start with Default Baseline
|
||||
|
||||
```python
|
||||
# Always compare AutoML to default hyperparameters
|
||||
baseline_model = XGBClassifier() # Default params
|
||||
baseline_score = cross_val_score(baseline_model, X, y, cv=5).mean()
|
||||
|
||||
# Then optimize
|
||||
optimizer = OptunaOptimizer(objective, n_trials=100)
|
||||
optimized_params = optimizer.optimize()
|
||||
|
||||
improvement = (optimized_score - baseline_score) / baseline_score * 100
|
||||
print(f"Improvement: {improvement:.1f}%")
|
||||
|
||||
# Only use optimized if significant improvement (>2-3%)
|
||||
```
|
||||
|
||||
### 2. Use Cross-Validation
|
||||
|
||||
```python
|
||||
# ❌ Wrong: Single train/test split
|
||||
score = model.score(X_test, y_test)
|
||||
|
||||
# ✅ Correct: Cross-validation
|
||||
scores = cross_val_score(model, X_train, y_train, cv=5)
|
||||
score = scores.mean()
|
||||
|
||||
# Prevents overfitting to specific train/test split
|
||||
```
|
||||
|
||||
### 3. Set Reasonable Search Budgets
|
||||
|
||||
```python
|
||||
# Quick exploration (development)
|
||||
optimizer.optimize(n_trials=20) # ~5-10 minutes
|
||||
|
||||
# Moderate search (iteration)
|
||||
optimizer.optimize(n_trials=100) # ~30-60 minutes
|
||||
|
||||
# Thorough search (final model)
|
||||
optimizer.optimize(n_trials=500) # ~2-4 hours
|
||||
|
||||
# Don't overdo it: diminishing returns after ~100-200 trials
|
||||
```
|
||||
|
||||
### 4. Prune Unpromising Trials
|
||||
|
||||
```python
|
||||
# Optuna can stop bad trials early
|
||||
study = optuna.create_study(
|
||||
direction='maximize',
|
||||
pruner=optuna.pruners.MedianPruner()
|
||||
)
|
||||
|
||||
# If trial is performing worse than median at epoch N, stop it
|
||||
# Saves time by not fully training bad models
|
||||
```
|
||||
|
||||
### 5. Document Search Space Rationale
|
||||
|
||||
```python
|
||||
# Document why you chose specific ranges
|
||||
search_space = {
|
||||
# XGBoost recommends max_depth 3-10 for most tasks
|
||||
'max_depth': (3, 10),
|
||||
|
||||
# Learning rate: 0.01-0.3 covers slow to fast learning
|
||||
# Log scale to spend more trials on smaller values
|
||||
'learning_rate': (0.01, 0.3, 'log'),
|
||||
|
||||
# n_estimators: Balance accuracy vs training time
|
||||
'n_estimators': (100, 1000)
|
||||
}
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Automatic Experiment Tracking
|
||||
|
||||
```python
|
||||
# All AutoML trials logged automatically
|
||||
optimizer = OptunaOptimizer(objective, increment="0042")
|
||||
optimizer.optimize(n_trials=100)
|
||||
|
||||
# Creates:
|
||||
# .specweave/increments/0042.../experiments/
|
||||
# ├── optuna-trial-001/
|
||||
# ├── optuna-trial-002/
|
||||
# ├── ...
|
||||
# ├── optuna-trial-100/
|
||||
# └── optuna-summary.md
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Updates:
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/ml-optimization.md -->
|
||||
|
||||
## Hyperparameter Optimization (Increment 0042)
|
||||
|
||||
### Optimization Strategy
|
||||
- Framework: Optuna (Bayesian optimization)
|
||||
- Trials: 100
|
||||
- Search space: 5 hyperparameters
|
||||
- Metric: ROC AUC (5-fold CV)
|
||||
|
||||
### Results
|
||||
- Best score: 0.892 ± 0.012
|
||||
- Improvement over default: +4.2%
|
||||
- Most important param: learning_rate (0.42)
|
||||
|
||||
### Selected Hyperparameters
|
||||
```python
|
||||
{
|
||||
'n_estimators': 673,
|
||||
'max_depth': 6,
|
||||
'learning_rate': 0.094,
|
||||
'subsample': 0.78,
|
||||
'colsample_bytree': 0.91
|
||||
}
|
||||
```
|
||||
|
||||
### Recommendation
|
||||
XGBoost with optimized hyperparameters for production deployment.
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Run AutoML optimization
|
||||
/ml:optimize 0042 --trials 100
|
||||
|
||||
# Compare algorithms
|
||||
/ml:compare-algorithms 0042
|
||||
|
||||
# Show optimization history
|
||||
/ml:optimization-report 0042
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Coarse-to-Fine Optimization
|
||||
|
||||
```python
|
||||
# Step 1: Coarse search (wide ranges, few trials)
|
||||
coarse_space = {
|
||||
'n_estimators': (100, 1000, 'int'),
|
||||
'max_depth': (3, 10, 'int'),
|
||||
'learning_rate': (0.01, 0.3, 'log')
|
||||
}
|
||||
coarse_results = optimizer.optimize(coarse_space, n_trials=50)
|
||||
|
||||
# Step 2: Fine search (narrow ranges around best)
|
||||
best_params = coarse_results['best_params']
|
||||
fine_space = {
|
||||
'n_estimators': (best_params['n_estimators'] - 100,
|
||||
best_params['n_estimators'] + 100),
|
||||
'max_depth': (max(3, best_params['max_depth'] - 1),
|
||||
min(10, best_params['max_depth'] + 1)),
|
||||
'learning_rate': (best_params['learning_rate'] * 0.5,
|
||||
best_params['learning_rate'] * 1.5, 'log')
|
||||
}
|
||||
fine_results = optimizer.optimize(fine_space, n_trials=50)
|
||||
```
|
||||
|
||||
### Pattern 2: Multi-Objective Optimization
|
||||
|
||||
```python
|
||||
# Optimize for multiple objectives (accuracy + speed)
|
||||
def multi_objective(trial):
|
||||
params = {
|
||||
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
|
||||
'max_depth': trial.suggest_int('max_depth', 3, 10),
|
||||
}
|
||||
|
||||
model = XGBClassifier(**params)
|
||||
|
||||
# Objective 1: Accuracy
|
||||
accuracy = cross_val_score(model, X, y, cv=5).mean()
|
||||
|
||||
# Objective 2: Training time
|
||||
start = time.time()
|
||||
model.fit(X_train, y_train)
|
||||
training_time = time.time() - start
|
||||
|
||||
return accuracy, -training_time # Maximize accuracy, minimize time
|
||||
|
||||
# Optuna will find Pareto-optimal solutions
|
||||
study = optuna.create_study(directions=['maximize', 'minimize'])
|
||||
study.optimize(multi_objective, n_trials=100)
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
AutoML accelerates ML development by:
|
||||
- ✅ Automating tedious hyperparameter tuning
|
||||
- ✅ Exploring search space systematically
|
||||
- ✅ Finding optimal configurations faster
|
||||
- ✅ Tracking all experiments automatically
|
||||
- ✅ Documenting optimization process
|
||||
|
||||
Don't spend days manually tuning—let AutoML do it in hours.
|
||||
157
skills/cv-pipeline-builder/SKILL.md
Normal file
157
skills/cv-pipeline-builder/SKILL.md
Normal file
@@ -0,0 +1,157 @@
|
||||
---
|
||||
name: cv-pipeline-builder
|
||||
description: |
|
||||
Computer vision ML pipelines for image classification, object detection, semantic segmentation, and image generation. Activates for "computer vision", "image classification", "object detection", "CNN", "ResNet", "YOLO", "image segmentation", "image preprocessing", "data augmentation". Builds end-to-end CV pipelines with PyTorch/TensorFlow, integrated with SpecWeave increments.
|
||||
---
|
||||
|
||||
# Computer Vision Pipeline Builder
|
||||
|
||||
## Overview
|
||||
|
||||
Specialized ML pipelines for computer vision tasks. Handles image preprocessing, data augmentation, CNN architectures, transfer learning, and deployment for production CV systems.
|
||||
|
||||
## CV Tasks Supported
|
||||
|
||||
### 1. Image Classification
|
||||
|
||||
```python
|
||||
from specweave import CVPipeline
|
||||
|
||||
# Binary or multi-class classification
|
||||
pipeline = CVPipeline(
|
||||
task="classification",
|
||||
num_classes=10,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Automatically configures:
|
||||
# - Image preprocessing (resize, normalize)
|
||||
# - Data augmentation (rotation, flip, color jitter)
|
||||
# - CNN architecture (ResNet, EfficientNet, ViT)
|
||||
# - Transfer learning from ImageNet
|
||||
# - Training loop with validation
|
||||
# - Inference pipeline
|
||||
|
||||
pipeline.fit(train_images, train_labels)
|
||||
```
|
||||
|
||||
### 2. Object Detection
|
||||
|
||||
```python
|
||||
# Detect multiple objects in images
|
||||
pipeline = CVPipeline(
|
||||
task="object_detection",
|
||||
classes=["person", "car", "dog", "cat"],
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Uses: YOLO, Faster R-CNN, or RetinaNet
|
||||
# Returns: Bounding boxes + class labels + confidence scores
|
||||
```
|
||||
|
||||
### 3. Semantic Segmentation
|
||||
|
||||
```python
|
||||
# Pixel-level classification
|
||||
pipeline = CVPipeline(
|
||||
task="segmentation",
|
||||
num_classes=21,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Uses: U-Net, DeepLab, or SegFormer
|
||||
# Returns: Segmentation mask for each pixel
|
||||
```
|
||||
|
||||
## Best Practices for CV
|
||||
|
||||
### Data Augmentation
|
||||
|
||||
```python
|
||||
from specweave import ImageAugmentation
|
||||
|
||||
aug = ImageAugmentation(increment="0042")
|
||||
|
||||
# Standard augmentations
|
||||
aug.add_transforms([
|
||||
"random_rotation", # ±15 degrees
|
||||
"random_flip_horizontal",
|
||||
"random_brightness", # ±20%
|
||||
"random_contrast", # ±20%
|
||||
"random_crop"
|
||||
])
|
||||
|
||||
# Advanced augmentations
|
||||
aug.add_advanced([
|
||||
"mixup", # Mix two images
|
||||
"cutout", # Random erasing
|
||||
"autoaugment" # Learned augmentation
|
||||
])
|
||||
```
|
||||
|
||||
### Transfer Learning
|
||||
|
||||
```python
|
||||
# Start from pre-trained ImageNet models
|
||||
pipeline = CVPipeline(task="classification")
|
||||
|
||||
# Option 1: Feature extraction (freeze backbone)
|
||||
pipeline.use_pretrained(
|
||||
model="resnet50",
|
||||
freeze_backbone=True
|
||||
)
|
||||
|
||||
# Option 2: Fine-tuning (unfreeze after few epochs)
|
||||
pipeline.use_pretrained(
|
||||
model="resnet50",
|
||||
freeze_backbone=False,
|
||||
fine_tune_after_epoch=3
|
||||
)
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**Image Classification**:
|
||||
- Small datasets (<10K): ResNet18, MobileNetV2
|
||||
- Medium datasets (10K-100K): ResNet50, EfficientNet-B0
|
||||
- Large datasets (>100K): EfficientNet-B3, Vision Transformer
|
||||
|
||||
**Object Detection**:
|
||||
- Real-time (>30 FPS): YOLOv8, SSDLite
|
||||
- High accuracy: Faster R-CNN, RetinaNet
|
||||
|
||||
**Segmentation**:
|
||||
- Medical imaging: U-Net
|
||||
- Scene segmentation: DeepLabV3, SegFormer
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
```python
|
||||
# CV increment structure
|
||||
.specweave/increments/0042-image-classifier/
|
||||
├── spec.md
|
||||
├── data/
|
||||
│ ├── train/
|
||||
│ ├── val/
|
||||
│ └── test/
|
||||
├── models/
|
||||
│ ├── model-v1.pth
|
||||
│ └── model-v2.pth
|
||||
├── experiments/
|
||||
│ ├── baseline-resnet18/
|
||||
│ ├── resnet50-augmented/
|
||||
│ └── efficientnet-b0/
|
||||
└── deployment/
|
||||
├── onnx_model.onnx
|
||||
└── inference.py
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
/ml:cv-pipeline --task classification --model resnet50
|
||||
/ml:cv-evaluate 0042 # Evaluate on test set
|
||||
/ml:cv-deploy 0042 # Export to ONNX
|
||||
```
|
||||
|
||||
Quick setup for CV projects with production-ready pipelines.
|
||||
521
skills/data-visualizer/SKILL.md
Normal file
521
skills/data-visualizer/SKILL.md
Normal file
@@ -0,0 +1,521 @@
|
||||
---
|
||||
name: data-visualizer
|
||||
description: |
|
||||
Automated data visualization for EDA, model performance, and business reporting. Activates for "visualize data", "create plots", "EDA", "exploratory analysis", "confusion matrix", "ROC curve", "feature distribution", "correlation heatmap", "plot results", "dashboard". Generates publication-quality visualizations integrated with SpecWeave increments.
|
||||
---
|
||||
|
||||
# Data Visualizer
|
||||
|
||||
## Overview
|
||||
|
||||
Automated visualization generation for exploratory data analysis, model performance reporting, and stakeholder communication. Creates publication-quality plots, interactive dashboards, and business-friendly reports—all integrated with SpecWeave's increment workflow.
|
||||
|
||||
## Visualization Categories
|
||||
|
||||
### 1. Exploratory Data Analysis (EDA)
|
||||
|
||||
**Automated EDA Report**:
|
||||
```python
|
||||
from specweave import EDAVisualizer
|
||||
|
||||
visualizer = EDAVisualizer(increment="0042")
|
||||
|
||||
# Generates comprehensive EDA report
|
||||
report = visualizer.generate_eda_report(df)
|
||||
|
||||
# Creates:
|
||||
# - Dataset overview (rows, columns, memory, missing values)
|
||||
# - Numerical feature distributions (histograms + KDE)
|
||||
# - Categorical feature counts (bar charts)
|
||||
# - Correlation heatmap
|
||||
# - Missing value pattern
|
||||
# - Outlier detection plots
|
||||
# - Feature relationships (pairplot for top features)
|
||||
```
|
||||
|
||||
**Individual EDA Plots**:
|
||||
```python
|
||||
# Distribution plots
|
||||
visualizer.plot_distribution(
|
||||
data=df['age'],
|
||||
title="Age Distribution",
|
||||
bins=30
|
||||
)
|
||||
|
||||
# Correlation heatmap
|
||||
visualizer.plot_correlation_heatmap(
|
||||
data=df[numerical_columns],
|
||||
method='pearson' # or 'spearman', 'kendall'
|
||||
)
|
||||
|
||||
# Missing value patterns
|
||||
visualizer.plot_missing_values(df)
|
||||
|
||||
# Outlier detection (boxplots)
|
||||
visualizer.plot_outliers(df[numerical_columns])
|
||||
```
|
||||
|
||||
### 2. Model Performance Visualizations
|
||||
|
||||
**Classification Performance**:
|
||||
```python
|
||||
from specweave import ClassificationVisualizer
|
||||
|
||||
viz = ClassificationVisualizer(increment="0042")
|
||||
|
||||
# Confusion matrix
|
||||
viz.plot_confusion_matrix(
|
||||
y_true=y_test,
|
||||
y_pred=y_pred,
|
||||
classes=['Negative', 'Positive']
|
||||
)
|
||||
|
||||
# ROC curve
|
||||
viz.plot_roc_curve(
|
||||
y_true=y_test,
|
||||
y_proba=y_proba
|
||||
)
|
||||
|
||||
# Precision-Recall curve
|
||||
viz.plot_precision_recall_curve(
|
||||
y_true=y_test,
|
||||
y_proba=y_proba
|
||||
)
|
||||
|
||||
# Learning curves (train vs val)
|
||||
viz.plot_learning_curve(
|
||||
train_scores=train_scores,
|
||||
val_scores=val_scores
|
||||
)
|
||||
|
||||
# Calibration curve (are probabilities well-calibrated?)
|
||||
viz.plot_calibration_curve(
|
||||
y_true=y_test,
|
||||
y_proba=y_proba
|
||||
)
|
||||
```
|
||||
|
||||
**Regression Performance**:
|
||||
```python
|
||||
from specweave import RegressionVisualizer
|
||||
|
||||
viz = RegressionVisualizer(increment="0042")
|
||||
|
||||
# Predicted vs Actual
|
||||
viz.plot_predictions(
|
||||
y_true=y_test,
|
||||
y_pred=y_pred
|
||||
)
|
||||
|
||||
# Residual plot
|
||||
viz.plot_residuals(
|
||||
y_true=y_test,
|
||||
y_pred=y_pred
|
||||
)
|
||||
|
||||
# Residual distribution (should be normal)
|
||||
viz.plot_residual_distribution(
|
||||
residuals=y_test - y_pred
|
||||
)
|
||||
|
||||
# Error by feature value
|
||||
viz.plot_error_analysis(
|
||||
y_true=y_test,
|
||||
y_pred=y_pred,
|
||||
features=X_test
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Feature Analysis Visualizations
|
||||
|
||||
**Feature Importance**:
|
||||
```python
|
||||
from specweave import FeatureVisualizer
|
||||
|
||||
viz = FeatureVisualizer(increment="0042")
|
||||
|
||||
# Feature importance (bar chart)
|
||||
viz.plot_feature_importance(
|
||||
feature_names=feature_names,
|
||||
importances=model.feature_importances_,
|
||||
top_n=20
|
||||
)
|
||||
|
||||
# SHAP summary plot
|
||||
viz.plot_shap_summary(
|
||||
shap_values=shap_values,
|
||||
features=X_test
|
||||
)
|
||||
|
||||
# Partial dependence plots
|
||||
viz.plot_partial_dependence(
|
||||
model=model,
|
||||
features=['age', 'income'],
|
||||
X=X_train
|
||||
)
|
||||
|
||||
# Feature interaction
|
||||
viz.plot_feature_interaction(
|
||||
model=model,
|
||||
features=('age', 'income'),
|
||||
X=X_train
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Time Series Visualizations
|
||||
|
||||
**Time Series Plots**:
|
||||
```python
|
||||
from specweave import TimeSeriesVisualizer
|
||||
|
||||
viz = TimeSeriesVisualizer(increment="0042")
|
||||
|
||||
# Time series with trend
|
||||
viz.plot_timeseries(
|
||||
data=sales_data,
|
||||
show_trend=True
|
||||
)
|
||||
|
||||
# Seasonal decomposition
|
||||
viz.plot_seasonal_decomposition(
|
||||
data=sales_data,
|
||||
period=12 # Monthly seasonality
|
||||
)
|
||||
|
||||
# Autocorrelation (ACF, PACF)
|
||||
viz.plot_autocorrelation(data=sales_data)
|
||||
|
||||
# Forecast with confidence intervals
|
||||
viz.plot_forecast(
|
||||
actual=test_data,
|
||||
forecast=forecast,
|
||||
confidence_intervals=(0.80, 0.95)
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Model Comparison Visualizations
|
||||
|
||||
**Compare Multiple Models**:
|
||||
```python
|
||||
from specweave import ModelComparisonVisualizer
|
||||
|
||||
viz = ModelComparisonVisualizer(increment="0042")
|
||||
|
||||
# Compare metrics across models
|
||||
viz.plot_model_comparison(
|
||||
models=['Baseline', 'XGBoost', 'LightGBM', 'Neural Net'],
|
||||
metrics={
|
||||
'accuracy': [0.65, 0.87, 0.86, 0.85],
|
||||
'roc_auc': [0.70, 0.92, 0.91, 0.90],
|
||||
'training_time': [1, 45, 32, 320]
|
||||
}
|
||||
)
|
||||
|
||||
# ROC curves for multiple models
|
||||
viz.plot_roc_curves_comparison(
|
||||
models_predictions={
|
||||
'XGBoost': (y_test, y_proba_xgb),
|
||||
'LightGBM': (y_test, y_proba_lgbm),
|
||||
'Neural Net': (y_test, y_proba_nn)
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Interactive Visualizations
|
||||
|
||||
**Plotly Integration**:
|
||||
```python
|
||||
from specweave import InteractiveVisualizer
|
||||
|
||||
viz = InteractiveVisualizer(increment="0042")
|
||||
|
||||
# Interactive scatter plot (zoom, pan, hover)
|
||||
viz.plot_interactive_scatter(
|
||||
x=X_test[:, 0],
|
||||
y=X_test[:, 1],
|
||||
colors=y_pred,
|
||||
hover_data=df[['id', 'amount', 'merchant']]
|
||||
)
|
||||
|
||||
# Interactive confusion matrix (click for details)
|
||||
viz.plot_interactive_confusion_matrix(
|
||||
y_true=y_test,
|
||||
y_pred=y_pred
|
||||
)
|
||||
|
||||
# Interactive feature importance (sortable, filterable)
|
||||
viz.plot_interactive_feature_importance(
|
||||
feature_names=feature_names,
|
||||
importances=importances
|
||||
)
|
||||
```
|
||||
|
||||
## Business Reporting
|
||||
|
||||
**Automated ML Report**:
|
||||
```python
|
||||
from specweave import MLReportGenerator
|
||||
|
||||
generator = MLReportGenerator(increment="0042")
|
||||
|
||||
# Generate executive summary report
|
||||
report = generator.generate_report(
|
||||
model=model,
|
||||
test_data=(X_test, y_test),
|
||||
business_metrics={
|
||||
'false_positive_cost': 5,
|
||||
'false_negative_cost': 500
|
||||
}
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - Executive summary (1 page, non-technical)
|
||||
# - Key metrics (accuracy, precision, recall)
|
||||
# - Business impact ($$ saved, ROI)
|
||||
# - Model performance visualizations
|
||||
# - Recommendations
|
||||
# - Technical appendix
|
||||
```
|
||||
|
||||
**Report Output** (HTML/PDF):
|
||||
```markdown
|
||||
# Fraud Detection Model - Executive Summary
|
||||
|
||||
## Key Results
|
||||
- **Accuracy**: 87% (target: >85%) ✅
|
||||
- **Fraud Detection Rate**: 62% (catching 310 frauds/day)
|
||||
- **False Positive Rate**: 38% (190 false alarms/day)
|
||||
|
||||
## Business Impact
|
||||
- **Fraud Prevented**: $155,000/day
|
||||
- **Review Cost**: $950/day (190 transactions × $5)
|
||||
- **Net Benefit**: $154,050/day ✅
|
||||
- **Annual Savings**: $56.2M
|
||||
|
||||
## Model Performance
|
||||
[Confusion Matrix Visualization]
|
||||
[ROC Curve]
|
||||
[Feature Importance]
|
||||
|
||||
## Recommendations
|
||||
1. ✅ Deploy to production immediately
|
||||
2. Monitor fraud patterns weekly
|
||||
3. Retrain model monthly with new data
|
||||
```
|
||||
|
||||
## Dashboard Creation
|
||||
|
||||
**Real-Time Dashboard**:
|
||||
```python
|
||||
from specweave import DashboardCreator
|
||||
|
||||
creator = DashboardCreator(increment="0042")
|
||||
|
||||
# Create Grafana/Plotly dashboard
|
||||
dashboard = creator.create_dashboard(
|
||||
title="Model Performance Dashboard",
|
||||
panels=[
|
||||
{'type': 'metric', 'query': 'prediction_latency_p95'},
|
||||
{'type': 'metric', 'query': 'predictions_per_second'},
|
||||
{'type': 'timeseries', 'query': 'accuracy_over_time'},
|
||||
{'type': 'timeseries', 'query': 'error_rate'},
|
||||
{'type': 'heatmap', 'query': 'prediction_distribution'},
|
||||
{'type': 'table', 'query': 'recent_anomalies'}
|
||||
]
|
||||
)
|
||||
|
||||
# Exports to Grafana JSON or Plotly Dash app
|
||||
dashboard.export(format='grafana')
|
||||
```
|
||||
|
||||
## Visualization Best Practices
|
||||
|
||||
### 1. Publication-Quality Plots
|
||||
|
||||
```python
|
||||
# Set consistent styling
|
||||
visualizer.set_style(
|
||||
style='seaborn', # Or 'ggplot', 'fivethirtyeight'
|
||||
context='paper', # Or 'notebook', 'talk', 'poster'
|
||||
palette='colorblind' # Accessible colors
|
||||
)
|
||||
|
||||
# High-resolution exports
|
||||
visualizer.save_figure(
|
||||
filename='model_performance.png',
|
||||
dpi=300, # Publication quality
|
||||
bbox_inches='tight'
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Accessible Visualizations
|
||||
|
||||
```python
|
||||
# Colorblind-friendly palettes
|
||||
visualizer.use_colorblind_palette()
|
||||
|
||||
# Add alt text for accessibility
|
||||
visualizer.add_alt_text(
|
||||
plot=fig,
|
||||
description="Confusion matrix showing 87% accuracy"
|
||||
)
|
||||
|
||||
# High contrast for presentations
|
||||
visualizer.set_high_contrast_mode()
|
||||
```
|
||||
|
||||
### 3. Annotation and Context
|
||||
|
||||
```python
|
||||
# Add reference lines
|
||||
viz.add_reference_line(
|
||||
y=0.85, # Target accuracy
|
||||
label='Target',
|
||||
color='red',
|
||||
linestyle='--'
|
||||
)
|
||||
|
||||
# Add annotations
|
||||
viz.annotate_point(
|
||||
x=optimal_threshold,
|
||||
y=optimal_f1,
|
||||
text='Optimal threshold: 0.47'
|
||||
)
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Automated Visualization in Increments
|
||||
|
||||
```python
|
||||
# All visualizations auto-saved to increment folder
|
||||
visualizer = EDAVisualizer(increment="0042")
|
||||
|
||||
# Creates:
|
||||
# .specweave/increments/0042-fraud-detection/
|
||||
# ├── visualizations/
|
||||
# │ ├── eda/
|
||||
# │ │ ├── distributions.png
|
||||
# │ │ ├── correlation_heatmap.png
|
||||
# │ │ └── missing_values.png
|
||||
# │ ├── model_performance/
|
||||
# │ │ ├── confusion_matrix.png
|
||||
# │ │ ├── roc_curve.png
|
||||
# │ │ ├── precision_recall.png
|
||||
# │ │ └── learning_curves.png
|
||||
# │ ├── feature_analysis/
|
||||
# │ │ ├── feature_importance.png
|
||||
# │ │ ├── shap_summary.png
|
||||
# │ │ └── partial_dependence/
|
||||
# │ └── reports/
|
||||
# │ ├── executive_summary.html
|
||||
# │ └── technical_report.pdf
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Updates:
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/ml-model-performance.md -->
|
||||
|
||||
## Fraud Detection Model Performance (Increment 0042)
|
||||
|
||||
### Model Accuracy
|
||||

|
||||
|
||||
### Key Metrics
|
||||
- Accuracy: 87%
|
||||
- Precision: 85%
|
||||
- Recall: 62%
|
||||
- ROC AUC: 0.92
|
||||
|
||||
### Feature Importance
|
||||

|
||||
|
||||
Top 5 features:
|
||||
1. amount_vs_user_average (0.18)
|
||||
2. days_since_last_purchase (0.12)
|
||||
3. merchant_risk_score (0.10)
|
||||
4. velocity_24h (0.08)
|
||||
5. location_distance_from_home (0.07)
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Generate EDA report
|
||||
/ml:visualize-eda 0042
|
||||
|
||||
# Generate model performance report
|
||||
/ml:visualize-performance 0042
|
||||
|
||||
# Create interactive dashboard
|
||||
/ml:create-dashboard 0042
|
||||
|
||||
# Export all visualizations
|
||||
/ml:export-visualizations 0042 --format png,pdf,html
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### 1. Automated Report Generation
|
||||
|
||||
```python
|
||||
# Generate full increment report with all visualizations
|
||||
generator = IncrementReportGenerator(increment="0042")
|
||||
|
||||
report = generator.generate_full_report()
|
||||
|
||||
# Includes:
|
||||
# - EDA visualizations
|
||||
# - Experiment comparisons
|
||||
# - Best model performance
|
||||
# - Feature importance
|
||||
# - Business impact
|
||||
# - Deployment readiness
|
||||
```
|
||||
|
||||
### 2. Custom Visualization Templates
|
||||
|
||||
```python
|
||||
# Create reusable templates
|
||||
template = VisualizationTemplate(name="fraud_analysis")
|
||||
|
||||
template.add_panel("confusion_matrix")
|
||||
template.add_panel("roc_curve")
|
||||
template.add_panel("top_fraud_features")
|
||||
template.add_panel("fraud_trends_over_time")
|
||||
|
||||
# Apply to any increment
|
||||
template.apply(increment="0042")
|
||||
```
|
||||
|
||||
### 3. Version Control for Visualizations
|
||||
|
||||
```python
|
||||
# Track visualization changes across model versions
|
||||
viz_tracker = VisualizationTracker(increment="0042")
|
||||
|
||||
# Compare model v1 vs v2 visualizations
|
||||
viz_tracker.compare_versions(
|
||||
version_1="model-v1",
|
||||
version_2="model-v2"
|
||||
)
|
||||
|
||||
# Shows: Confusion matrix improved, ROC curve comparison, etc.
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Data visualization is critical for:
|
||||
- ✅ Exploratory data analysis (understand data before modeling)
|
||||
- ✅ Model performance communication (stakeholder buy-in)
|
||||
- ✅ Feature analysis (understand what drives predictions)
|
||||
- ✅ Business reporting (translate metrics to impact)
|
||||
- ✅ Model debugging (identify issues visually)
|
||||
|
||||
This skill automates visualization generation, ensuring all ML work is visual, accessible, and business-friendly within SpecWeave's increment workflow.
|
||||
535
skills/experiment-tracker/SKILL.md
Normal file
535
skills/experiment-tracker/SKILL.md
Normal file
@@ -0,0 +1,535 @@
|
||||
---
|
||||
name: experiment-tracker
|
||||
description: |
|
||||
Manages ML experiment tracking with MLflow, Weights & Biases, or SpecWeave's built-in tracking. Activates for "track experiments", "MLflow", "wandb", "experiment logging", "compare experiments", "hyperparameter tracking". Automatically configures tracking tools to log to SpecWeave increment folders, ensuring all experiments are documented and reproducible. Integrates with SpecWeave's living docs for persistent experiment knowledge.
|
||||
---
|
||||
|
||||
# Experiment Tracker
|
||||
|
||||
## Overview
|
||||
|
||||
Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.
|
||||
|
||||
## Problem This Solves
|
||||
|
||||
**Without structured tracking**:
|
||||
- ❌ "Which hyperparameters did we use for model v2?"
|
||||
- ❌ "Why did we choose XGBoost over LightGBM?"
|
||||
- ❌ "Can't reproduce results from 3 months ago"
|
||||
- ❌ "Team member left, all knowledge in their notebooks"
|
||||
|
||||
**With experiment tracking**:
|
||||
- ✅ All experiments logged with params, metrics, artifacts
|
||||
- ✅ Decisions documented ("XGBoost: 5% better precision, chose it")
|
||||
- ✅ Reproducible (environment, data version, code hash)
|
||||
- ✅ Team knowledge in living docs, not individual notebooks
|
||||
|
||||
## How It Works
|
||||
|
||||
### Auto-Configuration
|
||||
|
||||
When you create an ML increment, the skill detects tracking tools:
|
||||
|
||||
```python
|
||||
# No configuration needed - automatically detects and configures
|
||||
from specweave import track_experiment
|
||||
|
||||
# Automatically logs to:
|
||||
# .specweave/increments/0042.../experiments/exp-001/
|
||||
with track_experiment("baseline-model") as exp:
|
||||
model.fit(X_train, y_train)
|
||||
exp.log_metric("accuracy", accuracy)
|
||||
```
|
||||
|
||||
### Tracking Backends
|
||||
|
||||
**Option 1: SpecWeave Built-in** (default, zero-config)
|
||||
```python
|
||||
from specweave import track_experiment
|
||||
|
||||
# Logs to increment folder automatically
|
||||
with track_experiment("xgboost-v1") as exp:
|
||||
exp.log_param("n_estimators", 100)
|
||||
exp.log_metric("auc", 0.87)
|
||||
exp.save_model(model, "model.pkl")
|
||||
|
||||
# Creates:
|
||||
# .specweave/increments/0042.../experiments/xgboost-v1/
|
||||
# ├── params.json
|
||||
# ├── metrics.json
|
||||
# ├── model.pkl
|
||||
# └── metadata.yaml
|
||||
```
|
||||
|
||||
**Option 2: MLflow** (if detected in project)
|
||||
```python
|
||||
import mlflow
|
||||
from specweave import configure_mlflow
|
||||
|
||||
# Auto-configures MLflow to log to increment
|
||||
configure_mlflow(increment="0042")
|
||||
|
||||
with mlflow.start_run(run_name="xgboost-v1"):
|
||||
mlflow.log_param("n_estimators", 100)
|
||||
mlflow.log_metric("auc", 0.87)
|
||||
mlflow.sklearn.log_model(model, "model")
|
||||
|
||||
# Still logs to increment folder, just uses MLflow as backend
|
||||
```
|
||||
|
||||
**Option 3: Weights & Biases**
|
||||
```python
|
||||
import wandb
|
||||
from specweave import configure_wandb
|
||||
|
||||
# Auto-configures W&B project = increment ID
|
||||
configure_wandb(increment="0042")
|
||||
|
||||
run = wandb.init(name="xgboost-v1")
|
||||
run.log({"auc": 0.87})
|
||||
run.log_model("model.pkl")
|
||||
|
||||
# W&B dashboard + local logs in increment folder
|
||||
```
|
||||
|
||||
### Experiment Comparison
|
||||
|
||||
```python
|
||||
from specweave import compare_experiments
|
||||
|
||||
# Compare all experiments in increment
|
||||
comparison = compare_experiments(increment="0042")
|
||||
|
||||
# Generates:
|
||||
# .specweave/increments/0042.../experiments/comparison.md
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```markdown
|
||||
| Experiment | Accuracy | Precision | Recall | F1 | Training Time |
|
||||
|--------------------|----------|-----------|--------|------|---------------|
|
||||
| exp-001-baseline | 0.65 | 0.60 | 0.55 | 0.57 | 2s |
|
||||
| exp-002-xgboost | 0.87 | 0.85 | 0.83 | 0.84 | 45s |
|
||||
| exp-003-lightgbm | 0.86 | 0.84 | 0.82 | 0.83 | 32s |
|
||||
| exp-004-neural-net | 0.85 | 0.83 | 0.81 | 0.82 | 320s |
|
||||
|
||||
**Best Model**: exp-002-xgboost
|
||||
- Highest accuracy (0.87)
|
||||
- Good precision/recall balance
|
||||
- Reasonable training time (45s)
|
||||
- Selected for deployment
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
After completing increment:
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Automatically updates:
|
||||
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/ml-experiments.md -->
|
||||
|
||||
## Recommendation Model (Increment 0042)
|
||||
|
||||
### Experiments Conducted: 7
|
||||
- exp-001-baseline: Random classifier (acc=0.12)
|
||||
- exp-002-popularity: Popularity baseline (acc=0.18)
|
||||
- exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ **SELECTED**
|
||||
- ...
|
||||
|
||||
### Selection Rationale
|
||||
XGBoost chosen for:
|
||||
- Best accuracy (0.26 vs baseline 0.18, +44% improvement)
|
||||
- Fast inference (<50ms)
|
||||
- Good explainability (SHAP values)
|
||||
- Stable across cross-validation (std=0.02)
|
||||
|
||||
### Hyperparameters (exp-003)
|
||||
- n_estimators: 200
|
||||
- max_depth: 6
|
||||
- learning_rate: 0.1
|
||||
- subsample: 0.8
|
||||
```
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate when you need to:
|
||||
|
||||
- **Track ML experiments** systematically
|
||||
- **Compare multiple models** objectively
|
||||
- **Document experiment decisions** for team
|
||||
- **Reproduce past results** exactly
|
||||
- **Maintain experiment history** across increments
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. Automatic Logging
|
||||
|
||||
```python
|
||||
# Logs everything automatically
|
||||
from specweave import AutoTracker
|
||||
|
||||
tracker = AutoTracker(increment="0042")
|
||||
|
||||
# Just wrap your training code
|
||||
@tracker.track(name="xgboost-auto")
|
||||
def train_model():
|
||||
model = XGBClassifier(**params)
|
||||
model.fit(X_train, y_train)
|
||||
score = model.score(X_test, y_test)
|
||||
return model, score
|
||||
|
||||
# Automatically logs: params, metrics, model, environment, git hash
|
||||
model, score = train_model()
|
||||
```
|
||||
|
||||
### 2. Hyperparameter Tracking
|
||||
|
||||
```python
|
||||
from specweave import track_hyperparameters
|
||||
|
||||
params_grid = {
|
||||
"n_estimators": [100, 200, 500],
|
||||
"max_depth": [3, 6, 9],
|
||||
"learning_rate": [0.01, 0.1, 0.3]
|
||||
}
|
||||
|
||||
# Tracks all parameter combinations
|
||||
results = track_hyperparameters(
|
||||
model=XGBClassifier,
|
||||
param_grid=params_grid,
|
||||
X_train=X_train,
|
||||
y_train=y_train,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Generates parameter importance analysis
|
||||
```
|
||||
|
||||
### 3. Cross-Validation Tracking
|
||||
|
||||
```python
|
||||
from specweave import track_cross_validation
|
||||
|
||||
# Tracks each fold separately
|
||||
cv_results = track_cross_validation(
|
||||
model=model,
|
||||
X=X,
|
||||
y=y,
|
||||
cv=5,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Logs: mean, std, per-fold scores, fold distribution
|
||||
```
|
||||
|
||||
### 4. Artifact Management
|
||||
|
||||
```python
|
||||
from specweave import track_artifacts
|
||||
|
||||
with track_experiment("xgboost-v1") as exp:
|
||||
# Training artifacts
|
||||
exp.save_artifact("preprocessor.pkl", preprocessor)
|
||||
exp.save_artifact("model.pkl", model)
|
||||
|
||||
# Evaluation artifacts
|
||||
exp.save_artifact("confusion_matrix.png", cm_plot)
|
||||
exp.save_artifact("roc_curve.png", roc_plot)
|
||||
|
||||
# Data artifacts
|
||||
exp.save_artifact("feature_importance.csv", importance_df)
|
||||
|
||||
# Environment artifacts
|
||||
exp.save_artifact("requirements.txt", requirements)
|
||||
exp.save_artifact("conda_env.yaml", conda_env)
|
||||
```
|
||||
|
||||
### 5. Experiment Metadata
|
||||
|
||||
```python
|
||||
from specweave import ExperimentMetadata
|
||||
|
||||
metadata = ExperimentMetadata(
|
||||
name="xgboost-v3",
|
||||
description="XGBoost with feature engineering v2",
|
||||
tags=["production-candidate", "feature-eng-v2"],
|
||||
git_commit="a3b8c9d",
|
||||
data_version="v2024-01",
|
||||
author="[email protected]"
|
||||
)
|
||||
|
||||
with track_experiment(metadata) as exp:
|
||||
# ... training ...
|
||||
pass
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Name Experiments Clearly
|
||||
|
||||
```python
|
||||
# ❌ Bad: Generic names
|
||||
with track_experiment("exp1"):
|
||||
...
|
||||
|
||||
# ✅ Good: Descriptive names
|
||||
with track_experiment("xgboost-tuned-depth6-lr0.1"):
|
||||
...
|
||||
```
|
||||
|
||||
### 2. Log Everything
|
||||
|
||||
```python
|
||||
# Log more than you think you need
|
||||
exp.log_param("random_seed", 42)
|
||||
exp.log_param("data_version", "2024-01")
|
||||
exp.log_param("python_version", sys.version)
|
||||
exp.log_param("sklearn_version", sklearn.__version__)
|
||||
|
||||
# Future you will thank present you
|
||||
```
|
||||
|
||||
### 3. Document Failures
|
||||
|
||||
```python
|
||||
try:
|
||||
with track_experiment("neural-net-attempt") as exp:
|
||||
model.fit(X_train, y_train)
|
||||
except Exception as e:
|
||||
exp.log_note(f"FAILED: {str(e)}")
|
||||
exp.log_note("Reason: Out of memory, need smaller batch size")
|
||||
exp.set_status("failed")
|
||||
|
||||
# Failure documentation prevents repeating mistakes
|
||||
```
|
||||
|
||||
### 4. Use Experiment Series
|
||||
|
||||
```python
|
||||
# Related experiments in series
|
||||
experiments = [
|
||||
"xgboost-baseline",
|
||||
"xgboost-tuned-v1",
|
||||
"xgboost-tuned-v2",
|
||||
"xgboost-tuned-v3-final"
|
||||
]
|
||||
|
||||
# Track progression and improvements
|
||||
```
|
||||
|
||||
### 5. Link to Data Versions
|
||||
|
||||
```python
|
||||
with track_experiment("xgboost-v1") as exp:
|
||||
exp.log_param("data_commit", "dvc:a3b8c9d")
|
||||
exp.log_param("data_url", "s3://bucket/data/v2024-01")
|
||||
|
||||
# Enables exact reproduction
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### With Increments
|
||||
|
||||
```bash
|
||||
# Experiments automatically tied to increment
|
||||
/specweave:inc "0042-recommendation-model"
|
||||
# All experiments logged to: .specweave/increments/0042.../experiments/
|
||||
```
|
||||
|
||||
### With Living Docs
|
||||
|
||||
```bash
|
||||
# Sync experiment findings to docs
|
||||
/specweave:sync-docs update
|
||||
# Updates: architecture/ml-models.md, runbooks/model-training.md
|
||||
```
|
||||
|
||||
### With GitHub
|
||||
|
||||
```bash
|
||||
# Create issue for model retraining
|
||||
/specweave:github:create-issue "Retrain model with Q1 2024 data"
|
||||
# Links to previous experiments in increment
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Baseline Experiments
|
||||
|
||||
```python
|
||||
from specweave import track_experiment
|
||||
|
||||
baselines = ["random", "majority", "stratified"]
|
||||
|
||||
for strategy in baselines:
|
||||
with track_experiment(f"baseline-{strategy}") as exp:
|
||||
model = DummyClassifier(strategy=strategy)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
accuracy = model.score(X_test, y_test)
|
||||
exp.log_metric("accuracy", accuracy)
|
||||
exp.log_note(f"Baseline: {strategy}")
|
||||
|
||||
# Generates baseline comparison report
|
||||
```
|
||||
|
||||
### Example 2: Hyperparameter Grid Search
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from specweave import track_grid_search
|
||||
|
||||
param_grid = {
|
||||
"n_estimators": [100, 200, 500],
|
||||
"max_depth": [3, 6, 9]
|
||||
}
|
||||
|
||||
# Automatically logs all combinations
|
||||
best_model, results = track_grid_search(
|
||||
XGBClassifier(),
|
||||
param_grid,
|
||||
X_train,
|
||||
y_train,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Creates visualization of parameter importance
|
||||
```
|
||||
|
||||
### Example 3: Model Comparison
|
||||
|
||||
```python
|
||||
from specweave import compare_models
|
||||
|
||||
models = {
|
||||
"xgboost": XGBClassifier(),
|
||||
"lightgbm": LGBMClassifier(),
|
||||
"random-forest": RandomForestClassifier()
|
||||
}
|
||||
|
||||
# Trains and compares all models
|
||||
comparison = compare_models(
|
||||
models,
|
||||
X_train,
|
||||
y_train,
|
||||
X_test,
|
||||
y_test,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Generates markdown comparison table
|
||||
```
|
||||
|
||||
## Tool Compatibility
|
||||
|
||||
### MLflow
|
||||
|
||||
```python
|
||||
# Option 1: Pure MLflow (auto-configured)
|
||||
import mlflow
|
||||
mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")
|
||||
|
||||
# Option 2: SpecWeave wrapper (recommended)
|
||||
from specweave import mlflow as sw_mlflow
|
||||
with sw_mlflow.start_run("xgboost"):
|
||||
# Logs to both MLflow and increment docs
|
||||
pass
|
||||
```
|
||||
|
||||
### Weights & Biases
|
||||
|
||||
```python
|
||||
# Option 1: Pure wandb
|
||||
import wandb
|
||||
wandb.init(project="0042-recommendation-model")
|
||||
|
||||
# Option 2: SpecWeave wrapper (recommended)
|
||||
from specweave import wandb as sw_wandb
|
||||
run = sw_wandb.init(increment="0042", name="xgboost")
|
||||
# Syncs to increment folder + W&B dashboard
|
||||
```
|
||||
|
||||
### TensorBoard
|
||||
|
||||
```python
|
||||
from specweave import TensorBoardCallback
|
||||
|
||||
# Keras callback
|
||||
model.fit(
|
||||
X_train,
|
||||
y_train,
|
||||
callbacks=[
|
||||
TensorBoardCallback(
|
||||
increment="0042",
|
||||
log_dir=".specweave/increments/0042.../tensorboard"
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# List all experiments in increment
|
||||
/ml:list-experiments 0042
|
||||
|
||||
# Compare experiments
|
||||
/ml:compare-experiments 0042
|
||||
|
||||
# Load experiment details
|
||||
/ml:show-experiment exp-003-xgboost
|
||||
|
||||
# Export experiment data
|
||||
/ml:export-experiments 0042 --format csv
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start tracking early** - Track from first experiment, not after 20 failed attempts
|
||||
2. **Tag production models** - `exp.add_tag("production")` for deployed models
|
||||
3. **Version everything** - Data, code, environment, dependencies
|
||||
4. **Document decisions** - Why model A over model B (not just metrics)
|
||||
5. **Prune old experiments** - Archive experiments >6 months old
|
||||
|
||||
## Advanced: Multi-Stage Experiments
|
||||
|
||||
For complex pipelines with multiple stages:
|
||||
|
||||
```python
|
||||
from specweave import ExperimentPipeline
|
||||
|
||||
pipeline = ExperimentPipeline("recommendation-full-pipeline")
|
||||
|
||||
# Stage 1: Data preprocessing
|
||||
with pipeline.stage("preprocessing") as stage:
|
||||
stage.log_metric("rows_before", len(df))
|
||||
df_clean = preprocess(df)
|
||||
stage.log_metric("rows_after", len(df_clean))
|
||||
|
||||
# Stage 2: Feature engineering
|
||||
with pipeline.stage("features") as stage:
|
||||
features = engineer_features(df_clean)
|
||||
stage.log_metric("num_features", features.shape[1])
|
||||
|
||||
# Stage 3: Model training
|
||||
with pipeline.stage("training") as stage:
|
||||
model = train_model(features)
|
||||
stage.log_metric("accuracy", accuracy)
|
||||
|
||||
# Logs entire pipeline with stage dependencies
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
- **ml-pipeline-orchestrator**: Auto-tracks experiments during pipeline execution
|
||||
- **model-evaluator**: Uses experiment data for model comparison
|
||||
- **ml-engineer agent**: Reviews experiment results and suggests improvements
|
||||
- **Living docs**: Syncs experiment findings to architecture docs
|
||||
|
||||
This skill ensures ML experimentation is never lost, always reproducible, and well-documented.
|
||||
566
skills/feature-engineer/SKILL.md
Normal file
566
skills/feature-engineer/SKILL.md
Normal file
@@ -0,0 +1,566 @@
|
||||
---
|
||||
name: feature-engineer
|
||||
description: |
|
||||
Comprehensive feature engineering for ML pipelines: data quality assessment, feature creation, selection, transformation, and encoding. Activates for "feature engineering", "create features", "feature selection", "data preprocessing", "handle missing values", "encode categorical", "scale features", "feature importance". Ensures features are production-ready with automated validation, documentation, and integration with SpecWeave increments.
|
||||
---
|
||||
|
||||
# Feature Engineer
|
||||
|
||||
## Overview
|
||||
|
||||
Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.
|
||||
|
||||
## The Feature Engineering Pipeline
|
||||
|
||||
### Phase 1: Data Quality Assessment
|
||||
|
||||
**Before creating features, understand your data**:
|
||||
|
||||
```python
|
||||
from specweave import DataQualityReport
|
||||
|
||||
# Automated data quality check
|
||||
report = DataQualityReport(df, increment="0042")
|
||||
|
||||
# Generates:
|
||||
# - Missing value analysis
|
||||
# - Outlier detection
|
||||
# - Data type validation
|
||||
# - Distribution analysis
|
||||
# - Correlation matrix
|
||||
# - Duplicate detection
|
||||
```
|
||||
|
||||
**Quality Report Output**:
|
||||
```markdown
|
||||
# Data Quality Report
|
||||
|
||||
## Dataset Overview
|
||||
- Rows: 100,000
|
||||
- Columns: 45
|
||||
- Memory: 34.2 MB
|
||||
|
||||
## Missing Values
|
||||
| Column | Missing | Percentage |
|
||||
|-----------------|---------|------------|
|
||||
| email | 15,234 | 15.2% |
|
||||
| phone | 8,901 | 8.9% |
|
||||
| purchase_date | 0 | 0.0% |
|
||||
|
||||
## Outliers Detected
|
||||
- transaction_amount: 234 outliers (>3 std dev)
|
||||
- user_age: 12 outliers (<18 or >100)
|
||||
|
||||
## Data Type Issues
|
||||
- user_id: Stored as float, should be int
|
||||
- date_joined: Stored as string, should be datetime
|
||||
|
||||
## Recommendations
|
||||
1. Impute email/phone or create "missing" indicator features
|
||||
2. Cap/remove outliers in transaction_amount
|
||||
3. Convert data types for efficiency
|
||||
```
|
||||
|
||||
### Phase 2: Feature Creation
|
||||
|
||||
**Create features from domain knowledge**:
|
||||
|
||||
```python
|
||||
from specweave import FeatureCreator
|
||||
|
||||
creator = FeatureCreator(df, increment="0042")
|
||||
|
||||
# Temporal features (from datetime)
|
||||
creator.add_temporal_features(
|
||||
date_column="purchase_date",
|
||||
features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"]
|
||||
)
|
||||
|
||||
# Aggregation features (user behavior)
|
||||
creator.add_aggregation_features(
|
||||
group_by="user_id",
|
||||
target="purchase_amount",
|
||||
aggs=["mean", "std", "count", "min", "max"]
|
||||
)
|
||||
# Creates: user_purchase_amount_mean, user_purchase_amount_std, etc.
|
||||
|
||||
# Interaction features
|
||||
creator.add_interaction_features(
|
||||
features=[("age", "income"), ("clicks", "impressions")],
|
||||
operations=["multiply", "divide", "subtract"]
|
||||
)
|
||||
# Creates: age_x_income, clicks_per_impression, etc.
|
||||
|
||||
# Ratio features
|
||||
creator.add_ratio_features([
|
||||
("revenue", "cost"),
|
||||
("conversions", "visits")
|
||||
])
|
||||
# Creates: revenue_to_cost_ratio, conversion_rate
|
||||
|
||||
# Binning (discretization)
|
||||
creator.add_binned_features(
|
||||
column="age",
|
||||
bins=[0, 18, 25, 35, 50, 65, 100],
|
||||
labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"]
|
||||
)
|
||||
|
||||
# Text features (from text columns)
|
||||
creator.add_text_features(
|
||||
column="product_description",
|
||||
features=["length", "word_count", "unique_words", "sentiment"]
|
||||
)
|
||||
|
||||
# Generate all features
|
||||
df_enriched = creator.generate()
|
||||
|
||||
# Auto-documents in increment folder
|
||||
creator.save_feature_definitions(
|
||||
path=".specweave/increments/0042.../features/feature_definitions.yaml"
|
||||
)
|
||||
```
|
||||
|
||||
**Feature Definitions** (auto-generated):
|
||||
```yaml
|
||||
# .specweave/increments/0042.../features/feature_definitions.yaml
|
||||
|
||||
features:
|
||||
- name: purchase_hour
|
||||
type: temporal
|
||||
source: purchase_date
|
||||
description: Hour of purchase (0-23)
|
||||
|
||||
- name: user_purchase_amount_mean
|
||||
type: aggregation
|
||||
source: purchase_amount
|
||||
group_by: user_id
|
||||
description: Average purchase amount per user
|
||||
|
||||
- name: age_x_income
|
||||
type: interaction
|
||||
sources: [age, income]
|
||||
operation: multiply
|
||||
description: Product of age and income
|
||||
|
||||
- name: conversion_rate
|
||||
type: ratio
|
||||
sources: [conversions, visits]
|
||||
description: Conversion rate (conversions / visits)
|
||||
```
|
||||
|
||||
### Phase 3: Feature Selection
|
||||
|
||||
**Reduce dimensionality, improve performance**:
|
||||
|
||||
```python
|
||||
from specweave import FeatureSelector
|
||||
|
||||
selector = FeatureSelector(X_train, y_train, increment="0042")
|
||||
|
||||
# Method 1: Correlation-based (remove redundant features)
|
||||
selector.remove_correlated_features(threshold=0.95)
|
||||
# Removes features with >95% correlation
|
||||
|
||||
# Method 2: Variance-based (remove constant features)
|
||||
selector.remove_low_variance_features(threshold=0.01)
|
||||
# Removes features with <1% variance
|
||||
|
||||
# Method 3: Statistical tests
|
||||
selector.select_by_statistical_test(k=50)
|
||||
# SelectKBest with chi2/f_classif
|
||||
|
||||
# Method 4: Model-based (tree importance)
|
||||
selector.select_by_model_importance(
|
||||
model=RandomForestClassifier(),
|
||||
threshold=0.01
|
||||
)
|
||||
# Removes features with <1% importance
|
||||
|
||||
# Method 5: Recursive Feature Elimination
|
||||
selector.select_by_rfe(
|
||||
model=LogisticRegression(),
|
||||
n_features=30
|
||||
)
|
||||
|
||||
# Get selected features
|
||||
selected_features = selector.get_selected_features()
|
||||
|
||||
# Generate selection report
|
||||
selector.generate_report()
|
||||
```
|
||||
|
||||
**Feature Selection Report**:
|
||||
```markdown
|
||||
# Feature Selection Report
|
||||
|
||||
## Original Features: 125
|
||||
## Selected Features: 35 (72% reduction)
|
||||
|
||||
## Selection Process
|
||||
1. Removed 12 correlated features (>95% correlation)
|
||||
2. Removed 8 low-variance features
|
||||
3. Statistical test: Selected top 50 (chi-squared)
|
||||
4. Model importance: Removed 15 low-importance features (<1%)
|
||||
|
||||
## Top 10 Features (by importance)
|
||||
1. user_purchase_amount_mean (0.18)
|
||||
2. days_since_last_purchase (0.12)
|
||||
3. total_purchases (0.10)
|
||||
4. age_x_income (0.08)
|
||||
5. conversion_rate (0.07)
|
||||
...
|
||||
|
||||
## Removed Features
|
||||
- user_id_hash (constant)
|
||||
- temp_feature_1 (99% correlated with temp_feature_2)
|
||||
- random_noise (0% importance)
|
||||
...
|
||||
```
|
||||
|
||||
### Phase 4: Feature Transformation
|
||||
|
||||
**Scale, normalize, encode for model compatibility**:
|
||||
|
||||
```python
|
||||
from specweave import FeatureTransformer
|
||||
|
||||
transformer = FeatureTransformer(increment="0042")
|
||||
|
||||
# Numerical transformations
|
||||
transformer.add_numerical_transformer(
|
||||
columns=["age", "income", "purchase_amount"],
|
||||
method="standard_scaler" # Or: min_max, robust, quantile
|
||||
)
|
||||
|
||||
# Categorical encoding
|
||||
transformer.add_categorical_encoder(
|
||||
columns=["country", "device_type", "product_category"],
|
||||
method="onehot", # Or: label, target, binary
|
||||
handle_unknown="ignore"
|
||||
)
|
||||
|
||||
# Ordinal encoding (for ordered categories)
|
||||
transformer.add_ordinal_encoder(
|
||||
column="education",
|
||||
order=["high_school", "bachelors", "masters", "phd"]
|
||||
)
|
||||
|
||||
# Log transformation (for skewed distributions)
|
||||
transformer.add_log_transform(
|
||||
columns=["transaction_amount", "page_views"],
|
||||
method="log1p" # log(1 + x) to handle zeros
|
||||
)
|
||||
|
||||
# Box-Cox transformation (for normalization)
|
||||
transformer.add_power_transform(
|
||||
columns=["revenue", "engagement_score"],
|
||||
method="box-cox"
|
||||
)
|
||||
|
||||
# Custom transformation
|
||||
def clip_outliers(x):
|
||||
return np.clip(x, x.quantile(0.01), x.quantile(0.99))
|
||||
|
||||
transformer.add_custom_transformer(
|
||||
columns=["outlier_prone_feature"],
|
||||
func=clip_outliers
|
||||
)
|
||||
|
||||
# Fit and transform
|
||||
X_train_transformed = transformer.fit_transform(X_train)
|
||||
X_test_transformed = transformer.transform(X_test)
|
||||
|
||||
# Save transformer pipeline
|
||||
transformer.save(
|
||||
path=".specweave/increments/0042.../features/transformer.pkl"
|
||||
)
|
||||
```
|
||||
|
||||
### Phase 5: Feature Validation
|
||||
|
||||
**Ensure features are production-ready**:
|
||||
|
||||
```python
|
||||
from specweave import FeatureValidator
|
||||
|
||||
validator = FeatureValidator(
|
||||
X_train, X_test,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Check for data leakage
|
||||
leakage_report = validator.check_data_leakage()
|
||||
# Detects: perfectly correlated features, future data in training
|
||||
|
||||
# Check for distribution drift
|
||||
drift_report = validator.check_distribution_drift()
|
||||
# Compares train vs test distributions
|
||||
|
||||
# Check for missing values after transformation
|
||||
missing_report = validator.check_missing_values()
|
||||
|
||||
# Check for infinite/NaN values
|
||||
invalid_report = validator.check_invalid_values()
|
||||
|
||||
# Generate validation report
|
||||
validator.generate_report()
|
||||
```
|
||||
|
||||
**Validation Report**:
|
||||
```markdown
|
||||
# Feature Validation Report
|
||||
|
||||
## Data Leakage: ✅ PASS
|
||||
No perfect correlations detected between train and test.
|
||||
|
||||
## Distribution Drift: ⚠️ WARNING
|
||||
Features with significant drift (KS test p < 0.05):
|
||||
- user_age: p=0.023 (minor drift)
|
||||
- device_type: p=0.001 (major drift)
|
||||
|
||||
Recommendation: Check if test data is from different time period.
|
||||
|
||||
## Missing Values: ✅ PASS
|
||||
No missing values after transformation.
|
||||
|
||||
## Invalid Values: ✅ PASS
|
||||
No infinite or NaN values detected.
|
||||
|
||||
## Overall: READY FOR TRAINING
|
||||
2 warnings, 0 critical issues.
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Automatic Feature Documentation
|
||||
|
||||
```python
|
||||
# All feature engineering steps logged to increment
|
||||
with track_experiment("feature-engineering-v1", increment="0042") as exp:
|
||||
# Create features
|
||||
df_enriched = creator.generate()
|
||||
|
||||
# Select features
|
||||
selected = selector.select()
|
||||
|
||||
# Transform features
|
||||
X_transformed = transformer.fit_transform(X)
|
||||
|
||||
# Validate
|
||||
validation = validator.validate()
|
||||
|
||||
# Auto-logs:
|
||||
exp.log_param("original_features", 125)
|
||||
exp.log_param("created_features", 45)
|
||||
exp.log_param("selected_features", 35)
|
||||
exp.log_metric("feature_reduction", 0.72)
|
||||
exp.save_artifact("feature_definitions.yaml")
|
||||
exp.save_artifact("transformer.pkl")
|
||||
exp.save_artifact("validation_report.md")
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
After completing feature engineering:
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Updates:
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/feature-engineering.md -->
|
||||
|
||||
## Recommendation Model Features (Increment 0042)
|
||||
|
||||
### Feature Engineering Pipeline
|
||||
1. Data Quality: 100K rows, 45 columns
|
||||
2. Created: 45 new features (temporal, aggregation, interaction)
|
||||
3. Selected: 35 features (72% reduction via importance + RFE)
|
||||
4. Transformed: StandardScaler for numerical, OneHot for categorical
|
||||
|
||||
### Key Features
|
||||
- user_purchase_amount_mean: Average user spend (top feature, 18% importance)
|
||||
- days_since_last_purchase: Recency indicator (12% importance)
|
||||
- age_x_income: Interaction feature (8% importance)
|
||||
|
||||
### Feature Store
|
||||
All features documented in: `.specweave/increments/0042.../features/`
|
||||
- feature_definitions.yaml: Feature catalog
|
||||
- transformer.pkl: Production transformation pipeline
|
||||
- validation_report.md: Quality checks
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Document Feature Rationale
|
||||
|
||||
```python
|
||||
# Bad: Create features without explanation
|
||||
df["feature_1"] = df["col_a"] * df["col_b"]
|
||||
|
||||
# Good: Document why features were created
|
||||
creator.add_interaction_feature(
|
||||
sources=["age", "income"],
|
||||
operation="multiply",
|
||||
rationale="High-income older users have different behavior patterns"
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Handle Missing Values Systematically
|
||||
|
||||
```python
|
||||
# Options for missing values:
|
||||
# 1. Imputation (mean, median, mode)
|
||||
creator.impute_missing(column="age", strategy="median")
|
||||
|
||||
# 2. Indicator features (flag missing as signal)
|
||||
creator.add_missing_indicator(column="email")
|
||||
# Creates: email_missing (0/1)
|
||||
|
||||
# 3. Forward/backward fill (for time series)
|
||||
creator.fill_missing(column="sensor_reading", method="ffill")
|
||||
|
||||
# 4. Model-based imputation
|
||||
creator.impute_with_model(column="income", model=RandomForestRegressor())
|
||||
```
|
||||
|
||||
### 3. Avoid Data Leakage
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Fit on all data (includes test set!)
|
||||
scaler.fit(X)
|
||||
X_train = scaler.transform(X_train)
|
||||
X_test = scaler.transform(X_test)
|
||||
|
||||
# ✅ CORRECT: Fit only on train, transform both
|
||||
scaler.fit(X_train)
|
||||
X_train = scaler.transform(X_train)
|
||||
X_test = scaler.transform(X_test)
|
||||
|
||||
# SpecWeave's transformer enforces this pattern
|
||||
transformer.fit_transform(X_train) # Fits
|
||||
transformer.transform(X_test) # Only transforms
|
||||
```
|
||||
|
||||
### 4. Version Feature Engineering Pipeline
|
||||
|
||||
```python
|
||||
# Version features with increment
|
||||
transformer.save(
|
||||
path=".specweave/increments/0042.../features/transformer-v1.pkl",
|
||||
metadata={
|
||||
"version": "v1",
|
||||
"features": selected_features,
|
||||
"transformations": ["standard_scaler", "onehot"]
|
||||
}
|
||||
)
|
||||
|
||||
# Load specific version for reproducibility
|
||||
transformer_v1 = FeatureTransformer.load(
|
||||
".specweave/increments/0042.../features/transformer-v1.pkl"
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Test Feature Engineering on New Data
|
||||
|
||||
```python
|
||||
# Before deploying, test on held-out data
|
||||
X_production_sample = load_production_data()
|
||||
|
||||
try:
|
||||
X_transformed = transformer.transform(X_production_sample)
|
||||
except Exception as e:
|
||||
raise FeatureEngineeringError(f"Failed on production data: {e}")
|
||||
|
||||
# Check for unexpected values
|
||||
validator = FeatureValidator(X_train, X_production_sample)
|
||||
validation_report = validator.validate()
|
||||
|
||||
if validation_report["status"] == "CRITICAL":
|
||||
raise FeatureEngineeringError("Feature engineering failed validation")
|
||||
```
|
||||
|
||||
## Common Feature Engineering Patterns
|
||||
|
||||
### Pattern 1: RFM (Recency, Frequency, Monetary)
|
||||
|
||||
```python
|
||||
# For e-commerce / customer analytics
|
||||
creator.add_rfm_features(
|
||||
user_id="user_id",
|
||||
transaction_date="purchase_date",
|
||||
transaction_amount="purchase_amount"
|
||||
)
|
||||
# Creates:
|
||||
# - recency: days since last purchase
|
||||
# - frequency: total purchases
|
||||
# - monetary: total spend
|
||||
```
|
||||
|
||||
### Pattern 2: Rolling Window Aggregations
|
||||
|
||||
```python
|
||||
# For time series
|
||||
creator.add_rolling_features(
|
||||
column="daily_sales",
|
||||
windows=[7, 14, 30],
|
||||
aggs=["mean", "std", "min", "max"]
|
||||
)
|
||||
# Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.
|
||||
```
|
||||
|
||||
### Pattern 3: Target Encoding (Categorical → Numerical)
|
||||
|
||||
```python
|
||||
# Encode categorical as target mean (careful: can leak!)
|
||||
creator.add_target_encoding(
|
||||
column="product_category",
|
||||
target="purchase_amount",
|
||||
cv_folds=5 # Cross-validation to prevent leakage
|
||||
)
|
||||
# Creates: product_category_target_encoded
|
||||
```
|
||||
|
||||
### Pattern 4: Polynomial Features
|
||||
|
||||
```python
|
||||
# For non-linear relationships
|
||||
creator.add_polynomial_features(
|
||||
columns=["age", "income"],
|
||||
degree=2,
|
||||
interaction_only=True
|
||||
)
|
||||
# Creates: age^2, income^2, age*income
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Generate feature engineering pipeline for increment
|
||||
/ml:engineer-features 0042
|
||||
|
||||
# Validate features before training
|
||||
/ml:validate-features 0042
|
||||
|
||||
# Generate feature importance report
|
||||
/ml:feature-importance 0042
|
||||
```
|
||||
|
||||
## Integration with Other Skills
|
||||
|
||||
- **ml-pipeline-orchestrator**: Task 2 is "Feature Engineering" (uses this skill)
|
||||
- **experiment-tracker**: Logs all feature engineering experiments
|
||||
- **model-evaluator**: Uses feature importance from models
|
||||
- **ml-deployment-helper**: Packages feature transformer for production
|
||||
|
||||
## Summary
|
||||
|
||||
Feature engineering is 70% of ML success. This skill ensures:
|
||||
- ✅ Systematic approach (quality → create → select → transform → validate)
|
||||
- ✅ No data leakage (train/test separation enforced)
|
||||
- ✅ Production-ready (versioned, validated, documented)
|
||||
- ✅ Reproducible (all steps tracked in increment)
|
||||
- ✅ Traceable (feature definitions in living docs)
|
||||
|
||||
Good features make mediocre models great. Great features make mediocre models excellent.
|
||||
345
skills/ml-deployment-helper/SKILL.md
Normal file
345
skills/ml-deployment-helper/SKILL.md
Normal file
@@ -0,0 +1,345 @@
|
||||
---
|
||||
name: ml-deployment-helper
|
||||
description: |
|
||||
Prepares ML models for production deployment with containerization, API creation, monitoring setup, and A/B testing. Activates for "deploy model", "production deployment", "model API", "containerize model", "docker ml", "serving ml model", "model monitoring", "A/B test model". Generates deployment artifacts and ensures models are production-ready with monitoring, versioning, and rollback capabilities.
|
||||
---
|
||||
|
||||
# ML Deployment Helper
|
||||
|
||||
## Overview
|
||||
|
||||
Bridges the gap between trained models and production systems. Generates deployment artifacts, APIs, monitoring, and A/B testing infrastructure following MLOps best practices.
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
Before deploying any model, this skill ensures:
|
||||
|
||||
- ✅ Model versioned and tracked
|
||||
- ✅ Dependencies documented (requirements.txt/Dockerfile)
|
||||
- ✅ API endpoint created
|
||||
- ✅ Input validation implemented
|
||||
- ✅ Monitoring configured
|
||||
- ✅ A/B testing ready
|
||||
- ✅ Rollback plan documented
|
||||
- ✅ Performance benchmarked
|
||||
|
||||
## Deployment Patterns
|
||||
|
||||
### Pattern 1: REST API (FastAPI)
|
||||
|
||||
```python
|
||||
from specweave import create_model_api
|
||||
|
||||
# Generates production-ready API
|
||||
api = create_model_api(
|
||||
model_path="models/model-v3.pkl",
|
||||
increment="0042",
|
||||
framework="fastapi"
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - api/
|
||||
# ├── main.py (FastAPI app)
|
||||
# ├── models.py (Pydantic schemas)
|
||||
# ├── predict.py (Prediction logic)
|
||||
# ├── Dockerfile
|
||||
# ├── requirements.txt
|
||||
# └── tests/
|
||||
```
|
||||
|
||||
Generated `main.py`:
|
||||
```python
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from pydantic import BaseModel
|
||||
import joblib
|
||||
|
||||
app = FastAPI(title="Recommendation Model API", version="0042-v3")
|
||||
|
||||
model = joblib.load("model-v3.pkl")
|
||||
|
||||
class PredictionRequest(BaseModel):
|
||||
user_id: int
|
||||
context: dict
|
||||
|
||||
@app.post("/predict")
|
||||
async def predict(request: PredictionRequest):
|
||||
try:
|
||||
prediction = model.predict([request.dict()])
|
||||
return {
|
||||
"recommendations": prediction.tolist(),
|
||||
"model_version": "0042-v3",
|
||||
"timestamp": datetime.now()
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {"status": "healthy", "model_loaded": model is not None}
|
||||
```
|
||||
|
||||
### Pattern 2: Batch Prediction
|
||||
|
||||
```python
|
||||
from specweave import create_batch_predictor
|
||||
|
||||
# For offline scoring
|
||||
batch_predictor = create_batch_predictor(
|
||||
model_path="models/model-v3.pkl",
|
||||
increment="0042",
|
||||
input_path="s3://bucket/data/",
|
||||
output_path="s3://bucket/predictions/"
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - batch/
|
||||
# ├── predictor.py
|
||||
# ├── scheduler.yaml (Airflow/Kubernetes CronJob)
|
||||
# └── monitoring.py
|
||||
```
|
||||
|
||||
### Pattern 3: Real-Time Streaming
|
||||
|
||||
```python
|
||||
from specweave import create_streaming_predictor
|
||||
|
||||
# For Kafka/Kinesis streams
|
||||
streaming = create_streaming_predictor(
|
||||
model_path="models/model-v3.pkl",
|
||||
increment="0042",
|
||||
input_topic="user-events",
|
||||
output_topic="predictions"
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - streaming/
|
||||
# ├── consumer.py
|
||||
# ├── predictor.py
|
||||
# ├── producer.py
|
||||
# └── docker-compose.yaml
|
||||
```
|
||||
|
||||
## Containerization
|
||||
|
||||
```python
|
||||
from specweave import containerize_model
|
||||
|
||||
# Generates optimized Dockerfile
|
||||
dockerfile = containerize_model(
|
||||
model_path="models/model-v3.pkl",
|
||||
framework="sklearn",
|
||||
python_version="3.10",
|
||||
increment="0042"
|
||||
)
|
||||
```
|
||||
|
||||
Generated `Dockerfile`:
|
||||
```dockerfile
|
||||
FROM python:3.10-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Copy model and dependencies
|
||||
COPY models/model-v3.pkl /app/model.pkl
|
||||
COPY requirements.txt /app/
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Copy application
|
||||
COPY api/ /app/api/
|
||||
|
||||
# Health check
|
||||
HEALTHCHECK --interval=30s --timeout=3s \
|
||||
CMD curl -f http://localhost:8000/health || exit 1
|
||||
|
||||
# Run API
|
||||
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
```python
|
||||
from specweave import setup_model_monitoring
|
||||
|
||||
# Configures monitoring for production
|
||||
monitoring = setup_model_monitoring(
|
||||
model_name="recommendation-model",
|
||||
increment="0042",
|
||||
metrics=[
|
||||
"prediction_latency",
|
||||
"throughput",
|
||||
"error_rate",
|
||||
"prediction_distribution",
|
||||
"feature_drift"
|
||||
]
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - monitoring/
|
||||
# ├── prometheus.yaml
|
||||
# ├── grafana-dashboard.json
|
||||
# ├── alerts.yaml
|
||||
# └── drift-detector.py
|
||||
```
|
||||
|
||||
## A/B Testing Infrastructure
|
||||
|
||||
```python
|
||||
from specweave import create_ab_test
|
||||
|
||||
# Sets up A/B test framework
|
||||
ab_test = create_ab_test(
|
||||
control_model="model-v2.pkl",
|
||||
treatment_model="model-v3.pkl",
|
||||
traffic_split=0.1, # 10% to new model
|
||||
success_metric="click_through_rate",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - ab-test/
|
||||
# ├── router.py (traffic splitting)
|
||||
# ├── metrics.py (success tracking)
|
||||
# ├── statistical-tests.py (significance testing)
|
||||
# └── dashboard.py (real-time monitoring)
|
||||
```
|
||||
|
||||
A/B Test Router:
|
||||
```python
|
||||
import random
|
||||
|
||||
def route_prediction(user_id, control_model, treatment_model):
|
||||
"""Route to control or treatment based on user_id hash"""
|
||||
|
||||
# Consistent hashing (same user always gets same model)
|
||||
user_bucket = hash(user_id) % 100
|
||||
|
||||
if user_bucket < 10: # 10% to treatment
|
||||
return treatment_model.predict(features), "treatment"
|
||||
else:
|
||||
return control_model.predict(features), "control"
|
||||
```
|
||||
|
||||
## Model Versioning
|
||||
|
||||
```python
|
||||
from specweave import ModelVersion
|
||||
|
||||
# Register model version
|
||||
version = ModelVersion.register(
|
||||
model_path="models/model-v3.pkl",
|
||||
increment="0042",
|
||||
metadata={
|
||||
"accuracy": 0.87,
|
||||
"training_date": "2024-01-15",
|
||||
"data_version": "v2024-01",
|
||||
"framework": "xgboost==1.7.0"
|
||||
}
|
||||
)
|
||||
|
||||
# Easy rollback
|
||||
if production_metrics["error_rate"] > threshold:
|
||||
ModelVersion.rollback(to_version="0042-v2")
|
||||
```
|
||||
|
||||
## Load Testing
|
||||
|
||||
```python
|
||||
from specweave import load_test_model
|
||||
|
||||
# Benchmark model performance
|
||||
results = load_test_model(
|
||||
api_url="http://localhost:8000/predict",
|
||||
requests_per_second=[10, 50, 100, 500, 1000],
|
||||
duration_seconds=60,
|
||||
increment="0042"
|
||||
)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Load Test Results:
|
||||
==================
|
||||
|
||||
| RPS | Latency P50 | Latency P95 | Latency P99 | Error Rate |
|
||||
|------|-------------|-------------|-------------|------------|
|
||||
| 10 | 35ms | 45ms | 50ms | 0.00% |
|
||||
| 50 | 38ms | 52ms | 65ms | 0.00% |
|
||||
| 100 | 45ms | 70ms | 95ms | 0.02% |
|
||||
| 500 | 120ms | 250ms | 400ms | 1.20% |
|
||||
| 1000 | 350ms | 800ms | 1200ms | 8.50% |
|
||||
|
||||
Recommendation: Deploy with max 100 RPS per instance
|
||||
Target: <100ms P95 latency (achieved at 100 RPS)
|
||||
```
|
||||
|
||||
## Deployment Commands
|
||||
|
||||
```bash
|
||||
# Generate deployment artifacts
|
||||
/ml:deploy-prepare 0042
|
||||
|
||||
# Create API
|
||||
/ml:create-api --increment 0042 --framework fastapi
|
||||
|
||||
# Setup monitoring
|
||||
/ml:setup-monitoring 0042
|
||||
|
||||
# Create A/B test
|
||||
/ml:create-ab-test --control v2 --treatment v3 --split 0.1
|
||||
|
||||
# Load test
|
||||
/ml:load-test 0042 --rps 100 --duration 60s
|
||||
|
||||
# Deploy to production
|
||||
/ml:deploy 0042 --environment production
|
||||
```
|
||||
|
||||
## Deployment Increment
|
||||
|
||||
The skill creates a deployment increment:
|
||||
|
||||
```
|
||||
.specweave/increments/0043-deploy-recommendation-model/
|
||||
├── spec.md (deployment requirements)
|
||||
├── plan.md (deployment strategy)
|
||||
├── tasks.md
|
||||
│ ├── [ ] Containerize model
|
||||
│ ├── [ ] Create API
|
||||
│ ├── [ ] Setup monitoring
|
||||
│ ├── [ ] Configure A/B test
|
||||
│ ├── [ ] Load test
|
||||
│ ├── [ ] Deploy to staging
|
||||
│ ├── [ ] Validate staging
|
||||
│ └── [ ] Deploy to production
|
||||
├── api/ (FastAPI app)
|
||||
├── monitoring/ (Grafana dashboards)
|
||||
├── ab-test/ (A/B testing logic)
|
||||
└── load-tests/ (Performance benchmarks)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always load test** before production
|
||||
2. **Start with 1-5% traffic** in A/B test
|
||||
3. **Monitor model drift** in production
|
||||
4. **Version everything** (model, data, code)
|
||||
5. **Document rollback plan** before deploying
|
||||
6. **Set up alerts** for anomalies
|
||||
7. **Gradual rollout** (canary deployment)
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
```bash
|
||||
# After training model (increment 0042)
|
||||
/specweave:inc "0043-deploy-recommendation-model"
|
||||
|
||||
# Generates deployment increment with all artifacts
|
||||
/specweave:do
|
||||
|
||||
# Deploy to production when ready
|
||||
/ml:deploy 0043 --environment production
|
||||
```
|
||||
|
||||
Model deployment is not the end—it's the beginning of the MLOps lifecycle.
|
||||
518
skills/ml-pipeline-orchestrator/SKILL.md
Normal file
518
skills/ml-pipeline-orchestrator/SKILL.md
Normal file
@@ -0,0 +1,518 @@
|
||||
---
|
||||
name: ml-pipeline-orchestrator
|
||||
description: |
|
||||
Orchestrates complete machine learning pipelines within SpecWeave increments. Activates when users request "ML pipeline", "train model", "build ML system", "end-to-end ML", "ML workflow", "model training pipeline", or similar. Guides users through data preprocessing, feature engineering, model training, evaluation, and deployment using SpecWeave's spec-driven approach. Integrates with increment lifecycle for reproducible ML development.
|
||||
---
|
||||
|
||||
# ML Pipeline Orchestrator
|
||||
|
||||
## Overview
|
||||
|
||||
This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.
|
||||
|
||||
## Core Philosophy
|
||||
|
||||
**SpecWeave + ML = Disciplined Data Science**
|
||||
|
||||
Traditional ML development often lacks structure:
|
||||
- ❌ Jupyter notebooks with no version control
|
||||
- ❌ Experiments without documentation
|
||||
- ❌ Models deployed with no reproducibility
|
||||
- ❌ Team knowledge trapped in individual notebooks
|
||||
|
||||
SpecWeave brings discipline:
|
||||
- ✅ Every ML feature is an increment (with spec, plan, tasks)
|
||||
- ✅ Experiments tracked and documented automatically
|
||||
- ✅ Model versions tied to increments
|
||||
- ✅ Living docs capture learnings and decisions
|
||||
|
||||
## How It Works
|
||||
|
||||
### Phase 1: ML Increment Planning
|
||||
|
||||
When you request "build a recommendation model", the skill:
|
||||
|
||||
1. **Creates ML increment structure**:
|
||||
```
|
||||
.specweave/increments/0042-recommendation-model/
|
||||
├── spec.md # ML requirements, success metrics
|
||||
├── plan.md # Pipeline architecture
|
||||
├── tasks.md # Implementation tasks
|
||||
├── tests.md # Evaluation criteria
|
||||
├── experiments/ # Experiment tracking
|
||||
│ ├── exp-001-baseline/
|
||||
│ ├── exp-002-xgboost/
|
||||
│ └── exp-003-neural-net/
|
||||
├── data/ # Data samples, schemas
|
||||
│ ├── schema.yaml
|
||||
│ └── sample.csv
|
||||
├── models/ # Trained models
|
||||
│ ├── model-v1.pkl
|
||||
│ └── model-v2.pkl
|
||||
└── notebooks/ # Exploratory notebooks
|
||||
├── 01-eda.ipynb
|
||||
└── 02-feature-engineering.ipynb
|
||||
```
|
||||
|
||||
2. **Generates ML-specific spec** (spec.md):
|
||||
```markdown
|
||||
## ML Problem Definition
|
||||
- Problem type: Recommendation (collaborative filtering)
|
||||
- Input: User behavior history
|
||||
- Output: Top-N product recommendations
|
||||
- Success metrics: Precision@10 > 0.25, Recall@10 > 0.15
|
||||
|
||||
## Data Requirements
|
||||
- Training data: 6 months user interactions
|
||||
- Validation: Last month
|
||||
- Features: User profile, product attributes, interaction history
|
||||
|
||||
## Model Requirements
|
||||
- Latency: <100ms inference
|
||||
- Throughput: 1000 req/sec
|
||||
- Accuracy: Better than random baseline by 3x
|
||||
- Explainability: Must explain top-3 recommendations
|
||||
```
|
||||
|
||||
3. **Creates ML-specific tasks** (tasks.md):
|
||||
```markdown
|
||||
- [ ] T-001: Data exploration and quality analysis
|
||||
- [ ] T-002: Feature engineering pipeline
|
||||
- [ ] T-003: Train baseline model (random/popularity)
|
||||
- [ ] T-004: Train candidate models (3 algorithms)
|
||||
- [ ] T-005: Hyperparameter tuning (best model)
|
||||
- [ ] T-006: Model evaluation (all metrics)
|
||||
- [ ] T-007: Model explainability (SHAP/LIME)
|
||||
- [ ] T-008: Production deployment preparation
|
||||
- [ ] T-009: A/B test plan
|
||||
```
|
||||
|
||||
### Phase 2: Pipeline Execution
|
||||
|
||||
The skill guides through each task with best practices:
|
||||
|
||||
#### Task 1: Data Exploration
|
||||
```python
|
||||
# Generated template with SpecWeave integration
|
||||
import pandas as pd
|
||||
import mlflow
|
||||
from specweave import track_experiment
|
||||
|
||||
# Auto-logs to .specweave/increments/0042.../experiments/
|
||||
with track_experiment("exp-001-eda") as exp:
|
||||
df = pd.read_csv("data/interactions.csv")
|
||||
|
||||
# EDA
|
||||
exp.log_param("dataset_size", len(df))
|
||||
exp.log_metric("missing_values", df.isnull().sum().sum())
|
||||
|
||||
# Auto-generates report in increment folder
|
||||
exp.save_report("eda-summary.md")
|
||||
```
|
||||
|
||||
#### Task 3: Train Baseline
|
||||
```python
|
||||
from sklearn.dummy import DummyClassifier
|
||||
from specweave import track_model
|
||||
|
||||
with track_model("baseline-random", increment="0042") as model:
|
||||
clf = DummyClassifier(strategy="uniform")
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
# Automatically logged to increment
|
||||
model.log_metrics({
|
||||
"accuracy": 0.12,
|
||||
"precision@10": 0.08
|
||||
})
|
||||
model.save_artifact(clf, "baseline.pkl")
|
||||
```
|
||||
|
||||
#### Task 4: Train Candidate Models
|
||||
```python
|
||||
from xgboost import XGBClassifier
|
||||
from specweave import ModelExperiment
|
||||
|
||||
# Parallel experiments with auto-tracking
|
||||
experiments = [
|
||||
ModelExperiment("xgboost", XGBClassifier, params_xgb),
|
||||
ModelExperiment("lightgbm", LGBMClassifier, params_lgbm),
|
||||
ModelExperiment("neural-net", KerasModel, params_nn)
|
||||
]
|
||||
|
||||
results = run_experiments(
|
||||
experiments,
|
||||
increment="0042",
|
||||
save_to="experiments/"
|
||||
)
|
||||
|
||||
# Auto-generates comparison table in increment docs
|
||||
```
|
||||
|
||||
### Phase 3: Increment Completion
|
||||
|
||||
When `/specweave:done 0042` runs:
|
||||
|
||||
1. **Validates ML-specific criteria**:
|
||||
- ✅ All experiments logged
|
||||
- ✅ Best model saved
|
||||
- ✅ Evaluation metrics documented
|
||||
- ✅ Model explainability artifacts present
|
||||
|
||||
2. **Generates completion summary**:
|
||||
```markdown
|
||||
## Recommendation Model - COMPLETE
|
||||
|
||||
### Experiments Run: 7
|
||||
1. exp-001-baseline (random): precision@10=0.08
|
||||
2. exp-002-popularity: precision@10=0.18
|
||||
3. exp-003-xgboost: precision@10=0.26 ✅ BEST
|
||||
4. exp-004-lightgbm: precision@10=0.24
|
||||
5. exp-005-neural-net: precision@10=0.22
|
||||
...
|
||||
|
||||
### Best Model
|
||||
- Algorithm: XGBoost
|
||||
- Version: model-v3.pkl
|
||||
- Metrics: precision@10=0.26, recall@10=0.16
|
||||
- Training time: 45 min
|
||||
- Model size: 12 MB
|
||||
|
||||
### Deployment Ready
|
||||
- ✅ Inference latency: 35ms (target: <100ms)
|
||||
- ✅ Explainability: SHAP values computed
|
||||
- ✅ A/B test plan documented
|
||||
```
|
||||
|
||||
3. **Syncs living docs** (via `/specweave:sync-docs`):
|
||||
- Updates architecture docs with model design
|
||||
- Adds ADR for algorithm selection
|
||||
- Documents learnings in runbooks
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when you need to:
|
||||
|
||||
- **Build ML features end-to-end** - From idea to deployed model
|
||||
- **Ensure reproducibility** - Every experiment tracked and documented
|
||||
- **Follow ML best practices** - Baseline comparison, proper validation, explainability
|
||||
- **Integrate ML with software engineering** - ML as increments, not isolated notebooks
|
||||
- **Maintain team knowledge** - Living docs capture why decisions were made
|
||||
|
||||
## ML Pipeline Stages
|
||||
|
||||
### 1. Data Stage
|
||||
- Data exploration (EDA)
|
||||
- Data quality assessment
|
||||
- Schema validation
|
||||
- Sample data documentation
|
||||
|
||||
### 2. Feature Stage
|
||||
- Feature engineering
|
||||
- Feature selection
|
||||
- Feature importance analysis
|
||||
- Feature store integration (optional)
|
||||
|
||||
### 3. Training Stage
|
||||
- Baseline model (random, rule-based)
|
||||
- Candidate models (3+ algorithms)
|
||||
- Hyperparameter tuning
|
||||
- Cross-validation
|
||||
|
||||
### 4. Evaluation Stage
|
||||
- Comprehensive metrics (accuracy, precision, recall, F1, AUC)
|
||||
- Business metrics (latency, throughput)
|
||||
- Model comparison (vs baseline, vs previous version)
|
||||
- Error analysis
|
||||
|
||||
### 5. Explainability Stage
|
||||
- Feature importance
|
||||
- SHAP values
|
||||
- LIME explanations
|
||||
- Example predictions with rationale
|
||||
|
||||
### 6. Deployment Stage
|
||||
- Model packaging
|
||||
- Inference pipeline
|
||||
- A/B test plan
|
||||
- Monitoring setup
|
||||
|
||||
## Integration with SpecWeave Workflow
|
||||
|
||||
### With Experiment Tracking
|
||||
```bash
|
||||
# Start ML increment
|
||||
/specweave:inc "0042-recommendation-model"
|
||||
|
||||
# Automatically integrates experiment tracking
|
||||
# All MLflow/W&B logs saved to increment folder
|
||||
```
|
||||
|
||||
### With Living Docs
|
||||
```bash
|
||||
# After training best model
|
||||
/specweave:sync-docs update
|
||||
|
||||
# Automatically:
|
||||
# - Updates architecture/ml-models.md
|
||||
# - Adds ADR for algorithm choice
|
||||
# - Documents hyperparameters in runbooks
|
||||
```
|
||||
|
||||
### With GitHub Sync
|
||||
```bash
|
||||
# Create GitHub issue for model retraining
|
||||
/specweave:github:create-issue "Retrain recommendation model with new data"
|
||||
|
||||
# Linked to increment 0042
|
||||
# Issue tracks model performance over time
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Start with Baseline
|
||||
```python
|
||||
# Before training complex models, establish baseline
|
||||
baseline_results = train_baseline_model(
|
||||
strategies=["random", "popularity", "rule-based"]
|
||||
)
|
||||
# Requirement: New model must beat best baseline by 20%+
|
||||
```
|
||||
|
||||
### 2. Use Cross-Validation
|
||||
```python
|
||||
# Never trust single train/test split
|
||||
cv_scores = cross_val_score(model, X, y, cv=5)
|
||||
exp.log_metric("cv_mean", cv_scores.mean())
|
||||
exp.log_metric("cv_std", cv_scores.std())
|
||||
```
|
||||
|
||||
### 3. Track Everything
|
||||
```python
|
||||
# Hyperparameters, metrics, artifacts, environment
|
||||
exp.log_params(model.get_params())
|
||||
exp.log_metrics({"accuracy": acc, "f1": f1})
|
||||
exp.log_artifact("model.pkl")
|
||||
exp.log_artifact("requirements.txt") # Reproducibility
|
||||
```
|
||||
|
||||
### 4. Document Failures
|
||||
```python
|
||||
# Failed experiments are valuable learnings
|
||||
with track_experiment("exp-006-failed-lstm") as exp:
|
||||
# ... training fails ...
|
||||
exp.log_note("FAILED: LSTM overfits badly, needs regularization")
|
||||
exp.set_status("failed")
|
||||
# This documents why LSTM wasn't chosen
|
||||
```
|
||||
|
||||
### 5. Model Versioning
|
||||
```python
|
||||
# Tie model versions to increments
|
||||
model_version = f"0042-v{iteration}"
|
||||
mlflow.register_model(
|
||||
f"runs:/{run_id}/model",
|
||||
f"recommendation-model-{model_version}"
|
||||
)
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Classification Pipeline
|
||||
```bash
|
||||
User: "Build a fraud detection model for transactions"
|
||||
|
||||
Skill creates increment 0051-fraud-detection with:
|
||||
- spec.md: Binary classification, 99% precision target
|
||||
- plan.md: Imbalanced data handling, threshold tuning
|
||||
- tasks.md: 9 tasks from EDA to deployment
|
||||
- experiments/: exp-001-baseline, exp-002-xgboost, etc.
|
||||
|
||||
Guides through:
|
||||
1. EDA → identify class imbalance (0.1% fraud)
|
||||
2. Baseline → random/majority (terrible results)
|
||||
3. Candidates → XGBoost, LightGBM, Neural Net
|
||||
4. Threshold tuning → optimize for precision
|
||||
5. SHAP → explain high-risk predictions
|
||||
6. Deploy → model + threshold + explainer
|
||||
```
|
||||
|
||||
### Example 2: Regression Pipeline
|
||||
```bash
|
||||
User: "Predict customer lifetime value"
|
||||
|
||||
Skill creates increment 0063-ltv-prediction with:
|
||||
- spec.md: Regression, RMSE < $50 target
|
||||
- plan.md: Time-based validation, feature engineering
|
||||
- tasks.md: Customer cohort analysis, feature importance
|
||||
|
||||
Key difference: Regression-specific evaluation (RMSE, MAE, R²)
|
||||
```
|
||||
|
||||
### Example 3: Time Series Forecasting
|
||||
```bash
|
||||
User: "Forecast weekly sales for next 12 weeks"
|
||||
|
||||
Skill creates increment 0072-sales-forecasting with:
|
||||
- spec.md: Time series, MAPE < 10% target
|
||||
- plan.md: Seasonal decomposition, ARIMA vs Prophet
|
||||
- tasks.md: Stationarity tests, residual analysis
|
||||
|
||||
Key difference: Time series validation (no random split)
|
||||
```
|
||||
|
||||
## Framework Support
|
||||
|
||||
This skill works with all major ML frameworks:
|
||||
|
||||
### Scikit-Learn
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from specweave import track_sklearn_model
|
||||
|
||||
model = RandomForestClassifier(n_estimators=100)
|
||||
with track_sklearn_model(model, increment="0042") as tracked:
|
||||
tracked.fit(X_train, y_train)
|
||||
tracked.evaluate(X_test, y_test)
|
||||
```
|
||||
|
||||
### PyTorch
|
||||
```python
|
||||
import torch
|
||||
from specweave import track_pytorch_model
|
||||
|
||||
model = NeuralNet()
|
||||
with track_pytorch_model(model, increment="0042") as tracked:
|
||||
for epoch in range(epochs):
|
||||
tracked.train_epoch(train_loader)
|
||||
tracked.log_metric(f"loss_epoch_{epoch}", loss)
|
||||
```
|
||||
|
||||
### TensorFlow/Keras
|
||||
```python
|
||||
from tensorflow import keras
|
||||
from specweave import KerasCallback
|
||||
|
||||
model = keras.Sequential([...])
|
||||
model.fit(
|
||||
X_train, y_train,
|
||||
callbacks=[KerasCallback(increment="0042")]
|
||||
)
|
||||
```
|
||||
|
||||
### XGBoost/LightGBM
|
||||
```python
|
||||
import xgboost as xgb
|
||||
from specweave import track_boosting_model
|
||||
|
||||
dtrain = xgb.DMatrix(X_train, label=y_train)
|
||||
with track_boosting_model("xgboost", increment="0042") as tracked:
|
||||
model = xgb.train(params, dtrain, callbacks=[tracked.callback])
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### With `experiment-tracker` skill
|
||||
- Auto-detects MLflow/W&B in project
|
||||
- Configures tracking URI to increment folder
|
||||
- Syncs experiment metadata to increment docs
|
||||
|
||||
### With `model-evaluator` skill
|
||||
- Generates comprehensive evaluation reports
|
||||
- Compares models across experiments
|
||||
- Highlights best model with confidence intervals
|
||||
|
||||
### With `feature-engineer` skill
|
||||
- Generates feature engineering pipeline
|
||||
- Documents feature importance
|
||||
- Creates feature store schemas
|
||||
|
||||
### With `ml-engineer` agent
|
||||
- Delegates complex ML decisions to specialized agent
|
||||
- Reviews model architecture
|
||||
- Suggests improvements based on results
|
||||
|
||||
## Skill Outputs
|
||||
|
||||
After running `/specweave:do` on an ML increment, you get:
|
||||
|
||||
```
|
||||
.specweave/increments/0042-recommendation-model/
|
||||
├── spec.md ✅
|
||||
├── plan.md ✅
|
||||
├── tasks.md ✅ (all completed)
|
||||
├── COMPLETION-SUMMARY.md ✅
|
||||
├── experiments/
|
||||
│ ├── exp-001-baseline/
|
||||
│ │ ├── metrics.json
|
||||
│ │ ├── params.json
|
||||
│ │ └── logs/
|
||||
│ ├── exp-002-xgboost/ ✅ BEST
|
||||
│ │ ├── metrics.json
|
||||
│ │ ├── params.json
|
||||
│ │ ├── model.pkl
|
||||
│ │ └── shap_values.pkl
|
||||
│ └── comparison.md
|
||||
├── models/
|
||||
│ ├── model-v3.pkl (best)
|
||||
│ └── model-v3.metadata.json
|
||||
├── data/
|
||||
│ ├── schema.yaml
|
||||
│ └── sample.parquet
|
||||
└── notebooks/
|
||||
├── 01-eda.ipynb
|
||||
├── 02-feature-engineering.ipynb
|
||||
└── 03-model-analysis.ipynb
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
This skill integrates with SpecWeave commands:
|
||||
|
||||
```bash
|
||||
# Create ML increment
|
||||
/specweave:inc "build recommendation model"
|
||||
→ Activates ml-pipeline-orchestrator
|
||||
→ Creates ML-specific increment structure
|
||||
|
||||
# Execute ML tasks
|
||||
/specweave:do
|
||||
→ Guides through data → train → eval workflow
|
||||
→ Auto-tracks experiments
|
||||
|
||||
# Validate ML increment
|
||||
/specweave:validate 0042
|
||||
→ Checks: experiments logged, model saved, metrics documented
|
||||
→ Validates: model meets success criteria
|
||||
|
||||
# Complete ML increment
|
||||
/specweave:done 0042
|
||||
→ Generates ML completion summary
|
||||
→ Syncs model metadata to living docs
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start simple** - Always begin with baseline, then iterate
|
||||
2. **Track failures** - Document why approaches didn't work
|
||||
3. **Version data** - Use DVC or similar for data versioning
|
||||
4. **Reproducibility** - Log environment (requirements.txt, conda env)
|
||||
5. **Incremental improvement** - Each increment improves on previous model
|
||||
6. **Team collaboration** - Living docs make ML decisions visible to all
|
||||
|
||||
## Advanced: Multi-Increment ML Projects
|
||||
|
||||
For complex ML systems (e.g., recommendation system with multiple models):
|
||||
|
||||
```
|
||||
0042-recommendation-data-pipeline
|
||||
0043-recommendation-candidate-generation
|
||||
0044-recommendation-ranking-model
|
||||
0045-recommendation-reranking
|
||||
0046-recommendation-ab-test
|
||||
```
|
||||
|
||||
Each increment:
|
||||
- Has its own spec, plan, tasks
|
||||
- Builds on previous increments
|
||||
- Documents model interactions
|
||||
- Maintains system-level living docs
|
||||
249
skills/mlops-dag-builder/SKILL.md
Normal file
249
skills/mlops-dag-builder/SKILL.md
Normal file
@@ -0,0 +1,249 @@
|
||||
---
|
||||
name: mlops-dag-builder
|
||||
description: Design DAG-based MLOps pipeline architectures with Airflow, Dagster, Kubeflow, or Prefect. Activates for DAG orchestration, workflow automation, pipeline design patterns, CI/CD for ML. Use for platform-agnostic MLOps infrastructure - NOT for SpecWeave increment-based ML (use ml-pipeline-orchestrator instead).
|
||||
---
|
||||
|
||||
# MLOps DAG Builder
|
||||
|
||||
Design and implement DAG-based ML pipeline architectures using production orchestration tools.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides guidance for building **platform-agnostic MLOps pipelines** using DAG orchestrators (Airflow, Dagster, Kubeflow, Prefect). It focuses on workflow architecture, not SpecWeave integration.
|
||||
|
||||
**When to use this skill vs ml-pipeline-orchestrator:**
|
||||
- **Use this skill**: General MLOps architecture, Airflow/Dagster DAGs, cloud ML platforms
|
||||
- **Use ml-pipeline-orchestrator**: SpecWeave increment-based ML development with experiment tracking
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Designing DAG-based workflow orchestration (Airflow, Dagster, Kubeflow)
|
||||
- Implementing platform-agnostic ML pipeline patterns
|
||||
- Setting up CI/CD automation for ML training jobs
|
||||
- Creating reusable pipeline templates for teams
|
||||
- Integrating with cloud ML services (SageMaker, Vertex AI, Azure ML)
|
||||
|
||||
## What This Skill Provides
|
||||
|
||||
### Core Capabilities
|
||||
|
||||
1. **Pipeline Architecture**
|
||||
- End-to-end workflow design
|
||||
- DAG orchestration patterns (Airflow, Dagster, Kubeflow)
|
||||
- Component dependencies and data flow
|
||||
- Error handling and retry strategies
|
||||
|
||||
2. **Data Preparation**
|
||||
- Data validation and quality checks
|
||||
- Feature engineering pipelines
|
||||
- Data versioning and lineage
|
||||
- Train/validation/test splitting strategies
|
||||
|
||||
3. **Model Training**
|
||||
- Training job orchestration
|
||||
- Hyperparameter management
|
||||
- Experiment tracking integration
|
||||
- Distributed training patterns
|
||||
|
||||
4. **Model Validation**
|
||||
- Validation frameworks and metrics
|
||||
- A/B testing infrastructure
|
||||
- Performance regression detection
|
||||
- Model comparison workflows
|
||||
|
||||
5. **Deployment Automation**
|
||||
- Model serving patterns
|
||||
- Canary deployments
|
||||
- Blue-green deployment strategies
|
||||
- Rollback mechanisms
|
||||
|
||||
### Reference Documentation
|
||||
|
||||
See the `references/` directory for detailed guides:
|
||||
- **data-preparation.md** - Data cleaning, validation, and feature engineering
|
||||
- **model-training.md** - Training workflows and best practices
|
||||
- **model-validation.md** - Validation strategies and metrics
|
||||
- **model-deployment.md** - Deployment patterns and serving architectures
|
||||
|
||||
### Assets and Templates
|
||||
|
||||
The `assets/` directory contains:
|
||||
- **pipeline-dag.yaml.template** - DAG template for workflow orchestration
|
||||
- **training-config.yaml** - Training configuration template
|
||||
- **validation-checklist.md** - Pre-deployment validation checklist
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Basic Pipeline Setup
|
||||
|
||||
```python
|
||||
# 1. Define pipeline stages
|
||||
stages = [
|
||||
"data_ingestion",
|
||||
"data_validation",
|
||||
"feature_engineering",
|
||||
"model_training",
|
||||
"model_validation",
|
||||
"model_deployment"
|
||||
]
|
||||
|
||||
# 2. Configure dependencies
|
||||
# See assets/pipeline-dag.yaml.template for full example
|
||||
```
|
||||
|
||||
### Production Workflow
|
||||
|
||||
1. **Data Preparation Phase**
|
||||
- Ingest raw data from sources
|
||||
- Run data quality checks
|
||||
- Apply feature transformations
|
||||
- Version processed datasets
|
||||
|
||||
2. **Training Phase**
|
||||
- Load versioned training data
|
||||
- Execute training jobs
|
||||
- Track experiments and metrics
|
||||
- Save trained models
|
||||
|
||||
3. **Validation Phase**
|
||||
- Run validation test suite
|
||||
- Compare against baseline
|
||||
- Generate performance reports
|
||||
- Approve for deployment
|
||||
|
||||
4. **Deployment Phase**
|
||||
- Package model artifacts
|
||||
- Deploy to serving infrastructure
|
||||
- Configure monitoring
|
||||
- Validate production traffic
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Pipeline Design
|
||||
|
||||
- **Modularity**: Each stage should be independently testable
|
||||
- **Idempotency**: Re-running stages should be safe
|
||||
- **Observability**: Log metrics at every stage
|
||||
- **Versioning**: Track data, code, and model versions
|
||||
- **Failure Handling**: Implement retry logic and alerting
|
||||
|
||||
### Data Management
|
||||
|
||||
- Use data validation libraries (Great Expectations, TFX)
|
||||
- Version datasets with DVC or similar tools
|
||||
- Document feature engineering transformations
|
||||
- Maintain data lineage tracking
|
||||
|
||||
### Model Operations
|
||||
|
||||
- Separate training and serving infrastructure
|
||||
- Use model registries (MLflow, Weights & Biases)
|
||||
- Implement gradual rollouts for new models
|
||||
- Monitor model performance drift
|
||||
- Maintain rollback capabilities
|
||||
|
||||
### Deployment Strategies
|
||||
|
||||
- Start with shadow deployments
|
||||
- Use canary releases for validation
|
||||
- Implement A/B testing infrastructure
|
||||
- Set up automated rollback triggers
|
||||
- Monitor latency and throughput
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Orchestration Tools
|
||||
|
||||
- **Apache Airflow**: DAG-based workflow orchestration
|
||||
- **Dagster**: Asset-based pipeline orchestration
|
||||
- **Kubeflow Pipelines**: Kubernetes-native ML workflows
|
||||
- **Prefect**: Modern dataflow automation
|
||||
|
||||
### Experiment Tracking
|
||||
|
||||
- MLflow for experiment tracking and model registry
|
||||
- Weights & Biases for visualization and collaboration
|
||||
- TensorBoard for training metrics
|
||||
|
||||
### Deployment Platforms
|
||||
|
||||
- AWS SageMaker for managed ML infrastructure
|
||||
- Google Vertex AI for GCP deployments
|
||||
- Azure ML for Azure cloud
|
||||
- Kubernetes + KServe for cloud-agnostic serving
|
||||
|
||||
## Progressive Disclosure
|
||||
|
||||
Start with the basics and gradually add complexity:
|
||||
|
||||
1. **Level 1**: Simple linear pipeline (data → train → deploy)
|
||||
2. **Level 2**: Add validation and monitoring stages
|
||||
3. **Level 3**: Implement hyperparameter tuning
|
||||
4. **Level 4**: Add A/B testing and gradual rollouts
|
||||
5. **Level 5**: Multi-model pipelines with ensemble strategies
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Batch Training Pipeline
|
||||
|
||||
```yaml
|
||||
# See assets/pipeline-dag.yaml.template
|
||||
stages:
|
||||
- name: data_preparation
|
||||
dependencies: []
|
||||
- name: model_training
|
||||
dependencies: [data_preparation]
|
||||
- name: model_evaluation
|
||||
dependencies: [model_training]
|
||||
- name: model_deployment
|
||||
dependencies: [model_evaluation]
|
||||
```
|
||||
|
||||
### Real-time Feature Pipeline
|
||||
|
||||
```python
|
||||
# Stream processing for real-time features
|
||||
# Combined with batch training
|
||||
# See references/data-preparation.md
|
||||
```
|
||||
|
||||
### Continuous Training
|
||||
|
||||
```python
|
||||
# Automated retraining on schedule
|
||||
# Triggered by data drift detection
|
||||
# See references/model-training.md
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
- **Pipeline failures**: Check dependencies and data availability
|
||||
- **Training instability**: Review hyperparameters and data quality
|
||||
- **Deployment issues**: Validate model artifacts and serving config
|
||||
- **Performance degradation**: Monitor data drift and model metrics
|
||||
|
||||
### Debugging Steps
|
||||
|
||||
1. Check pipeline logs for each stage
|
||||
2. Validate input/output data at boundaries
|
||||
3. Test components in isolation
|
||||
4. Review experiment tracking metrics
|
||||
5. Inspect model artifacts and metadata
|
||||
|
||||
## Next Steps
|
||||
|
||||
After setting up your pipeline:
|
||||
|
||||
1. Explore **hyperparameter-tuning** skill for optimization
|
||||
2. Learn **experiment-tracking-setup** for MLflow/W&B
|
||||
3. Review **model-deployment-patterns** for serving strategies
|
||||
4. Implement monitoring with observability tools
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **ml-pipeline-orchestrator**: SpecWeave-integrated ML development (use for increment-based ML)
|
||||
- **experiment-tracker**: MLflow and Weights & Biases experiment tracking
|
||||
- **automl-optimizer**: Automated hyperparameter optimization with Optuna/Hyperopt
|
||||
- **ml-deployment-helper**: Model serving and deployment patterns
|
||||
155
skills/model-evaluator/SKILL.md
Normal file
155
skills/model-evaluator/SKILL.md
Normal file
@@ -0,0 +1,155 @@
|
||||
---
|
||||
name: model-evaluator
|
||||
description: |
|
||||
Comprehensive ML model evaluation with multiple metrics, cross-validation, and statistical testing. Activates for "evaluate model", "model metrics", "model performance", "compare models", "validation metrics", "test accuracy", "precision recall", "ROC AUC". Generates detailed evaluation reports with visualizations and statistical significance tests, integrated with SpecWeave increment documentation.
|
||||
---
|
||||
|
||||
# Model Evaluator
|
||||
|
||||
## Overview
|
||||
|
||||
Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.
|
||||
|
||||
## Core Evaluation Framework
|
||||
|
||||
### 1. Classification Metrics
|
||||
- Accuracy, Precision, Recall, F1-score
|
||||
- ROC AUC, PR AUC
|
||||
- Confusion matrix
|
||||
- Per-class metrics (for multi-class)
|
||||
- Class imbalance handling
|
||||
|
||||
### 2. Regression Metrics
|
||||
- RMSE, MAE, MAPE
|
||||
- R² score, Adjusted R²
|
||||
- Residual analysis
|
||||
- Prediction interval coverage
|
||||
|
||||
### 3. Ranking Metrics (Recommendations)
|
||||
- Precision@K, Recall@K
|
||||
- NDCG@K, MAP@K
|
||||
- MRR (Mean Reciprocal Rank)
|
||||
- Coverage, Diversity
|
||||
|
||||
### 4. Statistical Validation
|
||||
- Cross-validation (K-fold, stratified, time-series)
|
||||
- Confidence intervals
|
||||
- Statistical significance testing
|
||||
- Calibration curves
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from specweave import ModelEvaluator
|
||||
|
||||
evaluator = ModelEvaluator(
|
||||
model=trained_model,
|
||||
X_test=X_test,
|
||||
y_test=y_test,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Comprehensive evaluation
|
||||
report = evaluator.evaluate_all()
|
||||
|
||||
# Generates:
|
||||
# - .specweave/increments/0042.../evaluation-report.md
|
||||
# - Visualizations (confusion matrix, ROC curves, etc.)
|
||||
# - Statistical tests
|
||||
```
|
||||
|
||||
## Evaluation Report Structure
|
||||
|
||||
```markdown
|
||||
# Model Evaluation Report: XGBoost Classifier
|
||||
|
||||
## Overall Performance
|
||||
- **Accuracy**: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
|
||||
- **ROC AUC**: 0.92 ± 0.01
|
||||
- **F1 Score**: 0.85 ± 0.02
|
||||
|
||||
## Per-Class Performance
|
||||
| Class | Precision | Recall | F1 | Support |
|
||||
|---------|-----------|--------|------|---------|
|
||||
| Class 0 | 0.88 | 0.85 | 0.86 | 1000 |
|
||||
| Class 1 | 0.84 | 0.87 | 0.86 | 800 |
|
||||
|
||||
## Confusion Matrix
|
||||
[Visualization embedded]
|
||||
|
||||
## Cross-Validation Results
|
||||
- 5-fold CV accuracy: 0.86 ± 0.03
|
||||
- Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
|
||||
- No overfitting detected (train=0.89, val=0.86, gap=0.03)
|
||||
|
||||
## Statistical Tests
|
||||
- Comparison vs baseline: p=0.001 (highly significant)
|
||||
- Comparison vs previous model: p=0.042 (significant)
|
||||
|
||||
## Recommendations
|
||||
✅ Deploy: Model meets accuracy threshold (>0.85)
|
||||
✅ Stable: Low variance across folds
|
||||
⚠️ Monitor: Class 1 recall slightly lower (0.84)
|
||||
```
|
||||
|
||||
## Model Comparison
|
||||
|
||||
```python
|
||||
from specweave import compare_models
|
||||
|
||||
models = {
|
||||
"baseline": baseline_model,
|
||||
"xgboost": xgb_model,
|
||||
"lightgbm": lgbm_model,
|
||||
"neural-net": nn_model
|
||||
}
|
||||
|
||||
comparison = compare_models(
|
||||
models,
|
||||
X_test,
|
||||
y_test,
|
||||
metrics=["accuracy", "auc", "f1"],
|
||||
increment="0042"
|
||||
)
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```
|
||||
Model Comparison Report
|
||||
=======================
|
||||
|
||||
| Model | Accuracy | ROC AUC | F1 | Inference Time | Model Size |
|
||||
|------------|----------|---------|------|----------------|------------|
|
||||
| baseline | 0.65 | 0.70 | 0.62 | 1ms | 10KB |
|
||||
| xgboost | 0.87 | 0.92 | 0.85 | 35ms | 12MB |
|
||||
| lightgbm | 0.86 | 0.91 | 0.84 | 28ms | 8MB |
|
||||
| neural-net | 0.85 | 0.90 | 0.83 | 120ms | 45MB |
|
||||
|
||||
Recommendation: XGBoost
|
||||
- Best accuracy and AUC
|
||||
- Acceptable inference time (<50ms requirement)
|
||||
- Good size/performance tradeoff
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always compare to baseline** - Random, majority, rule-based
|
||||
2. **Use cross-validation** - Never trust single split
|
||||
3. **Check calibration** - Are probabilities meaningful?
|
||||
4. **Analyze errors** - What types of mistakes?
|
||||
5. **Test statistical significance** - Is improvement real?
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
```bash
|
||||
# Evaluate model in increment
|
||||
/ml:evaluate-model 0042
|
||||
|
||||
# Compare all models in increment
|
||||
/ml:compare-models 0042
|
||||
|
||||
# Generate full evaluation report
|
||||
/ml:evaluation-report 0042
|
||||
```
|
||||
|
||||
Evaluation results automatically included in increment COMPLETION-SUMMARY.md.
|
||||
227
skills/model-explainer/SKILL.md
Normal file
227
skills/model-explainer/SKILL.md
Normal file
@@ -0,0 +1,227 @@
|
||||
---
|
||||
name: model-explainer
|
||||
description: |
|
||||
Model interpretability and explainability using SHAP, LIME, feature importance, and partial dependence plots. Activates for "explain model", "model interpretability", "SHAP", "LIME", "feature importance", "why prediction", "model explanation". Generates human-readable explanations for model predictions, critical for trust, debugging, and regulatory compliance.
|
||||
---
|
||||
|
||||
# Model Explainer
|
||||
|
||||
## Overview
|
||||
|
||||
Makes black-box models interpretable. Explains why models make specific predictions, which features matter most, and how features interact. Critical for trust, debugging, and regulatory compliance.
|
||||
|
||||
## Why Explainability Matters
|
||||
|
||||
- **Trust**: Stakeholders trust models they understand
|
||||
- **Debugging**: Find model weaknesses and biases
|
||||
- **Compliance**: GDPR, fair lending laws require explanations
|
||||
- **Improvement**: Understand what to improve
|
||||
- **Safety**: Detect when model might fail
|
||||
|
||||
## Explanation Types
|
||||
|
||||
### 1. Global Explanations (Model-Level)
|
||||
|
||||
**Feature Importance**:
|
||||
```python
|
||||
from specweave import explain_model
|
||||
|
||||
explainer = explain_model(
|
||||
model=trained_model,
|
||||
X_train=X_train,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Global feature importance
|
||||
importance = explainer.feature_importance()
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Top Features (Global):
|
||||
1. transaction_amount (importance: 0.35)
|
||||
2. user_history_days (importance: 0.22)
|
||||
3. merchant_reputation (importance: 0.18)
|
||||
4. time_since_last_transaction (importance: 0.15)
|
||||
5. device_type (importance: 0.10)
|
||||
```
|
||||
|
||||
**Partial Dependence Plots**:
|
||||
```python
|
||||
# How does feature affect prediction?
|
||||
explainer.partial_dependence(feature="transaction_amount")
|
||||
```
|
||||
|
||||
### 2. Local Explanations (Prediction-Level)
|
||||
|
||||
**SHAP Values**:
|
||||
```python
|
||||
# Explain single prediction
|
||||
explanation = explainer.explain_prediction(X_sample)
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Prediction: FRAUD (probability: 0.92)
|
||||
|
||||
Why?
|
||||
+ transaction_amount=5000 → +0.45 (high amount increases fraud risk)
|
||||
+ user_history_days=2 → +0.30 (new user increases risk)
|
||||
+ merchant_reputation=low → +0.25 (suspicious merchant)
|
||||
- time_since_last_transaction=1hr → -0.08 (recent activity normal)
|
||||
|
||||
Base prediction: 0.10
|
||||
Final prediction: 0.92
|
||||
```
|
||||
|
||||
**LIME Explanations**:
|
||||
```python
|
||||
# Local interpretable model
|
||||
lime_exp = explainer.lime_explanation(X_sample)
|
||||
```
|
||||
|
||||
## Usage in SpecWeave
|
||||
|
||||
```python
|
||||
from specweave import ModelExplainer
|
||||
|
||||
# Create explainer
|
||||
explainer = ModelExplainer(
|
||||
model=model,
|
||||
X_train=X_train,
|
||||
feature_names=feature_names,
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Generate all explanations
|
||||
explainer.generate_all_reports()
|
||||
|
||||
# Creates:
|
||||
# - feature-importance.png
|
||||
# - shap-summary.png
|
||||
# - pdp-plots/
|
||||
# - local-explanations/
|
||||
# - explainability-report.md
|
||||
```
|
||||
|
||||
## Real-World Examples
|
||||
|
||||
### Example 1: Fraud Detection
|
||||
|
||||
```python
|
||||
# Explain why transaction flagged as fraud
|
||||
transaction = {
|
||||
"amount": 5000,
|
||||
"user_age_days": 2,
|
||||
"merchant": "new_merchant_xyz"
|
||||
}
|
||||
|
||||
explanation = explainer.explain(transaction)
|
||||
print(explanation.to_text())
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
FRAUD ALERT (92% confidence)
|
||||
|
||||
Main factors:
|
||||
1. Large transaction amount ($5000) - Very unusual for new users
|
||||
2. Account only 2 days old - Fraud pattern
|
||||
3. Merchant has low reputation score - Red flag
|
||||
|
||||
If this is legitimate:
|
||||
- User should verify identity
|
||||
- Merchant should be manually reviewed
|
||||
```
|
||||
|
||||
### Example 2: Loan Approval
|
||||
|
||||
```python
|
||||
# Explain loan rejection
|
||||
applicant = {
|
||||
"income": 45000,
|
||||
"credit_score": 620,
|
||||
"debt_ratio": 0.45
|
||||
}
|
||||
|
||||
explanation = explainer.explain(applicant)
|
||||
print(explanation.to_text())
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
LOAN DENIED
|
||||
|
||||
Main reasons:
|
||||
1. Credit score (620) below threshold (650) - Primary factor
|
||||
2. High debt-to-income ratio (45%) - Risk indicator
|
||||
3. Income ($45k) adequate but not strong
|
||||
|
||||
To improve approval chances:
|
||||
- Increase credit score by 30+ points
|
||||
- Reduce debt-to-income ratio below 40%
|
||||
```
|
||||
|
||||
## Regulatory Compliance
|
||||
|
||||
### GDPR "Right to Explanation"
|
||||
|
||||
```python
|
||||
# Generate GDPR-compliant explanation
|
||||
gdpr_explanation = explainer.gdpr_explanation(prediction)
|
||||
|
||||
# Includes:
|
||||
# - Decision rationale
|
||||
# - Data used
|
||||
# - How to contest decision
|
||||
# - Impact of features
|
||||
```
|
||||
|
||||
### Fair Lending Act
|
||||
|
||||
```python
|
||||
# Check for bias in protected attributes
|
||||
bias_report = explainer.fairness_report(
|
||||
sensitive_features=["gender", "race", "age"]
|
||||
)
|
||||
|
||||
# Detects:
|
||||
# - Disparate impact
|
||||
# - Feature bias
|
||||
# - Recommendations for fairness
|
||||
```
|
||||
|
||||
## Visualization Types
|
||||
|
||||
1. **Feature Importance Bar Chart**
|
||||
2. **SHAP Summary Plot** (beeswarm)
|
||||
3. **SHAP Waterfall** (single prediction)
|
||||
4. **Partial Dependence Plots**
|
||||
5. **Individual Conditional Expectation** (ICE)
|
||||
6. **Force Plots** (interactive)
|
||||
7. **Decision Trees** (surrogate models)
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
```bash
|
||||
# Generate all explainability artifacts
|
||||
/ml:explain-model 0042
|
||||
|
||||
# Explain specific prediction
|
||||
/ml:explain-prediction --increment 0042 --sample sample.json
|
||||
|
||||
# Check for bias
|
||||
/ml:fairness-check 0042
|
||||
```
|
||||
|
||||
Explainability artifacts automatically included in increment documentation and COMPLETION-SUMMARY.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Generate explanations for all production models** - No "black boxes" in production
|
||||
2. **Check for bias** - Test sensitive attributes
|
||||
3. **Document limitations** - What model can't explain
|
||||
4. **Validate explanations** - Do they make domain sense?
|
||||
5. **Make explanations accessible** - Non-technical stakeholders should understand
|
||||
|
||||
Model explainability is non-negotiable for responsible AI deployment.
|
||||
541
skills/model-registry/SKILL.md
Normal file
541
skills/model-registry/SKILL.md
Normal file
@@ -0,0 +1,541 @@
|
||||
---
|
||||
name: model-registry
|
||||
description: |
|
||||
Centralized model versioning, staging, and lifecycle management. Activates for "model registry", "model versioning", "model staging", "deploy to production", "rollback model", "model metadata", "model lineage", "promote model", "model catalog". Manages ML model lifecycle from development through production with SpecWeave increment integration.
|
||||
---
|
||||
|
||||
# Model Registry
|
||||
|
||||
## Overview
|
||||
|
||||
Centralized system for managing ML model lifecycle: versioning, staging (dev/staging/prod), metadata tracking, lineage, and rollback. Ensures production models are tracked, reproducible, and can be safely deployed or rolled back—all integrated with SpecWeave's increment workflow.
|
||||
|
||||
## Why Model Registry Matters
|
||||
|
||||
**Without Model Registry**:
|
||||
- ❌ "Which model is in production?"
|
||||
- ❌ "Can't reproduce model from 3 months ago"
|
||||
- ❌ "Breaking change deployed, how to rollback?"
|
||||
- ❌ "Model metadata scattered across notebooks"
|
||||
- ❌ "No audit trail for model changes"
|
||||
|
||||
**With Model Registry**:
|
||||
- ✅ Single source of truth for all models
|
||||
- ✅ Full version history with metadata
|
||||
- ✅ Safe staging pipeline (dev → staging → prod)
|
||||
- ✅ One-command rollback
|
||||
- ✅ Complete model lineage
|
||||
- ✅ Audit trail for compliance
|
||||
|
||||
## Model Registry Structure
|
||||
|
||||
### Model Lifecycle Stages
|
||||
|
||||
```
|
||||
Development → Staging → Production → Archived
|
||||
|
||||
Dev: Training, experimentation
|
||||
Staging: Validation, A/B testing (10% traffic)
|
||||
Prod: Production deployment (100% traffic)
|
||||
Archived: Decommissioned, kept for audit
|
||||
```
|
||||
|
||||
## Core Operations
|
||||
|
||||
### 1. Model Registration
|
||||
|
||||
```python
|
||||
from specweave import ModelRegistry
|
||||
|
||||
registry = ModelRegistry(increment="0042")
|
||||
|
||||
# Register new model version
|
||||
model_version = registry.register_model(
|
||||
name="fraud-detection-model",
|
||||
model=trained_model,
|
||||
version="v3",
|
||||
metadata={
|
||||
"algorithm": "XGBoost",
|
||||
"accuracy": 0.87,
|
||||
"precision": 0.85,
|
||||
"recall": 0.62,
|
||||
"training_date": "2024-01-15",
|
||||
"training_data_version": "v2024-01",
|
||||
"hyperparameters": {
|
||||
"n_estimators": 673,
|
||||
"max_depth": 6,
|
||||
"learning_rate": 0.094
|
||||
},
|
||||
"features": feature_names,
|
||||
"framework": "xgboost==1.7.0",
|
||||
"python_version": "3.10",
|
||||
"increment": "0042"
|
||||
},
|
||||
stage="dev", # Initial stage
|
||||
tags=["fraud", "production-candidate"]
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - Model artifact (model.pkl)
|
||||
# - Model metadata (metadata.json)
|
||||
# - Model signature (inputs/outputs)
|
||||
# - Environment file (requirements.txt)
|
||||
# - Feature schema (features.yaml)
|
||||
```
|
||||
|
||||
### 2. Model Versioning
|
||||
|
||||
```python
|
||||
# Semantic versioning: major.minor.patch
|
||||
registry.version_model(
|
||||
name="fraud-detection-model",
|
||||
version_type="minor" # v3.0.0 → v3.1.0
|
||||
)
|
||||
|
||||
# Auto-increments based on changes:
|
||||
# - major: Breaking changes (different features, incompatible)
|
||||
# - minor: Improvements (better accuracy, new features added)
|
||||
# - patch: Bugfixes, retraining (same features, slight changes)
|
||||
```
|
||||
|
||||
### 3. Model Promotion
|
||||
|
||||
**Stage Progression**:
|
||||
```python
|
||||
# Promote from dev to staging
|
||||
registry.promote_model(
|
||||
name="fraud-detection-model",
|
||||
version="v3.1.0",
|
||||
from_stage="dev",
|
||||
to_stage="staging",
|
||||
approval_required=True # Requires review
|
||||
)
|
||||
|
||||
# Validate in staging (A/B test)
|
||||
ab_test_results = run_ab_test(
|
||||
control="fraud-detection-v3.0.0",
|
||||
treatment="fraud-detection-v3.1.0",
|
||||
traffic_split=0.1, # 10% to new model
|
||||
duration_days=7
|
||||
)
|
||||
|
||||
# Promote to production if successful
|
||||
if ab_test_results['treatment_is_better']:
|
||||
registry.promote_model(
|
||||
name="fraud-detection-model",
|
||||
version="v3.1.0",
|
||||
from_stage="staging",
|
||||
to_stage="production"
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Model Rollback
|
||||
|
||||
```python
|
||||
# Rollback to previous version
|
||||
registry.rollback(
|
||||
name="fraud-detection-model",
|
||||
to_version="v3.0.0", # Previous stable version
|
||||
reason="v3.1.0 causing high false positive rate"
|
||||
)
|
||||
|
||||
# Automatic rollback triggers:
|
||||
registry.set_auto_rollback_triggers(
|
||||
error_rate_threshold=0.05, # Rollback if >5% errors
|
||||
latency_threshold=200, # Rollback if p95 > 200ms
|
||||
accuracy_drop_threshold=0.10 # Rollback if accuracy drops >10%
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Model Retrieval
|
||||
|
||||
```python
|
||||
# Get latest production model
|
||||
model = registry.get_model(
|
||||
name="fraud-detection-model",
|
||||
stage="production"
|
||||
)
|
||||
|
||||
# Get specific version
|
||||
model_v3 = registry.get_model(
|
||||
name="fraud-detection-model",
|
||||
version="v3.1.0"
|
||||
)
|
||||
|
||||
# Get model by date
|
||||
model_jan = registry.get_model_by_date(
|
||||
name="fraud-detection-model",
|
||||
date="2024-01-15"
|
||||
)
|
||||
```
|
||||
|
||||
## Model Metadata
|
||||
|
||||
### Tracked Metadata
|
||||
|
||||
```python
|
||||
model_metadata = {
|
||||
# Core Info
|
||||
"name": "fraud-detection-model",
|
||||
"version": "v3.1.0",
|
||||
"stage": "production",
|
||||
"created_at": "2024-01-15T10:30:00Z",
|
||||
"updated_at": "2024-01-20T14:00:00Z",
|
||||
|
||||
# Training Info
|
||||
"algorithm": "XGBoost",
|
||||
"framework": "xgboost==1.7.0",
|
||||
"python_version": "3.10",
|
||||
"training_duration": "45min",
|
||||
"training_data_size": "100k rows",
|
||||
|
||||
# Performance Metrics
|
||||
"accuracy": 0.87,
|
||||
"precision": 0.85,
|
||||
"recall": 0.62,
|
||||
"roc_auc": 0.92,
|
||||
"f1_score": 0.72,
|
||||
|
||||
# Deployment Info
|
||||
"inference_latency_p50": "35ms",
|
||||
"inference_latency_p95": "80ms",
|
||||
"model_size": "12MB",
|
||||
"cpu_usage": "0.2 cores",
|
||||
"memory_usage": "256MB",
|
||||
|
||||
# Lineage
|
||||
"increment": "0042-fraud-detection",
|
||||
"experiment": "exp-003-xgboost",
|
||||
"training_data_version": "v2024-01",
|
||||
"feature_engineering_version": "v1",
|
||||
"parent_model": "fraud-detection-v3.0.0",
|
||||
|
||||
# Features
|
||||
"features": [
|
||||
"amount_vs_user_average",
|
||||
"days_since_last_purchase",
|
||||
"merchant_risk_score",
|
||||
...
|
||||
],
|
||||
"num_features": 35,
|
||||
|
||||
# Tags & Labels
|
||||
"tags": ["fraud", "production", "high-precision"],
|
||||
"owner": "[email protected]",
|
||||
"approver": "[email protected]"
|
||||
}
|
||||
```
|
||||
|
||||
## Model Lineage
|
||||
|
||||
### Tracking Model Lineage
|
||||
|
||||
```python
|
||||
# Full lineage: data → features → training → model
|
||||
lineage = registry.get_lineage(
|
||||
name="fraud-detection-model",
|
||||
version="v3.1.0"
|
||||
)
|
||||
|
||||
# Lineage graph:
|
||||
"""
|
||||
data:v2024-01
|
||||
└─> feature-engineering:v1
|
||||
└─> experiment:exp-003-xgboost
|
||||
└─> model:fraud-detection-v3.1.0
|
||||
└─> deployment:production
|
||||
"""
|
||||
|
||||
# Answer questions like:
|
||||
# - "What data was used to train this model?"
|
||||
# - "Which experiments led to this model?"
|
||||
# - "What models use this feature set?"
|
||||
# - "Impact of changing feature X?"
|
||||
```
|
||||
|
||||
### Model Comparison
|
||||
|
||||
```python
|
||||
# Compare two model versions
|
||||
comparison = registry.compare_models(
|
||||
model_a="fraud-detection-v3.0.0",
|
||||
model_b="fraud-detection-v3.1.0"
|
||||
)
|
||||
|
||||
# Output:
|
||||
"""
|
||||
Comparison: v3.0.0 vs v3.1.0
|
||||
============================
|
||||
|
||||
Metrics:
|
||||
- Accuracy: 0.85 → 0.87 (+2.4%) ✅
|
||||
- Precision: 0.83 → 0.85 (+2.4%) ✅
|
||||
- Recall: 0.60 → 0.62 (+3.3%) ✅
|
||||
|
||||
Performance:
|
||||
- Latency: 40ms → 35ms (-12.5%) ✅
|
||||
- Size: 15MB → 12MB (-20.0%) ✅
|
||||
|
||||
Features:
|
||||
- Added: merchant_reputation_score
|
||||
- Removed: obsolete_feature_x
|
||||
- Modified: 3 features rescaled
|
||||
|
||||
Recommendation: ✅ v3.1.0 is better (improvement in all metrics)
|
||||
"""
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Automatic Registration
|
||||
|
||||
```python
|
||||
# Models automatically registered during increment completion
|
||||
with track_experiment("xgboost-v1", increment="0042") as exp:
|
||||
model = train_model(X_train, y_train)
|
||||
|
||||
# Auto-registers model to registry
|
||||
exp.register_model(
|
||||
model=model,
|
||||
name="fraud-detection-model",
|
||||
auto_version=True # Auto-increment version
|
||||
)
|
||||
```
|
||||
|
||||
### Increment-Model Mapping
|
||||
|
||||
```
|
||||
.specweave/increments/0042-fraud-detection/
|
||||
├── models/
|
||||
│ ├── fraud-detection-v3.0.0/
|
||||
│ │ ├── model.pkl
|
||||
│ │ ├── metadata.json
|
||||
│ │ ├── requirements.txt
|
||||
│ │ └── features.yaml
|
||||
│ └── fraud-detection-v3.1.0/
|
||||
│ ├── model.pkl
|
||||
│ ├── metadata.json
|
||||
│ ├── requirements.txt
|
||||
│ └── features.yaml
|
||||
└── registry/
|
||||
├── model_catalog.yaml
|
||||
├── lineage_graph.json
|
||||
└── deployment_history.md
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Updates:
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/model-registry.md -->
|
||||
|
||||
## Fraud Detection Model - Production
|
||||
|
||||
### Current Production Model
|
||||
- Version: v3.1.0
|
||||
- Deployed: 2024-01-20
|
||||
- Accuracy: 87%
|
||||
- Latency: 35ms (p50)
|
||||
|
||||
### Version History
|
||||
| Version | Stage | Accuracy | Deployed | Notes |
|
||||
|---------|-------|----------|----------|-------|
|
||||
| v3.1.0 | Prod | 0.87 | 2024-01-20 | Current ✅ |
|
||||
| v3.0.0 | Archived | 0.85 | 2024-01-10 | Replaced by v3.1.0 |
|
||||
| v2.5.0 | Archived | 0.83 | 2023-12-01 | Retired |
|
||||
|
||||
### Rollback Plan
|
||||
If v3.1.0 issues detected:
|
||||
1. Rollback to v3.0.0 (tested, stable)
|
||||
2. Investigate issue in staging
|
||||
3. Deploy fix as v3.1.1
|
||||
```
|
||||
|
||||
## Model Registry Providers
|
||||
|
||||
### MLflow Model Registry
|
||||
|
||||
```python
|
||||
from specweave import MLflowRegistry
|
||||
|
||||
# Use MLflow as backend
|
||||
registry = MLflowRegistry(
|
||||
tracking_uri="http://mlflow.company.com",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# All SpecWeave operations work with MLflow backend
|
||||
registry.register_model(...)
|
||||
registry.promote_model(...)
|
||||
```
|
||||
|
||||
### Custom Registry
|
||||
|
||||
```python
|
||||
from specweave import CustomRegistry
|
||||
|
||||
# Use custom storage (S3, GCS, Azure Blob)
|
||||
registry = CustomRegistry(
|
||||
storage_uri="s3://ml-models/registry",
|
||||
increment="0042"
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Semantic Versioning
|
||||
|
||||
```python
|
||||
# Breaking change (different features)
|
||||
registry.version_model(version_type="major") # v3.0.0 → v4.0.0
|
||||
|
||||
# Feature addition (backward compatible)
|
||||
registry.version_model(version_type="minor") # v3.0.0 → v3.1.0
|
||||
|
||||
# Bugfix or retraining (no API change)
|
||||
registry.version_model(version_type="patch") # v3.0.0 → v3.0.1
|
||||
```
|
||||
|
||||
### 2. Model Signatures
|
||||
|
||||
```python
|
||||
# Document input/output schema
|
||||
registry.set_model_signature(
|
||||
model="fraud-detection-v3.1.0",
|
||||
inputs={
|
||||
"amount": "float",
|
||||
"merchant_id": "int",
|
||||
"location": "str"
|
||||
},
|
||||
outputs={
|
||||
"fraud_probability": "float",
|
||||
"fraud_flag": "bool",
|
||||
"risk_score": "float"
|
||||
}
|
||||
)
|
||||
|
||||
# Prevents breaking changes (validate on registration)
|
||||
```
|
||||
|
||||
### 3. Model Approval Workflow
|
||||
|
||||
```python
|
||||
# Require approval before production
|
||||
registry.set_approval_required(
|
||||
stage="production",
|
||||
approvers=["[email protected]", "[email protected]"]
|
||||
)
|
||||
|
||||
# Approve model promotion
|
||||
registry.approve_model(
|
||||
name="fraud-detection-model",
|
||||
version="v3.1.0",
|
||||
approver="[email protected]",
|
||||
comments="Tested in staging, accuracy improved 2%, latency reduced 12%"
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Model Deprecation
|
||||
|
||||
```python
|
||||
# Mark old models as deprecated
|
||||
registry.deprecate_model(
|
||||
name="fraud-detection-model",
|
||||
version="v2.5.0",
|
||||
reason="Superseded by v3.x series",
|
||||
end_of_life="2024-06-01"
|
||||
)
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# List all models
|
||||
/ml:registry-list
|
||||
|
||||
# Get model info
|
||||
/ml:registry-info fraud-detection-model
|
||||
|
||||
# Promote model
|
||||
/ml:registry-promote fraud-detection-model v3.1.0 --to production
|
||||
|
||||
# Rollback model
|
||||
/ml:registry-rollback fraud-detection-model --to v3.0.0
|
||||
|
||||
# Compare models
|
||||
/ml:registry-compare fraud-detection-model v3.0.0 v3.1.0
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### 1. Model Monitoring Integration
|
||||
|
||||
```python
|
||||
# Automatically track production model performance
|
||||
monitor = ModelMonitor(registry=registry)
|
||||
|
||||
monitor.track_model(
|
||||
name="fraud-detection-model",
|
||||
stage="production",
|
||||
metrics=["accuracy", "latency", "error_rate"]
|
||||
)
|
||||
|
||||
# Auto-rollback if metrics degrade
|
||||
monitor.set_auto_rollback(
|
||||
metric="accuracy",
|
||||
threshold=0.80, # Rollback if < 80%
|
||||
window="24h"
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Model Governance
|
||||
|
||||
```python
|
||||
# Compliance and audit trail
|
||||
governance = ModelGovernance(registry=registry)
|
||||
|
||||
# Generate audit report
|
||||
audit_report = governance.generate_audit_report(
|
||||
model="fraud-detection-model",
|
||||
start_date="2023-01-01",
|
||||
end_date="2024-01-31"
|
||||
)
|
||||
|
||||
# Includes:
|
||||
# - All model versions deployed
|
||||
# - Who approved deployments
|
||||
# - Performance metrics over time
|
||||
# - Data sources used
|
||||
# - Compliance checkpoints
|
||||
```
|
||||
|
||||
### 3. Multi-Environment Registry
|
||||
|
||||
```python
|
||||
# Separate registries for dev, staging, prod
|
||||
registry_dev = ModelRegistry(environment="dev")
|
||||
registry_staging = ModelRegistry(environment="staging")
|
||||
registry_prod = ModelRegistry(environment="production")
|
||||
|
||||
# Promote across environments
|
||||
registry_dev.promote_to(
|
||||
model="fraud-detection-v3.1.0",
|
||||
target_env="staging"
|
||||
)
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Model Registry is essential for:
|
||||
- ✅ Model versioning (track all model versions)
|
||||
- ✅ Safe deployment (dev → staging → prod pipeline)
|
||||
- ✅ Fast rollback (one-command revert to stable version)
|
||||
- ✅ Audit trail (who deployed what, when, why)
|
||||
- ✅ Model lineage (data → features → model → deployment)
|
||||
- ✅ Compliance (regulatory requirements, governance)
|
||||
|
||||
This skill brings enterprise-grade model lifecycle management to SpecWeave, ensuring all models are tracked, reproducible, and safely deployed.
|
||||
180
skills/nlp-pipeline-builder/SKILL.md
Normal file
180
skills/nlp-pipeline-builder/SKILL.md
Normal file
@@ -0,0 +1,180 @@
|
||||
---
|
||||
name: nlp-pipeline-builder
|
||||
description: |
|
||||
Natural language processing ML pipelines for text classification, NER, sentiment analysis, text generation, and embeddings. Activates for "nlp", "text classification", "sentiment analysis", "named entity recognition", "BERT", "transformers", "text preprocessing", "tokenization", "word embeddings". Builds NLP pipelines with transformers, integrated with SpecWeave increments.
|
||||
---
|
||||
|
||||
# NLP Pipeline Builder
|
||||
|
||||
## Overview
|
||||
|
||||
Specialized ML pipelines for natural language processing. Handles text preprocessing, tokenization, transformer models (BERT, RoBERTa, GPT), fine-tuning, and deployment for production NLP systems.
|
||||
|
||||
## NLP Tasks Supported
|
||||
|
||||
### 1. Text Classification
|
||||
|
||||
```python
|
||||
from specweave import NLPPipeline
|
||||
|
||||
# Binary or multi-class text classification
|
||||
pipeline = NLPPipeline(
|
||||
task="classification",
|
||||
classes=["positive", "negative", "neutral"],
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Automatically configures:
|
||||
# - Text preprocessing (lowercase, clean)
|
||||
# - Tokenization (BERT tokenizer)
|
||||
# - Model (BERT, RoBERTa, DistilBERT)
|
||||
# - Fine-tuning on your data
|
||||
# - Inference pipeline
|
||||
|
||||
pipeline.fit(train_texts, train_labels)
|
||||
```
|
||||
|
||||
### 2. Named Entity Recognition (NER)
|
||||
|
||||
```python
|
||||
# Extract entities from text
|
||||
pipeline = NLPPipeline(
|
||||
task="ner",
|
||||
entities=["PERSON", "ORG", "LOC", "DATE"],
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Returns: [(entity_text, entity_type, start_pos, end_pos), ...]
|
||||
```
|
||||
|
||||
### 3. Sentiment Analysis
|
||||
|
||||
```python
|
||||
# Sentiment classification (specialized)
|
||||
pipeline = NLPPipeline(
|
||||
task="sentiment",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Fine-tuned for sentiment (positive/negative/neutral)
|
||||
```
|
||||
|
||||
### 4. Text Generation
|
||||
|
||||
```python
|
||||
# Generate text continuations
|
||||
pipeline = NLPPipeline(
|
||||
task="generation",
|
||||
model="gpt2",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Fine-tune on your domain-specific text
|
||||
```
|
||||
|
||||
## Best Practices for NLP
|
||||
|
||||
### Text Preprocessing
|
||||
|
||||
```python
|
||||
from specweave import TextPreprocessor
|
||||
|
||||
preprocessor = TextPreprocessor(increment="0042")
|
||||
|
||||
# Standard preprocessing
|
||||
preprocessor.add_steps([
|
||||
"lowercase",
|
||||
"remove_html",
|
||||
"remove_urls",
|
||||
"remove_emails",
|
||||
"remove_special_chars",
|
||||
"remove_extra_whitespace"
|
||||
])
|
||||
|
||||
# Advanced preprocessing
|
||||
preprocessor.add_advanced([
|
||||
"spell_correction",
|
||||
"lemmatization",
|
||||
"stopword_removal"
|
||||
])
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**Text Classification**:
|
||||
- Small datasets (<10K): DistilBERT (6x faster than BERT)
|
||||
- Medium datasets (10K-100K): BERT-base
|
||||
- Large datasets (>100K): RoBERTa-large
|
||||
|
||||
**NER**:
|
||||
- General: BERT + CRF layer
|
||||
- Domain-specific: Fine-tune BERT on domain corpus
|
||||
|
||||
**Sentiment**:
|
||||
- Product reviews: DistilBERT fine-tuned on Amazon reviews
|
||||
- Social media: RoBERTa fine-tuned on Twitter
|
||||
|
||||
### Transfer Learning
|
||||
|
||||
```python
|
||||
# Start from pre-trained language models
|
||||
pipeline = NLPPipeline(task="classification")
|
||||
|
||||
# Option 1: Use pre-trained (no fine-tuning)
|
||||
pipeline.use_pretrained("distilbert-base-uncased")
|
||||
|
||||
# Option 2: Fine-tune on your data
|
||||
pipeline.use_pretrained_and_finetune(
|
||||
model="bert-base-uncased",
|
||||
epochs=3,
|
||||
learning_rate=2e-5
|
||||
)
|
||||
```
|
||||
|
||||
### Handling Long Text
|
||||
|
||||
```python
|
||||
# For text longer than 512 tokens
|
||||
pipeline = NLPPipeline(
|
||||
task="classification",
|
||||
max_length=512,
|
||||
truncation_strategy="head_and_tail" # Keep start + end
|
||||
)
|
||||
|
||||
# Or use Longformer for long documents
|
||||
pipeline.use_model("longformer") # Handles 4096 tokens
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
```python
|
||||
# NLP increment structure
|
||||
.specweave/increments/0042-sentiment-classifier/
|
||||
├── spec.md
|
||||
├── data/
|
||||
│ ├── train.csv
|
||||
│ ├── val.csv
|
||||
│ └── test.csv
|
||||
├── models/
|
||||
│ ├── tokenizer/
|
||||
│ ├── model-epoch-1/
|
||||
│ ├── model-epoch-2/
|
||||
│ └── model-epoch-3/
|
||||
├── experiments/
|
||||
│ ├── distilbert-baseline/
|
||||
│ ├── bert-base-finetuned/
|
||||
│ └── roberta-large/
|
||||
└── deployment/
|
||||
├── model.onnx
|
||||
└── inference.py
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
/ml:nlp-pipeline --task classification --model bert-base
|
||||
/ml:nlp-evaluate 0042 # Evaluate on test set
|
||||
/ml:nlp-deploy 0042 # Export for production
|
||||
```
|
||||
|
||||
Quick setup for NLP projects with state-of-the-art transformer models.
|
||||
569
skills/time-series-forecaster/SKILL.md
Normal file
569
skills/time-series-forecaster/SKILL.md
Normal file
@@ -0,0 +1,569 @@
|
||||
---
|
||||
name: time-series-forecaster
|
||||
description: |
|
||||
Time series forecasting with ARIMA, Prophet, LSTM, and statistical methods. Activates for "time series", "forecasting", "predict future", "trend analysis", "seasonality", "ARIMA", "Prophet", "sales forecast", "demand prediction", "stock prediction". Handles trend decomposition, seasonality detection, multivariate forecasting, and confidence intervals with SpecWeave increment integration.
|
||||
---
|
||||
|
||||
# Time Series Forecaster
|
||||
|
||||
## Overview
|
||||
|
||||
Specialized forecasting pipelines for time-dependent data. Handles trend analysis, seasonality detection, and future predictions using statistical methods, machine learning, and deep learning approaches—all integrated with SpecWeave's increment workflow.
|
||||
|
||||
## Why Time Series is Different
|
||||
|
||||
**Standard ML assumptions violated**:
|
||||
- ❌ Data is NOT independent (temporal correlation)
|
||||
- ❌ Data is NOT identically distributed (trends, seasonality)
|
||||
- ❌ Random train/test split is WRONG (breaks temporal order)
|
||||
|
||||
**Time series requirements**:
|
||||
- ✅ Temporal order preserved
|
||||
- ✅ No data leakage from future
|
||||
- ✅ Stationarity checks
|
||||
- ✅ Autocorrelation analysis
|
||||
- ✅ Seasonality decomposition
|
||||
|
||||
## Forecasting Methods
|
||||
|
||||
### 1. Statistical Methods (Baseline)
|
||||
|
||||
**ARIMA (AutoRegressive Integrated Moving Average)**:
|
||||
```python
|
||||
from specweave import TimeSeriesForecaster
|
||||
|
||||
forecaster = TimeSeriesForecaster(
|
||||
method="arima",
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Automatic order selection (p, d, q)
|
||||
forecaster.fit(train_data)
|
||||
|
||||
# Forecast next 30 periods
|
||||
forecast = forecaster.predict(horizon=30)
|
||||
|
||||
# Generates:
|
||||
# - Trend analysis
|
||||
# - Seasonality decomposition
|
||||
# - Autocorrelation plots (ACF, PACF)
|
||||
# - Residual diagnostics
|
||||
# - Forecast with confidence intervals
|
||||
```
|
||||
|
||||
**Seasonal Decomposition**:
|
||||
```python
|
||||
# Decompose into trend + seasonal + residual
|
||||
decomposition = forecaster.decompose(
|
||||
data=sales_data,
|
||||
model='multiplicative', # Or 'additive'
|
||||
period=12 # Monthly seasonality
|
||||
)
|
||||
|
||||
# Creates:
|
||||
# - Trend component plot
|
||||
# - Seasonal component plot
|
||||
# - Residual component plot
|
||||
# - Strength of trend/seasonality metrics
|
||||
```
|
||||
|
||||
### 2. Prophet (Facebook)
|
||||
|
||||
**Best for**: Business time series (sales, website traffic, user growth)
|
||||
|
||||
```python
|
||||
from specweave import ProphetForecaster
|
||||
|
||||
forecaster = ProphetForecaster(increment="0042")
|
||||
|
||||
# Prophet handles:
|
||||
# - Multiple seasonality (daily, weekly, yearly)
|
||||
# - Holidays and events
|
||||
# - Missing data
|
||||
# - Outliers
|
||||
|
||||
forecaster.fit(
|
||||
data=sales_data,
|
||||
holidays=us_holidays, # Built-in holiday effects
|
||||
seasonality_mode='multiplicative'
|
||||
)
|
||||
|
||||
forecast = forecaster.predict(horizon=90)
|
||||
|
||||
# Generates:
|
||||
# - Trend + seasonality + holiday components
|
||||
# - Change point detection
|
||||
# - Uncertainty intervals
|
||||
# - Cross-validation results
|
||||
```
|
||||
|
||||
**Prophet with Custom Regressors**:
|
||||
```python
|
||||
# Add external variables (marketing spend, weather, etc.)
|
||||
forecaster.add_regressor("marketing_spend")
|
||||
forecaster.add_regressor("temperature")
|
||||
|
||||
# Prophet incorporates external factors into forecast
|
||||
```
|
||||
|
||||
### 3. Deep Learning (LSTM/GRU)
|
||||
|
||||
**Best for**: Complex patterns, multivariate forecasting, non-linear relationships
|
||||
|
||||
```python
|
||||
from specweave import LSTMForecaster
|
||||
|
||||
forecaster = LSTMForecaster(
|
||||
lookback_window=30, # Use 30 past observations
|
||||
horizon=7, # Predict 7 steps ahead
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Automatically handles:
|
||||
# - Sequence creation
|
||||
# - Train/val/test split (temporal)
|
||||
# - Scaling
|
||||
# - Early stopping
|
||||
|
||||
forecaster.fit(
|
||||
data=sensor_data,
|
||||
epochs=100,
|
||||
batch_size=32
|
||||
)
|
||||
|
||||
forecast = forecaster.predict(horizon=7)
|
||||
|
||||
# Generates:
|
||||
# - Training history plots
|
||||
# - Validation metrics
|
||||
# - Attention weights (if using attention)
|
||||
# - Forecast uncertainty estimation
|
||||
```
|
||||
|
||||
### 4. Multivariate Forecasting
|
||||
|
||||
**VAR (Vector AutoRegression)** - Multiple related time series:
|
||||
```python
|
||||
from specweave import VARForecaster
|
||||
|
||||
# Forecast multiple related series simultaneously
|
||||
forecaster = VARForecaster(increment="0042")
|
||||
|
||||
# Example: Forecast sales across multiple stores
|
||||
# Each store's sales affects others
|
||||
forecaster.fit(data={
|
||||
'store_1_sales': store1_data,
|
||||
'store_2_sales': store2_data,
|
||||
'store_3_sales': store3_data
|
||||
})
|
||||
|
||||
forecast = forecaster.predict(horizon=30)
|
||||
# Returns forecasts for all 3 stores
|
||||
```
|
||||
|
||||
## Time Series Best Practices
|
||||
|
||||
### 1. Temporal Train/Test Split
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Random split (data leakage!)
|
||||
X_train, X_test = train_test_split(data, test_size=0.2)
|
||||
|
||||
# ✅ CORRECT: Temporal split
|
||||
split_date = "2024-01-01"
|
||||
train = data[data.index < split_date]
|
||||
test = data[data.index >= split_date]
|
||||
|
||||
# Or use last N periods as test
|
||||
train = data[:-30] # All but last 30 observations
|
||||
test = data[-30:] # Last 30 observations
|
||||
```
|
||||
|
||||
### 2. Stationarity Testing
|
||||
|
||||
```python
|
||||
from specweave import TimeSeriesAnalyzer
|
||||
|
||||
analyzer = TimeSeriesAnalyzer(increment="0042")
|
||||
|
||||
# Check stationarity (required for ARIMA)
|
||||
stationarity = analyzer.check_stationarity(data)
|
||||
|
||||
if not stationarity['is_stationary']:
|
||||
# Make stationary via differencing
|
||||
data_diff = analyzer.difference(data, order=1)
|
||||
|
||||
# Or detrend
|
||||
data_detrended = analyzer.detrend(data)
|
||||
```
|
||||
|
||||
**Stationarity Report**:
|
||||
```markdown
|
||||
# Stationarity Analysis
|
||||
|
||||
## ADF Test (Augmented Dickey-Fuller)
|
||||
- Test Statistic: -2.15
|
||||
- P-value: 0.23
|
||||
- Critical Value (5%): -2.89
|
||||
- Result: ❌ NON-STATIONARY (p > 0.05)
|
||||
|
||||
## Recommendation
|
||||
Apply differencing (order=1) to remove trend.
|
||||
|
||||
After differencing:
|
||||
- ADF Test Statistic: -5.42
|
||||
- P-value: 0.0001
|
||||
- Result: ✅ STATIONARY
|
||||
```
|
||||
|
||||
### 3. Seasonality Detection
|
||||
|
||||
```python
|
||||
# Automatic seasonality detection
|
||||
seasonality = analyzer.detect_seasonality(data)
|
||||
|
||||
# Results:
|
||||
# - Daily: False
|
||||
# - Weekly: True (period=7)
|
||||
# - Monthly: True (period=30)
|
||||
# - Yearly: False
|
||||
```
|
||||
|
||||
### 4. Cross-Validation for Time Series
|
||||
|
||||
```python
|
||||
# Time series cross-validation (expanding window)
|
||||
cv_results = forecaster.cross_validate(
|
||||
data=data,
|
||||
horizon=30, # Forecast 30 steps ahead
|
||||
n_splits=5, # 5 expanding windows
|
||||
metric='mape'
|
||||
)
|
||||
|
||||
# Visualizes:
|
||||
# - MAPE across different time periods
|
||||
# - Forecast vs actual for each fold
|
||||
# - Model stability over time
|
||||
```
|
||||
|
||||
### 5. Handling Missing Data
|
||||
|
||||
```python
|
||||
# Time series-specific imputation
|
||||
forecaster.handle_missing(
|
||||
method='interpolate', # Or 'forward_fill', 'backward_fill'
|
||||
limit=3 # Max consecutive missing values to fill
|
||||
)
|
||||
|
||||
# For seasonal data
|
||||
forecaster.handle_missing(
|
||||
method='seasonal_interpolate',
|
||||
period=12 # Use seasonal pattern to impute
|
||||
)
|
||||
```
|
||||
|
||||
## Common Time Series Patterns
|
||||
|
||||
### Pattern 1: Sales Forecasting
|
||||
|
||||
```python
|
||||
from specweave import SalesForecastPipeline
|
||||
|
||||
pipeline = SalesForecastPipeline(increment="0042")
|
||||
|
||||
# Handles:
|
||||
# - Weekly/monthly seasonality
|
||||
# - Holiday effects
|
||||
# - Marketing campaign impact
|
||||
# - Trend changes
|
||||
|
||||
pipeline.fit(
|
||||
sales_data=daily_sales,
|
||||
holidays=us_holidays,
|
||||
regressors={
|
||||
'marketing_spend': marketing_data,
|
||||
'competitor_price': competitor_data
|
||||
}
|
||||
)
|
||||
|
||||
forecast = pipeline.predict(horizon=90) # 90 days ahead
|
||||
|
||||
# Generates:
|
||||
# - Point forecast
|
||||
# - Prediction intervals (80%, 95%)
|
||||
# - Component analysis (trend, seasonality, regressors)
|
||||
# - Anomaly flags for past data
|
||||
```
|
||||
|
||||
### Pattern 2: Demand Forecasting
|
||||
|
||||
```python
|
||||
from specweave import DemandForecastPipeline
|
||||
|
||||
# Inventory optimization, supply chain planning
|
||||
pipeline = DemandForecastPipeline(
|
||||
aggregation='daily', # Or 'weekly', 'monthly'
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Multi-product forecasting
|
||||
forecasts = pipeline.fit_predict(
|
||||
products=['product_A', 'product_B', 'product_C'],
|
||||
horizon=30
|
||||
)
|
||||
|
||||
# Generates:
|
||||
# - Demand forecast per product
|
||||
# - Confidence intervals
|
||||
# - Stockout risk analysis
|
||||
# - Reorder point recommendations
|
||||
```
|
||||
|
||||
### Pattern 3: Stock Price Prediction
|
||||
|
||||
```python
|
||||
from specweave import FinancialForecastPipeline
|
||||
|
||||
# Stock prices, crypto, forex
|
||||
pipeline = FinancialForecastPipeline(increment="0042")
|
||||
|
||||
# Handles:
|
||||
# - Volatility clustering
|
||||
# - Non-linear patterns
|
||||
# - Technical indicators
|
||||
|
||||
pipeline.fit(
|
||||
price_data=stock_prices,
|
||||
features=['volume', 'volatility', 'RSI', 'MACD']
|
||||
)
|
||||
|
||||
forecast = pipeline.predict(horizon=7)
|
||||
|
||||
# Generates:
|
||||
# - Price forecast with confidence bands
|
||||
# - Volatility forecast (GARCH)
|
||||
# - Trading signals (optional)
|
||||
# - Risk metrics
|
||||
```
|
||||
|
||||
### Pattern 4: Sensor Data / IoT
|
||||
|
||||
```python
|
||||
from specweave import SensorForecastPipeline
|
||||
|
||||
# Temperature, humidity, machine metrics
|
||||
pipeline = SensorForecastPipeline(
|
||||
method='lstm', # Deep learning for complex patterns
|
||||
increment="0042"
|
||||
)
|
||||
|
||||
# Multivariate: Multiple sensor readings
|
||||
pipeline.fit(
|
||||
sensors={
|
||||
'temperature': temp_data,
|
||||
'humidity': humidity_data,
|
||||
'pressure': pressure_data
|
||||
}
|
||||
)
|
||||
|
||||
forecast = pipeline.predict(horizon=24) # 24 hours ahead
|
||||
|
||||
# Generates:
|
||||
# - Multi-sensor forecasts
|
||||
# - Anomaly detection (unexpected values)
|
||||
# - Maintenance alerts
|
||||
```
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
**Time series-specific metrics**:
|
||||
|
||||
```python
|
||||
from specweave import TimeSeriesEvaluator
|
||||
|
||||
evaluator = TimeSeriesEvaluator(increment="0042")
|
||||
|
||||
metrics = evaluator.evaluate(
|
||||
y_true=test_data,
|
||||
y_pred=forecast
|
||||
)
|
||||
|
||||
# Metrics:
|
||||
# - MAPE (Mean Absolute Percentage Error) - business-friendly
|
||||
# - RMSE (Root Mean Squared Error) - penalizes large errors
|
||||
# - MAE (Mean Absolute Error) - robust to outliers
|
||||
# - MASE (Mean Absolute Scaled Error) - scale-independent
|
||||
# - Directional Accuracy - did we predict up/down correctly?
|
||||
```
|
||||
|
||||
**Evaluation Report**:
|
||||
```markdown
|
||||
# Time Series Forecast Evaluation
|
||||
|
||||
## Point Metrics
|
||||
- MAPE: 8.2% (target: <10%) ✅
|
||||
- RMSE: 124.5
|
||||
- MAE: 98.3
|
||||
- MASE: 0.85 (< 1 = better than naive forecast) ✅
|
||||
|
||||
## Directional Accuracy
|
||||
- Correct direction: 73% (up/down predictions)
|
||||
|
||||
## Forecast Bias
|
||||
- Mean Error: -5.2 (slight under-forecasting)
|
||||
- Bias: -2.1%
|
||||
|
||||
## Confidence Intervals
|
||||
- 80% interval coverage: 79.2% ✅
|
||||
- 95% interval coverage: 94.1% ✅
|
||||
|
||||
## Recommendation
|
||||
✅ DEPLOY: Model meets accuracy targets and is well-calibrated.
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
### Increment Structure
|
||||
|
||||
```
|
||||
.specweave/increments/0042-sales-forecast/
|
||||
├── spec.md (forecasting requirements, accuracy targets)
|
||||
├── plan.md (forecasting strategy, method selection)
|
||||
├── tasks.md
|
||||
├── data/
|
||||
│ ├── train_data.csv
|
||||
│ ├── test_data.csv
|
||||
│ └── schema.yaml
|
||||
├── experiments/
|
||||
│ ├── arima-baseline/
|
||||
│ ├── prophet-holidays/
|
||||
│ └── lstm-multivariate/
|
||||
├── models/
|
||||
│ ├── prophet_model.pkl
|
||||
│ └── lstm_model.h5
|
||||
├── forecasts/
|
||||
│ ├── forecast_2024-01.csv
|
||||
│ ├── forecast_2024-02.csv
|
||||
│ └── forecast_with_intervals.csv
|
||||
└── analysis/
|
||||
├── stationarity_test.md
|
||||
├── seasonality_decomposition.png
|
||||
└── forecast_evaluation.md
|
||||
```
|
||||
|
||||
### Living Docs Integration
|
||||
|
||||
```bash
|
||||
/specweave:sync-docs update
|
||||
```
|
||||
|
||||
Updates:
|
||||
```markdown
|
||||
<!-- .specweave/docs/internal/architecture/time-series-forecasting.md -->
|
||||
|
||||
## Sales Forecasting Model (Increment 0042)
|
||||
|
||||
### Method Selected: Prophet
|
||||
- Reason: Handles multiple seasonality + holidays well
|
||||
- Alternatives tried: ARIMA (MAPE 12%), LSTM (MAPE 10%)
|
||||
- Prophet: MAPE 8.2% ✅ BEST
|
||||
|
||||
### Seasonality Detected
|
||||
- Weekly: Strong (7-day cycle)
|
||||
- Monthly: Moderate (30-day cycle)
|
||||
- Yearly: Weak
|
||||
|
||||
### Holiday Effects
|
||||
- Black Friday: +180% sales (strongest)
|
||||
- Christmas: +120% sales
|
||||
- Thanksgiving: +80% sales
|
||||
|
||||
### Forecast Horizon
|
||||
- 90 days ahead
|
||||
- Confidence intervals: 80%, 95%
|
||||
- Update frequency: Weekly retraining
|
||||
|
||||
### Model Performance
|
||||
- MAPE: 8.2% (target: <10%)
|
||||
- Directional accuracy: 73%
|
||||
- Deployed: 2024-01-15
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Create time series forecast
|
||||
/ml:forecast --horizon 30 --method prophet
|
||||
|
||||
# Evaluate forecast
|
||||
/ml:evaluate-forecast 0042
|
||||
|
||||
# Decompose time series
|
||||
/ml:decompose-timeseries 0042
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### 1. Ensemble Forecasting
|
||||
|
||||
```python
|
||||
# Combine multiple methods for robustness
|
||||
ensemble = EnsembleForecast(increment="0042")
|
||||
|
||||
ensemble.add_forecaster("arima", weight=0.3)
|
||||
ensemble.add_forecaster("prophet", weight=0.5)
|
||||
ensemble.add_forecaster("lstm", weight=0.2)
|
||||
|
||||
# Weighted average of all forecasts
|
||||
forecast = ensemble.predict(horizon=30)
|
||||
|
||||
# Ensemble typically 10-20% more accurate than single model
|
||||
```
|
||||
|
||||
### 2. Forecast Reconciliation
|
||||
|
||||
```python
|
||||
# For hierarchical time series (e.g., total sales = store1 + store2 + store3)
|
||||
reconciler = ForecastReconciler(increment="0042")
|
||||
|
||||
# Ensures forecasts sum correctly
|
||||
reconciled = reconciler.reconcile(
|
||||
forecasts={
|
||||
'total': total_forecast,
|
||||
'store1': store1_forecast,
|
||||
'store2': store2_forecast,
|
||||
'store3': store3_forecast
|
||||
},
|
||||
method='bottom_up' # Or 'top_down', 'middle_out'
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Forecast Monitoring
|
||||
|
||||
```python
|
||||
# Track forecast accuracy over time
|
||||
monitor = ForecastMonitor(increment="0042")
|
||||
|
||||
# Compare forecasts vs actuals
|
||||
monitor.track_performance(
|
||||
forecasts=past_forecasts,
|
||||
actuals=actual_values
|
||||
)
|
||||
|
||||
# Alerts when accuracy degrades
|
||||
if monitor.accuracy_degraded():
|
||||
print("⚠️ Forecast accuracy dropped 15% - retrain model!")
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Time series forecasting requires specialized techniques:
|
||||
- ✅ Temporal validation (no random split)
|
||||
- ✅ Stationarity testing
|
||||
- ✅ Seasonality detection
|
||||
- ✅ Trend decomposition
|
||||
- ✅ Cross-validation (expanding window)
|
||||
- ✅ Confidence intervals
|
||||
- ✅ Forecast monitoring
|
||||
|
||||
This skill handles all time series complexity within SpecWeave's increment workflow, ensuring forecasts are reproducible, documented, and production-ready.
|
||||
Reference in New Issue
Block a user