Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:53 +08:00
commit 468d045de7
24 changed files with 7204 additions and 0 deletions

View File

@@ -0,0 +1,559 @@
---
name: anomaly-detector
description: |
Anomaly and outlier detection using Isolation Forest, One-Class SVM, autoencoders, and statistical methods. Activates for "anomaly detection", "outlier detection", "fraud detection", "intrusion detection", "abnormal behavior", "unusual patterns", "detect anomalies", "system monitoring". Handles supervised and unsupervised anomaly detection with SpecWeave increment integration.
---
# Anomaly Detector
## Overview
Detect unusual patterns, outliers, and anomalies in data using statistical methods, machine learning, and deep learning. Critical for fraud detection, security monitoring, quality control, and system health monitoring—all integrated with SpecWeave's increment workflow.
## Why Anomaly Detection is Different
**Challenge**: Anomalies are rare (0.1% - 5% of data)
**Standard classification doesn't work**:
- ❌ Extreme class imbalance
- ❌ Unknown anomaly patterns
- ❌ Expensive to label anomalies
- ❌ Anomalies evolve over time
**Anomaly detection approaches**:
- ✅ Unsupervised (no labels needed)
- ✅ Semi-supervised (learn from normal data)
- ✅ Statistical (deviation from expected)
- ✅ Context-aware (what's normal for this user/time/location?)
## Anomaly Detection Methods
### 1. Statistical Methods (Baseline)
**Z-Score / Standard Deviation**:
```python
from specweave import AnomalyDetector
detector = AnomalyDetector(
method="statistical",
increment="0042"
)
# Flag values > 3 standard deviations from mean
anomalies = detector.detect(
data=transaction_amounts,
threshold=3.0
)
# Simple, fast, but assumes normal distribution
```
**IQR (Interquartile Range)**:
```python
# More robust to non-normal distributions
detector = AnomalyDetector(method="iqr")
# Flag values outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR]
anomalies = detector.detect(data=response_times)
# Good for skewed distributions
```
### 2. Isolation Forest (Recommended)
**Best for**: General purpose, high-dimensional data
```python
from specweave import IsolationForestDetector
detector = IsolationForestDetector(
contamination=0.05, # Expected anomaly rate (5%)
increment="0042"
)
# Train on normal data (or mixed data)
detector.fit(X_train)
# Detect anomalies
predictions = detector.predict(X_test)
# -1 = anomaly, 1 = normal
anomaly_scores = detector.score(X_test)
# Lower score = more anomalous
# Generates:
# - Anomaly scores for all samples
# - Feature importance (which features contribute to anomaly)
# - Threshold visualization
# - Top anomalies ranked by score
```
**Why Isolation Forest works**:
- Fast (O(n log n))
- Handles high dimensions well
- No assumptions about data distribution
- Anomalies are easier to isolate (fewer splits)
### 3. One-Class SVM
**Best for**: When you have only normal data for training
```python
from specweave import OneClassSVMDetector
# Train only on normal transactions
detector = OneClassSVMDetector(
kernel='rbf',
nu=0.05, # Expected anomaly rate
increment="0042"
)
detector.fit(X_normal)
# Detect anomalies in new data
predictions = detector.predict(X_new)
# -1 = anomaly, 1 = normal
# Good for: Clean training data of normal samples
```
### 4. Autoencoders (Deep Learning)
**Best for**: Complex patterns, high-dimensional data, images
```python
from specweave import AutoencoderDetector
# Learn to reconstruct normal data
detector = AutoencoderDetector(
encoding_dim=32, # Compressed representation
layers=[64, 32, 16, 32, 64],
increment="0042"
)
# Train on normal data
detector.fit(
X_normal,
epochs=100,
validation_split=0.2
)
# Anomalies have high reconstruction error
anomaly_scores = detector.score(X_test)
# Generates:
# - Reconstruction error distribution
# - Threshold recommendation
# - Top anomalies with explanations
# - Learned representations (t-SNE plot)
```
**How autoencoders work**:
```
Input → Encoder → Compressed → Decoder → Reconstructed
Normal data: Low reconstruction error (learned well)
Anomalies: High reconstruction error (never seen before)
```
### 5. LOF (Local Outlier Factor)
**Best for**: Density-based anomalies (sparse regions)
```python
from specweave import LOFDetector
# Detects points in low-density regions
detector = LOFDetector(
n_neighbors=20,
contamination=0.05,
increment="0042"
)
detector.fit(X_train)
predictions = detector.predict(X_test)
# Good for: Clustered data with sparse anomalies
```
## Anomaly Detection Workflows
### Workflow 1: Fraud Detection
```python
from specweave import FraudDetectionPipeline
pipeline = FraudDetectionPipeline(increment="0042")
# Features: transaction amount, location, time, merchant, etc.
pipeline.fit(normal_transactions)
# Real-time fraud detection
fraud_scores = pipeline.predict_proba(new_transactions)
# For each transaction:
# - Fraud probability (0-1)
# - Anomaly score
# - Contributing features
# - Similar past cases
# Generates:
# - Precision-Recall curve (fraud is rare)
# - Cost-benefit analysis (false positives vs missed fraud)
# - Feature importance for fraud
# - Fraud patterns identified
```
**Fraud Detection Best Practices**:
```python
# 1. Use multiple signals
pipeline.add_signals([
'amount_vs_user_average',
'distance_from_home',
'merchant_risk_score',
'velocity_24h' # Transactions in last 24h
])
# 2. Set threshold based on cost
# False Positive cost: $5 (manual review)
# False Negative cost: $500 (fraud loss)
# Optimal threshold: Maximize (savings - review_cost)
# 3. Provide explanations
explanation = pipeline.explain_prediction(suspicious_transaction)
# "Flagged because: amount 10x user average, new merchant, foreign location"
```
### Workflow 2: System Anomaly Detection
```python
from specweave import SystemAnomalyPipeline
# Monitor system metrics (CPU, memory, latency, errors)
pipeline = SystemAnomalyPipeline(increment="0042")
# Train on normal system behavior
pipeline.fit(normal_metrics)
# Detect system anomalies
anomalies = pipeline.detect(current_metrics)
# For each anomaly:
# - Severity (low, medium, high, critical)
# - Affected metrics
# - Similar past incidents
# - Recommended actions
# Generates:
# - Anomaly timeline
# - Metric correlations (which metrics moved together)
# - Root cause analysis
# - Alert rules
```
**System Monitoring Best Practices**:
```python
# 1. Use time windows
pipeline.add_time_windows([
'5min', # Immediate spikes
'1hour', # Short-term trends
'24hour' # Daily patterns
])
# 2. Correlate metrics
pipeline.detect_correlations([
('high_cpu', 'slow_response'),
('memory_leak', 'increasing_errors')
])
# 3. Reduce alert fatigue
pipeline.set_alert_rules(
min_severity='medium',
min_duration='5min', # Ignore transient spikes
max_alerts_per_hour=5
)
```
### Workflow 3: Manufacturing Quality Control
```python
from specweave import QualityControlPipeline
# Detect defective products from sensor data
pipeline = QualityControlPipeline(increment="0042")
# Train on good products
pipeline.fit(good_product_sensors)
# Detect defects in production line
defect_scores = pipeline.predict(production_line_data)
# Generates:
# - Real-time defect alerts
# - Defect rate trends
# - Most common defect patterns
# - Preventive maintenance recommendations
```
### Workflow 4: Network Intrusion Detection
```python
from specweave import IntrusionDetectionPipeline
# Detect malicious network traffic
pipeline = IntrusionDetectionPipeline(increment="0042")
# Features: packet size, frequency, ports, protocols, etc.
pipeline.fit(normal_network_traffic)
# Detect intrusions
intrusions = pipeline.detect(network_traffic_stream)
# Generates:
# - Attack type classification (DDoS, port scan, etc.)
# - Severity scores
# - Source IPs
# - Attack timeline
```
## Evaluation Metrics
**Anomaly detection metrics** (different from classification):
```python
from specweave import AnomalyEvaluator
evaluator = AnomalyEvaluator(increment="0042")
metrics = evaluator.evaluate(
y_true=true_labels, # 0=normal, 1=anomaly
y_pred=predictions,
y_scores=anomaly_scores
)
```
**Key Metrics**:
1. **Precision @ K** - Of top K flagged anomalies, how many are real?
```python
precision_at_100 = evaluator.precision_at_k(k=100)
# "Of 100 flagged transactions, 85 were actual fraud" = 85%
```
2. **Recall @ K** - Of all real anomalies, how many did we catch in top K?
```python
recall_at_100 = evaluator.recall_at_k(k=100)
# "We caught 78% of all fraud in top 100 flagged"
```
3. **ROC AUC** - Overall discrimination ability
```python
roc_auc = evaluator.roc_auc(y_true, y_scores)
# 0.95 = excellent discrimination
```
4. **PR AUC** - Better for imbalanced data
```python
pr_auc = evaluator.pr_auc(y_true, y_scores)
# More informative when anomalies are rare (<5%)
```
**Evaluation Report**:
```markdown
# Anomaly Detection Evaluation
## Dataset
- Total samples: 100,000
- Anomalies: 500 (0.5%)
- Features: 25
## Method: Isolation Forest
## Performance Metrics
- ROC AUC: 0.94 ✅ (excellent)
- PR AUC: 0.78 ✅ (good for 0.5% anomaly rate)
## Precision-Recall Tradeoff
- Precision @ 100: 85% (85 true anomalies in top 100)
- Recall @ 100: 17% (caught 17% of all anomalies)
- Precision @ 500: 62% (310 true anomalies in top 500)
- Recall @ 500: 62% (caught 62% of all anomalies)
## Business Impact (Fraud Detection Example)
- Review budget: 500 transactions/day
- At Precision @ 500 = 62%:
- True fraud caught: 310/day ($155,000 saved)
- False positives: 190/day ($950 review cost)
- Net benefit: $154,050/day ✅
## Recommendation
✅ DEPLOY with threshold for top 500 (62% precision)
```
## Integration with SpecWeave
### Increment Structure
```
.specweave/increments/0042-fraud-detection/
├── spec.md (detection requirements, business impact)
├── plan.md (method selection, threshold tuning)
├── tasks.md
├── data/
│ ├── normal_transactions.csv
│ ├── labeled_fraud.csv (if available)
│ └── schema.yaml
├── experiments/
│ ├── statistical-baseline/
│ ├── isolation-forest/
│ ├── one-class-svm/
│ └── autoencoder/
├── models/
│ ├── isolation_forest_model.pkl
│ └── threshold_config.json
├── evaluation/
│ ├── precision_recall_curve.png
│ ├── roc_curve.png
│ ├── top_anomalies.csv
│ └── evaluation_report.md
└── deployment/
├── real_time_api.py
├── monitoring_dashboard.json
└── alert_rules.yaml
```
## Best Practices
### 1. Start with Labeled Anomalies (if available)
```python
# Use labeled data to validate unsupervised methods
detector.fit(X_train) # Unlabeled
# Evaluate on labeled test set
metrics = evaluator.evaluate(y_true_test, detector.predict(X_test))
# Choose method with best precision @ K
```
### 2. Tune Contamination Parameter
```python
# Try different contamination rates
for contamination in [0.01, 0.05, 0.1, 0.2]:
detector = IsolationForestDetector(contamination=contamination)
detector.fit(X_train)
metrics = evaluator.evaluate(y_test, detector.predict(X_test))
# Choose contamination that maximizes business value
```
### 3. Explain Anomalies
```python
# Don't just flag anomalies - explain why
explainer = AnomalyExplainer(detector, increment="0042")
for anomaly in top_anomalies:
explanation = explainer.explain(anomaly)
print(f"Anomaly: {anomaly.id}")
print(f"Reasons:")
print(f" - {explanation.top_features}")
print(f" - Similar cases: {explanation.similar_cases}")
```
### 4. Handle Concept Drift
```python
# Anomalies evolve over time
monitor = AnomalyMonitor(increment="0042")
# Track detection performance
monitor.track_daily_performance()
# Retrain when accuracy drops
if monitor.performance_degraded():
detector.retrain(new_normal_data)
```
### 5. Set Business-Driven Thresholds
```python
# Balance false positives vs false negatives
optimizer = ThresholdOptimizer(increment="0042")
optimal_threshold = optimizer.find_optimal(
detector=detector,
data=validation_data,
false_positive_cost=5, # $5 per manual review
false_negative_cost=500 # $500 per missed fraud
)
# Use optimal threshold for deployment
```
## Advanced Features
### 1. Ensemble Anomaly Detection
```python
# Combine multiple detectors
ensemble = AnomalyEnsemble(increment="0042")
ensemble.add_detector("isolation_forest", weight=0.4)
ensemble.add_detector("one_class_svm", weight=0.3)
ensemble.add_detector("autoencoder", weight=0.3)
# Ensemble vote (more robust)
anomalies = ensemble.detect(X_test)
```
### 2. Contextual Anomaly Detection
```python
# What's normal varies by context
detector = ContextualAnomalyDetector(increment="0042")
# Different normality for different contexts
detector.fit(data, contexts=['user_id', 'time_of_day', 'location'])
# $10 transaction: Normal for user A, anomaly for user B
```
### 3. Sequential Anomaly Detection
```python
# Detect anomalous sequences (not just individual points)
detector = SequenceAnomalyDetector(
method='lstm',
window_size=10,
increment="0042"
)
# Example: Login from unusual sequence of locations
```
## Commands
```bash
# Train anomaly detector
/ml:train-anomaly-detector 0042
# Evaluate detector
/ml:evaluate-anomaly-detector 0042
# Explain top anomalies
/ml:explain-anomalies 0042 --top 100
```
## Summary
Anomaly detection is critical for:
- ✅ Fraud detection (financial transactions)
- ✅ Security monitoring (intrusion detection)
- ✅ Quality control (manufacturing defects)
- ✅ System health (performance monitoring)
- ✅ Business intelligence (unusual patterns)
This skill provides battle-tested methods integrated with SpecWeave's increment workflow, ensuring anomaly detectors are reproducible, explainable, and business-aligned.

View File

@@ -0,0 +1,485 @@
---
name: automl-optimizer
description: |
Automated machine learning with hyperparameter optimization using Optuna, Hyperopt, or AutoML libraries. Activates for "automl", "hyperparameter tuning", "optimize hyperparameters", "auto tune model", "neural architecture search", "automated ml". Systematically explores model and hyperparameter spaces, tracks all experiments, and finds optimal configurations with minimal manual intervention.
---
# AutoML Optimizer
## Overview
Automates the tedious process of hyperparameter tuning and model selection. Instead of manually trying different configurations, define a search space and let AutoML find the optimal configuration through intelligent exploration.
## Why AutoML?
**Manual Tuning Problems**:
- Time-consuming (hours/days of trial and error)
- Subjective (depends on intuition)
- Incomplete (can't try all combinations)
- Not reproducible (hard to document search process)
**AutoML Benefits**:
- ✅ Systematic exploration of search space
- ✅ Intelligent sampling (Bayesian optimization)
- ✅ All experiments tracked automatically
- ✅ Find optimal configuration faster
- ✅ Reproducible (search process documented)
## AutoML Strategies
### Strategy 1: Hyperparameter Optimization (Optuna)
```python
from specweave import OptunaOptimizer
# Define search space
def objective(trial):
# Suggest hyperparameters
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
}
# Train model
model = XGBClassifier(**params)
# Cross-validation score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
return scores.mean()
# Run optimization
optimizer = OptunaOptimizer(
objective=objective,
n_trials=100,
direction='maximize',
increment="0042"
)
best_params = optimizer.optimize()
# Creates:
# - .specweave/increments/0042.../experiments/optuna-study/
# ├── study.db (Optuna database)
# ├── optimization_history.png
# ├── param_importances.png
# ├── parallel_coordinate.png
# └── best_params.json
```
**Optimization Report**:
```markdown
# Optuna Optimization Report
## Search Space
- n_estimators: [100, 1000]
- max_depth: [3, 10]
- learning_rate: [0.01, 0.3] (log scale)
- subsample: [0.5, 1.0]
- colsample_bytree: [0.5, 1.0]
## Trials: 100
- Completed: 98
- Pruned: 2 (early stopping)
- Failed: 0
## Best Trial (#47)
- ROC AUC: 0.892 ± 0.012
- Parameters:
- n_estimators: 673
- max_depth: 6
- learning_rate: 0.094
- subsample: 0.78
- colsample_bytree: 0.91
## Parameter Importance
1. learning_rate (0.42) - Most important
2. n_estimators (0.28)
3. max_depth (0.18)
4. colsample_bytree (0.08)
5. subsample (0.04) - Least important
## Improvement over Default
- Default params: ROC AUC = 0.856
- Optimized params: ROC AUC = 0.892
- Improvement: +4.2%
```
### Strategy 2: Algorithm Selection + Tuning
```python
from specweave import AutoMLPipeline
# Define candidate algorithms with search spaces
pipeline = AutoMLPipeline(increment="0042")
# Add candidates
pipeline.add_candidate(
name="xgboost",
model=XGBClassifier,
search_space={
'n_estimators': (100, 1000),
'max_depth': (3, 10),
'learning_rate': (0.01, 0.3)
}
)
pipeline.add_candidate(
name="lightgbm",
model=LGBMClassifier,
search_space={
'n_estimators': (100, 1000),
'max_depth': (3, 10),
'learning_rate': (0.01, 0.3)
}
)
pipeline.add_candidate(
name="random_forest",
model=RandomForestClassifier,
search_space={
'n_estimators': (100, 500),
'max_depth': (3, 20),
'min_samples_split': (2, 20)
}
)
pipeline.add_candidate(
name="logistic_regression",
model=LogisticRegression,
search_space={
'C': (0.001, 100),
'penalty': ['l1', 'l2']
}
)
# Run AutoML (tries all algorithms + hyperparameters)
results = pipeline.fit(
X_train, y_train,
n_trials_per_model=50,
cv_folds=5,
metric='roc_auc'
)
# Best model automatically selected
best_model = pipeline.best_model_
best_params = pipeline.best_params_
```
**AutoML Comparison**:
```markdown
| Model | Trials | Best Score | Mean Score | Std | Best Params |
|---------------------|--------|------------|------------|-------|--------------------------------------|
| xgboost | 50 | 0.892 | 0.876 | 0.012 | n_est=673, depth=6, lr=0.094 |
| lightgbm | 50 | 0.889 | 0.873 | 0.011 | n_est=542, depth=7, lr=0.082 |
| random_forest | 50 | 0.871 | 0.858 | 0.015 | n_est=384, depth=12, min_split=5 |
| logistic_regression | 50 | 0.845 | 0.840 | 0.008 | C=1.234, penalty=l2 |
**Winner: XGBoost** (ROC AUC = 0.892)
```
### Strategy 3: Neural Architecture Search (NAS)
```python
from specweave import NeuralArchitectureSearch
# For deep learning
nas = NeuralArchitectureSearch(increment="0042")
# Define search space
search_space = {
'num_layers': (2, 5),
'layer_sizes': (32, 512),
'activation': ['relu', 'tanh', 'elu'],
'dropout': (0.0, 0.5),
'optimizer': ['adam', 'sgd', 'rmsprop'],
'learning_rate': (0.0001, 0.01)
}
# Search for best architecture
best_architecture = nas.search(
X_train, y_train,
search_space=search_space,
n_trials=100,
max_epochs=50
)
# Creates: Best neural network architecture
```
## AutoML Frameworks Integration
### Optuna (Recommended)
```python
import optuna
from specweave import configure_optuna
# Auto-configures Optuna to log to increment
configure_optuna(increment="0042")
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 10),
}
model = XGBClassifier(**params)
score = cross_val_score(model, X, y, cv=5).mean()
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
# Automatically logged to increment folder
```
### Auto-sklearn
```python
from specweave import AutoSklearnOptimizer
# Automated model selection + feature engineering
optimizer = AutoSklearnOptimizer(
time_left_for_this_task=3600, # 1 hour
increment="0042"
)
optimizer.fit(X_train, y_train)
# Auto-sklearn tries:
# - Multiple algorithms
# - Feature preprocessing combinations
# - Ensemble methods
# Returns best pipeline
```
### H2O AutoML
```python
from specweave import H2OAutoMLOptimizer
optimizer = H2OAutoMLOptimizer(
max_runtime_secs=3600, # 1 hour
max_models=50,
increment="0042"
)
optimizer.fit(X_train, y_train)
# H2O tries many algorithms in parallel
# Returns leaderboard + best model
```
## Best Practices
### 1. Start with Default Baseline
```python
# Always compare AutoML to default hyperparameters
baseline_model = XGBClassifier() # Default params
baseline_score = cross_val_score(baseline_model, X, y, cv=5).mean()
# Then optimize
optimizer = OptunaOptimizer(objective, n_trials=100)
optimized_params = optimizer.optimize()
improvement = (optimized_score - baseline_score) / baseline_score * 100
print(f"Improvement: {improvement:.1f}%")
# Only use optimized if significant improvement (>2-3%)
```
### 2. Use Cross-Validation
```python
# ❌ Wrong: Single train/test split
score = model.score(X_test, y_test)
# ✅ Correct: Cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5)
score = scores.mean()
# Prevents overfitting to specific train/test split
```
### 3. Set Reasonable Search Budgets
```python
# Quick exploration (development)
optimizer.optimize(n_trials=20) # ~5-10 minutes
# Moderate search (iteration)
optimizer.optimize(n_trials=100) # ~30-60 minutes
# Thorough search (final model)
optimizer.optimize(n_trials=500) # ~2-4 hours
# Don't overdo it: diminishing returns after ~100-200 trials
```
### 4. Prune Unpromising Trials
```python
# Optuna can stop bad trials early
study = optuna.create_study(
direction='maximize',
pruner=optuna.pruners.MedianPruner()
)
# If trial is performing worse than median at epoch N, stop it
# Saves time by not fully training bad models
```
### 5. Document Search Space Rationale
```python
# Document why you chose specific ranges
search_space = {
# XGBoost recommends max_depth 3-10 for most tasks
'max_depth': (3, 10),
# Learning rate: 0.01-0.3 covers slow to fast learning
# Log scale to spend more trials on smaller values
'learning_rate': (0.01, 0.3, 'log'),
# n_estimators: Balance accuracy vs training time
'n_estimators': (100, 1000)
}
```
## Integration with SpecWeave
### Automatic Experiment Tracking
```python
# All AutoML trials logged automatically
optimizer = OptunaOptimizer(objective, increment="0042")
optimizer.optimize(n_trials=100)
# Creates:
# .specweave/increments/0042.../experiments/
# ├── optuna-trial-001/
# ├── optuna-trial-002/
# ├── ...
# ├── optuna-trial-100/
# └── optuna-summary.md
```
### Living Docs Integration
```bash
/specweave:sync-docs update
```
Updates:
```markdown
<!-- .specweave/docs/internal/architecture/ml-optimization.md -->
## Hyperparameter Optimization (Increment 0042)
### Optimization Strategy
- Framework: Optuna (Bayesian optimization)
- Trials: 100
- Search space: 5 hyperparameters
- Metric: ROC AUC (5-fold CV)
### Results
- Best score: 0.892 ± 0.012
- Improvement over default: +4.2%
- Most important param: learning_rate (0.42)
### Selected Hyperparameters
```python
{
'n_estimators': 673,
'max_depth': 6,
'learning_rate': 0.094,
'subsample': 0.78,
'colsample_bytree': 0.91
}
```
### Recommendation
XGBoost with optimized hyperparameters for production deployment.
```
## Commands
```bash
# Run AutoML optimization
/ml:optimize 0042 --trials 100
# Compare algorithms
/ml:compare-algorithms 0042
# Show optimization history
/ml:optimization-report 0042
```
## Common Patterns
### Pattern 1: Coarse-to-Fine Optimization
```python
# Step 1: Coarse search (wide ranges, few trials)
coarse_space = {
'n_estimators': (100, 1000, 'int'),
'max_depth': (3, 10, 'int'),
'learning_rate': (0.01, 0.3, 'log')
}
coarse_results = optimizer.optimize(coarse_space, n_trials=50)
# Step 2: Fine search (narrow ranges around best)
best_params = coarse_results['best_params']
fine_space = {
'n_estimators': (best_params['n_estimators'] - 100,
best_params['n_estimators'] + 100),
'max_depth': (max(3, best_params['max_depth'] - 1),
min(10, best_params['max_depth'] + 1)),
'learning_rate': (best_params['learning_rate'] * 0.5,
best_params['learning_rate'] * 1.5, 'log')
}
fine_results = optimizer.optimize(fine_space, n_trials=50)
```
### Pattern 2: Multi-Objective Optimization
```python
# Optimize for multiple objectives (accuracy + speed)
def multi_objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 10),
}
model = XGBClassifier(**params)
# Objective 1: Accuracy
accuracy = cross_val_score(model, X, y, cv=5).mean()
# Objective 2: Training time
start = time.time()
model.fit(X_train, y_train)
training_time = time.time() - start
return accuracy, -training_time # Maximize accuracy, minimize time
# Optuna will find Pareto-optimal solutions
study = optuna.create_study(directions=['maximize', 'minimize'])
study.optimize(multi_objective, n_trials=100)
```
## Summary
AutoML accelerates ML development by:
- ✅ Automating tedious hyperparameter tuning
- ✅ Exploring search space systematically
- ✅ Finding optimal configurations faster
- ✅ Tracking all experiments automatically
- ✅ Documenting optimization process
Don't spend days manually tuning—let AutoML do it in hours.

View File

@@ -0,0 +1,157 @@
---
name: cv-pipeline-builder
description: |
Computer vision ML pipelines for image classification, object detection, semantic segmentation, and image generation. Activates for "computer vision", "image classification", "object detection", "CNN", "ResNet", "YOLO", "image segmentation", "image preprocessing", "data augmentation". Builds end-to-end CV pipelines with PyTorch/TensorFlow, integrated with SpecWeave increments.
---
# Computer Vision Pipeline Builder
## Overview
Specialized ML pipelines for computer vision tasks. Handles image preprocessing, data augmentation, CNN architectures, transfer learning, and deployment for production CV systems.
## CV Tasks Supported
### 1. Image Classification
```python
from specweave import CVPipeline
# Binary or multi-class classification
pipeline = CVPipeline(
task="classification",
num_classes=10,
increment="0042"
)
# Automatically configures:
# - Image preprocessing (resize, normalize)
# - Data augmentation (rotation, flip, color jitter)
# - CNN architecture (ResNet, EfficientNet, ViT)
# - Transfer learning from ImageNet
# - Training loop with validation
# - Inference pipeline
pipeline.fit(train_images, train_labels)
```
### 2. Object Detection
```python
# Detect multiple objects in images
pipeline = CVPipeline(
task="object_detection",
classes=["person", "car", "dog", "cat"],
increment="0042"
)
# Uses: YOLO, Faster R-CNN, or RetinaNet
# Returns: Bounding boxes + class labels + confidence scores
```
### 3. Semantic Segmentation
```python
# Pixel-level classification
pipeline = CVPipeline(
task="segmentation",
num_classes=21,
increment="0042"
)
# Uses: U-Net, DeepLab, or SegFormer
# Returns: Segmentation mask for each pixel
```
## Best Practices for CV
### Data Augmentation
```python
from specweave import ImageAugmentation
aug = ImageAugmentation(increment="0042")
# Standard augmentations
aug.add_transforms([
"random_rotation", # ±15 degrees
"random_flip_horizontal",
"random_brightness", # ±20%
"random_contrast", # ±20%
"random_crop"
])
# Advanced augmentations
aug.add_advanced([
"mixup", # Mix two images
"cutout", # Random erasing
"autoaugment" # Learned augmentation
])
```
### Transfer Learning
```python
# Start from pre-trained ImageNet models
pipeline = CVPipeline(task="classification")
# Option 1: Feature extraction (freeze backbone)
pipeline.use_pretrained(
model="resnet50",
freeze_backbone=True
)
# Option 2: Fine-tuning (unfreeze after few epochs)
pipeline.use_pretrained(
model="resnet50",
freeze_backbone=False,
fine_tune_after_epoch=3
)
```
### Model Selection
**Image Classification**:
- Small datasets (<10K): ResNet18, MobileNetV2
- Medium datasets (10K-100K): ResNet50, EfficientNet-B0
- Large datasets (>100K): EfficientNet-B3, Vision Transformer
**Object Detection**:
- Real-time (>30 FPS): YOLOv8, SSDLite
- High accuracy: Faster R-CNN, RetinaNet
**Segmentation**:
- Medical imaging: U-Net
- Scene segmentation: DeepLabV3, SegFormer
## Integration with SpecWeave
```python
# CV increment structure
.specweave/increments/0042-image-classifier/
spec.md
data/
train/
val/
test/
models/
model-v1.pth
model-v2.pth
experiments/
baseline-resnet18/
resnet50-augmented/
efficientnet-b0/
deployment/
onnx_model.onnx
inference.py
```
## Commands
```bash
/ml:cv-pipeline --task classification --model resnet50
/ml:cv-evaluate 0042 # Evaluate on test set
/ml:cv-deploy 0042 # Export to ONNX
```
Quick setup for CV projects with production-ready pipelines.

View File

@@ -0,0 +1,521 @@
---
name: data-visualizer
description: |
Automated data visualization for EDA, model performance, and business reporting. Activates for "visualize data", "create plots", "EDA", "exploratory analysis", "confusion matrix", "ROC curve", "feature distribution", "correlation heatmap", "plot results", "dashboard". Generates publication-quality visualizations integrated with SpecWeave increments.
---
# Data Visualizer
## Overview
Automated visualization generation for exploratory data analysis, model performance reporting, and stakeholder communication. Creates publication-quality plots, interactive dashboards, and business-friendly reports—all integrated with SpecWeave's increment workflow.
## Visualization Categories
### 1. Exploratory Data Analysis (EDA)
**Automated EDA Report**:
```python
from specweave import EDAVisualizer
visualizer = EDAVisualizer(increment="0042")
# Generates comprehensive EDA report
report = visualizer.generate_eda_report(df)
# Creates:
# - Dataset overview (rows, columns, memory, missing values)
# - Numerical feature distributions (histograms + KDE)
# - Categorical feature counts (bar charts)
# - Correlation heatmap
# - Missing value pattern
# - Outlier detection plots
# - Feature relationships (pairplot for top features)
```
**Individual EDA Plots**:
```python
# Distribution plots
visualizer.plot_distribution(
data=df['age'],
title="Age Distribution",
bins=30
)
# Correlation heatmap
visualizer.plot_correlation_heatmap(
data=df[numerical_columns],
method='pearson' # or 'spearman', 'kendall'
)
# Missing value patterns
visualizer.plot_missing_values(df)
# Outlier detection (boxplots)
visualizer.plot_outliers(df[numerical_columns])
```
### 2. Model Performance Visualizations
**Classification Performance**:
```python
from specweave import ClassificationVisualizer
viz = ClassificationVisualizer(increment="0042")
# Confusion matrix
viz.plot_confusion_matrix(
y_true=y_test,
y_pred=y_pred,
classes=['Negative', 'Positive']
)
# ROC curve
viz.plot_roc_curve(
y_true=y_test,
y_proba=y_proba
)
# Precision-Recall curve
viz.plot_precision_recall_curve(
y_true=y_test,
y_proba=y_proba
)
# Learning curves (train vs val)
viz.plot_learning_curve(
train_scores=train_scores,
val_scores=val_scores
)
# Calibration curve (are probabilities well-calibrated?)
viz.plot_calibration_curve(
y_true=y_test,
y_proba=y_proba
)
```
**Regression Performance**:
```python
from specweave import RegressionVisualizer
viz = RegressionVisualizer(increment="0042")
# Predicted vs Actual
viz.plot_predictions(
y_true=y_test,
y_pred=y_pred
)
# Residual plot
viz.plot_residuals(
y_true=y_test,
y_pred=y_pred
)
# Residual distribution (should be normal)
viz.plot_residual_distribution(
residuals=y_test - y_pred
)
# Error by feature value
viz.plot_error_analysis(
y_true=y_test,
y_pred=y_pred,
features=X_test
)
```
### 3. Feature Analysis Visualizations
**Feature Importance**:
```python
from specweave import FeatureVisualizer
viz = FeatureVisualizer(increment="0042")
# Feature importance (bar chart)
viz.plot_feature_importance(
feature_names=feature_names,
importances=model.feature_importances_,
top_n=20
)
# SHAP summary plot
viz.plot_shap_summary(
shap_values=shap_values,
features=X_test
)
# Partial dependence plots
viz.plot_partial_dependence(
model=model,
features=['age', 'income'],
X=X_train
)
# Feature interaction
viz.plot_feature_interaction(
model=model,
features=('age', 'income'),
X=X_train
)
```
### 4. Time Series Visualizations
**Time Series Plots**:
```python
from specweave import TimeSeriesVisualizer
viz = TimeSeriesVisualizer(increment="0042")
# Time series with trend
viz.plot_timeseries(
data=sales_data,
show_trend=True
)
# Seasonal decomposition
viz.plot_seasonal_decomposition(
data=sales_data,
period=12 # Monthly seasonality
)
# Autocorrelation (ACF, PACF)
viz.plot_autocorrelation(data=sales_data)
# Forecast with confidence intervals
viz.plot_forecast(
actual=test_data,
forecast=forecast,
confidence_intervals=(0.80, 0.95)
)
```
### 5. Model Comparison Visualizations
**Compare Multiple Models**:
```python
from specweave import ModelComparisonVisualizer
viz = ModelComparisonVisualizer(increment="0042")
# Compare metrics across models
viz.plot_model_comparison(
models=['Baseline', 'XGBoost', 'LightGBM', 'Neural Net'],
metrics={
'accuracy': [0.65, 0.87, 0.86, 0.85],
'roc_auc': [0.70, 0.92, 0.91, 0.90],
'training_time': [1, 45, 32, 320]
}
)
# ROC curves for multiple models
viz.plot_roc_curves_comparison(
models_predictions={
'XGBoost': (y_test, y_proba_xgb),
'LightGBM': (y_test, y_proba_lgbm),
'Neural Net': (y_test, y_proba_nn)
}
)
```
## Interactive Visualizations
**Plotly Integration**:
```python
from specweave import InteractiveVisualizer
viz = InteractiveVisualizer(increment="0042")
# Interactive scatter plot (zoom, pan, hover)
viz.plot_interactive_scatter(
x=X_test[:, 0],
y=X_test[:, 1],
colors=y_pred,
hover_data=df[['id', 'amount', 'merchant']]
)
# Interactive confusion matrix (click for details)
viz.plot_interactive_confusion_matrix(
y_true=y_test,
y_pred=y_pred
)
# Interactive feature importance (sortable, filterable)
viz.plot_interactive_feature_importance(
feature_names=feature_names,
importances=importances
)
```
## Business Reporting
**Automated ML Report**:
```python
from specweave import MLReportGenerator
generator = MLReportGenerator(increment="0042")
# Generate executive summary report
report = generator.generate_report(
model=model,
test_data=(X_test, y_test),
business_metrics={
'false_positive_cost': 5,
'false_negative_cost': 500
}
)
# Creates:
# - Executive summary (1 page, non-technical)
# - Key metrics (accuracy, precision, recall)
# - Business impact ($$ saved, ROI)
# - Model performance visualizations
# - Recommendations
# - Technical appendix
```
**Report Output** (HTML/PDF):
```markdown
# Fraud Detection Model - Executive Summary
## Key Results
- **Accuracy**: 87% (target: >85%) ✅
- **Fraud Detection Rate**: 62% (catching 310 frauds/day)
- **False Positive Rate**: 38% (190 false alarms/day)
## Business Impact
- **Fraud Prevented**: $155,000/day
- **Review Cost**: $950/day (190 transactions × $5)
- **Net Benefit**: $154,050/day ✅
- **Annual Savings**: $56.2M
## Model Performance
[Confusion Matrix Visualization]
[ROC Curve]
[Feature Importance]
## Recommendations
1. ✅ Deploy to production immediately
2. Monitor fraud patterns weekly
3. Retrain model monthly with new data
```
## Dashboard Creation
**Real-Time Dashboard**:
```python
from specweave import DashboardCreator
creator = DashboardCreator(increment="0042")
# Create Grafana/Plotly dashboard
dashboard = creator.create_dashboard(
title="Model Performance Dashboard",
panels=[
{'type': 'metric', 'query': 'prediction_latency_p95'},
{'type': 'metric', 'query': 'predictions_per_second'},
{'type': 'timeseries', 'query': 'accuracy_over_time'},
{'type': 'timeseries', 'query': 'error_rate'},
{'type': 'heatmap', 'query': 'prediction_distribution'},
{'type': 'table', 'query': 'recent_anomalies'}
]
)
# Exports to Grafana JSON or Plotly Dash app
dashboard.export(format='grafana')
```
## Visualization Best Practices
### 1. Publication-Quality Plots
```python
# Set consistent styling
visualizer.set_style(
style='seaborn', # Or 'ggplot', 'fivethirtyeight'
context='paper', # Or 'notebook', 'talk', 'poster'
palette='colorblind' # Accessible colors
)
# High-resolution exports
visualizer.save_figure(
filename='model_performance.png',
dpi=300, # Publication quality
bbox_inches='tight'
)
```
### 2. Accessible Visualizations
```python
# Colorblind-friendly palettes
visualizer.use_colorblind_palette()
# Add alt text for accessibility
visualizer.add_alt_text(
plot=fig,
description="Confusion matrix showing 87% accuracy"
)
# High contrast for presentations
visualizer.set_high_contrast_mode()
```
### 3. Annotation and Context
```python
# Add reference lines
viz.add_reference_line(
y=0.85, # Target accuracy
label='Target',
color='red',
linestyle='--'
)
# Add annotations
viz.annotate_point(
x=optimal_threshold,
y=optimal_f1,
text='Optimal threshold: 0.47'
)
```
## Integration with SpecWeave
### Automated Visualization in Increments
```python
# All visualizations auto-saved to increment folder
visualizer = EDAVisualizer(increment="0042")
# Creates:
# .specweave/increments/0042-fraud-detection/
# ├── visualizations/
# │ ├── eda/
# │ │ ├── distributions.png
# │ │ ├── correlation_heatmap.png
# │ │ └── missing_values.png
# │ ├── model_performance/
# │ │ ├── confusion_matrix.png
# │ │ ├── roc_curve.png
# │ │ ├── precision_recall.png
# │ │ └── learning_curves.png
# │ ├── feature_analysis/
# │ │ ├── feature_importance.png
# │ │ ├── shap_summary.png
# │ │ └── partial_dependence/
# │ └── reports/
# │ ├── executive_summary.html
# │ └── technical_report.pdf
```
### Living Docs Integration
```bash
/specweave:sync-docs update
```
Updates:
```markdown
<!-- .specweave/docs/internal/architecture/ml-model-performance.md -->
## Fraud Detection Model Performance (Increment 0042)
### Model Accuracy
![Confusion Matrix](../../../increments/0042-fraud-detection/visualizations/confusion_matrix.png)
### Key Metrics
- Accuracy: 87%
- Precision: 85%
- Recall: 62%
- ROC AUC: 0.92
### Feature Importance
![Top Features](../../../increments/0042-fraud-detection/visualizations/feature_importance.png)
Top 5 features:
1. amount_vs_user_average (0.18)
2. days_since_last_purchase (0.12)
3. merchant_risk_score (0.10)
4. velocity_24h (0.08)
5. location_distance_from_home (0.07)
```
## Commands
```bash
# Generate EDA report
/ml:visualize-eda 0042
# Generate model performance report
/ml:visualize-performance 0042
# Create interactive dashboard
/ml:create-dashboard 0042
# Export all visualizations
/ml:export-visualizations 0042 --format png,pdf,html
```
## Advanced Features
### 1. Automated Report Generation
```python
# Generate full increment report with all visualizations
generator = IncrementReportGenerator(increment="0042")
report = generator.generate_full_report()
# Includes:
# - EDA visualizations
# - Experiment comparisons
# - Best model performance
# - Feature importance
# - Business impact
# - Deployment readiness
```
### 2. Custom Visualization Templates
```python
# Create reusable templates
template = VisualizationTemplate(name="fraud_analysis")
template.add_panel("confusion_matrix")
template.add_panel("roc_curve")
template.add_panel("top_fraud_features")
template.add_panel("fraud_trends_over_time")
# Apply to any increment
template.apply(increment="0042")
```
### 3. Version Control for Visualizations
```python
# Track visualization changes across model versions
viz_tracker = VisualizationTracker(increment="0042")
# Compare model v1 vs v2 visualizations
viz_tracker.compare_versions(
version_1="model-v1",
version_2="model-v2"
)
# Shows: Confusion matrix improved, ROC curve comparison, etc.
```
## Summary
Data visualization is critical for:
- ✅ Exploratory data analysis (understand data before modeling)
- ✅ Model performance communication (stakeholder buy-in)
- ✅ Feature analysis (understand what drives predictions)
- ✅ Business reporting (translate metrics to impact)
- ✅ Model debugging (identify issues visually)
This skill automates visualization generation, ensuring all ML work is visual, accessible, and business-friendly within SpecWeave's increment workflow.

View File

@@ -0,0 +1,535 @@
---
name: experiment-tracker
description: |
Manages ML experiment tracking with MLflow, Weights & Biases, or SpecWeave's built-in tracking. Activates for "track experiments", "MLflow", "wandb", "experiment logging", "compare experiments", "hyperparameter tracking". Automatically configures tracking tools to log to SpecWeave increment folders, ensuring all experiments are documented and reproducible. Integrates with SpecWeave's living docs for persistent experiment knowledge.
---
# Experiment Tracker
## Overview
Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.
## Problem This Solves
**Without structured tracking**:
- ❌ "Which hyperparameters did we use for model v2?"
- ❌ "Why did we choose XGBoost over LightGBM?"
- ❌ "Can't reproduce results from 3 months ago"
- ❌ "Team member left, all knowledge in their notebooks"
**With experiment tracking**:
- ✅ All experiments logged with params, metrics, artifacts
- ✅ Decisions documented ("XGBoost: 5% better precision, chose it")
- ✅ Reproducible (environment, data version, code hash)
- ✅ Team knowledge in living docs, not individual notebooks
## How It Works
### Auto-Configuration
When you create an ML increment, the skill detects tracking tools:
```python
# No configuration needed - automatically detects and configures
from specweave import track_experiment
# Automatically logs to:
# .specweave/increments/0042.../experiments/exp-001/
with track_experiment("baseline-model") as exp:
model.fit(X_train, y_train)
exp.log_metric("accuracy", accuracy)
```
### Tracking Backends
**Option 1: SpecWeave Built-in** (default, zero-config)
```python
from specweave import track_experiment
# Logs to increment folder automatically
with track_experiment("xgboost-v1") as exp:
exp.log_param("n_estimators", 100)
exp.log_metric("auc", 0.87)
exp.save_model(model, "model.pkl")
# Creates:
# .specweave/increments/0042.../experiments/xgboost-v1/
# ├── params.json
# ├── metrics.json
# ├── model.pkl
# └── metadata.yaml
```
**Option 2: MLflow** (if detected in project)
```python
import mlflow
from specweave import configure_mlflow
# Auto-configures MLflow to log to increment
configure_mlflow(increment="0042")
with mlflow.start_run(run_name="xgboost-v1"):
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("auc", 0.87)
mlflow.sklearn.log_model(model, "model")
# Still logs to increment folder, just uses MLflow as backend
```
**Option 3: Weights & Biases**
```python
import wandb
from specweave import configure_wandb
# Auto-configures W&B project = increment ID
configure_wandb(increment="0042")
run = wandb.init(name="xgboost-v1")
run.log({"auc": 0.87})
run.log_model("model.pkl")
# W&B dashboard + local logs in increment folder
```
### Experiment Comparison
```python
from specweave import compare_experiments
# Compare all experiments in increment
comparison = compare_experiments(increment="0042")
# Generates:
# .specweave/increments/0042.../experiments/comparison.md
```
**Output**:
```markdown
| Experiment | Accuracy | Precision | Recall | F1 | Training Time |
|--------------------|----------|-----------|--------|------|---------------|
| exp-001-baseline | 0.65 | 0.60 | 0.55 | 0.57 | 2s |
| exp-002-xgboost | 0.87 | 0.85 | 0.83 | 0.84 | 45s |
| exp-003-lightgbm | 0.86 | 0.84 | 0.82 | 0.83 | 32s |
| exp-004-neural-net | 0.85 | 0.83 | 0.81 | 0.82 | 320s |
**Best Model**: exp-002-xgboost
- Highest accuracy (0.87)
- Good precision/recall balance
- Reasonable training time (45s)
- Selected for deployment
```
### Living Docs Integration
After completing increment:
```bash
/specweave:sync-docs update
```
Automatically updates:
```markdown
<!-- .specweave/docs/internal/architecture/ml-experiments.md -->
## Recommendation Model (Increment 0042)
### Experiments Conducted: 7
- exp-001-baseline: Random classifier (acc=0.12)
- exp-002-popularity: Popularity baseline (acc=0.18)
- exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ **SELECTED**
- ...
### Selection Rationale
XGBoost chosen for:
- Best accuracy (0.26 vs baseline 0.18, +44% improvement)
- Fast inference (<50ms)
- Good explainability (SHAP values)
- Stable across cross-validation (std=0.02)
### Hyperparameters (exp-003)
- n_estimators: 200
- max_depth: 6
- learning_rate: 0.1
- subsample: 0.8
```
## When to Use This Skill
Activate when you need to:
- **Track ML experiments** systematically
- **Compare multiple models** objectively
- **Document experiment decisions** for team
- **Reproduce past results** exactly
- **Maintain experiment history** across increments
## Key Features
### 1. Automatic Logging
```python
# Logs everything automatically
from specweave import AutoTracker
tracker = AutoTracker(increment="0042")
# Just wrap your training code
@tracker.track(name="xgboost-auto")
def train_model():
model = XGBClassifier(**params)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
return model, score
# Automatically logs: params, metrics, model, environment, git hash
model, score = train_model()
```
### 2. Hyperparameter Tracking
```python
from specweave import track_hyperparameters
params_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [3, 6, 9],
"learning_rate": [0.01, 0.1, 0.3]
}
# Tracks all parameter combinations
results = track_hyperparameters(
model=XGBClassifier,
param_grid=params_grid,
X_train=X_train,
y_train=y_train,
increment="0042"
)
# Generates parameter importance analysis
```
### 3. Cross-Validation Tracking
```python
from specweave import track_cross_validation
# Tracks each fold separately
cv_results = track_cross_validation(
model=model,
X=X,
y=y,
cv=5,
increment="0042"
)
# Logs: mean, std, per-fold scores, fold distribution
```
### 4. Artifact Management
```python
from specweave import track_artifacts
with track_experiment("xgboost-v1") as exp:
# Training artifacts
exp.save_artifact("preprocessor.pkl", preprocessor)
exp.save_artifact("model.pkl", model)
# Evaluation artifacts
exp.save_artifact("confusion_matrix.png", cm_plot)
exp.save_artifact("roc_curve.png", roc_plot)
# Data artifacts
exp.save_artifact("feature_importance.csv", importance_df)
# Environment artifacts
exp.save_artifact("requirements.txt", requirements)
exp.save_artifact("conda_env.yaml", conda_env)
```
### 5. Experiment Metadata
```python
from specweave import ExperimentMetadata
metadata = ExperimentMetadata(
name="xgboost-v3",
description="XGBoost with feature engineering v2",
tags=["production-candidate", "feature-eng-v2"],
git_commit="a3b8c9d",
data_version="v2024-01",
author="[email protected]"
)
with track_experiment(metadata) as exp:
# ... training ...
pass
```
## Best Practices
### 1. Name Experiments Clearly
```python
# ❌ Bad: Generic names
with track_experiment("exp1"):
...
# ✅ Good: Descriptive names
with track_experiment("xgboost-tuned-depth6-lr0.1"):
...
```
### 2. Log Everything
```python
# Log more than you think you need
exp.log_param("random_seed", 42)
exp.log_param("data_version", "2024-01")
exp.log_param("python_version", sys.version)
exp.log_param("sklearn_version", sklearn.__version__)
# Future you will thank present you
```
### 3. Document Failures
```python
try:
with track_experiment("neural-net-attempt") as exp:
model.fit(X_train, y_train)
except Exception as e:
exp.log_note(f"FAILED: {str(e)}")
exp.log_note("Reason: Out of memory, need smaller batch size")
exp.set_status("failed")
# Failure documentation prevents repeating mistakes
```
### 4. Use Experiment Series
```python
# Related experiments in series
experiments = [
"xgboost-baseline",
"xgboost-tuned-v1",
"xgboost-tuned-v2",
"xgboost-tuned-v3-final"
]
# Track progression and improvements
```
### 5. Link to Data Versions
```python
with track_experiment("xgboost-v1") as exp:
exp.log_param("data_commit", "dvc:a3b8c9d")
exp.log_param("data_url", "s3://bucket/data/v2024-01")
# Enables exact reproduction
```
## Integration with SpecWeave
### With Increments
```bash
# Experiments automatically tied to increment
/specweave:inc "0042-recommendation-model"
# All experiments logged to: .specweave/increments/0042.../experiments/
```
### With Living Docs
```bash
# Sync experiment findings to docs
/specweave:sync-docs update
# Updates: architecture/ml-models.md, runbooks/model-training.md
```
### With GitHub
```bash
# Create issue for model retraining
/specweave:github:create-issue "Retrain model with Q1 2024 data"
# Links to previous experiments in increment
```
## Examples
### Example 1: Baseline Experiments
```python
from specweave import track_experiment
baselines = ["random", "majority", "stratified"]
for strategy in baselines:
with track_experiment(f"baseline-{strategy}") as exp:
model = DummyClassifier(strategy=strategy)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
exp.log_metric("accuracy", accuracy)
exp.log_note(f"Baseline: {strategy}")
# Generates baseline comparison report
```
### Example 2: Hyperparameter Grid Search
```python
from sklearn.model_selection import GridSearchCV
from specweave import track_grid_search
param_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [3, 6, 9]
}
# Automatically logs all combinations
best_model, results = track_grid_search(
XGBClassifier(),
param_grid,
X_train,
y_train,
increment="0042"
)
# Creates visualization of parameter importance
```
### Example 3: Model Comparison
```python
from specweave import compare_models
models = {
"xgboost": XGBClassifier(),
"lightgbm": LGBMClassifier(),
"random-forest": RandomForestClassifier()
}
# Trains and compares all models
comparison = compare_models(
models,
X_train,
y_train,
X_test,
y_test,
increment="0042"
)
# Generates markdown comparison table
```
## Tool Compatibility
### MLflow
```python
# Option 1: Pure MLflow (auto-configured)
import mlflow
mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")
# Option 2: SpecWeave wrapper (recommended)
from specweave import mlflow as sw_mlflow
with sw_mlflow.start_run("xgboost"):
# Logs to both MLflow and increment docs
pass
```
### Weights & Biases
```python
# Option 1: Pure wandb
import wandb
wandb.init(project="0042-recommendation-model")
# Option 2: SpecWeave wrapper (recommended)
from specweave import wandb as sw_wandb
run = sw_wandb.init(increment="0042", name="xgboost")
# Syncs to increment folder + W&B dashboard
```
### TensorBoard
```python
from specweave import TensorBoardCallback
# Keras callback
model.fit(
X_train,
y_train,
callbacks=[
TensorBoardCallback(
increment="0042",
log_dir=".specweave/increments/0042.../tensorboard"
)
]
)
```
## Commands
```bash
# List all experiments in increment
/ml:list-experiments 0042
# Compare experiments
/ml:compare-experiments 0042
# Load experiment details
/ml:show-experiment exp-003-xgboost
# Export experiment data
/ml:export-experiments 0042 --format csv
```
## Tips
1. **Start tracking early** - Track from first experiment, not after 20 failed attempts
2. **Tag production models** - `exp.add_tag("production")` for deployed models
3. **Version everything** - Data, code, environment, dependencies
4. **Document decisions** - Why model A over model B (not just metrics)
5. **Prune old experiments** - Archive experiments >6 months old
## Advanced: Multi-Stage Experiments
For complex pipelines with multiple stages:
```python
from specweave import ExperimentPipeline
pipeline = ExperimentPipeline("recommendation-full-pipeline")
# Stage 1: Data preprocessing
with pipeline.stage("preprocessing") as stage:
stage.log_metric("rows_before", len(df))
df_clean = preprocess(df)
stage.log_metric("rows_after", len(df_clean))
# Stage 2: Feature engineering
with pipeline.stage("features") as stage:
features = engineer_features(df_clean)
stage.log_metric("num_features", features.shape[1])
# Stage 3: Model training
with pipeline.stage("training") as stage:
model = train_model(features)
stage.log_metric("accuracy", accuracy)
# Logs entire pipeline with stage dependencies
```
## Integration Points
- **ml-pipeline-orchestrator**: Auto-tracks experiments during pipeline execution
- **model-evaluator**: Uses experiment data for model comparison
- **ml-engineer agent**: Reviews experiment results and suggests improvements
- **Living docs**: Syncs experiment findings to architecture docs
This skill ensures ML experimentation is never lost, always reproducible, and well-documented.

View File

@@ -0,0 +1,566 @@
---
name: feature-engineer
description: |
Comprehensive feature engineering for ML pipelines: data quality assessment, feature creation, selection, transformation, and encoding. Activates for "feature engineering", "create features", "feature selection", "data preprocessing", "handle missing values", "encode categorical", "scale features", "feature importance". Ensures features are production-ready with automated validation, documentation, and integration with SpecWeave increments.
---
# Feature Engineer
## Overview
Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.
## The Feature Engineering Pipeline
### Phase 1: Data Quality Assessment
**Before creating features, understand your data**:
```python
from specweave import DataQualityReport
# Automated data quality check
report = DataQualityReport(df, increment="0042")
# Generates:
# - Missing value analysis
# - Outlier detection
# - Data type validation
# - Distribution analysis
# - Correlation matrix
# - Duplicate detection
```
**Quality Report Output**:
```markdown
# Data Quality Report
## Dataset Overview
- Rows: 100,000
- Columns: 45
- Memory: 34.2 MB
## Missing Values
| Column | Missing | Percentage |
|-----------------|---------|------------|
| email | 15,234 | 15.2% |
| phone | 8,901 | 8.9% |
| purchase_date | 0 | 0.0% |
## Outliers Detected
- transaction_amount: 234 outliers (>3 std dev)
- user_age: 12 outliers (<18 or >100)
## Data Type Issues
- user_id: Stored as float, should be int
- date_joined: Stored as string, should be datetime
## Recommendations
1. Impute email/phone or create "missing" indicator features
2. Cap/remove outliers in transaction_amount
3. Convert data types for efficiency
```
### Phase 2: Feature Creation
**Create features from domain knowledge**:
```python
from specweave import FeatureCreator
creator = FeatureCreator(df, increment="0042")
# Temporal features (from datetime)
creator.add_temporal_features(
date_column="purchase_date",
features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"]
)
# Aggregation features (user behavior)
creator.add_aggregation_features(
group_by="user_id",
target="purchase_amount",
aggs=["mean", "std", "count", "min", "max"]
)
# Creates: user_purchase_amount_mean, user_purchase_amount_std, etc.
# Interaction features
creator.add_interaction_features(
features=[("age", "income"), ("clicks", "impressions")],
operations=["multiply", "divide", "subtract"]
)
# Creates: age_x_income, clicks_per_impression, etc.
# Ratio features
creator.add_ratio_features([
("revenue", "cost"),
("conversions", "visits")
])
# Creates: revenue_to_cost_ratio, conversion_rate
# Binning (discretization)
creator.add_binned_features(
column="age",
bins=[0, 18, 25, 35, 50, 65, 100],
labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"]
)
# Text features (from text columns)
creator.add_text_features(
column="product_description",
features=["length", "word_count", "unique_words", "sentiment"]
)
# Generate all features
df_enriched = creator.generate()
# Auto-documents in increment folder
creator.save_feature_definitions(
path=".specweave/increments/0042.../features/feature_definitions.yaml"
)
```
**Feature Definitions** (auto-generated):
```yaml
# .specweave/increments/0042.../features/feature_definitions.yaml
features:
- name: purchase_hour
type: temporal
source: purchase_date
description: Hour of purchase (0-23)
- name: user_purchase_amount_mean
type: aggregation
source: purchase_amount
group_by: user_id
description: Average purchase amount per user
- name: age_x_income
type: interaction
sources: [age, income]
operation: multiply
description: Product of age and income
- name: conversion_rate
type: ratio
sources: [conversions, visits]
description: Conversion rate (conversions / visits)
```
### Phase 3: Feature Selection
**Reduce dimensionality, improve performance**:
```python
from specweave import FeatureSelector
selector = FeatureSelector(X_train, y_train, increment="0042")
# Method 1: Correlation-based (remove redundant features)
selector.remove_correlated_features(threshold=0.95)
# Removes features with >95% correlation
# Method 2: Variance-based (remove constant features)
selector.remove_low_variance_features(threshold=0.01)
# Removes features with <1% variance
# Method 3: Statistical tests
selector.select_by_statistical_test(k=50)
# SelectKBest with chi2/f_classif
# Method 4: Model-based (tree importance)
selector.select_by_model_importance(
model=RandomForestClassifier(),
threshold=0.01
)
# Removes features with <1% importance
# Method 5: Recursive Feature Elimination
selector.select_by_rfe(
model=LogisticRegression(),
n_features=30
)
# Get selected features
selected_features = selector.get_selected_features()
# Generate selection report
selector.generate_report()
```
**Feature Selection Report**:
```markdown
# Feature Selection Report
## Original Features: 125
## Selected Features: 35 (72% reduction)
## Selection Process
1. Removed 12 correlated features (>95% correlation)
2. Removed 8 low-variance features
3. Statistical test: Selected top 50 (chi-squared)
4. Model importance: Removed 15 low-importance features (<1%)
## Top 10 Features (by importance)
1. user_purchase_amount_mean (0.18)
2. days_since_last_purchase (0.12)
3. total_purchases (0.10)
4. age_x_income (0.08)
5. conversion_rate (0.07)
...
## Removed Features
- user_id_hash (constant)
- temp_feature_1 (99% correlated with temp_feature_2)
- random_noise (0% importance)
...
```
### Phase 4: Feature Transformation
**Scale, normalize, encode for model compatibility**:
```python
from specweave import FeatureTransformer
transformer = FeatureTransformer(increment="0042")
# Numerical transformations
transformer.add_numerical_transformer(
columns=["age", "income", "purchase_amount"],
method="standard_scaler" # Or: min_max, robust, quantile
)
# Categorical encoding
transformer.add_categorical_encoder(
columns=["country", "device_type", "product_category"],
method="onehot", # Or: label, target, binary
handle_unknown="ignore"
)
# Ordinal encoding (for ordered categories)
transformer.add_ordinal_encoder(
column="education",
order=["high_school", "bachelors", "masters", "phd"]
)
# Log transformation (for skewed distributions)
transformer.add_log_transform(
columns=["transaction_amount", "page_views"],
method="log1p" # log(1 + x) to handle zeros
)
# Box-Cox transformation (for normalization)
transformer.add_power_transform(
columns=["revenue", "engagement_score"],
method="box-cox"
)
# Custom transformation
def clip_outliers(x):
return np.clip(x, x.quantile(0.01), x.quantile(0.99))
transformer.add_custom_transformer(
columns=["outlier_prone_feature"],
func=clip_outliers
)
# Fit and transform
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
# Save transformer pipeline
transformer.save(
path=".specweave/increments/0042.../features/transformer.pkl"
)
```
### Phase 5: Feature Validation
**Ensure features are production-ready**:
```python
from specweave import FeatureValidator
validator = FeatureValidator(
X_train, X_test,
increment="0042"
)
# Check for data leakage
leakage_report = validator.check_data_leakage()
# Detects: perfectly correlated features, future data in training
# Check for distribution drift
drift_report = validator.check_distribution_drift()
# Compares train vs test distributions
# Check for missing values after transformation
missing_report = validator.check_missing_values()
# Check for infinite/NaN values
invalid_report = validator.check_invalid_values()
# Generate validation report
validator.generate_report()
```
**Validation Report**:
```markdown
# Feature Validation Report
## Data Leakage: ✅ PASS
No perfect correlations detected between train and test.
## Distribution Drift: ⚠️ WARNING
Features with significant drift (KS test p < 0.05):
- user_age: p=0.023 (minor drift)
- device_type: p=0.001 (major drift)
Recommendation: Check if test data is from different time period.
## Missing Values: ✅ PASS
No missing values after transformation.
## Invalid Values: ✅ PASS
No infinite or NaN values detected.
## Overall: READY FOR TRAINING
2 warnings, 0 critical issues.
```
## Integration with SpecWeave
### Automatic Feature Documentation
```python
# All feature engineering steps logged to increment
with track_experiment("feature-engineering-v1", increment="0042") as exp:
# Create features
df_enriched = creator.generate()
# Select features
selected = selector.select()
# Transform features
X_transformed = transformer.fit_transform(X)
# Validate
validation = validator.validate()
# Auto-logs:
exp.log_param("original_features", 125)
exp.log_param("created_features", 45)
exp.log_param("selected_features", 35)
exp.log_metric("feature_reduction", 0.72)
exp.save_artifact("feature_definitions.yaml")
exp.save_artifact("transformer.pkl")
exp.save_artifact("validation_report.md")
```
### Living Docs Integration
After completing feature engineering:
```bash
/specweave:sync-docs update
```
Updates:
```markdown
<!-- .specweave/docs/internal/architecture/feature-engineering.md -->
## Recommendation Model Features (Increment 0042)
### Feature Engineering Pipeline
1. Data Quality: 100K rows, 45 columns
2. Created: 45 new features (temporal, aggregation, interaction)
3. Selected: 35 features (72% reduction via importance + RFE)
4. Transformed: StandardScaler for numerical, OneHot for categorical
### Key Features
- user_purchase_amount_mean: Average user spend (top feature, 18% importance)
- days_since_last_purchase: Recency indicator (12% importance)
- age_x_income: Interaction feature (8% importance)
### Feature Store
All features documented in: `.specweave/increments/0042.../features/`
- feature_definitions.yaml: Feature catalog
- transformer.pkl: Production transformation pipeline
- validation_report.md: Quality checks
```
## Best Practices
### 1. Document Feature Rationale
```python
# Bad: Create features without explanation
df["feature_1"] = df["col_a"] * df["col_b"]
# Good: Document why features were created
creator.add_interaction_feature(
sources=["age", "income"],
operation="multiply",
rationale="High-income older users have different behavior patterns"
)
```
### 2. Handle Missing Values Systematically
```python
# Options for missing values:
# 1. Imputation (mean, median, mode)
creator.impute_missing(column="age", strategy="median")
# 2. Indicator features (flag missing as signal)
creator.add_missing_indicator(column="email")
# Creates: email_missing (0/1)
# 3. Forward/backward fill (for time series)
creator.fill_missing(column="sensor_reading", method="ffill")
# 4. Model-based imputation
creator.impute_with_model(column="income", model=RandomForestRegressor())
```
### 3. Avoid Data Leakage
```python
# ❌ WRONG: Fit on all data (includes test set!)
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# ✅ CORRECT: Fit only on train, transform both
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# SpecWeave's transformer enforces this pattern
transformer.fit_transform(X_train) # Fits
transformer.transform(X_test) # Only transforms
```
### 4. Version Feature Engineering Pipeline
```python
# Version features with increment
transformer.save(
path=".specweave/increments/0042.../features/transformer-v1.pkl",
metadata={
"version": "v1",
"features": selected_features,
"transformations": ["standard_scaler", "onehot"]
}
)
# Load specific version for reproducibility
transformer_v1 = FeatureTransformer.load(
".specweave/increments/0042.../features/transformer-v1.pkl"
)
```
### 5. Test Feature Engineering on New Data
```python
# Before deploying, test on held-out data
X_production_sample = load_production_data()
try:
X_transformed = transformer.transform(X_production_sample)
except Exception as e:
raise FeatureEngineeringError(f"Failed on production data: {e}")
# Check for unexpected values
validator = FeatureValidator(X_train, X_production_sample)
validation_report = validator.validate()
if validation_report["status"] == "CRITICAL":
raise FeatureEngineeringError("Feature engineering failed validation")
```
## Common Feature Engineering Patterns
### Pattern 1: RFM (Recency, Frequency, Monetary)
```python
# For e-commerce / customer analytics
creator.add_rfm_features(
user_id="user_id",
transaction_date="purchase_date",
transaction_amount="purchase_amount"
)
# Creates:
# - recency: days since last purchase
# - frequency: total purchases
# - monetary: total spend
```
### Pattern 2: Rolling Window Aggregations
```python
# For time series
creator.add_rolling_features(
column="daily_sales",
windows=[7, 14, 30],
aggs=["mean", "std", "min", "max"]
)
# Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.
```
### Pattern 3: Target Encoding (Categorical → Numerical)
```python
# Encode categorical as target mean (careful: can leak!)
creator.add_target_encoding(
column="product_category",
target="purchase_amount",
cv_folds=5 # Cross-validation to prevent leakage
)
# Creates: product_category_target_encoded
```
### Pattern 4: Polynomial Features
```python
# For non-linear relationships
creator.add_polynomial_features(
columns=["age", "income"],
degree=2,
interaction_only=True
)
# Creates: age^2, income^2, age*income
```
## Commands
```bash
# Generate feature engineering pipeline for increment
/ml:engineer-features 0042
# Validate features before training
/ml:validate-features 0042
# Generate feature importance report
/ml:feature-importance 0042
```
## Integration with Other Skills
- **ml-pipeline-orchestrator**: Task 2 is "Feature Engineering" (uses this skill)
- **experiment-tracker**: Logs all feature engineering experiments
- **model-evaluator**: Uses feature importance from models
- **ml-deployment-helper**: Packages feature transformer for production
## Summary
Feature engineering is 70% of ML success. This skill ensures:
- ✅ Systematic approach (quality → create → select → transform → validate)
- ✅ No data leakage (train/test separation enforced)
- ✅ Production-ready (versioned, validated, documented)
- ✅ Reproducible (all steps tracked in increment)
- ✅ Traceable (feature definitions in living docs)
Good features make mediocre models great. Great features make mediocre models excellent.

View File

@@ -0,0 +1,345 @@
---
name: ml-deployment-helper
description: |
Prepares ML models for production deployment with containerization, API creation, monitoring setup, and A/B testing. Activates for "deploy model", "production deployment", "model API", "containerize model", "docker ml", "serving ml model", "model monitoring", "A/B test model". Generates deployment artifacts and ensures models are production-ready with monitoring, versioning, and rollback capabilities.
---
# ML Deployment Helper
## Overview
Bridges the gap between trained models and production systems. Generates deployment artifacts, APIs, monitoring, and A/B testing infrastructure following MLOps best practices.
## Deployment Checklist
Before deploying any model, this skill ensures:
- ✅ Model versioned and tracked
- ✅ Dependencies documented (requirements.txt/Dockerfile)
- ✅ API endpoint created
- ✅ Input validation implemented
- ✅ Monitoring configured
- ✅ A/B testing ready
- ✅ Rollback plan documented
- ✅ Performance benchmarked
## Deployment Patterns
### Pattern 1: REST API (FastAPI)
```python
from specweave import create_model_api
# Generates production-ready API
api = create_model_api(
model_path="models/model-v3.pkl",
increment="0042",
framework="fastapi"
)
# Creates:
# - api/
# ├── main.py (FastAPI app)
# ├── models.py (Pydantic schemas)
# ├── predict.py (Prediction logic)
# ├── Dockerfile
# ├── requirements.txt
# └── tests/
```
Generated `main.py`:
```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
app = FastAPI(title="Recommendation Model API", version="0042-v3")
model = joblib.load("model-v3.pkl")
class PredictionRequest(BaseModel):
user_id: int
context: dict
@app.post("/predict")
async def predict(request: PredictionRequest):
try:
prediction = model.predict([request.dict()])
return {
"recommendations": prediction.tolist(),
"model_version": "0042-v3",
"timestamp": datetime.now()
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": model is not None}
```
### Pattern 2: Batch Prediction
```python
from specweave import create_batch_predictor
# For offline scoring
batch_predictor = create_batch_predictor(
model_path="models/model-v3.pkl",
increment="0042",
input_path="s3://bucket/data/",
output_path="s3://bucket/predictions/"
)
# Creates:
# - batch/
# ├── predictor.py
# ├── scheduler.yaml (Airflow/Kubernetes CronJob)
# └── monitoring.py
```
### Pattern 3: Real-Time Streaming
```python
from specweave import create_streaming_predictor
# For Kafka/Kinesis streams
streaming = create_streaming_predictor(
model_path="models/model-v3.pkl",
increment="0042",
input_topic="user-events",
output_topic="predictions"
)
# Creates:
# - streaming/
# ├── consumer.py
# ├── predictor.py
# ├── producer.py
# └── docker-compose.yaml
```
## Containerization
```python
from specweave import containerize_model
# Generates optimized Dockerfile
dockerfile = containerize_model(
model_path="models/model-v3.pkl",
framework="sklearn",
python_version="3.10",
increment="0042"
)
```
Generated `Dockerfile`:
```dockerfile
FROM python:3.10-slim
WORKDIR /app
# Copy model and dependencies
COPY models/model-v3.pkl /app/model.pkl
COPY requirements.txt /app/
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY api/ /app/api/
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:8000/health || exit 1
# Run API
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
## Monitoring Setup
```python
from specweave import setup_model_monitoring
# Configures monitoring for production
monitoring = setup_model_monitoring(
model_name="recommendation-model",
increment="0042",
metrics=[
"prediction_latency",
"throughput",
"error_rate",
"prediction_distribution",
"feature_drift"
]
)
# Creates:
# - monitoring/
# ├── prometheus.yaml
# ├── grafana-dashboard.json
# ├── alerts.yaml
# └── drift-detector.py
```
## A/B Testing Infrastructure
```python
from specweave import create_ab_test
# Sets up A/B test framework
ab_test = create_ab_test(
control_model="model-v2.pkl",
treatment_model="model-v3.pkl",
traffic_split=0.1, # 10% to new model
success_metric="click_through_rate",
increment="0042"
)
# Creates:
# - ab-test/
# ├── router.py (traffic splitting)
# ├── metrics.py (success tracking)
# ├── statistical-tests.py (significance testing)
# └── dashboard.py (real-time monitoring)
```
A/B Test Router:
```python
import random
def route_prediction(user_id, control_model, treatment_model):
"""Route to control or treatment based on user_id hash"""
# Consistent hashing (same user always gets same model)
user_bucket = hash(user_id) % 100
if user_bucket < 10: # 10% to treatment
return treatment_model.predict(features), "treatment"
else:
return control_model.predict(features), "control"
```
## Model Versioning
```python
from specweave import ModelVersion
# Register model version
version = ModelVersion.register(
model_path="models/model-v3.pkl",
increment="0042",
metadata={
"accuracy": 0.87,
"training_date": "2024-01-15",
"data_version": "v2024-01",
"framework": "xgboost==1.7.0"
}
)
# Easy rollback
if production_metrics["error_rate"] > threshold:
ModelVersion.rollback(to_version="0042-v2")
```
## Load Testing
```python
from specweave import load_test_model
# Benchmark model performance
results = load_test_model(
api_url="http://localhost:8000/predict",
requests_per_second=[10, 50, 100, 500, 1000],
duration_seconds=60,
increment="0042"
)
```
Output:
```
Load Test Results:
==================
| RPS | Latency P50 | Latency P95 | Latency P99 | Error Rate |
|------|-------------|-------------|-------------|------------|
| 10 | 35ms | 45ms | 50ms | 0.00% |
| 50 | 38ms | 52ms | 65ms | 0.00% |
| 100 | 45ms | 70ms | 95ms | 0.02% |
| 500 | 120ms | 250ms | 400ms | 1.20% |
| 1000 | 350ms | 800ms | 1200ms | 8.50% |
Recommendation: Deploy with max 100 RPS per instance
Target: <100ms P95 latency (achieved at 100 RPS)
```
## Deployment Commands
```bash
# Generate deployment artifacts
/ml:deploy-prepare 0042
# Create API
/ml:create-api --increment 0042 --framework fastapi
# Setup monitoring
/ml:setup-monitoring 0042
# Create A/B test
/ml:create-ab-test --control v2 --treatment v3 --split 0.1
# Load test
/ml:load-test 0042 --rps 100 --duration 60s
# Deploy to production
/ml:deploy 0042 --environment production
```
## Deployment Increment
The skill creates a deployment increment:
```
.specweave/increments/0043-deploy-recommendation-model/
├── spec.md (deployment requirements)
├── plan.md (deployment strategy)
├── tasks.md
│ ├── [ ] Containerize model
│ ├── [ ] Create API
│ ├── [ ] Setup monitoring
│ ├── [ ] Configure A/B test
│ ├── [ ] Load test
│ ├── [ ] Deploy to staging
│ ├── [ ] Validate staging
│ └── [ ] Deploy to production
├── api/ (FastAPI app)
├── monitoring/ (Grafana dashboards)
├── ab-test/ (A/B testing logic)
└── load-tests/ (Performance benchmarks)
```
## Best Practices
1. **Always load test** before production
2. **Start with 1-5% traffic** in A/B test
3. **Monitor model drift** in production
4. **Version everything** (model, data, code)
5. **Document rollback plan** before deploying
6. **Set up alerts** for anomalies
7. **Gradual rollout** (canary deployment)
## Integration with SpecWeave
```bash
# After training model (increment 0042)
/specweave:inc "0043-deploy-recommendation-model"
# Generates deployment increment with all artifacts
/specweave:do
# Deploy to production when ready
/ml:deploy 0043 --environment production
```
Model deployment is not the end—it's the beginning of the MLOps lifecycle.

View File

@@ -0,0 +1,518 @@
---
name: ml-pipeline-orchestrator
description: |
Orchestrates complete machine learning pipelines within SpecWeave increments. Activates when users request "ML pipeline", "train model", "build ML system", "end-to-end ML", "ML workflow", "model training pipeline", or similar. Guides users through data preprocessing, feature engineering, model training, evaluation, and deployment using SpecWeave's spec-driven approach. Integrates with increment lifecycle for reproducible ML development.
---
# ML Pipeline Orchestrator
## Overview
This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.
## Core Philosophy
**SpecWeave + ML = Disciplined Data Science**
Traditional ML development often lacks structure:
- ❌ Jupyter notebooks with no version control
- ❌ Experiments without documentation
- ❌ Models deployed with no reproducibility
- ❌ Team knowledge trapped in individual notebooks
SpecWeave brings discipline:
- ✅ Every ML feature is an increment (with spec, plan, tasks)
- ✅ Experiments tracked and documented automatically
- ✅ Model versions tied to increments
- ✅ Living docs capture learnings and decisions
## How It Works
### Phase 1: ML Increment Planning
When you request "build a recommendation model", the skill:
1. **Creates ML increment structure**:
```
.specweave/increments/0042-recommendation-model/
├── spec.md # ML requirements, success metrics
├── plan.md # Pipeline architecture
├── tasks.md # Implementation tasks
├── tests.md # Evaluation criteria
├── experiments/ # Experiment tracking
│ ├── exp-001-baseline/
│ ├── exp-002-xgboost/
│ └── exp-003-neural-net/
├── data/ # Data samples, schemas
│ ├── schema.yaml
│ └── sample.csv
├── models/ # Trained models
│ ├── model-v1.pkl
│ └── model-v2.pkl
└── notebooks/ # Exploratory notebooks
├── 01-eda.ipynb
└── 02-feature-engineering.ipynb
```
2. **Generates ML-specific spec** (spec.md):
```markdown
## ML Problem Definition
- Problem type: Recommendation (collaborative filtering)
- Input: User behavior history
- Output: Top-N product recommendations
- Success metrics: Precision@10 > 0.25, Recall@10 > 0.15
## Data Requirements
- Training data: 6 months user interactions
- Validation: Last month
- Features: User profile, product attributes, interaction history
## Model Requirements
- Latency: <100ms inference
- Throughput: 1000 req/sec
- Accuracy: Better than random baseline by 3x
- Explainability: Must explain top-3 recommendations
```
3. **Creates ML-specific tasks** (tasks.md):
```markdown
- [ ] T-001: Data exploration and quality analysis
- [ ] T-002: Feature engineering pipeline
- [ ] T-003: Train baseline model (random/popularity)
- [ ] T-004: Train candidate models (3 algorithms)
- [ ] T-005: Hyperparameter tuning (best model)
- [ ] T-006: Model evaluation (all metrics)
- [ ] T-007: Model explainability (SHAP/LIME)
- [ ] T-008: Production deployment preparation
- [ ] T-009: A/B test plan
```
### Phase 2: Pipeline Execution
The skill guides through each task with best practices:
#### Task 1: Data Exploration
```python
# Generated template with SpecWeave integration
import pandas as pd
import mlflow
from specweave import track_experiment
# Auto-logs to .specweave/increments/0042.../experiments/
with track_experiment("exp-001-eda") as exp:
df = pd.read_csv("data/interactions.csv")
# EDA
exp.log_param("dataset_size", len(df))
exp.log_metric("missing_values", df.isnull().sum().sum())
# Auto-generates report in increment folder
exp.save_report("eda-summary.md")
```
#### Task 3: Train Baseline
```python
from sklearn.dummy import DummyClassifier
from specweave import track_model
with track_model("baseline-random", increment="0042") as model:
clf = DummyClassifier(strategy="uniform")
clf.fit(X_train, y_train)
# Automatically logged to increment
model.log_metrics({
"accuracy": 0.12,
"precision@10": 0.08
})
model.save_artifact(clf, "baseline.pkl")
```
#### Task 4: Train Candidate Models
```python
from xgboost import XGBClassifier
from specweave import ModelExperiment
# Parallel experiments with auto-tracking
experiments = [
ModelExperiment("xgboost", XGBClassifier, params_xgb),
ModelExperiment("lightgbm", LGBMClassifier, params_lgbm),
ModelExperiment("neural-net", KerasModel, params_nn)
]
results = run_experiments(
experiments,
increment="0042",
save_to="experiments/"
)
# Auto-generates comparison table in increment docs
```
### Phase 3: Increment Completion
When `/specweave:done 0042` runs:
1. **Validates ML-specific criteria**:
- ✅ All experiments logged
- ✅ Best model saved
- ✅ Evaluation metrics documented
- ✅ Model explainability artifacts present
2. **Generates completion summary**:
```markdown
## Recommendation Model - COMPLETE
### Experiments Run: 7
1. exp-001-baseline (random): precision@10=0.08
2. exp-002-popularity: precision@10=0.18
3. exp-003-xgboost: precision@10=0.26 ✅ BEST
4. exp-004-lightgbm: precision@10=0.24
5. exp-005-neural-net: precision@10=0.22
...
### Best Model
- Algorithm: XGBoost
- Version: model-v3.pkl
- Metrics: precision@10=0.26, recall@10=0.16
- Training time: 45 min
- Model size: 12 MB
### Deployment Ready
- ✅ Inference latency: 35ms (target: <100ms)
- ✅ Explainability: SHAP values computed
- ✅ A/B test plan documented
```
3. **Syncs living docs** (via `/specweave:sync-docs`):
- Updates architecture docs with model design
- Adds ADR for algorithm selection
- Documents learnings in runbooks
## When to Use This Skill
Activate this skill when you need to:
- **Build ML features end-to-end** - From idea to deployed model
- **Ensure reproducibility** - Every experiment tracked and documented
- **Follow ML best practices** - Baseline comparison, proper validation, explainability
- **Integrate ML with software engineering** - ML as increments, not isolated notebooks
- **Maintain team knowledge** - Living docs capture why decisions were made
## ML Pipeline Stages
### 1. Data Stage
- Data exploration (EDA)
- Data quality assessment
- Schema validation
- Sample data documentation
### 2. Feature Stage
- Feature engineering
- Feature selection
- Feature importance analysis
- Feature store integration (optional)
### 3. Training Stage
- Baseline model (random, rule-based)
- Candidate models (3+ algorithms)
- Hyperparameter tuning
- Cross-validation
### 4. Evaluation Stage
- Comprehensive metrics (accuracy, precision, recall, F1, AUC)
- Business metrics (latency, throughput)
- Model comparison (vs baseline, vs previous version)
- Error analysis
### 5. Explainability Stage
- Feature importance
- SHAP values
- LIME explanations
- Example predictions with rationale
### 6. Deployment Stage
- Model packaging
- Inference pipeline
- A/B test plan
- Monitoring setup
## Integration with SpecWeave Workflow
### With Experiment Tracking
```bash
# Start ML increment
/specweave:inc "0042-recommendation-model"
# Automatically integrates experiment tracking
# All MLflow/W&B logs saved to increment folder
```
### With Living Docs
```bash
# After training best model
/specweave:sync-docs update
# Automatically:
# - Updates architecture/ml-models.md
# - Adds ADR for algorithm choice
# - Documents hyperparameters in runbooks
```
### With GitHub Sync
```bash
# Create GitHub issue for model retraining
/specweave:github:create-issue "Retrain recommendation model with new data"
# Linked to increment 0042
# Issue tracks model performance over time
```
## Best Practices
### 1. Always Start with Baseline
```python
# Before training complex models, establish baseline
baseline_results = train_baseline_model(
strategies=["random", "popularity", "rule-based"]
)
# Requirement: New model must beat best baseline by 20%+
```
### 2. Use Cross-Validation
```python
# Never trust single train/test split
cv_scores = cross_val_score(model, X, y, cv=5)
exp.log_metric("cv_mean", cv_scores.mean())
exp.log_metric("cv_std", cv_scores.std())
```
### 3. Track Everything
```python
# Hyperparameters, metrics, artifacts, environment
exp.log_params(model.get_params())
exp.log_metrics({"accuracy": acc, "f1": f1})
exp.log_artifact("model.pkl")
exp.log_artifact("requirements.txt") # Reproducibility
```
### 4. Document Failures
```python
# Failed experiments are valuable learnings
with track_experiment("exp-006-failed-lstm") as exp:
# ... training fails ...
exp.log_note("FAILED: LSTM overfits badly, needs regularization")
exp.set_status("failed")
# This documents why LSTM wasn't chosen
```
### 5. Model Versioning
```python
# Tie model versions to increments
model_version = f"0042-v{iteration}"
mlflow.register_model(
f"runs:/{run_id}/model",
f"recommendation-model-{model_version}"
)
```
## Examples
### Example 1: Classification Pipeline
```bash
User: "Build a fraud detection model for transactions"
Skill creates increment 0051-fraud-detection with:
- spec.md: Binary classification, 99% precision target
- plan.md: Imbalanced data handling, threshold tuning
- tasks.md: 9 tasks from EDA to deployment
- experiments/: exp-001-baseline, exp-002-xgboost, etc.
Guides through:
1. EDA → identify class imbalance (0.1% fraud)
2. Baseline → random/majority (terrible results)
3. Candidates → XGBoost, LightGBM, Neural Net
4. Threshold tuning → optimize for precision
5. SHAP → explain high-risk predictions
6. Deploy → model + threshold + explainer
```
### Example 2: Regression Pipeline
```bash
User: "Predict customer lifetime value"
Skill creates increment 0063-ltv-prediction with:
- spec.md: Regression, RMSE < $50 target
- plan.md: Time-based validation, feature engineering
- tasks.md: Customer cohort analysis, feature importance
Key difference: Regression-specific evaluation (RMSE, MAE, R²)
```
### Example 3: Time Series Forecasting
```bash
User: "Forecast weekly sales for next 12 weeks"
Skill creates increment 0072-sales-forecasting with:
- spec.md: Time series, MAPE < 10% target
- plan.md: Seasonal decomposition, ARIMA vs Prophet
- tasks.md: Stationarity tests, residual analysis
Key difference: Time series validation (no random split)
```
## Framework Support
This skill works with all major ML frameworks:
### Scikit-Learn
```python
from sklearn.ensemble import RandomForestClassifier
from specweave import track_sklearn_model
model = RandomForestClassifier(n_estimators=100)
with track_sklearn_model(model, increment="0042") as tracked:
tracked.fit(X_train, y_train)
tracked.evaluate(X_test, y_test)
```
### PyTorch
```python
import torch
from specweave import track_pytorch_model
model = NeuralNet()
with track_pytorch_model(model, increment="0042") as tracked:
for epoch in range(epochs):
tracked.train_epoch(train_loader)
tracked.log_metric(f"loss_epoch_{epoch}", loss)
```
### TensorFlow/Keras
```python
from tensorflow import keras
from specweave import KerasCallback
model = keras.Sequential([...])
model.fit(
X_train, y_train,
callbacks=[KerasCallback(increment="0042")]
)
```
### XGBoost/LightGBM
```python
import xgboost as xgb
from specweave import track_boosting_model
dtrain = xgb.DMatrix(X_train, label=y_train)
with track_boosting_model("xgboost", increment="0042") as tracked:
model = xgb.train(params, dtrain, callbacks=[tracked.callback])
```
## Integration Points
### With `experiment-tracker` skill
- Auto-detects MLflow/W&B in project
- Configures tracking URI to increment folder
- Syncs experiment metadata to increment docs
### With `model-evaluator` skill
- Generates comprehensive evaluation reports
- Compares models across experiments
- Highlights best model with confidence intervals
### With `feature-engineer` skill
- Generates feature engineering pipeline
- Documents feature importance
- Creates feature store schemas
### With `ml-engineer` agent
- Delegates complex ML decisions to specialized agent
- Reviews model architecture
- Suggests improvements based on results
## Skill Outputs
After running `/specweave:do` on an ML increment, you get:
```
.specweave/increments/0042-recommendation-model/
├── spec.md ✅
├── plan.md ✅
├── tasks.md ✅ (all completed)
├── COMPLETION-SUMMARY.md ✅
├── experiments/
│ ├── exp-001-baseline/
│ │ ├── metrics.json
│ │ ├── params.json
│ │ └── logs/
│ ├── exp-002-xgboost/ ✅ BEST
│ │ ├── metrics.json
│ │ ├── params.json
│ │ ├── model.pkl
│ │ └── shap_values.pkl
│ └── comparison.md
├── models/
│ ├── model-v3.pkl (best)
│ └── model-v3.metadata.json
├── data/
│ ├── schema.yaml
│ └── sample.parquet
└── notebooks/
├── 01-eda.ipynb
├── 02-feature-engineering.ipynb
└── 03-model-analysis.ipynb
```
## Commands
This skill integrates with SpecWeave commands:
```bash
# Create ML increment
/specweave:inc "build recommendation model"
→ Activates ml-pipeline-orchestrator
→ Creates ML-specific increment structure
# Execute ML tasks
/specweave:do
→ Guides through data → train → eval workflow
→ Auto-tracks experiments
# Validate ML increment
/specweave:validate 0042
→ Checks: experiments logged, model saved, metrics documented
→ Validates: model meets success criteria
# Complete ML increment
/specweave:done 0042
→ Generates ML completion summary
→ Syncs model metadata to living docs
```
## Tips
1. **Start simple** - Always begin with baseline, then iterate
2. **Track failures** - Document why approaches didn't work
3. **Version data** - Use DVC or similar for data versioning
4. **Reproducibility** - Log environment (requirements.txt, conda env)
5. **Incremental improvement** - Each increment improves on previous model
6. **Team collaboration** - Living docs make ML decisions visible to all
## Advanced: Multi-Increment ML Projects
For complex ML systems (e.g., recommendation system with multiple models):
```
0042-recommendation-data-pipeline
0043-recommendation-candidate-generation
0044-recommendation-ranking-model
0045-recommendation-reranking
0046-recommendation-ab-test
```
Each increment:
- Has its own spec, plan, tasks
- Builds on previous increments
- Documents model interactions
- Maintains system-level living docs

View File

@@ -0,0 +1,249 @@
---
name: mlops-dag-builder
description: Design DAG-based MLOps pipeline architectures with Airflow, Dagster, Kubeflow, or Prefect. Activates for DAG orchestration, workflow automation, pipeline design patterns, CI/CD for ML. Use for platform-agnostic MLOps infrastructure - NOT for SpecWeave increment-based ML (use ml-pipeline-orchestrator instead).
---
# MLOps DAG Builder
Design and implement DAG-based ML pipeline architectures using production orchestration tools.
## Overview
This skill provides guidance for building **platform-agnostic MLOps pipelines** using DAG orchestrators (Airflow, Dagster, Kubeflow, Prefect). It focuses on workflow architecture, not SpecWeave integration.
**When to use this skill vs ml-pipeline-orchestrator:**
- **Use this skill**: General MLOps architecture, Airflow/Dagster DAGs, cloud ML platforms
- **Use ml-pipeline-orchestrator**: SpecWeave increment-based ML development with experiment tracking
## When to Use This Skill
- Designing DAG-based workflow orchestration (Airflow, Dagster, Kubeflow)
- Implementing platform-agnostic ML pipeline patterns
- Setting up CI/CD automation for ML training jobs
- Creating reusable pipeline templates for teams
- Integrating with cloud ML services (SageMaker, Vertex AI, Azure ML)
## What This Skill Provides
### Core Capabilities
1. **Pipeline Architecture**
- End-to-end workflow design
- DAG orchestration patterns (Airflow, Dagster, Kubeflow)
- Component dependencies and data flow
- Error handling and retry strategies
2. **Data Preparation**
- Data validation and quality checks
- Feature engineering pipelines
- Data versioning and lineage
- Train/validation/test splitting strategies
3. **Model Training**
- Training job orchestration
- Hyperparameter management
- Experiment tracking integration
- Distributed training patterns
4. **Model Validation**
- Validation frameworks and metrics
- A/B testing infrastructure
- Performance regression detection
- Model comparison workflows
5. **Deployment Automation**
- Model serving patterns
- Canary deployments
- Blue-green deployment strategies
- Rollback mechanisms
### Reference Documentation
See the `references/` directory for detailed guides:
- **data-preparation.md** - Data cleaning, validation, and feature engineering
- **model-training.md** - Training workflows and best practices
- **model-validation.md** - Validation strategies and metrics
- **model-deployment.md** - Deployment patterns and serving architectures
### Assets and Templates
The `assets/` directory contains:
- **pipeline-dag.yaml.template** - DAG template for workflow orchestration
- **training-config.yaml** - Training configuration template
- **validation-checklist.md** - Pre-deployment validation checklist
## Usage Patterns
### Basic Pipeline Setup
```python
# 1. Define pipeline stages
stages = [
"data_ingestion",
"data_validation",
"feature_engineering",
"model_training",
"model_validation",
"model_deployment"
]
# 2. Configure dependencies
# See assets/pipeline-dag.yaml.template for full example
```
### Production Workflow
1. **Data Preparation Phase**
- Ingest raw data from sources
- Run data quality checks
- Apply feature transformations
- Version processed datasets
2. **Training Phase**
- Load versioned training data
- Execute training jobs
- Track experiments and metrics
- Save trained models
3. **Validation Phase**
- Run validation test suite
- Compare against baseline
- Generate performance reports
- Approve for deployment
4. **Deployment Phase**
- Package model artifacts
- Deploy to serving infrastructure
- Configure monitoring
- Validate production traffic
## Best Practices
### Pipeline Design
- **Modularity**: Each stage should be independently testable
- **Idempotency**: Re-running stages should be safe
- **Observability**: Log metrics at every stage
- **Versioning**: Track data, code, and model versions
- **Failure Handling**: Implement retry logic and alerting
### Data Management
- Use data validation libraries (Great Expectations, TFX)
- Version datasets with DVC or similar tools
- Document feature engineering transformations
- Maintain data lineage tracking
### Model Operations
- Separate training and serving infrastructure
- Use model registries (MLflow, Weights & Biases)
- Implement gradual rollouts for new models
- Monitor model performance drift
- Maintain rollback capabilities
### Deployment Strategies
- Start with shadow deployments
- Use canary releases for validation
- Implement A/B testing infrastructure
- Set up automated rollback triggers
- Monitor latency and throughput
## Integration Points
### Orchestration Tools
- **Apache Airflow**: DAG-based workflow orchestration
- **Dagster**: Asset-based pipeline orchestration
- **Kubeflow Pipelines**: Kubernetes-native ML workflows
- **Prefect**: Modern dataflow automation
### Experiment Tracking
- MLflow for experiment tracking and model registry
- Weights & Biases for visualization and collaboration
- TensorBoard for training metrics
### Deployment Platforms
- AWS SageMaker for managed ML infrastructure
- Google Vertex AI for GCP deployments
- Azure ML for Azure cloud
- Kubernetes + KServe for cloud-agnostic serving
## Progressive Disclosure
Start with the basics and gradually add complexity:
1. **Level 1**: Simple linear pipeline (data → train → deploy)
2. **Level 2**: Add validation and monitoring stages
3. **Level 3**: Implement hyperparameter tuning
4. **Level 4**: Add A/B testing and gradual rollouts
5. **Level 5**: Multi-model pipelines with ensemble strategies
## Common Patterns
### Batch Training Pipeline
```yaml
# See assets/pipeline-dag.yaml.template
stages:
- name: data_preparation
dependencies: []
- name: model_training
dependencies: [data_preparation]
- name: model_evaluation
dependencies: [model_training]
- name: model_deployment
dependencies: [model_evaluation]
```
### Real-time Feature Pipeline
```python
# Stream processing for real-time features
# Combined with batch training
# See references/data-preparation.md
```
### Continuous Training
```python
# Automated retraining on schedule
# Triggered by data drift detection
# See references/model-training.md
```
## Troubleshooting
### Common Issues
- **Pipeline failures**: Check dependencies and data availability
- **Training instability**: Review hyperparameters and data quality
- **Deployment issues**: Validate model artifacts and serving config
- **Performance degradation**: Monitor data drift and model metrics
### Debugging Steps
1. Check pipeline logs for each stage
2. Validate input/output data at boundaries
3. Test components in isolation
4. Review experiment tracking metrics
5. Inspect model artifacts and metadata
## Next Steps
After setting up your pipeline:
1. Explore **hyperparameter-tuning** skill for optimization
2. Learn **experiment-tracking-setup** for MLflow/W&B
3. Review **model-deployment-patterns** for serving strategies
4. Implement monitoring with observability tools
## Related Skills
- **ml-pipeline-orchestrator**: SpecWeave-integrated ML development (use for increment-based ML)
- **experiment-tracker**: MLflow and Weights & Biases experiment tracking
- **automl-optimizer**: Automated hyperparameter optimization with Optuna/Hyperopt
- **ml-deployment-helper**: Model serving and deployment patterns

View File

@@ -0,0 +1,155 @@
---
name: model-evaluator
description: |
Comprehensive ML model evaluation with multiple metrics, cross-validation, and statistical testing. Activates for "evaluate model", "model metrics", "model performance", "compare models", "validation metrics", "test accuracy", "precision recall", "ROC AUC". Generates detailed evaluation reports with visualizations and statistical significance tests, integrated with SpecWeave increment documentation.
---
# Model Evaluator
## Overview
Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.
## Core Evaluation Framework
### 1. Classification Metrics
- Accuracy, Precision, Recall, F1-score
- ROC AUC, PR AUC
- Confusion matrix
- Per-class metrics (for multi-class)
- Class imbalance handling
### 2. Regression Metrics
- RMSE, MAE, MAPE
- R² score, Adjusted R²
- Residual analysis
- Prediction interval coverage
### 3. Ranking Metrics (Recommendations)
- Precision@K, Recall@K
- NDCG@K, MAP@K
- MRR (Mean Reciprocal Rank)
- Coverage, Diversity
### 4. Statistical Validation
- Cross-validation (K-fold, stratified, time-series)
- Confidence intervals
- Statistical significance testing
- Calibration curves
## Usage
```python
from specweave import ModelEvaluator
evaluator = ModelEvaluator(
model=trained_model,
X_test=X_test,
y_test=y_test,
increment="0042"
)
# Comprehensive evaluation
report = evaluator.evaluate_all()
# Generates:
# - .specweave/increments/0042.../evaluation-report.md
# - Visualizations (confusion matrix, ROC curves, etc.)
# - Statistical tests
```
## Evaluation Report Structure
```markdown
# Model Evaluation Report: XGBoost Classifier
## Overall Performance
- **Accuracy**: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
- **ROC AUC**: 0.92 ± 0.01
- **F1 Score**: 0.85 ± 0.02
## Per-Class Performance
| Class | Precision | Recall | F1 | Support |
|---------|-----------|--------|------|---------|
| Class 0 | 0.88 | 0.85 | 0.86 | 1000 |
| Class 1 | 0.84 | 0.87 | 0.86 | 800 |
## Confusion Matrix
[Visualization embedded]
## Cross-Validation Results
- 5-fold CV accuracy: 0.86 ± 0.03
- Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
- No overfitting detected (train=0.89, val=0.86, gap=0.03)
## Statistical Tests
- Comparison vs baseline: p=0.001 (highly significant)
- Comparison vs previous model: p=0.042 (significant)
## Recommendations
✅ Deploy: Model meets accuracy threshold (>0.85)
✅ Stable: Low variance across folds
⚠️ Monitor: Class 1 recall slightly lower (0.84)
```
## Model Comparison
```python
from specweave import compare_models
models = {
"baseline": baseline_model,
"xgboost": xgb_model,
"lightgbm": lgbm_model,
"neural-net": nn_model
}
comparison = compare_models(
models,
X_test,
y_test,
metrics=["accuracy", "auc", "f1"],
increment="0042"
)
```
**Output**:
```
Model Comparison Report
=======================
| Model | Accuracy | ROC AUC | F1 | Inference Time | Model Size |
|------------|----------|---------|------|----------------|------------|
| baseline | 0.65 | 0.70 | 0.62 | 1ms | 10KB |
| xgboost | 0.87 | 0.92 | 0.85 | 35ms | 12MB |
| lightgbm | 0.86 | 0.91 | 0.84 | 28ms | 8MB |
| neural-net | 0.85 | 0.90 | 0.83 | 120ms | 45MB |
Recommendation: XGBoost
- Best accuracy and AUC
- Acceptable inference time (<50ms requirement)
- Good size/performance tradeoff
```
## Best Practices
1. **Always compare to baseline** - Random, majority, rule-based
2. **Use cross-validation** - Never trust single split
3. **Check calibration** - Are probabilities meaningful?
4. **Analyze errors** - What types of mistakes?
5. **Test statistical significance** - Is improvement real?
## Integration with SpecWeave
```bash
# Evaluate model in increment
/ml:evaluate-model 0042
# Compare all models in increment
/ml:compare-models 0042
# Generate full evaluation report
/ml:evaluation-report 0042
```
Evaluation results automatically included in increment COMPLETION-SUMMARY.md.

View File

@@ -0,0 +1,227 @@
---
name: model-explainer
description: |
Model interpretability and explainability using SHAP, LIME, feature importance, and partial dependence plots. Activates for "explain model", "model interpretability", "SHAP", "LIME", "feature importance", "why prediction", "model explanation". Generates human-readable explanations for model predictions, critical for trust, debugging, and regulatory compliance.
---
# Model Explainer
## Overview
Makes black-box models interpretable. Explains why models make specific predictions, which features matter most, and how features interact. Critical for trust, debugging, and regulatory compliance.
## Why Explainability Matters
- **Trust**: Stakeholders trust models they understand
- **Debugging**: Find model weaknesses and biases
- **Compliance**: GDPR, fair lending laws require explanations
- **Improvement**: Understand what to improve
- **Safety**: Detect when model might fail
## Explanation Types
### 1. Global Explanations (Model-Level)
**Feature Importance**:
```python
from specweave import explain_model
explainer = explain_model(
model=trained_model,
X_train=X_train,
increment="0042"
)
# Global feature importance
importance = explainer.feature_importance()
```
Output:
```
Top Features (Global):
1. transaction_amount (importance: 0.35)
2. user_history_days (importance: 0.22)
3. merchant_reputation (importance: 0.18)
4. time_since_last_transaction (importance: 0.15)
5. device_type (importance: 0.10)
```
**Partial Dependence Plots**:
```python
# How does feature affect prediction?
explainer.partial_dependence(feature="transaction_amount")
```
### 2. Local Explanations (Prediction-Level)
**SHAP Values**:
```python
# Explain single prediction
explanation = explainer.explain_prediction(X_sample)
```
Output:
```
Prediction: FRAUD (probability: 0.92)
Why?
+ transaction_amount=5000 → +0.45 (high amount increases fraud risk)
+ user_history_days=2 → +0.30 (new user increases risk)
+ merchant_reputation=low → +0.25 (suspicious merchant)
- time_since_last_transaction=1hr → -0.08 (recent activity normal)
Base prediction: 0.10
Final prediction: 0.92
```
**LIME Explanations**:
```python
# Local interpretable model
lime_exp = explainer.lime_explanation(X_sample)
```
## Usage in SpecWeave
```python
from specweave import ModelExplainer
# Create explainer
explainer = ModelExplainer(
model=model,
X_train=X_train,
feature_names=feature_names,
increment="0042"
)
# Generate all explanations
explainer.generate_all_reports()
# Creates:
# - feature-importance.png
# - shap-summary.png
# - pdp-plots/
# - local-explanations/
# - explainability-report.md
```
## Real-World Examples
### Example 1: Fraud Detection
```python
# Explain why transaction flagged as fraud
transaction = {
"amount": 5000,
"user_age_days": 2,
"merchant": "new_merchant_xyz"
}
explanation = explainer.explain(transaction)
print(explanation.to_text())
```
Output:
```
FRAUD ALERT (92% confidence)
Main factors:
1. Large transaction amount ($5000) - Very unusual for new users
2. Account only 2 days old - Fraud pattern
3. Merchant has low reputation score - Red flag
If this is legitimate:
- User should verify identity
- Merchant should be manually reviewed
```
### Example 2: Loan Approval
```python
# Explain loan rejection
applicant = {
"income": 45000,
"credit_score": 620,
"debt_ratio": 0.45
}
explanation = explainer.explain(applicant)
print(explanation.to_text())
```
Output:
```
LOAN DENIED
Main reasons:
1. Credit score (620) below threshold (650) - Primary factor
2. High debt-to-income ratio (45%) - Risk indicator
3. Income ($45k) adequate but not strong
To improve approval chances:
- Increase credit score by 30+ points
- Reduce debt-to-income ratio below 40%
```
## Regulatory Compliance
### GDPR "Right to Explanation"
```python
# Generate GDPR-compliant explanation
gdpr_explanation = explainer.gdpr_explanation(prediction)
# Includes:
# - Decision rationale
# - Data used
# - How to contest decision
# - Impact of features
```
### Fair Lending Act
```python
# Check for bias in protected attributes
bias_report = explainer.fairness_report(
sensitive_features=["gender", "race", "age"]
)
# Detects:
# - Disparate impact
# - Feature bias
# - Recommendations for fairness
```
## Visualization Types
1. **Feature Importance Bar Chart**
2. **SHAP Summary Plot** (beeswarm)
3. **SHAP Waterfall** (single prediction)
4. **Partial Dependence Plots**
5. **Individual Conditional Expectation** (ICE)
6. **Force Plots** (interactive)
7. **Decision Trees** (surrogate models)
## Integration with SpecWeave
```bash
# Generate all explainability artifacts
/ml:explain-model 0042
# Explain specific prediction
/ml:explain-prediction --increment 0042 --sample sample.json
# Check for bias
/ml:fairness-check 0042
```
Explainability artifacts automatically included in increment documentation and COMPLETION-SUMMARY.
## Best Practices
1. **Generate explanations for all production models** - No "black boxes" in production
2. **Check for bias** - Test sensitive attributes
3. **Document limitations** - What model can't explain
4. **Validate explanations** - Do they make domain sense?
5. **Make explanations accessible** - Non-technical stakeholders should understand
Model explainability is non-negotiable for responsible AI deployment.

View File

@@ -0,0 +1,541 @@
---
name: model-registry
description: |
Centralized model versioning, staging, and lifecycle management. Activates for "model registry", "model versioning", "model staging", "deploy to production", "rollback model", "model metadata", "model lineage", "promote model", "model catalog". Manages ML model lifecycle from development through production with SpecWeave increment integration.
---
# Model Registry
## Overview
Centralized system for managing ML model lifecycle: versioning, staging (dev/staging/prod), metadata tracking, lineage, and rollback. Ensures production models are tracked, reproducible, and can be safely deployed or rolled back—all integrated with SpecWeave's increment workflow.
## Why Model Registry Matters
**Without Model Registry**:
- ❌ "Which model is in production?"
- ❌ "Can't reproduce model from 3 months ago"
- ❌ "Breaking change deployed, how to rollback?"
- ❌ "Model metadata scattered across notebooks"
- ❌ "No audit trail for model changes"
**With Model Registry**:
- ✅ Single source of truth for all models
- ✅ Full version history with metadata
- ✅ Safe staging pipeline (dev → staging → prod)
- ✅ One-command rollback
- ✅ Complete model lineage
- ✅ Audit trail for compliance
## Model Registry Structure
### Model Lifecycle Stages
```
Development → Staging → Production → Archived
Dev: Training, experimentation
Staging: Validation, A/B testing (10% traffic)
Prod: Production deployment (100% traffic)
Archived: Decommissioned, kept for audit
```
## Core Operations
### 1. Model Registration
```python
from specweave import ModelRegistry
registry = ModelRegistry(increment="0042")
# Register new model version
model_version = registry.register_model(
name="fraud-detection-model",
model=trained_model,
version="v3",
metadata={
"algorithm": "XGBoost",
"accuracy": 0.87,
"precision": 0.85,
"recall": 0.62,
"training_date": "2024-01-15",
"training_data_version": "v2024-01",
"hyperparameters": {
"n_estimators": 673,
"max_depth": 6,
"learning_rate": 0.094
},
"features": feature_names,
"framework": "xgboost==1.7.0",
"python_version": "3.10",
"increment": "0042"
},
stage="dev", # Initial stage
tags=["fraud", "production-candidate"]
)
# Creates:
# - Model artifact (model.pkl)
# - Model metadata (metadata.json)
# - Model signature (inputs/outputs)
# - Environment file (requirements.txt)
# - Feature schema (features.yaml)
```
### 2. Model Versioning
```python
# Semantic versioning: major.minor.patch
registry.version_model(
name="fraud-detection-model",
version_type="minor" # v3.0.0 → v3.1.0
)
# Auto-increments based on changes:
# - major: Breaking changes (different features, incompatible)
# - minor: Improvements (better accuracy, new features added)
# - patch: Bugfixes, retraining (same features, slight changes)
```
### 3. Model Promotion
**Stage Progression**:
```python
# Promote from dev to staging
registry.promote_model(
name="fraud-detection-model",
version="v3.1.0",
from_stage="dev",
to_stage="staging",
approval_required=True # Requires review
)
# Validate in staging (A/B test)
ab_test_results = run_ab_test(
control="fraud-detection-v3.0.0",
treatment="fraud-detection-v3.1.0",
traffic_split=0.1, # 10% to new model
duration_days=7
)
# Promote to production if successful
if ab_test_results['treatment_is_better']:
registry.promote_model(
name="fraud-detection-model",
version="v3.1.0",
from_stage="staging",
to_stage="production"
)
```
### 4. Model Rollback
```python
# Rollback to previous version
registry.rollback(
name="fraud-detection-model",
to_version="v3.0.0", # Previous stable version
reason="v3.1.0 causing high false positive rate"
)
# Automatic rollback triggers:
registry.set_auto_rollback_triggers(
error_rate_threshold=0.05, # Rollback if >5% errors
latency_threshold=200, # Rollback if p95 > 200ms
accuracy_drop_threshold=0.10 # Rollback if accuracy drops >10%
)
```
### 5. Model Retrieval
```python
# Get latest production model
model = registry.get_model(
name="fraud-detection-model",
stage="production"
)
# Get specific version
model_v3 = registry.get_model(
name="fraud-detection-model",
version="v3.1.0"
)
# Get model by date
model_jan = registry.get_model_by_date(
name="fraud-detection-model",
date="2024-01-15"
)
```
## Model Metadata
### Tracked Metadata
```python
model_metadata = {
# Core Info
"name": "fraud-detection-model",
"version": "v3.1.0",
"stage": "production",
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-20T14:00:00Z",
# Training Info
"algorithm": "XGBoost",
"framework": "xgboost==1.7.0",
"python_version": "3.10",
"training_duration": "45min",
"training_data_size": "100k rows",
# Performance Metrics
"accuracy": 0.87,
"precision": 0.85,
"recall": 0.62,
"roc_auc": 0.92,
"f1_score": 0.72,
# Deployment Info
"inference_latency_p50": "35ms",
"inference_latency_p95": "80ms",
"model_size": "12MB",
"cpu_usage": "0.2 cores",
"memory_usage": "256MB",
# Lineage
"increment": "0042-fraud-detection",
"experiment": "exp-003-xgboost",
"training_data_version": "v2024-01",
"feature_engineering_version": "v1",
"parent_model": "fraud-detection-v3.0.0",
# Features
"features": [
"amount_vs_user_average",
"days_since_last_purchase",
"merchant_risk_score",
...
],
"num_features": 35,
# Tags & Labels
"tags": ["fraud", "production", "high-precision"],
"owner": "[email protected]",
"approver": "[email protected]"
}
```
## Model Lineage
### Tracking Model Lineage
```python
# Full lineage: data → features → training → model
lineage = registry.get_lineage(
name="fraud-detection-model",
version="v3.1.0"
)
# Lineage graph:
"""
data:v2024-01
└─> feature-engineering:v1
└─> experiment:exp-003-xgboost
└─> model:fraud-detection-v3.1.0
└─> deployment:production
"""
# Answer questions like:
# - "What data was used to train this model?"
# - "Which experiments led to this model?"
# - "What models use this feature set?"
# - "Impact of changing feature X?"
```
### Model Comparison
```python
# Compare two model versions
comparison = registry.compare_models(
model_a="fraud-detection-v3.0.0",
model_b="fraud-detection-v3.1.0"
)
# Output:
"""
Comparison: v3.0.0 vs v3.1.0
============================
Metrics:
- Accuracy: 0.85 → 0.87 (+2.4%) ✅
- Precision: 0.83 → 0.85 (+2.4%) ✅
- Recall: 0.60 → 0.62 (+3.3%) ✅
Performance:
- Latency: 40ms → 35ms (-12.5%) ✅
- Size: 15MB → 12MB (-20.0%) ✅
Features:
- Added: merchant_reputation_score
- Removed: obsolete_feature_x
- Modified: 3 features rescaled
Recommendation: ✅ v3.1.0 is better (improvement in all metrics)
"""
```
## Integration with SpecWeave
### Automatic Registration
```python
# Models automatically registered during increment completion
with track_experiment("xgboost-v1", increment="0042") as exp:
model = train_model(X_train, y_train)
# Auto-registers model to registry
exp.register_model(
model=model,
name="fraud-detection-model",
auto_version=True # Auto-increment version
)
```
### Increment-Model Mapping
```
.specweave/increments/0042-fraud-detection/
├── models/
│ ├── fraud-detection-v3.0.0/
│ │ ├── model.pkl
│ │ ├── metadata.json
│ │ ├── requirements.txt
│ │ └── features.yaml
│ └── fraud-detection-v3.1.0/
│ ├── model.pkl
│ ├── metadata.json
│ ├── requirements.txt
│ └── features.yaml
└── registry/
├── model_catalog.yaml
├── lineage_graph.json
└── deployment_history.md
```
### Living Docs Integration
```bash
/specweave:sync-docs update
```
Updates:
```markdown
<!-- .specweave/docs/internal/architecture/model-registry.md -->
## Fraud Detection Model - Production
### Current Production Model
- Version: v3.1.0
- Deployed: 2024-01-20
- Accuracy: 87%
- Latency: 35ms (p50)
### Version History
| Version | Stage | Accuracy | Deployed | Notes |
|---------|-------|----------|----------|-------|
| v3.1.0 | Prod | 0.87 | 2024-01-20 | Current ✅ |
| v3.0.0 | Archived | 0.85 | 2024-01-10 | Replaced by v3.1.0 |
| v2.5.0 | Archived | 0.83 | 2023-12-01 | Retired |
### Rollback Plan
If v3.1.0 issues detected:
1. Rollback to v3.0.0 (tested, stable)
2. Investigate issue in staging
3. Deploy fix as v3.1.1
```
## Model Registry Providers
### MLflow Model Registry
```python
from specweave import MLflowRegistry
# Use MLflow as backend
registry = MLflowRegistry(
tracking_uri="http://mlflow.company.com",
increment="0042"
)
# All SpecWeave operations work with MLflow backend
registry.register_model(...)
registry.promote_model(...)
```
### Custom Registry
```python
from specweave import CustomRegistry
# Use custom storage (S3, GCS, Azure Blob)
registry = CustomRegistry(
storage_uri="s3://ml-models/registry",
increment="0042"
)
```
## Best Practices
### 1. Semantic Versioning
```python
# Breaking change (different features)
registry.version_model(version_type="major") # v3.0.0 → v4.0.0
# Feature addition (backward compatible)
registry.version_model(version_type="minor") # v3.0.0 → v3.1.0
# Bugfix or retraining (no API change)
registry.version_model(version_type="patch") # v3.0.0 → v3.0.1
```
### 2. Model Signatures
```python
# Document input/output schema
registry.set_model_signature(
model="fraud-detection-v3.1.0",
inputs={
"amount": "float",
"merchant_id": "int",
"location": "str"
},
outputs={
"fraud_probability": "float",
"fraud_flag": "bool",
"risk_score": "float"
}
)
# Prevents breaking changes (validate on registration)
```
### 3. Model Approval Workflow
```python
# Require approval before production
registry.set_approval_required(
stage="production",
approvers=["[email protected]", "[email protected]"]
)
# Approve model promotion
registry.approve_model(
name="fraud-detection-model",
version="v3.1.0",
approver="[email protected]",
comments="Tested in staging, accuracy improved 2%, latency reduced 12%"
)
```
### 4. Model Deprecation
```python
# Mark old models as deprecated
registry.deprecate_model(
name="fraud-detection-model",
version="v2.5.0",
reason="Superseded by v3.x series",
end_of_life="2024-06-01"
)
```
## Commands
```bash
# List all models
/ml:registry-list
# Get model info
/ml:registry-info fraud-detection-model
# Promote model
/ml:registry-promote fraud-detection-model v3.1.0 --to production
# Rollback model
/ml:registry-rollback fraud-detection-model --to v3.0.0
# Compare models
/ml:registry-compare fraud-detection-model v3.0.0 v3.1.0
```
## Advanced Features
### 1. Model Monitoring Integration
```python
# Automatically track production model performance
monitor = ModelMonitor(registry=registry)
monitor.track_model(
name="fraud-detection-model",
stage="production",
metrics=["accuracy", "latency", "error_rate"]
)
# Auto-rollback if metrics degrade
monitor.set_auto_rollback(
metric="accuracy",
threshold=0.80, # Rollback if < 80%
window="24h"
)
```
### 2. Model Governance
```python
# Compliance and audit trail
governance = ModelGovernance(registry=registry)
# Generate audit report
audit_report = governance.generate_audit_report(
model="fraud-detection-model",
start_date="2023-01-01",
end_date="2024-01-31"
)
# Includes:
# - All model versions deployed
# - Who approved deployments
# - Performance metrics over time
# - Data sources used
# - Compliance checkpoints
```
### 3. Multi-Environment Registry
```python
# Separate registries for dev, staging, prod
registry_dev = ModelRegistry(environment="dev")
registry_staging = ModelRegistry(environment="staging")
registry_prod = ModelRegistry(environment="production")
# Promote across environments
registry_dev.promote_to(
model="fraud-detection-v3.1.0",
target_env="staging"
)
```
## Summary
Model Registry is essential for:
- ✅ Model versioning (track all model versions)
- ✅ Safe deployment (dev → staging → prod pipeline)
- ✅ Fast rollback (one-command revert to stable version)
- ✅ Audit trail (who deployed what, when, why)
- ✅ Model lineage (data → features → model → deployment)
- ✅ Compliance (regulatory requirements, governance)
This skill brings enterprise-grade model lifecycle management to SpecWeave, ensuring all models are tracked, reproducible, and safely deployed.

View File

@@ -0,0 +1,180 @@
---
name: nlp-pipeline-builder
description: |
Natural language processing ML pipelines for text classification, NER, sentiment analysis, text generation, and embeddings. Activates for "nlp", "text classification", "sentiment analysis", "named entity recognition", "BERT", "transformers", "text preprocessing", "tokenization", "word embeddings". Builds NLP pipelines with transformers, integrated with SpecWeave increments.
---
# NLP Pipeline Builder
## Overview
Specialized ML pipelines for natural language processing. Handles text preprocessing, tokenization, transformer models (BERT, RoBERTa, GPT), fine-tuning, and deployment for production NLP systems.
## NLP Tasks Supported
### 1. Text Classification
```python
from specweave import NLPPipeline
# Binary or multi-class text classification
pipeline = NLPPipeline(
task="classification",
classes=["positive", "negative", "neutral"],
increment="0042"
)
# Automatically configures:
# - Text preprocessing (lowercase, clean)
# - Tokenization (BERT tokenizer)
# - Model (BERT, RoBERTa, DistilBERT)
# - Fine-tuning on your data
# - Inference pipeline
pipeline.fit(train_texts, train_labels)
```
### 2. Named Entity Recognition (NER)
```python
# Extract entities from text
pipeline = NLPPipeline(
task="ner",
entities=["PERSON", "ORG", "LOC", "DATE"],
increment="0042"
)
# Returns: [(entity_text, entity_type, start_pos, end_pos), ...]
```
### 3. Sentiment Analysis
```python
# Sentiment classification (specialized)
pipeline = NLPPipeline(
task="sentiment",
increment="0042"
)
# Fine-tuned for sentiment (positive/negative/neutral)
```
### 4. Text Generation
```python
# Generate text continuations
pipeline = NLPPipeline(
task="generation",
model="gpt2",
increment="0042"
)
# Fine-tune on your domain-specific text
```
## Best Practices for NLP
### Text Preprocessing
```python
from specweave import TextPreprocessor
preprocessor = TextPreprocessor(increment="0042")
# Standard preprocessing
preprocessor.add_steps([
"lowercase",
"remove_html",
"remove_urls",
"remove_emails",
"remove_special_chars",
"remove_extra_whitespace"
])
# Advanced preprocessing
preprocessor.add_advanced([
"spell_correction",
"lemmatization",
"stopword_removal"
])
```
### Model Selection
**Text Classification**:
- Small datasets (<10K): DistilBERT (6x faster than BERT)
- Medium datasets (10K-100K): BERT-base
- Large datasets (>100K): RoBERTa-large
**NER**:
- General: BERT + CRF layer
- Domain-specific: Fine-tune BERT on domain corpus
**Sentiment**:
- Product reviews: DistilBERT fine-tuned on Amazon reviews
- Social media: RoBERTa fine-tuned on Twitter
### Transfer Learning
```python
# Start from pre-trained language models
pipeline = NLPPipeline(task="classification")
# Option 1: Use pre-trained (no fine-tuning)
pipeline.use_pretrained("distilbert-base-uncased")
# Option 2: Fine-tune on your data
pipeline.use_pretrained_and_finetune(
model="bert-base-uncased",
epochs=3,
learning_rate=2e-5
)
```
### Handling Long Text
```python
# For text longer than 512 tokens
pipeline = NLPPipeline(
task="classification",
max_length=512,
truncation_strategy="head_and_tail" # Keep start + end
)
# Or use Longformer for long documents
pipeline.use_model("longformer") # Handles 4096 tokens
```
## Integration with SpecWeave
```python
# NLP increment structure
.specweave/increments/0042-sentiment-classifier/
spec.md
data/
train.csv
val.csv
test.csv
models/
tokenizer/
model-epoch-1/
model-epoch-2/
model-epoch-3/
experiments/
distilbert-baseline/
bert-base-finetuned/
roberta-large/
deployment/
model.onnx
inference.py
```
## Commands
```bash
/ml:nlp-pipeline --task classification --model bert-base
/ml:nlp-evaluate 0042 # Evaluate on test set
/ml:nlp-deploy 0042 # Export for production
```
Quick setup for NLP projects with state-of-the-art transformer models.

View File

@@ -0,0 +1,569 @@
---
name: time-series-forecaster
description: |
Time series forecasting with ARIMA, Prophet, LSTM, and statistical methods. Activates for "time series", "forecasting", "predict future", "trend analysis", "seasonality", "ARIMA", "Prophet", "sales forecast", "demand prediction", "stock prediction". Handles trend decomposition, seasonality detection, multivariate forecasting, and confidence intervals with SpecWeave increment integration.
---
# Time Series Forecaster
## Overview
Specialized forecasting pipelines for time-dependent data. Handles trend analysis, seasonality detection, and future predictions using statistical methods, machine learning, and deep learning approaches—all integrated with SpecWeave's increment workflow.
## Why Time Series is Different
**Standard ML assumptions violated**:
- ❌ Data is NOT independent (temporal correlation)
- ❌ Data is NOT identically distributed (trends, seasonality)
- ❌ Random train/test split is WRONG (breaks temporal order)
**Time series requirements**:
- ✅ Temporal order preserved
- ✅ No data leakage from future
- ✅ Stationarity checks
- ✅ Autocorrelation analysis
- ✅ Seasonality decomposition
## Forecasting Methods
### 1. Statistical Methods (Baseline)
**ARIMA (AutoRegressive Integrated Moving Average)**:
```python
from specweave import TimeSeriesForecaster
forecaster = TimeSeriesForecaster(
method="arima",
increment="0042"
)
# Automatic order selection (p, d, q)
forecaster.fit(train_data)
# Forecast next 30 periods
forecast = forecaster.predict(horizon=30)
# Generates:
# - Trend analysis
# - Seasonality decomposition
# - Autocorrelation plots (ACF, PACF)
# - Residual diagnostics
# - Forecast with confidence intervals
```
**Seasonal Decomposition**:
```python
# Decompose into trend + seasonal + residual
decomposition = forecaster.decompose(
data=sales_data,
model='multiplicative', # Or 'additive'
period=12 # Monthly seasonality
)
# Creates:
# - Trend component plot
# - Seasonal component plot
# - Residual component plot
# - Strength of trend/seasonality metrics
```
### 2. Prophet (Facebook)
**Best for**: Business time series (sales, website traffic, user growth)
```python
from specweave import ProphetForecaster
forecaster = ProphetForecaster(increment="0042")
# Prophet handles:
# - Multiple seasonality (daily, weekly, yearly)
# - Holidays and events
# - Missing data
# - Outliers
forecaster.fit(
data=sales_data,
holidays=us_holidays, # Built-in holiday effects
seasonality_mode='multiplicative'
)
forecast = forecaster.predict(horizon=90)
# Generates:
# - Trend + seasonality + holiday components
# - Change point detection
# - Uncertainty intervals
# - Cross-validation results
```
**Prophet with Custom Regressors**:
```python
# Add external variables (marketing spend, weather, etc.)
forecaster.add_regressor("marketing_spend")
forecaster.add_regressor("temperature")
# Prophet incorporates external factors into forecast
```
### 3. Deep Learning (LSTM/GRU)
**Best for**: Complex patterns, multivariate forecasting, non-linear relationships
```python
from specweave import LSTMForecaster
forecaster = LSTMForecaster(
lookback_window=30, # Use 30 past observations
horizon=7, # Predict 7 steps ahead
increment="0042"
)
# Automatically handles:
# - Sequence creation
# - Train/val/test split (temporal)
# - Scaling
# - Early stopping
forecaster.fit(
data=sensor_data,
epochs=100,
batch_size=32
)
forecast = forecaster.predict(horizon=7)
# Generates:
# - Training history plots
# - Validation metrics
# - Attention weights (if using attention)
# - Forecast uncertainty estimation
```
### 4. Multivariate Forecasting
**VAR (Vector AutoRegression)** - Multiple related time series:
```python
from specweave import VARForecaster
# Forecast multiple related series simultaneously
forecaster = VARForecaster(increment="0042")
# Example: Forecast sales across multiple stores
# Each store's sales affects others
forecaster.fit(data={
'store_1_sales': store1_data,
'store_2_sales': store2_data,
'store_3_sales': store3_data
})
forecast = forecaster.predict(horizon=30)
# Returns forecasts for all 3 stores
```
## Time Series Best Practices
### 1. Temporal Train/Test Split
```python
# ❌ WRONG: Random split (data leakage!)
X_train, X_test = train_test_split(data, test_size=0.2)
# ✅ CORRECT: Temporal split
split_date = "2024-01-01"
train = data[data.index < split_date]
test = data[data.index >= split_date]
# Or use last N periods as test
train = data[:-30] # All but last 30 observations
test = data[-30:] # Last 30 observations
```
### 2. Stationarity Testing
```python
from specweave import TimeSeriesAnalyzer
analyzer = TimeSeriesAnalyzer(increment="0042")
# Check stationarity (required for ARIMA)
stationarity = analyzer.check_stationarity(data)
if not stationarity['is_stationary']:
# Make stationary via differencing
data_diff = analyzer.difference(data, order=1)
# Or detrend
data_detrended = analyzer.detrend(data)
```
**Stationarity Report**:
```markdown
# Stationarity Analysis
## ADF Test (Augmented Dickey-Fuller)
- Test Statistic: -2.15
- P-value: 0.23
- Critical Value (5%): -2.89
- Result: ❌ NON-STATIONARY (p > 0.05)
## Recommendation
Apply differencing (order=1) to remove trend.
After differencing:
- ADF Test Statistic: -5.42
- P-value: 0.0001
- Result: ✅ STATIONARY
```
### 3. Seasonality Detection
```python
# Automatic seasonality detection
seasonality = analyzer.detect_seasonality(data)
# Results:
# - Daily: False
# - Weekly: True (period=7)
# - Monthly: True (period=30)
# - Yearly: False
```
### 4. Cross-Validation for Time Series
```python
# Time series cross-validation (expanding window)
cv_results = forecaster.cross_validate(
data=data,
horizon=30, # Forecast 30 steps ahead
n_splits=5, # 5 expanding windows
metric='mape'
)
# Visualizes:
# - MAPE across different time periods
# - Forecast vs actual for each fold
# - Model stability over time
```
### 5. Handling Missing Data
```python
# Time series-specific imputation
forecaster.handle_missing(
method='interpolate', # Or 'forward_fill', 'backward_fill'
limit=3 # Max consecutive missing values to fill
)
# For seasonal data
forecaster.handle_missing(
method='seasonal_interpolate',
period=12 # Use seasonal pattern to impute
)
```
## Common Time Series Patterns
### Pattern 1: Sales Forecasting
```python
from specweave import SalesForecastPipeline
pipeline = SalesForecastPipeline(increment="0042")
# Handles:
# - Weekly/monthly seasonality
# - Holiday effects
# - Marketing campaign impact
# - Trend changes
pipeline.fit(
sales_data=daily_sales,
holidays=us_holidays,
regressors={
'marketing_spend': marketing_data,
'competitor_price': competitor_data
}
)
forecast = pipeline.predict(horizon=90) # 90 days ahead
# Generates:
# - Point forecast
# - Prediction intervals (80%, 95%)
# - Component analysis (trend, seasonality, regressors)
# - Anomaly flags for past data
```
### Pattern 2: Demand Forecasting
```python
from specweave import DemandForecastPipeline
# Inventory optimization, supply chain planning
pipeline = DemandForecastPipeline(
aggregation='daily', # Or 'weekly', 'monthly'
increment="0042"
)
# Multi-product forecasting
forecasts = pipeline.fit_predict(
products=['product_A', 'product_B', 'product_C'],
horizon=30
)
# Generates:
# - Demand forecast per product
# - Confidence intervals
# - Stockout risk analysis
# - Reorder point recommendations
```
### Pattern 3: Stock Price Prediction
```python
from specweave import FinancialForecastPipeline
# Stock prices, crypto, forex
pipeline = FinancialForecastPipeline(increment="0042")
# Handles:
# - Volatility clustering
# - Non-linear patterns
# - Technical indicators
pipeline.fit(
price_data=stock_prices,
features=['volume', 'volatility', 'RSI', 'MACD']
)
forecast = pipeline.predict(horizon=7)
# Generates:
# - Price forecast with confidence bands
# - Volatility forecast (GARCH)
# - Trading signals (optional)
# - Risk metrics
```
### Pattern 4: Sensor Data / IoT
```python
from specweave import SensorForecastPipeline
# Temperature, humidity, machine metrics
pipeline = SensorForecastPipeline(
method='lstm', # Deep learning for complex patterns
increment="0042"
)
# Multivariate: Multiple sensor readings
pipeline.fit(
sensors={
'temperature': temp_data,
'humidity': humidity_data,
'pressure': pressure_data
}
)
forecast = pipeline.predict(horizon=24) # 24 hours ahead
# Generates:
# - Multi-sensor forecasts
# - Anomaly detection (unexpected values)
# - Maintenance alerts
```
## Evaluation Metrics
**Time series-specific metrics**:
```python
from specweave import TimeSeriesEvaluator
evaluator = TimeSeriesEvaluator(increment="0042")
metrics = evaluator.evaluate(
y_true=test_data,
y_pred=forecast
)
# Metrics:
# - MAPE (Mean Absolute Percentage Error) - business-friendly
# - RMSE (Root Mean Squared Error) - penalizes large errors
# - MAE (Mean Absolute Error) - robust to outliers
# - MASE (Mean Absolute Scaled Error) - scale-independent
# - Directional Accuracy - did we predict up/down correctly?
```
**Evaluation Report**:
```markdown
# Time Series Forecast Evaluation
## Point Metrics
- MAPE: 8.2% (target: <10%) ✅
- RMSE: 124.5
- MAE: 98.3
- MASE: 0.85 (< 1 = better than naive forecast) ✅
## Directional Accuracy
- Correct direction: 73% (up/down predictions)
## Forecast Bias
- Mean Error: -5.2 (slight under-forecasting)
- Bias: -2.1%
## Confidence Intervals
- 80% interval coverage: 79.2% ✅
- 95% interval coverage: 94.1% ✅
## Recommendation
✅ DEPLOY: Model meets accuracy targets and is well-calibrated.
```
## Integration with SpecWeave
### Increment Structure
```
.specweave/increments/0042-sales-forecast/
├── spec.md (forecasting requirements, accuracy targets)
├── plan.md (forecasting strategy, method selection)
├── tasks.md
├── data/
│ ├── train_data.csv
│ ├── test_data.csv
│ └── schema.yaml
├── experiments/
│ ├── arima-baseline/
│ ├── prophet-holidays/
│ └── lstm-multivariate/
├── models/
│ ├── prophet_model.pkl
│ └── lstm_model.h5
├── forecasts/
│ ├── forecast_2024-01.csv
│ ├── forecast_2024-02.csv
│ └── forecast_with_intervals.csv
└── analysis/
├── stationarity_test.md
├── seasonality_decomposition.png
└── forecast_evaluation.md
```
### Living Docs Integration
```bash
/specweave:sync-docs update
```
Updates:
```markdown
<!-- .specweave/docs/internal/architecture/time-series-forecasting.md -->
## Sales Forecasting Model (Increment 0042)
### Method Selected: Prophet
- Reason: Handles multiple seasonality + holidays well
- Alternatives tried: ARIMA (MAPE 12%), LSTM (MAPE 10%)
- Prophet: MAPE 8.2% ✅ BEST
### Seasonality Detected
- Weekly: Strong (7-day cycle)
- Monthly: Moderate (30-day cycle)
- Yearly: Weak
### Holiday Effects
- Black Friday: +180% sales (strongest)
- Christmas: +120% sales
- Thanksgiving: +80% sales
### Forecast Horizon
- 90 days ahead
- Confidence intervals: 80%, 95%
- Update frequency: Weekly retraining
### Model Performance
- MAPE: 8.2% (target: <10%)
- Directional accuracy: 73%
- Deployed: 2024-01-15
```
## Commands
```bash
# Create time series forecast
/ml:forecast --horizon 30 --method prophet
# Evaluate forecast
/ml:evaluate-forecast 0042
# Decompose time series
/ml:decompose-timeseries 0042
```
## Advanced Features
### 1. Ensemble Forecasting
```python
# Combine multiple methods for robustness
ensemble = EnsembleForecast(increment="0042")
ensemble.add_forecaster("arima", weight=0.3)
ensemble.add_forecaster("prophet", weight=0.5)
ensemble.add_forecaster("lstm", weight=0.2)
# Weighted average of all forecasts
forecast = ensemble.predict(horizon=30)
# Ensemble typically 10-20% more accurate than single model
```
### 2. Forecast Reconciliation
```python
# For hierarchical time series (e.g., total sales = store1 + store2 + store3)
reconciler = ForecastReconciler(increment="0042")
# Ensures forecasts sum correctly
reconciled = reconciler.reconcile(
forecasts={
'total': total_forecast,
'store1': store1_forecast,
'store2': store2_forecast,
'store3': store3_forecast
},
method='bottom_up' # Or 'top_down', 'middle_out'
)
```
### 3. Forecast Monitoring
```python
# Track forecast accuracy over time
monitor = ForecastMonitor(increment="0042")
# Compare forecasts vs actuals
monitor.track_performance(
forecasts=past_forecasts,
actuals=actual_values
)
# Alerts when accuracy degrades
if monitor.accuracy_degraded():
print("⚠️ Forecast accuracy dropped 15% - retrain model!")
```
## Summary
Time series forecasting requires specialized techniques:
- ✅ Temporal validation (no random split)
- ✅ Stationarity testing
- ✅ Seasonality detection
- ✅ Trend decomposition
- ✅ Cross-validation (expanding window)
- ✅ Confidence intervals
- ✅ Forecast monitoring
This skill handles all time series complexity within SpecWeave's increment workflow, ensuring forecasts are reproducible, documented, and production-ready.