583 lines
14 KiB
Markdown
583 lines
14 KiB
Markdown
# ML Scenario Manager Guide
|
|
|
|
Complete guide for machine learning in SAP Data Intelligence.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [ML Scenario Manager](#ml-scenario-manager)
|
|
3. [JupyterLab Environment](#jupyterlab-environment)
|
|
4. [Python SDK](#python-sdk)
|
|
5. [Training Pipelines](#training-pipelines)
|
|
6. [Metrics Explorer](#metrics-explorer)
|
|
7. [Model Deployment](#model-deployment)
|
|
8. [Versioning](#versioning)
|
|
9. [Best Practices](#best-practices)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
SAP Data Intelligence provides comprehensive machine learning capabilities:
|
|
|
|
**Key Components:**
|
|
- **ML Scenario Manager**: Organize and manage ML artifacts
|
|
- **JupyterLab**: Interactive data science environment
|
|
- **Python SDK**: Programmatic ML operations
|
|
- **Metrics Explorer**: Visualize and compare results
|
|
- **Pipelines**: Productionize ML workflows
|
|
|
|
---
|
|
|
|
## ML Scenario Manager
|
|
|
|
Central application for organizing data science artifacts.
|
|
|
|
### Accessing ML Scenario Manager
|
|
|
|
1. Open SAP Data Intelligence Launchpad
|
|
2. Navigate to ML Scenario Manager tile
|
|
3. View existing scenarios or create new
|
|
|
|
### Core Concepts
|
|
|
|
**ML Scenario:**
|
|
- Container for datasets, notebooks, pipelines
|
|
- Supports versioning and branching
|
|
- Export/import for migration
|
|
|
|
**Artifacts:**
|
|
- Datasets (registered data sources)
|
|
- Jupyter notebooks
|
|
- Pipelines (training, inference)
|
|
- Model files
|
|
|
|
### Creating a Scenario
|
|
|
|
1. Click "Create" in ML Scenario Manager
|
|
2. Enter scenario name and description
|
|
3. Choose initial version name
|
|
4. Add artifacts (datasets, notebooks, pipelines)
|
|
|
|
### Scenario Structure
|
|
|
|
```
|
|
ML Scenario: Customer Churn Prediction
|
|
├── Datasets
|
|
│ ├── customer_data (registered)
|
|
│ └── transaction_history (registered)
|
|
├── Notebooks
|
|
│ ├── 01_data_exploration.ipynb
|
|
│ ├── 02_feature_engineering.ipynb
|
|
│ └── 03_model_training.ipynb
|
|
├── Pipelines
|
|
│ ├── training_pipeline
|
|
│ └── inference_pipeline
|
|
└── Versions
|
|
├── v1.0 (initial)
|
|
├── v1.1 (improved features)
|
|
└── v2.0 (new model architecture)
|
|
```
|
|
|
|
---
|
|
|
|
## JupyterLab Environment
|
|
|
|
Interactive environment for data science experimentation.
|
|
|
|
### Accessing JupyterLab
|
|
|
|
1. From ML Scenario Manager, click "Open Notebook"
|
|
2. Or access directly from SAP Data Intelligence Launchpad
|
|
|
|
### Available Kernels
|
|
|
|
- Python 3 (with ML libraries)
|
|
- Custom kernels (via Docker configuration)
|
|
|
|
### Pre-installed Libraries
|
|
|
|
```python
|
|
# Data Processing
|
|
import pandas as pd
|
|
import numpy as np
|
|
|
|
# Machine Learning
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
from sklearn.model_selection import train_test_split
|
|
|
|
# Deep Learning (available)
|
|
import tensorflow as tf
|
|
import torch
|
|
|
|
# SAP Data Intelligence SDK
|
|
from sapdi import tracking
|
|
```
|
|
|
|
### Data Lake Access
|
|
|
|
Access SAP Data Intelligence Data Lake from notebooks:
|
|
|
|
```python
|
|
from sapdi.datalake import DataLakeClient
|
|
|
|
client = DataLakeClient()
|
|
|
|
# Read file
|
|
df = client.read_csv('/shared/data/customers.csv')
|
|
|
|
# Write file
|
|
client.write_parquet(df, '/shared/output/processed.parquet')
|
|
```
|
|
|
|
### Virtual Environments
|
|
|
|
Create isolated environments for dependencies:
|
|
|
|
```bash
|
|
# Create virtual environment
|
|
python -m venv /home/user/myenv
|
|
|
|
# Activate
|
|
source /home/user/myenv/bin/activate
|
|
|
|
# Install packages
|
|
pip install xgboost lightgbm catboost
|
|
```
|
|
|
|
### Data Browser Extension
|
|
|
|
Use the Data Browser to:
|
|
- Browse available data sources
|
|
- Preview data
|
|
- Import data to notebooks
|
|
|
|
---
|
|
|
|
## Python SDK
|
|
|
|
Programmatic interface for ML operations.
|
|
|
|
### SDK Installation
|
|
|
|
Pre-installed in JupyterLab and Python operators.
|
|
|
|
```python
|
|
import sapdi
|
|
from sapdi import tracking
|
|
from sapdi import context
|
|
```
|
|
|
|
### MLTrackingSDK Functions
|
|
|
|
| Function | Description | Limits |
|
|
|----------|-------------|--------|
|
|
| `start_run()` | Begin experiment tracking | Specify run_collection_name, run_name |
|
|
| `end_run()` | Complete tracking | Auto-adds start/end timestamps |
|
|
| `log_param()` | Log configuration values | name: 256 chars, value: 5000 chars |
|
|
| `log_metric()` | Log numeric metric | name: 256 chars (case-sensitive) |
|
|
| `log_metrics()` | Batch log metrics | Dictionary list format |
|
|
| `persist_run()` | Force save to storage | Auto at 1.5MB cache or end_run |
|
|
| `set_tags()` | Key-value pairs for filtering | runName is reserved |
|
|
| `set_labels()` | UI/semantic labels | Non-filterable |
|
|
| `delete_runs()` | Remove persisted metrics | By scenario/pipeline/execution |
|
|
| `get_runs()` | Retrieve run objects | Returns metrics, params, tags |
|
|
| `get_metrics_history()` | Get metric values | Max 1000 per metric |
|
|
| `update_run_info()` | Modify run metadata | Change name, collection, tags |
|
|
|
|
### Metrics Tracking
|
|
|
|
```python
|
|
from sapdi import tracking
|
|
|
|
# Initialize tracking
|
|
with tracking.start_run(run_name="experiment_001") as run:
|
|
# Train model
|
|
model = train_model(X_train, y_train)
|
|
|
|
# Log parameters
|
|
run.log_param("algorithm", "RandomForest")
|
|
run.log_param("n_estimators", 100)
|
|
run.log_param("max_depth", 10)
|
|
|
|
# Log metrics
|
|
accuracy = evaluate(model, X_test, y_test)
|
|
run.log_metric("accuracy", accuracy)
|
|
run.log_metric("f1_score", f1)
|
|
|
|
# Log model artifact
|
|
run.log_artifact("model.pkl", model)
|
|
```
|
|
|
|
### Tracking Parameters and Metrics
|
|
|
|
**Parameters** (static values):
|
|
```python
|
|
run.log_param("learning_rate", 0.01)
|
|
run.log_param("batch_size", 32)
|
|
run.log_param("epochs", 100)
|
|
```
|
|
|
|
**Metrics** (can be logged multiple times):
|
|
```python
|
|
for epoch in range(epochs):
|
|
loss = train_epoch(model, data)
|
|
run.log_metric("loss", loss, step=epoch)
|
|
run.log_metric("val_loss", val_loss, step=epoch)
|
|
```
|
|
|
|
### Artifact Management
|
|
|
|
```python
|
|
# Log files
|
|
run.log_artifact("model.pkl", model_bytes)
|
|
run.log_artifact("feature_importance.png", image_bytes)
|
|
|
|
# Log directories
|
|
run.log_artifacts("./model_output/")
|
|
|
|
# Retrieve artifacts
|
|
artifacts = tracking.get_run_artifacts(run_id)
|
|
model_data = artifacts.get("model.pkl")
|
|
```
|
|
|
|
### Artifact Class Methods
|
|
|
|
| Method | Description |
|
|
|--------|-------------|
|
|
| `add_file()` | Add file to artifact, returns handler |
|
|
| `create()` | Create artifact with initial content, returns ID |
|
|
| `delete()` | Remove artifact metadata (not content) |
|
|
| `delete_content()` | Remove stored data |
|
|
| `download()` | Retrieve artifact contents to local storage |
|
|
| `get()` | Get artifact metadata |
|
|
| `list()` | List all artifacts in scenario |
|
|
| `open_file()` | Get handler for remote file access |
|
|
| `upload()` | Add files/directories to artifact |
|
|
| `walk()` | Depth-first traversal of artifact structure |
|
|
|
|
### FileHandler Methods
|
|
|
|
| Method | Description |
|
|
|--------|-------------|
|
|
| `get_reader()` | Returns file-like object for reading (use with `with`) |
|
|
| `get_writer()` | Returns object for incremental writing |
|
|
| `read()` | Load entire remote file at once |
|
|
| `write()` | Write strings, bytes, or files to data lake |
|
|
|
|
**Important:** Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use `get_writer()` for incremental writing instead.
|
|
|
|
```python
|
|
from sapdi.artifact import Artifact
|
|
|
|
# Create artifact
|
|
artifact_id = Artifact.create(
|
|
name="my_model",
|
|
description="Trained model",
|
|
content=model_bytes
|
|
)
|
|
|
|
# List artifacts
|
|
artifacts = Artifact.list()
|
|
|
|
# Download artifact
|
|
Artifact.download(artifact_id, local_path="/tmp/model/")
|
|
|
|
# Read remote file
|
|
with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f:
|
|
model = pickle.load(f)
|
|
```
|
|
|
|
### Context Information
|
|
|
|
```python
|
|
from sapdi import context
|
|
|
|
# Get scenario information
|
|
scenario_id = context.get_scenario_id()
|
|
version_id = context.get_version_id()
|
|
|
|
# Get environment info
|
|
tenant = context.get_tenant()
|
|
user = context.get_user()
|
|
```
|
|
|
|
---
|
|
|
|
## Training Pipelines
|
|
|
|
Productionize ML training workflows.
|
|
|
|
### Pipeline Components
|
|
|
|
```
|
|
[Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger]
|
|
| | | |
|
|
Read data Transform data Train model Log results
|
|
```
|
|
|
|
### Creating Training Pipeline
|
|
|
|
1. Create new graph in Modeler
|
|
2. Add data consumer operator
|
|
3. Add Python operator for training
|
|
4. Add Submit Metrics operator
|
|
5. Connect and configure
|
|
|
|
### Python Training Operator
|
|
|
|
```python
|
|
def on_input(msg):
|
|
import pandas as pd
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
from sklearn.model_selection import train_test_split
|
|
from sapdi import tracking
|
|
|
|
# Get data
|
|
df = pd.DataFrame(msg.body)
|
|
|
|
# Prepare features
|
|
X = df.drop('target', axis=1)
|
|
y = df['target']
|
|
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
|
|
|
# Train model
|
|
model = RandomForestClassifier(n_estimators=100)
|
|
model.fit(X_train, y_train)
|
|
|
|
# Evaluate
|
|
accuracy = model.score(X_test, y_test)
|
|
|
|
# Track metrics
|
|
with tracking.start_run(
|
|
run_collection_name="classification_experiments",
|
|
run_name="rf_training_001"
|
|
) as run:
|
|
run.log_param("model_type", "RandomForest")
|
|
run.log_metric("accuracy", accuracy)
|
|
run.log_artifact("model.pkl", pickle.dumps(model))
|
|
|
|
api.send("output", api.Message({"accuracy": accuracy}))
|
|
|
|
api.set_port_callback("input", on_input)
|
|
```
|
|
|
|
### ML Pipeline Templates
|
|
|
|
Pre-built templates available:
|
|
|
|
- **Auto ML Training**: Automated model selection
|
|
- **HANA ML Training**: In-database training
|
|
- **TensorFlow Training**: Deep learning
|
|
- **Basic Training**: Generic template
|
|
|
|
---
|
|
|
|
## Metrics Explorer
|
|
|
|
Visualize and compare ML experiments.
|
|
|
|
### Accessing Metrics Explorer
|
|
|
|
1. Open ML Scenario Manager
|
|
2. Click "Metrics Explorer"
|
|
3. Select scenario and version
|
|
|
|
### Viewing Runs
|
|
|
|
**Run List:**
|
|
- Run ID and name
|
|
- Status (completed, failed, running)
|
|
- Start/end time
|
|
- Logged metrics summary
|
|
|
|
### Comparing Runs
|
|
|
|
1. Select multiple runs
|
|
2. Click "Compare"
|
|
3. View side-by-side metrics
|
|
4. Visualize metric trends
|
|
|
|
### Metric Visualizations
|
|
|
|
**Available Charts:**
|
|
- Line charts (metrics over steps)
|
|
- Bar charts (metric comparison)
|
|
- Scatter plots (parameter vs metric)
|
|
|
|
### Filtering and Search
|
|
|
|
```
|
|
Filter by:
|
|
- Date range
|
|
- Status
|
|
- Parameter values
|
|
- Metric thresholds
|
|
```
|
|
|
|
---
|
|
|
|
## Model Deployment
|
|
|
|
Deploy trained models for inference.
|
|
|
|
### Deployment Options
|
|
|
|
**Batch Inference:**
|
|
- Scheduled pipeline execution
|
|
- Process large datasets
|
|
- Results to storage/database
|
|
|
|
**Real-time Inference:**
|
|
- API endpoint deployment
|
|
- Low-latency predictions
|
|
- Auto-scaling
|
|
|
|
### Creating Inference Pipeline
|
|
|
|
```
|
|
[API Input] -> [Load Model] -> [Predict] -> [API Output]
|
|
```
|
|
|
|
### Python Inference Operator
|
|
|
|
```python
|
|
import pickle
|
|
from sapdi.artifact import Artifact
|
|
|
|
# Load model once (thread-safe if model object is immutable/read-only during inference)
|
|
# Note: model.predict() must be thread-safe for concurrent requests
|
|
model = None
|
|
|
|
def load_model():
|
|
global model
|
|
# Get artifact metadata first
|
|
artifacts = Artifact.list()
|
|
model_artifact = next((a for a in artifacts if a.name == "model"), None)
|
|
|
|
if model_artifact:
|
|
# Download artifact and load model
|
|
with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f:
|
|
model = pickle.load(f)
|
|
|
|
def on_input(msg):
|
|
if model is None:
|
|
load_model()
|
|
|
|
# Get input features
|
|
features = msg.body
|
|
|
|
# Predict
|
|
prediction = model.predict([features])[0]
|
|
probability = model.predict_proba([features])[0]
|
|
|
|
result = {
|
|
"prediction": int(prediction),
|
|
"probability": probability.tolist()
|
|
}
|
|
|
|
api.send("output", api.Message(result))
|
|
|
|
api.set_port_callback("input", on_input)
|
|
```
|
|
|
|
### Deployment Monitoring
|
|
|
|
Track deployed model performance:
|
|
|
|
```python
|
|
# Log inference metrics
|
|
run.log_metric("inference_latency", latency_ms)
|
|
run.log_metric("prediction_count", count)
|
|
run.log_metric("error_rate", errors / total)
|
|
```
|
|
|
|
---
|
|
|
|
## Versioning
|
|
|
|
Manage ML scenario versions.
|
|
|
|
### Creating Versions
|
|
|
|
1. Open ML Scenario Manager
|
|
2. Navigate to scenario
|
|
3. Click "Create Version"
|
|
4. Enter version name
|
|
5. Select base version (optional)
|
|
|
|
### Version Workflow
|
|
|
|
```
|
|
v1.0 (initial baseline)
|
|
└── v1.1 (feature improvements)
|
|
└── v1.2 (hyperparameter tuning)
|
|
└── v2.0 (new architecture)
|
|
└── v2.1 (production release)
|
|
```
|
|
|
|
### Branching
|
|
|
|
Create versions from any point:
|
|
|
|
```
|
|
v1.0 ─── v1.1 ─── v1.2
|
|
└── v1.1-experiment (branch for testing)
|
|
```
|
|
|
|
### Export and Import
|
|
|
|
**Export:**
|
|
1. Select scenario version
|
|
2. Click "Export"
|
|
3. Download ZIP file
|
|
|
|
**Import:**
|
|
1. Click "Import" in ML Scenario Manager
|
|
2. Upload ZIP file
|
|
3. Configure target location
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### Experiment Management
|
|
|
|
1. **Name Runs Descriptively**: Include key parameters
|
|
2. **Log Comprehensively**: All parameters and metrics
|
|
3. **Version Data**: Track data versions with runs
|
|
4. **Document Experiments**: Notes in notebooks
|
|
|
|
### Pipeline Development
|
|
|
|
1. **Start in Notebooks**: Prototype in JupyterLab
|
|
2. **Modularize Code**: Reusable functions
|
|
3. **Test Incrementally**: Validate each component
|
|
4. **Productionize Gradually**: Notebook to pipeline
|
|
|
|
### Model Management
|
|
|
|
1. **Version Models**: Link to training runs
|
|
2. **Validate Before Deploy**: Test on holdout data
|
|
3. **Monitor Production**: Track drift and performance
|
|
4. **Maintain Lineage**: Data to model to prediction
|
|
|
|
### Resource Management
|
|
|
|
1. **Right-size Resources**: Appropriate memory/CPU
|
|
2. **Clean Up Artifacts**: Remove unused experiments
|
|
3. **Archive Old Versions**: Export for long-term storage
|
|
|
|
---
|
|
|
|
## Documentation Links
|
|
|
|
- **Machine Learning**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning)
|
|
- **ML Scenario Manager**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager)
|
|
- **JupyterLab**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment)
|
|
- **Python SDK**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning) (see python-sdk documentation)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-11-22
|