# ML Scenario Manager Guide Complete guide for machine learning in SAP Data Intelligence. ## Table of Contents 1. [Overview](#overview) 2. [ML Scenario Manager](#ml-scenario-manager) 3. [JupyterLab Environment](#jupyterlab-environment) 4. [Python SDK](#python-sdk) 5. [Training Pipelines](#training-pipelines) 6. [Metrics Explorer](#metrics-explorer) 7. [Model Deployment](#model-deployment) 8. [Versioning](#versioning) 9. [Best Practices](#best-practices) --- ## Overview SAP Data Intelligence provides comprehensive machine learning capabilities: **Key Components:** - **ML Scenario Manager**: Organize and manage ML artifacts - **JupyterLab**: Interactive data science environment - **Python SDK**: Programmatic ML operations - **Metrics Explorer**: Visualize and compare results - **Pipelines**: Productionize ML workflows --- ## ML Scenario Manager Central application for organizing data science artifacts. ### Accessing ML Scenario Manager 1. Open SAP Data Intelligence Launchpad 2. Navigate to ML Scenario Manager tile 3. View existing scenarios or create new ### Core Concepts **ML Scenario:** - Container for datasets, notebooks, pipelines - Supports versioning and branching - Export/import for migration **Artifacts:** - Datasets (registered data sources) - Jupyter notebooks - Pipelines (training, inference) - Model files ### Creating a Scenario 1. Click "Create" in ML Scenario Manager 2. Enter scenario name and description 3. Choose initial version name 4. Add artifacts (datasets, notebooks, pipelines) ### Scenario Structure ``` ML Scenario: Customer Churn Prediction ├── Datasets │ ├── customer_data (registered) │ └── transaction_history (registered) ├── Notebooks │ ├── 01_data_exploration.ipynb │ ├── 02_feature_engineering.ipynb │ └── 03_model_training.ipynb ├── Pipelines │ ├── training_pipeline │ └── inference_pipeline └── Versions ├── v1.0 (initial) ├── v1.1 (improved features) └── v2.0 (new model architecture) ``` --- ## JupyterLab Environment Interactive environment for data science experimentation. ### Accessing JupyterLab 1. From ML Scenario Manager, click "Open Notebook" 2. Or access directly from SAP Data Intelligence Launchpad ### Available Kernels - Python 3 (with ML libraries) - Custom kernels (via Docker configuration) ### Pre-installed Libraries ```python # Data Processing import pandas as pd import numpy as np # Machine Learning from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # Deep Learning (available) import tensorflow as tf import torch # SAP Data Intelligence SDK from sapdi import tracking ``` ### Data Lake Access Access SAP Data Intelligence Data Lake from notebooks: ```python from sapdi.datalake import DataLakeClient client = DataLakeClient() # Read file df = client.read_csv('/shared/data/customers.csv') # Write file client.write_parquet(df, '/shared/output/processed.parquet') ``` ### Virtual Environments Create isolated environments for dependencies: ```bash # Create virtual environment python -m venv /home/user/myenv # Activate source /home/user/myenv/bin/activate # Install packages pip install xgboost lightgbm catboost ``` ### Data Browser Extension Use the Data Browser to: - Browse available data sources - Preview data - Import data to notebooks --- ## Python SDK Programmatic interface for ML operations. ### SDK Installation Pre-installed in JupyterLab and Python operators. ```python import sapdi from sapdi import tracking from sapdi import context ``` ### MLTrackingSDK Functions | Function | Description | Limits | |----------|-------------|--------| | `start_run()` | Begin experiment tracking | Specify run_collection_name, run_name | | `end_run()` | Complete tracking | Auto-adds start/end timestamps | | `log_param()` | Log configuration values | name: 256 chars, value: 5000 chars | | `log_metric()` | Log numeric metric | name: 256 chars (case-sensitive) | | `log_metrics()` | Batch log metrics | Dictionary list format | | `persist_run()` | Force save to storage | Auto at 1.5MB cache or end_run | | `set_tags()` | Key-value pairs for filtering | runName is reserved | | `set_labels()` | UI/semantic labels | Non-filterable | | `delete_runs()` | Remove persisted metrics | By scenario/pipeline/execution | | `get_runs()` | Retrieve run objects | Returns metrics, params, tags | | `get_metrics_history()` | Get metric values | Max 1000 per metric | | `update_run_info()` | Modify run metadata | Change name, collection, tags | ### Metrics Tracking ```python from sapdi import tracking # Initialize tracking with tracking.start_run(run_name="experiment_001") as run: # Train model model = train_model(X_train, y_train) # Log parameters run.log_param("algorithm", "RandomForest") run.log_param("n_estimators", 100) run.log_param("max_depth", 10) # Log metrics accuracy = evaluate(model, X_test, y_test) run.log_metric("accuracy", accuracy) run.log_metric("f1_score", f1) # Log model artifact run.log_artifact("model.pkl", model) ``` ### Tracking Parameters and Metrics **Parameters** (static values): ```python run.log_param("learning_rate", 0.01) run.log_param("batch_size", 32) run.log_param("epochs", 100) ``` **Metrics** (can be logged multiple times): ```python for epoch in range(epochs): loss = train_epoch(model, data) run.log_metric("loss", loss, step=epoch) run.log_metric("val_loss", val_loss, step=epoch) ``` ### Artifact Management ```python # Log files run.log_artifact("model.pkl", model_bytes) run.log_artifact("feature_importance.png", image_bytes) # Log directories run.log_artifacts("./model_output/") # Retrieve artifacts artifacts = tracking.get_run_artifacts(run_id) model_data = artifacts.get("model.pkl") ``` ### Artifact Class Methods | Method | Description | |--------|-------------| | `add_file()` | Add file to artifact, returns handler | | `create()` | Create artifact with initial content, returns ID | | `delete()` | Remove artifact metadata (not content) | | `delete_content()` | Remove stored data | | `download()` | Retrieve artifact contents to local storage | | `get()` | Get artifact metadata | | `list()` | List all artifacts in scenario | | `open_file()` | Get handler for remote file access | | `upload()` | Add files/directories to artifact | | `walk()` | Depth-first traversal of artifact structure | ### FileHandler Methods | Method | Description | |--------|-------------| | `get_reader()` | Returns file-like object for reading (use with `with`) | | `get_writer()` | Returns object for incremental writing | | `read()` | Load entire remote file at once | | `write()` | Write strings, bytes, or files to data lake | **Important:** Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use `get_writer()` for incremental writing instead. ```python from sapdi.artifact import Artifact # Create artifact artifact_id = Artifact.create( name="my_model", description="Trained model", content=model_bytes ) # List artifacts artifacts = Artifact.list() # Download artifact Artifact.download(artifact_id, local_path="/tmp/model/") # Read remote file with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f: model = pickle.load(f) ``` ### Context Information ```python from sapdi import context # Get scenario information scenario_id = context.get_scenario_id() version_id = context.get_version_id() # Get environment info tenant = context.get_tenant() user = context.get_user() ``` --- ## Training Pipelines Productionize ML training workflows. ### Pipeline Components ``` [Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger] | | | | Read data Transform data Train model Log results ``` ### Creating Training Pipeline 1. Create new graph in Modeler 2. Add data consumer operator 3. Add Python operator for training 4. Add Submit Metrics operator 5. Connect and configure ### Python Training Operator ```python def on_input(msg): import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sapdi import tracking # Get data df = pd.DataFrame(msg.body) # Prepare features X = df.drop('target', axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train model model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) # Evaluate accuracy = model.score(X_test, y_test) # Track metrics with tracking.start_run( run_collection_name="classification_experiments", run_name="rf_training_001" ) as run: run.log_param("model_type", "RandomForest") run.log_metric("accuracy", accuracy) run.log_artifact("model.pkl", pickle.dumps(model)) api.send("output", api.Message({"accuracy": accuracy})) api.set_port_callback("input", on_input) ``` ### ML Pipeline Templates Pre-built templates available: - **Auto ML Training**: Automated model selection - **HANA ML Training**: In-database training - **TensorFlow Training**: Deep learning - **Basic Training**: Generic template --- ## Metrics Explorer Visualize and compare ML experiments. ### Accessing Metrics Explorer 1. Open ML Scenario Manager 2. Click "Metrics Explorer" 3. Select scenario and version ### Viewing Runs **Run List:** - Run ID and name - Status (completed, failed, running) - Start/end time - Logged metrics summary ### Comparing Runs 1. Select multiple runs 2. Click "Compare" 3. View side-by-side metrics 4. Visualize metric trends ### Metric Visualizations **Available Charts:** - Line charts (metrics over steps) - Bar charts (metric comparison) - Scatter plots (parameter vs metric) ### Filtering and Search ``` Filter by: - Date range - Status - Parameter values - Metric thresholds ``` --- ## Model Deployment Deploy trained models for inference. ### Deployment Options **Batch Inference:** - Scheduled pipeline execution - Process large datasets - Results to storage/database **Real-time Inference:** - API endpoint deployment - Low-latency predictions - Auto-scaling ### Creating Inference Pipeline ``` [API Input] -> [Load Model] -> [Predict] -> [API Output] ``` ### Python Inference Operator ```python import pickle from sapdi.artifact import Artifact # Load model once (thread-safe if model object is immutable/read-only during inference) # Note: model.predict() must be thread-safe for concurrent requests model = None def load_model(): global model # Get artifact metadata first artifacts = Artifact.list() model_artifact = next((a for a in artifacts if a.name == "model"), None) if model_artifact: # Download artifact and load model with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f: model = pickle.load(f) def on_input(msg): if model is None: load_model() # Get input features features = msg.body # Predict prediction = model.predict([features])[0] probability = model.predict_proba([features])[0] result = { "prediction": int(prediction), "probability": probability.tolist() } api.send("output", api.Message(result)) api.set_port_callback("input", on_input) ``` ### Deployment Monitoring Track deployed model performance: ```python # Log inference metrics run.log_metric("inference_latency", latency_ms) run.log_metric("prediction_count", count) run.log_metric("error_rate", errors / total) ``` --- ## Versioning Manage ML scenario versions. ### Creating Versions 1. Open ML Scenario Manager 2. Navigate to scenario 3. Click "Create Version" 4. Enter version name 5. Select base version (optional) ### Version Workflow ``` v1.0 (initial baseline) └── v1.1 (feature improvements) └── v1.2 (hyperparameter tuning) └── v2.0 (new architecture) └── v2.1 (production release) ``` ### Branching Create versions from any point: ``` v1.0 ─── v1.1 ─── v1.2 └── v1.1-experiment (branch for testing) ``` ### Export and Import **Export:** 1. Select scenario version 2. Click "Export" 3. Download ZIP file **Import:** 1. Click "Import" in ML Scenario Manager 2. Upload ZIP file 3. Configure target location --- ## Best Practices ### Experiment Management 1. **Name Runs Descriptively**: Include key parameters 2. **Log Comprehensively**: All parameters and metrics 3. **Version Data**: Track data versions with runs 4. **Document Experiments**: Notes in notebooks ### Pipeline Development 1. **Start in Notebooks**: Prototype in JupyterLab 2. **Modularize Code**: Reusable functions 3. **Test Incrementally**: Validate each component 4. **Productionize Gradually**: Notebook to pipeline ### Model Management 1. **Version Models**: Link to training runs 2. **Validate Before Deploy**: Test on holdout data 3. **Monitor Production**: Track drift and performance 4. **Maintain Lineage**: Data to model to prediction ### Resource Management 1. **Right-size Resources**: Appropriate memory/CPU 2. **Clean Up Artifacts**: Remove unused experiments 3. **Archive Old Versions**: Export for long-term storage --- ## Documentation Links - **Machine Learning**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning) - **ML Scenario Manager**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager) - **JupyterLab**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment) - **Python SDK**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning) (see python-sdk documentation) --- **Last Updated**: 2025-11-22