Files
gh-secondsky-sap-skills-ski…/references/ml-scenario-manager.md
2025-11-30 08:55:25 +08:00

14 KiB

ML Scenario Manager Guide

Complete guide for machine learning in SAP Data Intelligence.

Table of Contents

  1. Overview
  2. ML Scenario Manager
  3. JupyterLab Environment
  4. Python SDK
  5. Training Pipelines
  6. Metrics Explorer
  7. Model Deployment
  8. Versioning
  9. Best Practices

Overview

SAP Data Intelligence provides comprehensive machine learning capabilities:

Key Components:

  • ML Scenario Manager: Organize and manage ML artifacts
  • JupyterLab: Interactive data science environment
  • Python SDK: Programmatic ML operations
  • Metrics Explorer: Visualize and compare results
  • Pipelines: Productionize ML workflows

ML Scenario Manager

Central application for organizing data science artifacts.

Accessing ML Scenario Manager

  1. Open SAP Data Intelligence Launchpad
  2. Navigate to ML Scenario Manager tile
  3. View existing scenarios or create new

Core Concepts

ML Scenario:

  • Container for datasets, notebooks, pipelines
  • Supports versioning and branching
  • Export/import for migration

Artifacts:

  • Datasets (registered data sources)
  • Jupyter notebooks
  • Pipelines (training, inference)
  • Model files

Creating a Scenario

  1. Click "Create" in ML Scenario Manager
  2. Enter scenario name and description
  3. Choose initial version name
  4. Add artifacts (datasets, notebooks, pipelines)

Scenario Structure

ML Scenario: Customer Churn Prediction
├── Datasets
│   ├── customer_data (registered)
│   └── transaction_history (registered)
├── Notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
├── Pipelines
│   ├── training_pipeline
│   └── inference_pipeline
└── Versions
    ├── v1.0 (initial)
    ├── v1.1 (improved features)
    └── v2.0 (new model architecture)

JupyterLab Environment

Interactive environment for data science experimentation.

Accessing JupyterLab

  1. From ML Scenario Manager, click "Open Notebook"
  2. Or access directly from SAP Data Intelligence Launchpad

Available Kernels

  • Python 3 (with ML libraries)
  • Custom kernels (via Docker configuration)

Pre-installed Libraries

# Data Processing
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Deep Learning (available)
import tensorflow as tf
import torch

# SAP Data Intelligence SDK
from sapdi import tracking

Data Lake Access

Access SAP Data Intelligence Data Lake from notebooks:

from sapdi.datalake import DataLakeClient

client = DataLakeClient()

# Read file
df = client.read_csv('/shared/data/customers.csv')

# Write file
client.write_parquet(df, '/shared/output/processed.parquet')

Virtual Environments

Create isolated environments for dependencies:

# Create virtual environment
python -m venv /home/user/myenv

# Activate
source /home/user/myenv/bin/activate

# Install packages
pip install xgboost lightgbm catboost

Data Browser Extension

Use the Data Browser to:

  • Browse available data sources
  • Preview data
  • Import data to notebooks

Python SDK

Programmatic interface for ML operations.

SDK Installation

Pre-installed in JupyterLab and Python operators.

import sapdi
from sapdi import tracking
from sapdi import context

MLTrackingSDK Functions

Function Description Limits
start_run() Begin experiment tracking Specify run_collection_name, run_name
end_run() Complete tracking Auto-adds start/end timestamps
log_param() Log configuration values name: 256 chars, value: 5000 chars
log_metric() Log numeric metric name: 256 chars (case-sensitive)
log_metrics() Batch log metrics Dictionary list format
persist_run() Force save to storage Auto at 1.5MB cache or end_run
set_tags() Key-value pairs for filtering runName is reserved
set_labels() UI/semantic labels Non-filterable
delete_runs() Remove persisted metrics By scenario/pipeline/execution
get_runs() Retrieve run objects Returns metrics, params, tags
get_metrics_history() Get metric values Max 1000 per metric
update_run_info() Modify run metadata Change name, collection, tags

Metrics Tracking

from sapdi import tracking

# Initialize tracking
with tracking.start_run(run_name="experiment_001") as run:
    # Train model
    model = train_model(X_train, y_train)

    # Log parameters
    run.log_param("algorithm", "RandomForest")
    run.log_param("n_estimators", 100)
    run.log_param("max_depth", 10)

    # Log metrics
    accuracy = evaluate(model, X_test, y_test)
    run.log_metric("accuracy", accuracy)
    run.log_metric("f1_score", f1)

    # Log model artifact
    run.log_artifact("model.pkl", model)

Tracking Parameters and Metrics

Parameters (static values):

run.log_param("learning_rate", 0.01)
run.log_param("batch_size", 32)
run.log_param("epochs", 100)

Metrics (can be logged multiple times):

for epoch in range(epochs):
    loss = train_epoch(model, data)
    run.log_metric("loss", loss, step=epoch)
    run.log_metric("val_loss", val_loss, step=epoch)

Artifact Management

# Log files
run.log_artifact("model.pkl", model_bytes)
run.log_artifact("feature_importance.png", image_bytes)

# Log directories
run.log_artifacts("./model_output/")

# Retrieve artifacts
artifacts = tracking.get_run_artifacts(run_id)
model_data = artifacts.get("model.pkl")

Artifact Class Methods

Method Description
add_file() Add file to artifact, returns handler
create() Create artifact with initial content, returns ID
delete() Remove artifact metadata (not content)
delete_content() Remove stored data
download() Retrieve artifact contents to local storage
get() Get artifact metadata
list() List all artifacts in scenario
open_file() Get handler for remote file access
upload() Add files/directories to artifact
walk() Depth-first traversal of artifact structure

FileHandler Methods

Method Description
get_reader() Returns file-like object for reading (use with with)
get_writer() Returns object for incremental writing
read() Load entire remote file at once
write() Write strings, bytes, or files to data lake

Important: Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use get_writer() for incremental writing instead.

from sapdi.artifact import Artifact

# Create artifact
artifact_id = Artifact.create(
    name="my_model",
    description="Trained model",
    content=model_bytes
)

# List artifacts
artifacts = Artifact.list()

# Download artifact
Artifact.download(artifact_id, local_path="/tmp/model/")

# Read remote file
with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f:
    model = pickle.load(f)

Context Information

from sapdi import context

# Get scenario information
scenario_id = context.get_scenario_id()
version_id = context.get_version_id()

# Get environment info
tenant = context.get_tenant()
user = context.get_user()

Training Pipelines

Productionize ML training workflows.

Pipeline Components

[Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger]
        |                    |                     |                    |
   Read data          Transform data         Train model          Log results

Creating Training Pipeline

  1. Create new graph in Modeler
  2. Add data consumer operator
  3. Add Python operator for training
  4. Add Submit Metrics operator
  5. Connect and configure

Python Training Operator

def on_input(msg):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sapdi import tracking

    # Get data
    df = pd.DataFrame(msg.body)

    # Prepare features
    X = df.drop('target', axis=1)
    y = df['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Train model
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)

    # Evaluate
    accuracy = model.score(X_test, y_test)

    # Track metrics
    with tracking.start_run(
        run_collection_name="classification_experiments",
        run_name="rf_training_001"
    ) as run:
        run.log_param("model_type", "RandomForest")
        run.log_metric("accuracy", accuracy)
        run.log_artifact("model.pkl", pickle.dumps(model))

    api.send("output", api.Message({"accuracy": accuracy}))

api.set_port_callback("input", on_input)

ML Pipeline Templates

Pre-built templates available:

  • Auto ML Training: Automated model selection
  • HANA ML Training: In-database training
  • TensorFlow Training: Deep learning
  • Basic Training: Generic template

Metrics Explorer

Visualize and compare ML experiments.

Accessing Metrics Explorer

  1. Open ML Scenario Manager
  2. Click "Metrics Explorer"
  3. Select scenario and version

Viewing Runs

Run List:

  • Run ID and name
  • Status (completed, failed, running)
  • Start/end time
  • Logged metrics summary

Comparing Runs

  1. Select multiple runs
  2. Click "Compare"
  3. View side-by-side metrics
  4. Visualize metric trends

Metric Visualizations

Available Charts:

  • Line charts (metrics over steps)
  • Bar charts (metric comparison)
  • Scatter plots (parameter vs metric)
Filter by:
- Date range
- Status
- Parameter values
- Metric thresholds

Model Deployment

Deploy trained models for inference.

Deployment Options

Batch Inference:

  • Scheduled pipeline execution
  • Process large datasets
  • Results to storage/database

Real-time Inference:

  • API endpoint deployment
  • Low-latency predictions
  • Auto-scaling

Creating Inference Pipeline

[API Input] -> [Load Model] -> [Predict] -> [API Output]

Python Inference Operator

import pickle
from sapdi.artifact import Artifact

# Load model once (thread-safe if model object is immutable/read-only during inference)
# Note: model.predict() must be thread-safe for concurrent requests
model = None

def load_model():
    global model
    # Get artifact metadata first
    artifacts = Artifact.list()
    model_artifact = next((a for a in artifacts if a.name == "model"), None)

    if model_artifact:
        # Download artifact and load model
        with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f:
            model = pickle.load(f)

def on_input(msg):
    if model is None:
        load_model()

    # Get input features
    features = msg.body

    # Predict
    prediction = model.predict([features])[0]
    probability = model.predict_proba([features])[0]

    result = {
        "prediction": int(prediction),
        "probability": probability.tolist()
    }

    api.send("output", api.Message(result))

api.set_port_callback("input", on_input)

Deployment Monitoring

Track deployed model performance:

# Log inference metrics
run.log_metric("inference_latency", latency_ms)
run.log_metric("prediction_count", count)
run.log_metric("error_rate", errors / total)

Versioning

Manage ML scenario versions.

Creating Versions

  1. Open ML Scenario Manager
  2. Navigate to scenario
  3. Click "Create Version"
  4. Enter version name
  5. Select base version (optional)

Version Workflow

v1.0 (initial baseline)
  └── v1.1 (feature improvements)
        └── v1.2 (hyperparameter tuning)
              └── v2.0 (new architecture)
                    └── v2.1 (production release)

Branching

Create versions from any point:

v1.0 ─── v1.1 ─── v1.2
           └── v1.1-experiment (branch for testing)

Export and Import

Export:

  1. Select scenario version
  2. Click "Export"
  3. Download ZIP file

Import:

  1. Click "Import" in ML Scenario Manager
  2. Upload ZIP file
  3. Configure target location

Best Practices

Experiment Management

  1. Name Runs Descriptively: Include key parameters
  2. Log Comprehensively: All parameters and metrics
  3. Version Data: Track data versions with runs
  4. Document Experiments: Notes in notebooks

Pipeline Development

  1. Start in Notebooks: Prototype in JupyterLab
  2. Modularize Code: Reusable functions
  3. Test Incrementally: Validate each component
  4. Productionize Gradually: Notebook to pipeline

Model Management

  1. Version Models: Link to training runs
  2. Validate Before Deploy: Test on holdout data
  3. Monitor Production: Track drift and performance
  4. Maintain Lineage: Data to model to prediction

Resource Management

  1. Right-size Resources: Appropriate memory/CPU
  2. Clean Up Artifacts: Remove unused experiments
  3. Archive Old Versions: Export for long-term storage


Last Updated: 2025-11-22