zhongwei/gh-secondsky-sap-skills-skills-sap-hana-cloud-data-intelligence

Fork 0

Files

Zhongwei Li e23395aeb2 Initial commit

2025-11-30 08:55:25 +08:00

14 KiB

Raw Permalink Blame History

ML Scenario Manager Guide

Complete guide for machine learning in SAP Data Intelligence.

Overview
ML Scenario Manager
JupyterLab Environment
Python SDK
Training Pipelines
Metrics Explorer
Model Deployment
Versioning
Best Practices

Overview

SAP Data Intelligence provides comprehensive machine learning capabilities:

Key Components:

ML Scenario Manager: Organize and manage ML artifacts
JupyterLab: Interactive data science environment
Python SDK: Programmatic ML operations
Metrics Explorer: Visualize and compare results
Pipelines: Productionize ML workflows

ML Scenario Manager

Central application for organizing data science artifacts.

Accessing ML Scenario Manager

Open SAP Data Intelligence Launchpad
Navigate to ML Scenario Manager tile
View existing scenarios or create new

Core Concepts

ML Scenario:

Container for datasets, notebooks, pipelines
Supports versioning and branching
Export/import for migration

Artifacts:

Datasets (registered data sources)
Jupyter notebooks
Pipelines (training, inference)
Model files

Creating a Scenario

Click "Create" in ML Scenario Manager
Enter scenario name and description
Choose initial version name
Add artifacts (datasets, notebooks, pipelines)

Scenario Structure

ML Scenario: Customer Churn Prediction
├── Datasets
│   ├── customer_data (registered)
│   └── transaction_history (registered)
├── Notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
├── Pipelines
│   ├── training_pipeline
│   └── inference_pipeline
└── Versions
    ├── v1.0 (initial)
    ├── v1.1 (improved features)
    └── v2.0 (new model architecture)

JupyterLab Environment

Interactive environment for data science experimentation.

Accessing JupyterLab

From ML Scenario Manager, click "Open Notebook"
Or access directly from SAP Data Intelligence Launchpad

Available Kernels

Python 3 (with ML libraries)
Custom kernels (via Docker configuration)

Pre-installed Libraries

# Data Processing
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Deep Learning (available)
import tensorflow as tf
import torch

# SAP Data Intelligence SDK
from sapdi import tracking

Data Lake Access

Access SAP Data Intelligence Data Lake from notebooks:

from sapdi.datalake import DataLakeClient

client = DataLakeClient()

# Read file
df = client.read_csv('/shared/data/customers.csv')

# Write file
client.write_parquet(df, '/shared/output/processed.parquet')

Virtual Environments

Create isolated environments for dependencies:

# Create virtual environment
python -m venv /home/user/myenv

# Activate
source /home/user/myenv/bin/activate

# Install packages
pip install xgboost lightgbm catboost

Data Browser Extension

Use the Data Browser to:

Browse available data sources
Preview data
Import data to notebooks

Python SDK

Programmatic interface for ML operations.

SDK Installation

Pre-installed in JupyterLab and Python operators.

import sapdi
from sapdi import tracking
from sapdi import context

MLTrackingSDK Functions

Function	Description	Limits
`start_run()`	Begin experiment tracking	Specify run_collection_name, run_name
`end_run()`	Complete tracking	Auto-adds start/end timestamps
`log_param()`	Log configuration values	name: 256 chars, value: 5000 chars
`log_metric()`	Log numeric metric	name: 256 chars (case-sensitive)
`log_metrics()`	Batch log metrics	Dictionary list format
`persist_run()`	Force save to storage	Auto at 1.5MB cache or end_run
`set_tags()`	Key-value pairs for filtering	runName is reserved
`set_labels()`	UI/semantic labels	Non-filterable
`delete_runs()`	Remove persisted metrics	By scenario/pipeline/execution
`get_runs()`	Retrieve run objects	Returns metrics, params, tags
`get_metrics_history()`	Get metric values	Max 1000 per metric
`update_run_info()`	Modify run metadata	Change name, collection, tags

Metrics Tracking

from sapdi import tracking

# Initialize tracking
with tracking.start_run(run_name="experiment_001") as run:
    # Train model
    model = train_model(X_train, y_train)

    # Log parameters
    run.log_param("algorithm", "RandomForest")
    run.log_param("n_estimators", 100)
    run.log_param("max_depth", 10)

    # Log metrics
    accuracy = evaluate(model, X_test, y_test)
    run.log_metric("accuracy", accuracy)
    run.log_metric("f1_score", f1)

    # Log model artifact
    run.log_artifact("model.pkl", model)

Tracking Parameters and Metrics

Parameters (static values):

run.log_param("learning_rate", 0.01)
run.log_param("batch_size", 32)
run.log_param("epochs", 100)

Metrics (can be logged multiple times):

for epoch in range(epochs):
    loss = train_epoch(model, data)
    run.log_metric("loss", loss, step=epoch)
    run.log_metric("val_loss", val_loss, step=epoch)

Artifact Management

# Log files
run.log_artifact("model.pkl", model_bytes)
run.log_artifact("feature_importance.png", image_bytes)

# Log directories
run.log_artifacts("./model_output/")

# Retrieve artifacts
artifacts = tracking.get_run_artifacts(run_id)
model_data = artifacts.get("model.pkl")

Artifact Class Methods

Method	Description
`add_file()`	Add file to artifact, returns handler
`create()`	Create artifact with initial content, returns ID
`delete()`	Remove artifact metadata (not content)
`delete_content()`	Remove stored data
`download()`	Retrieve artifact contents to local storage
`get()`	Get artifact metadata
`list()`	List all artifacts in scenario
`open_file()`	Get handler for remote file access
`upload()`	Add files/directories to artifact
`walk()`	Depth-first traversal of artifact structure

FileHandler Methods

Method	Description
`get_reader()`	Returns file-like object for reading (use with `with`)
`get_writer()`	Returns object for incremental writing
`read()`	Load entire remote file at once
`write()`	Write strings, bytes, or files to data lake

Important: Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use get_writer() for incremental writing instead.

from sapdi.artifact import Artifact

# Create artifact
artifact_id = Artifact.create(
    name="my_model",
    description="Trained model",
    content=model_bytes
)

# List artifacts
artifacts = Artifact.list()

# Download artifact
Artifact.download(artifact_id, local_path="/tmp/model/")

# Read remote file
with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f:
    model = pickle.load(f)

Context Information

from sapdi import context

# Get scenario information
scenario_id = context.get_scenario_id()
version_id = context.get_version_id()

# Get environment info
tenant = context.get_tenant()
user = context.get_user()

Training Pipelines

Productionize ML training workflows.

Pipeline Components

[Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger]
        |                    |                     |                    |
   Read data          Transform data         Train model          Log results

Creating Training Pipeline

Create new graph in Modeler
Add data consumer operator
Add Python operator for training
Add Submit Metrics operator
Connect and configure

Python Training Operator

def on_input(msg):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sapdi import tracking

    # Get data
    df = pd.DataFrame(msg.body)

    # Prepare features
    X = df.drop('target', axis=1)
    y = df['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Train model
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)

    # Evaluate
    accuracy = model.score(X_test, y_test)

    # Track metrics
    with tracking.start_run(
        run_collection_name="classification_experiments",
        run_name="rf_training_001"
    ) as run:
        run.log_param("model_type", "RandomForest")
        run.log_metric("accuracy", accuracy)
        run.log_artifact("model.pkl", pickle.dumps(model))

    api.send("output", api.Message({"accuracy": accuracy}))

api.set_port_callback("input", on_input)

ML Pipeline Templates

Pre-built templates available:

Auto ML Training: Automated model selection
HANA ML Training: In-database training
TensorFlow Training: Deep learning
Basic Training: Generic template

Metrics Explorer

Visualize and compare ML experiments.

Accessing Metrics Explorer

Open ML Scenario Manager
Click "Metrics Explorer"
Select scenario and version

Viewing Runs

Run List:

Run ID and name
Status (completed, failed, running)
Start/end time
Logged metrics summary

Comparing Runs

Select multiple runs
Click "Compare"
View side-by-side metrics
Visualize metric trends

Metric Visualizations

Available Charts:

Line charts (metrics over steps)
Bar charts (metric comparison)
Scatter plots (parameter vs metric)

Filtering and Search

Filter by:
- Date range
- Status
- Parameter values
- Metric thresholds

Model Deployment

Deploy trained models for inference.

Deployment Options

Batch Inference:

Scheduled pipeline execution
Process large datasets
Results to storage/database

Real-time Inference:

API endpoint deployment
Low-latency predictions
Auto-scaling

Creating Inference Pipeline

[API Input] -> [Load Model] -> [Predict] -> [API Output]

Python Inference Operator

import pickle
from sapdi.artifact import Artifact

# Load model once (thread-safe if model object is immutable/read-only during inference)
# Note: model.predict() must be thread-safe for concurrent requests
model = None

def load_model():
    global model
    # Get artifact metadata first
    artifacts = Artifact.list()
    model_artifact = next((a for a in artifacts if a.name == "model"), None)

    if model_artifact:
        # Download artifact and load model
        with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f:
            model = pickle.load(f)

def on_input(msg):
    if model is None:
        load_model()

    # Get input features
    features = msg.body

    # Predict
    prediction = model.predict([features])[0]
    probability = model.predict_proba([features])[0]

    result = {
        "prediction": int(prediction),
        "probability": probability.tolist()
    }

    api.send("output", api.Message(result))

api.set_port_callback("input", on_input)

Deployment Monitoring

Track deployed model performance:

# Log inference metrics
run.log_metric("inference_latency", latency_ms)
run.log_metric("prediction_count", count)
run.log_metric("error_rate", errors / total)

Versioning

Manage ML scenario versions.

Creating Versions

Open ML Scenario Manager
Navigate to scenario
Click "Create Version"
Enter version name
Select base version (optional)

Version Workflow

v1.0 (initial baseline)
  └── v1.1 (feature improvements)
        └── v1.2 (hyperparameter tuning)
              └── v2.0 (new architecture)
                    └── v2.1 (production release)

Branching

Create versions from any point:

v1.0 ─── v1.1 ─── v1.2
           └── v1.1-experiment (branch for testing)

Export and Import

Export:

Select scenario version
Click "Export"
Download ZIP file

Import:

Click "Import" in ML Scenario Manager
Upload ZIP file
Configure target location

Best Practices

Experiment Management

Name Runs Descriptively: Include key parameters
Log Comprehensively: All parameters and metrics
Version Data: Track data versions with runs
Document Experiments: Notes in notebooks

Pipeline Development

Start in Notebooks: Prototype in JupyterLab
Modularize Code: Reusable functions
Test Incrementally: Validate each component
Productionize Gradually: Notebook to pipeline

Model Management

Version Models: Link to training runs
Validate Before Deploy: Test on holdout data
Monitor Production: Track drift and performance
Maintain Lineage: Data to model to prediction

Resource Management

Right-size Resources: Appropriate memory/CPU
Clean Up Artifacts: Remove unused experiments
Archive Old Versions: Export for long-term storage

Documentation Links

Machine Learning: https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning
ML Scenario Manager: https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager
JupyterLab: https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment
Python SDK: https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning (see python-sdk documentation)

Last Updated: 2025-11-22

14 KiB Raw Permalink Blame History