Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:55:25 +08:00
commit e23395aeb2
19 changed files with 6391 additions and 0 deletions

View File

@@ -0,0 +1,582 @@
# ML Scenario Manager Guide
Complete guide for machine learning in SAP Data Intelligence.
## Table of Contents
1. [Overview](#overview)
2. [ML Scenario Manager](#ml-scenario-manager)
3. [JupyterLab Environment](#jupyterlab-environment)
4. [Python SDK](#python-sdk)
5. [Training Pipelines](#training-pipelines)
6. [Metrics Explorer](#metrics-explorer)
7. [Model Deployment](#model-deployment)
8. [Versioning](#versioning)
9. [Best Practices](#best-practices)
---
## Overview
SAP Data Intelligence provides comprehensive machine learning capabilities:
**Key Components:**
- **ML Scenario Manager**: Organize and manage ML artifacts
- **JupyterLab**: Interactive data science environment
- **Python SDK**: Programmatic ML operations
- **Metrics Explorer**: Visualize and compare results
- **Pipelines**: Productionize ML workflows
---
## ML Scenario Manager
Central application for organizing data science artifacts.
### Accessing ML Scenario Manager
1. Open SAP Data Intelligence Launchpad
2. Navigate to ML Scenario Manager tile
3. View existing scenarios or create new
### Core Concepts
**ML Scenario:**
- Container for datasets, notebooks, pipelines
- Supports versioning and branching
- Export/import for migration
**Artifacts:**
- Datasets (registered data sources)
- Jupyter notebooks
- Pipelines (training, inference)
- Model files
### Creating a Scenario
1. Click "Create" in ML Scenario Manager
2. Enter scenario name and description
3. Choose initial version name
4. Add artifacts (datasets, notebooks, pipelines)
### Scenario Structure
```
ML Scenario: Customer Churn Prediction
├── Datasets
│ ├── customer_data (registered)
│ └── transaction_history (registered)
├── Notebooks
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_model_training.ipynb
├── Pipelines
│ ├── training_pipeline
│ └── inference_pipeline
└── Versions
├── v1.0 (initial)
├── v1.1 (improved features)
└── v2.0 (new model architecture)
```
---
## JupyterLab Environment
Interactive environment for data science experimentation.
### Accessing JupyterLab
1. From ML Scenario Manager, click "Open Notebook"
2. Or access directly from SAP Data Intelligence Launchpad
### Available Kernels
- Python 3 (with ML libraries)
- Custom kernels (via Docker configuration)
### Pre-installed Libraries
```python
# Data Processing
import pandas as pd
import numpy as np
# Machine Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Deep Learning (available)
import tensorflow as tf
import torch
# SAP Data Intelligence SDK
from sapdi import tracking
```
### Data Lake Access
Access SAP Data Intelligence Data Lake from notebooks:
```python
from sapdi.datalake import DataLakeClient
client = DataLakeClient()
# Read file
df = client.read_csv('/shared/data/customers.csv')
# Write file
client.write_parquet(df, '/shared/output/processed.parquet')
```
### Virtual Environments
Create isolated environments for dependencies:
```bash
# Create virtual environment
python -m venv /home/user/myenv
# Activate
source /home/user/myenv/bin/activate
# Install packages
pip install xgboost lightgbm catboost
```
### Data Browser Extension
Use the Data Browser to:
- Browse available data sources
- Preview data
- Import data to notebooks
---
## Python SDK
Programmatic interface for ML operations.
### SDK Installation
Pre-installed in JupyterLab and Python operators.
```python
import sapdi
from sapdi import tracking
from sapdi import context
```
### MLTrackingSDK Functions
| Function | Description | Limits |
|----------|-------------|--------|
| `start_run()` | Begin experiment tracking | Specify run_collection_name, run_name |
| `end_run()` | Complete tracking | Auto-adds start/end timestamps |
| `log_param()` | Log configuration values | name: 256 chars, value: 5000 chars |
| `log_metric()` | Log numeric metric | name: 256 chars (case-sensitive) |
| `log_metrics()` | Batch log metrics | Dictionary list format |
| `persist_run()` | Force save to storage | Auto at 1.5MB cache or end_run |
| `set_tags()` | Key-value pairs for filtering | runName is reserved |
| `set_labels()` | UI/semantic labels | Non-filterable |
| `delete_runs()` | Remove persisted metrics | By scenario/pipeline/execution |
| `get_runs()` | Retrieve run objects | Returns metrics, params, tags |
| `get_metrics_history()` | Get metric values | Max 1000 per metric |
| `update_run_info()` | Modify run metadata | Change name, collection, tags |
### Metrics Tracking
```python
from sapdi import tracking
# Initialize tracking
with tracking.start_run(run_name="experiment_001") as run:
# Train model
model = train_model(X_train, y_train)
# Log parameters
run.log_param("algorithm", "RandomForest")
run.log_param("n_estimators", 100)
run.log_param("max_depth", 10)
# Log metrics
accuracy = evaluate(model, X_test, y_test)
run.log_metric("accuracy", accuracy)
run.log_metric("f1_score", f1)
# Log model artifact
run.log_artifact("model.pkl", model)
```
### Tracking Parameters and Metrics
**Parameters** (static values):
```python
run.log_param("learning_rate", 0.01)
run.log_param("batch_size", 32)
run.log_param("epochs", 100)
```
**Metrics** (can be logged multiple times):
```python
for epoch in range(epochs):
loss = train_epoch(model, data)
run.log_metric("loss", loss, step=epoch)
run.log_metric("val_loss", val_loss, step=epoch)
```
### Artifact Management
```python
# Log files
run.log_artifact("model.pkl", model_bytes)
run.log_artifact("feature_importance.png", image_bytes)
# Log directories
run.log_artifacts("./model_output/")
# Retrieve artifacts
artifacts = tracking.get_run_artifacts(run_id)
model_data = artifacts.get("model.pkl")
```
### Artifact Class Methods
| Method | Description |
|--------|-------------|
| `add_file()` | Add file to artifact, returns handler |
| `create()` | Create artifact with initial content, returns ID |
| `delete()` | Remove artifact metadata (not content) |
| `delete_content()` | Remove stored data |
| `download()` | Retrieve artifact contents to local storage |
| `get()` | Get artifact metadata |
| `list()` | List all artifacts in scenario |
| `open_file()` | Get handler for remote file access |
| `upload()` | Add files/directories to artifact |
| `walk()` | Depth-first traversal of artifact structure |
### FileHandler Methods
| Method | Description |
|--------|-------------|
| `get_reader()` | Returns file-like object for reading (use with `with`) |
| `get_writer()` | Returns object for incremental writing |
| `read()` | Load entire remote file at once |
| `write()` | Write strings, bytes, or files to data lake |
**Important:** Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use `get_writer()` for incremental writing instead.
```python
from sapdi.artifact import Artifact
# Create artifact
artifact_id = Artifact.create(
name="my_model",
description="Trained model",
content=model_bytes
)
# List artifacts
artifacts = Artifact.list()
# Download artifact
Artifact.download(artifact_id, local_path="/tmp/model/")
# Read remote file
with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f:
model = pickle.load(f)
```
### Context Information
```python
from sapdi import context
# Get scenario information
scenario_id = context.get_scenario_id()
version_id = context.get_version_id()
# Get environment info
tenant = context.get_tenant()
user = context.get_user()
```
---
## Training Pipelines
Productionize ML training workflows.
### Pipeline Components
```
[Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger]
| | | |
Read data Transform data Train model Log results
```
### Creating Training Pipeline
1. Create new graph in Modeler
2. Add data consumer operator
3. Add Python operator for training
4. Add Submit Metrics operator
5. Connect and configure
### Python Training Operator
```python
def on_input(msg):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sapdi import tracking
# Get data
df = pd.DataFrame(msg.body)
# Prepare features
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)
# Track metrics
with tracking.start_run(
run_collection_name="classification_experiments",
run_name="rf_training_001"
) as run:
run.log_param("model_type", "RandomForest")
run.log_metric("accuracy", accuracy)
run.log_artifact("model.pkl", pickle.dumps(model))
api.send("output", api.Message({"accuracy": accuracy}))
api.set_port_callback("input", on_input)
```
### ML Pipeline Templates
Pre-built templates available:
- **Auto ML Training**: Automated model selection
- **HANA ML Training**: In-database training
- **TensorFlow Training**: Deep learning
- **Basic Training**: Generic template
---
## Metrics Explorer
Visualize and compare ML experiments.
### Accessing Metrics Explorer
1. Open ML Scenario Manager
2. Click "Metrics Explorer"
3. Select scenario and version
### Viewing Runs
**Run List:**
- Run ID and name
- Status (completed, failed, running)
- Start/end time
- Logged metrics summary
### Comparing Runs
1. Select multiple runs
2. Click "Compare"
3. View side-by-side metrics
4. Visualize metric trends
### Metric Visualizations
**Available Charts:**
- Line charts (metrics over steps)
- Bar charts (metric comparison)
- Scatter plots (parameter vs metric)
### Filtering and Search
```
Filter by:
- Date range
- Status
- Parameter values
- Metric thresholds
```
---
## Model Deployment
Deploy trained models for inference.
### Deployment Options
**Batch Inference:**
- Scheduled pipeline execution
- Process large datasets
- Results to storage/database
**Real-time Inference:**
- API endpoint deployment
- Low-latency predictions
- Auto-scaling
### Creating Inference Pipeline
```
[API Input] -> [Load Model] -> [Predict] -> [API Output]
```
### Python Inference Operator
```python
import pickle
from sapdi.artifact import Artifact
# Load model once (thread-safe if model object is immutable/read-only during inference)
# Note: model.predict() must be thread-safe for concurrent requests
model = None
def load_model():
global model
# Get artifact metadata first
artifacts = Artifact.list()
model_artifact = next((a for a in artifacts if a.name == "model"), None)
if model_artifact:
# Download artifact and load model
with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f:
model = pickle.load(f)
def on_input(msg):
if model is None:
load_model()
# Get input features
features = msg.body
# Predict
prediction = model.predict([features])[0]
probability = model.predict_proba([features])[0]
result = {
"prediction": int(prediction),
"probability": probability.tolist()
}
api.send("output", api.Message(result))
api.set_port_callback("input", on_input)
```
### Deployment Monitoring
Track deployed model performance:
```python
# Log inference metrics
run.log_metric("inference_latency", latency_ms)
run.log_metric("prediction_count", count)
run.log_metric("error_rate", errors / total)
```
---
## Versioning
Manage ML scenario versions.
### Creating Versions
1. Open ML Scenario Manager
2. Navigate to scenario
3. Click "Create Version"
4. Enter version name
5. Select base version (optional)
### Version Workflow
```
v1.0 (initial baseline)
└── v1.1 (feature improvements)
└── v1.2 (hyperparameter tuning)
└── v2.0 (new architecture)
└── v2.1 (production release)
```
### Branching
Create versions from any point:
```
v1.0 ─── v1.1 ─── v1.2
└── v1.1-experiment (branch for testing)
```
### Export and Import
**Export:**
1. Select scenario version
2. Click "Export"
3. Download ZIP file
**Import:**
1. Click "Import" in ML Scenario Manager
2. Upload ZIP file
3. Configure target location
---
## Best Practices
### Experiment Management
1. **Name Runs Descriptively**: Include key parameters
2. **Log Comprehensively**: All parameters and metrics
3. **Version Data**: Track data versions with runs
4. **Document Experiments**: Notes in notebooks
### Pipeline Development
1. **Start in Notebooks**: Prototype in JupyterLab
2. **Modularize Code**: Reusable functions
3. **Test Incrementally**: Validate each component
4. **Productionize Gradually**: Notebook to pipeline
### Model Management
1. **Version Models**: Link to training runs
2. **Validate Before Deploy**: Test on holdout data
3. **Monitor Production**: Track drift and performance
4. **Maintain Lineage**: Data to model to prediction
### Resource Management
1. **Right-size Resources**: Appropriate memory/CPU
2. **Clean Up Artifacts**: Remove unused experiments
3. **Archive Old Versions**: Export for long-term storage
---
## Documentation Links
- **Machine Learning**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning)
- **ML Scenario Manager**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager)
- **JupyterLab**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment)
- **Python SDK**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning) (see python-sdk documentation)
---
**Last Updated**: 2025-11-22