Initial commit

2025-11-30 08:55:25 +08:00
commit e23395aeb2
19 changed files with 6391 additions and 0 deletions
--- a/references/ml-scenario-manager.md
+++ b/references/ml-scenario-manager.md
@@ -0,0 +1,582 @@
+# ML Scenario Manager Guide
+
+Complete guide for machine learning in SAP Data Intelligence.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [ML Scenario Manager](#ml-scenario-manager)
+3. [JupyterLab Environment](#jupyterlab-environment)
+4. [Python SDK](#python-sdk)
+5. [Training Pipelines](#training-pipelines)
+6. [Metrics Explorer](#metrics-explorer)
+7. [Model Deployment](#model-deployment)
+8. [Versioning](#versioning)
+9. [Best Practices](#best-practices)
+
+---
+
+## Overview
+
+SAP Data Intelligence provides comprehensive machine learning capabilities:
+
+**Key Components:**
+- **ML Scenario Manager**: Organize and manage ML artifacts
+- **JupyterLab**: Interactive data science environment
+- **Python SDK**: Programmatic ML operations
+- **Metrics Explorer**: Visualize and compare results
+- **Pipelines**: Productionize ML workflows
+
+---
+
+## ML Scenario Manager
+
+Central application for organizing data science artifacts.
+
+### Accessing ML Scenario Manager
+
+1. Open SAP Data Intelligence Launchpad
+2. Navigate to ML Scenario Manager tile
+3. View existing scenarios or create new
+
+### Core Concepts
+
+**ML Scenario:**
+- Container for datasets, notebooks, pipelines
+- Supports versioning and branching
+- Export/import for migration
+
+**Artifacts:**
+- Datasets (registered data sources)
+- Jupyter notebooks
+- Pipelines (training, inference)
+- Model files
+
+### Creating a Scenario
+
+1. Click "Create" in ML Scenario Manager
+2. Enter scenario name and description
+3. Choose initial version name
+4. Add artifacts (datasets, notebooks, pipelines)
+
+### Scenario Structure
+
+```
+ML Scenario: Customer Churn Prediction
+├── Datasets
+│   ├── customer_data (registered)
+│   └── transaction_history (registered)
+├── Notebooks
+│   ├── 01_data_exploration.ipynb
+│   ├── 02_feature_engineering.ipynb
+│   └── 03_model_training.ipynb
+├── Pipelines
+│   ├── training_pipeline
+│   └── inference_pipeline
+└── Versions
+    ├── v1.0 (initial)
+    ├── v1.1 (improved features)
+    └── v2.0 (new model architecture)
+```
+
+---
+
+## JupyterLab Environment
+
+Interactive environment for data science experimentation.
+
+### Accessing JupyterLab
+
+1. From ML Scenario Manager, click "Open Notebook"
+2. Or access directly from SAP Data Intelligence Launchpad
+
+### Available Kernels
+
+- Python 3 (with ML libraries)
+- Custom kernels (via Docker configuration)
+
+### Pre-installed Libraries
+
+```python
+# Data Processing
+import pandas as pd
+import numpy as np
+
+# Machine Learning
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split
+
+# Deep Learning (available)
+import tensorflow as tf
+import torch
+
+# SAP Data Intelligence SDK
+from sapdi import tracking
+```
+
+### Data Lake Access
+
+Access SAP Data Intelligence Data Lake from notebooks:
+
+```python
+from sapdi.datalake import DataLakeClient
+
+client = DataLakeClient()
+
+# Read file
+df = client.read_csv('/shared/data/customers.csv')
+
+# Write file
+client.write_parquet(df, '/shared/output/processed.parquet')
+```
+
+### Virtual Environments
+
+Create isolated environments for dependencies:
+
+```bash
+# Create virtual environment
+python -m venv /home/user/myenv
+
+# Activate
+source /home/user/myenv/bin/activate
+
+# Install packages
+pip install xgboost lightgbm catboost
+```
+
+### Data Browser Extension
+
+Use the Data Browser to:
+- Browse available data sources
+- Preview data
+- Import data to notebooks
+
+---
+
+## Python SDK
+
+Programmatic interface for ML operations.
+
+### SDK Installation
+
+Pre-installed in JupyterLab and Python operators.
+
+```python
+import sapdi
+from sapdi import tracking
+from sapdi import context
+```
+
+### MLTrackingSDK Functions
+
+| Function | Description | Limits |
+|----------|-------------|--------|
+| `start_run()` | Begin experiment tracking | Specify run_collection_name, run_name |
+| `end_run()` | Complete tracking | Auto-adds start/end timestamps |
+| `log_param()` | Log configuration values | name: 256 chars, value: 5000 chars |
+| `log_metric()` | Log numeric metric | name: 256 chars (case-sensitive) |
+| `log_metrics()` | Batch log metrics | Dictionary list format |
+| `persist_run()` | Force save to storage | Auto at 1.5MB cache or end_run |
+| `set_tags()` | Key-value pairs for filtering | runName is reserved |
+| `set_labels()` | UI/semantic labels | Non-filterable |
+| `delete_runs()` | Remove persisted metrics | By scenario/pipeline/execution |
+| `get_runs()` | Retrieve run objects | Returns metrics, params, tags |
+| `get_metrics_history()` | Get metric values | Max 1000 per metric |
+| `update_run_info()` | Modify run metadata | Change name, collection, tags |
+
+### Metrics Tracking
+
+```python
+from sapdi import tracking
+
+# Initialize tracking
+with tracking.start_run(run_name="experiment_001") as run:
+    # Train model
+    model = train_model(X_train, y_train)
+
+    # Log parameters
+    run.log_param("algorithm", "RandomForest")
+    run.log_param("n_estimators", 100)
+    run.log_param("max_depth", 10)
+
+    # Log metrics
+    accuracy = evaluate(model, X_test, y_test)
+    run.log_metric("accuracy", accuracy)
+    run.log_metric("f1_score", f1)
+
+    # Log model artifact
+    run.log_artifact("model.pkl", model)
+```
+
+### Tracking Parameters and Metrics
+
+**Parameters** (static values):
+```python
+run.log_param("learning_rate", 0.01)
+run.log_param("batch_size", 32)
+run.log_param("epochs", 100)
+```
+
+**Metrics** (can be logged multiple times):
+```python
+for epoch in range(epochs):
+    loss = train_epoch(model, data)
+    run.log_metric("loss", loss, step=epoch)
+    run.log_metric("val_loss", val_loss, step=epoch)
+```
+
+### Artifact Management
+
+```python
+# Log files
+run.log_artifact("model.pkl", model_bytes)
+run.log_artifact("feature_importance.png", image_bytes)
+
+# Log directories
+run.log_artifacts("./model_output/")
+
+# Retrieve artifacts
+artifacts = tracking.get_run_artifacts(run_id)
+model_data = artifacts.get("model.pkl")
+```
+
+### Artifact Class Methods
+
+| Method | Description |
+|--------|-------------|
+| `add_file()` | Add file to artifact, returns handler |
+| `create()` | Create artifact with initial content, returns ID |
+| `delete()` | Remove artifact metadata (not content) |
+| `delete_content()` | Remove stored data |
+| `download()` | Retrieve artifact contents to local storage |
+| `get()` | Get artifact metadata |
+| `list()` | List all artifacts in scenario |
+| `open_file()` | Get handler for remote file access |
+| `upload()` | Add files/directories to artifact |
+| `walk()` | Depth-first traversal of artifact structure |
+
+### FileHandler Methods
+
+| Method | Description |
+|--------|-------------|
+| `get_reader()` | Returns file-like object for reading (use with `with`) |
+| `get_writer()` | Returns object for incremental writing |
+| `read()` | Load entire remote file at once |
+| `write()` | Write strings, bytes, or files to data lake |
+
+**Important:** Files between 5 MB and 5 GB (inclusive) may be appended using the append functionality. For files smaller than 5 MB, use `get_writer()` for incremental writing instead.
+
+```python
+from sapdi.artifact import Artifact
+
+# Create artifact
+artifact_id = Artifact.create(
+    name="my_model",
+    description="Trained model",
+    content=model_bytes
+)
+
+# List artifacts
+artifacts = Artifact.list()
+
+# Download artifact
+Artifact.download(artifact_id, local_path="/tmp/model/")
+
+# Read remote file
+with Artifact.open_file(artifact_id, "model.pkl").get_reader() as f:
+    model = pickle.load(f)
+```
+
+### Context Information
+
+```python
+from sapdi import context
+
+# Get scenario information
+scenario_id = context.get_scenario_id()
+version_id = context.get_version_id()
+
+# Get environment info
+tenant = context.get_tenant()
+user = context.get_user()
+```
+
+---
+
+## Training Pipelines
+
+Productionize ML training workflows.
+
+### Pipeline Components
+
+```
+[Data Consumer] -> [Feature Engineering] -> [Model Training] -> [Metrics Logger]
+        |                    |                     |                    |
+   Read data          Transform data         Train model          Log results
+```
+
+### Creating Training Pipeline
+
+1. Create new graph in Modeler
+2. Add data consumer operator
+3. Add Python operator for training
+4. Add Submit Metrics operator
+5. Connect and configure
+
+### Python Training Operator
+
+```python
+def on_input(msg):
+    import pandas as pd
+    from sklearn.ensemble import RandomForestClassifier
+    from sklearn.model_selection import train_test_split
+    from sapdi import tracking
+
+    # Get data
+    df = pd.DataFrame(msg.body)
+
+    # Prepare features
+    X = df.drop('target', axis=1)
+    y = df['target']
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+    # Train model
+    model = RandomForestClassifier(n_estimators=100)
+    model.fit(X_train, y_train)
+
+    # Evaluate
+    accuracy = model.score(X_test, y_test)
+
+    # Track metrics
+    with tracking.start_run(
+        run_collection_name="classification_experiments",
+        run_name="rf_training_001"
+    ) as run:
+        run.log_param("model_type", "RandomForest")
+        run.log_metric("accuracy", accuracy)
+        run.log_artifact("model.pkl", pickle.dumps(model))
+
+    api.send("output", api.Message({"accuracy": accuracy}))
+
+api.set_port_callback("input", on_input)
+```
+
+### ML Pipeline Templates
+
+Pre-built templates available:
+
+- **Auto ML Training**: Automated model selection
+- **HANA ML Training**: In-database training
+- **TensorFlow Training**: Deep learning
+- **Basic Training**: Generic template
+
+---
+
+## Metrics Explorer
+
+Visualize and compare ML experiments.
+
+### Accessing Metrics Explorer
+
+1. Open ML Scenario Manager
+2. Click "Metrics Explorer"
+3. Select scenario and version
+
+### Viewing Runs
+
+**Run List:**
+- Run ID and name
+- Status (completed, failed, running)
+- Start/end time
+- Logged metrics summary
+
+### Comparing Runs
+
+1. Select multiple runs
+2. Click "Compare"
+3. View side-by-side metrics
+4. Visualize metric trends
+
+### Metric Visualizations
+
+**Available Charts:**
+- Line charts (metrics over steps)
+- Bar charts (metric comparison)
+- Scatter plots (parameter vs metric)
+
+### Filtering and Search
+
+```
+Filter by:
+- Date range
+- Status
+- Parameter values
+- Metric thresholds
+```
+
+---
+
+## Model Deployment
+
+Deploy trained models for inference.
+
+### Deployment Options
+
+**Batch Inference:**
+- Scheduled pipeline execution
+- Process large datasets
+- Results to storage/database
+
+**Real-time Inference:**
+- API endpoint deployment
+- Low-latency predictions
+- Auto-scaling
+
+### Creating Inference Pipeline
+
+```
+[API Input] -> [Load Model] -> [Predict] -> [API Output]
+```
+
+### Python Inference Operator
+
+```python
+import pickle
+from sapdi.artifact import Artifact
+
+# Load model once (thread-safe if model object is immutable/read-only during inference)
+# Note: model.predict() must be thread-safe for concurrent requests
+model = None
+
+def load_model():
+    global model
+    # Get artifact metadata first
+    artifacts = Artifact.list()
+    model_artifact = next((a for a in artifacts if a.name == "model"), None)
+
+    if model_artifact:
+        # Download artifact and load model
+        with Artifact.open_file(model_artifact.id, "model.pkl").get_reader() as f:
+            model = pickle.load(f)
+
+def on_input(msg):
+    if model is None:
+        load_model()
+
+    # Get input features
+    features = msg.body
+
+    # Predict
+    prediction = model.predict([features])[0]
+    probability = model.predict_proba([features])[0]
+
+    result = {
+        "prediction": int(prediction),
+        "probability": probability.tolist()
+    }
+
+    api.send("output", api.Message(result))
+
+api.set_port_callback("input", on_input)
+```
+
+### Deployment Monitoring
+
+Track deployed model performance:
+
+```python
+# Log inference metrics
+run.log_metric("inference_latency", latency_ms)
+run.log_metric("prediction_count", count)
+run.log_metric("error_rate", errors / total)
+```
+
+---
+
+## Versioning
+
+Manage ML scenario versions.
+
+### Creating Versions
+
+1. Open ML Scenario Manager
+2. Navigate to scenario
+3. Click "Create Version"
+4. Enter version name
+5. Select base version (optional)
+
+### Version Workflow
+
+```
+v1.0 (initial baseline)
+  └── v1.1 (feature improvements)
+        └── v1.2 (hyperparameter tuning)
+              └── v2.0 (new architecture)
+                    └── v2.1 (production release)
+```
+
+### Branching
+
+Create versions from any point:
+
+```
+v1.0 ─── v1.1 ─── v1.2
+           └── v1.1-experiment (branch for testing)
+```
+
+### Export and Import
+
+**Export:**
+1. Select scenario version
+2. Click "Export"
+3. Download ZIP file
+
+**Import:**
+1. Click "Import" in ML Scenario Manager
+2. Upload ZIP file
+3. Configure target location
+
+---
+
+## Best Practices
+
+### Experiment Management
+
+1. **Name Runs Descriptively**: Include key parameters
+2. **Log Comprehensively**: All parameters and metrics
+3. **Version Data**: Track data versions with runs
+4. **Document Experiments**: Notes in notebooks
+
+### Pipeline Development
+
+1. **Start in Notebooks**: Prototype in JupyterLab
+2. **Modularize Code**: Reusable functions
+3. **Test Incrementally**: Validate each component
+4. **Productionize Gradually**: Notebook to pipeline
+
+### Model Management
+
+1. **Version Models**: Link to training runs
+2. **Validate Before Deploy**: Test on holdout data
+3. **Monitor Production**: Track drift and performance
+4. **Maintain Lineage**: Data to model to prediction
+
+### Resource Management
+
+1. **Right-size Resources**: Appropriate memory/CPU
+2. **Clean Up Artifacts**: Remove unused experiments
+3. **Archive Old Versions**: Export for long-term storage
+
+---
+
+## Documentation Links
+
+- **Machine Learning**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning)
+- **ML Scenario Manager**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/ml-scenario-manager)
+- **JupyterLab**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning/jupyterlab-environment)
+- **Python SDK**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/machinelearning) (see python-sdk documentation)
+
+---
+
+**Last Updated**: 2025-11-22