13 KiB
ML Operations Reference
Complete reference for SAP AI Core ML training and operations.
Documentation Source: https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core
Overview
SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.
Key Components
| Component | Description |
|---|---|
| Scenarios | AI use case implementations |
| Executables | Reusable workflow templates |
| Configurations | Parameters and artifact bindings |
| Executions | Running instances of workflows |
| Artifacts | Datasets, models, and results |
Workflow Engine
Argo Workflows
SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:
- Direct Acyclic Graph (DAG) structures
- Parallel step execution
- Container-based steps
- Data ingestion and preprocessing
- Model training and batch inference
Limitation: Not optimized for time-critical tasks due to scheduling overhead.
Prerequisites
1. Object Store Secret (Required)
Create a secret named default for training output artifacts:
curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "default",
"type": "S3",
"pathPrefix": "my-bucket/training-output",
"data": {
"AWS_ACCESS_KEY_ID": "<access-key>",
"AWS_SECRET_ACCESS_KEY": "<secret-key>"
}
}'
Note: Without a default secret, training pipelines will fail.
2. Docker Registry Secret
For custom training images:
curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "docker-registry",
"data": {
".dockerconfigjson": "<base64-encoded-docker-config>"
}
}'
3. Git Repository
Sync workflow templates from Git:
curl -X POST "$AI_API_URL/v2/admin/repositories" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "training-repo",
"url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
"username": "<git-user>",
"password": "<git-token>"
}'
Workflow Template
Basic Structure
apiVersion: ai.sap.com/v1alpha1
kind: WorkflowTemplate
metadata:
name: text-classifier-training
annotations:
scenarios.ai.sap.com/description: "Train text classification model"
scenarios.ai.sap.com/name: "text-classifier"
executables.ai.sap.com/description: "Training executable"
executables.ai.sap.com/name: "text-classifier-train"
artifacts.ai.sap.com/training-data.kind: "dataset"
artifacts.ai.sap.com/trained-model.kind: "model"
labels:
scenarios.ai.sap.com/id: "text-classifier"
executables.ai.sap.com/id: "text-classifier-train"
ai.sap.com/version: "1.0.0"
spec:
imagePullSecrets:
- name: docker-registry
entrypoint: main
arguments:
parameters:
- name: learning_rate
default: "0.001"
- name: epochs
default: "10"
artifacts:
- name: training-data
path: /data/input
archive:
none: {}
templates:
- name: main
steps:
- - name: preprocess
template: preprocess-data
- - name: train
template: train-model
- - name: evaluate
template: evaluate-model
- name: preprocess-data
container:
image: my-registry/preprocessing:latest
command: ["python", "preprocess.py"]
args: ["--input", "/data/input", "--output", "/data/processed"]
- name: train-model
container:
image: my-registry/training:latest
command: ["python", "train.py"]
args:
- "--data=/data/processed"
- "--lr={{workflow.parameters.learning_rate}}"
- "--epochs={{workflow.parameters.epochs}}"
- "--output=/data/model"
outputs:
artifacts:
- name: trained-model
path: /data/model
globalName: trained-model
archive:
none: {}
- name: evaluate-model
container:
image: my-registry/evaluation:latest
command: ["python", "evaluate.py"]
args: ["--model", "/data/model"]
Annotations Reference
| Annotation | Description |
|---|---|
scenarios.ai.sap.com/name |
Human-readable scenario name |
scenarios.ai.sap.com/id |
Scenario identifier |
executables.ai.sap.com/name |
Executable name |
executables.ai.sap.com/id |
Executable identifier |
artifacts.ai.sap.com/<name>.kind |
Artifact type (dataset, model, etc.) |
Artifacts
Types
| Kind | Description | Use Case |
|---|---|---|
dataset |
Training/validation data | Input for training |
model |
Trained model | Output from training |
resultset |
Inference results | Output from batch inference |
other |
Miscellaneous | Logs, metrics, configs |
Register Input Artifact
curl -X POST "$AI_API_URL/v2/lm/artifacts" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "training-dataset-v1",
"kind": "dataset",
"url": "ai://default/datasets/training-v1",
"scenarioId": "text-classifier",
"description": "Training dataset version 1"
}'
URL Format
ai://default/<path>- Uses default object store secretai://<secret-name>/<path>- Uses named object store secret
List Artifacts
curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
Configurations
Create Training Configuration
curl -X POST "$AI_API_URL/v2/lm/configurations" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "text-classifier-config-v1",
"executableId": "text-classifier-train",
"scenarioId": "text-classifier",
"parameterBindings": [
{"key": "learning_rate", "value": "0.001"},
{"key": "epochs", "value": "20"},
{"key": "batch_size", "value": "32"}
],
"inputArtifactBindings": [
{"key": "training-data", "artifactId": "<dataset-artifact-id>"}
]
}'
Executions
Create Execution
curl -X POST "$AI_API_URL/v2/lm/executions" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"configurationId": "<configuration-id>"
}'
Execution Statuses
| Status | Description |
|---|---|
UNKNOWN |
Initial state |
PENDING |
Queued for execution |
RUNNING |
Currently executing |
COMPLETED |
Finished successfully |
DEAD |
Failed |
STOPPED |
Manually stopped |
Check Execution Status
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
Get Execution Logs
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
Stop Execution
curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{"targetStatus": "STOPPED"}'
Metrics
Write Metrics from Training
In your training code:
import requests
import os
def log_metrics(metrics: dict, step: int):
"""Log metrics to SAP AI Core."""
api_url = os.environ.get("AICORE_API_URL")
token = os.environ.get("AICORE_AUTH_TOKEN")
execution_id = os.environ.get("AICORE_EXECUTION_ID")
response = requests.post(
f"{api_url}/v2/lm/executions/{execution_id}/metrics",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
},
json={
"metrics": [
{"name": name, "value": value, "step": step}
for name, value in metrics.items()
]
}
)
# Usage in training loop
for epoch in range(epochs):
train_loss = train_epoch()
val_loss = validate()
log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"accuracy": accuracy
}, step=epoch)
Read Metrics
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
Training Schedules
Create Schedule
curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"configurationId": "<configuration-id>",
"cron": "0 0 * * 0",
"start": "2024-01-01T00:00:00Z",
"end": "2024-12-31T23:59:59Z"
}'
Cron Expression Format
SAP AI Core uses 5-field cron expressions with 3-letter day-of-week names:
┌───────── minute (0-59)
│ ┌─────── hour (0-23)
│ │ ┌───── day of month (1-31)
│ │ │ ┌─── month (1-12)
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
│ │ │ │ │
* * * * *
Examples:
0 0 * * *- Daily at midnight0 0 * * sun- Weekly on Sunday0 0 * * fri- Weekly on Friday0 0 1 * *- Monthly on 1st0 */6 * * *- Every 6 hours
Note: Using * * * * * treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.
List Schedules
curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
Delete Schedule
curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
SAP AI Launchpad
ML Operations App
Access: Workspaces → ML Operations
Features:
- View scenarios and executables
- Create/manage configurations
- Run/monitor executions
- View training metrics
- Manage artifacts
- Create schedules
Required Roles
| Role | Capabilities |
|---|---|
operations_manager |
Access ML Operations app |
mloperations_viewer |
View-only access |
mloperations_editor |
Full edit access |
Comparing Runs
- Navigate to ML Operations → Executions
- Select multiple executions
- Click "Compare"
- View side-by-side metrics and parameters
Best Practices
Workflow Design
- Modular steps: Break workflow into reusable templates
- Parameterization: Use parameters for hyperparameters
- Artifact management: Define clear input/output artifacts
- Error handling: Include retry logic for flaky operations
Resource Management
- Appropriate sizing: Match container resources to workload
- GPU allocation: Request GPUs only when needed
- Storage: Use object store for large datasets
- Cleanup: Delete old executions and artifacts
Monitoring
- Log metrics: Track loss, accuracy, etc. during training
- Check logs: Review execution logs for errors
- Compare runs: Analyze different hyperparameter settings
- Set alerts: Monitor for failed executions
Troubleshooting
Execution Failed
- Check execution logs:
GET /v2/lm/executions/{id}/logs - Verify object store secret exists and is named
default - Check Docker image is accessible
- Verify artifact paths are correct
- Check resource quota not exceeded
Artifacts Not Found
- Verify artifact URL format:
ai://default/<path> - Check object store secret permissions
- Verify file exists in object store
- Check artifact registered in correct scenario
Schedule Not Running
- Verify schedule is active (not paused)
- Check cron expression is valid
- Verify start/end dates bracket current time
- Check configuration still exists
Documentation Links
- Training Overview: https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md
- ML Operations (Launchpad): https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md
- Schedules: https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md
- Metrics: https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md