Files
2025-11-30 08:54:41 +08:00

13 KiB

ML Operations Reference

Complete reference for SAP AI Core ML training and operations.

Documentation Source: https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core


Overview

SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.

Key Components

Component Description
Scenarios AI use case implementations
Executables Reusable workflow templates
Configurations Parameters and artifact bindings
Executions Running instances of workflows
Artifacts Datasets, models, and results

Workflow Engine

Argo Workflows

SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:

  • Direct Acyclic Graph (DAG) structures
  • Parallel step execution
  • Container-based steps
  • Data ingestion and preprocessing
  • Model training and batch inference

Limitation: Not optimized for time-critical tasks due to scheduling overhead.


Prerequisites

1. Object Store Secret (Required)

Create a secret named default for training output artifacts:

curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "default",
    "type": "S3",
    "pathPrefix": "my-bucket/training-output",
    "data": {
      "AWS_ACCESS_KEY_ID": "<access-key>",
      "AWS_SECRET_ACCESS_KEY": "<secret-key>"
    }
  }'

Note: Without a default secret, training pipelines will fail.

2. Docker Registry Secret

For custom training images:

curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "docker-registry",
    "data": {
      ".dockerconfigjson": "<base64-encoded-docker-config>"
    }
  }'

3. Git Repository

Sync workflow templates from Git:

curl -X POST "$AI_API_URL/v2/admin/repositories" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-repo",
    "url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
    "username": "<git-user>",
    "password": "<git-token>"
  }'

Workflow Template

Basic Structure

apiVersion: ai.sap.com/v1alpha1
kind: WorkflowTemplate
metadata:
  name: text-classifier-training
  annotations:
    scenarios.ai.sap.com/description: "Train text classification model"
    scenarios.ai.sap.com/name: "text-classifier"
    executables.ai.sap.com/description: "Training executable"
    executables.ai.sap.com/name: "text-classifier-train"
    artifacts.ai.sap.com/training-data.kind: "dataset"
    artifacts.ai.sap.com/trained-model.kind: "model"
  labels:
    scenarios.ai.sap.com/id: "text-classifier"
    executables.ai.sap.com/id: "text-classifier-train"
    ai.sap.com/version: "1.0.0"
spec:
  imagePullSecrets:
    - name: docker-registry
  entrypoint: main
  arguments:
    parameters:
      - name: learning_rate
        default: "0.001"
      - name: epochs
        default: "10"
    artifacts:
      - name: training-data
        path: /data/input
        archive:
          none: {}
  templates:
    - name: main
      steps:
        - - name: preprocess
            template: preprocess-data
        - - name: train
            template: train-model
        - - name: evaluate
            template: evaluate-model

    - name: preprocess-data
      container:
        image: my-registry/preprocessing:latest
        command: ["python", "preprocess.py"]
        args: ["--input", "/data/input", "--output", "/data/processed"]

    - name: train-model
      container:
        image: my-registry/training:latest
        command: ["python", "train.py"]
        args:
          - "--data=/data/processed"
          - "--lr={{workflow.parameters.learning_rate}}"
          - "--epochs={{workflow.parameters.epochs}}"
          - "--output=/data/model"
      outputs:
        artifacts:
          - name: trained-model
            path: /data/model
            globalName: trained-model
            archive:
              none: {}

    - name: evaluate-model
      container:
        image: my-registry/evaluation:latest
        command: ["python", "evaluate.py"]
        args: ["--model", "/data/model"]

Annotations Reference

Annotation Description
scenarios.ai.sap.com/name Human-readable scenario name
scenarios.ai.sap.com/id Scenario identifier
executables.ai.sap.com/name Executable name
executables.ai.sap.com/id Executable identifier
artifacts.ai.sap.com/<name>.kind Artifact type (dataset, model, etc.)

Artifacts

Types

Kind Description Use Case
dataset Training/validation data Input for training
model Trained model Output from training
resultset Inference results Output from batch inference
other Miscellaneous Logs, metrics, configs

Register Input Artifact

curl -X POST "$AI_API_URL/v2/lm/artifacts" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-dataset-v1",
    "kind": "dataset",
    "url": "ai://default/datasets/training-v1",
    "scenarioId": "text-classifier",
    "description": "Training dataset version 1"
  }'

URL Format

  • ai://default/<path> - Uses default object store secret
  • ai://<secret-name>/<path> - Uses named object store secret

List Artifacts

curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Configurations

Create Training Configuration

curl -X POST "$AI_API_URL/v2/lm/configurations" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "text-classifier-config-v1",
    "executableId": "text-classifier-train",
    "scenarioId": "text-classifier",
    "parameterBindings": [
      {"key": "learning_rate", "value": "0.001"},
      {"key": "epochs", "value": "20"},
      {"key": "batch_size", "value": "32"}
    ],
    "inputArtifactBindings": [
      {"key": "training-data", "artifactId": "<dataset-artifact-id>"}
    ]
  }'

Executions

Create Execution

curl -X POST "$AI_API_URL/v2/lm/executions" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>"
  }'

Execution Statuses

Status Description
UNKNOWN Initial state
PENDING Queued for execution
RUNNING Currently executing
COMPLETED Finished successfully
DEAD Failed
STOPPED Manually stopped

Check Execution Status

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Get Execution Logs

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Stop Execution

curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{"targetStatus": "STOPPED"}'

Metrics

Write Metrics from Training

In your training code:

import requests
import os

def log_metrics(metrics: dict, step: int):
    """Log metrics to SAP AI Core."""
    api_url = os.environ.get("AICORE_API_URL")
    token = os.environ.get("AICORE_AUTH_TOKEN")
    execution_id = os.environ.get("AICORE_EXECUTION_ID")

    response = requests.post(
        f"{api_url}/v2/lm/executions/{execution_id}/metrics",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "metrics": [
                {"name": name, "value": value, "step": step}
                for name, value in metrics.items()
            ]
        }
    )

# Usage in training loop
for epoch in range(epochs):
    train_loss = train_epoch()
    val_loss = validate()
    log_metrics({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "accuracy": accuracy
    }, step=epoch)

Read Metrics

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Training Schedules

Create Schedule

curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>",
    "cron": "0 0 * * 0",
    "start": "2024-01-01T00:00:00Z",
    "end": "2024-12-31T23:59:59Z"
  }'

Cron Expression Format

SAP AI Core uses 5-field cron expressions with 3-letter day-of-week names:

┌───────── minute (0-59)
│ ┌─────── hour (0-23)
│ │ ┌───── day of month (1-31)
│ │ │ ┌─── month (1-12)
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
│ │ │ │ │
* * * * *

Examples:

  • 0 0 * * * - Daily at midnight
  • 0 0 * * sun - Weekly on Sunday
  • 0 0 * * fri - Weekly on Friday
  • 0 0 1 * * - Monthly on 1st
  • 0 */6 * * * - Every 6 hours

Note: Using * * * * * treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.

List Schedules

curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Delete Schedule

curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

SAP AI Launchpad

ML Operations App

Access: WorkspacesML Operations

Features:

  • View scenarios and executables
  • Create/manage configurations
  • Run/monitor executions
  • View training metrics
  • Manage artifacts
  • Create schedules

Required Roles

Role Capabilities
operations_manager Access ML Operations app
mloperations_viewer View-only access
mloperations_editor Full edit access

Comparing Runs

  1. Navigate to ML Operations → Executions
  2. Select multiple executions
  3. Click "Compare"
  4. View side-by-side metrics and parameters

Best Practices

Workflow Design

  1. Modular steps: Break workflow into reusable templates
  2. Parameterization: Use parameters for hyperparameters
  3. Artifact management: Define clear input/output artifacts
  4. Error handling: Include retry logic for flaky operations

Resource Management

  1. Appropriate sizing: Match container resources to workload
  2. GPU allocation: Request GPUs only when needed
  3. Storage: Use object store for large datasets
  4. Cleanup: Delete old executions and artifacts

Monitoring

  1. Log metrics: Track loss, accuracy, etc. during training
  2. Check logs: Review execution logs for errors
  3. Compare runs: Analyze different hyperparameter settings
  4. Set alerts: Monitor for failed executions

Troubleshooting

Execution Failed

  1. Check execution logs: GET /v2/lm/executions/{id}/logs
  2. Verify object store secret exists and is named default
  3. Check Docker image is accessible
  4. Verify artifact paths are correct
  5. Check resource quota not exceeded

Artifacts Not Found

  1. Verify artifact URL format: ai://default/<path>
  2. Check object store secret permissions
  3. Verify file exists in object store
  4. Check artifact registered in correct scenario

Schedule Not Running

  1. Verify schedule is active (not paused)
  2. Check cron expression is valid
  3. Verify start/end dates bracket current time
  4. Check configuration still exists