zhongwei/gh-secondsky-sap-skills-skills-sap-ai-core

Fork 0

Files

Zhongwei Li 47e178c6cb Initial commit

2025-11-30 08:54:41 +08:00

13 KiB

Raw Permalink Blame History

ML Operations Reference

Complete reference for SAP AI Core ML training and operations.

Documentation Source: https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core

Overview

SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.

Key Components

Component	Description
Scenarios	AI use case implementations
Executables	Reusable workflow templates
Configurations	Parameters and artifact bindings
Executions	Running instances of workflows
Artifacts	Datasets, models, and results

Workflow Engine

Argo Workflows

SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:

Direct Acyclic Graph (DAG) structures
Parallel step execution
Container-based steps
Data ingestion and preprocessing
Model training and batch inference

Limitation: Not optimized for time-critical tasks due to scheduling overhead.

Prerequisites

1. Object Store Secret (Required)

Create a secret named default for training output artifacts:

curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "default",
    "type": "S3",
    "pathPrefix": "my-bucket/training-output",
    "data": {
      "AWS_ACCESS_KEY_ID": "<access-key>",
      "AWS_SECRET_ACCESS_KEY": "<secret-key>"
    }
  }'

Note: Without a default secret, training pipelines will fail.

2. Docker Registry Secret

For custom training images:

curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "docker-registry",
    "data": {
      ".dockerconfigjson": "<base64-encoded-docker-config>"
    }
  }'

3. Git Repository

Sync workflow templates from Git:

curl -X POST "$AI_API_URL/v2/admin/repositories" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-repo",
    "url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
    "username": "<git-user>",
    "password": "<git-token>"
  }'

Workflow Template

Basic Structure

apiVersion: ai.sap.com/v1alpha1
kind: WorkflowTemplate
metadata:
  name: text-classifier-training
  annotations:
    scenarios.ai.sap.com/description: "Train text classification model"
    scenarios.ai.sap.com/name: "text-classifier"
    executables.ai.sap.com/description: "Training executable"
    executables.ai.sap.com/name: "text-classifier-train"
    artifacts.ai.sap.com/training-data.kind: "dataset"
    artifacts.ai.sap.com/trained-model.kind: "model"
  labels:
    scenarios.ai.sap.com/id: "text-classifier"
    executables.ai.sap.com/id: "text-classifier-train"
    ai.sap.com/version: "1.0.0"
spec:
  imagePullSecrets:
    - name: docker-registry
  entrypoint: main
  arguments:
    parameters:
      - name: learning_rate
        default: "0.001"
      - name: epochs
        default: "10"
    artifacts:
      - name: training-data
        path: /data/input
        archive:
          none: {}
  templates:
    - name: main
      steps:
        - - name: preprocess
            template: preprocess-data
        - - name: train
            template: train-model
        - - name: evaluate
            template: evaluate-model

    - name: preprocess-data
      container:
        image: my-registry/preprocessing:latest
        command: ["python", "preprocess.py"]
        args: ["--input", "/data/input", "--output", "/data/processed"]

    - name: train-model
      container:
        image: my-registry/training:latest
        command: ["python", "train.py"]
        args:
          - "--data=/data/processed"
          - "--lr={{workflow.parameters.learning_rate}}"
          - "--epochs={{workflow.parameters.epochs}}"
          - "--output=/data/model"
      outputs:
        artifacts:
          - name: trained-model
            path: /data/model
            globalName: trained-model
            archive:
              none: {}

    - name: evaluate-model
      container:
        image: my-registry/evaluation:latest
        command: ["python", "evaluate.py"]
        args: ["--model", "/data/model"]

Annotations Reference

Annotation	Description
`scenarios.ai.sap.com/name`	Human-readable scenario name
`scenarios.ai.sap.com/id`	Scenario identifier
`executables.ai.sap.com/name`	Executable name
`executables.ai.sap.com/id`	Executable identifier
`artifacts.ai.sap.com/<name>.kind`	Artifact type (dataset, model, etc.)

Artifacts

Types

Kind	Description	Use Case
`dataset`	Training/validation data	Input for training
`model`	Trained model	Output from training
`resultset`	Inference results	Output from batch inference
`other`	Miscellaneous	Logs, metrics, configs

Register Input Artifact

curl -X POST "$AI_API_URL/v2/lm/artifacts" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-dataset-v1",
    "kind": "dataset",
    "url": "ai://default/datasets/training-v1",
    "scenarioId": "text-classifier",
    "description": "Training dataset version 1"
  }'

URL Format

ai://default/<path> - Uses default object store secret
ai://<secret-name>/<path> - Uses named object store secret

List Artifacts

curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Configurations

Create Training Configuration

curl -X POST "$AI_API_URL/v2/lm/configurations" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "text-classifier-config-v1",
    "executableId": "text-classifier-train",
    "scenarioId": "text-classifier",
    "parameterBindings": [
      {"key": "learning_rate", "value": "0.001"},
      {"key": "epochs", "value": "20"},
      {"key": "batch_size", "value": "32"}
    ],
    "inputArtifactBindings": [
      {"key": "training-data", "artifactId": "<dataset-artifact-id>"}
    ]
  }'

Executions

Create Execution

curl -X POST "$AI_API_URL/v2/lm/executions" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>"
  }'

Execution Statuses

Status	Description
`UNKNOWN`	Initial state
`PENDING`	Queued for execution
`RUNNING`	Currently executing
`COMPLETED`	Finished successfully
`DEAD`	Failed
`STOPPED`	Manually stopped

Check Execution Status

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Get Execution Logs

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Stop Execution

curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{"targetStatus": "STOPPED"}'

Metrics

Write Metrics from Training

In your training code:

import requests
import os

def log_metrics(metrics: dict, step: int):
    """Log metrics to SAP AI Core."""
    api_url = os.environ.get("AICORE_API_URL")
    token = os.environ.get("AICORE_AUTH_TOKEN")
    execution_id = os.environ.get("AICORE_EXECUTION_ID")

    response = requests.post(
        f"{api_url}/v2/lm/executions/{execution_id}/metrics",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "metrics": [
                {"name": name, "value": value, "step": step}
                for name, value in metrics.items()
            ]
        }
    )

# Usage in training loop
for epoch in range(epochs):
    train_loss = train_epoch()
    val_loss = validate()
    log_metrics({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "accuracy": accuracy
    }, step=epoch)

Read Metrics

curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Training Schedules

Create Schedule

curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>",
    "cron": "0 0 * * 0",
    "start": "2024-01-01T00:00:00Z",
    "end": "2024-12-31T23:59:59Z"
  }'

Cron Expression Format

SAP AI Core uses 5-field cron expressions with 3-letter day-of-week names:

┌───────── minute (0-59)
│ ┌─────── hour (0-23)
│ │ ┌───── day of month (1-31)
│ │ │ ┌─── month (1-12)
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
│ │ │ │ │
* * * * *

Examples:

0 0 * * * - Daily at midnight
0 0 * * sun - Weekly on Sunday
0 0 * * fri - Weekly on Friday
0 0 1 * * - Monthly on 1st
0 */6 * * * - Every 6 hours

Note: Using * * * * * treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.

List Schedules

curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

Delete Schedule

curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"

SAP AI Launchpad

ML Operations App

Access: Workspaces → ML Operations

Features:

View scenarios and executables
Create/manage configurations
Run/monitor executions
View training metrics
Manage artifacts
Create schedules

Required Roles

Role	Capabilities
`operations_manager`	Access ML Operations app
`mloperations_viewer`	View-only access
`mloperations_editor`	Full edit access

Comparing Runs

Navigate to ML Operations → Executions
Select multiple executions
Click "Compare"
View side-by-side metrics and parameters

Best Practices

Workflow Design

Modular steps: Break workflow into reusable templates
Parameterization: Use parameters for hyperparameters
Artifact management: Define clear input/output artifacts
Error handling: Include retry logic for flaky operations

Resource Management

Appropriate sizing: Match container resources to workload
GPU allocation: Request GPUs only when needed
Storage: Use object store for large datasets
Cleanup: Delete old executions and artifacts

Monitoring

Log metrics: Track loss, accuracy, etc. during training
Check logs: Review execution logs for errors
Compare runs: Analyze different hyperparameter settings
Set alerts: Monitor for failed executions

Troubleshooting

Execution Failed

Check execution logs: GET /v2/lm/executions/{id}/logs
Verify object store secret exists and is named default
Check Docker image is accessible
Verify artifact paths are correct
Check resource quota not exceeded

13 KiB Raw Permalink Blame History