gh-secondsky-sap-skills-ski…/references/ml-operations.md

# ML Operations Reference

Complete reference for SAP AI Core ML training and operations.

**Documentation Source:** [https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core](https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core)

---

## Overview

SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.

### Key Components

| Component | Description |
|-----------|-------------|
| **Scenarios** | AI use case implementations |
| **Executables** | Reusable workflow templates |
| **Configurations** | Parameters and artifact bindings |
| **Executions** | Running instances of workflows |
| **Artifacts** | Datasets, models, and results |

---

## Workflow Engine

### Argo Workflows

SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:

- Direct Acyclic Graph (DAG) structures
- Parallel step execution
- Container-based steps
- Data ingestion and preprocessing
- Model training and batch inference

**Limitation:** Not optimized for time-critical tasks due to scheduling overhead.

---

## Prerequisites

### 1. Object Store Secret (Required)

Create a secret named `default` for training output artifacts:

```bash
curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "default",
    "type": "S3",
    "pathPrefix": "my-bucket/training-output",
    "data": {
      "AWS_ACCESS_KEY_ID": "<access-key>",
      "AWS_SECRET_ACCESS_KEY": "<secret-key>"
    }
  }'
```

**Note:** Without a `default` secret, training pipelines will fail.

### 2. Docker Registry Secret

For custom training images:

```bash
curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "docker-registry",
    "data": {
      ".dockerconfigjson": "<base64-encoded-docker-config>"
    }
  }'
```

### 3. Git Repository

Sync workflow templates from Git:

```bash
curl -X POST "$AI_API_URL/v2/admin/repositories" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-repo",
    "url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
    "username": "<git-user>",
    "password": "<git-token>"
  }'
```

---

## Workflow Template

### Basic Structure

```yaml
apiVersion: ai.sap.com/v1alpha1
kind: WorkflowTemplate
metadata:
  name: text-classifier-training
  annotations:
    scenarios.ai.sap.com/description: "Train text classification model"
    scenarios.ai.sap.com/name: "text-classifier"
    executables.ai.sap.com/description: "Training executable"
    executables.ai.sap.com/name: "text-classifier-train"
    artifacts.ai.sap.com/training-data.kind: "dataset"
    artifacts.ai.sap.com/trained-model.kind: "model"
  labels:
    scenarios.ai.sap.com/id: "text-classifier"
    executables.ai.sap.com/id: "text-classifier-train"
    ai.sap.com/version: "1.0.0"
spec:
  imagePullSecrets:
    - name: docker-registry
  entrypoint: main
  arguments:
    parameters:
      - name: learning_rate
        default: "0.001"
      - name: epochs
        default: "10"
    artifacts:
      - name: training-data
        path: /data/input
        archive:
          none: {}
  templates:
    - name: main
      steps:
        - - name: preprocess
            template: preprocess-data
        - - name: train
            template: train-model
        - - name: evaluate
            template: evaluate-model

    - name: preprocess-data
      container:
        image: my-registry/preprocessing:latest
        command: ["python", "preprocess.py"]
        args: ["--input", "/data/input", "--output", "/data/processed"]

    - name: train-model
      container:
        image: my-registry/training:latest
        command: ["python", "train.py"]
        args:
          - "--data=/data/processed"
          - "--lr={{workflow.parameters.learning_rate}}"
          - "--epochs={{workflow.parameters.epochs}}"
          - "--output=/data/model"
      outputs:
        artifacts:
          - name: trained-model
            path: /data/model
            globalName: trained-model
            archive:
              none: {}

    - name: evaluate-model
      container:
        image: my-registry/evaluation:latest
        command: ["python", "evaluate.py"]
        args: ["--model", "/data/model"]
```

### Annotations Reference

| Annotation | Description |
|------------|-------------|
| `scenarios.ai.sap.com/name` | Human-readable scenario name |
| `scenarios.ai.sap.com/id` | Scenario identifier |
| `executables.ai.sap.com/name` | Executable name |
| `executables.ai.sap.com/id` | Executable identifier |
| `artifacts.ai.sap.com/<name>.kind` | Artifact type (dataset, model, etc.) |

---

## Artifacts

### Types

| Kind | Description | Use Case |
|------|-------------|----------|
| `dataset` | Training/validation data | Input for training |
| `model` | Trained model | Output from training |
| `resultset` | Inference results | Output from batch inference |
| `other` | Miscellaneous | Logs, metrics, configs |

### Register Input Artifact

```bash
curl -X POST "$AI_API_URL/v2/lm/artifacts" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training-dataset-v1",
    "kind": "dataset",
    "url": "ai://default/datasets/training-v1",
    "scenarioId": "text-classifier",
    "description": "Training dataset version 1"
  }'
```

### URL Format

- `ai://default/<path>` - Uses default object store secret
- `ai://<secret-name>/<path>` - Uses named object store secret

### List Artifacts

```bash
curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

---

## Configurations

### Create Training Configuration

```bash
curl -X POST "$AI_API_URL/v2/lm/configurations" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "text-classifier-config-v1",
    "executableId": "text-classifier-train",
    "scenarioId": "text-classifier",
    "parameterBindings": [
      {"key": "learning_rate", "value": "0.001"},
      {"key": "epochs", "value": "20"},
      {"key": "batch_size", "value": "32"}
    ],
    "inputArtifactBindings": [
      {"key": "training-data", "artifactId": "<dataset-artifact-id>"}
    ]
  }'
```

---

## Executions

### Create Execution

```bash
curl -X POST "$AI_API_URL/v2/lm/executions" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>"
  }'
```

### Execution Statuses

| Status | Description |
|--------|-------------|
| `UNKNOWN` | Initial state |
| `PENDING` | Queued for execution |
| `RUNNING` | Currently executing |
| `COMPLETED` | Finished successfully |
| `DEAD` | Failed |
| `STOPPED` | Manually stopped |

### Check Execution Status

```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

### Get Execution Logs

```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

### Stop Execution

```bash
curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{"targetStatus": "STOPPED"}'
```

---

## Metrics

### Write Metrics from Training

In your training code:

```python
import requests
import os

def log_metrics(metrics: dict, step: int):
    """Log metrics to SAP AI Core."""
    api_url = os.environ.get("AICORE_API_URL")
    token = os.environ.get("AICORE_AUTH_TOKEN")
    execution_id = os.environ.get("AICORE_EXECUTION_ID")

    response = requests.post(
        f"{api_url}/v2/lm/executions/{execution_id}/metrics",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "metrics": [
                {"name": name, "value": value, "step": step}
                for name, value in metrics.items()
            ]
        }
    )

# Usage in training loop
for epoch in range(epochs):
    train_loss = train_epoch()
    val_loss = validate()
    log_metrics({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "accuracy": accuracy
    }, step=epoch)
```

### Read Metrics

```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

---

## Training Schedules

### Create Schedule

```bash
curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default" \
  -H "Content-Type: application/json" \
  -d '{
    "configurationId": "<configuration-id>",
    "cron": "0 0 * * 0",
    "start": "2024-01-01T00:00:00Z",
    "end": "2024-12-31T23:59:59Z"
  }'
```

### Cron Expression Format

SAP AI Core uses 5-field cron expressions with **3-letter day-of-week names**:

```
┌───────── minute (0-59)
│ ┌─────── hour (0-23)
│ │ ┌───── day of month (1-31)
│ │ │ ┌─── month (1-12)
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
│ │ │ │ │
* * * * *
```

Examples:
- `0 0 * * *` - Daily at midnight
- `0 0 * * sun` - Weekly on Sunday
- `0 0 * * fri` - Weekly on Friday
- `0 0 1 * *` - Monthly on 1st
- `0 */6 * * *` - Every 6 hours

**Note:** Using `* * * * *` treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.

### List Schedules

```bash
curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

### Delete Schedule

```bash
curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "AI-Resource-Group: default"
```

---

## SAP AI Launchpad

### ML Operations App

Access: **Workspaces** → **ML Operations**

Features:
- View scenarios and executables
- Create/manage configurations
- Run/monitor executions
- View training metrics
- Manage artifacts
- Create schedules

### Required Roles

| Role | Capabilities |
|------|--------------|
| `operations_manager` | Access ML Operations app |
| `mloperations_viewer` | View-only access |
| `mloperations_editor` | Full edit access |

### Comparing Runs

1. Navigate to ML Operations → Executions
2. Select multiple executions
3. Click "Compare"
4. View side-by-side metrics and parameters

---

## Best Practices

### Workflow Design

1. **Modular steps**: Break workflow into reusable templates
2. **Parameterization**: Use parameters for hyperparameters
3. **Artifact management**: Define clear input/output artifacts
4. **Error handling**: Include retry logic for flaky operations

### Resource Management

1. **Appropriate sizing**: Match container resources to workload
2. **GPU allocation**: Request GPUs only when needed
3. **Storage**: Use object store for large datasets
4. **Cleanup**: Delete old executions and artifacts

### Monitoring

1. **Log metrics**: Track loss, accuracy, etc. during training
2. **Check logs**: Review execution logs for errors
3. **Compare runs**: Analyze different hyperparameter settings
4. **Set alerts**: Monitor for failed executions

---

## Troubleshooting

### Execution Failed

1. Check execution logs: `GET /v2/lm/executions/{id}/logs`
2. Verify object store secret exists and is named `default`
3. Check Docker image is accessible
4. Verify artifact paths are correct
5. Check resource quota not exceeded

### Artifacts Not Found

1. Verify artifact URL format: `ai://default/<path>`
2. Check object store secret permissions
3. Verify file exists in object store
4. Check artifact registered in correct scenario

### Schedule Not Running

1. Verify schedule is active (not paused)
2. Check cron expression is valid
3. Verify start/end dates bracket current time
4. Check configuration still exists

---

## Documentation Links

- Training Overview: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md)
- ML Operations (Launchpad): [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md)
- Schedules: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md)
- Metrics: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md)