506 lines
13 KiB
Markdown
506 lines
13 KiB
Markdown
# ML Operations Reference
|
|
|
|
Complete reference for SAP AI Core ML training and operations.
|
|
|
|
**Documentation Source:** [https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core](https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.
|
|
|
|
### Key Components
|
|
|
|
| Component | Description |
|
|
|-----------|-------------|
|
|
| **Scenarios** | AI use case implementations |
|
|
| **Executables** | Reusable workflow templates |
|
|
| **Configurations** | Parameters and artifact bindings |
|
|
| **Executions** | Running instances of workflows |
|
|
| **Artifacts** | Datasets, models, and results |
|
|
|
|
---
|
|
|
|
## Workflow Engine
|
|
|
|
### Argo Workflows
|
|
|
|
SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:
|
|
|
|
- Direct Acyclic Graph (DAG) structures
|
|
- Parallel step execution
|
|
- Container-based steps
|
|
- Data ingestion and preprocessing
|
|
- Model training and batch inference
|
|
|
|
**Limitation:** Not optimized for time-critical tasks due to scheduling overhead.
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Object Store Secret (Required)
|
|
|
|
Create a secret named `default` for training output artifacts:
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "default",
|
|
"type": "S3",
|
|
"pathPrefix": "my-bucket/training-output",
|
|
"data": {
|
|
"AWS_ACCESS_KEY_ID": "<access-key>",
|
|
"AWS_SECRET_ACCESS_KEY": "<secret-key>"
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Note:** Without a `default` secret, training pipelines will fail.
|
|
|
|
### 2. Docker Registry Secret
|
|
|
|
For custom training images:
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "docker-registry",
|
|
"data": {
|
|
".dockerconfigjson": "<base64-encoded-docker-config>"
|
|
}
|
|
}'
|
|
```
|
|
|
|
### 3. Git Repository
|
|
|
|
Sync workflow templates from Git:
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/admin/repositories" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "training-repo",
|
|
"url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
|
|
"username": "<git-user>",
|
|
"password": "<git-token>"
|
|
}'
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow Template
|
|
|
|
### Basic Structure
|
|
|
|
```yaml
|
|
apiVersion: ai.sap.com/v1alpha1
|
|
kind: WorkflowTemplate
|
|
metadata:
|
|
name: text-classifier-training
|
|
annotations:
|
|
scenarios.ai.sap.com/description: "Train text classification model"
|
|
scenarios.ai.sap.com/name: "text-classifier"
|
|
executables.ai.sap.com/description: "Training executable"
|
|
executables.ai.sap.com/name: "text-classifier-train"
|
|
artifacts.ai.sap.com/training-data.kind: "dataset"
|
|
artifacts.ai.sap.com/trained-model.kind: "model"
|
|
labels:
|
|
scenarios.ai.sap.com/id: "text-classifier"
|
|
executables.ai.sap.com/id: "text-classifier-train"
|
|
ai.sap.com/version: "1.0.0"
|
|
spec:
|
|
imagePullSecrets:
|
|
- name: docker-registry
|
|
entrypoint: main
|
|
arguments:
|
|
parameters:
|
|
- name: learning_rate
|
|
default: "0.001"
|
|
- name: epochs
|
|
default: "10"
|
|
artifacts:
|
|
- name: training-data
|
|
path: /data/input
|
|
archive:
|
|
none: {}
|
|
templates:
|
|
- name: main
|
|
steps:
|
|
- - name: preprocess
|
|
template: preprocess-data
|
|
- - name: train
|
|
template: train-model
|
|
- - name: evaluate
|
|
template: evaluate-model
|
|
|
|
- name: preprocess-data
|
|
container:
|
|
image: my-registry/preprocessing:latest
|
|
command: ["python", "preprocess.py"]
|
|
args: ["--input", "/data/input", "--output", "/data/processed"]
|
|
|
|
- name: train-model
|
|
container:
|
|
image: my-registry/training:latest
|
|
command: ["python", "train.py"]
|
|
args:
|
|
- "--data=/data/processed"
|
|
- "--lr={{workflow.parameters.learning_rate}}"
|
|
- "--epochs={{workflow.parameters.epochs}}"
|
|
- "--output=/data/model"
|
|
outputs:
|
|
artifacts:
|
|
- name: trained-model
|
|
path: /data/model
|
|
globalName: trained-model
|
|
archive:
|
|
none: {}
|
|
|
|
- name: evaluate-model
|
|
container:
|
|
image: my-registry/evaluation:latest
|
|
command: ["python", "evaluate.py"]
|
|
args: ["--model", "/data/model"]
|
|
```
|
|
|
|
### Annotations Reference
|
|
|
|
| Annotation | Description |
|
|
|------------|-------------|
|
|
| `scenarios.ai.sap.com/name` | Human-readable scenario name |
|
|
| `scenarios.ai.sap.com/id` | Scenario identifier |
|
|
| `executables.ai.sap.com/name` | Executable name |
|
|
| `executables.ai.sap.com/id` | Executable identifier |
|
|
| `artifacts.ai.sap.com/<name>.kind` | Artifact type (dataset, model, etc.) |
|
|
|
|
---
|
|
|
|
## Artifacts
|
|
|
|
### Types
|
|
|
|
| Kind | Description | Use Case |
|
|
|------|-------------|----------|
|
|
| `dataset` | Training/validation data | Input for training |
|
|
| `model` | Trained model | Output from training |
|
|
| `resultset` | Inference results | Output from batch inference |
|
|
| `other` | Miscellaneous | Logs, metrics, configs |
|
|
|
|
### Register Input Artifact
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/lm/artifacts" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "training-dataset-v1",
|
|
"kind": "dataset",
|
|
"url": "ai://default/datasets/training-v1",
|
|
"scenarioId": "text-classifier",
|
|
"description": "Training dataset version 1"
|
|
}'
|
|
```
|
|
|
|
### URL Format
|
|
|
|
- `ai://default/<path>` - Uses default object store secret
|
|
- `ai://<secret-name>/<path>` - Uses named object store secret
|
|
|
|
### List Artifacts
|
|
|
|
```bash
|
|
curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
---
|
|
|
|
## Configurations
|
|
|
|
### Create Training Configuration
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/lm/configurations" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "text-classifier-config-v1",
|
|
"executableId": "text-classifier-train",
|
|
"scenarioId": "text-classifier",
|
|
"parameterBindings": [
|
|
{"key": "learning_rate", "value": "0.001"},
|
|
{"key": "epochs", "value": "20"},
|
|
{"key": "batch_size", "value": "32"}
|
|
],
|
|
"inputArtifactBindings": [
|
|
{"key": "training-data", "artifactId": "<dataset-artifact-id>"}
|
|
]
|
|
}'
|
|
```
|
|
|
|
---
|
|
|
|
## Executions
|
|
|
|
### Create Execution
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/lm/executions" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"configurationId": "<configuration-id>"
|
|
}'
|
|
```
|
|
|
|
### Execution Statuses
|
|
|
|
| Status | Description |
|
|
|--------|-------------|
|
|
| `UNKNOWN` | Initial state |
|
|
| `PENDING` | Queued for execution |
|
|
| `RUNNING` | Currently executing |
|
|
| `COMPLETED` | Finished successfully |
|
|
| `DEAD` | Failed |
|
|
| `STOPPED` | Manually stopped |
|
|
|
|
### Check Execution Status
|
|
|
|
```bash
|
|
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
### Get Execution Logs
|
|
|
|
```bash
|
|
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
### Stop Execution
|
|
|
|
```bash
|
|
curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"targetStatus": "STOPPED"}'
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics
|
|
|
|
### Write Metrics from Training
|
|
|
|
In your training code:
|
|
|
|
```python
|
|
import requests
|
|
import os
|
|
|
|
def log_metrics(metrics: dict, step: int):
|
|
"""Log metrics to SAP AI Core."""
|
|
api_url = os.environ.get("AICORE_API_URL")
|
|
token = os.environ.get("AICORE_AUTH_TOKEN")
|
|
execution_id = os.environ.get("AICORE_EXECUTION_ID")
|
|
|
|
response = requests.post(
|
|
f"{api_url}/v2/lm/executions/{execution_id}/metrics",
|
|
headers={
|
|
"Authorization": f"Bearer {token}",
|
|
"Content-Type": "application/json"
|
|
},
|
|
json={
|
|
"metrics": [
|
|
{"name": name, "value": value, "step": step}
|
|
for name, value in metrics.items()
|
|
]
|
|
}
|
|
)
|
|
|
|
# Usage in training loop
|
|
for epoch in range(epochs):
|
|
train_loss = train_epoch()
|
|
val_loss = validate()
|
|
log_metrics({
|
|
"train_loss": train_loss,
|
|
"val_loss": val_loss,
|
|
"accuracy": accuracy
|
|
}, step=epoch)
|
|
```
|
|
|
|
### Read Metrics
|
|
|
|
```bash
|
|
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
---
|
|
|
|
## Training Schedules
|
|
|
|
### Create Schedule
|
|
|
|
```bash
|
|
curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"configurationId": "<configuration-id>",
|
|
"cron": "0 0 * * 0",
|
|
"start": "2024-01-01T00:00:00Z",
|
|
"end": "2024-12-31T23:59:59Z"
|
|
}'
|
|
```
|
|
|
|
### Cron Expression Format
|
|
|
|
SAP AI Core uses 5-field cron expressions with **3-letter day-of-week names**:
|
|
|
|
```
|
|
┌───────── minute (0-59)
|
|
│ ┌─────── hour (0-23)
|
|
│ │ ┌───── day of month (1-31)
|
|
│ │ │ ┌─── month (1-12)
|
|
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
|
|
│ │ │ │ │
|
|
* * * * *
|
|
```
|
|
|
|
Examples:
|
|
- `0 0 * * *` - Daily at midnight
|
|
- `0 0 * * sun` - Weekly on Sunday
|
|
- `0 0 * * fri` - Weekly on Friday
|
|
- `0 0 1 * *` - Monthly on 1st
|
|
- `0 */6 * * *` - Every 6 hours
|
|
|
|
**Note:** Using `* * * * *` treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.
|
|
|
|
### List Schedules
|
|
|
|
```bash
|
|
curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
### Delete Schedule
|
|
|
|
```bash
|
|
curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
|
|
-H "Authorization: Bearer $AUTH_TOKEN" \
|
|
-H "AI-Resource-Group: default"
|
|
```
|
|
|
|
---
|
|
|
|
## SAP AI Launchpad
|
|
|
|
### ML Operations App
|
|
|
|
Access: **Workspaces** → **ML Operations**
|
|
|
|
Features:
|
|
- View scenarios and executables
|
|
- Create/manage configurations
|
|
- Run/monitor executions
|
|
- View training metrics
|
|
- Manage artifacts
|
|
- Create schedules
|
|
|
|
### Required Roles
|
|
|
|
| Role | Capabilities |
|
|
|------|--------------|
|
|
| `operations_manager` | Access ML Operations app |
|
|
| `mloperations_viewer` | View-only access |
|
|
| `mloperations_editor` | Full edit access |
|
|
|
|
### Comparing Runs
|
|
|
|
1. Navigate to ML Operations → Executions
|
|
2. Select multiple executions
|
|
3. Click "Compare"
|
|
4. View side-by-side metrics and parameters
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### Workflow Design
|
|
|
|
1. **Modular steps**: Break workflow into reusable templates
|
|
2. **Parameterization**: Use parameters for hyperparameters
|
|
3. **Artifact management**: Define clear input/output artifacts
|
|
4. **Error handling**: Include retry logic for flaky operations
|
|
|
|
### Resource Management
|
|
|
|
1. **Appropriate sizing**: Match container resources to workload
|
|
2. **GPU allocation**: Request GPUs only when needed
|
|
3. **Storage**: Use object store for large datasets
|
|
4. **Cleanup**: Delete old executions and artifacts
|
|
|
|
### Monitoring
|
|
|
|
1. **Log metrics**: Track loss, accuracy, etc. during training
|
|
2. **Check logs**: Review execution logs for errors
|
|
3. **Compare runs**: Analyze different hyperparameter settings
|
|
4. **Set alerts**: Monitor for failed executions
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Execution Failed
|
|
|
|
1. Check execution logs: `GET /v2/lm/executions/{id}/logs`
|
|
2. Verify object store secret exists and is named `default`
|
|
3. Check Docker image is accessible
|
|
4. Verify artifact paths are correct
|
|
5. Check resource quota not exceeded
|
|
|
|
### Artifacts Not Found
|
|
|
|
1. Verify artifact URL format: `ai://default/<path>`
|
|
2. Check object store secret permissions
|
|
3. Verify file exists in object store
|
|
4. Check artifact registered in correct scenario
|
|
|
|
### Schedule Not Running
|
|
|
|
1. Verify schedule is active (not paused)
|
|
2. Check cron expression is valid
|
|
3. Verify start/end dates bracket current time
|
|
4. Check configuration still exists
|
|
|
|
---
|
|
|
|
## Documentation Links
|
|
|
|
- Training Overview: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md)
|
|
- ML Operations (Launchpad): [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md)
|
|
- Schedules: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md)
|
|
- Metrics: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md)
|