Files
2025-11-30 08:54:41 +08:00

506 lines
13 KiB
Markdown

# ML Operations Reference
Complete reference for SAP AI Core ML training and operations.
**Documentation Source:** [https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core](https://github.com/SAP-docs/sap-artificial-intelligence/tree/main/docs/sap-ai-core)
---
## Overview
SAP AI Core uses Argo Workflows for training pipelines, supporting batch jobs for model preprocessing, training, and inference.
### Key Components
| Component | Description |
|-----------|-------------|
| **Scenarios** | AI use case implementations |
| **Executables** | Reusable workflow templates |
| **Configurations** | Parameters and artifact bindings |
| **Executions** | Running instances of workflows |
| **Artifacts** | Datasets, models, and results |
---
## Workflow Engine
### Argo Workflows
SAP AI Core uses Argo Workflows (container-native workflow engine) supporting:
- Direct Acyclic Graph (DAG) structures
- Parallel step execution
- Container-based steps
- Data ingestion and preprocessing
- Model training and batch inference
**Limitation:** Not optimized for time-critical tasks due to scheduling overhead.
---
## Prerequisites
### 1. Object Store Secret (Required)
Create a secret named `default` for training output artifacts:
```bash
curl -X POST "$AI_API_URL/v2/admin/objectStoreSecrets" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "default",
"type": "S3",
"pathPrefix": "my-bucket/training-output",
"data": {
"AWS_ACCESS_KEY_ID": "<access-key>",
"AWS_SECRET_ACCESS_KEY": "<secret-key>"
}
}'
```
**Note:** Without a `default` secret, training pipelines will fail.
### 2. Docker Registry Secret
For custom training images:
```bash
curl -X POST "$AI_API_URL/v2/admin/dockerRegistrySecrets" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "docker-registry",
"data": {
".dockerconfigjson": "<base64-encoded-docker-config>"
}
}'
```
### 3. Git Repository
Sync workflow templates from Git:
```bash
curl -X POST "$AI_API_URL/v2/admin/repositories" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "training-repo",
"url": "[https://github.com/org/training-workflows",](https://github.com/org/training-workflows",)
"username": "<git-user>",
"password": "<git-token>"
}'
```
---
## Workflow Template
### Basic Structure
```yaml
apiVersion: ai.sap.com/v1alpha1
kind: WorkflowTemplate
metadata:
name: text-classifier-training
annotations:
scenarios.ai.sap.com/description: "Train text classification model"
scenarios.ai.sap.com/name: "text-classifier"
executables.ai.sap.com/description: "Training executable"
executables.ai.sap.com/name: "text-classifier-train"
artifacts.ai.sap.com/training-data.kind: "dataset"
artifacts.ai.sap.com/trained-model.kind: "model"
labels:
scenarios.ai.sap.com/id: "text-classifier"
executables.ai.sap.com/id: "text-classifier-train"
ai.sap.com/version: "1.0.0"
spec:
imagePullSecrets:
- name: docker-registry
entrypoint: main
arguments:
parameters:
- name: learning_rate
default: "0.001"
- name: epochs
default: "10"
artifacts:
- name: training-data
path: /data/input
archive:
none: {}
templates:
- name: main
steps:
- - name: preprocess
template: preprocess-data
- - name: train
template: train-model
- - name: evaluate
template: evaluate-model
- name: preprocess-data
container:
image: my-registry/preprocessing:latest
command: ["python", "preprocess.py"]
args: ["--input", "/data/input", "--output", "/data/processed"]
- name: train-model
container:
image: my-registry/training:latest
command: ["python", "train.py"]
args:
- "--data=/data/processed"
- "--lr={{workflow.parameters.learning_rate}}"
- "--epochs={{workflow.parameters.epochs}}"
- "--output=/data/model"
outputs:
artifacts:
- name: trained-model
path: /data/model
globalName: trained-model
archive:
none: {}
- name: evaluate-model
container:
image: my-registry/evaluation:latest
command: ["python", "evaluate.py"]
args: ["--model", "/data/model"]
```
### Annotations Reference
| Annotation | Description |
|------------|-------------|
| `scenarios.ai.sap.com/name` | Human-readable scenario name |
| `scenarios.ai.sap.com/id` | Scenario identifier |
| `executables.ai.sap.com/name` | Executable name |
| `executables.ai.sap.com/id` | Executable identifier |
| `artifacts.ai.sap.com/<name>.kind` | Artifact type (dataset, model, etc.) |
---
## Artifacts
### Types
| Kind | Description | Use Case |
|------|-------------|----------|
| `dataset` | Training/validation data | Input for training |
| `model` | Trained model | Output from training |
| `resultset` | Inference results | Output from batch inference |
| `other` | Miscellaneous | Logs, metrics, configs |
### Register Input Artifact
```bash
curl -X POST "$AI_API_URL/v2/lm/artifacts" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "training-dataset-v1",
"kind": "dataset",
"url": "ai://default/datasets/training-v1",
"scenarioId": "text-classifier",
"description": "Training dataset version 1"
}'
```
### URL Format
- `ai://default/<path>` - Uses default object store secret
- `ai://<secret-name>/<path>` - Uses named object store secret
### List Artifacts
```bash
curl -X GET "$AI_API_URL/v2/lm/artifacts?scenarioId=text-classifier" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
---
## Configurations
### Create Training Configuration
```bash
curl -X POST "$AI_API_URL/v2/lm/configurations" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"name": "text-classifier-config-v1",
"executableId": "text-classifier-train",
"scenarioId": "text-classifier",
"parameterBindings": [
{"key": "learning_rate", "value": "0.001"},
{"key": "epochs", "value": "20"},
{"key": "batch_size", "value": "32"}
],
"inputArtifactBindings": [
{"key": "training-data", "artifactId": "<dataset-artifact-id>"}
]
}'
```
---
## Executions
### Create Execution
```bash
curl -X POST "$AI_API_URL/v2/lm/executions" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"configurationId": "<configuration-id>"
}'
```
### Execution Statuses
| Status | Description |
|--------|-------------|
| `UNKNOWN` | Initial state |
| `PENDING` | Queued for execution |
| `RUNNING` | Currently executing |
| `COMPLETED` | Finished successfully |
| `DEAD` | Failed |
| `STOPPED` | Manually stopped |
### Check Execution Status
```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
### Get Execution Logs
```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/logs" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
### Stop Execution
```bash
curl -X PATCH "$AI_API_URL/v2/lm/executions/<execution-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{"targetStatus": "STOPPED"}'
```
---
## Metrics
### Write Metrics from Training
In your training code:
```python
import requests
import os
def log_metrics(metrics: dict, step: int):
"""Log metrics to SAP AI Core."""
api_url = os.environ.get("AICORE_API_URL")
token = os.environ.get("AICORE_AUTH_TOKEN")
execution_id = os.environ.get("AICORE_EXECUTION_ID")
response = requests.post(
f"{api_url}/v2/lm/executions/{execution_id}/metrics",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
},
json={
"metrics": [
{"name": name, "value": value, "step": step}
for name, value in metrics.items()
]
}
)
# Usage in training loop
for epoch in range(epochs):
train_loss = train_epoch()
val_loss = validate()
log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"accuracy": accuracy
}, step=epoch)
```
### Read Metrics
```bash
curl -X GET "$AI_API_URL/v2/lm/executions/<execution-id>/metrics" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
---
## Training Schedules
### Create Schedule
```bash
curl -X POST "$AI_API_URL/v2/lm/executionSchedules" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default" \
-H "Content-Type: application/json" \
-d '{
"configurationId": "<configuration-id>",
"cron": "0 0 * * 0",
"start": "2024-01-01T00:00:00Z",
"end": "2024-12-31T23:59:59Z"
}'
```
### Cron Expression Format
SAP AI Core uses 5-field cron expressions with **3-letter day-of-week names**:
```
┌───────── minute (0-59)
│ ┌─────── hour (0-23)
│ │ ┌───── day of month (1-31)
│ │ │ ┌─── month (1-12)
│ │ │ │ ┌─ day of week (mon, tue, wed, thu, fri, sat, sun)
│ │ │ │ │
* * * * *
```
Examples:
- `0 0 * * *` - Daily at midnight
- `0 0 * * sun` - Weekly on Sunday
- `0 0 * * fri` - Weekly on Friday
- `0 0 1 * *` - Monthly on 1st
- `0 */6 * * *` - Every 6 hours
**Note:** Using `* * * * *` treats the schedule as "Run Always" (continuous check), which differs from standard cron behavior. Minimum interval for pipeline schedules is 1 hour.
### List Schedules
```bash
curl -X GET "$AI_API_URL/v2/lm/executionSchedules" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
### Delete Schedule
```bash
curl -X DELETE "$AI_API_URL/v2/lm/executionSchedules/<schedule-id>" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "AI-Resource-Group: default"
```
---
## SAP AI Launchpad
### ML Operations App
Access: **Workspaces****ML Operations**
Features:
- View scenarios and executables
- Create/manage configurations
- Run/monitor executions
- View training metrics
- Manage artifacts
- Create schedules
### Required Roles
| Role | Capabilities |
|------|--------------|
| `operations_manager` | Access ML Operations app |
| `mloperations_viewer` | View-only access |
| `mloperations_editor` | Full edit access |
### Comparing Runs
1. Navigate to ML Operations → Executions
2. Select multiple executions
3. Click "Compare"
4. View side-by-side metrics and parameters
---
## Best Practices
### Workflow Design
1. **Modular steps**: Break workflow into reusable templates
2. **Parameterization**: Use parameters for hyperparameters
3. **Artifact management**: Define clear input/output artifacts
4. **Error handling**: Include retry logic for flaky operations
### Resource Management
1. **Appropriate sizing**: Match container resources to workload
2. **GPU allocation**: Request GPUs only when needed
3. **Storage**: Use object store for large datasets
4. **Cleanup**: Delete old executions and artifacts
### Monitoring
1. **Log metrics**: Track loss, accuracy, etc. during training
2. **Check logs**: Review execution logs for errors
3. **Compare runs**: Analyze different hyperparameter settings
4. **Set alerts**: Monitor for failed executions
---
## Troubleshooting
### Execution Failed
1. Check execution logs: `GET /v2/lm/executions/{id}/logs`
2. Verify object store secret exists and is named `default`
3. Check Docker image is accessible
4. Verify artifact paths are correct
5. Check resource quota not exceeded
### Artifacts Not Found
1. Verify artifact URL format: `ai://default/<path>`
2. Check object store secret permissions
3. Verify file exists in object store
4. Check artifact registered in correct scenario
### Schedule Not Running
1. Verify schedule is active (not paused)
2. Check cron expression is valid
3. Verify start/end dates bracket current time
4. Check configuration still exists
---
## Documentation Links
- Training Overview: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/train-your-model-a9ceb06.md)
- ML Operations (Launchpad): [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-launchpad/ml-operations-df78271.md)
- Schedules: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/create-a-training-schedule-bd409a9.md)
- Metrics: [https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md](https://github.com/SAP-docs/sap-artificial-intelligence/blob/main/docs/sap-ai-core/view-the-metric-resource-for-an-execution-d85dd44.md)