Files
gh-secondsky-sap-skills-ski…/references/graphs-pipelines.md
2025-11-30 08:55:25 +08:00

495 lines
10 KiB
Markdown

# Graphs and Pipelines Guide
Complete guide for graph/pipeline development in SAP Data Intelligence.
## Table of Contents
1. [Overview](#overview)
2. [Graph Concepts](#graph-concepts)
3. [Creating Graphs](#creating-graphs)
4. [Running Graphs](#running-graphs)
5. [Monitoring](#monitoring)
6. [Error Recovery](#error-recovery)
7. [Scheduling](#scheduling)
8. [Advanced Topics](#advanced-topics)
9. [Best Practices](#best-practices)
---
## Overview
Graphs (also called pipelines) are the core execution unit in SAP Data Intelligence.
**Definition:**
A graph is a network of operators connected via typed input/output ports for data transfer.
**Key Features:**
- Visual design in Modeler
- Two generations (Gen1, Gen2)
- Execution monitoring
- Error recovery (Gen2)
- Scheduling
---
## Graph Concepts
### Operator Generations
**Gen1 Operators:**
- Legacy operator model
- Process-based execution
- Manual error handling
- Broad compatibility
**Gen2 Operators:**
- Enhanced error recovery
- State management with snapshots
- Native multiplexing
- Better performance
**Critical Rule:** Cannot mix Gen1 and Gen2 operators in the same graph.
### Ports and Data Types
**Port Types:**
- Input ports: Receive data
- Output ports: Send data
- Typed connections
**Common Data Types:**
- string, int32, int64, float32, float64
- blob (binary data)
- message (structured data)
- table (tabular data)
- any (flexible type)
### Graph Structure
```
[Source Operator] ─── port ──→ [Processing Operator] ─── port ──→ [Target Operator]
│ │ │
Output Port In/Out Ports Input Port
```
---
## Creating Graphs
### Using the Modeler
1. **Create New Graph**
- Open Modeler
- Click "+" to create graph
- Select graph name and location
2. **Add Operators**
- Browse operator repository
- Drag operators to canvas
- Or search by name
3. **Connect Operators**
- Drag from output port to input port
- Verify type compatibility
- Configure connection properties
4. **Configure Operators**
- Select operator
- Set parameters in Properties panel
- Configure ports if needed
5. **Validate Graph**
- Click Validate button
- Review warnings and errors
- Fix issues before running
### Graph-Level Configuration
**Graph Data Types:**
Create custom data types for the graph:
```json
{
"name": "CustomerRecord",
"properties": {
"id": "string",
"name": "string",
"amount": "float64"
}
}
```
**Graph Parameters:**
Define runtime parameters:
```
Parameters:
- name: source_path
type: string
default: /data/input/
- name: batch_size
type: int32
default: 1000
```
### Groups and Tags
**Groups:**
- Organize operators visually
- Share Docker configuration
- Resource allocation
**Tags:**
- Label operators
- Filter and search
- Documentation
---
## Running Graphs
### Execution Methods
**Manual Execution:**
1. Open graph in Modeler
2. Click "Run" button
3. Configure runtime parameters
4. Monitor execution
**Programmatic Execution:**
Via Pipeline API or Data Workflow operators.
### Execution Model
**Process Model:**
- Each operator runs as process
- Communication via ports
- Coordinated by main engine
**Gen2 Features:**
- Snapshot checkpoints
- State recovery
- Exactly-once semantics (configurable)
### Runtime Parameters
Pass parameters at execution:
```
source_path = /data/2024/january/
batch_size = 5000
target_table = SALES_JAN_2024
```
### Resource Configuration
**Memory/CPU:**
```json
{
"resources": {
"requests": {
"memory": "1Gi",
"cpu": "500m"
},
"limits": {
"memory": "4Gi",
"cpu": "2000m"
}
}
}
```
---
## Monitoring
### Graph Status
| Status | Description |
|--------|-------------|
| Pending | Waiting to start |
| Running | Actively executing |
| Completed | Finished successfully |
| Failed | Error occurred |
| Dead | Terminated unexpectedly |
| Stopping | Shutdown in progress |
### Operator Status
| Status | Description |
|--------|-------------|
| Initializing | Setting up |
| Running | Processing data |
| Stopped | Finished or stopped |
| Failed | Error in operator |
### Monitoring Dashboard
**Available Metrics:**
- Messages processed
- Processing time
- Memory usage
- Error counts
**Access:**
1. Open running graph
2. Click "Monitor" tab
3. View real-time statistics
### Diagnostic Information
**Collect Diagnostics:**
1. Select running/failed graph
2. Click "Download Diagnostics"
3. Review logs and state
**Archive Contents:**
- execution.json (execution details)
- graphs.json (graph definition)
- events.json (execution events)
- Operator logs
- State snapshots
---
## Error Recovery
### Gen2 Error Recovery
**Automatic Recovery:**
1. Enable in graph settings
2. Configure snapshot interval
3. System recovers from last snapshot
**Configuration:**
```json
{
"autoRecovery": {
"enabled": true,
"snapshotInterval": "60s",
"maxRetries": 3
}
}
```
### Snapshots
**What's Saved:**
- Operator state
- Message queues
- Processing position
**When Snapshots Occur:**
- Periodic (configured interval)
- On operator request
- Before shutdown
### Delivery Guarantees
| Mode | Description |
|------|-------------|
| At-most-once | May lose messages |
| At-least-once | May duplicate messages |
| Exactly-once | No loss or duplication |
**Gen2 Default:** At-least-once with recovery.
### Manual Error Handling
**In Script Operators:**
```python
def on_input(msg_id, header, body):
try:
result = process(body)
api.send("output", api.Message(result))
except Exception as e:
api.logger.error(f"Processing error: {e}")
api.send("error", api.Message({
"error": str(e),
"input": body
}))
```
---
## Scheduling
### Schedule Graph Executions
**Cron Expression Format:**
```
┌───────────── second (0-59)
│ ┌───────────── minute (0-59)
│ │ ┌───────────── hour (0-23)
│ │ │ ┌───────────── day of month (1-31)
│ │ │ │ ┌───────────── month (1-12)
│ │ │ │ │ ┌───────────── day of week (0-6, Sun=0)
│ │ │ │ │ │
* * * * * *
```
**Examples:**
```
0 0 * * * * # Every hour
0 0 0 * * * # Daily at midnight
0 0 0 * * 1 # Every Monday
0 0 6 1 * * # 6 AM on first of month
0 */15 * * * * # Every 15 minutes
```
### Creating Schedule
1. Open graph
2. Click "Schedule"
3. Configure cron expression
4. Set timezone
5. Activate schedule
### Managing Schedules
**Actions:**
- View scheduled runs
- Pause schedule
- Resume schedule
- Delete schedule
- View execution history
---
## Advanced Topics
### Native Multiplexing (Gen2)
Connect one output to multiple inputs:
```
[Source] ─┬──→ [Processor A]
├──→ [Processor B]
└──→ [Processor C]
```
Or multiple outputs to one input:
```
[Source A] ──┐
[Source B] ──┼──→ [Processor]
[Source C] ──┘
```
### Graph Snippets
Reusable graph fragments:
1. **Create Snippet:**
- Select operators
- Right-click > "Save as Snippet"
- Name and save
2. **Use Snippet:**
- Drag snippet to canvas
- Configure parameters
- Connect to graph
### Parameterization
**Substitution Variables:**
```
${parameter_name}
${ENV.VARIABLE_NAME}
${SYSTEM.TENANT}
```
**In Operator Config:**
```
File Path: ${source_path}/data_${DATE}.csv
Connection: ${target_connection}
```
### Import/Export
**Export Graph:**
```
1. Select graph
2. Right-click > Export
3. Include data types (optional)
4. Save as .zip
```
**Import Graph:**
```
1. Right-click in repository
2. Import > From file
3. Select .zip file
4. Map dependencies
```
---
## Best Practices
### Graph Design
1. **Clear Data Flow**: Left-to-right, top-to-bottom
2. **Meaningful Names**: Descriptive operator names
3. **Group Related Operators**: Use groups for organization
4. **Document**: Add descriptions to operators
5. **Validate Often**: Check during development
### Performance
1. **Minimize Cross-Engine Communication**
2. **Use Appropriate Batch Sizes**
3. **Configure Resources**: Memory and CPU
4. **Enable Parallel Processing**: Where applicable
5. **Monitor and Tune**: Use metrics
### Error Handling
1. **Enable Auto-Recovery** (Gen2)
2. **Configure Appropriate Snapshot Interval**
3. **Implement Error Ports**: Route errors
4. **Log Sufficiently**: Debug information
5. **Test Failure Scenarios**: Validate recovery
### Maintenance
1. **Version Control**: Use graph versioning
2. **Document Changes**: Change history
3. **Test Before Deploy**: Validate thoroughly
4. **Monitor Production**: Watch for issues
5. **Clean Up**: Remove unused graphs
---
## Troubleshooting
### Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Port type mismatch | Incompatible types | Use converter operator |
| Graph won't start | Resource constraints | Adjust resource config |
| Slow performance | Cross-engine overhead | Optimize operator placement |
| Recovery fails | Corrupt snapshot | Clear state, restart |
| Schedule not running | Incorrect cron | Verify expression |
### Diagnostic Steps
1. Check graph status
2. Review operator logs
3. Download diagnostics
4. Check resource usage
5. Verify connections
---
## Documentation Links
- **Using Graphs**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/using-graphs](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/using-graphs)
- **Creating Graphs**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/using-graphs/creating-graphs](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/using-graphs/creating-graphs)
- **Graph Examples**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/repositoryobjects/data-intelligence-graphs](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/repositoryobjects/data-intelligence-graphs)
---
**Last Updated**: 2025-11-22