Initial commit

2025-11-30 08:55:25 +08:00
commit e23395aeb2
19 changed files with 6391 additions and 0 deletions
--- a/references/subengines.md
+++ b/references/subengines.md
@@ -0,0 +1,562 @@
+# Subengines Development Guide
+
+Complete guide for developing operators with subengines in SAP Data Intelligence.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Subengine Architecture](#subengine-architecture)
+3. [Python Subengine](#python-subengine)
+4. [Node.js Subengine](#nodejs-subengine)
+5. [C++ Subengine](#c-subengine)
+6. [FlowAgent Subengine](#flowagent-subengine)
+7. [Performance Optimization](#performance-optimization)
+8. [Best Practices](#best-practices)
+
+---
+
+## Overview
+
+Subengines enable operators to run on different runtimes within SAP Data Intelligence.
+
+**Supported Subengines:**
+- **Python 3.9**: Data science and ML workflows
+- **Node.js**: JavaScript-based processing
+- **C++**: High-performance native operators
+- **ABAP**: ABAP Pipeline Engine (source systems)
+- **FlowAgent**: Database connectivity
+
+**Key Benefits:**
+- Language flexibility
+- Performance optimization
+- Specialized libraries
+- Same-engine process sharing
+
+---
+
+## Subengine Architecture
+
+### Execution Model
+
+```
+Main Engine (Coordinator)
+    ├── Python Subengine Process
+    │   ├── Python Operator 1
+    │   └── Python Operator 2 (same process)
+    │
+    ├── Node.js Subengine Process
+    │   └── JavaScript Operator
+    │
+    └── Native Engine Process
+        └── Native Operator
+```
+
+### Communication
+
+**Same Engine Communication:**
+- In-memory data transfer
+- No serialization overhead
+- Optimal performance
+
+**Cross-Engine Communication:**
+- Serialization required
+- Inter-process communication
+- Higher latency
+
+### Engine Selection
+
+The optimizer selects engines to minimize communication:
+
+```
+Graph: [Python Op A] -> [Python Op B] -> [JS Op C]
+
+Execution:
+- Ops A and B: Same Python process
+- Op C: Separate Node.js process
+- Data serialized between B and C
+```
+
+---
+
+## Python Subengine
+
+The most commonly used subengine for data processing and ML.
+
+### Python 3.9 Operator (Gen2)
+
+**Creating Python Operator:**
+
+```python
+# Operator script
+
+def on_input(msg_id, header, body):
+    """Process incoming message."""
+    import pandas as pd
+
+    # Process data
+    df = pd.DataFrame(body)
+    result = df.groupby('category').sum()
+
+    # Send output
+    api.send("output", api.Message(result.to_dict()))
+
+# Register callback
+api.set_port_callback("input", on_input)
+```
+
+### API Reference
+
+**Message Handling:**
+```python
+# Set port callback
+api.set_port_callback("port_name", callback_function)
+
+# Send message
+api.send("port_name", api.Message(body, attributes={}))
+
+# Create message with attributes
+msg = api.Message(
+    body={"data": values},
+    attributes={"source": "python"}
+)
+```
+
+**Configuration Access:**
+```python
+# Get configuration parameter
+value = api.config.param_name
+
+# Get with default
+value = getattr(api.config, 'param_name', default_value)
+```
+
+**Logging:**
+```python
+api.logger.info("Processing started")
+api.logger.warning("Potential issue detected")
+api.logger.error("Error occurred")
+```
+
+### State Management
+
+```python
+# Initialize state
+state = {"counter": 0, "cache": {}}
+
+def on_input(msg_id, header, body):
+    global state
+    state["counter"] += 1
+
+    # Process with state
+    if body["id"] in state["cache"]:
+        result = state["cache"][body["id"]]
+    else:
+        result = process(body)
+        state["cache"][body["id"]] = result
+
+    api.send("output", api.Message(result))
+
+api.set_port_callback("input", on_input)
+```
+
+### Using External Libraries
+
+```python
+# Pre-installed libraries available
+import pandas as pd
+import numpy as np
+import sklearn
+import tensorflow
+import torch
+
+# Custom libraries via Dockerfile
+# (see Creating Dockerfiles section)
+```
+
+### Managed Connections
+
+Access database connections from Python:
+
+```python
+def on_input(msg_id, header, body):
+    # Get connection
+    conn = api.get_connection("HANA_CONNECTION")
+
+    # Execute query
+    cursor = conn.cursor()
+    cursor.execute("SELECT * FROM TABLE")
+    rows = cursor.fetchall()
+
+    api.send("output", api.Message(rows))
+```
+
+---
+
+## Node.js Subengine
+
+JavaScript-based operator development.
+
+### Creating Node.js Operator
+
+```javascript
+// Operator script using @sap/vflow-sub-node-sdk
+const { Operator } = require("@sap/vflow-sub-node-sdk");
+
+// Get operator instance
+const operator = Operator.getInstance();
+
+// Set up input port handler
+operator.getInPort("input").onMessage((ctx) => {
+    // Process message
+    const data = ctx.body;
+    const result = processData(data);
+
+    // Send to output port
+    operator.getOutPort("output").send(result);
+});
+
+function processData(data) {
+    // Transform data
+    return data.map((item) => {
+        return {
+            id: item.id,
+            value: item.value * 2
+        };
+    });
+}
+```
+
+### API Reference
+
+**Message Handling:**
+```javascript
+const { Operator } = require("@sap/vflow-sub-node-sdk");
+const operator = Operator.getInstance();
+
+// Set port callback
+operator.getInPort("port_name").onMessage((ctx) => {
+    // Access message body via ctx.body
+    const data = ctx.body;
+    // Process message
+});
+
+// Send to output port
+operator.getOutPort("output").send(data);
+
+// Send to specific named port
+operator.getOutPort("port_name").send(data);
+```
+
+**Configuration:**
+```javascript
+const operator = Operator.getInstance();
+
+// Access config
+const paramValue = operator.config.paramName;
+```
+
+**Logging:**
+```javascript
+const operator = Operator.getInstance();
+
+// Use operator logger
+operator.logger.info("Information message");
+operator.logger.debug("Debug message");
+operator.logger.error("Error message");
+```
+
+### Node.js Data Types
+
+| SAP DI Type | Node.js Type |
+|-------------|--------------|
+| string | String |
+| int32 | Number |
+| int64 | BigInt |
+| float32 | Number |
+| float64 | Number |
+| blob | Buffer |
+| message | Object |
+
+### Safe Integer Handling
+
+```javascript
+// Large integers may lose precision
+// Use BigInt for int64 values
+const { Operator } = require("@sap/vflow-sub-node-sdk");
+const operator = Operator.getInstance();
+
+operator.getInPort("input").onMessage((ctx) => {
+    const value = BigInt(ctx.body.largeNumber);
+    // Process safely
+});
+```
+
+### Node Modules
+
+```javascript
+// Built-in modules available
+var fs = require('fs');
+var path = require('path');
+var https = require('https');
+
+// Custom modules via Dockerfile
+```
+
+---
+
+## C++ Subengine
+
+High-performance native operator development.
+
+### Getting Started
+
+1. Install C++ SDK
+2. Create operator class
+3. Implement interfaces
+4. Compile and upload
+
+### Operator Implementation
+
+```cpp
+// custom_operator.h
+
+#include "sdi/subengine.h"
+#include "sdi/operator.h"
+
+class CustomOperator : public sdi::BaseOperator {
+public:
+    CustomOperator(const sdi::OperatorConfig& config);
+
+    void init() override;
+    void start() override;
+    void shutdown() override;
+
+private:
+    void onInput(const sdi::Message& msg);
+
+    std::string m_parameter;
+};
+```
+
+```cpp
+// custom_operator.cpp
+
+#include "custom_operator.h"
+
+CustomOperator::CustomOperator(const sdi::OperatorConfig& config)
+    : BaseOperator(config) {
+    m_parameter = config.get<std::string>("parameter");
+}
+
+void CustomOperator::init() {
+    registerPortCallback("input",
+        [this](const sdi::Message& msg) { onInput(msg); });
+}
+
+void CustomOperator::start() {
+    LOG_INFO("Operator started");
+}
+
+void CustomOperator::onInput(const sdi::Message& msg) {
+    // Process message
+    auto data = msg.body<std::vector<int>>();
+
+    // Transform
+    for (auto& val : data) {
+        val *= 2;
+    }
+
+    // Send output
+    send("output", sdi::Message(data));
+}
+
+void CustomOperator::shutdown() {
+    LOG_INFO("Operator shutdown");
+}
+```
+
+### Building and Uploading
+
+```bash
+# Build
+mkdir build && cd build
+cmake ..
+make
+
+# Package
+tar -czvf operator.tar.gz libcustom_operator.so manifest.json
+
+# Upload via System Management
+```
+
+---
+
+## FlowAgent Subengine
+
+Database connectivity subengine.
+
+### Purpose
+
+FlowAgent provides:
+- Database connection pooling
+- Efficient data transfer
+- Native database drivers
+
+### Supported Databases
+
+- SAP HANA
+- SAP IQ
+- Microsoft SQL Server
+- Oracle
+- PostgreSQL
+- MySQL
+- DB2
+
+### FlowAgent Operators
+
+Pre-built operators using FlowAgent:
+
+- **SQL Consumer**: Execute SELECT queries
+- **SQL Executor**: Execute DDL/DML
+- **Table Consumer**: Read tables
+- **Table Producer**: Write tables
+
+### Configuration
+
+```
+Connection: Database Connection ID
+SQL Statement: SELECT * FROM SALES WHERE YEAR = 2024
+Batch Size: 10000
+Fetch Size: 5000
+```
+
+---
+
+## Performance Optimization
+
+### Minimize Cross-Engine Communication
+
+```
+Bad:
+[Python A] -> [JS B] -> [Python C] -> [JS D]
+(4 serialization points)
+
+Good:
+[Python A] -> [Python C] -> [JS B] -> [JS D]
+(2 serialization points)
+```
+
+### Batch Processing
+
+```python
+# Process in batches
+def on_input(msg_id, header, body):
+    batch = []
+    batch.append(body)
+
+    if len(batch) >= 1000:
+        process_batch(batch)
+        batch.clear()
+```
+
+### Memory Management
+
+```python
+# Stream large data
+def on_input(msg_id, header, body):
+    import pandas as pd
+
+    # Process in chunks
+    for chunk in pd.read_csv(body, chunksize=10000):
+        result = process(chunk)
+        api.send("output", api.Message(result))
+```
+
+### Connection Pooling
+
+```python
+# Reuse connections
+_connection = None
+
+def get_connection():
+    global _connection
+    if _connection is None:
+        _connection = api.get_connection("DB_CONN")
+    return _connection
+```
+
+---
+
+## Best Practices
+
+### Operator Design
+
+1. **Single Responsibility**: One task per operator
+2. **Stateless When Possible**: Easier recovery
+3. **Handle Errors Gracefully**: Try-catch with logging
+4. **Clean Up Resources**: Close connections, files
+
+### Code Organization
+
+```python
+# Good: Modular code
+
+def validate_input(data):
+    """Validate input data."""
+    if not data:
+        raise ValueError("Empty input")
+    return True
+
+def transform_data(data):
+    """Transform data."""
+    # Transformation logic
+    return result
+
+def on_input(msg_id, header, body):
+    try:
+        validate_input(body)
+        result = transform_data(body)
+        api.send("output", api.Message(result))
+    except Exception as e:
+        api.logger.error(f"Error: {e}")
+        api.send("error", api.Message({"error": str(e)}))
+```
+
+### Testing
+
+1. **Unit Test Logic**: Test functions independently
+2. **Integration Test**: Test with sample data
+3. **Performance Test**: Verify throughput
+
+### Documentation
+
+```python
+def on_input(msg_id, header, body):
+    """
+    Process incoming sales data.
+
+    Input Schema:
+        - order_id: string
+        - amount: float
+        - date: string (YYYY-MM-DD)
+
+    Output Schema:
+        - order_id: string
+        - processed_amount: float
+        - quarter: int
+    """
+    # Implementation
+```
+
+---
+
+## Documentation Links
+
+- **Subengines Overview**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/subengines](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/subengines)
+- **Python Subengine**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/subengines/create-operators-with-the-python-subengine-7e8f7d2.md](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/subengines/create-operators-with-the-python-subengine-7e8f7d2.md)
+- **Node.js Subengine**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/subengines](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/subengines) (Node.js SDK files)
+- **C++ Subengine**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/subengines/working-with-the-c-subengine-to-create-operators-d8f634c.md](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/subengines/working-with-the-c-subengine-to-create-operators-d8f634c.md)
+
+---
+
+**Last Updated**: 2025-11-22