Initial commit

2025-11-30 08:55:25 +08:00
commit e23395aeb2
19 changed files with 6391 additions and 0 deletions
--- a/references/structured-data-operators.md
+++ b/references/structured-data-operators.md
@@ -0,0 +1,470 @@
+# Structured Data Operators Guide
+
+Complete guide for structured data processing in SAP Data Intelligence.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Data Transform Operator](#data-transform-operator)
+3. [Projection Node](#projection-node)
+4. [Aggregation Node](#aggregation-node)
+5. [Join Node](#join-node)
+6. [Union Node](#union-node)
+7. [Case Node](#case-node)
+8. [Consumer Operators](#consumer-operators)
+9. [Producer Operators](#producer-operators)
+10. [Resiliency](#resiliency)
+
+---
+
+## Overview
+
+Structured data operators provide SQL-like data processing capabilities with visual configuration.
+
+**Key Components:**
+- **Data Transform**: Visual SQL editor for transformations
+- **Consumer Operators**: Read structured data from sources
+- **Producer Operators**: Write structured data to targets
+
+**Supported Formats:**
+- CSV, Parquet, ORC, JSON (files)
+- Database tables (HANA, SQL databases)
+- SAP applications (S/4HANA, BW)
+
+---
+
+## Data Transform Operator
+
+The Data Transform operator provides a visual editor for creating SQL-like transformations.
+
+### Custom Editor Features
+
+**Visual Designer:**
+- Drag-and-drop node creation
+- Visual connection of data flows
+- Schema preview at each node
+
+**Available Nodes:**
+- Source: Input data connection
+- Target: Output data connection
+- Projection: Column operations
+- Aggregation: GROUP BY operations
+- Join: Combine datasets
+- Union: Merge datasets
+- Case: Conditional logic
+- Filter: Row filtering
+
+### Creating a Data Transform
+
+1. Add Data Transform operator to graph
+2. Open Custom Editor
+3. Add Source node (connect to input port)
+4. Add transformation nodes
+5. Add Target node (connect to output port)
+6. Configure each node
+7. Validate and close editor
+
+### Best Practices
+
+- Use meaningful node names
+- Preview data at each step
+- Optimize join order for performance
+- Use filters early to reduce data volume
+
+---
+
+## Projection Node
+
+The Projection node performs column-level operations.
+
+### Capabilities
+
+**Column Selection:**
+- Include/exclude columns
+- Rename columns
+- Reorder columns
+
+**Column Transformation:**
+- Apply expressions
+- Use DTL functions
+- Create calculated columns
+
+### Configuration
+
+```
+Source Columns:
+  - CUSTOMER_ID (include)
+  - CUSTOMER_NAME (include, rename to NAME)
+  - INTERNAL_CODE (exclude)
+
+Calculated Columns:
+  - FULL_NAME: CONCAT(FIRST_NAME, ' ', LAST_NAME)
+  - ORDER_YEAR: YEAR(ORDER_DATE)
+```
+
+### Expression Examples
+
+```sql
+-- String concatenation
+CONCAT(FIRST_NAME, ' ', LAST_NAME)
+
+-- Conditional value
+CASE WHEN AMOUNT > 1000 THEN 'High' ELSE 'Low' END
+
+-- Date extraction
+YEAR(ORDER_DATE)
+
+-- Null handling
+COALESCE(DISCOUNT, 0)
+
+-- Type conversion
+TO_DECIMAL(QUANTITY * PRICE, 15, 2)
+```
+
+---
+
+## Aggregation Node
+
+The Aggregation node performs GROUP BY operations with aggregate functions.
+
+### Aggregate Functions
+
+| Function | Description |
+|----------|-------------|
+| COUNT | Count rows |
+| COUNT_DISTINCT | Count unique values |
+| SUM | Sum of values |
+| AVG | Average of values |
+| MIN | Minimum value |
+| MAX | Maximum value |
+| FIRST | First value |
+| LAST | Last value |
+
+### Configuration
+
+```
+Group By Columns:
+  - REGION
+  - PRODUCT_CATEGORY
+
+Aggregations:
+  - TOTAL_SALES: SUM(SALES_AMOUNT)
+  - ORDER_COUNT: COUNT(ORDER_ID)
+  - AVG_ORDER: AVG(SALES_AMOUNT)
+  - FIRST_ORDER: MIN(ORDER_DATE)
+  - LAST_ORDER: MAX(ORDER_DATE)
+```
+
+### Example Output Schema
+
+```
+Input:
+  ORDER_ID, REGION, PRODUCT_CATEGORY, SALES_AMOUNT, ORDER_DATE
+
+Output:
+  REGION, PRODUCT_CATEGORY, TOTAL_SALES, ORDER_COUNT, AVG_ORDER, FIRST_ORDER, LAST_ORDER
+```
+
+---
+
+## Join Node
+
+The Join node combines data from multiple sources.
+
+### Join Types
+
+| Type | Description |
+|------|-------------|
+| INNER | Only matching rows |
+| LEFT | All left + matching right |
+| RIGHT | All right + matching left |
+| FULL | All rows from both sides |
+
+### Configuration
+
+```
+Join Type: LEFT
+Left Source: ORDERS
+Right Source: CUSTOMERS
+
+Join Conditions:
+  - ORDERS.CUSTOMER_ID = CUSTOMERS.ID
+  - ORDERS.REGION = CUSTOMERS.REGION (optional)
+```
+
+### Design Considerations
+
+**Performance Tips:**
+- Put smaller table on right side for LEFT joins
+- Use indexed columns in join conditions
+- Filter data before joining when possible
+- Avoid Cartesian products (missing join condition)
+
+**Multiple Joins:**
+- Chain Join nodes for 3+ sources
+- Consider join order for performance
+- Validate intermediate results
+
+### Handling Duplicates
+
+```
+Scenario: Customer has multiple orders
+Solution: Aggregate before or after join depending on requirement
+```
+
+---
+
+## Union Node
+
+The Union node combines rows from multiple sources.
+
+### Union Types
+
+| Type | Description |
+|------|-------------|
+| UNION ALL | Include all rows (with duplicates) |
+| UNION | Include distinct rows only |
+
+### Configuration
+
+```
+Union Type: UNION ALL
+
+Sources:
+  - ORDERS_2023
+  - ORDERS_2024
+
+Column Mapping:
+  - ORDER_ID -> ORDER_ID
+  - CUSTOMER -> CUSTOMER_ID (rename)
+  - AMOUNT -> SALES_AMOUNT (rename)
+```
+
+### Requirements
+
+- Same number of columns from each source
+- Compatible data types
+- Column mapping for different names
+
+### Adding Source Identifier
+
+Use a calculated column to track data origin:
+
+```sql
+-- In source 1 projection
+'2023' AS DATA_YEAR
+
+-- In source 2 projection
+'2024' AS DATA_YEAR
+```
+
+---
+
+## Case Node
+
+The Case node applies conditional logic to route or transform data.
+
+### Conditional Expressions
+
+```sql
+CASE
+  WHEN ORDER_TYPE = 'SALES' THEN 'Revenue'
+  WHEN ORDER_TYPE = 'RETURN' THEN 'Refund'
+  ELSE 'Other'
+END AS TRANSACTION_TYPE
+```
+
+### Routing Data
+
+Configure multiple output ports based on conditions:
+
+```
+Condition 1: REGION = 'EMEA' -> Output Port 1
+Condition 2: REGION = 'APAC' -> Output Port 2
+Default: -> Output Port 3
+```
+
+### Nested Conditions
+
+```sql
+CASE
+  WHEN AMOUNT > 10000 THEN
+    CASE
+      WHEN CUSTOMER_TYPE = 'VIP' THEN 'Priority High'
+      ELSE 'Priority Medium'
+    END
+  ELSE 'Priority Low'
+END AS PRIORITY
+```
+
+---
+
+## Consumer Operators
+
+Operators that read structured data from sources.
+
+### Structured File Consumer
+
+Reads from file storage (S3, Azure, GCS, HDFS, local).
+
+**Supported Formats:**
+- CSV (with header options)
+- Parquet
+- ORC
+- JSON (JSON Lines format)
+
+**Configuration:**
+```
+Connection: S3 Connection
+Source: s3://bucket/path/*.parquet
+Format: Parquet
+Schema: Auto-detect or manual
+Partition Pruning: date_column > '2024-01-01'
+```
+
+**Excel Support:**
+- Read Excel files (.xlsx)
+- Specify sheet name
+- Define header row
+
+### Structured SQL Consumer
+
+Reads from SQL databases.
+
+**Configuration:**
+```
+Connection: HANA Connection
+Table/View: SALES_DATA
+Columns: Select columns
+Filter: WHERE clause
+```
+
+### SAP Application Consumer
+
+Reads from SAP applications via OData or RFC.
+
+**Configuration:**
+```
+Connection: S/4HANA Connection
+Entity: A_SalesOrder
+Select: OrderID, CustomerID, NetAmount
+Filter: $filter=CreationDate gt '2024-01-01'
+```
+
+---
+
+## Producer Operators
+
+Operators that write structured data to targets.
+
+### Structured File Producer
+
+Writes to file storage.
+
+**Configuration:**
+```
+Connection: S3 Connection
+Target: s3://bucket/output/
+Format: Parquet
+Partition Columns: YEAR, MONTH
+Compression: SNAPPY
+```
+
+**Partitioning Strategies:**
+- By column values (e.g., year, region)
+- By time (hourly, daily)
+- By size (max rows per file)
+
+### Structured Table Producer
+
+Writes to database tables.
+
+**Write Modes:**
+- INSERT: Add new rows
+- UPSERT: Insert or update
+- DELETE: Remove matching rows
+- TRUNCATE_INSERT: Clear and reload
+
+**Configuration:**
+```
+Connection: HANA Connection
+Table: TARGET_TABLE
+Mode: UPSERT
+Key Columns: ID, DATE
+Batch Size: 10000
+```
+
+### SAP Application Producer
+
+Writes to SAP applications.
+
+**Configuration:**
+```
+Connection: S/4HANA Connection
+Entity: A_SalesOrder
+Operation: POST (create) / PATCH (update)
+```
+
+---
+
+## Resiliency
+
+Structured data operators support resiliency features for reliable processing.
+
+### Checkpoint Configuration
+
+```
+Enable Checkpointing: Yes
+Checkpoint Interval: 60 seconds
+Checkpoint Location: /checkpoint/path
+```
+
+### Recovery Behavior
+
+**On Failure:**
+1. Graph stops at failure point
+2. State saved to checkpoint
+3. Manual or auto restart
+4. Resume from last checkpoint
+
+### Best Practices
+
+- Enable checkpointing for long-running jobs
+- Use appropriate checkpoint intervals
+- Store checkpoints on reliable storage
+- Monitor checkpoint sizes
+
+### Exactly-Once Processing
+
+For exactly-once semantics:
+- Use UPSERT to database targets
+- Enable deduplication for file targets
+- Implement idempotent transformations
+
+---
+
+## Example: End-to-End Pipeline
+
+```
+[Structured File Consumer] -> [Data Transform] -> [Structured Table Producer]
+       (CSV files)              |                      (HANA table)
+                                |
+                    [Projection] - Select columns
+                                |
+                    [Join] - Enrich with master data
+                                |
+                    [Aggregation] - Summarize by region
+                                |
+                    [Case] - Apply business rules
+```
+
+---
+
+## Documentation Links
+
+- **Structured Data Operators**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/working-with-structureddata-operators](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/working-with-structureddata-operators)
+- **Data Transform**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/working-with-structureddata-operators/data-transform-8fe8c02.md](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/working-with-structureddata-operators/data-transform-8fe8c02.md)
+
+---
+
+**Last Updated**: 2025-11-22