# Structured Data Operators Guide Complete guide for structured data processing in SAP Data Intelligence. ## Table of Contents 1. [Overview](#overview) 2. [Data Transform Operator](#data-transform-operator) 3. [Projection Node](#projection-node) 4. [Aggregation Node](#aggregation-node) 5. [Join Node](#join-node) 6. [Union Node](#union-node) 7. [Case Node](#case-node) 8. [Consumer Operators](#consumer-operators) 9. [Producer Operators](#producer-operators) 10. [Resiliency](#resiliency) --- ## Overview Structured data operators provide SQL-like data processing capabilities with visual configuration. **Key Components:** - **Data Transform**: Visual SQL editor for transformations - **Consumer Operators**: Read structured data from sources - **Producer Operators**: Write structured data to targets **Supported Formats:** - CSV, Parquet, ORC, JSON (files) - Database tables (HANA, SQL databases) - SAP applications (S/4HANA, BW) --- ## Data Transform Operator The Data Transform operator provides a visual editor for creating SQL-like transformations. ### Custom Editor Features **Visual Designer:** - Drag-and-drop node creation - Visual connection of data flows - Schema preview at each node **Available Nodes:** - Source: Input data connection - Target: Output data connection - Projection: Column operations - Aggregation: GROUP BY operations - Join: Combine datasets - Union: Merge datasets - Case: Conditional logic - Filter: Row filtering ### Creating a Data Transform 1. Add Data Transform operator to graph 2. Open Custom Editor 3. Add Source node (connect to input port) 4. Add transformation nodes 5. Add Target node (connect to output port) 6. Configure each node 7. Validate and close editor ### Best Practices - Use meaningful node names - Preview data at each step - Optimize join order for performance - Use filters early to reduce data volume --- ## Projection Node The Projection node performs column-level operations. ### Capabilities **Column Selection:** - Include/exclude columns - Rename columns - Reorder columns **Column Transformation:** - Apply expressions - Use DTL functions - Create calculated columns ### Configuration ``` Source Columns: - CUSTOMER_ID (include) - CUSTOMER_NAME (include, rename to NAME) - INTERNAL_CODE (exclude) Calculated Columns: - FULL_NAME: CONCAT(FIRST_NAME, ' ', LAST_NAME) - ORDER_YEAR: YEAR(ORDER_DATE) ``` ### Expression Examples ```sql -- String concatenation CONCAT(FIRST_NAME, ' ', LAST_NAME) -- Conditional value CASE WHEN AMOUNT > 1000 THEN 'High' ELSE 'Low' END -- Date extraction YEAR(ORDER_DATE) -- Null handling COALESCE(DISCOUNT, 0) -- Type conversion TO_DECIMAL(QUANTITY * PRICE, 15, 2) ``` --- ## Aggregation Node The Aggregation node performs GROUP BY operations with aggregate functions. ### Aggregate Functions | Function | Description | |----------|-------------| | COUNT | Count rows | | COUNT_DISTINCT | Count unique values | | SUM | Sum of values | | AVG | Average of values | | MIN | Minimum value | | MAX | Maximum value | | FIRST | First value | | LAST | Last value | ### Configuration ``` Group By Columns: - REGION - PRODUCT_CATEGORY Aggregations: - TOTAL_SALES: SUM(SALES_AMOUNT) - ORDER_COUNT: COUNT(ORDER_ID) - AVG_ORDER: AVG(SALES_AMOUNT) - FIRST_ORDER: MIN(ORDER_DATE) - LAST_ORDER: MAX(ORDER_DATE) ``` ### Example Output Schema ``` Input: ORDER_ID, REGION, PRODUCT_CATEGORY, SALES_AMOUNT, ORDER_DATE Output: REGION, PRODUCT_CATEGORY, TOTAL_SALES, ORDER_COUNT, AVG_ORDER, FIRST_ORDER, LAST_ORDER ``` --- ## Join Node The Join node combines data from multiple sources. ### Join Types | Type | Description | |------|-------------| | INNER | Only matching rows | | LEFT | All left + matching right | | RIGHT | All right + matching left | | FULL | All rows from both sides | ### Configuration ``` Join Type: LEFT Left Source: ORDERS Right Source: CUSTOMERS Join Conditions: - ORDERS.CUSTOMER_ID = CUSTOMERS.ID - ORDERS.REGION = CUSTOMERS.REGION (optional) ``` ### Design Considerations **Performance Tips:** - Put smaller table on right side for LEFT joins - Use indexed columns in join conditions - Filter data before joining when possible - Avoid Cartesian products (missing join condition) **Multiple Joins:** - Chain Join nodes for 3+ sources - Consider join order for performance - Validate intermediate results ### Handling Duplicates ``` Scenario: Customer has multiple orders Solution: Aggregate before or after join depending on requirement ``` --- ## Union Node The Union node combines rows from multiple sources. ### Union Types | Type | Description | |------|-------------| | UNION ALL | Include all rows (with duplicates) | | UNION | Include distinct rows only | ### Configuration ``` Union Type: UNION ALL Sources: - ORDERS_2023 - ORDERS_2024 Column Mapping: - ORDER_ID -> ORDER_ID - CUSTOMER -> CUSTOMER_ID (rename) - AMOUNT -> SALES_AMOUNT (rename) ``` ### Requirements - Same number of columns from each source - Compatible data types - Column mapping for different names ### Adding Source Identifier Use a calculated column to track data origin: ```sql -- In source 1 projection '2023' AS DATA_YEAR -- In source 2 projection '2024' AS DATA_YEAR ``` --- ## Case Node The Case node applies conditional logic to route or transform data. ### Conditional Expressions ```sql CASE WHEN ORDER_TYPE = 'SALES' THEN 'Revenue' WHEN ORDER_TYPE = 'RETURN' THEN 'Refund' ELSE 'Other' END AS TRANSACTION_TYPE ``` ### Routing Data Configure multiple output ports based on conditions: ``` Condition 1: REGION = 'EMEA' -> Output Port 1 Condition 2: REGION = 'APAC' -> Output Port 2 Default: -> Output Port 3 ``` ### Nested Conditions ```sql CASE WHEN AMOUNT > 10000 THEN CASE WHEN CUSTOMER_TYPE = 'VIP' THEN 'Priority High' ELSE 'Priority Medium' END ELSE 'Priority Low' END AS PRIORITY ``` --- ## Consumer Operators Operators that read structured data from sources. ### Structured File Consumer Reads from file storage (S3, Azure, GCS, HDFS, local). **Supported Formats:** - CSV (with header options) - Parquet - ORC - JSON (JSON Lines format) **Configuration:** ``` Connection: S3 Connection Source: s3://bucket/path/*.parquet Format: Parquet Schema: Auto-detect or manual Partition Pruning: date_column > '2024-01-01' ``` **Excel Support:** - Read Excel files (.xlsx) - Specify sheet name - Define header row ### Structured SQL Consumer Reads from SQL databases. **Configuration:** ``` Connection: HANA Connection Table/View: SALES_DATA Columns: Select columns Filter: WHERE clause ``` ### SAP Application Consumer Reads from SAP applications via OData or RFC. **Configuration:** ``` Connection: S/4HANA Connection Entity: A_SalesOrder Select: OrderID, CustomerID, NetAmount Filter: $filter=CreationDate gt '2024-01-01' ``` --- ## Producer Operators Operators that write structured data to targets. ### Structured File Producer Writes to file storage. **Configuration:** ``` Connection: S3 Connection Target: s3://bucket/output/ Format: Parquet Partition Columns: YEAR, MONTH Compression: SNAPPY ``` **Partitioning Strategies:** - By column values (e.g., year, region) - By time (hourly, daily) - By size (max rows per file) ### Structured Table Producer Writes to database tables. **Write Modes:** - INSERT: Add new rows - UPSERT: Insert or update - DELETE: Remove matching rows - TRUNCATE_INSERT: Clear and reload **Configuration:** ``` Connection: HANA Connection Table: TARGET_TABLE Mode: UPSERT Key Columns: ID, DATE Batch Size: 10000 ``` ### SAP Application Producer Writes to SAP applications. **Configuration:** ``` Connection: S/4HANA Connection Entity: A_SalesOrder Operation: POST (create) / PATCH (update) ``` --- ## Resiliency Structured data operators support resiliency features for reliable processing. ### Checkpoint Configuration ``` Enable Checkpointing: Yes Checkpoint Interval: 60 seconds Checkpoint Location: /checkpoint/path ``` ### Recovery Behavior **On Failure:** 1. Graph stops at failure point 2. State saved to checkpoint 3. Manual or auto restart 4. Resume from last checkpoint ### Best Practices - Enable checkpointing for long-running jobs - Use appropriate checkpoint intervals - Store checkpoints on reliable storage - Monitor checkpoint sizes ### Exactly-Once Processing For exactly-once semantics: - Use UPSERT to database targets - Enable deduplication for file targets - Implement idempotent transformations --- ## Example: End-to-End Pipeline ``` [Structured File Consumer] -> [Data Transform] -> [Structured Table Producer] (CSV files) | (HANA table) | [Projection] - Select columns | [Join] - Enrich with master data | [Aggregation] - Summarize by region | [Case] - Apply business rules ``` --- ## Documentation Links - **Structured Data Operators**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/working-with-structureddata-operators](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/tree/main/docs/modelingguide/working-with-structureddata-operators) - **Data Transform**: [https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/working-with-structureddata-operators/data-transform-8fe8c02.md](https://github.com/SAP-docs/sap-hana-cloud-data-intelligence/blob/main/docs/modelingguide/working-with-structureddata-operators/data-transform-8fe8c02.md) --- **Last Updated**: 2025-11-22