zhongwei/gh-secondsky-sap-skills-skills-sap-datasphere

Files

Zhongwei Li 5f83ab42ed Initial commit

2025-11-30 08:55:17 +08:00

19 KiB

Raw Permalink Blame History

Data Acquisition and Preparation Reference

Source: https://github.com/SAP-docs/sap-datasphere/tree/main/docs/Acquiring-Preparing-Modeling-Data/Acquiring-and-Preparing-Data-in-the-Data-Builder

Data Flows
Replication Flows
Transformation Flows
Local Tables
Remote Tables
Task Chains
Python Operators
Data Transformation
Semantic Onboarding
File Spaces and Object Store

Data Flows

Data flows provide ETL capabilities for data transformation and loading.

Prerequisites

Required Privileges:

Data Warehouse General (-R------) - SAP Datasphere access
Connection (-R------) - Read connections
Data Warehouse Data Builder (CRUD----) - Create/edit/delete flows
Space Files (CRUD----) - Manage space objects
Data Warehouse Data Integration (-RU-----) - Run flows
Data Warehouse Data Integration (-R--E---) - Schedule flows

Creating a Data Flow

Navigate to Data Builder
Select "New Data Flow"
Add source operators
Add transformation operators
Add target operator
Save and deploy

Key Limitations

No delta processing: Use replication flows for delta/CDC data instead
Single target table only per data flow
Local tables only: Data flows load exclusively to local tables in the repository
Double quotes unsupported in identifiers (column/table names)
Spatial data types not supported
ABAP source preview unavailable (except CDS views and LTR objects)
Transformation operators cannot be previewed

Advanced Properties

Dynamic Memory Allocation:

Setting	Memory Range	Use Case
Small	1-2 GB	Low volume
Medium	2-3 GB	Standard volume
Large	3-5 GB	High volume

Additional Options:

Automatic restart on failure
Input parameters support

Data Flow Operators

Source Operators:

Remote tables
Local tables
Views
CSV files

Transformation Operators:

Operator	Purpose	Configuration
Join	Combine sources	Join type, conditions
Union	Stack sources	Column mapping
Projection	Select columns	Include/exclude, rename
Filter	Row filtering	Filter conditions
Aggregation	Group and aggregate	Group by, aggregates
Script	Custom Python	Python code
Calculated Column	Derived values	Expression

Target Operators:

Local table (new or existing)
Truncate and insert or delta merge

Join Operations

Join Types:

Inner Join: Matching rows only
Left Outer: All left + matching right
Right Outer: All right + matching left
Full Outer: All rows from both
Cross Join: Cartesian product

Join Conditions:

source1.column = source2.column

Aggregation Operations

Aggregate Functions:

SUM, AVG, MIN, MAX
COUNT, COUNT DISTINCT
FIRST, LAST

Calculated Columns

Expression Syntax:

CASE WHEN column1 > 100 THEN 'High' ELSE 'Low' END
CONCAT(first_name, ' ', last_name)
ROUND(amount * exchange_rate, 2)

Input Parameters

Define runtime parameters for dynamic filtering:

Parameter Types:

String
Integer
Date
Timestamp

Usage in Expressions:

WHERE region = :IP_REGION

Running Data Flows

Execution Options:

Manual run from Data Builder
Scheduled via task chain
API trigger

Run Modes:

Full: Process all data
Delta: Process changes only (requires delta capture)

Replication Flows

Replicate data from source systems to SAP Datasphere or external targets.

Creating a Replication Flow

Navigate to Data Builder
Select "New Replication Flow"
Add source connection and objects
Add target connection
Configure load type and mappings
Save and deploy

Source Systems

SAP Sources:

SAP S/4HANA Cloud (ODP, CDS views)
SAP S/4HANA On-Premise (ODP, SLT, CDS)
SAP BW/4HANA
SAP ECC
SAP HANA

Cloud Storage Sources:

Amazon S3
Azure Blob Storage
Google Cloud Storage
SFTP

Streaming Sources:

Apache Kafka
Confluent Kafka

Target Systems

SAP Datasphere Targets:

Local tables (managed by replication flow)

External Targets:

Apache Kafka
Confluent Kafka
Google BigQuery
Amazon S3
Azure Blob Storage
Google Cloud Storage
SFTP
SAP Signavio

Load Types

Load Type	Description	Use Case
Initial Only	One-time full load	Static data
Initial + Delta	Full load then changes	Standard replication
Real-Time	Continuous streaming	Live data

Configuration Options

Flow-Level Properties:

Property	Description	Default
Delta Load Frequency	Interval for delta changes	Configurable
Skip Unmapped Target Columns	Ignore unmapped columns	Optional
Merge Data Automatically	Auto-merge for file space targets	Requires consent
Source Thread Limit	Parallel threads for source (1-160)	16
Target Thread Limit	Parallel threads for target (1-160)	16
Content Type	Template or Native format	Template

Object-Level Properties:

Property	Description
Load Type	Initial Only, Initial+Delta, Delta Only
Delta Capture	Enable CDC tracking
ABAP Exit	Custom projection logic
Object Thread Count	Thread count for delta operations
Delete Before Load	Clear target before loading

Critical Constraints

No input parameters: Replication flows do not support input parameters
Thread limits read-only at design time: Editable only after deployment
Content Type applies globally: Selection affects all replication objects in the flow
ABAP systems: Consult SAP Note 3297105 before creating replication flows

Content Type (ABAP Sources)

Type	Date/Timestamp Handling	Use Case
Template Type	Applies ISO format requirements	Standard integration
Native Type	Dates → strings, timestamps → decimals	Custom formatting

Filters:

Define row-level filters on source
Multiple filter conditions with AND/OR
Important: For ODP-CDS, filters must apply to primary key fields only

Mappings:

Automatic column mapping
Manual mapping overrides
Exclude columns

Projections:

Custom SQL expressions
Column transformations
Calculated columns
ABAP Exit for custom projection logic

Sizing and Performance

Thread Configuration:

Source/Target Thread Limits: 1-160 (default: 16)
Higher values = more parallelism but more resources
Consider source system capacity

Capacity Planning:

Estimate data volume per table
Consider network bandwidth
Plan for parallel execution
RFC fast serialization (SAP Note 3486245) for improved performance

Load Balancing:

Distribute across multiple flows
Schedule during off-peak hours
Monitor resource consumption

Unsupported Data Types

BLOB, CLOB (large objects)
Spatial data types
Custom ABAP types
Virtual Tables (SAP HANA Smart Data Access)
Row Tables (use COLUMN TABLE only)

Transformation Flows

Delta-aware transformations with automatic change propagation.

Creating a Transformation Flow

Navigate to Data Builder
Select "New Transformation Flow"
Add source (view or graphical view)
Add target table
Configure run settings
Save and deploy

Key Constraints and Limitations

Data Access Restrictions: Views and Open SQL schema objects cannot be used if they:

Reference remote tables (except BW Bridge)
Consume views with data access controls
Have controls applied to them

Loading Constraints:

Loading delta changes from views is not supported
Only loads data to local SAP Datasphere repository tables
Remote tables in BW Bridge spaces must be shared with the SAP Datasphere space

Runtime Options

Runtime	Storage Target	Use Case
HANA	SAP HANA Database storage	Standard transformations
SPARK	SAP HANA Data Lake Files storage	Large-scale file processing

Load Types

Load Type	Description	Requirements
Initial Only	Full dataset load	None
Initial and Delta	Full load then changes	Delta capture enabled on source and target tables

Input Parameter Constraints

Cannot be created/edited in Graphical View Editor
Scheduled flows use default values
Not supported in Python operations (Spark runtime)
Exclude from task chain input parameters

Source Options

Graphical view (created inline)
SQL view (created inline)
Existing views

Target Table Management

Options:

Create new local table
Use existing local table

Column Handling:

Add new columns automatically
Map columns manually
Exclude columns

Run Modes

Mode	Action	Use Case
Start	Process delta changes	Regular runs
Delete	Remove target records	Cleanup
Truncate	Clear and reload	Full refresh

Delta Processing

Transformation flows track changes automatically:

Insert: New records
Update: Modified records
Delete: Removed records

File Space Transformations

Transform data in object store (file spaces):

Supported Functions:

String functions
Numeric functions
Date functions
Conversion functions

Local Tables

Store data directly in SAP Datasphere.

Creating Local Tables

Methods:

Data Builder > New Table
Import from CSV
Create from data flow target
Create from replication flow target

Storage Options

Storage	Target System	Use Case
Disk	SAP HANA Cloud, SAP HANA database	Standard persistent storage
In-Memory	SAP HANA Cloud, SAP HANA database	High-performance hot data
File	SAP HANA Cloud data lake storage	Large-scale cost-effective storage

Table Properties

Key Columns:

Primary key definition
Unique constraints

Data Types:

String (VARCHAR)
Integer (INT, BIGINT)
Decimal (DECIMAL)
Date, Time, Timestamp
Boolean
Binary

Partitioning

Partition Types:

Range partitioning (date/numeric)
Hash partitioning

Benefits:

Improved query performance
Parallel processing
Selective data loading

Delta Capture

Enable change tracking for incremental processing:

Enable delta capture on table
Track insert/update/delete operations
Query changes with delta tokens

Important Constraint: Once delta capture is enabled and deployed, it cannot be modified or disabled.

Allow Data Transport

Available for dimensions on SAP Business Data Cloud formation tenants:

Enables data inclusion during repository package transport
Limited to initial import data initialization
Applies only to: Dimensions, text entities, or relational datasets

Data Maintenance

Operations:

Insert records
Update records
Delete records
Truncate table
Load from file

Local Table (File)

Store data in object store:

Supported Formats:

Parquet
CSV
JSON

Use Cases:

Large datasets
Cost-effective storage
Integration with data lakes

Remote Tables

Virtual access to external data without copying.

Importing Remote Tables

Select connection in source browser
Choose tables/views to import
Configure import settings
Deploy remote table

Data Access Modes

Mode	Description	Performance
Remote	Query source directly	Network dependent
Replicated	Copy to local storage	Fast queries

Replication Options

Full Replication:

Copy all data
Scheduled refresh

Real-Time Replication:

Continuous change capture
Near real-time updates

Partitioned Replication:

Divide data into partitions
Parallel loading

Remote Table Properties

Statistics:

Create statistics for query optimization
Update statistics periodically

Filters:

Define partitioning filters
Limit data volume

Task Chains

Orchestrate multiple data integration tasks.

Creating Task Chains

Navigate to Data Builder
Select "New Task Chain"
Add task nodes
Configure dependencies
Save and deploy

Supported Task Types

Repository Objects:

Task Type	Activity	Description
Remote Tables	Replicate	Replicate remote table data
Views	Persist	Persist view data to storage
Intelligent Lookups	Run	Execute intelligent lookup
Data Flows	Run	Execute data flow
Replication Flows	Run	Run with load type Initial Only
Transformation Flows	Run	Execute transformation flow
Local Tables	Delete Records	Delete records with Change Type "Deleted"
Local Tables (File)	Merge	Merge delta files
Local Tables (File)	Optimize	Compact files
Local Tables (File)	Delete Records	Remove data

Non-Repository Objects:

Task Type	Description
Open SQL Procedure	Execute SAP HANA schema procedures
BW Bridge Process Chain	Run SAP BW Bridge processes

Toolbar-Only Objects:

Task Type	Description
API Task	Call external REST APIs
Notification Task	Send email notifications

Nested Objects:

Task Type	Description
Task Chain	Reference locally-created or shared task chains

Object Prerequisites

All objects must be deployed before adding to task chains
SAP HANA Open SQL schema procedures require EXECUTE privileges granted to space users
Views cannot have data access controls assigned
Data flows with input parameters use default values during task chain execution
Persisting views may include only one parameter with default value

Execution Control

Sequential Execution:

Tasks run one after another
Succeeding task runs only when previous completes with completed status
Failure stops chain execution

Parallel Execution:

Multiple branches run simultaneously
Completion condition options:
- ANY: Succeeds when any parallel task completes
- ALL: Succeeds only when all parallel tasks complete
Synchronization at join points

Layout Options:

Top-Bottom orientation
Left-Right orientation
Drag tasks to reorder

Apache Spark Settings:

Override default Apache Spark Application Settings per task
Configure memory and executor settings

Input Parameters

Pass parameters to task chain tasks:

Parameter Definition:

name: region
type: string
default: "US"

Parameter Usage:

Pass to data flows
Use in filters
Dynamic configuration

Scheduling

Simple Schedule:

Daily, weekly, monthly
Specific time

Cron Expression:

0 0 6 * * ?   # Daily at 6 AM
0 0 */4 * * ? # Every 4 hours

Important Scheduling Constraint: If scheduling remote tables with Replicated (Real-time) data access, replication type converts to batch replication at the next scheduled run (eliminates real-time updates).

Email Notifications

Configure notifications for:

Success
Failure
Warning

Recipient Options:

Tenant users (searchable after task chain is deployed)
External email addresses (requires deployed task chain for recipient selection)

Export Constraint: CSN/JSON export does not include notification recipients

Python Operators

Custom data processing with Python.

Creating Python Operators

Add Script operator to data flow
Define input/output ports
Write Python code
Configure execution

Python Script Structure

def transform(data):
    """
    Transform input data.

    Args:
        data: pandas DataFrame

    Returns:
        pandas DataFrame
    """
    # Your transformation logic
    result = data.copy()
    result['new_column'] = result['existing'].apply(my_function)
    return result

Available Libraries

pandas
numpy
scipy
scikit-learn
datetime

Best Practices

Keep transformations simple
Handle null values explicitly
Log errors appropriately
Test with sample data

Data Transformation

Column-level transformations in graphical views.

Text Transformations

Function	Description	Example
Change Case	Upper/lower/title	UPPER(name)
Concatenate	Join columns	CONCAT(first, last)
Extract	Substring	SUBSTRING(text, 1, 5)
Split	Divide by delimiter	SPLIT(full_name, ' ')
Find/Replace	Text substitution	REPLACE(text, 'old', 'new')

Numeric Transformations

Function	Description
ROUND	Round to precision
FLOOR	Round down
CEIL	Round up
ABS	Absolute value
MOD	Modulo operation

Date Transformations

Function	Description
YEAR	Extract year
MONTH	Extract month
DAY	Extract day
DATEDIFF	Date difference
ADD_DAYS	Add days to date

Filter Operations

-- Numeric filter
amount > 1000

-- Text filter
region IN ('US', 'EU', 'APAC')

-- Date filter
order_date >= '2024-01-01'

-- Null handling
customer_name IS NOT NULL

Semantic Onboarding

Import objects with business semantics from SAP systems.

SAP S/4HANA Import

Import CDS views with annotations:

Semantic types (currency, unit)
Associations
Hierarchies
Text relationships

SAP BW/4HANA Import

Import BW objects:

InfoObjects
CompositeProviders
Queries
Analysis Authorizations

Import Process

Select source connection
Browse available objects
Select objects to import
Review semantic mapping
Deploy imported objects

File Spaces and Object Store

Store and process data in object store.

Creating File Spaces

System > Configuration > Spaces
Create new file space
Configure object store connection
Set storage limits

Data Loading

Supported Formats:

Parquet (recommended)
CSV
JSON

Loading Methods:

Replication flows
Transformation flows
API upload

In-Memory Acceleration

Enable in-memory storage for faster queries:

Select table/view
Enable in-memory storage
Configure refresh schedule

Premium Outbound Integration

Export data to external systems:

Configure outbound connection
Schedule exports
Monitor transfer status

Documentation Links

Data Flows: https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/e30fd14
Replication Flows: https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/25e2bd7
Transformation Flows: https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/f7161e6
Task Chains: https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/d1afbc2

Last Updated: 2025-11-22

19 KiB Raw Permalink Blame History

Data Acquisition and Preparation Reference

Table of Contents

Data Flows

Prerequisites

Creating a Data Flow

Key Limitations

Advanced Properties

Data Flow Operators

Join Operations

Aggregation Operations

Calculated Columns

Input Parameters

Running Data Flows

Replication Flows

Creating a Replication Flow

Source Systems

Target Systems

Load Types

Configuration Options

Critical Constraints

Content Type (ABAP Sources)

Sizing and Performance

Unsupported Data Types

Transformation Flows

Creating a Transformation Flow

Key Constraints and Limitations

Runtime Options

Load Types

Input Parameter Constraints

Source Options

Target Table Management

Run Modes

Delta Processing

File Space Transformations

Local Tables

Creating Local Tables

Storage Options

Table Properties

Partitioning

Delta Capture

Allow Data Transport

Data Maintenance

Local Table (File)

Remote Tables

Importing Remote Tables

Data Access Modes

Replication Options

Remote Table Properties

Task Chains

Creating Task Chains

Supported Task Types

Object Prerequisites

Execution Control

Input Parameters

Scheduling

Email Notifications

Python Operators

Creating Python Operators

Python Script Structure

Available Libraries

Best Practices

Data Transformation

Text Transformations

Numeric Transformations

Date Transformations

Filter Operations

Semantic Onboarding

SAP S/4HANA Import

SAP BW/4HANA Import

Import Process

File Spaces and Object Store

Creating File Spaces

Data Loading

In-Memory Acceleration

Premium Outbound Integration

Documentation Links

19 KiB

Raw Permalink Blame History