gh-secondsky-sap-skills-ski…/references/data-acquisition-preparation.md

# Data Acquisition and Preparation Reference

**Source**: [https://github.com/SAP-docs/sap-datasphere/tree/main/docs/Acquiring-Preparing-Modeling-Data/Acquiring-and-Preparing-Data-in-the-Data-Builder](https://github.com/SAP-docs/sap-datasphere/tree/main/docs/Acquiring-Preparing-Modeling-Data/Acquiring-and-Preparing-Data-in-the-Data-Builder)

---

## Table of Contents

1. [Data Flows](#data-flows)
2. [Replication Flows](#replication-flows)
3. [Transformation Flows](#transformation-flows)
4. [Local Tables](#local-tables)
5. [Remote Tables](#remote-tables)
6. [Task Chains](#task-chains)
7. [Python Operators](#python-operators)
8. [Data Transformation](#data-transformation)
9. [Semantic Onboarding](#semantic-onboarding)
10. [File Spaces and Object Store](#file-spaces-and-object-store)

---

## Data Flows

Data flows provide ETL capabilities for data transformation and loading.

### Prerequisites

**Required Privileges**:
- Data Warehouse General (`-R------`) - SAP Datasphere access
- Connection (`-R------`) - Read connections
- Data Warehouse Data Builder (`CRUD----`) - Create/edit/delete flows
- Space Files (`CRUD----`) - Manage space objects
- Data Warehouse Data Integration (`-RU-----`) - Run flows
- Data Warehouse Data Integration (`-R--E---`) - Schedule flows

### Creating a Data Flow

1. Navigate to Data Builder
2. Select "New Data Flow"
3. Add source operators
4. Add transformation operators
5. Add target operator
6. Save and deploy

### Key Limitations

- **No delta processing**: Use replication flows for delta/CDC data instead
- **Single target table** only per data flow
- **Local tables only**: Data flows load exclusively to local tables in the repository
- **Double quotes unsupported** in identifiers (column/table names)
- **Spatial data types** not supported
- **ABAP source preview** unavailable (except CDS views and LTR objects)
- **Transformation operators** cannot be previewed

### Advanced Properties

**Dynamic Memory Allocation**:
| Setting | Memory Range | Use Case |
|---------|--------------|----------|
| Small | 1-2 GB | Low volume |
| Medium | 2-3 GB | Standard volume |
| Large | 3-5 GB | High volume |

**Additional Options**:
- Automatic restart on failure
- Input parameters support

### Data Flow Operators

**Source Operators**:
- Remote tables
- Local tables
- Views
- CSV files

**Transformation Operators**:

| Operator | Purpose | Configuration |
|----------|---------|---------------|
| Join | Combine sources | Join type, conditions |
| Union | Stack sources | Column mapping |
| Projection | Select columns | Include/exclude, rename |
| Filter | Row filtering | Filter conditions |
| Aggregation | Group and aggregate | Group by, aggregates |
| Script | Custom Python | Python code |
| Calculated Column | Derived values | Expression |

**Target Operators**:
- Local table (new or existing)
- Truncate and insert or delta merge

### Join Operations

**Join Types**:
- Inner Join: Matching rows only
- Left Outer: All left + matching right
- Right Outer: All right + matching left
- Full Outer: All rows from both
- Cross Join: Cartesian product

**Join Conditions**:
```
source1.column = source2.column
```

### Aggregation Operations

**Aggregate Functions**:
- SUM, AVG, MIN, MAX
- COUNT, COUNT DISTINCT
- FIRST, LAST

### Calculated Columns

**Expression Syntax**:
```sql
CASE WHEN column1 > 100 THEN 'High' ELSE 'Low' END
CONCAT(first_name, ' ', last_name)
ROUND(amount * exchange_rate, 2)
```

### Input Parameters

Define runtime parameters for dynamic filtering:

**Parameter Types**:
- String
- Integer
- Date
- Timestamp

**Usage in Expressions**:
```sql
WHERE region = :IP_REGION
```

### Running Data Flows

**Execution Options**:
- Manual run from Data Builder
- Scheduled via task chain
- API trigger

**Run Modes**:
- Full: Process all data
- Delta: Process changes only (requires delta capture)

---

## Replication Flows

Replicate data from source systems to SAP Datasphere or external targets.

### Creating a Replication Flow

1. Navigate to Data Builder
2. Select "New Replication Flow"
3. Add source connection and objects
4. Add target connection
5. Configure load type and mappings
6. Save and deploy

### Source Systems

**SAP Sources**:
- SAP S/4HANA Cloud (ODP, CDS views)
- SAP S/4HANA On-Premise (ODP, SLT, CDS)
- SAP BW/4HANA
- SAP ECC
- SAP HANA

**Cloud Storage Sources**:
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- SFTP

**Streaming Sources**:
- Apache Kafka
- Confluent Kafka

### Target Systems

**SAP Datasphere Targets**:
- Local tables (managed by replication flow)

**External Targets**:
- Apache Kafka
- Confluent Kafka
- Google BigQuery
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- SFTP
- SAP Signavio

### Load Types

| Load Type | Description | Use Case |
|-----------|-------------|----------|
| Initial Only | One-time full load | Static data |
| Initial + Delta | Full load then changes | Standard replication |
| Real-Time | Continuous streaming | Live data |

### Configuration Options

**Flow-Level Properties**:
| Property | Description | Default |
|----------|-------------|---------|
| Delta Load Frequency | Interval for delta changes | Configurable |
| Skip Unmapped Target Columns | Ignore unmapped columns | Optional |
| Merge Data Automatically | Auto-merge for file space targets | Requires consent |
| Source Thread Limit | Parallel threads for source (1-160) | 16 |
| Target Thread Limit | Parallel threads for target (1-160) | 16 |
| Content Type | Template or Native format | Template |

**Object-Level Properties**:
| Property | Description |
|----------|-------------|
| Load Type | Initial Only, Initial+Delta, Delta Only |
| Delta Capture | Enable CDC tracking |
| ABAP Exit | Custom projection logic |
| Object Thread Count | Thread count for delta operations |
| Delete Before Load | Clear target before loading |

### Critical Constraints

- **No input parameters**: Replication flows do not support input parameters
- **Thread limits read-only at design time**: Editable only after deployment
- **Content Type applies globally**: Selection affects all replication objects in the flow
- **ABAP systems**: Consult SAP Note 3297105 before creating replication flows

### Content Type (ABAP Sources)

| Type | Date/Timestamp Handling | Use Case |
|------|-------------------------|----------|
| Template Type | Applies ISO format requirements | Standard integration |
| Native Type | Dates → strings, timestamps → decimals | Custom formatting |

**Filters**:
- Define row-level filters on source
- Multiple filter conditions with AND/OR
- **Important**: For ODP-CDS, filters must apply to primary key fields only

**Mappings**:
- Automatic column mapping
- Manual mapping overrides
- Exclude columns

**Projections**:
- Custom SQL expressions
- Column transformations
- Calculated columns
- ABAP Exit for custom projection logic

### Sizing and Performance

**Thread Configuration**:
- Source/Target Thread Limits: 1-160 (default: 16)
- Higher values = more parallelism but more resources
- Consider source system capacity

**Capacity Planning**:
- Estimate data volume per table
- Consider network bandwidth
- Plan for parallel execution
- RFC fast serialization (SAP Note 3486245) for improved performance

**Load Balancing**:
- Distribute across multiple flows
- Schedule during off-peak hours
- Monitor resource consumption

### Unsupported Data Types

- BLOB, CLOB (large objects)
- Spatial data types
- Custom ABAP types
- Virtual Tables (SAP HANA Smart Data Access)
- Row Tables (use COLUMN TABLE only)

---

## Transformation Flows

Delta-aware transformations with automatic change propagation.

### Creating a Transformation Flow

1. Navigate to Data Builder
2. Select "New Transformation Flow"
3. Add source (view or graphical view)
4. Add target table
5. Configure run settings
6. Save and deploy

### Key Constraints and Limitations

**Data Access Restrictions**:
Views and Open SQL schema objects cannot be used if they:
- Reference remote tables (except BW Bridge)
- Consume views with data access controls
- Have controls applied to them

**Loading Constraints**:
- Loading delta changes from views is not supported
- Only loads data to local SAP Datasphere repository tables
- Remote tables in BW Bridge spaces must be shared with the SAP Datasphere space

### Runtime Options

| Runtime | Storage Target | Use Case |
|---------|----------------|----------|
| HANA | SAP HANA Database storage | Standard transformations |
| SPARK | SAP HANA Data Lake Files storage | Large-scale file processing |

### Load Types

| Load Type | Description | Requirements |
|-----------|-------------|--------------|
| Initial Only | Full dataset load | None |
| Initial and Delta | Full load then changes | Delta capture enabled on source and target tables |

### Input Parameter Constraints

- Cannot be created/edited in Graphical View Editor
- Scheduled flows use default values
- **Not supported** in Python operations (Spark runtime)
- Exclude from task chain input parameters

### Source Options

- Graphical view (created inline)
- SQL view (created inline)
- Existing views

### Target Table Management

**Options**:
- Create new local table
- Use existing local table

**Column Handling**:
- Add new columns automatically
- Map columns manually
- Exclude columns

### Run Modes

| Mode | Action | Use Case |
|------|--------|----------|
| Start | Process delta changes | Regular runs |
| Delete | Remove target records | Cleanup |
| Truncate | Clear and reload | Full refresh |

### Delta Processing

Transformation flows track changes automatically:
- Insert: New records
- Update: Modified records
- Delete: Removed records

### File Space Transformations

Transform data in object store (file spaces):

**Supported Functions**:
- String functions
- Numeric functions
- Date functions
- Conversion functions

---

## Local Tables

Store data directly in SAP Datasphere.

### Creating Local Tables

**Methods**:
1. Data Builder > New Table
2. Import from CSV
3. Create from data flow target
4. Create from replication flow target

### Storage Options

| Storage | Target System | Use Case |
|---------|---------------|----------|
| Disk | SAP HANA Cloud, SAP HANA database | Standard persistent storage |
| In-Memory | SAP HANA Cloud, SAP HANA database | High-performance hot data |
| File | SAP HANA Cloud data lake storage | Large-scale cost-effective storage |

### Table Properties

**Key Columns**:
- Primary key definition
- Unique constraints

**Data Types**:
- String (VARCHAR)
- Integer (INT, BIGINT)
- Decimal (DECIMAL)
- Date, Time, Timestamp
- Boolean
- Binary

### Partitioning

**Partition Types**:
- Range partitioning (date/numeric)
- Hash partitioning

**Benefits**:
- Improved query performance
- Parallel processing
- Selective data loading

### Delta Capture

Enable change tracking for incremental processing:

1. Enable delta capture on table
2. Track insert/update/delete operations
3. Query changes with delta tokens

**Important Constraint**: Once delta capture is enabled and deployed, it **cannot be modified or disabled**.

### Allow Data Transport

Available for dimensions on SAP Business Data Cloud formation tenants:
- Enables data inclusion during repository package transport
- Limited to initial import data initialization
- **Applies only to**: Dimensions, text entities, or relational datasets

### Data Maintenance

**Operations**:
- Insert records
- Update records
- Delete records
- Truncate table
- Load from file

### Local Table (File)

Store data in object store:

**Supported Formats**:
- Parquet
- CSV
- JSON

**Use Cases**:
- Large datasets
- Cost-effective storage
- Integration with data lakes

---

## Remote Tables

Virtual access to external data without copying.

### Importing Remote Tables

1. Select connection in source browser
2. Choose tables/views to import
3. Configure import settings
4. Deploy remote table

### Data Access Modes

| Mode | Description | Performance |
|------|-------------|-------------|
| Remote | Query source directly | Network dependent |
| Replicated | Copy to local storage | Fast queries |

### Replication Options

**Full Replication**:
- Copy all data
- Scheduled refresh

**Real-Time Replication**:
- Continuous change capture
- Near real-time updates

**Partitioned Replication**:
- Divide data into partitions
- Parallel loading

### Remote Table Properties

**Statistics**:
- Create statistics for query optimization
- Update statistics periodically

**Filters**:
- Define partitioning filters
- Limit data volume

---

## Task Chains

Orchestrate multiple data integration tasks.

### Creating Task Chains

1. Navigate to Data Builder
2. Select "New Task Chain"
3. Add task nodes
4. Configure dependencies
5. Save and deploy

### Supported Task Types

**Repository Objects**:
| Task Type | Activity | Description |
|-----------|----------|-------------|
| Remote Tables | Replicate | Replicate remote table data |
| Views | Persist | Persist view data to storage |
| Intelligent Lookups | Run | Execute intelligent lookup |
| Data Flows | Run | Execute data flow |
| Replication Flows | Run | Run with load type *Initial Only* |
| Transformation Flows | Run | Execute transformation flow |
| Local Tables | Delete Records | Delete records with Change Type "Deleted" |
| Local Tables (File) | Merge | Merge delta files |
| Local Tables (File) | Optimize | Compact files |
| Local Tables (File) | Delete Records | Remove data |

**Non-Repository Objects**:
| Task Type | Description |
|-----------|-------------|
| Open SQL Procedure | Execute SAP HANA schema procedures |
| BW Bridge Process Chain | Run SAP BW Bridge processes |

**Toolbar-Only Objects**:
| Task Type | Description |
|-----------|-------------|
| API Task | Call external REST APIs |
| Notification Task | Send email notifications |

**Nested Objects**:
| Task Type | Description |
|-----------|-------------|
| Task Chain | Reference locally-created or shared task chains |

### Object Prerequisites

- All objects must be deployed before adding to task chains
- SAP HANA Open SQL schema procedures require EXECUTE privileges granted to space users
- Views **cannot** have data access controls assigned
- Data flows with input parameters use default values during task chain execution
- Persisting views may include only one parameter with default value

### Execution Control

**Sequential Execution**:
- Tasks run one after another
- Succeeding task runs only when previous completes with *completed* status
- Failure stops chain execution

**Parallel Execution**:
- Multiple branches run simultaneously
- Completion condition options:
  - **ANY**: Succeeds when any parallel task completes
  - **ALL**: Succeeds only when all parallel tasks complete
- Synchronization at join points

**Layout Options**:
- Top-Bottom orientation
- Left-Right orientation
- Drag tasks to reorder

**Apache Spark Settings**:
- Override default Apache Spark Application Settings per task
- Configure memory and executor settings

### Input Parameters

Pass parameters to task chain tasks:

**Parameter Definition**:
```yaml
name: region
type: string
default: "US"
```

**Parameter Usage**:
- Pass to data flows
- Use in filters
- Dynamic configuration

### Scheduling

**Simple Schedule**:
- Daily, weekly, monthly
- Specific time

**Cron Expression**:
```
0 0 6 * * ?   # Daily at 6 AM
0 0 */4 * * ? # Every 4 hours
```

**Important Scheduling Constraint**: If scheduling remote tables with *Replicated (Real-time)* data access, replication type converts to batch replication at the next scheduled run (eliminates real-time updates).

### Email Notifications

Configure notifications for:
- Success
- Failure
- Warning

**Recipient Options**:
- Tenant users (searchable after task chain is deployed)
- External email addresses (requires deployed task chain for recipient selection)

**Export Constraint**: CSN/JSON export does not include notification recipients

---

## Python Operators

Custom data processing with Python.

### Creating Python Operators

1. Add Script operator to data flow
2. Define input/output ports
3. Write Python code
4. Configure execution

### Python Script Structure

```python
def transform(data):
    """
    Transform input data.

    Args:
        data: pandas DataFrame

    Returns:
        pandas DataFrame
    """
    # Your transformation logic
    result = data.copy()
    result['new_column'] = result['existing'].apply(my_function)
    return result
```

### Available Libraries

- pandas
- numpy
- scipy
- scikit-learn
- datetime

### Best Practices

- Keep transformations simple
- Handle null values explicitly
- Log errors appropriately
- Test with sample data

---

## Data Transformation

Column-level transformations in graphical views.

### Text Transformations

| Function | Description | Example |
|----------|-------------|---------|
| Change Case | Upper/lower/title | UPPER(name) |
| Concatenate | Join columns | CONCAT(first, last) |
| Extract | Substring | SUBSTRING(text, 1, 5) |
| Split | Divide by delimiter | SPLIT(full_name, ' ') |
| Find/Replace | Text substitution | REPLACE(text, 'old', 'new') |

### Numeric Transformations

| Function | Description |
|----------|-------------|
| ROUND | Round to precision |
| FLOOR | Round down |
| CEIL | Round up |
| ABS | Absolute value |
| MOD | Modulo operation |

### Date Transformations

| Function | Description |
|----------|-------------|
| YEAR | Extract year |
| MONTH | Extract month |
| DAY | Extract day |
| DATEDIFF | Date difference |
| ADD_DAYS | Add days to date |

### Filter Operations

```sql
-- Numeric filter
amount > 1000

-- Text filter
region IN ('US', 'EU', 'APAC')

-- Date filter
order_date >= '2024-01-01'

-- Null handling
customer_name IS NOT NULL
```

---

## Semantic Onboarding

Import objects with business semantics from SAP systems.

### SAP S/4HANA Import

Import CDS views with annotations:
- Semantic types (currency, unit)
- Associations
- Hierarchies
- Text relationships

### SAP BW/4HANA Import

Import BW objects:
- InfoObjects
- CompositeProviders
- Queries
- Analysis Authorizations

### Import Process

1. Select source connection
2. Browse available objects
3. Select objects to import
4. Review semantic mapping
5. Deploy imported objects

---

## File Spaces and Object Store

Store and process data in object store.

### Creating File Spaces

1. System > Configuration > Spaces
2. Create new file space
3. Configure object store connection
4. Set storage limits

### Data Loading

**Supported Formats**:
- Parquet (recommended)
- CSV
- JSON

**Loading Methods**:
- Replication flows
- Transformation flows
- API upload

### In-Memory Acceleration

Enable in-memory storage for faster queries:

1. Select table/view
2. Enable in-memory storage
3. Configure refresh schedule

### Premium Outbound Integration

Export data to external systems:
- Configure outbound connection
- Schedule exports
- Monitor transfer status

---

## Documentation Links

- **Data Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/e30fd14](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/e30fd14)
- **Replication Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/25e2bd7](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/25e2bd7)
- **Transformation Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/f7161e6](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/f7161e6)
- **Task Chains**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/d1afbc2](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/d1afbc2)

---

**Last Updated**: 2025-11-22