Initial commit

2025-11-30 08:55:17 +08:00
commit 5f83ab42ed
12 changed files with 6080 additions and 0 deletions
--- a/references/data-acquisition-preparation.md
+++ b/references/data-acquisition-preparation.md
@@ -0,0 +1,805 @@
+# Data Acquisition and Preparation Reference
+
+**Source**: [https://github.com/SAP-docs/sap-datasphere/tree/main/docs/Acquiring-Preparing-Modeling-Data/Acquiring-and-Preparing-Data-in-the-Data-Builder](https://github.com/SAP-docs/sap-datasphere/tree/main/docs/Acquiring-Preparing-Modeling-Data/Acquiring-and-Preparing-Data-in-the-Data-Builder)
+
+---
+
+## Table of Contents
+
+1. [Data Flows](#data-flows)
+2. [Replication Flows](#replication-flows)
+3. [Transformation Flows](#transformation-flows)
+4. [Local Tables](#local-tables)
+5. [Remote Tables](#remote-tables)
+6. [Task Chains](#task-chains)
+7. [Python Operators](#python-operators)
+8. [Data Transformation](#data-transformation)
+9. [Semantic Onboarding](#semantic-onboarding)
+10. [File Spaces and Object Store](#file-spaces-and-object-store)
+
+---
+
+## Data Flows
+
+Data flows provide ETL capabilities for data transformation and loading.
+
+### Prerequisites
+
+**Required Privileges**:
+- Data Warehouse General (`-R------`) - SAP Datasphere access
+- Connection (`-R------`) - Read connections
+- Data Warehouse Data Builder (`CRUD----`) - Create/edit/delete flows
+- Space Files (`CRUD----`) - Manage space objects
+- Data Warehouse Data Integration (`-RU-----`) - Run flows
+- Data Warehouse Data Integration (`-R--E---`) - Schedule flows
+
+### Creating a Data Flow
+
+1. Navigate to Data Builder
+2. Select "New Data Flow"
+3. Add source operators
+4. Add transformation operators
+5. Add target operator
+6. Save and deploy
+
+### Key Limitations
+
+- **No delta processing**: Use replication flows for delta/CDC data instead
+- **Single target table** only per data flow
+- **Local tables only**: Data flows load exclusively to local tables in the repository
+- **Double quotes unsupported** in identifiers (column/table names)
+- **Spatial data types** not supported
+- **ABAP source preview** unavailable (except CDS views and LTR objects)
+- **Transformation operators** cannot be previewed
+
+### Advanced Properties
+
+**Dynamic Memory Allocation**:
+| Setting | Memory Range | Use Case |
+|---------|--------------|----------|
+| Small | 1-2 GB | Low volume |
+| Medium | 2-3 GB | Standard volume |
+| Large | 3-5 GB | High volume |
+
+**Additional Options**:
+- Automatic restart on failure
+- Input parameters support
+
+### Data Flow Operators
+
+**Source Operators**:
+- Remote tables
+- Local tables
+- Views
+- CSV files
+
+**Transformation Operators**:
+
+| Operator | Purpose | Configuration |
+|----------|---------|---------------|
+| Join | Combine sources | Join type, conditions |
+| Union | Stack sources | Column mapping |
+| Projection | Select columns | Include/exclude, rename |
+| Filter | Row filtering | Filter conditions |
+| Aggregation | Group and aggregate | Group by, aggregates |
+| Script | Custom Python | Python code |
+| Calculated Column | Derived values | Expression |
+
+**Target Operators**:
+- Local table (new or existing)
+- Truncate and insert or delta merge
+
+### Join Operations
+
+**Join Types**:
+- Inner Join: Matching rows only
+- Left Outer: All left + matching right
+- Right Outer: All right + matching left
+- Full Outer: All rows from both
+- Cross Join: Cartesian product
+
+**Join Conditions**:
+```
+source1.column = source2.column
+```
+
+### Aggregation Operations
+
+**Aggregate Functions**:
+- SUM, AVG, MIN, MAX
+- COUNT, COUNT DISTINCT
+- FIRST, LAST
+
+### Calculated Columns
+
+**Expression Syntax**:
+```sql
+CASE WHEN column1 > 100 THEN 'High' ELSE 'Low' END
+CONCAT(first_name, ' ', last_name)
+ROUND(amount * exchange_rate, 2)
+```
+
+### Input Parameters
+
+Define runtime parameters for dynamic filtering:
+
+**Parameter Types**:
+- String
+- Integer
+- Date
+- Timestamp
+
+**Usage in Expressions**:
+```sql
+WHERE region = :IP_REGION
+```
+
+### Running Data Flows
+
+**Execution Options**:
+- Manual run from Data Builder
+- Scheduled via task chain
+- API trigger
+
+**Run Modes**:
+- Full: Process all data
+- Delta: Process changes only (requires delta capture)
+
+---
+
+## Replication Flows
+
+Replicate data from source systems to SAP Datasphere or external targets.
+
+### Creating a Replication Flow
+
+1. Navigate to Data Builder
+2. Select "New Replication Flow"
+3. Add source connection and objects
+4. Add target connection
+5. Configure load type and mappings
+6. Save and deploy
+
+### Source Systems
+
+**SAP Sources**:
+- SAP S/4HANA Cloud (ODP, CDS views)
+- SAP S/4HANA On-Premise (ODP, SLT, CDS)
+- SAP BW/4HANA
+- SAP ECC
+- SAP HANA
+
+**Cloud Storage Sources**:
+- Amazon S3
+- Azure Blob Storage
+- Google Cloud Storage
+- SFTP
+
+**Streaming Sources**:
+- Apache Kafka
+- Confluent Kafka
+
+### Target Systems
+
+**SAP Datasphere Targets**:
+- Local tables (managed by replication flow)
+
+**External Targets**:
+- Apache Kafka
+- Confluent Kafka
+- Google BigQuery
+- Amazon S3
+- Azure Blob Storage
+- Google Cloud Storage
+- SFTP
+- SAP Signavio
+
+### Load Types
+
+| Load Type | Description | Use Case |
+|-----------|-------------|----------|
+| Initial Only | One-time full load | Static data |
+| Initial + Delta | Full load then changes | Standard replication |
+| Real-Time | Continuous streaming | Live data |
+
+### Configuration Options
+
+**Flow-Level Properties**:
+| Property | Description | Default |
+|----------|-------------|---------|
+| Delta Load Frequency | Interval for delta changes | Configurable |
+| Skip Unmapped Target Columns | Ignore unmapped columns | Optional |
+| Merge Data Automatically | Auto-merge for file space targets | Requires consent |
+| Source Thread Limit | Parallel threads for source (1-160) | 16 |
+| Target Thread Limit | Parallel threads for target (1-160) | 16 |
+| Content Type | Template or Native format | Template |
+
+**Object-Level Properties**:
+| Property | Description |
+|----------|-------------|
+| Load Type | Initial Only, Initial+Delta, Delta Only |
+| Delta Capture | Enable CDC tracking |
+| ABAP Exit | Custom projection logic |
+| Object Thread Count | Thread count for delta operations |
+| Delete Before Load | Clear target before loading |
+
+### Critical Constraints
+
+- **No input parameters**: Replication flows do not support input parameters
+- **Thread limits read-only at design time**: Editable only after deployment
+- **Content Type applies globally**: Selection affects all replication objects in the flow
+- **ABAP systems**: Consult SAP Note 3297105 before creating replication flows
+
+### Content Type (ABAP Sources)
+
+| Type | Date/Timestamp Handling | Use Case |
+|------|-------------------------|----------|
+| Template Type | Applies ISO format requirements | Standard integration |
+| Native Type | Dates → strings, timestamps → decimals | Custom formatting |
+
+**Filters**:
+- Define row-level filters on source
+- Multiple filter conditions with AND/OR
+- **Important**: For ODP-CDS, filters must apply to primary key fields only
+
+**Mappings**:
+- Automatic column mapping
+- Manual mapping overrides
+- Exclude columns
+
+**Projections**:
+- Custom SQL expressions
+- Column transformations
+- Calculated columns
+- ABAP Exit for custom projection logic
+
+### Sizing and Performance
+
+**Thread Configuration**:
+- Source/Target Thread Limits: 1-160 (default: 16)
+- Higher values = more parallelism but more resources
+- Consider source system capacity
+
+**Capacity Planning**:
+- Estimate data volume per table
+- Consider network bandwidth
+- Plan for parallel execution
+- RFC fast serialization (SAP Note 3486245) for improved performance
+
+**Load Balancing**:
+- Distribute across multiple flows
+- Schedule during off-peak hours
+- Monitor resource consumption
+
+### Unsupported Data Types
+
+- BLOB, CLOB (large objects)
+- Spatial data types
+- Custom ABAP types
+- Virtual Tables (SAP HANA Smart Data Access)
+- Row Tables (use COLUMN TABLE only)
+
+---
+
+## Transformation Flows
+
+Delta-aware transformations with automatic change propagation.
+
+### Creating a Transformation Flow
+
+1. Navigate to Data Builder
+2. Select "New Transformation Flow"
+3. Add source (view or graphical view)
+4. Add target table
+5. Configure run settings
+6. Save and deploy
+
+### Key Constraints and Limitations
+
+**Data Access Restrictions**:
+Views and Open SQL schema objects cannot be used if they:
+- Reference remote tables (except BW Bridge)
+- Consume views with data access controls
+- Have controls applied to them
+
+**Loading Constraints**:
+- Loading delta changes from views is not supported
+- Only loads data to local SAP Datasphere repository tables
+- Remote tables in BW Bridge spaces must be shared with the SAP Datasphere space
+
+### Runtime Options
+
+| Runtime | Storage Target | Use Case |
+|---------|----------------|----------|
+| HANA | SAP HANA Database storage | Standard transformations |
+| SPARK | SAP HANA Data Lake Files storage | Large-scale file processing |
+
+### Load Types
+
+| Load Type | Description | Requirements |
+|-----------|-------------|--------------|
+| Initial Only | Full dataset load | None |
+| Initial and Delta | Full load then changes | Delta capture enabled on source and target tables |
+
+### Input Parameter Constraints
+
+- Cannot be created/edited in Graphical View Editor
+- Scheduled flows use default values
+- **Not supported** in Python operations (Spark runtime)
+- Exclude from task chain input parameters
+
+### Source Options
+
+- Graphical view (created inline)
+- SQL view (created inline)
+- Existing views
+
+### Target Table Management
+
+**Options**:
+- Create new local table
+- Use existing local table
+
+**Column Handling**:
+- Add new columns automatically
+- Map columns manually
+- Exclude columns
+
+### Run Modes
+
+| Mode | Action | Use Case |
+|------|--------|----------|
+| Start | Process delta changes | Regular runs |
+| Delete | Remove target records | Cleanup |
+| Truncate | Clear and reload | Full refresh |
+
+### Delta Processing
+
+Transformation flows track changes automatically:
+- Insert: New records
+- Update: Modified records
+- Delete: Removed records
+
+### File Space Transformations
+
+Transform data in object store (file spaces):
+
+**Supported Functions**:
+- String functions
+- Numeric functions
+- Date functions
+- Conversion functions
+
+---
+
+## Local Tables
+
+Store data directly in SAP Datasphere.
+
+### Creating Local Tables
+
+**Methods**:
+1. Data Builder > New Table
+2. Import from CSV
+3. Create from data flow target
+4. Create from replication flow target
+
+### Storage Options
+
+| Storage | Target System | Use Case |
+|---------|---------------|----------|
+| Disk | SAP HANA Cloud, SAP HANA database | Standard persistent storage |
+| In-Memory | SAP HANA Cloud, SAP HANA database | High-performance hot data |
+| File | SAP HANA Cloud data lake storage | Large-scale cost-effective storage |
+
+### Table Properties
+
+**Key Columns**:
+- Primary key definition
+- Unique constraints
+
+**Data Types**:
+- String (VARCHAR)
+- Integer (INT, BIGINT)
+- Decimal (DECIMAL)
+- Date, Time, Timestamp
+- Boolean
+- Binary
+
+### Partitioning
+
+**Partition Types**:
+- Range partitioning (date/numeric)
+- Hash partitioning
+
+**Benefits**:
+- Improved query performance
+- Parallel processing
+- Selective data loading
+
+### Delta Capture
+
+Enable change tracking for incremental processing:
+
+1. Enable delta capture on table
+2. Track insert/update/delete operations
+3. Query changes with delta tokens
+
+**Important Constraint**: Once delta capture is enabled and deployed, it **cannot be modified or disabled**.
+
+### Allow Data Transport
+
+Available for dimensions on SAP Business Data Cloud formation tenants:
+- Enables data inclusion during repository package transport
+- Limited to initial import data initialization
+- **Applies only to**: Dimensions, text entities, or relational datasets
+
+### Data Maintenance
+
+**Operations**:
+- Insert records
+- Update records
+- Delete records
+- Truncate table
+- Load from file
+
+### Local Table (File)
+
+Store data in object store:
+
+**Supported Formats**:
+- Parquet
+- CSV
+- JSON
+
+**Use Cases**:
+- Large datasets
+- Cost-effective storage
+- Integration with data lakes
+
+---
+
+## Remote Tables
+
+Virtual access to external data without copying.
+
+### Importing Remote Tables
+
+1. Select connection in source browser
+2. Choose tables/views to import
+3. Configure import settings
+4. Deploy remote table
+
+### Data Access Modes
+
+| Mode | Description | Performance |
+|------|-------------|-------------|
+| Remote | Query source directly | Network dependent |
+| Replicated | Copy to local storage | Fast queries |
+
+### Replication Options
+
+**Full Replication**:
+- Copy all data
+- Scheduled refresh
+
+**Real-Time Replication**:
+- Continuous change capture
+- Near real-time updates
+
+**Partitioned Replication**:
+- Divide data into partitions
+- Parallel loading
+
+### Remote Table Properties
+
+**Statistics**:
+- Create statistics for query optimization
+- Update statistics periodically
+
+**Filters**:
+- Define partitioning filters
+- Limit data volume
+
+---
+
+## Task Chains
+
+Orchestrate multiple data integration tasks.
+
+### Creating Task Chains
+
+1. Navigate to Data Builder
+2. Select "New Task Chain"
+3. Add task nodes
+4. Configure dependencies
+5. Save and deploy
+
+### Supported Task Types
+
+**Repository Objects**:
+| Task Type | Activity | Description |
+|-----------|----------|-------------|
+| Remote Tables | Replicate | Replicate remote table data |
+| Views | Persist | Persist view data to storage |
+| Intelligent Lookups | Run | Execute intelligent lookup |
+| Data Flows | Run | Execute data flow |
+| Replication Flows | Run | Run with load type *Initial Only* |
+| Transformation Flows | Run | Execute transformation flow |
+| Local Tables | Delete Records | Delete records with Change Type "Deleted" |
+| Local Tables (File) | Merge | Merge delta files |
+| Local Tables (File) | Optimize | Compact files |
+| Local Tables (File) | Delete Records | Remove data |
+
+**Non-Repository Objects**:
+| Task Type | Description |
+|-----------|-------------|
+| Open SQL Procedure | Execute SAP HANA schema procedures |
+| BW Bridge Process Chain | Run SAP BW Bridge processes |
+
+**Toolbar-Only Objects**:
+| Task Type | Description |
+|-----------|-------------|
+| API Task | Call external REST APIs |
+| Notification Task | Send email notifications |
+
+**Nested Objects**:
+| Task Type | Description |
+|-----------|-------------|
+| Task Chain | Reference locally-created or shared task chains |
+
+### Object Prerequisites
+
+- All objects must be deployed before adding to task chains
+- SAP HANA Open SQL schema procedures require EXECUTE privileges granted to space users
+- Views **cannot** have data access controls assigned
+- Data flows with input parameters use default values during task chain execution
+- Persisting views may include only one parameter with default value
+
+### Execution Control
+
+**Sequential Execution**:
+- Tasks run one after another
+- Succeeding task runs only when previous completes with *completed* status
+- Failure stops chain execution
+
+**Parallel Execution**:
+- Multiple branches run simultaneously
+- Completion condition options:
+  - **ANY**: Succeeds when any parallel task completes
+  - **ALL**: Succeeds only when all parallel tasks complete
+- Synchronization at join points
+
+**Layout Options**:
+- Top-Bottom orientation
+- Left-Right orientation
+- Drag tasks to reorder
+
+**Apache Spark Settings**:
+- Override default Apache Spark Application Settings per task
+- Configure memory and executor settings
+
+### Input Parameters
+
+Pass parameters to task chain tasks:
+
+**Parameter Definition**:
+```yaml
+name: region
+type: string
+default: "US"
+```
+
+**Parameter Usage**:
+- Pass to data flows
+- Use in filters
+- Dynamic configuration
+
+### Scheduling
+
+**Simple Schedule**:
+- Daily, weekly, monthly
+- Specific time
+
+**Cron Expression**:
+```
+0 0 6 * * ?   # Daily at 6 AM
+0 0 */4 * * ? # Every 4 hours
+```
+
+**Important Scheduling Constraint**: If scheduling remote tables with *Replicated (Real-time)* data access, replication type converts to batch replication at the next scheduled run (eliminates real-time updates).
+
+### Email Notifications
+
+Configure notifications for:
+- Success
+- Failure
+- Warning
+
+**Recipient Options**:
+- Tenant users (searchable after task chain is deployed)
+- External email addresses (requires deployed task chain for recipient selection)
+
+**Export Constraint**: CSN/JSON export does not include notification recipients
+
+---
+
+## Python Operators
+
+Custom data processing with Python.
+
+### Creating Python Operators
+
+1. Add Script operator to data flow
+2. Define input/output ports
+3. Write Python code
+4. Configure execution
+
+### Python Script Structure
+
+```python
+def transform(data):
+    """
+    Transform input data.
+
+    Args:
+        data: pandas DataFrame
+
+    Returns:
+        pandas DataFrame
+    """
+    # Your transformation logic
+    result = data.copy()
+    result['new_column'] = result['existing'].apply(my_function)
+    return result
+```
+
+### Available Libraries
+
+- pandas
+- numpy
+- scipy
+- scikit-learn
+- datetime
+
+### Best Practices
+
+- Keep transformations simple
+- Handle null values explicitly
+- Log errors appropriately
+- Test with sample data
+
+---
+
+## Data Transformation
+
+Column-level transformations in graphical views.
+
+### Text Transformations
+
+| Function | Description | Example |
+|----------|-------------|---------|
+| Change Case | Upper/lower/title | UPPER(name) |
+| Concatenate | Join columns | CONCAT(first, last) |
+| Extract | Substring | SUBSTRING(text, 1, 5) |
+| Split | Divide by delimiter | SPLIT(full_name, ' ') |
+| Find/Replace | Text substitution | REPLACE(text, 'old', 'new') |
+
+### Numeric Transformations
+
+| Function | Description |
+|----------|-------------|
+| ROUND | Round to precision |
+| FLOOR | Round down |
+| CEIL | Round up |
+| ABS | Absolute value |
+| MOD | Modulo operation |
+
+### Date Transformations
+
+| Function | Description |
+|----------|-------------|
+| YEAR | Extract year |
+| MONTH | Extract month |
+| DAY | Extract day |
+| DATEDIFF | Date difference |
+| ADD_DAYS | Add days to date |
+
+### Filter Operations
+
+```sql
+-- Numeric filter
+amount > 1000
+
+-- Text filter
+region IN ('US', 'EU', 'APAC')
+
+-- Date filter
+order_date >= '2024-01-01'
+
+-- Null handling
+customer_name IS NOT NULL
+```
+
+---
+
+## Semantic Onboarding
+
+Import objects with business semantics from SAP systems.
+
+### SAP S/4HANA Import
+
+Import CDS views with annotations:
+- Semantic types (currency, unit)
+- Associations
+- Hierarchies
+- Text relationships
+
+### SAP BW/4HANA Import
+
+Import BW objects:
+- InfoObjects
+- CompositeProviders
+- Queries
+- Analysis Authorizations
+
+### Import Process
+
+1. Select source connection
+2. Browse available objects
+3. Select objects to import
+4. Review semantic mapping
+5. Deploy imported objects
+
+---
+
+## File Spaces and Object Store
+
+Store and process data in object store.
+
+### Creating File Spaces
+
+1. System > Configuration > Spaces
+2. Create new file space
+3. Configure object store connection
+4. Set storage limits
+
+### Data Loading
+
+**Supported Formats**:
+- Parquet (recommended)
+- CSV
+- JSON
+
+**Loading Methods**:
+- Replication flows
+- Transformation flows
+- API upload
+
+### In-Memory Acceleration
+
+Enable in-memory storage for faster queries:
+
+1. Select table/view
+2. Enable in-memory storage
+3. Configure refresh schedule
+
+### Premium Outbound Integration
+
+Export data to external systems:
+- Configure outbound connection
+- Schedule exports
+- Monitor transfer status
+
+---
+
+## Documentation Links
+
+- **Data Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/e30fd14](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/e30fd14)
+- **Replication Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/25e2bd7](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/25e2bd7)
+- **Transformation Flows**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/f7161e6](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/f7161e6)
+- **Task Chains**: [https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/d1afbc2](https://help.sap.com/docs/SAP_DATASPHERE/c8a54ee704e94e15926551293243fd1d/d1afbc2)
+
+---
+
+**Last Updated**: 2025-11-22