Initial commit

2025-11-30 09:02:39 +08:00
commit 515e7bf6be
18 changed files with 5770 additions and 0 deletions
--- a/commands/hybrid-generate-databricks.md
+++ b/commands/hybrid-generate-databricks.md
@@ -0,0 +1,285 @@
+---
+name: hybrid-generate-databricks
+description: Generate Databricks Delta Lake SQL from YAML configuration for ID unification
+---
+
+# Generate Databricks SQL from YAML
+
+## Overview
+
+Generate production-ready Databricks SQL workflow from your `unify.yml` configuration file. This command creates Delta Lake optimized SQL files with ACID transactions, clustering, and platform-specific function conversions.
+
+---
+
+## What You Need
+
+### Required Inputs
+1. **YAML Configuration File**: Path to your `unify.yml`
+2. **Target Catalog**: Databricks Unity Catalog name
+3. **Target Schema**: Schema name within the catalog
+
+### Optional Inputs
+4. **Source Catalog**: Catalog containing source tables (defaults to target catalog)
+5. **Source Schema**: Schema containing source tables (defaults to target schema)
+6. **Output Directory**: Where to save generated SQL (defaults to `databricks_sql/`)
+
+---
+
+## What I'll Do
+
+### Step 1: Validation
+- Verify `unify.yml` exists and is valid
+- Check YAML syntax and structure
+- Validate keys, tables, and configuration sections
+
+### Step 2: SQL Generation
+I'll call the **databricks-sql-generator agent** to:
+- Execute `yaml_unification_to_databricks.py` Python script
+- Apply Databricks-specific SQL conversions:
+  - `ARRAY_SIZE` → `SIZE`
+  - `ARRAY_CONSTRUCT` → `ARRAY`
+  - `OBJECT_CONSTRUCT` → `STRUCT`
+  - `COLLECT_LIST` for aggregations
+  - `FLATTEN` for array operations
+  - `UNIX_TIMESTAMP()` for time functions
+- Generate Delta Lake table definitions with clustering
+- Create convergence detection logic
+- Build cryptographic hashing for canonical IDs
+
+### Step 3: Output Organization
+Generate complete SQL workflow in this structure:
+```
+databricks_sql/unify/
+├── 01_create_graph.sql              # Initialize graph with USING DELTA
+├── 02_extract_merge.sql             # Extract identities with validation
+├── 03_source_key_stats.sql          # Source statistics with GROUPING SETS
+├── 04_unify_loop_iteration_*.sql    # Loop iterations (auto-calculated count)
+├── 05_canonicalize.sql              # Canonical ID creation with key masks
+├── 06_result_key_stats.sql          # Result statistics with histograms
+├── 10_enrich_*.sql                  # Enrich each source table
+├── 20_master_*.sql                  # Master tables with attribute aggregation
+├── 30_unification_metadata.sql      # Metadata tables
+├── 31_filter_lookup.sql             # Validation rules lookup
+└── 32_column_lookup.sql             # Column mapping lookup
+```
+
+### Step 4: Summary Report
+Provide:
+- Total SQL files generated
+- Estimated execution order
+- Delta Lake optimizations included
+- Key features enabled
+- Next steps for execution
+
+---
+
+## Command Usage
+
+### Basic Usage
+```
+/cdp-hybrid-idu:hybrid-generate-databricks
+
+I'll prompt you for:
+- YAML file path
+- Target catalog
+- Target schema
+```
+
+### Advanced Usage
+Provide all parameters upfront:
+```
+YAML file: /path/to/unify.yml
+Target catalog: my_catalog
+Target schema: my_schema
+Source catalog: source_catalog (optional)
+Source schema: source_schema (optional)
+Output directory: custom_output/ (optional)
+```
+
+---
+
+## Generated SQL Features
+
+### Delta Lake Optimizations
+- **ACID Transactions**: `USING DELTA` for all tables
+- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
+- **Table Properties**: Optimized for large-scale joins
+
+### Advanced Capabilities
+1. **Dynamic Iteration Count**: Auto-calculates based on:
+   - Number of merge keys
+   - Number of tables
+   - Data complexity (configurable via YAML)
+
+2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
+   ```
+   Key Type 1 (email):       0ffdbcf0c666ce190d
+   Key Type 2 (customer_id): 61a821f2b646a4e890
+   Key Type 3 (phone):       acd2206c3f88b3ee27
+   ```
+
+3. **Validation Rules**:
+   - `valid_regexp`: Regex pattern filtering
+   - `invalid_texts`: NOT IN clause with NULL handling
+   - Combined AND logic for strict validation
+
+4. **Master Table Attributes**:
+   - Single value: `MAX_BY(attr, order)` with COALESCE
+   - Array values: `SLICE(CONCAT(arrays), 1, N)`
+   - Priority-based selection
+
+### Platform-Specific Conversions
+The generator automatically converts:
+- Presto functions → Databricks equivalents
+- Snowflake functions → Databricks equivalents
+- Array operations → Spark SQL syntax
+- Window functions → optimized versions
+- Time functions → UNIX_TIMESTAMP()
+
+---
+
+## Example Workflow
+
+### Input YAML (`unify.yml`)
+```yaml
+name: customer_unification
+
+keys:
+  - name: email
+    valid_regexp: ".*@.*"
+    invalid_texts: ['', 'N/A', 'null']
+  - name: customer_id
+    invalid_texts: ['', 'N/A']
+
+tables:
+  - table: customer_profiles
+    key_columns:
+      - {column: email_std, key: email}
+      - {column: customer_id, key: customer_id}
+
+canonical_ids:
+  - name: unified_id
+    merge_by_keys: [email, customer_id]
+    merge_iterations: 15
+
+master_tables:
+  - name: customer_master
+    canonical_id: unified_id
+    attributes:
+      - name: best_email
+        source_columns:
+          - {table: customer_profiles, column: email_std, priority: 1}
+```
+
+### Generated Output
+```
+databricks_sql/unify/
+├── 01_create_graph.sql              # Creates unified_id_graph_unify_loop_0
+├── 02_extract_merge.sql             # Merges customer_profiles keys
+├── 03_source_key_stats.sql          # Stats by table
+├── 04_unify_loop_iteration_01.sql   # First iteration
+├── 04_unify_loop_iteration_02.sql   # Second iteration
+├── ...                              # Up to iteration_05
+├── 05_canonicalize.sql              # Creates unified_id_lookup
+├── 06_result_key_stats.sql          # Final statistics
+├── 10_enrich_customer_profiles.sql  # Adds unified_id column
+├── 20_master_customer_master.sql    # Creates customer_master table
+├── 30_unification_metadata.sql      # Metadata
+├── 31_filter_lookup.sql             # Validation rules
+└── 32_column_lookup.sql             # Column mappings
+```
+
+---
+
+## Next Steps After Generation
+
+### Option 1: Execute Immediately
+Use the execution command:
+```
+/cdp-hybrid-idu:hybrid-execute-databricks
+```
+
+### Option 2: Review First
+1. Examine generated SQL files
+2. Verify table names and transformations
+3. Test with sample data
+4. Execute manually or via execution command
+
+### Option 3: Customize
+1. Modify generated SQL as needed
+2. Add custom logic or transformations
+3. Execute using Databricks SQL editor or execution command
+
+---
+
+## Technical Details
+
+### Python Script Execution
+The agent executes:
+```bash
+python3 scripts/databricks/yaml_unification_to_databricks.py \
+    unify.yml \
+    -tc my_catalog \
+    -ts my_schema \
+    -sc source_catalog \
+    -ss source_schema \
+    -o databricks_sql
+```
+
+### SQL File Naming Convention
+- `01-09`: Setup and initialization
+- `10-19`: Source table enrichment
+- `20-29`: Master table creation
+- `30-39`: Metadata and lookup tables
+- `04_*_NN`: Loop iterations (auto-numbered)
+
+### Convergence Detection
+Each loop iteration includes:
+```sql
+-- Check if graph changed
+SELECT COUNT(*) FROM (
+    SELECT leader_ns, leader_id, follower_ns, follower_id
+    FROM iteration_N
+    EXCEPT
+    SELECT leader_ns, leader_id, follower_ns, follower_id
+    FROM iteration_N_minus_1
+) diff
+```
+Stops when count = 0
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+**Issue**: YAML validation error
+**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
+
+**Issue**: Table not found error
+**Solution**: Verify source catalog/schema, check table names in YAML
+
+**Issue**: Python script error
+**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
+
+**Issue**: Too many/few iterations
+**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
+
+---
+
+## Success Criteria
+
+Generated SQL will:
+- ✅ Be valid Databricks Spark SQL
+- ✅ Use Delta Lake for ACID transactions
+- ✅ Include proper clustering for performance
+- ✅ Have convergence detection built-in
+- ✅ Support incremental processing
+- ✅ Generate comprehensive statistics
+- ✅ Work without modification on Databricks
+
+---
+
+**Ready to generate Databricks SQL from your YAML configuration?**
+
+Provide your YAML file path and target catalog/schema to begin!