Initial commit

2025-11-30 09:02:39 +08:00
commit 515e7bf6be
18 changed files with 5770 additions and 0 deletions
--- a/commands/hybrid-generate-snowflake.md
+++ b/commands/hybrid-generate-snowflake.md
@@ -0,0 +1,288 @@
+---
+name: hybrid-generate-snowflake
+description: Generate Snowflake SQL from YAML configuration for ID unification
+---
+
+# Generate Snowflake SQL from YAML
+
+## Overview
+
+Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
+
+---
+
+## What You Need
+
+### Required Inputs
+1. **YAML Configuration File**: Path to your `unify.yml`
+2. **Target Database**: Snowflake database name
+3. **Target Schema**: Schema name within the database
+
+### Optional Inputs
+4. **Source Database**: Database containing source tables (defaults to target database)
+5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
+6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)
+
+---
+
+## What I'll Do
+
+### Step 1: Validation
+- Verify `unify.yml` exists and is valid
+- Check YAML syntax and structure
+- Validate keys, tables, and configuration sections
+
+### Step 2: SQL Generation
+I'll call the **snowflake-sql-generator agent** to:
+- Execute `yaml_unification_to_snowflake.py` Python script
+- Generate Snowflake table definitions with clustering
+- Create convergence detection logic
+- Build cryptographic hashing for canonical IDs
+
+### Step 3: Output Organization
+Generate complete SQL workflow in this structure:
+```
+snowflake_sql/unify/
+├── 01_create_graph.sql              # Initialize graph table
+├── 02_extract_merge.sql             # Extract identities with validation
+├── 03_source_key_stats.sql          # Source statistics with GROUPING SETS
+├── 04_unify_loop_iteration_*.sql    # Loop iterations (auto-calculated count)
+├── 05_canonicalize.sql              # Canonical ID creation with key masks
+├── 06_result_key_stats.sql          # Result statistics with histograms
+├── 10_enrich_*.sql                  # Enrich each source table
+├── 20_master_*.sql                  # Master tables with attribute aggregation
+├── 30_unification_metadata.sql      # Metadata tables
+├── 31_filter_lookup.sql             # Validation rules lookup
+└── 32_column_lookup.sql             # Column mapping lookup
+```
+
+### Step 4: Summary Report
+Provide:
+- Total SQL files generated
+- Estimated execution order
+- Snowflake optimizations included
+- Key features enabled
+- Next steps for execution
+
+---
+
+## Command Usage
+
+### Basic Usage
+```
+/cdp-hybrid-idu:hybrid-generate-snowflake
+
+I'll prompt you for:
+- YAML file path
+- Target database
+- Target schema
+```
+
+### Advanced Usage
+Provide all parameters upfront:
+```
+YAML file: /path/to/unify.yml
+Target database: my_database
+Target schema: my_schema
+Source database: source_database (optional)
+Source schema: PUBLIC (optional, defaults to PUBLIC)
+Output directory: custom_output/ (optional)
+```
+
+---
+
+## Generated SQL Features
+
+### Snowflake Optimizations
+- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
+- **VARIANT Support**: Flexible data structures for arrays and objects
+- **Native Functions**: Snowflake-specific optimized functions
+
+### Advanced Capabilities
+1. **Dynamic Iteration Count**: Auto-calculates based on:
+   - Number of merge keys
+   - Number of tables
+   - Data complexity (configurable via YAML)
+
+2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
+   ```
+   Key Type 1 (email):       0ffdbcf0c666ce190d
+   Key Type 2 (customer_id): 61a821f2b646a4e890
+   Key Type 3 (phone):       acd2206c3f88b3ee27
+   ```
+
+3. **Validation Rules**:
+   - `valid_regexp`: REGEXP_LIKE pattern filtering
+   - `invalid_texts`: NOT IN clause with proper NULL handling
+   - Combined AND logic for strict validation
+
+4. **Master Table Attributes**:
+   - Single value: `MAX_BY(attr, order)` with COALESCE
+   - Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
+   - Priority-based selection
+
+### Platform-Specific Conversions
+The generator automatically converts:
+- Presto functions → Snowflake equivalents
+- Databricks functions → Snowflake equivalents
+- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
+- Window functions → optimized versions
+- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
+
+---
+
+## Example Workflow
+
+### Input YAML (`unify.yml`)
+```yaml
+name: customer_unification
+
+keys:
+  - name: email
+    valid_regexp: ".*@.*"
+    invalid_texts: ['', 'N/A', 'null']
+  - name: customer_id
+    invalid_texts: ['', 'N/A']
+
+tables:
+  - table: customer_profiles
+    key_columns:
+      - {column: email_std, key: email}
+      - {column: customer_id, key: customer_id}
+
+canonical_ids:
+  - name: unified_id
+    merge_by_keys: [email, customer_id]
+    merge_iterations: 15
+
+master_tables:
+  - name: customer_master
+    canonical_id: unified_id
+    attributes:
+      - name: best_email
+        source_columns:
+          - {table: customer_profiles, column: email_std, priority: 1}
+```
+
+### Generated Output
+```
+snowflake_sql/unify/
+├── 01_create_graph.sql              # Creates unified_id_graph_unify_loop_0
+├── 02_extract_merge.sql             # Merges customer_profiles keys
+├── 03_source_key_stats.sql          # Stats by table
+├── 04_unify_loop_iteration_01.sql   # First iteration
+├── 04_unify_loop_iteration_02.sql   # Second iteration
+├── ...                              # Up to iteration_05
+├── 05_canonicalize.sql              # Creates unified_id_lookup
+├── 06_result_key_stats.sql          # Final statistics
+├── 10_enrich_customer_profiles.sql  # Adds unified_id column
+├── 20_master_customer_master.sql    # Creates customer_master table
+├── 30_unification_metadata.sql      # Metadata
+├── 31_filter_lookup.sql             # Validation rules
+└── 32_column_lookup.sql             # Column mappings
+```
+
+---
+
+## Next Steps After Generation
+
+### Option 1: Execute Immediately
+Use the execution command:
+```
+/cdp-hybrid-idu:hybrid-execute-snowflake
+```
+
+### Option 2: Review First
+1. Examine generated SQL files
+2. Verify table names and transformations
+3. Test with sample data
+4. Execute manually or via execution command
+
+### Option 3: Customize
+1. Modify generated SQL as needed
+2. Add custom logic or transformations
+3. Execute using Snowflake SQL worksheet or execution command
+
+---
+
+## Technical Details
+
+### Python Script Execution
+The agent executes:
+```bash
+python3 scripts/snowflake/yaml_unification_to_snowflake.py \
+    unify.yml \
+    -d my_database \
+    -s my_schema \
+    -sd source_database \
+    -ss source_schema \
+    -o snowflake_sql
+```
+
+### SQL File Naming Convention
+- `01-09`: Setup and initialization
+- `10-19`: Source table enrichment
+- `20-29`: Master table creation
+- `30-39`: Metadata and lookup tables
+- `04_*_NN`: Loop iterations (auto-numbered)
+
+### Convergence Detection
+Each loop iteration includes:
+```sql
+-- Check if graph changed
+SELECT COUNT(*) FROM (
+    SELECT leader_ns, leader_id, follower_ns, follower_id
+    FROM iteration_N
+    EXCEPT
+    SELECT leader_ns, leader_id, follower_ns, follower_id
+    FROM iteration_N_minus_1
+) diff
+```
+Stops when count = 0
+
+### Snowflake-Specific Features
+- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
+- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
+- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
+- **ARRAYS_OVERLAP**: Checking array membership
+- **SPLIT_PART**: String splitting for leader key parsing
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+**Issue**: YAML validation error
+**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
+
+**Issue**: Table not found error
+**Solution**: Verify source database/schema, check table names in YAML
+
+**Issue**: Python script error
+**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
+
+**Issue**: Too many/few iterations
+**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
+
+**Issue**: VARIANT column errors
+**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
+
+---
+
+## Success Criteria
+
+Generated SQL will:
+- ✅ Be valid Snowflake SQL
+- ✅ Use native Snowflake functions
+- ✅ Include proper clustering for performance
+- ✅ Have convergence detection built-in
+- ✅ Support VARIANT types for flexible data
+- ✅ Generate comprehensive statistics
+- ✅ Work without modification on Snowflake
+
+---
+
+**Ready to generate Snowflake SQL from your YAML configuration?**
+
+Provide your YAML file path and target database/schema to begin!