8.3 KiB
name, description
| name | description |
|---|---|
| hybrid-generate-snowflake | Generate Snowflake SQL from YAML configuration for ID unification |
Generate Snowflake SQL from YAML
Overview
Generate production-ready Snowflake SQL workflow from your unify.yml configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
What You Need
Required Inputs
- YAML Configuration File: Path to your
unify.yml - Target Database: Snowflake database name
- Target Schema: Schema name within the database
Optional Inputs
- Source Database: Database containing source tables (defaults to target database)
- Source Schema: Schema containing source tables (defaults to PUBLIC)
- Output Directory: Where to save generated SQL (defaults to
snowflake_sql/)
What I'll Do
Step 1: Validation
- Verify
unify.ymlexists and is valid - Check YAML syntax and structure
- Validate keys, tables, and configuration sections
Step 2: SQL Generation
I'll call the snowflake-sql-generator agent to:
- Execute
yaml_unification_to_snowflake.pyPython script - Generate Snowflake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs
Step 3: Output Organization
Generate complete SQL workflow in this structure:
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize graph table
├── 02_extract_merge.sql # Extract identities with validation
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql # Canonical ID creation with key masks
├── 06_result_key_stats.sql # Result statistics with histograms
├── 10_enrich_*.sql # Enrich each source table
├── 20_master_*.sql # Master tables with attribute aggregation
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules lookup
└── 32_column_lookup.sql # Column mapping lookup
Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Snowflake optimizations included
- Key features enabled
- Next steps for execution
Command Usage
Basic Usage
/cdp-hybrid-idu:hybrid-generate-snowflake
I'll prompt you for:
- YAML file path
- Target database
- Target schema
Advanced Usage
Provide all parameters upfront:
YAML file: /path/to/unify.yml
Target database: my_database
Target schema: my_schema
Source database: source_database (optional)
Source schema: PUBLIC (optional, defaults to PUBLIC)
Output directory: custom_output/ (optional)
Generated SQL Features
Snowflake Optimizations
- Clustering:
CLUSTER BY (follower_id)on graph tables - VARIANT Support: Flexible data structures for arrays and objects
- Native Functions: Snowflake-specific optimized functions
Advanced Capabilities
-
Dynamic Iteration Count: Auto-calculates based on:
- Number of merge keys
- Number of tables
- Data complexity (configurable via YAML)
-
Key-Specific Hashing: Each key uses unique cryptographic mask:
Key Type 1 (email): 0ffdbcf0c666ce190d Key Type 2 (customer_id): 61a821f2b646a4e890 Key Type 3 (phone): acd2206c3f88b3ee27 -
Validation Rules:
valid_regexp: REGEXP_LIKE pattern filteringinvalid_texts: NOT IN clause with proper NULL handling- Combined AND logic for strict validation
-
Master Table Attributes:
- Single value:
MAX_BY(attr, order)with COALESCE - Array values:
ARRAY_SLICE(ARRAY_CAT(arrays), 0, N) - Priority-based selection
- Single value:
Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Snowflake equivalents
- Databricks functions → Snowflake equivalents
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
- Window functions → optimized versions
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
Example Workflow
Input YAML (unify.yml)
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
Generated Output
snowflake_sql/unify/
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql # Merges customer_profiles keys
├── 03_source_key_stats.sql # Stats by table
├── 04_unify_loop_iteration_01.sql # First iteration
├── 04_unify_loop_iteration_02.sql # Second iteration
├── ... # Up to iteration_05
├── 05_canonicalize.sql # Creates unified_id_lookup
├── 06_result_key_stats.sql # Final statistics
├── 10_enrich_customer_profiles.sql # Adds unified_id column
├── 20_master_customer_master.sql # Creates customer_master table
├── 30_unification_metadata.sql # Metadata
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
Next Steps After Generation
Option 1: Execute Immediately
Use the execution command:
/cdp-hybrid-idu:hybrid-execute-snowflake
Option 2: Review First
- Examine generated SQL files
- Verify table names and transformations
- Test with sample data
- Execute manually or via execution command
Option 3: Customize
- Modify generated SQL as needed
- Add custom logic or transformations
- Execute using Snowflake SQL worksheet or execution command
Technical Details
Python Script Execution
The agent executes:
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
unify.yml \
-d my_database \
-s my_schema \
-sd source_database \
-ss source_schema \
-o snowflake_sql
SQL File Naming Convention
01-09: Setup and initialization10-19: Source table enrichment20-29: Master table creation30-39: Metadata and lookup tables04_*_NN: Loop iterations (auto-numbered)
Convergence Detection
Each loop iteration includes:
-- Check if graph changed
SELECT COUNT(*) FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N_minus_1
) diff
Stops when count = 0
Snowflake-Specific Features
- LATERAL FLATTEN: Array expansion for id_ns_array processing
- ARRAY_CONSTRUCT: Building arrays from multiple columns
- OBJECT_CONSTRUCT: Creating structured objects for key-value pairs
- ARRAYS_OVERLAP: Checking array membership
- SPLIT_PART: String splitting for leader key parsing
Troubleshooting
Common Issues
Issue: YAML validation error Solution: Check YAML syntax, ensure proper indentation, verify all required fields
Issue: Table not found error Solution: Verify source database/schema, check table names in YAML
Issue: Python script error Solution: Ensure Python 3.7+ installed, check pyyaml dependency
Issue: Too many/few iterations
Solution: Adjust merge_iterations in canonical_ids section of YAML
Issue: VARIANT column errors Solution: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
Success Criteria
Generated SQL will:
- ✅ Be valid Snowflake SQL
- ✅ Use native Snowflake functions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support VARIANT types for flexible data
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Snowflake
Ready to generate Snowflake SQL from your YAML configuration?
Provide your YAML file path and target database/schema to begin!