Files
gh-treasure-data-aps-claude…/commands/hybrid-generate-snowflake.md
2025-11-30 09:02:39 +08:00

8.3 KiB

name, description
name description
hybrid-generate-snowflake Generate Snowflake SQL from YAML configuration for ID unification

Generate Snowflake SQL from YAML

Overview

Generate production-ready Snowflake SQL workflow from your unify.yml configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.


What You Need

Required Inputs

  1. YAML Configuration File: Path to your unify.yml
  2. Target Database: Snowflake database name
  3. Target Schema: Schema name within the database

Optional Inputs

  1. Source Database: Database containing source tables (defaults to target database)
  2. Source Schema: Schema containing source tables (defaults to PUBLIC)
  3. Output Directory: Where to save generated SQL (defaults to snowflake_sql/)

What I'll Do

Step 1: Validation

  • Verify unify.yml exists and is valid
  • Check YAML syntax and structure
  • Validate keys, tables, and configuration sections

Step 2: SQL Generation

I'll call the snowflake-sql-generator agent to:

  • Execute yaml_unification_to_snowflake.py Python script
  • Generate Snowflake table definitions with clustering
  • Create convergence detection logic
  • Build cryptographic hashing for canonical IDs

Step 3: Output Organization

Generate complete SQL workflow in this structure:

snowflake_sql/unify/
├── 01_create_graph.sql              # Initialize graph table
├── 02_extract_merge.sql             # Extract identities with validation
├── 03_source_key_stats.sql          # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql    # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql              # Canonical ID creation with key masks
├── 06_result_key_stats.sql          # Result statistics with histograms
├── 10_enrich_*.sql                  # Enrich each source table
├── 20_master_*.sql                  # Master tables with attribute aggregation
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules lookup
└── 32_column_lookup.sql             # Column mapping lookup

Step 4: Summary Report

Provide:

  • Total SQL files generated
  • Estimated execution order
  • Snowflake optimizations included
  • Key features enabled
  • Next steps for execution

Command Usage

Basic Usage

/cdp-hybrid-idu:hybrid-generate-snowflake

I'll prompt you for:
- YAML file path
- Target database
- Target schema

Advanced Usage

Provide all parameters upfront:

YAML file: /path/to/unify.yml
Target database: my_database
Target schema: my_schema
Source database: source_database (optional)
Source schema: PUBLIC (optional, defaults to PUBLIC)
Output directory: custom_output/ (optional)

Generated SQL Features

Snowflake Optimizations

  • Clustering: CLUSTER BY (follower_id) on graph tables
  • VARIANT Support: Flexible data structures for arrays and objects
  • Native Functions: Snowflake-specific optimized functions

Advanced Capabilities

  1. Dynamic Iteration Count: Auto-calculates based on:

    • Number of merge keys
    • Number of tables
    • Data complexity (configurable via YAML)
  2. Key-Specific Hashing: Each key uses unique cryptographic mask:

    Key Type 1 (email):       0ffdbcf0c666ce190d
    Key Type 2 (customer_id): 61a821f2b646a4e890
    Key Type 3 (phone):       acd2206c3f88b3ee27
    
  3. Validation Rules:

    • valid_regexp: REGEXP_LIKE pattern filtering
    • invalid_texts: NOT IN clause with proper NULL handling
    • Combined AND logic for strict validation
  4. Master Table Attributes:

    • Single value: MAX_BY(attr, order) with COALESCE
    • Array values: ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)
    • Priority-based selection

Platform-Specific Conversions

The generator automatically converts:

  • Presto functions → Snowflake equivalents
  • Databricks functions → Snowflake equivalents
  • Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
  • Window functions → optimized versions
  • Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())

Example Workflow

Input YAML (unify.yml)

name: customer_unification

keys:
  - name: email
    valid_regexp: ".*@.*"
    invalid_texts: ['', 'N/A', 'null']
  - name: customer_id
    invalid_texts: ['', 'N/A']

tables:
  - table: customer_profiles
    key_columns:
      - {column: email_std, key: email}
      - {column: customer_id, key: customer_id}

canonical_ids:
  - name: unified_id
    merge_by_keys: [email, customer_id]
    merge_iterations: 15

master_tables:
  - name: customer_master
    canonical_id: unified_id
    attributes:
      - name: best_email
        source_columns:
          - {table: customer_profiles, column: email_std, priority: 1}

Generated Output

snowflake_sql/unify/
├── 01_create_graph.sql              # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql             # Merges customer_profiles keys
├── 03_source_key_stats.sql          # Stats by table
├── 04_unify_loop_iteration_01.sql   # First iteration
├── 04_unify_loop_iteration_02.sql   # Second iteration
├── ...                              # Up to iteration_05
├── 05_canonicalize.sql              # Creates unified_id_lookup
├── 06_result_key_stats.sql          # Final statistics
├── 10_enrich_customer_profiles.sql  # Adds unified_id column
├── 20_master_customer_master.sql    # Creates customer_master table
├── 30_unification_metadata.sql      # Metadata
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings

Next Steps After Generation

Option 1: Execute Immediately

Use the execution command:

/cdp-hybrid-idu:hybrid-execute-snowflake

Option 2: Review First

  1. Examine generated SQL files
  2. Verify table names and transformations
  3. Test with sample data
  4. Execute manually or via execution command

Option 3: Customize

  1. Modify generated SQL as needed
  2. Add custom logic or transformations
  3. Execute using Snowflake SQL worksheet or execution command

Technical Details

Python Script Execution

The agent executes:

python3 scripts/snowflake/yaml_unification_to_snowflake.py \
    unify.yml \
    -d my_database \
    -s my_schema \
    -sd source_database \
    -ss source_schema \
    -o snowflake_sql

SQL File Naming Convention

  • 01-09: Setup and initialization
  • 10-19: Source table enrichment
  • 20-29: Master table creation
  • 30-39: Metadata and lookup tables
  • 04_*_NN: Loop iterations (auto-numbered)

Convergence Detection

Each loop iteration includes:

-- Check if graph changed
SELECT COUNT(*) FROM (
    SELECT leader_ns, leader_id, follower_ns, follower_id
    FROM iteration_N
    EXCEPT
    SELECT leader_ns, leader_id, follower_ns, follower_id
    FROM iteration_N_minus_1
) diff

Stops when count = 0

Snowflake-Specific Features

  • LATERAL FLATTEN: Array expansion for id_ns_array processing
  • ARRAY_CONSTRUCT: Building arrays from multiple columns
  • OBJECT_CONSTRUCT: Creating structured objects for key-value pairs
  • ARRAYS_OVERLAP: Checking array membership
  • SPLIT_PART: String splitting for leader key parsing

Troubleshooting

Common Issues

Issue: YAML validation error Solution: Check YAML syntax, ensure proper indentation, verify all required fields

Issue: Table not found error Solution: Verify source database/schema, check table names in YAML

Issue: Python script error Solution: Ensure Python 3.7+ installed, check pyyaml dependency

Issue: Too many/few iterations Solution: Adjust merge_iterations in canonical_ids section of YAML

Issue: VARIANT column errors Solution: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL


Success Criteria

Generated SQL will:

  • Be valid Snowflake SQL
  • Use native Snowflake functions
  • Include proper clustering for performance
  • Have convergence detection built-in
  • Support VARIANT types for flexible data
  • Generate comprehensive statistics
  • Work without modification on Snowflake

Ready to generate Snowflake SQL from your YAML configuration?

Provide your YAML file path and target database/schema to begin!