gh-treasure-data-aps-claude…/commands/hybrid-generate-snowflake.md

---
name: hybrid-generate-snowflake
description: Generate Snowflake SQL from YAML configuration for ID unification
---

# Generate Snowflake SQL from YAML

## Overview

Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.

---

## What You Need

### Required Inputs
1. **YAML Configuration File**: Path to your `unify.yml`
2. **Target Database**: Snowflake database name
3. **Target Schema**: Schema name within the database

### Optional Inputs
4. **Source Database**: Database containing source tables (defaults to target database)
5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)

---

## What I'll Do

### Step 1: Validation
- Verify `unify.yml` exists and is valid
- Check YAML syntax and structure
- Validate keys, tables, and configuration sections

### Step 2: SQL Generation
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` Python script
- Generate Snowflake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs

### Step 3: Output Organization
Generate complete SQL workflow in this structure:
```
snowflake_sql/unify/
├── 01_create_graph.sql              # Initialize graph table
├── 02_extract_merge.sql             # Extract identities with validation
├── 03_source_key_stats.sql          # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql    # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql              # Canonical ID creation with key masks
├── 06_result_key_stats.sql          # Result statistics with histograms
├── 10_enrich_*.sql                  # Enrich each source table
├── 20_master_*.sql                  # Master tables with attribute aggregation
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules lookup
└── 32_column_lookup.sql             # Column mapping lookup
```

### Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Snowflake optimizations included
- Key features enabled
- Next steps for execution

---

## Command Usage

### Basic Usage
```
/cdp-hybrid-idu:hybrid-generate-snowflake

I'll prompt you for:
- YAML file path
- Target database
- Target schema
```

### Advanced Usage
Provide all parameters upfront:
```
YAML file: /path/to/unify.yml
Target database: my_database
Target schema: my_schema
Source database: source_database (optional)
Source schema: PUBLIC (optional, defaults to PUBLIC)
Output directory: custom_output/ (optional)
```

---

## Generated SQL Features

### Snowflake Optimizations
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
- **VARIANT Support**: Flexible data structures for arrays and objects
- **Native Functions**: Snowflake-specific optimized functions

### Advanced Capabilities
1. **Dynamic Iteration Count**: Auto-calculates based on:
   - Number of merge keys
   - Number of tables
   - Data complexity (configurable via YAML)

2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
   ```
   Key Type 1 (email):       0ffdbcf0c666ce190d
   Key Type 2 (customer_id): 61a821f2b646a4e890
   Key Type 3 (phone):       acd2206c3f88b3ee27
   ```

3. **Validation Rules**:
   - `valid_regexp`: REGEXP_LIKE pattern filtering
   - `invalid_texts`: NOT IN clause with proper NULL handling
   - Combined AND logic for strict validation

4. **Master Table Attributes**:
   - Single value: `MAX_BY(attr, order)` with COALESCE
   - Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
   - Priority-based selection

### Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Snowflake equivalents
- Databricks functions → Snowflake equivalents
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
- Window functions → optimized versions
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())

---

## Example Workflow

### Input YAML (`unify.yml`)
```yaml
name: customer_unification

keys:
  - name: email
    valid_regexp: ".*@.*"
    invalid_texts: ['', 'N/A', 'null']
  - name: customer_id
    invalid_texts: ['', 'N/A']

tables:
  - table: customer_profiles
    key_columns:
      - {column: email_std, key: email}
      - {column: customer_id, key: customer_id}

canonical_ids:
  - name: unified_id
    merge_by_keys: [email, customer_id]
    merge_iterations: 15

master_tables:
  - name: customer_master
    canonical_id: unified_id
    attributes:
      - name: best_email
        source_columns:
          - {table: customer_profiles, column: email_std, priority: 1}
```

### Generated Output
```
snowflake_sql/unify/
├── 01_create_graph.sql              # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql             # Merges customer_profiles keys
├── 03_source_key_stats.sql          # Stats by table
├── 04_unify_loop_iteration_01.sql   # First iteration
├── 04_unify_loop_iteration_02.sql   # Second iteration
├── ...                              # Up to iteration_05
├── 05_canonicalize.sql              # Creates unified_id_lookup
├── 06_result_key_stats.sql          # Final statistics
├── 10_enrich_customer_profiles.sql  # Adds unified_id column
├── 20_master_customer_master.sql    # Creates customer_master table
├── 30_unification_metadata.sql      # Metadata
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings
```

---

## Next Steps After Generation

### Option 1: Execute Immediately
Use the execution command:
```
/cdp-hybrid-idu:hybrid-execute-snowflake
```

### Option 2: Review First
1. Examine generated SQL files
2. Verify table names and transformations
3. Test with sample data
4. Execute manually or via execution command

### Option 3: Customize
1. Modify generated SQL as needed
2. Add custom logic or transformations
3. Execute using Snowflake SQL worksheet or execution command

---

## Technical Details

### Python Script Execution
The agent executes:
```bash
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
    unify.yml \
    -d my_database \
    -s my_schema \
    -sd source_database \
    -ss source_schema \
    -o snowflake_sql
```

### SQL File Naming Convention
- `01-09`: Setup and initialization
- `10-19`: Source table enrichment
- `20-29`: Master table creation
- `30-39`: Metadata and lookup tables
- `04_*_NN`: Loop iterations (auto-numbered)

### Convergence Detection
Each loop iteration includes:
```sql
-- Check if graph changed
SELECT COUNT(*) FROM (
    SELECT leader_ns, leader_id, follower_ns, follower_id
    FROM iteration_N
    EXCEPT
    SELECT leader_ns, leader_id, follower_ns, follower_id
    FROM iteration_N_minus_1
) diff
```
Stops when count = 0

### Snowflake-Specific Features
- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
- **ARRAYS_OVERLAP**: Checking array membership
- **SPLIT_PART**: String splitting for leader key parsing

---

## Troubleshooting

### Common Issues

**Issue**: YAML validation error
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields

**Issue**: Table not found error
**Solution**: Verify source database/schema, check table names in YAML

**Issue**: Python script error
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency

**Issue**: Too many/few iterations
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML

**Issue**: VARIANT column errors
**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL

---

## Success Criteria

Generated SQL will:
- ✅ Be valid Snowflake SQL
- ✅ Use native Snowflake functions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support VARIANT types for flexible data
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Snowflake

---

**Ready to generate Snowflake SQL from your YAML configuration?**

Provide your YAML file path and target database/schema to begin!