Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:02:39 +08:00
commit 515e7bf6be
18 changed files with 5770 additions and 0 deletions

View File

@@ -0,0 +1,387 @@
---
name: hybrid-execute-databricks
description: Execute Databricks ID unification workflow with convergence detection and monitoring
---
# Execute Databricks ID Unification Workflow
## Overview
Execute your generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
---
## What You Need
### Required Inputs
1. **SQL Directory**: Path to generated SQL files (e.g., `databricks_sql/unify/`)
2. **Server Hostname**: Your Databricks workspace URL (e.g., `your-workspace.cloud.databricks.com`)
3. **HTTP Path**: SQL Warehouse or cluster path (e.g., `/sql/1.0/warehouses/abc123`)
4. **Catalog**: Target catalog name
5. **Schema**: Target schema name
### Authentication
**Option 1: Personal Access Token (PAT)**
- Access token from Databricks workspace
- Can be provided as argument or via environment variable `DATABRICKS_TOKEN`
**Option 2: OAuth**
- Browser-based authentication
- No token required, will open browser for login
---
## What I'll Do
### Step 1: Connection Setup
- Connect to your Databricks workspace
- Validate credentials and permissions
- Set catalog and schema context
- Verify SQL directory exists
### Step 2: Execution Plan
Display execution plan with:
- All SQL files in execution order
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
- Estimated steps and dependencies
### Step 3: SQL Execution
I'll call the **databricks-workflow-executor agent** to:
- Execute SQL files in proper sequence
- Skip loop iteration files (handled separately)
- Monitor progress with real-time feedback
- Track row counts and execution times
### Step 4: Unify Loop with Convergence Detection
**Intelligent Loop Execution**:
```
Iteration 1:
✓ Execute unify SQL
• Check convergence: 1500 records updated
• Optimize Delta table
→ Continue to iteration 2
Iteration 2:
✓ Execute unify SQL
• Check convergence: 450 records updated
• Optimize Delta table
→ Continue to iteration 3
Iteration 3:
✓ Execute unify SQL
• Check convergence: 0 records updated
✓ CONVERGED! Stop loop
```
**Features**:
- Runs until convergence (updated_count = 0)
- Maximum 30 iterations safety limit
- Auto-optimization after each iteration
- Creates alias table (loop_final) for downstream processing
### Step 5: Post-Loop Processing
- Execute canonicalization step
- Generate result statistics
- Enrich source tables with canonical IDs
- Create master tables
- Generate metadata and lookup tables
### Step 6: Final Report
Provide:
- Total execution time
- Files processed successfully
- Convergence statistics
- Final table row counts
- Next steps and recommendations
---
## Command Usage
### Interactive Mode (Recommended)
```
/cdp-hybrid-idu:hybrid-execute-databricks
I'll prompt you for:
- SQL directory path
- Databricks server hostname
- HTTP path
- Catalog and schema
- Authentication method
```
### Advanced Mode
Provide all parameters upfront:
```
SQL directory: databricks_sql/unify/
Server hostname: your-workspace.cloud.databricks.com
HTTP path: /sql/1.0/warehouses/abc123
Catalog: my_catalog
Schema: my_schema
Auth type: pat (or oauth)
Access token: dapi... (if using PAT)
```
---
## Execution Features
### 1. Convergence Detection
**Algorithm**:
```sql
SELECT COUNT(*) as updated_count FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM current_iteration
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM previous_iteration
) diff
```
**Stops when**: updated_count = 0
### 2. Delta Table Optimization
After major operations:
```sql
OPTIMIZE table_name
```
Benefits:
- Compacts small files
- Improves query performance
- Reduces storage costs
- Optimizes clustering
### 3. Interactive Error Handling
If an error occurs:
```
✗ File: 04_unify_loop_iteration_01.sql
Error: Table not found: source_table
Continue with remaining files? (y/n):
```
You can choose to:
- Continue: Skip failed file, continue with rest
- Stop: Halt execution for investigation
### 4. Real-Time Monitoring
Track progress with:
- ✓ Completed steps (green)
- • Progress indicators (cyan)
- ✗ Failed steps (red)
- ⚠ Warnings (yellow)
- Row counts and execution times
### 5. Alias Table Creation
After convergence, creates:
```sql
CREATE OR REPLACE TABLE catalog.schema.unified_id_graph_unify_loop_final
AS SELECT * FROM catalog.schema.unified_id_graph_unify_loop_3
```
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/databricks/databricks_sql_executor.py \
databricks_sql/unify/ \
--server-hostname your-workspace.databricks.com \
--http-path /sql/1.0/warehouses/abc123 \
--catalog my_catalog \
--schema my_schema \
--auth-type pat \
--optimize-tables
```
### Execution Order
1. **Setup Phase** (01-03):
- Create graph table (loop_0)
- Extract and merge identities
- Generate source statistics
2. **Unification Loop** (04):
- Run iterations until convergence
- Check after EVERY iteration
- Stop when updated_count = 0
- Create loop_final alias
3. **Canonicalization** (05):
- Create canonical ID lookup
- Create keys and tables metadata
- Rename final graph table
4. **Statistics** (06):
- Generate result key statistics
- Create histograms
- Calculate coverage metrics
5. **Enrichment** (10-19):
- Add canonical IDs to source tables
- Create enriched_* tables
6. **Master Tables** (20-29):
- Aggregate attributes
- Apply priority rules
- Create unified customer profiles
7. **Metadata** (30-39):
- Unification metadata
- Filter lookup tables
- Column lookup tables
### Connection Management
- Establishes single connection for entire workflow
- Uses connection pooling for efficiency
- Automatic reconnection on timeout
- Proper cleanup on completion or error
---
## Example Execution
### Input
```
SQL directory: databricks_sql/unify/
Server hostname: dbc-12345-abc.cloud.databricks.com
HTTP path: /sql/1.0/warehouses/6789abcd
Catalog: customer_data
Schema: id_unification
Auth type: pat
```
### Output
```
✓ Connected to Databricks: dbc-12345-abc.cloud.databricks.com
• Using catalog: customer_data, schema: id_unification
Starting Databricks SQL Execution
• Catalog: customer_data
• Schema: id_unification
• Delta tables: ✓ enabled
Executing: 01_create_graph.sql
✓ 01_create_graph.sql: Executed successfully
Executing: 02_extract_merge.sql
✓ 02_extract_merge.sql: Executed successfully
• Rows affected: 125000
Executing: 03_source_key_stats.sql
✓ 03_source_key_stats.sql: Executed successfully
Executing Unify Loop Before Canonicalization
--- Iteration 1 ---
✓ Iteration 1 completed
• Rows processed: 125000
• Updated records: 1500
• Optimizing Delta table
--- Iteration 2 ---
✓ Iteration 2 completed
• Rows processed: 125000
• Updated records: 450
• Optimizing Delta table
--- Iteration 3 ---
✓ Iteration 3 completed
• Rows processed: 125000
• Updated records: 0
✓ Loop converged after 3 iterations
• Creating alias table for final iteration
✓ Alias table 'unified_id_graph_unify_loop_final' created
Executing: 05_canonicalize.sql
✓ 05_canonicalize.sql: Executed successfully
[... continues with enrichment, master tables, metadata ...]
Execution Complete
• Files processed: 18/18
• Final unified_id_lookup rows: 98,500
• Disconnected from Databricks
```
---
## Monitoring and Troubleshooting
### Check Execution Progress
During execution, you can monitor:
- Databricks SQL Warehouse query history
- Delta table sizes and row counts
- Execution logs in Databricks workspace
### Common Issues
**Issue**: Connection timeout
**Solution**: Check network access, verify credentials, ensure SQL Warehouse is running
**Issue**: Table not found
**Solution**: Verify catalog/schema permissions, check source table names in YAML
**Issue**: Loop doesn't converge
**Solution**: Check data quality, increase max_iterations, review key validation rules
**Issue**: Out of memory
**Solution**: Increase SQL Warehouse size, optimize clustering, reduce batch sizes
**Issue**: Permission denied
**Solution**: Verify catalog/schema permissions, check Unity Catalog access controls
### Performance Optimization
- Use larger SQL Warehouse for faster execution
- Enable auto-scaling for variable workloads
- Optimize Delta tables regularly
- Use clustering on frequently joined columns
---
## Post-Execution Validation
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON DATABRICKS**
### Check Coverage
```sql
SELECT
COUNT(*) as total_records,
COUNT(unified_id) as records_with_id,
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
FROM catalog.schema.enriched_customer_profiles;
```
### Verify Master Table
```sql
SELECT COUNT(*) as unified_customers
FROM catalog.schema.customer_master;
```
### Review Statistics
```sql
SELECT * FROM catalog.schema.unified_id_result_key_stats
WHERE from_table = '*';
```
---
## Success Criteria
Execution successful when:
- ✅ All SQL files processed without critical errors
- ✅ Unification loop converged (updated_count = 0)
- ✅ Canonical IDs generated for all eligible records
- ✅ Enriched tables created successfully
- ✅ Master tables populated with attributes
- ✅ Coverage metrics meet expectations
---
**Ready to execute your Databricks ID unification workflow?**
Provide your SQL directory path and Databricks connection details to begin!

View File

@@ -0,0 +1,401 @@
---
name: hybrid-execute-snowflake
description: Execute Snowflake ID unification workflow with convergence detection and monitoring
---
# Execute Snowflake ID Unification Workflow
## Overview
Execute your generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
---
## What You Need
### Required Inputs
1. **SQL Directory**: Path to generated SQL files (e.g., `snowflake_sql/unify/`)
2. **Account**: Snowflake account name (e.g., `myaccount` from `myaccount.snowflakecomputing.com`)
3. **User**: Snowflake username
4. **Database**: Target database name
5. **Schema**: Target schema name
6. **Warehouse**: Compute warehouse name (defaults to `COMPUTE_WH`)
### Authentication
**Option 1: Password**
- Can be provided as argument or via environment variable `SNOWFLAKE_PASSWORD` via environment file (.env) `SNOWFLAKE_PASSWORD`
- Will prompt if not provided
**Option 2: SSO (externalbrowser)**
- Opens browser for authentication
- No password required
**Option 3: Key-Pair**
- Private key path via `SNOWFLAKE_PRIVATE_KEY_PATH`
- Passphrase via `SNOWFLAKE_PRIVATE_KEY_PASSPHRASE`
---
## What I'll Do
### Step 1: Connection Setup
- Connect to your Snowflake account
- Validate credentials and permissions
- Set database and schema context
- Verify SQL directory exists
- Activate warehouse
### Step 2: Execution Plan
Display execution plan with:
- All SQL files in execution order
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
- Estimated steps and dependencies
### Step 3: SQL Execution
I'll call the **snowflake-workflow-executor agent** to:
- Execute SQL files in proper sequence
- Skip loop iteration files (handled separately)
- Monitor progress with real-time feedback
- Track row counts and execution times
### Step 4: Unify Loop with Convergence Detection
**Intelligent Loop Execution**:
```
Iteration 1:
✓ Execute unify SQL
• Check convergence: 1500 records updated
→ Continue to iteration 2
Iteration 2:
✓ Execute unify SQL
• Check convergence: 450 records updated
→ Continue to iteration 3
Iteration 3:
✓ Execute unify SQL
• Check convergence: 0 records updated
✓ CONVERGED! Stop loop
```
**Features**:
- Runs until convergence (updated_count = 0)
- Maximum 30 iterations safety limit
- Creates alias table (loop_final) for downstream processing
### Step 5: Post-Loop Processing
- Execute canonicalization step
- Generate result statistics
- Enrich source tables with canonical IDs
- Create master tables
- Generate metadata and lookup tables
### Step 6: Final Report
Provide:
- Total execution time
- Files processed successfully
- Convergence statistics
- Final table row counts
- Next steps and recommendations
---
## Command Usage
### Interactive Mode (Recommended)
```
/cdp-hybrid-idu:hybrid-execute-snowflake
I'll prompt you for:
- SQL directory path
- Snowflake account name
- Username
- Database and schema
- Warehouse name
- Authentication method
```
### Advanced Mode
Provide all parameters upfront:
```
SQL directory: snowflake_sql/unify/
Account: myaccount
User: myuser
Database: my_database
Schema: my_schema
Warehouse: COMPUTE_WH
Password: (will prompt if not in environment)
```
---
## Execution Features
### 1. Convergence Detection
**Algorithm**:
```sql
SELECT COUNT(*) as updated_count FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM current_iteration
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM previous_iteration
) diff
```
**Stops when**: updated_count = 0
### 2. Interactive Error Handling
If an error occurs:
```
✗ File: 04_unify_loop_iteration_01.sql
Error: Table not found: source_table
Continue with remaining files? (y/n):
```
You can choose to:
- Continue: Skip failed file, continue with rest
- Stop: Halt execution for investigation
### 3. Real-Time Monitoring
Track progress with:
- ✓ Completed steps (green)
- • Progress indicators (cyan)
- ✗ Failed steps (red)
- ⚠ Warnings (yellow)
- Row counts and execution times
### 4. Alias Table Creation
After convergence, creates:
```sql
CREATE OR REPLACE TABLE database.schema.unified_id_graph_unify_loop_final
AS SELECT * FROM database.schema.unified_id_graph_unify_loop_3
```
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/snowflake/snowflake_sql_executor.py \
snowflake_sql/unify/ \
--account myaccount \
--user myuser \
--database my_database \
--schema my_schema \
--warehouse COMPUTE_WH
```
### Execution Order
1. **Setup Phase** (01-03):
- Create graph table (loop_0)
- Extract and merge identities
- Generate source statistics
2. **Unification Loop** (04):
- Run iterations until convergence
- Check after EVERY iteration
- Stop when updated_count = 0
- Create loop_final alias
3. **Canonicalization** (05):
- Create canonical ID lookup
- Create keys and tables metadata
- Rename final graph table
4. **Statistics** (06):
- Generate result key statistics
- Create histograms
- Calculate coverage metrics
5. **Enrichment** (10-19):
- Add canonical IDs to source tables
- Create enriched_* tables
6. **Master Tables** (20-29):
- Aggregate attributes
- Apply priority rules
- Create unified customer profiles
7. **Metadata** (30-39):
- Unification metadata
- Filter lookup tables
- Column lookup tables
### Connection Management
- Establishes single connection for entire workflow
- Uses connection pooling for efficiency
- Automatic reconnection on timeout
- Proper cleanup on completion or error
---
## Example Execution
### Input
```
SQL directory: snowflake_sql/unify/
Account: myorg-myaccount
User: analytics_user
Database: customer_data
Schema: id_unification
Warehouse: LARGE_WH
```
### Output
```
✓ Connected to Snowflake: myorg-myaccount
• Using database: customer_data, schema: id_unification
Starting Snowflake SQL Execution
• Database: customer_data
• Schema: id_unification
Executing: 01_create_graph.sql
✓ 01_create_graph.sql: Executed successfully
Executing: 02_extract_merge.sql
✓ 02_extract_merge.sql: Executed successfully
• Rows affected: 125000
Executing: 03_source_key_stats.sql
✓ 03_source_key_stats.sql: Executed successfully
Executing Unify Loop Before Canonicalization
--- Iteration 1 ---
✓ Iteration 1 completed
• Rows processed: 125000
• Updated records: 1500
--- Iteration 2 ---
✓ Iteration 2 completed
• Rows processed: 125000
• Updated records: 450
--- Iteration 3 ---
✓ Iteration 3 completed
• Rows processed: 125000
• Updated records: 0
✓ Loop converged after 3 iterations
• Creating alias table for final iteration
✓ Alias table 'unified_id_graph_unify_loop_final' created
Executing: 05_canonicalize.sql
✓ 05_canonicalize.sql: Executed successfully
[... continues with enrichment, master tables, metadata ...]
Execution Complete
• Files processed: 18/18
• Final unified_id_lookup rows: 98,500
• Disconnected from Snowflake
```
---
## Monitoring and Troubleshooting
### Check Execution Progress
During execution, you can monitor:
- Snowflake query history
- Table sizes and row counts
- Warehouse utilization
- Execution logs
### Common Issues
**Issue**: Connection timeout
**Solution**: Check network access, verify credentials, ensure warehouse is running
**Issue**: Table not found
**Solution**: Verify database/schema permissions, check source table names in YAML
**Issue**: Loop doesn't converge
**Solution**: Check data quality, increase max_iterations, review key validation rules
**Issue**: Warehouse suspended
**Solution**: Ensure auto-resume is enabled, manually resume warehouse if needed
**Issue**: Permission denied
**Solution**: Verify database/schema permissions, check role assignments
### Performance Optimization
- Use larger warehouse for faster execution (L, XL, 2XL, etc.)
- Enable multi-cluster warehouse for concurrency
- Use clustering keys on frequently joined columns
- Monitor query profiles for optimization opportunities
---
## Post-Execution Validation
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON SNOWFLAKE**
### Check Coverage
```sql
SELECT
COUNT(*) as total_records,
COUNT(unified_id) as records_with_id,
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
FROM database.schema.enriched_customer_profiles;
```
### Verify Master Table
```sql
SELECT COUNT(*) as unified_customers
FROM database.schema.customer_master;
```
### Review Statistics
```sql
SELECT * FROM database.schema.unified_id_result_key_stats
WHERE from_table = '*';
```
---
## Success Criteria
Execution successful when:
- ✅ All SQL files processed without critical errors
- ✅ Unification loop converged (updated_count = 0)
- ✅ Canonical IDs generated for all eligible records
- ✅ Enriched tables created successfully
- ✅ Master tables populated with attributes
- ✅ Coverage metrics meet expectations
---
## Authentication Examples
### Using Password
```bash
export SNOWFLAKE_PASSWORD='your_password'
/cdp-hybrid-idu:hybrid-execute-snowflake
```
### Using SSO
```bash
/cdp-hybrid-idu:hybrid-execute-snowflake
# Will prompt: Use SSO authentication? (y/n): y
# Opens browser for authentication
```
### Using Key-Pair
```bash
export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/key.p8'
export SNOWFLAKE_PRIVATE_KEY_PASSPHRASE='passphrase'
/cdp-hybrid-idu:hybrid-execute-snowflake
```
---
**Ready to execute your Snowflake ID unification workflow?**
Provide your SQL directory path and Snowflake connection details to begin!

View File

@@ -0,0 +1,285 @@
---
name: hybrid-generate-databricks
description: Generate Databricks Delta Lake SQL from YAML configuration for ID unification
---
# Generate Databricks SQL from YAML
## Overview
Generate production-ready Databricks SQL workflow from your `unify.yml` configuration file. This command creates Delta Lake optimized SQL files with ACID transactions, clustering, and platform-specific function conversions.
---
## What You Need
### Required Inputs
1. **YAML Configuration File**: Path to your `unify.yml`
2. **Target Catalog**: Databricks Unity Catalog name
3. **Target Schema**: Schema name within the catalog
### Optional Inputs
4. **Source Catalog**: Catalog containing source tables (defaults to target catalog)
5. **Source Schema**: Schema containing source tables (defaults to target schema)
6. **Output Directory**: Where to save generated SQL (defaults to `databricks_sql/`)
---
## What I'll Do
### Step 1: Validation
- Verify `unify.yml` exists and is valid
- Check YAML syntax and structure
- Validate keys, tables, and configuration sections
### Step 2: SQL Generation
I'll call the **databricks-sql-generator agent** to:
- Execute `yaml_unification_to_databricks.py` Python script
- Apply Databricks-specific SQL conversions:
- `ARRAY_SIZE``SIZE`
- `ARRAY_CONSTRUCT``ARRAY`
- `OBJECT_CONSTRUCT``STRUCT`
- `COLLECT_LIST` for aggregations
- `FLATTEN` for array operations
- `UNIX_TIMESTAMP()` for time functions
- Generate Delta Lake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs
### Step 3: Output Organization
Generate complete SQL workflow in this structure:
```
databricks_sql/unify/
├── 01_create_graph.sql # Initialize graph with USING DELTA
├── 02_extract_merge.sql # Extract identities with validation
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql # Canonical ID creation with key masks
├── 06_result_key_stats.sql # Result statistics with histograms
├── 10_enrich_*.sql # Enrich each source table
├── 20_master_*.sql # Master tables with attribute aggregation
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules lookup
└── 32_column_lookup.sql # Column mapping lookup
```
### Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Delta Lake optimizations included
- Key features enabled
- Next steps for execution
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-generate-databricks
I'll prompt you for:
- YAML file path
- Target catalog
- Target schema
```
### Advanced Usage
Provide all parameters upfront:
```
YAML file: /path/to/unify.yml
Target catalog: my_catalog
Target schema: my_schema
Source catalog: source_catalog (optional)
Source schema: source_schema (optional)
Output directory: custom_output/ (optional)
```
---
## Generated SQL Features
### Delta Lake Optimizations
- **ACID Transactions**: `USING DELTA` for all tables
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
- **Table Properties**: Optimized for large-scale joins
### Advanced Capabilities
1. **Dynamic Iteration Count**: Auto-calculates based on:
- Number of merge keys
- Number of tables
- Data complexity (configurable via YAML)
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
```
Key Type 1 (email): 0ffdbcf0c666ce190d
Key Type 2 (customer_id): 61a821f2b646a4e890
Key Type 3 (phone): acd2206c3f88b3ee27
```
3. **Validation Rules**:
- `valid_regexp`: Regex pattern filtering
- `invalid_texts`: NOT IN clause with NULL handling
- Combined AND logic for strict validation
4. **Master Table Attributes**:
- Single value: `MAX_BY(attr, order)` with COALESCE
- Array values: `SLICE(CONCAT(arrays), 1, N)`
- Priority-based selection
### Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Databricks equivalents
- Snowflake functions → Databricks equivalents
- Array operations → Spark SQL syntax
- Window functions → optimized versions
- Time functions → UNIX_TIMESTAMP()
---
## Example Workflow
### Input YAML (`unify.yml`)
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
```
### Generated Output
```
databricks_sql/unify/
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql # Merges customer_profiles keys
├── 03_source_key_stats.sql # Stats by table
├── 04_unify_loop_iteration_01.sql # First iteration
├── 04_unify_loop_iteration_02.sql # Second iteration
├── ... # Up to iteration_05
├── 05_canonicalize.sql # Creates unified_id_lookup
├── 06_result_key_stats.sql # Final statistics
├── 10_enrich_customer_profiles.sql # Adds unified_id column
├── 20_master_customer_master.sql # Creates customer_master table
├── 30_unification_metadata.sql # Metadata
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
---
## Next Steps After Generation
### Option 1: Execute Immediately
Use the execution command:
```
/cdp-hybrid-idu:hybrid-execute-databricks
```
### Option 2: Review First
1. Examine generated SQL files
2. Verify table names and transformations
3. Test with sample data
4. Execute manually or via execution command
### Option 3: Customize
1. Modify generated SQL as needed
2. Add custom logic or transformations
3. Execute using Databricks SQL editor or execution command
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/databricks/yaml_unification_to_databricks.py \
unify.yml \
-tc my_catalog \
-ts my_schema \
-sc source_catalog \
-ss source_schema \
-o databricks_sql
```
### SQL File Naming Convention
- `01-09`: Setup and initialization
- `10-19`: Source table enrichment
- `20-29`: Master table creation
- `30-39`: Metadata and lookup tables
- `04_*_NN`: Loop iterations (auto-numbered)
### Convergence Detection
Each loop iteration includes:
```sql
-- Check if graph changed
SELECT COUNT(*) FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N_minus_1
) diff
```
Stops when count = 0
---
## Troubleshooting
### Common Issues
**Issue**: YAML validation error
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
**Issue**: Table not found error
**Solution**: Verify source catalog/schema, check table names in YAML
**Issue**: Python script error
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
**Issue**: Too many/few iterations
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
---
## Success Criteria
Generated SQL will:
- ✅ Be valid Databricks Spark SQL
- ✅ Use Delta Lake for ACID transactions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support incremental processing
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Databricks
---
**Ready to generate Databricks SQL from your YAML configuration?**
Provide your YAML file path and target catalog/schema to begin!

View File

@@ -0,0 +1,288 @@
---
name: hybrid-generate-snowflake
description: Generate Snowflake SQL from YAML configuration for ID unification
---
# Generate Snowflake SQL from YAML
## Overview
Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
---
## What You Need
### Required Inputs
1. **YAML Configuration File**: Path to your `unify.yml`
2. **Target Database**: Snowflake database name
3. **Target Schema**: Schema name within the database
### Optional Inputs
4. **Source Database**: Database containing source tables (defaults to target database)
5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)
---
## What I'll Do
### Step 1: Validation
- Verify `unify.yml` exists and is valid
- Check YAML syntax and structure
- Validate keys, tables, and configuration sections
### Step 2: SQL Generation
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` Python script
- Generate Snowflake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs
### Step 3: Output Organization
Generate complete SQL workflow in this structure:
```
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize graph table
├── 02_extract_merge.sql # Extract identities with validation
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql # Canonical ID creation with key masks
├── 06_result_key_stats.sql # Result statistics with histograms
├── 10_enrich_*.sql # Enrich each source table
├── 20_master_*.sql # Master tables with attribute aggregation
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules lookup
└── 32_column_lookup.sql # Column mapping lookup
```
### Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Snowflake optimizations included
- Key features enabled
- Next steps for execution
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-generate-snowflake
I'll prompt you for:
- YAML file path
- Target database
- Target schema
```
### Advanced Usage
Provide all parameters upfront:
```
YAML file: /path/to/unify.yml
Target database: my_database
Target schema: my_schema
Source database: source_database (optional)
Source schema: PUBLIC (optional, defaults to PUBLIC)
Output directory: custom_output/ (optional)
```
---
## Generated SQL Features
### Snowflake Optimizations
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
- **VARIANT Support**: Flexible data structures for arrays and objects
- **Native Functions**: Snowflake-specific optimized functions
### Advanced Capabilities
1. **Dynamic Iteration Count**: Auto-calculates based on:
- Number of merge keys
- Number of tables
- Data complexity (configurable via YAML)
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
```
Key Type 1 (email): 0ffdbcf0c666ce190d
Key Type 2 (customer_id): 61a821f2b646a4e890
Key Type 3 (phone): acd2206c3f88b3ee27
```
3. **Validation Rules**:
- `valid_regexp`: REGEXP_LIKE pattern filtering
- `invalid_texts`: NOT IN clause with proper NULL handling
- Combined AND logic for strict validation
4. **Master Table Attributes**:
- Single value: `MAX_BY(attr, order)` with COALESCE
- Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
- Priority-based selection
### Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Snowflake equivalents
- Databricks functions → Snowflake equivalents
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
- Window functions → optimized versions
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
---
## Example Workflow
### Input YAML (`unify.yml`)
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
```
### Generated Output
```
snowflake_sql/unify/
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql # Merges customer_profiles keys
├── 03_source_key_stats.sql # Stats by table
├── 04_unify_loop_iteration_01.sql # First iteration
├── 04_unify_loop_iteration_02.sql # Second iteration
├── ... # Up to iteration_05
├── 05_canonicalize.sql # Creates unified_id_lookup
├── 06_result_key_stats.sql # Final statistics
├── 10_enrich_customer_profiles.sql # Adds unified_id column
├── 20_master_customer_master.sql # Creates customer_master table
├── 30_unification_metadata.sql # Metadata
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
---
## Next Steps After Generation
### Option 1: Execute Immediately
Use the execution command:
```
/cdp-hybrid-idu:hybrid-execute-snowflake
```
### Option 2: Review First
1. Examine generated SQL files
2. Verify table names and transformations
3. Test with sample data
4. Execute manually or via execution command
### Option 3: Customize
1. Modify generated SQL as needed
2. Add custom logic or transformations
3. Execute using Snowflake SQL worksheet or execution command
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
unify.yml \
-d my_database \
-s my_schema \
-sd source_database \
-ss source_schema \
-o snowflake_sql
```
### SQL File Naming Convention
- `01-09`: Setup and initialization
- `10-19`: Source table enrichment
- `20-29`: Master table creation
- `30-39`: Metadata and lookup tables
- `04_*_NN`: Loop iterations (auto-numbered)
### Convergence Detection
Each loop iteration includes:
```sql
-- Check if graph changed
SELECT COUNT(*) FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N_minus_1
) diff
```
Stops when count = 0
### Snowflake-Specific Features
- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
- **ARRAYS_OVERLAP**: Checking array membership
- **SPLIT_PART**: String splitting for leader key parsing
---
## Troubleshooting
### Common Issues
**Issue**: YAML validation error
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
**Issue**: Table not found error
**Solution**: Verify source database/schema, check table names in YAML
**Issue**: Python script error
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
**Issue**: Too many/few iterations
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
**Issue**: VARIANT column errors
**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
---
## Success Criteria
Generated SQL will:
- ✅ Be valid Snowflake SQL
- ✅ Use native Snowflake functions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support VARIANT types for flexible data
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Snowflake
---
**Ready to generate Snowflake SQL from your YAML configuration?**
Provide your YAML file path and target database/schema to begin!

308
commands/hybrid-setup.md Normal file
View File

@@ -0,0 +1,308 @@
---
name: hybrid-setup
description: Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks
---
# Hybrid ID Unification Complete Setup
## Overview
I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an **automated, end-to-end workflow** that will:
1. **Analyze your tables automatically** using platform MCP tools with strict PII detection
2. **Generate YAML configuration** from real schema and data analysis
3. **Choose target platform(s)** (Snowflake, Databricks, or both)
4. **Generate platform-specific SQL** optimized for each engine
5. **Execute workflows** with convergence detection and monitoring
6. **Provide deployment guidance** and operating instructions
**Key Features**:
- 🔍 **Automated Table Analysis**: Uses Snowflake/Databricks MCP tools to analyze actual tables
-**Strict PII Detection**: Zero tolerance - only includes tables with real user identifiers
- 📊 **Real Data Validation**: Queries actual data to validate patterns and quality
- 🎯 **Smart Recommendations**: Expert analysis provides merge strategy and priorities
- 🚀 **End-to-End Automation**: From table analysis to workflow execution
---
## What You Need to Provide
### 1. Unification Requirements (For Automated Analysis)
- **Platform**: Snowflake or Databricks
- **Tables**: List of source tables to analyze
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- **Canonical ID Name**: Name for your unified ID (e.g., `td_id`, `unified_customer_id`)
- **Merge Iterations**: Number of unification loops (default: 10)
- **Master Tables**: (Optional) Attribute aggregation specifications
**Note**: The system will automatically:
- Extract user identifiers from actual table schemas
- Validate data patterns from real data
- Apply appropriate validation rules based on data analysis
- Generate merge strategy recommendations
### 2. Platform Selection
- **Databricks**: Unity Catalog with Delta Lake
- **Snowflake**: Database with proper permissions
- **Both**: Generate SQL for both platforms
### 3. Target Configurations
**For Databricks**:
- **Catalog**: Target catalog name
- **Schema**: Target schema name
- **Source Catalog** (optional): Source data catalog
- **Source Schema** (optional): Source data schema
**For Snowflake**:
- **Database**: Target database name
- **Schema**: Target schema name
- **Source Schema** (optional): Source data schema
### 4. Execution Credentials (if executing)
**For Databricks**:
- **Server Hostname**: your-workspace.databricks.com
- **HTTP Path**: /sql/1.0/warehouses/your-warehouse-id
- **Authentication**: PAT (Personal Access Token) or OAuth
**For Snowflake**:
- **Account**: Snowflake account name
- **User**: Username
- **Password**: Password or use SSO/key-pair
- **Warehouse**: Compute warehouse name
---
## What I'll Do
### Step 1: Automated YAML Configuration Generation
I'll use the **hybrid-unif-config-creator** command to automatically generate your `unify.yml` file:
**Automated Analysis Approach** (Recommended):
- Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
- Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
- Validate data patterns from real table data
- Generate unify.yml with exact template compliance
- Only include tables with actual user identifiers
- Document excluded tables with detailed reasons
**What I'll do**:
- Call the **hybrid-unif-keys-extractor agent** to analyze tables
- Query actual schema and data using platform MCP tools
- Detect valid user identifiers (email, customer_id, phone, etc.)
- Exclude tables without PII with full documentation
- Generate production-ready unify.yml automatically
**Alternative - Manual Configuration**:
- If MCP tools are unavailable, I'll guide you through manual configuration
- Interactive prompts for keys, tables, and validation rules
- Step-by-step YAML building with validation
### Step 2: Platform Selection and Configuration
I'll help you:
- Choose between Databricks, Snowflake, or both
- Collect platform-specific configuration (catalog/database, schema names)
- Determine source/target separation strategy
- Decide on execution or generation-only mode
### Step 3: SQL Generation
**For Databricks** (if selected):
I'll call the **databricks-sql-generator agent** to:
- Execute `yaml_unification_to_databricks.py` script
- Generate Delta Lake optimized SQL workflow
- Create output directory: `databricks_sql/unify/`
- Generate 15+ SQL files with proper execution order
**For Snowflake** (if selected):
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` script
- Generate Snowflake-native SQL workflow
- Create output directory: `snowflake_sql/unify/`
- Generate 15+ SQL files with proper execution order
### Step 4: Workflow Execution (Optional)
**For Databricks** (if execution requested):
I'll call the **databricks-workflow-executor agent** to:
- Execute `databricks_sql_executor.py` script
- Connect to your Databricks workspace
- Run SQL files in proper sequence
- Monitor convergence and progress
- Optimize Delta tables
- Report final statistics
**For Snowflake** (if execution requested):
I'll call the **snowflake-workflow-executor agent** to:
- Execute `snowflake_sql_executor.py` script
- Connect to your Snowflake account
- Run SQL files in proper sequence
- Monitor convergence and progress
- Report final statistics
### Step 5: Deployment Guidance
I'll provide:
- Configuration summary
- Generated files overview
- Deployment instructions
- Operating guidelines
- Monitoring recommendations
---
## Interactive Workflow
This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:
### Phase 1: Configuration Creation
**I'll ask you for**:
- Platform (Snowflake or Databricks)
- Tables to analyze
- Canonical ID name
- Merge iterations
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-unif-config-creator` internally
- Analyze your tables automatically
- Generate `unify.yml` with strict PII detection
- Show you the configuration for review
### Phase 2: SQL Generation
**I'll ask you**:
- Which platform(s) to generate SQL for (can be different from source)
- Output directory preferences
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-generate-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-generate-databricks` (if Databricks selected)
- Generate 15+ optimized SQL files per platform
- Show you the execution plan
### Phase 3: Workflow Execution (Optional)
**I'll ask you**:
- Do you want to execute now or later?
- Connection credentials if executing
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-execute-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-execute-databricks` (if Databricks selected)
- Monitor convergence and progress
- Report final statistics
**Throughout the process**:
- **Questions**: When I need your input
- **Suggestions**: Recommended approaches based on best practices
- **Validation**: Real-time checks on your choices
- **Explanations**: Help you understand concepts and options
---
## Expected Output
### Files Created (Platform-specific):
**For Databricks**:
```
databricks_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**For Snowflake**:
```
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**Configuration**:
```
unify.yml # YAML configuration (created interactively)
```
---
## Success Criteria
All generated files will:
- ✅ Be platform-optimized and production-ready
- ✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
- ✅ Include convergence detection logic
- ✅ Support incremental processing
- ✅ Generate comprehensive statistics
- ✅ Work without modification on target platforms
---
## Getting Started
**Ready to begin?** I'll use the **hybrid-unif-config-creator** to automatically analyze your tables and generate the YAML configuration.
Please provide:
1. **Platform**: Which platform contains your data?
- Snowflake or Databricks
2. **Tables**: Which source tables should I analyze?
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- Example: `customer_db.public.customers`, `orders`, `web_events.user_activity`
3. **Canonical ID Name**: What should I call the unified ID?
- Example: `td_id`, `unified_customer_id`, `master_id`
- Default: `td_id`
4. **Merge Iterations** (optional): How many unification loops?
- Default: 10
- Range: 2-30
5. **Target Platform(s)** for SQL generation:
- Same as source, or generate for both platforms
**Example**:
```
I want to set up hybrid ID unification for:
Platform: Snowflake
Tables:
- customer_db.public.customer_profiles
- customer_db.public.orders
- marketing_db.public.campaigns
- event_db.public.web_events
Canonical ID: unified_customer_id
Merge Iterations: 10
Generate SQL for: Snowflake (or both Snowflake and Databricks)
```
**What I'll do next**:
1. ✅ Analyze your tables using Snowflake MCP tools
2. ✅ Extract user identifiers with strict PII detection
3. ✅ Generate unify.yml automatically
4. ✅ Generate platform-specific SQL files
5. ✅ Execute workflow (if requested)
6. ✅ Provide deployment guidance
---
**Let's get started with your hybrid ID unification setup!**

View File

@@ -0,0 +1,491 @@
---
name: hybrid-unif-config-creator
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
---
# Unify Configuration Creator for Snowflake/Databricks
## Overview
I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
1. **Analyzing your actual tables** using platform-specific MCP tools
2. **Extracting user identifiers** with zero-tolerance PII detection
3. **Validating data patterns** from real table data
4. **Generating unify.yml** using the exact template format
5. **Providing recommendations** for merge strategies and priorities
**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
---
## What You Need to Provide
### 1. Platform Selection
- **Snowflake**: For Snowflake databases
- **Databricks**: For Databricks Unity Catalog tables
### 2. Tables to Analyze
Provide tables you want to analyze for ID unification:
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
### 3. Canonical ID Configuration
- **Name**: Name for your unified ID (default: `td_id`)
- **Merge Iterations**: Number of unification loop iterations (default: 10)
- **Incremental Iterations**: Iterations for incremental processing (default: 5)
### 4. Output Configuration (Optional)
- **Output File**: Where to save unify.yml (default: `unify.yml`)
- **Template Path**: Path to template if using custom (default: uses built-in exact template)
---
## What I'll Do
### Step 1: Platform Detection and Validation
```
1. Confirm platform (Snowflake or Databricks)
2. Verify MCP tools are available for the platform
3. Set up platform-specific query patterns
4. Inform you of the analysis approach
```
### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
I'll launch the **hybrid-unif-keys-extractor agent** to:
**Schema Analysis**:
- Use platform MCP tools to describe each table
- Extract exact column names and data types
- Identify accessible vs inaccessible tables
**User Identifier Detection**:
- Apply STRICT matching rules for user identifiers:
- ✅ Email columns (email, email_std, email_address, etc.)
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
- ✅ User IDs (user_id, customer_id, account_id, etc.)
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
- ❌ System columns (id, created_at, time, etc.)
- ❌ Complex types (arrays, maps, objects, variants, structs)
**Data Validation**:
- Query actual MIN/MAX values from each identified column
- Analyze data patterns and quality
- Count unique values per identifier
- Detect data quality issues
**Table Classification**:
- **INCLUDED**: Tables with valid user identifiers
- **EXCLUDED**: Tables without user identifiers (fully documented why)
**Expert Analysis**:
- 3 SQL experts review the data
- Provide priority recommendations
- Suggest validation rules based on actual data patterns
### Step 3: Unify.yml Generation
**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
**Template Usage Process**:
```
1. Receive structured data from hybrid-unif-keys-extractor agent:
- Keys with validation rules
- Tables with column mappings
- Canonical ID configuration
- Master tables specification
2. Use BUILT-IN template structure (see agent documentation)
3. ONLY replace these specific values:
- Line 1: name: {canonical_id_name}
- keys section: actual keys found
- tables section: actual tables with actual columns
- canonical_ids section: name and merge_by_keys
- master_tables section: [] or user specifications
4. PRESERVE everything else:
- ALL comment blocks (#####...)
- ALL comment text ("Declare Validation logic", etc.)
- ALL spacing and indentation (2 spaces per level)
- ALL blank lines
- EXACT YAML structure
5. Use Write tool to save populated unify.yml
```
**I'll generate**:
**Section 1: Canonical ID Name**
```yaml
name: {your_canonical_id_name}
```
**Section 2: Keys with Validation**
```yaml
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
```
*Populated with actual keys found in your tables*
**Section 3: Tables with Key Column Mappings**
```yaml
tables:
- database: {database/catalog}
table: {table_name}
key_columns:
- {column: actual_column_name, key: mapped_key}
- {column: another_column, key: another_key}
```
*Only tables with valid user identifiers, with EXACT column names from schema analysis*
**Section 4: Canonical IDs Configuration**
```yaml
canonical_ids:
- name: {your_canonical_id_name}
merge_by_keys: [email, customer_id, phone_number]
merge_iterations: 15
```
*Based on extracted keys and your configuration*
**Section 5: Master Tables (Optional)**
```yaml
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: best_email
source_columns:
- {table: table1, column: email, order: last, order_by: time, priority: 1}
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
```
*If you request master table configuration, I'll help set up attribute aggregation*
### Step 4: Validation and Review
After generation:
```
1. Show complete unify.yml content
2. Highlight key sections:
- Keys found: [list]
- Tables included: [count]
- Tables excluded: [count] with reasons
- Merge strategy: [keys and priorities]
3. Provide recommendations for optimization
4. Ask for your approval before saving
```
### Step 5: File Output
```
1. Write unify.yml to specified location
2. Create backup of existing file if present
3. Provide file summary:
- Keys configured: X
- Tables configured: Y
- Validation rules: Z
4. Show next steps for using the configuration
```
---
## Example Workflow
**Input**:
```
Platform: Snowflake
Tables:
- customer_data.public.customers
- customer_data.public.orders
- web_data.public.events
Canonical ID Name: unified_customer_id
Output: snowflake_unify.yml
```
**Process**:
```
✓ Platform: Snowflake MCP tools detected
✓ Analyzing 3 tables...
Schema Analysis:
✓ customer_data.public.customers - 12 columns
✓ customer_data.public.orders - 8 columns
✓ web_data.public.events - 15 columns
User Identifier Detection:
✓ customers: email, customer_id (2 identifiers)
✓ orders: customer_id, email_address (2 identifiers)
✗ events: NO user identifiers found
Available columns: event_id, session_id, page_url, timestamp, ...
Reason: Contains only event tracking data - no PII
Data Analysis:
✓ email: 45,123 unique values, format valid
✓ customer_id: 45,089 unique values, numeric
✓ email_address: 12,456 unique values, format valid
Expert Analysis Complete:
Priority 1: customer_id (most stable, highest coverage)
Priority 2: email (good coverage, some quality issues)
Priority 3: phone_number (not found)
Generating unify.yml...
✓ Keys section: 2 keys configured
✓ Tables section: 2 tables configured
✓ Canonical IDs: unified_customer_id
✓ Validation rules: Applied based on data patterns
Tables EXCLUDED:
- web_data.public.events: No user identifiers
```
**Output (snowflake_unify.yml)**:
```yaml
name: unified_customer_id
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
tables:
- database: customer_data
table: customers
key_columns:
- {column: email, key: email}
- {column: customer_id, key: customer_id}
- database: customer_data
table: orders
key_columns:
- {column: email_address, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_customer_id
merge_by_keys: [customer_id, email]
merge_iterations: 15
master_tables: []
```
---
## Key Features
### 🔍 **STRICT PII Detection**
- Zero tolerance for guessing
- Only includes tables with actual user identifiers
- Documents why tables are excluded
- Based on REAL schema and data analysis
### ✅ **Exact Template Compliance**
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
- NO modifications to template format
- Preserves all comment sections
- Maintains exact YAML structure
- Portable across all systems
### 📊 **Real Data Analysis**
- Queries actual MIN/MAX values
- Counts unique identifiers
- Validates data patterns
- Identifies quality issues
### 🎯 **Platform-Aware**
- Uses correct MCP tools for each platform
- Respects platform naming conventions
- Applies platform-specific data type rules
- Generates platform-compatible SQL references
### 📋 **Complete Documentation**
- Documents all excluded tables with reasons
- Lists available columns for excluded tables
- Explains why columns don't qualify as user identifiers
- Provides expert recommendations
---
## Output Format
**The generated unify.yml will have EXACTLY this structure:**
```yaml
name: {canonical_id_name}
#####################################################
##
##Declare Validation logic for unification keys
##
#####################################################
keys:
- name: {key1}
valid_regexp: "{pattern}"
invalid_texts: ['{val1}', '{val2}', '{val3}']
- name: {key2}
invalid_texts: ['{val1}', '{val2}', '{val3}']
#####################################################
##
##Declare databases, tables, and keys to use during unification
##
#####################################################
tables:
- database: {db/catalog}
table: {table}
key_columns:
- {column: {col}, key: {key}}
#####################################################
##
##Declare hierarchy for unification. Define keys to use for each level.
##
#####################################################
canonical_ids:
- name: {canonical_id_name}
merge_by_keys: [{key1}, {key2}, ...]
merge_iterations: {number}
#####################################################
##
##Declare Similar Attributes and standardize into a single column
##
#####################################################
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: {attribute}
source_columns:
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
```
**NO deviations from this structure - EXACT template compliance guaranteed.**
---
## Prerequisites
### Required:
- ✅ Snowflake or Databricks platform access
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
- ✅ Read permissions on tables to be analyzed
- ✅ Tables must exist and be accessible
### Optional:
- Custom unify.yml template path (if not using default)
- Master table attribute specifications
- Custom validation rules
---
## Expected Timeline
| Step | Duration |
|------|----------|
| Platform detection | < 1 min |
| Schema analysis (per table) | 5-10 sec |
| Data analysis (per identifier) | 10-20 sec |
| Expert analysis | 1-2 min |
| YAML generation | < 1 min |
| **Total (for 5 tables)** | **~3-5 min** |
---
## Error Handling
### Common Issues:
**Issue**: MCP tools not available for platform
**Solution**:
- I'll inform you and provide fallback options
- You can provide schema information manually
- I'll still generate unify.yml with validation warnings
**Issue**: No tables have user identifiers
**Solution**:
- I'll show you why tables were excluded
- Suggest alternative tables to analyze
- Explain what constitutes a user identifier
**Issue**: Table not accessible
**Solution**:
- Document which tables are inaccessible
- Continue with accessible tables
- Recommend permission checks
**Issue**: Complex data types found
**Solution**:
- Exclude complex type columns (arrays, structs, maps)
- Explain why they can't be used for unification
- Suggest alternative columns if available
---
## Success Criteria
Generated unify.yml will:
- ✅ Use EXACT template structure - NO modifications
- ✅ Contain ONLY tables with validated user identifiers
- ✅ Include ONLY columns that actually exist in tables
- ✅ Have validation rules based on actual data patterns
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
- ✅ Work without any manual edits
- ✅ Include comprehensive documentation in comments
---
## Next Steps After Generation
1. **Review the generated unify.yml**
- Verify tables and columns are correct
- Check validation rules are appropriate
- Review merge strategy and priorities
2. **Generate SQL for your platform**:
- Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
- Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
3. **Execute the workflow**:
- Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
- Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
4. **Monitor convergence and results**
---
## Getting Started
**Ready to begin?**
Please provide:
1. **Platform**: Snowflake or Databricks
2. **Tables**: List of tables to analyze (full paths)
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
**Example**:
```
Platform: Snowflake
Tables:
- customer_db.public.customers
- customer_db.public.orders
- marketing_db.public.campaigns
Canonical ID: unified_id
Output: snowflake_unify.yml
```
---
**I'll analyze your tables and generate a production-ready unify.yml configuration!**

View File

@@ -0,0 +1,337 @@
---
name: hybrid-unif-config-validate
description: Validate YAML configuration for hybrid ID unification before SQL generation
---
# Validate Hybrid ID Unification YAML
## Overview
Validate your `unify.yml` configuration file to ensure it's properly structured and ready for SQL generation. This command checks syntax, structure, validation rules, and provides recommendations for optimization.
---
## What You Need
### Required Input
1. **YAML Configuration File**: Path to your `unify.yml`
---
## What I'll Do
### Step 1: File Validation
- Verify file exists and is readable
- Check YAML syntax (proper indentation, quotes, etc.)
- Ensure all required sections are present
### Step 2: Structure Validation
Check presence and structure of:
- **name**: Unification project name
- **keys**: Key definitions with validation rules
- **tables**: Source tables with key column mappings
- **canonical_ids**: Canonical ID configuration
- **master_tables**: Master table definitions (optional)
### Step 3: Content Validation
Validate individual sections:
**Keys Section**:
- ✓ Each key has a unique name
-`valid_regexp` is a valid regex pattern (if provided)
-`invalid_texts` is an array (if provided)
- ⚠ Recommend validation rules if missing
**Tables Section**:
- ✓ Each table has a name
- ✓ Each table has at least one key_column
- ✓ All referenced keys exist in keys section
- ✓ Column names are valid identifiers
- ⚠ Check for duplicate table definitions
**Canonical IDs Section**:
- ✓ Has a name (will be canonical ID column name)
-`merge_by_keys` references existing keys
-`merge_iterations` is a positive integer (if provided)
- ⚠ Suggest optimal iteration count if not specified
**Master Tables Section** (if present):
- ✓ Each master table has a name and canonical_id
- ✓ Referenced canonical_id exists
- ✓ Attributes have proper structure
- ✓ Source tables in attributes exist
- ✓ Priority values are valid
- ⚠ Check for attribute conflicts
### Step 4: Cross-Reference Validation
- ✓ All merge_by_keys exist in keys section
- ✓ All key_columns reference defined keys
- ✓ All master table source tables exist in tables section
- ✓ Canonical ID names don't conflict with existing columns
### Step 5: Best Practices Check
Provide recommendations for:
- Key validation rules
- Iteration count optimization
- Master table attribute priorities
- Performance considerations
### Step 6: Validation Report
Generate comprehensive report with:
- ✅ Passed checks
- ⚠ Warnings (non-critical issues)
- ❌ Errors (must fix before generation)
- 💡 Recommendations for improvement
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-unif-config-validate
I'll prompt you for:
- YAML file path
```
### Direct Usage
```
YAML file: /path/to/unify.yml
```
---
## Example Validation
### Input YAML
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
- {table: orders, column: email_address, priority: 2}
```
### Validation Report
```
✅ YAML VALIDATION SUCCESSFUL
File Structure:
✅ Valid YAML syntax
✅ All required sections present
✅ Proper indentation and formatting
Keys Section (2 keys):
✅ email: Valid regex pattern, invalid_texts defined
✅ customer_id: Invalid_texts defined
⚠ Consider adding valid_regexp for customer_id for better validation
Tables Section (2 tables):
✅ customer_profiles: 2 key columns mapped
✅ orders: 1 key column mapped
✅ All referenced keys exist
Canonical IDs Section:
✅ Name: unified_id
✅ Merge keys: email, customer_id (both exist)
✅ Iterations: 15 (recommended range: 10-20)
Master Tables Section (1 master table):
✅ customer_master: References unified_id
✅ Attribute 'best_email': 2 sources with priorities
✅ All source tables exist
Cross-References:
✅ All merge_by_keys defined in keys section
✅ All key_columns reference existing keys
✅ All master table sources exist
✅ No canonical ID name conflicts
Recommendations:
💡 Consider adding valid_regexp for customer_id (e.g., "^[A-Z0-9]+$")
💡 Add more master table attributes for richer customer profiles
💡 Consider array attributes (top_3_emails) for historical tracking
Summary:
✅ 0 errors
⚠ 1 warning
💡 3 recommendations
✓ Configuration is ready for SQL generation!
```
---
## Validation Checks
### Required Checks (Must Pass)
- [ ] File exists and is readable
- [ ] Valid YAML syntax
- [ ] `name` field present
- [ ] `keys` section present with at least one key
- [ ] `tables` section present with at least one table
- [ ] `canonical_ids` section present
- [ ] All merge_by_keys exist in keys section
- [ ] All key_columns reference defined keys
- [ ] No duplicate key names
- [ ] No duplicate table names
### Warning Checks (Recommended)
- [ ] Keys have validation rules (valid_regexp or invalid_texts)
- [ ] Merge_iterations specified (otherwise auto-calculated)
- [ ] Master tables defined for unified customer view
- [ ] Source tables have unique key combinations
- [ ] Attribute priorities are sequential
### Best Practice Checks
- [ ] Email keys have email regex pattern
- [ ] Phone keys have phone validation
- [ ] Invalid_texts include common null values ('', 'N/A', 'null')
- [ ] Master tables use time-based order_by for recency
- [ ] Array attributes for historical data (top_3_emails, etc.)
---
## Common Validation Errors
### Syntax Errors
**Error**: `Invalid YAML: mapping values are not allowed here`
**Solution**: Check indentation (use spaces, not tabs), ensure colons have space after them
**Error**: `Invalid YAML: could not find expected ':'`
**Solution**: Check for missing colons in key-value pairs
### Structure Errors
**Error**: `Missing required section: keys`
**Solution**: Add keys section with at least one key definition
**Error**: `Empty tables section`
**Solution**: Add at least one table with key_columns
### Reference Errors
**Error**: `Key 'phone' referenced in table 'orders' but not defined in keys section`
**Solution**: Add phone key to keys section or remove reference
**Error**: `Merge key 'phone_number' not found in keys section`
**Solution**: Add phone_number to keys section or remove from merge_by_keys
**Error**: `Master table source 'customer_360' not found in tables section`
**Solution**: Add customer_360 to tables section or use correct table name
### Value Errors
**Error**: `merge_iterations must be a positive integer, got: 'auto'`
**Solution**: Either remove merge_iterations (auto-calculate) or specify integer (e.g., 15)
**Error**: `Priority must be a positive integer, got: 'high'`
**Solution**: Use numeric priority (1 for highest, 2 for second, etc.)
---
## Validation Levels
### Strict Mode (Default)
- Fails on any structural errors
- Warns on missing best practices
- Recommends optimizations
### Lenient Mode
- Only fails on critical syntax errors
- Allows missing optional fields
- Minimal warnings
---
## Platform-Specific Validation
### Databricks-Specific
- ✓ Table names compatible with Unity Catalog
- ✓ Column names valid for Spark SQL
- ⚠ Check for reserved keywords (DATABASE, TABLE, etc.)
### Snowflake-Specific
- ✓ Table names compatible with Snowflake
- ✓ Column names valid for Snowflake SQL
- ⚠ Check for reserved keywords (ACCOUNT, SCHEMA, etc.)
---
## What Happens Next
### If Validation Passes
```
✅ Configuration validated successfully!
Ready for:
• SQL generation (Databricks or Snowflake)
• Direct execution after generation
Next steps:
1. /cdp-hybrid-idu:hybrid-generate-databricks
2. /cdp-hybrid-idu:hybrid-generate-snowflake
3. /cdp-hybrid-idu:hybrid-setup (complete workflow)
```
### If Validation Fails
```
❌ Configuration has errors that must be fixed
Errors (must fix):
1. Missing required section: canonical_ids
2. Undefined key 'phone' referenced in table 'orders'
Suggestions:
• Add canonical_ids section with name and merge_by_keys
• Add phone key to keys section or remove from orders
Would you like help fixing these issues? (y/n)
```
I can help you:
- Fix syntax errors
- Add missing sections
- Define proper validation rules
- Optimize configuration
---
## Success Criteria
Validation passes when:
- ✅ YAML syntax is valid
- ✅ All required sections present
- ✅ All references resolved
- ✅ No structural errors
- ✅ Ready for SQL generation
---
**Ready to validate your YAML configuration?**
Provide your `unify.yml` file path to begin validation!

View File

@@ -0,0 +1,726 @@
---
name: hybrid-unif-merge-stats-creator
description: Generate professional HTML/PDF merge statistics report from ID unification results for Snowflake or Databricks with expert analysis and visualizations
---
# ID Unification Merge Statistics Report Generator
## Overview
I'll generate a **comprehensive, professional HTML report** analyzing your ID unification merge statistics with:
- 📊 **Executive Summary** with key performance indicators
- 📈 **Identity Resolution Performance** analysis and deduplication rates
- 🎯 **Merge Distribution** patterns and complexity analysis
- 👥 **Top Merged Profiles** highlighting complex identity resolutions
-**Data Quality Metrics** with coverage percentages
- 🚀 **Convergence Analysis** showing iteration performance
- 💡 **Expert Recommendations** for optimization and next steps
**Platform Support:**
- ✅ Snowflake (using Snowflake MCP tools)
- ✅ Databricks (using Databricks MCP tools)
**Output Format:**
- Beautiful HTML report with charts, tables, and visualizations
- PDF-ready (print to PDF from browser)
- Consistent formatting every time
- Platform-agnostic design
---
## What You Need to Provide
### 1. Platform Selection
- **Snowflake**: For Snowflake-based ID unification
- **Databricks**: For Databricks-based ID unification
### 2. Database/Catalog Configuration
**For Snowflake:**
- **Database Name**: Where your unification tables are stored (e.g., `INDRESH_TEST`, `CUSTOMER_CDP`)
- **Schema Name**: Schema containing tables (e.g., `PUBLIC`, `ID_UNIFICATION`)
**For Databricks:**
- **Catalog Name**: Unity Catalog name (e.g., `customer_data`, `cdp_prod`)
- **Schema Name**: Schema containing tables (e.g., `id_unification`, `unified_profiles`)
### 3. Canonical ID Configuration
- **Canonical ID Name**: Name used for your unified ID (e.g., `td_id`, `unified_customer_id`, `master_id`)
- This is used to find the correct tables: `{canonical_id}_lookup`, `{canonical_id}_master_table`, etc.
### 4. Output Configuration (Optional)
- **Output File Path**: Where to save the HTML report (default: `id_unification_report.html`)
- **Report Title**: Custom title for the report (default: "ID Unification Merge Statistics Report")
---
## What I'll Do
### Step 1: Platform Detection and Validation
**Snowflake:**
```
1. Verify Snowflake MCP tools are available
2. Test connection to specified database.schema
3. Validate canonical ID tables exist:
- {database}.{schema}.{canonical_id}_lookup
- {database}.{schema}.{canonical_id}_master_table
- {database}.{schema}.{canonical_id}_source_key_stats
- {database}.{schema}.{canonical_id}_result_key_stats
4. Confirm access permissions
```
**Databricks:**
```
1. Verify Databricks MCP tools are available (or use Snowflake fallback)
2. Test connection to specified catalog.schema
3. Validate canonical ID tables exist
4. Confirm access permissions
```
### Step 2: Data Collection with Expert Analysis
I'll execute **16 specialized queries** to collect comprehensive statistics:
**Core Statistics Queries:**
1. **Source Key Statistics**
- Pre-unification identity counts
- Distinct values per key type (customer_id, email, phone, etc.)
- Per-table breakdowns
2. **Result Key Statistics**
- Post-unification canonical ID counts
- Distribution histograms
- Coverage per key type
3. **Canonical ID Metrics**
- Total identities processed
- Unique canonical IDs created
- Merge ratio calculation
4. **Top Merged Profiles**
- Top 10 most complex merges
- Identity count per canonical ID
- Merge complexity scoring
5. **Merge Distribution Analysis**
- Categorization (2, 3-5, 6-10, 10+ identities)
- Percentage distribution
- Pattern analysis
6. **Key Type Distribution**
- Identity breakdown by type
- Namespace analysis
- Cross-key coverage
7. **Master Table Quality Metrics**
- Attribute coverage percentages
- Data completeness analysis
- Sample record extraction
8. **Configuration Metadata**
- Unification settings
- Column mappings
- Validation rules
**Platform-Specific SQL Adaptation:**
For **Snowflake**:
```sql
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {database}.{schema}.{canonical_id}_lookup;
```
For **Databricks**:
```sql
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {catalog}.{schema}.{canonical_id}_lookup;
```
### Step 3: Statistical Analysis and Calculations
I'll perform expert-level calculations:
**Deduplication Rates:**
```
For each key type:
- Source distinct count (pre-unification)
- Final canonical IDs (post-unification)
- Deduplication % = (source - final) / source * 100
```
**Merge Ratios:**
```
- Average identities per customer = total_identities / unique_canonical_ids
- Distribution across categories
- Outlier detection (10+ merges)
```
**Convergence Analysis:**
```
- Parse from execution logs if available
- Calculate from iteration metadata tables
- Estimate convergence quality
```
**Data Quality Scores:**
```
- Coverage % for each attribute
- Completeness assessment
- Quality grading (Excellent, Good, Needs Improvement)
```
### Step 4: HTML Report Generation
I'll generate a **pixel-perfect HTML report** with:
**Design Features:**
- ✨ Modern gradient design (purple theme)
- 📊 Interactive visualizations (progress bars, horizontal bar charts)
- 🎨 Color-coded badges and status indicators
- 📱 Responsive layout (works on all devices)
- 🖨️ Print-optimized CSS for PDF export
**Report Structure:**
```html
<!DOCTYPE html>
<html>
<head>
- Professional CSS styling
- Chart/visualization styles
- Print media queries
</head>
<body>
<header>
- Report title
- Executive tagline
</header>
<metadata-bar>
- Database/Catalog info
- Canonical ID name
- Generation timestamp
- Platform indicator
</metadata-bar>
<section: Executive Summary>
- 4 KPI metric cards
- Key findings insight box
</section>
<section: Identity Resolution Performance>
- Source vs result comparison table
- Deduplication rate analysis
- Horizontal bar charts
- Expert insights
</section>
<section: Merge Distribution Analysis>
- Category breakdown table
- Distribution visualizations
- Pattern analysis insights
</section>
<section: Top Merged Profiles>
- Top 10 ranked table
- Complexity badges
- Investigation recommendations
</section>
<section: Source Table Configuration>
- Column mapping table
- Source contributions
- Multi-key strategy analysis
</section>
<section: Master Table Data Quality>
- 6 coverage cards with progress bars
- Sample records table
- Quality assessment
</section>
<section: Convergence Performance>
- Iteration breakdown table
- Convergence progression chart
- Efficiency analysis
</section>
<section: Expert Recommendations>
- 4 recommendation cards
- Strategic next steps
- Downstream activation ideas
</section>
<section: Summary Statistics>
- Comprehensive metrics table
- All key numbers documented
</section>
<footer>
- Generation metadata
- Platform information
- Report description
</footer>
</body>
</html>
```
### Step 5: Quality Validation and Output
**Pre-Output Validation:**
```
1. Verify all sections have data
2. Check calculations are correct
3. Validate percentages sum properly
4. Ensure no missing values
5. Confirm HTML is well-formed
```
**File Output:**
```
1. Write HTML to specified path
2. Create backup if file exists
3. Set proper file permissions
4. Verify file was written successfully
```
**Report Summary:**
```
✓ Report generated: {file_path}
✓ File size: {size} KB
✓ Sections included: 9
✓ Statistics queries: 16
✓ Data quality score: {score}%
✓ Ready for: Browser viewing, PDF export, sharing
```
---
## Example Workflow
### Snowflake Example
**User Input:**
```
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: snowflake_merge_report.html
```
**Process:**
```
✓ Connected to Snowflake via MCP
✓ Database: INDRESH_TEST.PUBLIC validated
✓ Tables found:
- td_id_lookup (19,512 records)
- td_id_master_table (4,940 records)
- td_id_source_key_stats (4 records)
- td_id_result_key_stats (4 records)
Executing queries:
✓ Query 1: Source statistics retrieved
✓ Query 2: Result statistics retrieved
✓ Query 3: Canonical ID counts (19,512 → 4,940)
✓ Query 4: Top 10 merged profiles identified
✓ Query 5: Merge distribution calculated
✓ Query 6: Key type distribution analyzed
✓ Query 7: Master table coverage (100% email, 99.39% phone)
✓ Query 8: Sample records extracted
✓ Query 9-11: Metadata retrieved
Calculating metrics:
✓ Merge ratio: 3.95:1
✓ Fragmentation reduction: 74.7%
✓ Deduplication rates:
- customer_id: 23.9%
- email: 32.0%
- phone: 14.8%
✓ Data quality score: 99.7%
Generating HTML report:
✓ Executive summary section
✓ Performance analysis section
✓ Merge distribution section
✓ Top profiles section
✓ Source configuration section
✓ Data quality section
✓ Convergence section
✓ Recommendations section
✓ Summary statistics section
✓ Report saved: snowflake_merge_report.html (142 KB)
✓ Open in browser to view
✓ Print to PDF for distribution
```
**Generated Report Contents:**
```
Executive Summary:
- 4,940 unified profiles
- 19,512 total identities
- 3.95:1 merge ratio
- 74.7% fragmentation reduction
Identity Resolution:
- customer_id: 6,489 → 4,940 (23.9% reduction)
- email: 7,261 → 4,940 (32.0% reduction)
- phone: 5,762 → 4,910 (14.8% reduction)
Merge Distribution:
- 89.0% profiles: 3-5 identities (normal)
- 8.1% profiles: 6-10 identities (high engagement)
- 2.3% profiles: 10+ identities (complex)
Top Merged Profile:
- mS9ssBEh4EsN: 38 identities merged
Data Quality:
- Email: 100% coverage
- Phone: 99.39% coverage
- Names: 100% coverage
- Location: 100% coverage
Expert Recommendations:
- Implement incremental processing
- Monitor profiles with 20+ merges
- Enable downstream activation
- Set up quality monitoring
```
### Databricks Example
**User Input:**
```
Platform: Databricks
Catalog: customer_cdp
Schema: id_unification
Canonical ID: unified_customer_id
Output: databricks_merge_report.html
```
**Process:**
```
✓ Connected to Databricks (or using Snowflake MCP fallback)
✓ Catalog: customer_cdp.id_unification validated
✓ Tables found:
- unified_customer_id_lookup
- unified_customer_id_master_table
- unified_customer_id_source_key_stats
- unified_customer_id_result_key_stats
[Same query execution and report generation as Snowflake]
✓ Report saved: databricks_merge_report.html
```
---
## Key Features
### 🎯 **Consistency Guarantee**
- **Same report every time**: Deterministic HTML generation
- **Platform-agnostic design**: Works identically on Snowflake and Databricks
- **Version controlled**: Report structure is fixed and versioned
### 🔍 **Expert Analysis**
- **16 specialized queries**: Comprehensive data collection
- **Calculated metrics**: Deduplication rates, merge ratios, quality scores
- **Pattern detection**: Identify anomalies and outliers
- **Strategic insights**: Actionable recommendations
### 📊 **Professional Visualizations**
- **KPI metric cards**: Large, colorful summary metrics
- **Progress bars**: Coverage percentages with animations
- **Horizontal bar charts**: Distribution comparisons
- **Color-coded badges**: Status indicators (Excellent, Good, Needs Review)
- **Tables with hover effects**: Interactive data exploration
### 🌍 **Platform Flexibility**
- **Snowflake**: Uses `mcp__snowflake__execute_query` tool
- **Databricks**: Uses Databricks MCP tools (with fallback options)
- **Automatic SQL adaptation**: Platform-specific query generation
- **Table name resolution**: Handles catalog vs database differences
### 📋 **Comprehensive Coverage**
**9 Report Sections:**
1. Executive Summary (4 KPIs + findings)
2. Identity Resolution Performance (deduplication analysis)
3. Merge Distribution Analysis (categorized breakdown)
4. Top Merged Profiles (complexity ranking)
5. Source Table Configuration (mappings)
6. Master Table Data Quality (coverage metrics)
7. Convergence Performance (iteration analysis)
8. Expert Recommendations (strategic guidance)
9. Summary Statistics (complete metrics)
**16 Statistical Queries:**
- Source/result key statistics
- Canonical ID counts and distributions
- Merge pattern analysis
- Quality coverage metrics
- Configuration metadata
---
## Table Naming Conventions
The command automatically finds tables based on your canonical ID name:
### Required Tables
For canonical ID = `{canonical_id}`:
1. **Lookup Table**: `{canonical_id}_lookup`
- Contains: canonical_id, id, id_key_type
- Used for: Merge ratio, distribution, top profiles
2. **Master Table**: `{canonical_id}_master_table`
- Contains: {canonical_id}, best_* attributes
- Used for: Data quality coverage
3. **Source Stats**: `{canonical_id}_source_key_stats`
- Contains: from_table, total_distinct, distinct_*
- Used for: Pre-unification baseline
4. **Result Stats**: `{canonical_id}_result_key_stats`
- Contains: from_table, total_distinct, histogram_*
- Used for: Post-unification results
### Optional Tables
5. **Unification Metadata**: `unification_metadata`
- Contains: canonical_id_name, canonical_id_type
- Used for: Configuration documentation
6. **Column Lookup**: `column_lookup`
- Contains: table_name, column_name, key_name
- Used for: Source table mappings
7. **Filter Lookup**: `filter_lookup`
- Contains: key_name, invalid_texts, valid_regexp
- Used for: Validation rules
**All tables must be in the same database.schema (Snowflake) or catalog.schema (Databricks)**
---
## Output Format
### HTML Report Features
**Styling:**
- Gradient purple theme (#667eea to #764ba2)
- Modern typography (system fonts)
- Responsive grid layouts
- Smooth hover animations
- Print-optimized media queries
**Sections:**
- Header with gradient background
- Metadata bar with key info
- 9 content sections with analysis
- Footer with generation details
**Visualizations:**
- Metric cards (4 in executive summary)
- Progress bars (6 in data quality)
- Horizontal bar charts (3 throughout report)
- Tables with sorting and hover effects
- Insight boxes with recommendations
**Interactivity:**
- Hover effects on cards and tables
- Animated progress bars
- Expandable insight boxes
- Responsive layout adapts to screen size
### PDF Export
To create a PDF from the HTML report:
1. Open HTML file in browser
2. Press Ctrl+P (Windows) or Cmd+P (Mac)
3. Select "Save as PDF"
4. Choose landscape orientation for better chart visibility
5. Enable background graphics for full styling
---
## Error Handling
### Common Issues and Solutions
**Issue: "Tables not found"**
```
Solution:
1. Verify canonical ID name is correct
2. Check database/catalog and schema names
3. Ensure unification workflow completed successfully
4. Confirm table naming: {canonical_id}_lookup, {canonical_id}_master_table, etc.
```
**Issue: "MCP tools not available"**
```
Solution:
1. For Snowflake: Verify Snowflake MCP server is configured
2. For Databricks: Fall back to Snowflake MCP with proper connection string
3. Check network connectivity
4. Validate credentials
```
**Issue: "No data in statistics tables"**
```
Solution:
1. Verify unification workflow ran completely
2. Check that statistics SQL files were executed
3. Confirm data exists in lookup and master tables
4. Re-run the unification workflow if needed
```
**Issue: "Permission denied"**
```
Solution:
1. Verify READ access to all tables
2. For Snowflake: Grant SELECT on schema
3. For Databricks: Grant USE CATALOG, USE SCHEMA, SELECT
4. Check role/user permissions
```
---
## Success Criteria
Generated report will:
-**Open successfully** in all modern browsers (Chrome, Firefox, Safari, Edge)
-**Display all 9 sections** with complete data
-**Show accurate calculations** for all metrics
-**Include visualizations** (charts, progress bars, tables)
-**Render consistently** every time it's generated
-**Export cleanly to PDF** with proper formatting
-**Match the reference design** (same HTML/CSS structure)
-**Contain expert insights** and recommendations
-**Be production-ready** for stakeholder distribution
---
## Usage Examples
### Quick Start (Snowflake)
```
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Snowflake
> Database: PROD_CDP
> Schema: ID_UNIFICATION
> Canonical ID: master_customer_id
> Output: (press Enter for default)
✓ Report generated: id_unification_report.html
```
### Custom Output Path
```
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Databricks
> Catalog: analytics_prod
> Schema: unified_ids
> Canonical ID: td_id
> Output: /reports/weekly/td_id_stats_2025-10-15.html
✓ Report generated: /reports/weekly/td_id_stats_2025-10-15.html
```
### Multiple Environments
Generate reports for different environments:
```bash
# Production
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: PROD_CDP
Output: prod_merge_stats.html
# Staging
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: STAGING_CDP
Output: staging_merge_stats.html
# Compare metrics across environments
```
---
## Best Practices
### Regular Reporting
1. **Weekly Reports**: Track merge performance over time
2. **Post-Workflow Reports**: Generate after each unification run
3. **Quality Audits**: Monthly deep-dive analysis
4. **Stakeholder Updates**: Executive-friendly format
### Comparative Analysis
Generate reports at different stages:
- After initial unification setup
- After incremental updates
- After data quality improvements
- Across different customer segments
### Archive and Versioning
```
reports/
2025-10-15_td_id_merge_stats.html
2025-10-08_td_id_merge_stats.html
2025-10-01_td_id_merge_stats.html
```
Track improvements over time by comparing:
- Merge ratios
- Data quality scores
- Convergence iterations
- Deduplication rates
---
## Getting Started
**Ready to generate your merge statistics report?**
Please provide:
1. **Platform**: Snowflake or Databricks?
2. **Database/Catalog**: Where are your unification tables?
3. **Schema**: Which schema contains the tables?
4. **Canonical ID**: What's the name of your unified ID? (e.g., td_id)
5. **Output Path** (optional): Where to save the report?
**Example:**
```
I want to generate a merge statistics report for:
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: my_unification_report.html
```
---
**I'll analyze your ID unification results and create a comprehensive, beautiful HTML report with expert insights!**