Initial commit
This commit is contained in:
387
commands/hybrid-execute-databricks.md
Normal file
387
commands/hybrid-execute-databricks.md
Normal file
@@ -0,0 +1,387 @@
|
||||
---
|
||||
name: hybrid-execute-databricks
|
||||
description: Execute Databricks ID unification workflow with convergence detection and monitoring
|
||||
---
|
||||
|
||||
# Execute Databricks ID Unification Workflow
|
||||
|
||||
## Overview
|
||||
|
||||
Execute your generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
|
||||
|
||||
---
|
||||
|
||||
## What You Need
|
||||
|
||||
### Required Inputs
|
||||
1. **SQL Directory**: Path to generated SQL files (e.g., `databricks_sql/unify/`)
|
||||
2. **Server Hostname**: Your Databricks workspace URL (e.g., `your-workspace.cloud.databricks.com`)
|
||||
3. **HTTP Path**: SQL Warehouse or cluster path (e.g., `/sql/1.0/warehouses/abc123`)
|
||||
4. **Catalog**: Target catalog name
|
||||
5. **Schema**: Target schema name
|
||||
|
||||
### Authentication
|
||||
**Option 1: Personal Access Token (PAT)**
|
||||
- Access token from Databricks workspace
|
||||
- Can be provided as argument or via environment variable `DATABRICKS_TOKEN`
|
||||
|
||||
**Option 2: OAuth**
|
||||
- Browser-based authentication
|
||||
- No token required, will open browser for login
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Connection Setup
|
||||
- Connect to your Databricks workspace
|
||||
- Validate credentials and permissions
|
||||
- Set catalog and schema context
|
||||
- Verify SQL directory exists
|
||||
|
||||
### Step 2: Execution Plan
|
||||
Display execution plan with:
|
||||
- All SQL files in execution order
|
||||
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
|
||||
- Estimated steps and dependencies
|
||||
|
||||
### Step 3: SQL Execution
|
||||
I'll call the **databricks-workflow-executor agent** to:
|
||||
- Execute SQL files in proper sequence
|
||||
- Skip loop iteration files (handled separately)
|
||||
- Monitor progress with real-time feedback
|
||||
- Track row counts and execution times
|
||||
|
||||
### Step 4: Unify Loop with Convergence Detection
|
||||
**Intelligent Loop Execution**:
|
||||
```
|
||||
Iteration 1:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 1500 records updated
|
||||
• Optimize Delta table
|
||||
→ Continue to iteration 2
|
||||
|
||||
Iteration 2:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 450 records updated
|
||||
• Optimize Delta table
|
||||
→ Continue to iteration 3
|
||||
|
||||
Iteration 3:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 0 records updated
|
||||
✓ CONVERGED! Stop loop
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Runs until convergence (updated_count = 0)
|
||||
- Maximum 30 iterations safety limit
|
||||
- Auto-optimization after each iteration
|
||||
- Creates alias table (loop_final) for downstream processing
|
||||
|
||||
### Step 5: Post-Loop Processing
|
||||
- Execute canonicalization step
|
||||
- Generate result statistics
|
||||
- Enrich source tables with canonical IDs
|
||||
- Create master tables
|
||||
- Generate metadata and lookup tables
|
||||
|
||||
### Step 6: Final Report
|
||||
Provide:
|
||||
- Total execution time
|
||||
- Files processed successfully
|
||||
- Convergence statistics
|
||||
- Final table row counts
|
||||
- Next steps and recommendations
|
||||
|
||||
---
|
||||
|
||||
## Command Usage
|
||||
|
||||
### Interactive Mode (Recommended)
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-execute-databricks
|
||||
|
||||
I'll prompt you for:
|
||||
- SQL directory path
|
||||
- Databricks server hostname
|
||||
- HTTP path
|
||||
- Catalog and schema
|
||||
- Authentication method
|
||||
```
|
||||
|
||||
### Advanced Mode
|
||||
Provide all parameters upfront:
|
||||
```
|
||||
SQL directory: databricks_sql/unify/
|
||||
Server hostname: your-workspace.cloud.databricks.com
|
||||
HTTP path: /sql/1.0/warehouses/abc123
|
||||
Catalog: my_catalog
|
||||
Schema: my_schema
|
||||
Auth type: pat (or oauth)
|
||||
Access token: dapi... (if using PAT)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Execution Features
|
||||
|
||||
### 1. Convergence Detection
|
||||
**Algorithm**:
|
||||
```sql
|
||||
SELECT COUNT(*) as updated_count FROM (
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM current_iteration
|
||||
EXCEPT
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM previous_iteration
|
||||
) diff
|
||||
```
|
||||
|
||||
**Stops when**: updated_count = 0
|
||||
|
||||
### 2. Delta Table Optimization
|
||||
After major operations:
|
||||
```sql
|
||||
OPTIMIZE table_name
|
||||
```
|
||||
Benefits:
|
||||
- Compacts small files
|
||||
- Improves query performance
|
||||
- Reduces storage costs
|
||||
- Optimizes clustering
|
||||
|
||||
### 3. Interactive Error Handling
|
||||
If an error occurs:
|
||||
```
|
||||
✗ File: 04_unify_loop_iteration_01.sql
|
||||
Error: Table not found: source_table
|
||||
|
||||
Continue with remaining files? (y/n):
|
||||
```
|
||||
|
||||
You can choose to:
|
||||
- Continue: Skip failed file, continue with rest
|
||||
- Stop: Halt execution for investigation
|
||||
|
||||
### 4. Real-Time Monitoring
|
||||
Track progress with:
|
||||
- ✓ Completed steps (green)
|
||||
- • Progress indicators (cyan)
|
||||
- ✗ Failed steps (red)
|
||||
- ⚠ Warnings (yellow)
|
||||
- Row counts and execution times
|
||||
|
||||
### 5. Alias Table Creation
|
||||
After convergence, creates:
|
||||
```sql
|
||||
CREATE OR REPLACE TABLE catalog.schema.unified_id_graph_unify_loop_final
|
||||
AS SELECT * FROM catalog.schema.unified_id_graph_unify_loop_3
|
||||
```
|
||||
|
||||
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Python Script Execution
|
||||
The agent executes:
|
||||
```bash
|
||||
python3 scripts/databricks/databricks_sql_executor.py \
|
||||
databricks_sql/unify/ \
|
||||
--server-hostname your-workspace.databricks.com \
|
||||
--http-path /sql/1.0/warehouses/abc123 \
|
||||
--catalog my_catalog \
|
||||
--schema my_schema \
|
||||
--auth-type pat \
|
||||
--optimize-tables
|
||||
```
|
||||
|
||||
### Execution Order
|
||||
1. **Setup Phase** (01-03):
|
||||
- Create graph table (loop_0)
|
||||
- Extract and merge identities
|
||||
- Generate source statistics
|
||||
|
||||
2. **Unification Loop** (04):
|
||||
- Run iterations until convergence
|
||||
- Check after EVERY iteration
|
||||
- Stop when updated_count = 0
|
||||
- Create loop_final alias
|
||||
|
||||
3. **Canonicalization** (05):
|
||||
- Create canonical ID lookup
|
||||
- Create keys and tables metadata
|
||||
- Rename final graph table
|
||||
|
||||
4. **Statistics** (06):
|
||||
- Generate result key statistics
|
||||
- Create histograms
|
||||
- Calculate coverage metrics
|
||||
|
||||
5. **Enrichment** (10-19):
|
||||
- Add canonical IDs to source tables
|
||||
- Create enriched_* tables
|
||||
|
||||
6. **Master Tables** (20-29):
|
||||
- Aggregate attributes
|
||||
- Apply priority rules
|
||||
- Create unified customer profiles
|
||||
|
||||
7. **Metadata** (30-39):
|
||||
- Unification metadata
|
||||
- Filter lookup tables
|
||||
- Column lookup tables
|
||||
|
||||
### Connection Management
|
||||
- Establishes single connection for entire workflow
|
||||
- Uses connection pooling for efficiency
|
||||
- Automatic reconnection on timeout
|
||||
- Proper cleanup on completion or error
|
||||
|
||||
---
|
||||
|
||||
## Example Execution
|
||||
|
||||
### Input
|
||||
```
|
||||
SQL directory: databricks_sql/unify/
|
||||
Server hostname: dbc-12345-abc.cloud.databricks.com
|
||||
HTTP path: /sql/1.0/warehouses/6789abcd
|
||||
Catalog: customer_data
|
||||
Schema: id_unification
|
||||
Auth type: pat
|
||||
```
|
||||
|
||||
### Output
|
||||
```
|
||||
✓ Connected to Databricks: dbc-12345-abc.cloud.databricks.com
|
||||
• Using catalog: customer_data, schema: id_unification
|
||||
|
||||
Starting Databricks SQL Execution
|
||||
• Catalog: customer_data
|
||||
• Schema: id_unification
|
||||
• Delta tables: ✓ enabled
|
||||
|
||||
Executing: 01_create_graph.sql
|
||||
✓ 01_create_graph.sql: Executed successfully
|
||||
|
||||
Executing: 02_extract_merge.sql
|
||||
✓ 02_extract_merge.sql: Executed successfully
|
||||
• Rows affected: 125000
|
||||
|
||||
Executing: 03_source_key_stats.sql
|
||||
✓ 03_source_key_stats.sql: Executed successfully
|
||||
|
||||
Executing Unify Loop Before Canonicalization
|
||||
|
||||
--- Iteration 1 ---
|
||||
✓ Iteration 1 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 1500
|
||||
• Optimizing Delta table
|
||||
|
||||
--- Iteration 2 ---
|
||||
✓ Iteration 2 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 450
|
||||
• Optimizing Delta table
|
||||
|
||||
--- Iteration 3 ---
|
||||
✓ Iteration 3 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 0
|
||||
✓ Loop converged after 3 iterations
|
||||
|
||||
• Creating alias table for final iteration
|
||||
✓ Alias table 'unified_id_graph_unify_loop_final' created
|
||||
|
||||
Executing: 05_canonicalize.sql
|
||||
✓ 05_canonicalize.sql: Executed successfully
|
||||
|
||||
[... continues with enrichment, master tables, metadata ...]
|
||||
|
||||
Execution Complete
|
||||
• Files processed: 18/18
|
||||
• Final unified_id_lookup rows: 98,500
|
||||
|
||||
• Disconnected from Databricks
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Troubleshooting
|
||||
|
||||
### Check Execution Progress
|
||||
During execution, you can monitor:
|
||||
- Databricks SQL Warehouse query history
|
||||
- Delta table sizes and row counts
|
||||
- Execution logs in Databricks workspace
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue**: Connection timeout
|
||||
**Solution**: Check network access, verify credentials, ensure SQL Warehouse is running
|
||||
|
||||
**Issue**: Table not found
|
||||
**Solution**: Verify catalog/schema permissions, check source table names in YAML
|
||||
|
||||
**Issue**: Loop doesn't converge
|
||||
**Solution**: Check data quality, increase max_iterations, review key validation rules
|
||||
|
||||
**Issue**: Out of memory
|
||||
**Solution**: Increase SQL Warehouse size, optimize clustering, reduce batch sizes
|
||||
|
||||
**Issue**: Permission denied
|
||||
**Solution**: Verify catalog/schema permissions, check Unity Catalog access controls
|
||||
|
||||
### Performance Optimization
|
||||
- Use larger SQL Warehouse for faster execution
|
||||
- Enable auto-scaling for variable workloads
|
||||
- Optimize Delta tables regularly
|
||||
- Use clustering on frequently joined columns
|
||||
|
||||
---
|
||||
|
||||
## Post-Execution Validation
|
||||
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON DATABRICKS**
|
||||
|
||||
### Check Coverage
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) as total_records,
|
||||
COUNT(unified_id) as records_with_id,
|
||||
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
|
||||
FROM catalog.schema.enriched_customer_profiles;
|
||||
```
|
||||
|
||||
### Verify Master Table
|
||||
```sql
|
||||
SELECT COUNT(*) as unified_customers
|
||||
FROM catalog.schema.customer_master;
|
||||
```
|
||||
|
||||
### Review Statistics
|
||||
```sql
|
||||
SELECT * FROM catalog.schema.unified_id_result_key_stats
|
||||
WHERE from_table = '*';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Execution successful when:
|
||||
- ✅ All SQL files processed without critical errors
|
||||
- ✅ Unification loop converged (updated_count = 0)
|
||||
- ✅ Canonical IDs generated for all eligible records
|
||||
- ✅ Enriched tables created successfully
|
||||
- ✅ Master tables populated with attributes
|
||||
- ✅ Coverage metrics meet expectations
|
||||
|
||||
---
|
||||
|
||||
**Ready to execute your Databricks ID unification workflow?**
|
||||
|
||||
Provide your SQL directory path and Databricks connection details to begin!
|
||||
401
commands/hybrid-execute-snowflake.md
Normal file
401
commands/hybrid-execute-snowflake.md
Normal file
@@ -0,0 +1,401 @@
|
||||
---
|
||||
name: hybrid-execute-snowflake
|
||||
description: Execute Snowflake ID unification workflow with convergence detection and monitoring
|
||||
---
|
||||
|
||||
# Execute Snowflake ID Unification Workflow
|
||||
|
||||
## Overview
|
||||
|
||||
Execute your generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
|
||||
|
||||
---
|
||||
|
||||
## What You Need
|
||||
|
||||
### Required Inputs
|
||||
1. **SQL Directory**: Path to generated SQL files (e.g., `snowflake_sql/unify/`)
|
||||
2. **Account**: Snowflake account name (e.g., `myaccount` from `myaccount.snowflakecomputing.com`)
|
||||
3. **User**: Snowflake username
|
||||
4. **Database**: Target database name
|
||||
5. **Schema**: Target schema name
|
||||
6. **Warehouse**: Compute warehouse name (defaults to `COMPUTE_WH`)
|
||||
|
||||
### Authentication
|
||||
**Option 1: Password**
|
||||
- Can be provided as argument or via environment variable `SNOWFLAKE_PASSWORD` via environment file (.env) `SNOWFLAKE_PASSWORD`
|
||||
- Will prompt if not provided
|
||||
|
||||
**Option 2: SSO (externalbrowser)**
|
||||
- Opens browser for authentication
|
||||
- No password required
|
||||
|
||||
**Option 3: Key-Pair**
|
||||
- Private key path via `SNOWFLAKE_PRIVATE_KEY_PATH`
|
||||
- Passphrase via `SNOWFLAKE_PRIVATE_KEY_PASSPHRASE`
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Connection Setup
|
||||
- Connect to your Snowflake account
|
||||
- Validate credentials and permissions
|
||||
- Set database and schema context
|
||||
- Verify SQL directory exists
|
||||
- Activate warehouse
|
||||
|
||||
### Step 2: Execution Plan
|
||||
Display execution plan with:
|
||||
- All SQL files in execution order
|
||||
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
|
||||
- Estimated steps and dependencies
|
||||
|
||||
### Step 3: SQL Execution
|
||||
I'll call the **snowflake-workflow-executor agent** to:
|
||||
- Execute SQL files in proper sequence
|
||||
- Skip loop iteration files (handled separately)
|
||||
- Monitor progress with real-time feedback
|
||||
- Track row counts and execution times
|
||||
|
||||
### Step 4: Unify Loop with Convergence Detection
|
||||
**Intelligent Loop Execution**:
|
||||
```
|
||||
Iteration 1:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 1500 records updated
|
||||
→ Continue to iteration 2
|
||||
|
||||
Iteration 2:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 450 records updated
|
||||
→ Continue to iteration 3
|
||||
|
||||
Iteration 3:
|
||||
✓ Execute unify SQL
|
||||
• Check convergence: 0 records updated
|
||||
✓ CONVERGED! Stop loop
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Runs until convergence (updated_count = 0)
|
||||
- Maximum 30 iterations safety limit
|
||||
- Creates alias table (loop_final) for downstream processing
|
||||
|
||||
### Step 5: Post-Loop Processing
|
||||
- Execute canonicalization step
|
||||
- Generate result statistics
|
||||
- Enrich source tables with canonical IDs
|
||||
- Create master tables
|
||||
- Generate metadata and lookup tables
|
||||
|
||||
### Step 6: Final Report
|
||||
Provide:
|
||||
- Total execution time
|
||||
- Files processed successfully
|
||||
- Convergence statistics
|
||||
- Final table row counts
|
||||
- Next steps and recommendations
|
||||
|
||||
---
|
||||
|
||||
## Command Usage
|
||||
|
||||
### Interactive Mode (Recommended)
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||
|
||||
I'll prompt you for:
|
||||
- SQL directory path
|
||||
- Snowflake account name
|
||||
- Username
|
||||
- Database and schema
|
||||
- Warehouse name
|
||||
- Authentication method
|
||||
```
|
||||
|
||||
### Advanced Mode
|
||||
Provide all parameters upfront:
|
||||
```
|
||||
SQL directory: snowflake_sql/unify/
|
||||
Account: myaccount
|
||||
User: myuser
|
||||
Database: my_database
|
||||
Schema: my_schema
|
||||
Warehouse: COMPUTE_WH
|
||||
Password: (will prompt if not in environment)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Execution Features
|
||||
|
||||
### 1. Convergence Detection
|
||||
**Algorithm**:
|
||||
```sql
|
||||
SELECT COUNT(*) as updated_count FROM (
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM current_iteration
|
||||
EXCEPT
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM previous_iteration
|
||||
) diff
|
||||
```
|
||||
|
||||
**Stops when**: updated_count = 0
|
||||
|
||||
### 2. Interactive Error Handling
|
||||
If an error occurs:
|
||||
```
|
||||
✗ File: 04_unify_loop_iteration_01.sql
|
||||
Error: Table not found: source_table
|
||||
|
||||
Continue with remaining files? (y/n):
|
||||
```
|
||||
|
||||
You can choose to:
|
||||
- Continue: Skip failed file, continue with rest
|
||||
- Stop: Halt execution for investigation
|
||||
|
||||
### 3. Real-Time Monitoring
|
||||
Track progress with:
|
||||
- ✓ Completed steps (green)
|
||||
- • Progress indicators (cyan)
|
||||
- ✗ Failed steps (red)
|
||||
- ⚠ Warnings (yellow)
|
||||
- Row counts and execution times
|
||||
|
||||
### 4. Alias Table Creation
|
||||
After convergence, creates:
|
||||
```sql
|
||||
CREATE OR REPLACE TABLE database.schema.unified_id_graph_unify_loop_final
|
||||
AS SELECT * FROM database.schema.unified_id_graph_unify_loop_3
|
||||
```
|
||||
|
||||
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Python Script Execution
|
||||
The agent executes:
|
||||
```bash
|
||||
python3 scripts/snowflake/snowflake_sql_executor.py \
|
||||
snowflake_sql/unify/ \
|
||||
--account myaccount \
|
||||
--user myuser \
|
||||
--database my_database \
|
||||
--schema my_schema \
|
||||
--warehouse COMPUTE_WH
|
||||
```
|
||||
|
||||
### Execution Order
|
||||
1. **Setup Phase** (01-03):
|
||||
- Create graph table (loop_0)
|
||||
- Extract and merge identities
|
||||
- Generate source statistics
|
||||
|
||||
2. **Unification Loop** (04):
|
||||
- Run iterations until convergence
|
||||
- Check after EVERY iteration
|
||||
- Stop when updated_count = 0
|
||||
- Create loop_final alias
|
||||
|
||||
3. **Canonicalization** (05):
|
||||
- Create canonical ID lookup
|
||||
- Create keys and tables metadata
|
||||
- Rename final graph table
|
||||
|
||||
4. **Statistics** (06):
|
||||
- Generate result key statistics
|
||||
- Create histograms
|
||||
- Calculate coverage metrics
|
||||
|
||||
5. **Enrichment** (10-19):
|
||||
- Add canonical IDs to source tables
|
||||
- Create enriched_* tables
|
||||
|
||||
6. **Master Tables** (20-29):
|
||||
- Aggregate attributes
|
||||
- Apply priority rules
|
||||
- Create unified customer profiles
|
||||
|
||||
7. **Metadata** (30-39):
|
||||
- Unification metadata
|
||||
- Filter lookup tables
|
||||
- Column lookup tables
|
||||
|
||||
### Connection Management
|
||||
- Establishes single connection for entire workflow
|
||||
- Uses connection pooling for efficiency
|
||||
- Automatic reconnection on timeout
|
||||
- Proper cleanup on completion or error
|
||||
|
||||
---
|
||||
|
||||
## Example Execution
|
||||
|
||||
### Input
|
||||
```
|
||||
SQL directory: snowflake_sql/unify/
|
||||
Account: myorg-myaccount
|
||||
User: analytics_user
|
||||
Database: customer_data
|
||||
Schema: id_unification
|
||||
Warehouse: LARGE_WH
|
||||
```
|
||||
|
||||
### Output
|
||||
```
|
||||
✓ Connected to Snowflake: myorg-myaccount
|
||||
• Using database: customer_data, schema: id_unification
|
||||
|
||||
Starting Snowflake SQL Execution
|
||||
• Database: customer_data
|
||||
• Schema: id_unification
|
||||
|
||||
Executing: 01_create_graph.sql
|
||||
✓ 01_create_graph.sql: Executed successfully
|
||||
|
||||
Executing: 02_extract_merge.sql
|
||||
✓ 02_extract_merge.sql: Executed successfully
|
||||
• Rows affected: 125000
|
||||
|
||||
Executing: 03_source_key_stats.sql
|
||||
✓ 03_source_key_stats.sql: Executed successfully
|
||||
|
||||
Executing Unify Loop Before Canonicalization
|
||||
|
||||
--- Iteration 1 ---
|
||||
✓ Iteration 1 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 1500
|
||||
|
||||
--- Iteration 2 ---
|
||||
✓ Iteration 2 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 450
|
||||
|
||||
--- Iteration 3 ---
|
||||
✓ Iteration 3 completed
|
||||
• Rows processed: 125000
|
||||
• Updated records: 0
|
||||
✓ Loop converged after 3 iterations
|
||||
|
||||
• Creating alias table for final iteration
|
||||
✓ Alias table 'unified_id_graph_unify_loop_final' created
|
||||
|
||||
Executing: 05_canonicalize.sql
|
||||
✓ 05_canonicalize.sql: Executed successfully
|
||||
|
||||
[... continues with enrichment, master tables, metadata ...]
|
||||
|
||||
Execution Complete
|
||||
• Files processed: 18/18
|
||||
• Final unified_id_lookup rows: 98,500
|
||||
|
||||
• Disconnected from Snowflake
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Troubleshooting
|
||||
|
||||
### Check Execution Progress
|
||||
During execution, you can monitor:
|
||||
- Snowflake query history
|
||||
- Table sizes and row counts
|
||||
- Warehouse utilization
|
||||
- Execution logs
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue**: Connection timeout
|
||||
**Solution**: Check network access, verify credentials, ensure warehouse is running
|
||||
|
||||
**Issue**: Table not found
|
||||
**Solution**: Verify database/schema permissions, check source table names in YAML
|
||||
|
||||
**Issue**: Loop doesn't converge
|
||||
**Solution**: Check data quality, increase max_iterations, review key validation rules
|
||||
|
||||
**Issue**: Warehouse suspended
|
||||
**Solution**: Ensure auto-resume is enabled, manually resume warehouse if needed
|
||||
|
||||
**Issue**: Permission denied
|
||||
**Solution**: Verify database/schema permissions, check role assignments
|
||||
|
||||
### Performance Optimization
|
||||
- Use larger warehouse for faster execution (L, XL, 2XL, etc.)
|
||||
- Enable multi-cluster warehouse for concurrency
|
||||
- Use clustering keys on frequently joined columns
|
||||
- Monitor query profiles for optimization opportunities
|
||||
|
||||
---
|
||||
|
||||
## Post-Execution Validation
|
||||
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON SNOWFLAKE**
|
||||
|
||||
### Check Coverage
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) as total_records,
|
||||
COUNT(unified_id) as records_with_id,
|
||||
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
|
||||
FROM database.schema.enriched_customer_profiles;
|
||||
```
|
||||
|
||||
### Verify Master Table
|
||||
```sql
|
||||
SELECT COUNT(*) as unified_customers
|
||||
FROM database.schema.customer_master;
|
||||
```
|
||||
|
||||
### Review Statistics
|
||||
```sql
|
||||
SELECT * FROM database.schema.unified_id_result_key_stats
|
||||
WHERE from_table = '*';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Execution successful when:
|
||||
- ✅ All SQL files processed without critical errors
|
||||
- ✅ Unification loop converged (updated_count = 0)
|
||||
- ✅ Canonical IDs generated for all eligible records
|
||||
- ✅ Enriched tables created successfully
|
||||
- ✅ Master tables populated with attributes
|
||||
- ✅ Coverage metrics meet expectations
|
||||
|
||||
---
|
||||
|
||||
## Authentication Examples
|
||||
|
||||
### Using Password
|
||||
```bash
|
||||
export SNOWFLAKE_PASSWORD='your_password'
|
||||
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||
```
|
||||
|
||||
### Using SSO
|
||||
```bash
|
||||
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||
# Will prompt: Use SSO authentication? (y/n): y
|
||||
# Opens browser for authentication
|
||||
```
|
||||
|
||||
### Using Key-Pair
|
||||
```bash
|
||||
export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/key.p8'
|
||||
export SNOWFLAKE_PRIVATE_KEY_PASSPHRASE='passphrase'
|
||||
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Ready to execute your Snowflake ID unification workflow?**
|
||||
|
||||
Provide your SQL directory path and Snowflake connection details to begin!
|
||||
285
commands/hybrid-generate-databricks.md
Normal file
285
commands/hybrid-generate-databricks.md
Normal file
@@ -0,0 +1,285 @@
|
||||
---
|
||||
name: hybrid-generate-databricks
|
||||
description: Generate Databricks Delta Lake SQL from YAML configuration for ID unification
|
||||
---
|
||||
|
||||
# Generate Databricks SQL from YAML
|
||||
|
||||
## Overview
|
||||
|
||||
Generate production-ready Databricks SQL workflow from your `unify.yml` configuration file. This command creates Delta Lake optimized SQL files with ACID transactions, clustering, and platform-specific function conversions.
|
||||
|
||||
---
|
||||
|
||||
## What You Need
|
||||
|
||||
### Required Inputs
|
||||
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||
2. **Target Catalog**: Databricks Unity Catalog name
|
||||
3. **Target Schema**: Schema name within the catalog
|
||||
|
||||
### Optional Inputs
|
||||
4. **Source Catalog**: Catalog containing source tables (defaults to target catalog)
|
||||
5. **Source Schema**: Schema containing source tables (defaults to target schema)
|
||||
6. **Output Directory**: Where to save generated SQL (defaults to `databricks_sql/`)
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Validation
|
||||
- Verify `unify.yml` exists and is valid
|
||||
- Check YAML syntax and structure
|
||||
- Validate keys, tables, and configuration sections
|
||||
|
||||
### Step 2: SQL Generation
|
||||
I'll call the **databricks-sql-generator agent** to:
|
||||
- Execute `yaml_unification_to_databricks.py` Python script
|
||||
- Apply Databricks-specific SQL conversions:
|
||||
- `ARRAY_SIZE` → `SIZE`
|
||||
- `ARRAY_CONSTRUCT` → `ARRAY`
|
||||
- `OBJECT_CONSTRUCT` → `STRUCT`
|
||||
- `COLLECT_LIST` for aggregations
|
||||
- `FLATTEN` for array operations
|
||||
- `UNIX_TIMESTAMP()` for time functions
|
||||
- Generate Delta Lake table definitions with clustering
|
||||
- Create convergence detection logic
|
||||
- Build cryptographic hashing for canonical IDs
|
||||
|
||||
### Step 3: Output Organization
|
||||
Generate complete SQL workflow in this structure:
|
||||
```
|
||||
databricks_sql/unify/
|
||||
├── 01_create_graph.sql # Initialize graph with USING DELTA
|
||||
├── 02_extract_merge.sql # Extract identities with validation
|
||||
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
|
||||
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
|
||||
├── 05_canonicalize.sql # Canonical ID creation with key masks
|
||||
├── 06_result_key_stats.sql # Result statistics with histograms
|
||||
├── 10_enrich_*.sql # Enrich each source table
|
||||
├── 20_master_*.sql # Master tables with attribute aggregation
|
||||
├── 30_unification_metadata.sql # Metadata tables
|
||||
├── 31_filter_lookup.sql # Validation rules lookup
|
||||
└── 32_column_lookup.sql # Column mapping lookup
|
||||
```
|
||||
|
||||
### Step 4: Summary Report
|
||||
Provide:
|
||||
- Total SQL files generated
|
||||
- Estimated execution order
|
||||
- Delta Lake optimizations included
|
||||
- Key features enabled
|
||||
- Next steps for execution
|
||||
|
||||
---
|
||||
|
||||
## Command Usage
|
||||
|
||||
### Basic Usage
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-generate-databricks
|
||||
|
||||
I'll prompt you for:
|
||||
- YAML file path
|
||||
- Target catalog
|
||||
- Target schema
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
Provide all parameters upfront:
|
||||
```
|
||||
YAML file: /path/to/unify.yml
|
||||
Target catalog: my_catalog
|
||||
Target schema: my_schema
|
||||
Source catalog: source_catalog (optional)
|
||||
Source schema: source_schema (optional)
|
||||
Output directory: custom_output/ (optional)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Generated SQL Features
|
||||
|
||||
### Delta Lake Optimizations
|
||||
- **ACID Transactions**: `USING DELTA` for all tables
|
||||
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
|
||||
- **Table Properties**: Optimized for large-scale joins
|
||||
|
||||
### Advanced Capabilities
|
||||
1. **Dynamic Iteration Count**: Auto-calculates based on:
|
||||
- Number of merge keys
|
||||
- Number of tables
|
||||
- Data complexity (configurable via YAML)
|
||||
|
||||
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
|
||||
```
|
||||
Key Type 1 (email): 0ffdbcf0c666ce190d
|
||||
Key Type 2 (customer_id): 61a821f2b646a4e890
|
||||
Key Type 3 (phone): acd2206c3f88b3ee27
|
||||
```
|
||||
|
||||
3. **Validation Rules**:
|
||||
- `valid_regexp`: Regex pattern filtering
|
||||
- `invalid_texts`: NOT IN clause with NULL handling
|
||||
- Combined AND logic for strict validation
|
||||
|
||||
4. **Master Table Attributes**:
|
||||
- Single value: `MAX_BY(attr, order)` with COALESCE
|
||||
- Array values: `SLICE(CONCAT(arrays), 1, N)`
|
||||
- Priority-based selection
|
||||
|
||||
### Platform-Specific Conversions
|
||||
The generator automatically converts:
|
||||
- Presto functions → Databricks equivalents
|
||||
- Snowflake functions → Databricks equivalents
|
||||
- Array operations → Spark SQL syntax
|
||||
- Window functions → optimized versions
|
||||
- Time functions → UNIX_TIMESTAMP()
|
||||
|
||||
---
|
||||
|
||||
## Example Workflow
|
||||
|
||||
### Input YAML (`unify.yml`)
|
||||
```yaml
|
||||
name: customer_unification
|
||||
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A']
|
||||
|
||||
tables:
|
||||
- table: customer_profiles
|
||||
key_columns:
|
||||
- {column: email_std, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
|
||||
canonical_ids:
|
||||
- name: unified_id
|
||||
merge_by_keys: [email, customer_id]
|
||||
merge_iterations: 15
|
||||
|
||||
master_tables:
|
||||
- name: customer_master
|
||||
canonical_id: unified_id
|
||||
attributes:
|
||||
- name: best_email
|
||||
source_columns:
|
||||
- {table: customer_profiles, column: email_std, priority: 1}
|
||||
```
|
||||
|
||||
### Generated Output
|
||||
```
|
||||
databricks_sql/unify/
|
||||
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
|
||||
├── 02_extract_merge.sql # Merges customer_profiles keys
|
||||
├── 03_source_key_stats.sql # Stats by table
|
||||
├── 04_unify_loop_iteration_01.sql # First iteration
|
||||
├── 04_unify_loop_iteration_02.sql # Second iteration
|
||||
├── ... # Up to iteration_05
|
||||
├── 05_canonicalize.sql # Creates unified_id_lookup
|
||||
├── 06_result_key_stats.sql # Final statistics
|
||||
├── 10_enrich_customer_profiles.sql # Adds unified_id column
|
||||
├── 20_master_customer_master.sql # Creates customer_master table
|
||||
├── 30_unification_metadata.sql # Metadata
|
||||
├── 31_filter_lookup.sql # Validation rules
|
||||
└── 32_column_lookup.sql # Column mappings
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Generation
|
||||
|
||||
### Option 1: Execute Immediately
|
||||
Use the execution command:
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-execute-databricks
|
||||
```
|
||||
|
||||
### Option 2: Review First
|
||||
1. Examine generated SQL files
|
||||
2. Verify table names and transformations
|
||||
3. Test with sample data
|
||||
4. Execute manually or via execution command
|
||||
|
||||
### Option 3: Customize
|
||||
1. Modify generated SQL as needed
|
||||
2. Add custom logic or transformations
|
||||
3. Execute using Databricks SQL editor or execution command
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Python Script Execution
|
||||
The agent executes:
|
||||
```bash
|
||||
python3 scripts/databricks/yaml_unification_to_databricks.py \
|
||||
unify.yml \
|
||||
-tc my_catalog \
|
||||
-ts my_schema \
|
||||
-sc source_catalog \
|
||||
-ss source_schema \
|
||||
-o databricks_sql
|
||||
```
|
||||
|
||||
### SQL File Naming Convention
|
||||
- `01-09`: Setup and initialization
|
||||
- `10-19`: Source table enrichment
|
||||
- `20-29`: Master table creation
|
||||
- `30-39`: Metadata and lookup tables
|
||||
- `04_*_NN`: Loop iterations (auto-numbered)
|
||||
|
||||
### Convergence Detection
|
||||
Each loop iteration includes:
|
||||
```sql
|
||||
-- Check if graph changed
|
||||
SELECT COUNT(*) FROM (
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM iteration_N
|
||||
EXCEPT
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM iteration_N_minus_1
|
||||
) diff
|
||||
```
|
||||
Stops when count = 0
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue**: YAML validation error
|
||||
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
|
||||
|
||||
**Issue**: Table not found error
|
||||
**Solution**: Verify source catalog/schema, check table names in YAML
|
||||
|
||||
**Issue**: Python script error
|
||||
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
|
||||
|
||||
**Issue**: Too many/few iterations
|
||||
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Generated SQL will:
|
||||
- ✅ Be valid Databricks Spark SQL
|
||||
- ✅ Use Delta Lake for ACID transactions
|
||||
- ✅ Include proper clustering for performance
|
||||
- ✅ Have convergence detection built-in
|
||||
- ✅ Support incremental processing
|
||||
- ✅ Generate comprehensive statistics
|
||||
- ✅ Work without modification on Databricks
|
||||
|
||||
---
|
||||
|
||||
**Ready to generate Databricks SQL from your YAML configuration?**
|
||||
|
||||
Provide your YAML file path and target catalog/schema to begin!
|
||||
288
commands/hybrid-generate-snowflake.md
Normal file
288
commands/hybrid-generate-snowflake.md
Normal file
@@ -0,0 +1,288 @@
|
||||
---
|
||||
name: hybrid-generate-snowflake
|
||||
description: Generate Snowflake SQL from YAML configuration for ID unification
|
||||
---
|
||||
|
||||
# Generate Snowflake SQL from YAML
|
||||
|
||||
## Overview
|
||||
|
||||
Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
|
||||
|
||||
---
|
||||
|
||||
## What You Need
|
||||
|
||||
### Required Inputs
|
||||
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||
2. **Target Database**: Snowflake database name
|
||||
3. **Target Schema**: Schema name within the database
|
||||
|
||||
### Optional Inputs
|
||||
4. **Source Database**: Database containing source tables (defaults to target database)
|
||||
5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
|
||||
6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Validation
|
||||
- Verify `unify.yml` exists and is valid
|
||||
- Check YAML syntax and structure
|
||||
- Validate keys, tables, and configuration sections
|
||||
|
||||
### Step 2: SQL Generation
|
||||
I'll call the **snowflake-sql-generator agent** to:
|
||||
- Execute `yaml_unification_to_snowflake.py` Python script
|
||||
- Generate Snowflake table definitions with clustering
|
||||
- Create convergence detection logic
|
||||
- Build cryptographic hashing for canonical IDs
|
||||
|
||||
### Step 3: Output Organization
|
||||
Generate complete SQL workflow in this structure:
|
||||
```
|
||||
snowflake_sql/unify/
|
||||
├── 01_create_graph.sql # Initialize graph table
|
||||
├── 02_extract_merge.sql # Extract identities with validation
|
||||
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
|
||||
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
|
||||
├── 05_canonicalize.sql # Canonical ID creation with key masks
|
||||
├── 06_result_key_stats.sql # Result statistics with histograms
|
||||
├── 10_enrich_*.sql # Enrich each source table
|
||||
├── 20_master_*.sql # Master tables with attribute aggregation
|
||||
├── 30_unification_metadata.sql # Metadata tables
|
||||
├── 31_filter_lookup.sql # Validation rules lookup
|
||||
└── 32_column_lookup.sql # Column mapping lookup
|
||||
```
|
||||
|
||||
### Step 4: Summary Report
|
||||
Provide:
|
||||
- Total SQL files generated
|
||||
- Estimated execution order
|
||||
- Snowflake optimizations included
|
||||
- Key features enabled
|
||||
- Next steps for execution
|
||||
|
||||
---
|
||||
|
||||
## Command Usage
|
||||
|
||||
### Basic Usage
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-generate-snowflake
|
||||
|
||||
I'll prompt you for:
|
||||
- YAML file path
|
||||
- Target database
|
||||
- Target schema
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
Provide all parameters upfront:
|
||||
```
|
||||
YAML file: /path/to/unify.yml
|
||||
Target database: my_database
|
||||
Target schema: my_schema
|
||||
Source database: source_database (optional)
|
||||
Source schema: PUBLIC (optional, defaults to PUBLIC)
|
||||
Output directory: custom_output/ (optional)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Generated SQL Features
|
||||
|
||||
### Snowflake Optimizations
|
||||
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
|
||||
- **VARIANT Support**: Flexible data structures for arrays and objects
|
||||
- **Native Functions**: Snowflake-specific optimized functions
|
||||
|
||||
### Advanced Capabilities
|
||||
1. **Dynamic Iteration Count**: Auto-calculates based on:
|
||||
- Number of merge keys
|
||||
- Number of tables
|
||||
- Data complexity (configurable via YAML)
|
||||
|
||||
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
|
||||
```
|
||||
Key Type 1 (email): 0ffdbcf0c666ce190d
|
||||
Key Type 2 (customer_id): 61a821f2b646a4e890
|
||||
Key Type 3 (phone): acd2206c3f88b3ee27
|
||||
```
|
||||
|
||||
3. **Validation Rules**:
|
||||
- `valid_regexp`: REGEXP_LIKE pattern filtering
|
||||
- `invalid_texts`: NOT IN clause with proper NULL handling
|
||||
- Combined AND logic for strict validation
|
||||
|
||||
4. **Master Table Attributes**:
|
||||
- Single value: `MAX_BY(attr, order)` with COALESCE
|
||||
- Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
|
||||
- Priority-based selection
|
||||
|
||||
### Platform-Specific Conversions
|
||||
The generator automatically converts:
|
||||
- Presto functions → Snowflake equivalents
|
||||
- Databricks functions → Snowflake equivalents
|
||||
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
|
||||
- Window functions → optimized versions
|
||||
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
|
||||
|
||||
---
|
||||
|
||||
## Example Workflow
|
||||
|
||||
### Input YAML (`unify.yml`)
|
||||
```yaml
|
||||
name: customer_unification
|
||||
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A']
|
||||
|
||||
tables:
|
||||
- table: customer_profiles
|
||||
key_columns:
|
||||
- {column: email_std, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
|
||||
canonical_ids:
|
||||
- name: unified_id
|
||||
merge_by_keys: [email, customer_id]
|
||||
merge_iterations: 15
|
||||
|
||||
master_tables:
|
||||
- name: customer_master
|
||||
canonical_id: unified_id
|
||||
attributes:
|
||||
- name: best_email
|
||||
source_columns:
|
||||
- {table: customer_profiles, column: email_std, priority: 1}
|
||||
```
|
||||
|
||||
### Generated Output
|
||||
```
|
||||
snowflake_sql/unify/
|
||||
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
|
||||
├── 02_extract_merge.sql # Merges customer_profiles keys
|
||||
├── 03_source_key_stats.sql # Stats by table
|
||||
├── 04_unify_loop_iteration_01.sql # First iteration
|
||||
├── 04_unify_loop_iteration_02.sql # Second iteration
|
||||
├── ... # Up to iteration_05
|
||||
├── 05_canonicalize.sql # Creates unified_id_lookup
|
||||
├── 06_result_key_stats.sql # Final statistics
|
||||
├── 10_enrich_customer_profiles.sql # Adds unified_id column
|
||||
├── 20_master_customer_master.sql # Creates customer_master table
|
||||
├── 30_unification_metadata.sql # Metadata
|
||||
├── 31_filter_lookup.sql # Validation rules
|
||||
└── 32_column_lookup.sql # Column mappings
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Generation
|
||||
|
||||
### Option 1: Execute Immediately
|
||||
Use the execution command:
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||
```
|
||||
|
||||
### Option 2: Review First
|
||||
1. Examine generated SQL files
|
||||
2. Verify table names and transformations
|
||||
3. Test with sample data
|
||||
4. Execute manually or via execution command
|
||||
|
||||
### Option 3: Customize
|
||||
1. Modify generated SQL as needed
|
||||
2. Add custom logic or transformations
|
||||
3. Execute using Snowflake SQL worksheet or execution command
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Python Script Execution
|
||||
The agent executes:
|
||||
```bash
|
||||
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
|
||||
unify.yml \
|
||||
-d my_database \
|
||||
-s my_schema \
|
||||
-sd source_database \
|
||||
-ss source_schema \
|
||||
-o snowflake_sql
|
||||
```
|
||||
|
||||
### SQL File Naming Convention
|
||||
- `01-09`: Setup and initialization
|
||||
- `10-19`: Source table enrichment
|
||||
- `20-29`: Master table creation
|
||||
- `30-39`: Metadata and lookup tables
|
||||
- `04_*_NN`: Loop iterations (auto-numbered)
|
||||
|
||||
### Convergence Detection
|
||||
Each loop iteration includes:
|
||||
```sql
|
||||
-- Check if graph changed
|
||||
SELECT COUNT(*) FROM (
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM iteration_N
|
||||
EXCEPT
|
||||
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||
FROM iteration_N_minus_1
|
||||
) diff
|
||||
```
|
||||
Stops when count = 0
|
||||
|
||||
### Snowflake-Specific Features
|
||||
- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
|
||||
- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
|
||||
- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
|
||||
- **ARRAYS_OVERLAP**: Checking array membership
|
||||
- **SPLIT_PART**: String splitting for leader key parsing
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue**: YAML validation error
|
||||
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
|
||||
|
||||
**Issue**: Table not found error
|
||||
**Solution**: Verify source database/schema, check table names in YAML
|
||||
|
||||
**Issue**: Python script error
|
||||
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
|
||||
|
||||
**Issue**: Too many/few iterations
|
||||
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
|
||||
|
||||
**Issue**: VARIANT column errors
|
||||
**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Generated SQL will:
|
||||
- ✅ Be valid Snowflake SQL
|
||||
- ✅ Use native Snowflake functions
|
||||
- ✅ Include proper clustering for performance
|
||||
- ✅ Have convergence detection built-in
|
||||
- ✅ Support VARIANT types for flexible data
|
||||
- ✅ Generate comprehensive statistics
|
||||
- ✅ Work without modification on Snowflake
|
||||
|
||||
---
|
||||
|
||||
**Ready to generate Snowflake SQL from your YAML configuration?**
|
||||
|
||||
Provide your YAML file path and target database/schema to begin!
|
||||
308
commands/hybrid-setup.md
Normal file
308
commands/hybrid-setup.md
Normal file
@@ -0,0 +1,308 @@
|
||||
---
|
||||
name: hybrid-setup
|
||||
description: Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks
|
||||
---
|
||||
|
||||
# Hybrid ID Unification Complete Setup
|
||||
|
||||
## Overview
|
||||
|
||||
I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an **automated, end-to-end workflow** that will:
|
||||
|
||||
1. **Analyze your tables automatically** using platform MCP tools with strict PII detection
|
||||
2. **Generate YAML configuration** from real schema and data analysis
|
||||
3. **Choose target platform(s)** (Snowflake, Databricks, or both)
|
||||
4. **Generate platform-specific SQL** optimized for each engine
|
||||
5. **Execute workflows** with convergence detection and monitoring
|
||||
6. **Provide deployment guidance** and operating instructions
|
||||
|
||||
**Key Features**:
|
||||
- 🔍 **Automated Table Analysis**: Uses Snowflake/Databricks MCP tools to analyze actual tables
|
||||
- ✅ **Strict PII Detection**: Zero tolerance - only includes tables with real user identifiers
|
||||
- 📊 **Real Data Validation**: Queries actual data to validate patterns and quality
|
||||
- 🎯 **Smart Recommendations**: Expert analysis provides merge strategy and priorities
|
||||
- 🚀 **End-to-End Automation**: From table analysis to workflow execution
|
||||
|
||||
---
|
||||
|
||||
## What You Need to Provide
|
||||
|
||||
### 1. Unification Requirements (For Automated Analysis)
|
||||
- **Platform**: Snowflake or Databricks
|
||||
- **Tables**: List of source tables to analyze
|
||||
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
|
||||
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
|
||||
- **Canonical ID Name**: Name for your unified ID (e.g., `td_id`, `unified_customer_id`)
|
||||
- **Merge Iterations**: Number of unification loops (default: 10)
|
||||
- **Master Tables**: (Optional) Attribute aggregation specifications
|
||||
|
||||
**Note**: The system will automatically:
|
||||
- Extract user identifiers from actual table schemas
|
||||
- Validate data patterns from real data
|
||||
- Apply appropriate validation rules based on data analysis
|
||||
- Generate merge strategy recommendations
|
||||
|
||||
### 2. Platform Selection
|
||||
- **Databricks**: Unity Catalog with Delta Lake
|
||||
- **Snowflake**: Database with proper permissions
|
||||
- **Both**: Generate SQL for both platforms
|
||||
|
||||
### 3. Target Configurations
|
||||
|
||||
**For Databricks**:
|
||||
- **Catalog**: Target catalog name
|
||||
- **Schema**: Target schema name
|
||||
- **Source Catalog** (optional): Source data catalog
|
||||
- **Source Schema** (optional): Source data schema
|
||||
|
||||
**For Snowflake**:
|
||||
- **Database**: Target database name
|
||||
- **Schema**: Target schema name
|
||||
- **Source Schema** (optional): Source data schema
|
||||
|
||||
### 4. Execution Credentials (if executing)
|
||||
|
||||
**For Databricks**:
|
||||
- **Server Hostname**: your-workspace.databricks.com
|
||||
- **HTTP Path**: /sql/1.0/warehouses/your-warehouse-id
|
||||
- **Authentication**: PAT (Personal Access Token) or OAuth
|
||||
|
||||
**For Snowflake**:
|
||||
- **Account**: Snowflake account name
|
||||
- **User**: Username
|
||||
- **Password**: Password or use SSO/key-pair
|
||||
- **Warehouse**: Compute warehouse name
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Automated YAML Configuration Generation
|
||||
I'll use the **hybrid-unif-config-creator** command to automatically generate your `unify.yml` file:
|
||||
|
||||
**Automated Analysis Approach** (Recommended):
|
||||
- Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
|
||||
- Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
|
||||
- Validate data patterns from real table data
|
||||
- Generate unify.yml with exact template compliance
|
||||
- Only include tables with actual user identifiers
|
||||
- Document excluded tables with detailed reasons
|
||||
|
||||
**What I'll do**:
|
||||
- Call the **hybrid-unif-keys-extractor agent** to analyze tables
|
||||
- Query actual schema and data using platform MCP tools
|
||||
- Detect valid user identifiers (email, customer_id, phone, etc.)
|
||||
- Exclude tables without PII with full documentation
|
||||
- Generate production-ready unify.yml automatically
|
||||
|
||||
**Alternative - Manual Configuration**:
|
||||
- If MCP tools are unavailable, I'll guide you through manual configuration
|
||||
- Interactive prompts for keys, tables, and validation rules
|
||||
- Step-by-step YAML building with validation
|
||||
|
||||
### Step 2: Platform Selection and Configuration
|
||||
I'll help you:
|
||||
- Choose between Databricks, Snowflake, or both
|
||||
- Collect platform-specific configuration (catalog/database, schema names)
|
||||
- Determine source/target separation strategy
|
||||
- Decide on execution or generation-only mode
|
||||
|
||||
### Step 3: SQL Generation
|
||||
|
||||
**For Databricks** (if selected):
|
||||
I'll call the **databricks-sql-generator agent** to:
|
||||
- Execute `yaml_unification_to_databricks.py` script
|
||||
- Generate Delta Lake optimized SQL workflow
|
||||
- Create output directory: `databricks_sql/unify/`
|
||||
- Generate 15+ SQL files with proper execution order
|
||||
|
||||
**For Snowflake** (if selected):
|
||||
I'll call the **snowflake-sql-generator agent** to:
|
||||
- Execute `yaml_unification_to_snowflake.py` script
|
||||
- Generate Snowflake-native SQL workflow
|
||||
- Create output directory: `snowflake_sql/unify/`
|
||||
- Generate 15+ SQL files with proper execution order
|
||||
|
||||
### Step 4: Workflow Execution (Optional)
|
||||
|
||||
**For Databricks** (if execution requested):
|
||||
I'll call the **databricks-workflow-executor agent** to:
|
||||
- Execute `databricks_sql_executor.py` script
|
||||
- Connect to your Databricks workspace
|
||||
- Run SQL files in proper sequence
|
||||
- Monitor convergence and progress
|
||||
- Optimize Delta tables
|
||||
- Report final statistics
|
||||
|
||||
**For Snowflake** (if execution requested):
|
||||
I'll call the **snowflake-workflow-executor agent** to:
|
||||
- Execute `snowflake_sql_executor.py` script
|
||||
- Connect to your Snowflake account
|
||||
- Run SQL files in proper sequence
|
||||
- Monitor convergence and progress
|
||||
- Report final statistics
|
||||
|
||||
### Step 5: Deployment Guidance
|
||||
I'll provide:
|
||||
- Configuration summary
|
||||
- Generated files overview
|
||||
- Deployment instructions
|
||||
- Operating guidelines
|
||||
- Monitoring recommendations
|
||||
|
||||
---
|
||||
|
||||
## Interactive Workflow
|
||||
|
||||
This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:
|
||||
|
||||
### Phase 1: Configuration Creation
|
||||
**I'll ask you for**:
|
||||
- Platform (Snowflake or Databricks)
|
||||
- Tables to analyze
|
||||
- Canonical ID name
|
||||
- Merge iterations
|
||||
|
||||
**Then I'll**:
|
||||
- Call `/cdp-hybrid-idu:hybrid-unif-config-creator` internally
|
||||
- Analyze your tables automatically
|
||||
- Generate `unify.yml` with strict PII detection
|
||||
- Show you the configuration for review
|
||||
|
||||
### Phase 2: SQL Generation
|
||||
**I'll ask you**:
|
||||
- Which platform(s) to generate SQL for (can be different from source)
|
||||
- Output directory preferences
|
||||
|
||||
**Then I'll**:
|
||||
- Call `/cdp-hybrid-idu:hybrid-generate-snowflake` (if Snowflake selected)
|
||||
- Call `/cdp-hybrid-idu:hybrid-generate-databricks` (if Databricks selected)
|
||||
- Generate 15+ optimized SQL files per platform
|
||||
- Show you the execution plan
|
||||
|
||||
### Phase 3: Workflow Execution (Optional)
|
||||
**I'll ask you**:
|
||||
- Do you want to execute now or later?
|
||||
- Connection credentials if executing
|
||||
|
||||
**Then I'll**:
|
||||
- Call `/cdp-hybrid-idu:hybrid-execute-snowflake` (if Snowflake selected)
|
||||
- Call `/cdp-hybrid-idu:hybrid-execute-databricks` (if Databricks selected)
|
||||
- Monitor convergence and progress
|
||||
- Report final statistics
|
||||
|
||||
**Throughout the process**:
|
||||
- **Questions**: When I need your input
|
||||
- **Suggestions**: Recommended approaches based on best practices
|
||||
- **Validation**: Real-time checks on your choices
|
||||
- **Explanations**: Help you understand concepts and options
|
||||
|
||||
---
|
||||
|
||||
## Expected Output
|
||||
|
||||
### Files Created (Platform-specific):
|
||||
|
||||
**For Databricks**:
|
||||
```
|
||||
databricks_sql/unify/
|
||||
├── 01_create_graph.sql # Initialize identity graph
|
||||
├── 02_extract_merge.sql # Extract and merge identities
|
||||
├── 03_source_key_stats.sql # Source statistics
|
||||
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
|
||||
├── 05_canonicalize.sql # Canonical ID creation
|
||||
├── 06_result_key_stats.sql # Result statistics
|
||||
├── 10_enrich_*.sql # Source table enrichment (N files)
|
||||
├── 20_master_*.sql # Master table creation (N files)
|
||||
├── 30_unification_metadata.sql # Metadata tables
|
||||
├── 31_filter_lookup.sql # Validation rules
|
||||
└── 32_column_lookup.sql # Column mappings
|
||||
```
|
||||
|
||||
**For Snowflake**:
|
||||
```
|
||||
snowflake_sql/unify/
|
||||
├── 01_create_graph.sql # Initialize identity graph
|
||||
├── 02_extract_merge.sql # Extract and merge identities
|
||||
├── 03_source_key_stats.sql # Source statistics
|
||||
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
|
||||
├── 05_canonicalize.sql # Canonical ID creation
|
||||
├── 06_result_key_stats.sql # Result statistics
|
||||
├── 10_enrich_*.sql # Source table enrichment (N files)
|
||||
├── 20_master_*.sql # Master table creation (N files)
|
||||
├── 30_unification_metadata.sql # Metadata tables
|
||||
├── 31_filter_lookup.sql # Validation rules
|
||||
└── 32_column_lookup.sql # Column mappings
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
```
|
||||
unify.yml # YAML configuration (created interactively)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
All generated files will:
|
||||
- ✅ Be platform-optimized and production-ready
|
||||
- ✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
|
||||
- ✅ Include convergence detection logic
|
||||
- ✅ Support incremental processing
|
||||
- ✅ Generate comprehensive statistics
|
||||
- ✅ Work without modification on target platforms
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**Ready to begin?** I'll use the **hybrid-unif-config-creator** to automatically analyze your tables and generate the YAML configuration.
|
||||
|
||||
Please provide:
|
||||
|
||||
1. **Platform**: Which platform contains your data?
|
||||
- Snowflake or Databricks
|
||||
|
||||
2. **Tables**: Which source tables should I analyze?
|
||||
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
|
||||
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
|
||||
- Example: `customer_db.public.customers`, `orders`, `web_events.user_activity`
|
||||
|
||||
3. **Canonical ID Name**: What should I call the unified ID?
|
||||
- Example: `td_id`, `unified_customer_id`, `master_id`
|
||||
- Default: `td_id`
|
||||
|
||||
4. **Merge Iterations** (optional): How many unification loops?
|
||||
- Default: 10
|
||||
- Range: 2-30
|
||||
|
||||
5. **Target Platform(s)** for SQL generation:
|
||||
- Same as source, or generate for both platforms
|
||||
|
||||
**Example**:
|
||||
```
|
||||
I want to set up hybrid ID unification for:
|
||||
|
||||
Platform: Snowflake
|
||||
Tables:
|
||||
- customer_db.public.customer_profiles
|
||||
- customer_db.public.orders
|
||||
- marketing_db.public.campaigns
|
||||
- event_db.public.web_events
|
||||
|
||||
Canonical ID: unified_customer_id
|
||||
Merge Iterations: 10
|
||||
Generate SQL for: Snowflake (or both Snowflake and Databricks)
|
||||
```
|
||||
|
||||
**What I'll do next**:
|
||||
1. ✅ Analyze your tables using Snowflake MCP tools
|
||||
2. ✅ Extract user identifiers with strict PII detection
|
||||
3. ✅ Generate unify.yml automatically
|
||||
4. ✅ Generate platform-specific SQL files
|
||||
5. ✅ Execute workflow (if requested)
|
||||
6. ✅ Provide deployment guidance
|
||||
|
||||
---
|
||||
|
||||
**Let's get started with your hybrid ID unification setup!**
|
||||
491
commands/hybrid-unif-config-creator.md
Normal file
491
commands/hybrid-unif-config-creator.md
Normal file
@@ -0,0 +1,491 @@
|
||||
---
|
||||
name: hybrid-unif-config-creator
|
||||
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
|
||||
---
|
||||
|
||||
# Unify Configuration Creator for Snowflake/Databricks
|
||||
|
||||
## Overview
|
||||
|
||||
I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
|
||||
|
||||
1. **Analyzing your actual tables** using platform-specific MCP tools
|
||||
2. **Extracting user identifiers** with zero-tolerance PII detection
|
||||
3. **Validating data patterns** from real table data
|
||||
4. **Generating unify.yml** using the exact template format
|
||||
5. **Providing recommendations** for merge strategies and priorities
|
||||
|
||||
**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
|
||||
|
||||
---
|
||||
|
||||
## What You Need to Provide
|
||||
|
||||
### 1. Platform Selection
|
||||
- **Snowflake**: For Snowflake databases
|
||||
- **Databricks**: For Databricks Unity Catalog tables
|
||||
|
||||
### 2. Tables to Analyze
|
||||
Provide tables you want to analyze for ID unification:
|
||||
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
|
||||
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
|
||||
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
|
||||
|
||||
### 3. Canonical ID Configuration
|
||||
- **Name**: Name for your unified ID (default: `td_id`)
|
||||
- **Merge Iterations**: Number of unification loop iterations (default: 10)
|
||||
- **Incremental Iterations**: Iterations for incremental processing (default: 5)
|
||||
|
||||
### 4. Output Configuration (Optional)
|
||||
- **Output File**: Where to save unify.yml (default: `unify.yml`)
|
||||
- **Template Path**: Path to template if using custom (default: uses built-in exact template)
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Platform Detection and Validation
|
||||
```
|
||||
1. Confirm platform (Snowflake or Databricks)
|
||||
2. Verify MCP tools are available for the platform
|
||||
3. Set up platform-specific query patterns
|
||||
4. Inform you of the analysis approach
|
||||
```
|
||||
|
||||
### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
|
||||
I'll launch the **hybrid-unif-keys-extractor agent** to:
|
||||
|
||||
**Schema Analysis**:
|
||||
- Use platform MCP tools to describe each table
|
||||
- Extract exact column names and data types
|
||||
- Identify accessible vs inaccessible tables
|
||||
|
||||
**User Identifier Detection**:
|
||||
- Apply STRICT matching rules for user identifiers:
|
||||
- ✅ Email columns (email, email_std, email_address, etc.)
|
||||
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
|
||||
- ✅ User IDs (user_id, customer_id, account_id, etc.)
|
||||
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
|
||||
- ❌ System columns (id, created_at, time, etc.)
|
||||
- ❌ Complex types (arrays, maps, objects, variants, structs)
|
||||
|
||||
**Data Validation**:
|
||||
- Query actual MIN/MAX values from each identified column
|
||||
- Analyze data patterns and quality
|
||||
- Count unique values per identifier
|
||||
- Detect data quality issues
|
||||
|
||||
**Table Classification**:
|
||||
- **INCLUDED**: Tables with valid user identifiers
|
||||
- **EXCLUDED**: Tables without user identifiers (fully documented why)
|
||||
|
||||
**Expert Analysis**:
|
||||
- 3 SQL experts review the data
|
||||
- Provide priority recommendations
|
||||
- Suggest validation rules based on actual data patterns
|
||||
|
||||
### Step 3: Unify.yml Generation
|
||||
|
||||
**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
|
||||
|
||||
**Template Usage Process**:
|
||||
```
|
||||
1. Receive structured data from hybrid-unif-keys-extractor agent:
|
||||
- Keys with validation rules
|
||||
- Tables with column mappings
|
||||
- Canonical ID configuration
|
||||
- Master tables specification
|
||||
|
||||
2. Use BUILT-IN template structure (see agent documentation)
|
||||
|
||||
3. ONLY replace these specific values:
|
||||
- Line 1: name: {canonical_id_name}
|
||||
- keys section: actual keys found
|
||||
- tables section: actual tables with actual columns
|
||||
- canonical_ids section: name and merge_by_keys
|
||||
- master_tables section: [] or user specifications
|
||||
|
||||
4. PRESERVE everything else:
|
||||
- ALL comment blocks (#####...)
|
||||
- ALL comment text ("Declare Validation logic", etc.)
|
||||
- ALL spacing and indentation (2 spaces per level)
|
||||
- ALL blank lines
|
||||
- EXACT YAML structure
|
||||
|
||||
5. Use Write tool to save populated unify.yml
|
||||
```
|
||||
|
||||
**I'll generate**:
|
||||
|
||||
**Section 1: Canonical ID Name**
|
||||
```yaml
|
||||
name: {your_canonical_id_name}
|
||||
```
|
||||
|
||||
**Section 2: Keys with Validation**
|
||||
```yaml
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: phone_number
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
```
|
||||
*Populated with actual keys found in your tables*
|
||||
|
||||
**Section 3: Tables with Key Column Mappings**
|
||||
```yaml
|
||||
tables:
|
||||
- database: {database/catalog}
|
||||
table: {table_name}
|
||||
key_columns:
|
||||
- {column: actual_column_name, key: mapped_key}
|
||||
- {column: another_column, key: another_key}
|
||||
```
|
||||
*Only tables with valid user identifiers, with EXACT column names from schema analysis*
|
||||
|
||||
**Section 4: Canonical IDs Configuration**
|
||||
```yaml
|
||||
canonical_ids:
|
||||
- name: {your_canonical_id_name}
|
||||
merge_by_keys: [email, customer_id, phone_number]
|
||||
merge_iterations: 15
|
||||
```
|
||||
*Based on extracted keys and your configuration*
|
||||
|
||||
**Section 5: Master Tables (Optional)**
|
||||
```yaml
|
||||
master_tables:
|
||||
- name: {canonical_id_name}_master_table
|
||||
canonical_id: {canonical_id_name}
|
||||
attributes:
|
||||
- name: best_email
|
||||
source_columns:
|
||||
- {table: table1, column: email, order: last, order_by: time, priority: 1}
|
||||
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
|
||||
```
|
||||
*If you request master table configuration, I'll help set up attribute aggregation*
|
||||
|
||||
### Step 4: Validation and Review
|
||||
|
||||
After generation:
|
||||
```
|
||||
1. Show complete unify.yml content
|
||||
2. Highlight key sections:
|
||||
- Keys found: [list]
|
||||
- Tables included: [count]
|
||||
- Tables excluded: [count] with reasons
|
||||
- Merge strategy: [keys and priorities]
|
||||
3. Provide recommendations for optimization
|
||||
4. Ask for your approval before saving
|
||||
```
|
||||
|
||||
### Step 5: File Output
|
||||
|
||||
```
|
||||
1. Write unify.yml to specified location
|
||||
2. Create backup of existing file if present
|
||||
3. Provide file summary:
|
||||
- Keys configured: X
|
||||
- Tables configured: Y
|
||||
- Validation rules: Z
|
||||
4. Show next steps for using the configuration
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Workflow
|
||||
|
||||
**Input**:
|
||||
```
|
||||
Platform: Snowflake
|
||||
Tables:
|
||||
- customer_data.public.customers
|
||||
- customer_data.public.orders
|
||||
- web_data.public.events
|
||||
Canonical ID Name: unified_customer_id
|
||||
Output: snowflake_unify.yml
|
||||
```
|
||||
|
||||
**Process**:
|
||||
```
|
||||
✓ Platform: Snowflake MCP tools detected
|
||||
✓ Analyzing 3 tables...
|
||||
|
||||
Schema Analysis:
|
||||
✓ customer_data.public.customers - 12 columns
|
||||
✓ customer_data.public.orders - 8 columns
|
||||
✓ web_data.public.events - 15 columns
|
||||
|
||||
User Identifier Detection:
|
||||
✓ customers: email, customer_id (2 identifiers)
|
||||
✓ orders: customer_id, email_address (2 identifiers)
|
||||
✗ events: NO user identifiers found
|
||||
Available columns: event_id, session_id, page_url, timestamp, ...
|
||||
Reason: Contains only event tracking data - no PII
|
||||
|
||||
Data Analysis:
|
||||
✓ email: 45,123 unique values, format valid
|
||||
✓ customer_id: 45,089 unique values, numeric
|
||||
✓ email_address: 12,456 unique values, format valid
|
||||
|
||||
Expert Analysis Complete:
|
||||
Priority 1: customer_id (most stable, highest coverage)
|
||||
Priority 2: email (good coverage, some quality issues)
|
||||
Priority 3: phone_number (not found)
|
||||
|
||||
Generating unify.yml...
|
||||
✓ Keys section: 2 keys configured
|
||||
✓ Tables section: 2 tables configured
|
||||
✓ Canonical IDs: unified_customer_id
|
||||
✓ Validation rules: Applied based on data patterns
|
||||
|
||||
Tables EXCLUDED:
|
||||
- web_data.public.events: No user identifiers
|
||||
```
|
||||
|
||||
**Output (snowflake_unify.yml)**:
|
||||
```yaml
|
||||
name: unified_customer_id
|
||||
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
|
||||
tables:
|
||||
- database: customer_data
|
||||
table: customers
|
||||
key_columns:
|
||||
- {column: email, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
- database: customer_data
|
||||
table: orders
|
||||
key_columns:
|
||||
- {column: email_address, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
|
||||
canonical_ids:
|
||||
- name: unified_customer_id
|
||||
merge_by_keys: [customer_id, email]
|
||||
merge_iterations: 15
|
||||
|
||||
master_tables: []
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
### 🔍 **STRICT PII Detection**
|
||||
- Zero tolerance for guessing
|
||||
- Only includes tables with actual user identifiers
|
||||
- Documents why tables are excluded
|
||||
- Based on REAL schema and data analysis
|
||||
|
||||
### ✅ **Exact Template Compliance**
|
||||
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
|
||||
- NO modifications to template format
|
||||
- Preserves all comment sections
|
||||
- Maintains exact YAML structure
|
||||
- Portable across all systems
|
||||
|
||||
### 📊 **Real Data Analysis**
|
||||
- Queries actual MIN/MAX values
|
||||
- Counts unique identifiers
|
||||
- Validates data patterns
|
||||
- Identifies quality issues
|
||||
|
||||
### 🎯 **Platform-Aware**
|
||||
- Uses correct MCP tools for each platform
|
||||
- Respects platform naming conventions
|
||||
- Applies platform-specific data type rules
|
||||
- Generates platform-compatible SQL references
|
||||
|
||||
### 📋 **Complete Documentation**
|
||||
- Documents all excluded tables with reasons
|
||||
- Lists available columns for excluded tables
|
||||
- Explains why columns don't qualify as user identifiers
|
||||
- Provides expert recommendations
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
**The generated unify.yml will have EXACTLY this structure:**
|
||||
|
||||
```yaml
|
||||
name: {canonical_id_name}
|
||||
#####################################################
|
||||
##
|
||||
##Declare Validation logic for unification keys
|
||||
##
|
||||
#####################################################
|
||||
keys:
|
||||
- name: {key1}
|
||||
valid_regexp: "{pattern}"
|
||||
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||
- name: {key2}
|
||||
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare databases, tables, and keys to use during unification
|
||||
##
|
||||
#####################################################
|
||||
|
||||
tables:
|
||||
- database: {db/catalog}
|
||||
table: {table}
|
||||
key_columns:
|
||||
- {column: {col}, key: {key}}
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare hierarchy for unification. Define keys to use for each level.
|
||||
##
|
||||
#####################################################
|
||||
|
||||
canonical_ids:
|
||||
- name: {canonical_id_name}
|
||||
merge_by_keys: [{key1}, {key2}, ...]
|
||||
merge_iterations: {number}
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare Similar Attributes and standardize into a single column
|
||||
##
|
||||
#####################################################
|
||||
|
||||
master_tables:
|
||||
- name: {canonical_id_name}_master_table
|
||||
canonical_id: {canonical_id_name}
|
||||
attributes:
|
||||
- name: {attribute}
|
||||
source_columns:
|
||||
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
|
||||
```
|
||||
|
||||
**NO deviations from this structure - EXACT template compliance guaranteed.**
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required:
|
||||
- ✅ Snowflake or Databricks platform access
|
||||
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
|
||||
- ✅ Read permissions on tables to be analyzed
|
||||
- ✅ Tables must exist and be accessible
|
||||
|
||||
### Optional:
|
||||
- Custom unify.yml template path (if not using default)
|
||||
- Master table attribute specifications
|
||||
- Custom validation rules
|
||||
|
||||
---
|
||||
|
||||
## Expected Timeline
|
||||
|
||||
| Step | Duration |
|
||||
|------|----------|
|
||||
| Platform detection | < 1 min |
|
||||
| Schema analysis (per table) | 5-10 sec |
|
||||
| Data analysis (per identifier) | 10-20 sec |
|
||||
| Expert analysis | 1-2 min |
|
||||
| YAML generation | < 1 min |
|
||||
| **Total (for 5 tables)** | **~3-5 min** |
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Issues:
|
||||
|
||||
**Issue**: MCP tools not available for platform
|
||||
**Solution**:
|
||||
- I'll inform you and provide fallback options
|
||||
- You can provide schema information manually
|
||||
- I'll still generate unify.yml with validation warnings
|
||||
|
||||
**Issue**: No tables have user identifiers
|
||||
**Solution**:
|
||||
- I'll show you why tables were excluded
|
||||
- Suggest alternative tables to analyze
|
||||
- Explain what constitutes a user identifier
|
||||
|
||||
**Issue**: Table not accessible
|
||||
**Solution**:
|
||||
- Document which tables are inaccessible
|
||||
- Continue with accessible tables
|
||||
- Recommend permission checks
|
||||
|
||||
**Issue**: Complex data types found
|
||||
**Solution**:
|
||||
- Exclude complex type columns (arrays, structs, maps)
|
||||
- Explain why they can't be used for unification
|
||||
- Suggest alternative columns if available
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Generated unify.yml will:
|
||||
- ✅ Use EXACT template structure - NO modifications
|
||||
- ✅ Contain ONLY tables with validated user identifiers
|
||||
- ✅ Include ONLY columns that actually exist in tables
|
||||
- ✅ Have validation rules based on actual data patterns
|
||||
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
|
||||
- ✅ Work without any manual edits
|
||||
- ✅ Include comprehensive documentation in comments
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Generation
|
||||
|
||||
1. **Review the generated unify.yml**
|
||||
- Verify tables and columns are correct
|
||||
- Check validation rules are appropriate
|
||||
- Review merge strategy and priorities
|
||||
|
||||
2. **Generate SQL for your platform**:
|
||||
- Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
|
||||
- Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
|
||||
|
||||
3. **Execute the workflow**:
|
||||
- Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
|
||||
- Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
|
||||
|
||||
4. **Monitor convergence and results**
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**Ready to begin?**
|
||||
|
||||
Please provide:
|
||||
|
||||
1. **Platform**: Snowflake or Databricks
|
||||
2. **Tables**: List of tables to analyze (full paths)
|
||||
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
|
||||
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Platform: Snowflake
|
||||
Tables:
|
||||
- customer_db.public.customers
|
||||
- customer_db.public.orders
|
||||
- marketing_db.public.campaigns
|
||||
Canonical ID: unified_id
|
||||
Output: snowflake_unify.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**I'll analyze your tables and generate a production-ready unify.yml configuration!**
|
||||
337
commands/hybrid-unif-config-validate.md
Normal file
337
commands/hybrid-unif-config-validate.md
Normal file
@@ -0,0 +1,337 @@
|
||||
---
|
||||
name: hybrid-unif-config-validate
|
||||
description: Validate YAML configuration for hybrid ID unification before SQL generation
|
||||
---
|
||||
|
||||
# Validate Hybrid ID Unification YAML
|
||||
|
||||
## Overview
|
||||
|
||||
Validate your `unify.yml` configuration file to ensure it's properly structured and ready for SQL generation. This command checks syntax, structure, validation rules, and provides recommendations for optimization.
|
||||
|
||||
---
|
||||
|
||||
## What You Need
|
||||
|
||||
### Required Input
|
||||
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: File Validation
|
||||
- Verify file exists and is readable
|
||||
- Check YAML syntax (proper indentation, quotes, etc.)
|
||||
- Ensure all required sections are present
|
||||
|
||||
### Step 2: Structure Validation
|
||||
Check presence and structure of:
|
||||
- **name**: Unification project name
|
||||
- **keys**: Key definitions with validation rules
|
||||
- **tables**: Source tables with key column mappings
|
||||
- **canonical_ids**: Canonical ID configuration
|
||||
- **master_tables**: Master table definitions (optional)
|
||||
|
||||
### Step 3: Content Validation
|
||||
Validate individual sections:
|
||||
|
||||
**Keys Section**:
|
||||
- ✓ Each key has a unique name
|
||||
- ✓ `valid_regexp` is a valid regex pattern (if provided)
|
||||
- ✓ `invalid_texts` is an array (if provided)
|
||||
- ⚠ Recommend validation rules if missing
|
||||
|
||||
**Tables Section**:
|
||||
- ✓ Each table has a name
|
||||
- ✓ Each table has at least one key_column
|
||||
- ✓ All referenced keys exist in keys section
|
||||
- ✓ Column names are valid identifiers
|
||||
- ⚠ Check for duplicate table definitions
|
||||
|
||||
**Canonical IDs Section**:
|
||||
- ✓ Has a name (will be canonical ID column name)
|
||||
- ✓ `merge_by_keys` references existing keys
|
||||
- ✓ `merge_iterations` is a positive integer (if provided)
|
||||
- ⚠ Suggest optimal iteration count if not specified
|
||||
|
||||
**Master Tables Section** (if present):
|
||||
- ✓ Each master table has a name and canonical_id
|
||||
- ✓ Referenced canonical_id exists
|
||||
- ✓ Attributes have proper structure
|
||||
- ✓ Source tables in attributes exist
|
||||
- ✓ Priority values are valid
|
||||
- ⚠ Check for attribute conflicts
|
||||
|
||||
### Step 4: Cross-Reference Validation
|
||||
- ✓ All merge_by_keys exist in keys section
|
||||
- ✓ All key_columns reference defined keys
|
||||
- ✓ All master table source tables exist in tables section
|
||||
- ✓ Canonical ID names don't conflict with existing columns
|
||||
|
||||
### Step 5: Best Practices Check
|
||||
Provide recommendations for:
|
||||
- Key validation rules
|
||||
- Iteration count optimization
|
||||
- Master table attribute priorities
|
||||
- Performance considerations
|
||||
|
||||
### Step 6: Validation Report
|
||||
Generate comprehensive report with:
|
||||
- ✅ Passed checks
|
||||
- ⚠ Warnings (non-critical issues)
|
||||
- ❌ Errors (must fix before generation)
|
||||
- 💡 Recommendations for improvement
|
||||
|
||||
---
|
||||
|
||||
## Command Usage
|
||||
|
||||
### Basic Usage
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-unif-config-validate
|
||||
|
||||
I'll prompt you for:
|
||||
- YAML file path
|
||||
```
|
||||
|
||||
### Direct Usage
|
||||
```
|
||||
YAML file: /path/to/unify.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Validation
|
||||
|
||||
### Input YAML
|
||||
```yaml
|
||||
name: customer_unification
|
||||
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A']
|
||||
|
||||
tables:
|
||||
- table: customer_profiles
|
||||
key_columns:
|
||||
- {column: email_std, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
- table: orders
|
||||
key_columns:
|
||||
- {column: email_address, key: email}
|
||||
|
||||
canonical_ids:
|
||||
- name: unified_id
|
||||
merge_by_keys: [email, customer_id]
|
||||
merge_iterations: 15
|
||||
|
||||
master_tables:
|
||||
- name: customer_master
|
||||
canonical_id: unified_id
|
||||
attributes:
|
||||
- name: best_email
|
||||
source_columns:
|
||||
- {table: customer_profiles, column: email_std, priority: 1}
|
||||
- {table: orders, column: email_address, priority: 2}
|
||||
```
|
||||
|
||||
### Validation Report
|
||||
```
|
||||
✅ YAML VALIDATION SUCCESSFUL
|
||||
|
||||
File Structure:
|
||||
✅ Valid YAML syntax
|
||||
✅ All required sections present
|
||||
✅ Proper indentation and formatting
|
||||
|
||||
Keys Section (2 keys):
|
||||
✅ email: Valid regex pattern, invalid_texts defined
|
||||
✅ customer_id: Invalid_texts defined
|
||||
⚠ Consider adding valid_regexp for customer_id for better validation
|
||||
|
||||
Tables Section (2 tables):
|
||||
✅ customer_profiles: 2 key columns mapped
|
||||
✅ orders: 1 key column mapped
|
||||
✅ All referenced keys exist
|
||||
|
||||
Canonical IDs Section:
|
||||
✅ Name: unified_id
|
||||
✅ Merge keys: email, customer_id (both exist)
|
||||
✅ Iterations: 15 (recommended range: 10-20)
|
||||
|
||||
Master Tables Section (1 master table):
|
||||
✅ customer_master: References unified_id
|
||||
✅ Attribute 'best_email': 2 sources with priorities
|
||||
✅ All source tables exist
|
||||
|
||||
Cross-References:
|
||||
✅ All merge_by_keys defined in keys section
|
||||
✅ All key_columns reference existing keys
|
||||
✅ All master table sources exist
|
||||
✅ No canonical ID name conflicts
|
||||
|
||||
Recommendations:
|
||||
💡 Consider adding valid_regexp for customer_id (e.g., "^[A-Z0-9]+$")
|
||||
💡 Add more master table attributes for richer customer profiles
|
||||
💡 Consider array attributes (top_3_emails) for historical tracking
|
||||
|
||||
Summary:
|
||||
✅ 0 errors
|
||||
⚠ 1 warning
|
||||
💡 3 recommendations
|
||||
|
||||
✓ Configuration is ready for SQL generation!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation Checks
|
||||
|
||||
### Required Checks (Must Pass)
|
||||
- [ ] File exists and is readable
|
||||
- [ ] Valid YAML syntax
|
||||
- [ ] `name` field present
|
||||
- [ ] `keys` section present with at least one key
|
||||
- [ ] `tables` section present with at least one table
|
||||
- [ ] `canonical_ids` section present
|
||||
- [ ] All merge_by_keys exist in keys section
|
||||
- [ ] All key_columns reference defined keys
|
||||
- [ ] No duplicate key names
|
||||
- [ ] No duplicate table names
|
||||
|
||||
### Warning Checks (Recommended)
|
||||
- [ ] Keys have validation rules (valid_regexp or invalid_texts)
|
||||
- [ ] Merge_iterations specified (otherwise auto-calculated)
|
||||
- [ ] Master tables defined for unified customer view
|
||||
- [ ] Source tables have unique key combinations
|
||||
- [ ] Attribute priorities are sequential
|
||||
|
||||
### Best Practice Checks
|
||||
- [ ] Email keys have email regex pattern
|
||||
- [ ] Phone keys have phone validation
|
||||
- [ ] Invalid_texts include common null values ('', 'N/A', 'null')
|
||||
- [ ] Master tables use time-based order_by for recency
|
||||
- [ ] Array attributes for historical data (top_3_emails, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Common Validation Errors
|
||||
|
||||
### Syntax Errors
|
||||
**Error**: `Invalid YAML: mapping values are not allowed here`
|
||||
**Solution**: Check indentation (use spaces, not tabs), ensure colons have space after them
|
||||
|
||||
**Error**: `Invalid YAML: could not find expected ':'`
|
||||
**Solution**: Check for missing colons in key-value pairs
|
||||
|
||||
### Structure Errors
|
||||
**Error**: `Missing required section: keys`
|
||||
**Solution**: Add keys section with at least one key definition
|
||||
|
||||
**Error**: `Empty tables section`
|
||||
**Solution**: Add at least one table with key_columns
|
||||
|
||||
### Reference Errors
|
||||
**Error**: `Key 'phone' referenced in table 'orders' but not defined in keys section`
|
||||
**Solution**: Add phone key to keys section or remove reference
|
||||
|
||||
**Error**: `Merge key 'phone_number' not found in keys section`
|
||||
**Solution**: Add phone_number to keys section or remove from merge_by_keys
|
||||
|
||||
**Error**: `Master table source 'customer_360' not found in tables section`
|
||||
**Solution**: Add customer_360 to tables section or use correct table name
|
||||
|
||||
### Value Errors
|
||||
**Error**: `merge_iterations must be a positive integer, got: 'auto'`
|
||||
**Solution**: Either remove merge_iterations (auto-calculate) or specify integer (e.g., 15)
|
||||
|
||||
**Error**: `Priority must be a positive integer, got: 'high'`
|
||||
**Solution**: Use numeric priority (1 for highest, 2 for second, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Validation Levels
|
||||
|
||||
### Strict Mode (Default)
|
||||
- Fails on any structural errors
|
||||
- Warns on missing best practices
|
||||
- Recommends optimizations
|
||||
|
||||
### Lenient Mode
|
||||
- Only fails on critical syntax errors
|
||||
- Allows missing optional fields
|
||||
- Minimal warnings
|
||||
|
||||
---
|
||||
|
||||
## Platform-Specific Validation
|
||||
|
||||
### Databricks-Specific
|
||||
- ✓ Table names compatible with Unity Catalog
|
||||
- ✓ Column names valid for Spark SQL
|
||||
- ⚠ Check for reserved keywords (DATABASE, TABLE, etc.)
|
||||
|
||||
### Snowflake-Specific
|
||||
- ✓ Table names compatible with Snowflake
|
||||
- ✓ Column names valid for Snowflake SQL
|
||||
- ⚠ Check for reserved keywords (ACCOUNT, SCHEMA, etc.)
|
||||
|
||||
---
|
||||
|
||||
## What Happens Next
|
||||
|
||||
### If Validation Passes
|
||||
```
|
||||
✅ Configuration validated successfully!
|
||||
|
||||
Ready for:
|
||||
• SQL generation (Databricks or Snowflake)
|
||||
• Direct execution after generation
|
||||
|
||||
Next steps:
|
||||
1. /cdp-hybrid-idu:hybrid-generate-databricks
|
||||
2. /cdp-hybrid-idu:hybrid-generate-snowflake
|
||||
3. /cdp-hybrid-idu:hybrid-setup (complete workflow)
|
||||
```
|
||||
|
||||
### If Validation Fails
|
||||
```
|
||||
❌ Configuration has errors that must be fixed
|
||||
|
||||
Errors (must fix):
|
||||
1. Missing required section: canonical_ids
|
||||
2. Undefined key 'phone' referenced in table 'orders'
|
||||
|
||||
Suggestions:
|
||||
• Add canonical_ids section with name and merge_by_keys
|
||||
• Add phone key to keys section or remove from orders
|
||||
|
||||
Would you like help fixing these issues? (y/n)
|
||||
```
|
||||
|
||||
I can help you:
|
||||
- Fix syntax errors
|
||||
- Add missing sections
|
||||
- Define proper validation rules
|
||||
- Optimize configuration
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Validation passes when:
|
||||
- ✅ YAML syntax is valid
|
||||
- ✅ All required sections present
|
||||
- ✅ All references resolved
|
||||
- ✅ No structural errors
|
||||
- ✅ Ready for SQL generation
|
||||
|
||||
---
|
||||
|
||||
**Ready to validate your YAML configuration?**
|
||||
|
||||
Provide your `unify.yml` file path to begin validation!
|
||||
726
commands/hybrid-unif-merge-stats-creator.md
Normal file
726
commands/hybrid-unif-merge-stats-creator.md
Normal file
@@ -0,0 +1,726 @@
|
||||
---
|
||||
name: hybrid-unif-merge-stats-creator
|
||||
description: Generate professional HTML/PDF merge statistics report from ID unification results for Snowflake or Databricks with expert analysis and visualizations
|
||||
---
|
||||
|
||||
# ID Unification Merge Statistics Report Generator
|
||||
|
||||
## Overview
|
||||
|
||||
I'll generate a **comprehensive, professional HTML report** analyzing your ID unification merge statistics with:
|
||||
|
||||
- 📊 **Executive Summary** with key performance indicators
|
||||
- 📈 **Identity Resolution Performance** analysis and deduplication rates
|
||||
- 🎯 **Merge Distribution** patterns and complexity analysis
|
||||
- 👥 **Top Merged Profiles** highlighting complex identity resolutions
|
||||
- ✅ **Data Quality Metrics** with coverage percentages
|
||||
- 🚀 **Convergence Analysis** showing iteration performance
|
||||
- 💡 **Expert Recommendations** for optimization and next steps
|
||||
|
||||
**Platform Support:**
|
||||
- ✅ Snowflake (using Snowflake MCP tools)
|
||||
- ✅ Databricks (using Databricks MCP tools)
|
||||
|
||||
**Output Format:**
|
||||
- Beautiful HTML report with charts, tables, and visualizations
|
||||
- PDF-ready (print to PDF from browser)
|
||||
- Consistent formatting every time
|
||||
- Platform-agnostic design
|
||||
|
||||
---
|
||||
|
||||
## What You Need to Provide
|
||||
|
||||
### 1. Platform Selection
|
||||
- **Snowflake**: For Snowflake-based ID unification
|
||||
- **Databricks**: For Databricks-based ID unification
|
||||
|
||||
### 2. Database/Catalog Configuration
|
||||
|
||||
**For Snowflake:**
|
||||
- **Database Name**: Where your unification tables are stored (e.g., `INDRESH_TEST`, `CUSTOMER_CDP`)
|
||||
- **Schema Name**: Schema containing tables (e.g., `PUBLIC`, `ID_UNIFICATION`)
|
||||
|
||||
**For Databricks:**
|
||||
- **Catalog Name**: Unity Catalog name (e.g., `customer_data`, `cdp_prod`)
|
||||
- **Schema Name**: Schema containing tables (e.g., `id_unification`, `unified_profiles`)
|
||||
|
||||
### 3. Canonical ID Configuration
|
||||
- **Canonical ID Name**: Name used for your unified ID (e.g., `td_id`, `unified_customer_id`, `master_id`)
|
||||
- This is used to find the correct tables: `{canonical_id}_lookup`, `{canonical_id}_master_table`, etc.
|
||||
|
||||
### 4. Output Configuration (Optional)
|
||||
- **Output File Path**: Where to save the HTML report (default: `id_unification_report.html`)
|
||||
- **Report Title**: Custom title for the report (default: "ID Unification Merge Statistics Report")
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Platform Detection and Validation
|
||||
|
||||
**Snowflake:**
|
||||
```
|
||||
1. Verify Snowflake MCP tools are available
|
||||
2. Test connection to specified database.schema
|
||||
3. Validate canonical ID tables exist:
|
||||
- {database}.{schema}.{canonical_id}_lookup
|
||||
- {database}.{schema}.{canonical_id}_master_table
|
||||
- {database}.{schema}.{canonical_id}_source_key_stats
|
||||
- {database}.{schema}.{canonical_id}_result_key_stats
|
||||
4. Confirm access permissions
|
||||
```
|
||||
|
||||
**Databricks:**
|
||||
```
|
||||
1. Verify Databricks MCP tools are available (or use Snowflake fallback)
|
||||
2. Test connection to specified catalog.schema
|
||||
3. Validate canonical ID tables exist
|
||||
4. Confirm access permissions
|
||||
```
|
||||
|
||||
### Step 2: Data Collection with Expert Analysis
|
||||
|
||||
I'll execute **16 specialized queries** to collect comprehensive statistics:
|
||||
|
||||
**Core Statistics Queries:**
|
||||
|
||||
1. **Source Key Statistics**
|
||||
- Pre-unification identity counts
|
||||
- Distinct values per key type (customer_id, email, phone, etc.)
|
||||
- Per-table breakdowns
|
||||
|
||||
2. **Result Key Statistics**
|
||||
- Post-unification canonical ID counts
|
||||
- Distribution histograms
|
||||
- Coverage per key type
|
||||
|
||||
3. **Canonical ID Metrics**
|
||||
- Total identities processed
|
||||
- Unique canonical IDs created
|
||||
- Merge ratio calculation
|
||||
|
||||
4. **Top Merged Profiles**
|
||||
- Top 10 most complex merges
|
||||
- Identity count per canonical ID
|
||||
- Merge complexity scoring
|
||||
|
||||
5. **Merge Distribution Analysis**
|
||||
- Categorization (2, 3-5, 6-10, 10+ identities)
|
||||
- Percentage distribution
|
||||
- Pattern analysis
|
||||
|
||||
6. **Key Type Distribution**
|
||||
- Identity breakdown by type
|
||||
- Namespace analysis
|
||||
- Cross-key coverage
|
||||
|
||||
7. **Master Table Quality Metrics**
|
||||
- Attribute coverage percentages
|
||||
- Data completeness analysis
|
||||
- Sample record extraction
|
||||
|
||||
8. **Configuration Metadata**
|
||||
- Unification settings
|
||||
- Column mappings
|
||||
- Validation rules
|
||||
|
||||
**Platform-Specific SQL Adaptation:**
|
||||
|
||||
For **Snowflake**:
|
||||
```sql
|
||||
SELECT COUNT(*) as total_identities,
|
||||
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||
FROM {database}.{schema}.{canonical_id}_lookup;
|
||||
```
|
||||
|
||||
For **Databricks**:
|
||||
```sql
|
||||
SELECT COUNT(*) as total_identities,
|
||||
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||
FROM {catalog}.{schema}.{canonical_id}_lookup;
|
||||
```
|
||||
|
||||
### Step 3: Statistical Analysis and Calculations
|
||||
|
||||
I'll perform expert-level calculations:
|
||||
|
||||
**Deduplication Rates:**
|
||||
```
|
||||
For each key type:
|
||||
- Source distinct count (pre-unification)
|
||||
- Final canonical IDs (post-unification)
|
||||
- Deduplication % = (source - final) / source * 100
|
||||
```
|
||||
|
||||
**Merge Ratios:**
|
||||
```
|
||||
- Average identities per customer = total_identities / unique_canonical_ids
|
||||
- Distribution across categories
|
||||
- Outlier detection (10+ merges)
|
||||
```
|
||||
|
||||
**Convergence Analysis:**
|
||||
```
|
||||
- Parse from execution logs if available
|
||||
- Calculate from iteration metadata tables
|
||||
- Estimate convergence quality
|
||||
```
|
||||
|
||||
**Data Quality Scores:**
|
||||
```
|
||||
- Coverage % for each attribute
|
||||
- Completeness assessment
|
||||
- Quality grading (Excellent, Good, Needs Improvement)
|
||||
```
|
||||
|
||||
### Step 4: HTML Report Generation
|
||||
|
||||
I'll generate a **pixel-perfect HTML report** with:
|
||||
|
||||
**Design Features:**
|
||||
- ✨ Modern gradient design (purple theme)
|
||||
- 📊 Interactive visualizations (progress bars, horizontal bar charts)
|
||||
- 🎨 Color-coded badges and status indicators
|
||||
- 📱 Responsive layout (works on all devices)
|
||||
- 🖨️ Print-optimized CSS for PDF export
|
||||
|
||||
**Report Structure:**
|
||||
|
||||
```html
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
- Professional CSS styling
|
||||
- Chart/visualization styles
|
||||
- Print media queries
|
||||
</head>
|
||||
<body>
|
||||
<header>
|
||||
- Report title
|
||||
- Executive tagline
|
||||
</header>
|
||||
|
||||
<metadata-bar>
|
||||
- Database/Catalog info
|
||||
- Canonical ID name
|
||||
- Generation timestamp
|
||||
- Platform indicator
|
||||
</metadata-bar>
|
||||
|
||||
<section: Executive Summary>
|
||||
- 4 KPI metric cards
|
||||
- Key findings insight box
|
||||
</section>
|
||||
|
||||
<section: Identity Resolution Performance>
|
||||
- Source vs result comparison table
|
||||
- Deduplication rate analysis
|
||||
- Horizontal bar charts
|
||||
- Expert insights
|
||||
</section>
|
||||
|
||||
<section: Merge Distribution Analysis>
|
||||
- Category breakdown table
|
||||
- Distribution visualizations
|
||||
- Pattern analysis insights
|
||||
</section>
|
||||
|
||||
<section: Top Merged Profiles>
|
||||
- Top 10 ranked table
|
||||
- Complexity badges
|
||||
- Investigation recommendations
|
||||
</section>
|
||||
|
||||
<section: Source Table Configuration>
|
||||
- Column mapping table
|
||||
- Source contributions
|
||||
- Multi-key strategy analysis
|
||||
</section>
|
||||
|
||||
<section: Master Table Data Quality>
|
||||
- 6 coverage cards with progress bars
|
||||
- Sample records table
|
||||
- Quality assessment
|
||||
</section>
|
||||
|
||||
<section: Convergence Performance>
|
||||
- Iteration breakdown table
|
||||
- Convergence progression chart
|
||||
- Efficiency analysis
|
||||
</section>
|
||||
|
||||
<section: Expert Recommendations>
|
||||
- 4 recommendation cards
|
||||
- Strategic next steps
|
||||
- Downstream activation ideas
|
||||
</section>
|
||||
|
||||
<section: Summary Statistics>
|
||||
- Comprehensive metrics table
|
||||
- All key numbers documented
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
- Generation metadata
|
||||
- Platform information
|
||||
- Report description
|
||||
</footer>
|
||||
</body>
|
||||
</html>
|
||||
```
|
||||
|
||||
### Step 5: Quality Validation and Output
|
||||
|
||||
**Pre-Output Validation:**
|
||||
```
|
||||
1. Verify all sections have data
|
||||
2. Check calculations are correct
|
||||
3. Validate percentages sum properly
|
||||
4. Ensure no missing values
|
||||
5. Confirm HTML is well-formed
|
||||
```
|
||||
|
||||
**File Output:**
|
||||
```
|
||||
1. Write HTML to specified path
|
||||
2. Create backup if file exists
|
||||
3. Set proper file permissions
|
||||
4. Verify file was written successfully
|
||||
```
|
||||
|
||||
**Report Summary:**
|
||||
```
|
||||
✓ Report generated: {file_path}
|
||||
✓ File size: {size} KB
|
||||
✓ Sections included: 9
|
||||
✓ Statistics queries: 16
|
||||
✓ Data quality score: {score}%
|
||||
✓ Ready for: Browser viewing, PDF export, sharing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Workflow
|
||||
|
||||
### Snowflake Example
|
||||
|
||||
**User Input:**
|
||||
```
|
||||
Platform: Snowflake
|
||||
Database: INDRESH_TEST
|
||||
Schema: PUBLIC
|
||||
Canonical ID: td_id
|
||||
Output: snowflake_merge_report.html
|
||||
```
|
||||
|
||||
**Process:**
|
||||
```
|
||||
✓ Connected to Snowflake via MCP
|
||||
✓ Database: INDRESH_TEST.PUBLIC validated
|
||||
✓ Tables found:
|
||||
- td_id_lookup (19,512 records)
|
||||
- td_id_master_table (4,940 records)
|
||||
- td_id_source_key_stats (4 records)
|
||||
- td_id_result_key_stats (4 records)
|
||||
|
||||
Executing queries:
|
||||
✓ Query 1: Source statistics retrieved
|
||||
✓ Query 2: Result statistics retrieved
|
||||
✓ Query 3: Canonical ID counts (19,512 → 4,940)
|
||||
✓ Query 4: Top 10 merged profiles identified
|
||||
✓ Query 5: Merge distribution calculated
|
||||
✓ Query 6: Key type distribution analyzed
|
||||
✓ Query 7: Master table coverage (100% email, 99.39% phone)
|
||||
✓ Query 8: Sample records extracted
|
||||
✓ Query 9-11: Metadata retrieved
|
||||
|
||||
Calculating metrics:
|
||||
✓ Merge ratio: 3.95:1
|
||||
✓ Fragmentation reduction: 74.7%
|
||||
✓ Deduplication rates:
|
||||
- customer_id: 23.9%
|
||||
- email: 32.0%
|
||||
- phone: 14.8%
|
||||
✓ Data quality score: 99.7%
|
||||
|
||||
Generating HTML report:
|
||||
✓ Executive summary section
|
||||
✓ Performance analysis section
|
||||
✓ Merge distribution section
|
||||
✓ Top profiles section
|
||||
✓ Source configuration section
|
||||
✓ Data quality section
|
||||
✓ Convergence section
|
||||
✓ Recommendations section
|
||||
✓ Summary statistics section
|
||||
|
||||
✓ Report saved: snowflake_merge_report.html (142 KB)
|
||||
✓ Open in browser to view
|
||||
✓ Print to PDF for distribution
|
||||
```
|
||||
|
||||
**Generated Report Contents:**
|
||||
```
|
||||
Executive Summary:
|
||||
- 4,940 unified profiles
|
||||
- 19,512 total identities
|
||||
- 3.95:1 merge ratio
|
||||
- 74.7% fragmentation reduction
|
||||
|
||||
Identity Resolution:
|
||||
- customer_id: 6,489 → 4,940 (23.9% reduction)
|
||||
- email: 7,261 → 4,940 (32.0% reduction)
|
||||
- phone: 5,762 → 4,910 (14.8% reduction)
|
||||
|
||||
Merge Distribution:
|
||||
- 89.0% profiles: 3-5 identities (normal)
|
||||
- 8.1% profiles: 6-10 identities (high engagement)
|
||||
- 2.3% profiles: 10+ identities (complex)
|
||||
|
||||
Top Merged Profile:
|
||||
- mS9ssBEh4EsN: 38 identities merged
|
||||
|
||||
Data Quality:
|
||||
- Email: 100% coverage
|
||||
- Phone: 99.39% coverage
|
||||
- Names: 100% coverage
|
||||
- Location: 100% coverage
|
||||
|
||||
Expert Recommendations:
|
||||
- Implement incremental processing
|
||||
- Monitor profiles with 20+ merges
|
||||
- Enable downstream activation
|
||||
- Set up quality monitoring
|
||||
```
|
||||
|
||||
### Databricks Example
|
||||
|
||||
**User Input:**
|
||||
```
|
||||
Platform: Databricks
|
||||
Catalog: customer_cdp
|
||||
Schema: id_unification
|
||||
Canonical ID: unified_customer_id
|
||||
Output: databricks_merge_report.html
|
||||
```
|
||||
|
||||
**Process:**
|
||||
```
|
||||
✓ Connected to Databricks (or using Snowflake MCP fallback)
|
||||
✓ Catalog: customer_cdp.id_unification validated
|
||||
✓ Tables found:
|
||||
- unified_customer_id_lookup
|
||||
- unified_customer_id_master_table
|
||||
- unified_customer_id_source_key_stats
|
||||
- unified_customer_id_result_key_stats
|
||||
|
||||
[Same query execution and report generation as Snowflake]
|
||||
|
||||
✓ Report saved: databricks_merge_report.html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
### 🎯 **Consistency Guarantee**
|
||||
- **Same report every time**: Deterministic HTML generation
|
||||
- **Platform-agnostic design**: Works identically on Snowflake and Databricks
|
||||
- **Version controlled**: Report structure is fixed and versioned
|
||||
|
||||
### 🔍 **Expert Analysis**
|
||||
- **16 specialized queries**: Comprehensive data collection
|
||||
- **Calculated metrics**: Deduplication rates, merge ratios, quality scores
|
||||
- **Pattern detection**: Identify anomalies and outliers
|
||||
- **Strategic insights**: Actionable recommendations
|
||||
|
||||
### 📊 **Professional Visualizations**
|
||||
- **KPI metric cards**: Large, colorful summary metrics
|
||||
- **Progress bars**: Coverage percentages with animations
|
||||
- **Horizontal bar charts**: Distribution comparisons
|
||||
- **Color-coded badges**: Status indicators (Excellent, Good, Needs Review)
|
||||
- **Tables with hover effects**: Interactive data exploration
|
||||
|
||||
### 🌍 **Platform Flexibility**
|
||||
- **Snowflake**: Uses `mcp__snowflake__execute_query` tool
|
||||
- **Databricks**: Uses Databricks MCP tools (with fallback options)
|
||||
- **Automatic SQL adaptation**: Platform-specific query generation
|
||||
- **Table name resolution**: Handles catalog vs database differences
|
||||
|
||||
### 📋 **Comprehensive Coverage**
|
||||
|
||||
**9 Report Sections:**
|
||||
1. Executive Summary (4 KPIs + findings)
|
||||
2. Identity Resolution Performance (deduplication analysis)
|
||||
3. Merge Distribution Analysis (categorized breakdown)
|
||||
4. Top Merged Profiles (complexity ranking)
|
||||
5. Source Table Configuration (mappings)
|
||||
6. Master Table Data Quality (coverage metrics)
|
||||
7. Convergence Performance (iteration analysis)
|
||||
8. Expert Recommendations (strategic guidance)
|
||||
9. Summary Statistics (complete metrics)
|
||||
|
||||
**16 Statistical Queries:**
|
||||
- Source/result key statistics
|
||||
- Canonical ID counts and distributions
|
||||
- Merge pattern analysis
|
||||
- Quality coverage metrics
|
||||
- Configuration metadata
|
||||
|
||||
---
|
||||
|
||||
## Table Naming Conventions
|
||||
|
||||
The command automatically finds tables based on your canonical ID name:
|
||||
|
||||
### Required Tables
|
||||
|
||||
For canonical ID = `{canonical_id}`:
|
||||
|
||||
1. **Lookup Table**: `{canonical_id}_lookup`
|
||||
- Contains: canonical_id, id, id_key_type
|
||||
- Used for: Merge ratio, distribution, top profiles
|
||||
|
||||
2. **Master Table**: `{canonical_id}_master_table`
|
||||
- Contains: {canonical_id}, best_* attributes
|
||||
- Used for: Data quality coverage
|
||||
|
||||
3. **Source Stats**: `{canonical_id}_source_key_stats`
|
||||
- Contains: from_table, total_distinct, distinct_*
|
||||
- Used for: Pre-unification baseline
|
||||
|
||||
4. **Result Stats**: `{canonical_id}_result_key_stats`
|
||||
- Contains: from_table, total_distinct, histogram_*
|
||||
- Used for: Post-unification results
|
||||
|
||||
### Optional Tables
|
||||
|
||||
5. **Unification Metadata**: `unification_metadata`
|
||||
- Contains: canonical_id_name, canonical_id_type
|
||||
- Used for: Configuration documentation
|
||||
|
||||
6. **Column Lookup**: `column_lookup`
|
||||
- Contains: table_name, column_name, key_name
|
||||
- Used for: Source table mappings
|
||||
|
||||
7. **Filter Lookup**: `filter_lookup`
|
||||
- Contains: key_name, invalid_texts, valid_regexp
|
||||
- Used for: Validation rules
|
||||
|
||||
**All tables must be in the same database.schema (Snowflake) or catalog.schema (Databricks)**
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
### HTML Report Features
|
||||
|
||||
**Styling:**
|
||||
- Gradient purple theme (#667eea to #764ba2)
|
||||
- Modern typography (system fonts)
|
||||
- Responsive grid layouts
|
||||
- Smooth hover animations
|
||||
- Print-optimized media queries
|
||||
|
||||
**Sections:**
|
||||
- Header with gradient background
|
||||
- Metadata bar with key info
|
||||
- 9 content sections with analysis
|
||||
- Footer with generation details
|
||||
|
||||
**Visualizations:**
|
||||
- Metric cards (4 in executive summary)
|
||||
- Progress bars (6 in data quality)
|
||||
- Horizontal bar charts (3 throughout report)
|
||||
- Tables with sorting and hover effects
|
||||
- Insight boxes with recommendations
|
||||
|
||||
**Interactivity:**
|
||||
- Hover effects on cards and tables
|
||||
- Animated progress bars
|
||||
- Expandable insight boxes
|
||||
- Responsive layout adapts to screen size
|
||||
|
||||
### PDF Export
|
||||
|
||||
To create a PDF from the HTML report:
|
||||
|
||||
1. Open HTML file in browser
|
||||
2. Press Ctrl+P (Windows) or Cmd+P (Mac)
|
||||
3. Select "Save as PDF"
|
||||
4. Choose landscape orientation for better chart visibility
|
||||
5. Enable background graphics for full styling
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
**Issue: "Tables not found"**
|
||||
```
|
||||
Solution:
|
||||
1. Verify canonical ID name is correct
|
||||
2. Check database/catalog and schema names
|
||||
3. Ensure unification workflow completed successfully
|
||||
4. Confirm table naming: {canonical_id}_lookup, {canonical_id}_master_table, etc.
|
||||
```
|
||||
|
||||
**Issue: "MCP tools not available"**
|
||||
```
|
||||
Solution:
|
||||
1. For Snowflake: Verify Snowflake MCP server is configured
|
||||
2. For Databricks: Fall back to Snowflake MCP with proper connection string
|
||||
3. Check network connectivity
|
||||
4. Validate credentials
|
||||
```
|
||||
|
||||
**Issue: "No data in statistics tables"**
|
||||
```
|
||||
Solution:
|
||||
1. Verify unification workflow ran completely
|
||||
2. Check that statistics SQL files were executed
|
||||
3. Confirm data exists in lookup and master tables
|
||||
4. Re-run the unification workflow if needed
|
||||
```
|
||||
|
||||
**Issue: "Permission denied"**
|
||||
```
|
||||
Solution:
|
||||
1. Verify READ access to all tables
|
||||
2. For Snowflake: Grant SELECT on schema
|
||||
3. For Databricks: Grant USE CATALOG, USE SCHEMA, SELECT
|
||||
4. Check role/user permissions
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Generated report will:
|
||||
|
||||
- ✅ **Open successfully** in all modern browsers (Chrome, Firefox, Safari, Edge)
|
||||
- ✅ **Display all 9 sections** with complete data
|
||||
- ✅ **Show accurate calculations** for all metrics
|
||||
- ✅ **Include visualizations** (charts, progress bars, tables)
|
||||
- ✅ **Render consistently** every time it's generated
|
||||
- ✅ **Export cleanly to PDF** with proper formatting
|
||||
- ✅ **Match the reference design** (same HTML/CSS structure)
|
||||
- ✅ **Contain expert insights** and recommendations
|
||||
- ✅ **Be production-ready** for stakeholder distribution
|
||||
|
||||
---
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Quick Start (Snowflake)
|
||||
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
|
||||
|
||||
> Platform: Snowflake
|
||||
> Database: PROD_CDP
|
||||
> Schema: ID_UNIFICATION
|
||||
> Canonical ID: master_customer_id
|
||||
> Output: (press Enter for default)
|
||||
|
||||
✓ Report generated: id_unification_report.html
|
||||
```
|
||||
|
||||
### Custom Output Path
|
||||
|
||||
```
|
||||
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
|
||||
|
||||
> Platform: Databricks
|
||||
> Catalog: analytics_prod
|
||||
> Schema: unified_ids
|
||||
> Canonical ID: td_id
|
||||
> Output: /reports/weekly/td_id_stats_2025-10-15.html
|
||||
|
||||
✓ Report generated: /reports/weekly/td_id_stats_2025-10-15.html
|
||||
```
|
||||
|
||||
### Multiple Environments
|
||||
|
||||
Generate reports for different environments:
|
||||
|
||||
```bash
|
||||
# Production
|
||||
/hybrid-unif-merge-stats-creator
|
||||
Platform: Snowflake
|
||||
Database: PROD_CDP
|
||||
Output: prod_merge_stats.html
|
||||
|
||||
# Staging
|
||||
/hybrid-unif-merge-stats-creator
|
||||
Platform: Snowflake
|
||||
Database: STAGING_CDP
|
||||
Output: staging_merge_stats.html
|
||||
|
||||
# Compare metrics across environments
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Regular Reporting
|
||||
|
||||
1. **Weekly Reports**: Track merge performance over time
|
||||
2. **Post-Workflow Reports**: Generate after each unification run
|
||||
3. **Quality Audits**: Monthly deep-dive analysis
|
||||
4. **Stakeholder Updates**: Executive-friendly format
|
||||
|
||||
### Comparative Analysis
|
||||
|
||||
Generate reports at different stages:
|
||||
- After initial unification setup
|
||||
- After incremental updates
|
||||
- After data quality improvements
|
||||
- Across different customer segments
|
||||
|
||||
### Archive and Versioning
|
||||
|
||||
```
|
||||
reports/
|
||||
2025-10-15_td_id_merge_stats.html
|
||||
2025-10-08_td_id_merge_stats.html
|
||||
2025-10-01_td_id_merge_stats.html
|
||||
```
|
||||
|
||||
Track improvements over time by comparing:
|
||||
- Merge ratios
|
||||
- Data quality scores
|
||||
- Convergence iterations
|
||||
- Deduplication rates
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**Ready to generate your merge statistics report?**
|
||||
|
||||
Please provide:
|
||||
|
||||
1. **Platform**: Snowflake or Databricks?
|
||||
2. **Database/Catalog**: Where are your unification tables?
|
||||
3. **Schema**: Which schema contains the tables?
|
||||
4. **Canonical ID**: What's the name of your unified ID? (e.g., td_id)
|
||||
5. **Output Path** (optional): Where to save the report?
|
||||
|
||||
**Example:**
|
||||
```
|
||||
I want to generate a merge statistics report for:
|
||||
|
||||
Platform: Snowflake
|
||||
Database: INDRESH_TEST
|
||||
Schema: PUBLIC
|
||||
Canonical ID: td_id
|
||||
Output: my_unification_report.html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**I'll analyze your ID unification results and create a comprehensive, beautiful HTML report with expert insights!**
|
||||
Reference in New Issue
Block a user