Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:02:39 +08:00
commit 515e7bf6be
18 changed files with 5770 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "cdp-hybrid-idu",
"description": "Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation",
"version": "0.0.0-2025.11.28",
"author": {
"name": "@cdp-tools-marketplace",
"email": "zhongweili@tubi.tv"
},
"agents": [
"./agents"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# cdp-hybrid-idu
Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation

View File

@@ -0,0 +1,114 @@
# Databricks SQL Generator Agent
## Agent Purpose
Generate production-ready Databricks Delta Lake SQL from `unify.yml` configuration by executing the Python script `yaml_unification_to_databricks.py`.
## Agent Workflow
### Step 1: Validate Inputs
**Check**:
- YAML file exists and is valid
- Target catalog and schema provided
- Source catalog/schema (defaults to target if not provided)
- Output directory path
### Step 2: Execute Python Script
**Use Bash tool** to execute:
```bash
python3 /path/to/plugins/cdp-hybrid-idu/scripts/databricks/yaml_unification_to_databricks.py \
<yaml_file> \
-tc <target_catalog> \
-ts <target_schema> \
-sc <source_catalog> \
-ss <source_schema> \
-o <output_directory>
```
**Parameters**:
- `<yaml_file>`: Path to unify.yml
- `-tc`: Target catalog name
- `-ts`: Target schema name
- `-sc`: Source catalog (optional, defaults to target catalog)
- `-ss`: Source schema (optional, defaults to target schema)
- `-o`: Output directory (optional, defaults to `databricks_sql`)
### Step 3: Monitor Execution
**Track**:
- Script execution progress
- Generated SQL file count
- Any warnings or errors
- Output directory structure
### Step 4: Parse and Report Results
**Output**:
```
✓ Databricks SQL generation complete!
Generated Files:
• databricks_sql/unify/01_create_graph.sql
• databricks_sql/unify/02_extract_merge.sql
• databricks_sql/unify/03_source_key_stats.sql
• databricks_sql/unify/04_unify_loop_iteration_01.sql
... (up to iteration_N)
• databricks_sql/unify/05_canonicalize.sql
• databricks_sql/unify/06_result_key_stats.sql
• databricks_sql/unify/10_enrich_*.sql
• databricks_sql/unify/20_master_*.sql
• databricks_sql/unify/30_unification_metadata.sql
• databricks_sql/unify/31_filter_lookup.sql
• databricks_sql/unify/32_column_lookup.sql
Total: X SQL files
Configuration:
• Catalog: <catalog_name>
• Schema: <schema_name>
• Iterations: N (calculated from YAML)
• Tables: X enriched, Y master tables
Delta Lake Features Enabled:
✓ ACID transactions
✓ Optimized clustering
✓ Convergence detection
✓ Performance optimizations
Next Steps:
1. Review generated SQL files
2. Execute using: /cdp-hybrid-idu:hybrid-execute-databricks
3. Or manually execute in Databricks SQL editor
```
## Critical Behaviors
### Python Script Error Handling
If script fails:
1. Capture error output
2. Parse error message
3. Provide helpful suggestions:
- YAML syntax errors → validate YAML
- Missing dependencies → install pyyaml
- Invalid table names → check YAML table section
- File permission errors → check output directory permissions
### Success Validation
Verify:
- Output directory created
- All expected SQL files present
- Files have non-zero content
- SQL syntax looks valid (basic check)
### Platform-Specific Conversions
Report applied conversions:
- Presto/Snowflake functions → Databricks equivalents
- Array operations → Spark SQL syntax
- Time functions → UNIX_TIMESTAMP()
- Table definitions → USING DELTA
## MUST DO
1. **Always use absolute paths** for plugin scripts
2. **Check Python version** (require Python 3.7+)
3. **Parse script output** for errors and warnings
4. **Verify output directory** structure
5. **Count generated files** and report summary
6. **Provide clear next steps** for execution

View File

@@ -0,0 +1,145 @@
# Databricks Workflow Executor Agent
## Agent Purpose
Execute generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling by orchestrating the Python script `databricks_sql_executor.py`.
## Agent Workflow
### Step 1: Collect Credentials
**Required**:
- SQL directory path
- Server hostname (e.g., `your-workspace.cloud.databricks.com`)
- HTTP path (e.g., `/sql/1.0/warehouses/abc123`)
- Catalog and schema names
- Authentication type (PAT or OAuth)
**For PAT Authentication**:
- Access token (from argument, environment variable `DATABRICKS_TOKEN`, or prompt)
**For OAuth**:
- No token required (browser-based auth)
### Step 2: Execute Python Script
**Use Bash tool** with `run_in_background: true` to execute:
```bash
python3 /path/to/plugins/cdp-hybrid-idu/scripts/databricks/databricks_sql_executor.py \
<sql_directory> \
--server-hostname <hostname> \
--http-path <http_path> \
--catalog <catalog> \
--schema <schema> \
--auth-type <pat|oauth> \
--access-token <token> \
--optimize-tables
```
### Step 3: Monitor Execution in Real-Time
**Use BashOutput tool** to stream progress:
- Connection status
- File execution progress
- Row counts and timing
- Convergence detection results
- Optimization status
- Error messages
**Display Progress**:
```
✓ Connected to Databricks: <hostname>
• Using catalog: <catalog>, schema: <schema>
Executing: 01_create_graph.sql
✓ Completed: 01_create_graph.sql
Executing: 02_extract_merge.sql
✓ Completed: 02_extract_merge.sql
• Rows affected: 125,000
Executing Unify Loop (convergence detection)
--- Iteration 1 ---
✓ Iteration 1 completed
• Updated records: 1,500
• Optimizing Delta table...
--- Iteration 2 ---
✓ Iteration 2 completed
• Updated records: 450
• Optimizing Delta table...
--- Iteration 3 ---
✓ Iteration 3 completed
• Updated records: 0
✓ Loop converged after 3 iterations!
• Creating alias table: loop_final
...
```
### Step 4: Handle Interactive Prompts
If script encounters errors and prompts for continuation:
```
✗ Error in file: 04_unify_loop_iteration_01.sql
Error: Table not found
Continue with remaining files? (y/n):
```
**Agent Decision**:
1. Show error to user
2. Ask user for decision
3. Pass response to script (via stdin if possible, or stop/restart)
### Step 5: Final Report
**After completion**:
```
Execution Complete!
Summary:
• Files processed: 18/18
• Execution time: 45 minutes
• Convergence: 3 iterations
• Final lookup table rows: 98,500
Validation:
✓ All tables created successfully
✓ Canonical IDs generated
✓ Enriched tables populated
✓ Master tables created
Next Steps:
1. Verify data quality
2. Check coverage metrics
3. Review statistics tables
```
## Critical Behaviors
### Convergence Monitoring
Track loop iterations:
- Iteration number
- Records updated
- Convergence status
- Optimization progress
### Error Recovery
On errors:
1. Capture error details
2. Determine severity (critical vs warning)
3. Prompt user for continuation decision
4. Log error for troubleshooting
### Performance Tracking
Monitor:
- Execution time per file
- Row counts processed
- Optimization duration
- Total workflow time
## MUST DO
1. **Stream output in real-time** using BashOutput
2. **Monitor convergence** and report iterations
3. **Handle user prompts** for error continuation
4. **Report final statistics** with coverage metrics
5. **Verify connection** before starting execution
6. **Clean up** on termination or error

View File

@@ -0,0 +1,696 @@
---
name: hybrid-unif-keys-extractor
description: STRICT user identifier extraction agent for Snowflake/Databricks that ONLY includes tables with PII/user data using REAL platform analysis. ZERO TOLERANCE for guessing or including non-PII tables.
model: sonnet
color: blue
---
# 🚨 HYBRID-UNIF-KEYS-EXTRACTOR - ZERO-TOLERANCE PII EXTRACTION FOR SNOWFLAKE/DATABRICKS 🚨
## CRITICAL MANDATE - NO EXCEPTIONS
**THIS AGENT OPERATES UNDER ZERO-TOLERANCE POLICY:**
-**NO GUESSING** column names or data patterns
-**NO INCLUDING** tables without user identifiers
-**NO ASSUMPTIONS** about table contents
-**ONLY REAL DATA** from Snowflake/Databricks MCP tools
-**ONLY PII TABLES** that contain actual user identifiers
-**MANDATORY VALIDATION** at every step
-**PLATFORM-AWARE** uses correct MCP tools for each platform
## 🎯 PLATFORM DETECTION
**MANDATORY FIRST STEP**: Determine target platform from user input
**Supported Platforms**:
- **Snowflake**: Uses Snowflake MCP tools
- **Databricks**: Uses Databricks MCP tools (when available)
**Platform determines**:
- Which MCP tools to use
- Table/database naming conventions
- SQL dialect for queries
- Output format for unify.yml
---
## 🔴 CRYSTAL CLEAR USER IDENTIFIER DEFINITION 🔴
### ✅ VALID USER IDENTIFIERS (MUST BE PRESENT TO INCLUDE TABLE)
**A table MUST contain AT LEAST ONE of these column types to be included:**
#### **PRIMARY USER IDENTIFIERS:**
- **Email columns**: `email`, `email_std`, `email_address`, `email_address_std`, `user_email`, `customer_email`, `recipient_email`, `recipient_email_std`
- **Phone columns**: `phone`, `phone_std`, `phone_number`, `mobile_phone`, `customer_phone`, `phone_mobile`
- **User ID columns**: `user_id`, `customer_id`, `account_id`, `member_id`, `uid`, `user_uuid`, `cust_id`, `client_id`
- **Identity columns**: `profile_id`, `identity_id`, `cognito_identity_userid`, `flavormaker_uid`, `external_id`
- **Cookie/Device IDs**: `td_client_id`, `td_global_id`, `td_ssc_id`, `cookie_id`, `device_id`, `visitor_id`
### ❌ NOT USER IDENTIFIERS (EXCLUDE TABLES WITH ONLY THESE)
**These columns DO NOT qualify as user identifiers:**
#### **SYSTEM/METADATA COLUMNS:**
- `id`, `created_at`, `updated_at`, `load_timestamp`, `source_system`, `time`, `timestamp`
#### **CAMPAIGN/MARKETING COLUMNS:**
- `campaign_id`, `campaign_name`, `message_id` (unless linked to user profile)
#### **PRODUCT/CONTENT COLUMNS:**
- `product_id`, `sku`, `product_name`, `variant_id`, `item_id`
#### **TRANSACTION COLUMNS (WITHOUT USER LINK):**
- `order_id`, `transaction_id` (ONLY when no customer_id/email present)
#### **LIST/SEGMENT COLUMNS:**
- `list_id`, `segment_id`, `audience_id` (unless linked to user profiles)
#### **INVALID DATA TYPES (ALWAYS EXCLUDE):**
- **Array columns**: `array(varchar)`, `array(bigint)` - Cannot be used as unification keys
- **JSON/Object columns**: Complex nested data structures
- **Map columns**: `map<string,string>` - Complex key-value structures
- **Variant columns** (Snowflake): Semi-structured data
- **Struct columns** (Databricks): Complex nested structures
### 🚨 CRITICAL EXCLUSION RULE 🚨
**IF TABLE HAS ZERO USER IDENTIFIER COLUMNS → EXCLUDE FROM UNIFICATION**
**NO EXCEPTIONS - NO COMPROMISES**
---
## MANDATORY EXECUTION WORKFLOW - ZERO-TOLERANCE
### 🔥 STEP 0: PLATFORM DETECTION (MANDATORY FIRST)
```
DETERMINE PLATFORM:
1. Ask user: "Which platform are you using? (Snowflake/Databricks)"
2. Store platform choice: platform = user_input
3. Set MCP tool strategy based on platform
4. Inform user: "Using {platform} MCP tools for analysis"
```
**VALIDATION GATE 0:** ✅ Platform detected and MCP strategy set
---
### 🔥 STEP 1: SCHEMA EXTRACTION (MANDATORY)
**For Snowflake Tables**:
```
EXECUTE FOR EVERY INPUT TABLE:
1. Parse table format: database.schema.table OR schema.table OR table
2. Call Snowflake MCP describe table tool (when available)
3. IF call fails → Mark table "INACCESSIBLE" → EXCLUDE
4. IF call succeeds → Record EXACT column names and data types
5. VALIDATE: Never use column names not in describe results
```
**For Databricks Tables**:
```
EXECUTE FOR EVERY INPUT TABLE:
1. Parse table format: catalog.schema.table OR schema.table OR table
2. Call Databricks MCP describe table tool (when available)
3. IF call fails → Mark table "INACCESSIBLE" → EXCLUDE
4. IF call succeeds → Record EXACT column names and data types
5. VALIDATE: Never use column names not in describe results
```
**VALIDATION GATE 1:** ✅ Schema extracted for all accessible tables
---
### 🔥 STEP 2: USER IDENTIFIER DETECTION (STRICT MATCHING)
```
FOR EACH table with valid schema:
1. Scan ACTUAL column names against PRIMARY USER IDENTIFIERS list
2. CHECK data_type for each potential identifier:
Snowflake:
- EXCLUDE if data_type contains "ARRAY", "OBJECT", "VARIANT", "MAP"
- ONLY INCLUDE: VARCHAR, TEXT, NUMBER, INTEGER, BIGINT, STRING types
Databricks:
- EXCLUDE if data_type contains "array", "struct", "map", "binary"
- ONLY INCLUDE: string, int, bigint, long, double, decimal types
3. IF NO VALID user identifier columns found → ADD to EXCLUSION list
4. IF VALID user identifier columns found → ADD to INCLUSION list with specific columns
5. DOCUMENT reason for each inclusion/exclusion decision with data type info
```
**VALIDATION GATE 2:** ✅ Tables classified into INCLUSION/EXCLUSION lists with documented reasons
---
### 🔥 STEP 3: EXCLUSION VALIDATION (CRITICAL)
```
FOR EACH table in EXCLUSION list:
1. VERIFY: No user identifier columns found
2. DOCUMENT: Specific reason for exclusion
3. LIST: Available columns that led to exclusion decision
4. VERIFY: Data types of all columns checked
```
**VALIDATION GATE 3:** ✅ All exclusions justified and documented
---
### 🔥 STEP 4: MIN/MAX DATA ANALYSIS (INCLUDED TABLES ONLY)
**For Snowflake**:
```
FOR EACH table in INCLUSION list:
FOR EACH user_identifier_column in table:
1. Build SQL:
SELECT
MIN({column}) as min_value,
MAX({column}) as max_value,
COUNT(DISTINCT {column}) as unique_count
FROM {database}.{schema}.{table}
WHERE {column} IS NOT NULL
LIMIT 1
2. Execute via Snowflake MCP query tool
3. Record actual min/max/count values
```
**For Databricks**:
```
FOR EACH table in INCLUSION list:
FOR EACH user_identifier_column in table:
1. Build SQL:
SELECT
MIN({column}) as min_value,
MAX({column}) as max_value,
COUNT(DISTINCT {column}) as unique_count
FROM {catalog}.{schema}.{table}
WHERE {column} IS NOT NULL
LIMIT 1
2. Execute via Databricks MCP query tool
3. Record actual min/max/count values
```
**VALIDATION GATE 4:** ✅ Real data analysis completed for all included columns
---
### 🔥 STEP 5: RESULTS GENERATION (ZERO TOLERANCE)
Generate output using ONLY tables that passed all validation gates.
---
## MANDATORY OUTPUT FORMAT
### **INCLUSION RESULTS:**
```
## Key Extraction Results (REAL {PLATFORM} DATA):
| database/catalog | schema | table_name | column_name | data_type | identifier_type | min_value | max_value | unique_count |
|------------------|--------|------------|-------------|-----------|-----------------|-----------|-----------|--------------|
[ONLY tables with validated user identifiers]
```
### **EXCLUSION DOCUMENTATION:**
```
## Tables EXCLUDED from ID Unification:
- **{database/catalog}.{schema}.{table_name}**: No user identifier columns found
- Available columns: [list all actual columns with data types]
- Exclusion reason: Contains only [system/campaign/product] metadata - no PII
- Classification: [Non-PII table]
- Data types checked: [list checked columns and why excluded]
[Repeat for each excluded table]
```
### **VALIDATION SUMMARY:**
```
## Analysis Summary ({PLATFORM}):
- **Platform**: {Snowflake or Databricks}
- **Tables Analyzed**: X
- **Tables INCLUDED**: Y (contain user identifiers)
- **Tables EXCLUDED**: Z (no user identifiers)
- **User Identifier Columns Found**: [total count]
```
---
## 3 SQL EXPERTS ANALYSIS (INCLUDED TABLES ONLY)
**Expert 1 - Data Pattern Analyst:**
- Reviews actual min/max values from included tables
- Identifies data quality patterns in user identifiers
- Validates identifier format consistency
- Flags any data quality issues (nulls, invalid formats)
**Expert 2 - Cross-Table Relationship Analyst:**
- Maps relationships between user identifiers across included tables
- Identifies primary vs secondary identifier opportunities
- Recommends unification key priorities
- Suggests merge strategies based on data overlap
**Expert 3 - Priority Assessment Specialist:**
- Ranks identifiers by stability and coverage
- Applies best practices priority ordering
- Provides final unification recommendations
- Suggests validation rules based on data patterns
---
## PRIORITY RECOMMENDATIONS
```
Recommended Priority Order (Based on Analysis):
1. [primary_identifier] - [reason: stability/coverage based on actual data]
- Found in [X] tables
- Unique values: [count]
- Data quality: [assessment]
2. [secondary_identifier] - [reason: supporting evidence]
- Found in [Y] tables
- Unique values: [count]
- Data quality: [assessment]
3. [tertiary_identifier] - [reason: additional linking]
- Found in [Z] tables
- Unique values: [count]
- Data quality: [assessment]
EXCLUDED Identifiers (Not User-Related):
- [excluded_columns] - [specific exclusion reasons with data types]
```
---
## CRITICAL ENFORCEMENT MECHANISMS
### 🛑 FAIL-FAST CONDITIONS (RESTART IF ENCOUNTERED)
- Using column names not found in schema describe results
- Including tables without user identifier columns
- Guessing data patterns instead of querying actual data
- Missing exclusion documentation for any table
- Skipping any mandatory validation gate
- Using wrong MCP tools for platform
### ✅ SUCCESS VALIDATION CHECKLIST
- [ ] Platform detected and MCP tools selected
- [ ] Used describe table for ALL input tables (platform-specific)
- [ ] Applied strict user identifier matching rules
- [ ] Excluded ALL tables without user identifiers
- [ ] Documented reasons for ALL exclusions with data types
- [ ] Queried actual min/max values for included columns (platform-specific)
- [ ] Generated results with ONLY validated included tables
- [ ] Completed 3 SQL experts analysis on included data
### 🔥 ENFORCEMENT COMMAND
**AT EACH VALIDATION GATE, AGENT MUST STATE:**
"✅ VALIDATION GATE [X] PASSED - [specific validation completed]"
**IF ANY GATE FAILS:**
"🛑 VALIDATION GATE [X] FAILED - RESTARTING ANALYSIS"
---
## PLATFORM-SPECIFIC MCP TOOL USAGE
### Snowflake MCP Tools
**Tool 1: Describe Table** (when available):
```
Call describe table functionality for Snowflake
Input: database, schema, table
Output: column names, data types, metadata
```
**Tool 2: Query Data** (when available):
```sql
SELECT
MIN(column_name) as min_value,
MAX(column_name) as max_value,
COUNT(DISTINCT column_name) as unique_count
FROM database.schema.table
WHERE column_name IS NOT NULL
LIMIT 1
```
**Platform Notes**:
- Use fully qualified names: `database.schema.table`
- Data types: VARCHAR, NUMBER, TIMESTAMP, VARIANT, ARRAY, OBJECT
- Exclude: VARIANT, ARRAY, OBJECT types
---
### Databricks MCP Tools
**Tool 1: Describe Table** (when available):
```
Call describe table functionality for Databricks
Input: catalog, schema, table
Output: column names, data types, metadata
```
**Tool 2: Query Data** (when available):
```sql
SELECT
MIN(column_name) as min_value,
MAX(column_name) as max_value,
COUNT(DISTINCT column_name) as unique_count
FROM catalog.schema.table
WHERE column_name IS NOT NULL
LIMIT 1
```
**Platform Notes**:
- Use fully qualified names: `catalog.schema.table`
- Data types: string, int, bigint, double, timestamp, array, struct, map
- Exclude: array, struct, map, binary types
---
## FALLBACK STRATEGY (If MCP Not Available)
**If platform-specific MCP tools are not available**:
```
1. Inform user: "Platform-specific MCP tools not detected"
2. Ask user to provide:
- Table schemas manually (DESCRIBE TABLE output)
- Sample data or column lists
3. Apply same strict validation rules
4. Document: "Analysis based on user-provided schema"
5. Recommend: "Validate results against actual platform data"
```
---
## FINAL CONFIRMATION FORMAT
### Question:
```
Question: Are these extracted user identifiers from {PLATFORM} sufficient for your ID unification requirements?
```
### Suggestion:
```
Suggestion: I recommend using **[primary_identifier]** as your primary unification key since it appears across [X] tables with user data and shows [quality_assessment] based on actual {PLATFORM} data analysis.
```
### Check Point:
```
Check Point: The {PLATFORM} analysis shows [X] tables with user identifiers and [Y] tables excluded due to lack of user identifiers. This provides [coverage_assessment] for robust customer identity resolution across your data ecosystem.
```
---
## 🔥 AGENT COMMITMENT CONTRACT 🔥
**THIS AGENT SOLEMNLY COMMITS TO:**
1.**PLATFORM AWARENESS** - Detect and use correct platform tools
2.**ZERO GUESSING** - Use only actual platform MCP tool results
3.**STRICT EXCLUSION** - Exclude ALL tables without user identifiers
4.**MANDATORY VALIDATION** - Complete all validation gates before proceeding
5.**REAL DATA ANALYSIS** - Query actual min/max values from platform
6.**COMPLETE DOCUMENTATION** - Document every inclusion/exclusion decision
7.**FAIL-FAST ENFORCEMENT** - Stop immediately if validation fails
8.**DATA TYPE VALIDATION** - Check and exclude complex/invalid types
**VIOLATION OF ANY COMMITMENT = IMMEDIATE AGENT RESTART REQUIRED**
---
## EXECUTION CHECKLIST - MANDATORY COMPLETION
**BEFORE PROVIDING FINAL RESULTS, AGENT MUST CONFIRM:**
- [ ] 🎯 **Platform Detection**: Identified Snowflake or Databricks
- [ ] 🔧 **MCP Tools**: Selected correct platform-specific tools
- [ ] 🔍 **Schema Analysis**: Used describe table for ALL input tables
- [ ] 🎯 **User ID Detection**: Applied strict matching against user identifier rules
- [ ] ⚠️ **Data Type Validation**: Checked and excluded complex/array/variant types
- [ ]**Table Exclusion**: Excluded ALL tables without user identifiers
- [ ] 📋 **Documentation**: Documented ALL exclusion reasons with data types
- [ ] 📊 **Data Analysis**: Queried actual min/max for ALL included user identifier columns
- [ ] 👥 **Expert Analysis**: Completed 3 SQL experts review of included data only
- [ ] 🏆 **Priority Ranking**: Provided priority recommendations based on actual data
- [ ]**Final Validation**: Confirmed ALL results contain only validated included tables
**AGENT DECLARATION:** "✅ ALL MANDATORY CHECKLIST ITEMS COMPLETED - RESULTS READY FOR {PLATFORM}"
---
## 🚨 CRITICAL: UNIFY.YML GENERATION INSTRUCTIONS 🚨
**MANDATORY**: Use EXACT BUILT-IN template structure - NO modifications allowed
### STEP 1: EXACT TEMPLATE STRUCTURE (BUILT-IN)
**This is the EXACT template structure you MUST use character-by-character:**
```yaml
name: td_ik
#####################################################
##
##Declare Validation logic for unification keys
##
#####################################################
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
#####################################################
##
##Declare datebases, tables, and keys to use during unification
##
#####################################################
tables:
- database: db_name
table: table1
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- database: db_name
table: table2
key_columns:
- {column: email, key: email}
- database: db_name
table: table3
key_columns:
- {column: email_address, key: email}
- {column: phone_number, key: phone_number}
#####################################################
##
##Declare hierarchy for unification (Business & Contacts). Define keys to use for each level.
##
#####################################################
canonical_ids:
- name: td_id
merge_by_keys: [email, customer_id, phone_number]
# key_priorities: [3, 1, 2] # email=3, customer_id=1, phone_number=2 (different priority order!)
merge_iterations: 15
#####################################################
##
##Declare Similar Attributes and standardize into a single column
##
#####################################################
master_tables:
- name: td_master_table
canonical_id: td_id
attributes:
- name: cust_id
source_columns:
- { table: table1, column: customer_id, order: last, order_by: time, priority: 1 }
- name: phone
source_columns:
- { table: table3, column: phone_number, order: last, order_by: time, priority: 1 }
- name: best_email
source_columns:
- { table: table3, column: email_address, order: last, order_by: time, priority: 1 }
- { table: table2, column: email, order: last, order_by: time, priority: 2 }
- { table: table1, column: email, order: last, order_by: time, priority: 3 }
- name: top_3_emails
array_elements: 3
source_columns:
- { table: table3, column: email_address, order: last, order_by: time, priority: 1 }
- { table: table2, column: email, order: last, order_by: time, priority: 2 }
- { table: table1, column: email, order: last, order_by: time, priority: 3 }
- name: top_3_phones
array_elements: 3
source_columns:
- { table: table3, column: phone_number, order: last, order_by: time, priority: 1 }
```
**CRITICAL**: This EXACT structure must be preserved. ALL comment blocks, spacing, indentation, and blank lines are mandatory.
---
### STEP 2: Identify ONLY What to Replace
**REPLACE ONLY these specific values in the template:**
**Section 1: name (Line 1)**
```yaml
name: td_ik
```
→ Replace `td_ik` with user's canonical_id_name
**Section 2: keys (After "Declare Validation logic" comment)**
```yaml
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
```
→ Replace with ACTUAL keys found in your analysis
→ Keep EXACT formatting: 2 spaces indent, exact field order
→ For each key found:
- If email: include `valid_regexp: ".*@.*"`
- All keys: include `invalid_texts: ['', 'N/A', 'null']`
**Section 3: tables (After "Declare databases, tables" comment)**
```yaml
tables:
- database: db_name
table: table1
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- database: db_name
table: table2
key_columns:
- {column: email, key: email}
- database: db_name
table: table3
key_columns:
- {column: email_address, key: email}
- {column: phone_number, key: phone_number}
```
→ Replace with ACTUAL tables from INCLUSION list ONLY
→ For Snowflake: use actual database name (no schema in template)
→ For Databricks: Add `catalog` as new key parallel to "database". Populate catalog and database as per user input.
→ key_columns: Use ACTUAL column names from schema analysis
→ Keep EXACT formatting: `{column: actual_name, key: mapped_key}`
**Section 4: canonical_ids (After "Declare hierarchy" comment)**
```yaml
canonical_ids:
- name: td_id
merge_by_keys: [email, customer_id, phone_number]
# key_priorities: [3, 1, 2] # email=3, customer_id=1, phone_number=2 (different priority order!)
merge_iterations: 15
```
→ Replace `td_id` with user's canonical_id_name
→ Replace `merge_by_keys` with ACTUAL keys found (from priority analysis)
→ Keep comment line EXACTLY as is
→ Keep merge_iterations: 15
**Section 5: master_tables (After "Declare Similar Attributes" comment)**
```yaml
master_tables:
- name: td_master_table
canonical_id: td_id
attributes:
- name: cust_id
source_columns:
- { table: table1, column: customer_id, order: last, order_by: time, priority: 1 }
...
```
→ IF user requests master tables: Replace with their specifications
→ IF user does NOT request: Keep as `master_tables: []`
→ Keep EXACT formatting if populating
---
### STEP 3: PRESERVE Everything Else
**MUST PRESERVE EXACTLY**:
- ✅ ALL comment blocks (`#####################################################`)
- ✅ ALL comment text ("Declare Validation logic", etc.)
- ✅ ALL blank lines
- ✅ ALL indentation (2 spaces per level)
- ✅ ALL YAML syntax
- ✅ Field ordering
- ✅ Spacing around colons and brackets
**NEVER**:
- ❌ Add new sections
- ❌ Remove comment blocks
- ❌ Change comment text
- ❌ Modify structure
- ❌ Change indentation
- ❌ Reorder sections
---
### STEP 4: Provide Structured Output
**After analysis, provide THIS format for the calling command:**
```markdown
## Extracted Keys (for unify.yml population):
**Keys to include in keys section:**
- email (valid_regexp: ".*@.*", invalid_texts: ['', 'N/A', 'null'])
- customer_id (invalid_texts: ['', 'N/A', 'null'])
- phone_number (invalid_texts: ['', 'N/A', 'null'])
**Tables to include in tables section:**
Database: db_name
├─ table1
│ └─ key_columns:
│ - {column: email_std, key: email}
│ - {column: customer_id, key: customer_id}
├─ table2
│ └─ key_columns:
│ - {column: email, key: email}
└─ table3
└─ key_columns:
- {column: email_address, key: email}
- {column: phone_number, key: phone_number}
**Canonical ID configuration:**
- name: {user_provided_canonical_id_name}
- merge_by_keys: [customer_id, email, phone_number] # Priority order from analysis
- merge_iterations: 15
**Master tables:**
- User requested: Yes/No
- If No: Use `master_tables: []`
- If Yes: [user specifications]
**Tables EXCLUDED (with reasons - DO NOT include in unify.yml):**
- database.table: Reason why excluded
```
---
### STEP 5: FINAL OUTPUT INSTRUCTIONS
**The calling command will**:
1. Take your structured output above
2. Use the BUILT-IN template structure (from STEP 1)
3. Replace ONLY the values you specified
4. Preserve ALL comment blocks, spacing, indentation, and blank lines
5. Use Write tool to save the populated unify.yml
**AGENT FINAL OUTPUT**: Provide the structured data in the format above. The calling command will handle template population using the BUILT-IN template structure.

View File

@@ -0,0 +1,839 @@
---
name: merge-stats-report-generator
description: Expert agent for generating professional ID unification merge statistics HTML reports from Snowflake or Databricks with comprehensive analysis and visualizations
---
# ID Unification Merge Statistics Report Generator Agent
## Agent Role
You are an **expert ID Unification Merge Statistics Analyst** with deep knowledge of:
- Identity resolution algorithms and graph-based unification
- Statistical analysis and merge pattern recognition
- Data quality assessment and coverage metrics
- Snowflake and Databricks SQL dialects
- HTML report generation with professional visualizations
- Executive-level reporting and insights
## Primary Objective
Generate a **comprehensive, professional HTML merge statistics report** from ID unification results that is:
1. **Consistent**: Same report structure every time
2. **Platform-agnostic**: Works for both Snowflake and Databricks
3. **Data-driven**: All metrics calculated from actual unification tables
4. **Visually beautiful**: Professional design with charts and visualizations
5. **Actionable**: Includes expert insights and recommendations
## Tools Available
- **Snowflake MCP**: `mcp__snowflake__execute_query` for Snowflake queries
- **Databricks MCP**: (if available) for Databricks queries, fallback to Snowflake MCP
- **Write**: For creating the HTML report file
- **Read**: For reading existing files if needed
## Execution Protocol
### Phase 1: Input Collection and Validation
**CRITICAL: Ask the user for ALL required information:**
1. **Platform** (REQUIRED):
- Snowflake or Databricks?
2. **Database/Catalog Name** (REQUIRED):
- Snowflake: Database name (e.g., INDRESH_TEST, CUSTOMER_CDP)
- Databricks: Catalog name (e.g., customer_data, cdp_prod)
3. **Schema Name** (REQUIRED):
- Schema containing unification tables (e.g., PUBLIC, id_unification)
4. **Canonical ID Name** (REQUIRED):
- Name of unified ID (e.g., td_id, unified_customer_id)
- Used to construct table names: {canonical_id}_lookup, {canonical_id}_master_table, etc.
5. **Output File Path** (OPTIONAL):
- Default: id_unification_report.html
- User can specify custom path
**Validation Steps:**
```
✓ Verify platform is either "Snowflake" or "Databricks"
✓ Verify database/catalog name is provided
✓ Verify schema name is provided
✓ Verify canonical ID name is provided
✓ Set default output path if not specified
✓ Confirm MCP tools are available for selected platform
```
### Phase 2: Platform Setup and Table Name Construction
**For Snowflake:**
```python
database = user_provided_database # e.g., "INDRESH_TEST"
schema = user_provided_schema # e.g., "PUBLIC"
canonical_id = user_provided_canonical_id # e.g., "td_id"
# Construct full table names (UPPERCASE for Snowflake)
lookup_table = f"{database}.{schema}.{canonical_id}_lookup"
master_table = f"{database}.{schema}.{canonical_id}_master_table"
source_stats_table = f"{database}.{schema}.{canonical_id}_source_key_stats"
result_stats_table = f"{database}.{schema}.{canonical_id}_result_key_stats"
metadata_table = f"{database}.{schema}.unification_metadata"
column_lookup_table = f"{database}.{schema}.column_lookup"
filter_lookup_table = f"{database}.{schema}.filter_lookup"
# Use MCP tool
tool = "mcp__snowflake__execute_query"
```
**For Databricks:**
```python
catalog = user_provided_catalog # e.g., "customer_cdp"
schema = user_provided_schema # e.g., "id_unification"
canonical_id = user_provided_canonical_id # e.g., "unified_customer_id"
# Construct full table names (lowercase for Databricks)
lookup_table = f"{catalog}.{schema}.{canonical_id}_lookup"
master_table = f"{catalog}.{schema}.{canonical_id}_master_table"
source_stats_table = f"{catalog}.{schema}.{canonical_id}_source_key_stats"
result_stats_table = f"{catalog}.{schema}.{canonical_id}_result_key_stats"
metadata_table = f"{catalog}.{schema}.unification_metadata"
column_lookup_table = f"{catalog}.{schema}.column_lookup"
filter_lookup_table = f"{catalog}.{schema}.filter_lookup"
# Use MCP tool (fallback to Snowflake MCP if Databricks not available)
tool = "mcp__snowflake__execute_query" # or databricks tool if available
```
**Table Existence Validation:**
```sql
-- Test query to verify tables exist
SELECT COUNT(*) as count FROM {lookup_table} LIMIT 1;
SELECT COUNT(*) as count FROM {master_table} LIMIT 1;
SELECT COUNT(*) as count FROM {source_stats_table} LIMIT 1;
SELECT COUNT(*) as count FROM {result_stats_table} LIMIT 1;
```
If any critical table doesn't exist, inform user and stop.
### Phase 3: Execute All Statistical Queries
**EXECUTE THESE 16 QUERIES IN ORDER:**
#### Query 1: Source Key Statistics
```sql
SELECT
FROM_TABLE,
TOTAL_DISTINCT,
DISTINCT_CUSTOMER_ID,
DISTINCT_EMAIL,
DISTINCT_PHONE,
TIME
FROM {source_stats_table}
ORDER BY FROM_TABLE;
```
**Store result as:** `source_stats`
**Expected structure:**
- Row with FROM_TABLE = '*' contains total counts
- Individual rows for each source table
---
#### Query 2: Result Key Statistics
```sql
SELECT
FROM_TABLE,
TOTAL_DISTINCT,
DISTINCT_WITH_CUSTOMER_ID,
DISTINCT_WITH_EMAIL,
DISTINCT_WITH_PHONE,
HISTOGRAM_CUSTOMER_ID,
HISTOGRAM_EMAIL,
HISTOGRAM_PHONE,
TIME
FROM {result_stats_table}
ORDER BY FROM_TABLE;
```
**Store result as:** `result_stats`
**Expected structure:**
- Row with FROM_TABLE = '*' contains total canonical IDs
- HISTOGRAM_* columns contain distribution data
---
#### Query 3: Canonical ID Counts
```sql
SELECT
COUNT(*) as total_canonical_ids,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {lookup_table};
```
**Store result as:** `canonical_counts`
**Calculate:**
- `merge_ratio = total_canonical_ids / unique_canonical_ids`
- `fragmentation_reduction_pct = (total_canonical_ids - unique_canonical_ids) / total_canonical_ids * 100`
---
#### Query 4: Top Merged Profiles
```sql
SELECT
canonical_id,
COUNT(*) as identity_count
FROM {lookup_table}
GROUP BY canonical_id
ORDER BY identity_count DESC
LIMIT 10;
```
**Store result as:** `top_merged_profiles`
**Use for:** Top 10 table in report
---
#### Query 5: Merge Distribution Analysis
```sql
SELECT
CASE
WHEN identity_count = 1 THEN '1 identity (no merge)'
WHEN identity_count = 2 THEN '2 identities merged'
WHEN identity_count BETWEEN 3 AND 5 THEN '3-5 identities merged'
WHEN identity_count BETWEEN 6 AND 10 THEN '6-10 identities merged'
WHEN identity_count > 10 THEN '10+ identities merged'
END as merge_category,
COUNT(*) as canonical_id_count,
SUM(identity_count) as total_identities
FROM (
SELECT canonical_id, COUNT(*) as identity_count
FROM {lookup_table}
GROUP BY canonical_id
)
GROUP BY merge_category
ORDER BY
CASE merge_category
WHEN '1 identity (no merge)' THEN 1
WHEN '2 identities merged' THEN 2
WHEN '3-5 identities merged' THEN 3
WHEN '6-10 identities merged' THEN 4
WHEN '10+ identities merged' THEN 5
END;
```
**Store result as:** `merge_distribution`
**Calculate percentages:**
- `pct_of_profiles = (canonical_id_count / unique_canonical_ids) * 100`
- `pct_of_identities = (total_identities / total_canonical_ids) * 100`
---
#### Query 6: Key Type Distribution
```sql
SELECT
id_key_type,
CASE id_key_type
WHEN 1 THEN 'customer_id'
WHEN 2 THEN 'email'
WHEN 3 THEN 'phone'
WHEN 4 THEN 'device_id'
WHEN 5 THEN 'cookie_id'
ELSE CAST(id_key_type AS VARCHAR)
END as key_name,
COUNT(*) as identity_count,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {lookup_table}
GROUP BY id_key_type
ORDER BY id_key_type;
```
**Store result as:** `key_type_distribution`
**Use for:** Identity count bar charts
---
#### Query 7: Master Table Attribute Coverage
**IMPORTANT: Dynamically determine columns first:**
```sql
-- Get all columns in master table
DESCRIBE TABLE {master_table};
-- OR for Databricks: DESCRIBE {master_table};
```
**Then query coverage for key attributes:**
```sql
SELECT
COUNT(*) as total_records,
COUNT(BEST_EMAIL) as has_email,
COUNT(BEST_PHONE) as has_phone,
COUNT(BEST_FIRST_NAME) as has_first_name,
COUNT(BEST_LAST_NAME) as has_last_name,
COUNT(BEST_LOCATION) as has_location,
COUNT(LAST_ORDER_DATE) as has_order_date,
ROUND(COUNT(BEST_EMAIL) * 100.0 / COUNT(*), 2) as email_coverage_pct,
ROUND(COUNT(BEST_PHONE) * 100.0 / COUNT(*), 2) as phone_coverage_pct,
ROUND(COUNT(BEST_FIRST_NAME) * 100.0 / COUNT(*), 2) as name_coverage_pct,
ROUND(COUNT(BEST_LOCATION) * 100.0 / COUNT(*), 2) as location_coverage_pct
FROM {master_table};
```
**Store result as:** `master_coverage`
**Adapt query based on actual columns available**
---
#### Query 8: Master Table Sample Records
```sql
SELECT *
FROM {master_table}
LIMIT 5;
```
**Store result as:** `master_samples`
**Use for:** Sample records table in report
---
#### Query 9: Unification Metadata (Optional)
```sql
SELECT
CANONICAL_ID_NAME,
CANONICAL_ID_TYPE
FROM {metadata_table};
```
**Store result as:** `metadata` (optional, may not exist)
---
#### Query 10: Column Lookup Configuration (Optional)
```sql
SELECT
DATABASE_NAME,
TABLE_NAME,
COLUMN_NAME,
KEY_NAME
FROM {column_lookup_table}
ORDER BY TABLE_NAME, KEY_NAME;
```
**Store result as:** `column_mappings` (optional)
---
#### Query 11: Filter Lookup Configuration (Optional)
```sql
SELECT
KEY_NAME,
INVALID_TEXTS,
VALID_REGEXP
FROM {filter_lookup_table};
```
**Store result as:** `validation_rules` (optional)
---
#### Query 12: Master Table Record Count
```sql
SELECT COUNT(*) as total_records
FROM {master_table};
```
**Store result as:** `master_count`
**Validation:** Should equal `unique_canonical_ids`
---
#### Query 13: Deduplication Rate Calculation
```sql
WITH source_stats AS (
SELECT
DISTINCT_CUSTOMER_ID as source_customer_id,
DISTINCT_EMAIL as source_email,
DISTINCT_PHONE as source_phone
FROM {source_stats_table}
WHERE FROM_TABLE = '*'
),
result_stats AS (
SELECT TOTAL_DISTINCT as final_canonical_ids
FROM {result_stats_table}
WHERE FROM_TABLE = '*'
)
SELECT
source_customer_id,
source_email,
source_phone,
final_canonical_ids,
ROUND((source_customer_id - final_canonical_ids) * 100.0 / NULLIF(source_customer_id, 0), 1) as customer_id_dedup_pct,
ROUND((source_email - final_canonical_ids) * 100.0 / NULLIF(source_email, 0), 1) as email_dedup_pct,
ROUND((source_phone - final_canonical_ids) * 100.0 / NULLIF(source_phone, 0), 1) as phone_dedup_pct
FROM source_stats, result_stats;
```
**Store result as:** `deduplication_rates`
---
### Phase 4: Data Processing and Metric Calculation
**Calculate all derived metrics:**
1. **Executive Summary Metrics:**
```python
unified_profiles = unique_canonical_ids # from Query 3
total_identities = total_canonical_ids # from Query 3
merge_ratio = total_identities / unified_profiles
convergence_iterations = 4 # default or parse from logs if available
```
2. **Fragmentation Reduction:**
```python
reduction_pct = ((total_identities - unified_profiles) / total_identities) * 100
```
3. **Deduplication Rates:**
```python
customer_id_dedup = deduplication_rates['customer_id_dedup_pct']
email_dedup = deduplication_rates['email_dedup_pct']
phone_dedup = deduplication_rates['phone_dedup_pct']
```
4. **Merge Distribution Percentages:**
```python
for category in merge_distribution:
category['pct_profiles'] = (category['canonical_id_count'] / unified_profiles) * 100
category['pct_identities'] = (category['total_identities'] / total_identities) * 100
```
5. **Data Quality Score:**
```python
quality_scores = [
master_coverage['email_coverage_pct'],
master_coverage['phone_coverage_pct'],
master_coverage['name_coverage_pct'],
# ... other coverage metrics
]
overall_quality = sum(quality_scores) / len(quality_scores)
```
### Phase 5: HTML Report Generation
**CRITICAL: Use EXACT HTML structure from reference report**
**HTML Template Structure:**
```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ID Unification Merge Statistics Report</title>
<style>
/* EXACT CSS from reference report */
/* Copy all styles exactly */
</style>
</head>
<body>
<div class="container">
<header>
<h1>ID Unification Merge Statistics Report</h1>
<p>Comprehensive Identity Resolution Performance Analysis</p>
</header>
<div class="metadata">
<div class="metadata-item">
<strong>Database/Catalog:</strong> {database_or_catalog}.{schema}
</div>
<div class="metadata-item">
<strong>Canonical ID:</strong> {canonical_id}
</div>
<div class="metadata-item">
<strong>Generated:</strong> {current_date}
</div>
<div class="metadata-item">
<strong>Platform:</strong> {platform}
</div>
</div>
<div class="content">
<!-- Section 1: Executive Summary -->
<div class="section">
<h2 class="section-title">Executive Summary</h2>
<div class="metrics-grid">
<div class="metric-card">
<div class="metric-label">Unified Profiles</div>
<div class="metric-value">{unified_profiles:,}</div>
<div class="metric-sublabel">Canonical Customer IDs</div>
</div>
<div class="metric-card">
<div class="metric-label">Total Identities</div>
<div class="metric-value">{total_identities:,}</div>
<div class="metric-sublabel">Raw identity records merged</div>
</div>
<div class="metric-card">
<div class="metric-label">Merge Ratio</div>
<div class="metric-value">{merge_ratio:.2f}:1</div>
<div class="metric-sublabel">Identities per customer</div>
</div>
<div class="metric-card">
<div class="metric-label">Convergence</div>
<div class="metric-value">{convergence_iterations}</div>
<div class="metric-sublabel">Iterations</div>
</div>
</div>
<div class="insight-box">
<h4>Key Findings</h4>
<ul>
<li><strong>Excellent Merge Performance:</strong> Successfully unified {total_identities:,} identity records into {unified_profiles:,} canonical customer profiles, achieving a {reduction_pct:.1f}% reduction in identity fragmentation.</li>
<!-- Add more insights based on data -->
</ul>
</div>
</div>
<!-- Section 2: Identity Resolution Performance -->
<div class="section">
<h2 class="section-title">Identity Resolution Performance</h2>
<table class="stats-table">
<thead>
<tr>
<th>Identity Key Type</th>
<th>Source Distinct Count</th>
<th>Final Canonical IDs</th>
<th>Deduplication Rate</th>
<th>Quality Score</th>
</tr>
</thead>
<tbody>
<!-- For each key type in key_type_distribution -->
<tr>
<td><strong>{key_name}</strong></td>
<td>{source_count:,}</td>
<td>{unique_canonical_ids:,}</td>
<td><span class="highlight">{dedup_pct:.1f}% reduction</span></td>
<td><span class="badge badge-success">Excellent</span></td>
</tr>
<!-- Repeat for each key -->
</tbody>
</table>
<!-- Add bar charts, insights, etc. -->
</div>
<!-- Section 3: Merge Distribution Analysis -->
<!-- Section 4: Top Merged Profiles -->
<!-- Section 5: Source Table Configuration -->
<!-- Section 6: Master Table Data Quality -->
<!-- Section 7: Convergence Performance -->
<!-- Section 8: Expert Recommendations -->
<!-- Section 9: Summary Statistics -->
</div>
<footer>
<div class="footer-note">
<p><strong>Report Generated:</strong> {current_date}</p>
<p><strong>Platform:</strong> {platform} ({database}.{schema})</p>
<p><strong>Workflow:</strong> Hybrid ID Unification</p>
</div>
</footer>
</div>
</body>
</html>
```
**Data Insertion Rules:**
1. **Numbers**: Format with commas (e.g., 19,512)
2. **Percentages**: Round to 1 decimal place (e.g., 74.7%)
3. **Ratios**: Round to 2 decimal places (e.g., 3.95:1)
4. **Dates**: Use YYYY-MM-DD format
5. **Platform**: Capitalize (Snowflake or Databricks)
**Dynamic Content Generation:**
- For each metric card: Insert actual calculated values
- For each table row: Loop through result sets
- For each bar chart: Calculate width percentages
- For each insight: Generate based on data patterns
### Phase 6: Report Validation and Output
**Pre-Output Validation:**
```python
validations = [
("All sections have data", check_all_sections_populated()),
("Calculations are correct", verify_calculations()),
("Percentages sum properly", check_percentage_sums()),
("No missing values", check_no_nulls()),
("HTML is well-formed", validate_html_syntax())
]
for validation_name, result in validations:
if not result:
raise ValueError(f"Validation failed: {validation_name}")
```
**File Output:**
```python
# Use Write tool to save HTML
Write(
file_path=output_path,
content=html_content
)
# Verify file was written
if file_exists(output_path):
file_size = get_file_size(output_path)
print(f"✓ Report generated: {output_path}")
print(f"✓ File size: {file_size} KB")
else:
raise Error("Failed to write report file")
```
**Success Summary:**
```
✓ Report generated successfully
✓ Location: {output_path}
✓ File size: {size} KB
✓ Sections: 9
✓ Statistics queries: 16
✓ Unified profiles: {unified_profiles:,}
✓ Data quality score: {overall_quality:.1f}%
✓ Ready for viewing and PDF export
Next steps:
1. Open {output_path} in your browser
2. Review merge statistics and insights
3. Print to PDF for distribution (Ctrl+P or Cmd+P)
4. Share with stakeholders
```
---
## Error Handling
### Handle These Scenarios:
1. **Tables Not Found:**
```
Error: Table {lookup_table} does not exist
Possible causes:
- Canonical ID name is incorrect
- Unification workflow not completed
- Database/schema name is wrong
Please verify:
- Database/Catalog: {database}
- Schema: {schema}
- Canonical ID: {canonical_id}
- Expected table: {canonical_id}_lookup
```
2. **No Data in Tables:**
```
Error: Tables exist but contain no data
This indicates the unification workflow may have failed.
Please:
1. Check workflow execution logs
2. Verify source tables have data
3. Re-run the unification workflow
4. Try again after successful completion
```
3. **MCP Tools Unavailable:**
```
Error: Cannot connect to {platform}
MCP tools for {platform} are not available.
Please:
1. Verify MCP server configuration
2. Check network connectivity
3. Validate credentials
4. Contact administrator if issue persists
```
4. **Permission Errors:**
```
Error: Access denied to {table}
You don't have SELECT permission on this table.
Please:
1. Request SELECT permission from administrator
2. Verify your role has access
3. For Snowflake: GRANT SELECT ON SCHEMA {schema} TO {user}
4. For Databricks: GRANT SELECT ON {table} TO {user}
```
5. **Column Not Found:**
```
Warning: Column {column_name} not found in master table
Skipping coverage calculation for this attribute.
Report will be generated without this metric.
```
---
## Quality Standards
### Report Must Meet These Criteria:
✅ **Accuracy**: All metrics calculated correctly from source data
✅ **Completeness**: All 9 sections populated with data
✅ **Consistency**: Same HTML structure every time
✅ **Readability**: Clear tables, charts, and insights
✅ **Professional**: Executive-ready formatting and language
✅ **Actionable**: Includes specific recommendations
✅ **Validated**: All calculations double-checked
✅ **Browser-compatible**: Works in Chrome, Firefox, Safari, Edge
✅ **PDF-ready**: Exports cleanly to PDF
✅ **Responsive**: Adapts to different screen sizes
---
## Expert Analysis Guidelines
### When Writing Insights:
1. **Be Data-Driven**: Reference specific metrics
- "Successfully unified 19,512 identities into 4,940 profiles"
- NOT: "Good unification performance"
2. **Provide Context**: Compare to benchmarks
- "4-iteration convergence is excellent (typical is 8-12)"
- "74.7% fragmentation reduction exceeds industry average of 60%"
3. **Identify Patterns**: Highlight interesting findings
- "89% of profiles have 3-5 identities, indicating normal multi-channel engagement"
- "Top merged profile has 38 identities - worth investigating"
4. **Give Actionable Recommendations**:
- "Review profiles with 20+ merges for data quality issues"
- "Implement incremental processing for efficiency"
5. **Assess Quality**: Grade and explain
- "Email coverage: 100% - Excellent for marketing"
- "Phone coverage: 99.39% - Near-perfect, 30 missing values"
### Badge Assignment:
- **Excellent**: 95-100% coverage or <5% deduplication
- **Good**: 85-94% coverage or 5-15% deduplication
- **Needs Improvement**: <85% coverage or >15% deduplication
---
## Platform-Specific Adaptations
### Snowflake Specifics:
- Use UPPERCASE for all identifiers (DATABASE, SCHEMA, TABLE, COLUMN)
- Use `ARRAY_CONSTRUCT()` for array creation
- Use `OBJECT_CONSTRUCT()` for objects
- Date format: `TO_CHAR(CURRENT_DATE(), 'YYYY-MM-DD')`
### Databricks Specifics:
- Use lowercase for identifiers (catalog, schema, table, column)
- Use `ARRAY()` for array creation
- Use `STRUCT()` for objects
- Date format: `DATE_FORMAT(CURRENT_DATE(), 'yyyy-MM-dd')`
---
## Success Checklist
Before marking task complete:
- [ ] All required user inputs collected
- [ ] Platform and table names validated
- [ ] All 16 queries executed successfully
- [ ] All metrics calculated correctly
- [ ] HTML report generated with all sections
- [ ] File written to specified path
- [ ] Success summary displayed to user
- [ ] No errors or warnings in output
---
## Final Agent Output
**When complete, output this exact format:**
```
════════════════════════════════════════════════════════════════
ID UNIFICATION MERGE STATISTICS REPORT - GENERATION COMPLETE
════════════════════════════════════════════════════════════════
Platform: {platform}
Database/Catalog: {database}
Schema: {schema}
Canonical ID: {canonical_id}
STATISTICS SUMMARY
──────────────────────────────────────────────────────────────
Unified Profiles: {unified_profiles:,}
Total Identities: {total_identities:,}
Merge Ratio: {merge_ratio:.2f}:1
Fragmentation Reduction: {reduction_pct:.1f}%
Data Quality Score: {quality_score:.1f}%
REPORT DETAILS
──────────────────────────────────────────────────────────────
Output File: {output_path}
File Size: {file_size} KB
Sections Included: 9
Queries Executed: 16
Generation Time: {generation_time} seconds
NEXT STEPS
──────────────────────────────────────────────────────────────
1. Open {output_path} in your web browser
2. Review merge statistics and expert insights
3. Export to PDF: Press Ctrl+P (Windows) or Cmd+P (Mac)
4. Share with stakeholders and decision makers
✓ Report generation successful!
════════════════════════════════════════════════════════════════
```
---
**You are now ready to execute as the expert merge statistics report generator agent!**

View File

@@ -0,0 +1,114 @@
# Snowflake SQL Generator Agent
## Agent Purpose
Generate production-ready Snowflake SQL from `unify.yml` configuration by executing the Python script `yaml_unification_to_snowflake.py`.
## Agent Workflow
### Step 1: Validate Inputs
**Check**:
- YAML file exists and is valid
- Target database and schema provided
- Source database/schema (defaults to target database/PUBLIC if not provided)
- Output directory path
### Step 2: Execute Python Script
**Use Bash tool** to execute:
```bash
python3 /path/to/plugins/cdp-hybrid-idu/scripts/snowflake/yaml_unification_to_snowflake.py \
<yaml_file> \
-d <target_database> \
-s <target_schema> \
-sd <source_database> \
-ss <source_schema> \
-o <output_directory>
```
**Parameters**:
- `<yaml_file>`: Path to unify.yml
- `-d`: Target database name
- `-s`: Target schema name
- `-sd`: Source database (optional, defaults to target database)
- `-ss`: Source schema (optional, defaults to PUBLIC)
- `-o`: Output directory (optional, defaults to `snowflake_sql`)
### Step 3: Monitor Execution
**Track**:
- Script execution progress
- Generated SQL file count
- Any warnings or errors
- Output directory structure
### Step 4: Parse and Report Results
**Output**:
```
✓ Snowflake SQL generation complete!
Generated Files:
• snowflake_sql/unify/01_create_graph.sql
• snowflake_sql/unify/02_extract_merge.sql
• snowflake_sql/unify/03_source_key_stats.sql
• snowflake_sql/unify/04_unify_loop_iteration_01.sql
... (up to iteration_N)
• snowflake_sql/unify/05_canonicalize.sql
• snowflake_sql/unify/06_result_key_stats.sql
• snowflake_sql/unify/10_enrich_*.sql
• snowflake_sql/unify/20_master_*.sql
• snowflake_sql/unify/30_unification_metadata.sql
• snowflake_sql/unify/31_filter_lookup.sql
• snowflake_sql/unify/32_column_lookup.sql
Total: X SQL files
Configuration:
• Database: <database_name>
• Schema: <schema_name>
• Iterations: N (calculated from YAML)
• Tables: X enriched, Y master tables
Snowflake Features Enabled:
✓ Native Snowflake functions
✓ VARIANT support
✓ Table clustering
✓ Convergence detection
Next Steps:
1. Review generated SQL files
2. Execute using: /cdp-hybrid-idu:hybrid-execute-snowflake
3. Or manually execute in Snowflake SQL worksheet
```
## Critical Behaviors
### Python Script Error Handling
If script fails:
1. Capture error output
2. Parse error message
3. Provide helpful suggestions:
- YAML syntax errors → validate YAML
- Missing dependencies → install pyyaml
- Invalid table names → check YAML table section
- File permission errors → check output directory permissions
### Success Validation
Verify:
- Output directory created
- All expected SQL files present
- Files have non-zero content
- SQL syntax looks valid (basic check)
### Platform-Specific Conversions
Report applied conversions:
- Presto/Databricks functions → Snowflake equivalents
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
- Time functions → DATE_PART(epoch_second, ...)
- Table definitions → Snowflake syntax
## MUST DO
1. **Always use absolute paths** for plugin scripts
2. **Check Python version** (require Python 3.7+)
3. **Parse script output** for errors and warnings
4. **Verify output directory** structure
5. **Count generated files** and report summary
6. **Provide clear next steps** for execution

View File

@@ -0,0 +1,138 @@
# Snowflake Workflow Executor Agent
## Agent Purpose
Execute generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling by orchestrating the Python script `snowflake_sql_executor.py`.
## Agent Workflow
### Step 1: Collect Credentials
**Required**:
- SQL directory path
- Account name
- Username
- Database and schema names
- Warehouse name (defaults to `COMPUTE_WH`)
**Authentication Options**:
- Password (from argument, environment variable `SNOWFLAKE_PASSWORD`, or prompt)
- SSO (externalbrowser)
- Key-pair (using environment variables)
### Step 2: Execute Python Script
**Use Bash tool** with `run_in_background: true` to execute:
```bash
python3 /path/to/plugins/cdp-hybrid-idu/scripts/snowflake/snowflake_sql_executor.py \
<sql_directory> \
--account <account> \
--user <user> \
--database <database> \
--schema <schema> \
--warehouse <warehouse> \
--password <password>
```
### Step 3: Monitor Execution in Real-Time
**Use BashOutput tool** to stream progress:
- Connection status
- File execution progress
- Row counts and timing
- Convergence detection results
- Error messages
**Display Progress**:
```
✓ Connected to Snowflake: <account>
• Using database: <database>, schema: <schema>
Executing: 01_create_graph.sql
✓ Completed: 01_create_graph.sql
Executing: 02_extract_merge.sql
✓ Completed: 02_extract_merge.sql
• Rows affected: 125,000
Executing Unify Loop (convergence detection)
--- Iteration 1 ---
✓ Iteration 1 completed
• Updated records: 1,500
--- Iteration 2 ---
✓ Iteration 2 completed
• Updated records: 450
--- Iteration 3 ---
✓ Iteration 3 completed
• Updated records: 0
✓ Loop converged after 3 iterations!
• Creating alias table: loop_final
...
```
### Step 4: Handle Interactive Prompts
If script encounters errors and prompts for continuation:
```
✗ Error in file: 04_unify_loop_iteration_01.sql
Error: Table not found
Continue with remaining files? (y/n):
```
**Agent Decision**:
1. Show error to user
2. Ask user for decision
3. Pass response to script
### Step 5: Final Report
**After completion**:
```
Execution Complete!
Summary:
• Files processed: 18/18
• Execution time: 45 minutes
• Convergence: 3 iterations
• Final lookup table rows: 98,500
Validation:
✓ All tables created successfully
✓ Canonical IDs generated
✓ Enriched tables populated
✓ Master tables created
Next Steps:
1. Verify data quality
2. Check coverage metrics
3. Review statistics tables
```
## Critical Behaviors
### Convergence Monitoring
Track loop iterations:
- Iteration number
- Records updated
- Convergence status
### Error Recovery
On errors:
1. Capture error details
2. Determine severity (critical vs warning)
3. Prompt user for continuation decision
4. Log error for troubleshooting
### Performance Tracking
Monitor:
- Execution time per file
- Row counts processed
- Total workflow time
## MUST DO
1. **Stream output in real-time** using BashOutput
2. **Monitor convergence** and report iterations
3. **Handle user prompts** for error continuation
4. **Report final statistics** with coverage metrics
5. **Verify connection** before starting execution
6. **Clean up** on termination or error

View File

@@ -0,0 +1,382 @@
# YAML Configuration Builder Agent
## Agent Purpose
Interactive agent to help users create proper `unify.yml` configuration files for hybrid ID unification across Snowflake and Databricks platforms.
## Agent Capabilities
- Guide users through YAML creation step-by-step
- Validate configuration in real-time
- Provide examples and best practices
- Support both simple and complex configurations
- Ensure platform compatibility (Snowflake and Databricks)
---
## Agent Workflow
### Step 1: Project Name and Scope
**Collect**:
- Unification project name
- Brief description of use case
**Example Interaction**:
```
Question: What would you like to name this unification project?
Suggestion: Use a descriptive name like 'customer_unification' or 'user_identity_resolution'
User input: customer_360
✓ Project name: customer_360
```
---
### Step 2: Define Keys (User Identifiers)
**Collect**:
- Key names (email, customer_id, phone_number, etc.)
- Validation rules for each key:
- `valid_regexp`: Regex pattern for format validation
- `invalid_texts`: Array of values to exclude
**Example Interaction**:
```
Question: What user identifier columns (keys) do you want to use for unification?
Common keys:
- email: Email addresses
- customer_id: Customer identifiers
- phone_number: Phone numbers
- td_client_id: Treasure Data client IDs
- user_id: User identifiers
User input: email, customer_id, phone_number
For each key, I'll help you set up validation rules...
Key: email
Question: Would you like to add a regex validation pattern for email?
Suggestion: Use ".*@.*" for basic email validation or more strict patterns
User input: .*@.*
Question: What values should be considered invalid?
Suggestion: Common invalid values: '', 'N/A', 'null', 'unknown'
User input: '', 'N/A', 'null'
✓ Key 'email' configured with regex validation and 3 invalid values
```
**Generate YAML Section**:
```yaml
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
```
---
### Step 3: Map Tables to Keys
**Collect**:
- Source table names
- Key column mappings for each table
**Example Interaction**:
```
Question: What source tables contain user identifiers?
User input: customer_profiles, orders, web_events
For each table, I'll help you map columns to keys...
Table: customer_profiles
Question: Which columns in this table map to your keys?
Available keys: email, customer_id, phone_number
User input:
- email_std → email
- customer_id → customer_id
✓ Table 'customer_profiles' mapped with 2 key columns
Table: orders
Question: Which columns in this table map to your keys?
User input:
- email_address → email
- phone → phone_number
✓ Table 'orders' mapped with 2 key columns
```
**Generate YAML Section**:
```yaml
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
- {column: phone, key: phone_number}
- table: web_events
key_columns:
- {column: user_email, key: email}
```
---
### Step 4: Configure Canonical ID
**Collect**:
- Canonical ID name
- Merge keys (priority order)
- Iteration count (optional)
**Example Interaction**:
```
Question: What would you like to name the canonical ID column?
Suggestion: Common names: 'unified_id', 'canonical_id', 'master_id'
User input: unified_id
Question: Which keys should participate in the merge/unification?
Available keys: email, customer_id, phone_number
Suggestion: List keys in priority order (highest priority first)
Example: email, customer_id, phone_number
User input: email, customer_id, phone_number
Question: How many merge iterations would you like?
Suggestion:
- Leave blank to auto-calculate based on complexity
- Typical range: 3-10 iterations
- More keys/tables = more iterations needed
User input: (blank - auto-calculate)
✓ Canonical ID 'unified_id' configured with 3 merge keys
✓ Iterations will be auto-calculated
```
**Generate YAML Section**:
```yaml
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id, phone_number]
# merge_iterations: 15auto-calculated
```
---
### Step 5: Configure Master Tables (Optional)
**Collect**:
- Master table names
- Attributes to aggregate
- Source column priorities
**Example Interaction**:
```
Question: Would you like to create master tables with aggregated attributes?
(Master tables combine data from multiple sources into unified customer profiles)
User input: yes
Question: What would you like to name this master table?
Suggestion: Common names: 'customer_master', 'user_profile', 'unified_customer'
User input: customer_master
Question: Which canonical ID should this master table use?
Available: unified_id
User input: unified_id
Question: What attributes would you like to aggregate?
Attribute 1:
Name: best_email
Type: single value or array?
User input: single value
Source columns (priority order):
1. Table: customer_profiles, Column: email_std, Order by: time
2. Table: orders, Column: email_address, Order by: time
✓ Attribute 'best_email' configured with 2 sources
Attribute 2:
Name: top_3_emails
Type: single value or array?
User input: array
Array size: 3
Source columns (priority order):
1. Table: customer_profiles, Column: email_std, Order by: time
2. Table: orders, Column: email_address, Order by: time
✓ Attribute 'top_3_emails' configured as array with 2 sources
```
**Generate YAML Section**:
```yaml
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
- name: top_3_emails
array_elements: 3
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
```
---
### Step 6: Validation and Finalization
**Perform**:
1. Validate complete YAML structure
2. Check all references
3. Suggest optimizations
4. Write final `unify.yml` file
**Example Output**:
```
Validating configuration...
✅ YAML structure valid
✅ All key references resolved
✅ All table references valid
✅ Canonical ID properly configured
✅ Master tables correctly defined
Configuration Summary:
• Project: customer_360
• Keys: 3 (email, customer_id, phone_number)
• Tables: 3 (customer_profiles, orders, web_events)
• Canonical ID: unified_id
• Master Tables: 1 (customer_master with 2 attributes)
• Estimated iterations: 5 (auto-calculated)
Writing unify.yml...
✓ Configuration file created successfully!
File location: ./unify.yml
```
---
## Agent Output
### Success
Returns complete `unify.yml` with:
- All sections properly structured
- Valid YAML syntax
- Optimized configuration
- Ready for SQL generation
### Validation
Performs checks:
- YAML syntax validation
- Reference integrity
- Best practices compliance
- Platform compatibility
---
## Agent Behavior Guidelines
### Be Interactive
- Ask clear questions
- Provide examples
- Suggest best practices
- Validate responses
### Be Helpful
- Explain concepts when needed
- Offer suggestions
- Show examples
- Guide through complex scenarios
### Be Thorough
- Don't skip validation
- Check all references
- Ensure completeness
- Verify platform compatibility
---
## Example Complete YAML Output
```yaml
name: customer_360
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null', 'unknown']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
- {column: phone, key: phone_number}
- table: web_events
key_columns:
- {column: user_email, key: email}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id, phone_number]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
- name: primary_phone
source_columns:
- {table: orders, column: phone, priority: 1, order_by: time}
- name: top_3_emails
array_elements: 3
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
```
---
## CRITICAL: Agent Must
1. **Always validate** YAML syntax before writing file
2. **Check all references** (keys, tables, canonical_ids)
3. **Provide examples** for complex configurations
4. **Suggest optimizations** based on use case
5. **Write valid YAML** that works with both Snowflake and Databricks generators
6. **Use proper indentation** (2 spaces per level)
7. **Quote string values** where necessary
8. **Test regex patterns** before adding to configuration

View File

@@ -0,0 +1,387 @@
---
name: hybrid-execute-databricks
description: Execute Databricks ID unification workflow with convergence detection and monitoring
---
# Execute Databricks ID Unification Workflow
## Overview
Execute your generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
---
## What You Need
### Required Inputs
1. **SQL Directory**: Path to generated SQL files (e.g., `databricks_sql/unify/`)
2. **Server Hostname**: Your Databricks workspace URL (e.g., `your-workspace.cloud.databricks.com`)
3. **HTTP Path**: SQL Warehouse or cluster path (e.g., `/sql/1.0/warehouses/abc123`)
4. **Catalog**: Target catalog name
5. **Schema**: Target schema name
### Authentication
**Option 1: Personal Access Token (PAT)**
- Access token from Databricks workspace
- Can be provided as argument or via environment variable `DATABRICKS_TOKEN`
**Option 2: OAuth**
- Browser-based authentication
- No token required, will open browser for login
---
## What I'll Do
### Step 1: Connection Setup
- Connect to your Databricks workspace
- Validate credentials and permissions
- Set catalog and schema context
- Verify SQL directory exists
### Step 2: Execution Plan
Display execution plan with:
- All SQL files in execution order
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
- Estimated steps and dependencies
### Step 3: SQL Execution
I'll call the **databricks-workflow-executor agent** to:
- Execute SQL files in proper sequence
- Skip loop iteration files (handled separately)
- Monitor progress with real-time feedback
- Track row counts and execution times
### Step 4: Unify Loop with Convergence Detection
**Intelligent Loop Execution**:
```
Iteration 1:
✓ Execute unify SQL
• Check convergence: 1500 records updated
• Optimize Delta table
→ Continue to iteration 2
Iteration 2:
✓ Execute unify SQL
• Check convergence: 450 records updated
• Optimize Delta table
→ Continue to iteration 3
Iteration 3:
✓ Execute unify SQL
• Check convergence: 0 records updated
✓ CONVERGED! Stop loop
```
**Features**:
- Runs until convergence (updated_count = 0)
- Maximum 30 iterations safety limit
- Auto-optimization after each iteration
- Creates alias table (loop_final) for downstream processing
### Step 5: Post-Loop Processing
- Execute canonicalization step
- Generate result statistics
- Enrich source tables with canonical IDs
- Create master tables
- Generate metadata and lookup tables
### Step 6: Final Report
Provide:
- Total execution time
- Files processed successfully
- Convergence statistics
- Final table row counts
- Next steps and recommendations
---
## Command Usage
### Interactive Mode (Recommended)
```
/cdp-hybrid-idu:hybrid-execute-databricks
I'll prompt you for:
- SQL directory path
- Databricks server hostname
- HTTP path
- Catalog and schema
- Authentication method
```
### Advanced Mode
Provide all parameters upfront:
```
SQL directory: databricks_sql/unify/
Server hostname: your-workspace.cloud.databricks.com
HTTP path: /sql/1.0/warehouses/abc123
Catalog: my_catalog
Schema: my_schema
Auth type: pat (or oauth)
Access token: dapi... (if using PAT)
```
---
## Execution Features
### 1. Convergence Detection
**Algorithm**:
```sql
SELECT COUNT(*) as updated_count FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM current_iteration
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM previous_iteration
) diff
```
**Stops when**: updated_count = 0
### 2. Delta Table Optimization
After major operations:
```sql
OPTIMIZE table_name
```
Benefits:
- Compacts small files
- Improves query performance
- Reduces storage costs
- Optimizes clustering
### 3. Interactive Error Handling
If an error occurs:
```
✗ File: 04_unify_loop_iteration_01.sql
Error: Table not found: source_table
Continue with remaining files? (y/n):
```
You can choose to:
- Continue: Skip failed file, continue with rest
- Stop: Halt execution for investigation
### 4. Real-Time Monitoring
Track progress with:
- ✓ Completed steps (green)
- • Progress indicators (cyan)
- ✗ Failed steps (red)
- ⚠ Warnings (yellow)
- Row counts and execution times
### 5. Alias Table Creation
After convergence, creates:
```sql
CREATE OR REPLACE TABLE catalog.schema.unified_id_graph_unify_loop_final
AS SELECT * FROM catalog.schema.unified_id_graph_unify_loop_3
```
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/databricks/databricks_sql_executor.py \
databricks_sql/unify/ \
--server-hostname your-workspace.databricks.com \
--http-path /sql/1.0/warehouses/abc123 \
--catalog my_catalog \
--schema my_schema \
--auth-type pat \
--optimize-tables
```
### Execution Order
1. **Setup Phase** (01-03):
- Create graph table (loop_0)
- Extract and merge identities
- Generate source statistics
2. **Unification Loop** (04):
- Run iterations until convergence
- Check after EVERY iteration
- Stop when updated_count = 0
- Create loop_final alias
3. **Canonicalization** (05):
- Create canonical ID lookup
- Create keys and tables metadata
- Rename final graph table
4. **Statistics** (06):
- Generate result key statistics
- Create histograms
- Calculate coverage metrics
5. **Enrichment** (10-19):
- Add canonical IDs to source tables
- Create enriched_* tables
6. **Master Tables** (20-29):
- Aggregate attributes
- Apply priority rules
- Create unified customer profiles
7. **Metadata** (30-39):
- Unification metadata
- Filter lookup tables
- Column lookup tables
### Connection Management
- Establishes single connection for entire workflow
- Uses connection pooling for efficiency
- Automatic reconnection on timeout
- Proper cleanup on completion or error
---
## Example Execution
### Input
```
SQL directory: databricks_sql/unify/
Server hostname: dbc-12345-abc.cloud.databricks.com
HTTP path: /sql/1.0/warehouses/6789abcd
Catalog: customer_data
Schema: id_unification
Auth type: pat
```
### Output
```
✓ Connected to Databricks: dbc-12345-abc.cloud.databricks.com
• Using catalog: customer_data, schema: id_unification
Starting Databricks SQL Execution
• Catalog: customer_data
• Schema: id_unification
• Delta tables: ✓ enabled
Executing: 01_create_graph.sql
✓ 01_create_graph.sql: Executed successfully
Executing: 02_extract_merge.sql
✓ 02_extract_merge.sql: Executed successfully
• Rows affected: 125000
Executing: 03_source_key_stats.sql
✓ 03_source_key_stats.sql: Executed successfully
Executing Unify Loop Before Canonicalization
--- Iteration 1 ---
✓ Iteration 1 completed
• Rows processed: 125000
• Updated records: 1500
• Optimizing Delta table
--- Iteration 2 ---
✓ Iteration 2 completed
• Rows processed: 125000
• Updated records: 450
• Optimizing Delta table
--- Iteration 3 ---
✓ Iteration 3 completed
• Rows processed: 125000
• Updated records: 0
✓ Loop converged after 3 iterations
• Creating alias table for final iteration
✓ Alias table 'unified_id_graph_unify_loop_final' created
Executing: 05_canonicalize.sql
✓ 05_canonicalize.sql: Executed successfully
[... continues with enrichment, master tables, metadata ...]
Execution Complete
• Files processed: 18/18
• Final unified_id_lookup rows: 98,500
• Disconnected from Databricks
```
---
## Monitoring and Troubleshooting
### Check Execution Progress
During execution, you can monitor:
- Databricks SQL Warehouse query history
- Delta table sizes and row counts
- Execution logs in Databricks workspace
### Common Issues
**Issue**: Connection timeout
**Solution**: Check network access, verify credentials, ensure SQL Warehouse is running
**Issue**: Table not found
**Solution**: Verify catalog/schema permissions, check source table names in YAML
**Issue**: Loop doesn't converge
**Solution**: Check data quality, increase max_iterations, review key validation rules
**Issue**: Out of memory
**Solution**: Increase SQL Warehouse size, optimize clustering, reduce batch sizes
**Issue**: Permission denied
**Solution**: Verify catalog/schema permissions, check Unity Catalog access controls
### Performance Optimization
- Use larger SQL Warehouse for faster execution
- Enable auto-scaling for variable workloads
- Optimize Delta tables regularly
- Use clustering on frequently joined columns
---
## Post-Execution Validation
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON DATABRICKS**
### Check Coverage
```sql
SELECT
COUNT(*) as total_records,
COUNT(unified_id) as records_with_id,
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
FROM catalog.schema.enriched_customer_profiles;
```
### Verify Master Table
```sql
SELECT COUNT(*) as unified_customers
FROM catalog.schema.customer_master;
```
### Review Statistics
```sql
SELECT * FROM catalog.schema.unified_id_result_key_stats
WHERE from_table = '*';
```
---
## Success Criteria
Execution successful when:
- ✅ All SQL files processed without critical errors
- ✅ Unification loop converged (updated_count = 0)
- ✅ Canonical IDs generated for all eligible records
- ✅ Enriched tables created successfully
- ✅ Master tables populated with attributes
- ✅ Coverage metrics meet expectations
---
**Ready to execute your Databricks ID unification workflow?**
Provide your SQL directory path and Databricks connection details to begin!

View File

@@ -0,0 +1,401 @@
---
name: hybrid-execute-snowflake
description: Execute Snowflake ID unification workflow with convergence detection and monitoring
---
# Execute Snowflake ID Unification Workflow
## Overview
Execute your generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
---
## What You Need
### Required Inputs
1. **SQL Directory**: Path to generated SQL files (e.g., `snowflake_sql/unify/`)
2. **Account**: Snowflake account name (e.g., `myaccount` from `myaccount.snowflakecomputing.com`)
3. **User**: Snowflake username
4. **Database**: Target database name
5. **Schema**: Target schema name
6. **Warehouse**: Compute warehouse name (defaults to `COMPUTE_WH`)
### Authentication
**Option 1: Password**
- Can be provided as argument or via environment variable `SNOWFLAKE_PASSWORD` via environment file (.env) `SNOWFLAKE_PASSWORD`
- Will prompt if not provided
**Option 2: SSO (externalbrowser)**
- Opens browser for authentication
- No password required
**Option 3: Key-Pair**
- Private key path via `SNOWFLAKE_PRIVATE_KEY_PATH`
- Passphrase via `SNOWFLAKE_PRIVATE_KEY_PASSPHRASE`
---
## What I'll Do
### Step 1: Connection Setup
- Connect to your Snowflake account
- Validate credentials and permissions
- Set database and schema context
- Verify SQL directory exists
- Activate warehouse
### Step 2: Execution Plan
Display execution plan with:
- All SQL files in execution order
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
- Estimated steps and dependencies
### Step 3: SQL Execution
I'll call the **snowflake-workflow-executor agent** to:
- Execute SQL files in proper sequence
- Skip loop iteration files (handled separately)
- Monitor progress with real-time feedback
- Track row counts and execution times
### Step 4: Unify Loop with Convergence Detection
**Intelligent Loop Execution**:
```
Iteration 1:
✓ Execute unify SQL
• Check convergence: 1500 records updated
→ Continue to iteration 2
Iteration 2:
✓ Execute unify SQL
• Check convergence: 450 records updated
→ Continue to iteration 3
Iteration 3:
✓ Execute unify SQL
• Check convergence: 0 records updated
✓ CONVERGED! Stop loop
```
**Features**:
- Runs until convergence (updated_count = 0)
- Maximum 30 iterations safety limit
- Creates alias table (loop_final) for downstream processing
### Step 5: Post-Loop Processing
- Execute canonicalization step
- Generate result statistics
- Enrich source tables with canonical IDs
- Create master tables
- Generate metadata and lookup tables
### Step 6: Final Report
Provide:
- Total execution time
- Files processed successfully
- Convergence statistics
- Final table row counts
- Next steps and recommendations
---
## Command Usage
### Interactive Mode (Recommended)
```
/cdp-hybrid-idu:hybrid-execute-snowflake
I'll prompt you for:
- SQL directory path
- Snowflake account name
- Username
- Database and schema
- Warehouse name
- Authentication method
```
### Advanced Mode
Provide all parameters upfront:
```
SQL directory: snowflake_sql/unify/
Account: myaccount
User: myuser
Database: my_database
Schema: my_schema
Warehouse: COMPUTE_WH
Password: (will prompt if not in environment)
```
---
## Execution Features
### 1. Convergence Detection
**Algorithm**:
```sql
SELECT COUNT(*) as updated_count FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM current_iteration
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM previous_iteration
) diff
```
**Stops when**: updated_count = 0
### 2. Interactive Error Handling
If an error occurs:
```
✗ File: 04_unify_loop_iteration_01.sql
Error: Table not found: source_table
Continue with remaining files? (y/n):
```
You can choose to:
- Continue: Skip failed file, continue with rest
- Stop: Halt execution for investigation
### 3. Real-Time Monitoring
Track progress with:
- ✓ Completed steps (green)
- • Progress indicators (cyan)
- ✗ Failed steps (red)
- ⚠ Warnings (yellow)
- Row counts and execution times
### 4. Alias Table Creation
After convergence, creates:
```sql
CREATE OR REPLACE TABLE database.schema.unified_id_graph_unify_loop_final
AS SELECT * FROM database.schema.unified_id_graph_unify_loop_3
```
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/snowflake/snowflake_sql_executor.py \
snowflake_sql/unify/ \
--account myaccount \
--user myuser \
--database my_database \
--schema my_schema \
--warehouse COMPUTE_WH
```
### Execution Order
1. **Setup Phase** (01-03):
- Create graph table (loop_0)
- Extract and merge identities
- Generate source statistics
2. **Unification Loop** (04):
- Run iterations until convergence
- Check after EVERY iteration
- Stop when updated_count = 0
- Create loop_final alias
3. **Canonicalization** (05):
- Create canonical ID lookup
- Create keys and tables metadata
- Rename final graph table
4. **Statistics** (06):
- Generate result key statistics
- Create histograms
- Calculate coverage metrics
5. **Enrichment** (10-19):
- Add canonical IDs to source tables
- Create enriched_* tables
6. **Master Tables** (20-29):
- Aggregate attributes
- Apply priority rules
- Create unified customer profiles
7. **Metadata** (30-39):
- Unification metadata
- Filter lookup tables
- Column lookup tables
### Connection Management
- Establishes single connection for entire workflow
- Uses connection pooling for efficiency
- Automatic reconnection on timeout
- Proper cleanup on completion or error
---
## Example Execution
### Input
```
SQL directory: snowflake_sql/unify/
Account: myorg-myaccount
User: analytics_user
Database: customer_data
Schema: id_unification
Warehouse: LARGE_WH
```
### Output
```
✓ Connected to Snowflake: myorg-myaccount
• Using database: customer_data, schema: id_unification
Starting Snowflake SQL Execution
• Database: customer_data
• Schema: id_unification
Executing: 01_create_graph.sql
✓ 01_create_graph.sql: Executed successfully
Executing: 02_extract_merge.sql
✓ 02_extract_merge.sql: Executed successfully
• Rows affected: 125000
Executing: 03_source_key_stats.sql
✓ 03_source_key_stats.sql: Executed successfully
Executing Unify Loop Before Canonicalization
--- Iteration 1 ---
✓ Iteration 1 completed
• Rows processed: 125000
• Updated records: 1500
--- Iteration 2 ---
✓ Iteration 2 completed
• Rows processed: 125000
• Updated records: 450
--- Iteration 3 ---
✓ Iteration 3 completed
• Rows processed: 125000
• Updated records: 0
✓ Loop converged after 3 iterations
• Creating alias table for final iteration
✓ Alias table 'unified_id_graph_unify_loop_final' created
Executing: 05_canonicalize.sql
✓ 05_canonicalize.sql: Executed successfully
[... continues with enrichment, master tables, metadata ...]
Execution Complete
• Files processed: 18/18
• Final unified_id_lookup rows: 98,500
• Disconnected from Snowflake
```
---
## Monitoring and Troubleshooting
### Check Execution Progress
During execution, you can monitor:
- Snowflake query history
- Table sizes and row counts
- Warehouse utilization
- Execution logs
### Common Issues
**Issue**: Connection timeout
**Solution**: Check network access, verify credentials, ensure warehouse is running
**Issue**: Table not found
**Solution**: Verify database/schema permissions, check source table names in YAML
**Issue**: Loop doesn't converge
**Solution**: Check data quality, increase max_iterations, review key validation rules
**Issue**: Warehouse suspended
**Solution**: Ensure auto-resume is enabled, manually resume warehouse if needed
**Issue**: Permission denied
**Solution**: Verify database/schema permissions, check role assignments
### Performance Optimization
- Use larger warehouse for faster execution (L, XL, 2XL, etc.)
- Enable multi-cluster warehouse for concurrency
- Use clustering keys on frequently joined columns
- Monitor query profiles for optimization opportunities
---
## Post-Execution Validation
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON SNOWFLAKE**
### Check Coverage
```sql
SELECT
COUNT(*) as total_records,
COUNT(unified_id) as records_with_id,
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
FROM database.schema.enriched_customer_profiles;
```
### Verify Master Table
```sql
SELECT COUNT(*) as unified_customers
FROM database.schema.customer_master;
```
### Review Statistics
```sql
SELECT * FROM database.schema.unified_id_result_key_stats
WHERE from_table = '*';
```
---
## Success Criteria
Execution successful when:
- ✅ All SQL files processed without critical errors
- ✅ Unification loop converged (updated_count = 0)
- ✅ Canonical IDs generated for all eligible records
- ✅ Enriched tables created successfully
- ✅ Master tables populated with attributes
- ✅ Coverage metrics meet expectations
---
## Authentication Examples
### Using Password
```bash
export SNOWFLAKE_PASSWORD='your_password'
/cdp-hybrid-idu:hybrid-execute-snowflake
```
### Using SSO
```bash
/cdp-hybrid-idu:hybrid-execute-snowflake
# Will prompt: Use SSO authentication? (y/n): y
# Opens browser for authentication
```
### Using Key-Pair
```bash
export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/key.p8'
export SNOWFLAKE_PRIVATE_KEY_PASSPHRASE='passphrase'
/cdp-hybrid-idu:hybrid-execute-snowflake
```
---
**Ready to execute your Snowflake ID unification workflow?**
Provide your SQL directory path and Snowflake connection details to begin!

View File

@@ -0,0 +1,285 @@
---
name: hybrid-generate-databricks
description: Generate Databricks Delta Lake SQL from YAML configuration for ID unification
---
# Generate Databricks SQL from YAML
## Overview
Generate production-ready Databricks SQL workflow from your `unify.yml` configuration file. This command creates Delta Lake optimized SQL files with ACID transactions, clustering, and platform-specific function conversions.
---
## What You Need
### Required Inputs
1. **YAML Configuration File**: Path to your `unify.yml`
2. **Target Catalog**: Databricks Unity Catalog name
3. **Target Schema**: Schema name within the catalog
### Optional Inputs
4. **Source Catalog**: Catalog containing source tables (defaults to target catalog)
5. **Source Schema**: Schema containing source tables (defaults to target schema)
6. **Output Directory**: Where to save generated SQL (defaults to `databricks_sql/`)
---
## What I'll Do
### Step 1: Validation
- Verify `unify.yml` exists and is valid
- Check YAML syntax and structure
- Validate keys, tables, and configuration sections
### Step 2: SQL Generation
I'll call the **databricks-sql-generator agent** to:
- Execute `yaml_unification_to_databricks.py` Python script
- Apply Databricks-specific SQL conversions:
- `ARRAY_SIZE``SIZE`
- `ARRAY_CONSTRUCT``ARRAY`
- `OBJECT_CONSTRUCT``STRUCT`
- `COLLECT_LIST` for aggregations
- `FLATTEN` for array operations
- `UNIX_TIMESTAMP()` for time functions
- Generate Delta Lake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs
### Step 3: Output Organization
Generate complete SQL workflow in this structure:
```
databricks_sql/unify/
├── 01_create_graph.sql # Initialize graph with USING DELTA
├── 02_extract_merge.sql # Extract identities with validation
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql # Canonical ID creation with key masks
├── 06_result_key_stats.sql # Result statistics with histograms
├── 10_enrich_*.sql # Enrich each source table
├── 20_master_*.sql # Master tables with attribute aggregation
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules lookup
└── 32_column_lookup.sql # Column mapping lookup
```
### Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Delta Lake optimizations included
- Key features enabled
- Next steps for execution
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-generate-databricks
I'll prompt you for:
- YAML file path
- Target catalog
- Target schema
```
### Advanced Usage
Provide all parameters upfront:
```
YAML file: /path/to/unify.yml
Target catalog: my_catalog
Target schema: my_schema
Source catalog: source_catalog (optional)
Source schema: source_schema (optional)
Output directory: custom_output/ (optional)
```
---
## Generated SQL Features
### Delta Lake Optimizations
- **ACID Transactions**: `USING DELTA` for all tables
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
- **Table Properties**: Optimized for large-scale joins
### Advanced Capabilities
1. **Dynamic Iteration Count**: Auto-calculates based on:
- Number of merge keys
- Number of tables
- Data complexity (configurable via YAML)
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
```
Key Type 1 (email): 0ffdbcf0c666ce190d
Key Type 2 (customer_id): 61a821f2b646a4e890
Key Type 3 (phone): acd2206c3f88b3ee27
```
3. **Validation Rules**:
- `valid_regexp`: Regex pattern filtering
- `invalid_texts`: NOT IN clause with NULL handling
- Combined AND logic for strict validation
4. **Master Table Attributes**:
- Single value: `MAX_BY(attr, order)` with COALESCE
- Array values: `SLICE(CONCAT(arrays), 1, N)`
- Priority-based selection
### Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Databricks equivalents
- Snowflake functions → Databricks equivalents
- Array operations → Spark SQL syntax
- Window functions → optimized versions
- Time functions → UNIX_TIMESTAMP()
---
## Example Workflow
### Input YAML (`unify.yml`)
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
```
### Generated Output
```
databricks_sql/unify/
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql # Merges customer_profiles keys
├── 03_source_key_stats.sql # Stats by table
├── 04_unify_loop_iteration_01.sql # First iteration
├── 04_unify_loop_iteration_02.sql # Second iteration
├── ... # Up to iteration_05
├── 05_canonicalize.sql # Creates unified_id_lookup
├── 06_result_key_stats.sql # Final statistics
├── 10_enrich_customer_profiles.sql # Adds unified_id column
├── 20_master_customer_master.sql # Creates customer_master table
├── 30_unification_metadata.sql # Metadata
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
---
## Next Steps After Generation
### Option 1: Execute Immediately
Use the execution command:
```
/cdp-hybrid-idu:hybrid-execute-databricks
```
### Option 2: Review First
1. Examine generated SQL files
2. Verify table names and transformations
3. Test with sample data
4. Execute manually or via execution command
### Option 3: Customize
1. Modify generated SQL as needed
2. Add custom logic or transformations
3. Execute using Databricks SQL editor or execution command
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/databricks/yaml_unification_to_databricks.py \
unify.yml \
-tc my_catalog \
-ts my_schema \
-sc source_catalog \
-ss source_schema \
-o databricks_sql
```
### SQL File Naming Convention
- `01-09`: Setup and initialization
- `10-19`: Source table enrichment
- `20-29`: Master table creation
- `30-39`: Metadata and lookup tables
- `04_*_NN`: Loop iterations (auto-numbered)
### Convergence Detection
Each loop iteration includes:
```sql
-- Check if graph changed
SELECT COUNT(*) FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N_minus_1
) diff
```
Stops when count = 0
---
## Troubleshooting
### Common Issues
**Issue**: YAML validation error
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
**Issue**: Table not found error
**Solution**: Verify source catalog/schema, check table names in YAML
**Issue**: Python script error
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
**Issue**: Too many/few iterations
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
---
## Success Criteria
Generated SQL will:
- ✅ Be valid Databricks Spark SQL
- ✅ Use Delta Lake for ACID transactions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support incremental processing
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Databricks
---
**Ready to generate Databricks SQL from your YAML configuration?**
Provide your YAML file path and target catalog/schema to begin!

View File

@@ -0,0 +1,288 @@
---
name: hybrid-generate-snowflake
description: Generate Snowflake SQL from YAML configuration for ID unification
---
# Generate Snowflake SQL from YAML
## Overview
Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
---
## What You Need
### Required Inputs
1. **YAML Configuration File**: Path to your `unify.yml`
2. **Target Database**: Snowflake database name
3. **Target Schema**: Schema name within the database
### Optional Inputs
4. **Source Database**: Database containing source tables (defaults to target database)
5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)
---
## What I'll Do
### Step 1: Validation
- Verify `unify.yml` exists and is valid
- Check YAML syntax and structure
- Validate keys, tables, and configuration sections
### Step 2: SQL Generation
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` Python script
- Generate Snowflake table definitions with clustering
- Create convergence detection logic
- Build cryptographic hashing for canonical IDs
### Step 3: Output Organization
Generate complete SQL workflow in this structure:
```
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize graph table
├── 02_extract_merge.sql # Extract identities with validation
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
├── 05_canonicalize.sql # Canonical ID creation with key masks
├── 06_result_key_stats.sql # Result statistics with histograms
├── 10_enrich_*.sql # Enrich each source table
├── 20_master_*.sql # Master tables with attribute aggregation
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules lookup
└── 32_column_lookup.sql # Column mapping lookup
```
### Step 4: Summary Report
Provide:
- Total SQL files generated
- Estimated execution order
- Snowflake optimizations included
- Key features enabled
- Next steps for execution
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-generate-snowflake
I'll prompt you for:
- YAML file path
- Target database
- Target schema
```
### Advanced Usage
Provide all parameters upfront:
```
YAML file: /path/to/unify.yml
Target database: my_database
Target schema: my_schema
Source database: source_database (optional)
Source schema: PUBLIC (optional, defaults to PUBLIC)
Output directory: custom_output/ (optional)
```
---
## Generated SQL Features
### Snowflake Optimizations
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
- **VARIANT Support**: Flexible data structures for arrays and objects
- **Native Functions**: Snowflake-specific optimized functions
### Advanced Capabilities
1. **Dynamic Iteration Count**: Auto-calculates based on:
- Number of merge keys
- Number of tables
- Data complexity (configurable via YAML)
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
```
Key Type 1 (email): 0ffdbcf0c666ce190d
Key Type 2 (customer_id): 61a821f2b646a4e890
Key Type 3 (phone): acd2206c3f88b3ee27
```
3. **Validation Rules**:
- `valid_regexp`: REGEXP_LIKE pattern filtering
- `invalid_texts`: NOT IN clause with proper NULL handling
- Combined AND logic for strict validation
4. **Master Table Attributes**:
- Single value: `MAX_BY(attr, order)` with COALESCE
- Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
- Priority-based selection
### Platform-Specific Conversions
The generator automatically converts:
- Presto functions → Snowflake equivalents
- Databricks functions → Snowflake equivalents
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
- Window functions → optimized versions
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
---
## Example Workflow
### Input YAML (`unify.yml`)
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
```
### Generated Output
```
snowflake_sql/unify/
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
├── 02_extract_merge.sql # Merges customer_profiles keys
├── 03_source_key_stats.sql # Stats by table
├── 04_unify_loop_iteration_01.sql # First iteration
├── 04_unify_loop_iteration_02.sql # Second iteration
├── ... # Up to iteration_05
├── 05_canonicalize.sql # Creates unified_id_lookup
├── 06_result_key_stats.sql # Final statistics
├── 10_enrich_customer_profiles.sql # Adds unified_id column
├── 20_master_customer_master.sql # Creates customer_master table
├── 30_unification_metadata.sql # Metadata
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
---
## Next Steps After Generation
### Option 1: Execute Immediately
Use the execution command:
```
/cdp-hybrid-idu:hybrid-execute-snowflake
```
### Option 2: Review First
1. Examine generated SQL files
2. Verify table names and transformations
3. Test with sample data
4. Execute manually or via execution command
### Option 3: Customize
1. Modify generated SQL as needed
2. Add custom logic or transformations
3. Execute using Snowflake SQL worksheet or execution command
---
## Technical Details
### Python Script Execution
The agent executes:
```bash
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
unify.yml \
-d my_database \
-s my_schema \
-sd source_database \
-ss source_schema \
-o snowflake_sql
```
### SQL File Naming Convention
- `01-09`: Setup and initialization
- `10-19`: Source table enrichment
- `20-29`: Master table creation
- `30-39`: Metadata and lookup tables
- `04_*_NN`: Loop iterations (auto-numbered)
### Convergence Detection
Each loop iteration includes:
```sql
-- Check if graph changed
SELECT COUNT(*) FROM (
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N
EXCEPT
SELECT leader_ns, leader_id, follower_ns, follower_id
FROM iteration_N_minus_1
) diff
```
Stops when count = 0
### Snowflake-Specific Features
- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
- **ARRAYS_OVERLAP**: Checking array membership
- **SPLIT_PART**: String splitting for leader key parsing
---
## Troubleshooting
### Common Issues
**Issue**: YAML validation error
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
**Issue**: Table not found error
**Solution**: Verify source database/schema, check table names in YAML
**Issue**: Python script error
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
**Issue**: Too many/few iterations
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
**Issue**: VARIANT column errors
**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
---
## Success Criteria
Generated SQL will:
- ✅ Be valid Snowflake SQL
- ✅ Use native Snowflake functions
- ✅ Include proper clustering for performance
- ✅ Have convergence detection built-in
- ✅ Support VARIANT types for flexible data
- ✅ Generate comprehensive statistics
- ✅ Work without modification on Snowflake
---
**Ready to generate Snowflake SQL from your YAML configuration?**
Provide your YAML file path and target database/schema to begin!

308
commands/hybrid-setup.md Normal file
View File

@@ -0,0 +1,308 @@
---
name: hybrid-setup
description: Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks
---
# Hybrid ID Unification Complete Setup
## Overview
I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an **automated, end-to-end workflow** that will:
1. **Analyze your tables automatically** using platform MCP tools with strict PII detection
2. **Generate YAML configuration** from real schema and data analysis
3. **Choose target platform(s)** (Snowflake, Databricks, or both)
4. **Generate platform-specific SQL** optimized for each engine
5. **Execute workflows** with convergence detection and monitoring
6. **Provide deployment guidance** and operating instructions
**Key Features**:
- 🔍 **Automated Table Analysis**: Uses Snowflake/Databricks MCP tools to analyze actual tables
-**Strict PII Detection**: Zero tolerance - only includes tables with real user identifiers
- 📊 **Real Data Validation**: Queries actual data to validate patterns and quality
- 🎯 **Smart Recommendations**: Expert analysis provides merge strategy and priorities
- 🚀 **End-to-End Automation**: From table analysis to workflow execution
---
## What You Need to Provide
### 1. Unification Requirements (For Automated Analysis)
- **Platform**: Snowflake or Databricks
- **Tables**: List of source tables to analyze
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- **Canonical ID Name**: Name for your unified ID (e.g., `td_id`, `unified_customer_id`)
- **Merge Iterations**: Number of unification loops (default: 10)
- **Master Tables**: (Optional) Attribute aggregation specifications
**Note**: The system will automatically:
- Extract user identifiers from actual table schemas
- Validate data patterns from real data
- Apply appropriate validation rules based on data analysis
- Generate merge strategy recommendations
### 2. Platform Selection
- **Databricks**: Unity Catalog with Delta Lake
- **Snowflake**: Database with proper permissions
- **Both**: Generate SQL for both platforms
### 3. Target Configurations
**For Databricks**:
- **Catalog**: Target catalog name
- **Schema**: Target schema name
- **Source Catalog** (optional): Source data catalog
- **Source Schema** (optional): Source data schema
**For Snowflake**:
- **Database**: Target database name
- **Schema**: Target schema name
- **Source Schema** (optional): Source data schema
### 4. Execution Credentials (if executing)
**For Databricks**:
- **Server Hostname**: your-workspace.databricks.com
- **HTTP Path**: /sql/1.0/warehouses/your-warehouse-id
- **Authentication**: PAT (Personal Access Token) or OAuth
**For Snowflake**:
- **Account**: Snowflake account name
- **User**: Username
- **Password**: Password or use SSO/key-pair
- **Warehouse**: Compute warehouse name
---
## What I'll Do
### Step 1: Automated YAML Configuration Generation
I'll use the **hybrid-unif-config-creator** command to automatically generate your `unify.yml` file:
**Automated Analysis Approach** (Recommended):
- Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
- Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
- Validate data patterns from real table data
- Generate unify.yml with exact template compliance
- Only include tables with actual user identifiers
- Document excluded tables with detailed reasons
**What I'll do**:
- Call the **hybrid-unif-keys-extractor agent** to analyze tables
- Query actual schema and data using platform MCP tools
- Detect valid user identifiers (email, customer_id, phone, etc.)
- Exclude tables without PII with full documentation
- Generate production-ready unify.yml automatically
**Alternative - Manual Configuration**:
- If MCP tools are unavailable, I'll guide you through manual configuration
- Interactive prompts for keys, tables, and validation rules
- Step-by-step YAML building with validation
### Step 2: Platform Selection and Configuration
I'll help you:
- Choose between Databricks, Snowflake, or both
- Collect platform-specific configuration (catalog/database, schema names)
- Determine source/target separation strategy
- Decide on execution or generation-only mode
### Step 3: SQL Generation
**For Databricks** (if selected):
I'll call the **databricks-sql-generator agent** to:
- Execute `yaml_unification_to_databricks.py` script
- Generate Delta Lake optimized SQL workflow
- Create output directory: `databricks_sql/unify/`
- Generate 15+ SQL files with proper execution order
**For Snowflake** (if selected):
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` script
- Generate Snowflake-native SQL workflow
- Create output directory: `snowflake_sql/unify/`
- Generate 15+ SQL files with proper execution order
### Step 4: Workflow Execution (Optional)
**For Databricks** (if execution requested):
I'll call the **databricks-workflow-executor agent** to:
- Execute `databricks_sql_executor.py` script
- Connect to your Databricks workspace
- Run SQL files in proper sequence
- Monitor convergence and progress
- Optimize Delta tables
- Report final statistics
**For Snowflake** (if execution requested):
I'll call the **snowflake-workflow-executor agent** to:
- Execute `snowflake_sql_executor.py` script
- Connect to your Snowflake account
- Run SQL files in proper sequence
- Monitor convergence and progress
- Report final statistics
### Step 5: Deployment Guidance
I'll provide:
- Configuration summary
- Generated files overview
- Deployment instructions
- Operating guidelines
- Monitoring recommendations
---
## Interactive Workflow
This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:
### Phase 1: Configuration Creation
**I'll ask you for**:
- Platform (Snowflake or Databricks)
- Tables to analyze
- Canonical ID name
- Merge iterations
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-unif-config-creator` internally
- Analyze your tables automatically
- Generate `unify.yml` with strict PII detection
- Show you the configuration for review
### Phase 2: SQL Generation
**I'll ask you**:
- Which platform(s) to generate SQL for (can be different from source)
- Output directory preferences
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-generate-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-generate-databricks` (if Databricks selected)
- Generate 15+ optimized SQL files per platform
- Show you the execution plan
### Phase 3: Workflow Execution (Optional)
**I'll ask you**:
- Do you want to execute now or later?
- Connection credentials if executing
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-execute-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-execute-databricks` (if Databricks selected)
- Monitor convergence and progress
- Report final statistics
**Throughout the process**:
- **Questions**: When I need your input
- **Suggestions**: Recommended approaches based on best practices
- **Validation**: Real-time checks on your choices
- **Explanations**: Help you understand concepts and options
---
## Expected Output
### Files Created (Platform-specific):
**For Databricks**:
```
databricks_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**For Snowflake**:
```
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**Configuration**:
```
unify.yml # YAML configuration (created interactively)
```
---
## Success Criteria
All generated files will:
- ✅ Be platform-optimized and production-ready
- ✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
- ✅ Include convergence detection logic
- ✅ Support incremental processing
- ✅ Generate comprehensive statistics
- ✅ Work without modification on target platforms
---
## Getting Started
**Ready to begin?** I'll use the **hybrid-unif-config-creator** to automatically analyze your tables and generate the YAML configuration.
Please provide:
1. **Platform**: Which platform contains your data?
- Snowflake or Databricks
2. **Tables**: Which source tables should I analyze?
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- Example: `customer_db.public.customers`, `orders`, `web_events.user_activity`
3. **Canonical ID Name**: What should I call the unified ID?
- Example: `td_id`, `unified_customer_id`, `master_id`
- Default: `td_id`
4. **Merge Iterations** (optional): How many unification loops?
- Default: 10
- Range: 2-30
5. **Target Platform(s)** for SQL generation:
- Same as source, or generate for both platforms
**Example**:
```
I want to set up hybrid ID unification for:
Platform: Snowflake
Tables:
- customer_db.public.customer_profiles
- customer_db.public.orders
- marketing_db.public.campaigns
- event_db.public.web_events
Canonical ID: unified_customer_id
Merge Iterations: 10
Generate SQL for: Snowflake (or both Snowflake and Databricks)
```
**What I'll do next**:
1. ✅ Analyze your tables using Snowflake MCP tools
2. ✅ Extract user identifiers with strict PII detection
3. ✅ Generate unify.yml automatically
4. ✅ Generate platform-specific SQL files
5. ✅ Execute workflow (if requested)
6. ✅ Provide deployment guidance
---
**Let's get started with your hybrid ID unification setup!**

View File

@@ -0,0 +1,491 @@
---
name: hybrid-unif-config-creator
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
---
# Unify Configuration Creator for Snowflake/Databricks
## Overview
I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
1. **Analyzing your actual tables** using platform-specific MCP tools
2. **Extracting user identifiers** with zero-tolerance PII detection
3. **Validating data patterns** from real table data
4. **Generating unify.yml** using the exact template format
5. **Providing recommendations** for merge strategies and priorities
**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
---
## What You Need to Provide
### 1. Platform Selection
- **Snowflake**: For Snowflake databases
- **Databricks**: For Databricks Unity Catalog tables
### 2. Tables to Analyze
Provide tables you want to analyze for ID unification:
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
### 3. Canonical ID Configuration
- **Name**: Name for your unified ID (default: `td_id`)
- **Merge Iterations**: Number of unification loop iterations (default: 10)
- **Incremental Iterations**: Iterations for incremental processing (default: 5)
### 4. Output Configuration (Optional)
- **Output File**: Where to save unify.yml (default: `unify.yml`)
- **Template Path**: Path to template if using custom (default: uses built-in exact template)
---
## What I'll Do
### Step 1: Platform Detection and Validation
```
1. Confirm platform (Snowflake or Databricks)
2. Verify MCP tools are available for the platform
3. Set up platform-specific query patterns
4. Inform you of the analysis approach
```
### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
I'll launch the **hybrid-unif-keys-extractor agent** to:
**Schema Analysis**:
- Use platform MCP tools to describe each table
- Extract exact column names and data types
- Identify accessible vs inaccessible tables
**User Identifier Detection**:
- Apply STRICT matching rules for user identifiers:
- ✅ Email columns (email, email_std, email_address, etc.)
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
- ✅ User IDs (user_id, customer_id, account_id, etc.)
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
- ❌ System columns (id, created_at, time, etc.)
- ❌ Complex types (arrays, maps, objects, variants, structs)
**Data Validation**:
- Query actual MIN/MAX values from each identified column
- Analyze data patterns and quality
- Count unique values per identifier
- Detect data quality issues
**Table Classification**:
- **INCLUDED**: Tables with valid user identifiers
- **EXCLUDED**: Tables without user identifiers (fully documented why)
**Expert Analysis**:
- 3 SQL experts review the data
- Provide priority recommendations
- Suggest validation rules based on actual data patterns
### Step 3: Unify.yml Generation
**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
**Template Usage Process**:
```
1. Receive structured data from hybrid-unif-keys-extractor agent:
- Keys with validation rules
- Tables with column mappings
- Canonical ID configuration
- Master tables specification
2. Use BUILT-IN template structure (see agent documentation)
3. ONLY replace these specific values:
- Line 1: name: {canonical_id_name}
- keys section: actual keys found
- tables section: actual tables with actual columns
- canonical_ids section: name and merge_by_keys
- master_tables section: [] or user specifications
4. PRESERVE everything else:
- ALL comment blocks (#####...)
- ALL comment text ("Declare Validation logic", etc.)
- ALL spacing and indentation (2 spaces per level)
- ALL blank lines
- EXACT YAML structure
5. Use Write tool to save populated unify.yml
```
**I'll generate**:
**Section 1: Canonical ID Name**
```yaml
name: {your_canonical_id_name}
```
**Section 2: Keys with Validation**
```yaml
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
```
*Populated with actual keys found in your tables*
**Section 3: Tables with Key Column Mappings**
```yaml
tables:
- database: {database/catalog}
table: {table_name}
key_columns:
- {column: actual_column_name, key: mapped_key}
- {column: another_column, key: another_key}
```
*Only tables with valid user identifiers, with EXACT column names from schema analysis*
**Section 4: Canonical IDs Configuration**
```yaml
canonical_ids:
- name: {your_canonical_id_name}
merge_by_keys: [email, customer_id, phone_number]
merge_iterations: 15
```
*Based on extracted keys and your configuration*
**Section 5: Master Tables (Optional)**
```yaml
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: best_email
source_columns:
- {table: table1, column: email, order: last, order_by: time, priority: 1}
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
```
*If you request master table configuration, I'll help set up attribute aggregation*
### Step 4: Validation and Review
After generation:
```
1. Show complete unify.yml content
2. Highlight key sections:
- Keys found: [list]
- Tables included: [count]
- Tables excluded: [count] with reasons
- Merge strategy: [keys and priorities]
3. Provide recommendations for optimization
4. Ask for your approval before saving
```
### Step 5: File Output
```
1. Write unify.yml to specified location
2. Create backup of existing file if present
3. Provide file summary:
- Keys configured: X
- Tables configured: Y
- Validation rules: Z
4. Show next steps for using the configuration
```
---
## Example Workflow
**Input**:
```
Platform: Snowflake
Tables:
- customer_data.public.customers
- customer_data.public.orders
- web_data.public.events
Canonical ID Name: unified_customer_id
Output: snowflake_unify.yml
```
**Process**:
```
✓ Platform: Snowflake MCP tools detected
✓ Analyzing 3 tables...
Schema Analysis:
✓ customer_data.public.customers - 12 columns
✓ customer_data.public.orders - 8 columns
✓ web_data.public.events - 15 columns
User Identifier Detection:
✓ customers: email, customer_id (2 identifiers)
✓ orders: customer_id, email_address (2 identifiers)
✗ events: NO user identifiers found
Available columns: event_id, session_id, page_url, timestamp, ...
Reason: Contains only event tracking data - no PII
Data Analysis:
✓ email: 45,123 unique values, format valid
✓ customer_id: 45,089 unique values, numeric
✓ email_address: 12,456 unique values, format valid
Expert Analysis Complete:
Priority 1: customer_id (most stable, highest coverage)
Priority 2: email (good coverage, some quality issues)
Priority 3: phone_number (not found)
Generating unify.yml...
✓ Keys section: 2 keys configured
✓ Tables section: 2 tables configured
✓ Canonical IDs: unified_customer_id
✓ Validation rules: Applied based on data patterns
Tables EXCLUDED:
- web_data.public.events: No user identifiers
```
**Output (snowflake_unify.yml)**:
```yaml
name: unified_customer_id
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
tables:
- database: customer_data
table: customers
key_columns:
- {column: email, key: email}
- {column: customer_id, key: customer_id}
- database: customer_data
table: orders
key_columns:
- {column: email_address, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_customer_id
merge_by_keys: [customer_id, email]
merge_iterations: 15
master_tables: []
```
---
## Key Features
### 🔍 **STRICT PII Detection**
- Zero tolerance for guessing
- Only includes tables with actual user identifiers
- Documents why tables are excluded
- Based on REAL schema and data analysis
### ✅ **Exact Template Compliance**
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
- NO modifications to template format
- Preserves all comment sections
- Maintains exact YAML structure
- Portable across all systems
### 📊 **Real Data Analysis**
- Queries actual MIN/MAX values
- Counts unique identifiers
- Validates data patterns
- Identifies quality issues
### 🎯 **Platform-Aware**
- Uses correct MCP tools for each platform
- Respects platform naming conventions
- Applies platform-specific data type rules
- Generates platform-compatible SQL references
### 📋 **Complete Documentation**
- Documents all excluded tables with reasons
- Lists available columns for excluded tables
- Explains why columns don't qualify as user identifiers
- Provides expert recommendations
---
## Output Format
**The generated unify.yml will have EXACTLY this structure:**
```yaml
name: {canonical_id_name}
#####################################################
##
##Declare Validation logic for unification keys
##
#####################################################
keys:
- name: {key1}
valid_regexp: "{pattern}"
invalid_texts: ['{val1}', '{val2}', '{val3}']
- name: {key2}
invalid_texts: ['{val1}', '{val2}', '{val3}']
#####################################################
##
##Declare databases, tables, and keys to use during unification
##
#####################################################
tables:
- database: {db/catalog}
table: {table}
key_columns:
- {column: {col}, key: {key}}
#####################################################
##
##Declare hierarchy for unification. Define keys to use for each level.
##
#####################################################
canonical_ids:
- name: {canonical_id_name}
merge_by_keys: [{key1}, {key2}, ...]
merge_iterations: {number}
#####################################################
##
##Declare Similar Attributes and standardize into a single column
##
#####################################################
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: {attribute}
source_columns:
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
```
**NO deviations from this structure - EXACT template compliance guaranteed.**
---
## Prerequisites
### Required:
- ✅ Snowflake or Databricks platform access
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
- ✅ Read permissions on tables to be analyzed
- ✅ Tables must exist and be accessible
### Optional:
- Custom unify.yml template path (if not using default)
- Master table attribute specifications
- Custom validation rules
---
## Expected Timeline
| Step | Duration |
|------|----------|
| Platform detection | < 1 min |
| Schema analysis (per table) | 5-10 sec |
| Data analysis (per identifier) | 10-20 sec |
| Expert analysis | 1-2 min |
| YAML generation | < 1 min |
| **Total (for 5 tables)** | **~3-5 min** |
---
## Error Handling
### Common Issues:
**Issue**: MCP tools not available for platform
**Solution**:
- I'll inform you and provide fallback options
- You can provide schema information manually
- I'll still generate unify.yml with validation warnings
**Issue**: No tables have user identifiers
**Solution**:
- I'll show you why tables were excluded
- Suggest alternative tables to analyze
- Explain what constitutes a user identifier
**Issue**: Table not accessible
**Solution**:
- Document which tables are inaccessible
- Continue with accessible tables
- Recommend permission checks
**Issue**: Complex data types found
**Solution**:
- Exclude complex type columns (arrays, structs, maps)
- Explain why they can't be used for unification
- Suggest alternative columns if available
---
## Success Criteria
Generated unify.yml will:
- ✅ Use EXACT template structure - NO modifications
- ✅ Contain ONLY tables with validated user identifiers
- ✅ Include ONLY columns that actually exist in tables
- ✅ Have validation rules based on actual data patterns
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
- ✅ Work without any manual edits
- ✅ Include comprehensive documentation in comments
---
## Next Steps After Generation
1. **Review the generated unify.yml**
- Verify tables and columns are correct
- Check validation rules are appropriate
- Review merge strategy and priorities
2. **Generate SQL for your platform**:
- Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
- Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
3. **Execute the workflow**:
- Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
- Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
4. **Monitor convergence and results**
---
## Getting Started
**Ready to begin?**
Please provide:
1. **Platform**: Snowflake or Databricks
2. **Tables**: List of tables to analyze (full paths)
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
**Example**:
```
Platform: Snowflake
Tables:
- customer_db.public.customers
- customer_db.public.orders
- marketing_db.public.campaigns
Canonical ID: unified_id
Output: snowflake_unify.yml
```
---
**I'll analyze your tables and generate a production-ready unify.yml configuration!**

View File

@@ -0,0 +1,337 @@
---
name: hybrid-unif-config-validate
description: Validate YAML configuration for hybrid ID unification before SQL generation
---
# Validate Hybrid ID Unification YAML
## Overview
Validate your `unify.yml` configuration file to ensure it's properly structured and ready for SQL generation. This command checks syntax, structure, validation rules, and provides recommendations for optimization.
---
## What You Need
### Required Input
1. **YAML Configuration File**: Path to your `unify.yml`
---
## What I'll Do
### Step 1: File Validation
- Verify file exists and is readable
- Check YAML syntax (proper indentation, quotes, etc.)
- Ensure all required sections are present
### Step 2: Structure Validation
Check presence and structure of:
- **name**: Unification project name
- **keys**: Key definitions with validation rules
- **tables**: Source tables with key column mappings
- **canonical_ids**: Canonical ID configuration
- **master_tables**: Master table definitions (optional)
### Step 3: Content Validation
Validate individual sections:
**Keys Section**:
- ✓ Each key has a unique name
-`valid_regexp` is a valid regex pattern (if provided)
-`invalid_texts` is an array (if provided)
- ⚠ Recommend validation rules if missing
**Tables Section**:
- ✓ Each table has a name
- ✓ Each table has at least one key_column
- ✓ All referenced keys exist in keys section
- ✓ Column names are valid identifiers
- ⚠ Check for duplicate table definitions
**Canonical IDs Section**:
- ✓ Has a name (will be canonical ID column name)
-`merge_by_keys` references existing keys
-`merge_iterations` is a positive integer (if provided)
- ⚠ Suggest optimal iteration count if not specified
**Master Tables Section** (if present):
- ✓ Each master table has a name and canonical_id
- ✓ Referenced canonical_id exists
- ✓ Attributes have proper structure
- ✓ Source tables in attributes exist
- ✓ Priority values are valid
- ⚠ Check for attribute conflicts
### Step 4: Cross-Reference Validation
- ✓ All merge_by_keys exist in keys section
- ✓ All key_columns reference defined keys
- ✓ All master table source tables exist in tables section
- ✓ Canonical ID names don't conflict with existing columns
### Step 5: Best Practices Check
Provide recommendations for:
- Key validation rules
- Iteration count optimization
- Master table attribute priorities
- Performance considerations
### Step 6: Validation Report
Generate comprehensive report with:
- ✅ Passed checks
- ⚠ Warnings (non-critical issues)
- ❌ Errors (must fix before generation)
- 💡 Recommendations for improvement
---
## Command Usage
### Basic Usage
```
/cdp-hybrid-idu:hybrid-unif-config-validate
I'll prompt you for:
- YAML file path
```
### Direct Usage
```
YAML file: /path/to/unify.yml
```
---
## Example Validation
### Input YAML
```yaml
name: customer_unification
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1}
- {table: orders, column: email_address, priority: 2}
```
### Validation Report
```
✅ YAML VALIDATION SUCCESSFUL
File Structure:
✅ Valid YAML syntax
✅ All required sections present
✅ Proper indentation and formatting
Keys Section (2 keys):
✅ email: Valid regex pattern, invalid_texts defined
✅ customer_id: Invalid_texts defined
⚠ Consider adding valid_regexp for customer_id for better validation
Tables Section (2 tables):
✅ customer_profiles: 2 key columns mapped
✅ orders: 1 key column mapped
✅ All referenced keys exist
Canonical IDs Section:
✅ Name: unified_id
✅ Merge keys: email, customer_id (both exist)
✅ Iterations: 15 (recommended range: 10-20)
Master Tables Section (1 master table):
✅ customer_master: References unified_id
✅ Attribute 'best_email': 2 sources with priorities
✅ All source tables exist
Cross-References:
✅ All merge_by_keys defined in keys section
✅ All key_columns reference existing keys
✅ All master table sources exist
✅ No canonical ID name conflicts
Recommendations:
💡 Consider adding valid_regexp for customer_id (e.g., "^[A-Z0-9]+$")
💡 Add more master table attributes for richer customer profiles
💡 Consider array attributes (top_3_emails) for historical tracking
Summary:
✅ 0 errors
⚠ 1 warning
💡 3 recommendations
✓ Configuration is ready for SQL generation!
```
---
## Validation Checks
### Required Checks (Must Pass)
- [ ] File exists and is readable
- [ ] Valid YAML syntax
- [ ] `name` field present
- [ ] `keys` section present with at least one key
- [ ] `tables` section present with at least one table
- [ ] `canonical_ids` section present
- [ ] All merge_by_keys exist in keys section
- [ ] All key_columns reference defined keys
- [ ] No duplicate key names
- [ ] No duplicate table names
### Warning Checks (Recommended)
- [ ] Keys have validation rules (valid_regexp or invalid_texts)
- [ ] Merge_iterations specified (otherwise auto-calculated)
- [ ] Master tables defined for unified customer view
- [ ] Source tables have unique key combinations
- [ ] Attribute priorities are sequential
### Best Practice Checks
- [ ] Email keys have email regex pattern
- [ ] Phone keys have phone validation
- [ ] Invalid_texts include common null values ('', 'N/A', 'null')
- [ ] Master tables use time-based order_by for recency
- [ ] Array attributes for historical data (top_3_emails, etc.)
---
## Common Validation Errors
### Syntax Errors
**Error**: `Invalid YAML: mapping values are not allowed here`
**Solution**: Check indentation (use spaces, not tabs), ensure colons have space after them
**Error**: `Invalid YAML: could not find expected ':'`
**Solution**: Check for missing colons in key-value pairs
### Structure Errors
**Error**: `Missing required section: keys`
**Solution**: Add keys section with at least one key definition
**Error**: `Empty tables section`
**Solution**: Add at least one table with key_columns
### Reference Errors
**Error**: `Key 'phone' referenced in table 'orders' but not defined in keys section`
**Solution**: Add phone key to keys section or remove reference
**Error**: `Merge key 'phone_number' not found in keys section`
**Solution**: Add phone_number to keys section or remove from merge_by_keys
**Error**: `Master table source 'customer_360' not found in tables section`
**Solution**: Add customer_360 to tables section or use correct table name
### Value Errors
**Error**: `merge_iterations must be a positive integer, got: 'auto'`
**Solution**: Either remove merge_iterations (auto-calculate) or specify integer (e.g., 15)
**Error**: `Priority must be a positive integer, got: 'high'`
**Solution**: Use numeric priority (1 for highest, 2 for second, etc.)
---
## Validation Levels
### Strict Mode (Default)
- Fails on any structural errors
- Warns on missing best practices
- Recommends optimizations
### Lenient Mode
- Only fails on critical syntax errors
- Allows missing optional fields
- Minimal warnings
---
## Platform-Specific Validation
### Databricks-Specific
- ✓ Table names compatible with Unity Catalog
- ✓ Column names valid for Spark SQL
- ⚠ Check for reserved keywords (DATABASE, TABLE, etc.)
### Snowflake-Specific
- ✓ Table names compatible with Snowflake
- ✓ Column names valid for Snowflake SQL
- ⚠ Check for reserved keywords (ACCOUNT, SCHEMA, etc.)
---
## What Happens Next
### If Validation Passes
```
✅ Configuration validated successfully!
Ready for:
• SQL generation (Databricks or Snowflake)
• Direct execution after generation
Next steps:
1. /cdp-hybrid-idu:hybrid-generate-databricks
2. /cdp-hybrid-idu:hybrid-generate-snowflake
3. /cdp-hybrid-idu:hybrid-setup (complete workflow)
```
### If Validation Fails
```
❌ Configuration has errors that must be fixed
Errors (must fix):
1. Missing required section: canonical_ids
2. Undefined key 'phone' referenced in table 'orders'
Suggestions:
• Add canonical_ids section with name and merge_by_keys
• Add phone key to keys section or remove from orders
Would you like help fixing these issues? (y/n)
```
I can help you:
- Fix syntax errors
- Add missing sections
- Define proper validation rules
- Optimize configuration
---
## Success Criteria
Validation passes when:
- ✅ YAML syntax is valid
- ✅ All required sections present
- ✅ All references resolved
- ✅ No structural errors
- ✅ Ready for SQL generation
---
**Ready to validate your YAML configuration?**
Provide your `unify.yml` file path to begin validation!

View File

@@ -0,0 +1,726 @@
---
name: hybrid-unif-merge-stats-creator
description: Generate professional HTML/PDF merge statistics report from ID unification results for Snowflake or Databricks with expert analysis and visualizations
---
# ID Unification Merge Statistics Report Generator
## Overview
I'll generate a **comprehensive, professional HTML report** analyzing your ID unification merge statistics with:
- 📊 **Executive Summary** with key performance indicators
- 📈 **Identity Resolution Performance** analysis and deduplication rates
- 🎯 **Merge Distribution** patterns and complexity analysis
- 👥 **Top Merged Profiles** highlighting complex identity resolutions
-**Data Quality Metrics** with coverage percentages
- 🚀 **Convergence Analysis** showing iteration performance
- 💡 **Expert Recommendations** for optimization and next steps
**Platform Support:**
- ✅ Snowflake (using Snowflake MCP tools)
- ✅ Databricks (using Databricks MCP tools)
**Output Format:**
- Beautiful HTML report with charts, tables, and visualizations
- PDF-ready (print to PDF from browser)
- Consistent formatting every time
- Platform-agnostic design
---
## What You Need to Provide
### 1. Platform Selection
- **Snowflake**: For Snowflake-based ID unification
- **Databricks**: For Databricks-based ID unification
### 2. Database/Catalog Configuration
**For Snowflake:**
- **Database Name**: Where your unification tables are stored (e.g., `INDRESH_TEST`, `CUSTOMER_CDP`)
- **Schema Name**: Schema containing tables (e.g., `PUBLIC`, `ID_UNIFICATION`)
**For Databricks:**
- **Catalog Name**: Unity Catalog name (e.g., `customer_data`, `cdp_prod`)
- **Schema Name**: Schema containing tables (e.g., `id_unification`, `unified_profiles`)
### 3. Canonical ID Configuration
- **Canonical ID Name**: Name used for your unified ID (e.g., `td_id`, `unified_customer_id`, `master_id`)
- This is used to find the correct tables: `{canonical_id}_lookup`, `{canonical_id}_master_table`, etc.
### 4. Output Configuration (Optional)
- **Output File Path**: Where to save the HTML report (default: `id_unification_report.html`)
- **Report Title**: Custom title for the report (default: "ID Unification Merge Statistics Report")
---
## What I'll Do
### Step 1: Platform Detection and Validation
**Snowflake:**
```
1. Verify Snowflake MCP tools are available
2. Test connection to specified database.schema
3. Validate canonical ID tables exist:
- {database}.{schema}.{canonical_id}_lookup
- {database}.{schema}.{canonical_id}_master_table
- {database}.{schema}.{canonical_id}_source_key_stats
- {database}.{schema}.{canonical_id}_result_key_stats
4. Confirm access permissions
```
**Databricks:**
```
1. Verify Databricks MCP tools are available (or use Snowflake fallback)
2. Test connection to specified catalog.schema
3. Validate canonical ID tables exist
4. Confirm access permissions
```
### Step 2: Data Collection with Expert Analysis
I'll execute **16 specialized queries** to collect comprehensive statistics:
**Core Statistics Queries:**
1. **Source Key Statistics**
- Pre-unification identity counts
- Distinct values per key type (customer_id, email, phone, etc.)
- Per-table breakdowns
2. **Result Key Statistics**
- Post-unification canonical ID counts
- Distribution histograms
- Coverage per key type
3. **Canonical ID Metrics**
- Total identities processed
- Unique canonical IDs created
- Merge ratio calculation
4. **Top Merged Profiles**
- Top 10 most complex merges
- Identity count per canonical ID
- Merge complexity scoring
5. **Merge Distribution Analysis**
- Categorization (2, 3-5, 6-10, 10+ identities)
- Percentage distribution
- Pattern analysis
6. **Key Type Distribution**
- Identity breakdown by type
- Namespace analysis
- Cross-key coverage
7. **Master Table Quality Metrics**
- Attribute coverage percentages
- Data completeness analysis
- Sample record extraction
8. **Configuration Metadata**
- Unification settings
- Column mappings
- Validation rules
**Platform-Specific SQL Adaptation:**
For **Snowflake**:
```sql
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {database}.{schema}.{canonical_id}_lookup;
```
For **Databricks**:
```sql
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {catalog}.{schema}.{canonical_id}_lookup;
```
### Step 3: Statistical Analysis and Calculations
I'll perform expert-level calculations:
**Deduplication Rates:**
```
For each key type:
- Source distinct count (pre-unification)
- Final canonical IDs (post-unification)
- Deduplication % = (source - final) / source * 100
```
**Merge Ratios:**
```
- Average identities per customer = total_identities / unique_canonical_ids
- Distribution across categories
- Outlier detection (10+ merges)
```
**Convergence Analysis:**
```
- Parse from execution logs if available
- Calculate from iteration metadata tables
- Estimate convergence quality
```
**Data Quality Scores:**
```
- Coverage % for each attribute
- Completeness assessment
- Quality grading (Excellent, Good, Needs Improvement)
```
### Step 4: HTML Report Generation
I'll generate a **pixel-perfect HTML report** with:
**Design Features:**
- ✨ Modern gradient design (purple theme)
- 📊 Interactive visualizations (progress bars, horizontal bar charts)
- 🎨 Color-coded badges and status indicators
- 📱 Responsive layout (works on all devices)
- 🖨️ Print-optimized CSS for PDF export
**Report Structure:**
```html
<!DOCTYPE html>
<html>
<head>
- Professional CSS styling
- Chart/visualization styles
- Print media queries
</head>
<body>
<header>
- Report title
- Executive tagline
</header>
<metadata-bar>
- Database/Catalog info
- Canonical ID name
- Generation timestamp
- Platform indicator
</metadata-bar>
<section: Executive Summary>
- 4 KPI metric cards
- Key findings insight box
</section>
<section: Identity Resolution Performance>
- Source vs result comparison table
- Deduplication rate analysis
- Horizontal bar charts
- Expert insights
</section>
<section: Merge Distribution Analysis>
- Category breakdown table
- Distribution visualizations
- Pattern analysis insights
</section>
<section: Top Merged Profiles>
- Top 10 ranked table
- Complexity badges
- Investigation recommendations
</section>
<section: Source Table Configuration>
- Column mapping table
- Source contributions
- Multi-key strategy analysis
</section>
<section: Master Table Data Quality>
- 6 coverage cards with progress bars
- Sample records table
- Quality assessment
</section>
<section: Convergence Performance>
- Iteration breakdown table
- Convergence progression chart
- Efficiency analysis
</section>
<section: Expert Recommendations>
- 4 recommendation cards
- Strategic next steps
- Downstream activation ideas
</section>
<section: Summary Statistics>
- Comprehensive metrics table
- All key numbers documented
</section>
<footer>
- Generation metadata
- Platform information
- Report description
</footer>
</body>
</html>
```
### Step 5: Quality Validation and Output
**Pre-Output Validation:**
```
1. Verify all sections have data
2. Check calculations are correct
3. Validate percentages sum properly
4. Ensure no missing values
5. Confirm HTML is well-formed
```
**File Output:**
```
1. Write HTML to specified path
2. Create backup if file exists
3. Set proper file permissions
4. Verify file was written successfully
```
**Report Summary:**
```
✓ Report generated: {file_path}
✓ File size: {size} KB
✓ Sections included: 9
✓ Statistics queries: 16
✓ Data quality score: {score}%
✓ Ready for: Browser viewing, PDF export, sharing
```
---
## Example Workflow
### Snowflake Example
**User Input:**
```
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: snowflake_merge_report.html
```
**Process:**
```
✓ Connected to Snowflake via MCP
✓ Database: INDRESH_TEST.PUBLIC validated
✓ Tables found:
- td_id_lookup (19,512 records)
- td_id_master_table (4,940 records)
- td_id_source_key_stats (4 records)
- td_id_result_key_stats (4 records)
Executing queries:
✓ Query 1: Source statistics retrieved
✓ Query 2: Result statistics retrieved
✓ Query 3: Canonical ID counts (19,512 → 4,940)
✓ Query 4: Top 10 merged profiles identified
✓ Query 5: Merge distribution calculated
✓ Query 6: Key type distribution analyzed
✓ Query 7: Master table coverage (100% email, 99.39% phone)
✓ Query 8: Sample records extracted
✓ Query 9-11: Metadata retrieved
Calculating metrics:
✓ Merge ratio: 3.95:1
✓ Fragmentation reduction: 74.7%
✓ Deduplication rates:
- customer_id: 23.9%
- email: 32.0%
- phone: 14.8%
✓ Data quality score: 99.7%
Generating HTML report:
✓ Executive summary section
✓ Performance analysis section
✓ Merge distribution section
✓ Top profiles section
✓ Source configuration section
✓ Data quality section
✓ Convergence section
✓ Recommendations section
✓ Summary statistics section
✓ Report saved: snowflake_merge_report.html (142 KB)
✓ Open in browser to view
✓ Print to PDF for distribution
```
**Generated Report Contents:**
```
Executive Summary:
- 4,940 unified profiles
- 19,512 total identities
- 3.95:1 merge ratio
- 74.7% fragmentation reduction
Identity Resolution:
- customer_id: 6,489 → 4,940 (23.9% reduction)
- email: 7,261 → 4,940 (32.0% reduction)
- phone: 5,762 → 4,910 (14.8% reduction)
Merge Distribution:
- 89.0% profiles: 3-5 identities (normal)
- 8.1% profiles: 6-10 identities (high engagement)
- 2.3% profiles: 10+ identities (complex)
Top Merged Profile:
- mS9ssBEh4EsN: 38 identities merged
Data Quality:
- Email: 100% coverage
- Phone: 99.39% coverage
- Names: 100% coverage
- Location: 100% coverage
Expert Recommendations:
- Implement incremental processing
- Monitor profiles with 20+ merges
- Enable downstream activation
- Set up quality monitoring
```
### Databricks Example
**User Input:**
```
Platform: Databricks
Catalog: customer_cdp
Schema: id_unification
Canonical ID: unified_customer_id
Output: databricks_merge_report.html
```
**Process:**
```
✓ Connected to Databricks (or using Snowflake MCP fallback)
✓ Catalog: customer_cdp.id_unification validated
✓ Tables found:
- unified_customer_id_lookup
- unified_customer_id_master_table
- unified_customer_id_source_key_stats
- unified_customer_id_result_key_stats
[Same query execution and report generation as Snowflake]
✓ Report saved: databricks_merge_report.html
```
---
## Key Features
### 🎯 **Consistency Guarantee**
- **Same report every time**: Deterministic HTML generation
- **Platform-agnostic design**: Works identically on Snowflake and Databricks
- **Version controlled**: Report structure is fixed and versioned
### 🔍 **Expert Analysis**
- **16 specialized queries**: Comprehensive data collection
- **Calculated metrics**: Deduplication rates, merge ratios, quality scores
- **Pattern detection**: Identify anomalies and outliers
- **Strategic insights**: Actionable recommendations
### 📊 **Professional Visualizations**
- **KPI metric cards**: Large, colorful summary metrics
- **Progress bars**: Coverage percentages with animations
- **Horizontal bar charts**: Distribution comparisons
- **Color-coded badges**: Status indicators (Excellent, Good, Needs Review)
- **Tables with hover effects**: Interactive data exploration
### 🌍 **Platform Flexibility**
- **Snowflake**: Uses `mcp__snowflake__execute_query` tool
- **Databricks**: Uses Databricks MCP tools (with fallback options)
- **Automatic SQL adaptation**: Platform-specific query generation
- **Table name resolution**: Handles catalog vs database differences
### 📋 **Comprehensive Coverage**
**9 Report Sections:**
1. Executive Summary (4 KPIs + findings)
2. Identity Resolution Performance (deduplication analysis)
3. Merge Distribution Analysis (categorized breakdown)
4. Top Merged Profiles (complexity ranking)
5. Source Table Configuration (mappings)
6. Master Table Data Quality (coverage metrics)
7. Convergence Performance (iteration analysis)
8. Expert Recommendations (strategic guidance)
9. Summary Statistics (complete metrics)
**16 Statistical Queries:**
- Source/result key statistics
- Canonical ID counts and distributions
- Merge pattern analysis
- Quality coverage metrics
- Configuration metadata
---
## Table Naming Conventions
The command automatically finds tables based on your canonical ID name:
### Required Tables
For canonical ID = `{canonical_id}`:
1. **Lookup Table**: `{canonical_id}_lookup`
- Contains: canonical_id, id, id_key_type
- Used for: Merge ratio, distribution, top profiles
2. **Master Table**: `{canonical_id}_master_table`
- Contains: {canonical_id}, best_* attributes
- Used for: Data quality coverage
3. **Source Stats**: `{canonical_id}_source_key_stats`
- Contains: from_table, total_distinct, distinct_*
- Used for: Pre-unification baseline
4. **Result Stats**: `{canonical_id}_result_key_stats`
- Contains: from_table, total_distinct, histogram_*
- Used for: Post-unification results
### Optional Tables
5. **Unification Metadata**: `unification_metadata`
- Contains: canonical_id_name, canonical_id_type
- Used for: Configuration documentation
6. **Column Lookup**: `column_lookup`
- Contains: table_name, column_name, key_name
- Used for: Source table mappings
7. **Filter Lookup**: `filter_lookup`
- Contains: key_name, invalid_texts, valid_regexp
- Used for: Validation rules
**All tables must be in the same database.schema (Snowflake) or catalog.schema (Databricks)**
---
## Output Format
### HTML Report Features
**Styling:**
- Gradient purple theme (#667eea to #764ba2)
- Modern typography (system fonts)
- Responsive grid layouts
- Smooth hover animations
- Print-optimized media queries
**Sections:**
- Header with gradient background
- Metadata bar with key info
- 9 content sections with analysis
- Footer with generation details
**Visualizations:**
- Metric cards (4 in executive summary)
- Progress bars (6 in data quality)
- Horizontal bar charts (3 throughout report)
- Tables with sorting and hover effects
- Insight boxes with recommendations
**Interactivity:**
- Hover effects on cards and tables
- Animated progress bars
- Expandable insight boxes
- Responsive layout adapts to screen size
### PDF Export
To create a PDF from the HTML report:
1. Open HTML file in browser
2. Press Ctrl+P (Windows) or Cmd+P (Mac)
3. Select "Save as PDF"
4. Choose landscape orientation for better chart visibility
5. Enable background graphics for full styling
---
## Error Handling
### Common Issues and Solutions
**Issue: "Tables not found"**
```
Solution:
1. Verify canonical ID name is correct
2. Check database/catalog and schema names
3. Ensure unification workflow completed successfully
4. Confirm table naming: {canonical_id}_lookup, {canonical_id}_master_table, etc.
```
**Issue: "MCP tools not available"**
```
Solution:
1. For Snowflake: Verify Snowflake MCP server is configured
2. For Databricks: Fall back to Snowflake MCP with proper connection string
3. Check network connectivity
4. Validate credentials
```
**Issue: "No data in statistics tables"**
```
Solution:
1. Verify unification workflow ran completely
2. Check that statistics SQL files were executed
3. Confirm data exists in lookup and master tables
4. Re-run the unification workflow if needed
```
**Issue: "Permission denied"**
```
Solution:
1. Verify READ access to all tables
2. For Snowflake: Grant SELECT on schema
3. For Databricks: Grant USE CATALOG, USE SCHEMA, SELECT
4. Check role/user permissions
```
---
## Success Criteria
Generated report will:
-**Open successfully** in all modern browsers (Chrome, Firefox, Safari, Edge)
-**Display all 9 sections** with complete data
-**Show accurate calculations** for all metrics
-**Include visualizations** (charts, progress bars, tables)
-**Render consistently** every time it's generated
-**Export cleanly to PDF** with proper formatting
-**Match the reference design** (same HTML/CSS structure)
-**Contain expert insights** and recommendations
-**Be production-ready** for stakeholder distribution
---
## Usage Examples
### Quick Start (Snowflake)
```
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Snowflake
> Database: PROD_CDP
> Schema: ID_UNIFICATION
> Canonical ID: master_customer_id
> Output: (press Enter for default)
✓ Report generated: id_unification_report.html
```
### Custom Output Path
```
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Databricks
> Catalog: analytics_prod
> Schema: unified_ids
> Canonical ID: td_id
> Output: /reports/weekly/td_id_stats_2025-10-15.html
✓ Report generated: /reports/weekly/td_id_stats_2025-10-15.html
```
### Multiple Environments
Generate reports for different environments:
```bash
# Production
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: PROD_CDP
Output: prod_merge_stats.html
# Staging
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: STAGING_CDP
Output: staging_merge_stats.html
# Compare metrics across environments
```
---
## Best Practices
### Regular Reporting
1. **Weekly Reports**: Track merge performance over time
2. **Post-Workflow Reports**: Generate after each unification run
3. **Quality Audits**: Monthly deep-dive analysis
4. **Stakeholder Updates**: Executive-friendly format
### Comparative Analysis
Generate reports at different stages:
- After initial unification setup
- After incremental updates
- After data quality improvements
- Across different customer segments
### Archive and Versioning
```
reports/
2025-10-15_td_id_merge_stats.html
2025-10-08_td_id_merge_stats.html
2025-10-01_td_id_merge_stats.html
```
Track improvements over time by comparing:
- Merge ratios
- Data quality scores
- Convergence iterations
- Deduplication rates
---
## Getting Started
**Ready to generate your merge statistics report?**
Please provide:
1. **Platform**: Snowflake or Databricks?
2. **Database/Catalog**: Where are your unification tables?
3. **Schema**: Which schema contains the tables?
4. **Canonical ID**: What's the name of your unified ID? (e.g., td_id)
5. **Output Path** (optional): Where to save the report?
**Example:**
```
I want to generate a merge statistics report for:
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: my_unification_report.html
```
---
**I'll analyze your ID unification results and create a comprehensive, beautiful HTML report with expert insights!**

101
plugin.lock.json Normal file
View File

@@ -0,0 +1,101 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:treasure-data/aps_claude_tools:plugins/cdp-hybrid-idu",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "58382efafa00d9c88bf68f0ba2be494e310d9827",
"treeHash": "04cbd3c0d2b818afaf15f92f7e5fb2880103cdbfd513d9926f323c5b7722f625",
"generatedAt": "2025-11-28T10:28:44.950550Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "cdp-hybrid-idu",
"description": "Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation",
"version": null
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "4e50b588ce6c220815a4ca869c68f41fe23cbaf05846fb306e7b2cbf127ed8f8"
},
{
"path": "agents/hybrid-unif-keys-extractor.md",
"sha256": "d2c92a61393209f0835f0118240254a7fa6f209aa62ec87d0ab253723055a7da"
},
{
"path": "agents/merge-stats-report-generator.md",
"sha256": "6e8fda43a277dfef132566b44a3dee23a632171641dcde0151d7602f43bcb5e8"
},
{
"path": "agents/databricks-sql-generator.md",
"sha256": "ae3ce3874d7c00599fcef09718cb612e551aac89896e3c75aa1194332179df9d"
},
{
"path": "agents/databricks-workflow-executor.md",
"sha256": "ecc4fcf94d470fe27f078e8722297921469852d596035e3d9d5b5d32aa2b0435"
},
{
"path": "agents/yaml-configuration-builder.md",
"sha256": "da90f575f8f0f7e33fba1ad720c73556e029227fcf135cd9fe4a9a1d3fb77be3"
},
{
"path": "agents/snowflake-sql-generator.md",
"sha256": "783ee1653bca7e0bb2647b953b4c05390e08686f7454c8e1a9e572851e8e0fc8"
},
{
"path": "agents/snowflake-workflow-executor.md",
"sha256": "f5f5352f47cfdd5a52769988ed9893f5a64de2a236f145b315838d475babca2c"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "a156b276659131718eab652f7b9806ab00bf59318ee07e22a585e3cb13da5e93"
},
{
"path": "commands/hybrid-setup.md",
"sha256": "9a287a1c414323cd6db2c5f3197fcfde531d337168e2696bc7f4896113ae40b6"
},
{
"path": "commands/hybrid-generate-databricks.md",
"sha256": "aff13cf95a74cd71dff35e3a4cd4ba2f287a7b3091f84cdb914d80e00bfe29ad"
},
{
"path": "commands/hybrid-generate-snowflake.md",
"sha256": "0dc460f41ee3c8130aa9a52537686fec6818e7a37a802040b8a570d8f89eaf77"
},
{
"path": "commands/hybrid-unif-config-creator.md",
"sha256": "3e14989f811e5ef198cff306e9203ec6bfa5f3772daa3a0f08292595574ab73c"
},
{
"path": "commands/hybrid-execute-databricks.md",
"sha256": "ad78068c5b96d310d1d620c00572c100915e0706a5312c0649b09a8165bbc79c"
},
{
"path": "commands/hybrid-unif-config-validate.md",
"sha256": "a413582bc43a23ad1addde134007bd6a3174b14d71c10dcbbd5f7824a6a97fb0"
},
{
"path": "commands/hybrid-unif-merge-stats-creator.md",
"sha256": "0c00db96f02559d212e502702eea5d3a02de8fedbca92f64eebd8ed430f96341"
},
{
"path": "commands/hybrid-execute-snowflake.md",
"sha256": "63fbc27f3350cd1d910d0dc4c588a14fd39e1d7ebda29ed6fce3584967bac4c4"
}
],
"dirSha256": "04cbd3c0d2b818afaf15f92f7e5fb2880103cdbfd513d9926f323c5b7722f625"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}