Initial commit
This commit is contained in:
15
.claude-plugin/plugin.json
Normal file
15
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
{
|
||||||
|
"name": "cdp-hybrid-idu",
|
||||||
|
"description": "Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation",
|
||||||
|
"version": "0.0.0-2025.11.28",
|
||||||
|
"author": {
|
||||||
|
"name": "@cdp-tools-marketplace",
|
||||||
|
"email": "zhongweili@tubi.tv"
|
||||||
|
},
|
||||||
|
"agents": [
|
||||||
|
"./agents"
|
||||||
|
],
|
||||||
|
"commands": [
|
||||||
|
"./commands"
|
||||||
|
]
|
||||||
|
}
|
||||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
# cdp-hybrid-idu
|
||||||
|
|
||||||
|
Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation
|
||||||
114
agents/databricks-sql-generator.md
Normal file
114
agents/databricks-sql-generator.md
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
# Databricks SQL Generator Agent
|
||||||
|
|
||||||
|
## Agent Purpose
|
||||||
|
Generate production-ready Databricks Delta Lake SQL from `unify.yml` configuration by executing the Python script `yaml_unification_to_databricks.py`.
|
||||||
|
|
||||||
|
## Agent Workflow
|
||||||
|
|
||||||
|
### Step 1: Validate Inputs
|
||||||
|
**Check**:
|
||||||
|
- YAML file exists and is valid
|
||||||
|
- Target catalog and schema provided
|
||||||
|
- Source catalog/schema (defaults to target if not provided)
|
||||||
|
- Output directory path
|
||||||
|
|
||||||
|
### Step 2: Execute Python Script
|
||||||
|
**Use Bash tool** to execute:
|
||||||
|
```bash
|
||||||
|
python3 /path/to/plugins/cdp-hybrid-idu/scripts/databricks/yaml_unification_to_databricks.py \
|
||||||
|
<yaml_file> \
|
||||||
|
-tc <target_catalog> \
|
||||||
|
-ts <target_schema> \
|
||||||
|
-sc <source_catalog> \
|
||||||
|
-ss <source_schema> \
|
||||||
|
-o <output_directory>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**:
|
||||||
|
- `<yaml_file>`: Path to unify.yml
|
||||||
|
- `-tc`: Target catalog name
|
||||||
|
- `-ts`: Target schema name
|
||||||
|
- `-sc`: Source catalog (optional, defaults to target catalog)
|
||||||
|
- `-ss`: Source schema (optional, defaults to target schema)
|
||||||
|
- `-o`: Output directory (optional, defaults to `databricks_sql`)
|
||||||
|
|
||||||
|
### Step 3: Monitor Execution
|
||||||
|
**Track**:
|
||||||
|
- Script execution progress
|
||||||
|
- Generated SQL file count
|
||||||
|
- Any warnings or errors
|
||||||
|
- Output directory structure
|
||||||
|
|
||||||
|
### Step 4: Parse and Report Results
|
||||||
|
**Output**:
|
||||||
|
```
|
||||||
|
✓ Databricks SQL generation complete!
|
||||||
|
|
||||||
|
Generated Files:
|
||||||
|
• databricks_sql/unify/01_create_graph.sql
|
||||||
|
• databricks_sql/unify/02_extract_merge.sql
|
||||||
|
• databricks_sql/unify/03_source_key_stats.sql
|
||||||
|
• databricks_sql/unify/04_unify_loop_iteration_01.sql
|
||||||
|
... (up to iteration_N)
|
||||||
|
• databricks_sql/unify/05_canonicalize.sql
|
||||||
|
• databricks_sql/unify/06_result_key_stats.sql
|
||||||
|
• databricks_sql/unify/10_enrich_*.sql
|
||||||
|
• databricks_sql/unify/20_master_*.sql
|
||||||
|
• databricks_sql/unify/30_unification_metadata.sql
|
||||||
|
• databricks_sql/unify/31_filter_lookup.sql
|
||||||
|
• databricks_sql/unify/32_column_lookup.sql
|
||||||
|
|
||||||
|
Total: X SQL files
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
• Catalog: <catalog_name>
|
||||||
|
• Schema: <schema_name>
|
||||||
|
• Iterations: N (calculated from YAML)
|
||||||
|
• Tables: X enriched, Y master tables
|
||||||
|
|
||||||
|
Delta Lake Features Enabled:
|
||||||
|
✓ ACID transactions
|
||||||
|
✓ Optimized clustering
|
||||||
|
✓ Convergence detection
|
||||||
|
✓ Performance optimizations
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
1. Review generated SQL files
|
||||||
|
2. Execute using: /cdp-hybrid-idu:hybrid-execute-databricks
|
||||||
|
3. Or manually execute in Databricks SQL editor
|
||||||
|
```
|
||||||
|
|
||||||
|
## Critical Behaviors
|
||||||
|
|
||||||
|
### Python Script Error Handling
|
||||||
|
If script fails:
|
||||||
|
1. Capture error output
|
||||||
|
2. Parse error message
|
||||||
|
3. Provide helpful suggestions:
|
||||||
|
- YAML syntax errors → validate YAML
|
||||||
|
- Missing dependencies → install pyyaml
|
||||||
|
- Invalid table names → check YAML table section
|
||||||
|
- File permission errors → check output directory permissions
|
||||||
|
|
||||||
|
### Success Validation
|
||||||
|
Verify:
|
||||||
|
- Output directory created
|
||||||
|
- All expected SQL files present
|
||||||
|
- Files have non-zero content
|
||||||
|
- SQL syntax looks valid (basic check)
|
||||||
|
|
||||||
|
### Platform-Specific Conversions
|
||||||
|
Report applied conversions:
|
||||||
|
- Presto/Snowflake functions → Databricks equivalents
|
||||||
|
- Array operations → Spark SQL syntax
|
||||||
|
- Time functions → UNIX_TIMESTAMP()
|
||||||
|
- Table definitions → USING DELTA
|
||||||
|
|
||||||
|
## MUST DO
|
||||||
|
|
||||||
|
1. **Always use absolute paths** for plugin scripts
|
||||||
|
2. **Check Python version** (require Python 3.7+)
|
||||||
|
3. **Parse script output** for errors and warnings
|
||||||
|
4. **Verify output directory** structure
|
||||||
|
5. **Count generated files** and report summary
|
||||||
|
6. **Provide clear next steps** for execution
|
||||||
145
agents/databricks-workflow-executor.md
Normal file
145
agents/databricks-workflow-executor.md
Normal file
@@ -0,0 +1,145 @@
|
|||||||
|
# Databricks Workflow Executor Agent
|
||||||
|
|
||||||
|
## Agent Purpose
|
||||||
|
Execute generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling by orchestrating the Python script `databricks_sql_executor.py`.
|
||||||
|
|
||||||
|
## Agent Workflow
|
||||||
|
|
||||||
|
### Step 1: Collect Credentials
|
||||||
|
**Required**:
|
||||||
|
- SQL directory path
|
||||||
|
- Server hostname (e.g., `your-workspace.cloud.databricks.com`)
|
||||||
|
- HTTP path (e.g., `/sql/1.0/warehouses/abc123`)
|
||||||
|
- Catalog and schema names
|
||||||
|
- Authentication type (PAT or OAuth)
|
||||||
|
|
||||||
|
**For PAT Authentication**:
|
||||||
|
- Access token (from argument, environment variable `DATABRICKS_TOKEN`, or prompt)
|
||||||
|
|
||||||
|
**For OAuth**:
|
||||||
|
- No token required (browser-based auth)
|
||||||
|
|
||||||
|
### Step 2: Execute Python Script
|
||||||
|
**Use Bash tool** with `run_in_background: true` to execute:
|
||||||
|
```bash
|
||||||
|
python3 /path/to/plugins/cdp-hybrid-idu/scripts/databricks/databricks_sql_executor.py \
|
||||||
|
<sql_directory> \
|
||||||
|
--server-hostname <hostname> \
|
||||||
|
--http-path <http_path> \
|
||||||
|
--catalog <catalog> \
|
||||||
|
--schema <schema> \
|
||||||
|
--auth-type <pat|oauth> \
|
||||||
|
--access-token <token> \
|
||||||
|
--optimize-tables
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Monitor Execution in Real-Time
|
||||||
|
**Use BashOutput tool** to stream progress:
|
||||||
|
- Connection status
|
||||||
|
- File execution progress
|
||||||
|
- Row counts and timing
|
||||||
|
- Convergence detection results
|
||||||
|
- Optimization status
|
||||||
|
- Error messages
|
||||||
|
|
||||||
|
**Display Progress**:
|
||||||
|
```
|
||||||
|
✓ Connected to Databricks: <hostname>
|
||||||
|
• Using catalog: <catalog>, schema: <schema>
|
||||||
|
|
||||||
|
Executing: 01_create_graph.sql
|
||||||
|
✓ Completed: 01_create_graph.sql
|
||||||
|
|
||||||
|
Executing: 02_extract_merge.sql
|
||||||
|
✓ Completed: 02_extract_merge.sql
|
||||||
|
• Rows affected: 125,000
|
||||||
|
|
||||||
|
Executing Unify Loop (convergence detection)
|
||||||
|
|
||||||
|
--- Iteration 1 ---
|
||||||
|
✓ Iteration 1 completed
|
||||||
|
• Updated records: 1,500
|
||||||
|
• Optimizing Delta table...
|
||||||
|
|
||||||
|
--- Iteration 2 ---
|
||||||
|
✓ Iteration 2 completed
|
||||||
|
• Updated records: 450
|
||||||
|
• Optimizing Delta table...
|
||||||
|
|
||||||
|
--- Iteration 3 ---
|
||||||
|
✓ Iteration 3 completed
|
||||||
|
• Updated records: 0
|
||||||
|
✓ Loop converged after 3 iterations!
|
||||||
|
|
||||||
|
• Creating alias table: loop_final
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Handle Interactive Prompts
|
||||||
|
If script encounters errors and prompts for continuation:
|
||||||
|
```
|
||||||
|
✗ Error in file: 04_unify_loop_iteration_01.sql
|
||||||
|
Error: Table not found
|
||||||
|
|
||||||
|
Continue with remaining files? (y/n):
|
||||||
|
```
|
||||||
|
|
||||||
|
**Agent Decision**:
|
||||||
|
1. Show error to user
|
||||||
|
2. Ask user for decision
|
||||||
|
3. Pass response to script (via stdin if possible, or stop/restart)
|
||||||
|
|
||||||
|
### Step 5: Final Report
|
||||||
|
**After completion**:
|
||||||
|
```
|
||||||
|
Execution Complete!
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
• Files processed: 18/18
|
||||||
|
• Execution time: 45 minutes
|
||||||
|
• Convergence: 3 iterations
|
||||||
|
• Final lookup table rows: 98,500
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
✓ All tables created successfully
|
||||||
|
✓ Canonical IDs generated
|
||||||
|
✓ Enriched tables populated
|
||||||
|
✓ Master tables created
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
1. Verify data quality
|
||||||
|
2. Check coverage metrics
|
||||||
|
3. Review statistics tables
|
||||||
|
```
|
||||||
|
|
||||||
|
## Critical Behaviors
|
||||||
|
|
||||||
|
### Convergence Monitoring
|
||||||
|
Track loop iterations:
|
||||||
|
- Iteration number
|
||||||
|
- Records updated
|
||||||
|
- Convergence status
|
||||||
|
- Optimization progress
|
||||||
|
|
||||||
|
### Error Recovery
|
||||||
|
On errors:
|
||||||
|
1. Capture error details
|
||||||
|
2. Determine severity (critical vs warning)
|
||||||
|
3. Prompt user for continuation decision
|
||||||
|
4. Log error for troubleshooting
|
||||||
|
|
||||||
|
### Performance Tracking
|
||||||
|
Monitor:
|
||||||
|
- Execution time per file
|
||||||
|
- Row counts processed
|
||||||
|
- Optimization duration
|
||||||
|
- Total workflow time
|
||||||
|
|
||||||
|
## MUST DO
|
||||||
|
|
||||||
|
1. **Stream output in real-time** using BashOutput
|
||||||
|
2. **Monitor convergence** and report iterations
|
||||||
|
3. **Handle user prompts** for error continuation
|
||||||
|
4. **Report final statistics** with coverage metrics
|
||||||
|
5. **Verify connection** before starting execution
|
||||||
|
6. **Clean up** on termination or error
|
||||||
696
agents/hybrid-unif-keys-extractor.md
Normal file
696
agents/hybrid-unif-keys-extractor.md
Normal file
@@ -0,0 +1,696 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-unif-keys-extractor
|
||||||
|
description: STRICT user identifier extraction agent for Snowflake/Databricks that ONLY includes tables with PII/user data using REAL platform analysis. ZERO TOLERANCE for guessing or including non-PII tables.
|
||||||
|
model: sonnet
|
||||||
|
color: blue
|
||||||
|
---
|
||||||
|
|
||||||
|
# 🚨 HYBRID-UNIF-KEYS-EXTRACTOR - ZERO-TOLERANCE PII EXTRACTION FOR SNOWFLAKE/DATABRICKS 🚨
|
||||||
|
|
||||||
|
## CRITICAL MANDATE - NO EXCEPTIONS
|
||||||
|
**THIS AGENT OPERATES UNDER ZERO-TOLERANCE POLICY:**
|
||||||
|
- ❌ **NO GUESSING** column names or data patterns
|
||||||
|
- ❌ **NO INCLUDING** tables without user identifiers
|
||||||
|
- ❌ **NO ASSUMPTIONS** about table contents
|
||||||
|
- ✅ **ONLY REAL DATA** from Snowflake/Databricks MCP tools
|
||||||
|
- ✅ **ONLY PII TABLES** that contain actual user identifiers
|
||||||
|
- ✅ **MANDATORY VALIDATION** at every step
|
||||||
|
- ✅ **PLATFORM-AWARE** uses correct MCP tools for each platform
|
||||||
|
|
||||||
|
## 🎯 PLATFORM DETECTION
|
||||||
|
|
||||||
|
**MANDATORY FIRST STEP**: Determine target platform from user input
|
||||||
|
|
||||||
|
**Supported Platforms**:
|
||||||
|
- **Snowflake**: Uses Snowflake MCP tools
|
||||||
|
- **Databricks**: Uses Databricks MCP tools (when available)
|
||||||
|
|
||||||
|
**Platform determines**:
|
||||||
|
- Which MCP tools to use
|
||||||
|
- Table/database naming conventions
|
||||||
|
- SQL dialect for queries
|
||||||
|
- Output format for unify.yml
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔴 CRYSTAL CLEAR USER IDENTIFIER DEFINITION 🔴
|
||||||
|
|
||||||
|
### ✅ VALID USER IDENTIFIERS (MUST BE PRESENT TO INCLUDE TABLE)
|
||||||
|
**A table MUST contain AT LEAST ONE of these column types to be included:**
|
||||||
|
|
||||||
|
#### **PRIMARY USER IDENTIFIERS:**
|
||||||
|
- **Email columns**: `email`, `email_std`, `email_address`, `email_address_std`, `user_email`, `customer_email`, `recipient_email`, `recipient_email_std`
|
||||||
|
- **Phone columns**: `phone`, `phone_std`, `phone_number`, `mobile_phone`, `customer_phone`, `phone_mobile`
|
||||||
|
- **User ID columns**: `user_id`, `customer_id`, `account_id`, `member_id`, `uid`, `user_uuid`, `cust_id`, `client_id`
|
||||||
|
- **Identity columns**: `profile_id`, `identity_id`, `cognito_identity_userid`, `flavormaker_uid`, `external_id`
|
||||||
|
- **Cookie/Device IDs**: `td_client_id`, `td_global_id`, `td_ssc_id`, `cookie_id`, `device_id`, `visitor_id`
|
||||||
|
|
||||||
|
### ❌ NOT USER IDENTIFIERS (EXCLUDE TABLES WITH ONLY THESE)
|
||||||
|
**These columns DO NOT qualify as user identifiers:**
|
||||||
|
|
||||||
|
#### **SYSTEM/METADATA COLUMNS:**
|
||||||
|
- `id`, `created_at`, `updated_at`, `load_timestamp`, `source_system`, `time`, `timestamp`
|
||||||
|
|
||||||
|
#### **CAMPAIGN/MARKETING COLUMNS:**
|
||||||
|
- `campaign_id`, `campaign_name`, `message_id` (unless linked to user profile)
|
||||||
|
|
||||||
|
#### **PRODUCT/CONTENT COLUMNS:**
|
||||||
|
- `product_id`, `sku`, `product_name`, `variant_id`, `item_id`
|
||||||
|
|
||||||
|
#### **TRANSACTION COLUMNS (WITHOUT USER LINK):**
|
||||||
|
- `order_id`, `transaction_id` (ONLY when no customer_id/email present)
|
||||||
|
|
||||||
|
#### **LIST/SEGMENT COLUMNS:**
|
||||||
|
- `list_id`, `segment_id`, `audience_id` (unless linked to user profiles)
|
||||||
|
|
||||||
|
#### **INVALID DATA TYPES (ALWAYS EXCLUDE):**
|
||||||
|
- **Array columns**: `array(varchar)`, `array(bigint)` - Cannot be used as unification keys
|
||||||
|
- **JSON/Object columns**: Complex nested data structures
|
||||||
|
- **Map columns**: `map<string,string>` - Complex key-value structures
|
||||||
|
- **Variant columns** (Snowflake): Semi-structured data
|
||||||
|
- **Struct columns** (Databricks): Complex nested structures
|
||||||
|
|
||||||
|
### 🚨 CRITICAL EXCLUSION RULE 🚨
|
||||||
|
**IF TABLE HAS ZERO USER IDENTIFIER COLUMNS → EXCLUDE FROM UNIFICATION**
|
||||||
|
**NO EXCEPTIONS - NO COMPROMISES**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MANDATORY EXECUTION WORKFLOW - ZERO-TOLERANCE
|
||||||
|
|
||||||
|
### 🔥 STEP 0: PLATFORM DETECTION (MANDATORY FIRST)
|
||||||
|
```
|
||||||
|
DETERMINE PLATFORM:
|
||||||
|
1. Ask user: "Which platform are you using? (Snowflake/Databricks)"
|
||||||
|
2. Store platform choice: platform = user_input
|
||||||
|
3. Set MCP tool strategy based on platform
|
||||||
|
4. Inform user: "Using {platform} MCP tools for analysis"
|
||||||
|
```
|
||||||
|
|
||||||
|
**VALIDATION GATE 0:** ✅ Platform detected and MCP strategy set
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔥 STEP 1: SCHEMA EXTRACTION (MANDATORY)
|
||||||
|
|
||||||
|
**For Snowflake Tables**:
|
||||||
|
```
|
||||||
|
EXECUTE FOR EVERY INPUT TABLE:
|
||||||
|
1. Parse table format: database.schema.table OR schema.table OR table
|
||||||
|
2. Call Snowflake MCP describe table tool (when available)
|
||||||
|
3. IF call fails → Mark table "INACCESSIBLE" → EXCLUDE
|
||||||
|
4. IF call succeeds → Record EXACT column names and data types
|
||||||
|
5. VALIDATE: Never use column names not in describe results
|
||||||
|
```
|
||||||
|
|
||||||
|
**For Databricks Tables**:
|
||||||
|
```
|
||||||
|
EXECUTE FOR EVERY INPUT TABLE:
|
||||||
|
1. Parse table format: catalog.schema.table OR schema.table OR table
|
||||||
|
2. Call Databricks MCP describe table tool (when available)
|
||||||
|
3. IF call fails → Mark table "INACCESSIBLE" → EXCLUDE
|
||||||
|
4. IF call succeeds → Record EXACT column names and data types
|
||||||
|
5. VALIDATE: Never use column names not in describe results
|
||||||
|
```
|
||||||
|
|
||||||
|
**VALIDATION GATE 1:** ✅ Schema extracted for all accessible tables
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔥 STEP 2: USER IDENTIFIER DETECTION (STRICT MATCHING)
|
||||||
|
|
||||||
|
```
|
||||||
|
FOR EACH table with valid schema:
|
||||||
|
1. Scan ACTUAL column names against PRIMARY USER IDENTIFIERS list
|
||||||
|
2. CHECK data_type for each potential identifier:
|
||||||
|
Snowflake:
|
||||||
|
- EXCLUDE if data_type contains "ARRAY", "OBJECT", "VARIANT", "MAP"
|
||||||
|
- ONLY INCLUDE: VARCHAR, TEXT, NUMBER, INTEGER, BIGINT, STRING types
|
||||||
|
|
||||||
|
Databricks:
|
||||||
|
- EXCLUDE if data_type contains "array", "struct", "map", "binary"
|
||||||
|
- ONLY INCLUDE: string, int, bigint, long, double, decimal types
|
||||||
|
|
||||||
|
3. IF NO VALID user identifier columns found → ADD to EXCLUSION list
|
||||||
|
4. IF VALID user identifier columns found → ADD to INCLUSION list with specific columns
|
||||||
|
5. DOCUMENT reason for each inclusion/exclusion decision with data type info
|
||||||
|
```
|
||||||
|
|
||||||
|
**VALIDATION GATE 2:** ✅ Tables classified into INCLUSION/EXCLUSION lists with documented reasons
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔥 STEP 3: EXCLUSION VALIDATION (CRITICAL)
|
||||||
|
|
||||||
|
```
|
||||||
|
FOR EACH table in EXCLUSION list:
|
||||||
|
1. VERIFY: No user identifier columns found
|
||||||
|
2. DOCUMENT: Specific reason for exclusion
|
||||||
|
3. LIST: Available columns that led to exclusion decision
|
||||||
|
4. VERIFY: Data types of all columns checked
|
||||||
|
```
|
||||||
|
|
||||||
|
**VALIDATION GATE 3:** ✅ All exclusions justified and documented
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔥 STEP 4: MIN/MAX DATA ANALYSIS (INCLUDED TABLES ONLY)
|
||||||
|
|
||||||
|
**For Snowflake**:
|
||||||
|
```
|
||||||
|
FOR EACH table in INCLUSION list:
|
||||||
|
FOR EACH user_identifier_column in table:
|
||||||
|
1. Build SQL:
|
||||||
|
SELECT
|
||||||
|
MIN({column}) as min_value,
|
||||||
|
MAX({column}) as max_value,
|
||||||
|
COUNT(DISTINCT {column}) as unique_count
|
||||||
|
FROM {database}.{schema}.{table}
|
||||||
|
WHERE {column} IS NOT NULL
|
||||||
|
LIMIT 1
|
||||||
|
|
||||||
|
2. Execute via Snowflake MCP query tool
|
||||||
|
3. Record actual min/max/count values
|
||||||
|
```
|
||||||
|
|
||||||
|
**For Databricks**:
|
||||||
|
```
|
||||||
|
FOR EACH table in INCLUSION list:
|
||||||
|
FOR EACH user_identifier_column in table:
|
||||||
|
1. Build SQL:
|
||||||
|
SELECT
|
||||||
|
MIN({column}) as min_value,
|
||||||
|
MAX({column}) as max_value,
|
||||||
|
COUNT(DISTINCT {column}) as unique_count
|
||||||
|
FROM {catalog}.{schema}.{table}
|
||||||
|
WHERE {column} IS NOT NULL
|
||||||
|
LIMIT 1
|
||||||
|
|
||||||
|
2. Execute via Databricks MCP query tool
|
||||||
|
3. Record actual min/max/count values
|
||||||
|
```
|
||||||
|
|
||||||
|
**VALIDATION GATE 4:** ✅ Real data analysis completed for all included columns
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔥 STEP 5: RESULTS GENERATION (ZERO TOLERANCE)
|
||||||
|
|
||||||
|
Generate output using ONLY tables that passed all validation gates.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MANDATORY OUTPUT FORMAT
|
||||||
|
|
||||||
|
### **INCLUSION RESULTS:**
|
||||||
|
```
|
||||||
|
## Key Extraction Results (REAL {PLATFORM} DATA):
|
||||||
|
|
||||||
|
| database/catalog | schema | table_name | column_name | data_type | identifier_type | min_value | max_value | unique_count |
|
||||||
|
|------------------|--------|------------|-------------|-----------|-----------------|-----------|-----------|--------------|
|
||||||
|
[ONLY tables with validated user identifiers]
|
||||||
|
```
|
||||||
|
|
||||||
|
### **EXCLUSION DOCUMENTATION:**
|
||||||
|
```
|
||||||
|
## Tables EXCLUDED from ID Unification:
|
||||||
|
|
||||||
|
- **{database/catalog}.{schema}.{table_name}**: No user identifier columns found
|
||||||
|
- Available columns: [list all actual columns with data types]
|
||||||
|
- Exclusion reason: Contains only [system/campaign/product] metadata - no PII
|
||||||
|
- Classification: [Non-PII table]
|
||||||
|
- Data types checked: [list checked columns and why excluded]
|
||||||
|
|
||||||
|
[Repeat for each excluded table]
|
||||||
|
```
|
||||||
|
|
||||||
|
### **VALIDATION SUMMARY:**
|
||||||
|
```
|
||||||
|
## Analysis Summary ({PLATFORM}):
|
||||||
|
- **Platform**: {Snowflake or Databricks}
|
||||||
|
- **Tables Analyzed**: X
|
||||||
|
- **Tables INCLUDED**: Y (contain user identifiers)
|
||||||
|
- **Tables EXCLUDED**: Z (no user identifiers)
|
||||||
|
- **User Identifier Columns Found**: [total count]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3 SQL EXPERTS ANALYSIS (INCLUDED TABLES ONLY)
|
||||||
|
|
||||||
|
**Expert 1 - Data Pattern Analyst:**
|
||||||
|
- Reviews actual min/max values from included tables
|
||||||
|
- Identifies data quality patterns in user identifiers
|
||||||
|
- Validates identifier format consistency
|
||||||
|
- Flags any data quality issues (nulls, invalid formats)
|
||||||
|
|
||||||
|
**Expert 2 - Cross-Table Relationship Analyst:**
|
||||||
|
- Maps relationships between user identifiers across included tables
|
||||||
|
- Identifies primary vs secondary identifier opportunities
|
||||||
|
- Recommends unification key priorities
|
||||||
|
- Suggests merge strategies based on data overlap
|
||||||
|
|
||||||
|
**Expert 3 - Priority Assessment Specialist:**
|
||||||
|
- Ranks identifiers by stability and coverage
|
||||||
|
- Applies best practices priority ordering
|
||||||
|
- Provides final unification recommendations
|
||||||
|
- Suggests validation rules based on data patterns
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PRIORITY RECOMMENDATIONS
|
||||||
|
|
||||||
|
```
|
||||||
|
Recommended Priority Order (Based on Analysis):
|
||||||
|
1. [primary_identifier] - [reason: stability/coverage based on actual data]
|
||||||
|
- Found in [X] tables
|
||||||
|
- Unique values: [count]
|
||||||
|
- Data quality: [assessment]
|
||||||
|
|
||||||
|
2. [secondary_identifier] - [reason: supporting evidence]
|
||||||
|
- Found in [Y] tables
|
||||||
|
- Unique values: [count]
|
||||||
|
- Data quality: [assessment]
|
||||||
|
|
||||||
|
3. [tertiary_identifier] - [reason: additional linking]
|
||||||
|
- Found in [Z] tables
|
||||||
|
- Unique values: [count]
|
||||||
|
- Data quality: [assessment]
|
||||||
|
|
||||||
|
EXCLUDED Identifiers (Not User-Related):
|
||||||
|
- [excluded_columns] - [specific exclusion reasons with data types]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CRITICAL ENFORCEMENT MECHANISMS
|
||||||
|
|
||||||
|
### 🛑 FAIL-FAST CONDITIONS (RESTART IF ENCOUNTERED)
|
||||||
|
- Using column names not found in schema describe results
|
||||||
|
- Including tables without user identifier columns
|
||||||
|
- Guessing data patterns instead of querying actual data
|
||||||
|
- Missing exclusion documentation for any table
|
||||||
|
- Skipping any mandatory validation gate
|
||||||
|
- Using wrong MCP tools for platform
|
||||||
|
|
||||||
|
### ✅ SUCCESS VALIDATION CHECKLIST
|
||||||
|
- [ ] Platform detected and MCP tools selected
|
||||||
|
- [ ] Used describe table for ALL input tables (platform-specific)
|
||||||
|
- [ ] Applied strict user identifier matching rules
|
||||||
|
- [ ] Excluded ALL tables without user identifiers
|
||||||
|
- [ ] Documented reasons for ALL exclusions with data types
|
||||||
|
- [ ] Queried actual min/max values for included columns (platform-specific)
|
||||||
|
- [ ] Generated results with ONLY validated included tables
|
||||||
|
- [ ] Completed 3 SQL experts analysis on included data
|
||||||
|
|
||||||
|
### 🔥 ENFORCEMENT COMMAND
|
||||||
|
**AT EACH VALIDATION GATE, AGENT MUST STATE:**
|
||||||
|
"✅ VALIDATION GATE [X] PASSED - [specific validation completed]"
|
||||||
|
|
||||||
|
**IF ANY GATE FAILS:**
|
||||||
|
"🛑 VALIDATION GATE [X] FAILED - RESTARTING ANALYSIS"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PLATFORM-SPECIFIC MCP TOOL USAGE
|
||||||
|
|
||||||
|
### Snowflake MCP Tools
|
||||||
|
|
||||||
|
**Tool 1: Describe Table** (when available):
|
||||||
|
```
|
||||||
|
Call describe table functionality for Snowflake
|
||||||
|
Input: database, schema, table
|
||||||
|
Output: column names, data types, metadata
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tool 2: Query Data** (when available):
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
MIN(column_name) as min_value,
|
||||||
|
MAX(column_name) as max_value,
|
||||||
|
COUNT(DISTINCT column_name) as unique_count
|
||||||
|
FROM database.schema.table
|
||||||
|
WHERE column_name IS NOT NULL
|
||||||
|
LIMIT 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**Platform Notes**:
|
||||||
|
- Use fully qualified names: `database.schema.table`
|
||||||
|
- Data types: VARCHAR, NUMBER, TIMESTAMP, VARIANT, ARRAY, OBJECT
|
||||||
|
- Exclude: VARIANT, ARRAY, OBJECT types
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Databricks MCP Tools
|
||||||
|
|
||||||
|
**Tool 1: Describe Table** (when available):
|
||||||
|
```
|
||||||
|
Call describe table functionality for Databricks
|
||||||
|
Input: catalog, schema, table
|
||||||
|
Output: column names, data types, metadata
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tool 2: Query Data** (when available):
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
MIN(column_name) as min_value,
|
||||||
|
MAX(column_name) as max_value,
|
||||||
|
COUNT(DISTINCT column_name) as unique_count
|
||||||
|
FROM catalog.schema.table
|
||||||
|
WHERE column_name IS NOT NULL
|
||||||
|
LIMIT 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**Platform Notes**:
|
||||||
|
- Use fully qualified names: `catalog.schema.table`
|
||||||
|
- Data types: string, int, bigint, double, timestamp, array, struct, map
|
||||||
|
- Exclude: array, struct, map, binary types
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FALLBACK STRATEGY (If MCP Not Available)
|
||||||
|
|
||||||
|
**If platform-specific MCP tools are not available**:
|
||||||
|
```
|
||||||
|
1. Inform user: "Platform-specific MCP tools not detected"
|
||||||
|
2. Ask user to provide:
|
||||||
|
- Table schemas manually (DESCRIBE TABLE output)
|
||||||
|
- Sample data or column lists
|
||||||
|
3. Apply same strict validation rules
|
||||||
|
4. Document: "Analysis based on user-provided schema"
|
||||||
|
5. Recommend: "Validate results against actual platform data"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FINAL CONFIRMATION FORMAT
|
||||||
|
|
||||||
|
### Question:
|
||||||
|
```
|
||||||
|
Question: Are these extracted user identifiers from {PLATFORM} sufficient for your ID unification requirements?
|
||||||
|
```
|
||||||
|
|
||||||
|
### Suggestion:
|
||||||
|
```
|
||||||
|
Suggestion: I recommend using **[primary_identifier]** as your primary unification key since it appears across [X] tables with user data and shows [quality_assessment] based on actual {PLATFORM} data analysis.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Point:
|
||||||
|
```
|
||||||
|
Check Point: The {PLATFORM} analysis shows [X] tables with user identifiers and [Y] tables excluded due to lack of user identifiers. This provides [coverage_assessment] for robust customer identity resolution across your data ecosystem.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔥 AGENT COMMITMENT CONTRACT 🔥
|
||||||
|
|
||||||
|
**THIS AGENT SOLEMNLY COMMITS TO:**
|
||||||
|
|
||||||
|
1. ✅ **PLATFORM AWARENESS** - Detect and use correct platform tools
|
||||||
|
2. ✅ **ZERO GUESSING** - Use only actual platform MCP tool results
|
||||||
|
3. ✅ **STRICT EXCLUSION** - Exclude ALL tables without user identifiers
|
||||||
|
4. ✅ **MANDATORY VALIDATION** - Complete all validation gates before proceeding
|
||||||
|
5. ✅ **REAL DATA ANALYSIS** - Query actual min/max values from platform
|
||||||
|
6. ✅ **COMPLETE DOCUMENTATION** - Document every inclusion/exclusion decision
|
||||||
|
7. ✅ **FAIL-FAST ENFORCEMENT** - Stop immediately if validation fails
|
||||||
|
8. ✅ **DATA TYPE VALIDATION** - Check and exclude complex/invalid types
|
||||||
|
|
||||||
|
**VIOLATION OF ANY COMMITMENT = IMMEDIATE AGENT RESTART REQUIRED**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## EXECUTION CHECKLIST - MANDATORY COMPLETION
|
||||||
|
|
||||||
|
**BEFORE PROVIDING FINAL RESULTS, AGENT MUST CONFIRM:**
|
||||||
|
|
||||||
|
- [ ] 🎯 **Platform Detection**: Identified Snowflake or Databricks
|
||||||
|
- [ ] 🔧 **MCP Tools**: Selected correct platform-specific tools
|
||||||
|
- [ ] 🔍 **Schema Analysis**: Used describe table for ALL input tables
|
||||||
|
- [ ] 🎯 **User ID Detection**: Applied strict matching against user identifier rules
|
||||||
|
- [ ] ⚠️ **Data Type Validation**: Checked and excluded complex/array/variant types
|
||||||
|
- [ ] ❌ **Table Exclusion**: Excluded ALL tables without user identifiers
|
||||||
|
- [ ] 📋 **Documentation**: Documented ALL exclusion reasons with data types
|
||||||
|
- [ ] 📊 **Data Analysis**: Queried actual min/max for ALL included user identifier columns
|
||||||
|
- [ ] 👥 **Expert Analysis**: Completed 3 SQL experts review of included data only
|
||||||
|
- [ ] 🏆 **Priority Ranking**: Provided priority recommendations based on actual data
|
||||||
|
- [ ] ✅ **Final Validation**: Confirmed ALL results contain only validated included tables
|
||||||
|
|
||||||
|
**AGENT DECLARATION:** "✅ ALL MANDATORY CHECKLIST ITEMS COMPLETED - RESULTS READY FOR {PLATFORM}"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 CRITICAL: UNIFY.YML GENERATION INSTRUCTIONS 🚨
|
||||||
|
|
||||||
|
**MANDATORY**: Use EXACT BUILT-IN template structure - NO modifications allowed
|
||||||
|
|
||||||
|
### STEP 1: EXACT TEMPLATE STRUCTURE (BUILT-IN)
|
||||||
|
|
||||||
|
**This is the EXACT template structure you MUST use character-by-character:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: td_ik
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare Validation logic for unification keys
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: phone_number
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare datebases, tables, and keys to use during unification
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- database: db_name
|
||||||
|
table: table1
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- database: db_name
|
||||||
|
table: table2
|
||||||
|
key_columns:
|
||||||
|
- {column: email, key: email}
|
||||||
|
- database: db_name
|
||||||
|
table: table3
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: phone_number, key: phone_number}
|
||||||
|
|
||||||
|
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare hierarchy for unification (Business & Contacts). Define keys to use for each level.
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: td_id
|
||||||
|
merge_by_keys: [email, customer_id, phone_number]
|
||||||
|
# key_priorities: [3, 1, 2] # email=3, customer_id=1, phone_number=2 (different priority order!)
|
||||||
|
merge_iterations: 15
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare Similar Attributes and standardize into a single column
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: td_master_table
|
||||||
|
canonical_id: td_id
|
||||||
|
attributes:
|
||||||
|
- name: cust_id
|
||||||
|
source_columns:
|
||||||
|
- { table: table1, column: customer_id, order: last, order_by: time, priority: 1 }
|
||||||
|
|
||||||
|
- name: phone
|
||||||
|
source_columns:
|
||||||
|
- { table: table3, column: phone_number, order: last, order_by: time, priority: 1 }
|
||||||
|
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- { table: table3, column: email_address, order: last, order_by: time, priority: 1 }
|
||||||
|
- { table: table2, column: email, order: last, order_by: time, priority: 2 }
|
||||||
|
- { table: table1, column: email, order: last, order_by: time, priority: 3 }
|
||||||
|
|
||||||
|
- name: top_3_emails
|
||||||
|
array_elements: 3
|
||||||
|
source_columns:
|
||||||
|
- { table: table3, column: email_address, order: last, order_by: time, priority: 1 }
|
||||||
|
- { table: table2, column: email, order: last, order_by: time, priority: 2 }
|
||||||
|
- { table: table1, column: email, order: last, order_by: time, priority: 3 }
|
||||||
|
|
||||||
|
- name: top_3_phones
|
||||||
|
array_elements: 3
|
||||||
|
source_columns:
|
||||||
|
- { table: table3, column: phone_number, order: last, order_by: time, priority: 1 }
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
**CRITICAL**: This EXACT structure must be preserved. ALL comment blocks, spacing, indentation, and blank lines are mandatory.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 2: Identify ONLY What to Replace
|
||||||
|
|
||||||
|
**REPLACE ONLY these specific values in the template:**
|
||||||
|
|
||||||
|
**Section 1: name (Line 1)**
|
||||||
|
```yaml
|
||||||
|
name: td_ik
|
||||||
|
```
|
||||||
|
→ Replace `td_ik` with user's canonical_id_name
|
||||||
|
|
||||||
|
**Section 2: keys (After "Declare Validation logic" comment)**
|
||||||
|
```yaml
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: phone_number
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
```
|
||||||
|
→ Replace with ACTUAL keys found in your analysis
|
||||||
|
→ Keep EXACT formatting: 2 spaces indent, exact field order
|
||||||
|
→ For each key found:
|
||||||
|
- If email: include `valid_regexp: ".*@.*"`
|
||||||
|
- All keys: include `invalid_texts: ['', 'N/A', 'null']`
|
||||||
|
|
||||||
|
**Section 3: tables (After "Declare databases, tables" comment)**
|
||||||
|
```yaml
|
||||||
|
tables:
|
||||||
|
- database: db_name
|
||||||
|
table: table1
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- database: db_name
|
||||||
|
table: table2
|
||||||
|
key_columns:
|
||||||
|
- {column: email, key: email}
|
||||||
|
- database: db_name
|
||||||
|
table: table3
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: phone_number, key: phone_number}
|
||||||
|
```
|
||||||
|
→ Replace with ACTUAL tables from INCLUSION list ONLY
|
||||||
|
→ For Snowflake: use actual database name (no schema in template)
|
||||||
|
→ For Databricks: Add `catalog` as new key parallel to "database". Populate catalog and database as per user input.
|
||||||
|
→ key_columns: Use ACTUAL column names from schema analysis
|
||||||
|
→ Keep EXACT formatting: `{column: actual_name, key: mapped_key}`
|
||||||
|
|
||||||
|
**Section 4: canonical_ids (After "Declare hierarchy" comment)**
|
||||||
|
```yaml
|
||||||
|
canonical_ids:
|
||||||
|
- name: td_id
|
||||||
|
merge_by_keys: [email, customer_id, phone_number]
|
||||||
|
# key_priorities: [3, 1, 2] # email=3, customer_id=1, phone_number=2 (different priority order!)
|
||||||
|
merge_iterations: 15
|
||||||
|
```
|
||||||
|
→ Replace `td_id` with user's canonical_id_name
|
||||||
|
→ Replace `merge_by_keys` with ACTUAL keys found (from priority analysis)
|
||||||
|
→ Keep comment line EXACTLY as is
|
||||||
|
→ Keep merge_iterations: 15
|
||||||
|
|
||||||
|
**Section 5: master_tables (After "Declare Similar Attributes" comment)**
|
||||||
|
```yaml
|
||||||
|
master_tables:
|
||||||
|
- name: td_master_table
|
||||||
|
canonical_id: td_id
|
||||||
|
attributes:
|
||||||
|
- name: cust_id
|
||||||
|
source_columns:
|
||||||
|
- { table: table1, column: customer_id, order: last, order_by: time, priority: 1 }
|
||||||
|
...
|
||||||
|
```
|
||||||
|
→ IF user requests master tables: Replace with their specifications
|
||||||
|
→ IF user does NOT request: Keep as `master_tables: []`
|
||||||
|
→ Keep EXACT formatting if populating
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 3: PRESERVE Everything Else
|
||||||
|
|
||||||
|
**MUST PRESERVE EXACTLY**:
|
||||||
|
- ✅ ALL comment blocks (`#####################################################`)
|
||||||
|
- ✅ ALL comment text ("Declare Validation logic", etc.)
|
||||||
|
- ✅ ALL blank lines
|
||||||
|
- ✅ ALL indentation (2 spaces per level)
|
||||||
|
- ✅ ALL YAML syntax
|
||||||
|
- ✅ Field ordering
|
||||||
|
- ✅ Spacing around colons and brackets
|
||||||
|
|
||||||
|
**NEVER**:
|
||||||
|
- ❌ Add new sections
|
||||||
|
- ❌ Remove comment blocks
|
||||||
|
- ❌ Change comment text
|
||||||
|
- ❌ Modify structure
|
||||||
|
- ❌ Change indentation
|
||||||
|
- ❌ Reorder sections
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 4: Provide Structured Output
|
||||||
|
|
||||||
|
**After analysis, provide THIS format for the calling command:**
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Extracted Keys (for unify.yml population):
|
||||||
|
|
||||||
|
**Keys to include in keys section:**
|
||||||
|
- email (valid_regexp: ".*@.*", invalid_texts: ['', 'N/A', 'null'])
|
||||||
|
- customer_id (invalid_texts: ['', 'N/A', 'null'])
|
||||||
|
- phone_number (invalid_texts: ['', 'N/A', 'null'])
|
||||||
|
|
||||||
|
**Tables to include in tables section:**
|
||||||
|
|
||||||
|
Database: db_name
|
||||||
|
├─ table1
|
||||||
|
│ └─ key_columns:
|
||||||
|
│ - {column: email_std, key: email}
|
||||||
|
│ - {column: customer_id, key: customer_id}
|
||||||
|
├─ table2
|
||||||
|
│ └─ key_columns:
|
||||||
|
│ - {column: email, key: email}
|
||||||
|
└─ table3
|
||||||
|
└─ key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: phone_number, key: phone_number}
|
||||||
|
|
||||||
|
**Canonical ID configuration:**
|
||||||
|
- name: {user_provided_canonical_id_name}
|
||||||
|
- merge_by_keys: [customer_id, email, phone_number] # Priority order from analysis
|
||||||
|
- merge_iterations: 15
|
||||||
|
|
||||||
|
**Master tables:**
|
||||||
|
- User requested: Yes/No
|
||||||
|
- If No: Use `master_tables: []`
|
||||||
|
- If Yes: [user specifications]
|
||||||
|
|
||||||
|
**Tables EXCLUDED (with reasons - DO NOT include in unify.yml):**
|
||||||
|
- database.table: Reason why excluded
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 5: FINAL OUTPUT INSTRUCTIONS
|
||||||
|
|
||||||
|
**The calling command will**:
|
||||||
|
1. Take your structured output above
|
||||||
|
2. Use the BUILT-IN template structure (from STEP 1)
|
||||||
|
3. Replace ONLY the values you specified
|
||||||
|
4. Preserve ALL comment blocks, spacing, indentation, and blank lines
|
||||||
|
5. Use Write tool to save the populated unify.yml
|
||||||
|
|
||||||
|
**AGENT FINAL OUTPUT**: Provide the structured data in the format above. The calling command will handle template population using the BUILT-IN template structure.
|
||||||
839
agents/merge-stats-report-generator.md
Normal file
839
agents/merge-stats-report-generator.md
Normal file
@@ -0,0 +1,839 @@
|
|||||||
|
---
|
||||||
|
name: merge-stats-report-generator
|
||||||
|
description: Expert agent for generating professional ID unification merge statistics HTML reports from Snowflake or Databricks with comprehensive analysis and visualizations
|
||||||
|
---
|
||||||
|
|
||||||
|
# ID Unification Merge Statistics Report Generator Agent
|
||||||
|
|
||||||
|
## Agent Role
|
||||||
|
|
||||||
|
You are an **expert ID Unification Merge Statistics Analyst** with deep knowledge of:
|
||||||
|
- Identity resolution algorithms and graph-based unification
|
||||||
|
- Statistical analysis and merge pattern recognition
|
||||||
|
- Data quality assessment and coverage metrics
|
||||||
|
- Snowflake and Databricks SQL dialects
|
||||||
|
- HTML report generation with professional visualizations
|
||||||
|
- Executive-level reporting and insights
|
||||||
|
|
||||||
|
## Primary Objective
|
||||||
|
|
||||||
|
Generate a **comprehensive, professional HTML merge statistics report** from ID unification results that is:
|
||||||
|
1. **Consistent**: Same report structure every time
|
||||||
|
2. **Platform-agnostic**: Works for both Snowflake and Databricks
|
||||||
|
3. **Data-driven**: All metrics calculated from actual unification tables
|
||||||
|
4. **Visually beautiful**: Professional design with charts and visualizations
|
||||||
|
5. **Actionable**: Includes expert insights and recommendations
|
||||||
|
|
||||||
|
## Tools Available
|
||||||
|
|
||||||
|
- **Snowflake MCP**: `mcp__snowflake__execute_query` for Snowflake queries
|
||||||
|
- **Databricks MCP**: (if available) for Databricks queries, fallback to Snowflake MCP
|
||||||
|
- **Write**: For creating the HTML report file
|
||||||
|
- **Read**: For reading existing files if needed
|
||||||
|
|
||||||
|
## Execution Protocol
|
||||||
|
|
||||||
|
### Phase 1: Input Collection and Validation
|
||||||
|
|
||||||
|
**CRITICAL: Ask the user for ALL required information:**
|
||||||
|
|
||||||
|
1. **Platform** (REQUIRED):
|
||||||
|
- Snowflake or Databricks?
|
||||||
|
|
||||||
|
2. **Database/Catalog Name** (REQUIRED):
|
||||||
|
- Snowflake: Database name (e.g., INDRESH_TEST, CUSTOMER_CDP)
|
||||||
|
- Databricks: Catalog name (e.g., customer_data, cdp_prod)
|
||||||
|
|
||||||
|
3. **Schema Name** (REQUIRED):
|
||||||
|
- Schema containing unification tables (e.g., PUBLIC, id_unification)
|
||||||
|
|
||||||
|
4. **Canonical ID Name** (REQUIRED):
|
||||||
|
- Name of unified ID (e.g., td_id, unified_customer_id)
|
||||||
|
- Used to construct table names: {canonical_id}_lookup, {canonical_id}_master_table, etc.
|
||||||
|
|
||||||
|
5. **Output File Path** (OPTIONAL):
|
||||||
|
- Default: id_unification_report.html
|
||||||
|
- User can specify custom path
|
||||||
|
|
||||||
|
**Validation Steps:**
|
||||||
|
|
||||||
|
```
|
||||||
|
✓ Verify platform is either "Snowflake" or "Databricks"
|
||||||
|
✓ Verify database/catalog name is provided
|
||||||
|
✓ Verify schema name is provided
|
||||||
|
✓ Verify canonical ID name is provided
|
||||||
|
✓ Set default output path if not specified
|
||||||
|
✓ Confirm MCP tools are available for selected platform
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Platform Setup and Table Name Construction
|
||||||
|
|
||||||
|
**For Snowflake:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
database = user_provided_database # e.g., "INDRESH_TEST"
|
||||||
|
schema = user_provided_schema # e.g., "PUBLIC"
|
||||||
|
canonical_id = user_provided_canonical_id # e.g., "td_id"
|
||||||
|
|
||||||
|
# Construct full table names (UPPERCASE for Snowflake)
|
||||||
|
lookup_table = f"{database}.{schema}.{canonical_id}_lookup"
|
||||||
|
master_table = f"{database}.{schema}.{canonical_id}_master_table"
|
||||||
|
source_stats_table = f"{database}.{schema}.{canonical_id}_source_key_stats"
|
||||||
|
result_stats_table = f"{database}.{schema}.{canonical_id}_result_key_stats"
|
||||||
|
metadata_table = f"{database}.{schema}.unification_metadata"
|
||||||
|
column_lookup_table = f"{database}.{schema}.column_lookup"
|
||||||
|
filter_lookup_table = f"{database}.{schema}.filter_lookup"
|
||||||
|
|
||||||
|
# Use MCP tool
|
||||||
|
tool = "mcp__snowflake__execute_query"
|
||||||
|
```
|
||||||
|
|
||||||
|
**For Databricks:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
catalog = user_provided_catalog # e.g., "customer_cdp"
|
||||||
|
schema = user_provided_schema # e.g., "id_unification"
|
||||||
|
canonical_id = user_provided_canonical_id # e.g., "unified_customer_id"
|
||||||
|
|
||||||
|
# Construct full table names (lowercase for Databricks)
|
||||||
|
lookup_table = f"{catalog}.{schema}.{canonical_id}_lookup"
|
||||||
|
master_table = f"{catalog}.{schema}.{canonical_id}_master_table"
|
||||||
|
source_stats_table = f"{catalog}.{schema}.{canonical_id}_source_key_stats"
|
||||||
|
result_stats_table = f"{catalog}.{schema}.{canonical_id}_result_key_stats"
|
||||||
|
metadata_table = f"{catalog}.{schema}.unification_metadata"
|
||||||
|
column_lookup_table = f"{catalog}.{schema}.column_lookup"
|
||||||
|
filter_lookup_table = f"{catalog}.{schema}.filter_lookup"
|
||||||
|
|
||||||
|
# Use MCP tool (fallback to Snowflake MCP if Databricks not available)
|
||||||
|
tool = "mcp__snowflake__execute_query" # or databricks tool if available
|
||||||
|
```
|
||||||
|
|
||||||
|
**Table Existence Validation:**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Test query to verify tables exist
|
||||||
|
SELECT COUNT(*) as count FROM {lookup_table} LIMIT 1;
|
||||||
|
SELECT COUNT(*) as count FROM {master_table} LIMIT 1;
|
||||||
|
SELECT COUNT(*) as count FROM {source_stats_table} LIMIT 1;
|
||||||
|
SELECT COUNT(*) as count FROM {result_stats_table} LIMIT 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
If any critical table doesn't exist, inform user and stop.
|
||||||
|
|
||||||
|
### Phase 3: Execute All Statistical Queries
|
||||||
|
|
||||||
|
**EXECUTE THESE 16 QUERIES IN ORDER:**
|
||||||
|
|
||||||
|
#### Query 1: Source Key Statistics
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
FROM_TABLE,
|
||||||
|
TOTAL_DISTINCT,
|
||||||
|
DISTINCT_CUSTOMER_ID,
|
||||||
|
DISTINCT_EMAIL,
|
||||||
|
DISTINCT_PHONE,
|
||||||
|
TIME
|
||||||
|
FROM {source_stats_table}
|
||||||
|
ORDER BY FROM_TABLE;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `source_stats`
|
||||||
|
|
||||||
|
**Expected structure:**
|
||||||
|
- Row with FROM_TABLE = '*' contains total counts
|
||||||
|
- Individual rows for each source table
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 2: Result Key Statistics
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
FROM_TABLE,
|
||||||
|
TOTAL_DISTINCT,
|
||||||
|
DISTINCT_WITH_CUSTOMER_ID,
|
||||||
|
DISTINCT_WITH_EMAIL,
|
||||||
|
DISTINCT_WITH_PHONE,
|
||||||
|
HISTOGRAM_CUSTOMER_ID,
|
||||||
|
HISTOGRAM_EMAIL,
|
||||||
|
HISTOGRAM_PHONE,
|
||||||
|
TIME
|
||||||
|
FROM {result_stats_table}
|
||||||
|
ORDER BY FROM_TABLE;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `result_stats`
|
||||||
|
|
||||||
|
**Expected structure:**
|
||||||
|
- Row with FROM_TABLE = '*' contains total canonical IDs
|
||||||
|
- HISTOGRAM_* columns contain distribution data
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 3: Canonical ID Counts
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_canonical_ids,
|
||||||
|
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||||
|
FROM {lookup_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `canonical_counts`
|
||||||
|
|
||||||
|
**Calculate:**
|
||||||
|
- `merge_ratio = total_canonical_ids / unique_canonical_ids`
|
||||||
|
- `fragmentation_reduction_pct = (total_canonical_ids - unique_canonical_ids) / total_canonical_ids * 100`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 4: Top Merged Profiles
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
canonical_id,
|
||||||
|
COUNT(*) as identity_count
|
||||||
|
FROM {lookup_table}
|
||||||
|
GROUP BY canonical_id
|
||||||
|
ORDER BY identity_count DESC
|
||||||
|
LIMIT 10;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `top_merged_profiles`
|
||||||
|
|
||||||
|
**Use for:** Top 10 table in report
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 5: Merge Distribution Analysis
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
CASE
|
||||||
|
WHEN identity_count = 1 THEN '1 identity (no merge)'
|
||||||
|
WHEN identity_count = 2 THEN '2 identities merged'
|
||||||
|
WHEN identity_count BETWEEN 3 AND 5 THEN '3-5 identities merged'
|
||||||
|
WHEN identity_count BETWEEN 6 AND 10 THEN '6-10 identities merged'
|
||||||
|
WHEN identity_count > 10 THEN '10+ identities merged'
|
||||||
|
END as merge_category,
|
||||||
|
COUNT(*) as canonical_id_count,
|
||||||
|
SUM(identity_count) as total_identities
|
||||||
|
FROM (
|
||||||
|
SELECT canonical_id, COUNT(*) as identity_count
|
||||||
|
FROM {lookup_table}
|
||||||
|
GROUP BY canonical_id
|
||||||
|
)
|
||||||
|
GROUP BY merge_category
|
||||||
|
ORDER BY
|
||||||
|
CASE merge_category
|
||||||
|
WHEN '1 identity (no merge)' THEN 1
|
||||||
|
WHEN '2 identities merged' THEN 2
|
||||||
|
WHEN '3-5 identities merged' THEN 3
|
||||||
|
WHEN '6-10 identities merged' THEN 4
|
||||||
|
WHEN '10+ identities merged' THEN 5
|
||||||
|
END;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `merge_distribution`
|
||||||
|
|
||||||
|
**Calculate percentages:**
|
||||||
|
- `pct_of_profiles = (canonical_id_count / unique_canonical_ids) * 100`
|
||||||
|
- `pct_of_identities = (total_identities / total_canonical_ids) * 100`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 6: Key Type Distribution
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
id_key_type,
|
||||||
|
CASE id_key_type
|
||||||
|
WHEN 1 THEN 'customer_id'
|
||||||
|
WHEN 2 THEN 'email'
|
||||||
|
WHEN 3 THEN 'phone'
|
||||||
|
WHEN 4 THEN 'device_id'
|
||||||
|
WHEN 5 THEN 'cookie_id'
|
||||||
|
ELSE CAST(id_key_type AS VARCHAR)
|
||||||
|
END as key_name,
|
||||||
|
COUNT(*) as identity_count,
|
||||||
|
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||||
|
FROM {lookup_table}
|
||||||
|
GROUP BY id_key_type
|
||||||
|
ORDER BY id_key_type;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `key_type_distribution`
|
||||||
|
|
||||||
|
**Use for:** Identity count bar charts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 7: Master Table Attribute Coverage
|
||||||
|
|
||||||
|
**IMPORTANT: Dynamically determine columns first:**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Get all columns in master table
|
||||||
|
DESCRIBE TABLE {master_table};
|
||||||
|
-- OR for Databricks: DESCRIBE {master_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Then query coverage for key attributes:**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_records,
|
||||||
|
COUNT(BEST_EMAIL) as has_email,
|
||||||
|
COUNT(BEST_PHONE) as has_phone,
|
||||||
|
COUNT(BEST_FIRST_NAME) as has_first_name,
|
||||||
|
COUNT(BEST_LAST_NAME) as has_last_name,
|
||||||
|
COUNT(BEST_LOCATION) as has_location,
|
||||||
|
COUNT(LAST_ORDER_DATE) as has_order_date,
|
||||||
|
ROUND(COUNT(BEST_EMAIL) * 100.0 / COUNT(*), 2) as email_coverage_pct,
|
||||||
|
ROUND(COUNT(BEST_PHONE) * 100.0 / COUNT(*), 2) as phone_coverage_pct,
|
||||||
|
ROUND(COUNT(BEST_FIRST_NAME) * 100.0 / COUNT(*), 2) as name_coverage_pct,
|
||||||
|
ROUND(COUNT(BEST_LOCATION) * 100.0 / COUNT(*), 2) as location_coverage_pct
|
||||||
|
FROM {master_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `master_coverage`
|
||||||
|
|
||||||
|
**Adapt query based on actual columns available**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 8: Master Table Sample Records
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT *
|
||||||
|
FROM {master_table}
|
||||||
|
LIMIT 5;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `master_samples`
|
||||||
|
|
||||||
|
**Use for:** Sample records table in report
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 9: Unification Metadata (Optional)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
CANONICAL_ID_NAME,
|
||||||
|
CANONICAL_ID_TYPE
|
||||||
|
FROM {metadata_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `metadata` (optional, may not exist)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 10: Column Lookup Configuration (Optional)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
DATABASE_NAME,
|
||||||
|
TABLE_NAME,
|
||||||
|
COLUMN_NAME,
|
||||||
|
KEY_NAME
|
||||||
|
FROM {column_lookup_table}
|
||||||
|
ORDER BY TABLE_NAME, KEY_NAME;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `column_mappings` (optional)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 11: Filter Lookup Configuration (Optional)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
KEY_NAME,
|
||||||
|
INVALID_TEXTS,
|
||||||
|
VALID_REGEXP
|
||||||
|
FROM {filter_lookup_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `validation_rules` (optional)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 12: Master Table Record Count
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as total_records
|
||||||
|
FROM {master_table};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `master_count`
|
||||||
|
|
||||||
|
**Validation:** Should equal `unique_canonical_ids`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Query 13: Deduplication Rate Calculation
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WITH source_stats AS (
|
||||||
|
SELECT
|
||||||
|
DISTINCT_CUSTOMER_ID as source_customer_id,
|
||||||
|
DISTINCT_EMAIL as source_email,
|
||||||
|
DISTINCT_PHONE as source_phone
|
||||||
|
FROM {source_stats_table}
|
||||||
|
WHERE FROM_TABLE = '*'
|
||||||
|
),
|
||||||
|
result_stats AS (
|
||||||
|
SELECT TOTAL_DISTINCT as final_canonical_ids
|
||||||
|
FROM {result_stats_table}
|
||||||
|
WHERE FROM_TABLE = '*'
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
source_customer_id,
|
||||||
|
source_email,
|
||||||
|
source_phone,
|
||||||
|
final_canonical_ids,
|
||||||
|
ROUND((source_customer_id - final_canonical_ids) * 100.0 / NULLIF(source_customer_id, 0), 1) as customer_id_dedup_pct,
|
||||||
|
ROUND((source_email - final_canonical_ids) * 100.0 / NULLIF(source_email, 0), 1) as email_dedup_pct,
|
||||||
|
ROUND((source_phone - final_canonical_ids) * 100.0 / NULLIF(source_phone, 0), 1) as phone_dedup_pct
|
||||||
|
FROM source_stats, result_stats;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Store result as:** `deduplication_rates`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 4: Data Processing and Metric Calculation
|
||||||
|
|
||||||
|
**Calculate all derived metrics:**
|
||||||
|
|
||||||
|
1. **Executive Summary Metrics:**
|
||||||
|
```python
|
||||||
|
unified_profiles = unique_canonical_ids # from Query 3
|
||||||
|
total_identities = total_canonical_ids # from Query 3
|
||||||
|
merge_ratio = total_identities / unified_profiles
|
||||||
|
convergence_iterations = 4 # default or parse from logs if available
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Fragmentation Reduction:**
|
||||||
|
```python
|
||||||
|
reduction_pct = ((total_identities - unified_profiles) / total_identities) * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Deduplication Rates:**
|
||||||
|
```python
|
||||||
|
customer_id_dedup = deduplication_rates['customer_id_dedup_pct']
|
||||||
|
email_dedup = deduplication_rates['email_dedup_pct']
|
||||||
|
phone_dedup = deduplication_rates['phone_dedup_pct']
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Merge Distribution Percentages:**
|
||||||
|
```python
|
||||||
|
for category in merge_distribution:
|
||||||
|
category['pct_profiles'] = (category['canonical_id_count'] / unified_profiles) * 100
|
||||||
|
category['pct_identities'] = (category['total_identities'] / total_identities) * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Data Quality Score:**
|
||||||
|
```python
|
||||||
|
quality_scores = [
|
||||||
|
master_coverage['email_coverage_pct'],
|
||||||
|
master_coverage['phone_coverage_pct'],
|
||||||
|
master_coverage['name_coverage_pct'],
|
||||||
|
# ... other coverage metrics
|
||||||
|
]
|
||||||
|
overall_quality = sum(quality_scores) / len(quality_scores)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 5: HTML Report Generation
|
||||||
|
|
||||||
|
**CRITICAL: Use EXACT HTML structure from reference report**
|
||||||
|
|
||||||
|
**HTML Template Structure:**
|
||||||
|
|
||||||
|
```html
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
|
<title>ID Unification Merge Statistics Report</title>
|
||||||
|
<style>
|
||||||
|
/* EXACT CSS from reference report */
|
||||||
|
/* Copy all styles exactly */
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<div class="container">
|
||||||
|
<header>
|
||||||
|
<h1>ID Unification Merge Statistics Report</h1>
|
||||||
|
<p>Comprehensive Identity Resolution Performance Analysis</p>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<div class="metadata">
|
||||||
|
<div class="metadata-item">
|
||||||
|
<strong>Database/Catalog:</strong> {database_or_catalog}.{schema}
|
||||||
|
</div>
|
||||||
|
<div class="metadata-item">
|
||||||
|
<strong>Canonical ID:</strong> {canonical_id}
|
||||||
|
</div>
|
||||||
|
<div class="metadata-item">
|
||||||
|
<strong>Generated:</strong> {current_date}
|
||||||
|
</div>
|
||||||
|
<div class="metadata-item">
|
||||||
|
<strong>Platform:</strong> {platform}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="content">
|
||||||
|
<!-- Section 1: Executive Summary -->
|
||||||
|
<div class="section">
|
||||||
|
<h2 class="section-title">Executive Summary</h2>
|
||||||
|
<div class="metrics-grid">
|
||||||
|
<div class="metric-card">
|
||||||
|
<div class="metric-label">Unified Profiles</div>
|
||||||
|
<div class="metric-value">{unified_profiles:,}</div>
|
||||||
|
<div class="metric-sublabel">Canonical Customer IDs</div>
|
||||||
|
</div>
|
||||||
|
<div class="metric-card">
|
||||||
|
<div class="metric-label">Total Identities</div>
|
||||||
|
<div class="metric-value">{total_identities:,}</div>
|
||||||
|
<div class="metric-sublabel">Raw identity records merged</div>
|
||||||
|
</div>
|
||||||
|
<div class="metric-card">
|
||||||
|
<div class="metric-label">Merge Ratio</div>
|
||||||
|
<div class="metric-value">{merge_ratio:.2f}:1</div>
|
||||||
|
<div class="metric-sublabel">Identities per customer</div>
|
||||||
|
</div>
|
||||||
|
<div class="metric-card">
|
||||||
|
<div class="metric-label">Convergence</div>
|
||||||
|
<div class="metric-value">{convergence_iterations}</div>
|
||||||
|
<div class="metric-sublabel">Iterations</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="insight-box">
|
||||||
|
<h4>Key Findings</h4>
|
||||||
|
<ul>
|
||||||
|
<li><strong>Excellent Merge Performance:</strong> Successfully unified {total_identities:,} identity records into {unified_profiles:,} canonical customer profiles, achieving a {reduction_pct:.1f}% reduction in identity fragmentation.</li>
|
||||||
|
<!-- Add more insights based on data -->
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Section 2: Identity Resolution Performance -->
|
||||||
|
<div class="section">
|
||||||
|
<h2 class="section-title">Identity Resolution Performance</h2>
|
||||||
|
<table class="stats-table">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Identity Key Type</th>
|
||||||
|
<th>Source Distinct Count</th>
|
||||||
|
<th>Final Canonical IDs</th>
|
||||||
|
<th>Deduplication Rate</th>
|
||||||
|
<th>Quality Score</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<!-- For each key type in key_type_distribution -->
|
||||||
|
<tr>
|
||||||
|
<td><strong>{key_name}</strong></td>
|
||||||
|
<td>{source_count:,}</td>
|
||||||
|
<td>{unique_canonical_ids:,}</td>
|
||||||
|
<td><span class="highlight">{dedup_pct:.1f}% reduction</span></td>
|
||||||
|
<td><span class="badge badge-success">Excellent</span></td>
|
||||||
|
</tr>
|
||||||
|
<!-- Repeat for each key -->
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
<!-- Add bar charts, insights, etc. -->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Section 3: Merge Distribution Analysis -->
|
||||||
|
<!-- Section 4: Top Merged Profiles -->
|
||||||
|
<!-- Section 5: Source Table Configuration -->
|
||||||
|
<!-- Section 6: Master Table Data Quality -->
|
||||||
|
<!-- Section 7: Convergence Performance -->
|
||||||
|
<!-- Section 8: Expert Recommendations -->
|
||||||
|
<!-- Section 9: Summary Statistics -->
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<footer>
|
||||||
|
<div class="footer-note">
|
||||||
|
<p><strong>Report Generated:</strong> {current_date}</p>
|
||||||
|
<p><strong>Platform:</strong> {platform} ({database}.{schema})</p>
|
||||||
|
<p><strong>Workflow:</strong> Hybrid ID Unification</p>
|
||||||
|
</div>
|
||||||
|
</footer>
|
||||||
|
</div>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data Insertion Rules:**
|
||||||
|
|
||||||
|
1. **Numbers**: Format with commas (e.g., 19,512)
|
||||||
|
2. **Percentages**: Round to 1 decimal place (e.g., 74.7%)
|
||||||
|
3. **Ratios**: Round to 2 decimal places (e.g., 3.95:1)
|
||||||
|
4. **Dates**: Use YYYY-MM-DD format
|
||||||
|
5. **Platform**: Capitalize (Snowflake or Databricks)
|
||||||
|
|
||||||
|
**Dynamic Content Generation:**
|
||||||
|
|
||||||
|
- For each metric card: Insert actual calculated values
|
||||||
|
- For each table row: Loop through result sets
|
||||||
|
- For each bar chart: Calculate width percentages
|
||||||
|
- For each insight: Generate based on data patterns
|
||||||
|
|
||||||
|
### Phase 6: Report Validation and Output
|
||||||
|
|
||||||
|
**Pre-Output Validation:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
validations = [
|
||||||
|
("All sections have data", check_all_sections_populated()),
|
||||||
|
("Calculations are correct", verify_calculations()),
|
||||||
|
("Percentages sum properly", check_percentage_sums()),
|
||||||
|
("No missing values", check_no_nulls()),
|
||||||
|
("HTML is well-formed", validate_html_syntax())
|
||||||
|
]
|
||||||
|
|
||||||
|
for validation_name, result in validations:
|
||||||
|
if not result:
|
||||||
|
raise ValueError(f"Validation failed: {validation_name}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**File Output:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Use Write tool to save HTML
|
||||||
|
Write(
|
||||||
|
file_path=output_path,
|
||||||
|
content=html_content
|
||||||
|
)
|
||||||
|
|
||||||
|
# Verify file was written
|
||||||
|
if file_exists(output_path):
|
||||||
|
file_size = get_file_size(output_path)
|
||||||
|
print(f"✓ Report generated: {output_path}")
|
||||||
|
print(f"✓ File size: {file_size} KB")
|
||||||
|
else:
|
||||||
|
raise Error("Failed to write report file")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Success Summary:**
|
||||||
|
|
||||||
|
```
|
||||||
|
✓ Report generated successfully
|
||||||
|
✓ Location: {output_path}
|
||||||
|
✓ File size: {size} KB
|
||||||
|
✓ Sections: 9
|
||||||
|
✓ Statistics queries: 16
|
||||||
|
✓ Unified profiles: {unified_profiles:,}
|
||||||
|
✓ Data quality score: {overall_quality:.1f}%
|
||||||
|
✓ Ready for viewing and PDF export
|
||||||
|
|
||||||
|
Next steps:
|
||||||
|
1. Open {output_path} in your browser
|
||||||
|
2. Review merge statistics and insights
|
||||||
|
3. Print to PDF for distribution (Ctrl+P or Cmd+P)
|
||||||
|
4. Share with stakeholders
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
### Handle These Scenarios:
|
||||||
|
|
||||||
|
1. **Tables Not Found:**
|
||||||
|
```
|
||||||
|
Error: Table {lookup_table} does not exist
|
||||||
|
|
||||||
|
Possible causes:
|
||||||
|
- Canonical ID name is incorrect
|
||||||
|
- Unification workflow not completed
|
||||||
|
- Database/schema name is wrong
|
||||||
|
|
||||||
|
Please verify:
|
||||||
|
- Database/Catalog: {database}
|
||||||
|
- Schema: {schema}
|
||||||
|
- Canonical ID: {canonical_id}
|
||||||
|
- Expected table: {canonical_id}_lookup
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **No Data in Tables:**
|
||||||
|
```
|
||||||
|
Error: Tables exist but contain no data
|
||||||
|
|
||||||
|
This indicates the unification workflow may have failed.
|
||||||
|
|
||||||
|
Please:
|
||||||
|
1. Check workflow execution logs
|
||||||
|
2. Verify source tables have data
|
||||||
|
3. Re-run the unification workflow
|
||||||
|
4. Try again after successful completion
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **MCP Tools Unavailable:**
|
||||||
|
```
|
||||||
|
Error: Cannot connect to {platform}
|
||||||
|
|
||||||
|
MCP tools for {platform} are not available.
|
||||||
|
|
||||||
|
Please:
|
||||||
|
1. Verify MCP server configuration
|
||||||
|
2. Check network connectivity
|
||||||
|
3. Validate credentials
|
||||||
|
4. Contact administrator if issue persists
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Permission Errors:**
|
||||||
|
```
|
||||||
|
Error: Access denied to {table}
|
||||||
|
|
||||||
|
You don't have SELECT permission on this table.
|
||||||
|
|
||||||
|
Please:
|
||||||
|
1. Request SELECT permission from administrator
|
||||||
|
2. Verify your role has access
|
||||||
|
3. For Snowflake: GRANT SELECT ON SCHEMA {schema} TO {user}
|
||||||
|
4. For Databricks: GRANT SELECT ON {table} TO {user}
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Column Not Found:**
|
||||||
|
```
|
||||||
|
Warning: Column {column_name} not found in master table
|
||||||
|
|
||||||
|
Skipping coverage calculation for this attribute.
|
||||||
|
Report will be generated without this metric.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quality Standards
|
||||||
|
|
||||||
|
### Report Must Meet These Criteria:
|
||||||
|
|
||||||
|
✅ **Accuracy**: All metrics calculated correctly from source data
|
||||||
|
✅ **Completeness**: All 9 sections populated with data
|
||||||
|
✅ **Consistency**: Same HTML structure every time
|
||||||
|
✅ **Readability**: Clear tables, charts, and insights
|
||||||
|
✅ **Professional**: Executive-ready formatting and language
|
||||||
|
✅ **Actionable**: Includes specific recommendations
|
||||||
|
✅ **Validated**: All calculations double-checked
|
||||||
|
✅ **Browser-compatible**: Works in Chrome, Firefox, Safari, Edge
|
||||||
|
✅ **PDF-ready**: Exports cleanly to PDF
|
||||||
|
✅ **Responsive**: Adapts to different screen sizes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expert Analysis Guidelines
|
||||||
|
|
||||||
|
### When Writing Insights:
|
||||||
|
|
||||||
|
1. **Be Data-Driven**: Reference specific metrics
|
||||||
|
- "Successfully unified 19,512 identities into 4,940 profiles"
|
||||||
|
- NOT: "Good unification performance"
|
||||||
|
|
||||||
|
2. **Provide Context**: Compare to benchmarks
|
||||||
|
- "4-iteration convergence is excellent (typical is 8-12)"
|
||||||
|
- "74.7% fragmentation reduction exceeds industry average of 60%"
|
||||||
|
|
||||||
|
3. **Identify Patterns**: Highlight interesting findings
|
||||||
|
- "89% of profiles have 3-5 identities, indicating normal multi-channel engagement"
|
||||||
|
- "Top merged profile has 38 identities - worth investigating"
|
||||||
|
|
||||||
|
4. **Give Actionable Recommendations**:
|
||||||
|
- "Review profiles with 20+ merges for data quality issues"
|
||||||
|
- "Implement incremental processing for efficiency"
|
||||||
|
|
||||||
|
5. **Assess Quality**: Grade and explain
|
||||||
|
- "Email coverage: 100% - Excellent for marketing"
|
||||||
|
- "Phone coverage: 99.39% - Near-perfect, 30 missing values"
|
||||||
|
|
||||||
|
### Badge Assignment:
|
||||||
|
|
||||||
|
- **Excellent**: 95-100% coverage or <5% deduplication
|
||||||
|
- **Good**: 85-94% coverage or 5-15% deduplication
|
||||||
|
- **Needs Improvement**: <85% coverage or >15% deduplication
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Platform-Specific Adaptations
|
||||||
|
|
||||||
|
### Snowflake Specifics:
|
||||||
|
|
||||||
|
- Use UPPERCASE for all identifiers (DATABASE, SCHEMA, TABLE, COLUMN)
|
||||||
|
- Use `ARRAY_CONSTRUCT()` for array creation
|
||||||
|
- Use `OBJECT_CONSTRUCT()` for objects
|
||||||
|
- Date format: `TO_CHAR(CURRENT_DATE(), 'YYYY-MM-DD')`
|
||||||
|
|
||||||
|
### Databricks Specifics:
|
||||||
|
|
||||||
|
- Use lowercase for identifiers (catalog, schema, table, column)
|
||||||
|
- Use `ARRAY()` for array creation
|
||||||
|
- Use `STRUCT()` for objects
|
||||||
|
- Date format: `DATE_FORMAT(CURRENT_DATE(), 'yyyy-MM-dd')`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Checklist
|
||||||
|
|
||||||
|
Before marking task complete:
|
||||||
|
|
||||||
|
- [ ] All required user inputs collected
|
||||||
|
- [ ] Platform and table names validated
|
||||||
|
- [ ] All 16 queries executed successfully
|
||||||
|
- [ ] All metrics calculated correctly
|
||||||
|
- [ ] HTML report generated with all sections
|
||||||
|
- [ ] File written to specified path
|
||||||
|
- [ ] Success summary displayed to user
|
||||||
|
- [ ] No errors or warnings in output
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Final Agent Output
|
||||||
|
|
||||||
|
**When complete, output this exact format:**
|
||||||
|
|
||||||
|
```
|
||||||
|
════════════════════════════════════════════════════════════════
|
||||||
|
ID UNIFICATION MERGE STATISTICS REPORT - GENERATION COMPLETE
|
||||||
|
════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
Platform: {platform}
|
||||||
|
Database/Catalog: {database}
|
||||||
|
Schema: {schema}
|
||||||
|
Canonical ID: {canonical_id}
|
||||||
|
|
||||||
|
STATISTICS SUMMARY
|
||||||
|
──────────────────────────────────────────────────────────────
|
||||||
|
Unified Profiles: {unified_profiles:,}
|
||||||
|
Total Identities: {total_identities:,}
|
||||||
|
Merge Ratio: {merge_ratio:.2f}:1
|
||||||
|
Fragmentation Reduction: {reduction_pct:.1f}%
|
||||||
|
Data Quality Score: {quality_score:.1f}%
|
||||||
|
|
||||||
|
REPORT DETAILS
|
||||||
|
──────────────────────────────────────────────────────────────
|
||||||
|
Output File: {output_path}
|
||||||
|
File Size: {file_size} KB
|
||||||
|
Sections Included: 9
|
||||||
|
Queries Executed: 16
|
||||||
|
Generation Time: {generation_time} seconds
|
||||||
|
|
||||||
|
NEXT STEPS
|
||||||
|
──────────────────────────────────────────────────────────────
|
||||||
|
1. Open {output_path} in your web browser
|
||||||
|
2. Review merge statistics and expert insights
|
||||||
|
3. Export to PDF: Press Ctrl+P (Windows) or Cmd+P (Mac)
|
||||||
|
4. Share with stakeholders and decision makers
|
||||||
|
|
||||||
|
✓ Report generation successful!
|
||||||
|
════════════════════════════════════════════════════════════════
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**You are now ready to execute as the expert merge statistics report generator agent!**
|
||||||
114
agents/snowflake-sql-generator.md
Normal file
114
agents/snowflake-sql-generator.md
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
# Snowflake SQL Generator Agent
|
||||||
|
|
||||||
|
## Agent Purpose
|
||||||
|
Generate production-ready Snowflake SQL from `unify.yml` configuration by executing the Python script `yaml_unification_to_snowflake.py`.
|
||||||
|
|
||||||
|
## Agent Workflow
|
||||||
|
|
||||||
|
### Step 1: Validate Inputs
|
||||||
|
**Check**:
|
||||||
|
- YAML file exists and is valid
|
||||||
|
- Target database and schema provided
|
||||||
|
- Source database/schema (defaults to target database/PUBLIC if not provided)
|
||||||
|
- Output directory path
|
||||||
|
|
||||||
|
### Step 2: Execute Python Script
|
||||||
|
**Use Bash tool** to execute:
|
||||||
|
```bash
|
||||||
|
python3 /path/to/plugins/cdp-hybrid-idu/scripts/snowflake/yaml_unification_to_snowflake.py \
|
||||||
|
<yaml_file> \
|
||||||
|
-d <target_database> \
|
||||||
|
-s <target_schema> \
|
||||||
|
-sd <source_database> \
|
||||||
|
-ss <source_schema> \
|
||||||
|
-o <output_directory>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**:
|
||||||
|
- `<yaml_file>`: Path to unify.yml
|
||||||
|
- `-d`: Target database name
|
||||||
|
- `-s`: Target schema name
|
||||||
|
- `-sd`: Source database (optional, defaults to target database)
|
||||||
|
- `-ss`: Source schema (optional, defaults to PUBLIC)
|
||||||
|
- `-o`: Output directory (optional, defaults to `snowflake_sql`)
|
||||||
|
|
||||||
|
### Step 3: Monitor Execution
|
||||||
|
**Track**:
|
||||||
|
- Script execution progress
|
||||||
|
- Generated SQL file count
|
||||||
|
- Any warnings or errors
|
||||||
|
- Output directory structure
|
||||||
|
|
||||||
|
### Step 4: Parse and Report Results
|
||||||
|
**Output**:
|
||||||
|
```
|
||||||
|
✓ Snowflake SQL generation complete!
|
||||||
|
|
||||||
|
Generated Files:
|
||||||
|
• snowflake_sql/unify/01_create_graph.sql
|
||||||
|
• snowflake_sql/unify/02_extract_merge.sql
|
||||||
|
• snowflake_sql/unify/03_source_key_stats.sql
|
||||||
|
• snowflake_sql/unify/04_unify_loop_iteration_01.sql
|
||||||
|
... (up to iteration_N)
|
||||||
|
• snowflake_sql/unify/05_canonicalize.sql
|
||||||
|
• snowflake_sql/unify/06_result_key_stats.sql
|
||||||
|
• snowflake_sql/unify/10_enrich_*.sql
|
||||||
|
• snowflake_sql/unify/20_master_*.sql
|
||||||
|
• snowflake_sql/unify/30_unification_metadata.sql
|
||||||
|
• snowflake_sql/unify/31_filter_lookup.sql
|
||||||
|
• snowflake_sql/unify/32_column_lookup.sql
|
||||||
|
|
||||||
|
Total: X SQL files
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
• Database: <database_name>
|
||||||
|
• Schema: <schema_name>
|
||||||
|
• Iterations: N (calculated from YAML)
|
||||||
|
• Tables: X enriched, Y master tables
|
||||||
|
|
||||||
|
Snowflake Features Enabled:
|
||||||
|
✓ Native Snowflake functions
|
||||||
|
✓ VARIANT support
|
||||||
|
✓ Table clustering
|
||||||
|
✓ Convergence detection
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
1. Review generated SQL files
|
||||||
|
2. Execute using: /cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
3. Or manually execute in Snowflake SQL worksheet
|
||||||
|
```
|
||||||
|
|
||||||
|
## Critical Behaviors
|
||||||
|
|
||||||
|
### Python Script Error Handling
|
||||||
|
If script fails:
|
||||||
|
1. Capture error output
|
||||||
|
2. Parse error message
|
||||||
|
3. Provide helpful suggestions:
|
||||||
|
- YAML syntax errors → validate YAML
|
||||||
|
- Missing dependencies → install pyyaml
|
||||||
|
- Invalid table names → check YAML table section
|
||||||
|
- File permission errors → check output directory permissions
|
||||||
|
|
||||||
|
### Success Validation
|
||||||
|
Verify:
|
||||||
|
- Output directory created
|
||||||
|
- All expected SQL files present
|
||||||
|
- Files have non-zero content
|
||||||
|
- SQL syntax looks valid (basic check)
|
||||||
|
|
||||||
|
### Platform-Specific Conversions
|
||||||
|
Report applied conversions:
|
||||||
|
- Presto/Databricks functions → Snowflake equivalents
|
||||||
|
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
|
||||||
|
- Time functions → DATE_PART(epoch_second, ...)
|
||||||
|
- Table definitions → Snowflake syntax
|
||||||
|
|
||||||
|
## MUST DO
|
||||||
|
|
||||||
|
1. **Always use absolute paths** for plugin scripts
|
||||||
|
2. **Check Python version** (require Python 3.7+)
|
||||||
|
3. **Parse script output** for errors and warnings
|
||||||
|
4. **Verify output directory** structure
|
||||||
|
5. **Count generated files** and report summary
|
||||||
|
6. **Provide clear next steps** for execution
|
||||||
138
agents/snowflake-workflow-executor.md
Normal file
138
agents/snowflake-workflow-executor.md
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
# Snowflake Workflow Executor Agent
|
||||||
|
|
||||||
|
## Agent Purpose
|
||||||
|
Execute generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling by orchestrating the Python script `snowflake_sql_executor.py`.
|
||||||
|
|
||||||
|
## Agent Workflow
|
||||||
|
|
||||||
|
### Step 1: Collect Credentials
|
||||||
|
**Required**:
|
||||||
|
- SQL directory path
|
||||||
|
- Account name
|
||||||
|
- Username
|
||||||
|
- Database and schema names
|
||||||
|
- Warehouse name (defaults to `COMPUTE_WH`)
|
||||||
|
|
||||||
|
**Authentication Options**:
|
||||||
|
- Password (from argument, environment variable `SNOWFLAKE_PASSWORD`, or prompt)
|
||||||
|
- SSO (externalbrowser)
|
||||||
|
- Key-pair (using environment variables)
|
||||||
|
|
||||||
|
### Step 2: Execute Python Script
|
||||||
|
**Use Bash tool** with `run_in_background: true` to execute:
|
||||||
|
```bash
|
||||||
|
python3 /path/to/plugins/cdp-hybrid-idu/scripts/snowflake/snowflake_sql_executor.py \
|
||||||
|
<sql_directory> \
|
||||||
|
--account <account> \
|
||||||
|
--user <user> \
|
||||||
|
--database <database> \
|
||||||
|
--schema <schema> \
|
||||||
|
--warehouse <warehouse> \
|
||||||
|
--password <password>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Monitor Execution in Real-Time
|
||||||
|
**Use BashOutput tool** to stream progress:
|
||||||
|
- Connection status
|
||||||
|
- File execution progress
|
||||||
|
- Row counts and timing
|
||||||
|
- Convergence detection results
|
||||||
|
- Error messages
|
||||||
|
|
||||||
|
**Display Progress**:
|
||||||
|
```
|
||||||
|
✓ Connected to Snowflake: <account>
|
||||||
|
• Using database: <database>, schema: <schema>
|
||||||
|
|
||||||
|
Executing: 01_create_graph.sql
|
||||||
|
✓ Completed: 01_create_graph.sql
|
||||||
|
|
||||||
|
Executing: 02_extract_merge.sql
|
||||||
|
✓ Completed: 02_extract_merge.sql
|
||||||
|
• Rows affected: 125,000
|
||||||
|
|
||||||
|
Executing Unify Loop (convergence detection)
|
||||||
|
|
||||||
|
--- Iteration 1 ---
|
||||||
|
✓ Iteration 1 completed
|
||||||
|
• Updated records: 1,500
|
||||||
|
|
||||||
|
--- Iteration 2 ---
|
||||||
|
✓ Iteration 2 completed
|
||||||
|
• Updated records: 450
|
||||||
|
|
||||||
|
--- Iteration 3 ---
|
||||||
|
✓ Iteration 3 completed
|
||||||
|
• Updated records: 0
|
||||||
|
✓ Loop converged after 3 iterations!
|
||||||
|
|
||||||
|
• Creating alias table: loop_final
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Handle Interactive Prompts
|
||||||
|
If script encounters errors and prompts for continuation:
|
||||||
|
```
|
||||||
|
✗ Error in file: 04_unify_loop_iteration_01.sql
|
||||||
|
Error: Table not found
|
||||||
|
|
||||||
|
Continue with remaining files? (y/n):
|
||||||
|
```
|
||||||
|
|
||||||
|
**Agent Decision**:
|
||||||
|
1. Show error to user
|
||||||
|
2. Ask user for decision
|
||||||
|
3. Pass response to script
|
||||||
|
|
||||||
|
### Step 5: Final Report
|
||||||
|
**After completion**:
|
||||||
|
```
|
||||||
|
Execution Complete!
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
• Files processed: 18/18
|
||||||
|
• Execution time: 45 minutes
|
||||||
|
• Convergence: 3 iterations
|
||||||
|
• Final lookup table rows: 98,500
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
✓ All tables created successfully
|
||||||
|
✓ Canonical IDs generated
|
||||||
|
✓ Enriched tables populated
|
||||||
|
✓ Master tables created
|
||||||
|
|
||||||
|
Next Steps:
|
||||||
|
1. Verify data quality
|
||||||
|
2. Check coverage metrics
|
||||||
|
3. Review statistics tables
|
||||||
|
```
|
||||||
|
|
||||||
|
## Critical Behaviors
|
||||||
|
|
||||||
|
### Convergence Monitoring
|
||||||
|
Track loop iterations:
|
||||||
|
- Iteration number
|
||||||
|
- Records updated
|
||||||
|
- Convergence status
|
||||||
|
|
||||||
|
### Error Recovery
|
||||||
|
On errors:
|
||||||
|
1. Capture error details
|
||||||
|
2. Determine severity (critical vs warning)
|
||||||
|
3. Prompt user for continuation decision
|
||||||
|
4. Log error for troubleshooting
|
||||||
|
|
||||||
|
### Performance Tracking
|
||||||
|
Monitor:
|
||||||
|
- Execution time per file
|
||||||
|
- Row counts processed
|
||||||
|
- Total workflow time
|
||||||
|
|
||||||
|
## MUST DO
|
||||||
|
|
||||||
|
1. **Stream output in real-time** using BashOutput
|
||||||
|
2. **Monitor convergence** and report iterations
|
||||||
|
3. **Handle user prompts** for error continuation
|
||||||
|
4. **Report final statistics** with coverage metrics
|
||||||
|
5. **Verify connection** before starting execution
|
||||||
|
6. **Clean up** on termination or error
|
||||||
382
agents/yaml-configuration-builder.md
Normal file
382
agents/yaml-configuration-builder.md
Normal file
@@ -0,0 +1,382 @@
|
|||||||
|
# YAML Configuration Builder Agent
|
||||||
|
|
||||||
|
## Agent Purpose
|
||||||
|
Interactive agent to help users create proper `unify.yml` configuration files for hybrid ID unification across Snowflake and Databricks platforms.
|
||||||
|
|
||||||
|
## Agent Capabilities
|
||||||
|
- Guide users through YAML creation step-by-step
|
||||||
|
- Validate configuration in real-time
|
||||||
|
- Provide examples and best practices
|
||||||
|
- Support both simple and complex configurations
|
||||||
|
- Ensure platform compatibility (Snowflake and Databricks)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Workflow
|
||||||
|
|
||||||
|
### Step 1: Project Name and Scope
|
||||||
|
**Collect**:
|
||||||
|
- Unification project name
|
||||||
|
- Brief description of use case
|
||||||
|
|
||||||
|
**Example Interaction**:
|
||||||
|
```
|
||||||
|
Question: What would you like to name this unification project?
|
||||||
|
Suggestion: Use a descriptive name like 'customer_unification' or 'user_identity_resolution'
|
||||||
|
|
||||||
|
User input: customer_360
|
||||||
|
|
||||||
|
✓ Project name: customer_360
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Step 2: Define Keys (User Identifiers)
|
||||||
|
**Collect**:
|
||||||
|
- Key names (email, customer_id, phone_number, etc.)
|
||||||
|
- Validation rules for each key:
|
||||||
|
- `valid_regexp`: Regex pattern for format validation
|
||||||
|
- `invalid_texts`: Array of values to exclude
|
||||||
|
|
||||||
|
**Example Interaction**:
|
||||||
|
```
|
||||||
|
Question: What user identifier columns (keys) do you want to use for unification?
|
||||||
|
|
||||||
|
Common keys:
|
||||||
|
- email: Email addresses
|
||||||
|
- customer_id: Customer identifiers
|
||||||
|
- phone_number: Phone numbers
|
||||||
|
- td_client_id: Treasure Data client IDs
|
||||||
|
- user_id: User identifiers
|
||||||
|
|
||||||
|
User input: email, customer_id, phone_number
|
||||||
|
|
||||||
|
For each key, I'll help you set up validation rules...
|
||||||
|
|
||||||
|
Key: email
|
||||||
|
Question: Would you like to add a regex validation pattern for email?
|
||||||
|
Suggestion: Use ".*@.*" for basic email validation or more strict patterns
|
||||||
|
|
||||||
|
User input: .*@.*
|
||||||
|
|
||||||
|
Question: What values should be considered invalid?
|
||||||
|
Suggestion: Common invalid values: '', 'N/A', 'null', 'unknown'
|
||||||
|
|
||||||
|
User input: '', 'N/A', 'null'
|
||||||
|
|
||||||
|
✓ Key 'email' configured with regex validation and 3 invalid values
|
||||||
|
```
|
||||||
|
|
||||||
|
**Generate YAML Section**:
|
||||||
|
```yaml
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: phone_number
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Step 3: Map Tables to Keys
|
||||||
|
**Collect**:
|
||||||
|
- Source table names
|
||||||
|
- Key column mappings for each table
|
||||||
|
|
||||||
|
**Example Interaction**:
|
||||||
|
```
|
||||||
|
Question: What source tables contain user identifiers?
|
||||||
|
|
||||||
|
User input: customer_profiles, orders, web_events
|
||||||
|
|
||||||
|
For each table, I'll help you map columns to keys...
|
||||||
|
|
||||||
|
Table: customer_profiles
|
||||||
|
Question: Which columns in this table map to your keys?
|
||||||
|
|
||||||
|
Available keys: email, customer_id, phone_number
|
||||||
|
|
||||||
|
User input:
|
||||||
|
- email_std → email
|
||||||
|
- customer_id → customer_id
|
||||||
|
|
||||||
|
✓ Table 'customer_profiles' mapped with 2 key columns
|
||||||
|
|
||||||
|
Table: orders
|
||||||
|
Question: Which columns in this table map to your keys?
|
||||||
|
|
||||||
|
User input:
|
||||||
|
- email_address → email
|
||||||
|
- phone → phone_number
|
||||||
|
|
||||||
|
✓ Table 'orders' mapped with 2 key columns
|
||||||
|
```
|
||||||
|
|
||||||
|
**Generate YAML Section**:
|
||||||
|
```yaml
|
||||||
|
tables:
|
||||||
|
- table: customer_profiles
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- table: orders
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: phone, key: phone_number}
|
||||||
|
- table: web_events
|
||||||
|
key_columns:
|
||||||
|
- {column: user_email, key: email}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Step 4: Configure Canonical ID
|
||||||
|
**Collect**:
|
||||||
|
- Canonical ID name
|
||||||
|
- Merge keys (priority order)
|
||||||
|
- Iteration count (optional)
|
||||||
|
|
||||||
|
**Example Interaction**:
|
||||||
|
```
|
||||||
|
Question: What would you like to name the canonical ID column?
|
||||||
|
Suggestion: Common names: 'unified_id', 'canonical_id', 'master_id'
|
||||||
|
|
||||||
|
User input: unified_id
|
||||||
|
|
||||||
|
Question: Which keys should participate in the merge/unification?
|
||||||
|
Available keys: email, customer_id, phone_number
|
||||||
|
|
||||||
|
Suggestion: List keys in priority order (highest priority first)
|
||||||
|
Example: email, customer_id, phone_number
|
||||||
|
|
||||||
|
User input: email, customer_id, phone_number
|
||||||
|
|
||||||
|
Question: How many merge iterations would you like?
|
||||||
|
Suggestion:
|
||||||
|
- Leave blank to auto-calculate based on complexity
|
||||||
|
- Typical range: 3-10 iterations
|
||||||
|
- More keys/tables = more iterations needed
|
||||||
|
|
||||||
|
User input: (blank - auto-calculate)
|
||||||
|
|
||||||
|
✓ Canonical ID 'unified_id' configured with 3 merge keys
|
||||||
|
✓ Iterations will be auto-calculated
|
||||||
|
```
|
||||||
|
|
||||||
|
**Generate YAML Section**:
|
||||||
|
```yaml
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_id
|
||||||
|
merge_by_keys: [email, customer_id, phone_number]
|
||||||
|
# merge_iterations: 15auto-calculated
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Step 5: Configure Master Tables (Optional)
|
||||||
|
**Collect**:
|
||||||
|
- Master table names
|
||||||
|
- Attributes to aggregate
|
||||||
|
- Source column priorities
|
||||||
|
|
||||||
|
**Example Interaction**:
|
||||||
|
```
|
||||||
|
Question: Would you like to create master tables with aggregated attributes?
|
||||||
|
(Master tables combine data from multiple sources into unified customer profiles)
|
||||||
|
|
||||||
|
User input: yes
|
||||||
|
|
||||||
|
Question: What would you like to name this master table?
|
||||||
|
Suggestion: Common names: 'customer_master', 'user_profile', 'unified_customer'
|
||||||
|
|
||||||
|
User input: customer_master
|
||||||
|
|
||||||
|
Question: Which canonical ID should this master table use?
|
||||||
|
Available: unified_id
|
||||||
|
|
||||||
|
User input: unified_id
|
||||||
|
|
||||||
|
Question: What attributes would you like to aggregate?
|
||||||
|
|
||||||
|
Attribute 1:
|
||||||
|
Name: best_email
|
||||||
|
Type: single value or array?
|
||||||
|
User input: single value
|
||||||
|
|
||||||
|
Source columns (priority order):
|
||||||
|
1. Table: customer_profiles, Column: email_std, Order by: time
|
||||||
|
2. Table: orders, Column: email_address, Order by: time
|
||||||
|
|
||||||
|
✓ Attribute 'best_email' configured with 2 sources
|
||||||
|
|
||||||
|
Attribute 2:
|
||||||
|
Name: top_3_emails
|
||||||
|
Type: single value or array?
|
||||||
|
User input: array
|
||||||
|
Array size: 3
|
||||||
|
|
||||||
|
Source columns (priority order):
|
||||||
|
1. Table: customer_profiles, Column: email_std, Order by: time
|
||||||
|
2. Table: orders, Column: email_address, Order by: time
|
||||||
|
|
||||||
|
✓ Attribute 'top_3_emails' configured as array with 2 sources
|
||||||
|
```
|
||||||
|
|
||||||
|
**Generate YAML Section**:
|
||||||
|
```yaml
|
||||||
|
master_tables:
|
||||||
|
- name: customer_master
|
||||||
|
canonical_id: unified_id
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
|
||||||
|
- {table: orders, column: email_address, priority: 2, order_by: time}
|
||||||
|
- name: top_3_emails
|
||||||
|
array_elements: 3
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
|
||||||
|
- {table: orders, column: email_address, priority: 2, order_by: time}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Step 6: Validation and Finalization
|
||||||
|
**Perform**:
|
||||||
|
1. Validate complete YAML structure
|
||||||
|
2. Check all references
|
||||||
|
3. Suggest optimizations
|
||||||
|
4. Write final `unify.yml` file
|
||||||
|
|
||||||
|
**Example Output**:
|
||||||
|
```
|
||||||
|
Validating configuration...
|
||||||
|
|
||||||
|
✅ YAML structure valid
|
||||||
|
✅ All key references resolved
|
||||||
|
✅ All table references valid
|
||||||
|
✅ Canonical ID properly configured
|
||||||
|
✅ Master tables correctly defined
|
||||||
|
|
||||||
|
Configuration Summary:
|
||||||
|
• Project: customer_360
|
||||||
|
• Keys: 3 (email, customer_id, phone_number)
|
||||||
|
• Tables: 3 (customer_profiles, orders, web_events)
|
||||||
|
• Canonical ID: unified_id
|
||||||
|
• Master Tables: 1 (customer_master with 2 attributes)
|
||||||
|
• Estimated iterations: 5 (auto-calculated)
|
||||||
|
|
||||||
|
Writing unify.yml...
|
||||||
|
|
||||||
|
✓ Configuration file created successfully!
|
||||||
|
|
||||||
|
File location: ./unify.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Output
|
||||||
|
|
||||||
|
### Success
|
||||||
|
Returns complete `unify.yml` with:
|
||||||
|
- All sections properly structured
|
||||||
|
- Valid YAML syntax
|
||||||
|
- Optimized configuration
|
||||||
|
- Ready for SQL generation
|
||||||
|
|
||||||
|
### Validation
|
||||||
|
Performs checks:
|
||||||
|
- YAML syntax validation
|
||||||
|
- Reference integrity
|
||||||
|
- Best practices compliance
|
||||||
|
- Platform compatibility
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Behavior Guidelines
|
||||||
|
|
||||||
|
### Be Interactive
|
||||||
|
- Ask clear questions
|
||||||
|
- Provide examples
|
||||||
|
- Suggest best practices
|
||||||
|
- Validate responses
|
||||||
|
|
||||||
|
### Be Helpful
|
||||||
|
- Explain concepts when needed
|
||||||
|
- Offer suggestions
|
||||||
|
- Show examples
|
||||||
|
- Guide through complex scenarios
|
||||||
|
|
||||||
|
### Be Thorough
|
||||||
|
- Don't skip validation
|
||||||
|
- Check all references
|
||||||
|
- Ensure completeness
|
||||||
|
- Verify platform compatibility
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Complete YAML Output
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: customer_360
|
||||||
|
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null', 'unknown']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: phone_number
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- table: customer_profiles
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- table: orders
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: phone, key: phone_number}
|
||||||
|
- table: web_events
|
||||||
|
key_columns:
|
||||||
|
- {column: user_email, key: email}
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_id
|
||||||
|
merge_by_keys: [email, customer_id, phone_number]
|
||||||
|
merge_iterations: 15
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: customer_master
|
||||||
|
canonical_id: unified_id
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
|
||||||
|
- {table: orders, column: email_address, priority: 2, order_by: time}
|
||||||
|
- name: primary_phone
|
||||||
|
source_columns:
|
||||||
|
- {table: orders, column: phone, priority: 1, order_by: time}
|
||||||
|
- name: top_3_emails
|
||||||
|
array_elements: 3
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
|
||||||
|
- {table: orders, column: email_address, priority: 2, order_by: time}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CRITICAL: Agent Must
|
||||||
|
|
||||||
|
1. **Always validate** YAML syntax before writing file
|
||||||
|
2. **Check all references** (keys, tables, canonical_ids)
|
||||||
|
3. **Provide examples** for complex configurations
|
||||||
|
4. **Suggest optimizations** based on use case
|
||||||
|
5. **Write valid YAML** that works with both Snowflake and Databricks generators
|
||||||
|
6. **Use proper indentation** (2 spaces per level)
|
||||||
|
7. **Quote string values** where necessary
|
||||||
|
8. **Test regex patterns** before adding to configuration
|
||||||
387
commands/hybrid-execute-databricks.md
Normal file
387
commands/hybrid-execute-databricks.md
Normal file
@@ -0,0 +1,387 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-execute-databricks
|
||||||
|
description: Execute Databricks ID unification workflow with convergence detection and monitoring
|
||||||
|
---
|
||||||
|
|
||||||
|
# Execute Databricks ID Unification Workflow
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Execute your generated Databricks SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need
|
||||||
|
|
||||||
|
### Required Inputs
|
||||||
|
1. **SQL Directory**: Path to generated SQL files (e.g., `databricks_sql/unify/`)
|
||||||
|
2. **Server Hostname**: Your Databricks workspace URL (e.g., `your-workspace.cloud.databricks.com`)
|
||||||
|
3. **HTTP Path**: SQL Warehouse or cluster path (e.g., `/sql/1.0/warehouses/abc123`)
|
||||||
|
4. **Catalog**: Target catalog name
|
||||||
|
5. **Schema**: Target schema name
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
**Option 1: Personal Access Token (PAT)**
|
||||||
|
- Access token from Databricks workspace
|
||||||
|
- Can be provided as argument or via environment variable `DATABRICKS_TOKEN`
|
||||||
|
|
||||||
|
**Option 2: OAuth**
|
||||||
|
- Browser-based authentication
|
||||||
|
- No token required, will open browser for login
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Connection Setup
|
||||||
|
- Connect to your Databricks workspace
|
||||||
|
- Validate credentials and permissions
|
||||||
|
- Set catalog and schema context
|
||||||
|
- Verify SQL directory exists
|
||||||
|
|
||||||
|
### Step 2: Execution Plan
|
||||||
|
Display execution plan with:
|
||||||
|
- All SQL files in execution order
|
||||||
|
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
|
||||||
|
- Estimated steps and dependencies
|
||||||
|
|
||||||
|
### Step 3: SQL Execution
|
||||||
|
I'll call the **databricks-workflow-executor agent** to:
|
||||||
|
- Execute SQL files in proper sequence
|
||||||
|
- Skip loop iteration files (handled separately)
|
||||||
|
- Monitor progress with real-time feedback
|
||||||
|
- Track row counts and execution times
|
||||||
|
|
||||||
|
### Step 4: Unify Loop with Convergence Detection
|
||||||
|
**Intelligent Loop Execution**:
|
||||||
|
```
|
||||||
|
Iteration 1:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 1500 records updated
|
||||||
|
• Optimize Delta table
|
||||||
|
→ Continue to iteration 2
|
||||||
|
|
||||||
|
Iteration 2:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 450 records updated
|
||||||
|
• Optimize Delta table
|
||||||
|
→ Continue to iteration 3
|
||||||
|
|
||||||
|
Iteration 3:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 0 records updated
|
||||||
|
✓ CONVERGED! Stop loop
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Runs until convergence (updated_count = 0)
|
||||||
|
- Maximum 30 iterations safety limit
|
||||||
|
- Auto-optimization after each iteration
|
||||||
|
- Creates alias table (loop_final) for downstream processing
|
||||||
|
|
||||||
|
### Step 5: Post-Loop Processing
|
||||||
|
- Execute canonicalization step
|
||||||
|
- Generate result statistics
|
||||||
|
- Enrich source tables with canonical IDs
|
||||||
|
- Create master tables
|
||||||
|
- Generate metadata and lookup tables
|
||||||
|
|
||||||
|
### Step 6: Final Report
|
||||||
|
Provide:
|
||||||
|
- Total execution time
|
||||||
|
- Files processed successfully
|
||||||
|
- Convergence statistics
|
||||||
|
- Final table row counts
|
||||||
|
- Next steps and recommendations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command Usage
|
||||||
|
|
||||||
|
### Interactive Mode (Recommended)
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-databricks
|
||||||
|
|
||||||
|
I'll prompt you for:
|
||||||
|
- SQL directory path
|
||||||
|
- Databricks server hostname
|
||||||
|
- HTTP path
|
||||||
|
- Catalog and schema
|
||||||
|
- Authentication method
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Mode
|
||||||
|
Provide all parameters upfront:
|
||||||
|
```
|
||||||
|
SQL directory: databricks_sql/unify/
|
||||||
|
Server hostname: your-workspace.cloud.databricks.com
|
||||||
|
HTTP path: /sql/1.0/warehouses/abc123
|
||||||
|
Catalog: my_catalog
|
||||||
|
Schema: my_schema
|
||||||
|
Auth type: pat (or oauth)
|
||||||
|
Access token: dapi... (if using PAT)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Execution Features
|
||||||
|
|
||||||
|
### 1. Convergence Detection
|
||||||
|
**Algorithm**:
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as updated_count FROM (
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM current_iteration
|
||||||
|
EXCEPT
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM previous_iteration
|
||||||
|
) diff
|
||||||
|
```
|
||||||
|
|
||||||
|
**Stops when**: updated_count = 0
|
||||||
|
|
||||||
|
### 2. Delta Table Optimization
|
||||||
|
After major operations:
|
||||||
|
```sql
|
||||||
|
OPTIMIZE table_name
|
||||||
|
```
|
||||||
|
Benefits:
|
||||||
|
- Compacts small files
|
||||||
|
- Improves query performance
|
||||||
|
- Reduces storage costs
|
||||||
|
- Optimizes clustering
|
||||||
|
|
||||||
|
### 3. Interactive Error Handling
|
||||||
|
If an error occurs:
|
||||||
|
```
|
||||||
|
✗ File: 04_unify_loop_iteration_01.sql
|
||||||
|
Error: Table not found: source_table
|
||||||
|
|
||||||
|
Continue with remaining files? (y/n):
|
||||||
|
```
|
||||||
|
|
||||||
|
You can choose to:
|
||||||
|
- Continue: Skip failed file, continue with rest
|
||||||
|
- Stop: Halt execution for investigation
|
||||||
|
|
||||||
|
### 4. Real-Time Monitoring
|
||||||
|
Track progress with:
|
||||||
|
- ✓ Completed steps (green)
|
||||||
|
- • Progress indicators (cyan)
|
||||||
|
- ✗ Failed steps (red)
|
||||||
|
- ⚠ Warnings (yellow)
|
||||||
|
- Row counts and execution times
|
||||||
|
|
||||||
|
### 5. Alias Table Creation
|
||||||
|
After convergence, creates:
|
||||||
|
```sql
|
||||||
|
CREATE OR REPLACE TABLE catalog.schema.unified_id_graph_unify_loop_final
|
||||||
|
AS SELECT * FROM catalog.schema.unified_id_graph_unify_loop_3
|
||||||
|
```
|
||||||
|
|
||||||
|
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Python Script Execution
|
||||||
|
The agent executes:
|
||||||
|
```bash
|
||||||
|
python3 scripts/databricks/databricks_sql_executor.py \
|
||||||
|
databricks_sql/unify/ \
|
||||||
|
--server-hostname your-workspace.databricks.com \
|
||||||
|
--http-path /sql/1.0/warehouses/abc123 \
|
||||||
|
--catalog my_catalog \
|
||||||
|
--schema my_schema \
|
||||||
|
--auth-type pat \
|
||||||
|
--optimize-tables
|
||||||
|
```
|
||||||
|
|
||||||
|
### Execution Order
|
||||||
|
1. **Setup Phase** (01-03):
|
||||||
|
- Create graph table (loop_0)
|
||||||
|
- Extract and merge identities
|
||||||
|
- Generate source statistics
|
||||||
|
|
||||||
|
2. **Unification Loop** (04):
|
||||||
|
- Run iterations until convergence
|
||||||
|
- Check after EVERY iteration
|
||||||
|
- Stop when updated_count = 0
|
||||||
|
- Create loop_final alias
|
||||||
|
|
||||||
|
3. **Canonicalization** (05):
|
||||||
|
- Create canonical ID lookup
|
||||||
|
- Create keys and tables metadata
|
||||||
|
- Rename final graph table
|
||||||
|
|
||||||
|
4. **Statistics** (06):
|
||||||
|
- Generate result key statistics
|
||||||
|
- Create histograms
|
||||||
|
- Calculate coverage metrics
|
||||||
|
|
||||||
|
5. **Enrichment** (10-19):
|
||||||
|
- Add canonical IDs to source tables
|
||||||
|
- Create enriched_* tables
|
||||||
|
|
||||||
|
6. **Master Tables** (20-29):
|
||||||
|
- Aggregate attributes
|
||||||
|
- Apply priority rules
|
||||||
|
- Create unified customer profiles
|
||||||
|
|
||||||
|
7. **Metadata** (30-39):
|
||||||
|
- Unification metadata
|
||||||
|
- Filter lookup tables
|
||||||
|
- Column lookup tables
|
||||||
|
|
||||||
|
### Connection Management
|
||||||
|
- Establishes single connection for entire workflow
|
||||||
|
- Uses connection pooling for efficiency
|
||||||
|
- Automatic reconnection on timeout
|
||||||
|
- Proper cleanup on completion or error
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Execution
|
||||||
|
|
||||||
|
### Input
|
||||||
|
```
|
||||||
|
SQL directory: databricks_sql/unify/
|
||||||
|
Server hostname: dbc-12345-abc.cloud.databricks.com
|
||||||
|
HTTP path: /sql/1.0/warehouses/6789abcd
|
||||||
|
Catalog: customer_data
|
||||||
|
Schema: id_unification
|
||||||
|
Auth type: pat
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output
|
||||||
|
```
|
||||||
|
✓ Connected to Databricks: dbc-12345-abc.cloud.databricks.com
|
||||||
|
• Using catalog: customer_data, schema: id_unification
|
||||||
|
|
||||||
|
Starting Databricks SQL Execution
|
||||||
|
• Catalog: customer_data
|
||||||
|
• Schema: id_unification
|
||||||
|
• Delta tables: ✓ enabled
|
||||||
|
|
||||||
|
Executing: 01_create_graph.sql
|
||||||
|
✓ 01_create_graph.sql: Executed successfully
|
||||||
|
|
||||||
|
Executing: 02_extract_merge.sql
|
||||||
|
✓ 02_extract_merge.sql: Executed successfully
|
||||||
|
• Rows affected: 125000
|
||||||
|
|
||||||
|
Executing: 03_source_key_stats.sql
|
||||||
|
✓ 03_source_key_stats.sql: Executed successfully
|
||||||
|
|
||||||
|
Executing Unify Loop Before Canonicalization
|
||||||
|
|
||||||
|
--- Iteration 1 ---
|
||||||
|
✓ Iteration 1 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 1500
|
||||||
|
• Optimizing Delta table
|
||||||
|
|
||||||
|
--- Iteration 2 ---
|
||||||
|
✓ Iteration 2 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 450
|
||||||
|
• Optimizing Delta table
|
||||||
|
|
||||||
|
--- Iteration 3 ---
|
||||||
|
✓ Iteration 3 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 0
|
||||||
|
✓ Loop converged after 3 iterations
|
||||||
|
|
||||||
|
• Creating alias table for final iteration
|
||||||
|
✓ Alias table 'unified_id_graph_unify_loop_final' created
|
||||||
|
|
||||||
|
Executing: 05_canonicalize.sql
|
||||||
|
✓ 05_canonicalize.sql: Executed successfully
|
||||||
|
|
||||||
|
[... continues with enrichment, master tables, metadata ...]
|
||||||
|
|
||||||
|
Execution Complete
|
||||||
|
• Files processed: 18/18
|
||||||
|
• Final unified_id_lookup rows: 98,500
|
||||||
|
|
||||||
|
• Disconnected from Databricks
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring and Troubleshooting
|
||||||
|
|
||||||
|
### Check Execution Progress
|
||||||
|
During execution, you can monitor:
|
||||||
|
- Databricks SQL Warehouse query history
|
||||||
|
- Delta table sizes and row counts
|
||||||
|
- Execution logs in Databricks workspace
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Issue**: Connection timeout
|
||||||
|
**Solution**: Check network access, verify credentials, ensure SQL Warehouse is running
|
||||||
|
|
||||||
|
**Issue**: Table not found
|
||||||
|
**Solution**: Verify catalog/schema permissions, check source table names in YAML
|
||||||
|
|
||||||
|
**Issue**: Loop doesn't converge
|
||||||
|
**Solution**: Check data quality, increase max_iterations, review key validation rules
|
||||||
|
|
||||||
|
**Issue**: Out of memory
|
||||||
|
**Solution**: Increase SQL Warehouse size, optimize clustering, reduce batch sizes
|
||||||
|
|
||||||
|
**Issue**: Permission denied
|
||||||
|
**Solution**: Verify catalog/schema permissions, check Unity Catalog access controls
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
- Use larger SQL Warehouse for faster execution
|
||||||
|
- Enable auto-scaling for variable workloads
|
||||||
|
- Optimize Delta tables regularly
|
||||||
|
- Use clustering on frequently joined columns
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Post-Execution Validation
|
||||||
|
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON DATABRICKS**
|
||||||
|
|
||||||
|
### Check Coverage
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_records,
|
||||||
|
COUNT(unified_id) as records_with_id,
|
||||||
|
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
|
||||||
|
FROM catalog.schema.enriched_customer_profiles;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Master Table
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as unified_customers
|
||||||
|
FROM catalog.schema.customer_master;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Review Statistics
|
||||||
|
```sql
|
||||||
|
SELECT * FROM catalog.schema.unified_id_result_key_stats
|
||||||
|
WHERE from_table = '*';
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Execution successful when:
|
||||||
|
- ✅ All SQL files processed without critical errors
|
||||||
|
- ✅ Unification loop converged (updated_count = 0)
|
||||||
|
- ✅ Canonical IDs generated for all eligible records
|
||||||
|
- ✅ Enriched tables created successfully
|
||||||
|
- ✅ Master tables populated with attributes
|
||||||
|
- ✅ Coverage metrics meet expectations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to execute your Databricks ID unification workflow?**
|
||||||
|
|
||||||
|
Provide your SQL directory path and Databricks connection details to begin!
|
||||||
401
commands/hybrid-execute-snowflake.md
Normal file
401
commands/hybrid-execute-snowflake.md
Normal file
@@ -0,0 +1,401 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-execute-snowflake
|
||||||
|
description: Execute Snowflake ID unification workflow with convergence detection and monitoring
|
||||||
|
---
|
||||||
|
|
||||||
|
# Execute Snowflake ID Unification Workflow
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Execute your generated Snowflake SQL workflow with intelligent convergence detection, real-time monitoring, and interactive error handling. This command orchestrates the complete unification process from graph creation to master table generation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need
|
||||||
|
|
||||||
|
### Required Inputs
|
||||||
|
1. **SQL Directory**: Path to generated SQL files (e.g., `snowflake_sql/unify/`)
|
||||||
|
2. **Account**: Snowflake account name (e.g., `myaccount` from `myaccount.snowflakecomputing.com`)
|
||||||
|
3. **User**: Snowflake username
|
||||||
|
4. **Database**: Target database name
|
||||||
|
5. **Schema**: Target schema name
|
||||||
|
6. **Warehouse**: Compute warehouse name (defaults to `COMPUTE_WH`)
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
**Option 1: Password**
|
||||||
|
- Can be provided as argument or via environment variable `SNOWFLAKE_PASSWORD` via environment file (.env) `SNOWFLAKE_PASSWORD`
|
||||||
|
- Will prompt if not provided
|
||||||
|
|
||||||
|
**Option 2: SSO (externalbrowser)**
|
||||||
|
- Opens browser for authentication
|
||||||
|
- No password required
|
||||||
|
|
||||||
|
**Option 3: Key-Pair**
|
||||||
|
- Private key path via `SNOWFLAKE_PRIVATE_KEY_PATH`
|
||||||
|
- Passphrase via `SNOWFLAKE_PRIVATE_KEY_PASSPHRASE`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Connection Setup
|
||||||
|
- Connect to your Snowflake account
|
||||||
|
- Validate credentials and permissions
|
||||||
|
- Set database and schema context
|
||||||
|
- Verify SQL directory exists
|
||||||
|
- Activate warehouse
|
||||||
|
|
||||||
|
### Step 2: Execution Plan
|
||||||
|
Display execution plan with:
|
||||||
|
- All SQL files in execution order
|
||||||
|
- File types (Setup, Loop Iteration, Enrichment, Master Table, etc.)
|
||||||
|
- Estimated steps and dependencies
|
||||||
|
|
||||||
|
### Step 3: SQL Execution
|
||||||
|
I'll call the **snowflake-workflow-executor agent** to:
|
||||||
|
- Execute SQL files in proper sequence
|
||||||
|
- Skip loop iteration files (handled separately)
|
||||||
|
- Monitor progress with real-time feedback
|
||||||
|
- Track row counts and execution times
|
||||||
|
|
||||||
|
### Step 4: Unify Loop with Convergence Detection
|
||||||
|
**Intelligent Loop Execution**:
|
||||||
|
```
|
||||||
|
Iteration 1:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 1500 records updated
|
||||||
|
→ Continue to iteration 2
|
||||||
|
|
||||||
|
Iteration 2:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 450 records updated
|
||||||
|
→ Continue to iteration 3
|
||||||
|
|
||||||
|
Iteration 3:
|
||||||
|
✓ Execute unify SQL
|
||||||
|
• Check convergence: 0 records updated
|
||||||
|
✓ CONVERGED! Stop loop
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Runs until convergence (updated_count = 0)
|
||||||
|
- Maximum 30 iterations safety limit
|
||||||
|
- Creates alias table (loop_final) for downstream processing
|
||||||
|
|
||||||
|
### Step 5: Post-Loop Processing
|
||||||
|
- Execute canonicalization step
|
||||||
|
- Generate result statistics
|
||||||
|
- Enrich source tables with canonical IDs
|
||||||
|
- Create master tables
|
||||||
|
- Generate metadata and lookup tables
|
||||||
|
|
||||||
|
### Step 6: Final Report
|
||||||
|
Provide:
|
||||||
|
- Total execution time
|
||||||
|
- Files processed successfully
|
||||||
|
- Convergence statistics
|
||||||
|
- Final table row counts
|
||||||
|
- Next steps and recommendations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command Usage
|
||||||
|
|
||||||
|
### Interactive Mode (Recommended)
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
|
||||||
|
I'll prompt you for:
|
||||||
|
- SQL directory path
|
||||||
|
- Snowflake account name
|
||||||
|
- Username
|
||||||
|
- Database and schema
|
||||||
|
- Warehouse name
|
||||||
|
- Authentication method
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Mode
|
||||||
|
Provide all parameters upfront:
|
||||||
|
```
|
||||||
|
SQL directory: snowflake_sql/unify/
|
||||||
|
Account: myaccount
|
||||||
|
User: myuser
|
||||||
|
Database: my_database
|
||||||
|
Schema: my_schema
|
||||||
|
Warehouse: COMPUTE_WH
|
||||||
|
Password: (will prompt if not in environment)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Execution Features
|
||||||
|
|
||||||
|
### 1. Convergence Detection
|
||||||
|
**Algorithm**:
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as updated_count FROM (
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM current_iteration
|
||||||
|
EXCEPT
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM previous_iteration
|
||||||
|
) diff
|
||||||
|
```
|
||||||
|
|
||||||
|
**Stops when**: updated_count = 0
|
||||||
|
|
||||||
|
### 2. Interactive Error Handling
|
||||||
|
If an error occurs:
|
||||||
|
```
|
||||||
|
✗ File: 04_unify_loop_iteration_01.sql
|
||||||
|
Error: Table not found: source_table
|
||||||
|
|
||||||
|
Continue with remaining files? (y/n):
|
||||||
|
```
|
||||||
|
|
||||||
|
You can choose to:
|
||||||
|
- Continue: Skip failed file, continue with rest
|
||||||
|
- Stop: Halt execution for investigation
|
||||||
|
|
||||||
|
### 3. Real-Time Monitoring
|
||||||
|
Track progress with:
|
||||||
|
- ✓ Completed steps (green)
|
||||||
|
- • Progress indicators (cyan)
|
||||||
|
- ✗ Failed steps (red)
|
||||||
|
- ⚠ Warnings (yellow)
|
||||||
|
- Row counts and execution times
|
||||||
|
|
||||||
|
### 4. Alias Table Creation
|
||||||
|
After convergence, creates:
|
||||||
|
```sql
|
||||||
|
CREATE OR REPLACE TABLE database.schema.unified_id_graph_unify_loop_final
|
||||||
|
AS SELECT * FROM database.schema.unified_id_graph_unify_loop_3
|
||||||
|
```
|
||||||
|
|
||||||
|
This allows downstream SQL to reference `loop_final` regardless of actual iteration count.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Python Script Execution
|
||||||
|
The agent executes:
|
||||||
|
```bash
|
||||||
|
python3 scripts/snowflake/snowflake_sql_executor.py \
|
||||||
|
snowflake_sql/unify/ \
|
||||||
|
--account myaccount \
|
||||||
|
--user myuser \
|
||||||
|
--database my_database \
|
||||||
|
--schema my_schema \
|
||||||
|
--warehouse COMPUTE_WH
|
||||||
|
```
|
||||||
|
|
||||||
|
### Execution Order
|
||||||
|
1. **Setup Phase** (01-03):
|
||||||
|
- Create graph table (loop_0)
|
||||||
|
- Extract and merge identities
|
||||||
|
- Generate source statistics
|
||||||
|
|
||||||
|
2. **Unification Loop** (04):
|
||||||
|
- Run iterations until convergence
|
||||||
|
- Check after EVERY iteration
|
||||||
|
- Stop when updated_count = 0
|
||||||
|
- Create loop_final alias
|
||||||
|
|
||||||
|
3. **Canonicalization** (05):
|
||||||
|
- Create canonical ID lookup
|
||||||
|
- Create keys and tables metadata
|
||||||
|
- Rename final graph table
|
||||||
|
|
||||||
|
4. **Statistics** (06):
|
||||||
|
- Generate result key statistics
|
||||||
|
- Create histograms
|
||||||
|
- Calculate coverage metrics
|
||||||
|
|
||||||
|
5. **Enrichment** (10-19):
|
||||||
|
- Add canonical IDs to source tables
|
||||||
|
- Create enriched_* tables
|
||||||
|
|
||||||
|
6. **Master Tables** (20-29):
|
||||||
|
- Aggregate attributes
|
||||||
|
- Apply priority rules
|
||||||
|
- Create unified customer profiles
|
||||||
|
|
||||||
|
7. **Metadata** (30-39):
|
||||||
|
- Unification metadata
|
||||||
|
- Filter lookup tables
|
||||||
|
- Column lookup tables
|
||||||
|
|
||||||
|
### Connection Management
|
||||||
|
- Establishes single connection for entire workflow
|
||||||
|
- Uses connection pooling for efficiency
|
||||||
|
- Automatic reconnection on timeout
|
||||||
|
- Proper cleanup on completion or error
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Execution
|
||||||
|
|
||||||
|
### Input
|
||||||
|
```
|
||||||
|
SQL directory: snowflake_sql/unify/
|
||||||
|
Account: myorg-myaccount
|
||||||
|
User: analytics_user
|
||||||
|
Database: customer_data
|
||||||
|
Schema: id_unification
|
||||||
|
Warehouse: LARGE_WH
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output
|
||||||
|
```
|
||||||
|
✓ Connected to Snowflake: myorg-myaccount
|
||||||
|
• Using database: customer_data, schema: id_unification
|
||||||
|
|
||||||
|
Starting Snowflake SQL Execution
|
||||||
|
• Database: customer_data
|
||||||
|
• Schema: id_unification
|
||||||
|
|
||||||
|
Executing: 01_create_graph.sql
|
||||||
|
✓ 01_create_graph.sql: Executed successfully
|
||||||
|
|
||||||
|
Executing: 02_extract_merge.sql
|
||||||
|
✓ 02_extract_merge.sql: Executed successfully
|
||||||
|
• Rows affected: 125000
|
||||||
|
|
||||||
|
Executing: 03_source_key_stats.sql
|
||||||
|
✓ 03_source_key_stats.sql: Executed successfully
|
||||||
|
|
||||||
|
Executing Unify Loop Before Canonicalization
|
||||||
|
|
||||||
|
--- Iteration 1 ---
|
||||||
|
✓ Iteration 1 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 1500
|
||||||
|
|
||||||
|
--- Iteration 2 ---
|
||||||
|
✓ Iteration 2 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 450
|
||||||
|
|
||||||
|
--- Iteration 3 ---
|
||||||
|
✓ Iteration 3 completed
|
||||||
|
• Rows processed: 125000
|
||||||
|
• Updated records: 0
|
||||||
|
✓ Loop converged after 3 iterations
|
||||||
|
|
||||||
|
• Creating alias table for final iteration
|
||||||
|
✓ Alias table 'unified_id_graph_unify_loop_final' created
|
||||||
|
|
||||||
|
Executing: 05_canonicalize.sql
|
||||||
|
✓ 05_canonicalize.sql: Executed successfully
|
||||||
|
|
||||||
|
[... continues with enrichment, master tables, metadata ...]
|
||||||
|
|
||||||
|
Execution Complete
|
||||||
|
• Files processed: 18/18
|
||||||
|
• Final unified_id_lookup rows: 98,500
|
||||||
|
|
||||||
|
• Disconnected from Snowflake
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring and Troubleshooting
|
||||||
|
|
||||||
|
### Check Execution Progress
|
||||||
|
During execution, you can monitor:
|
||||||
|
- Snowflake query history
|
||||||
|
- Table sizes and row counts
|
||||||
|
- Warehouse utilization
|
||||||
|
- Execution logs
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Issue**: Connection timeout
|
||||||
|
**Solution**: Check network access, verify credentials, ensure warehouse is running
|
||||||
|
|
||||||
|
**Issue**: Table not found
|
||||||
|
**Solution**: Verify database/schema permissions, check source table names in YAML
|
||||||
|
|
||||||
|
**Issue**: Loop doesn't converge
|
||||||
|
**Solution**: Check data quality, increase max_iterations, review key validation rules
|
||||||
|
|
||||||
|
**Issue**: Warehouse suspended
|
||||||
|
**Solution**: Ensure auto-resume is enabled, manually resume warehouse if needed
|
||||||
|
|
||||||
|
**Issue**: Permission denied
|
||||||
|
**Solution**: Verify database/schema permissions, check role assignments
|
||||||
|
|
||||||
|
### Performance Optimization
|
||||||
|
- Use larger warehouse for faster execution (L, XL, 2XL, etc.)
|
||||||
|
- Enable multi-cluster warehouse for concurrency
|
||||||
|
- Use clustering keys on frequently joined columns
|
||||||
|
- Monitor query profiles for optimization opportunities
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Post-Execution Validation
|
||||||
|
**DO NOT RUN THESE VALIDATION. JUST PRESENT TO USER TO RUN ON SNOWFLAKE**
|
||||||
|
|
||||||
|
### Check Coverage
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_records,
|
||||||
|
COUNT(unified_id) as records_with_id,
|
||||||
|
COUNT(unified_id) * 100.0 / COUNT(*) as coverage_percent
|
||||||
|
FROM database.schema.enriched_customer_profiles;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Master Table
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as unified_customers
|
||||||
|
FROM database.schema.customer_master;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Review Statistics
|
||||||
|
```sql
|
||||||
|
SELECT * FROM database.schema.unified_id_result_key_stats
|
||||||
|
WHERE from_table = '*';
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Execution successful when:
|
||||||
|
- ✅ All SQL files processed without critical errors
|
||||||
|
- ✅ Unification loop converged (updated_count = 0)
|
||||||
|
- ✅ Canonical IDs generated for all eligible records
|
||||||
|
- ✅ Enriched tables created successfully
|
||||||
|
- ✅ Master tables populated with attributes
|
||||||
|
- ✅ Coverage metrics meet expectations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Authentication Examples
|
||||||
|
|
||||||
|
### Using Password
|
||||||
|
```bash
|
||||||
|
export SNOWFLAKE_PASSWORD='your_password'
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using SSO
|
||||||
|
```bash
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
# Will prompt: Use SSO authentication? (y/n): y
|
||||||
|
# Opens browser for authentication
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using Key-Pair
|
||||||
|
```bash
|
||||||
|
export SNOWFLAKE_PRIVATE_KEY_PATH='/path/to/key.p8'
|
||||||
|
export SNOWFLAKE_PRIVATE_KEY_PASSPHRASE='passphrase'
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to execute your Snowflake ID unification workflow?**
|
||||||
|
|
||||||
|
Provide your SQL directory path and Snowflake connection details to begin!
|
||||||
285
commands/hybrid-generate-databricks.md
Normal file
285
commands/hybrid-generate-databricks.md
Normal file
@@ -0,0 +1,285 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-generate-databricks
|
||||||
|
description: Generate Databricks Delta Lake SQL from YAML configuration for ID unification
|
||||||
|
---
|
||||||
|
|
||||||
|
# Generate Databricks SQL from YAML
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Generate production-ready Databricks SQL workflow from your `unify.yml` configuration file. This command creates Delta Lake optimized SQL files with ACID transactions, clustering, and platform-specific function conversions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need
|
||||||
|
|
||||||
|
### Required Inputs
|
||||||
|
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||||
|
2. **Target Catalog**: Databricks Unity Catalog name
|
||||||
|
3. **Target Schema**: Schema name within the catalog
|
||||||
|
|
||||||
|
### Optional Inputs
|
||||||
|
4. **Source Catalog**: Catalog containing source tables (defaults to target catalog)
|
||||||
|
5. **Source Schema**: Schema containing source tables (defaults to target schema)
|
||||||
|
6. **Output Directory**: Where to save generated SQL (defaults to `databricks_sql/`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Validation
|
||||||
|
- Verify `unify.yml` exists and is valid
|
||||||
|
- Check YAML syntax and structure
|
||||||
|
- Validate keys, tables, and configuration sections
|
||||||
|
|
||||||
|
### Step 2: SQL Generation
|
||||||
|
I'll call the **databricks-sql-generator agent** to:
|
||||||
|
- Execute `yaml_unification_to_databricks.py` Python script
|
||||||
|
- Apply Databricks-specific SQL conversions:
|
||||||
|
- `ARRAY_SIZE` → `SIZE`
|
||||||
|
- `ARRAY_CONSTRUCT` → `ARRAY`
|
||||||
|
- `OBJECT_CONSTRUCT` → `STRUCT`
|
||||||
|
- `COLLECT_LIST` for aggregations
|
||||||
|
- `FLATTEN` for array operations
|
||||||
|
- `UNIX_TIMESTAMP()` for time functions
|
||||||
|
- Generate Delta Lake table definitions with clustering
|
||||||
|
- Create convergence detection logic
|
||||||
|
- Build cryptographic hashing for canonical IDs
|
||||||
|
|
||||||
|
### Step 3: Output Organization
|
||||||
|
Generate complete SQL workflow in this structure:
|
||||||
|
```
|
||||||
|
databricks_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Initialize graph with USING DELTA
|
||||||
|
├── 02_extract_merge.sql # Extract identities with validation
|
||||||
|
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
|
||||||
|
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
|
||||||
|
├── 05_canonicalize.sql # Canonical ID creation with key masks
|
||||||
|
├── 06_result_key_stats.sql # Result statistics with histograms
|
||||||
|
├── 10_enrich_*.sql # Enrich each source table
|
||||||
|
├── 20_master_*.sql # Master tables with attribute aggregation
|
||||||
|
├── 30_unification_metadata.sql # Metadata tables
|
||||||
|
├── 31_filter_lookup.sql # Validation rules lookup
|
||||||
|
└── 32_column_lookup.sql # Column mapping lookup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Summary Report
|
||||||
|
Provide:
|
||||||
|
- Total SQL files generated
|
||||||
|
- Estimated execution order
|
||||||
|
- Delta Lake optimizations included
|
||||||
|
- Key features enabled
|
||||||
|
- Next steps for execution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-generate-databricks
|
||||||
|
|
||||||
|
I'll prompt you for:
|
||||||
|
- YAML file path
|
||||||
|
- Target catalog
|
||||||
|
- Target schema
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Usage
|
||||||
|
Provide all parameters upfront:
|
||||||
|
```
|
||||||
|
YAML file: /path/to/unify.yml
|
||||||
|
Target catalog: my_catalog
|
||||||
|
Target schema: my_schema
|
||||||
|
Source catalog: source_catalog (optional)
|
||||||
|
Source schema: source_schema (optional)
|
||||||
|
Output directory: custom_output/ (optional)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Generated SQL Features
|
||||||
|
|
||||||
|
### Delta Lake Optimizations
|
||||||
|
- **ACID Transactions**: `USING DELTA` for all tables
|
||||||
|
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
|
||||||
|
- **Table Properties**: Optimized for large-scale joins
|
||||||
|
|
||||||
|
### Advanced Capabilities
|
||||||
|
1. **Dynamic Iteration Count**: Auto-calculates based on:
|
||||||
|
- Number of merge keys
|
||||||
|
- Number of tables
|
||||||
|
- Data complexity (configurable via YAML)
|
||||||
|
|
||||||
|
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
|
||||||
|
```
|
||||||
|
Key Type 1 (email): 0ffdbcf0c666ce190d
|
||||||
|
Key Type 2 (customer_id): 61a821f2b646a4e890
|
||||||
|
Key Type 3 (phone): acd2206c3f88b3ee27
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Validation Rules**:
|
||||||
|
- `valid_regexp`: Regex pattern filtering
|
||||||
|
- `invalid_texts`: NOT IN clause with NULL handling
|
||||||
|
- Combined AND logic for strict validation
|
||||||
|
|
||||||
|
4. **Master Table Attributes**:
|
||||||
|
- Single value: `MAX_BY(attr, order)` with COALESCE
|
||||||
|
- Array values: `SLICE(CONCAT(arrays), 1, N)`
|
||||||
|
- Priority-based selection
|
||||||
|
|
||||||
|
### Platform-Specific Conversions
|
||||||
|
The generator automatically converts:
|
||||||
|
- Presto functions → Databricks equivalents
|
||||||
|
- Snowflake functions → Databricks equivalents
|
||||||
|
- Array operations → Spark SQL syntax
|
||||||
|
- Window functions → optimized versions
|
||||||
|
- Time functions → UNIX_TIMESTAMP()
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Workflow
|
||||||
|
|
||||||
|
### Input YAML (`unify.yml`)
|
||||||
|
```yaml
|
||||||
|
name: customer_unification
|
||||||
|
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A']
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- table: customer_profiles
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_id
|
||||||
|
merge_by_keys: [email, customer_id]
|
||||||
|
merge_iterations: 15
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: customer_master
|
||||||
|
canonical_id: unified_id
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generated Output
|
||||||
|
```
|
||||||
|
databricks_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
|
||||||
|
├── 02_extract_merge.sql # Merges customer_profiles keys
|
||||||
|
├── 03_source_key_stats.sql # Stats by table
|
||||||
|
├── 04_unify_loop_iteration_01.sql # First iteration
|
||||||
|
├── 04_unify_loop_iteration_02.sql # Second iteration
|
||||||
|
├── ... # Up to iteration_05
|
||||||
|
├── 05_canonicalize.sql # Creates unified_id_lookup
|
||||||
|
├── 06_result_key_stats.sql # Final statistics
|
||||||
|
├── 10_enrich_customer_profiles.sql # Adds unified_id column
|
||||||
|
├── 20_master_customer_master.sql # Creates customer_master table
|
||||||
|
├── 30_unification_metadata.sql # Metadata
|
||||||
|
├── 31_filter_lookup.sql # Validation rules
|
||||||
|
└── 32_column_lookup.sql # Column mappings
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps After Generation
|
||||||
|
|
||||||
|
### Option 1: Execute Immediately
|
||||||
|
Use the execution command:
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-databricks
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 2: Review First
|
||||||
|
1. Examine generated SQL files
|
||||||
|
2. Verify table names and transformations
|
||||||
|
3. Test with sample data
|
||||||
|
4. Execute manually or via execution command
|
||||||
|
|
||||||
|
### Option 3: Customize
|
||||||
|
1. Modify generated SQL as needed
|
||||||
|
2. Add custom logic or transformations
|
||||||
|
3. Execute using Databricks SQL editor or execution command
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Python Script Execution
|
||||||
|
The agent executes:
|
||||||
|
```bash
|
||||||
|
python3 scripts/databricks/yaml_unification_to_databricks.py \
|
||||||
|
unify.yml \
|
||||||
|
-tc my_catalog \
|
||||||
|
-ts my_schema \
|
||||||
|
-sc source_catalog \
|
||||||
|
-ss source_schema \
|
||||||
|
-o databricks_sql
|
||||||
|
```
|
||||||
|
|
||||||
|
### SQL File Naming Convention
|
||||||
|
- `01-09`: Setup and initialization
|
||||||
|
- `10-19`: Source table enrichment
|
||||||
|
- `20-29`: Master table creation
|
||||||
|
- `30-39`: Metadata and lookup tables
|
||||||
|
- `04_*_NN`: Loop iterations (auto-numbered)
|
||||||
|
|
||||||
|
### Convergence Detection
|
||||||
|
Each loop iteration includes:
|
||||||
|
```sql
|
||||||
|
-- Check if graph changed
|
||||||
|
SELECT COUNT(*) FROM (
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM iteration_N
|
||||||
|
EXCEPT
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM iteration_N_minus_1
|
||||||
|
) diff
|
||||||
|
```
|
||||||
|
Stops when count = 0
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Issue**: YAML validation error
|
||||||
|
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
|
||||||
|
|
||||||
|
**Issue**: Table not found error
|
||||||
|
**Solution**: Verify source catalog/schema, check table names in YAML
|
||||||
|
|
||||||
|
**Issue**: Python script error
|
||||||
|
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
|
||||||
|
|
||||||
|
**Issue**: Too many/few iterations
|
||||||
|
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Generated SQL will:
|
||||||
|
- ✅ Be valid Databricks Spark SQL
|
||||||
|
- ✅ Use Delta Lake for ACID transactions
|
||||||
|
- ✅ Include proper clustering for performance
|
||||||
|
- ✅ Have convergence detection built-in
|
||||||
|
- ✅ Support incremental processing
|
||||||
|
- ✅ Generate comprehensive statistics
|
||||||
|
- ✅ Work without modification on Databricks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to generate Databricks SQL from your YAML configuration?**
|
||||||
|
|
||||||
|
Provide your YAML file path and target catalog/schema to begin!
|
||||||
288
commands/hybrid-generate-snowflake.md
Normal file
288
commands/hybrid-generate-snowflake.md
Normal file
@@ -0,0 +1,288 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-generate-snowflake
|
||||||
|
description: Generate Snowflake SQL from YAML configuration for ID unification
|
||||||
|
---
|
||||||
|
|
||||||
|
# Generate Snowflake SQL from YAML
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Generate production-ready Snowflake SQL workflow from your `unify.yml` configuration file. This command creates Snowflake-native SQL files with proper clustering, VARIANT support, and platform-specific function conversions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need
|
||||||
|
|
||||||
|
### Required Inputs
|
||||||
|
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||||
|
2. **Target Database**: Snowflake database name
|
||||||
|
3. **Target Schema**: Schema name within the database
|
||||||
|
|
||||||
|
### Optional Inputs
|
||||||
|
4. **Source Database**: Database containing source tables (defaults to target database)
|
||||||
|
5. **Source Schema**: Schema containing source tables (defaults to PUBLIC)
|
||||||
|
6. **Output Directory**: Where to save generated SQL (defaults to `snowflake_sql/`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Validation
|
||||||
|
- Verify `unify.yml` exists and is valid
|
||||||
|
- Check YAML syntax and structure
|
||||||
|
- Validate keys, tables, and configuration sections
|
||||||
|
|
||||||
|
### Step 2: SQL Generation
|
||||||
|
I'll call the **snowflake-sql-generator agent** to:
|
||||||
|
- Execute `yaml_unification_to_snowflake.py` Python script
|
||||||
|
- Generate Snowflake table definitions with clustering
|
||||||
|
- Create convergence detection logic
|
||||||
|
- Build cryptographic hashing for canonical IDs
|
||||||
|
|
||||||
|
### Step 3: Output Organization
|
||||||
|
Generate complete SQL workflow in this structure:
|
||||||
|
```
|
||||||
|
snowflake_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Initialize graph table
|
||||||
|
├── 02_extract_merge.sql # Extract identities with validation
|
||||||
|
├── 03_source_key_stats.sql # Source statistics with GROUPING SETS
|
||||||
|
├── 04_unify_loop_iteration_*.sql # Loop iterations (auto-calculated count)
|
||||||
|
├── 05_canonicalize.sql # Canonical ID creation with key masks
|
||||||
|
├── 06_result_key_stats.sql # Result statistics with histograms
|
||||||
|
├── 10_enrich_*.sql # Enrich each source table
|
||||||
|
├── 20_master_*.sql # Master tables with attribute aggregation
|
||||||
|
├── 30_unification_metadata.sql # Metadata tables
|
||||||
|
├── 31_filter_lookup.sql # Validation rules lookup
|
||||||
|
└── 32_column_lookup.sql # Column mapping lookup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Summary Report
|
||||||
|
Provide:
|
||||||
|
- Total SQL files generated
|
||||||
|
- Estimated execution order
|
||||||
|
- Snowflake optimizations included
|
||||||
|
- Key features enabled
|
||||||
|
- Next steps for execution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-generate-snowflake
|
||||||
|
|
||||||
|
I'll prompt you for:
|
||||||
|
- YAML file path
|
||||||
|
- Target database
|
||||||
|
- Target schema
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Usage
|
||||||
|
Provide all parameters upfront:
|
||||||
|
```
|
||||||
|
YAML file: /path/to/unify.yml
|
||||||
|
Target database: my_database
|
||||||
|
Target schema: my_schema
|
||||||
|
Source database: source_database (optional)
|
||||||
|
Source schema: PUBLIC (optional, defaults to PUBLIC)
|
||||||
|
Output directory: custom_output/ (optional)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Generated SQL Features
|
||||||
|
|
||||||
|
### Snowflake Optimizations
|
||||||
|
- **Clustering**: `CLUSTER BY (follower_id)` on graph tables
|
||||||
|
- **VARIANT Support**: Flexible data structures for arrays and objects
|
||||||
|
- **Native Functions**: Snowflake-specific optimized functions
|
||||||
|
|
||||||
|
### Advanced Capabilities
|
||||||
|
1. **Dynamic Iteration Count**: Auto-calculates based on:
|
||||||
|
- Number of merge keys
|
||||||
|
- Number of tables
|
||||||
|
- Data complexity (configurable via YAML)
|
||||||
|
|
||||||
|
2. **Key-Specific Hashing**: Each key uses unique cryptographic mask:
|
||||||
|
```
|
||||||
|
Key Type 1 (email): 0ffdbcf0c666ce190d
|
||||||
|
Key Type 2 (customer_id): 61a821f2b646a4e890
|
||||||
|
Key Type 3 (phone): acd2206c3f88b3ee27
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Validation Rules**:
|
||||||
|
- `valid_regexp`: REGEXP_LIKE pattern filtering
|
||||||
|
- `invalid_texts`: NOT IN clause with proper NULL handling
|
||||||
|
- Combined AND logic for strict validation
|
||||||
|
|
||||||
|
4. **Master Table Attributes**:
|
||||||
|
- Single value: `MAX_BY(attr, order)` with COALESCE
|
||||||
|
- Array values: `ARRAY_SLICE(ARRAY_CAT(arrays), 0, N)`
|
||||||
|
- Priority-based selection
|
||||||
|
|
||||||
|
### Platform-Specific Conversions
|
||||||
|
The generator automatically converts:
|
||||||
|
- Presto functions → Snowflake equivalents
|
||||||
|
- Databricks functions → Snowflake equivalents
|
||||||
|
- Array operations → ARRAY_CONSTRUCT/FLATTEN syntax
|
||||||
|
- Window functions → optimized versions
|
||||||
|
- Time functions → DATE_PART(epoch_second, CURRENT_TIMESTAMP())
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Workflow
|
||||||
|
|
||||||
|
### Input YAML (`unify.yml`)
|
||||||
|
```yaml
|
||||||
|
name: customer_unification
|
||||||
|
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A']
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- table: customer_profiles
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_id
|
||||||
|
merge_by_keys: [email, customer_id]
|
||||||
|
merge_iterations: 15
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: customer_master
|
||||||
|
canonical_id: unified_id
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generated Output
|
||||||
|
```
|
||||||
|
snowflake_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Creates unified_id_graph_unify_loop_0
|
||||||
|
├── 02_extract_merge.sql # Merges customer_profiles keys
|
||||||
|
├── 03_source_key_stats.sql # Stats by table
|
||||||
|
├── 04_unify_loop_iteration_01.sql # First iteration
|
||||||
|
├── 04_unify_loop_iteration_02.sql # Second iteration
|
||||||
|
├── ... # Up to iteration_05
|
||||||
|
├── 05_canonicalize.sql # Creates unified_id_lookup
|
||||||
|
├── 06_result_key_stats.sql # Final statistics
|
||||||
|
├── 10_enrich_customer_profiles.sql # Adds unified_id column
|
||||||
|
├── 20_master_customer_master.sql # Creates customer_master table
|
||||||
|
├── 30_unification_metadata.sql # Metadata
|
||||||
|
├── 31_filter_lookup.sql # Validation rules
|
||||||
|
└── 32_column_lookup.sql # Column mappings
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps After Generation
|
||||||
|
|
||||||
|
### Option 1: Execute Immediately
|
||||||
|
Use the execution command:
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-execute-snowflake
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 2: Review First
|
||||||
|
1. Examine generated SQL files
|
||||||
|
2. Verify table names and transformations
|
||||||
|
3. Test with sample data
|
||||||
|
4. Execute manually or via execution command
|
||||||
|
|
||||||
|
### Option 3: Customize
|
||||||
|
1. Modify generated SQL as needed
|
||||||
|
2. Add custom logic or transformations
|
||||||
|
3. Execute using Snowflake SQL worksheet or execution command
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Python Script Execution
|
||||||
|
The agent executes:
|
||||||
|
```bash
|
||||||
|
python3 scripts/snowflake/yaml_unification_to_snowflake.py \
|
||||||
|
unify.yml \
|
||||||
|
-d my_database \
|
||||||
|
-s my_schema \
|
||||||
|
-sd source_database \
|
||||||
|
-ss source_schema \
|
||||||
|
-o snowflake_sql
|
||||||
|
```
|
||||||
|
|
||||||
|
### SQL File Naming Convention
|
||||||
|
- `01-09`: Setup and initialization
|
||||||
|
- `10-19`: Source table enrichment
|
||||||
|
- `20-29`: Master table creation
|
||||||
|
- `30-39`: Metadata and lookup tables
|
||||||
|
- `04_*_NN`: Loop iterations (auto-numbered)
|
||||||
|
|
||||||
|
### Convergence Detection
|
||||||
|
Each loop iteration includes:
|
||||||
|
```sql
|
||||||
|
-- Check if graph changed
|
||||||
|
SELECT COUNT(*) FROM (
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM iteration_N
|
||||||
|
EXCEPT
|
||||||
|
SELECT leader_ns, leader_id, follower_ns, follower_id
|
||||||
|
FROM iteration_N_minus_1
|
||||||
|
) diff
|
||||||
|
```
|
||||||
|
Stops when count = 0
|
||||||
|
|
||||||
|
### Snowflake-Specific Features
|
||||||
|
- **LATERAL FLATTEN**: Array expansion for id_ns_array processing
|
||||||
|
- **ARRAY_CONSTRUCT**: Building arrays from multiple columns
|
||||||
|
- **OBJECT_CONSTRUCT**: Creating structured objects for key-value pairs
|
||||||
|
- **ARRAYS_OVERLAP**: Checking array membership
|
||||||
|
- **SPLIT_PART**: String splitting for leader key parsing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Issue**: YAML validation error
|
||||||
|
**Solution**: Check YAML syntax, ensure proper indentation, verify all required fields
|
||||||
|
|
||||||
|
**Issue**: Table not found error
|
||||||
|
**Solution**: Verify source database/schema, check table names in YAML
|
||||||
|
|
||||||
|
**Issue**: Python script error
|
||||||
|
**Solution**: Ensure Python 3.7+ installed, check pyyaml dependency
|
||||||
|
|
||||||
|
**Issue**: Too many/few iterations
|
||||||
|
**Solution**: Adjust `merge_iterations` in canonical_ids section of YAML
|
||||||
|
|
||||||
|
**Issue**: VARIANT column errors
|
||||||
|
**Solution**: Snowflake VARIANT type handling is automatic, ensure proper casting in custom SQL
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Generated SQL will:
|
||||||
|
- ✅ Be valid Snowflake SQL
|
||||||
|
- ✅ Use native Snowflake functions
|
||||||
|
- ✅ Include proper clustering for performance
|
||||||
|
- ✅ Have convergence detection built-in
|
||||||
|
- ✅ Support VARIANT types for flexible data
|
||||||
|
- ✅ Generate comprehensive statistics
|
||||||
|
- ✅ Work without modification on Snowflake
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to generate Snowflake SQL from your YAML configuration?**
|
||||||
|
|
||||||
|
Provide your YAML file path and target database/schema to begin!
|
||||||
308
commands/hybrid-setup.md
Normal file
308
commands/hybrid-setup.md
Normal file
@@ -0,0 +1,308 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-setup
|
||||||
|
description: Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks
|
||||||
|
---
|
||||||
|
|
||||||
|
# Hybrid ID Unification Complete Setup
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an **automated, end-to-end workflow** that will:
|
||||||
|
|
||||||
|
1. **Analyze your tables automatically** using platform MCP tools with strict PII detection
|
||||||
|
2. **Generate YAML configuration** from real schema and data analysis
|
||||||
|
3. **Choose target platform(s)** (Snowflake, Databricks, or both)
|
||||||
|
4. **Generate platform-specific SQL** optimized for each engine
|
||||||
|
5. **Execute workflows** with convergence detection and monitoring
|
||||||
|
6. **Provide deployment guidance** and operating instructions
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- 🔍 **Automated Table Analysis**: Uses Snowflake/Databricks MCP tools to analyze actual tables
|
||||||
|
- ✅ **Strict PII Detection**: Zero tolerance - only includes tables with real user identifiers
|
||||||
|
- 📊 **Real Data Validation**: Queries actual data to validate patterns and quality
|
||||||
|
- 🎯 **Smart Recommendations**: Expert analysis provides merge strategy and priorities
|
||||||
|
- 🚀 **End-to-End Automation**: From table analysis to workflow execution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need to Provide
|
||||||
|
|
||||||
|
### 1. Unification Requirements (For Automated Analysis)
|
||||||
|
- **Platform**: Snowflake or Databricks
|
||||||
|
- **Tables**: List of source tables to analyze
|
||||||
|
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
|
||||||
|
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
|
||||||
|
- **Canonical ID Name**: Name for your unified ID (e.g., `td_id`, `unified_customer_id`)
|
||||||
|
- **Merge Iterations**: Number of unification loops (default: 10)
|
||||||
|
- **Master Tables**: (Optional) Attribute aggregation specifications
|
||||||
|
|
||||||
|
**Note**: The system will automatically:
|
||||||
|
- Extract user identifiers from actual table schemas
|
||||||
|
- Validate data patterns from real data
|
||||||
|
- Apply appropriate validation rules based on data analysis
|
||||||
|
- Generate merge strategy recommendations
|
||||||
|
|
||||||
|
### 2. Platform Selection
|
||||||
|
- **Databricks**: Unity Catalog with Delta Lake
|
||||||
|
- **Snowflake**: Database with proper permissions
|
||||||
|
- **Both**: Generate SQL for both platforms
|
||||||
|
|
||||||
|
### 3. Target Configurations
|
||||||
|
|
||||||
|
**For Databricks**:
|
||||||
|
- **Catalog**: Target catalog name
|
||||||
|
- **Schema**: Target schema name
|
||||||
|
- **Source Catalog** (optional): Source data catalog
|
||||||
|
- **Source Schema** (optional): Source data schema
|
||||||
|
|
||||||
|
**For Snowflake**:
|
||||||
|
- **Database**: Target database name
|
||||||
|
- **Schema**: Target schema name
|
||||||
|
- **Source Schema** (optional): Source data schema
|
||||||
|
|
||||||
|
### 4. Execution Credentials (if executing)
|
||||||
|
|
||||||
|
**For Databricks**:
|
||||||
|
- **Server Hostname**: your-workspace.databricks.com
|
||||||
|
- **HTTP Path**: /sql/1.0/warehouses/your-warehouse-id
|
||||||
|
- **Authentication**: PAT (Personal Access Token) or OAuth
|
||||||
|
|
||||||
|
**For Snowflake**:
|
||||||
|
- **Account**: Snowflake account name
|
||||||
|
- **User**: Username
|
||||||
|
- **Password**: Password or use SSO/key-pair
|
||||||
|
- **Warehouse**: Compute warehouse name
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Automated YAML Configuration Generation
|
||||||
|
I'll use the **hybrid-unif-config-creator** command to automatically generate your `unify.yml` file:
|
||||||
|
|
||||||
|
**Automated Analysis Approach** (Recommended):
|
||||||
|
- Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
|
||||||
|
- Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
|
||||||
|
- Validate data patterns from real table data
|
||||||
|
- Generate unify.yml with exact template compliance
|
||||||
|
- Only include tables with actual user identifiers
|
||||||
|
- Document excluded tables with detailed reasons
|
||||||
|
|
||||||
|
**What I'll do**:
|
||||||
|
- Call the **hybrid-unif-keys-extractor agent** to analyze tables
|
||||||
|
- Query actual schema and data using platform MCP tools
|
||||||
|
- Detect valid user identifiers (email, customer_id, phone, etc.)
|
||||||
|
- Exclude tables without PII with full documentation
|
||||||
|
- Generate production-ready unify.yml automatically
|
||||||
|
|
||||||
|
**Alternative - Manual Configuration**:
|
||||||
|
- If MCP tools are unavailable, I'll guide you through manual configuration
|
||||||
|
- Interactive prompts for keys, tables, and validation rules
|
||||||
|
- Step-by-step YAML building with validation
|
||||||
|
|
||||||
|
### Step 2: Platform Selection and Configuration
|
||||||
|
I'll help you:
|
||||||
|
- Choose between Databricks, Snowflake, or both
|
||||||
|
- Collect platform-specific configuration (catalog/database, schema names)
|
||||||
|
- Determine source/target separation strategy
|
||||||
|
- Decide on execution or generation-only mode
|
||||||
|
|
||||||
|
### Step 3: SQL Generation
|
||||||
|
|
||||||
|
**For Databricks** (if selected):
|
||||||
|
I'll call the **databricks-sql-generator agent** to:
|
||||||
|
- Execute `yaml_unification_to_databricks.py` script
|
||||||
|
- Generate Delta Lake optimized SQL workflow
|
||||||
|
- Create output directory: `databricks_sql/unify/`
|
||||||
|
- Generate 15+ SQL files with proper execution order
|
||||||
|
|
||||||
|
**For Snowflake** (if selected):
|
||||||
|
I'll call the **snowflake-sql-generator agent** to:
|
||||||
|
- Execute `yaml_unification_to_snowflake.py` script
|
||||||
|
- Generate Snowflake-native SQL workflow
|
||||||
|
- Create output directory: `snowflake_sql/unify/`
|
||||||
|
- Generate 15+ SQL files with proper execution order
|
||||||
|
|
||||||
|
### Step 4: Workflow Execution (Optional)
|
||||||
|
|
||||||
|
**For Databricks** (if execution requested):
|
||||||
|
I'll call the **databricks-workflow-executor agent** to:
|
||||||
|
- Execute `databricks_sql_executor.py` script
|
||||||
|
- Connect to your Databricks workspace
|
||||||
|
- Run SQL files in proper sequence
|
||||||
|
- Monitor convergence and progress
|
||||||
|
- Optimize Delta tables
|
||||||
|
- Report final statistics
|
||||||
|
|
||||||
|
**For Snowflake** (if execution requested):
|
||||||
|
I'll call the **snowflake-workflow-executor agent** to:
|
||||||
|
- Execute `snowflake_sql_executor.py` script
|
||||||
|
- Connect to your Snowflake account
|
||||||
|
- Run SQL files in proper sequence
|
||||||
|
- Monitor convergence and progress
|
||||||
|
- Report final statistics
|
||||||
|
|
||||||
|
### Step 5: Deployment Guidance
|
||||||
|
I'll provide:
|
||||||
|
- Configuration summary
|
||||||
|
- Generated files overview
|
||||||
|
- Deployment instructions
|
||||||
|
- Operating guidelines
|
||||||
|
- Monitoring recommendations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Interactive Workflow
|
||||||
|
|
||||||
|
This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:
|
||||||
|
|
||||||
|
### Phase 1: Configuration Creation
|
||||||
|
**I'll ask you for**:
|
||||||
|
- Platform (Snowflake or Databricks)
|
||||||
|
- Tables to analyze
|
||||||
|
- Canonical ID name
|
||||||
|
- Merge iterations
|
||||||
|
|
||||||
|
**Then I'll**:
|
||||||
|
- Call `/cdp-hybrid-idu:hybrid-unif-config-creator` internally
|
||||||
|
- Analyze your tables automatically
|
||||||
|
- Generate `unify.yml` with strict PII detection
|
||||||
|
- Show you the configuration for review
|
||||||
|
|
||||||
|
### Phase 2: SQL Generation
|
||||||
|
**I'll ask you**:
|
||||||
|
- Which platform(s) to generate SQL for (can be different from source)
|
||||||
|
- Output directory preferences
|
||||||
|
|
||||||
|
**Then I'll**:
|
||||||
|
- Call `/cdp-hybrid-idu:hybrid-generate-snowflake` (if Snowflake selected)
|
||||||
|
- Call `/cdp-hybrid-idu:hybrid-generate-databricks` (if Databricks selected)
|
||||||
|
- Generate 15+ optimized SQL files per platform
|
||||||
|
- Show you the execution plan
|
||||||
|
|
||||||
|
### Phase 3: Workflow Execution (Optional)
|
||||||
|
**I'll ask you**:
|
||||||
|
- Do you want to execute now or later?
|
||||||
|
- Connection credentials if executing
|
||||||
|
|
||||||
|
**Then I'll**:
|
||||||
|
- Call `/cdp-hybrid-idu:hybrid-execute-snowflake` (if Snowflake selected)
|
||||||
|
- Call `/cdp-hybrid-idu:hybrid-execute-databricks` (if Databricks selected)
|
||||||
|
- Monitor convergence and progress
|
||||||
|
- Report final statistics
|
||||||
|
|
||||||
|
**Throughout the process**:
|
||||||
|
- **Questions**: When I need your input
|
||||||
|
- **Suggestions**: Recommended approaches based on best practices
|
||||||
|
- **Validation**: Real-time checks on your choices
|
||||||
|
- **Explanations**: Help you understand concepts and options
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected Output
|
||||||
|
|
||||||
|
### Files Created (Platform-specific):
|
||||||
|
|
||||||
|
**For Databricks**:
|
||||||
|
```
|
||||||
|
databricks_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Initialize identity graph
|
||||||
|
├── 02_extract_merge.sql # Extract and merge identities
|
||||||
|
├── 03_source_key_stats.sql # Source statistics
|
||||||
|
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
|
||||||
|
├── 05_canonicalize.sql # Canonical ID creation
|
||||||
|
├── 06_result_key_stats.sql # Result statistics
|
||||||
|
├── 10_enrich_*.sql # Source table enrichment (N files)
|
||||||
|
├── 20_master_*.sql # Master table creation (N files)
|
||||||
|
├── 30_unification_metadata.sql # Metadata tables
|
||||||
|
├── 31_filter_lookup.sql # Validation rules
|
||||||
|
└── 32_column_lookup.sql # Column mappings
|
||||||
|
```
|
||||||
|
|
||||||
|
**For Snowflake**:
|
||||||
|
```
|
||||||
|
snowflake_sql/unify/
|
||||||
|
├── 01_create_graph.sql # Initialize identity graph
|
||||||
|
├── 02_extract_merge.sql # Extract and merge identities
|
||||||
|
├── 03_source_key_stats.sql # Source statistics
|
||||||
|
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
|
||||||
|
├── 05_canonicalize.sql # Canonical ID creation
|
||||||
|
├── 06_result_key_stats.sql # Result statistics
|
||||||
|
├── 10_enrich_*.sql # Source table enrichment (N files)
|
||||||
|
├── 20_master_*.sql # Master table creation (N files)
|
||||||
|
├── 30_unification_metadata.sql # Metadata tables
|
||||||
|
├── 31_filter_lookup.sql # Validation rules
|
||||||
|
└── 32_column_lookup.sql # Column mappings
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```
|
||||||
|
unify.yml # YAML configuration (created interactively)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
All generated files will:
|
||||||
|
- ✅ Be platform-optimized and production-ready
|
||||||
|
- ✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
|
||||||
|
- ✅ Include convergence detection logic
|
||||||
|
- ✅ Support incremental processing
|
||||||
|
- ✅ Generate comprehensive statistics
|
||||||
|
- ✅ Work without modification on target platforms
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
**Ready to begin?** I'll use the **hybrid-unif-config-creator** to automatically analyze your tables and generate the YAML configuration.
|
||||||
|
|
||||||
|
Please provide:
|
||||||
|
|
||||||
|
1. **Platform**: Which platform contains your data?
|
||||||
|
- Snowflake or Databricks
|
||||||
|
|
||||||
|
2. **Tables**: Which source tables should I analyze?
|
||||||
|
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
|
||||||
|
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
|
||||||
|
- Example: `customer_db.public.customers`, `orders`, `web_events.user_activity`
|
||||||
|
|
||||||
|
3. **Canonical ID Name**: What should I call the unified ID?
|
||||||
|
- Example: `td_id`, `unified_customer_id`, `master_id`
|
||||||
|
- Default: `td_id`
|
||||||
|
|
||||||
|
4. **Merge Iterations** (optional): How many unification loops?
|
||||||
|
- Default: 10
|
||||||
|
- Range: 2-30
|
||||||
|
|
||||||
|
5. **Target Platform(s)** for SQL generation:
|
||||||
|
- Same as source, or generate for both platforms
|
||||||
|
|
||||||
|
**Example**:
|
||||||
|
```
|
||||||
|
I want to set up hybrid ID unification for:
|
||||||
|
|
||||||
|
Platform: Snowflake
|
||||||
|
Tables:
|
||||||
|
- customer_db.public.customer_profiles
|
||||||
|
- customer_db.public.orders
|
||||||
|
- marketing_db.public.campaigns
|
||||||
|
- event_db.public.web_events
|
||||||
|
|
||||||
|
Canonical ID: unified_customer_id
|
||||||
|
Merge Iterations: 10
|
||||||
|
Generate SQL for: Snowflake (or both Snowflake and Databricks)
|
||||||
|
```
|
||||||
|
|
||||||
|
**What I'll do next**:
|
||||||
|
1. ✅ Analyze your tables using Snowflake MCP tools
|
||||||
|
2. ✅ Extract user identifiers with strict PII detection
|
||||||
|
3. ✅ Generate unify.yml automatically
|
||||||
|
4. ✅ Generate platform-specific SQL files
|
||||||
|
5. ✅ Execute workflow (if requested)
|
||||||
|
6. ✅ Provide deployment guidance
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Let's get started with your hybrid ID unification setup!**
|
||||||
491
commands/hybrid-unif-config-creator.md
Normal file
491
commands/hybrid-unif-config-creator.md
Normal file
@@ -0,0 +1,491 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-unif-config-creator
|
||||||
|
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
|
||||||
|
---
|
||||||
|
|
||||||
|
# Unify Configuration Creator for Snowflake/Databricks
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
|
||||||
|
|
||||||
|
1. **Analyzing your actual tables** using platform-specific MCP tools
|
||||||
|
2. **Extracting user identifiers** with zero-tolerance PII detection
|
||||||
|
3. **Validating data patterns** from real table data
|
||||||
|
4. **Generating unify.yml** using the exact template format
|
||||||
|
5. **Providing recommendations** for merge strategies and priorities
|
||||||
|
|
||||||
|
**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need to Provide
|
||||||
|
|
||||||
|
### 1. Platform Selection
|
||||||
|
- **Snowflake**: For Snowflake databases
|
||||||
|
- **Databricks**: For Databricks Unity Catalog tables
|
||||||
|
|
||||||
|
### 2. Tables to Analyze
|
||||||
|
Provide tables you want to analyze for ID unification:
|
||||||
|
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
|
||||||
|
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
|
||||||
|
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
|
||||||
|
|
||||||
|
### 3. Canonical ID Configuration
|
||||||
|
- **Name**: Name for your unified ID (default: `td_id`)
|
||||||
|
- **Merge Iterations**: Number of unification loop iterations (default: 10)
|
||||||
|
- **Incremental Iterations**: Iterations for incremental processing (default: 5)
|
||||||
|
|
||||||
|
### 4. Output Configuration (Optional)
|
||||||
|
- **Output File**: Where to save unify.yml (default: `unify.yml`)
|
||||||
|
- **Template Path**: Path to template if using custom (default: uses built-in exact template)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Platform Detection and Validation
|
||||||
|
```
|
||||||
|
1. Confirm platform (Snowflake or Databricks)
|
||||||
|
2. Verify MCP tools are available for the platform
|
||||||
|
3. Set up platform-specific query patterns
|
||||||
|
4. Inform you of the analysis approach
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
|
||||||
|
I'll launch the **hybrid-unif-keys-extractor agent** to:
|
||||||
|
|
||||||
|
**Schema Analysis**:
|
||||||
|
- Use platform MCP tools to describe each table
|
||||||
|
- Extract exact column names and data types
|
||||||
|
- Identify accessible vs inaccessible tables
|
||||||
|
|
||||||
|
**User Identifier Detection**:
|
||||||
|
- Apply STRICT matching rules for user identifiers:
|
||||||
|
- ✅ Email columns (email, email_std, email_address, etc.)
|
||||||
|
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
|
||||||
|
- ✅ User IDs (user_id, customer_id, account_id, etc.)
|
||||||
|
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
|
||||||
|
- ❌ System columns (id, created_at, time, etc.)
|
||||||
|
- ❌ Complex types (arrays, maps, objects, variants, structs)
|
||||||
|
|
||||||
|
**Data Validation**:
|
||||||
|
- Query actual MIN/MAX values from each identified column
|
||||||
|
- Analyze data patterns and quality
|
||||||
|
- Count unique values per identifier
|
||||||
|
- Detect data quality issues
|
||||||
|
|
||||||
|
**Table Classification**:
|
||||||
|
- **INCLUDED**: Tables with valid user identifiers
|
||||||
|
- **EXCLUDED**: Tables without user identifiers (fully documented why)
|
||||||
|
|
||||||
|
**Expert Analysis**:
|
||||||
|
- 3 SQL experts review the data
|
||||||
|
- Provide priority recommendations
|
||||||
|
- Suggest validation rules based on actual data patterns
|
||||||
|
|
||||||
|
### Step 3: Unify.yml Generation
|
||||||
|
|
||||||
|
**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
|
||||||
|
|
||||||
|
**Template Usage Process**:
|
||||||
|
```
|
||||||
|
1. Receive structured data from hybrid-unif-keys-extractor agent:
|
||||||
|
- Keys with validation rules
|
||||||
|
- Tables with column mappings
|
||||||
|
- Canonical ID configuration
|
||||||
|
- Master tables specification
|
||||||
|
|
||||||
|
2. Use BUILT-IN template structure (see agent documentation)
|
||||||
|
|
||||||
|
3. ONLY replace these specific values:
|
||||||
|
- Line 1: name: {canonical_id_name}
|
||||||
|
- keys section: actual keys found
|
||||||
|
- tables section: actual tables with actual columns
|
||||||
|
- canonical_ids section: name and merge_by_keys
|
||||||
|
- master_tables section: [] or user specifications
|
||||||
|
|
||||||
|
4. PRESERVE everything else:
|
||||||
|
- ALL comment blocks (#####...)
|
||||||
|
- ALL comment text ("Declare Validation logic", etc.)
|
||||||
|
- ALL spacing and indentation (2 spaces per level)
|
||||||
|
- ALL blank lines
|
||||||
|
- EXACT YAML structure
|
||||||
|
|
||||||
|
5. Use Write tool to save populated unify.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
**I'll generate**:
|
||||||
|
|
||||||
|
**Section 1: Canonical ID Name**
|
||||||
|
```yaml
|
||||||
|
name: {your_canonical_id_name}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Section 2: Keys with Validation**
|
||||||
|
```yaml
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: phone_number
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
```
|
||||||
|
*Populated with actual keys found in your tables*
|
||||||
|
|
||||||
|
**Section 3: Tables with Key Column Mappings**
|
||||||
|
```yaml
|
||||||
|
tables:
|
||||||
|
- database: {database/catalog}
|
||||||
|
table: {table_name}
|
||||||
|
key_columns:
|
||||||
|
- {column: actual_column_name, key: mapped_key}
|
||||||
|
- {column: another_column, key: another_key}
|
||||||
|
```
|
||||||
|
*Only tables with valid user identifiers, with EXACT column names from schema analysis*
|
||||||
|
|
||||||
|
**Section 4: Canonical IDs Configuration**
|
||||||
|
```yaml
|
||||||
|
canonical_ids:
|
||||||
|
- name: {your_canonical_id_name}
|
||||||
|
merge_by_keys: [email, customer_id, phone_number]
|
||||||
|
merge_iterations: 15
|
||||||
|
```
|
||||||
|
*Based on extracted keys and your configuration*
|
||||||
|
|
||||||
|
**Section 5: Master Tables (Optional)**
|
||||||
|
```yaml
|
||||||
|
master_tables:
|
||||||
|
- name: {canonical_id_name}_master_table
|
||||||
|
canonical_id: {canonical_id_name}
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: table1, column: email, order: last, order_by: time, priority: 1}
|
||||||
|
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
|
||||||
|
```
|
||||||
|
*If you request master table configuration, I'll help set up attribute aggregation*
|
||||||
|
|
||||||
|
### Step 4: Validation and Review
|
||||||
|
|
||||||
|
After generation:
|
||||||
|
```
|
||||||
|
1. Show complete unify.yml content
|
||||||
|
2. Highlight key sections:
|
||||||
|
- Keys found: [list]
|
||||||
|
- Tables included: [count]
|
||||||
|
- Tables excluded: [count] with reasons
|
||||||
|
- Merge strategy: [keys and priorities]
|
||||||
|
3. Provide recommendations for optimization
|
||||||
|
4. Ask for your approval before saving
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: File Output
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Write unify.yml to specified location
|
||||||
|
2. Create backup of existing file if present
|
||||||
|
3. Provide file summary:
|
||||||
|
- Keys configured: X
|
||||||
|
- Tables configured: Y
|
||||||
|
- Validation rules: Z
|
||||||
|
4. Show next steps for using the configuration
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Workflow
|
||||||
|
|
||||||
|
**Input**:
|
||||||
|
```
|
||||||
|
Platform: Snowflake
|
||||||
|
Tables:
|
||||||
|
- customer_data.public.customers
|
||||||
|
- customer_data.public.orders
|
||||||
|
- web_data.public.events
|
||||||
|
Canonical ID Name: unified_customer_id
|
||||||
|
Output: snowflake_unify.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Process**:
|
||||||
|
```
|
||||||
|
✓ Platform: Snowflake MCP tools detected
|
||||||
|
✓ Analyzing 3 tables...
|
||||||
|
|
||||||
|
Schema Analysis:
|
||||||
|
✓ customer_data.public.customers - 12 columns
|
||||||
|
✓ customer_data.public.orders - 8 columns
|
||||||
|
✓ web_data.public.events - 15 columns
|
||||||
|
|
||||||
|
User Identifier Detection:
|
||||||
|
✓ customers: email, customer_id (2 identifiers)
|
||||||
|
✓ orders: customer_id, email_address (2 identifiers)
|
||||||
|
✗ events: NO user identifiers found
|
||||||
|
Available columns: event_id, session_id, page_url, timestamp, ...
|
||||||
|
Reason: Contains only event tracking data - no PII
|
||||||
|
|
||||||
|
Data Analysis:
|
||||||
|
✓ email: 45,123 unique values, format valid
|
||||||
|
✓ customer_id: 45,089 unique values, numeric
|
||||||
|
✓ email_address: 12,456 unique values, format valid
|
||||||
|
|
||||||
|
Expert Analysis Complete:
|
||||||
|
Priority 1: customer_id (most stable, highest coverage)
|
||||||
|
Priority 2: email (good coverage, some quality issues)
|
||||||
|
Priority 3: phone_number (not found)
|
||||||
|
|
||||||
|
Generating unify.yml...
|
||||||
|
✓ Keys section: 2 keys configured
|
||||||
|
✓ Tables section: 2 tables configured
|
||||||
|
✓ Canonical IDs: unified_customer_id
|
||||||
|
✓ Validation rules: Applied based on data patterns
|
||||||
|
|
||||||
|
Tables EXCLUDED:
|
||||||
|
- web_data.public.events: No user identifiers
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output (snowflake_unify.yml)**:
|
||||||
|
```yaml
|
||||||
|
name: unified_customer_id
|
||||||
|
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- database: customer_data
|
||||||
|
table: customers
|
||||||
|
key_columns:
|
||||||
|
- {column: email, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- database: customer_data
|
||||||
|
table: orders
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_customer_id
|
||||||
|
merge_by_keys: [customer_id, email]
|
||||||
|
merge_iterations: 15
|
||||||
|
|
||||||
|
master_tables: []
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
### 🔍 **STRICT PII Detection**
|
||||||
|
- Zero tolerance for guessing
|
||||||
|
- Only includes tables with actual user identifiers
|
||||||
|
- Documents why tables are excluded
|
||||||
|
- Based on REAL schema and data analysis
|
||||||
|
|
||||||
|
### ✅ **Exact Template Compliance**
|
||||||
|
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
|
||||||
|
- NO modifications to template format
|
||||||
|
- Preserves all comment sections
|
||||||
|
- Maintains exact YAML structure
|
||||||
|
- Portable across all systems
|
||||||
|
|
||||||
|
### 📊 **Real Data Analysis**
|
||||||
|
- Queries actual MIN/MAX values
|
||||||
|
- Counts unique identifiers
|
||||||
|
- Validates data patterns
|
||||||
|
- Identifies quality issues
|
||||||
|
|
||||||
|
### 🎯 **Platform-Aware**
|
||||||
|
- Uses correct MCP tools for each platform
|
||||||
|
- Respects platform naming conventions
|
||||||
|
- Applies platform-specific data type rules
|
||||||
|
- Generates platform-compatible SQL references
|
||||||
|
|
||||||
|
### 📋 **Complete Documentation**
|
||||||
|
- Documents all excluded tables with reasons
|
||||||
|
- Lists available columns for excluded tables
|
||||||
|
- Explains why columns don't qualify as user identifiers
|
||||||
|
- Provides expert recommendations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
**The generated unify.yml will have EXACTLY this structure:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: {canonical_id_name}
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare Validation logic for unification keys
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
keys:
|
||||||
|
- name: {key1}
|
||||||
|
valid_regexp: "{pattern}"
|
||||||
|
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||||
|
- name: {key2}
|
||||||
|
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||||
|
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare databases, tables, and keys to use during unification
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- database: {db/catalog}
|
||||||
|
table: {table}
|
||||||
|
key_columns:
|
||||||
|
- {column: {col}, key: {key}}
|
||||||
|
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare hierarchy for unification. Define keys to use for each level.
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: {canonical_id_name}
|
||||||
|
merge_by_keys: [{key1}, {key2}, ...]
|
||||||
|
merge_iterations: {number}
|
||||||
|
|
||||||
|
#####################################################
|
||||||
|
##
|
||||||
|
##Declare Similar Attributes and standardize into a single column
|
||||||
|
##
|
||||||
|
#####################################################
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: {canonical_id_name}_master_table
|
||||||
|
canonical_id: {canonical_id_name}
|
||||||
|
attributes:
|
||||||
|
- name: {attribute}
|
||||||
|
source_columns:
|
||||||
|
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
|
||||||
|
```
|
||||||
|
|
||||||
|
**NO deviations from this structure - EXACT template compliance guaranteed.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
### Required:
|
||||||
|
- ✅ Snowflake or Databricks platform access
|
||||||
|
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
|
||||||
|
- ✅ Read permissions on tables to be analyzed
|
||||||
|
- ✅ Tables must exist and be accessible
|
||||||
|
|
||||||
|
### Optional:
|
||||||
|
- Custom unify.yml template path (if not using default)
|
||||||
|
- Master table attribute specifications
|
||||||
|
- Custom validation rules
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected Timeline
|
||||||
|
|
||||||
|
| Step | Duration |
|
||||||
|
|------|----------|
|
||||||
|
| Platform detection | < 1 min |
|
||||||
|
| Schema analysis (per table) | 5-10 sec |
|
||||||
|
| Data analysis (per identifier) | 10-20 sec |
|
||||||
|
| Expert analysis | 1-2 min |
|
||||||
|
| YAML generation | < 1 min |
|
||||||
|
| **Total (for 5 tables)** | **~3-5 min** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
### Common Issues:
|
||||||
|
|
||||||
|
**Issue**: MCP tools not available for platform
|
||||||
|
**Solution**:
|
||||||
|
- I'll inform you and provide fallback options
|
||||||
|
- You can provide schema information manually
|
||||||
|
- I'll still generate unify.yml with validation warnings
|
||||||
|
|
||||||
|
**Issue**: No tables have user identifiers
|
||||||
|
**Solution**:
|
||||||
|
- I'll show you why tables were excluded
|
||||||
|
- Suggest alternative tables to analyze
|
||||||
|
- Explain what constitutes a user identifier
|
||||||
|
|
||||||
|
**Issue**: Table not accessible
|
||||||
|
**Solution**:
|
||||||
|
- Document which tables are inaccessible
|
||||||
|
- Continue with accessible tables
|
||||||
|
- Recommend permission checks
|
||||||
|
|
||||||
|
**Issue**: Complex data types found
|
||||||
|
**Solution**:
|
||||||
|
- Exclude complex type columns (arrays, structs, maps)
|
||||||
|
- Explain why they can't be used for unification
|
||||||
|
- Suggest alternative columns if available
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Generated unify.yml will:
|
||||||
|
- ✅ Use EXACT template structure - NO modifications
|
||||||
|
- ✅ Contain ONLY tables with validated user identifiers
|
||||||
|
- ✅ Include ONLY columns that actually exist in tables
|
||||||
|
- ✅ Have validation rules based on actual data patterns
|
||||||
|
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
|
||||||
|
- ✅ Work without any manual edits
|
||||||
|
- ✅ Include comprehensive documentation in comments
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps After Generation
|
||||||
|
|
||||||
|
1. **Review the generated unify.yml**
|
||||||
|
- Verify tables and columns are correct
|
||||||
|
- Check validation rules are appropriate
|
||||||
|
- Review merge strategy and priorities
|
||||||
|
|
||||||
|
2. **Generate SQL for your platform**:
|
||||||
|
- Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
|
||||||
|
- Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
|
||||||
|
|
||||||
|
3. **Execute the workflow**:
|
||||||
|
- Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
|
||||||
|
- Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
|
||||||
|
|
||||||
|
4. **Monitor convergence and results**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
**Ready to begin?**
|
||||||
|
|
||||||
|
Please provide:
|
||||||
|
|
||||||
|
1. **Platform**: Snowflake or Databricks
|
||||||
|
2. **Tables**: List of tables to analyze (full paths)
|
||||||
|
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
|
||||||
|
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
|
||||||
|
|
||||||
|
**Example**:
|
||||||
|
```
|
||||||
|
Platform: Snowflake
|
||||||
|
Tables:
|
||||||
|
- customer_db.public.customers
|
||||||
|
- customer_db.public.orders
|
||||||
|
- marketing_db.public.campaigns
|
||||||
|
Canonical ID: unified_id
|
||||||
|
Output: snowflake_unify.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**I'll analyze your tables and generate a production-ready unify.yml configuration!**
|
||||||
337
commands/hybrid-unif-config-validate.md
Normal file
337
commands/hybrid-unif-config-validate.md
Normal file
@@ -0,0 +1,337 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-unif-config-validate
|
||||||
|
description: Validate YAML configuration for hybrid ID unification before SQL generation
|
||||||
|
---
|
||||||
|
|
||||||
|
# Validate Hybrid ID Unification YAML
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Validate your `unify.yml` configuration file to ensure it's properly structured and ready for SQL generation. This command checks syntax, structure, validation rules, and provides recommendations for optimization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need
|
||||||
|
|
||||||
|
### Required Input
|
||||||
|
1. **YAML Configuration File**: Path to your `unify.yml`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: File Validation
|
||||||
|
- Verify file exists and is readable
|
||||||
|
- Check YAML syntax (proper indentation, quotes, etc.)
|
||||||
|
- Ensure all required sections are present
|
||||||
|
|
||||||
|
### Step 2: Structure Validation
|
||||||
|
Check presence and structure of:
|
||||||
|
- **name**: Unification project name
|
||||||
|
- **keys**: Key definitions with validation rules
|
||||||
|
- **tables**: Source tables with key column mappings
|
||||||
|
- **canonical_ids**: Canonical ID configuration
|
||||||
|
- **master_tables**: Master table definitions (optional)
|
||||||
|
|
||||||
|
### Step 3: Content Validation
|
||||||
|
Validate individual sections:
|
||||||
|
|
||||||
|
**Keys Section**:
|
||||||
|
- ✓ Each key has a unique name
|
||||||
|
- ✓ `valid_regexp` is a valid regex pattern (if provided)
|
||||||
|
- ✓ `invalid_texts` is an array (if provided)
|
||||||
|
- ⚠ Recommend validation rules if missing
|
||||||
|
|
||||||
|
**Tables Section**:
|
||||||
|
- ✓ Each table has a name
|
||||||
|
- ✓ Each table has at least one key_column
|
||||||
|
- ✓ All referenced keys exist in keys section
|
||||||
|
- ✓ Column names are valid identifiers
|
||||||
|
- ⚠ Check for duplicate table definitions
|
||||||
|
|
||||||
|
**Canonical IDs Section**:
|
||||||
|
- ✓ Has a name (will be canonical ID column name)
|
||||||
|
- ✓ `merge_by_keys` references existing keys
|
||||||
|
- ✓ `merge_iterations` is a positive integer (if provided)
|
||||||
|
- ⚠ Suggest optimal iteration count if not specified
|
||||||
|
|
||||||
|
**Master Tables Section** (if present):
|
||||||
|
- ✓ Each master table has a name and canonical_id
|
||||||
|
- ✓ Referenced canonical_id exists
|
||||||
|
- ✓ Attributes have proper structure
|
||||||
|
- ✓ Source tables in attributes exist
|
||||||
|
- ✓ Priority values are valid
|
||||||
|
- ⚠ Check for attribute conflicts
|
||||||
|
|
||||||
|
### Step 4: Cross-Reference Validation
|
||||||
|
- ✓ All merge_by_keys exist in keys section
|
||||||
|
- ✓ All key_columns reference defined keys
|
||||||
|
- ✓ All master table source tables exist in tables section
|
||||||
|
- ✓ Canonical ID names don't conflict with existing columns
|
||||||
|
|
||||||
|
### Step 5: Best Practices Check
|
||||||
|
Provide recommendations for:
|
||||||
|
- Key validation rules
|
||||||
|
- Iteration count optimization
|
||||||
|
- Master table attribute priorities
|
||||||
|
- Performance considerations
|
||||||
|
|
||||||
|
### Step 6: Validation Report
|
||||||
|
Generate comprehensive report with:
|
||||||
|
- ✅ Passed checks
|
||||||
|
- ⚠ Warnings (non-critical issues)
|
||||||
|
- ❌ Errors (must fix before generation)
|
||||||
|
- 💡 Recommendations for improvement
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-unif-config-validate
|
||||||
|
|
||||||
|
I'll prompt you for:
|
||||||
|
- YAML file path
|
||||||
|
```
|
||||||
|
|
||||||
|
### Direct Usage
|
||||||
|
```
|
||||||
|
YAML file: /path/to/unify.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Validation
|
||||||
|
|
||||||
|
### Input YAML
|
||||||
|
```yaml
|
||||||
|
name: customer_unification
|
||||||
|
|
||||||
|
keys:
|
||||||
|
- name: email
|
||||||
|
valid_regexp: ".*@.*"
|
||||||
|
invalid_texts: ['', 'N/A', 'null']
|
||||||
|
- name: customer_id
|
||||||
|
invalid_texts: ['', 'N/A']
|
||||||
|
|
||||||
|
tables:
|
||||||
|
- table: customer_profiles
|
||||||
|
key_columns:
|
||||||
|
- {column: email_std, key: email}
|
||||||
|
- {column: customer_id, key: customer_id}
|
||||||
|
- table: orders
|
||||||
|
key_columns:
|
||||||
|
- {column: email_address, key: email}
|
||||||
|
|
||||||
|
canonical_ids:
|
||||||
|
- name: unified_id
|
||||||
|
merge_by_keys: [email, customer_id]
|
||||||
|
merge_iterations: 15
|
||||||
|
|
||||||
|
master_tables:
|
||||||
|
- name: customer_master
|
||||||
|
canonical_id: unified_id
|
||||||
|
attributes:
|
||||||
|
- name: best_email
|
||||||
|
source_columns:
|
||||||
|
- {table: customer_profiles, column: email_std, priority: 1}
|
||||||
|
- {table: orders, column: email_address, priority: 2}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validation Report
|
||||||
|
```
|
||||||
|
✅ YAML VALIDATION SUCCESSFUL
|
||||||
|
|
||||||
|
File Structure:
|
||||||
|
✅ Valid YAML syntax
|
||||||
|
✅ All required sections present
|
||||||
|
✅ Proper indentation and formatting
|
||||||
|
|
||||||
|
Keys Section (2 keys):
|
||||||
|
✅ email: Valid regex pattern, invalid_texts defined
|
||||||
|
✅ customer_id: Invalid_texts defined
|
||||||
|
⚠ Consider adding valid_regexp for customer_id for better validation
|
||||||
|
|
||||||
|
Tables Section (2 tables):
|
||||||
|
✅ customer_profiles: 2 key columns mapped
|
||||||
|
✅ orders: 1 key column mapped
|
||||||
|
✅ All referenced keys exist
|
||||||
|
|
||||||
|
Canonical IDs Section:
|
||||||
|
✅ Name: unified_id
|
||||||
|
✅ Merge keys: email, customer_id (both exist)
|
||||||
|
✅ Iterations: 15 (recommended range: 10-20)
|
||||||
|
|
||||||
|
Master Tables Section (1 master table):
|
||||||
|
✅ customer_master: References unified_id
|
||||||
|
✅ Attribute 'best_email': 2 sources with priorities
|
||||||
|
✅ All source tables exist
|
||||||
|
|
||||||
|
Cross-References:
|
||||||
|
✅ All merge_by_keys defined in keys section
|
||||||
|
✅ All key_columns reference existing keys
|
||||||
|
✅ All master table sources exist
|
||||||
|
✅ No canonical ID name conflicts
|
||||||
|
|
||||||
|
Recommendations:
|
||||||
|
💡 Consider adding valid_regexp for customer_id (e.g., "^[A-Z0-9]+$")
|
||||||
|
💡 Add more master table attributes for richer customer profiles
|
||||||
|
💡 Consider array attributes (top_3_emails) for historical tracking
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
✅ 0 errors
|
||||||
|
⚠ 1 warning
|
||||||
|
💡 3 recommendations
|
||||||
|
|
||||||
|
✓ Configuration is ready for SQL generation!
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Validation Checks
|
||||||
|
|
||||||
|
### Required Checks (Must Pass)
|
||||||
|
- [ ] File exists and is readable
|
||||||
|
- [ ] Valid YAML syntax
|
||||||
|
- [ ] `name` field present
|
||||||
|
- [ ] `keys` section present with at least one key
|
||||||
|
- [ ] `tables` section present with at least one table
|
||||||
|
- [ ] `canonical_ids` section present
|
||||||
|
- [ ] All merge_by_keys exist in keys section
|
||||||
|
- [ ] All key_columns reference defined keys
|
||||||
|
- [ ] No duplicate key names
|
||||||
|
- [ ] No duplicate table names
|
||||||
|
|
||||||
|
### Warning Checks (Recommended)
|
||||||
|
- [ ] Keys have validation rules (valid_regexp or invalid_texts)
|
||||||
|
- [ ] Merge_iterations specified (otherwise auto-calculated)
|
||||||
|
- [ ] Master tables defined for unified customer view
|
||||||
|
- [ ] Source tables have unique key combinations
|
||||||
|
- [ ] Attribute priorities are sequential
|
||||||
|
|
||||||
|
### Best Practice Checks
|
||||||
|
- [ ] Email keys have email regex pattern
|
||||||
|
- [ ] Phone keys have phone validation
|
||||||
|
- [ ] Invalid_texts include common null values ('', 'N/A', 'null')
|
||||||
|
- [ ] Master tables use time-based order_by for recency
|
||||||
|
- [ ] Array attributes for historical data (top_3_emails, etc.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Validation Errors
|
||||||
|
|
||||||
|
### Syntax Errors
|
||||||
|
**Error**: `Invalid YAML: mapping values are not allowed here`
|
||||||
|
**Solution**: Check indentation (use spaces, not tabs), ensure colons have space after them
|
||||||
|
|
||||||
|
**Error**: `Invalid YAML: could not find expected ':'`
|
||||||
|
**Solution**: Check for missing colons in key-value pairs
|
||||||
|
|
||||||
|
### Structure Errors
|
||||||
|
**Error**: `Missing required section: keys`
|
||||||
|
**Solution**: Add keys section with at least one key definition
|
||||||
|
|
||||||
|
**Error**: `Empty tables section`
|
||||||
|
**Solution**: Add at least one table with key_columns
|
||||||
|
|
||||||
|
### Reference Errors
|
||||||
|
**Error**: `Key 'phone' referenced in table 'orders' but not defined in keys section`
|
||||||
|
**Solution**: Add phone key to keys section or remove reference
|
||||||
|
|
||||||
|
**Error**: `Merge key 'phone_number' not found in keys section`
|
||||||
|
**Solution**: Add phone_number to keys section or remove from merge_by_keys
|
||||||
|
|
||||||
|
**Error**: `Master table source 'customer_360' not found in tables section`
|
||||||
|
**Solution**: Add customer_360 to tables section or use correct table name
|
||||||
|
|
||||||
|
### Value Errors
|
||||||
|
**Error**: `merge_iterations must be a positive integer, got: 'auto'`
|
||||||
|
**Solution**: Either remove merge_iterations (auto-calculate) or specify integer (e.g., 15)
|
||||||
|
|
||||||
|
**Error**: `Priority must be a positive integer, got: 'high'`
|
||||||
|
**Solution**: Use numeric priority (1 for highest, 2 for second, etc.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Validation Levels
|
||||||
|
|
||||||
|
### Strict Mode (Default)
|
||||||
|
- Fails on any structural errors
|
||||||
|
- Warns on missing best practices
|
||||||
|
- Recommends optimizations
|
||||||
|
|
||||||
|
### Lenient Mode
|
||||||
|
- Only fails on critical syntax errors
|
||||||
|
- Allows missing optional fields
|
||||||
|
- Minimal warnings
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Platform-Specific Validation
|
||||||
|
|
||||||
|
### Databricks-Specific
|
||||||
|
- ✓ Table names compatible with Unity Catalog
|
||||||
|
- ✓ Column names valid for Spark SQL
|
||||||
|
- ⚠ Check for reserved keywords (DATABASE, TABLE, etc.)
|
||||||
|
|
||||||
|
### Snowflake-Specific
|
||||||
|
- ✓ Table names compatible with Snowflake
|
||||||
|
- ✓ Column names valid for Snowflake SQL
|
||||||
|
- ⚠ Check for reserved keywords (ACCOUNT, SCHEMA, etc.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Happens Next
|
||||||
|
|
||||||
|
### If Validation Passes
|
||||||
|
```
|
||||||
|
✅ Configuration validated successfully!
|
||||||
|
|
||||||
|
Ready for:
|
||||||
|
• SQL generation (Databricks or Snowflake)
|
||||||
|
• Direct execution after generation
|
||||||
|
|
||||||
|
Next steps:
|
||||||
|
1. /cdp-hybrid-idu:hybrid-generate-databricks
|
||||||
|
2. /cdp-hybrid-idu:hybrid-generate-snowflake
|
||||||
|
3. /cdp-hybrid-idu:hybrid-setup (complete workflow)
|
||||||
|
```
|
||||||
|
|
||||||
|
### If Validation Fails
|
||||||
|
```
|
||||||
|
❌ Configuration has errors that must be fixed
|
||||||
|
|
||||||
|
Errors (must fix):
|
||||||
|
1. Missing required section: canonical_ids
|
||||||
|
2. Undefined key 'phone' referenced in table 'orders'
|
||||||
|
|
||||||
|
Suggestions:
|
||||||
|
• Add canonical_ids section with name and merge_by_keys
|
||||||
|
• Add phone key to keys section or remove from orders
|
||||||
|
|
||||||
|
Would you like help fixing these issues? (y/n)
|
||||||
|
```
|
||||||
|
|
||||||
|
I can help you:
|
||||||
|
- Fix syntax errors
|
||||||
|
- Add missing sections
|
||||||
|
- Define proper validation rules
|
||||||
|
- Optimize configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Validation passes when:
|
||||||
|
- ✅ YAML syntax is valid
|
||||||
|
- ✅ All required sections present
|
||||||
|
- ✅ All references resolved
|
||||||
|
- ✅ No structural errors
|
||||||
|
- ✅ Ready for SQL generation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to validate your YAML configuration?**
|
||||||
|
|
||||||
|
Provide your `unify.yml` file path to begin validation!
|
||||||
726
commands/hybrid-unif-merge-stats-creator.md
Normal file
726
commands/hybrid-unif-merge-stats-creator.md
Normal file
@@ -0,0 +1,726 @@
|
|||||||
|
---
|
||||||
|
name: hybrid-unif-merge-stats-creator
|
||||||
|
description: Generate professional HTML/PDF merge statistics report from ID unification results for Snowflake or Databricks with expert analysis and visualizations
|
||||||
|
---
|
||||||
|
|
||||||
|
# ID Unification Merge Statistics Report Generator
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
I'll generate a **comprehensive, professional HTML report** analyzing your ID unification merge statistics with:
|
||||||
|
|
||||||
|
- 📊 **Executive Summary** with key performance indicators
|
||||||
|
- 📈 **Identity Resolution Performance** analysis and deduplication rates
|
||||||
|
- 🎯 **Merge Distribution** patterns and complexity analysis
|
||||||
|
- 👥 **Top Merged Profiles** highlighting complex identity resolutions
|
||||||
|
- ✅ **Data Quality Metrics** with coverage percentages
|
||||||
|
- 🚀 **Convergence Analysis** showing iteration performance
|
||||||
|
- 💡 **Expert Recommendations** for optimization and next steps
|
||||||
|
|
||||||
|
**Platform Support:**
|
||||||
|
- ✅ Snowflake (using Snowflake MCP tools)
|
||||||
|
- ✅ Databricks (using Databricks MCP tools)
|
||||||
|
|
||||||
|
**Output Format:**
|
||||||
|
- Beautiful HTML report with charts, tables, and visualizations
|
||||||
|
- PDF-ready (print to PDF from browser)
|
||||||
|
- Consistent formatting every time
|
||||||
|
- Platform-agnostic design
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What You Need to Provide
|
||||||
|
|
||||||
|
### 1. Platform Selection
|
||||||
|
- **Snowflake**: For Snowflake-based ID unification
|
||||||
|
- **Databricks**: For Databricks-based ID unification
|
||||||
|
|
||||||
|
### 2. Database/Catalog Configuration
|
||||||
|
|
||||||
|
**For Snowflake:**
|
||||||
|
- **Database Name**: Where your unification tables are stored (e.g., `INDRESH_TEST`, `CUSTOMER_CDP`)
|
||||||
|
- **Schema Name**: Schema containing tables (e.g., `PUBLIC`, `ID_UNIFICATION`)
|
||||||
|
|
||||||
|
**For Databricks:**
|
||||||
|
- **Catalog Name**: Unity Catalog name (e.g., `customer_data`, `cdp_prod`)
|
||||||
|
- **Schema Name**: Schema containing tables (e.g., `id_unification`, `unified_profiles`)
|
||||||
|
|
||||||
|
### 3. Canonical ID Configuration
|
||||||
|
- **Canonical ID Name**: Name used for your unified ID (e.g., `td_id`, `unified_customer_id`, `master_id`)
|
||||||
|
- This is used to find the correct tables: `{canonical_id}_lookup`, `{canonical_id}_master_table`, etc.
|
||||||
|
|
||||||
|
### 4. Output Configuration (Optional)
|
||||||
|
- **Output File Path**: Where to save the HTML report (default: `id_unification_report.html`)
|
||||||
|
- **Report Title**: Custom title for the report (default: "ID Unification Merge Statistics Report")
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'll Do
|
||||||
|
|
||||||
|
### Step 1: Platform Detection and Validation
|
||||||
|
|
||||||
|
**Snowflake:**
|
||||||
|
```
|
||||||
|
1. Verify Snowflake MCP tools are available
|
||||||
|
2. Test connection to specified database.schema
|
||||||
|
3. Validate canonical ID tables exist:
|
||||||
|
- {database}.{schema}.{canonical_id}_lookup
|
||||||
|
- {database}.{schema}.{canonical_id}_master_table
|
||||||
|
- {database}.{schema}.{canonical_id}_source_key_stats
|
||||||
|
- {database}.{schema}.{canonical_id}_result_key_stats
|
||||||
|
4. Confirm access permissions
|
||||||
|
```
|
||||||
|
|
||||||
|
**Databricks:**
|
||||||
|
```
|
||||||
|
1. Verify Databricks MCP tools are available (or use Snowflake fallback)
|
||||||
|
2. Test connection to specified catalog.schema
|
||||||
|
3. Validate canonical ID tables exist
|
||||||
|
4. Confirm access permissions
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Data Collection with Expert Analysis
|
||||||
|
|
||||||
|
I'll execute **16 specialized queries** to collect comprehensive statistics:
|
||||||
|
|
||||||
|
**Core Statistics Queries:**
|
||||||
|
|
||||||
|
1. **Source Key Statistics**
|
||||||
|
- Pre-unification identity counts
|
||||||
|
- Distinct values per key type (customer_id, email, phone, etc.)
|
||||||
|
- Per-table breakdowns
|
||||||
|
|
||||||
|
2. **Result Key Statistics**
|
||||||
|
- Post-unification canonical ID counts
|
||||||
|
- Distribution histograms
|
||||||
|
- Coverage per key type
|
||||||
|
|
||||||
|
3. **Canonical ID Metrics**
|
||||||
|
- Total identities processed
|
||||||
|
- Unique canonical IDs created
|
||||||
|
- Merge ratio calculation
|
||||||
|
|
||||||
|
4. **Top Merged Profiles**
|
||||||
|
- Top 10 most complex merges
|
||||||
|
- Identity count per canonical ID
|
||||||
|
- Merge complexity scoring
|
||||||
|
|
||||||
|
5. **Merge Distribution Analysis**
|
||||||
|
- Categorization (2, 3-5, 6-10, 10+ identities)
|
||||||
|
- Percentage distribution
|
||||||
|
- Pattern analysis
|
||||||
|
|
||||||
|
6. **Key Type Distribution**
|
||||||
|
- Identity breakdown by type
|
||||||
|
- Namespace analysis
|
||||||
|
- Cross-key coverage
|
||||||
|
|
||||||
|
7. **Master Table Quality Metrics**
|
||||||
|
- Attribute coverage percentages
|
||||||
|
- Data completeness analysis
|
||||||
|
- Sample record extraction
|
||||||
|
|
||||||
|
8. **Configuration Metadata**
|
||||||
|
- Unification settings
|
||||||
|
- Column mappings
|
||||||
|
- Validation rules
|
||||||
|
|
||||||
|
**Platform-Specific SQL Adaptation:**
|
||||||
|
|
||||||
|
For **Snowflake**:
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as total_identities,
|
||||||
|
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||||
|
FROM {database}.{schema}.{canonical_id}_lookup;
|
||||||
|
```
|
||||||
|
|
||||||
|
For **Databricks**:
|
||||||
|
```sql
|
||||||
|
SELECT COUNT(*) as total_identities,
|
||||||
|
COUNT(DISTINCT canonical_id) as unique_canonical_ids
|
||||||
|
FROM {catalog}.{schema}.{canonical_id}_lookup;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Statistical Analysis and Calculations
|
||||||
|
|
||||||
|
I'll perform expert-level calculations:
|
||||||
|
|
||||||
|
**Deduplication Rates:**
|
||||||
|
```
|
||||||
|
For each key type:
|
||||||
|
- Source distinct count (pre-unification)
|
||||||
|
- Final canonical IDs (post-unification)
|
||||||
|
- Deduplication % = (source - final) / source * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
**Merge Ratios:**
|
||||||
|
```
|
||||||
|
- Average identities per customer = total_identities / unique_canonical_ids
|
||||||
|
- Distribution across categories
|
||||||
|
- Outlier detection (10+ merges)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Convergence Analysis:**
|
||||||
|
```
|
||||||
|
- Parse from execution logs if available
|
||||||
|
- Calculate from iteration metadata tables
|
||||||
|
- Estimate convergence quality
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data Quality Scores:**
|
||||||
|
```
|
||||||
|
- Coverage % for each attribute
|
||||||
|
- Completeness assessment
|
||||||
|
- Quality grading (Excellent, Good, Needs Improvement)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: HTML Report Generation
|
||||||
|
|
||||||
|
I'll generate a **pixel-perfect HTML report** with:
|
||||||
|
|
||||||
|
**Design Features:**
|
||||||
|
- ✨ Modern gradient design (purple theme)
|
||||||
|
- 📊 Interactive visualizations (progress bars, horizontal bar charts)
|
||||||
|
- 🎨 Color-coded badges and status indicators
|
||||||
|
- 📱 Responsive layout (works on all devices)
|
||||||
|
- 🖨️ Print-optimized CSS for PDF export
|
||||||
|
|
||||||
|
**Report Structure:**
|
||||||
|
|
||||||
|
```html
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
- Professional CSS styling
|
||||||
|
- Chart/visualization styles
|
||||||
|
- Print media queries
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<header>
|
||||||
|
- Report title
|
||||||
|
- Executive tagline
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<metadata-bar>
|
||||||
|
- Database/Catalog info
|
||||||
|
- Canonical ID name
|
||||||
|
- Generation timestamp
|
||||||
|
- Platform indicator
|
||||||
|
</metadata-bar>
|
||||||
|
|
||||||
|
<section: Executive Summary>
|
||||||
|
- 4 KPI metric cards
|
||||||
|
- Key findings insight box
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Identity Resolution Performance>
|
||||||
|
- Source vs result comparison table
|
||||||
|
- Deduplication rate analysis
|
||||||
|
- Horizontal bar charts
|
||||||
|
- Expert insights
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Merge Distribution Analysis>
|
||||||
|
- Category breakdown table
|
||||||
|
- Distribution visualizations
|
||||||
|
- Pattern analysis insights
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Top Merged Profiles>
|
||||||
|
- Top 10 ranked table
|
||||||
|
- Complexity badges
|
||||||
|
- Investigation recommendations
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Source Table Configuration>
|
||||||
|
- Column mapping table
|
||||||
|
- Source contributions
|
||||||
|
- Multi-key strategy analysis
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Master Table Data Quality>
|
||||||
|
- 6 coverage cards with progress bars
|
||||||
|
- Sample records table
|
||||||
|
- Quality assessment
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Convergence Performance>
|
||||||
|
- Iteration breakdown table
|
||||||
|
- Convergence progression chart
|
||||||
|
- Efficiency analysis
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Expert Recommendations>
|
||||||
|
- 4 recommendation cards
|
||||||
|
- Strategic next steps
|
||||||
|
- Downstream activation ideas
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section: Summary Statistics>
|
||||||
|
- Comprehensive metrics table
|
||||||
|
- All key numbers documented
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<footer>
|
||||||
|
- Generation metadata
|
||||||
|
- Platform information
|
||||||
|
- Report description
|
||||||
|
</footer>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Quality Validation and Output
|
||||||
|
|
||||||
|
**Pre-Output Validation:**
|
||||||
|
```
|
||||||
|
1. Verify all sections have data
|
||||||
|
2. Check calculations are correct
|
||||||
|
3. Validate percentages sum properly
|
||||||
|
4. Ensure no missing values
|
||||||
|
5. Confirm HTML is well-formed
|
||||||
|
```
|
||||||
|
|
||||||
|
**File Output:**
|
||||||
|
```
|
||||||
|
1. Write HTML to specified path
|
||||||
|
2. Create backup if file exists
|
||||||
|
3. Set proper file permissions
|
||||||
|
4. Verify file was written successfully
|
||||||
|
```
|
||||||
|
|
||||||
|
**Report Summary:**
|
||||||
|
```
|
||||||
|
✓ Report generated: {file_path}
|
||||||
|
✓ File size: {size} KB
|
||||||
|
✓ Sections included: 9
|
||||||
|
✓ Statistics queries: 16
|
||||||
|
✓ Data quality score: {score}%
|
||||||
|
✓ Ready for: Browser viewing, PDF export, sharing
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Workflow
|
||||||
|
|
||||||
|
### Snowflake Example
|
||||||
|
|
||||||
|
**User Input:**
|
||||||
|
```
|
||||||
|
Platform: Snowflake
|
||||||
|
Database: INDRESH_TEST
|
||||||
|
Schema: PUBLIC
|
||||||
|
Canonical ID: td_id
|
||||||
|
Output: snowflake_merge_report.html
|
||||||
|
```
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
```
|
||||||
|
✓ Connected to Snowflake via MCP
|
||||||
|
✓ Database: INDRESH_TEST.PUBLIC validated
|
||||||
|
✓ Tables found:
|
||||||
|
- td_id_lookup (19,512 records)
|
||||||
|
- td_id_master_table (4,940 records)
|
||||||
|
- td_id_source_key_stats (4 records)
|
||||||
|
- td_id_result_key_stats (4 records)
|
||||||
|
|
||||||
|
Executing queries:
|
||||||
|
✓ Query 1: Source statistics retrieved
|
||||||
|
✓ Query 2: Result statistics retrieved
|
||||||
|
✓ Query 3: Canonical ID counts (19,512 → 4,940)
|
||||||
|
✓ Query 4: Top 10 merged profiles identified
|
||||||
|
✓ Query 5: Merge distribution calculated
|
||||||
|
✓ Query 6: Key type distribution analyzed
|
||||||
|
✓ Query 7: Master table coverage (100% email, 99.39% phone)
|
||||||
|
✓ Query 8: Sample records extracted
|
||||||
|
✓ Query 9-11: Metadata retrieved
|
||||||
|
|
||||||
|
Calculating metrics:
|
||||||
|
✓ Merge ratio: 3.95:1
|
||||||
|
✓ Fragmentation reduction: 74.7%
|
||||||
|
✓ Deduplication rates:
|
||||||
|
- customer_id: 23.9%
|
||||||
|
- email: 32.0%
|
||||||
|
- phone: 14.8%
|
||||||
|
✓ Data quality score: 99.7%
|
||||||
|
|
||||||
|
Generating HTML report:
|
||||||
|
✓ Executive summary section
|
||||||
|
✓ Performance analysis section
|
||||||
|
✓ Merge distribution section
|
||||||
|
✓ Top profiles section
|
||||||
|
✓ Source configuration section
|
||||||
|
✓ Data quality section
|
||||||
|
✓ Convergence section
|
||||||
|
✓ Recommendations section
|
||||||
|
✓ Summary statistics section
|
||||||
|
|
||||||
|
✓ Report saved: snowflake_merge_report.html (142 KB)
|
||||||
|
✓ Open in browser to view
|
||||||
|
✓ Print to PDF for distribution
|
||||||
|
```
|
||||||
|
|
||||||
|
**Generated Report Contents:**
|
||||||
|
```
|
||||||
|
Executive Summary:
|
||||||
|
- 4,940 unified profiles
|
||||||
|
- 19,512 total identities
|
||||||
|
- 3.95:1 merge ratio
|
||||||
|
- 74.7% fragmentation reduction
|
||||||
|
|
||||||
|
Identity Resolution:
|
||||||
|
- customer_id: 6,489 → 4,940 (23.9% reduction)
|
||||||
|
- email: 7,261 → 4,940 (32.0% reduction)
|
||||||
|
- phone: 5,762 → 4,910 (14.8% reduction)
|
||||||
|
|
||||||
|
Merge Distribution:
|
||||||
|
- 89.0% profiles: 3-5 identities (normal)
|
||||||
|
- 8.1% profiles: 6-10 identities (high engagement)
|
||||||
|
- 2.3% profiles: 10+ identities (complex)
|
||||||
|
|
||||||
|
Top Merged Profile:
|
||||||
|
- mS9ssBEh4EsN: 38 identities merged
|
||||||
|
|
||||||
|
Data Quality:
|
||||||
|
- Email: 100% coverage
|
||||||
|
- Phone: 99.39% coverage
|
||||||
|
- Names: 100% coverage
|
||||||
|
- Location: 100% coverage
|
||||||
|
|
||||||
|
Expert Recommendations:
|
||||||
|
- Implement incremental processing
|
||||||
|
- Monitor profiles with 20+ merges
|
||||||
|
- Enable downstream activation
|
||||||
|
- Set up quality monitoring
|
||||||
|
```
|
||||||
|
|
||||||
|
### Databricks Example
|
||||||
|
|
||||||
|
**User Input:**
|
||||||
|
```
|
||||||
|
Platform: Databricks
|
||||||
|
Catalog: customer_cdp
|
||||||
|
Schema: id_unification
|
||||||
|
Canonical ID: unified_customer_id
|
||||||
|
Output: databricks_merge_report.html
|
||||||
|
```
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
```
|
||||||
|
✓ Connected to Databricks (or using Snowflake MCP fallback)
|
||||||
|
✓ Catalog: customer_cdp.id_unification validated
|
||||||
|
✓ Tables found:
|
||||||
|
- unified_customer_id_lookup
|
||||||
|
- unified_customer_id_master_table
|
||||||
|
- unified_customer_id_source_key_stats
|
||||||
|
- unified_customer_id_result_key_stats
|
||||||
|
|
||||||
|
[Same query execution and report generation as Snowflake]
|
||||||
|
|
||||||
|
✓ Report saved: databricks_merge_report.html
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
### 🎯 **Consistency Guarantee**
|
||||||
|
- **Same report every time**: Deterministic HTML generation
|
||||||
|
- **Platform-agnostic design**: Works identically on Snowflake and Databricks
|
||||||
|
- **Version controlled**: Report structure is fixed and versioned
|
||||||
|
|
||||||
|
### 🔍 **Expert Analysis**
|
||||||
|
- **16 specialized queries**: Comprehensive data collection
|
||||||
|
- **Calculated metrics**: Deduplication rates, merge ratios, quality scores
|
||||||
|
- **Pattern detection**: Identify anomalies and outliers
|
||||||
|
- **Strategic insights**: Actionable recommendations
|
||||||
|
|
||||||
|
### 📊 **Professional Visualizations**
|
||||||
|
- **KPI metric cards**: Large, colorful summary metrics
|
||||||
|
- **Progress bars**: Coverage percentages with animations
|
||||||
|
- **Horizontal bar charts**: Distribution comparisons
|
||||||
|
- **Color-coded badges**: Status indicators (Excellent, Good, Needs Review)
|
||||||
|
- **Tables with hover effects**: Interactive data exploration
|
||||||
|
|
||||||
|
### 🌍 **Platform Flexibility**
|
||||||
|
- **Snowflake**: Uses `mcp__snowflake__execute_query` tool
|
||||||
|
- **Databricks**: Uses Databricks MCP tools (with fallback options)
|
||||||
|
- **Automatic SQL adaptation**: Platform-specific query generation
|
||||||
|
- **Table name resolution**: Handles catalog vs database differences
|
||||||
|
|
||||||
|
### 📋 **Comprehensive Coverage**
|
||||||
|
|
||||||
|
**9 Report Sections:**
|
||||||
|
1. Executive Summary (4 KPIs + findings)
|
||||||
|
2. Identity Resolution Performance (deduplication analysis)
|
||||||
|
3. Merge Distribution Analysis (categorized breakdown)
|
||||||
|
4. Top Merged Profiles (complexity ranking)
|
||||||
|
5. Source Table Configuration (mappings)
|
||||||
|
6. Master Table Data Quality (coverage metrics)
|
||||||
|
7. Convergence Performance (iteration analysis)
|
||||||
|
8. Expert Recommendations (strategic guidance)
|
||||||
|
9. Summary Statistics (complete metrics)
|
||||||
|
|
||||||
|
**16 Statistical Queries:**
|
||||||
|
- Source/result key statistics
|
||||||
|
- Canonical ID counts and distributions
|
||||||
|
- Merge pattern analysis
|
||||||
|
- Quality coverage metrics
|
||||||
|
- Configuration metadata
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table Naming Conventions
|
||||||
|
|
||||||
|
The command automatically finds tables based on your canonical ID name:
|
||||||
|
|
||||||
|
### Required Tables
|
||||||
|
|
||||||
|
For canonical ID = `{canonical_id}`:
|
||||||
|
|
||||||
|
1. **Lookup Table**: `{canonical_id}_lookup`
|
||||||
|
- Contains: canonical_id, id, id_key_type
|
||||||
|
- Used for: Merge ratio, distribution, top profiles
|
||||||
|
|
||||||
|
2. **Master Table**: `{canonical_id}_master_table`
|
||||||
|
- Contains: {canonical_id}, best_* attributes
|
||||||
|
- Used for: Data quality coverage
|
||||||
|
|
||||||
|
3. **Source Stats**: `{canonical_id}_source_key_stats`
|
||||||
|
- Contains: from_table, total_distinct, distinct_*
|
||||||
|
- Used for: Pre-unification baseline
|
||||||
|
|
||||||
|
4. **Result Stats**: `{canonical_id}_result_key_stats`
|
||||||
|
- Contains: from_table, total_distinct, histogram_*
|
||||||
|
- Used for: Post-unification results
|
||||||
|
|
||||||
|
### Optional Tables
|
||||||
|
|
||||||
|
5. **Unification Metadata**: `unification_metadata`
|
||||||
|
- Contains: canonical_id_name, canonical_id_type
|
||||||
|
- Used for: Configuration documentation
|
||||||
|
|
||||||
|
6. **Column Lookup**: `column_lookup`
|
||||||
|
- Contains: table_name, column_name, key_name
|
||||||
|
- Used for: Source table mappings
|
||||||
|
|
||||||
|
7. **Filter Lookup**: `filter_lookup`
|
||||||
|
- Contains: key_name, invalid_texts, valid_regexp
|
||||||
|
- Used for: Validation rules
|
||||||
|
|
||||||
|
**All tables must be in the same database.schema (Snowflake) or catalog.schema (Databricks)**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
### HTML Report Features
|
||||||
|
|
||||||
|
**Styling:**
|
||||||
|
- Gradient purple theme (#667eea to #764ba2)
|
||||||
|
- Modern typography (system fonts)
|
||||||
|
- Responsive grid layouts
|
||||||
|
- Smooth hover animations
|
||||||
|
- Print-optimized media queries
|
||||||
|
|
||||||
|
**Sections:**
|
||||||
|
- Header with gradient background
|
||||||
|
- Metadata bar with key info
|
||||||
|
- 9 content sections with analysis
|
||||||
|
- Footer with generation details
|
||||||
|
|
||||||
|
**Visualizations:**
|
||||||
|
- Metric cards (4 in executive summary)
|
||||||
|
- Progress bars (6 in data quality)
|
||||||
|
- Horizontal bar charts (3 throughout report)
|
||||||
|
- Tables with sorting and hover effects
|
||||||
|
- Insight boxes with recommendations
|
||||||
|
|
||||||
|
**Interactivity:**
|
||||||
|
- Hover effects on cards and tables
|
||||||
|
- Animated progress bars
|
||||||
|
- Expandable insight boxes
|
||||||
|
- Responsive layout adapts to screen size
|
||||||
|
|
||||||
|
### PDF Export
|
||||||
|
|
||||||
|
To create a PDF from the HTML report:
|
||||||
|
|
||||||
|
1. Open HTML file in browser
|
||||||
|
2. Press Ctrl+P (Windows) or Cmd+P (Mac)
|
||||||
|
3. Select "Save as PDF"
|
||||||
|
4. Choose landscape orientation for better chart visibility
|
||||||
|
5. Enable background graphics for full styling
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
### Common Issues and Solutions
|
||||||
|
|
||||||
|
**Issue: "Tables not found"**
|
||||||
|
```
|
||||||
|
Solution:
|
||||||
|
1. Verify canonical ID name is correct
|
||||||
|
2. Check database/catalog and schema names
|
||||||
|
3. Ensure unification workflow completed successfully
|
||||||
|
4. Confirm table naming: {canonical_id}_lookup, {canonical_id}_master_table, etc.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Issue: "MCP tools not available"**
|
||||||
|
```
|
||||||
|
Solution:
|
||||||
|
1. For Snowflake: Verify Snowflake MCP server is configured
|
||||||
|
2. For Databricks: Fall back to Snowflake MCP with proper connection string
|
||||||
|
3. Check network connectivity
|
||||||
|
4. Validate credentials
|
||||||
|
```
|
||||||
|
|
||||||
|
**Issue: "No data in statistics tables"**
|
||||||
|
```
|
||||||
|
Solution:
|
||||||
|
1. Verify unification workflow ran completely
|
||||||
|
2. Check that statistics SQL files were executed
|
||||||
|
3. Confirm data exists in lookup and master tables
|
||||||
|
4. Re-run the unification workflow if needed
|
||||||
|
```
|
||||||
|
|
||||||
|
**Issue: "Permission denied"**
|
||||||
|
```
|
||||||
|
Solution:
|
||||||
|
1. Verify READ access to all tables
|
||||||
|
2. For Snowflake: Grant SELECT on schema
|
||||||
|
3. For Databricks: Grant USE CATALOG, USE SCHEMA, SELECT
|
||||||
|
4. Check role/user permissions
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Generated report will:
|
||||||
|
|
||||||
|
- ✅ **Open successfully** in all modern browsers (Chrome, Firefox, Safari, Edge)
|
||||||
|
- ✅ **Display all 9 sections** with complete data
|
||||||
|
- ✅ **Show accurate calculations** for all metrics
|
||||||
|
- ✅ **Include visualizations** (charts, progress bars, tables)
|
||||||
|
- ✅ **Render consistently** every time it's generated
|
||||||
|
- ✅ **Export cleanly to PDF** with proper formatting
|
||||||
|
- ✅ **Match the reference design** (same HTML/CSS structure)
|
||||||
|
- ✅ **Contain expert insights** and recommendations
|
||||||
|
- ✅ **Be production-ready** for stakeholder distribution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
### Quick Start (Snowflake)
|
||||||
|
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
|
||||||
|
|
||||||
|
> Platform: Snowflake
|
||||||
|
> Database: PROD_CDP
|
||||||
|
> Schema: ID_UNIFICATION
|
||||||
|
> Canonical ID: master_customer_id
|
||||||
|
> Output: (press Enter for default)
|
||||||
|
|
||||||
|
✓ Report generated: id_unification_report.html
|
||||||
|
```
|
||||||
|
|
||||||
|
### Custom Output Path
|
||||||
|
|
||||||
|
```
|
||||||
|
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
|
||||||
|
|
||||||
|
> Platform: Databricks
|
||||||
|
> Catalog: analytics_prod
|
||||||
|
> Schema: unified_ids
|
||||||
|
> Canonical ID: td_id
|
||||||
|
> Output: /reports/weekly/td_id_stats_2025-10-15.html
|
||||||
|
|
||||||
|
✓ Report generated: /reports/weekly/td_id_stats_2025-10-15.html
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multiple Environments
|
||||||
|
|
||||||
|
Generate reports for different environments:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Production
|
||||||
|
/hybrid-unif-merge-stats-creator
|
||||||
|
Platform: Snowflake
|
||||||
|
Database: PROD_CDP
|
||||||
|
Output: prod_merge_stats.html
|
||||||
|
|
||||||
|
# Staging
|
||||||
|
/hybrid-unif-merge-stats-creator
|
||||||
|
Platform: Snowflake
|
||||||
|
Database: STAGING_CDP
|
||||||
|
Output: staging_merge_stats.html
|
||||||
|
|
||||||
|
# Compare metrics across environments
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Regular Reporting
|
||||||
|
|
||||||
|
1. **Weekly Reports**: Track merge performance over time
|
||||||
|
2. **Post-Workflow Reports**: Generate after each unification run
|
||||||
|
3. **Quality Audits**: Monthly deep-dive analysis
|
||||||
|
4. **Stakeholder Updates**: Executive-friendly format
|
||||||
|
|
||||||
|
### Comparative Analysis
|
||||||
|
|
||||||
|
Generate reports at different stages:
|
||||||
|
- After initial unification setup
|
||||||
|
- After incremental updates
|
||||||
|
- After data quality improvements
|
||||||
|
- Across different customer segments
|
||||||
|
|
||||||
|
### Archive and Versioning
|
||||||
|
|
||||||
|
```
|
||||||
|
reports/
|
||||||
|
2025-10-15_td_id_merge_stats.html
|
||||||
|
2025-10-08_td_id_merge_stats.html
|
||||||
|
2025-10-01_td_id_merge_stats.html
|
||||||
|
```
|
||||||
|
|
||||||
|
Track improvements over time by comparing:
|
||||||
|
- Merge ratios
|
||||||
|
- Data quality scores
|
||||||
|
- Convergence iterations
|
||||||
|
- Deduplication rates
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
**Ready to generate your merge statistics report?**
|
||||||
|
|
||||||
|
Please provide:
|
||||||
|
|
||||||
|
1. **Platform**: Snowflake or Databricks?
|
||||||
|
2. **Database/Catalog**: Where are your unification tables?
|
||||||
|
3. **Schema**: Which schema contains the tables?
|
||||||
|
4. **Canonical ID**: What's the name of your unified ID? (e.g., td_id)
|
||||||
|
5. **Output Path** (optional): Where to save the report?
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```
|
||||||
|
I want to generate a merge statistics report for:
|
||||||
|
|
||||||
|
Platform: Snowflake
|
||||||
|
Database: INDRESH_TEST
|
||||||
|
Schema: PUBLIC
|
||||||
|
Canonical ID: td_id
|
||||||
|
Output: my_unification_report.html
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**I'll analyze your ID unification results and create a comprehensive, beautiful HTML report with expert insights!**
|
||||||
101
plugin.lock.json
Normal file
101
plugin.lock.json
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
{
|
||||||
|
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||||
|
"pluginId": "gh:treasure-data/aps_claude_tools:plugins/cdp-hybrid-idu",
|
||||||
|
"normalized": {
|
||||||
|
"repo": null,
|
||||||
|
"ref": "refs/tags/v20251128.0",
|
||||||
|
"commit": "58382efafa00d9c88bf68f0ba2be494e310d9827",
|
||||||
|
"treeHash": "04cbd3c0d2b818afaf15f92f7e5fb2880103cdbfd513d9926f323c5b7722f625",
|
||||||
|
"generatedAt": "2025-11-28T10:28:44.950550Z",
|
||||||
|
"toolVersion": "publish_plugins.py@0.2.0"
|
||||||
|
},
|
||||||
|
"origin": {
|
||||||
|
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||||
|
"branch": "master",
|
||||||
|
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||||
|
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||||
|
},
|
||||||
|
"manifest": {
|
||||||
|
"name": "cdp-hybrid-idu",
|
||||||
|
"description": "Multi-platform ID Unification for Snowflake and Databricks with YAML-driven configuration, convergence detection, and master table generation",
|
||||||
|
"version": null
|
||||||
|
},
|
||||||
|
"content": {
|
||||||
|
"files": [
|
||||||
|
{
|
||||||
|
"path": "README.md",
|
||||||
|
"sha256": "4e50b588ce6c220815a4ca869c68f41fe23cbaf05846fb306e7b2cbf127ed8f8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/hybrid-unif-keys-extractor.md",
|
||||||
|
"sha256": "d2c92a61393209f0835f0118240254a7fa6f209aa62ec87d0ab253723055a7da"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/merge-stats-report-generator.md",
|
||||||
|
"sha256": "6e8fda43a277dfef132566b44a3dee23a632171641dcde0151d7602f43bcb5e8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/databricks-sql-generator.md",
|
||||||
|
"sha256": "ae3ce3874d7c00599fcef09718cb612e551aac89896e3c75aa1194332179df9d"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/databricks-workflow-executor.md",
|
||||||
|
"sha256": "ecc4fcf94d470fe27f078e8722297921469852d596035e3d9d5b5d32aa2b0435"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/yaml-configuration-builder.md",
|
||||||
|
"sha256": "da90f575f8f0f7e33fba1ad720c73556e029227fcf135cd9fe4a9a1d3fb77be3"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/snowflake-sql-generator.md",
|
||||||
|
"sha256": "783ee1653bca7e0bb2647b953b4c05390e08686f7454c8e1a9e572851e8e0fc8"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "agents/snowflake-workflow-executor.md",
|
||||||
|
"sha256": "f5f5352f47cfdd5a52769988ed9893f5a64de2a236f145b315838d475babca2c"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": ".claude-plugin/plugin.json",
|
||||||
|
"sha256": "a156b276659131718eab652f7b9806ab00bf59318ee07e22a585e3cb13da5e93"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-setup.md",
|
||||||
|
"sha256": "9a287a1c414323cd6db2c5f3197fcfde531d337168e2696bc7f4896113ae40b6"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-generate-databricks.md",
|
||||||
|
"sha256": "aff13cf95a74cd71dff35e3a4cd4ba2f287a7b3091f84cdb914d80e00bfe29ad"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-generate-snowflake.md",
|
||||||
|
"sha256": "0dc460f41ee3c8130aa9a52537686fec6818e7a37a802040b8a570d8f89eaf77"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-unif-config-creator.md",
|
||||||
|
"sha256": "3e14989f811e5ef198cff306e9203ec6bfa5f3772daa3a0f08292595574ab73c"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-execute-databricks.md",
|
||||||
|
"sha256": "ad78068c5b96d310d1d620c00572c100915e0706a5312c0649b09a8165bbc79c"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-unif-config-validate.md",
|
||||||
|
"sha256": "a413582bc43a23ad1addde134007bd6a3174b14d71c10dcbbd5f7824a6a97fb0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-unif-merge-stats-creator.md",
|
||||||
|
"sha256": "0c00db96f02559d212e502702eea5d3a02de8fedbca92f64eebd8ed430f96341"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "commands/hybrid-execute-snowflake.md",
|
||||||
|
"sha256": "63fbc27f3350cd1d910d0dc4c588a14fd39e1d7ebda29ed6fce3584967bac4c4"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"dirSha256": "04cbd3c0d2b818afaf15f92f7e5fb2880103cdbfd513d9926f323c5b7722f625"
|
||||||
|
},
|
||||||
|
"security": {
|
||||||
|
"scannedAt": null,
|
||||||
|
"scannerVersion": null,
|
||||||
|
"flags": []
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user