Initial commit
This commit is contained in:
491
commands/hybrid-unif-config-creator.md
Normal file
491
commands/hybrid-unif-config-creator.md
Normal file
@@ -0,0 +1,491 @@
|
||||
---
|
||||
name: hybrid-unif-config-creator
|
||||
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
|
||||
---
|
||||
|
||||
# Unify Configuration Creator for Snowflake/Databricks
|
||||
|
||||
## Overview
|
||||
|
||||
I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
|
||||
|
||||
1. **Analyzing your actual tables** using platform-specific MCP tools
|
||||
2. **Extracting user identifiers** with zero-tolerance PII detection
|
||||
3. **Validating data patterns** from real table data
|
||||
4. **Generating unify.yml** using the exact template format
|
||||
5. **Providing recommendations** for merge strategies and priorities
|
||||
|
||||
**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
|
||||
|
||||
---
|
||||
|
||||
## What You Need to Provide
|
||||
|
||||
### 1. Platform Selection
|
||||
- **Snowflake**: For Snowflake databases
|
||||
- **Databricks**: For Databricks Unity Catalog tables
|
||||
|
||||
### 2. Tables to Analyze
|
||||
Provide tables you want to analyze for ID unification:
|
||||
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
|
||||
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
|
||||
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
|
||||
|
||||
### 3. Canonical ID Configuration
|
||||
- **Name**: Name for your unified ID (default: `td_id`)
|
||||
- **Merge Iterations**: Number of unification loop iterations (default: 10)
|
||||
- **Incremental Iterations**: Iterations for incremental processing (default: 5)
|
||||
|
||||
### 4. Output Configuration (Optional)
|
||||
- **Output File**: Where to save unify.yml (default: `unify.yml`)
|
||||
- **Template Path**: Path to template if using custom (default: uses built-in exact template)
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Platform Detection and Validation
|
||||
```
|
||||
1. Confirm platform (Snowflake or Databricks)
|
||||
2. Verify MCP tools are available for the platform
|
||||
3. Set up platform-specific query patterns
|
||||
4. Inform you of the analysis approach
|
||||
```
|
||||
|
||||
### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
|
||||
I'll launch the **hybrid-unif-keys-extractor agent** to:
|
||||
|
||||
**Schema Analysis**:
|
||||
- Use platform MCP tools to describe each table
|
||||
- Extract exact column names and data types
|
||||
- Identify accessible vs inaccessible tables
|
||||
|
||||
**User Identifier Detection**:
|
||||
- Apply STRICT matching rules for user identifiers:
|
||||
- ✅ Email columns (email, email_std, email_address, etc.)
|
||||
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
|
||||
- ✅ User IDs (user_id, customer_id, account_id, etc.)
|
||||
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
|
||||
- ❌ System columns (id, created_at, time, etc.)
|
||||
- ❌ Complex types (arrays, maps, objects, variants, structs)
|
||||
|
||||
**Data Validation**:
|
||||
- Query actual MIN/MAX values from each identified column
|
||||
- Analyze data patterns and quality
|
||||
- Count unique values per identifier
|
||||
- Detect data quality issues
|
||||
|
||||
**Table Classification**:
|
||||
- **INCLUDED**: Tables with valid user identifiers
|
||||
- **EXCLUDED**: Tables without user identifiers (fully documented why)
|
||||
|
||||
**Expert Analysis**:
|
||||
- 3 SQL experts review the data
|
||||
- Provide priority recommendations
|
||||
- Suggest validation rules based on actual data patterns
|
||||
|
||||
### Step 3: Unify.yml Generation
|
||||
|
||||
**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
|
||||
|
||||
**Template Usage Process**:
|
||||
```
|
||||
1. Receive structured data from hybrid-unif-keys-extractor agent:
|
||||
- Keys with validation rules
|
||||
- Tables with column mappings
|
||||
- Canonical ID configuration
|
||||
- Master tables specification
|
||||
|
||||
2. Use BUILT-IN template structure (see agent documentation)
|
||||
|
||||
3. ONLY replace these specific values:
|
||||
- Line 1: name: {canonical_id_name}
|
||||
- keys section: actual keys found
|
||||
- tables section: actual tables with actual columns
|
||||
- canonical_ids section: name and merge_by_keys
|
||||
- master_tables section: [] or user specifications
|
||||
|
||||
4. PRESERVE everything else:
|
||||
- ALL comment blocks (#####...)
|
||||
- ALL comment text ("Declare Validation logic", etc.)
|
||||
- ALL spacing and indentation (2 spaces per level)
|
||||
- ALL blank lines
|
||||
- EXACT YAML structure
|
||||
|
||||
5. Use Write tool to save populated unify.yml
|
||||
```
|
||||
|
||||
**I'll generate**:
|
||||
|
||||
**Section 1: Canonical ID Name**
|
||||
```yaml
|
||||
name: {your_canonical_id_name}
|
||||
```
|
||||
|
||||
**Section 2: Keys with Validation**
|
||||
```yaml
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: phone_number
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
```
|
||||
*Populated with actual keys found in your tables*
|
||||
|
||||
**Section 3: Tables with Key Column Mappings**
|
||||
```yaml
|
||||
tables:
|
||||
- database: {database/catalog}
|
||||
table: {table_name}
|
||||
key_columns:
|
||||
- {column: actual_column_name, key: mapped_key}
|
||||
- {column: another_column, key: another_key}
|
||||
```
|
||||
*Only tables with valid user identifiers, with EXACT column names from schema analysis*
|
||||
|
||||
**Section 4: Canonical IDs Configuration**
|
||||
```yaml
|
||||
canonical_ids:
|
||||
- name: {your_canonical_id_name}
|
||||
merge_by_keys: [email, customer_id, phone_number]
|
||||
merge_iterations: 15
|
||||
```
|
||||
*Based on extracted keys and your configuration*
|
||||
|
||||
**Section 5: Master Tables (Optional)**
|
||||
```yaml
|
||||
master_tables:
|
||||
- name: {canonical_id_name}_master_table
|
||||
canonical_id: {canonical_id_name}
|
||||
attributes:
|
||||
- name: best_email
|
||||
source_columns:
|
||||
- {table: table1, column: email, order: last, order_by: time, priority: 1}
|
||||
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
|
||||
```
|
||||
*If you request master table configuration, I'll help set up attribute aggregation*
|
||||
|
||||
### Step 4: Validation and Review
|
||||
|
||||
After generation:
|
||||
```
|
||||
1. Show complete unify.yml content
|
||||
2. Highlight key sections:
|
||||
- Keys found: [list]
|
||||
- Tables included: [count]
|
||||
- Tables excluded: [count] with reasons
|
||||
- Merge strategy: [keys and priorities]
|
||||
3. Provide recommendations for optimization
|
||||
4. Ask for your approval before saving
|
||||
```
|
||||
|
||||
### Step 5: File Output
|
||||
|
||||
```
|
||||
1. Write unify.yml to specified location
|
||||
2. Create backup of existing file if present
|
||||
3. Provide file summary:
|
||||
- Keys configured: X
|
||||
- Tables configured: Y
|
||||
- Validation rules: Z
|
||||
4. Show next steps for using the configuration
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Workflow
|
||||
|
||||
**Input**:
|
||||
```
|
||||
Platform: Snowflake
|
||||
Tables:
|
||||
- customer_data.public.customers
|
||||
- customer_data.public.orders
|
||||
- web_data.public.events
|
||||
Canonical ID Name: unified_customer_id
|
||||
Output: snowflake_unify.yml
|
||||
```
|
||||
|
||||
**Process**:
|
||||
```
|
||||
✓ Platform: Snowflake MCP tools detected
|
||||
✓ Analyzing 3 tables...
|
||||
|
||||
Schema Analysis:
|
||||
✓ customer_data.public.customers - 12 columns
|
||||
✓ customer_data.public.orders - 8 columns
|
||||
✓ web_data.public.events - 15 columns
|
||||
|
||||
User Identifier Detection:
|
||||
✓ customers: email, customer_id (2 identifiers)
|
||||
✓ orders: customer_id, email_address (2 identifiers)
|
||||
✗ events: NO user identifiers found
|
||||
Available columns: event_id, session_id, page_url, timestamp, ...
|
||||
Reason: Contains only event tracking data - no PII
|
||||
|
||||
Data Analysis:
|
||||
✓ email: 45,123 unique values, format valid
|
||||
✓ customer_id: 45,089 unique values, numeric
|
||||
✓ email_address: 12,456 unique values, format valid
|
||||
|
||||
Expert Analysis Complete:
|
||||
Priority 1: customer_id (most stable, highest coverage)
|
||||
Priority 2: email (good coverage, some quality issues)
|
||||
Priority 3: phone_number (not found)
|
||||
|
||||
Generating unify.yml...
|
||||
✓ Keys section: 2 keys configured
|
||||
✓ Tables section: 2 tables configured
|
||||
✓ Canonical IDs: unified_customer_id
|
||||
✓ Validation rules: Applied based on data patterns
|
||||
|
||||
Tables EXCLUDED:
|
||||
- web_data.public.events: No user identifiers
|
||||
```
|
||||
|
||||
**Output (snowflake_unify.yml)**:
|
||||
```yaml
|
||||
name: unified_customer_id
|
||||
|
||||
keys:
|
||||
- name: email
|
||||
valid_regexp: ".*@.*"
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
- name: customer_id
|
||||
invalid_texts: ['', 'N/A', 'null']
|
||||
|
||||
tables:
|
||||
- database: customer_data
|
||||
table: customers
|
||||
key_columns:
|
||||
- {column: email, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
- database: customer_data
|
||||
table: orders
|
||||
key_columns:
|
||||
- {column: email_address, key: email}
|
||||
- {column: customer_id, key: customer_id}
|
||||
|
||||
canonical_ids:
|
||||
- name: unified_customer_id
|
||||
merge_by_keys: [customer_id, email]
|
||||
merge_iterations: 15
|
||||
|
||||
master_tables: []
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
### 🔍 **STRICT PII Detection**
|
||||
- Zero tolerance for guessing
|
||||
- Only includes tables with actual user identifiers
|
||||
- Documents why tables are excluded
|
||||
- Based on REAL schema and data analysis
|
||||
|
||||
### ✅ **Exact Template Compliance**
|
||||
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
|
||||
- NO modifications to template format
|
||||
- Preserves all comment sections
|
||||
- Maintains exact YAML structure
|
||||
- Portable across all systems
|
||||
|
||||
### 📊 **Real Data Analysis**
|
||||
- Queries actual MIN/MAX values
|
||||
- Counts unique identifiers
|
||||
- Validates data patterns
|
||||
- Identifies quality issues
|
||||
|
||||
### 🎯 **Platform-Aware**
|
||||
- Uses correct MCP tools for each platform
|
||||
- Respects platform naming conventions
|
||||
- Applies platform-specific data type rules
|
||||
- Generates platform-compatible SQL references
|
||||
|
||||
### 📋 **Complete Documentation**
|
||||
- Documents all excluded tables with reasons
|
||||
- Lists available columns for excluded tables
|
||||
- Explains why columns don't qualify as user identifiers
|
||||
- Provides expert recommendations
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
**The generated unify.yml will have EXACTLY this structure:**
|
||||
|
||||
```yaml
|
||||
name: {canonical_id_name}
|
||||
#####################################################
|
||||
##
|
||||
##Declare Validation logic for unification keys
|
||||
##
|
||||
#####################################################
|
||||
keys:
|
||||
- name: {key1}
|
||||
valid_regexp: "{pattern}"
|
||||
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||
- name: {key2}
|
||||
invalid_texts: ['{val1}', '{val2}', '{val3}']
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare databases, tables, and keys to use during unification
|
||||
##
|
||||
#####################################################
|
||||
|
||||
tables:
|
||||
- database: {db/catalog}
|
||||
table: {table}
|
||||
key_columns:
|
||||
- {column: {col}, key: {key}}
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare hierarchy for unification. Define keys to use for each level.
|
||||
##
|
||||
#####################################################
|
||||
|
||||
canonical_ids:
|
||||
- name: {canonical_id_name}
|
||||
merge_by_keys: [{key1}, {key2}, ...]
|
||||
merge_iterations: {number}
|
||||
|
||||
#####################################################
|
||||
##
|
||||
##Declare Similar Attributes and standardize into a single column
|
||||
##
|
||||
#####################################################
|
||||
|
||||
master_tables:
|
||||
- name: {canonical_id_name}_master_table
|
||||
canonical_id: {canonical_id_name}
|
||||
attributes:
|
||||
- name: {attribute}
|
||||
source_columns:
|
||||
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
|
||||
```
|
||||
|
||||
**NO deviations from this structure - EXACT template compliance guaranteed.**
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required:
|
||||
- ✅ Snowflake or Databricks platform access
|
||||
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
|
||||
- ✅ Read permissions on tables to be analyzed
|
||||
- ✅ Tables must exist and be accessible
|
||||
|
||||
### Optional:
|
||||
- Custom unify.yml template path (if not using default)
|
||||
- Master table attribute specifications
|
||||
- Custom validation rules
|
||||
|
||||
---
|
||||
|
||||
## Expected Timeline
|
||||
|
||||
| Step | Duration |
|
||||
|------|----------|
|
||||
| Platform detection | < 1 min |
|
||||
| Schema analysis (per table) | 5-10 sec |
|
||||
| Data analysis (per identifier) | 10-20 sec |
|
||||
| Expert analysis | 1-2 min |
|
||||
| YAML generation | < 1 min |
|
||||
| **Total (for 5 tables)** | **~3-5 min** |
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Issues:
|
||||
|
||||
**Issue**: MCP tools not available for platform
|
||||
**Solution**:
|
||||
- I'll inform you and provide fallback options
|
||||
- You can provide schema information manually
|
||||
- I'll still generate unify.yml with validation warnings
|
||||
|
||||
**Issue**: No tables have user identifiers
|
||||
**Solution**:
|
||||
- I'll show you why tables were excluded
|
||||
- Suggest alternative tables to analyze
|
||||
- Explain what constitutes a user identifier
|
||||
|
||||
**Issue**: Table not accessible
|
||||
**Solution**:
|
||||
- Document which tables are inaccessible
|
||||
- Continue with accessible tables
|
||||
- Recommend permission checks
|
||||
|
||||
**Issue**: Complex data types found
|
||||
**Solution**:
|
||||
- Exclude complex type columns (arrays, structs, maps)
|
||||
- Explain why they can't be used for unification
|
||||
- Suggest alternative columns if available
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Generated unify.yml will:
|
||||
- ✅ Use EXACT template structure - NO modifications
|
||||
- ✅ Contain ONLY tables with validated user identifiers
|
||||
- ✅ Include ONLY columns that actually exist in tables
|
||||
- ✅ Have validation rules based on actual data patterns
|
||||
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
|
||||
- ✅ Work without any manual edits
|
||||
- ✅ Include comprehensive documentation in comments
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Generation
|
||||
|
||||
1. **Review the generated unify.yml**
|
||||
- Verify tables and columns are correct
|
||||
- Check validation rules are appropriate
|
||||
- Review merge strategy and priorities
|
||||
|
||||
2. **Generate SQL for your platform**:
|
||||
- Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
|
||||
- Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
|
||||
|
||||
3. **Execute the workflow**:
|
||||
- Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
|
||||
- Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
|
||||
|
||||
4. **Monitor convergence and results**
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**Ready to begin?**
|
||||
|
||||
Please provide:
|
||||
|
||||
1. **Platform**: Snowflake or Databricks
|
||||
2. **Tables**: List of tables to analyze (full paths)
|
||||
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
|
||||
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Platform: Snowflake
|
||||
Tables:
|
||||
- customer_db.public.customers
|
||||
- customer_db.public.orders
|
||||
- marketing_db.public.campaigns
|
||||
Canonical ID: unified_id
|
||||
Output: snowflake_unify.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**I'll analyze your tables and generate a production-ready unify.yml configuration!**
|
||||
Reference in New Issue
Block a user