14 KiB
name, description
| name | description |
|---|---|
| hybrid-unif-config-creator | Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection |
Unify Configuration Creator for Snowflake/Databricks
Overview
I'll automatically generate a production-ready unify.yml configuration file for your Snowflake or Databricks ID unification by:
- Analyzing your actual tables using platform-specific MCP tools
- Extracting user identifiers with zero-tolerance PII detection
- Validating data patterns from real table data
- Generating unify.yml using the exact template format
- Providing recommendations for merge strategies and priorities
This command uses STRICT analysis - only tables with actual user identifiers will be included.
What You Need to Provide
1. Platform Selection
- Snowflake: For Snowflake databases
- Databricks: For Databricks Unity Catalog tables
2. Tables to Analyze
Provide tables you want to analyze for ID unification:
- Format (Snowflake):
database.schema.tableorschema.tableortable - Format (Databricks):
catalog.schema.tableorschema.tableortable - Example:
customer_data.public.customers,orders,web_events.user_activity
3. Canonical ID Configuration
- Name: Name for your unified ID (default:
td_id) - Merge Iterations: Number of unification loop iterations (default: 10)
- Incremental Iterations: Iterations for incremental processing (default: 5)
4. Output Configuration (Optional)
- Output File: Where to save unify.yml (default:
unify.yml) - Template Path: Path to template if using custom (default: uses built-in exact template)
What I'll Do
Step 1: Platform Detection and Validation
1. Confirm platform (Snowflake or Databricks)
2. Verify MCP tools are available for the platform
3. Set up platform-specific query patterns
4. Inform you of the analysis approach
Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
I'll launch the hybrid-unif-keys-extractor agent to:
Schema Analysis:
- Use platform MCP tools to describe each table
- Extract exact column names and data types
- Identify accessible vs inaccessible tables
User Identifier Detection:
- Apply STRICT matching rules for user identifiers:
- ✅ Email columns (email, email_std, email_address, etc.)
- ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
- ✅ User IDs (user_id, customer_id, account_id, etc.)
- ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
- ❌ System columns (id, created_at, time, etc.)
- ❌ Complex types (arrays, maps, objects, variants, structs)
Data Validation:
- Query actual MIN/MAX values from each identified column
- Analyze data patterns and quality
- Count unique values per identifier
- Detect data quality issues
Table Classification:
- INCLUDED: Tables with valid user identifiers
- EXCLUDED: Tables without user identifiers (fully documented why)
Expert Analysis:
- 3 SQL experts review the data
- Provide priority recommendations
- Suggest validation rules based on actual data patterns
Step 3: Unify.yml Generation
CRITICAL: Using the EXACT BUILT-IN template structure (embedded in hybrid-unif-keys-extractor agent)
Template Usage Process:
1. Receive structured data from hybrid-unif-keys-extractor agent:
- Keys with validation rules
- Tables with column mappings
- Canonical ID configuration
- Master tables specification
2. Use BUILT-IN template structure (see agent documentation)
3. ONLY replace these specific values:
- Line 1: name: {canonical_id_name}
- keys section: actual keys found
- tables section: actual tables with actual columns
- canonical_ids section: name and merge_by_keys
- master_tables section: [] or user specifications
4. PRESERVE everything else:
- ALL comment blocks (#####...)
- ALL comment text ("Declare Validation logic", etc.)
- ALL spacing and indentation (2 spaces per level)
- ALL blank lines
- EXACT YAML structure
5. Use Write tool to save populated unify.yml
I'll generate:
Section 1: Canonical ID Name
name: {your_canonical_id_name}
Section 2: Keys with Validation
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
Populated with actual keys found in your tables
Section 3: Tables with Key Column Mappings
tables:
- database: {database/catalog}
table: {table_name}
key_columns:
- {column: actual_column_name, key: mapped_key}
- {column: another_column, key: another_key}
Only tables with valid user identifiers, with EXACT column names from schema analysis
Section 4: Canonical IDs Configuration
canonical_ids:
- name: {your_canonical_id_name}
merge_by_keys: [email, customer_id, phone_number]
merge_iterations: 15
Based on extracted keys and your configuration
Section 5: Master Tables (Optional)
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: best_email
source_columns:
- {table: table1, column: email, order: last, order_by: time, priority: 1}
- {table: table2, column: email_address, order: last, order_by: time, priority: 2}
If you request master table configuration, I'll help set up attribute aggregation
Step 4: Validation and Review
After generation:
1. Show complete unify.yml content
2. Highlight key sections:
- Keys found: [list]
- Tables included: [count]
- Tables excluded: [count] with reasons
- Merge strategy: [keys and priorities]
3. Provide recommendations for optimization
4. Ask for your approval before saving
Step 5: File Output
1. Write unify.yml to specified location
2. Create backup of existing file if present
3. Provide file summary:
- Keys configured: X
- Tables configured: Y
- Validation rules: Z
4. Show next steps for using the configuration
Example Workflow
Input:
Platform: Snowflake
Tables:
- customer_data.public.customers
- customer_data.public.orders
- web_data.public.events
Canonical ID Name: unified_customer_id
Output: snowflake_unify.yml
Process:
✓ Platform: Snowflake MCP tools detected
✓ Analyzing 3 tables...
Schema Analysis:
✓ customer_data.public.customers - 12 columns
✓ customer_data.public.orders - 8 columns
✓ web_data.public.events - 15 columns
User Identifier Detection:
✓ customers: email, customer_id (2 identifiers)
✓ orders: customer_id, email_address (2 identifiers)
✗ events: NO user identifiers found
Available columns: event_id, session_id, page_url, timestamp, ...
Reason: Contains only event tracking data - no PII
Data Analysis:
✓ email: 45,123 unique values, format valid
✓ customer_id: 45,089 unique values, numeric
✓ email_address: 12,456 unique values, format valid
Expert Analysis Complete:
Priority 1: customer_id (most stable, highest coverage)
Priority 2: email (good coverage, some quality issues)
Priority 3: phone_number (not found)
Generating unify.yml...
✓ Keys section: 2 keys configured
✓ Tables section: 2 tables configured
✓ Canonical IDs: unified_customer_id
✓ Validation rules: Applied based on data patterns
Tables EXCLUDED:
- web_data.public.events: No user identifiers
Output (snowflake_unify.yml):
name: unified_customer_id
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
tables:
- database: customer_data
table: customers
key_columns:
- {column: email, key: email}
- {column: customer_id, key: customer_id}
- database: customer_data
table: orders
key_columns:
- {column: email_address, key: email}
- {column: customer_id, key: customer_id}
canonical_ids:
- name: unified_customer_id
merge_by_keys: [customer_id, email]
merge_iterations: 15
master_tables: []
Key Features
🔍 STRICT PII Detection
- Zero tolerance for guessing
- Only includes tables with actual user identifiers
- Documents why tables are excluded
- Based on REAL schema and data analysis
✅ Exact Template Compliance
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
- NO modifications to template format
- Preserves all comment sections
- Maintains exact YAML structure
- Portable across all systems
📊 Real Data Analysis
- Queries actual MIN/MAX values
- Counts unique identifiers
- Validates data patterns
- Identifies quality issues
🎯 Platform-Aware
- Uses correct MCP tools for each platform
- Respects platform naming conventions
- Applies platform-specific data type rules
- Generates platform-compatible SQL references
📋 Complete Documentation
- Documents all excluded tables with reasons
- Lists available columns for excluded tables
- Explains why columns don't qualify as user identifiers
- Provides expert recommendations
Output Format
The generated unify.yml will have EXACTLY this structure:
name: {canonical_id_name}
#####################################################
##
##Declare Validation logic for unification keys
##
#####################################################
keys:
- name: {key1}
valid_regexp: "{pattern}"
invalid_texts: ['{val1}', '{val2}', '{val3}']
- name: {key2}
invalid_texts: ['{val1}', '{val2}', '{val3}']
#####################################################
##
##Declare databases, tables, and keys to use during unification
##
#####################################################
tables:
- database: {db/catalog}
table: {table}
key_columns:
- {column: {col}, key: {key}}
#####################################################
##
##Declare hierarchy for unification. Define keys to use for each level.
##
#####################################################
canonical_ids:
- name: {canonical_id_name}
merge_by_keys: [{key1}, {key2}, ...]
merge_iterations: {number}
#####################################################
##
##Declare Similar Attributes and standardize into a single column
##
#####################################################
master_tables:
- name: {canonical_id_name}_master_table
canonical_id: {canonical_id_name}
attributes:
- name: {attribute}
source_columns:
- {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
NO deviations from this structure - EXACT template compliance guaranteed.
Prerequisites
Required:
- ✅ Snowflake or Databricks platform access
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
- ✅ Read permissions on tables to be analyzed
- ✅ Tables must exist and be accessible
Optional:
- Custom unify.yml template path (if not using default)
- Master table attribute specifications
- Custom validation rules
Expected Timeline
| Step | Duration |
|---|---|
| Platform detection | < 1 min |
| Schema analysis (per table) | 5-10 sec |
| Data analysis (per identifier) | 10-20 sec |
| Expert analysis | 1-2 min |
| YAML generation | < 1 min |
| Total (for 5 tables) | ~3-5 min |
Error Handling
Common Issues:
Issue: MCP tools not available for platform Solution:
- I'll inform you and provide fallback options
- You can provide schema information manually
- I'll still generate unify.yml with validation warnings
Issue: No tables have user identifiers Solution:
- I'll show you why tables were excluded
- Suggest alternative tables to analyze
- Explain what constitutes a user identifier
Issue: Table not accessible Solution:
- Document which tables are inaccessible
- Continue with accessible tables
- Recommend permission checks
Issue: Complex data types found Solution:
- Exclude complex type columns (arrays, structs, maps)
- Explain why they can't be used for unification
- Suggest alternative columns if available
Success Criteria
Generated unify.yml will:
- ✅ Use EXACT template structure - NO modifications
- ✅ Contain ONLY tables with validated user identifiers
- ✅ Include ONLY columns that actually exist in tables
- ✅ Have validation rules based on actual data patterns
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
- ✅ Work without any manual edits
- ✅ Include comprehensive documentation in comments
Next Steps After Generation
-
Review the generated unify.yml
- Verify tables and columns are correct
- Check validation rules are appropriate
- Review merge strategy and priorities
-
Generate SQL for your platform:
- Snowflake:
/cdp-hybrid-idu:hybrid-generate-snowflake - Databricks:
/cdp-hybrid-idu:hybrid-generate-databricks
- Snowflake:
-
Execute the workflow:
- Snowflake:
/cdp-hybrid-idu:hybrid-execute-snowflake - Databricks:
/cdp-hybrid-idu:hybrid-execute-databricks
- Snowflake:
-
Monitor convergence and results
Getting Started
Ready to begin?
Please provide:
- Platform: Snowflake or Databricks
- Tables: List of tables to analyze (full paths)
- Canonical ID Name: Name for your unified ID (e.g.,
unified_customer_id) - Output File (optional): Where to save unify.yml (default:
unify.yml)
Example:
Platform: Snowflake
Tables:
- customer_db.public.customers
- customer_db.public.orders
- marketing_db.public.campaigns
Canonical ID: unified_id
Output: snowflake_unify.yml
I'll analyze your tables and generate a production-ready unify.yml configuration!