Initial commit

2025-11-30 09:02:39 +08:00
commit 515e7bf6be
18 changed files with 5770 additions and 0 deletions
--- a/commands/hybrid-unif-config-creator.md
+++ b/commands/hybrid-unif-config-creator.md
@@ -0,0 +1,491 @@
+---
+name: hybrid-unif-config-creator
+description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
+---
+
+# Unify Configuration Creator for Snowflake/Databricks
+
+## Overview
+
+I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:
+
+1. **Analyzing your actual tables** using platform-specific MCP tools
+2. **Extracting user identifiers** with zero-tolerance PII detection
+3. **Validating data patterns** from real table data
+4. **Generating unify.yml** using the exact template format
+5. **Providing recommendations** for merge strategies and priorities
+
+**This command uses STRICT analysis - only tables with actual user identifiers will be included.**
+
+---
+
+## What You Need to Provide
+
+### 1. Platform Selection
+- **Snowflake**: For Snowflake databases
+- **Databricks**: For Databricks Unity Catalog tables
+
+### 2. Tables to Analyze
+Provide tables you want to analyze for ID unification:
+- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
+- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
+- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`
+
+### 3. Canonical ID Configuration
+- **Name**: Name for your unified ID (default: `td_id`)
+- **Merge Iterations**: Number of unification loop iterations (default: 10)
+- **Incremental Iterations**: Iterations for incremental processing (default: 5)
+
+### 4. Output Configuration (Optional)
+- **Output File**: Where to save unify.yml (default: `unify.yml`)
+- **Template Path**: Path to template if using custom (default: uses built-in exact template)
+
+---
+
+## What I'll Do
+
+### Step 1: Platform Detection and Validation
+```
+1. Confirm platform (Snowflake or Databricks)
+2. Verify MCP tools are available for the platform
+3. Set up platform-specific query patterns
+4. Inform you of the analysis approach
+```
+
+### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
+I'll launch the **hybrid-unif-keys-extractor agent** to:
+
+**Schema Analysis**:
+- Use platform MCP tools to describe each table
+- Extract exact column names and data types
+- Identify accessible vs inaccessible tables
+
+**User Identifier Detection**:
+- Apply STRICT matching rules for user identifiers:
+  - ✅ Email columns (email, email_std, email_address, etc.)
+  - ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
+  - ✅ User IDs (user_id, customer_id, account_id, etc.)
+  - ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
+  - ❌ System columns (id, created_at, time, etc.)
+  - ❌ Complex types (arrays, maps, objects, variants, structs)
+
+**Data Validation**:
+- Query actual MIN/MAX values from each identified column
+- Analyze data patterns and quality
+- Count unique values per identifier
+- Detect data quality issues
+
+**Table Classification**:
+- **INCLUDED**: Tables with valid user identifiers
+- **EXCLUDED**: Tables without user identifiers (fully documented why)
+
+**Expert Analysis**:
+- 3 SQL experts review the data
+- Provide priority recommendations
+- Suggest validation rules based on actual data patterns
+
+### Step 3: Unify.yml Generation
+
+**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)
+
+**Template Usage Process**:
+```
+1. Receive structured data from hybrid-unif-keys-extractor agent:
+   - Keys with validation rules
+   - Tables with column mappings
+   - Canonical ID configuration
+   - Master tables specification
+
+2. Use BUILT-IN template structure (see agent documentation)
+
+3. ONLY replace these specific values:
+   - Line 1: name: {canonical_id_name}
+   - keys section: actual keys found
+   - tables section: actual tables with actual columns
+   - canonical_ids section: name and merge_by_keys
+   - master_tables section: [] or user specifications
+
+4. PRESERVE everything else:
+   - ALL comment blocks (#####...)
+   - ALL comment text ("Declare Validation logic", etc.)
+   - ALL spacing and indentation (2 spaces per level)
+   - ALL blank lines
+   - EXACT YAML structure
+
+5. Use Write tool to save populated unify.yml
+```
+
+**I'll generate**:
+
+**Section 1: Canonical ID Name**
+```yaml
+name: {your_canonical_id_name}
+```
+
+**Section 2: Keys with Validation**
+```yaml
+keys:
+  - name: email
+    valid_regexp: ".*@.*"
+    invalid_texts: ['', 'N/A', 'null']
+  - name: customer_id
+    invalid_texts: ['', 'N/A', 'null']
+  - name: phone_number
+    invalid_texts: ['', 'N/A', 'null']
+```
+*Populated with actual keys found in your tables*
+
+**Section 3: Tables with Key Column Mappings**
+```yaml
+tables:
+  - database: {database/catalog}
+    table: {table_name}
+    key_columns:
+      - {column: actual_column_name, key: mapped_key}
+      - {column: another_column, key: another_key}
+```
+*Only tables with valid user identifiers, with EXACT column names from schema analysis*
+
+**Section 4: Canonical IDs Configuration**
+```yaml
+canonical_ids:
+  - name: {your_canonical_id_name}
+    merge_by_keys: [email, customer_id, phone_number]
+    merge_iterations: 15
+```
+*Based on extracted keys and your configuration*
+
+**Section 5: Master Tables (Optional)**
+```yaml
+master_tables:
+  - name: {canonical_id_name}_master_table
+    canonical_id: {canonical_id_name}
+    attributes:
+      - name: best_email
+        source_columns:
+          - {table: table1, column: email, order: last, order_by: time, priority: 1}
+          - {table: table2, column: email_address, order: last, order_by: time, priority: 2}
+```
+*If you request master table configuration, I'll help set up attribute aggregation*
+
+### Step 4: Validation and Review
+
+After generation:
+```
+1. Show complete unify.yml content
+2. Highlight key sections:
+   - Keys found: [list]
+   - Tables included: [count]
+   - Tables excluded: [count] with reasons
+   - Merge strategy: [keys and priorities]
+3. Provide recommendations for optimization
+4. Ask for your approval before saving
+```
+
+### Step 5: File Output
+
+```
+1. Write unify.yml to specified location
+2. Create backup of existing file if present
+3. Provide file summary:
+   - Keys configured: X
+   - Tables configured: Y
+   - Validation rules: Z
+4. Show next steps for using the configuration
+```
+
+---
+
+## Example Workflow
+
+**Input**:
+```
+Platform: Snowflake
+Tables:
+  - customer_data.public.customers
+  - customer_data.public.orders
+  - web_data.public.events
+Canonical ID Name: unified_customer_id
+Output: snowflake_unify.yml
+```
+
+**Process**:
+```
+✓ Platform: Snowflake MCP tools detected
+✓ Analyzing 3 tables...
+
+Schema Analysis:
+  ✓ customer_data.public.customers - 12 columns
+  ✓ customer_data.public.orders - 8 columns
+  ✓ web_data.public.events - 15 columns
+
+User Identifier Detection:
+  ✓ customers: email, customer_id (2 identifiers)
+  ✓ orders: customer_id, email_address (2 identifiers)
+  ✗ events: NO user identifiers found
+    Available columns: event_id, session_id, page_url, timestamp, ...
+    Reason: Contains only event tracking data - no PII
+
+Data Analysis:
+  ✓ email: 45,123 unique values, format valid
+  ✓ customer_id: 45,089 unique values, numeric
+  ✓ email_address: 12,456 unique values, format valid
+
+Expert Analysis Complete:
+  Priority 1: customer_id (most stable, highest coverage)
+  Priority 2: email (good coverage, some quality issues)
+  Priority 3: phone_number (not found)
+
+Generating unify.yml...
+  ✓ Keys section: 2 keys configured
+  ✓ Tables section: 2 tables configured
+  ✓ Canonical IDs: unified_customer_id
+  ✓ Validation rules: Applied based on data patterns
+
+Tables EXCLUDED:
+  - web_data.public.events: No user identifiers
+```
+
+**Output (snowflake_unify.yml)**:
+```yaml
+name: unified_customer_id
+
+keys:
+  - name: email
+    valid_regexp: ".*@.*"
+    invalid_texts: ['', 'N/A', 'null']
+  - name: customer_id
+    invalid_texts: ['', 'N/A', 'null']
+
+tables:
+  - database: customer_data
+    table: customers
+    key_columns:
+      - {column: email, key: email}
+      - {column: customer_id, key: customer_id}
+  - database: customer_data
+    table: orders
+    key_columns:
+      - {column: email_address, key: email}
+      - {column: customer_id, key: customer_id}
+
+canonical_ids:
+  - name: unified_customer_id
+    merge_by_keys: [customer_id, email]
+    merge_iterations: 15
+
+master_tables: []
+```
+
+---
+
+## Key Features
+
+### 🔍 **STRICT PII Detection**
+- Zero tolerance for guessing
+- Only includes tables with actual user identifiers
+- Documents why tables are excluded
+- Based on REAL schema and data analysis
+
+### ✅ **Exact Template Compliance**
+- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
+- NO modifications to template format
+- Preserves all comment sections
+- Maintains exact YAML structure
+- Portable across all systems
+
+### 📊 **Real Data Analysis**
+- Queries actual MIN/MAX values
+- Counts unique identifiers
+- Validates data patterns
+- Identifies quality issues
+
+### 🎯 **Platform-Aware**
+- Uses correct MCP tools for each platform
+- Respects platform naming conventions
+- Applies platform-specific data type rules
+- Generates platform-compatible SQL references
+
+### 📋 **Complete Documentation**
+- Documents all excluded tables with reasons
+- Lists available columns for excluded tables
+- Explains why columns don't qualify as user identifiers
+- Provides expert recommendations
+
+---
+
+## Output Format
+
+**The generated unify.yml will have EXACTLY this structure:**
+
+```yaml
+name: {canonical_id_name}
+#####################################################
+##
+##Declare Validation logic for unification keys
+##
+#####################################################
+keys:
+  - name: {key1}
+    valid_regexp: "{pattern}"
+    invalid_texts: ['{val1}', '{val2}', '{val3}']
+  - name: {key2}
+    invalid_texts: ['{val1}', '{val2}', '{val3}']
+
+#####################################################
+##
+##Declare databases, tables, and keys to use during unification
+##
+#####################################################
+
+tables:
+  - database: {db/catalog}
+    table: {table}
+    key_columns:
+      - {column: {col}, key: {key}}
+
+#####################################################
+##
+##Declare hierarchy for unification. Define keys to use for each level.
+##
+#####################################################
+
+canonical_ids:
+  - name: {canonical_id_name}
+    merge_by_keys: [{key1}, {key2}, ...]
+    merge_iterations: {number}
+
+#####################################################
+##
+##Declare Similar Attributes and standardize into a single column
+##
+#####################################################
+
+master_tables:
+  - name: {canonical_id_name}_master_table
+    canonical_id: {canonical_id_name}
+    attributes:
+      - name: {attribute}
+        source_columns:
+          - {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
+```
+
+**NO deviations from this structure - EXACT template compliance guaranteed.**
+
+---
+
+## Prerequisites
+
+### Required:
+- ✅ Snowflake or Databricks platform access
+- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
+- ✅ Read permissions on tables to be analyzed
+- ✅ Tables must exist and be accessible
+
+### Optional:
+- Custom unify.yml template path (if not using default)
+- Master table attribute specifications
+- Custom validation rules
+
+---
+
+## Expected Timeline
+
+| Step | Duration |
+|------|----------|
+| Platform detection | < 1 min |
+| Schema analysis (per table) | 5-10 sec |
+| Data analysis (per identifier) | 10-20 sec |
+| Expert analysis | 1-2 min |
+| YAML generation | < 1 min |
+| **Total (for 5 tables)** | **~3-5 min** |
+
+---
+
+## Error Handling
+
+### Common Issues:
+
+**Issue**: MCP tools not available for platform
+**Solution**:
+- I'll inform you and provide fallback options
+- You can provide schema information manually
+- I'll still generate unify.yml with validation warnings
+
+**Issue**: No tables have user identifiers
+**Solution**:
+- I'll show you why tables were excluded
+- Suggest alternative tables to analyze
+- Explain what constitutes a user identifier
+
+**Issue**: Table not accessible
+**Solution**:
+- Document which tables are inaccessible
+- Continue with accessible tables
+- Recommend permission checks
+
+**Issue**: Complex data types found
+**Solution**:
+- Exclude complex type columns (arrays, structs, maps)
+- Explain why they can't be used for unification
+- Suggest alternative columns if available
+
+---
+
+## Success Criteria
+
+Generated unify.yml will:
+- ✅ Use EXACT template structure - NO modifications
+- ✅ Contain ONLY tables with validated user identifiers
+- ✅ Include ONLY columns that actually exist in tables
+- ✅ Have validation rules based on actual data patterns
+- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
+- ✅ Work without any manual edits
+- ✅ Include comprehensive documentation in comments
+
+---
+
+## Next Steps After Generation
+
+1. **Review the generated unify.yml**
+   - Verify tables and columns are correct
+   - Check validation rules are appropriate
+   - Review merge strategy and priorities
+
+2. **Generate SQL for your platform**:
+   - Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
+   - Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`
+
+3. **Execute the workflow**:
+   - Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
+   - Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`
+
+4. **Monitor convergence and results**
+
+---
+
+## Getting Started
+
+**Ready to begin?**
+
+Please provide:
+
+1. **Platform**: Snowflake or Databricks
+2. **Tables**: List of tables to analyze (full paths)
+3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
+4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)
+
+**Example**:
+```
+Platform: Snowflake
+Tables:
+  - customer_db.public.customers
+  - customer_db.public.orders
+  - marketing_db.public.campaigns
+Canonical ID: unified_id
+Output: snowflake_unify.yml
+```
+
+---
+
+**I'll analyze your tables and generate a production-ready unify.yml configuration!**