gh-treasure-data-aps-claude…/commands/hybrid-unif-config-creator.md

---
name: hybrid-unif-config-creator
description: Auto-generate unify.yml configuration for Snowflake/Databricks by extracting user identifiers from actual tables using strict PII detection
---

# Unify Configuration Creator for Snowflake/Databricks

## Overview

I'll automatically generate a production-ready `unify.yml` configuration file for your Snowflake or Databricks ID unification by:

1. **Analyzing your actual tables** using platform-specific MCP tools
2. **Extracting user identifiers** with zero-tolerance PII detection
3. **Validating data patterns** from real table data
4. **Generating unify.yml** using the exact template format
5. **Providing recommendations** for merge strategies and priorities

**This command uses STRICT analysis - only tables with actual user identifiers will be included.**

---

## What You Need to Provide

### 1. Platform Selection
- **Snowflake**: For Snowflake databases
- **Databricks**: For Databricks Unity Catalog tables

### 2. Tables to Analyze
Provide tables you want to analyze for ID unification:
- **Format (Snowflake)**: `database.schema.table` or `schema.table` or `table`
- **Format (Databricks)**: `catalog.schema.table` or `schema.table` or `table`
- **Example**: `customer_data.public.customers`, `orders`, `web_events.user_activity`

### 3. Canonical ID Configuration
- **Name**: Name for your unified ID (default: `td_id`)
- **Merge Iterations**: Number of unification loop iterations (default: 10)
- **Incremental Iterations**: Iterations for incremental processing (default: 5)

### 4. Output Configuration (Optional)
- **Output File**: Where to save unify.yml (default: `unify.yml`)
- **Template Path**: Path to template if using custom (default: uses built-in exact template)

---

## What I'll Do

### Step 1: Platform Detection and Validation
```
1. Confirm platform (Snowflake or Databricks)
2. Verify MCP tools are available for the platform
3. Set up platform-specific query patterns
4. Inform you of the analysis approach
```

### Step 2: Key Extraction with hybrid-unif-keys-extractor Agent
I'll launch the **hybrid-unif-keys-extractor agent** to:

**Schema Analysis**:
- Use platform MCP tools to describe each table
- Extract exact column names and data types
- Identify accessible vs inaccessible tables

**User Identifier Detection**:
- Apply STRICT matching rules for user identifiers:
  - ✅ Email columns (email, email_std, email_address, etc.)
  - ✅ Phone columns (phone, phone_number, mobile_phone, etc.)
  - ✅ User IDs (user_id, customer_id, account_id, etc.)
  - ✅ Cookie/Device IDs (td_client_id, cookie_id, etc.)
  - ❌ System columns (id, created_at, time, etc.)
  - ❌ Complex types (arrays, maps, objects, variants, structs)

**Data Validation**:
- Query actual MIN/MAX values from each identified column
- Analyze data patterns and quality
- Count unique values per identifier
- Detect data quality issues

**Table Classification**:
- **INCLUDED**: Tables with valid user identifiers
- **EXCLUDED**: Tables without user identifiers (fully documented why)

**Expert Analysis**:
- 3 SQL experts review the data
- Provide priority recommendations
- Suggest validation rules based on actual data patterns

### Step 3: Unify.yml Generation

**CRITICAL**: Using the **EXACT BUILT-IN template structure** (embedded in hybrid-unif-keys-extractor agent)

**Template Usage Process**:
```
1. Receive structured data from hybrid-unif-keys-extractor agent:
   - Keys with validation rules
   - Tables with column mappings
   - Canonical ID configuration
   - Master tables specification

2. Use BUILT-IN template structure (see agent documentation)

3. ONLY replace these specific values:
   - Line 1: name: {canonical_id_name}
   - keys section: actual keys found
   - tables section: actual tables with actual columns
   - canonical_ids section: name and merge_by_keys
   - master_tables section: [] or user specifications

4. PRESERVE everything else:
   - ALL comment blocks (#####...)
   - ALL comment text ("Declare Validation logic", etc.)
   - ALL spacing and indentation (2 spaces per level)
   - ALL blank lines
   - EXACT YAML structure

5. Use Write tool to save populated unify.yml
```

**I'll generate**:

**Section 1: Canonical ID Name**
```yaml
name: {your_canonical_id_name}
```

**Section 2: Keys with Validation**
```yaml
keys:
  - name: email
    valid_regexp: ".*@.*"
    invalid_texts: ['', 'N/A', 'null']
  - name: customer_id
    invalid_texts: ['', 'N/A', 'null']
  - name: phone_number
    invalid_texts: ['', 'N/A', 'null']
```
*Populated with actual keys found in your tables*

**Section 3: Tables with Key Column Mappings**
```yaml
tables:
  - database: {database/catalog}
    table: {table_name}
    key_columns:
      - {column: actual_column_name, key: mapped_key}
      - {column: another_column, key: another_key}
```
*Only tables with valid user identifiers, with EXACT column names from schema analysis*

**Section 4: Canonical IDs Configuration**
```yaml
canonical_ids:
  - name: {your_canonical_id_name}
    merge_by_keys: [email, customer_id, phone_number]
    merge_iterations: 15
```
*Based on extracted keys and your configuration*

**Section 5: Master Tables (Optional)**
```yaml
master_tables:
  - name: {canonical_id_name}_master_table
    canonical_id: {canonical_id_name}
    attributes:
      - name: best_email
        source_columns:
          - {table: table1, column: email, order: last, order_by: time, priority: 1}
          - {table: table2, column: email_address, order: last, order_by: time, priority: 2}
```
*If you request master table configuration, I'll help set up attribute aggregation*

### Step 4: Validation and Review

After generation:
```
1. Show complete unify.yml content
2. Highlight key sections:
   - Keys found: [list]
   - Tables included: [count]
   - Tables excluded: [count] with reasons
   - Merge strategy: [keys and priorities]
3. Provide recommendations for optimization
4. Ask for your approval before saving
```

### Step 5: File Output

```
1. Write unify.yml to specified location
2. Create backup of existing file if present
3. Provide file summary:
   - Keys configured: X
   - Tables configured: Y
   - Validation rules: Z
4. Show next steps for using the configuration
```

---

## Example Workflow

**Input**:
```
Platform: Snowflake
Tables:
  - customer_data.public.customers
  - customer_data.public.orders
  - web_data.public.events
Canonical ID Name: unified_customer_id
Output: snowflake_unify.yml
```

**Process**:
```
✓ Platform: Snowflake MCP tools detected
✓ Analyzing 3 tables...

Schema Analysis:
  ✓ customer_data.public.customers - 12 columns
  ✓ customer_data.public.orders - 8 columns
  ✓ web_data.public.events - 15 columns

User Identifier Detection:
  ✓ customers: email, customer_id (2 identifiers)
  ✓ orders: customer_id, email_address (2 identifiers)
  ✗ events: NO user identifiers found
    Available columns: event_id, session_id, page_url, timestamp, ...
    Reason: Contains only event tracking data - no PII

Data Analysis:
  ✓ email: 45,123 unique values, format valid
  ✓ customer_id: 45,089 unique values, numeric
  ✓ email_address: 12,456 unique values, format valid

Expert Analysis Complete:
  Priority 1: customer_id (most stable, highest coverage)
  Priority 2: email (good coverage, some quality issues)
  Priority 3: phone_number (not found)

Generating unify.yml...
  ✓ Keys section: 2 keys configured
  ✓ Tables section: 2 tables configured
  ✓ Canonical IDs: unified_customer_id
  ✓ Validation rules: Applied based on data patterns

Tables EXCLUDED:
  - web_data.public.events: No user identifiers
```

**Output (snowflake_unify.yml)**:
```yaml
name: unified_customer_id

keys:
  - name: email
    valid_regexp: ".*@.*"
    invalid_texts: ['', 'N/A', 'null']
  - name: customer_id
    invalid_texts: ['', 'N/A', 'null']

tables:
  - database: customer_data
    table: customers
    key_columns:
      - {column: email, key: email}
      - {column: customer_id, key: customer_id}
  - database: customer_data
    table: orders
    key_columns:
      - {column: email_address, key: email}
      - {column: customer_id, key: customer_id}

canonical_ids:
  - name: unified_customer_id
    merge_by_keys: [customer_id, email]
    merge_iterations: 15

master_tables: []
```

---

## Key Features

### 🔍 **STRICT PII Detection**
- Zero tolerance for guessing
- Only includes tables with actual user identifiers
- Documents why tables are excluded
- Based on REAL schema and data analysis

### ✅ **Exact Template Compliance**
- Uses BUILT-IN exact template structure (embedded in hybrid-unif-keys-extractor agent)
- NO modifications to template format
- Preserves all comment sections
- Maintains exact YAML structure
- Portable across all systems

### 📊 **Real Data Analysis**
- Queries actual MIN/MAX values
- Counts unique identifiers
- Validates data patterns
- Identifies quality issues

### 🎯 **Platform-Aware**
- Uses correct MCP tools for each platform
- Respects platform naming conventions
- Applies platform-specific data type rules
- Generates platform-compatible SQL references

### 📋 **Complete Documentation**
- Documents all excluded tables with reasons
- Lists available columns for excluded tables
- Explains why columns don't qualify as user identifiers
- Provides expert recommendations

---

## Output Format

**The generated unify.yml will have EXACTLY this structure:**

```yaml
name: {canonical_id_name}
#####################################################
##
##Declare Validation logic for unification keys
##
#####################################################
keys:
  - name: {key1}
    valid_regexp: "{pattern}"
    invalid_texts: ['{val1}', '{val2}', '{val3}']
  - name: {key2}
    invalid_texts: ['{val1}', '{val2}', '{val3}']

#####################################################
##
##Declare databases, tables, and keys to use during unification
##
#####################################################

tables:
  - database: {db/catalog}
    table: {table}
    key_columns:
      - {column: {col}, key: {key}}

#####################################################
##
##Declare hierarchy for unification. Define keys to use for each level.
##
#####################################################

canonical_ids:
  - name: {canonical_id_name}
    merge_by_keys: [{key1}, {key2}, ...]
    merge_iterations: {number}

#####################################################
##
##Declare Similar Attributes and standardize into a single column
##
#####################################################

master_tables:
  - name: {canonical_id_name}_master_table
    canonical_id: {canonical_id_name}
    attributes:
      - name: {attribute}
        source_columns:
          - {table: {t}, column: {c}, order: last, order_by: time, priority: 1}
```

**NO deviations from this structure - EXACT template compliance guaranteed.**

---

## Prerequisites

### Required:
- ✅ Snowflake or Databricks platform access
- ✅ Platform-specific MCP tools configured (may use fallback if unavailable)
- ✅ Read permissions on tables to be analyzed
- ✅ Tables must exist and be accessible

### Optional:
- Custom unify.yml template path (if not using default)
- Master table attribute specifications
- Custom validation rules

---

## Expected Timeline

| Step | Duration |
|------|----------|
| Platform detection | < 1 min |
| Schema analysis (per table) | 5-10 sec |
| Data analysis (per identifier) | 10-20 sec |
| Expert analysis | 1-2 min |
| YAML generation | < 1 min |
| **Total (for 5 tables)** | **~3-5 min** |

---

## Error Handling

### Common Issues:

**Issue**: MCP tools not available for platform
**Solution**:
- I'll inform you and provide fallback options
- You can provide schema information manually
- I'll still generate unify.yml with validation warnings

**Issue**: No tables have user identifiers
**Solution**:
- I'll show you why tables were excluded
- Suggest alternative tables to analyze
- Explain what constitutes a user identifier

**Issue**: Table not accessible
**Solution**:
- Document which tables are inaccessible
- Continue with accessible tables
- Recommend permission checks

**Issue**: Complex data types found
**Solution**:
- Exclude complex type columns (arrays, structs, maps)
- Explain why they can't be used for unification
- Suggest alternative columns if available

---

## Success Criteria

Generated unify.yml will:
- ✅ Use EXACT template structure - NO modifications
- ✅ Contain ONLY tables with validated user identifiers
- ✅ Include ONLY columns that actually exist in tables
- ✅ Have validation rules based on actual data patterns
- ✅ Be ready for immediate use with hybrid-generate-snowflake or hybrid-generate-databricks
- ✅ Work without any manual edits
- ✅ Include comprehensive documentation in comments

---

## Next Steps After Generation

1. **Review the generated unify.yml**
   - Verify tables and columns are correct
   - Check validation rules are appropriate
   - Review merge strategy and priorities

2. **Generate SQL for your platform**:
   - Snowflake: `/cdp-hybrid-idu:hybrid-generate-snowflake`
   - Databricks: `/cdp-hybrid-idu:hybrid-generate-databricks`

3. **Execute the workflow**:
   - Snowflake: `/cdp-hybrid-idu:hybrid-execute-snowflake`
   - Databricks: `/cdp-hybrid-idu:hybrid-execute-databricks`

4. **Monitor convergence and results**

---

## Getting Started

**Ready to begin?**

Please provide:

1. **Platform**: Snowflake or Databricks
2. **Tables**: List of tables to analyze (full paths)
3. **Canonical ID Name**: Name for your unified ID (e.g., `unified_customer_id`)
4. **Output File** (optional): Where to save unify.yml (default: `unify.yml`)

**Example**:
```
Platform: Snowflake
Tables:
  - customer_db.public.customers
  - customer_db.public.orders
  - marketing_db.public.campaigns
Canonical ID: unified_id
Output: snowflake_unify.yml
```

---

**I'll analyze your tables and generate a production-ready unify.yml configuration!**