Files
gh-treasure-data-aps-claude…/commands/hybrid-setup.md
2025-11-30 09:02:39 +08:00

11 KiB

name, description
name description
hybrid-setup Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks

Hybrid ID Unification Complete Setup

Overview

I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an automated, end-to-end workflow that will:

  1. Analyze your tables automatically using platform MCP tools with strict PII detection
  2. Generate YAML configuration from real schema and data analysis
  3. Choose target platform(s) (Snowflake, Databricks, or both)
  4. Generate platform-specific SQL optimized for each engine
  5. Execute workflows with convergence detection and monitoring
  6. Provide deployment guidance and operating instructions

Key Features:

  • 🔍 Automated Table Analysis: Uses Snowflake/Databricks MCP tools to analyze actual tables
  • Strict PII Detection: Zero tolerance - only includes tables with real user identifiers
  • 📊 Real Data Validation: Queries actual data to validate patterns and quality
  • 🎯 Smart Recommendations: Expert analysis provides merge strategy and priorities
  • 🚀 End-to-End Automation: From table analysis to workflow execution

What You Need to Provide

1. Unification Requirements (For Automated Analysis)

  • Platform: Snowflake or Databricks
  • Tables: List of source tables to analyze
    • Format (Snowflake): database.schema.table or schema.table or table
    • Format (Databricks): catalog.schema.table or schema.table or table
  • Canonical ID Name: Name for your unified ID (e.g., td_id, unified_customer_id)
  • Merge Iterations: Number of unification loops (default: 10)
  • Master Tables: (Optional) Attribute aggregation specifications

Note: The system will automatically:

  • Extract user identifiers from actual table schemas
  • Validate data patterns from real data
  • Apply appropriate validation rules based on data analysis
  • Generate merge strategy recommendations

2. Platform Selection

  • Databricks: Unity Catalog with Delta Lake
  • Snowflake: Database with proper permissions
  • Both: Generate SQL for both platforms

3. Target Configurations

For Databricks:

  • Catalog: Target catalog name
  • Schema: Target schema name
  • Source Catalog (optional): Source data catalog
  • Source Schema (optional): Source data schema

For Snowflake:

  • Database: Target database name
  • Schema: Target schema name
  • Source Schema (optional): Source data schema

4. Execution Credentials (if executing)

For Databricks:

  • Server Hostname: your-workspace.databricks.com
  • HTTP Path: /sql/1.0/warehouses/your-warehouse-id
  • Authentication: PAT (Personal Access Token) or OAuth

For Snowflake:

  • Account: Snowflake account name
  • User: Username
  • Password: Password or use SSO/key-pair
  • Warehouse: Compute warehouse name

What I'll Do

Step 1: Automated YAML Configuration Generation

I'll use the hybrid-unif-config-creator command to automatically generate your unify.yml file:

Automated Analysis Approach (Recommended):

  • Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
  • Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
  • Validate data patterns from real table data
  • Generate unify.yml with exact template compliance
  • Only include tables with actual user identifiers
  • Document excluded tables with detailed reasons

What I'll do:

  • Call the hybrid-unif-keys-extractor agent to analyze tables
  • Query actual schema and data using platform MCP tools
  • Detect valid user identifiers (email, customer_id, phone, etc.)
  • Exclude tables without PII with full documentation
  • Generate production-ready unify.yml automatically

Alternative - Manual Configuration:

  • If MCP tools are unavailable, I'll guide you through manual configuration
  • Interactive prompts for keys, tables, and validation rules
  • Step-by-step YAML building with validation

Step 2: Platform Selection and Configuration

I'll help you:

  • Choose between Databricks, Snowflake, or both
  • Collect platform-specific configuration (catalog/database, schema names)
  • Determine source/target separation strategy
  • Decide on execution or generation-only mode

Step 3: SQL Generation

For Databricks (if selected): I'll call the databricks-sql-generator agent to:

  • Execute yaml_unification_to_databricks.py script
  • Generate Delta Lake optimized SQL workflow
  • Create output directory: databricks_sql/unify/
  • Generate 15+ SQL files with proper execution order

For Snowflake (if selected): I'll call the snowflake-sql-generator agent to:

  • Execute yaml_unification_to_snowflake.py script
  • Generate Snowflake-native SQL workflow
  • Create output directory: snowflake_sql/unify/
  • Generate 15+ SQL files with proper execution order

Step 4: Workflow Execution (Optional)

For Databricks (if execution requested): I'll call the databricks-workflow-executor agent to:

  • Execute databricks_sql_executor.py script
  • Connect to your Databricks workspace
  • Run SQL files in proper sequence
  • Monitor convergence and progress
  • Optimize Delta tables
  • Report final statistics

For Snowflake (if execution requested): I'll call the snowflake-workflow-executor agent to:

  • Execute snowflake_sql_executor.py script
  • Connect to your Snowflake account
  • Run SQL files in proper sequence
  • Monitor convergence and progress
  • Report final statistics

Step 5: Deployment Guidance

I'll provide:

  • Configuration summary
  • Generated files overview
  • Deployment instructions
  • Operating guidelines
  • Monitoring recommendations

Interactive Workflow

This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:

Phase 1: Configuration Creation

I'll ask you for:

  • Platform (Snowflake or Databricks)
  • Tables to analyze
  • Canonical ID name
  • Merge iterations

Then I'll:

  • Call /cdp-hybrid-idu:hybrid-unif-config-creator internally
  • Analyze your tables automatically
  • Generate unify.yml with strict PII detection
  • Show you the configuration for review

Phase 2: SQL Generation

I'll ask you:

  • Which platform(s) to generate SQL for (can be different from source)
  • Output directory preferences

Then I'll:

  • Call /cdp-hybrid-idu:hybrid-generate-snowflake (if Snowflake selected)
  • Call /cdp-hybrid-idu:hybrid-generate-databricks (if Databricks selected)
  • Generate 15+ optimized SQL files per platform
  • Show you the execution plan

Phase 3: Workflow Execution (Optional)

I'll ask you:

  • Do you want to execute now or later?
  • Connection credentials if executing

Then I'll:

  • Call /cdp-hybrid-idu:hybrid-execute-snowflake (if Snowflake selected)
  • Call /cdp-hybrid-idu:hybrid-execute-databricks (if Databricks selected)
  • Monitor convergence and progress
  • Report final statistics

Throughout the process:

  • Questions: When I need your input
  • Suggestions: Recommended approaches based on best practices
  • Validation: Real-time checks on your choices
  • Explanations: Help you understand concepts and options

Expected Output

Files Created (Platform-specific):

For Databricks:

databricks_sql/unify/
├── 01_create_graph.sql              # Initialize identity graph
├── 02_extract_merge.sql             # Extract and merge identities
├── 03_source_key_stats.sql          # Source statistics
├── 04_unify_loop_iteration_*.sql    # Iterative unification (N files)
├── 05_canonicalize.sql              # Canonical ID creation
├── 06_result_key_stats.sql          # Result statistics
├── 10_enrich_*.sql                  # Source table enrichment (N files)
├── 20_master_*.sql                  # Master table creation (N files)
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings

For Snowflake:

snowflake_sql/unify/
├── 01_create_graph.sql              # Initialize identity graph
├── 02_extract_merge.sql             # Extract and merge identities
├── 03_source_key_stats.sql          # Source statistics
├── 04_unify_loop_iteration_*.sql    # Iterative unification (N files)
├── 05_canonicalize.sql              # Canonical ID creation
├── 06_result_key_stats.sql          # Result statistics
├── 10_enrich_*.sql                  # Source table enrichment (N files)
├── 20_master_*.sql                  # Master table creation (N files)
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings

Configuration:

unify.yml                            # YAML configuration (created interactively)

Success Criteria

All generated files will:

  • Be platform-optimized and production-ready
  • Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
  • Include convergence detection logic
  • Support incremental processing
  • Generate comprehensive statistics
  • Work without modification on target platforms

Getting Started

Ready to begin? I'll use the hybrid-unif-config-creator to automatically analyze your tables and generate the YAML configuration.

Please provide:

  1. Platform: Which platform contains your data?

    • Snowflake or Databricks
  2. Tables: Which source tables should I analyze?

    • Format (Snowflake): database.schema.table or schema.table or table
    • Format (Databricks): catalog.schema.table or schema.table or table
    • Example: customer_db.public.customers, orders, web_events.user_activity
  3. Canonical ID Name: What should I call the unified ID?

    • Example: td_id, unified_customer_id, master_id
    • Default: td_id
  4. Merge Iterations (optional): How many unification loops?

    • Default: 10
    • Range: 2-30
  5. Target Platform(s) for SQL generation:

    • Same as source, or generate for both platforms

Example:

I want to set up hybrid ID unification for:

Platform: Snowflake
Tables:
- customer_db.public.customer_profiles
- customer_db.public.orders
- marketing_db.public.campaigns
- event_db.public.web_events

Canonical ID: unified_customer_id
Merge Iterations: 10
Generate SQL for: Snowflake (or both Snowflake and Databricks)

What I'll do next:

  1. Analyze your tables using Snowflake MCP tools
  2. Extract user identifiers with strict PII detection
  3. Generate unify.yml automatically
  4. Generate platform-specific SQL files
  5. Execute workflow (if requested)
  6. Provide deployment guidance

Let's get started with your hybrid ID unification setup!