zhongwei/gh-treasure-data-aps-claude-tools-plugins-cdp-hybrid-idu

Files

Zhongwei Li 515e7bf6be Initial commit

2025-11-30 09:02:39 +08:00

11 KiB

Raw Blame History

name, description

name	description
hybrid-setup	Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks

Hybrid ID Unification Complete Setup

Overview

I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an automated, end-to-end workflow that will:

Analyze your tables automatically using platform MCP tools with strict PII detection
Generate YAML configuration from real schema and data analysis
Choose target platform(s) (Snowflake, Databricks, or both)
Generate platform-specific SQL optimized for each engine
Execute workflows with convergence detection and monitoring
Provide deployment guidance and operating instructions

Key Features:

🔍 Automated Table Analysis: Uses Snowflake/Databricks MCP tools to analyze actual tables
✅ Strict PII Detection: Zero tolerance - only includes tables with real user identifiers
📊 Real Data Validation: Queries actual data to validate patterns and quality
🎯 Smart Recommendations: Expert analysis provides merge strategy and priorities
🚀 End-to-End Automation: From table analysis to workflow execution

What You Need to Provide

1. Unification Requirements (For Automated Analysis)

Platform: Snowflake or Databricks
Tables: List of source tables to analyze
- Format (Snowflake): database.schema.table or schema.table or table
- Format (Databricks): catalog.schema.table or schema.table or table
Canonical ID Name: Name for your unified ID (e.g., td_id, unified_customer_id)
Merge Iterations: Number of unification loops (default: 10)
Master Tables: (Optional) Attribute aggregation specifications

Note: The system will automatically:

Extract user identifiers from actual table schemas
Validate data patterns from real data
Apply appropriate validation rules based on data analysis
Generate merge strategy recommendations

2. Platform Selection

Databricks: Unity Catalog with Delta Lake
Snowflake: Database with proper permissions
Both: Generate SQL for both platforms

3. Target Configurations

For Databricks:

Catalog: Target catalog name
Schema: Target schema name
Source Catalog (optional): Source data catalog
Source Schema (optional): Source data schema

For Snowflake:

Database: Target database name
Schema: Target schema name
Source Schema (optional): Source data schema

4. Execution Credentials (if executing)

For Databricks:

Server Hostname: your-workspace.databricks.com
HTTP Path: /sql/1.0/warehouses/your-warehouse-id
Authentication: PAT (Personal Access Token) or OAuth

For Snowflake:

Account: Snowflake account name
User: Username
Password: Password or use SSO/key-pair
Warehouse: Compute warehouse name

What I'll Do

Step 1: Automated YAML Configuration Generation

I'll use the hybrid-unif-config-creator command to automatically generate your unify.yml file:

Automated Analysis Approach (Recommended):

Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
Validate data patterns from real table data
Generate unify.yml with exact template compliance
Only include tables with actual user identifiers
Document excluded tables with detailed reasons

What I'll do:

Call the hybrid-unif-keys-extractor agent to analyze tables
Query actual schema and data using platform MCP tools
Detect valid user identifiers (email, customer_id, phone, etc.)
Exclude tables without PII with full documentation
Generate production-ready unify.yml automatically

Alternative - Manual Configuration:

If MCP tools are unavailable, I'll guide you through manual configuration
Interactive prompts for keys, tables, and validation rules
Step-by-step YAML building with validation

Step 2: Platform Selection and Configuration

I'll help you:

Choose between Databricks, Snowflake, or both
Collect platform-specific configuration (catalog/database, schema names)
Determine source/target separation strategy
Decide on execution or generation-only mode

Step 3: SQL Generation

For Databricks (if selected): I'll call the databricks-sql-generator agent to:

Execute yaml_unification_to_databricks.py script
Generate Delta Lake optimized SQL workflow
Create output directory: databricks_sql/unify/
Generate 15+ SQL files with proper execution order

For Snowflake (if selected): I'll call the snowflake-sql-generator agent to:

Execute yaml_unification_to_snowflake.py script
Generate Snowflake-native SQL workflow
Create output directory: snowflake_sql/unify/
Generate 15+ SQL files with proper execution order

Step 4: Workflow Execution (Optional)

For Databricks (if execution requested): I'll call the databricks-workflow-executor agent to:

Execute databricks_sql_executor.py script
Connect to your Databricks workspace
Run SQL files in proper sequence
Monitor convergence and progress
Optimize Delta tables
Report final statistics

For Snowflake (if execution requested): I'll call the snowflake-workflow-executor agent to:

Execute snowflake_sql_executor.py script
Connect to your Snowflake account
Run SQL files in proper sequence
Monitor convergence and progress
Report final statistics

Step 5: Deployment Guidance

I'll provide:

Configuration summary
Generated files overview
Deployment instructions
Operating guidelines
Monitoring recommendations

Interactive Workflow

This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:

Phase 1: Configuration Creation

I'll ask you for:

Platform (Snowflake or Databricks)
Tables to analyze
Canonical ID name
Merge iterations

Then I'll:

Call /cdp-hybrid-idu:hybrid-unif-config-creator internally
Analyze your tables automatically
Generate unify.yml with strict PII detection
Show you the configuration for review

Phase 2: SQL Generation

I'll ask you:

Which platform(s) to generate SQL for (can be different from source)
Output directory preferences

Then I'll:

Call /cdp-hybrid-idu:hybrid-generate-snowflake (if Snowflake selected)
Call /cdp-hybrid-idu:hybrid-generate-databricks (if Databricks selected)
Generate 15+ optimized SQL files per platform
Show you the execution plan

Phase 3: Workflow Execution (Optional)

I'll ask you:

Do you want to execute now or later?
Connection credentials if executing

Then I'll:

Call /cdp-hybrid-idu:hybrid-execute-snowflake (if Snowflake selected)
Call /cdp-hybrid-idu:hybrid-execute-databricks (if Databricks selected)
Monitor convergence and progress
Report final statistics

Throughout the process:

Questions: When I need your input
Suggestions: Recommended approaches based on best practices
Validation: Real-time checks on your choices
Explanations: Help you understand concepts and options

Expected Output

Files Created (Platform-specific):

For Databricks:

databricks_sql/unify/
├── 01_create_graph.sql              # Initialize identity graph
├── 02_extract_merge.sql             # Extract and merge identities
├── 03_source_key_stats.sql          # Source statistics
├── 04_unify_loop_iteration_*.sql    # Iterative unification (N files)
├── 05_canonicalize.sql              # Canonical ID creation
├── 06_result_key_stats.sql          # Result statistics
├── 10_enrich_*.sql                  # Source table enrichment (N files)
├── 20_master_*.sql                  # Master table creation (N files)
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings

For Snowflake:

snowflake_sql/unify/
├── 01_create_graph.sql              # Initialize identity graph
├── 02_extract_merge.sql             # Extract and merge identities
├── 03_source_key_stats.sql          # Source statistics
├── 04_unify_loop_iteration_*.sql    # Iterative unification (N files)
├── 05_canonicalize.sql              # Canonical ID creation
├── 06_result_key_stats.sql          # Result statistics
├── 10_enrich_*.sql                  # Source table enrichment (N files)
├── 20_master_*.sql                  # Master table creation (N files)
├── 30_unification_metadata.sql      # Metadata tables
├── 31_filter_lookup.sql             # Validation rules
└── 32_column_lookup.sql             # Column mappings

Configuration:

unify.yml                            # YAML configuration (created interactively)

Success Criteria

All generated files will:

✅ Be platform-optimized and production-ready
✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
✅ Include convergence detection logic
✅ Support incremental processing
✅ Generate comprehensive statistics
✅ Work without modification on target platforms

Getting Started

Ready to begin? I'll use the hybrid-unif-config-creator to automatically analyze your tables and generate the YAML configuration.

Please provide:

Platform: Which platform contains your data?
- Snowflake or Databricks
Tables: Which source tables should I analyze?
- Format (Snowflake): database.schema.table or schema.table or table
- Format (Databricks): catalog.schema.table or schema.table or table
- Example: customer_db.public.customers, orders, web_events.user_activity
Canonical ID Name: What should I call the unified ID?
- Example: td_id, unified_customer_id, master_id
- Default: td_id
Merge Iterations (optional): How many unification loops?
- Default: 10
- Range: 2-30
Target Platform(s) for SQL generation:
- Same as source, or generate for both platforms

Example:

I want to set up hybrid ID unification for:

Platform: Snowflake
Tables:
- customer_db.public.customer_profiles
- customer_db.public.orders
- marketing_db.public.campaigns
- event_db.public.web_events

Canonical ID: unified_customer_id
Merge Iterations: 10
Generate SQL for: Snowflake (or both Snowflake and Databricks)

What I'll do next:

✅ Analyze your tables using Snowflake MCP tools
✅ Extract user identifiers with strict PII detection
✅ Generate unify.yml automatically
✅ Generate platform-specific SQL files
✅ Execute workflow (if requested)
✅ Provide deployment guidance

Let's get started with your hybrid ID unification setup!

11 KiB Raw Blame History

Hybrid ID Unification Complete Setup

Overview

What You Need to Provide

1. Unification Requirements (For Automated Analysis)

2. Platform Selection

3. Target Configurations

4. Execution Credentials (if executing)

What I'll Do

Step 1: Automated YAML Configuration Generation

Step 2: Platform Selection and Configuration

Step 3: SQL Generation

Step 4: Workflow Execution (Optional)

Step 5: Deployment Guidance

Interactive Workflow

Phase 1: Configuration Creation

Phase 2: SQL Generation

Phase 3: Workflow Execution (Optional)

Expected Output

Files Created (Platform-specific):

Success Criteria

Getting Started

11 KiB

Raw Blame History