Files
2025-11-30 09:02:39 +08:00

309 lines
11 KiB
Markdown

---
name: hybrid-setup
description: Complete end-to-end hybrid ID unification setup - automatically analyzes tables, generates config, creates SQL, and executes workflow for Snowflake and Databricks
---
# Hybrid ID Unification Complete Setup
## Overview
I'll guide you through the complete hybrid ID unification setup process for Snowflake and/or Databricks platforms. This is an **automated, end-to-end workflow** that will:
1. **Analyze your tables automatically** using platform MCP tools with strict PII detection
2. **Generate YAML configuration** from real schema and data analysis
3. **Choose target platform(s)** (Snowflake, Databricks, or both)
4. **Generate platform-specific SQL** optimized for each engine
5. **Execute workflows** with convergence detection and monitoring
6. **Provide deployment guidance** and operating instructions
**Key Features**:
- 🔍 **Automated Table Analysis**: Uses Snowflake/Databricks MCP tools to analyze actual tables
-**Strict PII Detection**: Zero tolerance - only includes tables with real user identifiers
- 📊 **Real Data Validation**: Queries actual data to validate patterns and quality
- 🎯 **Smart Recommendations**: Expert analysis provides merge strategy and priorities
- 🚀 **End-to-End Automation**: From table analysis to workflow execution
---
## What You Need to Provide
### 1. Unification Requirements (For Automated Analysis)
- **Platform**: Snowflake or Databricks
- **Tables**: List of source tables to analyze
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- **Canonical ID Name**: Name for your unified ID (e.g., `td_id`, `unified_customer_id`)
- **Merge Iterations**: Number of unification loops (default: 10)
- **Master Tables**: (Optional) Attribute aggregation specifications
**Note**: The system will automatically:
- Extract user identifiers from actual table schemas
- Validate data patterns from real data
- Apply appropriate validation rules based on data analysis
- Generate merge strategy recommendations
### 2. Platform Selection
- **Databricks**: Unity Catalog with Delta Lake
- **Snowflake**: Database with proper permissions
- **Both**: Generate SQL for both platforms
### 3. Target Configurations
**For Databricks**:
- **Catalog**: Target catalog name
- **Schema**: Target schema name
- **Source Catalog** (optional): Source data catalog
- **Source Schema** (optional): Source data schema
**For Snowflake**:
- **Database**: Target database name
- **Schema**: Target schema name
- **Source Schema** (optional): Source data schema
### 4. Execution Credentials (if executing)
**For Databricks**:
- **Server Hostname**: your-workspace.databricks.com
- **HTTP Path**: /sql/1.0/warehouses/your-warehouse-id
- **Authentication**: PAT (Personal Access Token) or OAuth
**For Snowflake**:
- **Account**: Snowflake account name
- **User**: Username
- **Password**: Password or use SSO/key-pair
- **Warehouse**: Compute warehouse name
---
## What I'll Do
### Step 1: Automated YAML Configuration Generation
I'll use the **hybrid-unif-config-creator** command to automatically generate your `unify.yml` file:
**Automated Analysis Approach** (Recommended):
- Analyze your actual tables using platform MCP tools (Snowflake/Databricks)
- Extract user identifiers with STRICT PII detection (zero tolerance for guessing)
- Validate data patterns from real table data
- Generate unify.yml with exact template compliance
- Only include tables with actual user identifiers
- Document excluded tables with detailed reasons
**What I'll do**:
- Call the **hybrid-unif-keys-extractor agent** to analyze tables
- Query actual schema and data using platform MCP tools
- Detect valid user identifiers (email, customer_id, phone, etc.)
- Exclude tables without PII with full documentation
- Generate production-ready unify.yml automatically
**Alternative - Manual Configuration**:
- If MCP tools are unavailable, I'll guide you through manual configuration
- Interactive prompts for keys, tables, and validation rules
- Step-by-step YAML building with validation
### Step 2: Platform Selection and Configuration
I'll help you:
- Choose between Databricks, Snowflake, or both
- Collect platform-specific configuration (catalog/database, schema names)
- Determine source/target separation strategy
- Decide on execution or generation-only mode
### Step 3: SQL Generation
**For Databricks** (if selected):
I'll call the **databricks-sql-generator agent** to:
- Execute `yaml_unification_to_databricks.py` script
- Generate Delta Lake optimized SQL workflow
- Create output directory: `databricks_sql/unify/`
- Generate 15+ SQL files with proper execution order
**For Snowflake** (if selected):
I'll call the **snowflake-sql-generator agent** to:
- Execute `yaml_unification_to_snowflake.py` script
- Generate Snowflake-native SQL workflow
- Create output directory: `snowflake_sql/unify/`
- Generate 15+ SQL files with proper execution order
### Step 4: Workflow Execution (Optional)
**For Databricks** (if execution requested):
I'll call the **databricks-workflow-executor agent** to:
- Execute `databricks_sql_executor.py` script
- Connect to your Databricks workspace
- Run SQL files in proper sequence
- Monitor convergence and progress
- Optimize Delta tables
- Report final statistics
**For Snowflake** (if execution requested):
I'll call the **snowflake-workflow-executor agent** to:
- Execute `snowflake_sql_executor.py` script
- Connect to your Snowflake account
- Run SQL files in proper sequence
- Monitor convergence and progress
- Report final statistics
### Step 5: Deployment Guidance
I'll provide:
- Configuration summary
- Generated files overview
- Deployment instructions
- Operating guidelines
- Monitoring recommendations
---
## Interactive Workflow
This command orchestrates the complete end-to-end flow by calling specialized commands in sequence:
### Phase 1: Configuration Creation
**I'll ask you for**:
- Platform (Snowflake or Databricks)
- Tables to analyze
- Canonical ID name
- Merge iterations
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-unif-config-creator` internally
- Analyze your tables automatically
- Generate `unify.yml` with strict PII detection
- Show you the configuration for review
### Phase 2: SQL Generation
**I'll ask you**:
- Which platform(s) to generate SQL for (can be different from source)
- Output directory preferences
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-generate-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-generate-databricks` (if Databricks selected)
- Generate 15+ optimized SQL files per platform
- Show you the execution plan
### Phase 3: Workflow Execution (Optional)
**I'll ask you**:
- Do you want to execute now or later?
- Connection credentials if executing
**Then I'll**:
- Call `/cdp-hybrid-idu:hybrid-execute-snowflake` (if Snowflake selected)
- Call `/cdp-hybrid-idu:hybrid-execute-databricks` (if Databricks selected)
- Monitor convergence and progress
- Report final statistics
**Throughout the process**:
- **Questions**: When I need your input
- **Suggestions**: Recommended approaches based on best practices
- **Validation**: Real-time checks on your choices
- **Explanations**: Help you understand concepts and options
---
## Expected Output
### Files Created (Platform-specific):
**For Databricks**:
```
databricks_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**For Snowflake**:
```
snowflake_sql/unify/
├── 01_create_graph.sql # Initialize identity graph
├── 02_extract_merge.sql # Extract and merge identities
├── 03_source_key_stats.sql # Source statistics
├── 04_unify_loop_iteration_*.sql # Iterative unification (N files)
├── 05_canonicalize.sql # Canonical ID creation
├── 06_result_key_stats.sql # Result statistics
├── 10_enrich_*.sql # Source table enrichment (N files)
├── 20_master_*.sql # Master table creation (N files)
├── 30_unification_metadata.sql # Metadata tables
├── 31_filter_lookup.sql # Validation rules
└── 32_column_lookup.sql # Column mappings
```
**Configuration**:
```
unify.yml # YAML configuration (created interactively)
```
---
## Success Criteria
All generated files will:
- ✅ Be platform-optimized and production-ready
- ✅ Use proper SQL dialects (Databricks Spark SQL or Snowflake SQL)
- ✅ Include convergence detection logic
- ✅ Support incremental processing
- ✅ Generate comprehensive statistics
- ✅ Work without modification on target platforms
---
## Getting Started
**Ready to begin?** I'll use the **hybrid-unif-config-creator** to automatically analyze your tables and generate the YAML configuration.
Please provide:
1. **Platform**: Which platform contains your data?
- Snowflake or Databricks
2. **Tables**: Which source tables should I analyze?
- Format (Snowflake): `database.schema.table` or `schema.table` or `table`
- Format (Databricks): `catalog.schema.table` or `schema.table` or `table`
- Example: `customer_db.public.customers`, `orders`, `web_events.user_activity`
3. **Canonical ID Name**: What should I call the unified ID?
- Example: `td_id`, `unified_customer_id`, `master_id`
- Default: `td_id`
4. **Merge Iterations** (optional): How many unification loops?
- Default: 10
- Range: 2-30
5. **Target Platform(s)** for SQL generation:
- Same as source, or generate for both platforms
**Example**:
```
I want to set up hybrid ID unification for:
Platform: Snowflake
Tables:
- customer_db.public.customer_profiles
- customer_db.public.orders
- marketing_db.public.campaigns
- event_db.public.web_events
Canonical ID: unified_customer_id
Merge Iterations: 10
Generate SQL for: Snowflake (or both Snowflake and Databricks)
```
**What I'll do next**:
1. ✅ Analyze your tables using Snowflake MCP tools
2. ✅ Extract user identifiers with strict PII detection
3. ✅ Generate unify.yml automatically
4. ✅ Generate platform-specific SQL files
5. ✅ Execute workflow (if requested)
6. ✅ Provide deployment guidance
---
**Let's get started with your hybrid ID unification setup!**