Initial commit

2025-11-30 09:02:36 +08:00
commit 19e906ecca
7 changed files with 1584 additions and 0 deletions
--- a/commands/histunion-batch.md
+++ b/commands/histunion-batch.md
@@ -0,0 +1,420 @@
+---
+name: histunion-batch
+description: Create hist-union workflows for multiple tables in batch with parallel processing
+---
+
+# Create Batch Hist-Union Workflows
+
+## ⚠️ CRITICAL: This command processes multiple tables efficiently with schema validation
+
+I'll help you create hist-union workflows for multiple tables at once, with proper schema validation for each table.
+
+---
+
+## Required Information
+
+### 1. Table List
+Provide table names in any format (comma-separated or one per line):
+
+**Option A - Base names:**
+```
+client_src.klaviyo_events, client_src.shopify_products, client_src.onetrust_profiles
+```
+
+**Option B - Hist names:**
+```
+client_src.klaviyo_events_hist
+client_src.shopify_products_hist
+client_src.onetrust_profiles_hist
+```
+
+**Option C - Mixed formats:**
+```
+client_src.klaviyo_events, client_src.shopify_products_hist, client_src.onetrust_profiles
+```
+
+**Option D - List format:**
+```
+- client_src.klaviyo_events
+- client_src.shopify_products
+- client_src.onetrust_profiles
+```
+
+### 2. Lookup Database (Optional)
+- **Lookup/Config Database**: Database for inc_log watermark table
+- **Default**: `client_config` (will be used if not specified)
+
+---
+
+## What I'll Do
+
+### Step 1: Parse All Table Names
+I will parse and normalize all table names:
+```
+For each table in list:
+1. Extract database and base name
+2. Remove _hist or _histunion suffix if present
+3. Derive:
+   - Inc table: {database}.{base_name}
+   - Hist table: {database}.{base_name}_hist
+   - Target table: {database}.{base_name}_histunion
+```
+
+### Step 2: Get Schemas for All Tables via MCP Tool
+**CRITICAL**: I will get exact schemas for EVERY table:
+```
+For each table:
+1. Call mcp__mcc_treasuredata__describe_table for inc table
+   - Get complete column list
+   - Get exact column order
+   - Get data types
+
+2. Call mcp__mcc_treasuredata__describe_table for hist table
+   - Get complete column list
+   - Get exact column order
+   - Get data types
+
+3. Compare schemas:
+   - Document column differences
+   - Note any extra columns in inc vs hist
+   - Record exact column order
+```
+
+**Note**: This may require multiple MCP calls. I'll process them efficiently.
+
+### Step 3: Check Full Load Status for Each Table
+I will check each table against full load list:
+```
+For each table:
+IF table_name IN ('klaviyo_lists', 'klaviyo_metric_data'):
+    template[table] = 'FULL_LOAD'  # Case 3
+ELSE:
+    IF inc_schema == hist_schema:
+        template[table] = 'IDENTICAL'  # Case 1
+    ELSE:
+        template[table] = 'EXTRA_COLUMNS'  # Case 2
+```
+
+### Step 4: Generate SQL Files for All Tables
+I will create SQL file for each table in ONE response:
+```
+For each table, create: hist_union/queries/{base_name}.sql
+
+With correct template based on schema analysis:
+- Case 1: Identical schemas
+- Case 2: Inc has extra columns
+- Case 3: Full load
+
+All files created in parallel using multiple Write tool calls
+```
+
+### Step 5: Update Digdag Workflow
+I will update workflow with all tables:
+```
+File: hist_union/hist_union_runner.dig
+
+Structure:
+hist_union_tasks:
+  _parallel: true
+
+  +{table1_name}_histunion:
+    td>: queries/{table1_name}.sql
+
+  +{table2_name}_histunion:
+    td>: queries/{table2_name}.sql
+
+  +{table3_name}_histunion:
+    td>: queries/{table3_name}.sql
+
+  ... (all tables)
+```
+
+### Step 6: Verify Quality Gates for All Tables
+Before delivering, I will verify for EACH table:
+```
+For each table:
+✅ MCP tool used for both inc and hist schemas
+✅ Schema differences identified
+✅ Correct template selected
+✅ All inc columns present in exact order
+✅ NULL handling correct for missing columns
+✅ Watermarks included for both hist and inc
+✅ Parallel execution configured
+```
+
+---
+
+## Batch Processing Strategy
+
+### Efficient MCP Usage
+```
+1. Collect all table names first
+2. Make MCP calls for all inc tables
+3. Make MCP calls for all hist tables
+4. Compare all schemas in batch
+5. Generate all SQL files in ONE response
+6. Update workflow once with all tasks
+```
+
+### Parallel File Generation
+I will use multiple Write tool calls in a SINGLE response:
+```
+Single Response Contains:
+- Write: hist_union/queries/table1.sql
+- Write: hist_union/queries/table2.sql
+- Write: hist_union/queries/table3.sql
+- ... (all tables)
+- Edit: hist_union/hist_union_runner.dig (add all tasks)
+```
+
+---
+
+## Output
+
+I will generate:
+
+### For N Tables:
+1. **hist_union/queries/{table1}.sql** - SQL for table 1
+2. **hist_union/queries/{table2}.sql** - SQL for table 2
+3. **hist_union/queries/{table3}.sql** - SQL for table 3
+4. ... (one SQL file per table)
+5. **hist_union/hist_union_runner.dig** - Updated workflow with all tables
+
+### Workflow Structure:
+```yaml
+timezone: UTC
+
+_export:
+  td:
+    database: {database}
+  lkup_db: {lkup_db}
+
+create_inc_log_table:
+  td>:
+  query: |
+    CREATE TABLE IF NOT EXISTS ${lkup_db}.inc_log (
+      table_name varchar,
+      project_name varchar,
+      inc_value bigint
+    )
+
+hist_union_tasks:
+  _parallel: true
+
+  +table1_histunion:
+    td>: queries/table1.sql
+
+  +table2_histunion:
+    td>: queries/table2.sql
+
+  +table3_histunion:
+    td>: queries/table3.sql
+
+  # ... all tables processed in parallel
+```
+
+---
+
+## Progress Reporting
+
+During processing, I will report:
+
+### Phase 1: Parsing
+```
+Parsing table names...
+✅ Found 5 tables to process:
+  1. client_src.klaviyo_events
+  2. client_src.shopify_products
+  3. client_src.onetrust_profiles
+  4. client_src.klaviyo_lists (FULL LOAD)
+  5. client_src.users
+```
+
+### Phase 2: Schema Retrieval
+```
+Retrieving schemas via MCP tool...
+✅ Got schema for client_src.klaviyo_events (inc)
+✅ Got schema for client_src.klaviyo_events_hist (hist)
+✅ Got schema for client_src.shopify_products (inc)
+✅ Got schema for client_src.shopify_products_hist (hist)
+... (all tables)
+```
+
+### Phase 3: Schema Analysis
+```
+Analyzing schemas...
+✅ Table 1: Identical schemas - Use Case 1
+✅ Table 2: Inc has extra 'incremental_date' - Use Case 2
+✅ Table 3: Identical schemas - Use Case 1
+✅ Table 4: FULL LOAD - Use Case 3
+✅ Table 5: Identical schemas - Use Case 1
+```
+
+### Phase 4: File Generation
+```
+Generating all files...
+✅ Created hist_union/queries/klaviyo_events.sql
+✅ Created hist_union/queries/shopify_products.sql
+✅ Created hist_union/queries/onetrust_profiles.sql
+✅ Created hist_union/queries/klaviyo_lists.sql (FULL LOAD)
+✅ Created hist_union/queries/users.sql
+✅ Updated hist_union/hist_union_runner.dig with 5 parallel tasks
+```
+
+---
+
+## Special Handling
+
+### Mixed Databases
+If tables are from different databases:
+```
+✅ Supported - Each SQL file uses correct database
+✅ Workflow uses primary database in _export
+✅ Individual tasks can override if needed
+```
+
+### Full Load Tables in Batch
+```
+✅ Automatically detected (klaviyo_lists, klaviyo_metric_data)
+✅ Uses Case 3 template (DROP + CREATE, no WHERE)
+✅ Still updates watermarks
+✅ Processed in parallel with other tables
+```
+
+### Schema Differences
+```
+✅ Each table analyzed independently
+✅ NULL handling applied only where needed
+✅ Exact column order maintained per table
+✅ Template selection per table based on schema
+```
+
+---
+
+## Performance Benefits
+
+### Why Batch Processing?
+- ✅ **Faster**: All files created in one response
+- ✅ **Consistent**: Single workflow file with all tasks
+- ✅ **Efficient**: Parallel MCP calls where possible
+- ✅ **Complete**: All tables configured together
+- ✅ **Parallel Execution**: All tasks run concurrently in Treasure Data
+
+### Execution Efficiency
+```
+Sequential Processing:
+Table 1: 10 min
+Table 2: 10 min
+Table 3: 10 min
+Total: 30 minutes
+
+Parallel Processing:
+All tables: ~10 minutes (depending on slowest table)
+```
+
+---
+
+## Next Steps After Generation
+
+1. **Review All Generated Files**:
+   ```bash
+   ls -la hist_union/queries/
+   cat hist_union/hist_union_runner.dig
+   ```
+
+2. **Verify Workflow Syntax**:
+   ```bash
+   cd hist_union
+   td wf check hist_union_runner.dig
+   ```
+
+3. **Run Batch Workflow**:
+   ```bash
+   td wf run hist_union_runner.dig
+   ```
+
+4. **Monitor Progress**:
+   ```bash
+   td wf logs hist_union_runner.dig
+   ```
+
+5. **Verify All Results**:
+   ```sql
+   -- Check watermarks for all tables
+   SELECT * FROM {lkup_db}.inc_log
+   WHERE project_name = 'hist_union'
+   ORDER BY table_name;
+
+   -- Check row counts for all histunion tables
+   SELECT
+     '{table1}_histunion' as table_name,
+     COUNT(*) as row_count
+   FROM {database}.{table1}_histunion
+   UNION ALL
+   SELECT
+     '{table2}_histunion',
+     COUNT(*)
+   FROM {database}.{table2}_histunion
+   -- ... (for all tables)
+   ```
+
+---
+
+## Example
+
+### Input
+```
+Create hist-union for these tables:
+- client_src.klaviyo_events
+- client_src.shopify_products_hist
+- client_src.onetrust_profiles
+- client_src.klaviyo_lists
+```
+
+### Output Summary
+```
+✅ Processed 4 tables:
+
+1. klaviyo_events (Incremental - Case 1: Identical schemas)
+   - Inc: client_src.klaviyo_events
+   - Hist: client_src.klaviyo_events_hist
+   - Target: client_src.klaviyo_events_histunion
+
+2. shopify_products (Incremental - Case 2: Inc has extra columns)
+   - Inc: client_src.shopify_products
+   - Hist: client_src.shopify_products_hist
+   - Target: client_src.shopify_products_histunion
+   - Extra columns in inc: incremental_date
+
+3. onetrust_profiles (Incremental - Case 1: Identical schemas)
+   - Inc: client_src.onetrust_profiles
+   - Hist: client_src.onetrust_profiles_hist
+   - Target: client_src.onetrust_profiles_histunion
+
+4. klaviyo_lists (FULL LOAD - Case 3)
+   - Inc: client_src.klaviyo_lists
+   - Hist: client_src.klaviyo_lists_hist
+   - Target: client_src.klaviyo_lists_histunion
+
+Created 4 SQL files + 1 workflow file
+All tasks configured for parallel execution
+```
+
+---
+
+## Production-Ready Guarantee
+
+All generated code will:
+- ✅ Use exact schemas from MCP tool for every table
+- ✅ Handle schema differences correctly per table
+- ✅ Use correct template based on individual table analysis
+- ✅ Process all tables in parallel for maximum efficiency
+- ✅ Maintain exact column order per table
+- ✅ Include proper NULL handling where needed
+- ✅ Update watermarks for all tables
+- ✅ Follow Presto/Trino SQL syntax
+- ✅ Be production-tested and proven
+
+---
+
+**Ready to proceed? Please provide your list of tables and I'll generate complete hist-union workflows for all of them using exact schemas from MCP tool and production-tested templates.**