Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:02:46 +08:00
commit b81339e588
8 changed files with 2678 additions and 0 deletions

295
commands/transform-batch.md Normal file
View File

@@ -0,0 +1,295 @@
---
name: transform-batch
description: Transform multiple database tables in parallel with maximum efficiency
---
# Transform Multiple Tables to Staging (Batch Mode)
## ⚠️ CRITICAL: This command enables parallel processing for 3x-10x faster transformations
I'll help you transform multiple database tables to staging format using parallel sub-agent execution for maximum performance.
---
## Required Information
Please provide the following details:
### 1. Source Tables
- **Table List**: Comma-separated list of tables (e.g., `table1, table2, table3`)
- **Format**: `database.table_name` or just `table_name` (if same database)
- **Example**: `client_src.customers_histunion, client_src.orders_histunion, client_src.products_histunion`
### 2. Source Configuration
- **Source Database**: Database containing tables (e.g., `client_src`)
- **Staging Database**: Target database (default: `client_stg`)
- **Lookup Database**: Reference database for rules (default: `client_config`)
### 3. SQL Engine (Optional)
- **Engine**: Choose one:
- `presto` or `trino` - Presto/Trino SQL engine (default)
- `hive` - Hive SQL engine
- `mixed` - Specify engine per table
- If not specified, will default to Presto/Trino for all tables
### 4. Mixed Engine Example (Optional)
If you need different engines for different tables:
```
Transform table1 using Hive, table2 using Presto, table3 using Hive
```
---
## What I'll Do
### Step 1: Parse Table List
I will extract individual tables from your input:
- Parse comma-separated list
- Detect database prefix for each table
- Identify total table count
### Step 2: Detect Engine Strategy
I will determine processing strategy:
- **Single Engine**: All tables use same engine
- Presto/Trino (default) → All tables to `staging-transformer-presto`
- Hive → All tables to `staging-transformer-hive`
- **Mixed Engines**: Different engines per table
- Parse engine specification per table
- Route each table to appropriate sub-agent
### Step 3: Launch Parallel Sub-Agents
I will create parallel sub-agent calls:
- **ONE sub-agent per table** (maximum parallelism)
- **Single message with multiple Task calls** (concurrent execution)
- **Each sub-agent processes independently** (no blocking)
- **All sub-agents skip git workflow** (consolidated at end)
### Step 4: Monitor Parallel Execution
I will track all sub-agent progress:
- Wait for all sub-agents to complete
- Collect results from each transformation
- Identify any failures or errors
- Report partial success if needed
### Step 5: Consolidate Results
After ALL tables complete successfully:
1. **Aggregate file changes** across all tables
2. **Execute single git workflow**:
- Create feature branch
- Commit all changes together
- Push to remote
- Create comprehensive PR
3. **Report complete summary**
---
## Processing Strategy
### Parallel Processing (Recommended for 2+ Tables)
```
User requests: "Transform tables A, B, C"
Main Claude creates 3 parallel sub-agent calls:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Sub-Agent 1 │ │ Sub-Agent 2 │ │ Sub-Agent 3 │
│ (Table A) │ │ (Table B) │ │ (Table C) │
│ staging- │ │ staging- │ │ staging- │
│ transformer- │ │ transformer- │ │ transformer- │
│ presto │ │ presto │ │ presto │
└─────────────────┘ └─────────────────┘ └─────────────────┘
↓ ↓ ↓
[Files for A] [Files for B] [Files for C]
↓ ↓ ↓
└─────────────────────┴─────────────────────┘
[Consolidated Git Workflow]
[Single PR with all tables]
```
### Performance Benefits:
- **Speed**: N tables in ~1x time instead of N×time
- **Efficiency**: Optimal resource utilization
- **User Experience**: Faster results for batch operations
- **Scalability**: Can handle 10+ tables efficiently
---
## Quality Assurance (Per Table)
Each sub-agent ensures complete compliance:
**Column Limit Management** (max 200 columns)
**JSON Detection & Extraction** (automatic)
**Date Processing** (4 outputs per date column)
**Email/Phone Validation** (with hashing)
**String Standardization** (UPPER, TRIM, NULL handling)
**Deduplication Logic** (if configured)
**Join Processing** (if specified)
**Incremental Processing** (state tracking)
**SQL File Creation** (init, incremental, upsert)
**DIG File Management** (conditional creation)
**Configuration Update** (src_params.yml)
**Treasure Data Compatibility** (VARCHAR/BIGINT timestamps)
---
## Output Files
### For Presto/Trino Engine (per table):
- `staging/init_queries/{source_db}_{table}_init.sql`
- `staging/queries/{source_db}_{table}.sql`
- `staging/queries/{source_db}_{table}_upsert.sql` (if dedup)
- Updated `staging/config/src_params.yml` (all tables)
- `staging/staging_transformation.dig` (created once if not exists)
### For Hive Engine (per table):
- `staging_hive/queries/{source_db}_{table}.sql`
- Updated `staging_hive/config/src_params.yml` (all tables)
- `staging_hive/staging_hive.dig` (created once if not exists)
- Template files (created once if not exist)
### Plus:
- Single git commit with all tables
- Comprehensive pull request
- Complete validation report for all tables
---
## Example Usage
### Example 1: Same Engine (Presto Default)
```
User: Transform tables: client_src.customers_histunion, client_src.orders_histunion, client_src.products_histunion
→ Parallel execution with 3 staging-transformer-presto agents
→ All files to staging/ directory
→ Single consolidated git workflow
→ Time: ~1x (vs 3x sequential)
```
### Example 2: Same Engine (Hive Explicit)
```
User: Transform tables using Hive: client_src.events_histunion, client_src.profiles_histunion
→ Parallel execution with 2 staging-transformer-hive agents
→ All files to staging_hive/ directory
→ Single consolidated git workflow
→ Time: ~1x (vs 2x sequential)
```
### Example 3: Mixed Engines
```
User: Transform table1 using Hive, table2 using Presto, table3 using Hive
→ Parallel execution:
- Table1 → staging-transformer-hive
- Table2 → staging-transformer-presto
- Table3 → staging-transformer-hive
→ Files distributed to appropriate directories
→ Single consolidated git workflow
→ Time: ~1x (vs 3x sequential)
```
---
## Error Handling
### Partial Success Scenario
If some tables succeed and others fail:
1. **Report Clear Status**:
```
✅ Successfully transformed: table1, table2
❌ Failed: table3 (error message)
```
2. **Preserve Successful Work**:
- Keep files from successful transformations
- Allow retry of only failed tables
3. **Git Safety**:
- Only execute git workflow if ALL tables succeed
- Otherwise, keep changes local for review
### Full Failure Scenario
If all tables fail:
- Report detailed error for each table
- No git workflow execution
- Provide troubleshooting guidance
---
## Next Steps After Batch Transformation
1. **Review Pull Request**:
```
Title: "Batch transform 5 tables to staging"
Body:
- Transformed tables: table1, table2, table3, table4, table5
- Engine: Presto/Trino
- All validation gates passed ✅
- Files created: 15 SQL files, 1 config update
```
2. **Verify Generated Files**:
```bash
# For Presto
ls -l staging/queries/
ls -l staging/init_queries/
cat staging/config/src_params.yml
# For Hive
ls -l staging_hive/queries/
cat staging_hive/config/src_params.yml
```
3. **Test Workflow**:
```bash
cd staging # or staging_hive
td wf push
td wf run staging_transformation.dig # or staging_hive.dig
```
4. **Monitor All Tables**:
```sql
SELECT table_name, inc_value, project_name
FROM client_config.inc_log
WHERE table_name IN ('table1', 'table2', 'table3')
ORDER BY inc_value DESC
```
---
## Performance Comparison
| Tables | Sequential Time | Parallel Time | Speedup |
|--------|----------------|---------------|---------|
| 2 | ~10 min | ~5 min | 2x |
| 3 | ~15 min | ~5 min | 3x |
| 5 | ~25 min | ~5 min | 5x |
| 10 | ~50 min | ~5 min | 10x |
**Note**: Actual times vary based on table complexity and data volume.
---
## Production-Ready Guarantee
All batch transformations will:
- ✅ Execute in parallel for maximum speed
- ✅ Maintain complete quality for each table
- ✅ Provide atomic git workflow (all or nothing)
- ✅ Include comprehensive error handling
- ✅ Generate maintainable code
- ✅ Match production standards exactly
---
**Ready to proceed? Please provide your table list and I'll launch parallel sub-agents for maximum efficiency!**
**Format Examples:**
- `Transform tables: table1, table2, table3` (same database)
- `Transform client_src.table1, client_src.table2` (explicit database)
- `Transform table1 using Hive, table2 using Presto` (mixed engines)