Initial commit
This commit is contained in:
295
commands/transform-batch.md
Normal file
295
commands/transform-batch.md
Normal file
@@ -0,0 +1,295 @@
|
||||
---
|
||||
name: transform-batch
|
||||
description: Transform multiple database tables in parallel with maximum efficiency
|
||||
---
|
||||
|
||||
# Transform Multiple Tables to Staging (Batch Mode)
|
||||
|
||||
## ⚠️ CRITICAL: This command enables parallel processing for 3x-10x faster transformations
|
||||
|
||||
I'll help you transform multiple database tables to staging format using parallel sub-agent execution for maximum performance.
|
||||
|
||||
---
|
||||
|
||||
## Required Information
|
||||
|
||||
Please provide the following details:
|
||||
|
||||
### 1. Source Tables
|
||||
- **Table List**: Comma-separated list of tables (e.g., `table1, table2, table3`)
|
||||
- **Format**: `database.table_name` or just `table_name` (if same database)
|
||||
- **Example**: `client_src.customers_histunion, client_src.orders_histunion, client_src.products_histunion`
|
||||
|
||||
### 2. Source Configuration
|
||||
- **Source Database**: Database containing tables (e.g., `client_src`)
|
||||
- **Staging Database**: Target database (default: `client_stg`)
|
||||
- **Lookup Database**: Reference database for rules (default: `client_config`)
|
||||
|
||||
### 3. SQL Engine (Optional)
|
||||
- **Engine**: Choose one:
|
||||
- `presto` or `trino` - Presto/Trino SQL engine (default)
|
||||
- `hive` - Hive SQL engine
|
||||
- `mixed` - Specify engine per table
|
||||
- If not specified, will default to Presto/Trino for all tables
|
||||
|
||||
### 4. Mixed Engine Example (Optional)
|
||||
If you need different engines for different tables:
|
||||
```
|
||||
Transform table1 using Hive, table2 using Presto, table3 using Hive
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Parse Table List
|
||||
I will extract individual tables from your input:
|
||||
- Parse comma-separated list
|
||||
- Detect database prefix for each table
|
||||
- Identify total table count
|
||||
|
||||
### Step 2: Detect Engine Strategy
|
||||
I will determine processing strategy:
|
||||
- **Single Engine**: All tables use same engine
|
||||
- Presto/Trino (default) → All tables to `staging-transformer-presto`
|
||||
- Hive → All tables to `staging-transformer-hive`
|
||||
- **Mixed Engines**: Different engines per table
|
||||
- Parse engine specification per table
|
||||
- Route each table to appropriate sub-agent
|
||||
|
||||
### Step 3: Launch Parallel Sub-Agents
|
||||
I will create parallel sub-agent calls:
|
||||
- **ONE sub-agent per table** (maximum parallelism)
|
||||
- **Single message with multiple Task calls** (concurrent execution)
|
||||
- **Each sub-agent processes independently** (no blocking)
|
||||
- **All sub-agents skip git workflow** (consolidated at end)
|
||||
|
||||
### Step 4: Monitor Parallel Execution
|
||||
I will track all sub-agent progress:
|
||||
- Wait for all sub-agents to complete
|
||||
- Collect results from each transformation
|
||||
- Identify any failures or errors
|
||||
- Report partial success if needed
|
||||
|
||||
### Step 5: Consolidate Results
|
||||
After ALL tables complete successfully:
|
||||
1. **Aggregate file changes** across all tables
|
||||
2. **Execute single git workflow**:
|
||||
- Create feature branch
|
||||
- Commit all changes together
|
||||
- Push to remote
|
||||
- Create comprehensive PR
|
||||
3. **Report complete summary**
|
||||
|
||||
---
|
||||
|
||||
## Processing Strategy
|
||||
|
||||
### Parallel Processing (Recommended for 2+ Tables)
|
||||
```
|
||||
User requests: "Transform tables A, B, C"
|
||||
|
||||
Main Claude creates 3 parallel sub-agent calls:
|
||||
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Sub-Agent 1 │ │ Sub-Agent 2 │ │ Sub-Agent 3 │
|
||||
│ (Table A) │ │ (Table B) │ │ (Table C) │
|
||||
│ staging- │ │ staging- │ │ staging- │
|
||||
│ transformer- │ │ transformer- │ │ transformer- │
|
||||
│ presto │ │ presto │ │ presto │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
↓ ↓ ↓
|
||||
[Files for A] [Files for B] [Files for C]
|
||||
↓ ↓ ↓
|
||||
└─────────────────────┴─────────────────────┘
|
||||
↓
|
||||
[Consolidated Git Workflow]
|
||||
[Single PR with all tables]
|
||||
```
|
||||
|
||||
### Performance Benefits:
|
||||
- **Speed**: N tables in ~1x time instead of N×time
|
||||
- **Efficiency**: Optimal resource utilization
|
||||
- **User Experience**: Faster results for batch operations
|
||||
- **Scalability**: Can handle 10+ tables efficiently
|
||||
|
||||
---
|
||||
|
||||
## Quality Assurance (Per Table)
|
||||
|
||||
Each sub-agent ensures complete compliance:
|
||||
|
||||
✅ **Column Limit Management** (max 200 columns)
|
||||
✅ **JSON Detection & Extraction** (automatic)
|
||||
✅ **Date Processing** (4 outputs per date column)
|
||||
✅ **Email/Phone Validation** (with hashing)
|
||||
✅ **String Standardization** (UPPER, TRIM, NULL handling)
|
||||
✅ **Deduplication Logic** (if configured)
|
||||
✅ **Join Processing** (if specified)
|
||||
✅ **Incremental Processing** (state tracking)
|
||||
✅ **SQL File Creation** (init, incremental, upsert)
|
||||
✅ **DIG File Management** (conditional creation)
|
||||
✅ **Configuration Update** (src_params.yml)
|
||||
✅ **Treasure Data Compatibility** (VARCHAR/BIGINT timestamps)
|
||||
|
||||
---
|
||||
|
||||
## Output Files
|
||||
|
||||
### For Presto/Trino Engine (per table):
|
||||
- `staging/init_queries/{source_db}_{table}_init.sql`
|
||||
- `staging/queries/{source_db}_{table}.sql`
|
||||
- `staging/queries/{source_db}_{table}_upsert.sql` (if dedup)
|
||||
- Updated `staging/config/src_params.yml` (all tables)
|
||||
- `staging/staging_transformation.dig` (created once if not exists)
|
||||
|
||||
### For Hive Engine (per table):
|
||||
- `staging_hive/queries/{source_db}_{table}.sql`
|
||||
- Updated `staging_hive/config/src_params.yml` (all tables)
|
||||
- `staging_hive/staging_hive.dig` (created once if not exists)
|
||||
- Template files (created once if not exist)
|
||||
|
||||
### Plus:
|
||||
- Single git commit with all tables
|
||||
- Comprehensive pull request
|
||||
- Complete validation report for all tables
|
||||
|
||||
---
|
||||
|
||||
## Example Usage
|
||||
|
||||
### Example 1: Same Engine (Presto Default)
|
||||
```
|
||||
User: Transform tables: client_src.customers_histunion, client_src.orders_histunion, client_src.products_histunion
|
||||
|
||||
→ Parallel execution with 3 staging-transformer-presto agents
|
||||
→ All files to staging/ directory
|
||||
→ Single consolidated git workflow
|
||||
→ Time: ~1x (vs 3x sequential)
|
||||
```
|
||||
|
||||
### Example 2: Same Engine (Hive Explicit)
|
||||
```
|
||||
User: Transform tables using Hive: client_src.events_histunion, client_src.profiles_histunion
|
||||
|
||||
→ Parallel execution with 2 staging-transformer-hive agents
|
||||
→ All files to staging_hive/ directory
|
||||
→ Single consolidated git workflow
|
||||
→ Time: ~1x (vs 2x sequential)
|
||||
```
|
||||
|
||||
### Example 3: Mixed Engines
|
||||
```
|
||||
User: Transform table1 using Hive, table2 using Presto, table3 using Hive
|
||||
|
||||
→ Parallel execution:
|
||||
- Table1 → staging-transformer-hive
|
||||
- Table2 → staging-transformer-presto
|
||||
- Table3 → staging-transformer-hive
|
||||
→ Files distributed to appropriate directories
|
||||
→ Single consolidated git workflow
|
||||
→ Time: ~1x (vs 3x sequential)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Partial Success Scenario
|
||||
If some tables succeed and others fail:
|
||||
|
||||
1. **Report Clear Status**:
|
||||
```
|
||||
✅ Successfully transformed: table1, table2
|
||||
❌ Failed: table3 (error message)
|
||||
```
|
||||
|
||||
2. **Preserve Successful Work**:
|
||||
- Keep files from successful transformations
|
||||
- Allow retry of only failed tables
|
||||
|
||||
3. **Git Safety**:
|
||||
- Only execute git workflow if ALL tables succeed
|
||||
- Otherwise, keep changes local for review
|
||||
|
||||
### Full Failure Scenario
|
||||
If all tables fail:
|
||||
- Report detailed error for each table
|
||||
- No git workflow execution
|
||||
- Provide troubleshooting guidance
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Batch Transformation
|
||||
|
||||
1. **Review Pull Request**:
|
||||
```
|
||||
Title: "Batch transform 5 tables to staging"
|
||||
|
||||
Body:
|
||||
- Transformed tables: table1, table2, table3, table4, table5
|
||||
- Engine: Presto/Trino
|
||||
- All validation gates passed ✅
|
||||
- Files created: 15 SQL files, 1 config update
|
||||
```
|
||||
|
||||
2. **Verify Generated Files**:
|
||||
```bash
|
||||
# For Presto
|
||||
ls -l staging/queries/
|
||||
ls -l staging/init_queries/
|
||||
cat staging/config/src_params.yml
|
||||
|
||||
# For Hive
|
||||
ls -l staging_hive/queries/
|
||||
cat staging_hive/config/src_params.yml
|
||||
```
|
||||
|
||||
3. **Test Workflow**:
|
||||
```bash
|
||||
cd staging # or staging_hive
|
||||
td wf push
|
||||
td wf run staging_transformation.dig # or staging_hive.dig
|
||||
```
|
||||
|
||||
4. **Monitor All Tables**:
|
||||
```sql
|
||||
SELECT table_name, inc_value, project_name
|
||||
FROM client_config.inc_log
|
||||
WHERE table_name IN ('table1', 'table2', 'table3')
|
||||
ORDER BY inc_value DESC
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Comparison
|
||||
|
||||
| Tables | Sequential Time | Parallel Time | Speedup |
|
||||
|--------|----------------|---------------|---------|
|
||||
| 2 | ~10 min | ~5 min | 2x |
|
||||
| 3 | ~15 min | ~5 min | 3x |
|
||||
| 5 | ~25 min | ~5 min | 5x |
|
||||
| 10 | ~50 min | ~5 min | 10x |
|
||||
|
||||
**Note**: Actual times vary based on table complexity and data volume.
|
||||
|
||||
---
|
||||
|
||||
## Production-Ready Guarantee
|
||||
|
||||
All batch transformations will:
|
||||
- ✅ Execute in parallel for maximum speed
|
||||
- ✅ Maintain complete quality for each table
|
||||
- ✅ Provide atomic git workflow (all or nothing)
|
||||
- ✅ Include comprehensive error handling
|
||||
- ✅ Generate maintainable code
|
||||
- ✅ Match production standards exactly
|
||||
|
||||
---
|
||||
|
||||
**Ready to proceed? Please provide your table list and I'll launch parallel sub-agents for maximum efficiency!**
|
||||
|
||||
**Format Examples:**
|
||||
- `Transform tables: table1, table2, table3` (same database)
|
||||
- `Transform client_src.table1, client_src.table2` (explicit database)
|
||||
- `Transform table1 using Hive, table2 using Presto` (mixed engines)
|
||||
186
commands/transform-table.md
Normal file
186
commands/transform-table.md
Normal file
@@ -0,0 +1,186 @@
|
||||
---
|
||||
name: transform-table
|
||||
description: Transform a single database table to staging format with data quality improvements, PII handling, and JSON extraction
|
||||
---
|
||||
|
||||
# Transform Single Table to Staging
|
||||
|
||||
## ⚠️ CRITICAL: This command enforces strict sub-agent delegation
|
||||
|
||||
I'll help you transform a database table to staging format using the appropriate staging-transformer sub-agent (Presto/Trino or Hive).
|
||||
|
||||
---
|
||||
|
||||
## Required Information
|
||||
|
||||
Please provide the following details:
|
||||
|
||||
### 1. Source Table
|
||||
- **Database Name**: Source database (e.g., `client_src`, `demo_db`)
|
||||
- **Table Name**: Source table name (e.g., `customer_profiles_histunion`)
|
||||
|
||||
### 2. Target Configuration
|
||||
- **Staging Database**: Target database (default: `client_stg`)
|
||||
- **Lookup Database**: Reference database for rules (default: `client_config`)
|
||||
|
||||
### 3. SQL Engine (Optional)
|
||||
- **Engine**: Choose one:
|
||||
- `presto` or `trino` - Presto/Trino SQL engine (default)
|
||||
- `hive` - Hive SQL engine
|
||||
- If not specified, will default to Presto/Trino
|
||||
|
||||
### 4. Transformation Requirements (Optional)
|
||||
- **Deduplication**: Required? (will check client_config.staging_trnsfrm_rules)
|
||||
- **JSON Columns**: Will auto-detect and process
|
||||
- **Join Logic**: Any joins needed? (will check additional_rules)
|
||||
|
||||
---
|
||||
|
||||
## What I'll Do
|
||||
|
||||
### Step 1: Detect SQL Engine
|
||||
I will determine the appropriate sub-agent:
|
||||
- **Presto/Trino keywords** → `staging-transformer-presto`
|
||||
- **Hive keywords** → `staging-transformer-hive`
|
||||
- **No specification** → `staging-transformer-presto` (default)
|
||||
|
||||
### Step 2: Delegate to Specialized Agent
|
||||
I will invoke the appropriate staging-transformer agent with:
|
||||
- Complete table transformation context
|
||||
- All mandatory requirements (13 rules)
|
||||
- Engine-specific SQL generation
|
||||
- Full compliance validation
|
||||
|
||||
### Step 3: Sub-Agent Will Execute
|
||||
The specialized agent will:
|
||||
1. **Validate table existence** (MANDATORY first step)
|
||||
2. **Analyze metadata** (columns, types, data samples)
|
||||
3. **Check configuration** (deduplication rules, additional rules)
|
||||
4. **Detect JSON columns** (automatic processing)
|
||||
5. **Generate SQL files**:
|
||||
- `staging/init_queries/{source_db}_{table_name}_init.sql` (Presto)
|
||||
- `staging/queries/{source_db}_{table_name}.sql` (Presto)
|
||||
- `staging/queries/{source_db}_{table_name}_upsert.sql` (if dedup, Presto)
|
||||
- OR `staging_hive/queries/{source_db}_{table_name}.sql` (Hive)
|
||||
6. **Update configuration**: `staging/config/src_params.yml` or `staging_hive/config/src_params.yml`
|
||||
7. **Create/verify DIG file**: `staging/staging_transformation.dig` or `staging_hive/staging_hive.dig`
|
||||
8. **Execute git workflow**: Commit, branch, push, PR creation
|
||||
|
||||
---
|
||||
|
||||
## Quality Assurance
|
||||
|
||||
The sub-agent ensures complete compliance with all requirements:
|
||||
|
||||
✅ **Column Limit Management** (max 200 columns)
|
||||
✅ **JSON Detection & Extraction** (automatic)
|
||||
✅ **Date Processing** (4 outputs per date column)
|
||||
✅ **Email/Phone Validation** (with hashing)
|
||||
✅ **String Standardization** (UPPER, TRIM, NULL handling)
|
||||
✅ **Deduplication Logic** (if configured)
|
||||
✅ **Join Processing** (if specified)
|
||||
✅ **Incremental Processing** (state tracking)
|
||||
✅ **SQL File Creation** (init, incremental, upsert)
|
||||
✅ **DIG File Management** (conditional creation)
|
||||
✅ **Configuration Update** (src_params.yml)
|
||||
✅ **Git Workflow** (complete automation)
|
||||
✅ **Treasure Data Compatibility** (VARCHAR/BIGINT timestamps)
|
||||
|
||||
---
|
||||
|
||||
## Output Files
|
||||
|
||||
### For Presto/Trino Engine:
|
||||
1. `staging/init_queries/{source_db}_{table_name}_init.sql` - Initial load SQL
|
||||
2. `staging/queries/{source_db}_{table_name}.sql` - Incremental SQL
|
||||
3. `staging/queries/{source_db}_{table_name}_upsert.sql` - Upsert SQL (if dedup)
|
||||
4. `staging/config/src_params.yml` - Updated configuration
|
||||
5. `staging/staging_transformation.dig` - Workflow (created if not exists)
|
||||
|
||||
### For Hive Engine:
|
||||
1. `staging_hive/queries/{source_db}_{table_name}.sql` - Combined SQL
|
||||
2. `staging_hive/config/src_params.yml` - Updated configuration
|
||||
3. `staging_hive/staging_hive.dig` - Workflow (created if not exists)
|
||||
4. `staging_hive/queries/get_max_time.sql` - Template (created if not exists)
|
||||
5. `staging_hive/queries/get_stg_rows_for_delete.sql` - Template (created if not exists)
|
||||
|
||||
### Plus:
|
||||
- Git commit with comprehensive message
|
||||
- Pull request with transformation summary
|
||||
- Validation report
|
||||
|
||||
---
|
||||
|
||||
## Example Usage
|
||||
|
||||
### Example 1: Presto Engine (Default)
|
||||
```
|
||||
User: Transform table client_src.customer_profiles_histunion
|
||||
→ Engine: Presto (default)
|
||||
→ Sub-agent: staging-transformer-presto
|
||||
→ Output: staging/ directory files
|
||||
```
|
||||
|
||||
### Example 2: Hive Engine (Explicit)
|
||||
```
|
||||
User: Transform table client_src.klaviyo_events_histunion using Hive
|
||||
→ Engine: Hive
|
||||
→ Sub-agent: staging-transformer-hive
|
||||
→ Output: staging_hive/ directory files
|
||||
```
|
||||
|
||||
### Example 3: With Custom Databases
|
||||
```
|
||||
User: Transform demo_db.orders_histunion
|
||||
Use demo_db_stg as staging database
|
||||
Use client_config for lookup
|
||||
→ Engine: Presto (default)
|
||||
→ Custom databases applied
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Transformation
|
||||
|
||||
1. **Review generated files**:
|
||||
```bash
|
||||
ls -l staging/queries/
|
||||
ls -l staging/init_queries/
|
||||
cat staging/config/src_params.yml
|
||||
```
|
||||
|
||||
2. **Review Pull Request**:
|
||||
- Check transformation summary
|
||||
- Verify all validation gates passed
|
||||
- Review generated SQL
|
||||
|
||||
3. **Test the workflow**:
|
||||
```bash
|
||||
cd staging
|
||||
td wf push
|
||||
td wf run staging_transformation.dig
|
||||
```
|
||||
|
||||
4. **Monitor execution**:
|
||||
```sql
|
||||
SELECT * FROM client_config.inc_log
|
||||
WHERE table_name = '{your_table}'
|
||||
ORDER BY inc_value DESC
|
||||
LIMIT 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production-Ready Guarantee
|
||||
|
||||
All transformations will:
|
||||
- ✅ Work the first time
|
||||
- ✅ Follow consistent patterns
|
||||
- ✅ Include complete error handling
|
||||
- ✅ Include comprehensive data quality
|
||||
- ✅ Be maintainable and documented
|
||||
- ✅ Match production standards exactly
|
||||
|
||||
---
|
||||
|
||||
**Ready to proceed? Please provide the source database and table name, and I'll delegate to the appropriate staging-transformer agent for complete processing.**
|
||||
363
commands/transform-validation.md
Normal file
363
commands/transform-validation.md
Normal file
@@ -0,0 +1,363 @@
|
||||
---
|
||||
name: transform-validation
|
||||
description: Validate staging transformation files against CLAUDE.md compliance and quality gates
|
||||
---
|
||||
|
||||
# Validate Staging Transformation
|
||||
|
||||
## ⚠️ CRITICAL: This validates against strict production quality gates
|
||||
|
||||
I'll validate your staging transformation files for compliance with CLAUDE.md specifications and production standards.
|
||||
|
||||
---
|
||||
|
||||
## What I'll Validate
|
||||
|
||||
### Quality Gates (ALL MUST PASS)
|
||||
|
||||
#### 1. File Structure Compliance
|
||||
- ✅ Correct directory structure (staging/ or staging_hive/)
|
||||
- ✅ Proper file naming conventions
|
||||
- ✅ All required files present
|
||||
- ✅ No missing dependencies
|
||||
|
||||
#### 2. SQL File Validation
|
||||
- ✅ **Incremental SQL** (`staging/queries/{source_db}_{table}.sql`)
|
||||
- Correct FROM clause with `{source_database}.{source_table}`
|
||||
- Correct WHERE clause with incremental logic
|
||||
- Proper database references
|
||||
- ✅ **Initial Load SQL** (`staging/init_queries/{source_db}_{table}_init.sql`)
|
||||
- Full table scan (no WHERE clause)
|
||||
- Same transformations as incremental
|
||||
- ✅ **Upsert SQL** (`staging/queries/{source_db}_{table}_upsert.sql`)
|
||||
- Only present if deduplication exists
|
||||
- Correct DELETE and INSERT logic
|
||||
- Work table pattern used
|
||||
|
||||
#### 3. Data Transformation Standards
|
||||
- ✅ **Column Processing**:
|
||||
- All columns transformed (except 'time')
|
||||
- Column limit compliance (max 200)
|
||||
- Proper type casting
|
||||
- ✅ **Date Columns** (CRITICAL):
|
||||
- ALL date columns have 4 outputs (original, _std, _unixtime, _date)
|
||||
- `time` column NOT transformed
|
||||
- Correct FORMAT_DATETIME patterns (Presto) or date_format (Hive)
|
||||
- ✅ **JSON Columns** (CRITICAL):
|
||||
- ALL JSON columns detected and processed
|
||||
- Top-level key extraction completed
|
||||
- Nested objects handled (up to 2 levels)
|
||||
- Array handling with TRY_CAST
|
||||
- All extractions wrapped with NULLIF(UPPER(...), '')
|
||||
- ✅ **Email Columns**:
|
||||
- Original + _std + _hash + _valid outputs
|
||||
- Correct validation regex
|
||||
- SHA256 hashing applied
|
||||
- ✅ **Phone Columns**:
|
||||
- Exact pattern compliance
|
||||
- phone_std, phone_hash, phone_valid outputs
|
||||
- No deviations from templates
|
||||
|
||||
#### 4. Data Processing Order
|
||||
- ✅ **Clean → Join → Dedupe sequence followed**
|
||||
- ✅ Joins use cleaned columns (not raw)
|
||||
- ✅ Deduplication uses cleaned columns (not raw)
|
||||
- ✅ No raw column names in PARTITION BY
|
||||
- ✅ Proper CTE structure
|
||||
|
||||
#### 5. Configuration File Validation
|
||||
- ✅ **src_params.yml structure**:
|
||||
- Valid YAML syntax
|
||||
- All required fields present
|
||||
- Correct dependency group structure
|
||||
- Proper table configuration
|
||||
- ✅ **Table Configuration**:
|
||||
- `name`, `source_db`, `staging_table` present
|
||||
- `has_dedup` boolean correct
|
||||
- `partition_columns` matches dedup logic
|
||||
- `mode` set correctly (inc/full)
|
||||
|
||||
#### 6. DIG File Validation
|
||||
- ✅ **Loop-based architecture** used
|
||||
- ✅ No repetitive table blocks
|
||||
- ✅ Correct template structure
|
||||
- ✅ Proper error handling
|
||||
- ✅ Inc_log table creation
|
||||
- ✅ Conditional processing logic
|
||||
|
||||
#### 7. Treasure Data Compatibility
|
||||
- ✅ **Forbidden Types**:
|
||||
- No TIMESTAMP types (must be VARCHAR/BIGINT)
|
||||
- No BOOLEAN types (must be VARCHAR)
|
||||
- No DOUBLE for timestamps
|
||||
- ✅ **Function Compatibility**:
|
||||
- Presto: FORMAT_DATETIME, TD_TIME_PARSE
|
||||
- Hive: date_format, unix_timestamp, from_unixtime
|
||||
- ✅ **Type Casting**:
|
||||
- TRY_CAST used for safety
|
||||
- All timestamps cast to VARCHAR or BIGINT
|
||||
|
||||
#### 8. Incremental Processing
|
||||
- ✅ **State Management**:
|
||||
- Correct inc_log table usage
|
||||
- COALESCE(MAX(inc_value), 0) pattern
|
||||
- Proper WHERE clause filtering
|
||||
- ✅ **Database References**:
|
||||
- Source table: `{source_database}.{source_table}`
|
||||
- Inc_log: `${lkup_db}.inc_log`
|
||||
- Correct project_name filter
|
||||
|
||||
#### 9. Engine-Specific Validation
|
||||
|
||||
**For Presto/Trino:**
|
||||
- ✅ Files in `staging/` directory
|
||||
- ✅ Three SQL files (init, incremental, upsert if dedup)
|
||||
- ✅ Presto-compatible functions used
|
||||
- ✅ JSON: json_extract_scalar with NULLIF(UPPER(...), '')
|
||||
|
||||
**For Hive:**
|
||||
- ✅ Files in `staging_hive/` directory
|
||||
- ✅ Single SQL file with INSERT OVERWRITE
|
||||
- ✅ Template files present (get_max_time.sql, get_stg_rows_for_delete.sql)
|
||||
- ✅ Hive-compatible functions used
|
||||
- ✅ JSON: REGEXP_EXTRACT with NULLIF(UPPER(...), '')
|
||||
- ✅ Timestamp keyword escaped with backticks
|
||||
|
||||
---
|
||||
|
||||
## Validation Options
|
||||
|
||||
### Option 1: Validate Specific Table
|
||||
Provide:
|
||||
- **Table Name**: e.g., `client_src.customer_profiles`
|
||||
- **Engine**: `presto` or `hive`
|
||||
|
||||
I will:
|
||||
1. Find all related files for the table
|
||||
2. Check against ALL quality gates
|
||||
3. Report detailed findings with line numbers
|
||||
4. Provide remediation guidance
|
||||
|
||||
### Option 2: Validate All Tables
|
||||
Say: **"validate all"** or **"validate all Presto"** or **"validate all Hive"**
|
||||
|
||||
I will:
|
||||
1. Find all transformation files
|
||||
2. Validate each table against quality gates
|
||||
3. Report comprehensive project status
|
||||
4. Identify any inconsistencies
|
||||
|
||||
### Option 3: Validate Configuration Only
|
||||
Say: **"validate config"**
|
||||
|
||||
I will:
|
||||
1. Check `staging/config/src_params.yml` (Presto)
|
||||
2. Check `staging_hive/config/src_params.yml` (Hive)
|
||||
3. Validate YAML syntax
|
||||
4. Verify table configurations
|
||||
5. Check dependency groups
|
||||
|
||||
---
|
||||
|
||||
## Validation Process
|
||||
|
||||
### Step 1: File Discovery
|
||||
I will locate all relevant files:
|
||||
- SQL files (init, queries, upsert)
|
||||
- Configuration files (src_params.yml)
|
||||
- Workflow files (staging_transformation.dig or staging_hive.dig)
|
||||
|
||||
### Step 2: Load and Parse
|
||||
I will read all files:
|
||||
- Parse SQL syntax
|
||||
- Parse YAML structure
|
||||
- Extract transformation logic
|
||||
- Identify patterns
|
||||
|
||||
### Step 3: Quality Gate Checks
|
||||
I will verify each gate systematically:
|
||||
- Execute all validation rules
|
||||
- Identify violations
|
||||
- Collect evidence (line numbers, code samples)
|
||||
- Categorize severity
|
||||
|
||||
### Step 4: Generate Report
|
||||
|
||||
#### Pass Report (if all gates pass)
|
||||
```
|
||||
✅ VALIDATION PASSED
|
||||
|
||||
Table: client_src.customer_profiles
|
||||
Engine: Presto
|
||||
Files validated: 3
|
||||
|
||||
Quality Gates: 9/9 PASSED
|
||||
✅ File Structure Compliance
|
||||
✅ SQL File Validation
|
||||
✅ Data Transformation Standards
|
||||
✅ Data Processing Order
|
||||
✅ Configuration File Validation
|
||||
✅ DIG File Validation
|
||||
✅ Treasure Data Compatibility
|
||||
✅ Incremental Processing
|
||||
✅ Engine-Specific Validation
|
||||
|
||||
Transformation Details:
|
||||
- Columns: 45 (within 200 limit)
|
||||
- Date columns: 3 (12 outputs - all have 4 outputs ✅)
|
||||
- JSON columns: 2 (all processed ✅)
|
||||
- Email columns: 1 (validated with hashing ✅)
|
||||
- Phone columns: 1 (exact pattern compliance ✅)
|
||||
- Deduplication: Yes (customer_id)
|
||||
- Joins: None
|
||||
|
||||
No issues found. Transformation is production-ready.
|
||||
```
|
||||
|
||||
#### Fail Report (if any gate fails)
|
||||
```
|
||||
❌ VALIDATION FAILED
|
||||
|
||||
Table: client_src.customer_profiles
|
||||
Engine: Presto
|
||||
Files validated: 3
|
||||
|
||||
Quality Gates: 6/9 PASSED
|
||||
|
||||
✅ File Structure Compliance
|
||||
✅ SQL File Validation
|
||||
❌ Data Transformation Standards - FAILED
|
||||
- Date column 'created_at' missing outputs
|
||||
Line 45: Only has _std output, missing _unixtime and _date
|
||||
- JSON column 'attributes' not processed
|
||||
Line 67: Raw column passed through without extraction
|
||||
|
||||
✅ Data Processing Order
|
||||
✅ Configuration File Validation
|
||||
❌ DIG File Validation - FAILED
|
||||
- Using old repetitive block structure
|
||||
File: staging/staging_transformation.dig
|
||||
Issue: Should use loop-based architecture
|
||||
|
||||
✅ Treasure Data Compatibility
|
||||
✅ Incremental Processing
|
||||
❌ Engine-Specific Validation - FAILED
|
||||
- JSON extraction missing NULLIF(UPPER(...), '') wrapper
|
||||
Line 72: json_extract_scalar used without wrapper
|
||||
Should be: NULLIF(UPPER(json_extract_scalar(...)), '')
|
||||
|
||||
CRITICAL ISSUES (Must Fix):
|
||||
1. Add missing date outputs for 'created_at' column
|
||||
2. Process 'attributes' JSON column with key extraction
|
||||
3. Migrate DIG file to loop-based architecture
|
||||
4. Wrap all JSON extractions with NULLIF(UPPER(...), '')
|
||||
|
||||
RECOMMENDATIONS:
|
||||
- Re-read CLAUDE.md date processing requirements
|
||||
- Check JSON column detection logic
|
||||
- Use DIG template from documentation
|
||||
- Review JSON extraction patterns
|
||||
|
||||
Re-validate after fixing issues.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues Detected
|
||||
|
||||
### Data Transformation Violations
|
||||
- Missing date column outputs (not all 4)
|
||||
- JSON columns not processed
|
||||
- Raw columns used in deduplication
|
||||
- Phone/email patterns deviated from templates
|
||||
|
||||
### Configuration Violations
|
||||
- Invalid YAML syntax
|
||||
- Missing required fields
|
||||
- Incorrect dependency group structure
|
||||
- Table name mismatches
|
||||
|
||||
### DIG File Violations
|
||||
- Old repetitive block structure
|
||||
- Missing error handling
|
||||
- Incorrect template variables
|
||||
- No inc_log creation
|
||||
|
||||
### Engine-Specific Violations
|
||||
- **Presto**: Wrong date functions, missing TRY_CAST
|
||||
- **Hive**: REGEXP_EXTRACT patterns wrong, timestamp not escaped
|
||||
|
||||
### Incremental Processing Violations
|
||||
- Wrong WHERE clause pattern
|
||||
- Incorrect database references
|
||||
- Missing COALESCE in inc_log lookup
|
||||
|
||||
---
|
||||
|
||||
## Remediation Guidance
|
||||
|
||||
### For Each Failed Gate:
|
||||
1. **Reference CLAUDE.md** - Re-read relevant section
|
||||
2. **Check Examples** - Review sample code in documentation
|
||||
3. **Fix Violations** - Apply exact patterns from templates
|
||||
4. **Re-validate** - Run validation again until passes
|
||||
|
||||
### Quick Fixes:
|
||||
|
||||
**Date Columns Missing Outputs:**
|
||||
```sql
|
||||
-- Add all 4 outputs for each date column
|
||||
column AS column,
|
||||
FORMAT_DATETIME(...) AS column_std,
|
||||
TD_TIME_PARSE(...) AS column_unixtime,
|
||||
SUBSTR(...) AS column_date
|
||||
```
|
||||
|
||||
**JSON Columns Not Processed:**
|
||||
```sql
|
||||
-- Sample data, extract keys, add extractions
|
||||
NULLIF(UPPER(json_extract_scalar(json_col, '$.key')), '') AS json_col_key
|
||||
```
|
||||
|
||||
**DIG File Old Structure:**
|
||||
- Delete old file
|
||||
- Use loop-based template from documentation
|
||||
- Update src_params.yml instead
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Validation
|
||||
|
||||
### If Validation Passes
|
||||
✅ Transformation is production-ready:
|
||||
- Deploy with confidence
|
||||
- Test workflow execution
|
||||
- Monitor inc_log for health
|
||||
|
||||
### If Validation Fails
|
||||
❌ Fix reported issues:
|
||||
1. Address critical issues first
|
||||
2. Apply exact patterns from CLAUDE.md
|
||||
3. Re-validate until all gates pass
|
||||
4. **DO NOT deploy failing transformations**
|
||||
|
||||
---
|
||||
|
||||
## Production Quality Assurance
|
||||
|
||||
This validation ensures:
|
||||
- ✅ Transformations work correctly
|
||||
- ✅ Data quality maintained
|
||||
- ✅ No data loss or corruption
|
||||
- ✅ Consistent patterns across tables
|
||||
- ✅ Maintainable code
|
||||
- ✅ Compliance with standards
|
||||
|
||||
---
|
||||
|
||||
**What would you like to validate?**
|
||||
|
||||
Options:
|
||||
1. **Validate specific table**: Provide table name and engine
|
||||
2. **Validate all tables**: Say "validate all"
|
||||
3. **Validate configuration**: Say "validate config"
|
||||
Reference in New Issue
Block a user