Initial commit

2025-11-30 09:02:46 +08:00
commit b81339e588
8 changed files with 2678 additions and 0 deletions
--- a/commands/transform-validation.md
+++ b/commands/transform-validation.md
@@ -0,0 +1,363 @@
+---
+name: transform-validation
+description: Validate staging transformation files against CLAUDE.md compliance and quality gates
+---
+
+# Validate Staging Transformation
+
+## ⚠️ CRITICAL: This validates against strict production quality gates
+
+I'll validate your staging transformation files for compliance with CLAUDE.md specifications and production standards.
+
+---
+
+## What I'll Validate
+
+### Quality Gates (ALL MUST PASS)
+
+#### 1. File Structure Compliance
+- ✅ Correct directory structure (staging/ or staging_hive/)
+- ✅ Proper file naming conventions
+- ✅ All required files present
+- ✅ No missing dependencies
+
+#### 2. SQL File Validation
+- ✅ **Incremental SQL** (`staging/queries/{source_db}_{table}.sql`)
+  - Correct FROM clause with `{source_database}.{source_table}`
+  - Correct WHERE clause with incremental logic
+  - Proper database references
+- ✅ **Initial Load SQL** (`staging/init_queries/{source_db}_{table}_init.sql`)
+  - Full table scan (no WHERE clause)
+  - Same transformations as incremental
+- ✅ **Upsert SQL** (`staging/queries/{source_db}_{table}_upsert.sql`)
+  - Only present if deduplication exists
+  - Correct DELETE and INSERT logic
+  - Work table pattern used
+
+#### 3. Data Transformation Standards
+- ✅ **Column Processing**:
+  - All columns transformed (except 'time')
+  - Column limit compliance (max 200)
+  - Proper type casting
+- ✅ **Date Columns** (CRITICAL):
+  - ALL date columns have 4 outputs (original, _std, _unixtime, _date)
+  - `time` column NOT transformed
+  - Correct FORMAT_DATETIME patterns (Presto) or date_format (Hive)
+- ✅ **JSON Columns** (CRITICAL):
+  - ALL JSON columns detected and processed
+  - Top-level key extraction completed
+  - Nested objects handled (up to 2 levels)
+  - Array handling with TRY_CAST
+  - All extractions wrapped with NULLIF(UPPER(...), '')
+- ✅ **Email Columns**:
+  - Original + _std + _hash + _valid outputs
+  - Correct validation regex
+  - SHA256 hashing applied
+- ✅ **Phone Columns**:
+  - Exact pattern compliance
+  - phone_std, phone_hash, phone_valid outputs
+  - No deviations from templates
+
+#### 4. Data Processing Order
+- ✅ **Clean → Join → Dedupe sequence followed**
+- ✅ Joins use cleaned columns (not raw)
+- ✅ Deduplication uses cleaned columns (not raw)
+- ✅ No raw column names in PARTITION BY
+- ✅ Proper CTE structure
+
+#### 5. Configuration File Validation
+- ✅ **src_params.yml structure**:
+  - Valid YAML syntax
+  - All required fields present
+  - Correct dependency group structure
+  - Proper table configuration
+- ✅ **Table Configuration**:
+  - `name`, `source_db`, `staging_table` present
+  - `has_dedup` boolean correct
+  - `partition_columns` matches dedup logic
+  - `mode` set correctly (inc/full)
+
+#### 6. DIG File Validation
+- ✅ **Loop-based architecture** used
+- ✅ No repetitive table blocks
+- ✅ Correct template structure
+- ✅ Proper error handling
+- ✅ Inc_log table creation
+- ✅ Conditional processing logic
+
+#### 7. Treasure Data Compatibility
+- ✅ **Forbidden Types**:
+  - No TIMESTAMP types (must be VARCHAR/BIGINT)
+  - No BOOLEAN types (must be VARCHAR)
+  - No DOUBLE for timestamps
+- ✅ **Function Compatibility**:
+  - Presto: FORMAT_DATETIME, TD_TIME_PARSE
+  - Hive: date_format, unix_timestamp, from_unixtime
+- ✅ **Type Casting**:
+  - TRY_CAST used for safety
+  - All timestamps cast to VARCHAR or BIGINT
+
+#### 8. Incremental Processing
+- ✅ **State Management**:
+  - Correct inc_log table usage
+  - COALESCE(MAX(inc_value), 0) pattern
+  - Proper WHERE clause filtering
+- ✅ **Database References**:
+  - Source table: `{source_database}.{source_table}`
+  - Inc_log: `${lkup_db}.inc_log`
+  - Correct project_name filter
+
+#### 9. Engine-Specific Validation
+
+**For Presto/Trino:**
+- ✅ Files in `staging/` directory
+- ✅ Three SQL files (init, incremental, upsert if dedup)
+- ✅ Presto-compatible functions used
+- ✅ JSON: json_extract_scalar with NULLIF(UPPER(...), '')
+
+**For Hive:**
+- ✅ Files in `staging_hive/` directory
+- ✅ Single SQL file with INSERT OVERWRITE
+- ✅ Template files present (get_max_time.sql, get_stg_rows_for_delete.sql)
+- ✅ Hive-compatible functions used
+- ✅ JSON: REGEXP_EXTRACT with NULLIF(UPPER(...), '')
+- ✅ Timestamp keyword escaped with backticks
+
+---
+
+## Validation Options
+
+### Option 1: Validate Specific Table
+Provide:
+- **Table Name**: e.g., `client_src.customer_profiles`
+- **Engine**: `presto` or `hive`
+
+I will:
+1. Find all related files for the table
+2. Check against ALL quality gates
+3. Report detailed findings with line numbers
+4. Provide remediation guidance
+
+### Option 2: Validate All Tables
+Say: **"validate all"** or **"validate all Presto"** or **"validate all Hive"**
+
+I will:
+1. Find all transformation files
+2. Validate each table against quality gates
+3. Report comprehensive project status
+4. Identify any inconsistencies
+
+### Option 3: Validate Configuration Only
+Say: **"validate config"**
+
+I will:
+1. Check `staging/config/src_params.yml` (Presto)
+2. Check `staging_hive/config/src_params.yml` (Hive)
+3. Validate YAML syntax
+4. Verify table configurations
+5. Check dependency groups
+
+---
+
+## Validation Process
+
+### Step 1: File Discovery
+I will locate all relevant files:
+- SQL files (init, queries, upsert)
+- Configuration files (src_params.yml)
+- Workflow files (staging_transformation.dig or staging_hive.dig)
+
+### Step 2: Load and Parse
+I will read all files:
+- Parse SQL syntax
+- Parse YAML structure
+- Extract transformation logic
+- Identify patterns
+
+### Step 3: Quality Gate Checks
+I will verify each gate systematically:
+- Execute all validation rules
+- Identify violations
+- Collect evidence (line numbers, code samples)
+- Categorize severity
+
+### Step 4: Generate Report
+
+#### Pass Report (if all gates pass)
+```
+✅ VALIDATION PASSED
+
+Table: client_src.customer_profiles
+Engine: Presto
+Files validated: 3
+
+Quality Gates: 9/9 PASSED
+✅ File Structure Compliance
+✅ SQL File Validation
+✅ Data Transformation Standards
+✅ Data Processing Order
+✅ Configuration File Validation
+✅ DIG File Validation
+✅ Treasure Data Compatibility
+✅ Incremental Processing
+✅ Engine-Specific Validation
+
+Transformation Details:
+- Columns: 45 (within 200 limit)
+- Date columns: 3 (12 outputs - all have 4 outputs ✅)
+- JSON columns: 2 (all processed ✅)
+- Email columns: 1 (validated with hashing ✅)
+- Phone columns: 1 (exact pattern compliance ✅)
+- Deduplication: Yes (customer_id)
+- Joins: None
+
+No issues found. Transformation is production-ready.
+```
+
+#### Fail Report (if any gate fails)
+```
+❌ VALIDATION FAILED
+
+Table: client_src.customer_profiles
+Engine: Presto
+Files validated: 3
+
+Quality Gates: 6/9 PASSED
+
+✅ File Structure Compliance
+✅ SQL File Validation
+❌ Data Transformation Standards - FAILED
+  - Date column 'created_at' missing outputs
+    Line 45: Only has _std output, missing _unixtime and _date
+  - JSON column 'attributes' not processed
+    Line 67: Raw column passed through without extraction
+
+✅ Data Processing Order
+✅ Configuration File Validation
+❌ DIG File Validation - FAILED
+  - Using old repetitive block structure
+    File: staging/staging_transformation.dig
+    Issue: Should use loop-based architecture
+
+✅ Treasure Data Compatibility
+✅ Incremental Processing
+❌ Engine-Specific Validation - FAILED
+  - JSON extraction missing NULLIF(UPPER(...), '') wrapper
+    Line 72: json_extract_scalar used without wrapper
+    Should be: NULLIF(UPPER(json_extract_scalar(...)), '')
+
+CRITICAL ISSUES (Must Fix):
+1. Add missing date outputs for 'created_at' column
+2. Process 'attributes' JSON column with key extraction
+3. Migrate DIG file to loop-based architecture
+4. Wrap all JSON extractions with NULLIF(UPPER(...), '')
+
+RECOMMENDATIONS:
+- Re-read CLAUDE.md date processing requirements
+- Check JSON column detection logic
+- Use DIG template from documentation
+- Review JSON extraction patterns
+
+Re-validate after fixing issues.
+```
+
+---
+
+## Common Issues Detected
+
+### Data Transformation Violations
+- Missing date column outputs (not all 4)
+- JSON columns not processed
+- Raw columns used in deduplication
+- Phone/email patterns deviated from templates
+
+### Configuration Violations
+- Invalid YAML syntax
+- Missing required fields
+- Incorrect dependency group structure
+- Table name mismatches
+
+### DIG File Violations
+- Old repetitive block structure
+- Missing error handling
+- Incorrect template variables
+- No inc_log creation
+
+### Engine-Specific Violations
+- **Presto**: Wrong date functions, missing TRY_CAST
+- **Hive**: REGEXP_EXTRACT patterns wrong, timestamp not escaped
+
+### Incremental Processing Violations
+- Wrong WHERE clause pattern
+- Incorrect database references
+- Missing COALESCE in inc_log lookup
+
+---
+
+## Remediation Guidance
+
+### For Each Failed Gate:
+1. **Reference CLAUDE.md** - Re-read relevant section
+2. **Check Examples** - Review sample code in documentation
+3. **Fix Violations** - Apply exact patterns from templates
+4. **Re-validate** - Run validation again until passes
+
+### Quick Fixes:
+
+**Date Columns Missing Outputs:**
+```sql
+-- Add all 4 outputs for each date column
+column AS column,
+FORMAT_DATETIME(...) AS column_std,
+TD_TIME_PARSE(...) AS column_unixtime,
+SUBSTR(...) AS column_date
+```
+
+**JSON Columns Not Processed:**
+```sql
+-- Sample data, extract keys, add extractions
+NULLIF(UPPER(json_extract_scalar(json_col, '$.key')), '') AS json_col_key
+```
+
+**DIG File Old Structure:**
+- Delete old file
+- Use loop-based template from documentation
+- Update src_params.yml instead
+
+---
+
+## Next Steps After Validation
+
+### If Validation Passes
+✅ Transformation is production-ready:
+- Deploy with confidence
+- Test workflow execution
+- Monitor inc_log for health
+
+### If Validation Fails
+❌ Fix reported issues:
+1. Address critical issues first
+2. Apply exact patterns from CLAUDE.md
+3. Re-validate until all gates pass
+4. **DO NOT deploy failing transformations**
+
+---
+
+## Production Quality Assurance
+
+This validation ensures:
+- ✅ Transformations work correctly
+- ✅ Data quality maintained
+- ✅ No data loss or corruption
+- ✅ Consistent patterns across tables
+- ✅ Maintainable code
+- ✅ Compliance with standards
+
+---
+
+**What would you like to validate?**
+
+Options:
+1. **Validate specific table**: Provide table name and engine
+2. **Validate all tables**: Say "validate all"
+3. **Validate configuration**: Say "validate config"