10 KiB
10 KiB
name, description
| name | description |
|---|---|
| transform-validation | Validate staging transformation files against CLAUDE.md compliance and quality gates |
Validate Staging Transformation
⚠️ CRITICAL: This validates against strict production quality gates
I'll validate your staging transformation files for compliance with CLAUDE.md specifications and production standards.
What I'll Validate
Quality Gates (ALL MUST PASS)
1. File Structure Compliance
- ✅ Correct directory structure (staging/ or staging_hive/)
- ✅ Proper file naming conventions
- ✅ All required files present
- ✅ No missing dependencies
2. SQL File Validation
- ✅ Incremental SQL (
staging/queries/{source_db}_{table}.sql)- Correct FROM clause with
{source_database}.{source_table} - Correct WHERE clause with incremental logic
- Proper database references
- Correct FROM clause with
- ✅ Initial Load SQL (
staging/init_queries/{source_db}_{table}_init.sql)- Full table scan (no WHERE clause)
- Same transformations as incremental
- ✅ Upsert SQL (
staging/queries/{source_db}_{table}_upsert.sql)- Only present if deduplication exists
- Correct DELETE and INSERT logic
- Work table pattern used
3. Data Transformation Standards
- ✅ Column Processing:
- All columns transformed (except 'time')
- Column limit compliance (max 200)
- Proper type casting
- ✅ Date Columns (CRITICAL):
- ALL date columns have 4 outputs (original, _std, _unixtime, _date)
timecolumn NOT transformed- Correct FORMAT_DATETIME patterns (Presto) or date_format (Hive)
- ✅ JSON Columns (CRITICAL):
- ALL JSON columns detected and processed
- Top-level key extraction completed
- Nested objects handled (up to 2 levels)
- Array handling with TRY_CAST
- All extractions wrapped with NULLIF(UPPER(...), '')
- ✅ Email Columns:
- Original + _std + _hash + _valid outputs
- Correct validation regex
- SHA256 hashing applied
- ✅ Phone Columns:
- Exact pattern compliance
- phone_std, phone_hash, phone_valid outputs
- No deviations from templates
4. Data Processing Order
- ✅ Clean → Join → Dedupe sequence followed
- ✅ Joins use cleaned columns (not raw)
- ✅ Deduplication uses cleaned columns (not raw)
- ✅ No raw column names in PARTITION BY
- ✅ Proper CTE structure
5. Configuration File Validation
- ✅ src_params.yml structure:
- Valid YAML syntax
- All required fields present
- Correct dependency group structure
- Proper table configuration
- ✅ Table Configuration:
name,source_db,staging_tablepresenthas_dedupboolean correctpartition_columnsmatches dedup logicmodeset correctly (inc/full)
6. DIG File Validation
- ✅ Loop-based architecture used
- ✅ No repetitive table blocks
- ✅ Correct template structure
- ✅ Proper error handling
- ✅ Inc_log table creation
- ✅ Conditional processing logic
7. Treasure Data Compatibility
- ✅ Forbidden Types:
- No TIMESTAMP types (must be VARCHAR/BIGINT)
- No BOOLEAN types (must be VARCHAR)
- No DOUBLE for timestamps
- ✅ Function Compatibility:
- Presto: FORMAT_DATETIME, TD_TIME_PARSE
- Hive: date_format, unix_timestamp, from_unixtime
- ✅ Type Casting:
- TRY_CAST used for safety
- All timestamps cast to VARCHAR or BIGINT
8. Incremental Processing
- ✅ State Management:
- Correct inc_log table usage
- COALESCE(MAX(inc_value), 0) pattern
- Proper WHERE clause filtering
- ✅ Database References:
- Source table:
{source_database}.{source_table} - Inc_log:
${lkup_db}.inc_log - Correct project_name filter
- Source table:
9. Engine-Specific Validation
For Presto/Trino:
- ✅ Files in
staging/directory - ✅ Three SQL files (init, incremental, upsert if dedup)
- ✅ Presto-compatible functions used
- ✅ JSON: json_extract_scalar with NULLIF(UPPER(...), '')
For Hive:
- ✅ Files in
staging_hive/directory - ✅ Single SQL file with INSERT OVERWRITE
- ✅ Template files present (get_max_time.sql, get_stg_rows_for_delete.sql)
- ✅ Hive-compatible functions used
- ✅ JSON: REGEXP_EXTRACT with NULLIF(UPPER(...), '')
- ✅ Timestamp keyword escaped with backticks
Validation Options
Option 1: Validate Specific Table
Provide:
- Table Name: e.g.,
client_src.customer_profiles - Engine:
prestoorhive
I will:
- Find all related files for the table
- Check against ALL quality gates
- Report detailed findings with line numbers
- Provide remediation guidance
Option 2: Validate All Tables
Say: "validate all" or "validate all Presto" or "validate all Hive"
I will:
- Find all transformation files
- Validate each table against quality gates
- Report comprehensive project status
- Identify any inconsistencies
Option 3: Validate Configuration Only
Say: "validate config"
I will:
- Check
staging/config/src_params.yml(Presto) - Check
staging_hive/config/src_params.yml(Hive) - Validate YAML syntax
- Verify table configurations
- Check dependency groups
Validation Process
Step 1: File Discovery
I will locate all relevant files:
- SQL files (init, queries, upsert)
- Configuration files (src_params.yml)
- Workflow files (staging_transformation.dig or staging_hive.dig)
Step 2: Load and Parse
I will read all files:
- Parse SQL syntax
- Parse YAML structure
- Extract transformation logic
- Identify patterns
Step 3: Quality Gate Checks
I will verify each gate systematically:
- Execute all validation rules
- Identify violations
- Collect evidence (line numbers, code samples)
- Categorize severity
Step 4: Generate Report
Pass Report (if all gates pass)
✅ VALIDATION PASSED
Table: client_src.customer_profiles
Engine: Presto
Files validated: 3
Quality Gates: 9/9 PASSED
✅ File Structure Compliance
✅ SQL File Validation
✅ Data Transformation Standards
✅ Data Processing Order
✅ Configuration File Validation
✅ DIG File Validation
✅ Treasure Data Compatibility
✅ Incremental Processing
✅ Engine-Specific Validation
Transformation Details:
- Columns: 45 (within 200 limit)
- Date columns: 3 (12 outputs - all have 4 outputs ✅)
- JSON columns: 2 (all processed ✅)
- Email columns: 1 (validated with hashing ✅)
- Phone columns: 1 (exact pattern compliance ✅)
- Deduplication: Yes (customer_id)
- Joins: None
No issues found. Transformation is production-ready.
Fail Report (if any gate fails)
❌ VALIDATION FAILED
Table: client_src.customer_profiles
Engine: Presto
Files validated: 3
Quality Gates: 6/9 PASSED
✅ File Structure Compliance
✅ SQL File Validation
❌ Data Transformation Standards - FAILED
- Date column 'created_at' missing outputs
Line 45: Only has _std output, missing _unixtime and _date
- JSON column 'attributes' not processed
Line 67: Raw column passed through without extraction
✅ Data Processing Order
✅ Configuration File Validation
❌ DIG File Validation - FAILED
- Using old repetitive block structure
File: staging/staging_transformation.dig
Issue: Should use loop-based architecture
✅ Treasure Data Compatibility
✅ Incremental Processing
❌ Engine-Specific Validation - FAILED
- JSON extraction missing NULLIF(UPPER(...), '') wrapper
Line 72: json_extract_scalar used without wrapper
Should be: NULLIF(UPPER(json_extract_scalar(...)), '')
CRITICAL ISSUES (Must Fix):
1. Add missing date outputs for 'created_at' column
2. Process 'attributes' JSON column with key extraction
3. Migrate DIG file to loop-based architecture
4. Wrap all JSON extractions with NULLIF(UPPER(...), '')
RECOMMENDATIONS:
- Re-read CLAUDE.md date processing requirements
- Check JSON column detection logic
- Use DIG template from documentation
- Review JSON extraction patterns
Re-validate after fixing issues.
Common Issues Detected
Data Transformation Violations
- Missing date column outputs (not all 4)
- JSON columns not processed
- Raw columns used in deduplication
- Phone/email patterns deviated from templates
Configuration Violations
- Invalid YAML syntax
- Missing required fields
- Incorrect dependency group structure
- Table name mismatches
DIG File Violations
- Old repetitive block structure
- Missing error handling
- Incorrect template variables
- No inc_log creation
Engine-Specific Violations
- Presto: Wrong date functions, missing TRY_CAST
- Hive: REGEXP_EXTRACT patterns wrong, timestamp not escaped
Incremental Processing Violations
- Wrong WHERE clause pattern
- Incorrect database references
- Missing COALESCE in inc_log lookup
Remediation Guidance
For Each Failed Gate:
- Reference CLAUDE.md - Re-read relevant section
- Check Examples - Review sample code in documentation
- Fix Violations - Apply exact patterns from templates
- Re-validate - Run validation again until passes
Quick Fixes:
Date Columns Missing Outputs:
-- Add all 4 outputs for each date column
column AS column,
FORMAT_DATETIME(...) AS column_std,
TD_TIME_PARSE(...) AS column_unixtime,
SUBSTR(...) AS column_date
JSON Columns Not Processed:
-- Sample data, extract keys, add extractions
NULLIF(UPPER(json_extract_scalar(json_col, '$.key')), '') AS json_col_key
DIG File Old Structure:
- Delete old file
- Use loop-based template from documentation
- Update src_params.yml instead
Next Steps After Validation
If Validation Passes
✅ Transformation is production-ready:
- Deploy with confidence
- Test workflow execution
- Monitor inc_log for health
If Validation Fails
❌ Fix reported issues:
- Address critical issues first
- Apply exact patterns from CLAUDE.md
- Re-validate until all gates pass
- DO NOT deploy failing transformations
Production Quality Assurance
This validation ensures:
- ✅ Transformations work correctly
- ✅ Data quality maintained
- ✅ No data loss or corruption
- ✅ Consistent patterns across tables
- ✅ Maintainable code
- ✅ Compliance with standards
What would you like to validate?
Options:
- Validate specific table: Provide table name and engine
- Validate all tables: Say "validate all"
- Validate configuration: Say "validate config"