Files
gh-treasure-data-aps-claude…/commands/transform-validation.md
2025-11-30 09:02:46 +08:00

10 KiB

name, description
name description
transform-validation Validate staging transformation files against CLAUDE.md compliance and quality gates

Validate Staging Transformation

⚠️ CRITICAL: This validates against strict production quality gates

I'll validate your staging transformation files for compliance with CLAUDE.md specifications and production standards.


What I'll Validate

Quality Gates (ALL MUST PASS)

1. File Structure Compliance

  • Correct directory structure (staging/ or staging_hive/)
  • Proper file naming conventions
  • All required files present
  • No missing dependencies

2. SQL File Validation

  • Incremental SQL (staging/queries/{source_db}_{table}.sql)
    • Correct FROM clause with {source_database}.{source_table}
    • Correct WHERE clause with incremental logic
    • Proper database references
  • Initial Load SQL (staging/init_queries/{source_db}_{table}_init.sql)
    • Full table scan (no WHERE clause)
    • Same transformations as incremental
  • Upsert SQL (staging/queries/{source_db}_{table}_upsert.sql)
    • Only present if deduplication exists
    • Correct DELETE and INSERT logic
    • Work table pattern used

3. Data Transformation Standards

  • Column Processing:
    • All columns transformed (except 'time')
    • Column limit compliance (max 200)
    • Proper type casting
  • Date Columns (CRITICAL):
    • ALL date columns have 4 outputs (original, _std, _unixtime, _date)
    • time column NOT transformed
    • Correct FORMAT_DATETIME patterns (Presto) or date_format (Hive)
  • JSON Columns (CRITICAL):
    • ALL JSON columns detected and processed
    • Top-level key extraction completed
    • Nested objects handled (up to 2 levels)
    • Array handling with TRY_CAST
    • All extractions wrapped with NULLIF(UPPER(...), '')
  • Email Columns:
    • Original + _std + _hash + _valid outputs
    • Correct validation regex
    • SHA256 hashing applied
  • Phone Columns:
    • Exact pattern compliance
    • phone_std, phone_hash, phone_valid outputs
    • No deviations from templates

4. Data Processing Order

  • Clean → Join → Dedupe sequence followed
  • Joins use cleaned columns (not raw)
  • Deduplication uses cleaned columns (not raw)
  • No raw column names in PARTITION BY
  • Proper CTE structure

5. Configuration File Validation

  • src_params.yml structure:
    • Valid YAML syntax
    • All required fields present
    • Correct dependency group structure
    • Proper table configuration
  • Table Configuration:
    • name, source_db, staging_table present
    • has_dedup boolean correct
    • partition_columns matches dedup logic
    • mode set correctly (inc/full)

6. DIG File Validation

  • Loop-based architecture used
  • No repetitive table blocks
  • Correct template structure
  • Proper error handling
  • Inc_log table creation
  • Conditional processing logic

7. Treasure Data Compatibility

  • Forbidden Types:
    • No TIMESTAMP types (must be VARCHAR/BIGINT)
    • No BOOLEAN types (must be VARCHAR)
    • No DOUBLE for timestamps
  • Function Compatibility:
    • Presto: FORMAT_DATETIME, TD_TIME_PARSE
    • Hive: date_format, unix_timestamp, from_unixtime
  • Type Casting:
    • TRY_CAST used for safety
    • All timestamps cast to VARCHAR or BIGINT

8. Incremental Processing

  • State Management:
    • Correct inc_log table usage
    • COALESCE(MAX(inc_value), 0) pattern
    • Proper WHERE clause filtering
  • Database References:
    • Source table: {source_database}.{source_table}
    • Inc_log: ${lkup_db}.inc_log
    • Correct project_name filter

9. Engine-Specific Validation

For Presto/Trino:

  • Files in staging/ directory
  • Three SQL files (init, incremental, upsert if dedup)
  • Presto-compatible functions used
  • JSON: json_extract_scalar with NULLIF(UPPER(...), '')

For Hive:

  • Files in staging_hive/ directory
  • Single SQL file with INSERT OVERWRITE
  • Template files present (get_max_time.sql, get_stg_rows_for_delete.sql)
  • Hive-compatible functions used
  • JSON: REGEXP_EXTRACT with NULLIF(UPPER(...), '')
  • Timestamp keyword escaped with backticks

Validation Options

Option 1: Validate Specific Table

Provide:

  • Table Name: e.g., client_src.customer_profiles
  • Engine: presto or hive

I will:

  1. Find all related files for the table
  2. Check against ALL quality gates
  3. Report detailed findings with line numbers
  4. Provide remediation guidance

Option 2: Validate All Tables

Say: "validate all" or "validate all Presto" or "validate all Hive"

I will:

  1. Find all transformation files
  2. Validate each table against quality gates
  3. Report comprehensive project status
  4. Identify any inconsistencies

Option 3: Validate Configuration Only

Say: "validate config"

I will:

  1. Check staging/config/src_params.yml (Presto)
  2. Check staging_hive/config/src_params.yml (Hive)
  3. Validate YAML syntax
  4. Verify table configurations
  5. Check dependency groups

Validation Process

Step 1: File Discovery

I will locate all relevant files:

  • SQL files (init, queries, upsert)
  • Configuration files (src_params.yml)
  • Workflow files (staging_transformation.dig or staging_hive.dig)

Step 2: Load and Parse

I will read all files:

  • Parse SQL syntax
  • Parse YAML structure
  • Extract transformation logic
  • Identify patterns

Step 3: Quality Gate Checks

I will verify each gate systematically:

  • Execute all validation rules
  • Identify violations
  • Collect evidence (line numbers, code samples)
  • Categorize severity

Step 4: Generate Report

Pass Report (if all gates pass)

✅ VALIDATION PASSED

Table: client_src.customer_profiles
Engine: Presto
Files validated: 3

Quality Gates: 9/9 PASSED
✅ File Structure Compliance
✅ SQL File Validation
✅ Data Transformation Standards
✅ Data Processing Order
✅ Configuration File Validation
✅ DIG File Validation
✅ Treasure Data Compatibility
✅ Incremental Processing
✅ Engine-Specific Validation

Transformation Details:
- Columns: 45 (within 200 limit)
- Date columns: 3 (12 outputs - all have 4 outputs ✅)
- JSON columns: 2 (all processed ✅)
- Email columns: 1 (validated with hashing ✅)
- Phone columns: 1 (exact pattern compliance ✅)
- Deduplication: Yes (customer_id)
- Joins: None

No issues found. Transformation is production-ready.

Fail Report (if any gate fails)

❌ VALIDATION FAILED

Table: client_src.customer_profiles
Engine: Presto
Files validated: 3

Quality Gates: 6/9 PASSED

✅ File Structure Compliance
✅ SQL File Validation
❌ Data Transformation Standards - FAILED
  - Date column 'created_at' missing outputs
    Line 45: Only has _std output, missing _unixtime and _date
  - JSON column 'attributes' not processed
    Line 67: Raw column passed through without extraction

✅ Data Processing Order
✅ Configuration File Validation
❌ DIG File Validation - FAILED
  - Using old repetitive block structure
    File: staging/staging_transformation.dig
    Issue: Should use loop-based architecture

✅ Treasure Data Compatibility
✅ Incremental Processing
❌ Engine-Specific Validation - FAILED
  - JSON extraction missing NULLIF(UPPER(...), '') wrapper
    Line 72: json_extract_scalar used without wrapper
    Should be: NULLIF(UPPER(json_extract_scalar(...)), '')

CRITICAL ISSUES (Must Fix):
1. Add missing date outputs for 'created_at' column
2. Process 'attributes' JSON column with key extraction
3. Migrate DIG file to loop-based architecture
4. Wrap all JSON extractions with NULLIF(UPPER(...), '')

RECOMMENDATIONS:
- Re-read CLAUDE.md date processing requirements
- Check JSON column detection logic
- Use DIG template from documentation
- Review JSON extraction patterns

Re-validate after fixing issues.

Common Issues Detected

Data Transformation Violations

  • Missing date column outputs (not all 4)
  • JSON columns not processed
  • Raw columns used in deduplication
  • Phone/email patterns deviated from templates

Configuration Violations

  • Invalid YAML syntax
  • Missing required fields
  • Incorrect dependency group structure
  • Table name mismatches

DIG File Violations

  • Old repetitive block structure
  • Missing error handling
  • Incorrect template variables
  • No inc_log creation

Engine-Specific Violations

  • Presto: Wrong date functions, missing TRY_CAST
  • Hive: REGEXP_EXTRACT patterns wrong, timestamp not escaped

Incremental Processing Violations

  • Wrong WHERE clause pattern
  • Incorrect database references
  • Missing COALESCE in inc_log lookup

Remediation Guidance

For Each Failed Gate:

  1. Reference CLAUDE.md - Re-read relevant section
  2. Check Examples - Review sample code in documentation
  3. Fix Violations - Apply exact patterns from templates
  4. Re-validate - Run validation again until passes

Quick Fixes:

Date Columns Missing Outputs:

-- Add all 4 outputs for each date column
column AS column,
FORMAT_DATETIME(...) AS column_std,
TD_TIME_PARSE(...) AS column_unixtime,
SUBSTR(...) AS column_date

JSON Columns Not Processed:

-- Sample data, extract keys, add extractions
NULLIF(UPPER(json_extract_scalar(json_col, '$.key')), '') AS json_col_key

DIG File Old Structure:

  • Delete old file
  • Use loop-based template from documentation
  • Update src_params.yml instead

Next Steps After Validation

If Validation Passes

Transformation is production-ready:

  • Deploy with confidence
  • Test workflow execution
  • Monitor inc_log for health

If Validation Fails

Fix reported issues:

  1. Address critical issues first
  2. Apply exact patterns from CLAUDE.md
  3. Re-validate until all gates pass
  4. DO NOT deploy failing transformations

Production Quality Assurance

This validation ensures:

  • Transformations work correctly
  • Data quality maintained
  • No data loss or corruption
  • Consistent patterns across tables
  • Maintainable code
  • Compliance with standards

What would you like to validate?

Options:

  1. Validate specific table: Provide table name and engine
  2. Validate all tables: Say "validate all"
  3. Validate configuration: Say "validate config"