---
name: transform-validation
description: Validate staging transformation files against CLAUDE.md compliance and quality gates
---

# Validate Staging Transformation

## ⚠️ CRITICAL: This validates against strict production quality gates

I'll validate your staging transformation files for compliance with CLAUDE.md specifications and production standards.

---

## What I'll Validate

### Quality Gates (ALL MUST PASS)

#### 1. File Structure Compliance
- ✅ Correct directory structure (staging/ or staging_hive/)
- ✅ Proper file naming conventions
- ✅ All required files present
- ✅ No missing dependencies

#### 2. SQL File Validation
- ✅ **Incremental SQL** (`staging/queries/{source_db}_{table}.sql`)
  - Correct FROM clause with `{source_database}.{source_table}`
  - Correct WHERE clause with incremental logic
  - Proper database references
- ✅ **Initial Load SQL** (`staging/init_queries/{source_db}_{table}_init.sql`)
  - Full table scan (no WHERE clause)
  - Same transformations as incremental
- ✅ **Upsert SQL** (`staging/queries/{source_db}_{table}_upsert.sql`)
  - Only present if deduplication exists
  - Correct DELETE and INSERT logic
  - Work table pattern used

#### 3. Data Transformation Standards
- ✅ **Column Processing**:
  - All columns transformed (except 'time')
  - Column limit compliance (max 200)
  - Proper type casting
- ✅ **Date Columns** (CRITICAL):
  - ALL date columns have 4 outputs (original, _std, _unixtime, _date)
  - `time` column NOT transformed
  - Correct FORMAT_DATETIME patterns (Presto) or date_format (Hive)
- ✅ **JSON Columns** (CRITICAL):
  - ALL JSON columns detected and processed
  - Top-level key extraction completed
  - Nested objects handled (up to 2 levels)
  - Array handling with TRY_CAST
  - All extractions wrapped with NULLIF(UPPER(...), '')
- ✅ **Email Columns**:
  - Original + _std + _hash + _valid outputs
  - Correct validation regex
  - SHA256 hashing applied
- ✅ **Phone Columns**:
  - Exact pattern compliance
  - phone_std, phone_hash, phone_valid outputs
  - No deviations from templates

#### 4. Data Processing Order
- ✅ **Clean → Join → Dedupe sequence followed**
- ✅ Joins use cleaned columns (not raw)
- ✅ Deduplication uses cleaned columns (not raw)
- ✅ No raw column names in PARTITION BY
- ✅ Proper CTE structure

#### 5. Configuration File Validation
- ✅ **src_params.yml structure**:
  - Valid YAML syntax
  - All required fields present
  - Correct dependency group structure
  - Proper table configuration
- ✅ **Table Configuration**:
  - `name`, `source_db`, `staging_table` present
  - `has_dedup` boolean correct
  - `partition_columns` matches dedup logic
  - `mode` set correctly (inc/full)

#### 6. DIG File Validation
- ✅ **Loop-based architecture** used
- ✅ No repetitive table blocks
- ✅ Correct template structure
- ✅ Proper error handling
- ✅ Inc_log table creation
- ✅ Conditional processing logic

#### 7. Treasure Data Compatibility
- ✅ **Forbidden Types**:
  - No TIMESTAMP types (must be VARCHAR/BIGINT)
  - No BOOLEAN types (must be VARCHAR)
  - No DOUBLE for timestamps
- ✅ **Function Compatibility**:
  - Presto: FORMAT_DATETIME, TD_TIME_PARSE
  - Hive: date_format, unix_timestamp, from_unixtime
- ✅ **Type Casting**:
  - TRY_CAST used for safety
  - All timestamps cast to VARCHAR or BIGINT

#### 8. Incremental Processing
- ✅ **State Management**:
  - Correct inc_log table usage
  - COALESCE(MAX(inc_value), 0) pattern
  - Proper WHERE clause filtering
- ✅ **Database References**:
  - Source table: `{source_database}.{source_table}`
  - Inc_log: `${lkup_db}.inc_log`
  - Correct project_name filter

#### 9. Engine-Specific Validation

**For Presto/Trino:**
- ✅ Files in `staging/` directory
- ✅ Three SQL files (init, incremental, upsert if dedup)
- ✅ Presto-compatible functions used
- ✅ JSON: json_extract_scalar with NULLIF(UPPER(...), '')

**For Hive:**
- ✅ Files in `staging_hive/` directory
- ✅ Single SQL file with INSERT OVERWRITE
- ✅ Template files present (get_max_time.sql, get_stg_rows_for_delete.sql)
- ✅ Hive-compatible functions used
- ✅ JSON: REGEXP_EXTRACT with NULLIF(UPPER(...), '')
- ✅ Timestamp keyword escaped with backticks

---

## Validation Options

### Option 1: Validate Specific Table
Provide:
- **Table Name**: e.g., `client_src.customer_profiles`
- **Engine**: `presto` or `hive`

I will:
1. Find all related files for the table
2. Check against ALL quality gates
3. Report detailed findings with line numbers
4. Provide remediation guidance

### Option 2: Validate All Tables
Say: **"validate all"** or **"validate all Presto"** or **"validate all Hive"**

I will:
1. Find all transformation files
2. Validate each table against quality gates
3. Report comprehensive project status
4. Identify any inconsistencies

### Option 3: Validate Configuration Only
Say: **"validate config"**

I will:
1. Check `staging/config/src_params.yml` (Presto)
2. Check `staging_hive/config/src_params.yml` (Hive)
3. Validate YAML syntax
4. Verify table configurations
5. Check dependency groups

---

## Validation Process

### Step 1: File Discovery
I will locate all relevant files:
- SQL files (init, queries, upsert)
- Configuration files (src_params.yml)
- Workflow files (staging_transformation.dig or staging_hive.dig)

### Step 2: Load and Parse
I will read all files:
- Parse SQL syntax
- Parse YAML structure
- Extract transformation logic
- Identify patterns

### Step 3: Quality Gate Checks
I will verify each gate systematically:
- Execute all validation rules
- Identify violations
- Collect evidence (line numbers, code samples)
- Categorize severity

### Step 4: Generate Report

#### Pass Report (if all gates pass)
```
✅ VALIDATION PASSED

Table: client_src.customer_profiles
Engine: Presto
Files validated: 3

Quality Gates: 9/9 PASSED
✅ File Structure Compliance
✅ SQL File Validation
✅ Data Transformation Standards
✅ Data Processing Order
✅ Configuration File Validation
✅ DIG File Validation
✅ Treasure Data Compatibility
✅ Incremental Processing
✅ Engine-Specific Validation

Transformation Details:
- Columns: 45 (within 200 limit)
- Date columns: 3 (12 outputs - all have 4 outputs ✅)
- JSON columns: 2 (all processed ✅)
- Email columns: 1 (validated with hashing ✅)
- Phone columns: 1 (exact pattern compliance ✅)
- Deduplication: Yes (customer_id)
- Joins: None

No issues found. Transformation is production-ready.
```

#### Fail Report (if any gate fails)
```
❌ VALIDATION FAILED

Table: client_src.customer_profiles
Engine: Presto
Files validated: 3

Quality Gates: 6/9 PASSED

✅ File Structure Compliance
✅ SQL File Validation
❌ Data Transformation Standards - FAILED
  - Date column 'created_at' missing outputs
    Line 45: Only has _std output, missing _unixtime and _date
  - JSON column 'attributes' not processed
    Line 67: Raw column passed through without extraction

✅ Data Processing Order
✅ Configuration File Validation
❌ DIG File Validation - FAILED
  - Using old repetitive block structure
    File: staging/staging_transformation.dig
    Issue: Should use loop-based architecture

✅ Treasure Data Compatibility
✅ Incremental Processing
❌ Engine-Specific Validation - FAILED
  - JSON extraction missing NULLIF(UPPER(...), '') wrapper
    Line 72: json_extract_scalar used without wrapper
    Should be: NULLIF(UPPER(json_extract_scalar(...)), '')

CRITICAL ISSUES (Must Fix):
1. Add missing date outputs for 'created_at' column
2. Process 'attributes' JSON column with key extraction
3. Migrate DIG file to loop-based architecture
4. Wrap all JSON extractions with NULLIF(UPPER(...), '')

RECOMMENDATIONS:
- Re-read CLAUDE.md date processing requirements
- Check JSON column detection logic
- Use DIG template from documentation
- Review JSON extraction patterns

Re-validate after fixing issues.
```

---

## Common Issues Detected

### Data Transformation Violations
- Missing date column outputs (not all 4)
- JSON columns not processed
- Raw columns used in deduplication
- Phone/email patterns deviated from templates

### Configuration Violations
- Invalid YAML syntax
- Missing required fields
- Incorrect dependency group structure
- Table name mismatches

### DIG File Violations
- Old repetitive block structure
- Missing error handling
- Incorrect template variables
- No inc_log creation

### Engine-Specific Violations
- **Presto**: Wrong date functions, missing TRY_CAST
- **Hive**: REGEXP_EXTRACT patterns wrong, timestamp not escaped

### Incremental Processing Violations
- Wrong WHERE clause pattern
- Incorrect database references
- Missing COALESCE in inc_log lookup

---

## Remediation Guidance

### For Each Failed Gate:
1. **Reference CLAUDE.md** - Re-read relevant section
2. **Check Examples** - Review sample code in documentation
3. **Fix Violations** - Apply exact patterns from templates
4. **Re-validate** - Run validation again until passes

### Quick Fixes:

**Date Columns Missing Outputs:**
```sql
-- Add all 4 outputs for each date column
column AS column,
FORMAT_DATETIME(...) AS column_std,
TD_TIME_PARSE(...) AS column_unixtime,
SUBSTR(...) AS column_date
```

**JSON Columns Not Processed:**
```sql
-- Sample data, extract keys, add extractions
NULLIF(UPPER(json_extract_scalar(json_col, '$.key')), '') AS json_col_key
```

**DIG File Old Structure:**
- Delete old file
- Use loop-based template from documentation
- Update src_params.yml instead

---

## Next Steps After Validation

### If Validation Passes
✅ Transformation is production-ready:
- Deploy with confidence
- Test workflow execution
- Monitor inc_log for health

### If Validation Fails
❌ Fix reported issues:
1. Address critical issues first
2. Apply exact patterns from CLAUDE.md
3. Re-validate until all gates pass
4. **DO NOT deploy failing transformations**

---

## Production Quality Assurance

This validation ensures:
- ✅ Transformations work correctly
- ✅ Data quality maintained
- ✅ No data loss or corruption
- ✅ Consistent patterns across tables
- ✅ Maintainable code
- ✅ Compliance with standards

---

**What would you like to validate?**

Options:
1. **Validate specific table**: Provide table name and engine
2. **Validate all tables**: Say "validate all"
3. **Validate configuration**: Say "validate config"