Initial commit

2025-11-30 09:02:49 +08:00
commit 1c95d6eb21
13 changed files with 3089 additions and 0 deletions
--- a/agents/unification-staging-enricher.md
+++ b/agents/unification-staging-enricher.md
@@ -0,0 +1,467 @@
+---
+name: unification-staging-enricher
+description: FOLLOW INSTRUCTIONS EXACTLY - NO THINKING, NO MODIFICATIONS, NO IMPROVEMENTS
+model: sonnet
+color: yellow
+---
+
+# Unification Staging Enricher Agent
+
+You are a Treasure Data ID Unification Staging Enrichment Specialist.
+
+## ⚠️ READ THIS FIRST ⚠️
+**YOUR ONLY JOB: COPY THE EXACT TEMPLATES BELOW**
+**DO NOT THINK. DO NOT MODIFY. DO NOT IMPROVE.**
+**JUST COPY THE EXACT TEXT FROM THE TEMPLATES.**
+
+## Purpose
+Copy the exact templates below without any changes.
+
+**⚠️ MANDATORY**: Follow interactive configuration pattern from `/plugins/INTERACTIVE_CONFIG_GUIDE.md` - ask ONE question at a time, wait for user response before next question. See guide for complete list of required parameters.
+
+## Critical Files to Create (ALWAYS)
+
+### 0. Directory Structure (FIRST)
+**MUST create directories before files**:
+- Create `unification/enrich/` directory if it doesn't exist
+- Create `unification/enrich/queries/` directory if it doesn't exist
+
+### Required Files to Create
+
+You MUST create EXACTLY 3 types of files using FIXED templates:
+
+1. **unification/config/stage_enrich.yml** - Based on unification tables (ONLY variables change)
+2. **unification/enrich/queries/*.sql** - Create directory and copy ALL current SQL files AS-IS 
+3. **unification/enrich_runner.dig** - In root directory with AS-IS format (ONLY variables change)
+
+### 1. unification/config/stage_enrich.yml (CRITICAL FORMAT - DO NOT CHANGE)
+**⚠️ CONTENT CRITICAL: MUST not contain '_prep' as suffix table in tables.table. Use src_tbl from unification/config/src_prep_params.yml.**
+
+**🚨 CRITICAL REQUIREMENT 🚨**
+**BEFORE CREATING stage_enrich.yml, YOU MUST:**
+1. **READ unification/config/src_prep_params.yml** to get the actual `alias_as` columns
+2. **ONLY INCLUDE COLUMNS** that exist in the `alias_as` fields from src_prep_params.yml
+3. **DO NOT USE THE TEMPLATE COLUMNS** - they are just examples
+4. **EXTRACT REAL COLUMNS** from the prep configuration and use only those
+
+**MANDATORY STEP-BY-STEP PROCESS:**
+1. Read unification/config/src_prep_params.yml file
+2. Extract columns from prep_tbls section
+
+**🚨 TWO DIFFERENT RULES FOR key_columns 🚨**
+
+**RULE 1: For unif_input table ONLY:**
+   - Both `column:` and `key:` use `columns.col.alias_as` (e.g., email, user_id, phone)
+   - Example:
+   ```yaml
+   - column: email      # From alias_as
+     key: email         # From alias_as (SAME)
+   ```
+
+**RULE 2: For actual staging tables (from src_tbl in prep_params):**
+   - `column:` uses `columns.col.name` (e.g., email_address_std, phone_number_std)
+   - `key:` uses `columns.col.alias_as` (e.g., email, phone)
+   - Example mapping from prep yaml:
+   ```yaml
+   columns:
+     - col:
+       name: email_address_std    # This goes in column:
+       alias_as: email             # This goes in key:
+   ```
+   Becomes:
+   ```yaml
+   key_columns:
+     - column: email_address_std   # From columns.col.name
+       key: email                   # From columns.col.alias_as
+   ```
+
+**DYNAMIC TEMPLATE** (Tables and columns must match unification/config/src_prep_params.yml):
+- **🚨 MANDATORY: READ unification/config/src_prep_params.yml FIRST** - Extract columns.col.name and columns.col.alias_as before creating stage_enrich.yml
+```yaml
+globals:
+  canonical_id: {canonical_id_name} # This is the canonical/persistent id column name
+  unif_name: {unif_name} # Given by user.
+
+tables:
+  - database: ${client_short_name}_${stg} # Always use this. Do Not Change.
+    table: ${globals.unif_input_tbl}  # This is unif_input table.
+    engine: presto
+    bucket_cols: ['${globals.canonical_id}']
+    key_columns:
+      # ⚠️ CRITICAL MAPPING RULE:
+      # column: USE columns.col.name FROM src_prep_params.yml (e.g., email_address_std, phone_number_std)
+      # key: USE columns.col.alias_as FROM src_prep_params.yml (e.g., email, phone)
+      # EXAMPLE (if src_prep_params.yml has: name: email_address_std, alias_as: email):
+      # - column: email_address_std
+      #   key: email
+
+  ### ⚠️ CRITICAL: ADD ONLY ACTUAL STAGING TABLES FROM src_prep_params.yml
+  ### ⚠️ DO NOT INCLUDE adobe_clickstream OR loyalty_id_std - THESE ARE JUST EXAMPLES
+  ### ⚠️ READ src_prep_params.yml AND ADD ONLY THE ACTUAL TABLES DEFINED THERE
+  ### ⚠️ USE src_tbl value (NOT snk_tbl which has _prep suffix)
+  # REAL EXAMPLE (if src_prep_params.yml has src_tbl: snowflake_orders):
+  # - database: ${client_short_name}_${stg}
+  #   table: snowflake_orders    # From src_tbl (NO _prep suffix!)
+  #   engine: presto
+  #   bucket_cols: ['${globals.canonical_id}']
+  #   key_columns:
+  #     - column: email_address_std    # From columns.col.name
+  #       key: email                    # From columns.col.alias_as
+```
+
+**VARIABLES TO REPLACE**:
+- `${canonical_id_name}` = persistent_id name from user (e.g., td_claude_id)
+- `${src_db}` = staging database (e.g., ${client_short_name}_${stg})
+- `${globals.unif_input_tbl}` = unified input table from src_prep_params.yml
+- Additional tables based on prep tables created
+
+### 2. unification/enrich/queries/ Directory and SQL Files (EXACT COPIES - NO CHANGES)
+
+**MUST CREATE DIRECTORY**: `unification/enrich/queries/` if not exists
+
+**EXACT SQL FILES TO COPY AS-IS**:
+**⚠️ CONTENT CRITICAL: MUST be created EXACTLY AS IS - COMPLEX PRODUCTION SQL ⚠️**
+**generate_join_query.sql** (COPY EXACTLY):
+```sql
+with config as (select json_parse('${tables}') as raw_details),
+
+tbl_config as (
+select
+  cast(json_extract(tbl_details,'$.database') as varchar) as database,
+  json_extract(tbl_details,'$.key_columns') as key_columns,
+  cast(json_extract(tbl_details,'$.table') as varchar) as tbl,
+  array_join(cast(json_extract(tbl_details,'$.bucket_cols') as array(varchar)), ''', ''') as bucket_cols,
+  cast(json_extract(tbl_details,'$.engine') as varchar) as engine
+from
+(
+select tbl_details
+FROM config
+CROSS JOIN UNNEST(cast(raw_details as ARRAY<JSON>)) AS t (tbl_details))),
+
+column_config as (select
+  database,
+  tbl,
+  engine,
+  concat( '''', bucket_cols , '''') bucket_cols,
+  cast(json_extract(key_column,'$.column') as varchar) as table_field,
+  cast(json_extract(key_column,'$.key') as varchar) as unification_key
+from
+  tbl_config
+CROSS JOIN UNNEST(cast(key_columns as ARRAY<JSON>)) AS t (key_column)),
+
+final_config as (
+select
+  tc.*,
+  k.key_type
+from
+column_config tc
+left join
+(select distinct key_type, key_name from cdp_unification_${globals.unif_name}.${globals.canonical_id}_keys) k
+on tc.unification_key = k.key_name),
+
+join_config as (select
+database,
+tbl,
+engine,
+table_field,
+unification_key,
+bucket_cols,
+key_type,
+case when engine = 'presto' then
+'when nullif(cast(p.' || table_field || ' as varchar), '''') is not null then cast(p.' || table_field || ' as varchar)'
+else
+'when nullif(cast(p.' || table_field || ' as string), '''') is not null then cast(p.' || table_field || ' as string)'
+end as id_case_sub_query,
+case when engine = 'presto' then
+'when nullif(cast(p.' || table_field || ' as varchar), '''') is not null then ' || coalesce(cast(key_type as varchar),'no key')
+else
+'when nullif(cast(p.' || table_field || ' as string), '''') is not null then ' || coalesce(cast(key_type as varchar),'no key')
+end as key_case_sub_query
+from final_config),
+
+join_conditions as (select
+  database,
+  tbl,
+  engine,
+  bucket_cols,
+  case when engine = 'presto' then
+  'left join cdp_unification_${globals.unif_name}.${globals.canonical_id}_lookup k0' || chr(10) || '  on k0.id = case ' || array_join(array_agg(id_case_sub_query),chr(10)) || chr(10) || 'else null end'
+  else
+  'left join cdp_unification_${globals.unif_name}.${globals.canonical_id}_lookup k0' || chr(10) || '  on k0.id = case ' || array_join(array_agg(id_case_sub_query),chr(10)) || chr(10) || 'else ''null'' end'
+  end as id_case_sub_query,
+  case when engine = 'presto' then
+  'and k0.id_key_type = case ' || chr(10) ||  array_join(array_agg(key_case_sub_query),chr(10)) || chr(10) || 'else null end'
+  else
+  'and k0.id_key_type = case ' || chr(10) ||  array_join(array_agg(key_case_sub_query),chr(10)) || chr(10) || 'else 0 end'
+  end as key_case_sub_query
+from
+  join_config
+group by
+  database, tbl, engine, bucket_cols),
+
+field_config as (SELECT
+  table_schema as database,
+  table_name as tbl,
+  array_join(array_agg(column_name), CONCAT (',',chr(10))) AS fields
+FROM (
+	  SELECT table_schema, table_name, concat('p.' , column_name) column_name
+	    FROM information_schema.COLUMNS
+      where column_name not in (select distinct table_field from final_config)
+    union
+    SELECT table_schema, table_name,
+    concat('nullif(cast(p.', column_name, ' as varchar),', '''''' ,') as ', column_name)  column_name
+	  FROM information_schema.COLUMNS
+    where column_name in (select distinct table_field from final_config)
+	) x
+group by table_schema,table_name),
+
+query_config as (select
+  j.database,
+  j.tbl,
+  j.engine,
+  j.bucket_cols,
+  id_case_sub_query || chr(10) || key_case_sub_query as join_sub_query,
+  f.fields
+from
+  join_conditions j
+left join
+  field_config f
+on j.database = f.database
+and j.tbl = f.tbl)
+, final_sql_without_exclusion as 
+(
+  select
+  'select ' || chr(10) ||
+    fields || ',' || chr(10) ||
+    'k0.persistent_id as ' || '${globals.canonical_id}' || chr(10) ||
+  'from ' || chr(10) ||
+    database || '.' || tbl ||' p' || chr(10) ||
+  join_sub_query as query,
+  bucket_cols,
+  tbl as tbl,
+  engine as engine
+from
+  query_config
+  order by tbl desc
+)
+-- Below sql is added to nullify the bad email/phone of stg table before joining with unification lookup table.
+, exclusion_join as
+(
+  select
+    database, tbl,
+    ARRAY_JOIN(ARRAY_AGG('case when ' || unification_key || '.key_value is null then a.' || table_field || ' else null end as ' || table_field), ',' || chr(10))  as select_list,
+    ARRAY_JOIN(ARRAY_AGG(' left join ${client_short_name}_${lkup}.exclusion_list ' || unification_key || ' on a.' || table_field || ' = ' || unification_key || '.key_value and ' || unification_key || '.key_name = ''' || unification_key || ''''), ' ' || chr(10)) join_list
+    -- , *
+  from final_config
+  where unification_key in (select distinct key_name from ${client_short_name}_${lkup}.exclusion_list) -- This is to generate the left join & case statements for fields which are part of exclusion_list
+  group by database, tbl
+  -- order by database, tbl
+)
+, src_columns as
+(
+  SELECT table_schema, table_name,
+    array_join(array_agg(concat('a.' , column_name)), CONCAT (',',chr(10))) AS fields
+  FROM information_schema.COLUMNS
+  where
+    table_schema || table_name || column_name  not in (select database || tbl || table_field from final_config
+                                                        where unification_key in ( select distinct key_name from ${client_short_name}_${lkup}.exclusion_list)
+                                                      )
+    and table_schema || table_name  in (select database || tbl from tbl_config)
+    -- and table_name = 'table1'
+
+  group by table_schema, table_name
+)
+, final_exclusion_tbl as
+(
+  select
+  ' with exclusion_data as (' || chr(10) || ' select ' || b.fields || ',' || chr(10) ||  a.select_list || chr(10) ||
+
+  ' from ' || a.database || '.' || a.tbl || ' a ' || chr(10) || a.join_list || chr(10) || ')'
+  as with_exclusion_sql_str
+  , a.*
+  from exclusion_join a
+  inner join src_columns b on a.database = b.table_schema and a.tbl = b.table_name
+  order by b.table_schema, b.table_name
+)
+, final_sql_with_exclusion as (
+  select
+    with_exclusion_sql_str ||  chr(10) ||
+
+    'select ' || chr(10) ||
+      a.fields || ',' || chr(10) ||
+      'k0.persistent_id as ' || '${globals.canonical_id}' || chr(10) ||
+    'from ' || chr(10) ||
+      -- a.database || '.' || a.tbl ||' p' || chr(10) ||
+    ' exclusion_data p' || chr(10) ||
+    a.join_sub_query as query,
+    a.bucket_cols,
+    a.tbl as tbl,
+    a.engine as engine
+  from
+  query_config a
+  join final_exclusion_tbl b on a.database = b.database and a.tbl = b.tbl
+  order by a.database, a.tbl
+)
+select * from final_sql_with_exclusion
+union all 
+select a.* from final_sql_without_exclusion a 
+left join final_sql_with_exclusion b on a.tbl = b.tbl 
+where b.tbl is null 
+order by 4, 3
+```
+
+**execute_join_presto.sql** (COPY EXACTLY):
+**⚠️ CONTENT CRITICAL: MUST be created EXACTLY AS IS - NO CHANGES ⚠️**
+
+```sql
+-- set session join_distribution_type = 'PARTITIONED'
+-- set session time_partitioning_range = 'none'
+DROP TABLE IF EXISTS ${td.each.tbl}_tmp;
+CREATE TABLE ${td.each.tbl}_tmp 
+with (bucketed_on = array[${td.each.bucket_cols}], bucket_count = 512)  
+as
+${td.each.query}
+```
+
+**execute_join_hive.sql** (COPY EXACTLY):
+**⚠️ CONTENT CRITICAL: MUST be created EXACTLY AS IS - NO CHANGES ⚠️**
+```sql
+-- set session join_distribution_type = 'PARTITIONED'
+-- set session time_partitioning_range = 'none'
+DROP TABLE IF EXISTS ${td.each.tbl}_tmp;
+CREATE TABLE ${td.each.tbl}_tmp 
+with (bucketed_on = array[${td.each.bucket_cols}], bucket_count = 512)  
+as
+${td.each.query}
+```
+
+**enrich_tbl_creation.sql** (COPY EXACTLY):
+**⚠️ CONTENT CRITICAL: MUST be created EXACTLY AS IS - NO CHANGES ⚠️**
+```sql
+DROP TABLE IF EXISTS ${td.each.tbl}_tmp;
+CREATE TABLE ${td.each.tbl}_tmp  (crafter_id varchar)
+with (bucketed_on = array[${td.each.bucket_cols}], bucket_count = 512);
+```
+
+### 3. unification/enrich_runner.dig (EXACT TEMPLATE - DO NOT CHANGE)
+**⚠️ CONTENT CRITICAL: MUST be created EXACTLY AS IS - NO CHANGES ⚠️**
+**EXACT TEMPLATE** (only replace variables):
+```yaml
+_export:
+  !include : config/environment.yml
+  !include : config/src_prep_params.yml
+  !include : config/stage_enrich.yml
+  td:
+    database: cdp_unification_${globals.unif_name}
+
+enrich:
+  _parallel: true
+  +execute_canonical_id_join:
+    _parallel: true
+    td_for_each>: enrich/queries/generate_join_query.sql
+    _do:
+      +execute:
+        if>: ${td.each.engine.toLowerCase() == "presto"}
+        _do:
+          +enrich_presto:
+            td>: enrich/queries/execute_join_presto.sql
+            engine: ${td.each.engine}
+
+          +promote:
+            td_ddl>:
+            rename_tables: [{from: "${td.each.tbl}_tmp", to: "enriched_${td.each.tbl}"}]
+
+        _else_do:
+
+          +enrich_tbl_bucket:
+            td>: enrich/queries/enrich_tbl_creation.sql
+            engine: presto
+
+          +enrich_hive:
+            td>: enrich/queries/execute_join_hive.sql
+            engine: ${td.each.engine}
+
+          +promote:
+            td_ddl>:
+            rename_tables: [{from: "${td.each.tbl}_tmp", to: "enriched_${td.each.tbl}"}]
+```
+
+**VARIABLES TO REPLACE**:
+- `${unif_name}` = unification name from user (e.g., claude)
+
+## Agent Workflow
+
+### When Called by Main Agent:
+1. **Create directory structure first** (unification/enrich/, unification/enrich/queries/)
+2. **🚨 MANDATORY: READ unification/config/src_prep_params.yml** to extract actual alias_as columns
+3. **Always create the 4 generic SQL files** (generate_join_query.sql, execute_join_presto.sql, execute_join_hive.sql, enrich_tbl_creation.sql)
+4. **🚨 Create stage_enrich.yml** with DYNAMIC columns from src_prep_params.yml (NOT template columns)
+5. **Create unification/enrich_runner.dig** with exact template format
+6. **Analyze provided unification information** from main agent
+7. **Replace only specified variables** following exact structure
+8. **Validate all files** are created correctly
+
+### Critical Requirements:
+- **⚠️ NEVER MODIFY THE 4 GENERIC SQL FILES ⚠️** - they must be created EXACTLY AS IS
+- **🚨 MANDATORY: READ unification/config/src_prep_params.yml FIRST** - Extract alias_as columns before creating stage_enrich.yml
+- **🚨 DYNAMIC COLUMNS ONLY** - Use ONLY columns from src_prep_params.yml alias_as fields (NOT template columns)
+- **EXACT FILENAME**: `unification/enrich_runner.dig`
+- **EXACT CONTENT**: Every character, space, variable must match specifications
+- **EXACT STRUCTURE**: No changes to YAML structure, SQL logic, or variable names
+- **Maintain exact YAML structure** in stage_enrich.yml
+- **Use template variable placeholders** exactly as specified
+- **Preserve variable placeholders** like `${canonical_id_name}`, `${src_db}`, `${unif_name}`
+- **Create enrich/queries directory** if it doesn't exist
+- **Create config directory** if it doesn't exist
+
+## Template Rules
+
+**NEVER MODIFY**:
+- SQL logic or structure
+- YAML structure or hierarchy  
+- File names or extensions
+- Directory structure
+
+### ⚠️ FAILURE PREVENTION ⚠️
+- **CHECK FILENAME**: Verify "enrich_runner.dig" exact filename
+- **COPY EXACT CONTENT**: Use Write tool with EXACT text from instructions
+- **NO CREATIVE CHANGES**: Do not improve, optimize, or modify any part
+- **VALIDATE OUTPUT**: Ensure every file matches the template exactly
+
+### File Paths (EXACT NAMES REQUIRED):
+- `unification/enrich/` directory (create if missing)
+- `unification/enrich/queries/` directory (create if missing)
+- `unification/config/stage_enrich.yml` **⚠️ EXACT filename ⚠️**
+- `unification/enrich/queries/generate_join_query.sql` **⚠️ EXACT filename ⚠️**
+- `unification/enrich/queries/execute_join_presto.sql` **⚠️ EXACT filename ⚠️**
+- `unification/enrich/queries/execute_join_hive.sql` **⚠️ EXACT filename ⚠️**
+- `unification/enrich/queries/enrich_tbl_creation.sql` **⚠️ EXACT filename ⚠️**
+- `unification/enrich_runner.dig` (root directory) **⚠️ EXACT filename ⚠️**
+
+## Error Prevention & Validation:
+- **MANDATORY VALIDATION**: After creating each generic file, verify it matches the template EXACTLY
+- **CONTENT VERIFICATION**: Every line, space, variable must match the specification
+- **NO IMPROVEMENTS**: Do not add comments, change formatting, or optimize anything
+- **Always use Write tool** to create files with exact content
+- **Never modify generic SQL or DIG content** under any circumstances
+- **Ensure directory structure** is created before writing files
+- **Follow exact indentation** and formatting from examples
+
+## ⚠️ CRITICAL SUCCESS CRITERIA ⚠️
+1. **🚨 MANDATORY: Read unification/config/src_prep_params.yml** and extract alias_as columns
+2. **🚨 stage_enrich.yml contains ONLY actual columns** from src_prep_params.yml (NOT template columns)
+3. File named "enrich_runner.dig" exists
+4. Content matches template character-for-character
+5. All variable placeholders preserved exactly
+6. No additional comments or modifications
+7. enrich/queries folder contains exact SQL files
+8. Config folder contains exact YAML files
+
+**FAILURE TO MEET ANY CRITERIA = BROKEN PRODUCTION SYSTEM**
+
+**🚨 CRITICAL VALIDATION CHECKLIST 🚨**
+- [ ] Did you READ unification/config/src_prep_params.yml before creating stage_enrich.yml?
+- [ ] Does stage_enrich.yml contain ONLY the alias_as columns from prep params?
+- [ ] Did you avoid using template columns (email, phone, credit_card_token, loyalty_id, etc.)?
+- [ ] Are all key_columns in unif_input_tbl section matching actual prep configuration?
+