--- name: cdp-implement description: Complete CDP implementation with workflow generation, deployment, and execution --- # CDP Complete Implementation Pipeline ## Overview I'll orchestrate your complete CDP implementation with **automated deployment and execution** across all phases: ``` Phase 1: Ingestion → Generate → Deploy → Execute → Monitor → Validate ✓ Phase 2: Hist-Union → Generate → Deploy → Execute → Monitor → Validate ✓ Phase 3: Staging → Generate → Deploy → Execute → Monitor → Validate ✓ Phase 4: Unification → Generate → Deploy → Execute → Monitor → Validate ✓ ``` **Each phase MUST complete successfully before proceeding to the next phase.** This ensures data dependencies are satisfied: - Ingestion creates source tables - Hist-Union requires source tables - Staging requires hist-union tables - Unification requires staging tables --- ## Required Inputs ### Global Configuration (Required First) **TD API Credentials:** - **TD_API_KEY**: Your Treasure Data API key (required for deployment & execution) - Find at: https://console.treasuredata.com/app/users - Format: `12345/abcdef1234567890abcdef1234567890abcdef12` - **TD_ENDPOINT**: Your TD regional endpoint - **US**: `https://api.treasuredata.com` (default) - **EU**: `https://api-cdp.eu01.treasuredata.com` - **Tokyo**: `https://api-cdp.treasuredata.co.jp` - **Asia Pacific**: `https://api-cdp.ap02.treasuredata.com` **Client Information:** - **Client Name**: Your client identifier (e.g., `mck`, `acme`, `client_name`) - Used for: Database naming, configuration, organization --- ### Phase 1: Ingestion Configuration **Data Source:** - **Source Name**: System name (e.g., `shopify`, `salesforce`, `klaviyo`) - **Connector Type**: TD connector (e.g., `rest`, `salesforce`, `bigquery`) - **Objects/Tables**: Comma-separated list (e.g., `orders,customers,products`) **Ingestion Settings:** - **Mode**: Choose one: - `incremental` - Ongoing data sync only - `historical` - One-time historical backfill only - `both` - Separate historical and incremental workflows (recommended) - **Incremental Field**: Field for updates (e.g., `updated_at`, `modified_date`) - **Default Start Date**: Initial load date (format: `2023-09-01T00:00:00.000000`) **Target:** - **Target Database**: Default: `{client}_src` - **Project Directory**: Default: `ingestion` **Authentication:** - Source system credentials (will be configured during ingestion setup) --- ### Phase 2: Hist-Union Configuration **Tables to Combine:** - List of tables requiring historical + incremental merge - Format: `database.table_name` or `table_name` (uses default database) - Example: `shopify_orders`, `shopify_customers` **Settings:** - **Project Directory**: Default: `hist_union` - **Source Database**: From Phase 1 output (default: `{client}_src`) - **Target Database**: Default: `{client}_src` (overwrites with combined data) --- ### Phase 3: Staging Configuration **Transformation Settings:** - **SQL Engine**: Choose one: - `presto` - Presto SQL (recommended for most cases) - `hive` - Hive SQL (for legacy compatibility) **Tables to Transform:** - Tables from hist-union output - Will apply data cleaning, standardization, PII handling **Settings:** - **Project Directory**: Default: `staging` - **Source Database**: From Phase 2 output - **Target Database**: Default: `{client}_stg` --- ### Phase 4: Unification Configuration **Unification Settings:** - **Unification Name**: Project name (e.g., `customer_360`, `unified_profile`) - **ID Method**: Choose one: - `persistent_id` - Stable IDs across updates (recommended) - `canonical_id` - Traditional merge approach - **Update Strategy**: Choose one: - `incremental` - Process new/updated records only (recommended) - `full` - Reprocess all data each time **Tables for Unification:** - Staging tables with user identifiers - Format: `database.table_name` - Example: `acme_stg.shopify_orders`, `acme_stg.shopify_customers` **Settings:** - **Project Directory**: Default: `unification` - **Regional Endpoint**: Same as Global TD_ENDPOINT --- ## Execution Process ### Step-by-Step Flow I'll launch the **cdp-pipeline-orchestrator agent** to execute: **1. Configuration Collection** - Gather all inputs upfront - Validate credentials and access - Create execution plan - Show complete configuration for approval **2. Phase Execution Loop** For each phase (Ingestion → Hist-Union → Staging → Unification): ``` ┌─────────── PHASE EXECUTION ───────────┐ │ │ │ [1] GENERATE Workflows │ │ → Invoke plugin slash command │ │ → Wait for file generation │ │ → Verify files created │ │ │ │ [2] DEPLOY Workflows │ │ → Navigate to project directory │ │ → Execute: td wf push │ │ → Parse deployment result │ │ → Auto-fix errors if possible │ │ → Retry up to 3 times │ │ │ │ [3] EXECUTE Workflows │ │ → Execute: td wf start │ │ → Capture session ID │ │ → Log start time │ │ │ │ [4] MONITOR Execution │ │ → Poll status every 30 seconds │ │ → Show real-time progress │ │ → Calculate elapsed time │ │ → Wait for completion │ │ │ │ [5] VALIDATE Output │ │ → Query TD for created tables │ │ → Verify row counts > 0 │ │ → Check schema expectations │ │ │ │ [6] PROCEED to Next Phase │ │ → Only if validation passes │ │ → Update progress tracker │ │ │ └───────────────────────────────────────┘ ``` **3. Final Report** - Complete execution summary - All files generated - Deployment records - Data quality metrics - Next steps guidance --- ## What Happens in Each Step ### [1] GENERATE Workflows **Tool**: SlashCommand **Actions**: ``` Phase 1 → /cdp-ingestion:ingest-new Phase 2 → /cdp-histunion:histunion-batch Phase 3 → /cdp-staging:transform-batch Phase 4 → /cdp-unification:unify-setup ``` **Output**: Complete workflow files (.dig, .yml, .sql) **Verification**: Check files exist using Glob tool --- ### [2] DEPLOY Workflows **Tool**: Bash **Command**: ```bash cd {project_directory} td -k {TD_API_KEY} -e {TD_ENDPOINT} wf push {project_name} ``` **Success Indicators**: - "Uploaded workflows" - "Project uploaded" - No error messages **Error Handling**: **Syntax Error**: ``` Error: syntax error in shopify_ingest_inc.dig:15 → Read file to identify issue → Fix using Edit tool → Retry deployment ``` **Validation Error**: ``` Error: database 'acme_src' not found → Check database exists → Create if needed → Update configuration → Retry deployment ``` **Authentication Error**: ``` Error: authentication failed → Verify TD_API_KEY → Check endpoint URL → Ask user to provide correct credentials → Retry deployment ``` **Retry Logic**: Up to 3 attempts with auto-fixes --- ### [3] EXECUTE Workflows **Tool**: Bash **Command**: ```bash td -k {TD_API_KEY} -e {TD_ENDPOINT} wf start {project} {workflow} --session now ``` **Output Parsing**: ``` session id: 123456789 attempt id: 987654321 ``` **Captured**: - Session ID (for monitoring) - Attempt ID - Start timestamp --- ### [4] MONITOR Execution **Tool**: Bash (polling loop) **Pattern**: ```bash # Check 1 (immediately) td -k {TD_API_KEY} -e {TD_ENDPOINT} wf session {session_id} # Output: {"status": "running"} # → Show: ⏳ Status: running (0:00 elapsed) # Wait 30 seconds sleep 30 # Check 2 (after 30s) td -k {TD_API_KEY} -e {TD_ENDPOINT} wf session {session_id} # Output: {"status": "running"} # → Show: ⏳ Status: running (0:30 elapsed) # Continue until status changes... # Final check td -k {TD_API_KEY} -e {TD_ENDPOINT} wf session {session_id} # Output: {"status": "success"} # → Show: ✓ Execution completed (15:30 elapsed) ``` **Status Handling**: **Status: running** - Continue polling - Show progress indicator - Wait 30 seconds **Status: success** - Show completion message - Log final duration - Proceed to validation **Status: error** - Retrieve logs: `td wf log {session_id}` - Parse error message - Show to user - Ask: Retry / Fix / Skip / Abort **Status: killed** - Show killed message - Ask user: Restart / Abort **Maximum Wait**: 2 hours (240 checks) --- ### [5] VALIDATE Output **Tool**: mcp__mcc_treasuredata__query **Validation Queries**: **Ingestion Phase**: ```sql -- Check tables created SELECT table_name, row_count FROM information_schema.tables WHERE database_name = '{target_database}' AND table_name LIKE '{source_name}%' -- Expected: {source}_orders, {source}_customers, etc. -- Verify: row_count > 0 for each table ``` **Hist-Union Phase**: ```sql -- Check hist-union tables SELECT table_name, row_count FROM information_schema.tables WHERE database_name = '{target_database}' AND table_name LIKE '%_hist_union' -- Verify: row_count >= source table counts ``` **Staging Phase**: ```sql -- Check staging tables SELECT table_name, row_count FROM information_schema.tables WHERE database_name = '{target_database}' AND table_name LIKE '%_stg_%' -- Check for transformed columns DESCRIBE {database}.{table_name} ``` **Unification Phase**: ```sql -- Check unification tables SELECT table_name, row_count FROM information_schema.tables WHERE database_name = '{target_database}' AND (table_name LIKE '%_prep' OR table_name LIKE '%unified_id%' OR table_name LIKE '%enriched%') -- Verify: unified_id_lookup exists -- Verify: enriched tables have unified_id column ``` **Validation Results**: - ✓ **PASS**: All tables found with data → Proceed - ⚠ **WARN**: Tables found but empty → Ask user - ✗ **FAIL**: Tables missing → Stop, show error --- ## Error Handling ### Deployment Errors **Common Issues & Auto-Fixes**: **1. YAML Syntax Error** ``` Error: syntax error at line 15: missing colon → Auto-fix: Add missing colon → Retry: Automatic ``` **2. Missing Database** ``` Error: database 'acme_src' does not exist → Check: Query information_schema → Create: If user approves → Retry: Automatic ``` **3. Secret Not Found** ``` Error: secret 'shopify_api_key' not found → Prompt: User to upload credentials → Wait: For user confirmation → Retry: After user uploads ``` **Retry Strategy**: - Attempt 1: Auto-fix if possible - Attempt 2: Ask user for input, fix, retry - Attempt 3: Final attempt after user guidance - After 3 failures: Ask user to Skip/Abort --- ### Execution Errors **Common Issues**: **1. Table Not Found** ``` Error: Table 'acme_src.shopify_orders' does not exist Diagnosis: - Previous phase (Ingestion) may have failed - Table name mismatch in configuration - Database permissions issue Options: 1. Retry - Run workflow again 2. Check Previous Phase - Verify ingestion completed 3. Skip - Skip this phase (NOT RECOMMENDED) 4. Abort - Stop entire pipeline Your choice: ``` **2. Query Timeout** ``` Error: Query exceeded timeout limit Diagnosis: - Data volume too large - Query not optimized - Warehouse too small Options: 1. Retry - Attempt again 2. Increase Timeout - Update workflow configuration 3. Abort - Stop for investigation Your choice: ``` **3. Authentication Failed** ``` Error: Authentication failed for data source Diagnosis: - Credentials expired - Invalid API key - Permissions changed Options: 1. Update Credentials - Upload new credentials 2. Retry - Try again with existing credentials 3. Abort - Stop for manual fix Your choice: ``` --- ### User Decision Points At each failure, I'll present: ``` ┌─────────────────────────────────────────┐ │ ⚠ Phase X Failed │ ├─────────────────────────────────────────┤ │ Workflow: {workflow_name} │ │ Session ID: {session_id} │ │ Error: {error_message} │ │ │ │ Possible Causes: │ │ 1. {cause_1} │ │ 2. {cause_2} │ │ 3. {cause_3} │ │ │ │ Options: │ │ 1. Retry - Run again with same config │ │ 2. Fix - Let me help fix the issue │ │ 3. Skip - Skip this phase (DANGEROUS) │ │ 4. Abort - Stop entire pipeline │ │ │ │ Your choice (1-4): │ └─────────────────────────────────────────┘ ``` **Choice Handling**: - **1 (Retry)**: Re-execute workflow immediately - **2 (Fix)**: Interactive troubleshooting, then retry - **3 (Skip)**: Warn about consequences, skip if confirmed - **4 (Abort)**: Stop pipeline, generate partial report --- ## Progress Tracking I'll use TodoWrite to show real-time progress: ``` ✓ Pre-Flight: Configuration gathered → Phase 1: Ingestion ✓ Generate workflows ✓ Deploy workflows → Execute workflows (session: 123456789) ⏳ Monitor execution... (5:30 elapsed) ⏳ Validate output... □ Phase 2: Hist-Union □ Phase 3: Staging □ Phase 4: Unification □ Final: Report generation ``` **Status Indicators**: - ✓ Completed - → In Progress - ⏳ Waiting/Monitoring - □ Pending - ✗ Failed --- ## Expected Timeline **Typical Execution Times** (varies by data volume): | Phase | Generation | Deployment | Execution | Validation | Total | |-------|-----------|------------|-----------|------------|-------| | **Ingestion** | 2-5 min | 30 sec | 15-60 min | 1 min | ~1 hour | | **Hist-Union** | 1-2 min | 30 sec | 10-30 min | 1 min | ~30 min | | **Staging** | 2-5 min | 30 sec | 20-45 min | 1 min | ~45 min | | **Unification** | 3-5 min | 30 sec | 30-90 min | 2 min | ~1.5 hours | | **TOTAL** | **~10 min** | **~2 min** | **~2-3 hours** | **~5 min** | **~3-4 hours** | *Actual times depend on data volume, complexity, and TD warehouse size* --- ## Final Deliverables Upon successful completion, you'll receive: ### 1. Generated Workflow Files **Ingestion** (`ingestion/`): - `{source}_ingest_inc.dig` - Incremental ingestion workflow - `{source}_ingest_hist.dig` - Historical backfill workflow (if mode=both) - `config/{source}_datasources.yml` - Datasource configuration - `config/{source}_{object}_load.yml` - Per-object load configs **Hist-Union** (`hist_union/`): - `{source}_hist_union.dig` - Main hist-union workflow - `queries/{table}_union.sql` - Union SQL per table **Staging** (`staging/`): - `transform_{source}.dig` - Transformation workflow - `queries/{table}_transform.sql` - Transform SQL per table **Unification** (`unification/`): - `unif_runner.dig` - Main orchestration workflow - `dynmic_prep_creation.dig` - Prep table creation - `id_unification.dig` - ID unification via API - `enrich_runner.dig` - Enrichment workflow - `config/unify.yml` - Unification configuration - `config/environment.yml` - Client environment - `config/src_prep_params.yml` - Prep parameters - `config/stage_enrich.yml` - Enrichment config - `queries/` - 15+ SQL query files ### 2. Deployment Records - Session IDs for all workflow executions - Execution logs saved to `pipeline_logs/{date}/` - Deployment timestamps - Error logs (if any) ### 3. Data Quality Metrics - Row counts per table - ID unification statistics - Coverage percentages - Data quality scores ### 4. Documentation - Complete configuration summary - Deployment instructions - Operational runbook - Monitoring guidelines - Troubleshooting playbook --- ## Prerequisites **Before starting, ensure**: ✅ **TD Toolbelt Installed**: ```bash # Check version td --version # If not installed: # macOS: brew install td # Linux: https://toolbelt.treasuredata.com/ ``` ✅ **Valid TD API Key**: - Get from: https://console.treasuredata.com/app/users - Requires write permissions - Should not expire during execution ✅ **Network Access**: - Can reach TD endpoint - Can connect to source systems (for ingestion) ✅ **Sufficient Permissions**: - Create databases - Create tables - Upload workflows - Execute workflows - Upload secrets ✅ **Source System Credentials**: - API keys ready - OAuth tokens current - Service accounts configured --- ## Getting Started **Ready to begin?** Please provide the following information: ### Step 1: Global Configuration ``` TD_API_KEY: 12345/abcd... TD_ENDPOINT: https://api.treasuredata.com Client Name: acme ``` ### Step 2: Ingestion Details ``` Source Name: shopify Connector Type: rest Objects: orders,customers,products Mode: both Incremental Field: updated_at Start Date: 2023-09-01T00:00:00.000000 Target Database: acme_src ``` ### Step 3: Hist-Union Details ``` Tables: shopify_orders,shopify_customers,shopify_products ``` ### Step 4: Staging Details ``` SQL Engine: presto Tables: (will use hist-union output) Target Database: acme_stg ``` ### Step 5: Unification Details ``` Unification Name: customer_360 ID Method: persistent_id Update Strategy: incremental Tables: (will use staging output) ``` --- **Alternatively, provide all at once in YAML format:** ```yaml global: td_api_key: "12345/abcd..." td_endpoint: "https://api.treasuredata.com" client: "acme" ingestion: source_name: "shopify" connector: "rest" objects: ["orders", "customers", "products"] mode: "both" incremental_field: "updated_at" start_date: "2023-09-01T00:00:00.000000" target_database: "acme_src" hist_union: tables: ["shopify_orders", "shopify_customers", "shopify_products"] staging: engine: "presto" target_database: "acme_stg" unification: name: "customer_360" id_method: "persistent_id" update_strategy: "incremental" ``` --- **I'll orchestrate the complete CDP implementation from start to finish!**