--- name: dbt description: Expert guidance for using dbt (data build tool) with Treasure Data Trino. Use this skill when users need help setting up dbt with TD, creating models, using TD-specific macros, handling incremental models, or troubleshooting dbt-trino adapter issues. --- # dbt with Treasure Data Trino Expert assistance for using dbt (data build tool) with Treasure Data's Trino engine. ## When to Use This Skill Use this skill when: - Setting up dbt with Treasure Data Trino - Creating dbt models for TD - Writing TD-specific dbt macros - Implementing incremental models with TD_INTERVAL - Troubleshooting dbt-trino adapter errors - Overriding dbt-trino macros for TD compatibility - Managing dbt projects with TD data pipelines ## Prerequisites ### Installation **Recommended: Using uv (modern Python package manager):** `uv` is a fast, modern Python package and environment manager written in Rust. It's significantly faster than traditional pip and provides better dependency resolution. ```bash # Install uv (choose one): # Option 1: Homebrew (recommended for Mac) brew install uv # Option 2: Standalone installer curl -LsSf https://astral.sh/uv/install.sh | sh # Create and activate virtual environment with uv uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dbt-core and dbt-trino (much faster than pip) uv pip install dbt-core dbt-trino==1.9.3 # Verify installation dbt --version ``` **Benefits of uv:** - **10-100x faster** than pip for package installation - **Better dependency resolution** with clearer error messages - **Drop-in replacement** for pip (use `uv pip` instead of `pip`) - **Built-in virtual environment management** with `uv venv` **Alternative: Using traditional pip and venv:** ```bash # Create virtual environment (recommended) python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Note: brew install dbt doesn't work well on Mac OS X # Install dbt-core and dbt-trino pip install dbt-core dbt-trino==1.9.3 # Verify installation dbt --version # Expected output: # Core: 1.10.9 # Plugins: trino: 1.9.3 ``` ### TD Connection Setup Create `profiles.yml` (can be in `~/.dbt/profiles.yml` or at project root): ```yaml td: target: dev outputs: dev: type: trino method: none # Use 'none' for API key authentication user: "{{ env_var('TD_API_KEY') }}" # TD API key from environment variable password: dummy # Password is not used with API key host: api-presto.treasuredata.com port: 443 database: td # Always 'td' for Treasure Data schema: your_dev_database # Your dev TD database (e.g., 'dev_analytics') threads: 4 http_scheme: https session_properties: query_max_run_time: 1h prod: type: trino method: none user: "{{ env_var('TD_API_KEY') }}" password: dummy host: api-presto.treasuredata.com port: 443 database: td schema: your_prod_database # Your prod TD database (e.g., 'production') threads: 4 http_scheme: https session_properties: query_max_run_time: 1h ``` **Important TD-specific settings:** - `method`: Set to `none` for API key authentication (not `ldap`) - `user`: Use TD API key from `TD_API_KEY` environment variable - `password`: Set to `dummy` (not used with API key authentication) - `host`: Always `api-presto.treasuredata.com` (even though it's actually Trino) - `database`: Always set to `td` for Treasure Data - `schema`: Set to your actual TD database name (what you see in TD Console) **Set up your TD API key:** ```bash # Get your API key from TD Console: https://console.treasuredata.com/app/users export TD_API_KEY="your_api_key_here" # Or add to your shell profile (~/.bashrc, ~/.zshrc, etc.) echo 'export TD_API_KEY="your_api_key_here"' >> ~/.zshrc ``` **Switch between dev and prod:** ```bash # Run against dev (default) dbt run # Run against prod dbt run --target prod ``` ### dbt Project Configuration Create or update `dbt_project.yml` with TD-specific settings: ```yaml name: 'my_td_project' version: '1.0.0' config-version: 2 # This setting configures which "profile" dbt uses for this project. profile: 'td' # These configurations specify where dbt should look for different types of files. model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" clean-targets: - "target" - "dbt_packages" # SSL certificate validation (required for TD) flags: require_certificate_validation: true # Global variable for default time range vars: target_range: '-3M/now' # Default: last 3 months to now # Model configuration with TD-specific settings models: my_td_project: +materialized: table +on_schema_change: "append_new_columns" # Auto-add new columns instead of failing +views_enabled: false # TD doesn't support views (use tables) # Staging models staging: +materialized: table +tags: ["staging"] # Marts models marts: +materialized: table +tags: ["marts"] # Incremental models incremental: +materialized: incremental +on_schema_change: "append_new_columns" +tags: ["incremental"] ``` **Key TD-specific settings:** - `flags.require_certificate_validation: true` - Required for SSL validation with TD - `vars.target_range: '-3M/now'` - Default time range for all models using the variable - `+on_schema_change: "append_new_columns"` - Automatically add new columns to existing tables (prevents rebuild on schema changes) - `+views_enabled: false` - Explicitly disable views since TD doesn't support `CREATE VIEW` **Benefits:** - **SSL security**: Ensures certificate validation for secure TD connections - **Schema evolution**: New columns are added automatically without dropping tables - **Default time window**: All models using `{{ var('target_range') }}` get sensible default - **No views**: Prevents accidental view creation attempts ## Required TD-Specific Overrides TD's Presto/Trino has limitations that require overriding some dbt-trino macros. You MUST create this file in your dbt project. ### Create `macros/override_dbt_trino.sql` This file overrides dbt-trino macros to work with TD Presto/Trino limitations: **Key changes:** 1. Removes table ownership queries (TD doesn't support) 2. Simplifies catalog queries 3. Replaces `CREATE VIEW` with `CREATE TABLE` (TD doesn't support views) See the full macro file in [macros/override_dbt_trino.sql](./macros/override_dbt_trino.sql) in this skill directory. **Why this is needed:** - TD Presto doesn't support `CREATE VIEW` statements - TD doesn't expose table ownership information - Some information_schema queries need simplification ## TD-Specific dbt Macros ### 1. Incremental Scan Macro For incremental models that process new data only: ```sql -- macros/td_incremental_scan.sql {% macro incremental_scan(table_name) -%} ( SELECT * FROM {{ table_name }} WHERE TD_INTERVAL(time, '{{ var("target_range", "-3M/now") }}') {% if is_incremental() -%} AND time > {{ get_max_time(this.table) }} {%- endif %} ) {%- endmacro %} {% macro get_max_time(table_name) -%} (SELECT MAX(time) FROM {{ table_name }}) {%- endmacro %} ``` **Default behavior:** Scans last 3 months to now (`-3M/now`) if no `target_range` variable is provided. **Usage in model:** ```sql -- models/incremental_events.sql {{ config( materialized='incremental', unique_key='event_id' ) }} SELECT event_id, user_id, event_type, time FROM {{ incremental_scan('raw_events') }} ``` **Run with default (last 3 months):** ```bash dbt run --select incremental_events ``` **Or override with specific range:** ```bash # Yesterday only dbt run --vars '{"target_range": "-1d"}' --select incremental_events # Last 7 days dbt run --vars '{"target_range": "-7d/now"}' --select incremental_events # Specific date range dbt run --vars '{"target_range": "2024-01-01/2024-01-31"}' --select incremental_events ``` **Note:** No need to create wrapper macros for TD time functions - they're already simple enough to use directly in your SQL. ## dbt Model Patterns for TD ### Basic Model ```sql -- models/daily_events.sql {{ config( materialized='table' ) }} SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, event_type, COUNT(*) as event_count, approx_distinct(user_id) as unique_users FROM {{ source('raw', 'events') }} WHERE TD_INTERVAL(time, '-30d', 'JST') GROUP BY 1, 2 ``` ### Incremental Model ```sql -- models/incremental_user_events.sql {{ config( materialized='incremental', unique_key='user_date_key' ) }} SELECT CONCAT(CAST(user_id AS VARCHAR), '_', TD_TIME_STRING(time, 'd!', 'JST')) as user_date_key, user_id, TD_TIME_STRING(time, 'd!', 'JST') as date, COUNT(*) as event_count FROM {{ source('raw', 'events') }} WHERE TD_INTERVAL(time, '{{ var('target_range', '-1d') }}', 'JST') {% if is_incremental() %} -- Only process data after last run AND time > (SELECT MAX(time) FROM {{ this }}) {% endif %} GROUP BY 1, 2, 3 ``` ### CTE (Common Table Expression) Pattern ```sql -- models/user_metrics.sql {{ config( materialized='table' ) }} WITH events_filtered AS ( SELECT * FROM {{ source('raw', 'events') }} WHERE TD_INTERVAL(time, '-7d', 'JST') ), user_sessions AS ( SELECT user_id, TD_SESSIONIZE(time, 1800, user_id) as session_id, MIN(time) as session_start, MAX(time) as session_end FROM events_filtered GROUP BY user_id, session_id ) SELECT user_id, COUNT(DISTINCT session_id) as session_count, AVG(session_end - session_start) as avg_session_duration FROM user_sessions GROUP BY user_id ``` ## Sources Configuration Define TD tables as sources: ```yaml # models/sources.yml version: 2 sources: - name: raw database: production schema: default tables: - name: events description: Raw event data from applications columns: - name: time description: Event timestamp (Unix time) - name: user_id description: User identifier - name: event_type description: Type of event - name: users description: User profile data ``` **Usage in models:** ```sql SELECT * FROM {{ source('raw', 'events') }} ``` ## Testing with TD ### Schema Tests ```yaml # models/schema.yml version: 2 models: - name: daily_events description: Daily event aggregations columns: - name: date description: Event date tests: - not_null - unique - name: event_count description: Number of events tests: - not_null - dbt_utils.expression_is_true: expression: ">= 0" - name: unique_users description: Unique user count (approximate) tests: - not_null ``` ### Custom TD Tests ```sql -- tests/assert_positive_events.sql -- Returns records that fail the test SELECT * FROM {{ ref('daily_events') }} WHERE event_count < 0 ``` ## Running dbt with TD ### Basic Commands ```bash # Test connection dbt debug # Run all models dbt run # Run specific model dbt run --select daily_events # Run with variables dbt run --vars '{"target_range": "-7d"}' # Run tests dbt test # Generate documentation dbt docs generate dbt docs serve ``` ### Incremental Run Pattern ```bash # Daily incremental run dbt run --select incremental_events --vars '{"target_range": "-1d"}' # Full refresh dbt run --select incremental_events --full-refresh # Backfill specific date dbt run --select incremental_events --vars '{"target_range": "2024-01-15"}' ``` ## Common Issues and Solutions ### Issue 1: "This connector does not support creating views" **Error:** ``` TrinoUserError: This connector does not support creating views ``` **Solution:** Add `macros/override_dbt_trino.sql` that overrides `trino__create_view_as` to use `CREATE TABLE` instead. ### Issue 2: Catalog Query Failures **Error:** ``` Database Error: Table ownership information not available ``` **Solution:** Use the override macros that remove table ownership queries from catalog operations. ### Issue 3: Connection Timeout **Error:** ``` Connection timeout ``` **Solution:** Increase session timeout in `profiles.yml` if needed (default is 1h): ```yaml session_properties: query_max_run_time: 2h # Increase if queries legitimately need more time ``` ### Issue 4: Incremental Model Not Working **Problem:** Incremental model processes all data every time. **Solution:** Ensure unique_key is set and check incremental logic: ```sql {{ config( materialized='incremental', unique_key='event_id' -- Must be specified ) }} {% if is_incremental() %} -- This block only runs on incremental runs WHERE time > (SELECT MAX(time) FROM {{ this }}) {% endif %} ``` ### Issue 5: Variable Not Found **Error:** ``` Compilation Error: Var 'target_range' is undefined ``` **Solution:** Provide default value: ```sql WHERE TD_INTERVAL(time, '{{ var('target_range', '-1d') }}', 'JST') ``` Or pass variable: ```bash dbt run --vars '{"target_range": "-1d"}' ``` ## Project Structure ``` dbt_project/ ├── dbt_project.yml # Project config with TD-specific settings ├── profiles.yml # Connection config (or in ~/.dbt/profiles.yml) ├── macros/ │ ├── override_dbt_trino.sql # Required TD overrides │ └── td_incremental_scan.sql # Optional: Incremental helper ├── models/ │ ├── sources.yml # Source definitions │ ├── schema.yml # Tests and documentation │ ├── staging/ │ │ └── stg_events.sql │ └── marts/ │ ├── daily_events.sql │ └── user_metrics.sql └── tests/ └── assert_positive_events.sql ``` **Note:** `profiles.yml` can be placed either: - At project root (recommended for TD Workflow deployments) - In `~/.dbt/profiles.yml` (for local development) ## Best Practices 1. **Include time filters in all models** - Use TD_INTERVAL or TD_TIME_RANGE directly - Critical for performance on large tables 2. **Use incremental models wisely** - Good for append-only event data - Requires careful unique_key selection - Test thoroughly before production 3. **Leverage sources** - Define all TD tables as sources - Enables lineage tracking - Centralizes table documentation 4. **Use variables for flexibility** - Date ranges - Environment-specific settings - Makes models reusable 5. **Test your models** - Not null checks on key columns - Unique checks on IDs - Custom assertions for business logic 6. **Document everything** - Model descriptions - Column descriptions - Include TD-specific notes ## Integration with TD Workflows ### Running dbt with Custom Scripts (Recommended for TD Workflow) TD Workflow supports running dbt using Custom Scripts with Docker containers. This is the recommended approach for production deployments. **Create a Python wrapper (`dbt_wrapper.py`):** ```python #!/usr/bin/env python3 import sys from dbt.cli.main import dbtRunner def run_dbt(command_args): """Run dbt commands using dbtRunner""" dbt = dbtRunner() result = dbt.invoke(command_args) if not result.success: sys.exit(1) return result if __name__ == "__main__": # Get command from arguments (e.g., ['run', '--target', 'prod']) command = sys.argv[1:] if len(sys.argv) > 1 else ['run'] print(f"Running dbt with command: {' '.join(command)}") run_dbt(command) ``` **Create workflow file (`dbt_workflow.dig`):** ```yaml timezone: Asia/Tokyo schedule: daily>: 03:00:00 _export: docker: image: "treasuredata/customscript-python:3.12.11-td1" # Set TD API key from secrets _env: TD_API_KEY: ${secret:td.apikey} +setup: py>: tasks.InstallPackages +dbt_run: py>: dbt_wrapper.run_dbt command_args: ['run', '--target', 'prod'] +dbt_test: py>: dbt_wrapper.run_dbt command_args: ['test'] ``` **Create package installer (`tasks.py`):** ```python def InstallPackages(): """Install dbt and dependencies at runtime""" import subprocess import sys packages = [ 'dbt-core==1.10.9', 'dbt-trino==1.9.3' ] for package in packages: subprocess.check_call([ sys.executable, '-m', 'pip', 'install', package ]) ``` **Deploy to TD Workflow:** ```bash # 1. Clean dbt artifacts dbt clean # 2. Push to TD Workflow td workflow push my_dbt_project # 3. Set TD API key secret td workflow secrets --project my_dbt_project --set td.apikey=YOUR_API_KEY # 4. Run from TD Console or trigger manually td workflow start my_dbt_project dbt_workflow --session now ``` **Important notes:** - Use Docker image: `treasuredata/customscript-python:3.12.11-td1` (latest stable image with Python 3.12) - Install dependencies at runtime using `py>: tasks.InstallPackages` - Store API key in TD secrets: `${secret:td.apikey}` - Include your dbt project files (models, macros, profiles.yml, dbt_project.yml) ### Local Digdag + dbt Integration (Development) For local development and testing: ```yaml # workflow.dig +dbt_run: sh>: dbt run --vars '{"target_range": "${session_date}"}' +dbt_test: sh>: dbt test ``` ### Scheduled dbt Runs ```yaml # daily_dbt_workflow.dig timezone: Asia/Tokyo schedule: daily>: 03:00:00 _export: session_date: ${session_date} +run_incremental_models: sh>: | cd /path/to/dbt_project dbt run --select tag:incremental --vars '{"target_range": "-1d"}' +run_tests: sh>: | cd /path/to/dbt_project dbt test --select tag:incremental +notify_completion: echo>: "dbt run completed for ${session_date}" ``` ## Advanced Patterns ### Dynamic Table Selection ```sql -- models/flexible_aggregation.sql {{ config( materialized='table' ) }} {% set table_name = var('source_table', 'events') %} {% set metric = var('metric', 'event_count') %} SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, COUNT(*) as {{ metric }} FROM {{ source('raw', table_name) }} WHERE TD_INTERVAL(time, '{{ var('target_range', '-7d') }}', 'JST') GROUP BY 1 ``` ### Multi-Source Union ```sql -- models/unified_events.sql {{ config( materialized='table' ) }} {% set sources = ['mobile_events', 'web_events', 'api_events'] %} {% for source in sources %} SELECT '{{ source }}' as source_type, * FROM {{ source('raw', source) }} WHERE TD_INTERVAL(time, '-1d', 'JST') {% if not loop.last %}UNION ALL{% endif %} {% endfor %} ``` ## Resources - dbt Documentation: https://docs.getdbt.com/ - dbt-trino adapter: https://github.com/starburstdata/dbt-trino - TD Query Engine: Use Trino-specific SQL - TD Functions: TD_INTERVAL, TD_TIME_STRING, etc. ## Migration from SQL Scripts to dbt If migrating existing TD SQL workflows to dbt: 1. **Convert queries to models** - Add config block - Use source() for table references - Add TD-specific macros 2. **Add tests** - Start with basic not_null tests - Add unique key tests - Create custom business logic tests 3. **Implement incrementally** - Start with simple table materializations - Add incremental models gradually - Test each model thoroughly 4. **Update orchestration** - Replace direct SQL in digdag with dbt commands - Maintain existing schedules - Add dbt test steps