# dbt Integration Patterns Guide to combining dbt with MXCP for data transformation pipelines. ## Why dbt + MXCP? **dbt creates the tables → MXCP queries them** This pattern provides: - Data transformation and quality in dbt - Fast local caching of external data - SQL queries against materialized views - Consistent data contracts ## Setup ### 1. Enable dbt in MXCP ```yaml # mxcp-site.yml dbt: enabled: true model_paths: ["models"] ``` ### 2. Create dbt Project ```bash dbt init ``` ### 3. Configure dbt Profile ```yaml # profiles.yml (auto-generated by mxcp dbt-config) covid_owid: outputs: dev: type: duckdb path: data.duckdb target: dev ``` ## Basic Pattern ### dbt Model Create `models/sales_summary.sql`: ```sql {{ config(materialized='table') }} SELECT region, DATE_TRUNC('month', sale_date) as month, SUM(amount) as total_sales, COUNT(*) as transaction_count FROM {{ source('raw', 'sales_data') }} GROUP BY region, month ``` ### Run dbt ```bash mxcp dbt run # or directly: dbt run ``` ### MXCP Tool Queries Table Create `tools/monthly_sales.yml`: ```yaml mxcp: 1 tool: name: monthly_sales description: "Get monthly sales summary" parameters: - name: region type: string return: type: array source: code: | SELECT * FROM sales_summary WHERE region = $region ORDER BY month DESC ``` ## External Data Caching ### Fetch and Cache External Data ```sql -- models/covid_data.sql {{ config(materialized='table') }} SELECT * FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv') ``` Run once to cache: ```bash mxcp dbt run ``` ### Query Cached Data ```yaml # tools/covid_stats.yml tool: name: covid_stats source: code: | SELECT location, date, total_cases, new_cases FROM covid_data WHERE location = $country ORDER BY date DESC LIMIT 30 ``` ## Incremental Models ### Incremental Updates ```sql -- models/events_incremental.sql {{ config( materialized='incremental', unique_key='event_id' ) }} SELECT * FROM read_json('https://api.example.com/events') {% if is_incremental() %} WHERE created_at > (SELECT MAX(created_at) FROM {{ this }}) {% endif %} ``` ## Sources and References ### Define Sources ```yaml # models/sources.yml version: 2 sources: - name: raw tables: - name: sales_data - name: customer_data ``` ### Reference Models ```sql -- models/customer_summary.sql {{ config(materialized='table') }} WITH customers AS ( SELECT * FROM {{ source('raw', 'customer_data') }} ), sales AS ( SELECT * FROM {{ ref('sales_summary') }} ) SELECT c.customer_id, c.name, s.total_sales FROM customers c JOIN sales s ON c.customer_id = s.customer_id ``` ## Data Quality Tests ### dbt Tests ```yaml # models/schema.yml version: 2 models: - name: sales_summary columns: - name: region tests: - not_null - name: total_sales tests: - not_null - positive_value - name: month tests: - unique ``` ### Run Tests ```bash mxcp dbt test ``` ## Complete Workflow ### 1. Development ```bash # Create/modify dbt models vim models/new_analysis.sql # Run transformations mxcp dbt run --select new_analysis # Test data quality mxcp dbt test --select new_analysis # Create MXCP endpoint vim tools/new_endpoint.yml ``` ### 2. Testing ```bash # Validate MXCP endpoint mxcp validate # Test endpoint mxcp test tool new_endpoint ``` ### 3. Production ```bash # Run dbt in production mxcp dbt run --profile production # Start MXCP server mxcp serve --profile production ``` ## Advanced Patterns ### Multi-Source Aggregation ```sql -- models/unified_metrics.sql {{ config(materialized='table') }} WITH external_data AS ( SELECT * FROM read_json('https://api.example.com/metrics') ), internal_data AS ( SELECT * FROM {{ source('raw', 'internal_metrics') }} ), third_party AS ( SELECT * FROM read_parquet('s3://bucket/data/*.parquet') ) SELECT * FROM external_data UNION ALL SELECT * FROM internal_data UNION ALL SELECT * FROM third_party ``` ### Dynamic Caching Strategy ```sql -- models/live_dashboard.sql {{ config( materialized='table', post_hook="PRAGMA optimize" ) }} -- Recent data (refresh hourly) SELECT * FROM read_json('https://api.metrics.com/live') WHERE timestamp >= current_timestamp - interval '24 hours' UNION ALL -- Historical data (cached daily) SELECT * FROM {{ ref('historical_metrics') }} WHERE timestamp < current_timestamp - interval '24 hours' ``` ## Best Practices 1. **Materialization Strategy** - Use `table` for frequently queried data - Use `view` for rarely used transformations - Use `incremental` for large, append-only datasets 2. **Naming Conventions** - `stg_` for staging models - `int_` for intermediate models - `fct_` for fact tables - `dim_` for dimension tables 3. **Data Quality** - Add tests to all models - Document columns - Use sources for raw data 4. **Performance** - Materialize frequently used aggregations - Use incremental for large datasets - Add indexes where needed 5. **Version Control** - Commit dbt models - Version dbt_project.yml - Document model changes