5.2 KiB
5.2 KiB
dbt Integration Patterns
Guide to combining dbt with MXCP for data transformation pipelines.
Why dbt + MXCP?
dbt creates the tables → MXCP queries them
This pattern provides:
- Data transformation and quality in dbt
- Fast local caching of external data
- SQL queries against materialized views
- Consistent data contracts
Setup
1. Enable dbt in MXCP
# mxcp-site.yml
dbt:
enabled: true
model_paths: ["models"]
2. Create dbt Project
dbt init
3. Configure dbt Profile
# profiles.yml (auto-generated by mxcp dbt-config)
covid_owid:
outputs:
dev:
type: duckdb
path: data.duckdb
target: dev
Basic Pattern
dbt Model
Create models/sales_summary.sql:
{{ config(materialized='table') }}
SELECT
region,
DATE_TRUNC('month', sale_date) as month,
SUM(amount) as total_sales,
COUNT(*) as transaction_count
FROM {{ source('raw', 'sales_data') }}
GROUP BY region, month
Run dbt
mxcp dbt run
# or directly: dbt run
MXCP Tool Queries Table
Create tools/monthly_sales.yml:
mxcp: 1
tool:
name: monthly_sales
description: "Get monthly sales summary"
parameters:
- name: region
type: string
return:
type: array
source:
code: |
SELECT * FROM sales_summary
WHERE region = $region
ORDER BY month DESC
External Data Caching
Fetch and Cache External Data
-- models/covid_data.sql
{{ config(materialized='table') }}
SELECT *
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
Run once to cache:
mxcp dbt run
Query Cached Data
# tools/covid_stats.yml
tool:
name: covid_stats
source:
code: |
SELECT location, date, total_cases, new_cases
FROM covid_data
WHERE location = $country
ORDER BY date DESC
LIMIT 30
Incremental Models
Incremental Updates
-- models/events_incremental.sql
{{ config(
materialized='incremental',
unique_key='event_id'
) }}
SELECT *
FROM read_json('https://api.example.com/events')
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
Sources and References
Define Sources
# models/sources.yml
version: 2
sources:
- name: raw
tables:
- name: sales_data
- name: customer_data
Reference Models
-- models/customer_summary.sql
{{ config(materialized='table') }}
WITH customers AS (
SELECT * FROM {{ source('raw', 'customer_data') }}
),
sales AS (
SELECT * FROM {{ ref('sales_summary') }}
)
SELECT
c.customer_id,
c.name,
s.total_sales
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
Data Quality Tests
dbt Tests
# models/schema.yml
version: 2
models:
- name: sales_summary
columns:
- name: region
tests:
- not_null
- name: total_sales
tests:
- not_null
- positive_value
- name: month
tests:
- unique
Run Tests
mxcp dbt test
Complete Workflow
1. Development
# Create/modify dbt models
vim models/new_analysis.sql
# Run transformations
mxcp dbt run --select new_analysis
# Test data quality
mxcp dbt test --select new_analysis
# Create MXCP endpoint
vim tools/new_endpoint.yml
2. Testing
# Validate MXCP endpoint
mxcp validate
# Test endpoint
mxcp test tool new_endpoint
3. Production
# Run dbt in production
mxcp dbt run --profile production
# Start MXCP server
mxcp serve --profile production
Advanced Patterns
Multi-Source Aggregation
-- models/unified_metrics.sql
{{ config(materialized='table') }}
WITH external_data AS (
SELECT * FROM read_json('https://api.example.com/metrics')
),
internal_data AS (
SELECT * FROM {{ source('raw', 'internal_metrics') }}
),
third_party AS (
SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
)
SELECT * FROM external_data
UNION ALL
SELECT * FROM internal_data
UNION ALL
SELECT * FROM third_party
Dynamic Caching Strategy
-- models/live_dashboard.sql
{{ config(
materialized='table',
post_hook="PRAGMA optimize"
) }}
-- Recent data (refresh hourly)
SELECT * FROM read_json('https://api.metrics.com/live')
WHERE timestamp >= current_timestamp - interval '24 hours'
UNION ALL
-- Historical data (cached daily)
SELECT * FROM {{ ref('historical_metrics') }}
WHERE timestamp < current_timestamp - interval '24 hours'
Best Practices
-
Materialization Strategy
- Use
tablefor frequently queried data - Use
viewfor rarely used transformations - Use
incrementalfor large, append-only datasets
- Use
-
Naming Conventions
stg_for staging modelsint_for intermediate modelsfct_for fact tablesdim_for dimension tables
-
Data Quality
- Add tests to all models
- Document columns
- Use sources for raw data
-
Performance
- Materialize frequently used aggregations
- Use incremental for large datasets
- Add indexes where needed
-
Version Control
- Commit dbt models
- Version dbt_project.yml
- Document model changes