312 lines
5.2 KiB
Markdown
312 lines
5.2 KiB
Markdown
# dbt Integration Patterns
|
|
|
|
Guide to combining dbt with MXCP for data transformation pipelines.
|
|
|
|
## Why dbt + MXCP?
|
|
|
|
**dbt creates the tables → MXCP queries them**
|
|
|
|
This pattern provides:
|
|
- Data transformation and quality in dbt
|
|
- Fast local caching of external data
|
|
- SQL queries against materialized views
|
|
- Consistent data contracts
|
|
|
|
## Setup
|
|
|
|
### 1. Enable dbt in MXCP
|
|
|
|
```yaml
|
|
# mxcp-site.yml
|
|
dbt:
|
|
enabled: true
|
|
model_paths: ["models"]
|
|
```
|
|
|
|
### 2. Create dbt Project
|
|
|
|
```bash
|
|
dbt init
|
|
```
|
|
|
|
### 3. Configure dbt Profile
|
|
|
|
```yaml
|
|
# profiles.yml (auto-generated by mxcp dbt-config)
|
|
covid_owid:
|
|
outputs:
|
|
dev:
|
|
type: duckdb
|
|
path: data.duckdb
|
|
target: dev
|
|
```
|
|
|
|
## Basic Pattern
|
|
|
|
### dbt Model
|
|
|
|
Create `models/sales_summary.sql`:
|
|
```sql
|
|
{{ config(materialized='table') }}
|
|
|
|
SELECT
|
|
region,
|
|
DATE_TRUNC('month', sale_date) as month,
|
|
SUM(amount) as total_sales,
|
|
COUNT(*) as transaction_count
|
|
FROM {{ source('raw', 'sales_data') }}
|
|
GROUP BY region, month
|
|
```
|
|
|
|
### Run dbt
|
|
|
|
```bash
|
|
mxcp dbt run
|
|
# or directly: dbt run
|
|
```
|
|
|
|
### MXCP Tool Queries Table
|
|
|
|
Create `tools/monthly_sales.yml`:
|
|
```yaml
|
|
mxcp: 1
|
|
tool:
|
|
name: monthly_sales
|
|
description: "Get monthly sales summary"
|
|
parameters:
|
|
- name: region
|
|
type: string
|
|
return:
|
|
type: array
|
|
source:
|
|
code: |
|
|
SELECT * FROM sales_summary
|
|
WHERE region = $region
|
|
ORDER BY month DESC
|
|
```
|
|
|
|
## External Data Caching
|
|
|
|
### Fetch and Cache External Data
|
|
|
|
```sql
|
|
-- models/covid_data.sql
|
|
{{ config(materialized='table') }}
|
|
|
|
SELECT *
|
|
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
|
|
```
|
|
|
|
Run once to cache:
|
|
```bash
|
|
mxcp dbt run
|
|
```
|
|
|
|
### Query Cached Data
|
|
|
|
```yaml
|
|
# tools/covid_stats.yml
|
|
tool:
|
|
name: covid_stats
|
|
source:
|
|
code: |
|
|
SELECT location, date, total_cases, new_cases
|
|
FROM covid_data
|
|
WHERE location = $country
|
|
ORDER BY date DESC
|
|
LIMIT 30
|
|
```
|
|
|
|
## Incremental Models
|
|
|
|
### Incremental Updates
|
|
|
|
```sql
|
|
-- models/events_incremental.sql
|
|
{{ config(
|
|
materialized='incremental',
|
|
unique_key='event_id'
|
|
) }}
|
|
|
|
SELECT *
|
|
FROM read_json('https://api.example.com/events')
|
|
|
|
{% if is_incremental() %}
|
|
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
|
|
{% endif %}
|
|
```
|
|
|
|
## Sources and References
|
|
|
|
### Define Sources
|
|
|
|
```yaml
|
|
# models/sources.yml
|
|
version: 2
|
|
|
|
sources:
|
|
- name: raw
|
|
tables:
|
|
- name: sales_data
|
|
- name: customer_data
|
|
```
|
|
|
|
### Reference Models
|
|
|
|
```sql
|
|
-- models/customer_summary.sql
|
|
{{ config(materialized='table') }}
|
|
|
|
WITH customers AS (
|
|
SELECT * FROM {{ source('raw', 'customer_data') }}
|
|
),
|
|
sales AS (
|
|
SELECT * FROM {{ ref('sales_summary') }}
|
|
)
|
|
SELECT
|
|
c.customer_id,
|
|
c.name,
|
|
s.total_sales
|
|
FROM customers c
|
|
JOIN sales s ON c.customer_id = s.customer_id
|
|
```
|
|
|
|
## Data Quality Tests
|
|
|
|
### dbt Tests
|
|
|
|
```yaml
|
|
# models/schema.yml
|
|
version: 2
|
|
|
|
models:
|
|
- name: sales_summary
|
|
columns:
|
|
- name: region
|
|
tests:
|
|
- not_null
|
|
- name: total_sales
|
|
tests:
|
|
- not_null
|
|
- positive_value
|
|
- name: month
|
|
tests:
|
|
- unique
|
|
```
|
|
|
|
### Run Tests
|
|
|
|
```bash
|
|
mxcp dbt test
|
|
```
|
|
|
|
## Complete Workflow
|
|
|
|
### 1. Development
|
|
|
|
```bash
|
|
# Create/modify dbt models
|
|
vim models/new_analysis.sql
|
|
|
|
# Run transformations
|
|
mxcp dbt run --select new_analysis
|
|
|
|
# Test data quality
|
|
mxcp dbt test --select new_analysis
|
|
|
|
# Create MXCP endpoint
|
|
vim tools/new_endpoint.yml
|
|
```
|
|
|
|
### 2. Testing
|
|
|
|
```bash
|
|
# Validate MXCP endpoint
|
|
mxcp validate
|
|
|
|
# Test endpoint
|
|
mxcp test tool new_endpoint
|
|
```
|
|
|
|
### 3. Production
|
|
|
|
```bash
|
|
# Run dbt in production
|
|
mxcp dbt run --profile production
|
|
|
|
# Start MXCP server
|
|
mxcp serve --profile production
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Multi-Source Aggregation
|
|
|
|
```sql
|
|
-- models/unified_metrics.sql
|
|
{{ config(materialized='table') }}
|
|
|
|
WITH external_data AS (
|
|
SELECT * FROM read_json('https://api.example.com/metrics')
|
|
),
|
|
internal_data AS (
|
|
SELECT * FROM {{ source('raw', 'internal_metrics') }}
|
|
),
|
|
third_party AS (
|
|
SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
|
|
)
|
|
SELECT * FROM external_data
|
|
UNION ALL
|
|
SELECT * FROM internal_data
|
|
UNION ALL
|
|
SELECT * FROM third_party
|
|
```
|
|
|
|
### Dynamic Caching Strategy
|
|
|
|
```sql
|
|
-- models/live_dashboard.sql
|
|
{{ config(
|
|
materialized='table',
|
|
post_hook="PRAGMA optimize"
|
|
) }}
|
|
|
|
-- Recent data (refresh hourly)
|
|
SELECT * FROM read_json('https://api.metrics.com/live')
|
|
WHERE timestamp >= current_timestamp - interval '24 hours'
|
|
|
|
UNION ALL
|
|
|
|
-- Historical data (cached daily)
|
|
SELECT * FROM {{ ref('historical_metrics') }}
|
|
WHERE timestamp < current_timestamp - interval '24 hours'
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Materialization Strategy**
|
|
- Use `table` for frequently queried data
|
|
- Use `view` for rarely used transformations
|
|
- Use `incremental` for large, append-only datasets
|
|
|
|
2. **Naming Conventions**
|
|
- `stg_` for staging models
|
|
- `int_` for intermediate models
|
|
- `fct_` for fact tables
|
|
- `dim_` for dimension tables
|
|
|
|
3. **Data Quality**
|
|
- Add tests to all models
|
|
- Document columns
|
|
- Use sources for raw data
|
|
|
|
4. **Performance**
|
|
- Materialize frequently used aggregations
|
|
- Use incremental for large datasets
|
|
- Add indexes where needed
|
|
|
|
5. **Version Control**
|
|
- Commit dbt models
|
|
- Version dbt_project.yml
|
|
- Document model changes
|