Files
gh-raw-labs-claude-code-mar…/skills/mxcp-expert/references/dbt-patterns.md
2025-11-30 08:49:50 +08:00

312 lines
5.2 KiB
Markdown

# dbt Integration Patterns
Guide to combining dbt with MXCP for data transformation pipelines.
## Why dbt + MXCP?
**dbt creates the tables → MXCP queries them**
This pattern provides:
- Data transformation and quality in dbt
- Fast local caching of external data
- SQL queries against materialized views
- Consistent data contracts
## Setup
### 1. Enable dbt in MXCP
```yaml
# mxcp-site.yml
dbt:
enabled: true
model_paths: ["models"]
```
### 2. Create dbt Project
```bash
dbt init
```
### 3. Configure dbt Profile
```yaml
# profiles.yml (auto-generated by mxcp dbt-config)
covid_owid:
outputs:
dev:
type: duckdb
path: data.duckdb
target: dev
```
## Basic Pattern
### dbt Model
Create `models/sales_summary.sql`:
```sql
{{ config(materialized='table') }}
SELECT
region,
DATE_TRUNC('month', sale_date) as month,
SUM(amount) as total_sales,
COUNT(*) as transaction_count
FROM {{ source('raw', 'sales_data') }}
GROUP BY region, month
```
### Run dbt
```bash
mxcp dbt run
# or directly: dbt run
```
### MXCP Tool Queries Table
Create `tools/monthly_sales.yml`:
```yaml
mxcp: 1
tool:
name: monthly_sales
description: "Get monthly sales summary"
parameters:
- name: region
type: string
return:
type: array
source:
code: |
SELECT * FROM sales_summary
WHERE region = $region
ORDER BY month DESC
```
## External Data Caching
### Fetch and Cache External Data
```sql
-- models/covid_data.sql
{{ config(materialized='table') }}
SELECT *
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
```
Run once to cache:
```bash
mxcp dbt run
```
### Query Cached Data
```yaml
# tools/covid_stats.yml
tool:
name: covid_stats
source:
code: |
SELECT location, date, total_cases, new_cases
FROM covid_data
WHERE location = $country
ORDER BY date DESC
LIMIT 30
```
## Incremental Models
### Incremental Updates
```sql
-- models/events_incremental.sql
{{ config(
materialized='incremental',
unique_key='event_id'
) }}
SELECT *
FROM read_json('https://api.example.com/events')
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
```
## Sources and References
### Define Sources
```yaml
# models/sources.yml
version: 2
sources:
- name: raw
tables:
- name: sales_data
- name: customer_data
```
### Reference Models
```sql
-- models/customer_summary.sql
{{ config(materialized='table') }}
WITH customers AS (
SELECT * FROM {{ source('raw', 'customer_data') }}
),
sales AS (
SELECT * FROM {{ ref('sales_summary') }}
)
SELECT
c.customer_id,
c.name,
s.total_sales
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
```
## Data Quality Tests
### dbt Tests
```yaml
# models/schema.yml
version: 2
models:
- name: sales_summary
columns:
- name: region
tests:
- not_null
- name: total_sales
tests:
- not_null
- positive_value
- name: month
tests:
- unique
```
### Run Tests
```bash
mxcp dbt test
```
## Complete Workflow
### 1. Development
```bash
# Create/modify dbt models
vim models/new_analysis.sql
# Run transformations
mxcp dbt run --select new_analysis
# Test data quality
mxcp dbt test --select new_analysis
# Create MXCP endpoint
vim tools/new_endpoint.yml
```
### 2. Testing
```bash
# Validate MXCP endpoint
mxcp validate
# Test endpoint
mxcp test tool new_endpoint
```
### 3. Production
```bash
# Run dbt in production
mxcp dbt run --profile production
# Start MXCP server
mxcp serve --profile production
```
## Advanced Patterns
### Multi-Source Aggregation
```sql
-- models/unified_metrics.sql
{{ config(materialized='table') }}
WITH external_data AS (
SELECT * FROM read_json('https://api.example.com/metrics')
),
internal_data AS (
SELECT * FROM {{ source('raw', 'internal_metrics') }}
),
third_party AS (
SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
)
SELECT * FROM external_data
UNION ALL
SELECT * FROM internal_data
UNION ALL
SELECT * FROM third_party
```
### Dynamic Caching Strategy
```sql
-- models/live_dashboard.sql
{{ config(
materialized='table',
post_hook="PRAGMA optimize"
) }}
-- Recent data (refresh hourly)
SELECT * FROM read_json('https://api.metrics.com/live')
WHERE timestamp >= current_timestamp - interval '24 hours'
UNION ALL
-- Historical data (cached daily)
SELECT * FROM {{ ref('historical_metrics') }}
WHERE timestamp < current_timestamp - interval '24 hours'
```
## Best Practices
1. **Materialization Strategy**
- Use `table` for frequently queried data
- Use `view` for rarely used transformations
- Use `incremental` for large, append-only datasets
2. **Naming Conventions**
- `stg_` for staging models
- `int_` for intermediate models
- `fct_` for fact tables
- `dim_` for dimension tables
3. **Data Quality**
- Add tests to all models
- Document columns
- Use sources for raw data
4. **Performance**
- Materialize frequently used aggregations
- Use incremental for large datasets
- Add indexes where needed
5. **Version Control**
- Commit dbt models
- Version dbt_project.yml
- Document model changes