Initial commit
This commit is contained in:
311
skills/mxcp-expert/references/dbt-patterns.md
Normal file
311
skills/mxcp-expert/references/dbt-patterns.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# dbt Integration Patterns
|
||||
|
||||
Guide to combining dbt with MXCP for data transformation pipelines.
|
||||
|
||||
## Why dbt + MXCP?
|
||||
|
||||
**dbt creates the tables → MXCP queries them**
|
||||
|
||||
This pattern provides:
|
||||
- Data transformation and quality in dbt
|
||||
- Fast local caching of external data
|
||||
- SQL queries against materialized views
|
||||
- Consistent data contracts
|
||||
|
||||
## Setup
|
||||
|
||||
### 1. Enable dbt in MXCP
|
||||
|
||||
```yaml
|
||||
# mxcp-site.yml
|
||||
dbt:
|
||||
enabled: true
|
||||
model_paths: ["models"]
|
||||
```
|
||||
|
||||
### 2. Create dbt Project
|
||||
|
||||
```bash
|
||||
dbt init
|
||||
```
|
||||
|
||||
### 3. Configure dbt Profile
|
||||
|
||||
```yaml
|
||||
# profiles.yml (auto-generated by mxcp dbt-config)
|
||||
covid_owid:
|
||||
outputs:
|
||||
dev:
|
||||
type: duckdb
|
||||
path: data.duckdb
|
||||
target: dev
|
||||
```
|
||||
|
||||
## Basic Pattern
|
||||
|
||||
### dbt Model
|
||||
|
||||
Create `models/sales_summary.sql`:
|
||||
```sql
|
||||
{{ config(materialized='table') }}
|
||||
|
||||
SELECT
|
||||
region,
|
||||
DATE_TRUNC('month', sale_date) as month,
|
||||
SUM(amount) as total_sales,
|
||||
COUNT(*) as transaction_count
|
||||
FROM {{ source('raw', 'sales_data') }}
|
||||
GROUP BY region, month
|
||||
```
|
||||
|
||||
### Run dbt
|
||||
|
||||
```bash
|
||||
mxcp dbt run
|
||||
# or directly: dbt run
|
||||
```
|
||||
|
||||
### MXCP Tool Queries Table
|
||||
|
||||
Create `tools/monthly_sales.yml`:
|
||||
```yaml
|
||||
mxcp: 1
|
||||
tool:
|
||||
name: monthly_sales
|
||||
description: "Get monthly sales summary"
|
||||
parameters:
|
||||
- name: region
|
||||
type: string
|
||||
return:
|
||||
type: array
|
||||
source:
|
||||
code: |
|
||||
SELECT * FROM sales_summary
|
||||
WHERE region = $region
|
||||
ORDER BY month DESC
|
||||
```
|
||||
|
||||
## External Data Caching
|
||||
|
||||
### Fetch and Cache External Data
|
||||
|
||||
```sql
|
||||
-- models/covid_data.sql
|
||||
{{ config(materialized='table') }}
|
||||
|
||||
SELECT *
|
||||
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
|
||||
```
|
||||
|
||||
Run once to cache:
|
||||
```bash
|
||||
mxcp dbt run
|
||||
```
|
||||
|
||||
### Query Cached Data
|
||||
|
||||
```yaml
|
||||
# tools/covid_stats.yml
|
||||
tool:
|
||||
name: covid_stats
|
||||
source:
|
||||
code: |
|
||||
SELECT location, date, total_cases, new_cases
|
||||
FROM covid_data
|
||||
WHERE location = $country
|
||||
ORDER BY date DESC
|
||||
LIMIT 30
|
||||
```
|
||||
|
||||
## Incremental Models
|
||||
|
||||
### Incremental Updates
|
||||
|
||||
```sql
|
||||
-- models/events_incremental.sql
|
||||
{{ config(
|
||||
materialized='incremental',
|
||||
unique_key='event_id'
|
||||
) }}
|
||||
|
||||
SELECT *
|
||||
FROM read_json('https://api.example.com/events')
|
||||
|
||||
{% if is_incremental() %}
|
||||
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
## Sources and References
|
||||
|
||||
### Define Sources
|
||||
|
||||
```yaml
|
||||
# models/sources.yml
|
||||
version: 2
|
||||
|
||||
sources:
|
||||
- name: raw
|
||||
tables:
|
||||
- name: sales_data
|
||||
- name: customer_data
|
||||
```
|
||||
|
||||
### Reference Models
|
||||
|
||||
```sql
|
||||
-- models/customer_summary.sql
|
||||
{{ config(materialized='table') }}
|
||||
|
||||
WITH customers AS (
|
||||
SELECT * FROM {{ source('raw', 'customer_data') }}
|
||||
),
|
||||
sales AS (
|
||||
SELECT * FROM {{ ref('sales_summary') }}
|
||||
)
|
||||
SELECT
|
||||
c.customer_id,
|
||||
c.name,
|
||||
s.total_sales
|
||||
FROM customers c
|
||||
JOIN sales s ON c.customer_id = s.customer_id
|
||||
```
|
||||
|
||||
## Data Quality Tests
|
||||
|
||||
### dbt Tests
|
||||
|
||||
```yaml
|
||||
# models/schema.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: sales_summary
|
||||
columns:
|
||||
- name: region
|
||||
tests:
|
||||
- not_null
|
||||
- name: total_sales
|
||||
tests:
|
||||
- not_null
|
||||
- positive_value
|
||||
- name: month
|
||||
tests:
|
||||
- unique
|
||||
```
|
||||
|
||||
### Run Tests
|
||||
|
||||
```bash
|
||||
mxcp dbt test
|
||||
```
|
||||
|
||||
## Complete Workflow
|
||||
|
||||
### 1. Development
|
||||
|
||||
```bash
|
||||
# Create/modify dbt models
|
||||
vim models/new_analysis.sql
|
||||
|
||||
# Run transformations
|
||||
mxcp dbt run --select new_analysis
|
||||
|
||||
# Test data quality
|
||||
mxcp dbt test --select new_analysis
|
||||
|
||||
# Create MXCP endpoint
|
||||
vim tools/new_endpoint.yml
|
||||
```
|
||||
|
||||
### 2. Testing
|
||||
|
||||
```bash
|
||||
# Validate MXCP endpoint
|
||||
mxcp validate
|
||||
|
||||
# Test endpoint
|
||||
mxcp test tool new_endpoint
|
||||
```
|
||||
|
||||
### 3. Production
|
||||
|
||||
```bash
|
||||
# Run dbt in production
|
||||
mxcp dbt run --profile production
|
||||
|
||||
# Start MXCP server
|
||||
mxcp serve --profile production
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Multi-Source Aggregation
|
||||
|
||||
```sql
|
||||
-- models/unified_metrics.sql
|
||||
{{ config(materialized='table') }}
|
||||
|
||||
WITH external_data AS (
|
||||
SELECT * FROM read_json('https://api.example.com/metrics')
|
||||
),
|
||||
internal_data AS (
|
||||
SELECT * FROM {{ source('raw', 'internal_metrics') }}
|
||||
),
|
||||
third_party AS (
|
||||
SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
|
||||
)
|
||||
SELECT * FROM external_data
|
||||
UNION ALL
|
||||
SELECT * FROM internal_data
|
||||
UNION ALL
|
||||
SELECT * FROM third_party
|
||||
```
|
||||
|
||||
### Dynamic Caching Strategy
|
||||
|
||||
```sql
|
||||
-- models/live_dashboard.sql
|
||||
{{ config(
|
||||
materialized='table',
|
||||
post_hook="PRAGMA optimize"
|
||||
) }}
|
||||
|
||||
-- Recent data (refresh hourly)
|
||||
SELECT * FROM read_json('https://api.metrics.com/live')
|
||||
WHERE timestamp >= current_timestamp - interval '24 hours'
|
||||
|
||||
UNION ALL
|
||||
|
||||
-- Historical data (cached daily)
|
||||
SELECT * FROM {{ ref('historical_metrics') }}
|
||||
WHERE timestamp < current_timestamp - interval '24 hours'
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Materialization Strategy**
|
||||
- Use `table` for frequently queried data
|
||||
- Use `view` for rarely used transformations
|
||||
- Use `incremental` for large, append-only datasets
|
||||
|
||||
2. **Naming Conventions**
|
||||
- `stg_` for staging models
|
||||
- `int_` for intermediate models
|
||||
- `fct_` for fact tables
|
||||
- `dim_` for dimension tables
|
||||
|
||||
3. **Data Quality**
|
||||
- Add tests to all models
|
||||
- Document columns
|
||||
- Use sources for raw data
|
||||
|
||||
4. **Performance**
|
||||
- Materialize frequently used aggregations
|
||||
- Use incremental for large datasets
|
||||
- Add indexes where needed
|
||||
|
||||
5. **Version Control**
|
||||
- Commit dbt models
|
||||
- Version dbt_project.yml
|
||||
- Document model changes
|
||||
Reference in New Issue
Block a user