Files
gh-raw-labs-claude-code-mar…/skills/mxcp-expert/references/dbt-patterns.md
2025-11-30 08:49:50 +08:00

5.2 KiB

dbt Integration Patterns

Guide to combining dbt with MXCP for data transformation pipelines.

Why dbt + MXCP?

dbt creates the tables → MXCP queries them

This pattern provides:

  • Data transformation and quality in dbt
  • Fast local caching of external data
  • SQL queries against materialized views
  • Consistent data contracts

Setup

1. Enable dbt in MXCP

# mxcp-site.yml
dbt:
  enabled: true
  model_paths: ["models"]

2. Create dbt Project

dbt init

3. Configure dbt Profile

# profiles.yml (auto-generated by mxcp dbt-config)
covid_owid:
  outputs:
    dev:
      type: duckdb
      path: data.duckdb
  target: dev

Basic Pattern

dbt Model

Create models/sales_summary.sql:

{{ config(materialized='table') }}

SELECT 
  region,
  DATE_TRUNC('month', sale_date) as month,
  SUM(amount) as total_sales,
  COUNT(*) as transaction_count
FROM {{ source('raw', 'sales_data') }}
GROUP BY region, month

Run dbt

mxcp dbt run
# or directly: dbt run

MXCP Tool Queries Table

Create tools/monthly_sales.yml:

mxcp: 1
tool:
  name: monthly_sales
  description: "Get monthly sales summary"
  parameters:
    - name: region
      type: string
  return:
    type: array
  source:
    code: |
      SELECT * FROM sales_summary
      WHERE region = $region
      ORDER BY month DESC

External Data Caching

Fetch and Cache External Data

-- models/covid_data.sql
{{ config(materialized='table') }}

SELECT *
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')

Run once to cache:

mxcp dbt run

Query Cached Data

# tools/covid_stats.yml
tool:
  name: covid_stats
  source:
    code: |
      SELECT location, date, total_cases, new_cases
      FROM covid_data
      WHERE location = $country
      ORDER BY date DESC
      LIMIT 30

Incremental Models

Incremental Updates

-- models/events_incremental.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

SELECT *
FROM read_json('https://api.example.com/events')

{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Sources and References

Define Sources

# models/sources.yml
version: 2

sources:
  - name: raw
    tables:
      - name: sales_data
      - name: customer_data

Reference Models

-- models/customer_summary.sql
{{ config(materialized='table') }}

WITH customers AS (
  SELECT * FROM {{ source('raw', 'customer_data') }}
),
sales AS (
  SELECT * FROM {{ ref('sales_summary') }}
)
SELECT 
  c.customer_id,
  c.name,
  s.total_sales
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id

Data Quality Tests

dbt Tests

# models/schema.yml
version: 2

models:
  - name: sales_summary
    columns:
      - name: region
        tests:
          - not_null
      - name: total_sales
        tests:
          - not_null
          - positive_value
      - name: month
        tests:
          - unique

Run Tests

mxcp dbt test

Complete Workflow

1. Development

# Create/modify dbt models
vim models/new_analysis.sql

# Run transformations
mxcp dbt run --select new_analysis

# Test data quality
mxcp dbt test --select new_analysis

# Create MXCP endpoint
vim tools/new_endpoint.yml

2. Testing

# Validate MXCP endpoint
mxcp validate

# Test endpoint
mxcp test tool new_endpoint

3. Production

# Run dbt in production
mxcp dbt run --profile production

# Start MXCP server
mxcp serve --profile production

Advanced Patterns

Multi-Source Aggregation

-- models/unified_metrics.sql
{{ config(materialized='table') }}

WITH external_data AS (
  SELECT * FROM read_json('https://api.example.com/metrics')
),
internal_data AS (
  SELECT * FROM {{ source('raw', 'internal_metrics') }}
),
third_party AS (
  SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
)
SELECT * FROM external_data
UNION ALL
SELECT * FROM internal_data
UNION ALL
SELECT * FROM third_party

Dynamic Caching Strategy

-- models/live_dashboard.sql
{{ config(
  materialized='table',
  post_hook="PRAGMA optimize"
) }}

-- Recent data (refresh hourly)
SELECT * FROM read_json('https://api.metrics.com/live')
WHERE timestamp >= current_timestamp - interval '24 hours'

UNION ALL

-- Historical data (cached daily)
SELECT * FROM {{ ref('historical_metrics') }}
WHERE timestamp < current_timestamp - interval '24 hours'

Best Practices

  1. Materialization Strategy

    • Use table for frequently queried data
    • Use view for rarely used transformations
    • Use incremental for large, append-only datasets
  2. Naming Conventions

    • stg_ for staging models
    • int_ for intermediate models
    • fct_ for fact tables
    • dim_ for dimension tables
  3. Data Quality

    • Add tests to all models
    • Document columns
    • Use sources for raw data
  4. Performance

    • Materialize frequently used aggregations
    • Use incremental for large datasets
    • Add indexes where needed
  5. Version Control

    • Commit dbt models
    • Version dbt_project.yml
    • Document model changes