# dbt Integration Patterns

Guide to combining dbt with MXCP for data transformation pipelines.

## Why dbt + MXCP?

**dbt creates the tables → MXCP queries them**

This pattern provides:
- Data transformation and quality in dbt
- Fast local caching of external data
- SQL queries against materialized views
- Consistent data contracts

## Setup

### 1. Enable dbt in MXCP

```yaml
# mxcp-site.yml
dbt:
  enabled: true
  model_paths: ["models"]
```

### 2. Create dbt Project

```bash
dbt init
```

### 3. Configure dbt Profile

```yaml
# profiles.yml (auto-generated by mxcp dbt-config)
covid_owid:
  outputs:
    dev:
      type: duckdb
      path: data.duckdb
  target: dev
```

## Basic Pattern

### dbt Model

Create `models/sales_summary.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
  region,
  DATE_TRUNC('month', sale_date) as month,
  SUM(amount) as total_sales,
  COUNT(*) as transaction_count
FROM {{ source('raw', 'sales_data') }}
GROUP BY region, month
```

### Run dbt

```bash
mxcp dbt run
# or directly: dbt run
```

### MXCP Tool Queries Table

Create `tools/monthly_sales.yml`:
```yaml
mxcp: 1
tool:
  name: monthly_sales
  description: "Get monthly sales summary"
  parameters:
    - name: region
      type: string
  return:
    type: array
  source:
    code: |
      SELECT * FROM sales_summary
      WHERE region = $region
      ORDER BY month DESC
```

## External Data Caching

### Fetch and Cache External Data

```sql
-- models/covid_data.sql
{{ config(materialized='table') }}

SELECT *
FROM read_csv_auto('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
```

Run once to cache:
```bash
mxcp dbt run
```

### Query Cached Data

```yaml
# tools/covid_stats.yml
tool:
  name: covid_stats
  source:
    code: |
      SELECT location, date, total_cases, new_cases
      FROM covid_data
      WHERE location = $country
      ORDER BY date DESC
      LIMIT 30
```

## Incremental Models

### Incremental Updates

```sql
-- models/events_incremental.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

SELECT *
FROM read_json('https://api.example.com/events')

{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
```

## Sources and References

### Define Sources

```yaml
# models/sources.yml
version: 2

sources:
  - name: raw
    tables:
      - name: sales_data
      - name: customer_data
```

### Reference Models

```sql
-- models/customer_summary.sql
{{ config(materialized='table') }}

WITH customers AS (
  SELECT * FROM {{ source('raw', 'customer_data') }}
),
sales AS (
  SELECT * FROM {{ ref('sales_summary') }}
)
SELECT 
  c.customer_id,
  c.name,
  s.total_sales
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
```

## Data Quality Tests

### dbt Tests

```yaml
# models/schema.yml
version: 2

models:
  - name: sales_summary
    columns:
      - name: region
        tests:
          - not_null
      - name: total_sales
        tests:
          - not_null
          - positive_value
      - name: month
        tests:
          - unique
```

### Run Tests

```bash
mxcp dbt test
```

## Complete Workflow

### 1. Development

```bash
# Create/modify dbt models
vim models/new_analysis.sql

# Run transformations
mxcp dbt run --select new_analysis

# Test data quality
mxcp dbt test --select new_analysis

# Create MXCP endpoint
vim tools/new_endpoint.yml
```

### 2. Testing

```bash
# Validate MXCP endpoint
mxcp validate

# Test endpoint
mxcp test tool new_endpoint
```

### 3. Production

```bash
# Run dbt in production
mxcp dbt run --profile production

# Start MXCP server
mxcp serve --profile production
```

## Advanced Patterns

### Multi-Source Aggregation

```sql
-- models/unified_metrics.sql
{{ config(materialized='table') }}

WITH external_data AS (
  SELECT * FROM read_json('https://api.example.com/metrics')
),
internal_data AS (
  SELECT * FROM {{ source('raw', 'internal_metrics') }}
),
third_party AS (
  SELECT * FROM read_parquet('s3://bucket/data/*.parquet')
)
SELECT * FROM external_data
UNION ALL
SELECT * FROM internal_data
UNION ALL
SELECT * FROM third_party
```

### Dynamic Caching Strategy

```sql
-- models/live_dashboard.sql
{{ config(
  materialized='table',
  post_hook="PRAGMA optimize"
) }}

-- Recent data (refresh hourly)
SELECT * FROM read_json('https://api.metrics.com/live')
WHERE timestamp >= current_timestamp - interval '24 hours'

UNION ALL

-- Historical data (cached daily)
SELECT * FROM {{ ref('historical_metrics') }}
WHERE timestamp < current_timestamp - interval '24 hours'
```

## Best Practices

1. **Materialization Strategy**
   - Use `table` for frequently queried data
   - Use `view` for rarely used transformations
   - Use `incremental` for large, append-only datasets

2. **Naming Conventions**
   - `stg_` for staging models
   - `int_` for intermediate models
   - `fct_` for fact tables
   - `dim_` for dimension tables

3. **Data Quality**
   - Add tests to all models
   - Document columns
   - Use sources for raw data

4. **Performance**
   - Materialize frequently used aggregations
   - Use incremental for large datasets
   - Add indexes where needed

5. **Version Control**
   - Commit dbt models
   - Version dbt_project.yml
   - Document model changes