13 KiB
dbt Core Guide for MXCP
Essential dbt (data build tool) knowledge for building MXCP servers.
What is dbt?
dbt (data build tool) is a transformation workflow tool that enables data analysts and engineers to transform data in their data warehouse using SQL SELECT statements. In MXCP, dbt transforms raw data into clean, queryable tables that MXCP endpoints can access.
Core principle: dbt creates the tables → MXCP queries them
Core Concepts
1. Seeds
Seeds are CSV files that dbt loads into your database as tables. They are perfect for:
- Static reference data (country codes, status mappings, etc.)
- Small lookup tables (<10,000 rows)
- User-provided data files that need to be queried
Location: Place CSV files in seeds/ directory
Loading seeds:
dbt seed # Load all seeds
dbt seed --select my_file # Load specific seed
mxcp dbt seed # Load via MXCP
Example use case: User provides customers.csv → dbt loads it as a table → MXCP tools query it
Critical for MXCP: Seeds are the primary way to make CSV files queryable via MXCP tools.
2. Models
Models transform data using either SQL or Python. Each .sql or .py file in models/ becomes a table or view.
SQL Models
SQL models are SELECT statements that transform data. Best for standard transformations, aggregations, and joins.
Basic SQL model (models/customer_summary.sql):
{{ config(materialized='table') }}
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM {{ ref('orders') }}
GROUP BY customer_id
Python Models
Python models use pandas for complex data processing. Best for Excel files, ML preprocessing, and complex transformations.
Basic Python model (models/process_data.py):
import pandas as pd
def model(dbt, session):
# Load data from dbt ref or read files
# df = dbt.ref('source_table').to_pandas() # From dbt source
df = pd.read_excel('data/input.xlsx') # From file
# Transform using pandas
df = df.dropna(how='all')
df['new_column'] = df['amount'] * 1.1
return df # Returns DataFrame that becomes a table
When to use Python models:
- Processing Excel files with complex formatting
- Data cleaning requiring pandas operations (pivoting, melting, etc.)
- ML feature engineering or preprocessing
- Complex string manipulation or regex operations
- Integration with Python libraries (sklearn, numpy, etc.)
Materialization types (for both SQL and Python models):
table- Creates a table (fast queries, slower builds)view- Creates a view (slow queries, instant builds) - Not available for Python modelsincremental- Appends new data only (best for large datasets)
3. Schema Files (schema.yml)
Schema files define structure, tests, and documentation for seeds and models.
ALWAYS create schema.yml files - they are critical for:
- Type validation
- Data quality tests
- Documentation
- Column descriptions
Example (seeds/schema.yml):
version: 2
seeds:
- name: customers
description: "Customer master data from CSV upload"
columns:
- name: customer_id
description: "Unique customer identifier"
tests:
- unique
- not_null
- name: email
description: "Customer email address"
tests:
- not_null
- name: created_at
description: "Account creation timestamp"
data_type: timestamp
Example (models/schema.yml):
version: 2
models:
- name: customer_summary
description: "Aggregated customer metrics"
columns:
- name: customer_id
tests:
- unique
- not_null
- name: total_spent
tests:
- not_null
- name: order_count
tests:
- not_null
4. Sources
Sources represent raw data already in your database (not managed by dbt).
Example (models/sources.yml):
version: 2
sources:
- name: raw_data
tables:
- name: transactions
description: "Raw transaction data"
- name: users
description: "User accounts"
Use in models:
SELECT * FROM {{ source('raw_data', 'transactions') }}
MXCP + dbt Workflow
Pattern 1: CSV File to MXCP Tool
User request: "I need to query my sales.csv file"
Steps:
-
Create seed - Place
sales.csvinseeds/directory -
Create schema - Define structure in
seeds/schema.yml:version: 2 seeds: - name: sales description: "Sales data from CSV upload" columns: - name: sale_id tests: [unique, not_null] - name: amount tests: [not_null] - name: sale_date data_type: date tests: [not_null] - name: region tests: [not_null] -
Load seed:
dbt seed --select sales -
Create MXCP tool (
tools/get_sales.yml):mxcp: 1 tool: name: get_sales description: "Query sales data by region and date range" parameters: - name: region type: string required: false - name: start_date type: string format: date required: false - name: end_date type: string format: date required: false return: type: array items: type: object source: code: | SELECT * FROM sales WHERE ($region IS NULL OR region = $region) AND ($start_date IS NULL OR sale_date >= $start_date) AND ($end_date IS NULL OR sale_date <= $end_date) ORDER BY sale_date DESC -
Test:
dbt test --select sales mxcp validate mxcp test tool get_sales
Pattern 2: Transform Then Query
User request: "Analyze monthly sales trends from my CSV"
-
Seed the raw data (as above)
-
Create transformation model (
models/monthly_sales.sql):{{ config(materialized='table') }} SELECT region, DATE_TRUNC('month', sale_date) as month, SUM(amount) as total_sales, COUNT(*) as transaction_count, AVG(amount) as avg_sale FROM {{ ref('sales') }} GROUP BY region, month -
Create schema (
models/schema.yml):version: 2 models: - name: monthly_sales description: "Monthly sales aggregations" columns: - name: region tests: [not_null] - name: month tests: [not_null] - name: total_sales tests: [not_null] -
Run dbt:
dbt seed dbt run --select monthly_sales dbt test --select monthly_sales -
Create MXCP tool to query the model:
tool: name: monthly_trends source: code: | SELECT * FROM monthly_sales WHERE region = $region ORDER BY month DESC
Pattern 3: Excel Processing with Python Models
User request: "Process this Excel file with multiple sheets and complex formatting"
-
Create Python model (
models/process_excel.py):import pandas as pd def model(dbt, session): # Read Excel file df = pd.read_excel('data/sales_data.xlsx', sheet_name='Sales') # Clean data df = df.dropna(how='all') # Remove empty rows df = df.dropna(axis=1, how='all') # Remove empty columns # Normalize column names df.columns = df.columns.str.lower().str.replace(' ', '_') # Complex transformations using pandas df['sale_date'] = pd.to_datetime(df['sale_date']) df['month'] = df['sale_date'].dt.to_period('M').astype(str) # Aggregate data result = df.groupby(['region', 'month']).agg({ 'amount': 'sum', 'quantity': 'sum' }).reset_index() return result -
Create schema (
models/schema.yml):version: 2 models: - name: process_excel description: "Processed sales data from Excel" config: materialized: table columns: - name: region tests: [not_null] - name: month tests: [not_null] - name: amount tests: [not_null] -
Run the Python model:
dbt run --select process_excel dbt test --select process_excel -
Create MXCP tool to query:
mxcp: 1 tool: name: get_sales_by_region description: "Get sales data processed from Excel" parameters: - name: region type: string default: null source: code: | SELECT * FROM process_excel WHERE $region IS NULL OR region = $region ORDER BY month DESC
Project Structure
mxcp-project/
├── mxcp-site.yml # MXCP configuration
├── dbt_project.yml # dbt configuration
├── seeds/ # CSV files
│ ├── customers.csv
│ └── schema.yml # Seed schemas (REQUIRED)
├── models/ # SQL transformations
│ ├── staging/
│ ├── intermediate/
│ ├── marts/
│ └── schema.yml # Model schemas (REQUIRED)
├── tools/ # MXCP tools that query seeds/models
└── target/ # dbt build output (gitignored)
dbt Commands for MXCP
# Initialize dbt project
dbt init
# Load CSV seeds into database
dbt seed # Load all seeds
dbt seed --select sales # Load specific seed
# Run transformations
dbt run # Run all models
dbt run --select model_name # Run specific model
dbt run --select +model_name # Run model and upstream dependencies
# Test data quality
dbt test # Run all tests
dbt test --select sales # Test specific seed/model
# Documentation
dbt docs generate # Generate documentation
dbt docs serve # Serve documentation site
# Via MXCP wrapper
mxcp dbt seed
mxcp dbt run
mxcp dbt test
Schema.yml Best Practices
ALWAYS include these in schema.yml:
- Version declaration:
version: 2 - Description for every seed/model: Helps LLMs and humans understand purpose
- Column-level descriptions: Document what each field contains
- Data type declarations: Ensure proper typing (
data_type: timestamp, etc.) - Tests for key columns:
unique- No duplicatesnot_null- Required fieldaccepted_values- Enum validationrelationships- Foreign key validation
Example comprehensive schema.yml:
version: 2
seeds:
- name: employees
description: "Employee master data"
columns:
- name: employee_id
description: "Unique employee identifier"
data_type: varchar
tests:
- unique
- not_null
- name: department
description: "Department code"
data_type: varchar
tests:
- not_null
- accepted_values:
values: ['engineering', 'sales', 'marketing']
- name: salary
description: "Annual salary in USD"
data_type: decimal
tests:
- not_null
- name: hire_date
description: "Date of hire"
data_type: date
tests:
- not_null
DuckDB Integration
MXCP uses DuckDB as its default database. dbt can target DuckDB directly.
Auto-configured by MXCP - no manual setup needed:
# profiles.yml (auto-generated)
my_project:
outputs:
dev:
type: duckdb
path: "{{ env_var('MXCP_DUCKDB_PATH', 'data/db-default.duckdb') }}"
target: dev
DuckDB reads CSVs directly:
-- In dbt models, you can read CSVs without seeding
SELECT * FROM read_csv_auto('path/to/file.csv')
But prefer seeds for user data - they provide version control and validation.
Common Issues
Issue: Seed file not loading Solution: Check CSV format, ensure no special characters in filename, verify schema.yml exists
Issue: Model not found
Solution: Run dbt compile to check for syntax errors, ensure model is in models/ directory
Issue: Tests failing Solution: Review test output, check data quality, adjust tests or fix data
Issue: Type errors
Solution: Add explicit data_type declarations in schema.yml
Summary for MXCP Builders
When building MXCP servers:
- For CSV files → Use dbt seeds
- Always create
schema.ymlfiles with tests and types - Load with
dbt seed - Transform with dbt models if needed
- Query from MXCP tools using
SELECT * FROM <table> - Validate with
dbt testbefore deploying
This ensures data quality, type safety, and proper documentation for all data sources.