12 KiB
DuckDB Essentials for MXCP
Essential DuckDB knowledge for building MXCP servers with embedded analytics.
What is DuckDB?
DuckDB is an embedded, in-process SQL OLAP database - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.
Key characteristics:
- Embedded: No server setup, no configuration
- Fast: Vectorized execution engine, parallel processing
- Versatile: Reads CSV, Parquet, JSON directly from disk or URLs
- SQL: Full SQL support with analytical extensions
- Portable: Single-file database, easy to move/backup
MXCP uses DuckDB by default for all SQL-based tools and resources.
Core Features for MXCP
1. Direct File Reading
DuckDB can query files without importing them first:
-- Query CSV directly
SELECT * FROM 'data/sales.csv'
-- Query with explicit reader
SELECT * FROM read_csv_auto('data/sales.csv')
-- Query Parquet
SELECT * FROM 'data/sales.parquet'
-- Query JSON
SELECT * FROM read_json_auto('data/events.json')
-- Query from URL
SELECT * FROM 'https://example.com/data.csv'
Auto-detection: DuckDB automatically infers:
- Column names from headers
- Data types from values
- CSV delimiters, quotes, etc.
2. CSV Import and Export
Import CSV to table:
-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales.csv')
-- Or use COPY
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
Export to CSV:
-- Export query results
COPY (SELECT * FROM sales WHERE region = 'US')
TO 'us_sales.csv' (HEADER, DELIMITER ',')
CSV reading options:
SELECT * FROM read_csv_auto(
'data.csv',
header = true,
delim = ',',
quote = '"',
dateformat = '%Y-%m-%d'
)
3. Data Types
Common DuckDB types (important for MXCP type validation):
-- Numeric
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE
-- String
VARCHAR, TEXT
-- Temporal
DATE, TIME, TIMESTAMP, INTERVAL
-- Complex
ARRAY, STRUCT, MAP, JSON
-- Boolean
BOOLEAN
Type casting:
-- Cast to specific type
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales
-- Short syntax
SELECT amount::DECIMAL(10,2) FROM sales
-- Date parsing
SELECT CAST('2025-01-15' AS DATE)
4. SQL Extensions
DuckDB adds useful SQL extensions beyond standard SQL:
EXCLUDE clause (select all except):
-- Select all columns except sensitive ones
SELECT * EXCLUDE (ssn, salary) FROM employees
REPLACE clause (modify columns in SELECT *):
-- Replace amount with rounded version
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
List aggregation:
-- Aggregate into arrays
SELECT
region,
LIST(product) AS products,
LIST(DISTINCT customer) AS customers
FROM sales
GROUP BY region
String aggregation:
SELECT
department,
STRING_AGG(employee_name, ', ') AS team_members
FROM employees
GROUP BY department
5. Analytical Functions
Window functions:
-- Running totals
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales
-- Ranking
SELECT
product,
sales,
RANK() OVER (ORDER BY sales DESC) AS rank
FROM product_sales
-- Partitioned windows
SELECT
region,
product,
sales,
AVG(sales) OVER (PARTITION BY region) AS regional_avg
FROM sales
Percentiles and statistics:
SELECT
PERCENTILE_CONT(0.5) AS median,
PERCENTILE_CONT(0.95) AS p95,
STDDEV(amount) AS std_dev,
CORR(amount, quantity) AS correlation
FROM sales
6. Date and Time Functions
-- Current timestamp
SELECT CURRENT_TIMESTAMP
-- Date arithmetic
SELECT date + INTERVAL '7 days' AS next_week
SELECT date - INTERVAL '1 month' AS last_month
-- Date truncation
SELECT DATE_TRUNC('month', timestamp) AS month
SELECT DATE_TRUNC('week', timestamp) AS week
-- Date parts
SELECT
YEAR(date) AS year,
MONTH(date) AS month,
DAYOFWEEK(date) AS day_of_week
7. JSON Support
Parse JSON strings:
-- Extract JSON fields
SELECT
json_extract(data, '$.user_id') AS user_id,
json_extract(data, '$.event_type') AS event_type
FROM events
-- Arrow notation (shorthand)
SELECT
data->'user_id' AS user_id,
data->>'event_type' AS event_type
FROM events
Read JSON files:
SELECT * FROM read_json_auto('events.json')
8. Performance Features
Parallel execution (automatic):
- DuckDB uses all CPU cores automatically
- No configuration needed
Larger-than-memory processing:
- Spills to disk when needed
- Handles datasets larger than RAM
Columnar storage:
- Efficient for analytical queries
- Fast aggregations and filters
Indexes (for point lookups):
CREATE INDEX idx_customer ON sales(customer_id)
MXCP Integration
Database Connection
Automatic in MXCP - no setup needed:
# mxcp-site.yml
# DuckDB is the default, no configuration required
Environment variable for custom path:
# Default database path is data/db-default.duckdb
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
mxcp serve
Profile-specific databases:
# mxcp-site.yml
profiles:
development:
database:
path: "dev.duckdb"
production:
database:
path: "prod.duckdb"
Using DuckDB in MXCP Tools
Direct SQL queries:
# tools/query_sales.yml
mxcp: 1
tool:
name: query_sales
source:
code: |
SELECT
region,
SUM(amount) as total,
COUNT(*) as count
FROM sales
WHERE sale_date >= $start_date
GROUP BY region
ORDER BY total DESC
Query CSV files directly:
tool:
name: analyze_upload
source:
code: |
SELECT
COUNT(*) as rows,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue
FROM 'uploads/$filename'
Complex analytical queries:
tool:
name: customer_cohorts
source:
code: |
WITH first_purchase AS (
SELECT
customer_id,
MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
FROM purchases
GROUP BY customer_id
),
cohort_size AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size
FROM first_purchase
GROUP BY cohort_month
)
SELECT
fp.cohort_month,
DATE_TRUNC('month', p.purchase_date) AS activity_month,
COUNT(DISTINCT p.customer_id) AS active_customers,
cs.cohort_size,
COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
FROM purchases p
JOIN first_purchase fp ON p.customer_id = fp.customer_id
JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
GROUP BY fp.cohort_month, activity_month, cs.cohort_size
ORDER BY fp.cohort_month, activity_month
Using DuckDB in Python Endpoints
Access via MXCP runtime:
from mxcp.runtime import db
def analyze_data(region: str) -> dict:
# Execute query
result = db.execute(
"SELECT SUM(amount) as total FROM sales WHERE region = $1",
{"region": region}
)
# Fetch results
row = result.fetchone()
return {"total": row["total"]}
def batch_insert(records: list[dict]) -> dict:
# Insert data
db.execute(
"INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
[(r["timestamp"], r["event"]) for r in records]
)
return {"inserted": len(records)}
Read files in Python:
def import_csv(filepath: str) -> dict:
# Create table from CSV
db.execute(f"""
CREATE TABLE imported_data AS
SELECT * FROM read_csv_auto('{filepath}')
""")
# Get stats
result = db.execute("SELECT COUNT(*) as count FROM imported_data")
return {"rows_imported": result.fetchone()["count"]}
Best Practices for MXCP
1. Use Parameter Binding
ALWAYS use parameterized queries to prevent SQL injection:
✅ Correct:
source:
code: |
SELECT * FROM sales WHERE region = $region
❌ WRONG (SQL injection risk):
source:
code: |
SELECT * FROM sales WHERE region = '$region'
2. Optimize Queries
Index frequently filtered columns:
CREATE INDEX idx_customer ON orders(customer_id)
CREATE INDEX idx_date ON orders(order_date)
Use EXPLAIN to analyze queries:
EXPLAIN SELECT * FROM large_table WHERE id = 123
Materialize complex aggregations (via dbt models):
-- Instead of computing on every query
-- Create a materialized view via dbt
CREATE TABLE daily_summary AS
SELECT
DATE_TRUNC('day', timestamp) AS date,
COUNT(*) AS count,
SUM(amount) AS total
FROM transactions
GROUP BY date
3. Handle Large Datasets
For large CSVs (>100MB):
- Use Parquet format instead (much faster)
- Create tables rather than querying files directly
- Use dbt to materialize transformations
Conversion to Parquet:
COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)
4. Data Types in MXCP
Match DuckDB types to MXCP types:
# MXCP tool definition
parameters:
- name: amount
type: number # → DuckDB DOUBLE
- name: quantity
type: integer # → DuckDB INTEGER
- name: description
type: string # → DuckDB VARCHAR
- name: created_at
type: string
format: date-time # → DuckDB TIMESTAMP
- name: is_active
type: boolean # → DuckDB BOOLEAN
5. Database File Management
Backup:
# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
cp data/db-default.duckdb data/db-default.duckdb.backup
Export to SQL:
EXPORT DATABASE 'backup_directory'
Import from SQL:
IMPORT DATABASE 'backup_directory'
Common Patterns in MXCP
Pattern 1: CSV → Table → Query
# 1. Load CSV via dbt seed
dbt seed --select customers
# 2. Query from MXCP tool
SELECT * FROM customers WHERE country = $country
Pattern 2: External Data Caching
-- dbt model: cache_external_data.sql
{{ config(materialized='table') }}
SELECT * FROM read_csv_auto('https://example.com/data.csv')
Pattern 3: Multi-File Aggregation
-- Query multiple CSVs
SELECT * FROM 'data/*.csv'
-- Union multiple Parquet files
SELECT * FROM 'archive/2025-*.parquet'
Pattern 4: Real-time + Historical
-- Combine recent API data with historical cache
SELECT * FROM read_json_auto('https://api.com/recent')
UNION ALL
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
Troubleshooting
Issue: "Table does not exist" Solution: Ensure dbt models/seeds have been run, check table name spelling
Issue: "Type mismatch" Solution: Add explicit CAST() or update schema.yml with correct data types
Issue: "Out of memory" Solution: Reduce query scope, add WHERE filters, materialize intermediate results
Issue: "CSV parsing error" Solution: Use read_csv_auto with explicit options (delim, quote, etc.)
Issue: "Slow queries" Solution: Add indexes, materialize via dbt, use Parquet instead of CSV
Summary for MXCP Builders
When building MXCP servers with DuckDB:
- Use parameterized queries (
$param) to prevent injection - Load CSVs via dbt seeds for version control and validation
- Materialize complex queries as dbt models
- Index frequently filtered columns for performance
- Use Parquet for large datasets (>100MB)
- Match MXCP types to DuckDB types in tool definitions
- Leverage DuckDB extensions (EXCLUDE, REPLACE, window functions)
DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.