547 lines
12 KiB
Markdown
547 lines
12 KiB
Markdown
# DuckDB Essentials for MXCP
|
|
|
|
Essential DuckDB knowledge for building MXCP servers with embedded analytics.
|
|
|
|
## What is DuckDB?
|
|
|
|
**DuckDB is an embedded, in-process SQL OLAP database** - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.
|
|
|
|
**Key characteristics**:
|
|
- **Embedded**: No server setup, no configuration
|
|
- **Fast**: Vectorized execution engine, parallel processing
|
|
- **Versatile**: Reads CSV, Parquet, JSON directly from disk or URLs
|
|
- **SQL**: Full SQL support with analytical extensions
|
|
- **Portable**: Single-file database, easy to move/backup
|
|
|
|
**MXCP uses DuckDB by default** for all SQL-based tools and resources.
|
|
|
|
## Core Features for MXCP
|
|
|
|
### 1. Direct File Reading
|
|
|
|
**DuckDB can query files without importing them first**:
|
|
|
|
```sql
|
|
-- Query CSV directly
|
|
SELECT * FROM 'data/sales.csv'
|
|
|
|
-- Query with explicit reader
|
|
SELECT * FROM read_csv_auto('data/sales.csv')
|
|
|
|
-- Query Parquet
|
|
SELECT * FROM 'data/sales.parquet'
|
|
|
|
-- Query JSON
|
|
SELECT * FROM read_json_auto('data/events.json')
|
|
|
|
-- Query from URL
|
|
SELECT * FROM 'https://example.com/data.csv'
|
|
```
|
|
|
|
**Auto-detection**: DuckDB automatically infers:
|
|
- Column names from headers
|
|
- Data types from values
|
|
- CSV delimiters, quotes, etc.
|
|
|
|
### 2. CSV Import and Export
|
|
|
|
**Import CSV to table**:
|
|
```sql
|
|
-- Create table from CSV
|
|
CREATE TABLE sales AS
|
|
SELECT * FROM read_csv_auto('sales.csv')
|
|
|
|
-- Or use COPY
|
|
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
|
|
```
|
|
|
|
**Export to CSV**:
|
|
```sql
|
|
-- Export query results
|
|
COPY (SELECT * FROM sales WHERE region = 'US')
|
|
TO 'us_sales.csv' (HEADER, DELIMITER ',')
|
|
```
|
|
|
|
**CSV reading options**:
|
|
```sql
|
|
SELECT * FROM read_csv_auto(
|
|
'data.csv',
|
|
header = true,
|
|
delim = ',',
|
|
quote = '"',
|
|
dateformat = '%Y-%m-%d'
|
|
)
|
|
```
|
|
|
|
### 3. Data Types
|
|
|
|
**Common DuckDB types** (important for MXCP type validation):
|
|
|
|
```sql
|
|
-- Numeric
|
|
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE
|
|
|
|
-- String
|
|
VARCHAR, TEXT
|
|
|
|
-- Temporal
|
|
DATE, TIME, TIMESTAMP, INTERVAL
|
|
|
|
-- Complex
|
|
ARRAY, STRUCT, MAP, JSON
|
|
|
|
-- Boolean
|
|
BOOLEAN
|
|
```
|
|
|
|
**Type casting**:
|
|
```sql
|
|
-- Cast to specific type
|
|
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales
|
|
|
|
-- Short syntax
|
|
SELECT amount::DECIMAL(10,2) FROM sales
|
|
|
|
-- Date parsing
|
|
SELECT CAST('2025-01-15' AS DATE)
|
|
```
|
|
|
|
### 4. SQL Extensions
|
|
|
|
**DuckDB adds useful SQL extensions beyond standard SQL**:
|
|
|
|
**EXCLUDE clause** (select all except):
|
|
```sql
|
|
-- Select all columns except sensitive ones
|
|
SELECT * EXCLUDE (ssn, salary) FROM employees
|
|
```
|
|
|
|
**REPLACE clause** (modify columns in SELECT *):
|
|
```sql
|
|
-- Replace amount with rounded version
|
|
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
|
|
```
|
|
|
|
**List aggregation**:
|
|
```sql
|
|
-- Aggregate into arrays
|
|
SELECT
|
|
region,
|
|
LIST(product) AS products,
|
|
LIST(DISTINCT customer) AS customers
|
|
FROM sales
|
|
GROUP BY region
|
|
```
|
|
|
|
**String aggregation**:
|
|
```sql
|
|
SELECT
|
|
department,
|
|
STRING_AGG(employee_name, ', ') AS team_members
|
|
FROM employees
|
|
GROUP BY department
|
|
```
|
|
|
|
### 5. Analytical Functions
|
|
|
|
**Window functions**:
|
|
```sql
|
|
-- Running totals
|
|
SELECT
|
|
date,
|
|
amount,
|
|
SUM(amount) OVER (ORDER BY date) AS running_total
|
|
FROM sales
|
|
|
|
-- Ranking
|
|
SELECT
|
|
product,
|
|
sales,
|
|
RANK() OVER (ORDER BY sales DESC) AS rank
|
|
FROM product_sales
|
|
|
|
-- Partitioned windows
|
|
SELECT
|
|
region,
|
|
product,
|
|
sales,
|
|
AVG(sales) OVER (PARTITION BY region) AS regional_avg
|
|
FROM sales
|
|
```
|
|
|
|
**Percentiles and statistics**:
|
|
```sql
|
|
SELECT
|
|
PERCENTILE_CONT(0.5) AS median,
|
|
PERCENTILE_CONT(0.95) AS p95,
|
|
STDDEV(amount) AS std_dev,
|
|
CORR(amount, quantity) AS correlation
|
|
FROM sales
|
|
```
|
|
|
|
### 6. Date and Time Functions
|
|
|
|
```sql
|
|
-- Current timestamp
|
|
SELECT CURRENT_TIMESTAMP
|
|
|
|
-- Date arithmetic
|
|
SELECT date + INTERVAL '7 days' AS next_week
|
|
SELECT date - INTERVAL '1 month' AS last_month
|
|
|
|
-- Date truncation
|
|
SELECT DATE_TRUNC('month', timestamp) AS month
|
|
SELECT DATE_TRUNC('week', timestamp) AS week
|
|
|
|
-- Date parts
|
|
SELECT
|
|
YEAR(date) AS year,
|
|
MONTH(date) AS month,
|
|
DAYOFWEEK(date) AS day_of_week
|
|
```
|
|
|
|
### 7. JSON Support
|
|
|
|
**Parse JSON strings**:
|
|
```sql
|
|
-- Extract JSON fields
|
|
SELECT
|
|
json_extract(data, '$.user_id') AS user_id,
|
|
json_extract(data, '$.event_type') AS event_type
|
|
FROM events
|
|
|
|
-- Arrow notation (shorthand)
|
|
SELECT
|
|
data->'user_id' AS user_id,
|
|
data->>'event_type' AS event_type
|
|
FROM events
|
|
```
|
|
|
|
**Read JSON files**:
|
|
```sql
|
|
SELECT * FROM read_json_auto('events.json')
|
|
```
|
|
|
|
### 8. Performance Features
|
|
|
|
**Parallel execution** (automatic):
|
|
- DuckDB uses all CPU cores automatically
|
|
- No configuration needed
|
|
|
|
**Larger-than-memory processing**:
|
|
- Spills to disk when needed
|
|
- Handles datasets larger than RAM
|
|
|
|
**Columnar storage**:
|
|
- Efficient for analytical queries
|
|
- Fast aggregations and filters
|
|
|
|
**Indexes** (for point lookups):
|
|
```sql
|
|
CREATE INDEX idx_customer ON sales(customer_id)
|
|
```
|
|
|
|
## MXCP Integration
|
|
|
|
### Database Connection
|
|
|
|
**Automatic in MXCP** - no setup needed:
|
|
```yaml
|
|
# mxcp-site.yml
|
|
# DuckDB is the default, no configuration required
|
|
```
|
|
|
|
**Environment variable** for custom path:
|
|
```bash
|
|
# Default database path is data/db-default.duckdb
|
|
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
|
|
mxcp serve
|
|
```
|
|
|
|
**Profile-specific databases**:
|
|
```yaml
|
|
# mxcp-site.yml
|
|
profiles:
|
|
development:
|
|
database:
|
|
path: "dev.duckdb"
|
|
production:
|
|
database:
|
|
path: "prod.duckdb"
|
|
```
|
|
|
|
### Using DuckDB in MXCP Tools
|
|
|
|
**Direct SQL queries**:
|
|
```yaml
|
|
# tools/query_sales.yml
|
|
mxcp: 1
|
|
tool:
|
|
name: query_sales
|
|
source:
|
|
code: |
|
|
SELECT
|
|
region,
|
|
SUM(amount) as total,
|
|
COUNT(*) as count
|
|
FROM sales
|
|
WHERE sale_date >= $start_date
|
|
GROUP BY region
|
|
ORDER BY total DESC
|
|
```
|
|
|
|
**Query CSV files directly**:
|
|
```yaml
|
|
tool:
|
|
name: analyze_upload
|
|
source:
|
|
code: |
|
|
SELECT
|
|
COUNT(*) as rows,
|
|
COUNT(DISTINCT customer_id) as unique_customers,
|
|
SUM(amount) as total_revenue
|
|
FROM 'uploads/$filename'
|
|
```
|
|
|
|
**Complex analytical queries**:
|
|
```yaml
|
|
tool:
|
|
name: customer_cohorts
|
|
source:
|
|
code: |
|
|
WITH first_purchase AS (
|
|
SELECT
|
|
customer_id,
|
|
MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
|
|
FROM purchases
|
|
GROUP BY customer_id
|
|
),
|
|
cohort_size AS (
|
|
SELECT
|
|
cohort_month,
|
|
COUNT(DISTINCT customer_id) AS cohort_size
|
|
FROM first_purchase
|
|
GROUP BY cohort_month
|
|
)
|
|
SELECT
|
|
fp.cohort_month,
|
|
DATE_TRUNC('month', p.purchase_date) AS activity_month,
|
|
COUNT(DISTINCT p.customer_id) AS active_customers,
|
|
cs.cohort_size,
|
|
COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
|
|
FROM purchases p
|
|
JOIN first_purchase fp ON p.customer_id = fp.customer_id
|
|
JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
|
|
GROUP BY fp.cohort_month, activity_month, cs.cohort_size
|
|
ORDER BY fp.cohort_month, activity_month
|
|
```
|
|
|
|
### Using DuckDB in Python Endpoints
|
|
|
|
**Access via MXCP runtime**:
|
|
```python
|
|
from mxcp.runtime import db
|
|
|
|
def analyze_data(region: str) -> dict:
|
|
# Execute query
|
|
result = db.execute(
|
|
"SELECT SUM(amount) as total FROM sales WHERE region = $1",
|
|
{"region": region}
|
|
)
|
|
|
|
# Fetch results
|
|
row = result.fetchone()
|
|
return {"total": row["total"]}
|
|
|
|
def batch_insert(records: list[dict]) -> dict:
|
|
# Insert data
|
|
db.execute(
|
|
"INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
|
|
[(r["timestamp"], r["event"]) for r in records]
|
|
)
|
|
|
|
return {"inserted": len(records)}
|
|
```
|
|
|
|
**Read files in Python**:
|
|
```python
|
|
def import_csv(filepath: str) -> dict:
|
|
# Create table from CSV
|
|
db.execute(f"""
|
|
CREATE TABLE imported_data AS
|
|
SELECT * FROM read_csv_auto('{filepath}')
|
|
""")
|
|
|
|
# Get stats
|
|
result = db.execute("SELECT COUNT(*) as count FROM imported_data")
|
|
return {"rows_imported": result.fetchone()["count"]}
|
|
```
|
|
|
|
## Best Practices for MXCP
|
|
|
|
### 1. Use Parameter Binding
|
|
|
|
**ALWAYS use parameterized queries** to prevent SQL injection:
|
|
|
|
✅ **Correct**:
|
|
```yaml
|
|
source:
|
|
code: |
|
|
SELECT * FROM sales WHERE region = $region
|
|
```
|
|
|
|
❌ **WRONG** (SQL injection risk):
|
|
```yaml
|
|
source:
|
|
code: |
|
|
SELECT * FROM sales WHERE region = '$region'
|
|
```
|
|
|
|
### 2. Optimize Queries
|
|
|
|
**Index frequently filtered columns**:
|
|
```sql
|
|
CREATE INDEX idx_customer ON orders(customer_id)
|
|
CREATE INDEX idx_date ON orders(order_date)
|
|
```
|
|
|
|
**Use EXPLAIN to analyze queries**:
|
|
```sql
|
|
EXPLAIN SELECT * FROM large_table WHERE id = 123
|
|
```
|
|
|
|
**Materialize complex aggregations** (via dbt models):
|
|
```sql
|
|
-- Instead of computing on every query
|
|
-- Create a materialized view via dbt
|
|
CREATE TABLE daily_summary AS
|
|
SELECT
|
|
DATE_TRUNC('day', timestamp) AS date,
|
|
COUNT(*) AS count,
|
|
SUM(amount) AS total
|
|
FROM transactions
|
|
GROUP BY date
|
|
```
|
|
|
|
### 3. Handle Large Datasets
|
|
|
|
**For large CSVs** (>100MB):
|
|
- Use Parquet format instead (much faster)
|
|
- Create tables rather than querying files directly
|
|
- Use dbt to materialize transformations
|
|
|
|
**Conversion to Parquet**:
|
|
```sql
|
|
COPY (SELECT * FROM 'large_data.csv')
|
|
TO 'large_data.parquet' (FORMAT PARQUET)
|
|
```
|
|
|
|
### 4. Data Types in MXCP
|
|
|
|
**Match DuckDB types to MXCP types**:
|
|
|
|
```yaml
|
|
# MXCP tool definition
|
|
parameters:
|
|
- name: amount
|
|
type: number # → DuckDB DOUBLE
|
|
- name: quantity
|
|
type: integer # → DuckDB INTEGER
|
|
- name: description
|
|
type: string # → DuckDB VARCHAR
|
|
- name: created_at
|
|
type: string
|
|
format: date-time # → DuckDB TIMESTAMP
|
|
- name: is_active
|
|
type: boolean # → DuckDB BOOLEAN
|
|
```
|
|
|
|
### 5. Database File Management
|
|
|
|
**Backup**:
|
|
```bash
|
|
# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
|
|
cp data/db-default.duckdb data/db-default.duckdb.backup
|
|
```
|
|
|
|
**Export to SQL**:
|
|
```sql
|
|
EXPORT DATABASE 'backup_directory'
|
|
```
|
|
|
|
**Import from SQL**:
|
|
```sql
|
|
IMPORT DATABASE 'backup_directory'
|
|
```
|
|
|
|
## Common Patterns in MXCP
|
|
|
|
### Pattern 1: CSV → Table → Query
|
|
|
|
```bash
|
|
# 1. Load CSV via dbt seed
|
|
dbt seed --select customers
|
|
|
|
# 2. Query from MXCP tool
|
|
SELECT * FROM customers WHERE country = $country
|
|
```
|
|
|
|
### Pattern 2: External Data Caching
|
|
|
|
```sql
|
|
-- dbt model: cache_external_data.sql
|
|
{{ config(materialized='table') }}
|
|
|
|
SELECT * FROM read_csv_auto('https://example.com/data.csv')
|
|
```
|
|
|
|
### Pattern 3: Multi-File Aggregation
|
|
|
|
```sql
|
|
-- Query multiple CSVs
|
|
SELECT * FROM 'data/*.csv'
|
|
|
|
-- Union multiple Parquet files
|
|
SELECT * FROM 'archive/2025-*.parquet'
|
|
```
|
|
|
|
### Pattern 4: Real-time + Historical
|
|
|
|
```sql
|
|
-- Combine recent API data with historical cache
|
|
SELECT * FROM read_json_auto('https://api.com/recent')
|
|
UNION ALL
|
|
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**Issue**: "Table does not exist"
|
|
**Solution**: Ensure dbt models/seeds have been run, check table name spelling
|
|
|
|
**Issue**: "Type mismatch"
|
|
**Solution**: Add explicit CAST() or update schema.yml with correct data types
|
|
|
|
**Issue**: "Out of memory"
|
|
**Solution**: Reduce query scope, add WHERE filters, materialize intermediate results
|
|
|
|
**Issue**: "CSV parsing error"
|
|
**Solution**: Use read_csv_auto with explicit options (delim, quote, etc.)
|
|
|
|
**Issue**: "Slow queries"
|
|
**Solution**: Add indexes, materialize via dbt, use Parquet instead of CSV
|
|
|
|
## Summary for MXCP Builders
|
|
|
|
When building MXCP servers with DuckDB:
|
|
|
|
1. **Use parameterized queries** (`$param`) to prevent injection
|
|
2. **Load CSVs via dbt seeds** for version control and validation
|
|
3. **Materialize complex queries** as dbt models
|
|
4. **Index frequently filtered columns** for performance
|
|
5. **Use Parquet for large datasets** (>100MB)
|
|
6. **Match MXCP types to DuckDB types** in tool definitions
|
|
7. **Leverage DuckDB extensions** (EXCLUDE, REPLACE, window functions)
|
|
|
|
DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.
|