Files
2025-11-30 08:49:50 +08:00

12 KiB

DuckDB Essentials for MXCP

Essential DuckDB knowledge for building MXCP servers with embedded analytics.

What is DuckDB?

DuckDB is an embedded, in-process SQL OLAP database - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.

Key characteristics:

  • Embedded: No server setup, no configuration
  • Fast: Vectorized execution engine, parallel processing
  • Versatile: Reads CSV, Parquet, JSON directly from disk or URLs
  • SQL: Full SQL support with analytical extensions
  • Portable: Single-file database, easy to move/backup

MXCP uses DuckDB by default for all SQL-based tools and resources.

Core Features for MXCP

1. Direct File Reading

DuckDB can query files without importing them first:

-- Query CSV directly
SELECT * FROM 'data/sales.csv'

-- Query with explicit reader
SELECT * FROM read_csv_auto('data/sales.csv')

-- Query Parquet
SELECT * FROM 'data/sales.parquet'

-- Query JSON
SELECT * FROM read_json_auto('data/events.json')

-- Query from URL
SELECT * FROM 'https://example.com/data.csv'

Auto-detection: DuckDB automatically infers:

  • Column names from headers
  • Data types from values
  • CSV delimiters, quotes, etc.

2. CSV Import and Export

Import CSV to table:

-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales.csv')

-- Or use COPY
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)

Export to CSV:

-- Export query results
COPY (SELECT * FROM sales WHERE region = 'US')
TO 'us_sales.csv' (HEADER, DELIMITER ',')

CSV reading options:

SELECT * FROM read_csv_auto(
  'data.csv',
  header = true,
  delim = ',',
  quote = '"',
  dateformat = '%Y-%m-%d'
)

3. Data Types

Common DuckDB types (important for MXCP type validation):

-- Numeric
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE

-- String
VARCHAR, TEXT

-- Temporal
DATE, TIME, TIMESTAMP, INTERVAL

-- Complex
ARRAY, STRUCT, MAP, JSON

-- Boolean
BOOLEAN

Type casting:

-- Cast to specific type
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales

-- Short syntax
SELECT amount::DECIMAL(10,2) FROM sales

-- Date parsing
SELECT CAST('2025-01-15' AS DATE)

4. SQL Extensions

DuckDB adds useful SQL extensions beyond standard SQL:

EXCLUDE clause (select all except):

-- Select all columns except sensitive ones
SELECT * EXCLUDE (ssn, salary) FROM employees

REPLACE clause (modify columns in SELECT *):

-- Replace amount with rounded version
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales

List aggregation:

-- Aggregate into arrays
SELECT
  region,
  LIST(product) AS products,
  LIST(DISTINCT customer) AS customers
FROM sales
GROUP BY region

String aggregation:

SELECT
  department,
  STRING_AGG(employee_name, ', ') AS team_members
FROM employees
GROUP BY department

5. Analytical Functions

Window functions:

-- Running totals
SELECT
  date,
  amount,
  SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales

-- Ranking
SELECT
  product,
  sales,
  RANK() OVER (ORDER BY sales DESC) AS rank
FROM product_sales

-- Partitioned windows
SELECT
  region,
  product,
  sales,
  AVG(sales) OVER (PARTITION BY region) AS regional_avg
FROM sales

Percentiles and statistics:

SELECT
  PERCENTILE_CONT(0.5) AS median,
  PERCENTILE_CONT(0.95) AS p95,
  STDDEV(amount) AS std_dev,
  CORR(amount, quantity) AS correlation
FROM sales

6. Date and Time Functions

-- Current timestamp
SELECT CURRENT_TIMESTAMP

-- Date arithmetic
SELECT date + INTERVAL '7 days' AS next_week
SELECT date - INTERVAL '1 month' AS last_month

-- Date truncation
SELECT DATE_TRUNC('month', timestamp) AS month
SELECT DATE_TRUNC('week', timestamp) AS week

-- Date parts
SELECT
  YEAR(date) AS year,
  MONTH(date) AS month,
  DAYOFWEEK(date) AS day_of_week

7. JSON Support

Parse JSON strings:

-- Extract JSON fields
SELECT
  json_extract(data, '$.user_id') AS user_id,
  json_extract(data, '$.event_type') AS event_type
FROM events

-- Arrow notation (shorthand)
SELECT
  data->'user_id' AS user_id,
  data->>'event_type' AS event_type
FROM events

Read JSON files:

SELECT * FROM read_json_auto('events.json')

8. Performance Features

Parallel execution (automatic):

  • DuckDB uses all CPU cores automatically
  • No configuration needed

Larger-than-memory processing:

  • Spills to disk when needed
  • Handles datasets larger than RAM

Columnar storage:

  • Efficient for analytical queries
  • Fast aggregations and filters

Indexes (for point lookups):

CREATE INDEX idx_customer ON sales(customer_id)

MXCP Integration

Database Connection

Automatic in MXCP - no setup needed:

# mxcp-site.yml
# DuckDB is the default, no configuration required

Environment variable for custom path:

# Default database path is data/db-default.duckdb
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
mxcp serve

Profile-specific databases:

# mxcp-site.yml
profiles:
  development:
    database:
      path: "dev.duckdb"
  production:
    database:
      path: "prod.duckdb"

Using DuckDB in MXCP Tools

Direct SQL queries:

# tools/query_sales.yml
mxcp: 1
tool:
  name: query_sales
  source:
    code: |
      SELECT
        region,
        SUM(amount) as total,
        COUNT(*) as count
      FROM sales
      WHERE sale_date >= $start_date
      GROUP BY region
      ORDER BY total DESC

Query CSV files directly:

tool:
  name: analyze_upload
  source:
    code: |
      SELECT
        COUNT(*) as rows,
        COUNT(DISTINCT customer_id) as unique_customers,
        SUM(amount) as total_revenue
      FROM 'uploads/$filename'

Complex analytical queries:

tool:
  name: customer_cohorts
  source:
    code: |
      WITH first_purchase AS (
        SELECT
          customer_id,
          MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
        FROM purchases
        GROUP BY customer_id
      ),
      cohort_size AS (
        SELECT
          cohort_month,
          COUNT(DISTINCT customer_id) AS cohort_size
        FROM first_purchase
        GROUP BY cohort_month
      )
      SELECT
        fp.cohort_month,
        DATE_TRUNC('month', p.purchase_date) AS activity_month,
        COUNT(DISTINCT p.customer_id) AS active_customers,
        cs.cohort_size,
        COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
      FROM purchases p
      JOIN first_purchase fp ON p.customer_id = fp.customer_id
      JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
      GROUP BY fp.cohort_month, activity_month, cs.cohort_size
      ORDER BY fp.cohort_month, activity_month

Using DuckDB in Python Endpoints

Access via MXCP runtime:

from mxcp.runtime import db

def analyze_data(region: str) -> dict:
    # Execute query
    result = db.execute(
        "SELECT SUM(amount) as total FROM sales WHERE region = $1",
        {"region": region}
    )

    # Fetch results
    row = result.fetchone()
    return {"total": row["total"]}

def batch_insert(records: list[dict]) -> dict:
    # Insert data
    db.execute(
        "INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
        [(r["timestamp"], r["event"]) for r in records]
    )

    return {"inserted": len(records)}

Read files in Python:

def import_csv(filepath: str) -> dict:
    # Create table from CSV
    db.execute(f"""
        CREATE TABLE imported_data AS
        SELECT * FROM read_csv_auto('{filepath}')
    """)

    # Get stats
    result = db.execute("SELECT COUNT(*) as count FROM imported_data")
    return {"rows_imported": result.fetchone()["count"]}

Best Practices for MXCP

1. Use Parameter Binding

ALWAYS use parameterized queries to prevent SQL injection:

Correct:

source:
  code: |
    SELECT * FROM sales WHERE region = $region

WRONG (SQL injection risk):

source:
  code: |
    SELECT * FROM sales WHERE region = '$region'

2. Optimize Queries

Index frequently filtered columns:

CREATE INDEX idx_customer ON orders(customer_id)
CREATE INDEX idx_date ON orders(order_date)

Use EXPLAIN to analyze queries:

EXPLAIN SELECT * FROM large_table WHERE id = 123

Materialize complex aggregations (via dbt models):

-- Instead of computing on every query
-- Create a materialized view via dbt
CREATE TABLE daily_summary AS
SELECT
  DATE_TRUNC('day', timestamp) AS date,
  COUNT(*) AS count,
  SUM(amount) AS total
FROM transactions
GROUP BY date

3. Handle Large Datasets

For large CSVs (>100MB):

  • Use Parquet format instead (much faster)
  • Create tables rather than querying files directly
  • Use dbt to materialize transformations

Conversion to Parquet:

COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)

4. Data Types in MXCP

Match DuckDB types to MXCP types:

# MXCP tool definition
parameters:
  - name: amount
    type: number        # → DuckDB DOUBLE
  - name: quantity
    type: integer       # → DuckDB INTEGER
  - name: description
    type: string        # → DuckDB VARCHAR
  - name: created_at
    type: string
    format: date-time   # → DuckDB TIMESTAMP
  - name: is_active
    type: boolean       # → DuckDB BOOLEAN

5. Database File Management

Backup:

# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
cp data/db-default.duckdb data/db-default.duckdb.backup

Export to SQL:

EXPORT DATABASE 'backup_directory'

Import from SQL:

IMPORT DATABASE 'backup_directory'

Common Patterns in MXCP

Pattern 1: CSV → Table → Query

# 1. Load CSV via dbt seed
dbt seed --select customers

# 2. Query from MXCP tool
SELECT * FROM customers WHERE country = $country

Pattern 2: External Data Caching

-- dbt model: cache_external_data.sql
{{ config(materialized='table') }}

SELECT * FROM read_csv_auto('https://example.com/data.csv')

Pattern 3: Multi-File Aggregation

-- Query multiple CSVs
SELECT * FROM 'data/*.csv'

-- Union multiple Parquet files
SELECT * FROM 'archive/2025-*.parquet'

Pattern 4: Real-time + Historical

-- Combine recent API data with historical cache
SELECT * FROM read_json_auto('https://api.com/recent')
UNION ALL
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'

Troubleshooting

Issue: "Table does not exist" Solution: Ensure dbt models/seeds have been run, check table name spelling

Issue: "Type mismatch" Solution: Add explicit CAST() or update schema.yml with correct data types

Issue: "Out of memory" Solution: Reduce query scope, add WHERE filters, materialize intermediate results

Issue: "CSV parsing error" Solution: Use read_csv_auto with explicit options (delim, quote, etc.)

Issue: "Slow queries" Solution: Add indexes, materialize via dbt, use Parquet instead of CSV

Summary for MXCP Builders

When building MXCP servers with DuckDB:

  1. Use parameterized queries ($param) to prevent injection
  2. Load CSVs via dbt seeds for version control and validation
  3. Materialize complex queries as dbt models
  4. Index frequently filtered columns for performance
  5. Use Parquet for large datasets (>100MB)
  6. Match MXCP types to DuckDB types in tool definitions
  7. Leverage DuckDB extensions (EXCLUDE, REPLACE, window functions)

DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.