zhongwei/gh-raw-labs-claude-code-marketplace-claude-plugin-plugins-mxcp-plugin

Fork 0

Files

Zhongwei Li adc4b2be25 Initial commit

2025-11-30 08:49:50 +08:00

12 KiB

Raw Permalink Blame History

DuckDB Essentials for MXCP

Essential DuckDB knowledge for building MXCP servers with embedded analytics.

What is DuckDB?

DuckDB is an embedded, in-process SQL OLAP database - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.

Key characteristics:

Embedded: No server setup, no configuration
Fast: Vectorized execution engine, parallel processing
Versatile: Reads CSV, Parquet, JSON directly from disk or URLs
SQL: Full SQL support with analytical extensions
Portable: Single-file database, easy to move/backup

MXCP uses DuckDB by default for all SQL-based tools and resources.

Core Features for MXCP

1. Direct File Reading

DuckDB can query files without importing them first:

-- Query CSV directly
SELECT * FROM 'data/sales.csv'

-- Query with explicit reader
SELECT * FROM read_csv_auto('data/sales.csv')

-- Query Parquet
SELECT * FROM 'data/sales.parquet'

-- Query JSON
SELECT * FROM read_json_auto('data/events.json')

-- Query from URL
SELECT * FROM 'https://example.com/data.csv'

Auto-detection: DuckDB automatically infers:

Column names from headers
Data types from values
CSV delimiters, quotes, etc.

2. CSV Import and Export

Import CSV to table:

-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales.csv')

-- Or use COPY
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)

Export to CSV:

-- Export query results
COPY (SELECT * FROM sales WHERE region = 'US')
TO 'us_sales.csv' (HEADER, DELIMITER ',')

CSV reading options:

SELECT * FROM read_csv_auto(
  'data.csv',
  header = true,
  delim = ',',
  quote = '"',
  dateformat = '%Y-%m-%d'
)

3. Data Types

Common DuckDB types (important for MXCP type validation):

-- Numeric
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE

-- String
VARCHAR, TEXT

-- Temporal
DATE, TIME, TIMESTAMP, INTERVAL

-- Complex
ARRAY, STRUCT, MAP, JSON

-- Boolean
BOOLEAN

Type casting:

-- Cast to specific type
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales

-- Short syntax
SELECT amount::DECIMAL(10,2) FROM sales

-- Date parsing
SELECT CAST('2025-01-15' AS DATE)

4. SQL Extensions

DuckDB adds useful SQL extensions beyond standard SQL:

EXCLUDE clause (select all except):

-- Select all columns except sensitive ones
SELECT * EXCLUDE (ssn, salary) FROM employees

REPLACE clause (modify columns in SELECT *):

-- Replace amount with rounded version
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales

List aggregation:

-- Aggregate into arrays
SELECT
  region,
  LIST(product) AS products,
  LIST(DISTINCT customer) AS customers
FROM sales
GROUP BY region

String aggregation:

SELECT
  department,
  STRING_AGG(employee_name, ', ') AS team_members
FROM employees
GROUP BY department

5. Analytical Functions

Window functions:

-- Running totals
SELECT
  date,
  amount,
  SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales

-- Ranking
SELECT
  product,
  sales,
  RANK() OVER (ORDER BY sales DESC) AS rank
FROM product_sales

-- Partitioned windows
SELECT
  region,
  product,
  sales,
  AVG(sales) OVER (PARTITION BY region) AS regional_avg
FROM sales

Percentiles and statistics:

SELECT
  PERCENTILE_CONT(0.5) AS median,
  PERCENTILE_CONT(0.95) AS p95,
  STDDEV(amount) AS std_dev,
  CORR(amount, quantity) AS correlation
FROM sales

6. Date and Time Functions

-- Current timestamp
SELECT CURRENT_TIMESTAMP

-- Date arithmetic
SELECT date + INTERVAL '7 days' AS next_week
SELECT date - INTERVAL '1 month' AS last_month

-- Date truncation
SELECT DATE_TRUNC('month', timestamp) AS month
SELECT DATE_TRUNC('week', timestamp) AS week

-- Date parts
SELECT
  YEAR(date) AS year,
  MONTH(date) AS month,
  DAYOFWEEK(date) AS day_of_week

7. JSON Support

Parse JSON strings:

-- Extract JSON fields
SELECT
  json_extract(data, '$.user_id') AS user_id,
  json_extract(data, '$.event_type') AS event_type
FROM events

-- Arrow notation (shorthand)
SELECT
  data->'user_id' AS user_id,
  data->>'event_type' AS event_type
FROM events

Read JSON files:

SELECT * FROM read_json_auto('events.json')

8. Performance Features

Parallel execution (automatic):

DuckDB uses all CPU cores automatically
No configuration needed

Larger-than-memory processing:

Spills to disk when needed
Handles datasets larger than RAM

Columnar storage:

Efficient for analytical queries
Fast aggregations and filters

Indexes (for point lookups):

CREATE INDEX idx_customer ON sales(customer_id)

MXCP Integration

Database Connection

Automatic in MXCP - no setup needed:

# mxcp-site.yml
# DuckDB is the default, no configuration required

Environment variable for custom path:

# Default database path is data/db-default.duckdb
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
mxcp serve

Profile-specific databases:

# mxcp-site.yml
profiles:
  development:
    database:
      path: "dev.duckdb"
  production:
    database:
      path: "prod.duckdb"

Using DuckDB in MXCP Tools

Direct SQL queries:

# tools/query_sales.yml
mxcp: 1
tool:
  name: query_sales
  source:
    code: |
      SELECT
        region,
        SUM(amount) as total,
        COUNT(*) as count
      FROM sales
      WHERE sale_date >= $start_date
      GROUP BY region
      ORDER BY total DESC

Query CSV files directly:

tool:
  name: analyze_upload
  source:
    code: |
      SELECT
        COUNT(*) as rows,
        COUNT(DISTINCT customer_id) as unique_customers,
        SUM(amount) as total_revenue
      FROM 'uploads/$filename'

Complex analytical queries:

tool:
  name: customer_cohorts
  source:
    code: |
      WITH first_purchase AS (
        SELECT
          customer_id,
          MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
        FROM purchases
        GROUP BY customer_id
      ),
      cohort_size AS (
        SELECT
          cohort_month,
          COUNT(DISTINCT customer_id) AS cohort_size
        FROM first_purchase
        GROUP BY cohort_month
      )
      SELECT
        fp.cohort_month,
        DATE_TRUNC('month', p.purchase_date) AS activity_month,
        COUNT(DISTINCT p.customer_id) AS active_customers,
        cs.cohort_size,
        COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
      FROM purchases p
      JOIN first_purchase fp ON p.customer_id = fp.customer_id
      JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
      GROUP BY fp.cohort_month, activity_month, cs.cohort_size
      ORDER BY fp.cohort_month, activity_month

Using DuckDB in Python Endpoints

Access via MXCP runtime:

from mxcp.runtime import db

def analyze_data(region: str) -> dict:
    # Execute query
    result = db.execute(
        "SELECT SUM(amount) as total FROM sales WHERE region = $1",
        {"region": region}
    )

    # Fetch results
    row = result.fetchone()
    return {"total": row["total"]}

def batch_insert(records: list[dict]) -> dict:
    # Insert data
    db.execute(
        "INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
        [(r["timestamp"], r["event"]) for r in records]
    )

    return {"inserted": len(records)}

Read files in Python:

def import_csv(filepath: str) -> dict:
    # Create table from CSV
    db.execute(f"""
        CREATE TABLE imported_data AS
        SELECT * FROM read_csv_auto('{filepath}')
    """)

    # Get stats
    result = db.execute("SELECT COUNT(*) as count FROM imported_data")
    return {"rows_imported": result.fetchone()["count"]}

Best Practices for MXCP

1. Use Parameter Binding

ALWAYS use parameterized queries to prevent SQL injection:

✅ Correct:

source:
  code: |
    SELECT * FROM sales WHERE region = $region

❌ WRONG (SQL injection risk):

source:
  code: |
    SELECT * FROM sales WHERE region = '$region'

2. Optimize Queries

Index frequently filtered columns:

CREATE INDEX idx_customer ON orders(customer_id)
CREATE INDEX idx_date ON orders(order_date)

Use EXPLAIN to analyze queries:

EXPLAIN SELECT * FROM large_table WHERE id = 123

Materialize complex aggregations (via dbt models):

-- Instead of computing on every query
-- Create a materialized view via dbt
CREATE TABLE daily_summary AS
SELECT
  DATE_TRUNC('day', timestamp) AS date,
  COUNT(*) AS count,
  SUM(amount) AS total
FROM transactions
GROUP BY date

3. Handle Large Datasets

For large CSVs (>100MB):

Use Parquet format instead (much faster)
Create tables rather than querying files directly
Use dbt to materialize transformations

Conversion to Parquet:

COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)

4. Data Types in MXCP

Match DuckDB types to MXCP types:

# MXCP tool definition
parameters:
  - name: amount
    type: number        # → DuckDB DOUBLE
  - name: quantity
    type: integer       # → DuckDB INTEGER
  - name: description
    type: string        # → DuckDB VARCHAR
  - name: created_at
    type: string
    format: date-time   # → DuckDB TIMESTAMP
  - name: is_active
    type: boolean       # → DuckDB BOOLEAN

5. Database File Management

Backup:

# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
cp data/db-default.duckdb data/db-default.duckdb.backup

Export to SQL:

EXPORT DATABASE 'backup_directory'

Import from SQL:

IMPORT DATABASE 'backup_directory'

Common Patterns in MXCP

Pattern 1: CSV → Table → Query

# 1. Load CSV via dbt seed
dbt seed --select customers

# 2. Query from MXCP tool
SELECT * FROM customers WHERE country = $country

Pattern 2: External Data Caching

-- dbt model: cache_external_data.sql
{{ config(materialized='table') }}

SELECT * FROM read_csv_auto('https://example.com/data.csv')

Pattern 3: Multi-File Aggregation

-- Query multiple CSVs
SELECT * FROM 'data/*.csv'

-- Union multiple Parquet files
SELECT * FROM 'archive/2025-*.parquet'

Pattern 4: Real-time + Historical

-- Combine recent API data with historical cache
SELECT * FROM read_json_auto('https://api.com/recent')
UNION ALL
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'

Troubleshooting

Issue: "Table does not exist" Solution: Ensure dbt models/seeds have been run, check table name spelling

Issue: "Type mismatch" Solution: Add explicit CAST() or update schema.yml with correct data types

Issue: "Out of memory" Solution: Reduce query scope, add WHERE filters, materialize intermediate results

Issue: "CSV parsing error" Solution: Use read_csv_auto with explicit options (delim, quote, etc.)

Issue: "Slow queries" Solution: Add indexes, materialize via dbt, use Parquet instead of CSV

Summary for MXCP Builders

When building MXCP servers with DuckDB:

Use parameterized queries ($param) to prevent injection
Load CSVs via dbt seeds for version control and validation
Materialize complex queries as dbt models
Index frequently filtered columns for performance
Use Parquet for large datasets (>100MB)
Match MXCP types to DuckDB types in tool definitions
Leverage DuckDB extensions (EXCLUDE, REPLACE, window functions)

DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.

12 KiB Raw Permalink Blame History