gh-raw-labs-claude-code-mar…/skills/mxcp-expert/references/duckdb-essentials.md

# DuckDB Essentials for MXCP

Essential DuckDB knowledge for building MXCP servers with embedded analytics.

## What is DuckDB?

**DuckDB is an embedded, in-process SQL OLAP database** - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.

**Key characteristics**:
- **Embedded**: No server setup, no configuration
- **Fast**: Vectorized execution engine, parallel processing
- **Versatile**: Reads CSV, Parquet, JSON directly from disk or URLs
- **SQL**: Full SQL support with analytical extensions
- **Portable**: Single-file database, easy to move/backup

**MXCP uses DuckDB by default** for all SQL-based tools and resources.

## Core Features for MXCP

### 1. Direct File Reading

**DuckDB can query files without importing them first**:

```sql
-- Query CSV directly
SELECT * FROM 'data/sales.csv'

-- Query with explicit reader
SELECT * FROM read_csv_auto('data/sales.csv')

-- Query Parquet
SELECT * FROM 'data/sales.parquet'

-- Query JSON
SELECT * FROM read_json_auto('data/events.json')

-- Query from URL
SELECT * FROM 'https://example.com/data.csv'
```

**Auto-detection**: DuckDB automatically infers:
- Column names from headers
- Data types from values
- CSV delimiters, quotes, etc.

### 2. CSV Import and Export

**Import CSV to table**:
```sql
-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales.csv')

-- Or use COPY
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
```

**Export to CSV**:
```sql
-- Export query results
COPY (SELECT * FROM sales WHERE region = 'US')
TO 'us_sales.csv' (HEADER, DELIMITER ',')
```

**CSV reading options**:
```sql
SELECT * FROM read_csv_auto(
  'data.csv',
  header = true,
  delim = ',',
  quote = '"',
  dateformat = '%Y-%m-%d'
)
```

### 3. Data Types

**Common DuckDB types** (important for MXCP type validation):

```sql
-- Numeric
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE

-- String
VARCHAR, TEXT

-- Temporal
DATE, TIME, TIMESTAMP, INTERVAL

-- Complex
ARRAY, STRUCT, MAP, JSON

-- Boolean
BOOLEAN
```

**Type casting**:
```sql
-- Cast to specific type
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales

-- Short syntax
SELECT amount::DECIMAL(10,2) FROM sales

-- Date parsing
SELECT CAST('2025-01-15' AS DATE)
```

### 4. SQL Extensions

**DuckDB adds useful SQL extensions beyond standard SQL**:

**EXCLUDE clause** (select all except):
```sql
-- Select all columns except sensitive ones
SELECT * EXCLUDE (ssn, salary) FROM employees
```

**REPLACE clause** (modify columns in SELECT *):
```sql
-- Replace amount with rounded version
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
```

**List aggregation**:
```sql
-- Aggregate into arrays
SELECT
  region,
  LIST(product) AS products,
  LIST(DISTINCT customer) AS customers
FROM sales
GROUP BY region
```

**String aggregation**:
```sql
SELECT
  department,
  STRING_AGG(employee_name, ', ') AS team_members
FROM employees
GROUP BY department
```

### 5. Analytical Functions

**Window functions**:
```sql
-- Running totals
SELECT
  date,
  amount,
  SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales

-- Ranking
SELECT
  product,
  sales,
  RANK() OVER (ORDER BY sales DESC) AS rank
FROM product_sales

-- Partitioned windows
SELECT
  region,
  product,
  sales,
  AVG(sales) OVER (PARTITION BY region) AS regional_avg
FROM sales
```

**Percentiles and statistics**:
```sql
SELECT
  PERCENTILE_CONT(0.5) AS median,
  PERCENTILE_CONT(0.95) AS p95,
  STDDEV(amount) AS std_dev,
  CORR(amount, quantity) AS correlation
FROM sales
```

### 6. Date and Time Functions

```sql
-- Current timestamp
SELECT CURRENT_TIMESTAMP

-- Date arithmetic
SELECT date + INTERVAL '7 days' AS next_week
SELECT date - INTERVAL '1 month' AS last_month

-- Date truncation
SELECT DATE_TRUNC('month', timestamp) AS month
SELECT DATE_TRUNC('week', timestamp) AS week

-- Date parts
SELECT
  YEAR(date) AS year,
  MONTH(date) AS month,
  DAYOFWEEK(date) AS day_of_week
```

### 7. JSON Support

**Parse JSON strings**:
```sql
-- Extract JSON fields
SELECT
  json_extract(data, '$.user_id') AS user_id,
  json_extract(data, '$.event_type') AS event_type
FROM events

-- Arrow notation (shorthand)
SELECT
  data->'user_id' AS user_id,
  data->>'event_type' AS event_type
FROM events
```

**Read JSON files**:
```sql
SELECT * FROM read_json_auto('events.json')
```

### 8. Performance Features

**Parallel execution** (automatic):
- DuckDB uses all CPU cores automatically
- No configuration needed

**Larger-than-memory processing**:
- Spills to disk when needed
- Handles datasets larger than RAM

**Columnar storage**:
- Efficient for analytical queries
- Fast aggregations and filters

**Indexes** (for point lookups):
```sql
CREATE INDEX idx_customer ON sales(customer_id)
```

## MXCP Integration

### Database Connection

**Automatic in MXCP** - no setup needed:
```yaml
# mxcp-site.yml
# DuckDB is the default, no configuration required
```

**Environment variable** for custom path:
```bash
# Default database path is data/db-default.duckdb
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
mxcp serve
```

**Profile-specific databases**:
```yaml
# mxcp-site.yml
profiles:
  development:
    database:
      path: "dev.duckdb"
  production:
    database:
      path: "prod.duckdb"
```

### Using DuckDB in MXCP Tools

**Direct SQL queries**:
```yaml
# tools/query_sales.yml
mxcp: 1
tool:
  name: query_sales
  source:
    code: |
      SELECT
        region,
        SUM(amount) as total,
        COUNT(*) as count
      FROM sales
      WHERE sale_date >= $start_date
      GROUP BY region
      ORDER BY total DESC
```

**Query CSV files directly**:
```yaml
tool:
  name: analyze_upload
  source:
    code: |
      SELECT
        COUNT(*) as rows,
        COUNT(DISTINCT customer_id) as unique_customers,
        SUM(amount) as total_revenue
      FROM 'uploads/$filename'
```

**Complex analytical queries**:
```yaml
tool:
  name: customer_cohorts
  source:
    code: |
      WITH first_purchase AS (
        SELECT
          customer_id,
          MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
        FROM purchases
        GROUP BY customer_id
      ),
      cohort_size AS (
        SELECT
          cohort_month,
          COUNT(DISTINCT customer_id) AS cohort_size
        FROM first_purchase
        GROUP BY cohort_month
      )
      SELECT
        fp.cohort_month,
        DATE_TRUNC('month', p.purchase_date) AS activity_month,
        COUNT(DISTINCT p.customer_id) AS active_customers,
        cs.cohort_size,
        COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
      FROM purchases p
      JOIN first_purchase fp ON p.customer_id = fp.customer_id
      JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
      GROUP BY fp.cohort_month, activity_month, cs.cohort_size
      ORDER BY fp.cohort_month, activity_month
```

### Using DuckDB in Python Endpoints

**Access via MXCP runtime**:
```python
from mxcp.runtime import db

def analyze_data(region: str) -> dict:
    # Execute query
    result = db.execute(
        "SELECT SUM(amount) as total FROM sales WHERE region = $1",
        {"region": region}
    )

    # Fetch results
    row = result.fetchone()
    return {"total": row["total"]}

def batch_insert(records: list[dict]) -> dict:
    # Insert data
    db.execute(
        "INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
        [(r["timestamp"], r["event"]) for r in records]
    )

    return {"inserted": len(records)}
```

**Read files in Python**:
```python
def import_csv(filepath: str) -> dict:
    # Create table from CSV
    db.execute(f"""
        CREATE TABLE imported_data AS
        SELECT * FROM read_csv_auto('{filepath}')
    """)

    # Get stats
    result = db.execute("SELECT COUNT(*) as count FROM imported_data")
    return {"rows_imported": result.fetchone()["count"]}
```

## Best Practices for MXCP

### 1. Use Parameter Binding

**ALWAYS use parameterized queries** to prevent SQL injection:

✅ **Correct**:
```yaml
source:
  code: |
    SELECT * FROM sales WHERE region = $region
```

❌ **WRONG** (SQL injection risk):
```yaml
source:
  code: |
    SELECT * FROM sales WHERE region = '$region'
```

### 2. Optimize Queries

**Index frequently filtered columns**:
```sql
CREATE INDEX idx_customer ON orders(customer_id)
CREATE INDEX idx_date ON orders(order_date)
```

**Use EXPLAIN to analyze queries**:
```sql
EXPLAIN SELECT * FROM large_table WHERE id = 123
```

**Materialize complex aggregations** (via dbt models):
```sql
-- Instead of computing on every query
-- Create a materialized view via dbt
CREATE TABLE daily_summary AS
SELECT
  DATE_TRUNC('day', timestamp) AS date,
  COUNT(*) AS count,
  SUM(amount) AS total
FROM transactions
GROUP BY date
```

### 3. Handle Large Datasets

**For large CSVs** (>100MB):
- Use Parquet format instead (much faster)
- Create tables rather than querying files directly
- Use dbt to materialize transformations

**Conversion to Parquet**:
```sql
COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)
```

### 4. Data Types in MXCP

**Match DuckDB types to MXCP types**:

```yaml
# MXCP tool definition
parameters:
  - name: amount
    type: number        # → DuckDB DOUBLE
  - name: quantity
    type: integer       # → DuckDB INTEGER
  - name: description
    type: string        # → DuckDB VARCHAR
  - name: created_at
    type: string
    format: date-time   # → DuckDB TIMESTAMP
  - name: is_active
    type: boolean       # → DuckDB BOOLEAN
```

### 5. Database File Management

**Backup**:
```bash
# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
cp data/db-default.duckdb data/db-default.duckdb.backup
```

**Export to SQL**:
```sql
EXPORT DATABASE 'backup_directory'
```

**Import from SQL**:
```sql
IMPORT DATABASE 'backup_directory'
```

## Common Patterns in MXCP

### Pattern 1: CSV → Table → Query

```bash
# 1. Load CSV via dbt seed
dbt seed --select customers

# 2. Query from MXCP tool
SELECT * FROM customers WHERE country = $country
```

### Pattern 2: External Data Caching

```sql
-- dbt model: cache_external_data.sql
{{ config(materialized='table') }}

SELECT * FROM read_csv_auto('https://example.com/data.csv')
```

### Pattern 3: Multi-File Aggregation

```sql
-- Query multiple CSVs
SELECT * FROM 'data/*.csv'

-- Union multiple Parquet files
SELECT * FROM 'archive/2025-*.parquet'
```

### Pattern 4: Real-time + Historical

```sql
-- Combine recent API data with historical cache
SELECT * FROM read_json_auto('https://api.com/recent')
UNION ALL
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
```

## Troubleshooting

**Issue**: "Table does not exist"
**Solution**: Ensure dbt models/seeds have been run, check table name spelling

**Issue**: "Type mismatch"
**Solution**: Add explicit CAST() or update schema.yml with correct data types

**Issue**: "Out of memory"
**Solution**: Reduce query scope, add WHERE filters, materialize intermediate results

**Issue**: "CSV parsing error"
**Solution**: Use read_csv_auto with explicit options (delim, quote, etc.)

**Issue**: "Slow queries"
**Solution**: Add indexes, materialize via dbt, use Parquet instead of CSV

## Summary for MXCP Builders

When building MXCP servers with DuckDB:

1. **Use parameterized queries** (`$param`) to prevent injection
2. **Load CSVs via dbt seeds** for version control and validation
3. **Materialize complex queries** as dbt models
4. **Index frequently filtered columns** for performance
5. **Use Parquet for large datasets** (>100MB)
6. **Match MXCP types to DuckDB types** in tool definitions
7. **Leverage DuckDB extensions** (EXCLUDE, REPLACE, window functions)

DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.