Initial commit
This commit is contained in:
546
skills/mxcp-expert/references/duckdb-essentials.md
Normal file
546
skills/mxcp-expert/references/duckdb-essentials.md
Normal file
@@ -0,0 +1,546 @@
|
||||
# DuckDB Essentials for MXCP
|
||||
|
||||
Essential DuckDB knowledge for building MXCP servers with embedded analytics.
|
||||
|
||||
## What is DuckDB?
|
||||
|
||||
**DuckDB is an embedded, in-process SQL OLAP database** - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.
|
||||
|
||||
**Key characteristics**:
|
||||
- **Embedded**: No server setup, no configuration
|
||||
- **Fast**: Vectorized execution engine, parallel processing
|
||||
- **Versatile**: Reads CSV, Parquet, JSON directly from disk or URLs
|
||||
- **SQL**: Full SQL support with analytical extensions
|
||||
- **Portable**: Single-file database, easy to move/backup
|
||||
|
||||
**MXCP uses DuckDB by default** for all SQL-based tools and resources.
|
||||
|
||||
## Core Features for MXCP
|
||||
|
||||
### 1. Direct File Reading
|
||||
|
||||
**DuckDB can query files without importing them first**:
|
||||
|
||||
```sql
|
||||
-- Query CSV directly
|
||||
SELECT * FROM 'data/sales.csv'
|
||||
|
||||
-- Query with explicit reader
|
||||
SELECT * FROM read_csv_auto('data/sales.csv')
|
||||
|
||||
-- Query Parquet
|
||||
SELECT * FROM 'data/sales.parquet'
|
||||
|
||||
-- Query JSON
|
||||
SELECT * FROM read_json_auto('data/events.json')
|
||||
|
||||
-- Query from URL
|
||||
SELECT * FROM 'https://example.com/data.csv'
|
||||
```
|
||||
|
||||
**Auto-detection**: DuckDB automatically infers:
|
||||
- Column names from headers
|
||||
- Data types from values
|
||||
- CSV delimiters, quotes, etc.
|
||||
|
||||
### 2. CSV Import and Export
|
||||
|
||||
**Import CSV to table**:
|
||||
```sql
|
||||
-- Create table from CSV
|
||||
CREATE TABLE sales AS
|
||||
SELECT * FROM read_csv_auto('sales.csv')
|
||||
|
||||
-- Or use COPY
|
||||
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
|
||||
```
|
||||
|
||||
**Export to CSV**:
|
||||
```sql
|
||||
-- Export query results
|
||||
COPY (SELECT * FROM sales WHERE region = 'US')
|
||||
TO 'us_sales.csv' (HEADER, DELIMITER ',')
|
||||
```
|
||||
|
||||
**CSV reading options**:
|
||||
```sql
|
||||
SELECT * FROM read_csv_auto(
|
||||
'data.csv',
|
||||
header = true,
|
||||
delim = ',',
|
||||
quote = '"',
|
||||
dateformat = '%Y-%m-%d'
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Data Types
|
||||
|
||||
**Common DuckDB types** (important for MXCP type validation):
|
||||
|
||||
```sql
|
||||
-- Numeric
|
||||
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE
|
||||
|
||||
-- String
|
||||
VARCHAR, TEXT
|
||||
|
||||
-- Temporal
|
||||
DATE, TIME, TIMESTAMP, INTERVAL
|
||||
|
||||
-- Complex
|
||||
ARRAY, STRUCT, MAP, JSON
|
||||
|
||||
-- Boolean
|
||||
BOOLEAN
|
||||
```
|
||||
|
||||
**Type casting**:
|
||||
```sql
|
||||
-- Cast to specific type
|
||||
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales
|
||||
|
||||
-- Short syntax
|
||||
SELECT amount::DECIMAL(10,2) FROM sales
|
||||
|
||||
-- Date parsing
|
||||
SELECT CAST('2025-01-15' AS DATE)
|
||||
```
|
||||
|
||||
### 4. SQL Extensions
|
||||
|
||||
**DuckDB adds useful SQL extensions beyond standard SQL**:
|
||||
|
||||
**EXCLUDE clause** (select all except):
|
||||
```sql
|
||||
-- Select all columns except sensitive ones
|
||||
SELECT * EXCLUDE (ssn, salary) FROM employees
|
||||
```
|
||||
|
||||
**REPLACE clause** (modify columns in SELECT *):
|
||||
```sql
|
||||
-- Replace amount with rounded version
|
||||
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
|
||||
```
|
||||
|
||||
**List aggregation**:
|
||||
```sql
|
||||
-- Aggregate into arrays
|
||||
SELECT
|
||||
region,
|
||||
LIST(product) AS products,
|
||||
LIST(DISTINCT customer) AS customers
|
||||
FROM sales
|
||||
GROUP BY region
|
||||
```
|
||||
|
||||
**String aggregation**:
|
||||
```sql
|
||||
SELECT
|
||||
department,
|
||||
STRING_AGG(employee_name, ', ') AS team_members
|
||||
FROM employees
|
||||
GROUP BY department
|
||||
```
|
||||
|
||||
### 5. Analytical Functions
|
||||
|
||||
**Window functions**:
|
||||
```sql
|
||||
-- Running totals
|
||||
SELECT
|
||||
date,
|
||||
amount,
|
||||
SUM(amount) OVER (ORDER BY date) AS running_total
|
||||
FROM sales
|
||||
|
||||
-- Ranking
|
||||
SELECT
|
||||
product,
|
||||
sales,
|
||||
RANK() OVER (ORDER BY sales DESC) AS rank
|
||||
FROM product_sales
|
||||
|
||||
-- Partitioned windows
|
||||
SELECT
|
||||
region,
|
||||
product,
|
||||
sales,
|
||||
AVG(sales) OVER (PARTITION BY region) AS regional_avg
|
||||
FROM sales
|
||||
```
|
||||
|
||||
**Percentiles and statistics**:
|
||||
```sql
|
||||
SELECT
|
||||
PERCENTILE_CONT(0.5) AS median,
|
||||
PERCENTILE_CONT(0.95) AS p95,
|
||||
STDDEV(amount) AS std_dev,
|
||||
CORR(amount, quantity) AS correlation
|
||||
FROM sales
|
||||
```
|
||||
|
||||
### 6. Date and Time Functions
|
||||
|
||||
```sql
|
||||
-- Current timestamp
|
||||
SELECT CURRENT_TIMESTAMP
|
||||
|
||||
-- Date arithmetic
|
||||
SELECT date + INTERVAL '7 days' AS next_week
|
||||
SELECT date - INTERVAL '1 month' AS last_month
|
||||
|
||||
-- Date truncation
|
||||
SELECT DATE_TRUNC('month', timestamp) AS month
|
||||
SELECT DATE_TRUNC('week', timestamp) AS week
|
||||
|
||||
-- Date parts
|
||||
SELECT
|
||||
YEAR(date) AS year,
|
||||
MONTH(date) AS month,
|
||||
DAYOFWEEK(date) AS day_of_week
|
||||
```
|
||||
|
||||
### 7. JSON Support
|
||||
|
||||
**Parse JSON strings**:
|
||||
```sql
|
||||
-- Extract JSON fields
|
||||
SELECT
|
||||
json_extract(data, '$.user_id') AS user_id,
|
||||
json_extract(data, '$.event_type') AS event_type
|
||||
FROM events
|
||||
|
||||
-- Arrow notation (shorthand)
|
||||
SELECT
|
||||
data->'user_id' AS user_id,
|
||||
data->>'event_type' AS event_type
|
||||
FROM events
|
||||
```
|
||||
|
||||
**Read JSON files**:
|
||||
```sql
|
||||
SELECT * FROM read_json_auto('events.json')
|
||||
```
|
||||
|
||||
### 8. Performance Features
|
||||
|
||||
**Parallel execution** (automatic):
|
||||
- DuckDB uses all CPU cores automatically
|
||||
- No configuration needed
|
||||
|
||||
**Larger-than-memory processing**:
|
||||
- Spills to disk when needed
|
||||
- Handles datasets larger than RAM
|
||||
|
||||
**Columnar storage**:
|
||||
- Efficient for analytical queries
|
||||
- Fast aggregations and filters
|
||||
|
||||
**Indexes** (for point lookups):
|
||||
```sql
|
||||
CREATE INDEX idx_customer ON sales(customer_id)
|
||||
```
|
||||
|
||||
## MXCP Integration
|
||||
|
||||
### Database Connection
|
||||
|
||||
**Automatic in MXCP** - no setup needed:
|
||||
```yaml
|
||||
# mxcp-site.yml
|
||||
# DuckDB is the default, no configuration required
|
||||
```
|
||||
|
||||
**Environment variable** for custom path:
|
||||
```bash
|
||||
# Default database path is data/db-default.duckdb
|
||||
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
|
||||
mxcp serve
|
||||
```
|
||||
|
||||
**Profile-specific databases**:
|
||||
```yaml
|
||||
# mxcp-site.yml
|
||||
profiles:
|
||||
development:
|
||||
database:
|
||||
path: "dev.duckdb"
|
||||
production:
|
||||
database:
|
||||
path: "prod.duckdb"
|
||||
```
|
||||
|
||||
### Using DuckDB in MXCP Tools
|
||||
|
||||
**Direct SQL queries**:
|
||||
```yaml
|
||||
# tools/query_sales.yml
|
||||
mxcp: 1
|
||||
tool:
|
||||
name: query_sales
|
||||
source:
|
||||
code: |
|
||||
SELECT
|
||||
region,
|
||||
SUM(amount) as total,
|
||||
COUNT(*) as count
|
||||
FROM sales
|
||||
WHERE sale_date >= $start_date
|
||||
GROUP BY region
|
||||
ORDER BY total DESC
|
||||
```
|
||||
|
||||
**Query CSV files directly**:
|
||||
```yaml
|
||||
tool:
|
||||
name: analyze_upload
|
||||
source:
|
||||
code: |
|
||||
SELECT
|
||||
COUNT(*) as rows,
|
||||
COUNT(DISTINCT customer_id) as unique_customers,
|
||||
SUM(amount) as total_revenue
|
||||
FROM 'uploads/$filename'
|
||||
```
|
||||
|
||||
**Complex analytical queries**:
|
||||
```yaml
|
||||
tool:
|
||||
name: customer_cohorts
|
||||
source:
|
||||
code: |
|
||||
WITH first_purchase AS (
|
||||
SELECT
|
||||
customer_id,
|
||||
MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
|
||||
FROM purchases
|
||||
GROUP BY customer_id
|
||||
),
|
||||
cohort_size AS (
|
||||
SELECT
|
||||
cohort_month,
|
||||
COUNT(DISTINCT customer_id) AS cohort_size
|
||||
FROM first_purchase
|
||||
GROUP BY cohort_month
|
||||
)
|
||||
SELECT
|
||||
fp.cohort_month,
|
||||
DATE_TRUNC('month', p.purchase_date) AS activity_month,
|
||||
COUNT(DISTINCT p.customer_id) AS active_customers,
|
||||
cs.cohort_size,
|
||||
COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
|
||||
FROM purchases p
|
||||
JOIN first_purchase fp ON p.customer_id = fp.customer_id
|
||||
JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
|
||||
GROUP BY fp.cohort_month, activity_month, cs.cohort_size
|
||||
ORDER BY fp.cohort_month, activity_month
|
||||
```
|
||||
|
||||
### Using DuckDB in Python Endpoints
|
||||
|
||||
**Access via MXCP runtime**:
|
||||
```python
|
||||
from mxcp.runtime import db
|
||||
|
||||
def analyze_data(region: str) -> dict:
|
||||
# Execute query
|
||||
result = db.execute(
|
||||
"SELECT SUM(amount) as total FROM sales WHERE region = $1",
|
||||
{"region": region}
|
||||
)
|
||||
|
||||
# Fetch results
|
||||
row = result.fetchone()
|
||||
return {"total": row["total"]}
|
||||
|
||||
def batch_insert(records: list[dict]) -> dict:
|
||||
# Insert data
|
||||
db.execute(
|
||||
"INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
|
||||
[(r["timestamp"], r["event"]) for r in records]
|
||||
)
|
||||
|
||||
return {"inserted": len(records)}
|
||||
```
|
||||
|
||||
**Read files in Python**:
|
||||
```python
|
||||
def import_csv(filepath: str) -> dict:
|
||||
# Create table from CSV
|
||||
db.execute(f"""
|
||||
CREATE TABLE imported_data AS
|
||||
SELECT * FROM read_csv_auto('{filepath}')
|
||||
""")
|
||||
|
||||
# Get stats
|
||||
result = db.execute("SELECT COUNT(*) as count FROM imported_data")
|
||||
return {"rows_imported": result.fetchone()["count"]}
|
||||
```
|
||||
|
||||
## Best Practices for MXCP
|
||||
|
||||
### 1. Use Parameter Binding
|
||||
|
||||
**ALWAYS use parameterized queries** to prevent SQL injection:
|
||||
|
||||
✅ **Correct**:
|
||||
```yaml
|
||||
source:
|
||||
code: |
|
||||
SELECT * FROM sales WHERE region = $region
|
||||
```
|
||||
|
||||
❌ **WRONG** (SQL injection risk):
|
||||
```yaml
|
||||
source:
|
||||
code: |
|
||||
SELECT * FROM sales WHERE region = '$region'
|
||||
```
|
||||
|
||||
### 2. Optimize Queries
|
||||
|
||||
**Index frequently filtered columns**:
|
||||
```sql
|
||||
CREATE INDEX idx_customer ON orders(customer_id)
|
||||
CREATE INDEX idx_date ON orders(order_date)
|
||||
```
|
||||
|
||||
**Use EXPLAIN to analyze queries**:
|
||||
```sql
|
||||
EXPLAIN SELECT * FROM large_table WHERE id = 123
|
||||
```
|
||||
|
||||
**Materialize complex aggregations** (via dbt models):
|
||||
```sql
|
||||
-- Instead of computing on every query
|
||||
-- Create a materialized view via dbt
|
||||
CREATE TABLE daily_summary AS
|
||||
SELECT
|
||||
DATE_TRUNC('day', timestamp) AS date,
|
||||
COUNT(*) AS count,
|
||||
SUM(amount) AS total
|
||||
FROM transactions
|
||||
GROUP BY date
|
||||
```
|
||||
|
||||
### 3. Handle Large Datasets
|
||||
|
||||
**For large CSVs** (>100MB):
|
||||
- Use Parquet format instead (much faster)
|
||||
- Create tables rather than querying files directly
|
||||
- Use dbt to materialize transformations
|
||||
|
||||
**Conversion to Parquet**:
|
||||
```sql
|
||||
COPY (SELECT * FROM 'large_data.csv')
|
||||
TO 'large_data.parquet' (FORMAT PARQUET)
|
||||
```
|
||||
|
||||
### 4. Data Types in MXCP
|
||||
|
||||
**Match DuckDB types to MXCP types**:
|
||||
|
||||
```yaml
|
||||
# MXCP tool definition
|
||||
parameters:
|
||||
- name: amount
|
||||
type: number # → DuckDB DOUBLE
|
||||
- name: quantity
|
||||
type: integer # → DuckDB INTEGER
|
||||
- name: description
|
||||
type: string # → DuckDB VARCHAR
|
||||
- name: created_at
|
||||
type: string
|
||||
format: date-time # → DuckDB TIMESTAMP
|
||||
- name: is_active
|
||||
type: boolean # → DuckDB BOOLEAN
|
||||
```
|
||||
|
||||
### 5. Database File Management
|
||||
|
||||
**Backup**:
|
||||
```bash
|
||||
# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
|
||||
cp data/db-default.duckdb data/db-default.duckdb.backup
|
||||
```
|
||||
|
||||
**Export to SQL**:
|
||||
```sql
|
||||
EXPORT DATABASE 'backup_directory'
|
||||
```
|
||||
|
||||
**Import from SQL**:
|
||||
```sql
|
||||
IMPORT DATABASE 'backup_directory'
|
||||
```
|
||||
|
||||
## Common Patterns in MXCP
|
||||
|
||||
### Pattern 1: CSV → Table → Query
|
||||
|
||||
```bash
|
||||
# 1. Load CSV via dbt seed
|
||||
dbt seed --select customers
|
||||
|
||||
# 2. Query from MXCP tool
|
||||
SELECT * FROM customers WHERE country = $country
|
||||
```
|
||||
|
||||
### Pattern 2: External Data Caching
|
||||
|
||||
```sql
|
||||
-- dbt model: cache_external_data.sql
|
||||
{{ config(materialized='table') }}
|
||||
|
||||
SELECT * FROM read_csv_auto('https://example.com/data.csv')
|
||||
```
|
||||
|
||||
### Pattern 3: Multi-File Aggregation
|
||||
|
||||
```sql
|
||||
-- Query multiple CSVs
|
||||
SELECT * FROM 'data/*.csv'
|
||||
|
||||
-- Union multiple Parquet files
|
||||
SELECT * FROM 'archive/2025-*.parquet'
|
||||
```
|
||||
|
||||
### Pattern 4: Real-time + Historical
|
||||
|
||||
```sql
|
||||
-- Combine recent API data with historical cache
|
||||
SELECT * FROM read_json_auto('https://api.com/recent')
|
||||
UNION ALL
|
||||
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue**: "Table does not exist"
|
||||
**Solution**: Ensure dbt models/seeds have been run, check table name spelling
|
||||
|
||||
**Issue**: "Type mismatch"
|
||||
**Solution**: Add explicit CAST() or update schema.yml with correct data types
|
||||
|
||||
**Issue**: "Out of memory"
|
||||
**Solution**: Reduce query scope, add WHERE filters, materialize intermediate results
|
||||
|
||||
**Issue**: "CSV parsing error"
|
||||
**Solution**: Use read_csv_auto with explicit options (delim, quote, etc.)
|
||||
|
||||
**Issue**: "Slow queries"
|
||||
**Solution**: Add indexes, materialize via dbt, use Parquet instead of CSV
|
||||
|
||||
## Summary for MXCP Builders
|
||||
|
||||
When building MXCP servers with DuckDB:
|
||||
|
||||
1. **Use parameterized queries** (`$param`) to prevent injection
|
||||
2. **Load CSVs via dbt seeds** for version control and validation
|
||||
3. **Materialize complex queries** as dbt models
|
||||
4. **Index frequently filtered columns** for performance
|
||||
5. **Use Parquet for large datasets** (>100MB)
|
||||
6. **Match MXCP types to DuckDB types** in tool definitions
|
||||
7. **Leverage DuckDB extensions** (EXCLUDE, REPLACE, window functions)
|
||||
|
||||
DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.
|
||||
Reference in New Issue
Block a user