Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:49:50 +08:00
commit adc4b2be25
147 changed files with 24716 additions and 0 deletions

View File

@@ -0,0 +1,546 @@
# DuckDB Essentials for MXCP
Essential DuckDB knowledge for building MXCP servers with embedded analytics.
## What is DuckDB?
**DuckDB is an embedded, in-process SQL OLAP database** - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.
**Key characteristics**:
- **Embedded**: No server setup, no configuration
- **Fast**: Vectorized execution engine, parallel processing
- **Versatile**: Reads CSV, Parquet, JSON directly from disk or URLs
- **SQL**: Full SQL support with analytical extensions
- **Portable**: Single-file database, easy to move/backup
**MXCP uses DuckDB by default** for all SQL-based tools and resources.
## Core Features for MXCP
### 1. Direct File Reading
**DuckDB can query files without importing them first**:
```sql
-- Query CSV directly
SELECT * FROM 'data/sales.csv'
-- Query with explicit reader
SELECT * FROM read_csv_auto('data/sales.csv')
-- Query Parquet
SELECT * FROM 'data/sales.parquet'
-- Query JSON
SELECT * FROM read_json_auto('data/events.json')
-- Query from URL
SELECT * FROM 'https://example.com/data.csv'
```
**Auto-detection**: DuckDB automatically infers:
- Column names from headers
- Data types from values
- CSV delimiters, quotes, etc.
### 2. CSV Import and Export
**Import CSV to table**:
```sql
-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales.csv')
-- Or use COPY
COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
```
**Export to CSV**:
```sql
-- Export query results
COPY (SELECT * FROM sales WHERE region = 'US')
TO 'us_sales.csv' (HEADER, DELIMITER ',')
```
**CSV reading options**:
```sql
SELECT * FROM read_csv_auto(
'data.csv',
header = true,
delim = ',',
quote = '"',
dateformat = '%Y-%m-%d'
)
```
### 3. Data Types
**Common DuckDB types** (important for MXCP type validation):
```sql
-- Numeric
INTEGER, BIGINT, DECIMAL(10,2), DOUBLE
-- String
VARCHAR, TEXT
-- Temporal
DATE, TIME, TIMESTAMP, INTERVAL
-- Complex
ARRAY, STRUCT, MAP, JSON
-- Boolean
BOOLEAN
```
**Type casting**:
```sql
-- Cast to specific type
SELECT CAST(amount AS DECIMAL(10,2)) FROM sales
-- Short syntax
SELECT amount::DECIMAL(10,2) FROM sales
-- Date parsing
SELECT CAST('2025-01-15' AS DATE)
```
### 4. SQL Extensions
**DuckDB adds useful SQL extensions beyond standard SQL**:
**EXCLUDE clause** (select all except):
```sql
-- Select all columns except sensitive ones
SELECT * EXCLUDE (ssn, salary) FROM employees
```
**REPLACE clause** (modify columns in SELECT *):
```sql
-- Replace amount with rounded version
SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
```
**List aggregation**:
```sql
-- Aggregate into arrays
SELECT
region,
LIST(product) AS products,
LIST(DISTINCT customer) AS customers
FROM sales
GROUP BY region
```
**String aggregation**:
```sql
SELECT
department,
STRING_AGG(employee_name, ', ') AS team_members
FROM employees
GROUP BY department
```
### 5. Analytical Functions
**Window functions**:
```sql
-- Running totals
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales
-- Ranking
SELECT
product,
sales,
RANK() OVER (ORDER BY sales DESC) AS rank
FROM product_sales
-- Partitioned windows
SELECT
region,
product,
sales,
AVG(sales) OVER (PARTITION BY region) AS regional_avg
FROM sales
```
**Percentiles and statistics**:
```sql
SELECT
PERCENTILE_CONT(0.5) AS median,
PERCENTILE_CONT(0.95) AS p95,
STDDEV(amount) AS std_dev,
CORR(amount, quantity) AS correlation
FROM sales
```
### 6. Date and Time Functions
```sql
-- Current timestamp
SELECT CURRENT_TIMESTAMP
-- Date arithmetic
SELECT date + INTERVAL '7 days' AS next_week
SELECT date - INTERVAL '1 month' AS last_month
-- Date truncation
SELECT DATE_TRUNC('month', timestamp) AS month
SELECT DATE_TRUNC('week', timestamp) AS week
-- Date parts
SELECT
YEAR(date) AS year,
MONTH(date) AS month,
DAYOFWEEK(date) AS day_of_week
```
### 7. JSON Support
**Parse JSON strings**:
```sql
-- Extract JSON fields
SELECT
json_extract(data, '$.user_id') AS user_id,
json_extract(data, '$.event_type') AS event_type
FROM events
-- Arrow notation (shorthand)
SELECT
data->'user_id' AS user_id,
data->>'event_type' AS event_type
FROM events
```
**Read JSON files**:
```sql
SELECT * FROM read_json_auto('events.json')
```
### 8. Performance Features
**Parallel execution** (automatic):
- DuckDB uses all CPU cores automatically
- No configuration needed
**Larger-than-memory processing**:
- Spills to disk when needed
- Handles datasets larger than RAM
**Columnar storage**:
- Efficient for analytical queries
- Fast aggregations and filters
**Indexes** (for point lookups):
```sql
CREATE INDEX idx_customer ON sales(customer_id)
```
## MXCP Integration
### Database Connection
**Automatic in MXCP** - no setup needed:
```yaml
# mxcp-site.yml
# DuckDB is the default, no configuration required
```
**Environment variable** for custom path:
```bash
# Default database path is data/db-default.duckdb
export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
mxcp serve
```
**Profile-specific databases**:
```yaml
# mxcp-site.yml
profiles:
development:
database:
path: "dev.duckdb"
production:
database:
path: "prod.duckdb"
```
### Using DuckDB in MXCP Tools
**Direct SQL queries**:
```yaml
# tools/query_sales.yml
mxcp: 1
tool:
name: query_sales
source:
code: |
SELECT
region,
SUM(amount) as total,
COUNT(*) as count
FROM sales
WHERE sale_date >= $start_date
GROUP BY region
ORDER BY total DESC
```
**Query CSV files directly**:
```yaml
tool:
name: analyze_upload
source:
code: |
SELECT
COUNT(*) as rows,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue
FROM 'uploads/$filename'
```
**Complex analytical queries**:
```yaml
tool:
name: customer_cohorts
source:
code: |
WITH first_purchase AS (
SELECT
customer_id,
MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
FROM purchases
GROUP BY customer_id
),
cohort_size AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size
FROM first_purchase
GROUP BY cohort_month
)
SELECT
fp.cohort_month,
DATE_TRUNC('month', p.purchase_date) AS activity_month,
COUNT(DISTINCT p.customer_id) AS active_customers,
cs.cohort_size,
COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
FROM purchases p
JOIN first_purchase fp ON p.customer_id = fp.customer_id
JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
GROUP BY fp.cohort_month, activity_month, cs.cohort_size
ORDER BY fp.cohort_month, activity_month
```
### Using DuckDB in Python Endpoints
**Access via MXCP runtime**:
```python
from mxcp.runtime import db
def analyze_data(region: str) -> dict:
# Execute query
result = db.execute(
"SELECT SUM(amount) as total FROM sales WHERE region = $1",
{"region": region}
)
# Fetch results
row = result.fetchone()
return {"total": row["total"]}
def batch_insert(records: list[dict]) -> dict:
# Insert data
db.execute(
"INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
[(r["timestamp"], r["event"]) for r in records]
)
return {"inserted": len(records)}
```
**Read files in Python**:
```python
def import_csv(filepath: str) -> dict:
# Create table from CSV
db.execute(f"""
CREATE TABLE imported_data AS
SELECT * FROM read_csv_auto('{filepath}')
""")
# Get stats
result = db.execute("SELECT COUNT(*) as count FROM imported_data")
return {"rows_imported": result.fetchone()["count"]}
```
## Best Practices for MXCP
### 1. Use Parameter Binding
**ALWAYS use parameterized queries** to prevent SQL injection:
**Correct**:
```yaml
source:
code: |
SELECT * FROM sales WHERE region = $region
```
**WRONG** (SQL injection risk):
```yaml
source:
code: |
SELECT * FROM sales WHERE region = '$region'
```
### 2. Optimize Queries
**Index frequently filtered columns**:
```sql
CREATE INDEX idx_customer ON orders(customer_id)
CREATE INDEX idx_date ON orders(order_date)
```
**Use EXPLAIN to analyze queries**:
```sql
EXPLAIN SELECT * FROM large_table WHERE id = 123
```
**Materialize complex aggregations** (via dbt models):
```sql
-- Instead of computing on every query
-- Create a materialized view via dbt
CREATE TABLE daily_summary AS
SELECT
DATE_TRUNC('day', timestamp) AS date,
COUNT(*) AS count,
SUM(amount) AS total
FROM transactions
GROUP BY date
```
### 3. Handle Large Datasets
**For large CSVs** (>100MB):
- Use Parquet format instead (much faster)
- Create tables rather than querying files directly
- Use dbt to materialize transformations
**Conversion to Parquet**:
```sql
COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)
```
### 4. Data Types in MXCP
**Match DuckDB types to MXCP types**:
```yaml
# MXCP tool definition
parameters:
- name: amount
type: number # → DuckDB DOUBLE
- name: quantity
type: integer # → DuckDB INTEGER
- name: description
type: string # → DuckDB VARCHAR
- name: created_at
type: string
format: date-time # → DuckDB TIMESTAMP
- name: is_active
type: boolean # → DuckDB BOOLEAN
```
### 5. Database File Management
**Backup**:
```bash
# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
cp data/db-default.duckdb data/db-default.duckdb.backup
```
**Export to SQL**:
```sql
EXPORT DATABASE 'backup_directory'
```
**Import from SQL**:
```sql
IMPORT DATABASE 'backup_directory'
```
## Common Patterns in MXCP
### Pattern 1: CSV → Table → Query
```bash
# 1. Load CSV via dbt seed
dbt seed --select customers
# 2. Query from MXCP tool
SELECT * FROM customers WHERE country = $country
```
### Pattern 2: External Data Caching
```sql
-- dbt model: cache_external_data.sql
{{ config(materialized='table') }}
SELECT * FROM read_csv_auto('https://example.com/data.csv')
```
### Pattern 3: Multi-File Aggregation
```sql
-- Query multiple CSVs
SELECT * FROM 'data/*.csv'
-- Union multiple Parquet files
SELECT * FROM 'archive/2025-*.parquet'
```
### Pattern 4: Real-time + Historical
```sql
-- Combine recent API data with historical cache
SELECT * FROM read_json_auto('https://api.com/recent')
UNION ALL
SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
```
## Troubleshooting
**Issue**: "Table does not exist"
**Solution**: Ensure dbt models/seeds have been run, check table name spelling
**Issue**: "Type mismatch"
**Solution**: Add explicit CAST() or update schema.yml with correct data types
**Issue**: "Out of memory"
**Solution**: Reduce query scope, add WHERE filters, materialize intermediate results
**Issue**: "CSV parsing error"
**Solution**: Use read_csv_auto with explicit options (delim, quote, etc.)
**Issue**: "Slow queries"
**Solution**: Add indexes, materialize via dbt, use Parquet instead of CSV
## Summary for MXCP Builders
When building MXCP servers with DuckDB:
1. **Use parameterized queries** (`$param`) to prevent injection
2. **Load CSVs via dbt seeds** for version control and validation
3. **Materialize complex queries** as dbt models
4. **Index frequently filtered columns** for performance
5. **Use Parquet for large datasets** (>100MB)
6. **Match MXCP types to DuckDB types** in tool definitions
7. **Leverage DuckDB extensions** (EXCLUDE, REPLACE, window functions)
DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.