Initial commit

2025-11-30 08:49:50 +08:00
commit adc4b2be25
147 changed files with 24716 additions and 0 deletions
--- a/skills/mxcp-expert/references/duckdb-essentials.md
+++ b/skills/mxcp-expert/references/duckdb-essentials.md
@@ -0,0 +1,546 @@
+# DuckDB Essentials for MXCP
+
+Essential DuckDB knowledge for building MXCP servers with embedded analytics.
+
+## What is DuckDB?
+
+**DuckDB is an embedded, in-process SQL OLAP database** - think "SQLite for analytics". It runs directly in your MXCP server process without needing a separate database server.
+
+**Key characteristics**:
+- **Embedded**: No server setup, no configuration
+- **Fast**: Vectorized execution engine, parallel processing
+- **Versatile**: Reads CSV, Parquet, JSON directly from disk or URLs
+- **SQL**: Full SQL support with analytical extensions
+- **Portable**: Single-file database, easy to move/backup
+
+**MXCP uses DuckDB by default** for all SQL-based tools and resources.
+
+## Core Features for MXCP
+
+### 1. Direct File Reading
+
+**DuckDB can query files without importing them first**:
+
+```sql
+-- Query CSV directly
+SELECT * FROM 'data/sales.csv'
+
+-- Query with explicit reader
+SELECT * FROM read_csv_auto('data/sales.csv')
+
+-- Query Parquet
+SELECT * FROM 'data/sales.parquet'
+
+-- Query JSON
+SELECT * FROM read_json_auto('data/events.json')
+
+-- Query from URL
+SELECT * FROM 'https://example.com/data.csv'
+```
+
+**Auto-detection**: DuckDB automatically infers:
+- Column names from headers
+- Data types from values
+- CSV delimiters, quotes, etc.
+
+### 2. CSV Import and Export
+
+**Import CSV to table**:
+```sql
+-- Create table from CSV
+CREATE TABLE sales AS
+SELECT * FROM read_csv_auto('sales.csv')
+
+-- Or use COPY
+COPY sales FROM 'sales.csv' (AUTO_DETECT TRUE)
+```
+
+**Export to CSV**:
+```sql
+-- Export query results
+COPY (SELECT * FROM sales WHERE region = 'US')
+TO 'us_sales.csv' (HEADER, DELIMITER ',')
+```
+
+**CSV reading options**:
+```sql
+SELECT * FROM read_csv_auto(
+  'data.csv',
+  header = true,
+  delim = ',',
+  quote = '"',
+  dateformat = '%Y-%m-%d'
+)
+```
+
+### 3. Data Types
+
+**Common DuckDB types** (important for MXCP type validation):
+
+```sql
+-- Numeric
+INTEGER, BIGINT, DECIMAL(10,2), DOUBLE
+
+-- String
+VARCHAR, TEXT
+
+-- Temporal
+DATE, TIME, TIMESTAMP, INTERVAL
+
+-- Complex
+ARRAY, STRUCT, MAP, JSON
+
+-- Boolean
+BOOLEAN
+```
+
+**Type casting**:
+```sql
+-- Cast to specific type
+SELECT CAST(amount AS DECIMAL(10,2)) FROM sales
+
+-- Short syntax
+SELECT amount::DECIMAL(10,2) FROM sales
+
+-- Date parsing
+SELECT CAST('2025-01-15' AS DATE)
+```
+
+### 4. SQL Extensions
+
+**DuckDB adds useful SQL extensions beyond standard SQL**:
+
+**EXCLUDE clause** (select all except):
+```sql
+-- Select all columns except sensitive ones
+SELECT * EXCLUDE (ssn, salary) FROM employees
+```
+
+**REPLACE clause** (modify columns in SELECT *):
+```sql
+-- Replace amount with rounded version
+SELECT * REPLACE (ROUND(amount, 2) AS amount) FROM sales
+```
+
+**List aggregation**:
+```sql
+-- Aggregate into arrays
+SELECT
+  region,
+  LIST(product) AS products,
+  LIST(DISTINCT customer) AS customers
+FROM sales
+GROUP BY region
+```
+
+**String aggregation**:
+```sql
+SELECT
+  department,
+  STRING_AGG(employee_name, ', ') AS team_members
+FROM employees
+GROUP BY department
+```
+
+### 5. Analytical Functions
+
+**Window functions**:
+```sql
+-- Running totals
+SELECT
+  date,
+  amount,
+  SUM(amount) OVER (ORDER BY date) AS running_total
+FROM sales
+
+-- Ranking
+SELECT
+  product,
+  sales,
+  RANK() OVER (ORDER BY sales DESC) AS rank
+FROM product_sales
+
+-- Partitioned windows
+SELECT
+  region,
+  product,
+  sales,
+  AVG(sales) OVER (PARTITION BY region) AS regional_avg
+FROM sales
+```
+
+**Percentiles and statistics**:
+```sql
+SELECT
+  PERCENTILE_CONT(0.5) AS median,
+  PERCENTILE_CONT(0.95) AS p95,
+  STDDEV(amount) AS std_dev,
+  CORR(amount, quantity) AS correlation
+FROM sales
+```
+
+### 6. Date and Time Functions
+
+```sql
+-- Current timestamp
+SELECT CURRENT_TIMESTAMP
+
+-- Date arithmetic
+SELECT date + INTERVAL '7 days' AS next_week
+SELECT date - INTERVAL '1 month' AS last_month
+
+-- Date truncation
+SELECT DATE_TRUNC('month', timestamp) AS month
+SELECT DATE_TRUNC('week', timestamp) AS week
+
+-- Date parts
+SELECT
+  YEAR(date) AS year,
+  MONTH(date) AS month,
+  DAYOFWEEK(date) AS day_of_week
+```
+
+### 7. JSON Support
+
+**Parse JSON strings**:
+```sql
+-- Extract JSON fields
+SELECT
+  json_extract(data, '$.user_id') AS user_id,
+  json_extract(data, '$.event_type') AS event_type
+FROM events
+
+-- Arrow notation (shorthand)
+SELECT
+  data->'user_id' AS user_id,
+  data->>'event_type' AS event_type
+FROM events
+```
+
+**Read JSON files**:
+```sql
+SELECT * FROM read_json_auto('events.json')
+```
+
+### 8. Performance Features
+
+**Parallel execution** (automatic):
+- DuckDB uses all CPU cores automatically
+- No configuration needed
+
+**Larger-than-memory processing**:
+- Spills to disk when needed
+- Handles datasets larger than RAM
+
+**Columnar storage**:
+- Efficient for analytical queries
+- Fast aggregations and filters
+
+**Indexes** (for point lookups):
+```sql
+CREATE INDEX idx_customer ON sales(customer_id)
+```
+
+## MXCP Integration
+
+### Database Connection
+
+**Automatic in MXCP** - no setup needed:
+```yaml
+# mxcp-site.yml
+# DuckDB is the default, no configuration required
+```
+
+**Environment variable** for custom path:
+```bash
+# Default database path is data/db-default.duckdb
+export MXCP_DUCKDB_PATH="/path/to/data/db-default.duckdb"
+mxcp serve
+```
+
+**Profile-specific databases**:
+```yaml
+# mxcp-site.yml
+profiles:
+  development:
+    database:
+      path: "dev.duckdb"
+  production:
+    database:
+      path: "prod.duckdb"
+```
+
+### Using DuckDB in MXCP Tools
+
+**Direct SQL queries**:
+```yaml
+# tools/query_sales.yml
+mxcp: 1
+tool:
+  name: query_sales
+  source:
+    code: |
+      SELECT
+        region,
+        SUM(amount) as total,
+        COUNT(*) as count
+      FROM sales
+      WHERE sale_date >= $start_date
+      GROUP BY region
+      ORDER BY total DESC
+```
+
+**Query CSV files directly**:
+```yaml
+tool:
+  name: analyze_upload
+  source:
+    code: |
+      SELECT
+        COUNT(*) as rows,
+        COUNT(DISTINCT customer_id) as unique_customers,
+        SUM(amount) as total_revenue
+      FROM 'uploads/$filename'
+```
+
+**Complex analytical queries**:
+```yaml
+tool:
+  name: customer_cohorts
+  source:
+    code: |
+      WITH first_purchase AS (
+        SELECT
+          customer_id,
+          MIN(DATE_TRUNC('month', purchase_date)) AS cohort_month
+        FROM purchases
+        GROUP BY customer_id
+      ),
+      cohort_size AS (
+        SELECT
+          cohort_month,
+          COUNT(DISTINCT customer_id) AS cohort_size
+        FROM first_purchase
+        GROUP BY cohort_month
+      )
+      SELECT
+        fp.cohort_month,
+        DATE_TRUNC('month', p.purchase_date) AS activity_month,
+        COUNT(DISTINCT p.customer_id) AS active_customers,
+        cs.cohort_size,
+        COUNT(DISTINCT p.customer_id)::FLOAT / cs.cohort_size AS retention_rate
+      FROM purchases p
+      JOIN first_purchase fp ON p.customer_id = fp.customer_id
+      JOIN cohort_size cs ON fp.cohort_month = cs.cohort_month
+      GROUP BY fp.cohort_month, activity_month, cs.cohort_size
+      ORDER BY fp.cohort_month, activity_month
+```
+
+### Using DuckDB in Python Endpoints
+
+**Access via MXCP runtime**:
+```python
+from mxcp.runtime import db
+
+def analyze_data(region: str) -> dict:
+    # Execute query
+    result = db.execute(
+        "SELECT SUM(amount) as total FROM sales WHERE region = $1",
+        {"region": region}
+    )
+
+    # Fetch results
+    row = result.fetchone()
+    return {"total": row["total"]}
+
+def batch_insert(records: list[dict]) -> dict:
+    # Insert data
+    db.execute(
+        "INSERT INTO logs (timestamp, event) VALUES ($1, $2)",
+        [(r["timestamp"], r["event"]) for r in records]
+    )
+
+    return {"inserted": len(records)}
+```
+
+**Read files in Python**:
+```python
+def import_csv(filepath: str) -> dict:
+    # Create table from CSV
+    db.execute(f"""
+        CREATE TABLE imported_data AS
+        SELECT * FROM read_csv_auto('{filepath}')
+    """)
+
+    # Get stats
+    result = db.execute("SELECT COUNT(*) as count FROM imported_data")
+    return {"rows_imported": result.fetchone()["count"]}
+```
+
+## Best Practices for MXCP
+
+### 1. Use Parameter Binding
+
+**ALWAYS use parameterized queries** to prevent SQL injection:
+
+✅ **Correct**:
+```yaml
+source:
+  code: |
+    SELECT * FROM sales WHERE region = $region
+```
+
+❌ **WRONG** (SQL injection risk):
+```yaml
+source:
+  code: |
+    SELECT * FROM sales WHERE region = '$region'
+```
+
+### 2. Optimize Queries
+
+**Index frequently filtered columns**:
+```sql
+CREATE INDEX idx_customer ON orders(customer_id)
+CREATE INDEX idx_date ON orders(order_date)
+```
+
+**Use EXPLAIN to analyze queries**:
+```sql
+EXPLAIN SELECT * FROM large_table WHERE id = 123
+```
+
+**Materialize complex aggregations** (via dbt models):
+```sql
+-- Instead of computing on every query
+-- Create a materialized view via dbt
+CREATE TABLE daily_summary AS
+SELECT
+  DATE_TRUNC('day', timestamp) AS date,
+  COUNT(*) AS count,
+  SUM(amount) AS total
+FROM transactions
+GROUP BY date
+```
+
+### 3. Handle Large Datasets
+
+**For large CSVs** (>100MB):
+- Use Parquet format instead (much faster)
+- Create tables rather than querying files directly
+- Use dbt to materialize transformations
+
+**Conversion to Parquet**:
+```sql
+COPY (SELECT * FROM 'large_data.csv')
+TO 'large_data.parquet' (FORMAT PARQUET)
+```
+
+### 4. Data Types in MXCP
+
+**Match DuckDB types to MXCP types**:
+
+```yaml
+# MXCP tool definition
+parameters:
+  - name: amount
+    type: number        # → DuckDB DOUBLE
+  - name: quantity
+    type: integer       # → DuckDB INTEGER
+  - name: description
+    type: string        # → DuckDB VARCHAR
+  - name: created_at
+    type: string
+    format: date-time   # → DuckDB TIMESTAMP
+  - name: is_active
+    type: boolean       # → DuckDB BOOLEAN
+```
+
+### 5. Database File Management
+
+**Backup**:
+```bash
+# DuckDB is a single file - just copy it (default: data/db-default.duckdb)
+cp data/db-default.duckdb data/db-default.duckdb.backup
+```
+
+**Export to SQL**:
+```sql
+EXPORT DATABASE 'backup_directory'
+```
+
+**Import from SQL**:
+```sql
+IMPORT DATABASE 'backup_directory'
+```
+
+## Common Patterns in MXCP
+
+### Pattern 1: CSV → Table → Query
+
+```bash
+# 1. Load CSV via dbt seed
+dbt seed --select customers
+
+# 2. Query from MXCP tool
+SELECT * FROM customers WHERE country = $country
+```
+
+### Pattern 2: External Data Caching
+
+```sql
+-- dbt model: cache_external_data.sql
+{{ config(materialized='table') }}
+
+SELECT * FROM read_csv_auto('https://example.com/data.csv')
+```
+
+### Pattern 3: Multi-File Aggregation
+
+```sql
+-- Query multiple CSVs
+SELECT * FROM 'data/*.csv'
+
+-- Union multiple Parquet files
+SELECT * FROM 'archive/2025-*.parquet'
+```
+
+### Pattern 4: Real-time + Historical
+
+```sql
+-- Combine recent API data with historical cache
+SELECT * FROM read_json_auto('https://api.com/recent')
+UNION ALL
+SELECT * FROM historical_data WHERE date < CURRENT_DATE - INTERVAL '7 days'
+```
+
+## Troubleshooting
+
+**Issue**: "Table does not exist"
+**Solution**: Ensure dbt models/seeds have been run, check table name spelling
+
+**Issue**: "Type mismatch"
+**Solution**: Add explicit CAST() or update schema.yml with correct data types
+
+**Issue**: "Out of memory"
+**Solution**: Reduce query scope, add WHERE filters, materialize intermediate results
+
+**Issue**: "CSV parsing error"
+**Solution**: Use read_csv_auto with explicit options (delim, quote, etc.)
+
+**Issue**: "Slow queries"
+**Solution**: Add indexes, materialize via dbt, use Parquet instead of CSV
+
+## Summary for MXCP Builders
+
+When building MXCP servers with DuckDB:
+
+1. **Use parameterized queries** (`$param`) to prevent injection
+2. **Load CSVs via dbt seeds** for version control and validation
+3. **Materialize complex queries** as dbt models
+4. **Index frequently filtered columns** for performance
+5. **Use Parquet for large datasets** (>100MB)
+6. **Match MXCP types to DuckDB types** in tool definitions
+7. **Leverage DuckDB extensions** (EXCLUDE, REPLACE, window functions)
+
+DuckDB is the powerhouse behind MXCP's data capabilities - understanding it enables building robust, high-performance MCP servers.