Files
gh-raw-labs-claude-code-mar…/skills/mxcp-expert/references/project-selection-guide.md
2025-11-30 08:49:50 +08:00

25 KiB

Project Selection Guide

Decision tree and heuristics for selecting the right MXCP approach and templates based on technical requirements.

Scope: This guide helps select implementation patterns (SQL vs Python, template selection, architecture patterns) based on data sources, authentication mechanisms, and technical constraints. It does NOT help define business requirements or determine what features to build.

Decision Tree

Use this decision tree to determine the appropriate MXCP implementation approach:

User Request
    ├─ Data File
    │   ├─ CSV file
    │   │   ├─ Static data → dbt seed + SQL tool
    │   │   ├─ Needs transformation → dbt seed + dbt model + SQL tool
    │   │   └─ Large file (>100MB) → Convert to Parquet + dbt model
    │   ├─ Excel file (.xlsx, .xls)
    │   │   ├─ Static/one-time → Convert to CSV + dbt seed
    │   │   ├─ User upload (dynamic) → Python tool with pandas + DuckDB table
    │   │   └─ Multi-sheet → Python tool to load all sheets as tables
    │   ├─ JSON/Parquet
    │   │   └─ DuckDB read_json/read_parquet directly in SQL tool
    │   └─ Synthetic data needed
    │       ├─ For testing → dbt model with GENERATE_SERIES
    │       ├─ Dynamic generation → Python tool with parameters
    │       └─ With statistics → Generate + analyze in single tool
    │
    ├─ External API Integration
    │   ├─ OAuth required
    │   │   ├─ Google (Calendar, Sheets, etc.) → google-calendar template
    │   │   ├─ Jira Cloud → jira-oauth template
    │   │   ├─ Salesforce → salesforce-oauth template
    │   │   └─ Other OAuth → Adapt google-calendar template
    │   │
    │   ├─ API Token/Basic Auth
    │   │   ├─ Jira → jira template
    │   │   ├─ Confluence → confluence template
    │   │   ├─ Salesforce → salesforce template
    │   │   ├─ Custom API → python-demo template
    │   │   └─ REST API → Create new Python tool
    │   │
    │   └─ Public API (no auth)
    │       └─ Create SQL tool with read_json/read_csv from URL
    │
    ├─ Database Connection
    │   ├─ PostgreSQL
    │   │   ├─ Direct query → DuckDB ATTACH + SQL tools
    │   │   └─ Cache data → dbt source + model + SQL tools
    │   ├─ MySQL
    │   │   ├─ Direct query → DuckDB ATTACH + SQL tools
    │   │   └─ Cache data → dbt source + model
    │   ├─ SQLite → DuckDB ATTACH + SQL tools (simple)
    │   ├─ SQL Server → DuckDB ATTACH + SQL tools
    │   └─ Other/NoSQL → Create Python tool with connection library
    │
    ├─ Complex Logic/Processing
    │   ├─ Data transformation → dbt model
    │   ├─ Business logic → Python tool
    │   ├─ ML/AI processing → Python tool with libraries
    │   └─ Async operations → Python tool with async/await
    │
    └─ Authentication/Security System
        ├─ Keycloak → keycloak template
        ├─ Custom SSO → Adapt keycloak template
        └─ Policy enforcement → Use MXCP policies

Available Project Templates

Data-Focused Templates

covid_owid

Use when: Working with external data sources, caching datasets

Features:

  • dbt integration for data caching
  • External CSV/JSON fetching
  • Data quality tests
  • Incremental updates

Example use cases:

  • "Cache COVID statistics for offline analysis"
  • "Query external datasets regularly"
  • "Download and transform public data"

Key files:

  • models/ - dbt models for data transformation
  • tools/ - SQL tools querying cached data

earthquakes

Use when: Real-time data monitoring, geospatial data

Features:

  • Real-time API queries
  • Geospatial filtering
  • Time-based queries

Example use cases:

  • "Monitor earthquake activity"
  • "Query geospatial data by region"
  • "Real-time event tracking"

API Integration Templates

google-calendar (OAuth)

Use when: Integrating with Google APIs or other OAuth 2.0 services

Features:

  • OAuth 2.0 authentication flow
  • Token management
  • Google API client integration
  • Python endpoints with async support

Example use cases:

  • "Connect to Google Calendar"
  • "Access Google Sheets data"
  • "Integrate with Gmail"
  • "Any OAuth 2.0 API integration"

Adaptation guide:

  1. Replace Google API client with target API client
  2. Update OAuth scopes and endpoints
  3. Modify tool definitions for new API methods
  4. Update configuration with new OAuth provider

jira (API Token)

Use when: Integrating with Jira using API tokens

Features:

  • API token authentication
  • JQL query support
  • Issue, user, project management
  • Python HTTP client pattern

Example use cases:

  • "Query Jira issues"
  • "Get project information"
  • "Search for users"

jira-oauth (OAuth)

Use when: Jira integration requiring OAuth

Features:

  • OAuth 1.0a for Jira
  • More secure than API tokens
  • Full Jira REST API access

confluence

Use when: Atlassian Confluence integration

Features:

  • Confluence REST API
  • Page and space queries
  • Content search

Example use cases:

  • "Search Confluence pages"
  • "Get page content"
  • "List spaces"

salesforce / salesforce-oauth

Use when: Salesforce CRM integration

Features:

  • Salesforce REST API
  • SOQL queries
  • OAuth or username/password auth

Example use cases:

  • "Query Salesforce records"
  • "Get account information"
  • "Search opportunities"

Development Templates

python-demo

Use when: Building custom Python-based tools

Features:

  • Python endpoint patterns
  • Async/await examples
  • Database access patterns
  • Error handling

Example use cases:

  • "Create custom API integration"
  • "Implement complex business logic"
  • "Build ML/AI-powered tools"

Key patterns:

# Sync endpoint
def simple_tool(param: str) -> dict:
    return {"result": param.upper()}

# Async endpoint
async def async_tool(ids: list[str]) -> list[dict]:
    results = await asyncio.gather(*[fetch_data(id) for id in ids])
    return results

# Database access
def db_tool(query: str) -> list[dict]:
    return db.execute(query).fetchall()

Infrastructure Templates

plugin

Use when: Extending DuckDB with custom functions

Features:

  • DuckDB plugin development
  • Custom SQL functions
  • Compiled extensions

Example use cases:

  • "Add custom SQL functions"
  • "Integrate C/C++ libraries"
  • "Optimize performance-critical operations"

keycloak

Use when: Enterprise authentication/authorization

Features:

  • Keycloak integration
  • SSO support
  • Role-based access control

Example use cases:

  • "Integrate with Keycloak SSO"
  • "Implement role-based policies"
  • "Enterprise user management"

squirro

Use when: Enterprise search and insights integration

Features:

  • Squirro API integration
  • Search and analytics
  • Enterprise data access

Common Scenarios and Heuristics

Scenario 1: CSV File to Query

User says: "I need to connect my chat to a CSV file"

Heuristic:

  1. DO NOT use existing templates
  2. CREATE new MXCP project from scratch
  3. APPROACH:
    • Place CSV in seeds/ directory
    • Create seeds/schema.yml with schema definition and tests
    • Run dbt seed to load into DuckDB
    • Create SQL tool: SELECT * FROM <table_name>
    • Add parameters for filtering if needed

Implementation steps:

# 1. Initialize project
mkdir csv-server && cd csv-server
mxcp init --bootstrap

# 2. Setup dbt
mkdir seeds
cp /path/to/file.csv seeds/data.csv

# 3. Create schema
cat > seeds/schema.yml <<EOF
version: 2
seeds:
  - name: data
    description: "User uploaded CSV data"
    columns:
      - name: id
        tests: [unique, not_null]
      # ... add all columns
EOF

# 4. Load seed
dbt seed

# 5. Create tool
cat > tools/query_data.yml <<EOF
mxcp: 1
tool:
  name: query_data
  description: "Query the uploaded CSV data"
  parameters:
    - name: filter_column
      type: string
      required: false
  return:
    type: array
  source:
    code: |
      SELECT * FROM data
      WHERE \$filter_column IS NULL OR column_name = \$filter_column
EOF

# 6. Test
mxcp validate
mxcp test
mxcp serve

Scenario 2: API Integration (OAuth)

User says: "Connect to [OAuth-enabled API]"

Heuristic:

  1. Check if template exists (Google, Jira OAuth, Salesforce OAuth)
  2. If exists: Copy and adapt template
  3. If not: Copy google-calendar template and modify

Implementation steps:

# 1. Copy template
cp -r assets/project-templates/google-calendar my-api-project
cd my-api-project

# 2. Update mxcp-site.yml
vim mxcp-site.yml  # Change project name

# 3. Update config.yml for new OAuth provider
vim config.yml  # Update OAuth endpoints and scopes

# 4. Replace API client
pip install new-api-client-library
vim python/*.py  # Replace google-api-client with new library

# 5. Update tools for new API methods
vim tools/*.yml  # Adapt to new API endpoints

# 6. Test OAuth flow
mxcp serve
# Follow OAuth flow in browser

Scenario 3: API Integration (Token/Basic Auth)

User says: "Connect to [API with token]"

Heuristic:

  1. Check if template exists (Jira, Confluence, Salesforce)
  2. If exists: Copy and adapt template
  3. If not: Use python-demo template

Implementation steps:

# 1. Copy python-demo template
cp -r assets/project-templates/python-demo my-api-project
cd my-api-project

# 2. Create Python endpoint
cat > python/api_client.py <<EOF
import httpx
from mxcp.runtime import get_secret

async def fetch_data(endpoint: str) -> dict:
    secret = get_secret("api_token")
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.example.com/{endpoint}",
            headers={"Authorization": f"Bearer {secret['token']}"}
        )
        return response.json()
EOF

# 3. Create tool
# 4. Configure secret in config.yml
# 5. Test

Scenario 4: Complex Data Transformation

User says: "Transform this data and provide analytics"

Heuristic:

  1. Use dbt for transformations
  2. Use SQL tools for queries
  3. Pattern: seed → model → tool

Implementation steps:

# 1. Load source data (seed or external)
# 2. Create dbt model for transformation
cat > models/analytics.sql <<EOF
{{ config(materialized='table') }}

SELECT
  DATE_TRUNC('month', date) as month,
  category,
  SUM(amount) as total,
  AVG(amount) as average,
  COUNT(*) as count
FROM {{ ref('source_data') }}
GROUP BY month, category
EOF

# 3. Create schema.yml
# 4. Run dbt
dbt run --select analytics
dbt test --select analytics

# 5. Create tool to query model
# 6. Test

Scenario 5: Excel File Integration

User says: "I have an Excel file with sales data" or "Read this xlsx file"

Heuristic:

  1. If static/one-time: Convert to CSV, use dbt seed
  2. If user upload: Python tool with pandas to load into DuckDB
  3. If multi-sheet: Python tool to process all sheets

Implementation steps:

# Option A: Static Excel → CSV → dbt seed
python -c "import pandas as pd; pd.read_excel('data.xlsx').to_csv('seeds/data.csv', index=False)"
cat > seeds/schema.yml  # Create schema
dbt seed

# Option B: Dynamic upload → Python tool
cat > python/excel_loader.py  # Create loader
cat > tools/load_excel.yml    # Create tool
pip install openpyxl pandas   # Add dependencies

See references/excel-integration.md for complete patterns.

Scenario 6: Synthetic Data Generation

User says: "Generate test data" or "Create synthetic customer records" or "I need dummy data for testing"

Heuristic:

  1. If persistent test data: dbt model with GENERATE_SERIES
  2. If dynamic/parameterized: Python tool
  3. If with analysis: Generate + calculate statistics in one tool

Implementation steps:

# Option A: Persistent via dbt
cat > models/synthetic_customers.sql <<EOF
{{ config(materialized='table') }}
SELECT
  ROW_NUMBER() OVER () AS id,
  'customer' || ROW_NUMBER() OVER () || '@example.com' AS email
FROM GENERATE_SERIES(1, 1000)
EOF

dbt run --select synthetic_customers

# Option B: Dynamic via Python
cat > python/generate_data.py  # Create generator
cat > tools/generate_test_data.yml  # Create tool

See references/synthetic-data-patterns.md for complete patterns.

Scenario 7: Python Library Wrapping

User says: "Wrap the Stripe API" or "Use pandas for analysis" or "Connect to Redis"

Heuristic:

  1. Check if it's an API client library (stripe, twilio, etc.)
  2. Check if it's a data/ML library (pandas, sklearn, etc.)
  3. Use python-demo as base
  4. Add library to requirements.txt
  5. Use @on_init for initialization if stateful

Implementation steps:

# 1. Copy python-demo template
cp -r assets/project-templates/python-demo my-project

# 2. Install library
echo "stripe>=5.4.0" >> requirements.txt
pip install stripe

# 3. Create wrapper
cat > python/stripe_wrapper.py  # Implement wrapper functions

# 4. Create tools
cat > tools/create_customer.yml  # Map to wrapper functions

# 5. Create project config with secrets
cat > config.yml <<EOF
mxcp: 1
profiles:
  default:
    secrets:
      - name: api_key
        type: env
        parameters:
          env_var: API_KEY
EOF

# User sets: export API_KEY=xxx
# Or user copies to ~/.mxcp/ manually if preferred

See references/python-api.md (Wrapping External Libraries section) for complete patterns.

Scenario 8: ML/AI Processing

User says: "Analyze sentiment" or "Classify images" or "Train a model"

Heuristic:

  1. Use Python tool (not SQL)
  2. Use python-demo template as base
  3. Add ML libraries (transformers, scikit-learn, etc.)
  4. Use @on_init to load models (expensive operation)

Implementation steps:

# 1. Copy python-demo
# 2. Install ML libraries
pip install transformers torch

# 3. Create Python endpoint with model loading
cat > python/ml_tool.py <<EOF
from mxcp.runtime import on_init
from transformers import pipeline

classifier = None

@on_init
def load_model():
    global classifier
    classifier = pipeline("sentiment-analysis")

async def analyze_sentiment(texts: list[str]) -> list[dict]:
    results = classifier(texts)
    return [{"text": t, **r} for t, r in zip(texts, results)]
EOF

# 4. Create tool definition
# 5. Test

Scenario 9: External Database Connection

User says: "Connect to my PostgreSQL database" or "Query my MySQL production database"

Heuristic:

  1. Ask if data can be exported to CSV (simpler approach)
  2. Ask if they need real-time data or can cache it
  3. Decide: Direct query (ATTACH) vs cached (dbt)

Implementation steps - Direct Query (ATTACH):

# 1. Create project
mkdir db-connection && cd db-connection
mxcp init --bootstrap

# 2. Create config with credentials
cat > config.yml <<EOF
mxcp: 1
profiles:
  default:
    secrets:
      - name: db_host
        type: env
        parameters:
          env_var: DB_HOST
      - name: db_user
        type: env
        parameters:
          env_var: DB_USER
      - name: db_password
        type: env
        parameters:
          env_var: DB_PASSWORD
      - name: db_name
        type: env
        parameters:
          env_var: DB_NAME
EOF

# 3. Create SQL tool with ATTACH
cat > tools/query_database.yml <<EOF
mxcp: 1
tool:
  name: query_customers
  description: "Query customers from PostgreSQL database"
  parameters:
    - name: country
      type: string
      required: false
  return:
    type: array
  source:
    code: |
      -- Install and attach PostgreSQL
      INSTALL postgres;
      LOAD postgres;
      ATTACH IF NOT EXISTS 'host=\${DB_HOST} port=5432 dbname=\${DB_NAME} user=\${DB_USER} password=\${DB_PASSWORD}'
        AS prod_db (TYPE POSTGRES);

      -- Query attached database
      SELECT customer_id, name, email, country
      FROM prod_db.public.customers
      WHERE \$country IS NULL OR country = \$country
      LIMIT 1000
EOF

# 4. Set credentials and test
export DB_HOST="localhost"
export DB_USER="readonly_user"
export DB_PASSWORD="secure_pass"
export DB_NAME="mydb"

mxcp validate
mxcp run tool query_customers --param country="US"
mxcp serve

Implementation steps - Cached with dbt:

# 1. Create project
mkdir db-cache && cd db-cache
mxcp init --bootstrap

# 2. Create dbt source
mkdir -p models
cat > models/sources.yml <<EOF
version: 2
sources:
  - name: production
    database: postgres_db
    schema: public
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests: [unique, not_null]
EOF

# 3. Create dbt model to cache data
cat > models/customer_cache.sql <<EOF
{{ config(materialized='table') }}

-- Attach PostgreSQL
{% set attach_sql %}
INSTALL postgres;
LOAD postgres;
ATTACH IF NOT EXISTS 'host=\${DB_HOST} dbname=\${DB_NAME} user=\${DB_USER} password=\${DB_PASSWORD}'
  AS postgres_db (TYPE POSTGRES);
{% endset %}
{% do run_query(attach_sql) %}

-- Cache data
SELECT * FROM postgres_db.public.customers
EOF

# 4. Create schema
cat > models/schema.yml <<EOF
version: 2
models:
  - name: customer_cache
    columns:
      - name: customer_id
        tests: [unique, not_null]
EOF

# 5. Run dbt to cache data
export DB_HOST="localhost" DB_USER="user" DB_PASSWORD="pass" DB_NAME="mydb"
dbt run --select customer_cache
dbt test --select customer_cache

# 6. Create MXCP tool to query cache (fast!)
cat > tools/query_cached.yml <<EOF
mxcp: 1
tool:
  name: query_customers
  source:
    code: SELECT * FROM customer_cache WHERE \$country IS NULL OR country = \$country
EOF

# 7. Create refresh tool
# (see minimal-working-examples.md Example 7 for complete refresh tool)

When to use which approach:

  • ATTACH (Direct): Real-time data needed, small queries, low query frequency
  • dbt (Cached): Large tables, frequent queries, can tolerate staleness, want data quality tests

See references/database-connections.md for complete patterns.

When to Ask for Clarification

If user request is ambiguous, ask these questions:

Data Source Unclear

  • "What type of data are you working with? (CSV, API, database, etc.)"
  • "Do you have a file to upload, or are you connecting to an external source?"

API Integration Unclear

  • "Does this API require authentication? (OAuth, API token, basic auth, or none)"
  • "What operations do you need? (read data, write data, both)"

Data Volume Unclear

  • "How large is the dataset? (<1MB, 1-100MB, >100MB)"
  • "How often does the data update? (static, daily, real-time)"

Security Requirements Unclear

  • "Who should have access to this data? (everyone, specific roles, specific users)"
  • "Are there any sensitive fields that need protection?"

Functionality Unclear

  • "What questions do you want to ask about this data?"
  • "What operations should be available through the MCP server?"

Heuristics When No Interaction Available

If cannot ask questions, use these defaults:

  1. CSV file mentioned → dbt seed + SQL tool with SELECT *
  2. Excel mentioned → Convert to CSV + dbt seed OR Python pandas tool
  3. API mentioned → Check for template, otherwise use Python tool with httpx
  4. OAuth mentioned → Use google-calendar template as base
  5. Database mentioned → DuckDB ATTACH for direct query OR dbt for caching
  6. PostgreSQL/MySQL mentioned → Use ATTACH with read-only user
  7. Transformation needed → dbt model
  8. Complex logic → Python tool
  9. Security not mentioned → No policies (user can add later)
  10. No auth mentioned for API → Assume token/basic auth

Configuration Management

ALWAYS create config.yml in the project directory, NOT ~/.mxcp/config.yml

Why?

  • User maintains control over global config
  • Project is self-contained and portable
  • Safer for agents (no global config modification)
  • User can review before copying to ~/.mxcp/

Basic config.yml template:

# config.yml (in project root)
mxcp: 1

profiles:
  default:
    # Secrets via environment variables (recommended)
    secrets:
      - name: api_token
        type: env
        parameters:
          env_var: API_TOKEN

    # Database configuration (optional, default is data/db-default.duckdb)
    database:
      path: "data/db-default.duckdb"

    # Authentication (if needed)
    auth:
      provider: github  # or google, microsoft, etc.

  production:
    database:
      path: "prod.duckdb"
    audit:
      enabled: true
      path: "audit.jsonl"

Usage options:

# Option 1: Auto-discover (mxcp looks for ./config.yml)
mxcp serve

# Option 2: Explicit path via environment variable
MXCP_CONFIG=./config.yml mxcp serve

# Option 3: User manually copies to global location
cp config.yml ~/.mxcp/config.yml
mxcp serve

In skill implementations:

# CORRECT: Create local config
cat > config.yml <<EOF
mxcp: 1
profiles:
  default:
    secrets:
      - name: github_token
        type: env
        parameters:
          env_var: GITHUB_TOKEN
EOF

echo "Config created at ./config.yml"
echo "Set environment variable: export GITHUB_TOKEN=your_token"
echo "Or copy to global config: cp config.yml ~/.mxcp/config.yml"
# WRONG: Don't edit user's global config
# DON'T DO THIS:
# vim ~/.mxcp/config.yml  # ❌ Never do this!

Secrets Management

Three approaches (in order of preference):

  1. Environment Variables (Best for development):
# config.yml
secrets:
  - name: api_key
    type: env
    parameters:
      env_var: API_KEY
export API_KEY=your_secret_key
mxcp serve
  1. Vault/1Password (Best for production):
# config.yml
secrets:
  - name: database_creds
    type: vault
    parameters:
      path: secret/data/myapp/db
      field: password
  1. Direct in config.yml (Only for non-sensitive or example values):
# config.yml - ONLY for non-sensitive data
secrets:
  - name: api_endpoint
    type: python
    parameters:
      url: "https://api.example.com"  # Not sensitive

Instructions for users:

# After agent creates config.yml, user can:

# Option A: Use environment variables
export API_KEY=xxx
export DB_PASSWORD=yyy
mxcp serve

# Option B: Copy to global config and edit
cp config.yml ~/.mxcp/config.yml
vim ~/.mxcp/config.yml  # User edits their own config
mxcp serve

# Option C: Use with explicit path
MXCP_CONFIG=./config.yml mxcp serve

Security-First Checklist

ALWAYS consider security:

  • Authentication: What auth method is needed?
  • Authorization: Who can access this data?
  • Input validation: Add parameter validation in tool definition
  • Output filtering: Use policies to filter sensitive fields
  • Secrets management: Use Vault/1Password, never hardcode
  • Audit logging: Enable for production systems
  • SQL injection: Use parameterized queries ($param)
  • Rate limiting: Consider for external API calls

Robustness Checklist

ALWAYS ensure robustness:

  • Error handling: Add try/catch in Python, handle nulls in SQL
  • Type validation: Define return types and parameter types
  • Tests: Create test cases for all tools
  • Data validation: Add dbt tests for seeds and models
  • Documentation: Add descriptions to all tools/resources
  • Schema validation: Create schema.yml for all dbt seeds/models

Testing Checklist

ALWAYS test before deployment:

  • mxcp validate - Structure validation
  • mxcp test - Functional testing
  • mxcp lint - Metadata quality
  • dbt test - Data quality (if using dbt)
  • Manual testing with mxcp run tool <name>
  • Test with invalid inputs
  • Test with edge cases (empty data, nulls, etc.)

Summary

Quick reference for common requests:

User Request Approach Template Key Steps
"Query my CSV" dbt seed + SQL tool None seed → schema.yml → dbt seed/test → SQL tool
"Read Excel file" Convert to CSV + dbt seed OR pandas tool None Excel→CSV → seed OR pandas → DuckDB table
"Connect to PostgreSQL" ATTACH + SQL tool OR dbt cache None ATTACH → SQL tool OR dbt source/model → SQL tool
"Connect to MySQL" ATTACH + SQL tool OR dbt cache None ATTACH → SQL tool OR dbt source/model → SQL tool
"Generate test data" dbt model or Python None GENERATE_SERIES → dbt model or Python tool
"Wrap library X" Python wrapper python-demo Install lib → wrap functions → create tools
"Connect to Google Calendar" OAuth + Python google-calendar Copy template → configure OAuth
"Connect to Jira" Token + Python jira or jira-oauth Copy template → configure token
"Transform data" dbt model None seed/source → model → schema.yml → dbt run/test → SQL tool
"Complex logic" Python tool python-demo Copy template → implement function
"ML/AI task" Python + libraries python-demo Add ML libs → implement model
"External API" Python + httpx python-demo Implement client → create tool

Priority order:

  1. Security (auth, policies, validation)
  2. Robustness (error handling, types, tests)
  3. Testing (validate, test, lint)
  4. Features (based on user needs)