Files
gh-raw-labs-claude-code-mar…/skills/mxcp-expert/references/excel-integration.md
2025-11-30 08:49:50 +08:00

16 KiB

Excel File Integration

Guide for working with Excel files (.xlsx, .xls) in MXCP servers.

Overview

Excel files are common data sources that can be integrated into MXCP servers. DuckDB provides multiple ways to read Excel files, and dbt can be used to manage Excel data as seeds or sources.

Reading Excel Files in DuckDB

Method 1: Direct Reading with spatial Extension

DuckDB's spatial extension includes st_read which can read Excel files:

-- Install and load spatial extension (includes Excel support)
INSTALL spatial;
LOAD spatial;

-- Read Excel file
SELECT * FROM st_read('data.xlsx');

-- Read specific sheet
SELECT * FROM st_read('data.xlsx', layer='Sheet2');

Method 2: Using Python with pandas

For more control, use Python with pandas:

# python/excel_reader.py
from mxcp.runtime import db
import pandas as pd

def load_excel_to_duckdb(filepath: str, table_name: str, sheet_name: str = None) -> dict:
    """Load Excel file into DuckDB table"""
    # Read Excel with pandas
    df = pd.read_excel(filepath, sheet_name=sheet_name)

    # Register DataFrame in DuckDB
    db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df")

    return {
        "table": table_name,
        "rows": len(df),
        "columns": list(df.columns)
    }

def read_excel_data(filepath: str, sheet_name: str = None) -> list[dict]:
    """Read Excel and return as list of dicts"""
    df = pd.read_excel(filepath, sheet_name=sheet_name)
    return df.to_dict('records')

Method 3: Convert to CSV, then use dbt seed

Best practice for user-uploaded Excel files:

# Convert Excel to CSV using Python
python -c "import pandas as pd; pd.read_excel('data.xlsx').to_csv('seeds/data.csv', index=False)"

# Then follow standard dbt seed workflow
cat > seeds/schema.yml <<EOF
version: 2
seeds:
  - name: data
    columns:
      - name: id
        tests: [unique, not_null]
EOF

dbt seed

Common Patterns

Pattern 1: Excel Upload → Query Tool

User request: "I have an Excel file with sales data, let me query it"

Implementation:

# tools/upload_excel.yml
mxcp: 1
tool:
  name: upload_excel
  description: "Load Excel file into queryable table"
  language: python
  parameters:
    - name: filepath
      type: string
      description: "Path to Excel file"
    - name: sheet_name
      type: string
      required: false
      description: "Sheet name (default: first sheet)"
  return:
    type: object
    properties:
      table_name: { type: string }
      rows: { type: integer }
      columns: { type: array }
  source:
    file: ../python/excel_loader.py
# python/excel_loader.py
from mxcp.runtime import db
import pandas as pd
import os

def upload_excel(filepath: str, sheet_name: str = None) -> dict:
    """Load Excel file into DuckDB for querying"""
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Excel file not found: {filepath}")

    # Read Excel
    df = pd.read_excel(filepath, sheet_name=sheet_name or 0)

    # Generate table name from filename
    table_name = os.path.splitext(os.path.basename(filepath))[0].replace('-', '_').replace(' ', '_')

    # Load into DuckDB
    db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df")

    return {
        "table_name": table_name,
        "rows": len(df),
        "columns": list(df.columns),
        "message": f"Excel loaded. Query with: SELECT * FROM {table_name}"
    }
# tools/query_excel_data.yml
mxcp: 1
tool:
  name: query_excel_data
  description: "Query data from uploaded Excel file"
  parameters:
    - name: table_name
      type: string
      description: "Table name (from upload_excel result)"
    - name: filter_column
      type: string
      required: false
    - name: filter_value
      type: string
      required: false
  return:
    type: array
  source:
    code: |
      SELECT * FROM {{ table_name }}
      WHERE $filter_column IS NULL
         OR CAST({{ filter_column }} AS VARCHAR) = $filter_value
      LIMIT 1000

Validation workflow for Pattern 1:

# 1. Validate MXCP structure
mxcp validate

# 2. Test upload tool
mxcp test tool upload_excel

# 3. Manual test with real Excel file
mxcp run tool upload_excel --param filepath="./test.xlsx"

# 4. Test query tool
mxcp run tool query_excel_data --param table_name="test"

# 5. All validations must pass before deployment

Pattern 2: Excel → dbt Python Model → Analytics

User request: "Process this Excel file with complex formatting and transform the data"

RECOMMENDED for complex Excel processing - Use dbt Python models when:

  • Excel has complex formatting or multiple sheets
  • Need pandas operations (pivoting, melting, complex string manipulation)
  • Data cleaning requires Python logic

Implementation:

  1. Create dbt Python model (models/process_excel.py):
import pandas as pd

def model(dbt, session):
    # Read Excel file
    df = pd.read_excel('data/sales_data.xlsx', sheet_name='Sales')

    # Clean data
    df = df.dropna(how='all')  # Remove empty rows
    df = df.dropna(axis=1, how='all')  # Remove empty columns

    # Normalize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_')

    # Complex transformations using pandas
    df['sale_date'] = pd.to_datetime(df['sale_date'])
    df['month'] = df['sale_date'].dt.to_period('M').astype(str)

    # Aggregate
    result = df.groupby(['region', 'month']).agg({
        'amount': 'sum',
        'quantity': 'sum'
    }).reset_index()

    return result  # Returns DataFrame that becomes a DuckDB table
  1. Create schema (models/schema.yml):
version: 2

models:
  - name: process_excel
    description: "Processed sales data from Excel"
    config:
      materialized: table
    columns:
      - name: region
        tests: [not_null]
      - name: month
        tests: [not_null]
      - name: amount
        tests: [not_null]
  1. Run the Python model:
dbt run --select process_excel
dbt test --select process_excel
  1. Create MXCP tool:
# tools/sales_analytics.yml
mxcp: 1
tool:
  name: sales_analytics
  description: "Get processed sales data from Excel"
  parameters:
    - name: region
      type: string
      default: null
  return:
    type: array
  source:
    code: |
      SELECT * FROM process_excel
      WHERE $region IS NULL OR region = $region
      ORDER BY month DESC
  1. Validate:
mxcp validate
mxcp test tool sales_analytics

Pattern 3: Excel → dbt seed → Analytics

User request: "Analyze this Excel file with aggregations"

Use this approach for simpler Excel files - Convert to CSV first when:

  • Excel file is simple with standard formatting
  • Want version control for the data (CSV in git)
  • Data is static and doesn't change

Implementation:

  1. Convert Excel to CSV seed:
# One-time conversion
python -c "
import pandas as pd
df = pd.read_excel('sales_data.xlsx')
df.to_csv('seeds/sales_data.csv', index=False)
print(f'Converted {len(df)} rows')
"
  1. Create seed schema:
# seeds/schema.yml
version: 2

seeds:
  - name: sales_data
    description: "Sales data from Excel upload"
    columns:
      - name: sale_id
        tests: [unique, not_null]
      - name: sale_date
        data_type: date
        tests: [not_null]
      - name: amount
        data_type: decimal
        tests: [not_null]
      - name: region
        tests: [not_null]
      - name: product
        tests: [not_null]
  1. Load seed and validate:
# Load CSV into DuckDB
dbt seed --select sales_data

# Run data quality tests
dbt test --select sales_data

# Verify data loaded correctly
dbt run-operation show_table --args '{"table_name": "sales_data"}'

CRITICAL: Always run dbt test after loading seeds to ensure data quality.

  1. Create analytics model:
-- models/sales_analytics.sql
{{ config(materialized='table') }}

SELECT
  region,
  product,
  DATE_TRUNC('month', sale_date) as month,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  AVG(amount) as avg_sale,
  MIN(amount) as min_sale,
  MAX(amount) as max_sale
FROM {{ ref('sales_data') }}
GROUP BY region, product, month
  1. Create query tool:
# tools/sales_analytics.yml
mxcp: 1
tool:
  name: sales_analytics
  description: "Get sales analytics by region and product"
  parameters:
    - name: region
      type: string
      required: false
    - name: product
      type: string
      required: false
  return:
    type: array
  source:
    code: |
      SELECT * FROM sales_analytics
      WHERE ($region IS NULL OR region = $region)
        AND ($product IS NULL OR product = $product)
      ORDER BY month DESC, total_sales DESC
  1. Validate and test MXCP tool:
# Validate MXCP structure
mxcp validate

# Test tool execution
mxcp test tool sales_analytics

# Manual verification
mxcp run tool sales_analytics --param region="North"

# All checks must pass before deployment

Pattern 4: Multi-Sheet Excel Processing

User request: "My Excel has multiple sheets, process them all"

# python/multi_sheet_loader.py
from mxcp.runtime import db
import pandas as pd

def load_all_sheets(filepath: str) -> dict:
    """Load all sheets from Excel file as separate tables"""
    # Read all sheets
    excel_file = pd.ExcelFile(filepath)

    results = {}
    for sheet_name in excel_file.sheet_names:
        df = pd.read_excel(filepath, sheet_name=sheet_name)

        # Clean table name
        table_name = sheet_name.lower().replace(' ', '_').replace('-', '_')

        # Load to DuckDB
        db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df")

        results[sheet_name] = {
            "table_name": table_name,
            "rows": len(df),
            "columns": list(df.columns)
        }

    return {
        "sheets_loaded": len(results),
        "sheets": results
    }

Excel-Specific Considerations

Data Type Inference

Excel doesn't have strict types. Handle type ambiguity:

def clean_excel_types(df: pd.DataFrame) -> pd.DataFrame:
    """Clean common Excel type issues"""
    for col in df.columns:
        # Convert Excel dates properly
        if df[col].dtype == 'object':
            try:
                df[col] = pd.to_datetime(df[col])
            except:
                pass

        # Strip whitespace from strings
        if df[col].dtype == 'object':
            df[col] = df[col].str.strip()

    return df

Handling Headers

Excel files may have inconsistent headers:

def normalize_headers(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize Excel column names"""
    df.columns = (
        df.columns
        .str.lower()
        .str.replace(' ', '_')
        .str.replace('-', '_')
        .str.replace('[^a-z0-9_]', '', regex=True)
    )
    return df

Empty Rows/Columns

Excel often has empty rows:

def clean_excel_data(filepath: str, sheet_name: str = None) -> pd.DataFrame:
    """Read and clean Excel data"""
    df = pd.read_excel(filepath, sheet_name=sheet_name)

    # Remove completely empty rows
    df = df.dropna(how='all')

    # Remove completely empty columns
    df = df.dropna(axis=1, how='all')

    # Normalize headers
    df = normalize_headers(df)

    # Clean types
    df = clean_excel_types(df)

    return df

Complete Example: Excel Analytics Server

Scenario: User uploads Excel file, wants to query and get statistics

# Project structure
excel-analytics/
├── mxcp-site.yml
├── python/
│   ├── excel_loader.py
│   └── excel_analytics.py
├── tools/
│   ├── load_excel.yml
│   ├── query_data.yml
│   └── get_statistics.yml
└── seeds/
    └── schema.yml (if using dbt seed approach)

Implementation:

# python/excel_loader.py
from mxcp.runtime import db
import pandas as pd
import os

def normalize_headers(df: pd.DataFrame) -> pd.DataFrame:
    df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '', regex=True)
    return df

def load_excel(filepath: str, sheet_name: str = None) -> dict:
    """Load Excel file with cleaning"""
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"File not found: {filepath}")

    # Read and clean
    df = pd.read_excel(filepath, sheet_name=sheet_name or 0)
    df = df.dropna(how='all').dropna(axis=1, how='all')
    df = normalize_headers(df)

    # Table name from filename
    table_name = os.path.splitext(os.path.basename(filepath))[0]
    table_name = table_name.lower().replace('-', '_').replace(' ', '_')

    # Load to DuckDB
    db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df")

    # Get column info
    col_info = db.execute(f"DESCRIBE {table_name}").fetchall()

    return {
        "table_name": table_name,
        "rows": len(df),
        "columns": [{"name": c["column_name"], "type": c["column_type"]} for c in col_info]
    }

def get_statistics(table_name: str, numeric_columns: list[str] = None) -> dict:
    """Calculate statistics for numeric columns"""
    # Get numeric columns if not specified
    if not numeric_columns:
        schema = db.execute(f"DESCRIBE {table_name}").fetchall()
        numeric_columns = [
            c["column_name"] for c in schema
            if c["column_type"] in ('INTEGER', 'BIGINT', 'DOUBLE', 'DECIMAL', 'FLOAT')
        ]

    if not numeric_columns:
        return {"error": "No numeric columns found"}

    # Build statistics query
    stats_parts = []
    for col in numeric_columns:
        stats_parts.append(f"""
            '{col}' as column,
            COUNT({col}) as count,
            AVG({col}) as mean,
            STDDEV({col}) as std_dev,
            MIN({col}) as min,
            PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY {col}) as q25,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {col}) as median,
            PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY {col}) as q75,
            MAX({col}) as max
        """)

    query = f"""
        SELECT * FROM (
            {' UNION ALL '.join(f'SELECT {part} FROM {table_name}' for part in stats_parts)}
        )
    """

    results = db.execute(query).fetchall()
    return {"statistics": results}
# tools/load_excel.yml
mxcp: 1
tool:
  name: load_excel
  description: "Load Excel file for querying and analysis"
  language: python
  parameters:
    - name: filepath
      type: string
      description: "Path to Excel file"
    - name: sheet_name
      type: string
      required: false
  return:
    type: object
  source:
    file: ../python/excel_loader.py
  tests:
    - name: "load_test_file"
      arguments:
        - key: filepath
          value: "test_data.xlsx"
      result:
        rows: 100

Dependencies

Add to requirements.txt:

openpyxl>=3.0.0    # For .xlsx files
xlrd>=2.0.0        # For .xls files (optional)
pandas>=2.0.0      # For Excel processing

Best Practices

  1. Always clean Excel data: Remove empty rows/columns, normalize headers
  2. Type validation: Excel types are unreliable, validate and cast
  3. Use dbt seeds for static data: Convert Excel → CSV → seed for version control
  4. Use Python for dynamic uploads: For user-uploaded files during runtime
  5. Document expected format: Tell users what Excel structure is expected
  6. Error handling: Excel files can be malformed, handle errors gracefully
  7. Sheet validation: Check sheet names exist before processing
  8. Memory considerations: Large Excel files can be slow, consider pagination

Troubleshooting

Issue: "No module named 'openpyxl'" Solution: pip install openpyxl

Issue: "Excel file empty after loading" Solution: Check for empty rows/columns, use dropna()

Issue: "Column names have special characters" Solution: Use normalize_headers() function

Issue: "Date columns appear as numbers" Solution: Use pd.to_datetime() to convert Excel serial dates

Issue: "Out of memory with large Excel files" Solution: Convert to CSV first, use dbt seed, or process in chunks

Summary

For Excel integration in MXCP:

  1. User uploads → Python tool with pandas → DuckDB table → Query tools
  2. Static data → Convert to CSV → dbt seed → Schema validation → Query tools
  3. Multi-sheet → Load all sheets as separate tables
  4. Always validate → Clean headers, types, empty rows
  5. Add statistics tools → Provide insights on numeric columns