# Excel File Integration Guide for working with Excel files (.xlsx, .xls) in MXCP servers. ## Overview Excel files are common data sources that can be integrated into MXCP servers. DuckDB provides multiple ways to read Excel files, and dbt can be used to manage Excel data as seeds or sources. ## Reading Excel Files in DuckDB ### Method 1: Direct Reading with spatial Extension DuckDB's spatial extension includes `st_read` which can read Excel files: ```sql -- Install and load spatial extension (includes Excel support) INSTALL spatial; LOAD spatial; -- Read Excel file SELECT * FROM st_read('data.xlsx'); -- Read specific sheet SELECT * FROM st_read('data.xlsx', layer='Sheet2'); ``` ### Method 2: Using Python with pandas For more control, use Python with pandas: ```python # python/excel_reader.py from mxcp.runtime import db import pandas as pd def load_excel_to_duckdb(filepath: str, table_name: str, sheet_name: str = None) -> dict: """Load Excel file into DuckDB table""" # Read Excel with pandas df = pd.read_excel(filepath, sheet_name=sheet_name) # Register DataFrame in DuckDB db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df") return { "table": table_name, "rows": len(df), "columns": list(df.columns) } def read_excel_data(filepath: str, sheet_name: str = None) -> list[dict]: """Read Excel and return as list of dicts""" df = pd.read_excel(filepath, sheet_name=sheet_name) return df.to_dict('records') ``` ### Method 3: Convert to CSV, then use dbt seed **Best practice for user-uploaded Excel files**: ```bash # Convert Excel to CSV using Python python -c "import pandas as pd; pd.read_excel('data.xlsx').to_csv('seeds/data.csv', index=False)" # Then follow standard dbt seed workflow cat > seeds/schema.yml < dict: """Load Excel file into DuckDB for querying""" if not os.path.exists(filepath): raise FileNotFoundError(f"Excel file not found: {filepath}") # Read Excel df = pd.read_excel(filepath, sheet_name=sheet_name or 0) # Generate table name from filename table_name = os.path.splitext(os.path.basename(filepath))[0].replace('-', '_').replace(' ', '_') # Load into DuckDB db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df") return { "table_name": table_name, "rows": len(df), "columns": list(df.columns), "message": f"Excel loaded. Query with: SELECT * FROM {table_name}" } ``` ```yaml # tools/query_excel_data.yml mxcp: 1 tool: name: query_excel_data description: "Query data from uploaded Excel file" parameters: - name: table_name type: string description: "Table name (from upload_excel result)" - name: filter_column type: string required: false - name: filter_value type: string required: false return: type: array source: code: | SELECT * FROM {{ table_name }} WHERE $filter_column IS NULL OR CAST({{ filter_column }} AS VARCHAR) = $filter_value LIMIT 1000 ``` **Validation workflow for Pattern 1**: ```bash # 1. Validate MXCP structure mxcp validate # 2. Test upload tool mxcp test tool upload_excel # 3. Manual test with real Excel file mxcp run tool upload_excel --param filepath="./test.xlsx" # 4. Test query tool mxcp run tool query_excel_data --param table_name="test" # 5. All validations must pass before deployment ``` ### Pattern 2: Excel → dbt Python Model → Analytics **User request**: "Process this Excel file with complex formatting and transform the data" **RECOMMENDED for complex Excel processing** - Use dbt Python models when: - Excel has complex formatting or multiple sheets - Need pandas operations (pivoting, melting, complex string manipulation) - Data cleaning requires Python logic **Implementation**: 1. **Create dbt Python model** (`models/process_excel.py`): ```python import pandas as pd def model(dbt, session): # Read Excel file df = pd.read_excel('data/sales_data.xlsx', sheet_name='Sales') # Clean data df = df.dropna(how='all') # Remove empty rows df = df.dropna(axis=1, how='all') # Remove empty columns # Normalize column names df.columns = df.columns.str.lower().str.replace(' ', '_') # Complex transformations using pandas df['sale_date'] = pd.to_datetime(df['sale_date']) df['month'] = df['sale_date'].dt.to_period('M').astype(str) # Aggregate result = df.groupby(['region', 'month']).agg({ 'amount': 'sum', 'quantity': 'sum' }).reset_index() return result # Returns DataFrame that becomes a DuckDB table ``` 2. **Create schema** (`models/schema.yml`): ```yaml version: 2 models: - name: process_excel description: "Processed sales data from Excel" config: materialized: table columns: - name: region tests: [not_null] - name: month tests: [not_null] - name: amount tests: [not_null] ``` 3. **Run the Python model**: ```bash dbt run --select process_excel dbt test --select process_excel ``` 4. **Create MXCP tool**: ```yaml # tools/sales_analytics.yml mxcp: 1 tool: name: sales_analytics description: "Get processed sales data from Excel" parameters: - name: region type: string default: null return: type: array source: code: | SELECT * FROM process_excel WHERE $region IS NULL OR region = $region ORDER BY month DESC ``` 5. **Validate**: ```bash mxcp validate mxcp test tool sales_analytics ``` ### Pattern 3: Excel → dbt seed → Analytics **User request**: "Analyze this Excel file with aggregations" **Use this approach for simpler Excel files** - Convert to CSV first when: - Excel file is simple with standard formatting - Want version control for the data (CSV in git) - Data is static and doesn't change **Implementation**: 1. **Convert Excel to CSV seed**: ```bash # One-time conversion python -c " import pandas as pd df = pd.read_excel('sales_data.xlsx') df.to_csv('seeds/sales_data.csv', index=False) print(f'Converted {len(df)} rows') " ``` 2. **Create seed schema**: ```yaml # seeds/schema.yml version: 2 seeds: - name: sales_data description: "Sales data from Excel upload" columns: - name: sale_id tests: [unique, not_null] - name: sale_date data_type: date tests: [not_null] - name: amount data_type: decimal tests: [not_null] - name: region tests: [not_null] - name: product tests: [not_null] ``` 3. **Load seed and validate**: ```bash # Load CSV into DuckDB dbt seed --select sales_data # Run data quality tests dbt test --select sales_data # Verify data loaded correctly dbt run-operation show_table --args '{"table_name": "sales_data"}' ``` **CRITICAL**: Always run `dbt test` after loading seeds to ensure data quality. 4. **Create analytics model**: ```sql -- models/sales_analytics.sql {{ config(materialized='table') }} SELECT region, product, DATE_TRUNC('month', sale_date) as month, COUNT(*) as transaction_count, SUM(amount) as total_sales, AVG(amount) as avg_sale, MIN(amount) as min_sale, MAX(amount) as max_sale FROM {{ ref('sales_data') }} GROUP BY region, product, month ``` 5. **Create query tool**: ```yaml # tools/sales_analytics.yml mxcp: 1 tool: name: sales_analytics description: "Get sales analytics by region and product" parameters: - name: region type: string required: false - name: product type: string required: false return: type: array source: code: | SELECT * FROM sales_analytics WHERE ($region IS NULL OR region = $region) AND ($product IS NULL OR product = $product) ORDER BY month DESC, total_sales DESC ``` 6. **Validate and test MXCP tool**: ```bash # Validate MXCP structure mxcp validate # Test tool execution mxcp test tool sales_analytics # Manual verification mxcp run tool sales_analytics --param region="North" # All checks must pass before deployment ``` ### Pattern 4: Multi-Sheet Excel Processing **User request**: "My Excel has multiple sheets, process them all" ```python # python/multi_sheet_loader.py from mxcp.runtime import db import pandas as pd def load_all_sheets(filepath: str) -> dict: """Load all sheets from Excel file as separate tables""" # Read all sheets excel_file = pd.ExcelFile(filepath) results = {} for sheet_name in excel_file.sheet_names: df = pd.read_excel(filepath, sheet_name=sheet_name) # Clean table name table_name = sheet_name.lower().replace(' ', '_').replace('-', '_') # Load to DuckDB db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df") results[sheet_name] = { "table_name": table_name, "rows": len(df), "columns": list(df.columns) } return { "sheets_loaded": len(results), "sheets": results } ``` ## Excel-Specific Considerations ### Data Type Inference Excel doesn't have strict types. Handle type ambiguity: ```python def clean_excel_types(df: pd.DataFrame) -> pd.DataFrame: """Clean common Excel type issues""" for col in df.columns: # Convert Excel dates properly if df[col].dtype == 'object': try: df[col] = pd.to_datetime(df[col]) except: pass # Strip whitespace from strings if df[col].dtype == 'object': df[col] = df[col].str.strip() return df ``` ### Handling Headers Excel files may have inconsistent headers: ```python def normalize_headers(df: pd.DataFrame) -> pd.DataFrame: """Normalize Excel column names""" df.columns = ( df.columns .str.lower() .str.replace(' ', '_') .str.replace('-', '_') .str.replace('[^a-z0-9_]', '', regex=True) ) return df ``` ### Empty Rows/Columns Excel often has empty rows: ```python def clean_excel_data(filepath: str, sheet_name: str = None) -> pd.DataFrame: """Read and clean Excel data""" df = pd.read_excel(filepath, sheet_name=sheet_name) # Remove completely empty rows df = df.dropna(how='all') # Remove completely empty columns df = df.dropna(axis=1, how='all') # Normalize headers df = normalize_headers(df) # Clean types df = clean_excel_types(df) return df ``` ## Complete Example: Excel Analytics Server **Scenario**: User uploads Excel file, wants to query and get statistics ```bash # Project structure excel-analytics/ ├── mxcp-site.yml ├── python/ │ ├── excel_loader.py │ └── excel_analytics.py ├── tools/ │ ├── load_excel.yml │ ├── query_data.yml │ └── get_statistics.yml └── seeds/ └── schema.yml (if using dbt seed approach) ``` **Implementation**: ```python # python/excel_loader.py from mxcp.runtime import db import pandas as pd import os def normalize_headers(df: pd.DataFrame) -> pd.DataFrame: df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '', regex=True) return df def load_excel(filepath: str, sheet_name: str = None) -> dict: """Load Excel file with cleaning""" if not os.path.exists(filepath): raise FileNotFoundError(f"File not found: {filepath}") # Read and clean df = pd.read_excel(filepath, sheet_name=sheet_name or 0) df = df.dropna(how='all').dropna(axis=1, how='all') df = normalize_headers(df) # Table name from filename table_name = os.path.splitext(os.path.basename(filepath))[0] table_name = table_name.lower().replace('-', '_').replace(' ', '_') # Load to DuckDB db.execute(f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM df") # Get column info col_info = db.execute(f"DESCRIBE {table_name}").fetchall() return { "table_name": table_name, "rows": len(df), "columns": [{"name": c["column_name"], "type": c["column_type"]} for c in col_info] } def get_statistics(table_name: str, numeric_columns: list[str] = None) -> dict: """Calculate statistics for numeric columns""" # Get numeric columns if not specified if not numeric_columns: schema = db.execute(f"DESCRIBE {table_name}").fetchall() numeric_columns = [ c["column_name"] for c in schema if c["column_type"] in ('INTEGER', 'BIGINT', 'DOUBLE', 'DECIMAL', 'FLOAT') ] if not numeric_columns: return {"error": "No numeric columns found"} # Build statistics query stats_parts = [] for col in numeric_columns: stats_parts.append(f""" '{col}' as column, COUNT({col}) as count, AVG({col}) as mean, STDDEV({col}) as std_dev, MIN({col}) as min, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY {col}) as q25, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {col}) as median, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY {col}) as q75, MAX({col}) as max """) query = f""" SELECT * FROM ( {' UNION ALL '.join(f'SELECT {part} FROM {table_name}' for part in stats_parts)} ) """ results = db.execute(query).fetchall() return {"statistics": results} ``` ```yaml # tools/load_excel.yml mxcp: 1 tool: name: load_excel description: "Load Excel file for querying and analysis" language: python parameters: - name: filepath type: string description: "Path to Excel file" - name: sheet_name type: string required: false return: type: object source: file: ../python/excel_loader.py tests: - name: "load_test_file" arguments: - key: filepath value: "test_data.xlsx" result: rows: 100 ``` ## Dependencies Add to `requirements.txt`: ``` openpyxl>=3.0.0 # For .xlsx files xlrd>=2.0.0 # For .xls files (optional) pandas>=2.0.0 # For Excel processing ``` ## Best Practices 1. **Always clean Excel data**: Remove empty rows/columns, normalize headers 2. **Type validation**: Excel types are unreliable, validate and cast 3. **Use dbt seeds for static data**: Convert Excel → CSV → seed for version control 4. **Use Python for dynamic uploads**: For user-uploaded files during runtime 5. **Document expected format**: Tell users what Excel structure is expected 6. **Error handling**: Excel files can be malformed, handle errors gracefully 7. **Sheet validation**: Check sheet names exist before processing 8. **Memory considerations**: Large Excel files can be slow, consider pagination ## Troubleshooting **Issue**: "No module named 'openpyxl'" **Solution**: `pip install openpyxl` **Issue**: "Excel file empty after loading" **Solution**: Check for empty rows/columns, use `dropna()` **Issue**: "Column names have special characters" **Solution**: Use `normalize_headers()` function **Issue**: "Date columns appear as numbers" **Solution**: Use `pd.to_datetime()` to convert Excel serial dates **Issue**: "Out of memory with large Excel files" **Solution**: Convert to CSV first, use dbt seed, or process in chunks ## Summary For Excel integration in MXCP: 1. **User uploads** → Python tool with pandas → DuckDB table → Query tools 2. **Static data** → Convert to CSV → dbt seed → Schema validation → Query tools 3. **Multi-sheet** → Load all sheets as separate tables 4. **Always validate** → Clean headers, types, empty rows 5. **Add statistics tools** → Provide insights on numeric columns