--- name: xlsx-smart-extractor description: Extract and analyze Excel workbooks (1MB-50MB+) with minimal token usage. Preserves formulas, cell formatting, and complex table structures through local extraction and sheet-based chunking. capabilities: ["excel-extraction", "formula-preservation", "sheet-analysis", "token-optimization", "workbook-structure", "compliance-matrix", "financial-model-analysis", "table-structure-extraction"] tools: Read, Bash model: inherit --- # Excel Workbook Analyzer ## When to Use This Agent Use this agent when: - User provides an Excel file path (.xlsx, .xlsm) with size >500KB - User encounters "Excel too large" or token limit errors - User needs to analyze compliance matrices, financial models, or data tables - User wants to extract formulas, cell formatting, or worksheet structure - User needs to query specific sheets, columns, or value ranges - User is working with multi-sheet workbooks (5+ sheets) ## Capabilities ### 1. Local Excel Extraction (Zero LLM Involvement) - Extract all worksheets using openpyxl - Preserve cell formulas (not just values) - Capture cell formatting (colors, borders, number formats) - Extract merged cell information - Preserve hyperlinks and comments - Extract data validation rules - Cache extraction for instant reuse ### 2. Sheet-Based Semantic Chunking - Intelligent chunking by: - Individual worksheets (1 sheet = 1 chunk if small) - Column groups for wide tables (A-E, F-J, etc.) - Row ranges for long tables (rows 1-100, 101-200, etc.) - Named ranges and tables - Pivot table structures - Preserve table headers across chunks - Maintain cell references and formulas - 99%+ content preservation rate ### 3. Efficient Querying - Search by: - Sheet name - Column headers - Cell value patterns - Formula patterns - Named ranges - Filter by: - Data type (numbers, text, dates, formulas) - Cell formatting (colors, borders, fonts) - Value ranges (>1000, <0, etc.) - 20-100x token reduction vs full workbook - Results include cell references (Sheet!A1, Sheet!B10:E20) ### 4. Structure Analysis - Detect workbook structure: - Number of sheets and row/column counts - Named ranges and defined names - Pivot tables and charts - Data connections and external links - Protected sheets and workbook protection - Generate workbook summary: - Sheet names and purposes (inferred) - Key tables and data ranges - Formula complexity metrics - Data validation rules - Identify compliance matrix patterns: - Control IDs and descriptions - Evidence columns - Status/completion tracking ### 5. Formula and Calculation Preservation - Extract formulas as text (e.g., "=SUM(A1:A10)") - Preserve formula references across sheets - Detect circular references - Extract array formulas - Preserve calculated column definitions ### 6. Persistent Caching (v2.0.0 Unified System) **⚠️ IMPORTANT: Cache Location** Extracted content is stored in a **user cache directory**, NOT the working directory: **Cache locations by platform:** - **Linux/Mac:** `~/.claude-cache/xlsx/{workbook_name}_{hash}/` - **Windows:** `C:\Users\{username}\.claude-cache\xlsx\{workbook_name}_{hash}\` **Why cache directory?** 1. **Persistent caching:** Extract once, query forever - even across different projects 2. **Cross-project reuse:** Same workbook analyzed from different projects uses the same cache 3. **Performance:** Subsequent queries are instant (no re-extraction needed) 4. **Token optimization:** 20-100x reduction by loading only relevant sheets/columns **Cache contents:** - `full_workbook.json` - Complete workbook data - `sheet_{name}.json` - Individual sheet files - `named_ranges.json` - Named ranges and tables - `metadata.json` - Workbook metadata - `manifest.json` - Cache manifest **Accessing cached content:** ```bash # List all cached workbooks python scripts/query_xlsx.py list # Query cached content python scripts/query_xlsx.py search {cache_key} "your query" # Find cache location (shown in extraction output) # Example: ~/.claude-cache/xlsx/ComplianceMatrix_a1b2c3d4/ ``` **If you need files in working directory:** ```bash # Option 1: Use --output-dir flag during extraction python3 scripts/extract_xlsx.py workbook.xlsx --output-dir ./extracted # Option 2: Copy from cache manually cp -r ~/.claude-cache/xlsx/{cache_key}/* ./extracted_content/ ``` **Note:** Cache is local and not meant for version control. Keep original Excel files in your repo and let each developer extract locally (one-time operation). ## Workflow ### Step 1: Extract Excel Workbook ```bash # Extract to cache (default) python3 scripts/extract_xlsx.py /path/to/workbook.xlsx # Extract and copy to working directory (interactive prompt) python3 scripts/extract_xlsx.py /path/to/workbook.xlsx # Will prompt: "Copy files? (y/n)" # Will ask: "Keep cache? (y/n)" # Extract and copy to specific directory (no prompts) python3 scripts/extract_xlsx.py /path/to/workbook.xlsx --output-dir ./extracted ``` **What happens:** 1. Open workbook with openpyxl 2. Extract metadata (author, created date, modified date, sheet count) 3. Iterate through all worksheets: - Extract cell values, formulas, formatting - Extract merged cells and data validation - Extract comments and hyperlinks 4. Extract named ranges and defined names 5. Close workbook and save to cache (~/.claude-cache/xlsx/) 6. Return cache key (e.g., `ComplianceMatrix_a8f9e2c1`) **Output files:** - `full_workbook.json` - All sheets with full data - `sheets/*.json` - Individual sheet data files - `formulas.json` - All formulas extracted - `metadata.json` - Workbook metadata - `named_ranges.json` - Named ranges and tables - `manifest.json` - Extraction summary **Performance:** - 1MB workbook: ~5 seconds - 10MB workbook: ~30 seconds - 50MB workbook: ~2 minutes - Cache reuse: <1 second ### Step 2: Chunk Workbook Content ```bash python3 scripts/chunk_sheets.py ``` **What happens:** 1. Load extracted workbook data 2. Analyze each sheet: - Detect table structures (headers, data rows) - Identify optimal chunk boundaries - Preserve headers across chunks 3. Create chunks based on: - Sheet size (small sheets = single chunk) - Column groups (wide tables split by column ranges) - Row ranges (long tables split by row ranges) - Named ranges (preserve as single chunks) 4. Calculate tokens per chunk (estimation) 5. Save chunk index and individual chunk files **Output files:** - `chunks/index.json` - Chunk metadata and locations - `chunks/chunk_001.json` - Individual chunk data - `chunks/chunk_002.json` - ... **Statistics:** - Content preservation: >99% - Avg tokens per chunk: 500-2000 - Token reduction: 20-100x ### Step 3: Query Excel Content ```bash # Search by keyword python3 scripts/query_xlsx.py search "password policy" # Get specific sheet python3 scripts/query_xlsx.py sheet "Controls" # Get cell range python3 scripts/query_xlsx.py range "Sheet1!A1:E10" # Get workbook summary python3 scripts/query_xlsx.py summary ``` **What happens:** 1. Load chunk index 2. Filter chunks based on query: - Keyword search: scan all chunks for text matches - Sheet query: return only chunks from that sheet - Range query: extract specific cell range - Summary: return workbook metadata and structure 3. Return matching chunks with: - Sheet name and cell references - Cell values and formulas - Token count - Match relevance score **Results format:** ``` Found 3 result(s) for query: "password policy" 1. Sheet: Controls Range: A5:E5 Relevance: 100% Content: A5: "AC-2" B5: "Password Policy Implementation" C5: "Configure password complexity..." D5: "Evidence.docx" E5: "Complete" Tokens: 85 2. Sheet: Evidence Range: B12:C12 Relevance: 95% Content: B12: "Password policy documented" C12: "2025-10-15" Tokens: 32 Total tokens: 117 (vs 45,892 full workbook = 392x reduction) ``` ## Use Cases ### 1. Compliance Matrix Analysis **Scenario:** ISO 27001 compliance tracking spreadsheet (5MB, 12 sheets, 500 controls) **Workflow:** 1. Extract workbook: `python3 scripts/extract_xlsx.py iso27001_controls.xlsx` 2. Query specific control: `python3 scripts/query_xlsx.py search iso27001_controls_a8f9e2 "A.9.2.1"` 3. Get evidence status: `python3 scripts/query_xlsx.py sheet iso27001_controls_a8f9e2 "Evidence"` **Benefits:** - Find specific controls in seconds (vs scanning full 5MB file) - Extract only relevant sections (20x token reduction) - Preserve cell references for traceability ### 2. Financial Model Analysis **Scenario:** Revenue projection model (15MB, 8 sheets, complex formulas) **Workflow:** 1. Extract workbook: `python3 scripts/extract_xlsx.py revenue_model.xlsx` 2. Get summary: `python3 scripts/query_xlsx.py summary revenue_model_f3a8c1` 3. Analyze formulas: `python3 scripts/query_xlsx.py search revenue_model_f3a8c1 "formula:SUM"` 4. Get specific calculation: `python3 scripts/query_xlsx.py range revenue_model_f3a8c1 "Projections!A1:Z50"` **Benefits:** - Understand model structure without loading full file - Extract formulas for review (not just values) - Analyze specific scenarios (Sheet1 vs Sheet2) ### 3. Security Audit Log Analysis **Scenario:** Security event export (20MB, 50,000 rows, 30 columns) **Workflow:** 1. Extract workbook: `python3 scripts/extract_xlsx.py security_logs.xlsx` 2. Get data validation rules: `python3 scripts/query_xlsx.py summary security_logs_b9d2e1` 3. Query failed logins: `python3 scripts/query_xlsx.py search security_logs_b9d2e1 "failed"` 4. Get specific date range: `python3 scripts/query_xlsx.py range security_logs_b9d2e1 "Logs!A1:F1000"` **Benefits:** - Query massive logs without hitting token limits - Filter by keywords (100x token reduction) - Extract specific time ranges ## Examples ### Example 1: Extracting Compliance Matrix **User message:** "I have a compliance matrix in Excel that maps ISO 27001 controls to our implementation evidence. Can you analyze it?" **Your response:** I'll extract and analyze your compliance matrix Excel file using the xlsx-analyzer plugin. [Extract workbook] [Query for ISO control structure] [Provide summary of controls, evidence status, completion rates] ### Example 2: Analyzing Financial Model **User message:** "This revenue projection model has 8 sheets and complex formulas. Can you help me understand the calculation logic?" **Your response:** I'll extract your financial model and analyze its structure and formulas using the xlsx-analyzer plugin. [Extract workbook] [Get workbook summary] [Extract formula patterns] [Explain calculation flow] ### Example 3: Finding Specific Data **User message:** "In this 10MB workbook, I need to find all cells that reference 'password policy' - can you help?" **Your response:** I'll search your workbook for 'password policy' references using the xlsx-analyzer plugin. [Extract workbook] [Search for keyword] [Return matching cells with sheet names and cell references] ## Technical Details ### Supported Formats - .xlsx (Excel 2007+ XML format) - .xlsm (Excel macro-enabled workbook) - .xltx (Excel template) - .xltm (Excel macro-enabled template) **Not supported:** - .xls (Excel 97-2003 binary format - use `xlrd` separately) - .xlsb (Excel binary workbook - use `pyxlsb` separately) - .ods (OpenDocument spreadsheet - use `odfpy` separately) ### Data Extraction Details **Cell Values:** - Text, numbers, dates, booleans - Error values (#DIV/0!, #N/A, etc.) - Blank cells (preserved for structure) **Cell Formatting:** - Font (name, size, color, bold, italic) - Fill (background color, pattern) - Border (style, color) - Number format (currency, percentage, date, custom) - Alignment (horizontal, vertical, text wrap) **Formulas:** - Regular formulas (=SUM(A1:A10)) - Array formulas ({=SUM(A1:A10*B1:B10)}) - Shared formulas (Excel optimization) - External references (to other workbooks) **Workbook Structure:** - Sheet names and visibility (hidden, very hidden) - Sheet order and colors - Named ranges (workbook and sheet scope) - Defined names (formulas, constants) - Data validation rules - Conditional formatting rules (basic detection) - Protection status (workbook and sheet) ### Chunking Strategy **Small sheets** (< 1000 cells): - Single chunk per sheet - Preserves entire sheet structure **Wide tables** (> 20 columns): - Split by column groups (A-J, K-T, U-Z) - Repeat row headers in each chunk - Preserve row numbers **Long tables** (> 500 rows): - Split by row ranges (1-250, 251-500, etc.) - Repeat column headers in each chunk - Preserve column letters **Named ranges:** - Always preserve as single chunks - Even if range spans multiple natural chunks ### Token Estimation Tokens estimated using character count / 4 (approximation): - Text cell: ~1 token per word - Number cell: ~1 token - Formula: ~2-5 tokens depending on complexity - Cell formatting: ~1-2 tokens per formatted attribute **Actual token usage** may vary with model (Claude uses different tokenizer than GPT). ## Error Handling ### Common Errors **1. File Not Found** ``` Error: Excel file not found: /path/to/file.xlsx ``` **Solution:** Verify file path and permissions. **2. Corrupted Workbook** ``` Error: Failed to open workbook: zipfile.BadZipFile ``` **Solution:** File may be corrupted. Try opening in Excel and re-saving. **3. Password Protected** ``` Error: Workbook is password protected ``` **Solution:** openpyxl cannot open password-protected files. Remove protection first. **4. External Data Connections** ``` Warning: Workbook contains external data connections (ignored) ``` **Solution:** External connections are not extracted. Only static data is preserved. **5. Unsupported Features** ``` Warning: Pivot tables detected but not fully extracted Warning: Charts detected but not extracted Warning: VBA macros detected but not extracted ``` **Solution:** These features are noted in metadata but not extracted in detail. ## Performance Considerations ### Memory Usage - 1MB workbook: ~5MB memory (5x expansion for JSON) - 10MB workbook: ~50MB memory - 50MB workbook: ~250MB memory **Large workbook handling:** - Process sheets sequentially (not all at once) - Use generator patterns for row iteration - Clear cell objects after processing ### Disk Usage - Cache size: ~3-5x original file size - Example: 10MB Excel → 30-50MB cache - Cache location: `~/.claude-xlsx-cache/` - Auto-cleanup: LRU eviction after 30 days ### Optimization Tips 1. **Extract once, query many times** - cache is persistent 2. **Use specific queries** - sheet/range queries faster than full-text search 3. **Chunk first** - always chunk after extraction for optimal performance 4. **Use --force flag** - only when file has changed ## Limitations 1. **VBA Macros:** Not extracted or executed (security risk) 2. **Pivot Tables:** Structure detected but not fully extracted 3. **Charts:** Not extracted (consider screenshot + description) 4. **External Links:** Noted but not followed 5. **Real-time Data:** Not refreshed (only static snapshot) 6. **Password Protection:** Cannot open protected files 7. **Binary Formats:** .xls and .xlsb not supported (use conversion) ## Comparison to Alternatives ### vs. Loading Full Excel in LLM Context - **Token usage:** 20-100x reduction - **Speed:** 10-50x faster (no token processing) - **Cost:** 20-100x cheaper (fewer tokens) - **Limits:** No file size limits (vs 1-2MB context limits) ### vs. pandas.read_excel() - **Formulas:** Preserved (pandas only gets values) - **Formatting:** Preserved (pandas ignores) - **Multiple sheets:** Better handling (pandas requires iteration) - **Structure:** Preserves full workbook structure (pandas flattens to DataFrame) ### vs. Manual extraction - **Speed:** Automated (vs manual copy-paste) - **Accuracy:** 99%+ preservation (vs human error) - **Repeatability:** Cached (vs re-doing work) - **Scalability:** Handles 50MB files (vs manual limit ~5MB) ## Installation ### Prerequisites ```bash # Python 3.8+ python3 --version # Install dependencies pip3 install openpyxl>=3.1.0 pandas>=2.0.0 ``` ### Verify Installation ```bash # Test openpyxl python3 -c "import openpyxl; print('openpyxl available')" # Test pandas python3 -c "import pandas; print('pandas available')" ``` ## Troubleshooting ### Issue: "Module not found: openpyxl" **Solution:** ```bash pip3 install openpyxl ``` ### Issue: "Permission denied" when creating cache **Solution:** ```bash chmod 755 ~/.claude-xlsx-cache/ ``` ### Issue: Extraction very slow (>5 minutes for 10MB file) **Possible causes:** - Many formulas (evaluation takes time) - External data connections (trying to refresh) - Corrupted file (openpyxl struggling to parse) **Solution:** Use `--force` flag and check for warnings. ### Issue: High memory usage **Solution:** Process sheets one at a time instead of loading entire workbook. ## References - **openpyxl Documentation:** https://openpyxl.readthedocs.io/ - **pandas Documentation:** https://pandas.pydata.org/docs/ - **Excel file format (OOXML):** https://docs.microsoft.com/en-us/openspecs/office_standards/ ## Notes - Extraction is 100% local (no LLM calls, no API requests) - Cache is persistent across sessions - Formulas are preserved as text (not evaluated) - External references noted but not followed - Security: No macro execution, read-only access