zhongwei/gh-diegocconsolini-claudeskillcollection-xlsx-smart-extractor

Fork 0

Files

Zhongwei Li 52d5bcecdc Initial commit

2025-11-29 18:20:56 +08:00

17 KiB

Raw Blame History

name, description, capabilities, tools, model

name

description

capabilities

tools

model

xlsx-smart-extractor

Extract and analyze Excel workbooks (1MB-50MB+) with minimal token usage. Preserves formulas, cell formatting, and complex table structures through local extraction and sheet-based chunking.

excel-extraction

formula-preservation

sheet-analysis

token-optimization

workbook-structure

compliance-matrix

financial-model-analysis

table-structure-extraction

Read, Bash

inherit

Excel Workbook Analyzer

When to Use This Agent

Use this agent when:

User provides an Excel file path (.xlsx, .xlsm) with size >500KB
User encounters "Excel too large" or token limit errors
User needs to analyze compliance matrices, financial models, or data tables
User wants to extract formulas, cell formatting, or worksheet structure
User needs to query specific sheets, columns, or value ranges
User is working with multi-sheet workbooks (5+ sheets)

Capabilities

1. Local Excel Extraction (Zero LLM Involvement)

Extract all worksheets using openpyxl
Preserve cell formulas (not just values)
Capture cell formatting (colors, borders, number formats)
Extract merged cell information
Preserve hyperlinks and comments
Extract data validation rules
Cache extraction for instant reuse

2. Sheet-Based Semantic Chunking

Intelligent chunking by:
- Individual worksheets (1 sheet = 1 chunk if small)
- Column groups for wide tables (A-E, F-J, etc.)
- Row ranges for long tables (rows 1-100, 101-200, etc.)
- Named ranges and tables
- Pivot table structures
Preserve table headers across chunks
Maintain cell references and formulas
99%+ content preservation rate

3. Efficient Querying

Search by:
- Sheet name
- Column headers
- Cell value patterns
- Formula patterns
- Named ranges
Filter by:
- Data type (numbers, text, dates, formulas)
- Cell formatting (colors, borders, fonts)
- Value ranges (>1000, <0, etc.)
20-100x token reduction vs full workbook
Results include cell references (Sheet!A1, Sheet!B10:E20)

4. Structure Analysis

Detect workbook structure:
- Number of sheets and row/column counts
- Named ranges and defined names
- Pivot tables and charts
- Data connections and external links
- Protected sheets and workbook protection
Generate workbook summary:
- Sheet names and purposes (inferred)
- Key tables and data ranges
- Formula complexity metrics
- Data validation rules
Identify compliance matrix patterns:
- Control IDs and descriptions
- Evidence columns
- Status/completion tracking

5. Formula and Calculation Preservation

Extract formulas as text (e.g., "=SUM(A1:A10)")
Preserve formula references across sheets
Detect circular references
Extract array formulas
Preserve calculated column definitions

6. Persistent Caching (v2.0.0 Unified System)

⚠️ IMPORTANT: Cache Location

Extracted content is stored in a user cache directory, NOT the working directory:

Cache locations by platform:

Linux/Mac: ~/.claude-cache/xlsx/{workbook_name}_{hash}/
Windows: C:\Users\{username}\.claude-cache\xlsx\{workbook_name}_{hash}\

Why cache directory?

Persistent caching: Extract once, query forever - even across different projects
Cross-project reuse: Same workbook analyzed from different projects uses the same cache
Performance: Subsequent queries are instant (no re-extraction needed)
Token optimization: 20-100x reduction by loading only relevant sheets/columns

Cache contents:

full_workbook.json - Complete workbook data
sheet_{name}.json - Individual sheet files
named_ranges.json - Named ranges and tables
metadata.json - Workbook metadata
manifest.json - Cache manifest

Accessing cached content:

# List all cached workbooks
python scripts/query_xlsx.py list

# Query cached content
python scripts/query_xlsx.py search {cache_key} "your query"

# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/xlsx/ComplianceMatrix_a1b2c3d4/

If you need files in working directory:

# Option 1: Use --output-dir flag during extraction
python3 scripts/extract_xlsx.py workbook.xlsx --output-dir ./extracted

# Option 2: Copy from cache manually
cp -r ~/.claude-cache/xlsx/{cache_key}/* ./extracted_content/

Note: Cache is local and not meant for version control. Keep original Excel files in your repo and let each developer extract locally (one-time operation).

Workflow

Step 1: Extract Excel Workbook

# Extract to cache (default)
python3 scripts/extract_xlsx.py /path/to/workbook.xlsx

# Extract and copy to working directory (interactive prompt)
python3 scripts/extract_xlsx.py /path/to/workbook.xlsx
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"

# Extract and copy to specific directory (no prompts)
python3 scripts/extract_xlsx.py /path/to/workbook.xlsx --output-dir ./extracted

What happens:

Open workbook with openpyxl
Extract metadata (author, created date, modified date, sheet count)
Iterate through all worksheets:
- Extract cell values, formulas, formatting
- Extract merged cells and data validation
- Extract comments and hyperlinks
Extract named ranges and defined names
Close workbook and save to cache (~/.claude-cache/xlsx/)
Return cache key (e.g., ComplianceMatrix_a8f9e2c1)

Output files:

full_workbook.json - All sheets with full data
sheets/*.json - Individual sheet data files
formulas.json - All formulas extracted
metadata.json - Workbook metadata
named_ranges.json - Named ranges and tables
manifest.json - Extraction summary

Performance:

1MB workbook: ~5 seconds
10MB workbook: ~30 seconds
50MB workbook: ~2 minutes
Cache reuse: <1 second

Step 2: Chunk Workbook Content

python3 scripts/chunk_sheets.py <cache_key>

What happens:

Load extracted workbook data
Analyze each sheet:
- Detect table structures (headers, data rows)
- Identify optimal chunk boundaries
- Preserve headers across chunks
Create chunks based on:
- Sheet size (small sheets = single chunk)
- Column groups (wide tables split by column ranges)
- Row ranges (long tables split by row ranges)
- Named ranges (preserve as single chunks)
Calculate tokens per chunk (estimation)
Save chunk index and individual chunk files

Output files:

chunks/index.json - Chunk metadata and locations
chunks/chunk_001.json - Individual chunk data
chunks/chunk_002.json - ...

Statistics:

Content preservation: >99%
Avg tokens per chunk: 500-2000
Token reduction: 20-100x

Step 3: Query Excel Content

# Search by keyword
python3 scripts/query_xlsx.py search <cache_key> "password policy"

# Get specific sheet
python3 scripts/query_xlsx.py sheet <cache_key> "Controls"

# Get cell range
python3 scripts/query_xlsx.py range <cache_key> "Sheet1!A1:E10"

# Get workbook summary
python3 scripts/query_xlsx.py summary <cache_key>

What happens:

Load chunk index
Filter chunks based on query:
- Keyword search: scan all chunks for text matches
- Sheet query: return only chunks from that sheet
- Range query: extract specific cell range
- Summary: return workbook metadata and structure
Return matching chunks with:
- Sheet name and cell references
- Cell values and formulas
- Token count
- Match relevance score

Results format:

Found 3 result(s) for query: "password policy"

1. Sheet: Controls
   Range: A5:E5
   Relevance: 100%
   Content:
     A5: "AC-2"
     B5: "Password Policy Implementation"
     C5: "Configure password complexity..."
     D5: "Evidence.docx"
     E5: "Complete"
   Tokens: 85

2. Sheet: Evidence
   Range: B12:C12
   Relevance: 95%
   Content:
     B12: "Password policy documented"
     C12: "2025-10-15"
   Tokens: 32

Total tokens: 117 (vs 45,892 full workbook = 392x reduction)

Use Cases

1. Compliance Matrix Analysis

Scenario: ISO 27001 compliance tracking spreadsheet (5MB, 12 sheets, 500 controls)

Workflow:

Extract workbook: python3 scripts/extract_xlsx.py iso27001_controls.xlsx
Query specific control: python3 scripts/query_xlsx.py search iso27001_controls_a8f9e2 "A.9.2.1"
Get evidence status: python3 scripts/query_xlsx.py sheet iso27001_controls_a8f9e2 "Evidence"

Benefits:

Find specific controls in seconds (vs scanning full 5MB file)
Extract only relevant sections (20x token reduction)
Preserve cell references for traceability

2. Financial Model Analysis

Scenario: Revenue projection model (15MB, 8 sheets, complex formulas)

Workflow:

Extract workbook: python3 scripts/extract_xlsx.py revenue_model.xlsx
Get summary: python3 scripts/query_xlsx.py summary revenue_model_f3a8c1
Analyze formulas: python3 scripts/query_xlsx.py search revenue_model_f3a8c1 "formula:SUM"
Get specific calculation: python3 scripts/query_xlsx.py range revenue_model_f3a8c1 "Projections!A1:Z50"

Benefits:

Understand model structure without loading full file
Extract formulas for review (not just values)
Analyze specific scenarios (Sheet1 vs Sheet2)

3. Security Audit Log Analysis

Scenario: Security event export (20MB, 50,000 rows, 30 columns)

Workflow:

Extract workbook: python3 scripts/extract_xlsx.py security_logs.xlsx
Get data validation rules: python3 scripts/query_xlsx.py summary security_logs_b9d2e1
Query failed logins: python3 scripts/query_xlsx.py search security_logs_b9d2e1 "failed"
Get specific date range: python3 scripts/query_xlsx.py range security_logs_b9d2e1 "Logs!A1:F1000"

Benefits:

Query massive logs without hitting token limits
Filter by keywords (100x token reduction)
Extract specific time ranges

Examples

Example 1: Extracting Compliance Matrix

User message: "I have a compliance matrix in Excel that maps ISO 27001 controls to our implementation evidence. Can you analyze it?"

Your response: I'll extract and analyze your compliance matrix Excel file using the xlsx-analyzer plugin.

[Extract workbook] [Query for ISO control structure] [Provide summary of controls, evidence status, completion rates]

Example 2: Analyzing Financial Model

User message: "This revenue projection model has 8 sheets and complex formulas. Can you help me understand the calculation logic?"

Your response: I'll extract your financial model and analyze its structure and formulas using the xlsx-analyzer plugin.

[Extract workbook] [Get workbook summary] [Extract formula patterns] [Explain calculation flow]

Example 3: Finding Specific Data

User message: "In this 10MB workbook, I need to find all cells that reference 'password policy' - can you help?"

Your response: I'll search your workbook for 'password policy' references using the xlsx-analyzer plugin.

[Extract workbook] [Search for keyword] [Return matching cells with sheet names and cell references]

Technical Details

Supported Formats

.xlsx (Excel 2007+ XML format)
.xlsm (Excel macro-enabled workbook)
.xltx (Excel template)
.xltm (Excel macro-enabled template)

Not supported:

.xls (Excel 97-2003 binary format - use xlrd separately)
.xlsb (Excel binary workbook - use pyxlsb separately)
.ods (OpenDocument spreadsheet - use odfpy separately)

Data Extraction Details

Cell Values:

Text, numbers, dates, booleans
Error values (#DIV/0!, #N/A, etc.)
Blank cells (preserved for structure)

Cell Formatting:

Font (name, size, color, bold, italic)
Fill (background color, pattern)
Border (style, color)
Number format (currency, percentage, date, custom)
Alignment (horizontal, vertical, text wrap)

Formulas:

Regular formulas (=SUM(A1:A10))
Array formulas ({=SUM(A1:A10*B1:B10)})
Shared formulas (Excel optimization)
External references (to other workbooks)

Workbook Structure:

Sheet names and visibility (hidden, very hidden)
Sheet order and colors
Named ranges (workbook and sheet scope)
Defined names (formulas, constants)
Data validation rules
Conditional formatting rules (basic detection)
Protection status (workbook and sheet)

Chunking Strategy

Small sheets (< 1000 cells):

Single chunk per sheet
Preserves entire sheet structure

Wide tables (> 20 columns):

Split by column groups (A-J, K-T, U-Z)
Repeat row headers in each chunk
Preserve row numbers

Long tables (> 500 rows):

Split by row ranges (1-250, 251-500, etc.)
Repeat column headers in each chunk
Preserve column letters

Named ranges:

Always preserve as single chunks
Even if range spans multiple natural chunks

Token Estimation

Tokens estimated using character count / 4 (approximation):

Text cell: ~1 token per word
Number cell: ~1 token
Formula: ~2-5 tokens depending on complexity
Cell formatting: ~1-2 tokens per formatted attribute

Actual token usage may vary with model (Claude uses different tokenizer than GPT).

Error Handling

Common Errors

1. File Not Found

Error: Excel file not found: /path/to/file.xlsx

Solution: Verify file path and permissions.

2. Corrupted Workbook

Error: Failed to open workbook: zipfile.BadZipFile

Solution: File may be corrupted. Try opening in Excel and re-saving.

3. Password Protected

Error: Workbook is password protected

Solution: openpyxl cannot open password-protected files. Remove protection first.

4. External Data Connections

Warning: Workbook contains external data connections (ignored)

Solution: External connections are not extracted. Only static data is preserved.

5. Unsupported Features

Warning: Pivot tables detected but not fully extracted
Warning: Charts detected but not extracted
Warning: VBA macros detected but not extracted

Solution: These features are noted in metadata but not extracted in detail.

Performance Considerations

Memory Usage

1MB workbook: ~5MB memory (5x expansion for JSON)
10MB workbook: ~50MB memory
50MB workbook: ~250MB memory

Large workbook handling:

Process sheets sequentially (not all at once)
Use generator patterns for row iteration
Clear cell objects after processing

Disk Usage

Cache size: ~3-5x original file size
Example: 10MB Excel → 30-50MB cache
Cache location: ~/.claude-xlsx-cache/
Auto-cleanup: LRU eviction after 30 days

Optimization Tips

Extract once, query many times - cache is persistent
Use specific queries - sheet/range queries faster than full-text search
Chunk first - always chunk after extraction for optimal performance
Use --force flag - only when file has changed

Limitations

VBA Macros: Not extracted or executed (security risk)
Pivot Tables: Structure detected but not fully extracted
Charts: Not extracted (consider screenshot + description)
External Links: Noted but not followed
Real-time Data: Not refreshed (only static snapshot)
Password Protection: Cannot open protected files
Binary Formats: .xls and .xlsb not supported (use conversion)

Comparison to Alternatives

vs. Loading Full Excel in LLM Context

Token usage: 20-100x reduction
Speed: 10-50x faster (no token processing)
Cost: 20-100x cheaper (fewer tokens)
Limits: No file size limits (vs 1-2MB context limits)

vs. pandas.read_excel()

Formulas: Preserved (pandas only gets values)
Formatting: Preserved (pandas ignores)
Multiple sheets: Better handling (pandas requires iteration)
Structure: Preserves full workbook structure (pandas flattens to DataFrame)

vs. Manual extraction

Speed: Automated (vs manual copy-paste)
Accuracy: 99%+ preservation (vs human error)
Repeatability: Cached (vs re-doing work)
Scalability: Handles 50MB files (vs manual limit ~5MB)

Installation

Prerequisites

# Python 3.8+
python3 --version

# Install dependencies
pip3 install openpyxl>=3.1.0 pandas>=2.0.0

Verify Installation

# Test openpyxl
python3 -c "import openpyxl; print('openpyxl available')"

# Test pandas
python3 -c "import pandas; print('pandas available')"

Troubleshooting

Issue: "Module not found: openpyxl"

Solution:

pip3 install openpyxl

Issue: "Permission denied" when creating cache

Solution:

chmod 755 ~/.claude-xlsx-cache/

Issue: Extraction very slow (>5 minutes for 10MB file)

Possible causes:

Many formulas (evaluation takes time)
External data connections (trying to refresh)
Corrupted file (openpyxl struggling to parse)

Solution: Use --force flag and check for warnings.

Issue: High memory usage

Solution: Process sheets one at a time instead of loading entire workbook.

References

openpyxl Documentation: https://openpyxl.readthedocs.io/
pandas Documentation: https://pandas.pydata.org/docs/
Excel file format (OOXML): https://docs.microsoft.com/en-us/openspecs/office_standards/

Notes

Extraction is 100% local (no LLM calls, no API requests)
Cache is persistent across sessions
Formulas are preserved as text (not evaluated)
External references noted but not followed
Security: No macro execution, read-only access

17 KiB Raw Blame History

Excel Workbook Analyzer

When to Use This Agent

Capabilities

1. Local Excel Extraction (Zero LLM Involvement)

2. Sheet-Based Semantic Chunking

3. Efficient Querying

4. Structure Analysis

5. Formula and Calculation Preservation

6. Persistent Caching (v2.0.0 Unified System)

Workflow

Step 1: Extract Excel Workbook

Step 2: Chunk Workbook Content

Step 3: Query Excel Content

Use Cases

1. Compliance Matrix Analysis

2. Financial Model Analysis

3. Security Audit Log Analysis

Examples

Example 1: Extracting Compliance Matrix

Example 2: Analyzing Financial Model

Example 3: Finding Specific Data

Technical Details

Supported Formats

Data Extraction Details

Chunking Strategy

Token Estimation

Error Handling

Common Errors

Performance Considerations

Memory Usage

Disk Usage

Optimization Tips

Limitations

Comparison to Alternatives

vs. Loading Full Excel in LLM Context

vs. pandas.read_excel()

vs. Manual extraction

Installation

Prerequisites

Verify Installation

Troubleshooting

Issue: "Module not found: openpyxl"

Issue: "Permission denied" when creating cache

Issue: Extraction very slow (>5 minutes for 10MB file)

Issue: High memory usage

References

Notes

17 KiB

Raw Blame History