6.7 KiB
name, description, capabilities, tools, model
| name | description | capabilities | tools | model | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| docx-smart-extractor | Extract and analyze Word documents (1MB-50MB+) with minimal token usage through local extraction, semantic chunking by headings, and intelligent caching. |
|
Read, Bash | inherit |
DOCX Smart Extractor Agent
Overview
The DOCX Smart Extractor enables efficient analysis of Word documents through local extraction, semantic chunking, and intelligent caching. Extract once, query forever.
Capabilities
Document Extraction
- Complete text extraction - All paragraphs with hierarchy preservation
- Table extraction - Full table structure, cells, and content
- Formatting preservation - Bold, italic, fonts, colors, styles
- Document metadata - Author, title, created date, modified date
- Heading structure - H1, H2, H3 hierarchy for navigation
- Comments and tracked changes - Full change history
- Headers and footers - Page-level content
- Hyperlinks - URL extraction and context
Semantic Chunking
- By heading hierarchy - Chunk at H1, H2, H3 boundaries
- By paragraph groups - 10-20 paragraphs per chunk
- By tables - Each table as separate chunk
- Target chunk size - 500-2000 tokens
- No BS metrics - Honest, verifiable features only
Query Capabilities
- Keyword search - Fast text search across all chunks
- Heading lookup - Get specific sections by heading
- Table access - Direct table extraction
- Document summary - Metadata and statistics
When to Use
Use this plugin for:
- Policy documents (security, privacy, compliance)
- Technical reports and documentation
- Large Word documents (>1MB, >50 pages)
- Documents with clear heading structure
- Documents with tables and structured data
- Contract review and analysis
- Meeting notes and specifications
Workflow
-
Extract document
# Extract to cache (default) python scripts/extract_docx.py /path/to/document.docx # Extract and copy to working directory (interactive prompt) python scripts/extract_docx.py /path/to/document.docx # Will prompt: "Copy files? (y/n)" # Will ask: "Keep cache? (y/n)" # Extract and copy to specific directory (no prompts) python scripts/extract_docx.py /path/to/document.docx --output-dir ./extractedOutput: Cache key (e.g.,
policy_document_a8f9e2c1) -
Chunk content
python scripts/semantic_chunker.py policy_document_a8f9e2c1 -
Query content
# Search for keyword python scripts/query_docx.py search policy_document_a8f9e2c1 "data retention" # Get specific heading python scripts/query_docx.py heading policy_document_a8f9e2c1 "Security Controls" # Get summary python scripts/query_docx.py summary policy_document_a8f9e2c1
Token Reduction
Typical reductions:
- Small documents (< 50 paragraphs): 5-10x
- Medium documents (50-200 paragraphs): 10-30x
- Large documents (200+ paragraphs): 30-50x
Persistent Caching (v2.0.0 Unified System)
⚠️ IMPORTANT: Cache Location
Extracted content is stored in a user cache directory, NOT the working directory:
Cache locations by platform:
- Linux/Mac:
~/.claude-cache/docx/{document_name}_{hash}/ - Windows:
C:\Users\{username}\.claude-cache\docx\{document_name}_{hash}\
Why cache directory?
- Persistent caching: Extract once, query forever - even across different projects
- Cross-project reuse: Same document analyzed from different projects uses the same cache
- Performance: Subsequent queries are instant (no re-extraction needed)
- Token optimization: 10-50x reduction by loading only relevant sections
Cache contents:
full_document.json- Complete document text with formattingheadings.json- Document heading structuretables.json- Extracted tablesmetadata.json- Document metadatamanifest.json- Cache manifest
Accessing cached content:
# List all cached documents
python scripts/query_docx.py list
# Query cached content
python scripts/query_docx.py search {cache_key} "your query"
# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/docx/policy_document_a1b2c3d4/
If you need files in working directory:
# Option 1: Use --output-dir flag during extraction
python scripts/extract_docx.py document.docx --output-dir ./extracted
# Option 2: Copy from cache manually
cp -r ~/.claude-cache/docx/{cache_key}/* ./extracted_content/
Note: Cache is local and not meant for version control. Keep original Word files in your repo and let each developer extract locally (one-time operation).
Supported Formats
- ✅ .docx (Word 2007+ XML format)
- ✅ .docm (Macro-enabled Word documents)
- ❌ .doc (Legacy Word 97-2003 format - convert to .docx first)
Limitations
- VBA macros not extracted (design choice)
- Images extracted as metadata only (position, size, alt text)
- Charts not extracted (recommend screenshot approach)
- Password-protected files cannot be opened
- Embedded objects may not be fully extracted
Dependencies
- Python >= 3.8
- python-docx >= 0.8.11
Example Use Cases
Policy Document Analysis
# Extract
python scripts/extract_docx.py InfoSecPolicy.docx
# Chunk
python scripts/semantic_chunker.py InfoSecPolicy_a8f9e2
# Find password policy section
python scripts/query_docx.py search InfoSecPolicy_a8f9e2 "password requirements"
Contract Review
# Extract
python scripts/extract_docx.py Vendor_Contract.docx
# Get specific section
python scripts/query_docx.py heading Vendor_Contract_f3a8c1 "Termination Clause"
Technical Documentation
# Extract large spec document
python scripts/extract_docx.py API_Specification.docx
# Search for endpoint details
python scripts/query_docx.py search API_Specification_b9d2e1 "authentication endpoint"
Performance
- Extraction speed: ~1-5 seconds for typical documents (1-10MB)
- Chunking speed: <1 second
- Query speed: <1 second
- Cache reuse: Instant (no re-extraction needed)
Output Format
All output is JSON with UTF-8 encoding. Structured for easy parsing and analysis.
No Marketing BS
This plugin does NOT:
- Claim "100% content preservation" (meaningless metric)
- Use AI during extraction (all local python-docx)
- Require internet connection
- Modify original documents
- Extract content you don't need
What it DOES:
- Extract all text, tables, and formatting
- Chunk by semantic boundaries (headings)
- Enable fast keyword search
- Cache for instant reuse
- Achieve 10-50x token reduction (verified)