gh-diegocconsolini-claudesk…/agents/docx-smart-extractor.md

---
name: docx-smart-extractor
description: Extract and analyze Word documents (1MB-50MB+) with minimal token usage through local extraction, semantic chunking by headings, and intelligent caching.
capabilities: ["word-extraction", "table-extraction", "heading-structure", "token-optimization", "document-analysis", "policy-documents", "contract-analysis", "technical-reports"]
tools: Read, Bash
model: inherit
---

# DOCX Smart Extractor Agent

## Overview

The DOCX Smart Extractor enables efficient analysis of Word documents through local extraction, semantic chunking, and intelligent caching. Extract once, query forever.

## Capabilities

### Document Extraction
- **Complete text extraction** - All paragraphs with hierarchy preservation
- **Table extraction** - Full table structure, cells, and content
- **Formatting preservation** - Bold, italic, fonts, colors, styles
- **Document metadata** - Author, title, created date, modified date
- **Heading structure** - H1, H2, H3 hierarchy for navigation
- **Comments and tracked changes** - Full change history
- **Headers and footers** - Page-level content
- **Hyperlinks** - URL extraction and context

### Semantic Chunking
- **By heading hierarchy** - Chunk at H1, H2, H3 boundaries
- **By paragraph groups** - 10-20 paragraphs per chunk
- **By tables** - Each table as separate chunk
- **Target chunk size** - 500-2000 tokens
- **No BS metrics** - Honest, verifiable features only

### Query Capabilities
- **Keyword search** - Fast text search across all chunks
- **Heading lookup** - Get specific sections by heading
- **Table access** - Direct table extraction
- **Document summary** - Metadata and statistics

## When to Use

Use this plugin for:
- Policy documents (security, privacy, compliance)
- Technical reports and documentation
- Large Word documents (>1MB, >50 pages)
- Documents with clear heading structure
- Documents with tables and structured data
- Contract review and analysis
- Meeting notes and specifications

## Workflow

1. **Extract document**
   ```bash
   # Extract to cache (default)
   python scripts/extract_docx.py /path/to/document.docx

   # Extract and copy to working directory (interactive prompt)
   python scripts/extract_docx.py /path/to/document.docx
   # Will prompt: "Copy files? (y/n)"
   # Will ask: "Keep cache? (y/n)"

   # Extract and copy to specific directory (no prompts)
   python scripts/extract_docx.py /path/to/document.docx --output-dir ./extracted
   ```
   Output: Cache key (e.g., `policy_document_a8f9e2c1`)

2. **Chunk content**
   ```bash
   python scripts/semantic_chunker.py policy_document_a8f9e2c1
   ```

3. **Query content**
   ```bash
   # Search for keyword
   python scripts/query_docx.py search policy_document_a8f9e2c1 "data retention"

   # Get specific heading
   python scripts/query_docx.py heading policy_document_a8f9e2c1 "Security Controls"

   # Get summary
   python scripts/query_docx.py summary policy_document_a8f9e2c1
   ```

## Token Reduction

Typical reductions:
- Small documents (< 50 paragraphs): 5-10x
- Medium documents (50-200 paragraphs): 10-30x
- Large documents (200+ paragraphs): 30-50x

## Persistent Caching (v2.0.0 Unified System)

**⚠️ IMPORTANT: Cache Location**

Extracted content is stored in a **user cache directory**, NOT the working directory:

**Cache locations by platform:**
- **Linux/Mac:** `~/.claude-cache/docx/{document_name}_{hash}/`
- **Windows:** `C:\Users\{username}\.claude-cache\docx\{document_name}_{hash}\`

**Why cache directory?**
1. **Persistent caching:** Extract once, query forever - even across different projects
2. **Cross-project reuse:** Same document analyzed from different projects uses the same cache
3. **Performance:** Subsequent queries are instant (no re-extraction needed)
4. **Token optimization:** 10-50x reduction by loading only relevant sections

**Cache contents:**
- `full_document.json` - Complete document text with formatting
- `headings.json` - Document heading structure
- `tables.json` - Extracted tables
- `metadata.json` - Document metadata
- `manifest.json` - Cache manifest

**Accessing cached content:**
```bash
# List all cached documents
python scripts/query_docx.py list

# Query cached content
python scripts/query_docx.py search {cache_key} "your query"

# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/docx/policy_document_a1b2c3d4/
```

**If you need files in working directory:**
```bash
# Option 1: Use --output-dir flag during extraction
python scripts/extract_docx.py document.docx --output-dir ./extracted

# Option 2: Copy from cache manually
cp -r ~/.claude-cache/docx/{cache_key}/* ./extracted_content/
```

**Note:** Cache is local and not meant for version control. Keep original Word files in your repo and let each developer extract locally (one-time operation).

## Supported Formats

- ✅ .docx (Word 2007+ XML format)
- ✅ .docm (Macro-enabled Word documents)
- ❌ .doc (Legacy Word 97-2003 format - convert to .docx first)

## Limitations

- VBA macros not extracted (design choice)
- Images extracted as metadata only (position, size, alt text)
- Charts not extracted (recommend screenshot approach)
- Password-protected files cannot be opened
- Embedded objects may not be fully extracted

## Dependencies

- Python >= 3.8
- python-docx >= 0.8.11

## Example Use Cases

### Policy Document Analysis
```bash
# Extract
python scripts/extract_docx.py InfoSecPolicy.docx

# Chunk
python scripts/semantic_chunker.py InfoSecPolicy_a8f9e2

# Find password policy section
python scripts/query_docx.py search InfoSecPolicy_a8f9e2 "password requirements"
```

### Contract Review
```bash
# Extract
python scripts/extract_docx.py Vendor_Contract.docx

# Get specific section
python scripts/query_docx.py heading Vendor_Contract_f3a8c1 "Termination Clause"
```

### Technical Documentation
```bash
# Extract large spec document
python scripts/extract_docx.py API_Specification.docx

# Search for endpoint details
python scripts/query_docx.py search API_Specification_b9d2e1 "authentication endpoint"
```

## Performance

- **Extraction speed**: ~1-5 seconds for typical documents (1-10MB)
- **Chunking speed**: <1 second
- **Query speed**: <1 second
- **Cache reuse**: Instant (no re-extraction needed)

## Output Format

All output is JSON with UTF-8 encoding. Structured for easy parsing and analysis.

## No Marketing BS

This plugin does NOT:
- Claim "100% content preservation" (meaningless metric)
- Use AI during extraction (all local python-docx)
- Require internet connection
- Modify original documents
- Extract content you don't need

What it DOES:
- Extract all text, tables, and formatting
- Chunk by semantic boundaries (headings)
- Enable fast keyword search
- Cache for instant reuse
- Achieve 10-50x token reduction (verified)