zhongwei/gh-diegocconsolini-claudeskillcollection-docx-smart-extractor

Fork 0

Files

Zhongwei Li 07c3046c05 Initial commit

2025-11-29 18:20:43 +08:00

6.7 KiB

Raw Permalink Blame History

name, description, capabilities, tools, model

name

description

capabilities

tools

model

docx-smart-extractor

Extract and analyze Word documents (1MB-50MB+) with minimal token usage through local extraction, semantic chunking by headings, and intelligent caching.

word-extraction

table-extraction

heading-structure

token-optimization

document-analysis

policy-documents

contract-analysis

technical-reports

Read, Bash

inherit

DOCX Smart Extractor Agent

Overview

The DOCX Smart Extractor enables efficient analysis of Word documents through local extraction, semantic chunking, and intelligent caching. Extract once, query forever.

Capabilities

Document Extraction

Complete text extraction - All paragraphs with hierarchy preservation
Table extraction - Full table structure, cells, and content
Formatting preservation - Bold, italic, fonts, colors, styles
Document metadata - Author, title, created date, modified date
Heading structure - H1, H2, H3 hierarchy for navigation
Comments and tracked changes - Full change history
Headers and footers - Page-level content
Hyperlinks - URL extraction and context

Semantic Chunking

By heading hierarchy - Chunk at H1, H2, H3 boundaries
By paragraph groups - 10-20 paragraphs per chunk
By tables - Each table as separate chunk
Target chunk size - 500-2000 tokens
No BS metrics - Honest, verifiable features only

Query Capabilities

Keyword search - Fast text search across all chunks
Heading lookup - Get specific sections by heading
Table access - Direct table extraction
Document summary - Metadata and statistics

When to Use

Use this plugin for:

Policy documents (security, privacy, compliance)
Technical reports and documentation
Large Word documents (>1MB, >50 pages)
Documents with clear heading structure
Documents with tables and structured data
Contract review and analysis
Meeting notes and specifications

Workflow

Extract document

# Extract to cache (default)
python scripts/extract_docx.py /path/to/document.docx

# Extract and copy to working directory (interactive prompt)
python scripts/extract_docx.py /path/to/document.docx
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"

# Extract and copy to specific directory (no prompts)
python scripts/extract_docx.py /path/to/document.docx --output-dir ./extracted

Output: Cache key (e.g., policy_document_a8f9e2c1)

Chunk content

python scripts/semantic_chunker.py policy_document_a8f9e2c1

Query content

# Search for keyword
python scripts/query_docx.py search policy_document_a8f9e2c1 "data retention"

# Get specific heading
python scripts/query_docx.py heading policy_document_a8f9e2c1 "Security Controls"

# Get summary
python scripts/query_docx.py summary policy_document_a8f9e2c1

Token Reduction

Typical reductions:

Small documents (< 50 paragraphs): 5-10x
Medium documents (50-200 paragraphs): 10-30x
Large documents (200+ paragraphs): 30-50x

Persistent Caching (v2.0.0 Unified System)

⚠️ IMPORTANT: Cache Location

Extracted content is stored in a user cache directory, NOT the working directory:

Cache locations by platform:

Linux/Mac: ~/.claude-cache/docx/{document_name}_{hash}/
Windows: C:\Users\{username}\.claude-cache\docx\{document_name}_{hash}\

Why cache directory?

Persistent caching: Extract once, query forever - even across different projects
Cross-project reuse: Same document analyzed from different projects uses the same cache
Performance: Subsequent queries are instant (no re-extraction needed)
Token optimization: 10-50x reduction by loading only relevant sections

Cache contents:

full_document.json - Complete document text with formatting
headings.json - Document heading structure
tables.json - Extracted tables
metadata.json - Document metadata
manifest.json - Cache manifest

Accessing cached content:

# List all cached documents
python scripts/query_docx.py list

# Query cached content
python scripts/query_docx.py search {cache_key} "your query"

# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/docx/policy_document_a1b2c3d4/

If you need files in working directory:

# Option 1: Use --output-dir flag during extraction
python scripts/extract_docx.py document.docx --output-dir ./extracted

# Option 2: Copy from cache manually
cp -r ~/.claude-cache/docx/{cache_key}/* ./extracted_content/

Note: Cache is local and not meant for version control. Keep original Word files in your repo and let each developer extract locally (one-time operation).

Supported Formats

✅ .docx (Word 2007+ XML format)
✅ .docm (Macro-enabled Word documents)
❌ .doc (Legacy Word 97-2003 format - convert to .docx first)

Limitations

VBA macros not extracted (design choice)
Images extracted as metadata only (position, size, alt text)
Charts not extracted (recommend screenshot approach)
Password-protected files cannot be opened
Embedded objects may not be fully extracted

Dependencies

Python >= 3.8
python-docx >= 0.8.11

Example Use Cases

Policy Document Analysis

# Extract
python scripts/extract_docx.py InfoSecPolicy.docx

# Chunk
python scripts/semantic_chunker.py InfoSecPolicy_a8f9e2

# Find password policy section
python scripts/query_docx.py search InfoSecPolicy_a8f9e2 "password requirements"

Contract Review

# Extract
python scripts/extract_docx.py Vendor_Contract.docx

# Get specific section
python scripts/query_docx.py heading Vendor_Contract_f3a8c1 "Termination Clause"

Technical Documentation

# Extract large spec document
python scripts/extract_docx.py API_Specification.docx

# Search for endpoint details
python scripts/query_docx.py search API_Specification_b9d2e1 "authentication endpoint"

Performance

Extraction speed: ~1-5 seconds for typical documents (1-10MB)
Chunking speed: <1 second
Query speed: <1 second
Cache reuse: Instant (no re-extraction needed)

Output Format

All output is JSON with UTF-8 encoding. Structured for easy parsing and analysis.

No Marketing BS

This plugin does NOT:

Claim "100% content preservation" (meaningless metric)
Use AI during extraction (all local python-docx)
Require internet connection
Modify original documents
Extract content you don't need

What it DOES:

Extract all text, tables, and formatting
Chunk by semantic boundaries (headings)
Enable fast keyword search
Cache for instant reuse
Achieve 10-50x token reduction (verified)

6.7 KiB Raw Permalink Blame History