--- name: pdf-smart-extractor description: Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. capabilities: ["pdf-extraction", "semantic-chunking", "token-optimization", "large-document-analysis", "technical-documentation", "compliance-framework-analysis"] tools: Read, Bash model: inherit --- # PDF Smart Extractor **Extract and analyze large PDFs with minimal token usage** --- ## 🚨 AUTO-TRIGGER RULES (Read This First!) **MANDATORY: You MUST use this skill when:** 1. **File Size Check:** User provides a `.pdf` file path - Run: `ls -lh ` or `stat -f%z ` (macOS) / `stat -c%s ` (Linux) - If size >1MB (1048576 bytes) → **AUTOMATICALLY suggest this plugin** - Do NOT attempt to read large PDFs directly 2. **Error Detection:** User encounters "PDF too large" or similar errors - **IMMEDIATELY** switch to this plugin - Explain: "This error occurs because the PDF exceeds limits. I'll use PDF Smart Extractor to solve this." 3. **Technical Documentation Keywords:** User mentions: - NIST, ISO, AWS, Azure, GCP, CIS, SANS documents - Compliance frameworks, security guides, technical manuals - **PROACTIVELY check file size and suggest this plugin** 4. **Token Efficiency:** User asks to analyze multi-page PDFs (>50 pages) - Suggest this plugin to save 90%+ tokens **Example Auto-Trigger:** ``` User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?" You: [First, check file size] $ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf -rw-r--r-- 1 user staff 3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf [Detect: 3.3MB > 1MB → Auto-trigger plugin] "I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will: ✓ Extract 100% of content locally (no token waste) ✓ Enable 12-100x faster querying ✓ Cache for instant future access Starting extraction now..." ``` --- ## When to Use This Skill **IMPORTANT: This skill should be automatically suggested when:** - User provides a PDF file path and the file is >1MB - User mentions "PDF too large" or "can't read PDF" - User tries to analyze technical documentation (NIST, ISO, AWS, Azure, GCP guides) - User mentions token limits or context window issues with PDFs - User wants to extract or analyze multi-page PDFs (>50 pages) Use this skill when users need to: - Extract content from large PDF documents (>1MB, >50 pages) - Analyze PDFs that exceed LLM context windows - Query specific sections of technical documents (NIST, ISO, AWS guides, etc.) - Preserve 100% of PDF content while minimizing token consumption - Build knowledge bases from PDF documentation - Search PDFs for specific topics or keywords - Overcome "PDF too large" errors **Trigger phrases (explicit):** - "extract this PDF" - "analyze [PDF file]" - "search [PDF] for [topic]" - "what does [PDF] say about [topic]" - "chunk this large PDF" - "process NIST document" - "read this PDF: /path/to/file.pdf" - "can you analyze this technical document" **Trigger phrases (implicit - auto-detect):** - User provides path ending in `.pdf` and file size >1MB - "PDF too large to read" - "can't open this PDF" - "this PDF won't load" - "help me with this NIST/ISO/AWS/compliance document" - "extract information from [large document]" - "I have a big PDF file" **Auto-detection logic:** When user provides a file path: 1. Check if file extension is `.pdf` 2. Check file size using `ls -lh` or `stat` 3. If size >1MB, proactively suggest: "This PDF is X MB. I can use PDF Smart Extractor to process it efficiently with 100x less tokens. Would you like me to extract and chunk it?" ## Core Capabilities ### 1. Local PDF Extraction (Zero LLM Involvement) - Extracts 100% of PDF content using PyMuPDF - No LLM calls during extraction - fully local processing - Preserves metadata, table of contents, and document structure - Caches extracted content for instant reuse ### 2. Semantic Chunking - Splits text at intelligent boundaries (chapters, sections, paragraphs) - Preserves context and meaning across chunks - Target chunk size: ~2000 tokens (configurable) - 100% content preservation guaranteed ### 3. Efficient Querying - Search chunks by keywords or topics - Load only relevant chunks (12-25x token reduction) - Ranked results by relevance - Combine multiple chunks as needed ### 4. Persistent Caching - One-time extraction per PDF - Instant access to cached content - File hash verification for integrity - Automatic cache management **⚠️ IMPORTANT: Cache Location** Extracted content is stored in a **user cache directory**, NOT the working directory: **Cache locations by platform:** - **Linux/Mac:** `~/.claude-cache/pdf/{pdf_name}_{hash}/` - **Windows:** `C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\` **Why cache directory?** 1. **Persistent caching:** Extract once, query forever - even across different projects 2. **Cross-project reuse:** Same PDF analyzed from different projects uses the same cache 3. **Performance:** Subsequent queries are instant (no re-extraction needed) 4. **Token optimization:** 12-115x reduction by loading only relevant chunks **Cache contents:** - `full_text.txt` - Complete extracted text - `pages.json` - Page-by-page content - `metadata.json` - PDF metadata - `toc.json` - Table of contents - `manifest.json` - Cache manifest **Accessing cached content:** ```bash # List all cached PDFs python scripts/query_pdf.py list # Query cached content python scripts/query_pdf.py search {cache_key} "your query" # Find cache location (shown in extraction output) # Example: ~/.claude-cache/pdf/document_a1b2c3d4/ ``` **If you need files in working directory:** ```bash # Option 1: Use --output-dir flag during extraction python scripts/extract_pdf.py document.pdf --output-dir ./extracted # Option 2: Copy from cache manually cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/ ``` **Note:** Cache is local and not meant for version control. Keep original PDFs in your repo and let each developer extract locally (one-time operation). ## Workflow ### Phase 1: Extract PDF (One-Time Setup) ```bash # Extract to cache (default) python scripts/extract_pdf.py /path/to/document.pdf # Extract and copy to working directory (interactive prompt) python scripts/extract_pdf.py /path/to/document.pdf # Will prompt: "Copy files? (y/n)" # Will ask: "Keep cache? (y/n)" # Extract and copy to specific directory (no prompts) python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted ``` **What happens:** - Reads entire PDF locally - Extracts text, metadata, table of contents - Saves to `~/.claude-pdf-cache/{cache_key}/` - Returns cache key for future queries **Output:** - `full_text.txt` - Complete document text - `pages.json` - Structured page data - `metadata.json` - PDF metadata - `toc.json` - Table of contents (if available) - `manifest.json` - Extraction statistics ### Phase 2: Chunk Content (Semantic Organization) ```python python scripts/semantic_chunker.py {cache_key} ``` **What happens:** - Detects semantic boundaries (chapters, sections, paragraphs) - Splits text at intelligent boundaries - Creates ~2000 token chunks - Preserves 100% of content **Output:** - `chunks.json` - Chunk index with metadata - `chunks/chunk_0000.txt` - Individual chunk files - Statistics: total chunks, token distribution, preservation rate ### Phase 3: Query Content (Efficient Retrieval) ```python python scripts/query_pdf.py search {cache_key} "supply chain security" ``` **What happens:** - Searches chunk index for relevant content - Ranks results by relevance - Returns only matching chunks - Displays token counts for transparency **Output:** - List of matching chunks with previews - Relevance scores - Total tokens required (vs. full document) ## Usage Examples ### Example 1: Large NIST Document **User Request:** "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures" **Your Workflow:** 1. **Extract PDF (one-time):** ```bash python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf ``` Output: `Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6` 2. **Chunk content:** ```bash python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 ``` Output: Created 87 chunks, 98.7% content preservation 3. **Search for relevant sections:** ```bash python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response" ``` Output: - Chunk 23 - "Supply Chain Risk Management" (relevance: 87%, 1,850 tokens) - Chunk 45 - "Incident Response in C-SCRM" (relevance: 72%, 2,010 tokens) - Total: 3,860 tokens (vs. 48,000 for full document = 12.4x reduction) 4. **Retrieve specific chunks:** ```bash python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23 ``` Output: Full content of chunk 23 5. **Provide context to user:** "Based on NIST SP 800-161r1, supply chain incident response involves... [use chunk content]" ### Example 2: Multiple Related Queries **User Request:** "I need to understand OT security incidents from NIST SP 800-82r3" **Your Workflow:** 1. **Extract (one-time):** ```bash python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf ``` 2. **Chunk:** ```bash python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9 ``` 3. **First query - Overview:** ```bash python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview" ``` 4. **Second query - Incidents:** ```bash python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS" ``` 5. **Third query - Specific threat:** ```bash python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology" ``` **Result:** Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens) ### Example 3: Table of Contents Navigation **User Request:** "Show me the structure of this AWS security guide" **Your Workflow:** 1. **Extract PDF:** ```bash python scripts/extract_pdf.py aws-security-guide.pdf ``` 2. **Get TOC:** ```bash python scripts/query_pdf.py toc aws-security-guide_abc123 ``` Output: ``` Chapter 1: Introduction (page 1) 1.1 Security Fundamentals (page 3) 1.2 Shared Responsibility Model (page 7) Chapter 2: Identity and Access Management (page 15) 2.1 IAM Best Practices (page 17) ... ``` 3. **Navigate to specific section:** Based on TOC, identify relevant chunk IDs and retrieve specific content. ## Important Guidelines ### Content Preservation - **ALWAYS preserve 100% of PDF content** - no summarization during extraction - Verify preservation rate in chunking statistics (should be >99.5%) - If preservation rate is low, investigate boundary detection issues ### Token Efficiency - **Target 12-25x token reduction** compared to loading full PDF - Search before loading - don't load chunks blindly - Combine related chunks when context requires it - Show token counts to user for transparency ### Cache Management - Cache key format: `{pdf_name}_{hash}` - Cache location: `~/.claude-pdf-cache/` - Reuse cached extractions - don't re-extract unnecessarily - Use `--force` flag only when PDF has been modified ### Error Handling - If extraction fails, check PDF encryption status - If chunking produces few chunks, document may lack structure - If search returns no results, try broader keywords - If preservation rate < 99%, review boundary detection ## Command Reference ### Extract PDF ```bash python scripts/extract_pdf.py [--force] ``` - `pdf_path`: Path to PDF file - `--force`: Re-extract even if cached ### Chunk Text ```bash python scripts/semantic_chunker.py [--target-size TOKENS] ``` - `cache_key`: Cache key from extraction - `--target-size`: Target tokens per chunk (default: 2000) ### List Cached PDFs ```bash python scripts/query_pdf.py list ``` ### Search Chunks ```bash python scripts/query_pdf.py search ``` - `cache_key`: PDF cache key - `query`: Keywords or phrase to search ### Get Specific Chunk ```bash python scripts/query_pdf.py get ``` - `chunk_id`: Chunk number to retrieve ### View Statistics ```bash python scripts/query_pdf.py stats ``` ### View Table of Contents ```bash python scripts/query_pdf.py toc ``` ## Performance Metrics ### Real-World Performance **NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):** - Extraction: ~45 seconds - Chunking: ~8 seconds - Full document tokens: ~48,000 - Average query result: ~3,500 tokens - Token reduction: 13.7x **NIST SP 800-82r3 (8.2 MB, 247 pages):** - Extraction: ~90 seconds - Chunking: ~15 seconds - Full document tokens: ~124,000 - Average query result: ~5,200 tokens - Token reduction: 23.8x ### Content Preservation Verification All extractions maintain >99.5% content preservation rate: - Original characters = Sum of all chunk characters - No content lost during chunking - Semantic boundaries preserve context ## Technical Architecture ### Extraction Layer (extract_pdf.py) - **PyMuPDF (pymupdf)** - Fast, reliable PDF parsing - Handles encrypted PDFs, complex layouts, embedded images - Extracts text, metadata, TOC, page structure - File hashing for cache validation ### Chunking Layer (semantic_chunker.py) - **Semantic boundary detection** - Regex patterns for structure - **Intelligent splitting** - Respects chapters, sections, paragraphs - **Size balancing** - Splits large chunks, combines small ones - **Content preservation** - Mathematical verification ### Query Layer (query_pdf.py) - **Keyword search** - Multi-term matching with relevance scoring - **Context preservation** - Shows match previews - **Efficient retrieval** - Loads only required chunks - **Statistics tracking** - Token usage transparency ## Integration with Other Skills ### Incident Response Playbook Creator Use PDF Smart Extractor to: - Extract NIST SP 800-61r3 sections on-demand - Query specific incident types (ransomware, DDoS, etc.) - Reduce token usage for playbook generation ### Cybersecurity Policy Generator Use PDF Smart Extractor to: - Extract compliance framework requirements (ISO 27001, SOC 2) - Query specific control families - Reference authoritative sources efficiently ### Research and Analysis Tasks Use PDF Smart Extractor to: - Build knowledge bases from technical documentation - Compare multiple PDF sources - Extract specific sections for citation ## Limitations and Considerations ### What This Skill Does - ✅ Extracts 100% of PDF text content - ✅ Preserves document structure and metadata - ✅ Enables efficient querying with minimal tokens - ✅ Caches for instant reuse - ✅ Works offline (extraction is local) ### What This Skill Does NOT Do - ❌ OCR for scanned PDFs (text must be extractable) - ❌ Image analysis (focuses on text content) - ❌ PDF creation or modification - ❌ Real-time collaboration or annotation - ❌ Automatic summarization (preserves full content) ### Dependencies - Python 3.8+ - PyMuPDF (pymupdf): `pip install pymupdf` - Standard library only (json, re, pathlib, hashlib) ## Success Criteria A successful PDF extraction and query session should: 1. **Preserve 100% of content** (preservation rate >99.5%) 2. **Achieve 12-25x token reduction** for typical queries 3. **Complete extraction** in <2 minutes for documents <10MB 4. **Return relevant results** with clear relevance scoring 5. **Cache efficiently** for instant reuse ## Proactive Detection and Suggestion **CRITICAL: When user provides a PDF file path, ALWAYS:** 1. **Check file size first:** ```bash ls -lh /path/to/file.pdf # or stat -f%z /path/to/file.pdf # macOS stat -c%s /path/to/file.pdf # Linux ``` 2. **If file is >1MB (1048576 bytes), IMMEDIATELY suggest this plugin:** ``` I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can: - Extract 100% of content locally (no token usage for extraction) - Enable querying with 12-100x token reduction - Cache the PDF for instant future queries Would you like me to: 1. Extract and chunk this PDF for efficient analysis? (recommended) 2. Try reading it directly (may hit token limits)? ``` 3. **If user says "PDF too large" or similar error, IMMEDIATELY:** ``` This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this: - I'll extract the PDF locally (no LLM involvement) - Chunk it semantically at section boundaries - Then query only the relevant parts Starting extraction now... ``` ## User Communication When using this skill, always: - **Proactively check PDF size** before attempting to read - **Suggest this plugin** for any PDF >1MB - **Inform user of extraction progress** (one-time setup) - **Show cache key** for future reference - **Display token counts** (query vs. full document) - **Explain token savings** achieved - **Verify content preservation** rate **Example communication:** ``` I'll extract and analyze NIST SP 800-161r1 for you. Step 1: Extracting PDF (one-time setup)... ✓ Extracted 155 pages (48,000 tokens) ✓ Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4 Step 2: Semantic chunking... ✓ Created 87 chunks (99.2% content preservation) Step 3: Searching for "supply chain incident response"... ✓ Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction) Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves... [provide analysis using chunk content] ``` --- **Remember:** This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.