17 KiB
name, description, capabilities, tools, model
| name | description | capabilities | tools | model | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pdf-smart-extractor | Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. |
|
Read, Bash | inherit |
PDF Smart Extractor
Extract and analyze large PDFs with minimal token usage
🚨 AUTO-TRIGGER RULES (Read This First!)
MANDATORY: You MUST use this skill when:
-
File Size Check: User provides a
.pdffile path- Run:
ls -lh <file_path>orstat -f%z <file_path>(macOS) /stat -c%s <file_path>(Linux) - If size >1MB (1048576 bytes) → AUTOMATICALLY suggest this plugin
- Do NOT attempt to read large PDFs directly
- Run:
-
Error Detection: User encounters "PDF too large" or similar errors
- IMMEDIATELY switch to this plugin
- Explain: "This error occurs because the PDF exceeds limits. I'll use PDF Smart Extractor to solve this."
-
Technical Documentation Keywords: User mentions:
- NIST, ISO, AWS, Azure, GCP, CIS, SANS documents
- Compliance frameworks, security guides, technical manuals
- PROACTIVELY check file size and suggest this plugin
-
Token Efficiency: User asks to analyze multi-page PDFs (>50 pages)
- Suggest this plugin to save 90%+ tokens
Example Auto-Trigger:
User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?"
You: [First, check file size]
$ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf
-rw-r--r-- 1 user staff 3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf
[Detect: 3.3MB > 1MB → Auto-trigger plugin]
"I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will:
✓ Extract 100% of content locally (no token waste)
✓ Enable 12-100x faster querying
✓ Cache for instant future access
Starting extraction now..."
When to Use This Skill
IMPORTANT: This skill should be automatically suggested when:
- User provides a PDF file path and the file is >1MB
- User mentions "PDF too large" or "can't read PDF"
- User tries to analyze technical documentation (NIST, ISO, AWS, Azure, GCP guides)
- User mentions token limits or context window issues with PDFs
- User wants to extract or analyze multi-page PDFs (>50 pages)
Use this skill when users need to:
- Extract content from large PDF documents (>1MB, >50 pages)
- Analyze PDFs that exceed LLM context windows
- Query specific sections of technical documents (NIST, ISO, AWS guides, etc.)
- Preserve 100% of PDF content while minimizing token consumption
- Build knowledge bases from PDF documentation
- Search PDFs for specific topics or keywords
- Overcome "PDF too large" errors
Trigger phrases (explicit):
- "extract this PDF"
- "analyze [PDF file]"
- "search [PDF] for [topic]"
- "what does [PDF] say about [topic]"
- "chunk this large PDF"
- "process NIST document"
- "read this PDF: /path/to/file.pdf"
- "can you analyze this technical document"
Trigger phrases (implicit - auto-detect):
- User provides path ending in
.pdfand file size >1MB - "PDF too large to read"
- "can't open this PDF"
- "this PDF won't load"
- "help me with this NIST/ISO/AWS/compliance document"
- "extract information from [large document]"
- "I have a big PDF file"
Auto-detection logic: When user provides a file path:
- Check if file extension is
.pdf - Check file size using
ls -lhorstat - If size >1MB, proactively suggest: "This PDF is X MB. I can use PDF Smart Extractor to process it efficiently with 100x less tokens. Would you like me to extract and chunk it?"
Core Capabilities
1. Local PDF Extraction (Zero LLM Involvement)
- Extracts 100% of PDF content using PyMuPDF
- No LLM calls during extraction - fully local processing
- Preserves metadata, table of contents, and document structure
- Caches extracted content for instant reuse
2. Semantic Chunking
- Splits text at intelligent boundaries (chapters, sections, paragraphs)
- Preserves context and meaning across chunks
- Target chunk size: ~2000 tokens (configurable)
- 100% content preservation guaranteed
3. Efficient Querying
- Search chunks by keywords or topics
- Load only relevant chunks (12-25x token reduction)
- Ranked results by relevance
- Combine multiple chunks as needed
4. Persistent Caching
- One-time extraction per PDF
- Instant access to cached content
- File hash verification for integrity
- Automatic cache management
⚠️ IMPORTANT: Cache Location
Extracted content is stored in a user cache directory, NOT the working directory:
Cache locations by platform:
- Linux/Mac:
~/.claude-cache/pdf/{pdf_name}_{hash}/ - Windows:
C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\
Why cache directory?
- Persistent caching: Extract once, query forever - even across different projects
- Cross-project reuse: Same PDF analyzed from different projects uses the same cache
- Performance: Subsequent queries are instant (no re-extraction needed)
- Token optimization: 12-115x reduction by loading only relevant chunks
Cache contents:
full_text.txt- Complete extracted textpages.json- Page-by-page contentmetadata.json- PDF metadatatoc.json- Table of contentsmanifest.json- Cache manifest
Accessing cached content:
# List all cached PDFs
python scripts/query_pdf.py list
# Query cached content
python scripts/query_pdf.py search {cache_key} "your query"
# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/pdf/document_a1b2c3d4/
If you need files in working directory:
# Option 1: Use --output-dir flag during extraction
python scripts/extract_pdf.py document.pdf --output-dir ./extracted
# Option 2: Copy from cache manually
cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/
Note: Cache is local and not meant for version control. Keep original PDFs in your repo and let each developer extract locally (one-time operation).
Workflow
Phase 1: Extract PDF (One-Time Setup)
# Extract to cache (default)
python scripts/extract_pdf.py /path/to/document.pdf
# Extract and copy to working directory (interactive prompt)
python scripts/extract_pdf.py /path/to/document.pdf
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"
# Extract and copy to specific directory (no prompts)
python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted
What happens:
- Reads entire PDF locally
- Extracts text, metadata, table of contents
- Saves to
~/.claude-pdf-cache/{cache_key}/ - Returns cache key for future queries
Output:
full_text.txt- Complete document textpages.json- Structured page datametadata.json- PDF metadatatoc.json- Table of contents (if available)manifest.json- Extraction statistics
Phase 2: Chunk Content (Semantic Organization)
python scripts/semantic_chunker.py {cache_key}
What happens:
- Detects semantic boundaries (chapters, sections, paragraphs)
- Splits text at intelligent boundaries
- Creates ~2000 token chunks
- Preserves 100% of content
Output:
chunks.json- Chunk index with metadatachunks/chunk_0000.txt- Individual chunk files- Statistics: total chunks, token distribution, preservation rate
Phase 3: Query Content (Efficient Retrieval)
python scripts/query_pdf.py search {cache_key} "supply chain security"
What happens:
- Searches chunk index for relevant content
- Ranks results by relevance
- Returns only matching chunks
- Displays token counts for transparency
Output:
- List of matching chunks with previews
- Relevance scores
- Total tokens required (vs. full document)
Usage Examples
Example 1: Large NIST Document
User Request: "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures"
Your Workflow:
- Extract PDF (one-time):
python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf
Output: Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6
- Chunk content:
python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6
Output: Created 87 chunks, 98.7% content preservation
- Search for relevant sections:
python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response"
Output:
- Chunk 23 - "Supply Chain Risk Management" (relevance: 87%, 1,850 tokens)
- Chunk 45 - "Incident Response in C-SCRM" (relevance: 72%, 2,010 tokens)
- Total: 3,860 tokens (vs. 48,000 for full document = 12.4x reduction)
- Retrieve specific chunks:
python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23
Output: Full content of chunk 23
- Provide context to user: "Based on NIST SP 800-161r1, supply chain incident response involves... [use chunk content]"
Example 2: Multiple Related Queries
User Request: "I need to understand OT security incidents from NIST SP 800-82r3"
Your Workflow:
- Extract (one-time):
python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf
- Chunk:
python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9
- First query - Overview:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview"
- Second query - Incidents:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS"
- Third query - Specific threat:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology"
Result: Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens)
Example 3: Table of Contents Navigation
User Request: "Show me the structure of this AWS security guide"
Your Workflow:
- Extract PDF:
python scripts/extract_pdf.py aws-security-guide.pdf
- Get TOC:
python scripts/query_pdf.py toc aws-security-guide_abc123
Output:
Chapter 1: Introduction (page 1)
1.1 Security Fundamentals (page 3)
1.2 Shared Responsibility Model (page 7)
Chapter 2: Identity and Access Management (page 15)
2.1 IAM Best Practices (page 17)
...
- Navigate to specific section: Based on TOC, identify relevant chunk IDs and retrieve specific content.
Important Guidelines
Content Preservation
- ALWAYS preserve 100% of PDF content - no summarization during extraction
- Verify preservation rate in chunking statistics (should be >99.5%)
- If preservation rate is low, investigate boundary detection issues
Token Efficiency
- Target 12-25x token reduction compared to loading full PDF
- Search before loading - don't load chunks blindly
- Combine related chunks when context requires it
- Show token counts to user for transparency
Cache Management
- Cache key format:
{pdf_name}_{hash} - Cache location:
~/.claude-pdf-cache/ - Reuse cached extractions - don't re-extract unnecessarily
- Use
--forceflag only when PDF has been modified
Error Handling
- If extraction fails, check PDF encryption status
- If chunking produces few chunks, document may lack structure
- If search returns no results, try broader keywords
- If preservation rate < 99%, review boundary detection
Command Reference
Extract PDF
python scripts/extract_pdf.py <pdf_path> [--force]
pdf_path: Path to PDF file--force: Re-extract even if cached
Chunk Text
python scripts/semantic_chunker.py <cache_key> [--target-size TOKENS]
cache_key: Cache key from extraction--target-size: Target tokens per chunk (default: 2000)
List Cached PDFs
python scripts/query_pdf.py list
Search Chunks
python scripts/query_pdf.py search <cache_key> <query>
cache_key: PDF cache keyquery: Keywords or phrase to search
Get Specific Chunk
python scripts/query_pdf.py get <cache_key> <chunk_id>
chunk_id: Chunk number to retrieve
View Statistics
python scripts/query_pdf.py stats <cache_key>
View Table of Contents
python scripts/query_pdf.py toc <cache_key>
Performance Metrics
Real-World Performance
NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):
- Extraction: ~45 seconds
- Chunking: ~8 seconds
- Full document tokens: ~48,000
- Average query result: ~3,500 tokens
- Token reduction: 13.7x
NIST SP 800-82r3 (8.2 MB, 247 pages):
- Extraction: ~90 seconds
- Chunking: ~15 seconds
- Full document tokens: ~124,000
- Average query result: ~5,200 tokens
- Token reduction: 23.8x
Content Preservation Verification
All extractions maintain >99.5% content preservation rate:
- Original characters = Sum of all chunk characters
- No content lost during chunking
- Semantic boundaries preserve context
Technical Architecture
Extraction Layer (extract_pdf.py)
- PyMuPDF (pymupdf) - Fast, reliable PDF parsing
- Handles encrypted PDFs, complex layouts, embedded images
- Extracts text, metadata, TOC, page structure
- File hashing for cache validation
Chunking Layer (semantic_chunker.py)
- Semantic boundary detection - Regex patterns for structure
- Intelligent splitting - Respects chapters, sections, paragraphs
- Size balancing - Splits large chunks, combines small ones
- Content preservation - Mathematical verification
Query Layer (query_pdf.py)
- Keyword search - Multi-term matching with relevance scoring
- Context preservation - Shows match previews
- Efficient retrieval - Loads only required chunks
- Statistics tracking - Token usage transparency
Integration with Other Skills
Incident Response Playbook Creator
Use PDF Smart Extractor to:
- Extract NIST SP 800-61r3 sections on-demand
- Query specific incident types (ransomware, DDoS, etc.)
- Reduce token usage for playbook generation
Cybersecurity Policy Generator
Use PDF Smart Extractor to:
- Extract compliance framework requirements (ISO 27001, SOC 2)
- Query specific control families
- Reference authoritative sources efficiently
Research and Analysis Tasks
Use PDF Smart Extractor to:
- Build knowledge bases from technical documentation
- Compare multiple PDF sources
- Extract specific sections for citation
Limitations and Considerations
What This Skill Does
- ✅ Extracts 100% of PDF text content
- ✅ Preserves document structure and metadata
- ✅ Enables efficient querying with minimal tokens
- ✅ Caches for instant reuse
- ✅ Works offline (extraction is local)
What This Skill Does NOT Do
- ❌ OCR for scanned PDFs (text must be extractable)
- ❌ Image analysis (focuses on text content)
- ❌ PDF creation or modification
- ❌ Real-time collaboration or annotation
- ❌ Automatic summarization (preserves full content)
Dependencies
- Python 3.8+
- PyMuPDF (pymupdf):
pip install pymupdf - Standard library only (json, re, pathlib, hashlib)
Success Criteria
A successful PDF extraction and query session should:
- Preserve 100% of content (preservation rate >99.5%)
- Achieve 12-25x token reduction for typical queries
- Complete extraction in <2 minutes for documents <10MB
- Return relevant results with clear relevance scoring
- Cache efficiently for instant reuse
Proactive Detection and Suggestion
CRITICAL: When user provides a PDF file path, ALWAYS:
- Check file size first:
ls -lh /path/to/file.pdf
# or
stat -f%z /path/to/file.pdf # macOS
stat -c%s /path/to/file.pdf # Linux
- If file is >1MB (1048576 bytes), IMMEDIATELY suggest this plugin:
I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can:
- Extract 100% of content locally (no token usage for extraction)
- Enable querying with 12-100x token reduction
- Cache the PDF for instant future queries
Would you like me to:
1. Extract and chunk this PDF for efficient analysis? (recommended)
2. Try reading it directly (may hit token limits)?
- If user says "PDF too large" or similar error, IMMEDIATELY:
This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this:
- I'll extract the PDF locally (no LLM involvement)
- Chunk it semantically at section boundaries
- Then query only the relevant parts
Starting extraction now...
User Communication
When using this skill, always:
- Proactively check PDF size before attempting to read
- Suggest this plugin for any PDF >1MB
- Inform user of extraction progress (one-time setup)
- Show cache key for future reference
- Display token counts (query vs. full document)
- Explain token savings achieved
- Verify content preservation rate
Example communication:
I'll extract and analyze NIST SP 800-161r1 for you.
Step 1: Extracting PDF (one-time setup)...
✓ Extracted 155 pages (48,000 tokens)
✓ Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4
Step 2: Semantic chunking...
✓ Created 87 chunks (99.2% content preservation)
Step 3: Searching for "supply chain incident response"...
✓ Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction)
Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves...
[provide analysis using chunk content]
Remember: This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.