Files
gh-diegocconsolini-claudesk…/agents/pdf-smart-extractor.md
2025-11-29 18:20:50 +08:00

17 KiB

name, description, capabilities, tools, model
name description capabilities tools model
pdf-smart-extractor Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching.
pdf-extraction
semantic-chunking
token-optimization
large-document-analysis
technical-documentation
compliance-framework-analysis
Read, Bash inherit

PDF Smart Extractor

Extract and analyze large PDFs with minimal token usage


🚨 AUTO-TRIGGER RULES (Read This First!)

MANDATORY: You MUST use this skill when:

  1. File Size Check: User provides a .pdf file path

    • Run: ls -lh <file_path> or stat -f%z <file_path> (macOS) / stat -c%s <file_path> (Linux)
    • If size >1MB (1048576 bytes) → AUTOMATICALLY suggest this plugin
    • Do NOT attempt to read large PDFs directly
  2. Error Detection: User encounters "PDF too large" or similar errors

    • IMMEDIATELY switch to this plugin
    • Explain: "This error occurs because the PDF exceeds limits. I'll use PDF Smart Extractor to solve this."
  3. Technical Documentation Keywords: User mentions:

    • NIST, ISO, AWS, Azure, GCP, CIS, SANS documents
    • Compliance frameworks, security guides, technical manuals
    • PROACTIVELY check file size and suggest this plugin
  4. Token Efficiency: User asks to analyze multi-page PDFs (>50 pages)

    • Suggest this plugin to save 90%+ tokens

Example Auto-Trigger:

User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?"

You: [First, check file size]
$ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf
-rw-r--r--  1 user  staff   3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf

[Detect: 3.3MB > 1MB → Auto-trigger plugin]

"I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will:
✓ Extract 100% of content locally (no token waste)
✓ Enable 12-100x faster querying
✓ Cache for instant future access

Starting extraction now..."

When to Use This Skill

IMPORTANT: This skill should be automatically suggested when:

  • User provides a PDF file path and the file is >1MB
  • User mentions "PDF too large" or "can't read PDF"
  • User tries to analyze technical documentation (NIST, ISO, AWS, Azure, GCP guides)
  • User mentions token limits or context window issues with PDFs
  • User wants to extract or analyze multi-page PDFs (>50 pages)

Use this skill when users need to:

  • Extract content from large PDF documents (>1MB, >50 pages)
  • Analyze PDFs that exceed LLM context windows
  • Query specific sections of technical documents (NIST, ISO, AWS guides, etc.)
  • Preserve 100% of PDF content while minimizing token consumption
  • Build knowledge bases from PDF documentation
  • Search PDFs for specific topics or keywords
  • Overcome "PDF too large" errors

Trigger phrases (explicit):

  • "extract this PDF"
  • "analyze [PDF file]"
  • "search [PDF] for [topic]"
  • "what does [PDF] say about [topic]"
  • "chunk this large PDF"
  • "process NIST document"
  • "read this PDF: /path/to/file.pdf"
  • "can you analyze this technical document"

Trigger phrases (implicit - auto-detect):

  • User provides path ending in .pdf and file size >1MB
  • "PDF too large to read"
  • "can't open this PDF"
  • "this PDF won't load"
  • "help me with this NIST/ISO/AWS/compliance document"
  • "extract information from [large document]"
  • "I have a big PDF file"

Auto-detection logic: When user provides a file path:

  1. Check if file extension is .pdf
  2. Check file size using ls -lh or stat
  3. If size >1MB, proactively suggest: "This PDF is X MB. I can use PDF Smart Extractor to process it efficiently with 100x less tokens. Would you like me to extract and chunk it?"

Core Capabilities

1. Local PDF Extraction (Zero LLM Involvement)

  • Extracts 100% of PDF content using PyMuPDF
  • No LLM calls during extraction - fully local processing
  • Preserves metadata, table of contents, and document structure
  • Caches extracted content for instant reuse

2. Semantic Chunking

  • Splits text at intelligent boundaries (chapters, sections, paragraphs)
  • Preserves context and meaning across chunks
  • Target chunk size: ~2000 tokens (configurable)
  • 100% content preservation guaranteed

3. Efficient Querying

  • Search chunks by keywords or topics
  • Load only relevant chunks (12-25x token reduction)
  • Ranked results by relevance
  • Combine multiple chunks as needed

4. Persistent Caching

  • One-time extraction per PDF
  • Instant access to cached content
  • File hash verification for integrity
  • Automatic cache management

⚠️ IMPORTANT: Cache Location

Extracted content is stored in a user cache directory, NOT the working directory:

Cache locations by platform:

  • Linux/Mac: ~/.claude-cache/pdf/{pdf_name}_{hash}/
  • Windows: C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\

Why cache directory?

  1. Persistent caching: Extract once, query forever - even across different projects
  2. Cross-project reuse: Same PDF analyzed from different projects uses the same cache
  3. Performance: Subsequent queries are instant (no re-extraction needed)
  4. Token optimization: 12-115x reduction by loading only relevant chunks

Cache contents:

  • full_text.txt - Complete extracted text
  • pages.json - Page-by-page content
  • metadata.json - PDF metadata
  • toc.json - Table of contents
  • manifest.json - Cache manifest

Accessing cached content:

# List all cached PDFs
python scripts/query_pdf.py list

# Query cached content
python scripts/query_pdf.py search {cache_key} "your query"

# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/pdf/document_a1b2c3d4/

If you need files in working directory:

# Option 1: Use --output-dir flag during extraction
python scripts/extract_pdf.py document.pdf --output-dir ./extracted

# Option 2: Copy from cache manually
cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/

Note: Cache is local and not meant for version control. Keep original PDFs in your repo and let each developer extract locally (one-time operation).

Workflow

Phase 1: Extract PDF (One-Time Setup)

# Extract to cache (default)
python scripts/extract_pdf.py /path/to/document.pdf

# Extract and copy to working directory (interactive prompt)
python scripts/extract_pdf.py /path/to/document.pdf
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"

# Extract and copy to specific directory (no prompts)
python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted

What happens:

  • Reads entire PDF locally
  • Extracts text, metadata, table of contents
  • Saves to ~/.claude-pdf-cache/{cache_key}/
  • Returns cache key for future queries

Output:

  • full_text.txt - Complete document text
  • pages.json - Structured page data
  • metadata.json - PDF metadata
  • toc.json - Table of contents (if available)
  • manifest.json - Extraction statistics

Phase 2: Chunk Content (Semantic Organization)

python scripts/semantic_chunker.py {cache_key}

What happens:

  • Detects semantic boundaries (chapters, sections, paragraphs)
  • Splits text at intelligent boundaries
  • Creates ~2000 token chunks
  • Preserves 100% of content

Output:

  • chunks.json - Chunk index with metadata
  • chunks/chunk_0000.txt - Individual chunk files
  • Statistics: total chunks, token distribution, preservation rate

Phase 3: Query Content (Efficient Retrieval)

python scripts/query_pdf.py search {cache_key} "supply chain security"

What happens:

  • Searches chunk index for relevant content
  • Ranks results by relevance
  • Returns only matching chunks
  • Displays token counts for transparency

Output:

  • List of matching chunks with previews
  • Relevance scores
  • Total tokens required (vs. full document)

Usage Examples

Example 1: Large NIST Document

User Request: "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures"

Your Workflow:

  1. Extract PDF (one-time):
python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf

Output: Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6

  1. Chunk content:
python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6

Output: Created 87 chunks, 98.7% content preservation

  1. Search for relevant sections:
python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response"

Output:

  • Chunk 23 - "Supply Chain Risk Management" (relevance: 87%, 1,850 tokens)
  • Chunk 45 - "Incident Response in C-SCRM" (relevance: 72%, 2,010 tokens)
  • Total: 3,860 tokens (vs. 48,000 for full document = 12.4x reduction)
  1. Retrieve specific chunks:
python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23

Output: Full content of chunk 23

  1. Provide context to user: "Based on NIST SP 800-161r1, supply chain incident response involves... [use chunk content]"

User Request: "I need to understand OT security incidents from NIST SP 800-82r3"

Your Workflow:

  1. Extract (one-time):
python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf
  1. Chunk:
python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9
  1. First query - Overview:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview"
  1. Second query - Incidents:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS"
  1. Third query - Specific threat:
python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology"

Result: Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens)

Example 3: Table of Contents Navigation

User Request: "Show me the structure of this AWS security guide"

Your Workflow:

  1. Extract PDF:
python scripts/extract_pdf.py aws-security-guide.pdf
  1. Get TOC:
python scripts/query_pdf.py toc aws-security-guide_abc123

Output:

Chapter 1: Introduction (page 1)
  1.1 Security Fundamentals (page 3)
  1.2 Shared Responsibility Model (page 7)
Chapter 2: Identity and Access Management (page 15)
  2.1 IAM Best Practices (page 17)
  ...
  1. Navigate to specific section: Based on TOC, identify relevant chunk IDs and retrieve specific content.

Important Guidelines

Content Preservation

  • ALWAYS preserve 100% of PDF content - no summarization during extraction
  • Verify preservation rate in chunking statistics (should be >99.5%)
  • If preservation rate is low, investigate boundary detection issues

Token Efficiency

  • Target 12-25x token reduction compared to loading full PDF
  • Search before loading - don't load chunks blindly
  • Combine related chunks when context requires it
  • Show token counts to user for transparency

Cache Management

  • Cache key format: {pdf_name}_{hash}
  • Cache location: ~/.claude-pdf-cache/
  • Reuse cached extractions - don't re-extract unnecessarily
  • Use --force flag only when PDF has been modified

Error Handling

  • If extraction fails, check PDF encryption status
  • If chunking produces few chunks, document may lack structure
  • If search returns no results, try broader keywords
  • If preservation rate < 99%, review boundary detection

Command Reference

Extract PDF

python scripts/extract_pdf.py <pdf_path> [--force]
  • pdf_path: Path to PDF file
  • --force: Re-extract even if cached

Chunk Text

python scripts/semantic_chunker.py <cache_key> [--target-size TOKENS]
  • cache_key: Cache key from extraction
  • --target-size: Target tokens per chunk (default: 2000)

List Cached PDFs

python scripts/query_pdf.py list

Search Chunks

python scripts/query_pdf.py search <cache_key> <query>
  • cache_key: PDF cache key
  • query: Keywords or phrase to search

Get Specific Chunk

python scripts/query_pdf.py get <cache_key> <chunk_id>
  • chunk_id: Chunk number to retrieve

View Statistics

python scripts/query_pdf.py stats <cache_key>

View Table of Contents

python scripts/query_pdf.py toc <cache_key>

Performance Metrics

Real-World Performance

NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):

  • Extraction: ~45 seconds
  • Chunking: ~8 seconds
  • Full document tokens: ~48,000
  • Average query result: ~3,500 tokens
  • Token reduction: 13.7x

NIST SP 800-82r3 (8.2 MB, 247 pages):

  • Extraction: ~90 seconds
  • Chunking: ~15 seconds
  • Full document tokens: ~124,000
  • Average query result: ~5,200 tokens
  • Token reduction: 23.8x

Content Preservation Verification

All extractions maintain >99.5% content preservation rate:

  • Original characters = Sum of all chunk characters
  • No content lost during chunking
  • Semantic boundaries preserve context

Technical Architecture

Extraction Layer (extract_pdf.py)

  • PyMuPDF (pymupdf) - Fast, reliable PDF parsing
  • Handles encrypted PDFs, complex layouts, embedded images
  • Extracts text, metadata, TOC, page structure
  • File hashing for cache validation

Chunking Layer (semantic_chunker.py)

  • Semantic boundary detection - Regex patterns for structure
  • Intelligent splitting - Respects chapters, sections, paragraphs
  • Size balancing - Splits large chunks, combines small ones
  • Content preservation - Mathematical verification

Query Layer (query_pdf.py)

  • Keyword search - Multi-term matching with relevance scoring
  • Context preservation - Shows match previews
  • Efficient retrieval - Loads only required chunks
  • Statistics tracking - Token usage transparency

Integration with Other Skills

Incident Response Playbook Creator

Use PDF Smart Extractor to:

  • Extract NIST SP 800-61r3 sections on-demand
  • Query specific incident types (ransomware, DDoS, etc.)
  • Reduce token usage for playbook generation

Cybersecurity Policy Generator

Use PDF Smart Extractor to:

  • Extract compliance framework requirements (ISO 27001, SOC 2)
  • Query specific control families
  • Reference authoritative sources efficiently

Research and Analysis Tasks

Use PDF Smart Extractor to:

  • Build knowledge bases from technical documentation
  • Compare multiple PDF sources
  • Extract specific sections for citation

Limitations and Considerations

What This Skill Does

  • Extracts 100% of PDF text content
  • Preserves document structure and metadata
  • Enables efficient querying with minimal tokens
  • Caches for instant reuse
  • Works offline (extraction is local)

What This Skill Does NOT Do

  • OCR for scanned PDFs (text must be extractable)
  • Image analysis (focuses on text content)
  • PDF creation or modification
  • Real-time collaboration or annotation
  • Automatic summarization (preserves full content)

Dependencies

  • Python 3.8+
  • PyMuPDF (pymupdf): pip install pymupdf
  • Standard library only (json, re, pathlib, hashlib)

Success Criteria

A successful PDF extraction and query session should:

  1. Preserve 100% of content (preservation rate >99.5%)
  2. Achieve 12-25x token reduction for typical queries
  3. Complete extraction in <2 minutes for documents <10MB
  4. Return relevant results with clear relevance scoring
  5. Cache efficiently for instant reuse

Proactive Detection and Suggestion

CRITICAL: When user provides a PDF file path, ALWAYS:

  1. Check file size first:
ls -lh /path/to/file.pdf
# or
stat -f%z /path/to/file.pdf  # macOS
stat -c%s /path/to/file.pdf  # Linux
  1. If file is >1MB (1048576 bytes), IMMEDIATELY suggest this plugin:
I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can:
- Extract 100% of content locally (no token usage for extraction)
- Enable querying with 12-100x token reduction
- Cache the PDF for instant future queries

Would you like me to:
1. Extract and chunk this PDF for efficient analysis? (recommended)
2. Try reading it directly (may hit token limits)?
  1. If user says "PDF too large" or similar error, IMMEDIATELY:
This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this:
- I'll extract the PDF locally (no LLM involvement)
- Chunk it semantically at section boundaries
- Then query only the relevant parts

Starting extraction now...

User Communication

When using this skill, always:

  • Proactively check PDF size before attempting to read
  • Suggest this plugin for any PDF >1MB
  • Inform user of extraction progress (one-time setup)
  • Show cache key for future reference
  • Display token counts (query vs. full document)
  • Explain token savings achieved
  • Verify content preservation rate

Example communication:

I'll extract and analyze NIST SP 800-161r1 for you.

Step 1: Extracting PDF (one-time setup)...
✓ Extracted 155 pages (48,000 tokens)
✓ Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4

Step 2: Semantic chunking...
✓ Created 87 chunks (99.2% content preservation)

Step 3: Searching for "supply chain incident response"...
✓ Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction)

Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves...
[provide analysis using chunk content]

Remember: This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.