zhongwei/gh-diegocconsolini-claudeskillcollection-pdf-smart-extractor

Fork 0

Files

Zhongwei Li 1148a58fa7 Initial commit

2025-11-29 18:20:50 +08:00

17 KiB

Raw Permalink Blame History

name, description, capabilities, tools, model

name

description

capabilities

tools

model

pdf-smart-extractor

Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching.

pdf-extraction

semantic-chunking

token-optimization

large-document-analysis

technical-documentation

compliance-framework-analysis

Read, Bash

inherit

PDF Smart Extractor

Extract and analyze large PDFs with minimal token usage

🚨 AUTO-TRIGGER RULES (Read This First!)

MANDATORY: You MUST use this skill when:

File Size Check: User provides a .pdf file path
- Run: ls -lh <file_path> or stat -f%z <file_path> (macOS) / stat -c%s <file_path> (Linux)
- If size >1MB (1048576 bytes) → AUTOMATICALLY suggest this plugin
- Do NOT attempt to read large PDFs directly
Error Detection: User encounters "PDF too large" or similar errors
- IMMEDIATELY switch to this plugin
- Explain: "This error occurs because the PDF exceeds limits. I'll use PDF Smart Extractor to solve this."
Technical Documentation Keywords: User mentions:
- NIST, ISO, AWS, Azure, GCP, CIS, SANS documents
- Compliance frameworks, security guides, technical manuals
- PROACTIVELY check file size and suggest this plugin
Token Efficiency: User asks to analyze multi-page PDFs (>50 pages)
- Suggest this plugin to save 90%+ tokens

Example Auto-Trigger:

User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?"

You: [First, check file size]
$ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf
-rw-r--r--  1 user  staff   3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf

[Detect: 3.3MB > 1MB → Auto-trigger plugin]

"I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will:
✓ Extract 100% of content locally (no token waste)
✓ Enable 12-100x faster querying
✓ Cache for instant future access

Starting extraction now..."

When to Use This Skill

IMPORTANT: This skill should be automatically suggested when:

User provides a PDF file path and the file is >1MB
User mentions "PDF too large" or "can't read PDF"
User tries to analyze technical documentation (NIST, ISO, AWS, Azure, GCP guides)
User mentions token limits or context window issues with PDFs
User wants to extract or analyze multi-page PDFs (>50 pages)

Use this skill when users need to:

Extract content from large PDF documents (>1MB, >50 pages)
Analyze PDFs that exceed LLM context windows
Query specific sections of technical documents (NIST, ISO, AWS guides, etc.)
Preserve 100% of PDF content while minimizing token consumption
Build knowledge bases from PDF documentation
Search PDFs for specific topics or keywords
Overcome "PDF too large" errors

Trigger phrases (explicit):

"extract this PDF"
"analyze [PDF file]"
"search [PDF] for [topic]"
"what does [PDF] say about [topic]"
"chunk this large PDF"
"process NIST document"
"read this PDF: /path/to/file.pdf"
"can you analyze this technical document"

Trigger phrases (implicit - auto-detect):

User provides path ending in .pdf and file size >1MB
"PDF too large to read"
"can't open this PDF"
"this PDF won't load"
"help me with this NIST/ISO/AWS/compliance document"
"extract information from [large document]"
"I have a big PDF file"

Auto-detection logic: When user provides a file path:

Check if file extension is .pdf
Check file size using ls -lh or stat
If size >1MB, proactively suggest: "This PDF is X MB. I can use PDF Smart Extractor to process it efficiently with 100x less tokens. Would you like me to extract and chunk it?"

Core Capabilities

1. Local PDF Extraction (Zero LLM Involvement)

Extracts 100% of PDF content using PyMuPDF
No LLM calls during extraction - fully local processing
Preserves metadata, table of contents, and document structure
Caches extracted content for instant reuse

2. Semantic Chunking

Splits text at intelligent boundaries (chapters, sections, paragraphs)
Preserves context and meaning across chunks
Target chunk size: ~2000 tokens (configurable)
100% content preservation guaranteed

3. Efficient Querying

Search chunks by keywords or topics
Load only relevant chunks (12-25x token reduction)
Ranked results by relevance
Combine multiple chunks as needed

4. Persistent Caching

One-time extraction per PDF
Instant access to cached content
File hash verification for integrity
Automatic cache management

⚠️ IMPORTANT: Cache Location

Extracted content is stored in a user cache directory, NOT the working directory:

Cache locations by platform:

Linux/Mac: ~/.claude-cache/pdf/{pdf_name}_{hash}/
Windows: C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\

Why cache directory?

Persistent caching: Extract once, query forever - even across different projects
Cross-project reuse: Same PDF analyzed from different projects uses the same cache
Performance: Subsequent queries are instant (no re-extraction needed)
Token optimization: 12-115x reduction by loading only relevant chunks

Cache contents:

full_text.txt - Complete extracted text
pages.json - Page-by-page content
metadata.json - PDF metadata
toc.json - Table of contents
manifest.json - Cache manifest

Accessing cached content:

# List all cached PDFs
python scripts/query_pdf.py list

# Query cached content
python scripts/query_pdf.py search {cache_key} "your query"

# Find cache location (shown in extraction output)
# Example: ~/.claude-cache/pdf/document_a1b2c3d4/

If you need files in working directory:

# Option 1: Use --output-dir flag during extraction
python scripts/extract_pdf.py document.pdf --output-dir ./extracted

# Option 2: Copy from cache manually
cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/

Note: Cache is local and not meant for version control. Keep original PDFs in your repo and let each developer extract locally (one-time operation).

Workflow

Phase 1: Extract PDF (One-Time Setup)

# Extract to cache (default)
python scripts/extract_pdf.py /path/to/document.pdf

# Extract and copy to working directory (interactive prompt)
python scripts/extract_pdf.py /path/to/document.pdf
# Will prompt: "Copy files? (y/n)"
# Will ask: "Keep cache? (y/n)"

# Extract and copy to specific directory (no prompts)
python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted

What happens:

Reads entire PDF locally
Extracts text, metadata, table of contents
Saves to ~/.claude-pdf-cache/{cache_key}/
Returns cache key for future queries

Output:

full_text.txt - Complete document text
pages.json - Structured page data
metadata.json - PDF metadata
toc.json - Table of contents (if available)
manifest.json - Extraction statistics

Phase 2: Chunk Content (Semantic Organization)

python scripts/semantic_chunker.py {cache_key}

What happens:

Detects semantic boundaries (chapters, sections, paragraphs)
Splits text at intelligent boundaries
Creates ~2000 token chunks
Preserves 100% of content

Output:

chunks.json - Chunk index with metadata
chunks/chunk_0000.txt - Individual chunk files
Statistics: total chunks, token distribution, preservation rate

Phase 3: Query Content (Efficient Retrieval)

python scripts/query_pdf.py search {cache_key} "supply chain security"

What happens:

Searches chunk index for relevant content
Ranks results by relevance
Returns only matching chunks
Displays token counts for transparency

Output:

List of matching chunks with previews
Relevance scores
Total tokens required (vs. full document)

Usage Examples

Example 1: Large NIST Document

User Request: "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures"

Your Workflow:

Extract PDF (one-time):

python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf

Output: Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6

Chunk content:

python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6

Output: Created 87 chunks, 98.7% content preservation

Search for relevant sections:

python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response"

Output:

Chunk 23 - "Supply Chain Risk Management" (relevance: 87%, 1,850 tokens)
Chunk 45 - "Incident Response in C-SCRM" (relevance: 72%, 2,010 tokens)
Total: 3,860 tokens (vs. 48,000 for full document = 12.4x reduction)

Retrieve specific chunks:

python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23

Output: Full content of chunk 23

Provide context to user: "Based on NIST SP 800-161r1, supply chain incident response involves... [use chunk content]"

User Request: "I need to understand OT security incidents from NIST SP 800-82r3"

Your Workflow:

Extract (one-time):

python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf

Chunk:

python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9

First query - Overview:

python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview"

Second query - Incidents:

python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS"

Third query - Specific threat:

python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology"

Result: Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens)

User Request: "Show me the structure of this AWS security guide"

Your Workflow:

Extract PDF:

python scripts/extract_pdf.py aws-security-guide.pdf

Get TOC:

python scripts/query_pdf.py toc aws-security-guide_abc123

Output:

Chapter 1: Introduction (page 1)
  1.1 Security Fundamentals (page 3)
  1.2 Shared Responsibility Model (page 7)
Chapter 2: Identity and Access Management (page 15)
  2.1 IAM Best Practices (page 17)
  ...

Navigate to specific section: Based on TOC, identify relevant chunk IDs and retrieve specific content.

Important Guidelines

Content Preservation

ALWAYS preserve 100% of PDF content - no summarization during extraction
Verify preservation rate in chunking statistics (should be >99.5%)
If preservation rate is low, investigate boundary detection issues

Token Efficiency

Target 12-25x token reduction compared to loading full PDF
Search before loading - don't load chunks blindly
Combine related chunks when context requires it
Show token counts to user for transparency

Cache Management

Cache key format: {pdf_name}_{hash}
Cache location: ~/.claude-pdf-cache/
Reuse cached extractions - don't re-extract unnecessarily
Use --force flag only when PDF has been modified

Error Handling

If extraction fails, check PDF encryption status
If chunking produces few chunks, document may lack structure
If search returns no results, try broader keywords
If preservation rate < 99%, review boundary detection

Command Reference

Extract PDF

python scripts/extract_pdf.py <pdf_path> [--force]

pdf_path: Path to PDF file
--force: Re-extract even if cached

Chunk Text

python scripts/semantic_chunker.py <cache_key> [--target-size TOKENS]

cache_key: Cache key from extraction
--target-size: Target tokens per chunk (default: 2000)

List Cached PDFs

python scripts/query_pdf.py list

Search Chunks

python scripts/query_pdf.py search <cache_key> <query>

cache_key: PDF cache key
query: Keywords or phrase to search

Get Specific Chunk

python scripts/query_pdf.py get <cache_key> <chunk_id>

chunk_id: Chunk number to retrieve

View Statistics

python scripts/query_pdf.py stats <cache_key>

View Table of Contents

python scripts/query_pdf.py toc <cache_key>

Performance Metrics

Real-World Performance

NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):

Extraction: ~45 seconds
Chunking: ~8 seconds
Full document tokens: ~48,000
Average query result: ~3,500 tokens
Token reduction: 13.7x

NIST SP 800-82r3 (8.2 MB, 247 pages):

Extraction: ~90 seconds
Chunking: ~15 seconds
Full document tokens: ~124,000
Average query result: ~5,200 tokens
Token reduction: 23.8x

Content Preservation Verification

All extractions maintain >99.5% content preservation rate:

Original characters = Sum of all chunk characters
No content lost during chunking
Semantic boundaries preserve context

Technical Architecture

Extraction Layer (extract_pdf.py)

PyMuPDF (pymupdf) - Fast, reliable PDF parsing
Handles encrypted PDFs, complex layouts, embedded images
Extracts text, metadata, TOC, page structure
File hashing for cache validation

Chunking Layer (semantic_chunker.py)

Semantic boundary detection - Regex patterns for structure
Intelligent splitting - Respects chapters, sections, paragraphs
Size balancing - Splits large chunks, combines small ones
Content preservation - Mathematical verification

Query Layer (query_pdf.py)

Keyword search - Multi-term matching with relevance scoring
Context preservation - Shows match previews
Efficient retrieval - Loads only required chunks
Statistics tracking - Token usage transparency

Integration with Other Skills

Incident Response Playbook Creator

Use PDF Smart Extractor to:

Extract NIST SP 800-61r3 sections on-demand
Query specific incident types (ransomware, DDoS, etc.)
Reduce token usage for playbook generation

Cybersecurity Policy Generator

Use PDF Smart Extractor to:

Extract compliance framework requirements (ISO 27001, SOC 2)
Query specific control families
Reference authoritative sources efficiently

Research and Analysis Tasks

Use PDF Smart Extractor to:

Build knowledge bases from technical documentation
Compare multiple PDF sources
Extract specific sections for citation

Limitations and Considerations

What This Skill Does

✅ Extracts 100% of PDF text content
✅ Preserves document structure and metadata
✅ Enables efficient querying with minimal tokens
✅ Caches for instant reuse
✅ Works offline (extraction is local)

What This Skill Does NOT Do

❌ OCR for scanned PDFs (text must be extractable)
❌ Image analysis (focuses on text content)
❌ PDF creation or modification
❌ Real-time collaboration or annotation
❌ Automatic summarization (preserves full content)

Dependencies

Python 3.8+
PyMuPDF (pymupdf): pip install pymupdf
Standard library only (json, re, pathlib, hashlib)

Success Criteria

A successful PDF extraction and query session should:

Preserve 100% of content (preservation rate >99.5%)
Achieve 12-25x token reduction for typical queries
Complete extraction in <2 minutes for documents <10MB
Return relevant results with clear relevance scoring
Cache efficiently for instant reuse

Proactive Detection and Suggestion

CRITICAL: When user provides a PDF file path, ALWAYS:

Check file size first:

ls -lh /path/to/file.pdf
# or
stat -f%z /path/to/file.pdf  # macOS
stat -c%s /path/to/file.pdf  # Linux

If file is >1MB (1048576 bytes), IMMEDIATELY suggest this plugin:

I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can:
- Extract 100% of content locally (no token usage for extraction)
- Enable querying with 12-100x token reduction
- Cache the PDF for instant future queries

Would you like me to:
1. Extract and chunk this PDF for efficient analysis? (recommended)
2. Try reading it directly (may hit token limits)?

If user says "PDF too large" or similar error, IMMEDIATELY:

This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this:
- I'll extract the PDF locally (no LLM involvement)
- Chunk it semantically at section boundaries
- Then query only the relevant parts

Starting extraction now...

User Communication

When using this skill, always:

Proactively check PDF size before attempting to read
Suggest this plugin for any PDF >1MB
Inform user of extraction progress (one-time setup)
Show cache key for future reference
Display token counts (query vs. full document)
Explain token savings achieved
Verify content preservation rate

Example communication:

I'll extract and analyze NIST SP 800-161r1 for you.

Step 1: Extracting PDF (one-time setup)...
✓ Extracted 155 pages (48,000 tokens)
✓ Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4

Step 2: Semantic chunking...
✓ Created 87 chunks (99.2% content preservation)

Step 3: Searching for "supply chain incident response"...
✓ Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction)

Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves...
[provide analysis using chunk content]

Remember: This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.

17 KiB Raw Permalink Blame History

PDF Smart Extractor

🚨 AUTO-TRIGGER RULES (Read This First!)

When to Use This Skill

Core Capabilities

1. Local PDF Extraction (Zero LLM Involvement)

2. Semantic Chunking

3. Efficient Querying

4. Persistent Caching

Workflow

Phase 1: Extract PDF (One-Time Setup)

Phase 2: Chunk Content (Semantic Organization)

Phase 3: Query Content (Efficient Retrieval)

Usage Examples

Example 1: Large NIST Document

Example 2: Multiple Related Queries

Example 3: Table of Contents Navigation

Important Guidelines

Content Preservation

Token Efficiency

Cache Management

Error Handling

Command Reference

Extract PDF

Chunk Text

List Cached PDFs

Search Chunks

Get Specific Chunk

View Statistics

View Table of Contents

Performance Metrics

Real-World Performance

Content Preservation Verification

Technical Architecture

Extraction Layer (extract_pdf.py)

Chunking Layer (semantic_chunker.py)

Query Layer (query_pdf.py)

Integration with Other Skills

Incident Response Playbook Creator

Cybersecurity Policy Generator

Research and Analysis Tasks

Limitations and Considerations

What This Skill Does

What This Skill Does NOT Do

Dependencies

Success Criteria

Proactive Detection and Suggestion

User Communication

17 KiB

Raw Permalink Blame History