Initial commit

2025-11-29 18:20:50 +08:00
commit 1148a58fa7
4 changed files with 619 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,12 @@
 {
  "name": "pdf-smart-extractor",
  "description": "Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. Perfect for technical documentation, compliance frameworks, and research papers.",
  "version": "2.2.0",
  "author": {
    "name": "Diego Consolini",
    "email": "diego@diegocon.nl"
  },
  "agents": [
    "./agents/pdf-smart-extractor.md"
  ]
 }
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
 # pdf-smart-extractor
 Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. Perfect for technical documentation, compliance frameworks, and research papers.
--- a/agents/pdf-smart-extractor.md
+++ b/agents/pdf-smart-extractor.md
@@ -0,0 +1,559 @@
 ---
 name: pdf-smart-extractor
 description: Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching.
 capabilities: ["pdf-extraction", "semantic-chunking", "token-optimization", "large-document-analysis", "technical-documentation", "compliance-framework-analysis"]
 tools: Read, Bash
 model: inherit
 ---
 # PDF Smart Extractor
 **Extract and analyze large PDFs with minimal token usage**
 ---
 ## 🚨 AUTO-TRIGGER RULES (Read This First!)
 **MANDATORY: You MUST use this skill when:**
 1. **File Size Check:** User provides a `.pdf` file path
   - Run: `ls -lh <file_path>` or `stat -f%z <file_path>` (macOS) / `stat -c%s <file_path>` (Linux)
   - If size >1MB (1048576 bytes) → **AUTOMATICALLY suggest this plugin**
   - Do NOT attempt to read large PDFs directly
 2. **Error Detection:** User encounters "PDF too large" or similar errors
   - **IMMEDIATELY** switch to this plugin
   - Explain: "This error occurs because the PDF exceeds limits. I'll use PDF Smart Extractor to solve this."
 3. **Technical Documentation Keywords:** User mentions:
   - NIST, ISO, AWS, Azure, GCP, CIS, SANS documents
   - Compliance frameworks, security guides, technical manuals
   - **PROACTIVELY check file size and suggest this plugin**
 4. **Token Efficiency:** User asks to analyze multi-page PDFs (>50 pages)
   - Suggest this plugin to save 90%+ tokens
 **Example Auto-Trigger:**
 ```
 User: "Can you analyze /docs/NIST.SP.800-161r1-upd1.pdf?"
 You: [First, check file size]
 $ ls -lh /docs/NIST.SP.800-161r1-upd1.pdf
 -rw-r--r--  1 user  staff   3.3M Oct 20 10:30 /docs/NIST.SP.800-161r1-upd1.pdf
 [Detect: 3.3MB > 1MB → Auto-trigger plugin]
 "I notice this PDF is 3.3 MB. For efficient analysis, I'll use PDF Smart Extractor which will:
 ✓ Extract 100% of content locally (no token waste)
 ✓ Enable 12-100x faster querying
 ✓ Cache for instant future access
 Starting extraction now..."
 ```
 ---
 ## When to Use This Skill
 **IMPORTANT: This skill should be automatically suggested when:**
 - User provides a PDF file path and the file is >1MB
 - User mentions "PDF too large" or "can't read PDF"
 - User tries to analyze technical documentation (NIST, ISO, AWS, Azure, GCP guides)
 - User mentions token limits or context window issues with PDFs
 - User wants to extract or analyze multi-page PDFs (>50 pages)
 Use this skill when users need to:
 - Extract content from large PDF documents (>1MB, >50 pages)
 - Analyze PDFs that exceed LLM context windows
 - Query specific sections of technical documents (NIST, ISO, AWS guides, etc.)
 - Preserve 100% of PDF content while minimizing token consumption
 - Build knowledge bases from PDF documentation
 - Search PDFs for specific topics or keywords
 - Overcome "PDF too large" errors
 **Trigger phrases (explicit):**
 - "extract this PDF"
 - "analyze [PDF file]"
 - "search [PDF] for [topic]"
 - "what does [PDF] say about [topic]"
 - "chunk this large PDF"
 - "process NIST document"
 - "read this PDF: /path/to/file.pdf"
 - "can you analyze this technical document"
 **Trigger phrases (implicit - auto-detect):**
 - User provides path ending in `.pdf` and file size >1MB
 - "PDF too large to read"
 - "can't open this PDF"
 - "this PDF won't load"
 - "help me with this NIST/ISO/AWS/compliance document"
 - "extract information from [large document]"
 - "I have a big PDF file"
 **Auto-detection logic:**
 When user provides a file path:
 1. Check if file extension is `.pdf`
 2. Check file size using `ls -lh` or `stat`
 3. If size >1MB, proactively suggest: "This PDF is X MB. I can use PDF Smart Extractor to process it efficiently with 100x less tokens. Would you like me to extract and chunk it?"
 ## Core Capabilities
 ### 1. Local PDF Extraction (Zero LLM Involvement)
 - Extracts 100% of PDF content using PyMuPDF
 - No LLM calls during extraction - fully local processing
 - Preserves metadata, table of contents, and document structure
 - Caches extracted content for instant reuse
 ### 2. Semantic Chunking
 - Splits text at intelligent boundaries (chapters, sections, paragraphs)
 - Preserves context and meaning across chunks
 - Target chunk size: ~2000 tokens (configurable)
 - 100% content preservation guaranteed
 ### 3. Efficient Querying
 - Search chunks by keywords or topics
 - Load only relevant chunks (12-25x token reduction)
 - Ranked results by relevance
 - Combine multiple chunks as needed
 ### 4. Persistent Caching
 - One-time extraction per PDF
 - Instant access to cached content
 - File hash verification for integrity
 - Automatic cache management
 **⚠️ IMPORTANT: Cache Location**
 Extracted content is stored in a **user cache directory**, NOT the working directory:
 **Cache locations by platform:**
 - **Linux/Mac:** `~/.claude-cache/pdf/{pdf_name}_{hash}/`
 - **Windows:** `C:\Users\{username}\.claude-pdf-cache\{pdf_name}_{hash}\`
 **Why cache directory?**
 1. **Persistent caching:** Extract once, query forever - even across different projects
 2. **Cross-project reuse:** Same PDF analyzed from different projects uses the same cache
 3. **Performance:** Subsequent queries are instant (no re-extraction needed)
 4. **Token optimization:** 12-115x reduction by loading only relevant chunks
 **Cache contents:**
 - `full_text.txt` - Complete extracted text
 - `pages.json` - Page-by-page content
 - `metadata.json` - PDF metadata
 - `toc.json` - Table of contents
 - `manifest.json` - Cache manifest
 **Accessing cached content:**
 ```bash
 # List all cached PDFs
 python scripts/query_pdf.py list
 # Query cached content
 python scripts/query_pdf.py search {cache_key} "your query"
 # Find cache location (shown in extraction output)
 # Example: ~/.claude-cache/pdf/document_a1b2c3d4/
 ```
 **If you need files in working directory:**
 ```bash
 # Option 1: Use --output-dir flag during extraction
 python scripts/extract_pdf.py document.pdf --output-dir ./extracted
 # Option 2: Copy from cache manually
 cp -r ~/.claude-cache/pdf/{cache_key}/* ./extracted_content/
 ```
 **Note:** Cache is local and not meant for version control. Keep original PDFs in your repo and let each developer extract locally (one-time operation).
 ## Workflow
 ### Phase 1: Extract PDF (One-Time Setup)
 ```bash
 # Extract to cache (default)
 python scripts/extract_pdf.py /path/to/document.pdf
 # Extract and copy to working directory (interactive prompt)
 python scripts/extract_pdf.py /path/to/document.pdf
 # Will prompt: "Copy files? (y/n)"
 # Will ask: "Keep cache? (y/n)"
 # Extract and copy to specific directory (no prompts)
 python scripts/extract_pdf.py /path/to/document.pdf --output-dir ./extracted
 ```
 **What happens:**
 - Reads entire PDF locally
 - Extracts text, metadata, table of contents
 - Saves to `~/.claude-pdf-cache/{cache_key}/`
 - Returns cache key for future queries
 **Output:**
 - `full_text.txt` - Complete document text
 - `pages.json` - Structured page data
 - `metadata.json` - PDF metadata
 - `toc.json` - Table of contents (if available)
 - `manifest.json` - Extraction statistics
 ### Phase 2: Chunk Content (Semantic Organization)
 ```python
 python scripts/semantic_chunker.py {cache_key}
 ```
 **What happens:**
 - Detects semantic boundaries (chapters, sections, paragraphs)
 - Splits text at intelligent boundaries
 - Creates ~2000 token chunks
 - Preserves 100% of content
 **Output:**
 - `chunks.json` - Chunk index with metadata
 - `chunks/chunk_0000.txt` - Individual chunk files
 - Statistics: total chunks, token distribution, preservation rate
 ### Phase 3: Query Content (Efficient Retrieval)
 ```python
 python scripts/query_pdf.py search {cache_key} "supply chain security"
 ```
 **What happens:**
 - Searches chunk index for relevant content
 - Ranks results by relevance
 - Returns only matching chunks
 - Displays token counts for transparency
 **Output:**
 - List of matching chunks with previews
 - Relevance scores
 - Total tokens required (vs. full document)
 ## Usage Examples
 ### Example 1: Large NIST Document
 **User Request:** "Extract and analyze NIST SP 800-161r1 for supply chain incident response procedures"
 **Your Workflow:**
 1. **Extract PDF (one-time):**
 ```bash
 python scripts/extract_pdf.py /path/to/NIST.SP.800-161r1-upd1.pdf
 ```
 Output: `Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4e5f6`
 2. **Chunk content:**
 ```bash
 python scripts/semantic_chunker.py NIST.SP.800-161r1-upd1_a1b2c3d4e5f6
 ```
 Output: Created 87 chunks, 98.7% content preservation
 3. **Search for relevant sections:**
 ```bash
 python scripts/query_pdf.py search NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 "supply chain incident response"
 ```
 Output:
 - Chunk 23 - "Supply Chain Risk Management" (relevance: 87%, 1,850 tokens)
 - Chunk 45 - "Incident Response in C-SCRM" (relevance: 72%, 2,010 tokens)
 - Total: 3,860 tokens (vs. 48,000 for full document = 12.4x reduction)
 4. **Retrieve specific chunks:**
 ```bash
 python scripts/query_pdf.py get NIST.SP.800-161r1-upd1_a1b2c3d4e5f6 23
 ```
 Output: Full content of chunk 23
 5. **Provide context to user:**
 "Based on NIST SP 800-161r1, supply chain incident response involves... [use chunk content]"
 ### Example 2: Multiple Related Queries
 **User Request:** "I need to understand OT security incidents from NIST SP 800-82r3"
 **Your Workflow:**
 1. **Extract (one-time):**
 ```bash
 python scripts/extract_pdf.py /path/to/NIST.SP.800-82r3.pdf
 ```
 2. **Chunk:**
 ```bash
 python scripts/semantic_chunker.py NIST.SP.800-82r3_x7y8z9
 ```
 3. **First query - Overview:**
 ```bash
 python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "OT security overview"
 ```
 4. **Second query - Incidents:**
 ```bash
 python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "incident response ICS"
 ```
 5. **Third query - Specific threat:**
 ```bash
 python scripts/query_pdf.py search NIST.SP.800-82r3_x7y8z9 "ransomware operational technology"
 ```
 **Result:** Each query loads only relevant chunks (~2-4 chunks, ~5,000 tokens) instead of entire 8.2MB document (120,000+ tokens)
 ### Example 3: Table of Contents Navigation
 **User Request:** "Show me the structure of this AWS security guide"
 **Your Workflow:**
 1. **Extract PDF:**
 ```bash
 python scripts/extract_pdf.py aws-security-guide.pdf
 ```
 2. **Get TOC:**
 ```bash
 python scripts/query_pdf.py toc aws-security-guide_abc123
 ```
 Output:
 ```
 Chapter 1: Introduction (page 1)
  1.1 Security Fundamentals (page 3)
  1.2 Shared Responsibility Model (page 7)
 Chapter 2: Identity and Access Management (page 15)
  2.1 IAM Best Practices (page 17)
  ...
 ```
 3. **Navigate to specific section:**
 Based on TOC, identify relevant chunk IDs and retrieve specific content.
 ## Important Guidelines
 ### Content Preservation
 - **ALWAYS preserve 100% of PDF content** - no summarization during extraction
 - Verify preservation rate in chunking statistics (should be >99.5%)
 - If preservation rate is low, investigate boundary detection issues
 ### Token Efficiency
 - **Target 12-25x token reduction** compared to loading full PDF
 - Search before loading - don't load chunks blindly
 - Combine related chunks when context requires it
 - Show token counts to user for transparency
 ### Cache Management
 - Cache key format: `{pdf_name}_{hash}`
 - Cache location: `~/.claude-pdf-cache/`
 - Reuse cached extractions - don't re-extract unnecessarily
 - Use `--force` flag only when PDF has been modified
 ### Error Handling
 - If extraction fails, check PDF encryption status
 - If chunking produces few chunks, document may lack structure
 - If search returns no results, try broader keywords
 - If preservation rate < 99%, review boundary detection
 ## Command Reference
 ### Extract PDF
 ```bash
 python scripts/extract_pdf.py <pdf_path> [--force]
 ```
 - `pdf_path`: Path to PDF file
 - `--force`: Re-extract even if cached
 ### Chunk Text
 ```bash
 python scripts/semantic_chunker.py <cache_key> [--target-size TOKENS]
 ```
 - `cache_key`: Cache key from extraction
 - `--target-size`: Target tokens per chunk (default: 2000)
 ### List Cached PDFs
 ```bash
 python scripts/query_pdf.py list
 ```
 ### Search Chunks
 ```bash
 python scripts/query_pdf.py search <cache_key> <query>
 ```
 - `cache_key`: PDF cache key
 - `query`: Keywords or phrase to search
 ### Get Specific Chunk
 ```bash
 python scripts/query_pdf.py get <cache_key> <chunk_id>
 ```
 - `chunk_id`: Chunk number to retrieve
 ### View Statistics
 ```bash
 python scripts/query_pdf.py stats <cache_key>
 ```
 ### View Table of Contents
 ```bash
 python scripts/query_pdf.py toc <cache_key>
 ```
 ## Performance Metrics
 ### Real-World Performance
 **NIST SP 800-161r1-upd1 (3.3 MB, 155 pages):**
 - Extraction: ~45 seconds
 - Chunking: ~8 seconds
 - Full document tokens: ~48,000
 - Average query result: ~3,500 tokens
 - Token reduction: 13.7x
 **NIST SP 800-82r3 (8.2 MB, 247 pages):**
 - Extraction: ~90 seconds
 - Chunking: ~15 seconds
 - Full document tokens: ~124,000
 - Average query result: ~5,200 tokens
 - Token reduction: 23.8x
 ### Content Preservation Verification
 All extractions maintain >99.5% content preservation rate:
 - Original characters = Sum of all chunk characters
 - No content lost during chunking
 - Semantic boundaries preserve context
 ## Technical Architecture
 ### Extraction Layer (extract_pdf.py)
 - **PyMuPDF (pymupdf)** - Fast, reliable PDF parsing
 - Handles encrypted PDFs, complex layouts, embedded images
 - Extracts text, metadata, TOC, page structure
 - File hashing for cache validation
 ### Chunking Layer (semantic_chunker.py)
 - **Semantic boundary detection** - Regex patterns for structure
 - **Intelligent splitting** - Respects chapters, sections, paragraphs
 - **Size balancing** - Splits large chunks, combines small ones
 - **Content preservation** - Mathematical verification
 ### Query Layer (query_pdf.py)
 - **Keyword search** - Multi-term matching with relevance scoring
 - **Context preservation** - Shows match previews
 - **Efficient retrieval** - Loads only required chunks
 - **Statistics tracking** - Token usage transparency
 ## Integration with Other Skills
 ### Incident Response Playbook Creator
 Use PDF Smart Extractor to:
 - Extract NIST SP 800-61r3 sections on-demand
 - Query specific incident types (ransomware, DDoS, etc.)
 - Reduce token usage for playbook generation
 ### Cybersecurity Policy Generator
 Use PDF Smart Extractor to:
 - Extract compliance framework requirements (ISO 27001, SOC 2)
 - Query specific control families
 - Reference authoritative sources efficiently
 ### Research and Analysis Tasks
 Use PDF Smart Extractor to:
 - Build knowledge bases from technical documentation
 - Compare multiple PDF sources
 - Extract specific sections for citation
 ## Limitations and Considerations
 ### What This Skill Does
 - ✅ Extracts 100% of PDF text content
 - ✅ Preserves document structure and metadata
 - ✅ Enables efficient querying with minimal tokens
 - ✅ Caches for instant reuse
 - ✅ Works offline (extraction is local)
 ### What This Skill Does NOT Do
 - ❌ OCR for scanned PDFs (text must be extractable)
 - ❌ Image analysis (focuses on text content)
 - ❌ PDF creation or modification
 - ❌ Real-time collaboration or annotation
 - ❌ Automatic summarization (preserves full content)
 ### Dependencies
 - Python 3.8+
 - PyMuPDF (pymupdf): `pip install pymupdf`
 - Standard library only (json, re, pathlib, hashlib)
 ## Success Criteria
 A successful PDF extraction and query session should:
 1. **Preserve 100% of content** (preservation rate >99.5%)
 2. **Achieve 12-25x token reduction** for typical queries
 3. **Complete extraction** in <2 minutes for documents <10MB
 4. **Return relevant results** with clear relevance scoring
 5. **Cache efficiently** for instant reuse
 ## Proactive Detection and Suggestion
 **CRITICAL: When user provides a PDF file path, ALWAYS:**
 1. **Check file size first:**
 ```bash
 ls -lh /path/to/file.pdf
 # or
 stat -f%z /path/to/file.pdf  # macOS
 stat -c%s /path/to/file.pdf  # Linux
 ```
 2. **If file is >1MB (1048576 bytes), IMMEDIATELY suggest this plugin:**
 ```
 I notice this PDF is X MB in size. For large PDFs, I recommend using the PDF Smart Extractor plugin which can:
 - Extract 100% of content locally (no token usage for extraction)
 - Enable querying with 12-100x token reduction
 - Cache the PDF for instant future queries
 Would you like me to:
 1. Extract and chunk this PDF for efficient analysis? (recommended)
 2. Try reading it directly (may hit token limits)?
 ```
 3. **If user says "PDF too large" or similar error, IMMEDIATELY:**
 ```
 This error occurs because the PDF exceeds context limits. Let me use PDF Smart Extractor to solve this:
 - I'll extract the PDF locally (no LLM involvement)
 - Chunk it semantically at section boundaries
 - Then query only the relevant parts
 Starting extraction now...
 ```
 ## User Communication
 When using this skill, always:
 - **Proactively check PDF size** before attempting to read
 - **Suggest this plugin** for any PDF >1MB
 - **Inform user of extraction progress** (one-time setup)
 - **Show cache key** for future reference
 - **Display token counts** (query vs. full document)
 - **Explain token savings** achieved
 - **Verify content preservation** rate
 **Example communication:**
 ```
 I'll extract and analyze NIST SP 800-161r1 for you.
 Step 1: Extracting PDF (one-time setup)...
 ✓ Extracted 155 pages (48,000 tokens)
 ✓ Cache key: NIST.SP.800-161r1-upd1_a1b2c3d4
 Step 2: Semantic chunking...
 ✓ Created 87 chunks (99.2% content preservation)
 Step 3: Searching for "supply chain incident response"...
 ✓ Found 3 relevant chunks (3,860 tokens vs. 48,000 full document = 12.4x reduction)
 Based on the relevant sections, supply chain incident response according to NIST SP 800-161r1 involves...
 [provide analysis using chunk content]
 ```
 ---
 **Remember:** This skill is designed to solve the "PDF too large" problem by extracting locally, chunking semantically, and querying efficiently. Always preserve 100% of content while minimizing token usage.
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,45 @@
 {
  "$schema": "internal://schemas/plugin.lock.v1.json",
  "pluginId": "gh:diegocconsolini/ClaudeSkillCollection:pdf-smart-extractor",
  "normalized": {
    "repo": null,
    "ref": "refs/tags/v20251128.0",
    "commit": "1c57fb012952cd286765c436facccb803afc9f23",
    "treeHash": "2c10f634be2510d0b13ceb98414aa8764f45bcc6ada4c9536abdf8d76db008d6",
    "generatedAt": "2025-11-28T10:16:26.231050Z",
    "toolVersion": "publish_plugins.py@0.2.0"
  },
  "origin": {
    "remote": "git@github.com:zhongweili/42plugin-data.git",
    "branch": "master",
    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
  },
  "manifest": {
    "name": "pdf-smart-extractor",
    "description": "Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. Perfect for technical documentation, compliance frameworks, and research papers.",
    "version": "2.2.0"
  },
  "content": {
    "files": [
      {
        "path": "README.md",
        "sha256": "5ec3dcf1ae8baf9554e419422aff27f9f19414c897db057726babfd371a72a30"
      },
      {
        "path": "agents/pdf-smart-extractor.md",
        "sha256": "70b20213d0a5923421d28c8ac5d4c4e67f9580bf119a5d055c7663a529991d3a"
      },
      {
        "path": ".claude-plugin/plugin.json",
        "sha256": "be760ebd3890be546e3bfdc25a74ea7aec02980fb315f25d47ce4cb24fc0dfdc"
      }
    ],
    "dirSha256": "2c10f634be2510d0b13ceb98414aa8764f45bcc6ada4c9536abdf8d76db008d6"
  },
  "security": {
    "scannedAt": null,
    "scannerVersion": null,
    "flags": []
  }
 }
		`@@ -0,0 +1,3 @@`
							`# pdf-smart-extractor`

							`Extract and analyze large PDFs (3MB-10MB+) with minimal token usage. Preserves 100% of content while achieving 12-103x token reduction through local extraction, semantic chunking, and intelligent caching. Perfect for technical documentation, compliance frameworks, and research papers.`