Initial commit

2025-11-30 09:05:19 +08:00
commit 09fec2555b
96 changed files with 24269 additions and 0 deletions
--- a/skills/pdftext/references/benchmarks.md
+++ b/skills/pdftext/references/benchmarks.md
@@ -0,0 +1,149 @@
+# PDF Extraction Benchmarks
+
+## Enterprise Benchmark (2025 Procycons)
+
+Production-grade comparison of ML-based PDF extraction tools.
+
+| Tool | Table Accuracy | Text Fidelity | Speed (s/page) | Memory (GB) |
+|------|----------------|---------------|----------------|-------------|
+| **Docling** | **97.9%** | **100%** | 6.28 | 2.1 |
+| Marker | 89.2% | 98.5% | 8.45 | 3.5 |
+| MinerU | 92.1% | 99.2% | 12.33 | 4.2 |
+| Unstructured.io | 75.0% | 95.8% | 51.02 | 1.8 |
+| PyMuPDF4LLM | 82.3% | 97.1% | 4.12 | 1.2 |
+| LlamaParse | 88.5% | 97.3% | 6.00 | N/A (cloud) |
+
+**Test corpus:** 500 academic papers, business reports, financial statements (mixed complexity)
+
+**Key finding:** Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.
+
+*Source: Procycons Enterprise PDF Processing Benchmark 2025*
+
+## Academic PDF Test (This Research)
+
+Real-world testing on distributed cognition literature.
+
+### Test Environment
+
+- **PDFs:** 4 academic books
+- **Total size:** 98.2 MB
+- **Pages:** ~400 pages combined
+- **Content:** Multi-column layouts, tables, figures, references
+
+### Test Results
+
+#### Speed (90-page PDF, 1.9 MB)
+
+| Tool | Total Time | Per Page | Speedup |
+|------|------------|----------|---------|
+| pdftotext | 0.63s | 0.007s/page | 60x |
+| PyMuPDF | 1.18s | 0.013s/page | 33x |
+| Docling | 38.86s | 0.432s/page | 1x |
+| pdfplumber | 38.91s | 0.432s/page | 1x |
+
+#### Quality (Issues per document)
+
+| Tool | Consecutive Spaces | Excessive Newlines | Control Chars | Garbled | Total |
+|------|-------------------|-------------------|---------------|---------|-------|
+| pdfplumber | 0 | 0 | 0 | 0 | **0** |
+| PyMuPDF | 1 | 0 | 0 | 0 | **1** |
+| Docling | 48 | 2 | 0 | 0 | **50** |
+| pdftotext | 85 | 5 | 0 | 0 | **90** |
+
+#### Structure Preservation
+
+| Tool | Headers | Tables | Lists | Images |
+|------|---------|--------|-------|--------|
+| Docling | ✓ 36 | ✓ 16 rows | ✓ 307 items | ✓ 4 markers |
+| PyMuPDF | ✗ | ✗ | ✗ | ✗ |
+| pdfplumber | ✗ | ✗ | ✗ | ✗ |
+| pdftotext | ✗ | ✗ | ✗ | ✗ |
+
+**Key finding:** Docling is the ONLY tool that preserves document structure.
+
+## Production Recommendations
+
+### By Use Case
+
+**Academic research / Literature review:**
+- **Primary:** Docling (structure essential)
+- **Secondary:** PyMuPDF (speed for large batches)
+
+**RAG system ingestion:**
+- **Recommended:** Docling (semantic structure preserved)
+- **Alternative:** PyMuPDF + post-processing
+
+**Quick text extraction:**
+- **Recommended:** PyMuPDF (60x faster)
+- **Alternative:** pdftotext (fastest, lower quality)
+
+**Maximum quality (legal, financial):**
+- **Recommended:** pdfplumber (perfect quality)
+- **Alternative:** Docling (structure + good quality)
+
+### By Document Type
+
+**Academic papers:** Docling (tables, multi-column, references)
+**Books/ebooks:** PyMuPDF (simple linear text)
+**Business reports:** Docling (tables, charts, sections)
+**Scanned documents:** Docling with OCR enabled
+**Legal contracts:** pdfplumber (maximum fidelity)
+
+## ML Model Performance (Docling)
+
+### RT-DETR (Layout Detection)
+
+- **Speed:** 44-633ms per page
+- **Accuracy:** ~95% layout element detection
+- **Detects:** Text blocks, headers, tables, figures, captions
+
+### TableFormer (Table Structure)
+
+- **Speed:** 400ms-1.74s per table
+- **Accuracy:** 97.9% cell-level accuracy
+- **Handles:** Borderless tables, merged cells, nested tables
+
+## Cloud vs On-Device
+
+| Tool | Processing | Privacy | Cost | Speed |
+|------|-----------|---------|------|-------|
+| Docling | On-device | ✓ Private | Free | 0.43s/page |
+| LlamaParse | Cloud API | ✗ Sends data | $0.003/page | 6s/page |
+| Claude Vision | Cloud API | ✗ Sends data | $0.0075/page | Variable |
+| Mathpix | Cloud API | ✗ Sends data | $0.004/page | 4s/page |
+
+**Recommendation:** Use on-device (Docling) for sensitive/unpublished academic work.
+
+## Benchmark Methodology
+
+### Speed Testing
+
+```python
+import time
+
+start = time.time()
+result = converter.convert(pdf_path)
+elapsed = time.time() - start
+per_page = elapsed / page_count
+```
+
+### Quality Testing
+
+```python
+# Count quality issues
+consecutive_spaces = len(re.findall(r'  +', text))
+excessive_newlines = len(re.findall(r'\n{4,}', text))
+control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
+garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))
+
+total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars
+```
+
+### Structure Testing
+
+```python
+# Count markdown elements
+headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
+tables = len(re.findall(r'\|.+\|', markdown))
+lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))
+```
--- a/skills/pdftext/references/quality-metrics.md
+++ b/skills/pdftext/references/quality-metrics.md
@@ -0,0 +1,154 @@
+# PDF Extraction Quality Metrics
+
+## Key Metrics
+
+### 1. Consecutive Spaces
+**What:** Multiple spaces in sequence (2+)
+**Pattern:** `  +`
+**Impact:** Formatting artifacts, token waste
+**Good:** < 50 occurrences
+**Bad:** > 100 occurrences
+
+### 2. Excessive Newlines
+**What:** 4+ consecutive newlines
+**Pattern:** `\n{4,}`
+**Impact:** Page breaks treated as whitespace
+**Good:** < 20 occurrences
+**Bad:** > 50 occurrences
+
+### 3. Control Characters
+**What:** Non-printable characters
+**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
+**Impact:** Parsing errors, display issues
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 4. Garbled Characters
+**What:** Replacement characters (<28>)
+**Pattern:** `[<5B>\ufffd]`
+**Impact:** Lost information, encoding failures
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 5. Hyphenation Breaks
+**What:** End-of-line hyphens not joined
+**Pattern:** `\w+-\n\w+`
+**Impact:** Word splitting affects search
+**Good:** < 10 occurrences
+**Bad:** > 50 occurrences
+
+### 6. Ligature Encoding
+**What:** Special character combinations
+**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
+**Impact:** Search failures, readability
+**Fix:** Post-process with regex replacement
+
+## Quality Score Formula
+
+```python
+total_issues = (
+    consecutive_spaces +
+    excessive_newlines +
+    control_chars +
+    garbled_chars
+)
+
+quality_score = garbled_chars * 10 + total_issues
+# Lower is better
+```
+
+**Ranking:**
+- Excellent: < 10 score
+- Good: 10-50 score
+- Fair: 50-100 score
+- Poor: > 100 score
+
+## Analysis Script
+
+```python
+import re
+
+def analyze_quality(text):
+    """Analyze PDF extraction quality."""
+    return {
+        'chars': len(text),
+        'words': len(text.split()),
+        'consecutive_spaces': len(re.findall(r'  +', text)),
+        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
+        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
+        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
+        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
+    }
+
+# Usage
+text = open("extracted.txt").read()
+metrics = analyze_quality(text)
+print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
+```
+
+## Test Results (90-page Academic PDF)
+
+| Tool | Total Issues | Garbled | Quality Score | Rating |
+|------|--------------|---------|---------------|--------|
+| pdfplumber | 0 | 0 | 0 | Excellent |
+| PyMuPDF | 1 | 0 | 1 | Excellent |
+| Docling | 50 | 0 | 50 | Good |
+| pdftotext | 90 | 0 | 90 | Fair |
+| pdfminer | 45 | 0 | 45 | Good |
+| pypdf | 120 | 5 | 170 | Poor |
+
+## Content Completeness
+
+### Phrase Coverage Analysis
+
+Extract 3-word phrases from each tool's output:
+
+```python
+def extract_phrases(text):
+    words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
+    return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
+
+common = set.intersection(*[extract_phrases(t) for t in texts.values()])
+```
+
+**Results:**
+- Common phrases: 10,587 (captured by all tools)
+- Docling unique: 17,170 phrases (most complete)
+- pdfplumber unique: 8,229 phrases (conservative)
+
+## Cleaning Strategies
+
+### Fix Ligatures
+
+```python
+def fix_ligatures(text):
+    """Fix PDF ligature encoding."""
+    replacements = {
+        r'/uniFB00': 'ff',
+        r'/uniFB01': 'fi',
+        r'/uniFB02': 'fl',
+        r'/uniFB03': 'ffi',
+        r'/uniFB04': 'ffl',
+    }
+    for pattern, repl in replacements.items():
+        text = re.sub(pattern, repl, text)
+    return text
+```
+
+### Normalize Whitespace
+
+```python
+def normalize_whitespace(text):
+    """Clean excessive whitespace."""
+    text = re.sub(r'  +', ' ', text)  # Multiple spaces → single
+    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Many newlines → max 3
+    return text.strip()
+```
+
+### Join Hyphenated Words
+
+```python
+def join_hyphens(text):
+    """Join end-of-line hyphenated words."""
+    return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
+```
--- a/skills/pdftext/references/tool-comparison.md
+++ b/skills/pdftext/references/tool-comparison.md
@@ -0,0 +1,141 @@
+# PDF Tool Comparison
+
+## Summary Table
+
+| Tool | Type | Speed | Quality Issues | Garbled | Structure | License |
+|------|------|-------|----------------|---------|-----------|---------|
+| **Docling** | ML | 0.43s/page | 50 | 0 | ✓ Yes | Apache 2.0 |
+| **PyMuPDF** | Traditional | 0.01s/page | 1 | 0 | ✗ No | AGPL |
+| **pdfplumber** | Traditional | 0.44s/page | 0 | 0 | ✗ No | MIT |
+| **pdftotext** | Traditional | 0.007s/page | 90 | 0 | ✗ No | GPL |
+| **pdfminer.six** | Traditional | 0.15s/page | 45 | 0 | ✗ No | MIT |
+| **pypdf** | Traditional | 0.25s/page | 120 | 5 | ✗ No | BSD |
+
+*Test environment: 90-page academic PDF, 1.9 MB*
+
+## Detailed Comparison
+
+### Docling (Recommended for Academic PDFs)
+
+**Advantages:**
+- Only tool that preserves structure (headers, tables, lists)
+- AI-powered layout understanding via RT-DETR + TableFormer
+- Markdown output perfect for LLMs
+- 97.9% table accuracy in enterprise benchmarks
+- On-device processing (no API calls)
+
+**Disadvantages:**
+- Slower than PyMuPDF (40x)
+- Requires 500MB-1GB model download
+- Some ligature encoding issues
+
+**Use when:**
+- Document structure is essential
+- Processing academic papers with tables
+- Preparing content for RAG systems
+- LLM consumption is primary goal
+
+### PyMuPDF (Recommended for Speed)
+
+**Advantages:**
+- Fastest tool (60x faster than pdfplumber)
+- Excellent quality (only 1 issue in test)
+- Clean output with minimal artifacts
+- C-based, highly optimized
+
+**Disadvantages:**
+- No structure preservation
+- AGPL license (restrictive for commercial use)
+- Flat text output
+
+**Use when:**
+- Speed is critical
+- Simple text extraction sufficient
+- Batch processing large datasets
+- Structure preservation not needed
+
+### pdfplumber (Recommended for Quality)
+
+**Advantages:**
+- Perfect quality (0 issues)
+- Character-level spatial analysis
+- Geometric table detection
+- MIT license
+
+**Disadvantages:**
+- Very slow (60x slower than PyMuPDF)
+- No markdown structure output
+- CPU-intensive
+
+**Use when:**
+- Maximum fidelity required
+- Quality more important than speed
+- Processing critical documents
+- Slow processing acceptable
+
+## Traditional vs ML-Based
+
+### Traditional Tools
+
+**How they work:**
+- Parse PDF internal structure
+- Extract embedded text objects
+- Follow PDF specification rules
+
+**Advantages:**
+- Fast (no ML inference)
+- Small footprint (no model files)
+- Deterministic output
+
+**Disadvantages:**
+- No layout understanding
+- Cannot handle borderless tables
+- Lose document hierarchy
+
+### ML-Based Tools (Docling)
+
+**How they work:**
+- Computer vision to "see" document layout
+- RT-DETR detects layout regions
+- TableFormer understands table structure
+- Hybrid: ML for layout + PDF parsing for text
+
+**Advantages:**
+- Understands visual layout
+- Handles complex multi-column layouts
+- Preserves semantic structure
+- Works with borderless tables
+
+**Disadvantages:**
+- Slower (ML inference time)
+- Larger footprint (model files)
+- Non-deterministic output
+
+## Architecture Details
+
+### Docling Pipeline
+
+1. **PDF Backend** - Extracts raw content and positions
+2. **AI Models** - Analyze layout and structure
+   - RT-DETR: Layout analysis (44-633ms/page)
+   - TableFormer: Table structure (400ms-1.74s/table)
+3. **Assembly** - Combines understanding with text
+
+### pdfplumber Architecture
+
+1. **Built on pdfminer.six** - Character-level extraction
+2. **Spatial clustering** - Groups chars into words/lines
+3. **Geometric detection** - Finds tables from lines/rectangles
+4. **Character objects** - Full metadata (position, font, size, color)
+
+## Enterprise Benchmarks (2025 Procycons)
+
+| Tool | Table Accuracy | Text Fidelity | Speed (s/page) |
+|------|----------------|---------------|----------------|
+| Docling | 97.9% | 100% | 6.28 |
+| Marker | 89.2% | 98.5% | 8.45 |
+| MinerU | 92.1% | 99.2% | 12.33 |
+| Unstructured.io | 75.0% | 95.8% | 51.02 |
+| LlamaParse | 88.5% | 97.3% | 6.00 |
+
+*Source: Procycons Enterprise PDF Processing Benchmark 2025*