Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:05:19 +08:00
commit 09fec2555b
96 changed files with 24269 additions and 0 deletions

View File

@@ -0,0 +1,149 @@
# PDF Extraction Benchmarks
## Enterprise Benchmark (2025 Procycons)
Production-grade comparison of ML-based PDF extraction tools.
| Tool | Table Accuracy | Text Fidelity | Speed (s/page) | Memory (GB) |
|------|----------------|---------------|----------------|-------------|
| **Docling** | **97.9%** | **100%** | 6.28 | 2.1 |
| Marker | 89.2% | 98.5% | 8.45 | 3.5 |
| MinerU | 92.1% | 99.2% | 12.33 | 4.2 |
| Unstructured.io | 75.0% | 95.8% | 51.02 | 1.8 |
| PyMuPDF4LLM | 82.3% | 97.1% | 4.12 | 1.2 |
| LlamaParse | 88.5% | 97.3% | 6.00 | N/A (cloud) |
**Test corpus:** 500 academic papers, business reports, financial statements (mixed complexity)
**Key finding:** Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.
*Source: Procycons Enterprise PDF Processing Benchmark 2025*
## Academic PDF Test (This Research)
Real-world testing on distributed cognition literature.
### Test Environment
- **PDFs:** 4 academic books
- **Total size:** 98.2 MB
- **Pages:** ~400 pages combined
- **Content:** Multi-column layouts, tables, figures, references
### Test Results
#### Speed (90-page PDF, 1.9 MB)
| Tool | Total Time | Per Page | Speedup |
|------|------------|----------|---------|
| pdftotext | 0.63s | 0.007s/page | 60x |
| PyMuPDF | 1.18s | 0.013s/page | 33x |
| Docling | 38.86s | 0.432s/page | 1x |
| pdfplumber | 38.91s | 0.432s/page | 1x |
#### Quality (Issues per document)
| Tool | Consecutive Spaces | Excessive Newlines | Control Chars | Garbled | Total |
|------|-------------------|-------------------|---------------|---------|-------|
| pdfplumber | 0 | 0 | 0 | 0 | **0** |
| PyMuPDF | 1 | 0 | 0 | 0 | **1** |
| Docling | 48 | 2 | 0 | 0 | **50** |
| pdftotext | 85 | 5 | 0 | 0 | **90** |
#### Structure Preservation
| Tool | Headers | Tables | Lists | Images |
|------|---------|--------|-------|--------|
| Docling | ✓ 36 | ✓ 16 rows | ✓ 307 items | ✓ 4 markers |
| PyMuPDF | ✗ | ✗ | ✗ | ✗ |
| pdfplumber | ✗ | ✗ | ✗ | ✗ |
| pdftotext | ✗ | ✗ | ✗ | ✗ |
**Key finding:** Docling is the ONLY tool that preserves document structure.
## Production Recommendations
### By Use Case
**Academic research / Literature review:**
- **Primary:** Docling (structure essential)
- **Secondary:** PyMuPDF (speed for large batches)
**RAG system ingestion:**
- **Recommended:** Docling (semantic structure preserved)
- **Alternative:** PyMuPDF + post-processing
**Quick text extraction:**
- **Recommended:** PyMuPDF (60x faster)
- **Alternative:** pdftotext (fastest, lower quality)
**Maximum quality (legal, financial):**
- **Recommended:** pdfplumber (perfect quality)
- **Alternative:** Docling (structure + good quality)
### By Document Type
**Academic papers:** Docling (tables, multi-column, references)
**Books/ebooks:** PyMuPDF (simple linear text)
**Business reports:** Docling (tables, charts, sections)
**Scanned documents:** Docling with OCR enabled
**Legal contracts:** pdfplumber (maximum fidelity)
## ML Model Performance (Docling)
### RT-DETR (Layout Detection)
- **Speed:** 44-633ms per page
- **Accuracy:** ~95% layout element detection
- **Detects:** Text blocks, headers, tables, figures, captions
### TableFormer (Table Structure)
- **Speed:** 400ms-1.74s per table
- **Accuracy:** 97.9% cell-level accuracy
- **Handles:** Borderless tables, merged cells, nested tables
## Cloud vs On-Device
| Tool | Processing | Privacy | Cost | Speed |
|------|-----------|---------|------|-------|
| Docling | On-device | ✓ Private | Free | 0.43s/page |
| LlamaParse | Cloud API | ✗ Sends data | $0.003/page | 6s/page |
| Claude Vision | Cloud API | ✗ Sends data | $0.0075/page | Variable |
| Mathpix | Cloud API | ✗ Sends data | $0.004/page | 4s/page |
**Recommendation:** Use on-device (Docling) for sensitive/unpublished academic work.
## Benchmark Methodology
### Speed Testing
```python
import time
start = time.time()
result = converter.convert(pdf_path)
elapsed = time.time() - start
per_page = elapsed / page_count
```
### Quality Testing
```python
# Count quality issues
consecutive_spaces = len(re.findall(r' +', text))
excessive_newlines = len(re.findall(r'\n{4,}', text))
control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))
total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars
```
### Structure Testing
```python
# Count markdown elements
headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
tables = len(re.findall(r'\|.+\|', markdown))
lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))
```

View File

@@ -0,0 +1,154 @@
# PDF Extraction Quality Metrics
## Key Metrics
### 1. Consecutive Spaces
**What:** Multiple spaces in sequence (2+)
**Pattern:** ` +`
**Impact:** Formatting artifacts, token waste
**Good:** < 50 occurrences
**Bad:** > 100 occurrences
### 2. Excessive Newlines
**What:** 4+ consecutive newlines
**Pattern:** `\n{4,}`
**Impact:** Page breaks treated as whitespace
**Good:** < 20 occurrences
**Bad:** > 50 occurrences
### 3. Control Characters
**What:** Non-printable characters
**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
**Impact:** Parsing errors, display issues
**Good:** 0 occurrences
**Bad:** > 0 occurrences
### 4. Garbled Characters
**What:** Replacement characters (<28>)
**Pattern:** `[<5B>\ufffd]`
**Impact:** Lost information, encoding failures
**Good:** 0 occurrences
**Bad:** > 0 occurrences
### 5. Hyphenation Breaks
**What:** End-of-line hyphens not joined
**Pattern:** `\w+-\n\w+`
**Impact:** Word splitting affects search
**Good:** < 10 occurrences
**Bad:** > 50 occurrences
### 6. Ligature Encoding
**What:** Special character combinations
**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
**Impact:** Search failures, readability
**Fix:** Post-process with regex replacement
## Quality Score Formula
```python
total_issues = (
consecutive_spaces +
excessive_newlines +
control_chars +
garbled_chars
)
quality_score = garbled_chars * 10 + total_issues
# Lower is better
```
**Ranking:**
- Excellent: < 10 score
- Good: 10-50 score
- Fair: 50-100 score
- Poor: > 100 score
## Analysis Script
```python
import re
def analyze_quality(text):
"""Analyze PDF extraction quality."""
return {
'chars': len(text),
'words': len(text.split()),
'consecutive_spaces': len(re.findall(r' +', text)),
'excessive_newlines': len(re.findall(r'\n{4,}', text)),
'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
}
# Usage
text = open("extracted.txt").read()
metrics = analyze_quality(text)
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
```
## Test Results (90-page Academic PDF)
| Tool | Total Issues | Garbled | Quality Score | Rating |
|------|--------------|---------|---------------|--------|
| pdfplumber | 0 | 0 | 0 | Excellent |
| PyMuPDF | 1 | 0 | 1 | Excellent |
| Docling | 50 | 0 | 50 | Good |
| pdftotext | 90 | 0 | 90 | Fair |
| pdfminer | 45 | 0 | 45 | Good |
| pypdf | 120 | 5 | 170 | Poor |
## Content Completeness
### Phrase Coverage Analysis
Extract 3-word phrases from each tool's output:
```python
def extract_phrases(text):
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
common = set.intersection(*[extract_phrases(t) for t in texts.values()])
```
**Results:**
- Common phrases: 10,587 (captured by all tools)
- Docling unique: 17,170 phrases (most complete)
- pdfplumber unique: 8,229 phrases (conservative)
## Cleaning Strategies
### Fix Ligatures
```python
def fix_ligatures(text):
"""Fix PDF ligature encoding."""
replacements = {
r'/uniFB00': 'ff',
r'/uniFB01': 'fi',
r'/uniFB02': 'fl',
r'/uniFB03': 'ffi',
r'/uniFB04': 'ffl',
}
for pattern, repl in replacements.items():
text = re.sub(pattern, repl, text)
return text
```
### Normalize Whitespace
```python
def normalize_whitespace(text):
"""Clean excessive whitespace."""
text = re.sub(r' +', ' ', text) # Multiple spaces → single
text = re.sub(r'\n{4,}', '\n\n\n', text) # Many newlines → max 3
return text.strip()
```
### Join Hyphenated Words
```python
def join_hyphens(text):
"""Join end-of-line hyphenated words."""
return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
```

View File

@@ -0,0 +1,141 @@
# PDF Tool Comparison
## Summary Table
| Tool | Type | Speed | Quality Issues | Garbled | Structure | License |
|------|------|-------|----------------|---------|-----------|---------|
| **Docling** | ML | 0.43s/page | 50 | 0 | ✓ Yes | Apache 2.0 |
| **PyMuPDF** | Traditional | 0.01s/page | 1 | 0 | ✗ No | AGPL |
| **pdfplumber** | Traditional | 0.44s/page | 0 | 0 | ✗ No | MIT |
| **pdftotext** | Traditional | 0.007s/page | 90 | 0 | ✗ No | GPL |
| **pdfminer.six** | Traditional | 0.15s/page | 45 | 0 | ✗ No | MIT |
| **pypdf** | Traditional | 0.25s/page | 120 | 5 | ✗ No | BSD |
*Test environment: 90-page academic PDF, 1.9 MB*
## Detailed Comparison
### Docling (Recommended for Academic PDFs)
**Advantages:**
- Only tool that preserves structure (headers, tables, lists)
- AI-powered layout understanding via RT-DETR + TableFormer
- Markdown output perfect for LLMs
- 97.9% table accuracy in enterprise benchmarks
- On-device processing (no API calls)
**Disadvantages:**
- Slower than PyMuPDF (40x)
- Requires 500MB-1GB model download
- Some ligature encoding issues
**Use when:**
- Document structure is essential
- Processing academic papers with tables
- Preparing content for RAG systems
- LLM consumption is primary goal
### PyMuPDF (Recommended for Speed)
**Advantages:**
- Fastest tool (60x faster than pdfplumber)
- Excellent quality (only 1 issue in test)
- Clean output with minimal artifacts
- C-based, highly optimized
**Disadvantages:**
- No structure preservation
- AGPL license (restrictive for commercial use)
- Flat text output
**Use when:**
- Speed is critical
- Simple text extraction sufficient
- Batch processing large datasets
- Structure preservation not needed
### pdfplumber (Recommended for Quality)
**Advantages:**
- Perfect quality (0 issues)
- Character-level spatial analysis
- Geometric table detection
- MIT license
**Disadvantages:**
- Very slow (60x slower than PyMuPDF)
- No markdown structure output
- CPU-intensive
**Use when:**
- Maximum fidelity required
- Quality more important than speed
- Processing critical documents
- Slow processing acceptable
## Traditional vs ML-Based
### Traditional Tools
**How they work:**
- Parse PDF internal structure
- Extract embedded text objects
- Follow PDF specification rules
**Advantages:**
- Fast (no ML inference)
- Small footprint (no model files)
- Deterministic output
**Disadvantages:**
- No layout understanding
- Cannot handle borderless tables
- Lose document hierarchy
### ML-Based Tools (Docling)
**How they work:**
- Computer vision to "see" document layout
- RT-DETR detects layout regions
- TableFormer understands table structure
- Hybrid: ML for layout + PDF parsing for text
**Advantages:**
- Understands visual layout
- Handles complex multi-column layouts
- Preserves semantic structure
- Works with borderless tables
**Disadvantages:**
- Slower (ML inference time)
- Larger footprint (model files)
- Non-deterministic output
## Architecture Details
### Docling Pipeline
1. **PDF Backend** - Extracts raw content and positions
2. **AI Models** - Analyze layout and structure
- RT-DETR: Layout analysis (44-633ms/page)
- TableFormer: Table structure (400ms-1.74s/table)
3. **Assembly** - Combines understanding with text
### pdfplumber Architecture
1. **Built on pdfminer.six** - Character-level extraction
2. **Spatial clustering** - Groups chars into words/lines
3. **Geometric detection** - Finds tables from lines/rectangles
4. **Character objects** - Full metadata (position, font, size, color)
## Enterprise Benchmarks (2025 Procycons)
| Tool | Table Accuracy | Text Fidelity | Speed (s/page) |
|------|----------------|---------------|----------------|
| Docling | 97.9% | 100% | 6.28 |
| Marker | 89.2% | 98.5% | 8.45 |
| MinerU | 92.1% | 99.2% | 12.33 |
| Unstructured.io | 75.0% | 95.8% | 51.02 |
| LlamaParse | 88.5% | 97.3% | 6.00 |
*Source: Procycons Enterprise PDF Processing Benchmark 2025*