4.6 KiB
PDF Extraction Benchmarks
Enterprise Benchmark (2025 Procycons)
Production-grade comparison of ML-based PDF extraction tools.
| Tool | Table Accuracy | Text Fidelity | Speed (s/page) | Memory (GB) |
|---|---|---|---|---|
| Docling | 97.9% | 100% | 6.28 | 2.1 |
| Marker | 89.2% | 98.5% | 8.45 | 3.5 |
| MinerU | 92.1% | 99.2% | 12.33 | 4.2 |
| Unstructured.io | 75.0% | 95.8% | 51.02 | 1.8 |
| PyMuPDF4LLM | 82.3% | 97.1% | 4.12 | 1.2 |
| LlamaParse | 88.5% | 97.3% | 6.00 | N/A (cloud) |
Test corpus: 500 academic papers, business reports, financial statements (mixed complexity)
Key finding: Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.
Source: Procycons Enterprise PDF Processing Benchmark 2025
Academic PDF Test (This Research)
Real-world testing on distributed cognition literature.
Test Environment
- PDFs: 4 academic books
- Total size: 98.2 MB
- Pages: ~400 pages combined
- Content: Multi-column layouts, tables, figures, references
Test Results
Speed (90-page PDF, 1.9 MB)
| Tool | Total Time | Per Page | Speedup |
|---|---|---|---|
| pdftotext | 0.63s | 0.007s/page | 60x |
| PyMuPDF | 1.18s | 0.013s/page | 33x |
| Docling | 38.86s | 0.432s/page | 1x |
| pdfplumber | 38.91s | 0.432s/page | 1x |
Quality (Issues per document)
| Tool | Consecutive Spaces | Excessive Newlines | Control Chars | Garbled | Total |
|---|---|---|---|---|---|
| pdfplumber | 0 | 0 | 0 | 0 | 0 |
| PyMuPDF | 1 | 0 | 0 | 0 | 1 |
| Docling | 48 | 2 | 0 | 0 | 50 |
| pdftotext | 85 | 5 | 0 | 0 | 90 |
Structure Preservation
| Tool | Headers | Tables | Lists | Images |
|---|---|---|---|---|
| Docling | ✓ 36 | ✓ 16 rows | ✓ 307 items | ✓ 4 markers |
| PyMuPDF | ✗ | ✗ | ✗ | ✗ |
| pdfplumber | ✗ | ✗ | ✗ | ✗ |
| pdftotext | ✗ | ✗ | ✗ | ✗ |
Key finding: Docling is the ONLY tool that preserves document structure.
Production Recommendations
By Use Case
Academic research / Literature review:
- Primary: Docling (structure essential)
- Secondary: PyMuPDF (speed for large batches)
RAG system ingestion:
- Recommended: Docling (semantic structure preserved)
- Alternative: PyMuPDF + post-processing
Quick text extraction:
- Recommended: PyMuPDF (60x faster)
- Alternative: pdftotext (fastest, lower quality)
Maximum quality (legal, financial):
- Recommended: pdfplumber (perfect quality)
- Alternative: Docling (structure + good quality)
By Document Type
Academic papers: Docling (tables, multi-column, references) Books/ebooks: PyMuPDF (simple linear text) Business reports: Docling (tables, charts, sections) Scanned documents: Docling with OCR enabled Legal contracts: pdfplumber (maximum fidelity)
ML Model Performance (Docling)
RT-DETR (Layout Detection)
- Speed: 44-633ms per page
- Accuracy: ~95% layout element detection
- Detects: Text blocks, headers, tables, figures, captions
TableFormer (Table Structure)
- Speed: 400ms-1.74s per table
- Accuracy: 97.9% cell-level accuracy
- Handles: Borderless tables, merged cells, nested tables
Cloud vs On-Device
| Tool | Processing | Privacy | Cost | Speed |
|---|---|---|---|---|
| Docling | On-device | ✓ Private | Free | 0.43s/page |
| LlamaParse | Cloud API | ✗ Sends data | $0.003/page | 6s/page |
| Claude Vision | Cloud API | ✗ Sends data | $0.0075/page | Variable |
| Mathpix | Cloud API | ✗ Sends data | $0.004/page | 4s/page |
Recommendation: Use on-device (Docling) for sensitive/unpublished academic work.
Benchmark Methodology
Speed Testing
import time
start = time.time()
result = converter.convert(pdf_path)
elapsed = time.time() - start
per_page = elapsed / page_count
Quality Testing
# Count quality issues
consecutive_spaces = len(re.findall(r' +', text))
excessive_newlines = len(re.findall(r'\n{4,}', text))
control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))
total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars
Structure Testing
# Count markdown elements
headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
tables = len(re.findall(r'\|.+\|', markdown))
lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))