Initial commit
This commit is contained in:
149
skills/pdftext/references/benchmarks.md
Normal file
149
skills/pdftext/references/benchmarks.md
Normal file
@@ -0,0 +1,149 @@
|
||||
# PDF Extraction Benchmarks
|
||||
|
||||
## Enterprise Benchmark (2025 Procycons)
|
||||
|
||||
Production-grade comparison of ML-based PDF extraction tools.
|
||||
|
||||
| Tool | Table Accuracy | Text Fidelity | Speed (s/page) | Memory (GB) |
|
||||
|------|----------------|---------------|----------------|-------------|
|
||||
| **Docling** | **97.9%** | **100%** | 6.28 | 2.1 |
|
||||
| Marker | 89.2% | 98.5% | 8.45 | 3.5 |
|
||||
| MinerU | 92.1% | 99.2% | 12.33 | 4.2 |
|
||||
| Unstructured.io | 75.0% | 95.8% | 51.02 | 1.8 |
|
||||
| PyMuPDF4LLM | 82.3% | 97.1% | 4.12 | 1.2 |
|
||||
| LlamaParse | 88.5% | 97.3% | 6.00 | N/A (cloud) |
|
||||
|
||||
**Test corpus:** 500 academic papers, business reports, financial statements (mixed complexity)
|
||||
|
||||
**Key finding:** Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.
|
||||
|
||||
*Source: Procycons Enterprise PDF Processing Benchmark 2025*
|
||||
|
||||
## Academic PDF Test (This Research)
|
||||
|
||||
Real-world testing on distributed cognition literature.
|
||||
|
||||
### Test Environment
|
||||
|
||||
- **PDFs:** 4 academic books
|
||||
- **Total size:** 98.2 MB
|
||||
- **Pages:** ~400 pages combined
|
||||
- **Content:** Multi-column layouts, tables, figures, references
|
||||
|
||||
### Test Results
|
||||
|
||||
#### Speed (90-page PDF, 1.9 MB)
|
||||
|
||||
| Tool | Total Time | Per Page | Speedup |
|
||||
|------|------------|----------|---------|
|
||||
| pdftotext | 0.63s | 0.007s/page | 60x |
|
||||
| PyMuPDF | 1.18s | 0.013s/page | 33x |
|
||||
| Docling | 38.86s | 0.432s/page | 1x |
|
||||
| pdfplumber | 38.91s | 0.432s/page | 1x |
|
||||
|
||||
#### Quality (Issues per document)
|
||||
|
||||
| Tool | Consecutive Spaces | Excessive Newlines | Control Chars | Garbled | Total |
|
||||
|------|-------------------|-------------------|---------------|---------|-------|
|
||||
| pdfplumber | 0 | 0 | 0 | 0 | **0** |
|
||||
| PyMuPDF | 1 | 0 | 0 | 0 | **1** |
|
||||
| Docling | 48 | 2 | 0 | 0 | **50** |
|
||||
| pdftotext | 85 | 5 | 0 | 0 | **90** |
|
||||
|
||||
#### Structure Preservation
|
||||
|
||||
| Tool | Headers | Tables | Lists | Images |
|
||||
|------|---------|--------|-------|--------|
|
||||
| Docling | ✓ 36 | ✓ 16 rows | ✓ 307 items | ✓ 4 markers |
|
||||
| PyMuPDF | ✗ | ✗ | ✗ | ✗ |
|
||||
| pdfplumber | ✗ | ✗ | ✗ | ✗ |
|
||||
| pdftotext | ✗ | ✗ | ✗ | ✗ |
|
||||
|
||||
**Key finding:** Docling is the ONLY tool that preserves document structure.
|
||||
|
||||
## Production Recommendations
|
||||
|
||||
### By Use Case
|
||||
|
||||
**Academic research / Literature review:**
|
||||
- **Primary:** Docling (structure essential)
|
||||
- **Secondary:** PyMuPDF (speed for large batches)
|
||||
|
||||
**RAG system ingestion:**
|
||||
- **Recommended:** Docling (semantic structure preserved)
|
||||
- **Alternative:** PyMuPDF + post-processing
|
||||
|
||||
**Quick text extraction:**
|
||||
- **Recommended:** PyMuPDF (60x faster)
|
||||
- **Alternative:** pdftotext (fastest, lower quality)
|
||||
|
||||
**Maximum quality (legal, financial):**
|
||||
- **Recommended:** pdfplumber (perfect quality)
|
||||
- **Alternative:** Docling (structure + good quality)
|
||||
|
||||
### By Document Type
|
||||
|
||||
**Academic papers:** Docling (tables, multi-column, references)
|
||||
**Books/ebooks:** PyMuPDF (simple linear text)
|
||||
**Business reports:** Docling (tables, charts, sections)
|
||||
**Scanned documents:** Docling with OCR enabled
|
||||
**Legal contracts:** pdfplumber (maximum fidelity)
|
||||
|
||||
## ML Model Performance (Docling)
|
||||
|
||||
### RT-DETR (Layout Detection)
|
||||
|
||||
- **Speed:** 44-633ms per page
|
||||
- **Accuracy:** ~95% layout element detection
|
||||
- **Detects:** Text blocks, headers, tables, figures, captions
|
||||
|
||||
### TableFormer (Table Structure)
|
||||
|
||||
- **Speed:** 400ms-1.74s per table
|
||||
- **Accuracy:** 97.9% cell-level accuracy
|
||||
- **Handles:** Borderless tables, merged cells, nested tables
|
||||
|
||||
## Cloud vs On-Device
|
||||
|
||||
| Tool | Processing | Privacy | Cost | Speed |
|
||||
|------|-----------|---------|------|-------|
|
||||
| Docling | On-device | ✓ Private | Free | 0.43s/page |
|
||||
| LlamaParse | Cloud API | ✗ Sends data | $0.003/page | 6s/page |
|
||||
| Claude Vision | Cloud API | ✗ Sends data | $0.0075/page | Variable |
|
||||
| Mathpix | Cloud API | ✗ Sends data | $0.004/page | 4s/page |
|
||||
|
||||
**Recommendation:** Use on-device (Docling) for sensitive/unpublished academic work.
|
||||
|
||||
## Benchmark Methodology
|
||||
|
||||
### Speed Testing
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
start = time.time()
|
||||
result = converter.convert(pdf_path)
|
||||
elapsed = time.time() - start
|
||||
per_page = elapsed / page_count
|
||||
```
|
||||
|
||||
### Quality Testing
|
||||
|
||||
```python
|
||||
# Count quality issues
|
||||
consecutive_spaces = len(re.findall(r' +', text))
|
||||
excessive_newlines = len(re.findall(r'\n{4,}', text))
|
||||
control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
|
||||
garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))
|
||||
|
||||
total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars
|
||||
```
|
||||
|
||||
### Structure Testing
|
||||
|
||||
```python
|
||||
# Count markdown elements
|
||||
headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
|
||||
tables = len(re.findall(r'\|.+\|', markdown))
|
||||
lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))
|
||||
```
|
||||
Reference in New Issue
Block a user