Files
gh-warrenzhu050413-warren-c…/skills/pdftext/references/benchmarks.md
2025-11-30 09:05:19 +08:00

4.6 KiB
Raw Blame History

PDF Extraction Benchmarks

Enterprise Benchmark (2025 Procycons)

Production-grade comparison of ML-based PDF extraction tools.

Tool Table Accuracy Text Fidelity Speed (s/page) Memory (GB)
Docling 97.9% 100% 6.28 2.1
Marker 89.2% 98.5% 8.45 3.5
MinerU 92.1% 99.2% 12.33 4.2
Unstructured.io 75.0% 95.8% 51.02 1.8
PyMuPDF4LLM 82.3% 97.1% 4.12 1.2
LlamaParse 88.5% 97.3% 6.00 N/A (cloud)

Test corpus: 500 academic papers, business reports, financial statements (mixed complexity)

Key finding: Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.

Source: Procycons Enterprise PDF Processing Benchmark 2025

Academic PDF Test (This Research)

Real-world testing on distributed cognition literature.

Test Environment

  • PDFs: 4 academic books
  • Total size: 98.2 MB
  • Pages: ~400 pages combined
  • Content: Multi-column layouts, tables, figures, references

Test Results

Speed (90-page PDF, 1.9 MB)

Tool Total Time Per Page Speedup
pdftotext 0.63s 0.007s/page 60x
PyMuPDF 1.18s 0.013s/page 33x
Docling 38.86s 0.432s/page 1x
pdfplumber 38.91s 0.432s/page 1x

Quality (Issues per document)

Tool Consecutive Spaces Excessive Newlines Control Chars Garbled Total
pdfplumber 0 0 0 0 0
PyMuPDF 1 0 0 0 1
Docling 48 2 0 0 50
pdftotext 85 5 0 0 90

Structure Preservation

Tool Headers Tables Lists Images
Docling ✓ 36 ✓ 16 rows ✓ 307 items ✓ 4 markers
PyMuPDF
pdfplumber
pdftotext

Key finding: Docling is the ONLY tool that preserves document structure.

Production Recommendations

By Use Case

Academic research / Literature review:

  • Primary: Docling (structure essential)
  • Secondary: PyMuPDF (speed for large batches)

RAG system ingestion:

  • Recommended: Docling (semantic structure preserved)
  • Alternative: PyMuPDF + post-processing

Quick text extraction:

  • Recommended: PyMuPDF (60x faster)
  • Alternative: pdftotext (fastest, lower quality)

Maximum quality (legal, financial):

  • Recommended: pdfplumber (perfect quality)
  • Alternative: Docling (structure + good quality)

By Document Type

Academic papers: Docling (tables, multi-column, references) Books/ebooks: PyMuPDF (simple linear text) Business reports: Docling (tables, charts, sections) Scanned documents: Docling with OCR enabled Legal contracts: pdfplumber (maximum fidelity)

ML Model Performance (Docling)

RT-DETR (Layout Detection)

  • Speed: 44-633ms per page
  • Accuracy: ~95% layout element detection
  • Detects: Text blocks, headers, tables, figures, captions

TableFormer (Table Structure)

  • Speed: 400ms-1.74s per table
  • Accuracy: 97.9% cell-level accuracy
  • Handles: Borderless tables, merged cells, nested tables

Cloud vs On-Device

Tool Processing Privacy Cost Speed
Docling On-device ✓ Private Free 0.43s/page
LlamaParse Cloud API ✗ Sends data $0.003/page 6s/page
Claude Vision Cloud API ✗ Sends data $0.0075/page Variable
Mathpix Cloud API ✗ Sends data $0.004/page 4s/page

Recommendation: Use on-device (Docling) for sensitive/unpublished academic work.

Benchmark Methodology

Speed Testing

import time

start = time.time()
result = converter.convert(pdf_path)
elapsed = time.time() - start
per_page = elapsed / page_count

Quality Testing

# Count quality issues
consecutive_spaces = len(re.findall(r'  +', text))
excessive_newlines = len(re.findall(r'\n{4,}', text))
control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))

total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars

Structure Testing

# Count markdown elements
headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
tables = len(re.findall(r'\|.+\|', markdown))
lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))