3.9 KiB
PDF Extraction Quality Metrics
Key Metrics
1. Consecutive Spaces
What: Multiple spaces in sequence (2+)
Pattern: +
Impact: Formatting artifacts, token waste
Good: < 50 occurrences
Bad: > 100 occurrences
2. Excessive Newlines
What: 4+ consecutive newlines
Pattern: \n{4,}
Impact: Page breaks treated as whitespace
Good: < 20 occurrences
Bad: > 50 occurrences
3. Control Characters
What: Non-printable characters
Pattern: [\x00-\x08\x0b\x0c\x0e-\x1f]
Impact: Parsing errors, display issues
Good: 0 occurrences
Bad: > 0 occurrences
4. Garbled Characters
What: Replacement characters (<28>)
Pattern: [<5B>\ufffd]
Impact: Lost information, encoding failures
Good: 0 occurrences
Bad: > 0 occurrences
5. Hyphenation Breaks
What: End-of-line hyphens not joined
Pattern: \w+-\n\w+
Impact: Word splitting affects search
Good: < 10 occurrences
Bad: > 50 occurrences
6. Ligature Encoding
What: Special character combinations
Examples: /uniFB00 (ff), /uniFB01 (fi), /uniFB03 (ffi)
Impact: Search failures, readability
Fix: Post-process with regex replacement
Quality Score Formula
total_issues = (
consecutive_spaces +
excessive_newlines +
control_chars +
garbled_chars
)
quality_score = garbled_chars * 10 + total_issues
# Lower is better
Ranking:
- Excellent: < 10 score
- Good: 10-50 score
- Fair: 50-100 score
- Poor: > 100 score
Analysis Script
import re
def analyze_quality(text):
"""Analyze PDF extraction quality."""
return {
'chars': len(text),
'words': len(text.split()),
'consecutive_spaces': len(re.findall(r' +', text)),
'excessive_newlines': len(re.findall(r'\n{4,}', text)),
'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
}
# Usage
text = open("extracted.txt").read()
metrics = analyze_quality(text)
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
Test Results (90-page Academic PDF)
| Tool | Total Issues | Garbled | Quality Score | Rating |
|---|---|---|---|---|
| pdfplumber | 0 | 0 | 0 | Excellent |
| PyMuPDF | 1 | 0 | 1 | Excellent |
| Docling | 50 | 0 | 50 | Good |
| pdftotext | 90 | 0 | 90 | Fair |
| pdfminer | 45 | 0 | 45 | Good |
| pypdf | 120 | 5 | 170 | Poor |
Content Completeness
Phrase Coverage Analysis
Extract 3-word phrases from each tool's output:
def extract_phrases(text):
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
common = set.intersection(*[extract_phrases(t) for t in texts.values()])
Results:
- Common phrases: 10,587 (captured by all tools)
- Docling unique: 17,170 phrases (most complete)
- pdfplumber unique: 8,229 phrases (conservative)
Cleaning Strategies
Fix Ligatures
def fix_ligatures(text):
"""Fix PDF ligature encoding."""
replacements = {
r'/uniFB00': 'ff',
r'/uniFB01': 'fi',
r'/uniFB02': 'fl',
r'/uniFB03': 'ffi',
r'/uniFB04': 'ffl',
}
for pattern, repl in replacements.items():
text = re.sub(pattern, repl, text)
return text
Normalize Whitespace
def normalize_whitespace(text):
"""Clean excessive whitespace."""
text = re.sub(r' +', ' ', text) # Multiple spaces → single
text = re.sub(r'\n{4,}', '\n\n\n', text) # Many newlines → max 3
return text.strip()
Join Hyphenated Words
def join_hyphens(text):
"""Join end-of-line hyphenated words."""
return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)