Files
2025-11-30 09:05:19 +08:00

3.9 KiB
Raw Permalink Blame History

PDF Extraction Quality Metrics

Key Metrics

1. Consecutive Spaces

What: Multiple spaces in sequence (2+) Pattern: + Impact: Formatting artifacts, token waste Good: < 50 occurrences Bad: > 100 occurrences

2. Excessive Newlines

What: 4+ consecutive newlines Pattern: \n{4,} Impact: Page breaks treated as whitespace Good: < 20 occurrences Bad: > 50 occurrences

3. Control Characters

What: Non-printable characters Pattern: [\x00-\x08\x0b\x0c\x0e-\x1f] Impact: Parsing errors, display issues Good: 0 occurrences Bad: > 0 occurrences

4. Garbled Characters

What: Replacement characters (<28>) Pattern: [<5B>\ufffd] Impact: Lost information, encoding failures Good: 0 occurrences Bad: > 0 occurrences

5. Hyphenation Breaks

What: End-of-line hyphens not joined Pattern: \w+-\n\w+ Impact: Word splitting affects search Good: < 10 occurrences Bad: > 50 occurrences

6. Ligature Encoding

What: Special character combinations Examples: /uniFB00 (ff), /uniFB01 (fi), /uniFB03 (ffi) Impact: Search failures, readability Fix: Post-process with regex replacement

Quality Score Formula

total_issues = (
    consecutive_spaces +
    excessive_newlines +
    control_chars +
    garbled_chars
)

quality_score = garbled_chars * 10 + total_issues
# Lower is better

Ranking:

  • Excellent: < 10 score
  • Good: 10-50 score
  • Fair: 50-100 score
  • Poor: > 100 score

Analysis Script

import re

def analyze_quality(text):
    """Analyze PDF extraction quality."""
    return {
        'chars': len(text),
        'words': len(text.split()),
        'consecutive_spaces': len(re.findall(r'  +', text)),
        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
    }

# Usage
text = open("extracted.txt").read()
metrics = analyze_quality(text)
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")

Test Results (90-page Academic PDF)

Tool Total Issues Garbled Quality Score Rating
pdfplumber 0 0 0 Excellent
PyMuPDF 1 0 1 Excellent
Docling 50 0 50 Good
pdftotext 90 0 90 Fair
pdfminer 45 0 45 Good
pypdf 120 5 170 Poor

Content Completeness

Phrase Coverage Analysis

Extract 3-word phrases from each tool's output:

def extract_phrases(text):
    words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    return {' '.join(words[i:i+3]) for i in range(len(words)-2)}

common = set.intersection(*[extract_phrases(t) for t in texts.values()])

Results:

  • Common phrases: 10,587 (captured by all tools)
  • Docling unique: 17,170 phrases (most complete)
  • pdfplumber unique: 8,229 phrases (conservative)

Cleaning Strategies

Fix Ligatures

def fix_ligatures(text):
    """Fix PDF ligature encoding."""
    replacements = {
        r'/uniFB00': 'ff',
        r'/uniFB01': 'fi',
        r'/uniFB02': 'fl',
        r'/uniFB03': 'ffi',
        r'/uniFB04': 'ffl',
    }
    for pattern, repl in replacements.items():
        text = re.sub(pattern, repl, text)
    return text

Normalize Whitespace

def normalize_whitespace(text):
    """Clean excessive whitespace."""
    text = re.sub(r'  +', ' ', text)  # Multiple spaces → single
    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Many newlines → max 3
    return text.strip()

Join Hyphenated Words

def join_hyphens(text):
    """Join end-of-line hyphenated words."""
    return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)