# PDF Extraction Quality Metrics ## Key Metrics ### 1. Consecutive Spaces **What:** Multiple spaces in sequence (2+) **Pattern:** ` +` **Impact:** Formatting artifacts, token waste **Good:** < 50 occurrences **Bad:** > 100 occurrences ### 2. Excessive Newlines **What:** 4+ consecutive newlines **Pattern:** `\n{4,}` **Impact:** Page breaks treated as whitespace **Good:** < 20 occurrences **Bad:** > 50 occurrences ### 3. Control Characters **What:** Non-printable characters **Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]` **Impact:** Parsing errors, display issues **Good:** 0 occurrences **Bad:** > 0 occurrences ### 4. Garbled Characters **What:** Replacement characters (�) **Pattern:** `[�\ufffd]` **Impact:** Lost information, encoding failures **Good:** 0 occurrences **Bad:** > 0 occurrences ### 5. Hyphenation Breaks **What:** End-of-line hyphens not joined **Pattern:** `\w+-\n\w+` **Impact:** Word splitting affects search **Good:** < 10 occurrences **Bad:** > 50 occurrences ### 6. Ligature Encoding **What:** Special character combinations **Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi) **Impact:** Search failures, readability **Fix:** Post-process with regex replacement ## Quality Score Formula ```python total_issues = ( consecutive_spaces + excessive_newlines + control_chars + garbled_chars ) quality_score = garbled_chars * 10 + total_issues # Lower is better ``` **Ranking:** - Excellent: < 10 score - Good: 10-50 score - Fair: 50-100 score - Poor: > 100 score ## Analysis Script ```python import re def analyze_quality(text): """Analyze PDF extraction quality.""" return { 'chars': len(text), 'words': len(text.split()), 'consecutive_spaces': len(re.findall(r' +', text)), 'excessive_newlines': len(re.findall(r'\n{4,}', text)), 'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)), 'garbled_chars': len(re.findall(r'[�\ufffd]', text)), 'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text)) } # Usage text = open("extracted.txt").read() metrics = analyze_quality(text) print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}") ``` ## Test Results (90-page Academic PDF) | Tool | Total Issues | Garbled | Quality Score | Rating | |------|--------------|---------|---------------|--------| | pdfplumber | 0 | 0 | 0 | Excellent | | PyMuPDF | 1 | 0 | 1 | Excellent | | Docling | 50 | 0 | 50 | Good | | pdftotext | 90 | 0 | 90 | Fair | | pdfminer | 45 | 0 | 45 | Good | | pypdf | 120 | 5 | 170 | Poor | ## Content Completeness ### Phrase Coverage Analysis Extract 3-word phrases from each tool's output: ```python def extract_phrases(text): words = re.findall(r'\b[a-zA-Z]+\b', text.lower()) return {' '.join(words[i:i+3]) for i in range(len(words)-2)} common = set.intersection(*[extract_phrases(t) for t in texts.values()]) ``` **Results:** - Common phrases: 10,587 (captured by all tools) - Docling unique: 17,170 phrases (most complete) - pdfplumber unique: 8,229 phrases (conservative) ## Cleaning Strategies ### Fix Ligatures ```python def fix_ligatures(text): """Fix PDF ligature encoding.""" replacements = { r'/uniFB00': 'ff', r'/uniFB01': 'fi', r'/uniFB02': 'fl', r'/uniFB03': 'ffi', r'/uniFB04': 'ffl', } for pattern, repl in replacements.items(): text = re.sub(pattern, repl, text) return text ``` ### Normalize Whitespace ```python def normalize_whitespace(text): """Clean excessive whitespace.""" text = re.sub(r' +', ' ', text) # Multiple spaces → single text = re.sub(r'\n{4,}', '\n\n\n', text) # Many newlines → max 3 return text.strip() ``` ### Join Hyphenated Words ```python def join_hyphens(text): """Join end-of-line hyphenated words.""" return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text) ```