zhongwei/gh-warrenzhu050413-warren-claude-code-plugin-marketplace-claude-context-orchestrator

Fork 0

Files

Zhongwei Li 09fec2555b Initial commit

2025-11-30 09:05:19 +08:00

3.9 KiB

Raw Permalink Blame History

PDF Extraction Quality Metrics

Key Metrics

1. Consecutive Spaces

What: Multiple spaces in sequence (2+) Pattern: + Impact: Formatting artifacts, token waste Good: < 50 occurrences Bad: > 100 occurrences

2. Excessive Newlines

What: 4+ consecutive newlines Pattern: \n{4,} Impact: Page breaks treated as whitespace Good: < 20 occurrences Bad: > 50 occurrences

3. Control Characters

What: Non-printable characters Pattern: [\x00-\x08\x0b\x0c\x0e-\x1f] Impact: Parsing errors, display issues Good: 0 occurrences Bad: > 0 occurrences

4. Garbled Characters

What: Replacement characters (<28>) Pattern: [<5B>\ufffd] Impact: Lost information, encoding failures Good: 0 occurrences Bad: > 0 occurrences

5. Hyphenation Breaks

What: End-of-line hyphens not joined Pattern: \w+-\n\w+ Impact: Word splitting affects search Good: < 10 occurrences Bad: > 50 occurrences

6. Ligature Encoding

What: Special character combinations Examples: /uniFB00 (ff), /uniFB01 (fi), /uniFB03 (ffi) Impact: Search failures, readability Fix: Post-process with regex replacement

Quality Score Formula

total_issues = (
    consecutive_spaces +
    excessive_newlines +
    control_chars +
    garbled_chars
)

quality_score = garbled_chars * 10 + total_issues
# Lower is better

Ranking:

Excellent: < 10 score
Good: 10-50 score
Fair: 50-100 score
Poor: > 100 score

Analysis Script

import re

def analyze_quality(text):
    """Analyze PDF extraction quality."""
    return {
        'chars': len(text),
        'words': len(text.split()),
        'consecutive_spaces': len(re.findall(r'  +', text)),
        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
    }

# Usage
text = open("extracted.txt").read()
metrics = analyze_quality(text)
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")

Test Results (90-page Academic PDF)

Tool	Total Issues	Garbled	Quality Score	Rating
pdfplumber	0	0	0	Excellent
PyMuPDF	1	0	1	Excellent
Docling	50	0	50	Good
pdftotext	90	0	90	Fair
pdfminer	45	0	45	Good
pypdf	120	5	170	Poor

Content Completeness

Phrase Coverage Analysis

Extract 3-word phrases from each tool's output:

def extract_phrases(text):
    words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    return {' '.join(words[i:i+3]) for i in range(len(words)-2)}

common = set.intersection(*[extract_phrases(t) for t in texts.values()])

Results:

Common phrases: 10,587 (captured by all tools)
Docling unique: 17,170 phrases (most complete)
pdfplumber unique: 8,229 phrases (conservative)

Cleaning Strategies

Fix Ligatures

def fix_ligatures(text):
    """Fix PDF ligature encoding."""
    replacements = {
        r'/uniFB00': 'ff',
        r'/uniFB01': 'fi',
        r'/uniFB02': 'fl',
        r'/uniFB03': 'ffi',
        r'/uniFB04': 'ffl',
    }
    for pattern, repl in replacements.items():
        text = re.sub(pattern, repl, text)
    return text

Normalize Whitespace

def normalize_whitespace(text):
    """Clean excessive whitespace."""
    text = re.sub(r'  +', ' ', text)  # Multiple spaces → single
    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Many newlines → max 3
    return text.strip()

Join Hyphenated Words

def join_hyphens(text):
    """Join end-of-line hyphenated words."""
    return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)

3.9 KiB Raw Permalink Blame History Unescape Escape