Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:05:19 +08:00
commit 09fec2555b
96 changed files with 24269 additions and 0 deletions

View File

@@ -0,0 +1,154 @@
# PDF Extraction Quality Metrics
## Key Metrics
### 1. Consecutive Spaces
**What:** Multiple spaces in sequence (2+)
**Pattern:** ` +`
**Impact:** Formatting artifacts, token waste
**Good:** < 50 occurrences
**Bad:** > 100 occurrences
### 2. Excessive Newlines
**What:** 4+ consecutive newlines
**Pattern:** `\n{4,}`
**Impact:** Page breaks treated as whitespace
**Good:** < 20 occurrences
**Bad:** > 50 occurrences
### 3. Control Characters
**What:** Non-printable characters
**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
**Impact:** Parsing errors, display issues
**Good:** 0 occurrences
**Bad:** > 0 occurrences
### 4. Garbled Characters
**What:** Replacement characters (<28>)
**Pattern:** `[<5B>\ufffd]`
**Impact:** Lost information, encoding failures
**Good:** 0 occurrences
**Bad:** > 0 occurrences
### 5. Hyphenation Breaks
**What:** End-of-line hyphens not joined
**Pattern:** `\w+-\n\w+`
**Impact:** Word splitting affects search
**Good:** < 10 occurrences
**Bad:** > 50 occurrences
### 6. Ligature Encoding
**What:** Special character combinations
**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
**Impact:** Search failures, readability
**Fix:** Post-process with regex replacement
## Quality Score Formula
```python
total_issues = (
consecutive_spaces +
excessive_newlines +
control_chars +
garbled_chars
)
quality_score = garbled_chars * 10 + total_issues
# Lower is better
```
**Ranking:**
- Excellent: < 10 score
- Good: 10-50 score
- Fair: 50-100 score
- Poor: > 100 score
## Analysis Script
```python
import re
def analyze_quality(text):
"""Analyze PDF extraction quality."""
return {
'chars': len(text),
'words': len(text.split()),
'consecutive_spaces': len(re.findall(r' +', text)),
'excessive_newlines': len(re.findall(r'\n{4,}', text)),
'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
}
# Usage
text = open("extracted.txt").read()
metrics = analyze_quality(text)
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
```
## Test Results (90-page Academic PDF)
| Tool | Total Issues | Garbled | Quality Score | Rating |
|------|--------------|---------|---------------|--------|
| pdfplumber | 0 | 0 | 0 | Excellent |
| PyMuPDF | 1 | 0 | 1 | Excellent |
| Docling | 50 | 0 | 50 | Good |
| pdftotext | 90 | 0 | 90 | Fair |
| pdfminer | 45 | 0 | 45 | Good |
| pypdf | 120 | 5 | 170 | Poor |
## Content Completeness
### Phrase Coverage Analysis
Extract 3-word phrases from each tool's output:
```python
def extract_phrases(text):
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
common = set.intersection(*[extract_phrases(t) for t in texts.values()])
```
**Results:**
- Common phrases: 10,587 (captured by all tools)
- Docling unique: 17,170 phrases (most complete)
- pdfplumber unique: 8,229 phrases (conservative)
## Cleaning Strategies
### Fix Ligatures
```python
def fix_ligatures(text):
"""Fix PDF ligature encoding."""
replacements = {
r'/uniFB00': 'ff',
r'/uniFB01': 'fi',
r'/uniFB02': 'fl',
r'/uniFB03': 'ffi',
r'/uniFB04': 'ffl',
}
for pattern, repl in replacements.items():
text = re.sub(pattern, repl, text)
return text
```
### Normalize Whitespace
```python
def normalize_whitespace(text):
"""Clean excessive whitespace."""
text = re.sub(r' +', ' ', text) # Multiple spaces → single
text = re.sub(r'\n{4,}', '\n\n\n', text) # Many newlines → max 3
return text.strip()
```
### Join Hyphenated Words
```python
def join_hyphens(text):
"""Join end-of-line hyphenated words."""
return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
```