Initial commit
This commit is contained in:
154
skills/pdftext/references/quality-metrics.md
Normal file
154
skills/pdftext/references/quality-metrics.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# PDF Extraction Quality Metrics
|
||||
|
||||
## Key Metrics
|
||||
|
||||
### 1. Consecutive Spaces
|
||||
**What:** Multiple spaces in sequence (2+)
|
||||
**Pattern:** ` +`
|
||||
**Impact:** Formatting artifacts, token waste
|
||||
**Good:** < 50 occurrences
|
||||
**Bad:** > 100 occurrences
|
||||
|
||||
### 2. Excessive Newlines
|
||||
**What:** 4+ consecutive newlines
|
||||
**Pattern:** `\n{4,}`
|
||||
**Impact:** Page breaks treated as whitespace
|
||||
**Good:** < 20 occurrences
|
||||
**Bad:** > 50 occurrences
|
||||
|
||||
### 3. Control Characters
|
||||
**What:** Non-printable characters
|
||||
**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
|
||||
**Impact:** Parsing errors, display issues
|
||||
**Good:** 0 occurrences
|
||||
**Bad:** > 0 occurrences
|
||||
|
||||
### 4. Garbled Characters
|
||||
**What:** Replacement characters (<28>)
|
||||
**Pattern:** `[<5B>\ufffd]`
|
||||
**Impact:** Lost information, encoding failures
|
||||
**Good:** 0 occurrences
|
||||
**Bad:** > 0 occurrences
|
||||
|
||||
### 5. Hyphenation Breaks
|
||||
**What:** End-of-line hyphens not joined
|
||||
**Pattern:** `\w+-\n\w+`
|
||||
**Impact:** Word splitting affects search
|
||||
**Good:** < 10 occurrences
|
||||
**Bad:** > 50 occurrences
|
||||
|
||||
### 6. Ligature Encoding
|
||||
**What:** Special character combinations
|
||||
**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
|
||||
**Impact:** Search failures, readability
|
||||
**Fix:** Post-process with regex replacement
|
||||
|
||||
## Quality Score Formula
|
||||
|
||||
```python
|
||||
total_issues = (
|
||||
consecutive_spaces +
|
||||
excessive_newlines +
|
||||
control_chars +
|
||||
garbled_chars
|
||||
)
|
||||
|
||||
quality_score = garbled_chars * 10 + total_issues
|
||||
# Lower is better
|
||||
```
|
||||
|
||||
**Ranking:**
|
||||
- Excellent: < 10 score
|
||||
- Good: 10-50 score
|
||||
- Fair: 50-100 score
|
||||
- Poor: > 100 score
|
||||
|
||||
## Analysis Script
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def analyze_quality(text):
|
||||
"""Analyze PDF extraction quality."""
|
||||
return {
|
||||
'chars': len(text),
|
||||
'words': len(text.split()),
|
||||
'consecutive_spaces': len(re.findall(r' +', text)),
|
||||
'excessive_newlines': len(re.findall(r'\n{4,}', text)),
|
||||
'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
|
||||
'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
|
||||
'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
|
||||
}
|
||||
|
||||
# Usage
|
||||
text = open("extracted.txt").read()
|
||||
metrics = analyze_quality(text)
|
||||
print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
|
||||
```
|
||||
|
||||
## Test Results (90-page Academic PDF)
|
||||
|
||||
| Tool | Total Issues | Garbled | Quality Score | Rating |
|
||||
|------|--------------|---------|---------------|--------|
|
||||
| pdfplumber | 0 | 0 | 0 | Excellent |
|
||||
| PyMuPDF | 1 | 0 | 1 | Excellent |
|
||||
| Docling | 50 | 0 | 50 | Good |
|
||||
| pdftotext | 90 | 0 | 90 | Fair |
|
||||
| pdfminer | 45 | 0 | 45 | Good |
|
||||
| pypdf | 120 | 5 | 170 | Poor |
|
||||
|
||||
## Content Completeness
|
||||
|
||||
### Phrase Coverage Analysis
|
||||
|
||||
Extract 3-word phrases from each tool's output:
|
||||
|
||||
```python
|
||||
def extract_phrases(text):
|
||||
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
|
||||
return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
|
||||
|
||||
common = set.intersection(*[extract_phrases(t) for t in texts.values()])
|
||||
```
|
||||
|
||||
**Results:**
|
||||
- Common phrases: 10,587 (captured by all tools)
|
||||
- Docling unique: 17,170 phrases (most complete)
|
||||
- pdfplumber unique: 8,229 phrases (conservative)
|
||||
|
||||
## Cleaning Strategies
|
||||
|
||||
### Fix Ligatures
|
||||
|
||||
```python
|
||||
def fix_ligatures(text):
|
||||
"""Fix PDF ligature encoding."""
|
||||
replacements = {
|
||||
r'/uniFB00': 'ff',
|
||||
r'/uniFB01': 'fi',
|
||||
r'/uniFB02': 'fl',
|
||||
r'/uniFB03': 'ffi',
|
||||
r'/uniFB04': 'ffl',
|
||||
}
|
||||
for pattern, repl in replacements.items():
|
||||
text = re.sub(pattern, repl, text)
|
||||
return text
|
||||
```
|
||||
|
||||
### Normalize Whitespace
|
||||
|
||||
```python
|
||||
def normalize_whitespace(text):
|
||||
"""Clean excessive whitespace."""
|
||||
text = re.sub(r' +', ' ', text) # Multiple spaces → single
|
||||
text = re.sub(r'\n{4,}', '\n\n\n', text) # Many newlines → max 3
|
||||
return text.strip()
|
||||
```
|
||||
|
||||
### Join Hyphenated Words
|
||||
|
||||
```python
|
||||
def join_hyphens(text):
|
||||
"""Join end-of-line hyphenated words."""
|
||||
return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
|
||||
```
|
||||
Reference in New Issue
Block a user