Initial commit

2025-11-30 09:05:19 +08:00
commit 09fec2555b
96 changed files with 24269 additions and 0 deletions
--- a/skills/pdftext/references/quality-metrics.md
+++ b/skills/pdftext/references/quality-metrics.md
@@ -0,0 +1,154 @@
+# PDF Extraction Quality Metrics
+
+## Key Metrics
+
+### 1. Consecutive Spaces
+**What:** Multiple spaces in sequence (2+)
+**Pattern:** `  +`
+**Impact:** Formatting artifacts, token waste
+**Good:** < 50 occurrences
+**Bad:** > 100 occurrences
+
+### 2. Excessive Newlines
+**What:** 4+ consecutive newlines
+**Pattern:** `\n{4,}`
+**Impact:** Page breaks treated as whitespace
+**Good:** < 20 occurrences
+**Bad:** > 50 occurrences
+
+### 3. Control Characters
+**What:** Non-printable characters
+**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
+**Impact:** Parsing errors, display issues
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 4. Garbled Characters
+**What:** Replacement characters (<28>)
+**Pattern:** `[<5B>\ufffd]`
+**Impact:** Lost information, encoding failures
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 5. Hyphenation Breaks
+**What:** End-of-line hyphens not joined
+**Pattern:** `\w+-\n\w+`
+**Impact:** Word splitting affects search
+**Good:** < 10 occurrences
+**Bad:** > 50 occurrences
+
+### 6. Ligature Encoding
+**What:** Special character combinations
+**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
+**Impact:** Search failures, readability
+**Fix:** Post-process with regex replacement
+
+## Quality Score Formula
+
+```python
+total_issues = (
+    consecutive_spaces +
+    excessive_newlines +
+    control_chars +
+    garbled_chars
+)
+
+quality_score = garbled_chars * 10 + total_issues
+# Lower is better
+```
+
+**Ranking:**
+- Excellent: < 10 score
+- Good: 10-50 score
+- Fair: 50-100 score
+- Poor: > 100 score
+
+## Analysis Script
+
+```python
+import re
+
+def analyze_quality(text):
+    """Analyze PDF extraction quality."""
+    return {
+        'chars': len(text),
+        'words': len(text.split()),
+        'consecutive_spaces': len(re.findall(r'  +', text)),
+        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
+        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
+        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
+        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
+    }
+
+# Usage
+text = open("extracted.txt").read()
+metrics = analyze_quality(text)
+print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
+```
+
+## Test Results (90-page Academic PDF)
+
+| Tool | Total Issues | Garbled | Quality Score | Rating |
+|------|--------------|---------|---------------|--------|
+| pdfplumber | 0 | 0 | 0 | Excellent |
+| PyMuPDF | 1 | 0 | 1 | Excellent |
+| Docling | 50 | 0 | 50 | Good |
+| pdftotext | 90 | 0 | 90 | Fair |
+| pdfminer | 45 | 0 | 45 | Good |
+| pypdf | 120 | 5 | 170 | Poor |
+
+## Content Completeness
+
+### Phrase Coverage Analysis
+
+Extract 3-word phrases from each tool's output:
+
+```python
+def extract_phrases(text):
+    words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
+    return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
+
+common = set.intersection(*[extract_phrases(t) for t in texts.values()])
+```
+
+**Results:**
+- Common phrases: 10,587 (captured by all tools)
+- Docling unique: 17,170 phrases (most complete)
+- pdfplumber unique: 8,229 phrases (conservative)
+
+## Cleaning Strategies
+
+### Fix Ligatures
+
+```python
+def fix_ligatures(text):
+    """Fix PDF ligature encoding."""
+    replacements = {
+        r'/uniFB00': 'ff',
+        r'/uniFB01': 'fi',
+        r'/uniFB02': 'fl',
+        r'/uniFB03': 'ffi',
+        r'/uniFB04': 'ffl',
+    }
+    for pattern, repl in replacements.items():
+        text = re.sub(pattern, repl, text)
+    return text
+```
+
+### Normalize Whitespace
+
+```python
+def normalize_whitespace(text):
+    """Clean excessive whitespace."""
+    text = re.sub(r'  +', ' ', text)  # Multiple spaces → single
+    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Many newlines → max 3
+    return text.strip()
+```
+
+### Join Hyphenated Words
+
+```python
+def join_hyphens(text):
+    """Join end-of-line hyphenated words."""
+    return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
+```