Initial commit

2025-11-30 09:05:19 +08:00
commit 09fec2555b
96 changed files with 24269 additions and 0 deletions
--- a/skills/pdftext/LICENSE.txt
+++ b/skills/pdftext/LICENSE.txt
@@ -0,0 +1,176 @@
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+
+   "License" shall mean the terms and conditions for use, reproduction,
+   and distribution as defined by Sections 1 through 9 of this document.
+
+   "Licensor" shall mean the copyright owner or entity authorized by
+   the copyright owner that is granting the License.
+
+   "Legal Entity" shall mean the union of the acting entity and all
+   other entities that control, are controlled by, or are under common
+   control with that entity. For the purposes of this definition,
+   "control" means (i) the power, direct or indirect, to cause the
+   direction or management of such entity, whether by contract or
+   otherwise, or (ii) ownership of fifty percent (50%) or more of the
+   outstanding shares, or (iii) beneficial ownership of such entity.
+
+   "You" (or "Your") shall mean an individual or Legal Entity
+   exercising permissions granted by this License.
+
+   "Source" form shall mean the preferred form for making modifications,
+   including but not limited to software source code, documentation
+   source, and configuration files.
+
+   "Object" form shall mean any form resulting from mechanical
+   transformation or translation of a Source form, including but
+   not limited to compiled object code, generated documentation,
+   and conversions to other media types.
+
+   "Work" shall mean the work of authorship, whether in Source or
+   Object form, made available under the License, as indicated by a
+   copyright notice that is included in or attached to the work
+   (an example is provided in the Appendix below).
+
+   "Derivative Works" shall mean any work, whether in Source or Object
+   form, that is based on (or derived from) the Work and for which the
+   editorial revisions, annotations, elaborations, or other modifications
+   represent, as a whole, an original work of authorship. For the purposes
+   of this License, Derivative Works shall not include works that remain
+   separable from, or merely link (or bind by name) to the interfaces of,
+   the Work and Derivative Works thereof.
+
+   "Contribution" shall mean any work of authorship, including
+   the original version of the Work and any modifications or additions
+   to that Work or Derivative Works thereof, that is intentionally
+   submitted to Licensor for inclusion in the Work by the copyright owner
+   or by an individual or Legal Entity authorized to submit on behalf of
+   the copyright owner. For the purposes of this definition, "submitted"
+   means any form of electronic, verbal, or written communication sent
+   to the Licensor or its representatives, including but not limited to
+   communication on electronic mailing lists, source code control systems,
+   and issue tracking systems that are managed by, or on behalf of, the
+   Licensor for the purpose of discussing and improving the Work, but
+   excluding communication that is conspicuously marked or otherwise
+   designated in writing by the copyright owner as "Not a Contribution."
+
+   "Contributor" shall mean Licensor and any individual or Legal Entity
+   on behalf of whom a Contribution has been received by Licensor and
+   subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of
+   this License, each Contributor hereby grants to You a perpetual,
+   worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+   copyright license to reproduce, prepare Derivative Works of,
+   publicly display, publicly perform, sublicense, and distribute the
+   Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of
+   this License, each Contributor hereby grants to You a perpetual,
+   worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+   (except as stated in this section) patent license to make, have made,
+   use, offer to sell, sell, import, and otherwise transfer the Work,
+   where such license applies only to those patent claims licensable
+   by such Contributor that are necessarily infringed by their
+   Contribution(s) alone or by combination of their Contribution(s)
+   with the Work to which such Contribution(s) was submitted. If You
+   institute patent litigation against any entity (including a
+   cross-claim or counterclaim in a lawsuit) alleging that the Work
+   or a Contribution incorporated within the Work constitutes direct
+   or contributory patent infringement, then any patent licenses
+   granted to You under this License for that Work shall terminate
+   as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the
+   Work or Derivative Works thereof in any medium, with or without
+   modifications, and in Source or Object form, provided that You
+   meet the following conditions:
+
+   (a) You must give any other recipients of the Work or
+       Derivative Works a copy of this License; and
+
+   (b) You must cause any modified files to carry prominent notices
+       stating that You changed the files; and
+
+   (c) You must retain, in the Source form of any Derivative Works
+       that You distribute, all copyright, patent, trademark, and
+       attribution notices from the Source form of the Work,
+       excluding those notices that do not pertain to any part of
+       the Derivative Works; and
+
+   (d) If the Work includes a "NOTICE" text file as part of its
+       distribution, then any Derivative Works that You distribute must
+       include a readable copy of the attribution notices contained
+       within such NOTICE file, excluding those notices that do not
+       pertain to any part of the Derivative Works, in at least one
+       of the following places: within a NOTICE text file distributed
+       as part of the Derivative Works; within the Source form or
+       documentation, if provided along with the Derivative Works; or,
+       within a display generated by the Derivative Works, if and
+       wherever such third-party notices normally appear. The contents
+       of the NOTICE file are for informational purposes only and
+       do not modify the License. You may add Your own attribution
+       notices within Derivative Works that You distribute, alongside
+       or as an addendum to the NOTICE text from the Work, provided
+       that such additional attribution notices cannot be construed
+       as modifying the License.
+
+   You may add Your own copyright statement to Your modifications and
+   may provide additional or different license terms and conditions
+   for use, reproduction, or distribution of Your modifications, or
+   for any such Derivative Works as a whole, provided Your use,
+   reproduction, and distribution of the Work otherwise complies with
+   the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise,
+   any Contribution intentionally submitted for inclusion in the Work
+   by You to the Licensor shall be under the terms and conditions of
+   this License, without any additional terms or conditions.
+   Notwithstanding the above, nothing herein shall supersede or modify
+   the terms of any separate license agreement you may have executed
+   with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade
+   names, trademarks, service marks, or product names of the Licensor,
+   except as required for reasonable and customary use in describing the
+   origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or
+   agreed to in writing, Licensor provides the Work (and each
+   Contributor provides its Contributions) on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+   implied, including, without limitation, any warranties or conditions
+   of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+   PARTICULAR PURPOSE. You are solely responsible for determining the
+   appropriateness of using or redistributing the Work and assume any
+   risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory,
+   whether in tort (including negligence), contract, or otherwise,
+   unless required by applicable law (such as deliberate and grossly
+   negligent acts) or agreed to in writing, shall any Contributor be
+   liable to You for damages, including any direct, indirect, special,
+   incidental, or consequential damages of any character arising as a
+   result of this License or out of the use or inability to use the
+   Work (including but not limited to damages for loss of goodwill,
+   work stoppage, computer failure or malfunction, or any and all
+   other commercial damages or losses), even if such Contributor
+   has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing
+   the Work or Derivative Works thereof, You may choose to offer,
+   and charge a fee for, acceptance of support, warranty, indemnity,
+   or other liability obligations and/or rights consistent with this
+   License. However, in accepting such obligations, You may act only
+   on Your own behalf and on Your sole responsibility, not on behalf
+   of any other Contributor, and only if You agree to indemnify,
+   defend, and hold each Contributor harmless for any liability
+   incurred by, or claims asserted against, such Contributor by reason
+   of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
--- a/skills/pdftext/NOTICE.txt
+++ b/skills/pdftext/NOTICE.txt
@@ -0,0 +1,20 @@
+pdftext
+Copyright 2025 Warren Zhu
+
+This skill was created based on research conducted in November 2025 comparing
+PDF extraction tools for academic research and LLM consumption.
+
+Research included testing of:
+- Docling (IBM Research)
+- PyMuPDF (Artifex Software)
+- pdfplumber (Jeremy Singer-Vine)
+- pdfminer.six
+- pypdf
+- Ghostscript (Artifex Software)
+- Poppler (pdftotext)
+
+All tool comparisons and benchmarks are based on independent testing on
+academic PDFs from the distributed cognition literature.
+
+No code from external projects is included in this skill. All example scripts
+are original work or standard usage patterns from public documentation.
--- a/skills/pdftext/SKILL.md
+++ b/skills/pdftext/SKILL.md
@@ -0,0 +1,128 @@
+---
+name: pdftext
+description: Extract text from PDFs for LLM consumption using AI-powered or traditional tools. Use when converting academic PDFs to markdown, extracting structured content (headers/tables/lists), batch processing research papers, preparing PDFs for RAG systems, or when mentions of "pdf extraction", "pdf to text", "pdf to markdown", "docling", "pymupdf", "pdfplumber" appear. Provides Docling (AI-powered, structure-preserving, 97.9% table accuracy) and traditional tools (PyMuPDF for speed, pdfplumber for quality). All processing is on-device with no API calls.
+license: Apache 2.0 (see LICENSE.txt)
+---
+
+# PDF Text Extraction
+
+## Tool Selection
+
+| Tool | Speed | Quality | Structure | Use When |
+|------|-------|---------|-----------|----------|
+| **Docling** | 0.43s/page | Good | ✓ Yes | Need headers/tables/lists, academic PDFs, LLM consumption |
+| **PyMuPDF** | 0.01s/page | Excellent | ✗ No | Speed critical, simple text extraction, good enough quality |
+| **pdfplumber** | 0.44s/page | Perfect | ✗ No | Maximum fidelity needed, slow acceptable |
+
+**Decision:**
+- Academic research → Docling (structure preservation)
+- Batch processing → PyMuPDF (60x faster)
+- Critical accuracy → pdfplumber (0 quality issues)
+
+## Installation
+
+```bash
+# Create virtual environment
+python3 -m venv pdf_env
+source pdf_env/bin/activate
+
+# Install Docling (AI-powered, recommended)
+pip install docling
+
+# Install traditional tools
+pip install pymupdf pdfplumber
+```
+
+**First run downloads ML models** (~500MB-1GB, cached locally, no API calls).
+
+## Basic Usage
+
+### Docling (Structure-Preserving)
+
+```python
+from docling.document_converter import DocumentConverter
+
+converter = DocumentConverter()  # Reuse for multiple PDFs
+result = converter.convert("paper.pdf")
+markdown = result.document.export_to_markdown()
+
+# Save output
+with open("paper.md", "w") as f:
+    f.write(markdown)
+```
+
+**Output includes:** Headers (##), tables (|...|), lists (- ...), image markers.
+
+### PyMuPDF (Fast)
+
+```python
+import fitz
+
+doc = fitz.open("paper.pdf")
+text = "\n".join(page.get_text() for page in doc)
+doc.close()
+
+with open("paper.txt", "w") as f:
+    f.write(text)
+```
+
+### pdfplumber (Highest Quality)
+
+```python
+import pdfplumber
+
+with pdfplumber.open("paper.pdf") as pdf:
+    text = "\n".join(page.extract_text() or "" for page in pdf.pages)
+
+with open("paper.txt", "w") as f:
+    f.write(text)
+```
+
+## Batch Processing
+
+See `examples/batch_convert.py` for ready-to-use script.
+
+**Pattern:**
+```python
+from pathlib import Path
+from docling.document_converter import DocumentConverter
+
+converter = DocumentConverter()  # Initialize once
+for pdf in Path("./pdfs").glob("*.pdf"):
+    result = converter.convert(str(pdf))
+    markdown = result.document.export_to_markdown()
+    Path(f"./output/{pdf.stem}.md").write_text(markdown)
+```
+
+**Performance tip:** Reuse converter instance. Reinitializing wastes time.
+
+## Quality Considerations
+
+**Common issues:**
+- Ligatures: `/uniFB03` → "ffi" (post-process with regex)
+- Excessive whitespace: 50-90 instances (Docling has fewer)
+- Hyphenation breaks: End-of-line hyphens may remain
+
+**Quality metrics script:** See `examples/quality_analysis.py`
+
+**Benchmarks:** See `references/benchmarks.md` for enterprise production data.
+
+## Troubleshooting
+
+**Slow first run:** ML models downloading (15-30s). Subsequent runs fast.
+
+**Out of memory:** Reduce concurrent conversions, process large PDFs individually.
+
+**Missing tables:** Ensure `do_table_structure=True` in Docling options.
+
+**Garbled text:** PDF encoding issue. Apply ligature fixes post-processing.
+
+## Privacy
+
+**All tools run on-device.** No API calls, no data sent externally. Docling downloads models once, caches locally (~500MB-1GB).
+
+## References
+
+- Tool comparison: `references/tool-comparison.md`
+- Quality metrics: `references/quality-metrics.md`
+- Production benchmarks: `references/benchmarks.md`
--- a/skills/pdftext/examples/batch_convert.py
+++ b/skills/pdftext/examples/batch_convert.py
@@ -0,0 +1,107 @@
+#!/usr/bin/env python3
+"""
+Batch convert PDFs to markdown using Docling.
+
+Usage:
+    python batch_convert.py <pdf_directory> <output_directory>
+
+Example:
+    python batch_convert.py ./papers ./markdown_output
+
+Copyright 2025 Warren Zhu
+Licensed under the Apache License, Version 2.0
+"""
+
+import sys
+import time
+from pathlib import Path
+
+try:
+    from docling.document_converter import DocumentConverter
+except ImportError:
+    print("Error: Docling not installed. Run: pip install docling")
+    sys.exit(1)
+
+
+def batch_convert(pdf_dir, output_dir):
+    """Convert all PDFs in directory to markdown."""
+
+    pdf_dir = Path(pdf_dir)
+    output_dir = Path(output_dir)
+    output_dir.mkdir(exist_ok=True)
+
+    # Get PDF files
+    pdf_files = sorted(pdf_dir.glob("*.pdf"))
+    if not pdf_files:
+        print(f"No PDF files found in {pdf_dir}")
+        return
+
+    print(f"Found {len(pdf_files)} PDFs")
+    print()
+
+    # Initialize converter once
+    print("Initializing Docling...")
+    converter = DocumentConverter()
+    print("Ready")
+    print()
+
+    # Convert each PDF
+    results = []
+    total_start = time.time()
+
+    for i, pdf_path in enumerate(pdf_files, 1):
+        print(f"[{i}/{len(pdf_files)}] {pdf_path.name}")
+
+        try:
+            start = time.time()
+            result = converter.convert(str(pdf_path))
+            markdown = result.document.export_to_markdown()
+            elapsed = time.time() - start
+
+            # Save
+            output_file = output_dir / f"{pdf_path.stem}.md"
+            output_file.write_text(markdown)
+
+            # Stats
+            pages = len(result.document.pages)
+            chars = len(markdown)
+
+            print(f"  ✓ {pages} pages in {elapsed:.1f}s ({elapsed/pages:.2f}s/page)")
+            print(f"  ✓ {chars:,} chars → {output_file.name}")
+
+            results.append({
+                'file': pdf_path.name,
+                'pages': pages,
+                'time': elapsed,
+                'status': 'Success'
+            })
+
+        except Exception as e:
+            elapsed = time.time() - start
+            print(f"  ✗ Error: {e}")
+            results.append({
+                'file': pdf_path.name,
+                'pages': 0,
+                'time': elapsed,
+                'status': f'Failed: {e}'
+            })
+
+        print()
+
+    # Summary
+    total_time = time.time() - total_start
+    success_count = sum(1 for r in results if r['status'] == 'Success')
+
+    print("=" * 60)
+    print(f"Complete: {success_count}/{len(results)} successful")
+    print(f"Total time: {total_time:.1f}s ({total_time/60:.1f} min)")
+    print(f"Output: {output_dir}/")
+    print("=" * 60)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 3:
+        print("Usage: python batch_convert.py <pdf_dir> <output_dir>")
+        sys.exit(1)
+
+    batch_convert(sys.argv[1], sys.argv[2])
--- a/skills/pdftext/examples/quality_analysis.py
+++ b/skills/pdftext/examples/quality_analysis.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+"""
+Analyze PDF extraction quality across different tools.
+
+Usage:
+    python quality_analysis.py <extraction_directory>
+
+Example:
+    python quality_analysis.py ./pdf_extraction_results
+
+Expects files named: PDFname_tool.txt (e.g., paper_docling.txt, paper_pymupdf.txt)
+
+Copyright 2025 Warren Zhu
+Licensed under the Apache License, Version 2.0
+"""
+
+import re
+import sys
+from pathlib import Path
+from collections import defaultdict
+
+
+def analyze_quality(text):
+    """Analyze text quality metrics."""
+    return {
+        'chars': len(text),
+        'words': len(text.split()),
+        'consecutive_spaces': len(re.findall(r'  +', text)),
+        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
+        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
+        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
+        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
+    }
+
+
+def compare_tools(results_dir):
+    """Compare extraction quality across tools."""
+
+    results_dir = Path(results_dir)
+    if not results_dir.exists():
+        print(f"Error: {results_dir} not found")
+        return
+
+    # Group files by PDF
+    pdf_files = defaultdict(dict)
+
+    for txt_file in sorted(results_dir.glob('*.txt')):
+        # Parse: PDFname_tool.txt
+        parts = txt_file.stem.rsplit('_', 1)
+        if len(parts) == 2:
+            pdf_name, tool = parts
+            text = txt_file.read_text(encoding='utf-8', errors='ignore')
+            pdf_files[pdf_name][tool] = text
+
+    if not pdf_files:
+        print(f"No extraction files found in {results_dir}")
+        print("Expected format: PDFname_tool.txt")
+        return
+
+    # Analyze each PDF
+    for pdf_name, tools in sorted(pdf_files.items()):
+        print("=" * 80)
+        print(f"PDF: {pdf_name}")
+        print("=" * 80)
+        print()
+
+        # Quality metrics
+        results = {tool: analyze_quality(text) for tool, text in tools.items()}
+
+        print("QUALITY METRICS")
+        print("-" * 80)
+        print(f"{'Tool':<20} {'Chars':>12} {'Words':>10} {'Issues':>10} {'Garbled':>10}")
+        print("-" * 80)
+
+        for tool in ['docling', 'pymupdf', 'pdfplumber', 'pdftotext', 'pdfminer', 'pypdf']:
+            if tool in results:
+                r = results[tool]
+                issues = (r['consecutive_spaces'] + r['excessive_newlines'] +
+                         r['control_chars'] + r['garbled_chars'])
+                print(f"{tool:<20} {r['chars']:>12,} {r['words']:>10,} "
+                      f"{issues:>10} {r['garbled_chars']:>10}")
+
+        print()
+
+        # Find best
+        best_quality = min(results.items(),
+                          key=lambda x: x[1]['consecutive_spaces'] + x[1]['garbled_chars'])
+        most_content = max(results.items(), key=lambda x: x[1]['chars'])
+
+        print(f"Best quality: {best_quality[0]}")
+        print(f"Most content: {most_content[0]}")
+        print()
+
+    # Overall ranking
+    print("=" * 80)
+    print("OVERALL RANKING")
+    print("=" * 80)
+    print()
+
+    tool_scores = defaultdict(lambda: {'total_issues': 0, 'total_garbled': 0, 'files': 0})
+
+    for tools in pdf_files.values():
+        for tool, text in tools.items():
+            r = analyze_quality(text)
+            issues = (r['consecutive_spaces'] + r['excessive_newlines'] +
+                     r['control_chars'] + r['garbled_chars'])
+
+            tool_scores[tool]['total_issues'] += issues
+            tool_scores[tool]['total_garbled'] += r['garbled_chars']
+            tool_scores[tool]['files'] += 1
+
+    # Calculate average quality
+    ranked = []
+    for tool, scores in tool_scores.items():
+        avg_issues = scores['total_issues'] / scores['files']
+        avg_garbled = scores['total_garbled'] / scores['files']
+        quality_score = avg_garbled * 10 + avg_issues
+
+        ranked.append({
+            'tool': tool,
+            'score': quality_score,
+            'avg_issues': avg_issues,
+            'avg_garbled': avg_garbled
+        })
+
+    ranked.sort(key=lambda x: x['score'])
+
+    print(f"{'Rank':<6} {'Tool':<20} {'Avg Issues':>12} {'Avg Garbled':>12} {'Score':>10}")
+    print("-" * 80)
+
+    for i, r in enumerate(ranked, 1):
+        medal = "🥇" if i == 1 else "🥈" if i == 2 else "🥉" if i == 3 else "  "
+        print(f"{medal} {i:<3} {r['tool']:<20} {r['avg_issues']:>12.1f} "
+              f"{r['avg_garbled']:>12.1f} {r['score']:>10.1f}")
+
+    print()
+    print("Quality score: garbled_chars * 10 + total_issues (lower is better)")
+    print()
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2:
+        print("Usage: python quality_analysis.py <extraction_directory>")
+        sys.exit(1)
+
+    compare_tools(sys.argv[1])
--- a/skills/pdftext/references/benchmarks.md
+++ b/skills/pdftext/references/benchmarks.md
@@ -0,0 +1,149 @@
+# PDF Extraction Benchmarks
+
+## Enterprise Benchmark (2025 Procycons)
+
+Production-grade comparison of ML-based PDF extraction tools.
+
+| Tool | Table Accuracy | Text Fidelity | Speed (s/page) | Memory (GB) |
+|------|----------------|---------------|----------------|-------------|
+| **Docling** | **97.9%** | **100%** | 6.28 | 2.1 |
+| Marker | 89.2% | 98.5% | 8.45 | 3.5 |
+| MinerU | 92.1% | 99.2% | 12.33 | 4.2 |
+| Unstructured.io | 75.0% | 95.8% | 51.02 | 1.8 |
+| PyMuPDF4LLM | 82.3% | 97.1% | 4.12 | 1.2 |
+| LlamaParse | 88.5% | 97.3% | 6.00 | N/A (cloud) |
+
+**Test corpus:** 500 academic papers, business reports, financial statements (mixed complexity)
+
+**Key finding:** Docling leads in table accuracy with competitive speed. Unstructured.io despite popularity has poor performance.
+
+*Source: Procycons Enterprise PDF Processing Benchmark 2025*
+
+## Academic PDF Test (This Research)
+
+Real-world testing on distributed cognition literature.
+
+### Test Environment
+
+- **PDFs:** 4 academic books
+- **Total size:** 98.2 MB
+- **Pages:** ~400 pages combined
+- **Content:** Multi-column layouts, tables, figures, references
+
+### Test Results
+
+#### Speed (90-page PDF, 1.9 MB)
+
+| Tool | Total Time | Per Page | Speedup |
+|------|------------|----------|---------|
+| pdftotext | 0.63s | 0.007s/page | 60x |
+| PyMuPDF | 1.18s | 0.013s/page | 33x |
+| Docling | 38.86s | 0.432s/page | 1x |
+| pdfplumber | 38.91s | 0.432s/page | 1x |
+
+#### Quality (Issues per document)
+
+| Tool | Consecutive Spaces | Excessive Newlines | Control Chars | Garbled | Total |
+|------|-------------------|-------------------|---------------|---------|-------|
+| pdfplumber | 0 | 0 | 0 | 0 | **0** |
+| PyMuPDF | 1 | 0 | 0 | 0 | **1** |
+| Docling | 48 | 2 | 0 | 0 | **50** |
+| pdftotext | 85 | 5 | 0 | 0 | **90** |
+
+#### Structure Preservation
+
+| Tool | Headers | Tables | Lists | Images |
+|------|---------|--------|-------|--------|
+| Docling | ✓ 36 | ✓ 16 rows | ✓ 307 items | ✓ 4 markers |
+| PyMuPDF | ✗ | ✗ | ✗ | ✗ |
+| pdfplumber | ✗ | ✗ | ✗ | ✗ |
+| pdftotext | ✗ | ✗ | ✗ | ✗ |
+
+**Key finding:** Docling is the ONLY tool that preserves document structure.
+
+## Production Recommendations
+
+### By Use Case
+
+**Academic research / Literature review:**
+- **Primary:** Docling (structure essential)
+- **Secondary:** PyMuPDF (speed for large batches)
+
+**RAG system ingestion:**
+- **Recommended:** Docling (semantic structure preserved)
+- **Alternative:** PyMuPDF + post-processing
+
+**Quick text extraction:**
+- **Recommended:** PyMuPDF (60x faster)
+- **Alternative:** pdftotext (fastest, lower quality)
+
+**Maximum quality (legal, financial):**
+- **Recommended:** pdfplumber (perfect quality)
+- **Alternative:** Docling (structure + good quality)
+
+### By Document Type
+
+**Academic papers:** Docling (tables, multi-column, references)
+**Books/ebooks:** PyMuPDF (simple linear text)
+**Business reports:** Docling (tables, charts, sections)
+**Scanned documents:** Docling with OCR enabled
+**Legal contracts:** pdfplumber (maximum fidelity)
+
+## ML Model Performance (Docling)
+
+### RT-DETR (Layout Detection)
+
+- **Speed:** 44-633ms per page
+- **Accuracy:** ~95% layout element detection
+- **Detects:** Text blocks, headers, tables, figures, captions
+
+### TableFormer (Table Structure)
+
+- **Speed:** 400ms-1.74s per table
+- **Accuracy:** 97.9% cell-level accuracy
+- **Handles:** Borderless tables, merged cells, nested tables
+
+## Cloud vs On-Device
+
+| Tool | Processing | Privacy | Cost | Speed |
+|------|-----------|---------|------|-------|
+| Docling | On-device | ✓ Private | Free | 0.43s/page |
+| LlamaParse | Cloud API | ✗ Sends data | $0.003/page | 6s/page |
+| Claude Vision | Cloud API | ✗ Sends data | $0.0075/page | Variable |
+| Mathpix | Cloud API | ✗ Sends data | $0.004/page | 4s/page |
+
+**Recommendation:** Use on-device (Docling) for sensitive/unpublished academic work.
+
+## Benchmark Methodology
+
+### Speed Testing
+
+```python
+import time
+
+start = time.time()
+result = converter.convert(pdf_path)
+elapsed = time.time() - start
+per_page = elapsed / page_count
+```
+
+### Quality Testing
+
+```python
+# Count quality issues
+consecutive_spaces = len(re.findall(r'  +', text))
+excessive_newlines = len(re.findall(r'\n{4,}', text))
+control_chars = len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text))
+garbled_chars = len(re.findall(r'[<5B>\ufffd]', text))
+
+total_issues = consecutive_spaces + excessive_newlines + control_chars + garbled_chars
+```
+
+### Structure Testing
+
+```python
+# Count markdown elements
+headers = len(re.findall(r'^#{1,6}\s+.+$', markdown, re.MULTILINE))
+tables = len(re.findall(r'\|.+\|', markdown))
+lists = len(re.findall(r'^\s*[-*]\s+', markdown, re.MULTILINE))
+```
--- a/skills/pdftext/references/quality-metrics.md
+++ b/skills/pdftext/references/quality-metrics.md
@@ -0,0 +1,154 @@
+# PDF Extraction Quality Metrics
+
+## Key Metrics
+
+### 1. Consecutive Spaces
+**What:** Multiple spaces in sequence (2+)
+**Pattern:** `  +`
+**Impact:** Formatting artifacts, token waste
+**Good:** < 50 occurrences
+**Bad:** > 100 occurrences
+
+### 2. Excessive Newlines
+**What:** 4+ consecutive newlines
+**Pattern:** `\n{4,}`
+**Impact:** Page breaks treated as whitespace
+**Good:** < 20 occurrences
+**Bad:** > 50 occurrences
+
+### 3. Control Characters
+**What:** Non-printable characters
+**Pattern:** `[\x00-\x08\x0b\x0c\x0e-\x1f]`
+**Impact:** Parsing errors, display issues
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 4. Garbled Characters
+**What:** Replacement characters (<28>)
+**Pattern:** `[<5B>\ufffd]`
+**Impact:** Lost information, encoding failures
+**Good:** 0 occurrences
+**Bad:** > 0 occurrences
+
+### 5. Hyphenation Breaks
+**What:** End-of-line hyphens not joined
+**Pattern:** `\w+-\n\w+`
+**Impact:** Word splitting affects search
+**Good:** < 10 occurrences
+**Bad:** > 50 occurrences
+
+### 6. Ligature Encoding
+**What:** Special character combinations
+**Examples:** `/uniFB00` (ff), `/uniFB01` (fi), `/uniFB03` (ffi)
+**Impact:** Search failures, readability
+**Fix:** Post-process with regex replacement
+
+## Quality Score Formula
+
+```python
+total_issues = (
+    consecutive_spaces +
+    excessive_newlines +
+    control_chars +
+    garbled_chars
+)
+
+quality_score = garbled_chars * 10 + total_issues
+# Lower is better
+```
+
+**Ranking:**
+- Excellent: < 10 score
+- Good: 10-50 score
+- Fair: 50-100 score
+- Poor: > 100 score
+
+## Analysis Script
+
+```python
+import re
+
+def analyze_quality(text):
+    """Analyze PDF extraction quality."""
+    return {
+        'chars': len(text),
+        'words': len(text.split()),
+        'consecutive_spaces': len(re.findall(r'  +', text)),
+        'excessive_newlines': len(re.findall(r'\n{4,}', text)),
+        'control_chars': len(re.findall(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', text)),
+        'garbled_chars': len(re.findall(r'[<5B>\ufffd]', text)),
+        'hyphen_breaks': len(re.findall(r'\w+-\n\w+', text))
+    }
+
+# Usage
+text = open("extracted.txt").read()
+metrics = analyze_quality(text)
+print(f"Quality score: {metrics['garbled_chars'] * 10 + metrics['consecutive_spaces'] + metrics['excessive_newlines']}")
+```
+
+## Test Results (90-page Academic PDF)
+
+| Tool | Total Issues | Garbled | Quality Score | Rating |
+|------|--------------|---------|---------------|--------|
+| pdfplumber | 0 | 0 | 0 | Excellent |
+| PyMuPDF | 1 | 0 | 1 | Excellent |
+| Docling | 50 | 0 | 50 | Good |
+| pdftotext | 90 | 0 | 90 | Fair |
+| pdfminer | 45 | 0 | 45 | Good |
+| pypdf | 120 | 5 | 170 | Poor |
+
+## Content Completeness
+
+### Phrase Coverage Analysis
+
+Extract 3-word phrases from each tool's output:
+
+```python
+def extract_phrases(text):
+    words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
+    return {' '.join(words[i:i+3]) for i in range(len(words)-2)}
+
+common = set.intersection(*[extract_phrases(t) for t in texts.values()])
+```
+
+**Results:**
+- Common phrases: 10,587 (captured by all tools)
+- Docling unique: 17,170 phrases (most complete)
+- pdfplumber unique: 8,229 phrases (conservative)
+
+## Cleaning Strategies
+
+### Fix Ligatures
+
+```python
+def fix_ligatures(text):
+    """Fix PDF ligature encoding."""
+    replacements = {
+        r'/uniFB00': 'ff',
+        r'/uniFB01': 'fi',
+        r'/uniFB02': 'fl',
+        r'/uniFB03': 'ffi',
+        r'/uniFB04': 'ffl',
+    }
+    for pattern, repl in replacements.items():
+        text = re.sub(pattern, repl, text)
+    return text
+```
+
+### Normalize Whitespace
+
+```python
+def normalize_whitespace(text):
+    """Clean excessive whitespace."""
+    text = re.sub(r'  +', ' ', text)  # Multiple spaces → single
+    text = re.sub(r'\n{4,}', '\n\n\n', text)  # Many newlines → max 3
+    return text.strip()
+```
+
+### Join Hyphenated Words
+
+```python
+def join_hyphens(text):
+    """Join end-of-line hyphenated words."""
+    return re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
+```
--- a/skills/pdftext/references/tool-comparison.md
+++ b/skills/pdftext/references/tool-comparison.md
@@ -0,0 +1,141 @@
+# PDF Tool Comparison
+
+## Summary Table
+
+| Tool | Type | Speed | Quality Issues | Garbled | Structure | License |
+|------|------|-------|----------------|---------|-----------|---------|
+| **Docling** | ML | 0.43s/page | 50 | 0 | ✓ Yes | Apache 2.0 |
+| **PyMuPDF** | Traditional | 0.01s/page | 1 | 0 | ✗ No | AGPL |
+| **pdfplumber** | Traditional | 0.44s/page | 0 | 0 | ✗ No | MIT |
+| **pdftotext** | Traditional | 0.007s/page | 90 | 0 | ✗ No | GPL |
+| **pdfminer.six** | Traditional | 0.15s/page | 45 | 0 | ✗ No | MIT |
+| **pypdf** | Traditional | 0.25s/page | 120 | 5 | ✗ No | BSD |
+
+*Test environment: 90-page academic PDF, 1.9 MB*
+
+## Detailed Comparison
+
+### Docling (Recommended for Academic PDFs)
+
+**Advantages:**
+- Only tool that preserves structure (headers, tables, lists)
+- AI-powered layout understanding via RT-DETR + TableFormer
+- Markdown output perfect for LLMs
+- 97.9% table accuracy in enterprise benchmarks
+- On-device processing (no API calls)
+
+**Disadvantages:**
+- Slower than PyMuPDF (40x)
+- Requires 500MB-1GB model download
+- Some ligature encoding issues
+
+**Use when:**
+- Document structure is essential
+- Processing academic papers with tables
+- Preparing content for RAG systems
+- LLM consumption is primary goal
+
+### PyMuPDF (Recommended for Speed)
+
+**Advantages:**
+- Fastest tool (60x faster than pdfplumber)
+- Excellent quality (only 1 issue in test)
+- Clean output with minimal artifacts
+- C-based, highly optimized
+
+**Disadvantages:**
+- No structure preservation
+- AGPL license (restrictive for commercial use)
+- Flat text output
+
+**Use when:**
+- Speed is critical
+- Simple text extraction sufficient
+- Batch processing large datasets
+- Structure preservation not needed
+
+### pdfplumber (Recommended for Quality)
+
+**Advantages:**
+- Perfect quality (0 issues)
+- Character-level spatial analysis
+- Geometric table detection
+- MIT license
+
+**Disadvantages:**
+- Very slow (60x slower than PyMuPDF)
+- No markdown structure output
+- CPU-intensive
+
+**Use when:**
+- Maximum fidelity required
+- Quality more important than speed
+- Processing critical documents
+- Slow processing acceptable
+
+## Traditional vs ML-Based
+
+### Traditional Tools
+
+**How they work:**
+- Parse PDF internal structure
+- Extract embedded text objects
+- Follow PDF specification rules
+
+**Advantages:**
+- Fast (no ML inference)
+- Small footprint (no model files)
+- Deterministic output
+
+**Disadvantages:**
+- No layout understanding
+- Cannot handle borderless tables
+- Lose document hierarchy
+
+### ML-Based Tools (Docling)
+
+**How they work:**
+- Computer vision to "see" document layout
+- RT-DETR detects layout regions
+- TableFormer understands table structure
+- Hybrid: ML for layout + PDF parsing for text
+
+**Advantages:**
+- Understands visual layout
+- Handles complex multi-column layouts
+- Preserves semantic structure
+- Works with borderless tables
+
+**Disadvantages:**
+- Slower (ML inference time)
+- Larger footprint (model files)
+- Non-deterministic output
+
+## Architecture Details
+
+### Docling Pipeline
+
+1. **PDF Backend** - Extracts raw content and positions
+2. **AI Models** - Analyze layout and structure
+   - RT-DETR: Layout analysis (44-633ms/page)
+   - TableFormer: Table structure (400ms-1.74s/table)
+3. **Assembly** - Combines understanding with text
+
+### pdfplumber Architecture
+
+1. **Built on pdfminer.six** - Character-level extraction
+2. **Spatial clustering** - Groups chars into words/lines
+3. **Geometric detection** - Finds tables from lines/rectangles
+4. **Character objects** - Full metadata (position, font, size, color)
+
+## Enterprise Benchmarks (2025 Procycons)
+
+| Tool | Table Accuracy | Text Fidelity | Speed (s/page) |
+|------|----------------|---------------|----------------|
+| Docling | 97.9% | 100% | 6.28 |
+| Marker | 89.2% | 98.5% | 8.45 |
+| MinerU | 92.1% | 99.2% | 12.33 |
+| Unstructured.io | 75.0% | 95.8% | 51.02 |
+| LlamaParse | 88.5% | 97.3% | 6.00 |
+
+*Source: Procycons Enterprise PDF Processing Benchmark 2025*