zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

6.7 KiB

Raw Blame History

Document Conversion Reference

This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.

PDF Files

PDF conversion extracts text, tables, and structure from PDF documents.

Basic PDF Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

PDF with Azure Document Intelligence

For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
    docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)

Benefits of Azure Document Intelligence:

Superior table extraction and reconstruction
Better handling of multi-column layouts
Form field recognition
Improved text ordering in complex documents

PDF Handling Notes

Scanned PDFs require OCR (automatically handled if tesseract is installed)
Password-protected PDFs are not supported
Large PDFs may take longer to process
Vector graphics and embedded images are extracted where possible

Word Documents (DOCX)

Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.

Basic DOCX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)

DOCX Structure Preservation

MarkItDown preserves:

Headings → Markdown headers (#, ##, etc.)
Bold/Italic → Markdown emphasis (**bold**, *italic*)
Lists → Markdown lists (ordered and unordered)
Tables → Markdown tables
Hyperlinks → Markdown links [text](url)
Images → Referenced with descriptions (can use LLM for descriptions)

Command-Line Usage

# Basic conversion
markitdown report.docx -o report.md

# With output directory
markitdown report.docx -o output/report.md

DOCX with Images

To generate descriptions for images in Word documents, use LLM integration:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")

PowerPoint Presentations (PPTX)

PowerPoint conversion extracts text from slides while preserving structure.

Basic PPTX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)

PPTX Structure

MarkItDown processes presentations as:

Each slide becomes a major section
Slide titles become headers
Bullet points are preserved
Tables are converted to Markdown tables
Notes are included if present

PPTX with Image Descriptions

Presentations often contain important visual information. Use LLM integration to describe images:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")

Custom prompts for presentations:

"Describe charts and graphs with their key data points"
"Explain diagrams and their relationships"
"Summarize visual content for accessibility"

Excel Spreadsheets (XLSX, XLS)

Excel conversion formats spreadsheet data as Markdown tables.

Basic XLSX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)

Multi-Sheet Workbooks

For workbooks with multiple sheets:

Each sheet becomes a separate section
Sheet names are used as headers
Empty sheets are skipped
Formulas are evaluated (values shown, not formulas)

XLSX Conversion Details

What's preserved:

Cell values (text, numbers, dates)
Table structure (rows and columns)
Sheet names
Cell formatting (bold headers)

What's not preserved:

Formulas (only computed values)
Charts and graphs (use LLM integration for descriptions)
Cell colors and conditional formatting
Comments and notes

Large Spreadsheets

For large spreadsheets, consider:

Processing may be slower for files with many rows/columns
Very wide tables may not format well in Markdown
Consider filtering or preprocessing data if possible

XLS (Legacy Excel) Files

Legacy .xls files are supported but require additional dependencies:

pip install 'markitdown[xls]'

Then use normally:

md = MarkItDown()
result = md.convert("legacy_data.xls")

Common Document Conversion Patterns

Batch Document Processing

from markitdown import MarkItDown
import os

md = MarkItDown()

# Process all documents in a directory
for filename in os.listdir("documents"):
    if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
        result = md.convert(f"documents/{filename}")

        # Save to output directory
        output_name = os.path.splitext(filename)[0] + ".md"
        with open(f"markdown/{output_name}", "w") as f:
            f.write(result.text_content)

Document with Mixed Content

For documents containing multiple types of content (text, tables, images):

from markitdown import MarkItDown
from openai import OpenAI

# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)

result = md.convert("complex_report.pdf")

Error Handling

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("document.pdf")
    print(result.text_content)
except Exception as e:
    print(f"Conversion failed: {e}")
    # Handle specific errors (file not found, unsupported format, etc.)

Output Quality Tips

For best results:

Use Azure Document Intelligence for PDFs with complex tables
Enable LLM descriptions for documents with important visual content
Ensure source documents are well-structured (proper headings, etc.)
For scanned documents, ensure good scan quality for OCR accuracy
Test with sample documents to verify output quality

Performance Considerations

Conversion speed depends on:

Document size and complexity
Number of images (especially with LLM descriptions)
Use of Azure Document Intelligence
Available system resources

Optimization tips:

Disable LLM integration if image descriptions aren't needed
Use standard extraction (not Azure) for simple documents
Process large batches in parallel when possible
Consider streaming for very large documents

6.7 KiB Raw Blame History