Files
gh-k-dense-ai-claude-scient…/skills/markitdown/references/document_conversion.md
2025-11-30 08:30:10 +08:00

6.7 KiB

Document Conversion Reference

This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.

PDF Files

PDF conversion extracts text, tables, and structure from PDF documents.

Basic PDF Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

PDF with Azure Document Intelligence

For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
    docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)

Benefits of Azure Document Intelligence:

  • Superior table extraction and reconstruction
  • Better handling of multi-column layouts
  • Form field recognition
  • Improved text ordering in complex documents

PDF Handling Notes

  • Scanned PDFs require OCR (automatically handled if tesseract is installed)
  • Password-protected PDFs are not supported
  • Large PDFs may take longer to process
  • Vector graphics and embedded images are extracted where possible

Word Documents (DOCX)

Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.

Basic DOCX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)

DOCX Structure Preservation

MarkItDown preserves:

  • Headings → Markdown headers (#, ##, etc.)
  • Bold/Italic → Markdown emphasis (**bold**, *italic*)
  • Lists → Markdown lists (ordered and unordered)
  • Tables → Markdown tables
  • Hyperlinks → Markdown links [text](url)
  • Images → Referenced with descriptions (can use LLM for descriptions)

Command-Line Usage

# Basic conversion
markitdown report.docx -o report.md

# With output directory
markitdown report.docx -o output/report.md

DOCX with Images

To generate descriptions for images in Word documents, use LLM integration:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")

PowerPoint Presentations (PPTX)

PowerPoint conversion extracts text from slides while preserving structure.

Basic PPTX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)

PPTX Structure

MarkItDown processes presentations as:

  • Each slide becomes a major section
  • Slide titles become headers
  • Bullet points are preserved
  • Tables are converted to Markdown tables
  • Notes are included if present

PPTX with Image Descriptions

Presentations often contain important visual information. Use LLM integration to describe images:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")

Custom prompts for presentations:

  • "Describe charts and graphs with their key data points"
  • "Explain diagrams and their relationships"
  • "Summarize visual content for accessibility"

Excel Spreadsheets (XLSX, XLS)

Excel conversion formats spreadsheet data as Markdown tables.

Basic XLSX Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)

Multi-Sheet Workbooks

For workbooks with multiple sheets:

  • Each sheet becomes a separate section
  • Sheet names are used as headers
  • Empty sheets are skipped
  • Formulas are evaluated (values shown, not formulas)

XLSX Conversion Details

What's preserved:

  • Cell values (text, numbers, dates)
  • Table structure (rows and columns)
  • Sheet names
  • Cell formatting (bold headers)

What's not preserved:

  • Formulas (only computed values)
  • Charts and graphs (use LLM integration for descriptions)
  • Cell colors and conditional formatting
  • Comments and notes

Large Spreadsheets

For large spreadsheets, consider:

  • Processing may be slower for files with many rows/columns
  • Very wide tables may not format well in Markdown
  • Consider filtering or preprocessing data if possible

XLS (Legacy Excel) Files

Legacy .xls files are supported but require additional dependencies:

pip install 'markitdown[xls]'

Then use normally:

md = MarkItDown()
result = md.convert("legacy_data.xls")

Common Document Conversion Patterns

Batch Document Processing

from markitdown import MarkItDown
import os

md = MarkItDown()

# Process all documents in a directory
for filename in os.listdir("documents"):
    if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
        result = md.convert(f"documents/{filename}")

        # Save to output directory
        output_name = os.path.splitext(filename)[0] + ".md"
        with open(f"markdown/{output_name}", "w") as f:
            f.write(result.text_content)

Document with Mixed Content

For documents containing multiple types of content (text, tables, images):

from markitdown import MarkItDown
from openai import OpenAI

# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)

result = md.convert("complex_report.pdf")

Error Handling

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("document.pdf")
    print(result.text_content)
except Exception as e:
    print(f"Conversion failed: {e}")
    # Handle specific errors (file not found, unsupported format, etc.)

Output Quality Tips

For best results:

  1. Use Azure Document Intelligence for PDFs with complex tables
  2. Enable LLM descriptions for documents with important visual content
  3. Ensure source documents are well-structured (proper headings, etc.)
  4. For scanned documents, ensure good scan quality for OCR accuracy
  5. Test with sample documents to verify output quality

Performance Considerations

Conversion speed depends on:

  • Document size and complexity
  • Number of images (especially with LLM descriptions)
  • Use of Azure Document Intelligence
  • Available system resources

Optimization tips:

  • Disable LLM integration if image descriptions aren't needed
  • Use standard extraction (not Azure) for simple documents
  • Process large batches in parallel when possible
  • Consider streaming for very large documents