6.7 KiB
Document Conversion Reference
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
PDF Files
PDF conversion extracts text, tables, and structure from PDF documents.
Basic PDF Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
PDF with Azure Document Intelligence
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)
Benefits of Azure Document Intelligence:
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents
PDF Handling Notes
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible
Word Documents (DOCX)
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
Basic DOCX Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
DOCX Structure Preservation
MarkItDown preserves:
- Headings → Markdown headers (
#,##, etc.) - Bold/Italic → Markdown emphasis (
**bold**,*italic*) - Lists → Markdown lists (ordered and unordered)
- Tables → Markdown tables
- Hyperlinks → Markdown links
[text](url) - Images → Referenced with descriptions (can use LLM for descriptions)
Command-Line Usage
# Basic conversion
markitdown report.docx -o report.md
# With output directory
markitdown report.docx -o output/report.md
DOCX with Images
To generate descriptions for images in Word documents, use LLM integration:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")
PowerPoint Presentations (PPTX)
PowerPoint conversion extracts text from slides while preserving structure.
Basic PPTX Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
PPTX Structure
MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present
PPTX with Image Descriptions
Presentations often contain important visual information. Use LLM integration to describe images:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")
Custom prompts for presentations:
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"
Excel Spreadsheets (XLSX, XLS)
Excel conversion formats spreadsheet data as Markdown tables.
Basic XLSX Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)
Multi-Sheet Workbooks
For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)
XLSX Conversion Details
What's preserved:
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)
What's not preserved:
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes
Large Spreadsheets
For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible
XLS (Legacy Excel) Files
Legacy .xls files are supported but require additional dependencies:
pip install 'markitdown[xls]'
Then use normally:
md = MarkItDown()
result = md.convert("legacy_data.xls")
Common Document Conversion Patterns
Batch Document Processing
from markitdown import MarkItDown
import os
md = MarkItDown()
# Process all documents in a directory
for filename in os.listdir("documents"):
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
result = md.convert(f"documents/{filename}")
# Save to output directory
output_name = os.path.splitext(filename)[0] + ".md"
with open(f"markdown/{output_name}", "w") as f:
f.write(result.text_content)
Document with Mixed Content
For documents containing multiple types of content (text, tables, images):
from markitdown import MarkItDown
from openai import OpenAI
# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_report.pdf")
Error Handling
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Conversion failed: {e}")
# Handle specific errors (file not found, unsupported format, etc.)
Output Quality Tips
For best results:
- Use Azure Document Intelligence for PDFs with complex tables
- Enable LLM descriptions for documents with important visual content
- Ensure source documents are well-structured (proper headings, etc.)
- For scanned documents, ensure good scan quality for OCR accuracy
- Test with sample documents to verify output quality
Performance Considerations
Conversion speed depends on:
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources
Optimization tips:
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents