274 lines
6.7 KiB
Markdown
274 lines
6.7 KiB
Markdown
# Document Conversion Reference
|
|
|
|
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
|
|
|
|
## PDF Files
|
|
|
|
PDF conversion extracts text, tables, and structure from PDF documents.
|
|
|
|
### Basic PDF Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("document.pdf")
|
|
print(result.text_content)
|
|
```
|
|
|
|
### PDF with Azure Document Intelligence
|
|
|
|
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown(
|
|
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
|
|
docintel_key="YOUR-API-KEY"
|
|
)
|
|
result = md.convert("complex_table.pdf")
|
|
print(result.text_content)
|
|
```
|
|
|
|
**Benefits of Azure Document Intelligence:**
|
|
- Superior table extraction and reconstruction
|
|
- Better handling of multi-column layouts
|
|
- Form field recognition
|
|
- Improved text ordering in complex documents
|
|
|
|
### PDF Handling Notes
|
|
|
|
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
|
|
- Password-protected PDFs are not supported
|
|
- Large PDFs may take longer to process
|
|
- Vector graphics and embedded images are extracted where possible
|
|
|
|
## Word Documents (DOCX)
|
|
|
|
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
|
|
|
|
### Basic DOCX Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("document.docx")
|
|
print(result.text_content)
|
|
```
|
|
|
|
### DOCX Structure Preservation
|
|
|
|
MarkItDown preserves:
|
|
- **Headings** → Markdown headers (`#`, `##`, etc.)
|
|
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
|
|
- **Lists** → Markdown lists (ordered and unordered)
|
|
- **Tables** → Markdown tables
|
|
- **Hyperlinks** → Markdown links `[text](url)`
|
|
- **Images** → Referenced with descriptions (can use LLM for descriptions)
|
|
|
|
### Command-Line Usage
|
|
|
|
```bash
|
|
# Basic conversion
|
|
markitdown report.docx -o report.md
|
|
|
|
# With output directory
|
|
markitdown report.docx -o output/report.md
|
|
```
|
|
|
|
### DOCX with Images
|
|
|
|
To generate descriptions for images in Word documents, use LLM integration:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
|
result = md.convert("document_with_images.docx")
|
|
```
|
|
|
|
## PowerPoint Presentations (PPTX)
|
|
|
|
PowerPoint conversion extracts text from slides while preserving structure.
|
|
|
|
### Basic PPTX Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("presentation.pptx")
|
|
print(result.text_content)
|
|
```
|
|
|
|
### PPTX Structure
|
|
|
|
MarkItDown processes presentations as:
|
|
- Each slide becomes a major section
|
|
- Slide titles become headers
|
|
- Bullet points are preserved
|
|
- Tables are converted to Markdown tables
|
|
- Notes are included if present
|
|
|
|
### PPTX with Image Descriptions
|
|
|
|
Presentations often contain important visual information. Use LLM integration to describe images:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this slide image in detail, focusing on key information"
|
|
)
|
|
result = md.convert("presentation.pptx")
|
|
```
|
|
|
|
**Custom prompts for presentations:**
|
|
- "Describe charts and graphs with their key data points"
|
|
- "Explain diagrams and their relationships"
|
|
- "Summarize visual content for accessibility"
|
|
|
|
## Excel Spreadsheets (XLSX, XLS)
|
|
|
|
Excel conversion formats spreadsheet data as Markdown tables.
|
|
|
|
### Basic XLSX Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("data.xlsx")
|
|
print(result.text_content)
|
|
```
|
|
|
|
### Multi-Sheet Workbooks
|
|
|
|
For workbooks with multiple sheets:
|
|
- Each sheet becomes a separate section
|
|
- Sheet names are used as headers
|
|
- Empty sheets are skipped
|
|
- Formulas are evaluated (values shown, not formulas)
|
|
|
|
### XLSX Conversion Details
|
|
|
|
**What's preserved:**
|
|
- Cell values (text, numbers, dates)
|
|
- Table structure (rows and columns)
|
|
- Sheet names
|
|
- Cell formatting (bold headers)
|
|
|
|
**What's not preserved:**
|
|
- Formulas (only computed values)
|
|
- Charts and graphs (use LLM integration for descriptions)
|
|
- Cell colors and conditional formatting
|
|
- Comments and notes
|
|
|
|
### Large Spreadsheets
|
|
|
|
For large spreadsheets, consider:
|
|
- Processing may be slower for files with many rows/columns
|
|
- Very wide tables may not format well in Markdown
|
|
- Consider filtering or preprocessing data if possible
|
|
|
|
### XLS (Legacy Excel) Files
|
|
|
|
Legacy `.xls` files are supported but require additional dependencies:
|
|
|
|
```bash
|
|
pip install 'markitdown[xls]'
|
|
```
|
|
|
|
Then use normally:
|
|
```python
|
|
md = MarkItDown()
|
|
result = md.convert("legacy_data.xls")
|
|
```
|
|
|
|
## Common Document Conversion Patterns
|
|
|
|
### Batch Document Processing
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
import os
|
|
|
|
md = MarkItDown()
|
|
|
|
# Process all documents in a directory
|
|
for filename in os.listdir("documents"):
|
|
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
|
|
result = md.convert(f"documents/{filename}")
|
|
|
|
# Save to output directory
|
|
output_name = os.path.splitext(filename)[0] + ".md"
|
|
with open(f"markdown/{output_name}", "w") as f:
|
|
f.write(result.text_content)
|
|
```
|
|
|
|
### Document with Mixed Content
|
|
|
|
For documents containing multiple types of content (text, tables, images):
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
# Use LLM for image descriptions + Azure for complex tables
|
|
client = OpenAI()
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
docintel_endpoint="YOUR-ENDPOINT",
|
|
docintel_key="YOUR-KEY"
|
|
)
|
|
|
|
result = md.convert("complex_report.pdf")
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
|
|
try:
|
|
result = md.convert("document.pdf")
|
|
print(result.text_content)
|
|
except Exception as e:
|
|
print(f"Conversion failed: {e}")
|
|
# Handle specific errors (file not found, unsupported format, etc.)
|
|
```
|
|
|
|
## Output Quality Tips
|
|
|
|
**For best results:**
|
|
1. Use Azure Document Intelligence for PDFs with complex tables
|
|
2. Enable LLM descriptions for documents with important visual content
|
|
3. Ensure source documents are well-structured (proper headings, etc.)
|
|
4. For scanned documents, ensure good scan quality for OCR accuracy
|
|
5. Test with sample documents to verify output quality
|
|
|
|
## Performance Considerations
|
|
|
|
**Conversion speed depends on:**
|
|
- Document size and complexity
|
|
- Number of images (especially with LLM descriptions)
|
|
- Use of Azure Document Intelligence
|
|
- Available system resources
|
|
|
|
**Optimization tips:**
|
|
- Disable LLM integration if image descriptions aren't needed
|
|
- Use standard extraction (not Azure) for simple documents
|
|
- Process large batches in parallel when possible
|
|
- Consider streaming for very large documents
|