Files
2025-11-30 08:30:10 +08:00

274 lines
6.7 KiB
Markdown

# Document Conversion Reference
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
## PDF Files
PDF conversion extracts text, tables, and structure from PDF documents.
### Basic PDF Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```
### PDF with Azure Document Intelligence
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)
```
**Benefits of Azure Document Intelligence:**
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents
### PDF Handling Notes
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible
## Word Documents (DOCX)
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
### Basic DOCX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
```
### DOCX Structure Preservation
MarkItDown preserves:
- **Headings** → Markdown headers (`#`, `##`, etc.)
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
- **Lists** → Markdown lists (ordered and unordered)
- **Tables** → Markdown tables
- **Hyperlinks** → Markdown links `[text](url)`
- **Images** → Referenced with descriptions (can use LLM for descriptions)
### Command-Line Usage
```bash
# Basic conversion
markitdown report.docx -o report.md
# With output directory
markitdown report.docx -o output/report.md
```
### DOCX with Images
To generate descriptions for images in Word documents, use LLM integration:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")
```
## PowerPoint Presentations (PPTX)
PowerPoint conversion extracts text from slides while preserving structure.
### Basic PPTX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
```
### PPTX Structure
MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present
### PPTX with Image Descriptions
Presentations often contain important visual information. Use LLM integration to describe images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")
```
**Custom prompts for presentations:**
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"
## Excel Spreadsheets (XLSX, XLS)
Excel conversion formats spreadsheet data as Markdown tables.
### Basic XLSX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)
```
### Multi-Sheet Workbooks
For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)
### XLSX Conversion Details
**What's preserved:**
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)
**What's not preserved:**
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes
### Large Spreadsheets
For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible
### XLS (Legacy Excel) Files
Legacy `.xls` files are supported but require additional dependencies:
```bash
pip install 'markitdown[xls]'
```
Then use normally:
```python
md = MarkItDown()
result = md.convert("legacy_data.xls")
```
## Common Document Conversion Patterns
### Batch Document Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
# Process all documents in a directory
for filename in os.listdir("documents"):
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
result = md.convert(f"documents/{filename}")
# Save to output directory
output_name = os.path.splitext(filename)[0] + ".md"
with open(f"markdown/{output_name}", "w") as f:
f.write(result.text_content)
```
### Document with Mixed Content
For documents containing multiple types of content (text, tables, images):
```python
from markitdown import MarkItDown
from openai import OpenAI
# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_report.pdf")
```
### Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Conversion failed: {e}")
# Handle specific errors (file not found, unsupported format, etc.)
```
## Output Quality Tips
**For best results:**
1. Use Azure Document Intelligence for PDFs with complex tables
2. Enable LLM descriptions for documents with important visual content
3. Ensure source documents are well-structured (proper headings, etc.)
4. For scanned documents, ensure good scan quality for OCR accuracy
5. Test with sample documents to verify output quality
## Performance Considerations
**Conversion speed depends on:**
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources
**Optimization tips:**
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents