Initial commit
This commit is contained in:
273
skills/markitdown/references/document_conversion.md
Normal file
273
skills/markitdown/references/document_conversion.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Document Conversion Reference
|
||||
|
||||
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
|
||||
|
||||
## PDF Files
|
||||
|
||||
PDF conversion extracts text, tables, and structure from PDF documents.
|
||||
|
||||
### Basic PDF Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PDF with Azure Document Intelligence
|
||||
|
||||
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Benefits of Azure Document Intelligence:**
|
||||
- Superior table extraction and reconstruction
|
||||
- Better handling of multi-column layouts
|
||||
- Form field recognition
|
||||
- Improved text ordering in complex documents
|
||||
|
||||
### PDF Handling Notes
|
||||
|
||||
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
|
||||
- Password-protected PDFs are not supported
|
||||
- Large PDFs may take longer to process
|
||||
- Vector graphics and embedded images are extracted where possible
|
||||
|
||||
## Word Documents (DOCX)
|
||||
|
||||
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
|
||||
|
||||
### Basic DOCX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.docx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### DOCX Structure Preservation
|
||||
|
||||
MarkItDown preserves:
|
||||
- **Headings** → Markdown headers (`#`, `##`, etc.)
|
||||
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
|
||||
- **Lists** → Markdown lists (ordered and unordered)
|
||||
- **Tables** → Markdown tables
|
||||
- **Hyperlinks** → Markdown links `[text](url)`
|
||||
- **Images** → Referenced with descriptions (can use LLM for descriptions)
|
||||
|
||||
### Command-Line Usage
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown report.docx -o report.md
|
||||
|
||||
# With output directory
|
||||
markitdown report.docx -o output/report.md
|
||||
```
|
||||
|
||||
### DOCX with Images
|
||||
|
||||
To generate descriptions for images in Word documents, use LLM integration:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("document_with_images.docx")
|
||||
```
|
||||
|
||||
## PowerPoint Presentations (PPTX)
|
||||
|
||||
PowerPoint conversion extracts text from slides while preserving structure.
|
||||
|
||||
### Basic PPTX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PPTX Structure
|
||||
|
||||
MarkItDown processes presentations as:
|
||||
- Each slide becomes a major section
|
||||
- Slide titles become headers
|
||||
- Bullet points are preserved
|
||||
- Tables are converted to Markdown tables
|
||||
- Notes are included if present
|
||||
|
||||
### PPTX with Image Descriptions
|
||||
|
||||
Presentations often contain important visual information. Use LLM integration to describe images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this slide image in detail, focusing on key information"
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
**Custom prompts for presentations:**
|
||||
- "Describe charts and graphs with their key data points"
|
||||
- "Explain diagrams and their relationships"
|
||||
- "Summarize visual content for accessibility"
|
||||
|
||||
## Excel Spreadsheets (XLSX, XLS)
|
||||
|
||||
Excel conversion formats spreadsheet data as Markdown tables.
|
||||
|
||||
### Basic XLSX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xlsx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Sheet Workbooks
|
||||
|
||||
For workbooks with multiple sheets:
|
||||
- Each sheet becomes a separate section
|
||||
- Sheet names are used as headers
|
||||
- Empty sheets are skipped
|
||||
- Formulas are evaluated (values shown, not formulas)
|
||||
|
||||
### XLSX Conversion Details
|
||||
|
||||
**What's preserved:**
|
||||
- Cell values (text, numbers, dates)
|
||||
- Table structure (rows and columns)
|
||||
- Sheet names
|
||||
- Cell formatting (bold headers)
|
||||
|
||||
**What's not preserved:**
|
||||
- Formulas (only computed values)
|
||||
- Charts and graphs (use LLM integration for descriptions)
|
||||
- Cell colors and conditional formatting
|
||||
- Comments and notes
|
||||
|
||||
### Large Spreadsheets
|
||||
|
||||
For large spreadsheets, consider:
|
||||
- Processing may be slower for files with many rows/columns
|
||||
- Very wide tables may not format well in Markdown
|
||||
- Consider filtering or preprocessing data if possible
|
||||
|
||||
### XLS (Legacy Excel) Files
|
||||
|
||||
Legacy `.xls` files are supported but require additional dependencies:
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[xls]'
|
||||
```
|
||||
|
||||
Then use normally:
|
||||
```python
|
||||
md = MarkItDown()
|
||||
result = md.convert("legacy_data.xls")
|
||||
```
|
||||
|
||||
## Common Document Conversion Patterns
|
||||
|
||||
### Batch Document Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process all documents in a directory
|
||||
for filename in os.listdir("documents"):
|
||||
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
|
||||
result = md.convert(f"documents/{filename}")
|
||||
|
||||
# Save to output directory
|
||||
output_name = os.path.splitext(filename)[0] + ".md"
|
||||
with open(f"markdown/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Document with Mixed Content
|
||||
|
||||
For documents containing multiple types of content (text, tables, images):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Use LLM for image descriptions + Azure for complex tables
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Conversion failed: {e}")
|
||||
# Handle specific errors (file not found, unsupported format, etc.)
|
||||
```
|
||||
|
||||
## Output Quality Tips
|
||||
|
||||
**For best results:**
|
||||
1. Use Azure Document Intelligence for PDFs with complex tables
|
||||
2. Enable LLM descriptions for documents with important visual content
|
||||
3. Ensure source documents are well-structured (proper headings, etc.)
|
||||
4. For scanned documents, ensure good scan quality for OCR accuracy
|
||||
5. Test with sample documents to verify output quality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Conversion speed depends on:**
|
||||
- Document size and complexity
|
||||
- Number of images (especially with LLM descriptions)
|
||||
- Use of Azure Document Intelligence
|
||||
- Available system resources
|
||||
|
||||
**Optimization tips:**
|
||||
- Disable LLM integration if image descriptions aren't needed
|
||||
- Use standard extraction (not Azure) for simple documents
|
||||
- Process large batches in parallel when possible
|
||||
- Consider streaming for very large documents
|
||||
Reference in New Issue
Block a user