gh-k-dense-ai-claude-scient…/skills/markitdown/references/document_conversion.md

# Document Conversion Reference

This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.

## PDF Files

PDF conversion extracts text, tables, and structure from PDF documents.

### Basic PDF Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

### PDF with Azure Document Intelligence

For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:

```python
from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
    docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)
```

**Benefits of Azure Document Intelligence:**
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents

### PDF Handling Notes

- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible

## Word Documents (DOCX)

Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.

### Basic DOCX Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
```

### DOCX Structure Preservation

MarkItDown preserves:
- **Headings** → Markdown headers (`#`, `##`, etc.)
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
- **Lists** → Markdown lists (ordered and unordered)
- **Tables** → Markdown tables
- **Hyperlinks** → Markdown links `[text](url)`
- **Images** → Referenced with descriptions (can use LLM for descriptions)

### Command-Line Usage

```bash
# Basic conversion
markitdown report.docx -o report.md

# With output directory
markitdown report.docx -o output/report.md
```

### DOCX with Images

To generate descriptions for images in Word documents, use LLM integration:

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")
```

## PowerPoint Presentations (PPTX)

PowerPoint conversion extracts text from slides while preserving structure.

### Basic PPTX Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
```

### PPTX Structure

MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present

### PPTX with Image Descriptions

Presentations often contain important visual information. Use LLM integration to describe images:

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")
```

**Custom prompts for presentations:**
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"

## Excel Spreadsheets (XLSX, XLS)

Excel conversion formats spreadsheet data as Markdown tables.

### Basic XLSX Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)
```

### Multi-Sheet Workbooks

For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)

### XLSX Conversion Details

**What's preserved:**
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)

**What's not preserved:**
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes

### Large Spreadsheets

For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible

### XLS (Legacy Excel) Files

Legacy `.xls` files are supported but require additional dependencies:

```bash
pip install 'markitdown[xls]'
```

Then use normally:
```python
md = MarkItDown()
result = md.convert("legacy_data.xls")
```

## Common Document Conversion Patterns

### Batch Document Processing

```python
from markitdown import MarkItDown
import os

md = MarkItDown()

# Process all documents in a directory
for filename in os.listdir("documents"):
    if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
        result = md.convert(f"documents/{filename}")

        # Save to output directory
        output_name = os.path.splitext(filename)[0] + ".md"
        with open(f"markdown/{output_name}", "w") as f:
            f.write(result.text_content)
```

### Document with Mixed Content

For documents containing multiple types of content (text, tables, images):

```python
from markitdown import MarkItDown
from openai import OpenAI

# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)

result = md.convert("complex_report.pdf")
```

### Error Handling

```python
from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("document.pdf")
    print(result.text_content)
except Exception as e:
    print(f"Conversion failed: {e}")
    # Handle specific errors (file not found, unsupported format, etc.)
```

## Output Quality Tips

**For best results:**
1. Use Azure Document Intelligence for PDFs with complex tables
2. Enable LLM descriptions for documents with important visual content
3. Ensure source documents are well-structured (proper headings, etc.)
4. For scanned documents, ensure good scan quality for OCR accuracy
5. Test with sample documents to verify output quality

## Performance Considerations

**Conversion speed depends on:**
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources

**Optimization tips:**
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents