543 lines
8.8 KiB
Markdown
543 lines
8.8 KiB
Markdown
# File Format Support
|
|
|
|
This document provides detailed information about each file format supported by MarkItDown.
|
|
|
|
## Document Formats
|
|
|
|
### PDF (.pdf)
|
|
|
|
**Capabilities**:
|
|
- Text extraction
|
|
- Table detection
|
|
- Metadata extraction
|
|
- OCR for scanned documents (with dependencies)
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[pdf]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Scientific papers
|
|
- Reports
|
|
- Books
|
|
- Forms
|
|
|
|
**Limitations**:
|
|
- Complex layouts may not preserve perfect formatting
|
|
- Scanned PDFs require OCR setup
|
|
- Some PDF features (annotations, forms) may not convert
|
|
|
|
**Example**:
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("research_paper.pdf")
|
|
print(result.text_content)
|
|
```
|
|
|
|
**Enhanced with Azure Document Intelligence**:
|
|
```python
|
|
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
|
|
result = md.convert("complex_layout.pdf")
|
|
```
|
|
|
|
---
|
|
|
|
### Microsoft Word (.docx)
|
|
|
|
**Capabilities**:
|
|
- Text extraction
|
|
- Table conversion
|
|
- Heading hierarchy
|
|
- List formatting
|
|
- Basic text formatting (bold, italic)
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[docx]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Research papers
|
|
- Reports
|
|
- Documentation
|
|
- Manuscripts
|
|
|
|
**Preserved Elements**:
|
|
- Headings (converted to Markdown headers)
|
|
- Tables (converted to Markdown tables)
|
|
- Lists (bulleted and numbered)
|
|
- Basic formatting (bold, italic)
|
|
- Paragraphs
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("manuscript.docx")
|
|
```
|
|
|
|
---
|
|
|
|
### PowerPoint (.pptx)
|
|
|
|
**Capabilities**:
|
|
- Slide content extraction
|
|
- Speaker notes
|
|
- Table extraction
|
|
- Image descriptions (with AI)
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[pptx]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Presentations
|
|
- Lecture slides
|
|
- Conference talks
|
|
|
|
**Output Format**:
|
|
```markdown
|
|
# Slide 1: Title
|
|
|
|
Content from slide 1...
|
|
|
|
**Notes**: Speaker notes appear here
|
|
|
|
---
|
|
|
|
# Slide 2: Next Topic
|
|
|
|
...
|
|
```
|
|
|
|
**With AI Image Descriptions**:
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
|
result = md.convert("presentation.pptx")
|
|
```
|
|
|
|
---
|
|
|
|
### Excel (.xlsx, .xls)
|
|
|
|
**Capabilities**:
|
|
- Sheet extraction
|
|
- Table formatting
|
|
- Data preservation
|
|
- Formula values (calculated)
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[xlsx]' # Modern Excel
|
|
pip install 'markitdown[xls]' # Legacy Excel
|
|
```
|
|
|
|
**Best For**:
|
|
- Data tables
|
|
- Research data
|
|
- Statistical results
|
|
- Experimental data
|
|
|
|
**Output Format**:
|
|
```markdown
|
|
# Sheet: Results
|
|
|
|
| Sample | Control | Treatment | P-value |
|
|
|--------|---------|-----------|---------|
|
|
| 1 | 10.2 | 12.5 | 0.023 |
|
|
| 2 | 9.8 | 11.9 | 0.031 |
|
|
```
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("experimental_data.xlsx")
|
|
```
|
|
|
|
---
|
|
|
|
## Image Formats
|
|
|
|
### Images (.jpg, .jpeg, .png, .gif, .webp)
|
|
|
|
**Capabilities**:
|
|
- EXIF metadata extraction
|
|
- OCR text extraction
|
|
- AI-powered image descriptions
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[all]' # Includes image support
|
|
```
|
|
|
|
**Best For**:
|
|
- Scanned documents
|
|
- Charts and graphs
|
|
- Scientific diagrams
|
|
- Photographs with text
|
|
|
|
**Output Without AI**:
|
|
```markdown
|
|

|
|
|
|
**EXIF Data**:
|
|
- Camera: Canon EOS 5D
|
|
- Date: 2024-01-15
|
|
- Resolution: 4000x3000
|
|
```
|
|
|
|
**Output With AI**:
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this scientific diagram in detail"
|
|
)
|
|
result = md.convert("graph.png")
|
|
```
|
|
|
|
**OCR for Text Extraction**:
|
|
Requires Tesseract OCR:
|
|
```bash
|
|
# macOS
|
|
brew install tesseract
|
|
|
|
# Ubuntu
|
|
sudo apt-get install tesseract-ocr
|
|
```
|
|
|
|
---
|
|
|
|
## Audio Formats
|
|
|
|
### Audio (.wav, .mp3)
|
|
|
|
**Capabilities**:
|
|
- Metadata extraction
|
|
- Speech-to-text transcription
|
|
- Duration and technical info
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[audio-transcription]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Lecture recordings
|
|
- Interviews
|
|
- Podcasts
|
|
- Meeting recordings
|
|
|
|
**Output Format**:
|
|
```markdown
|
|
# Audio: interview.mp3
|
|
|
|
**Metadata**:
|
|
- Duration: 45:32
|
|
- Bitrate: 320kbps
|
|
- Sample Rate: 44100Hz
|
|
|
|
**Transcription**:
|
|
[Transcribed text appears here...]
|
|
```
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("lecture.mp3")
|
|
```
|
|
|
|
---
|
|
|
|
## Web Formats
|
|
|
|
### HTML (.html, .htm)
|
|
|
|
**Capabilities**:
|
|
- Clean HTML to Markdown conversion
|
|
- Link preservation
|
|
- Table conversion
|
|
- List formatting
|
|
|
|
**Best For**:
|
|
- Web pages
|
|
- Documentation
|
|
- Blog posts
|
|
- Online articles
|
|
|
|
**Output Format**: Clean Markdown with preserved links and structure
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("webpage.html")
|
|
```
|
|
|
|
---
|
|
|
|
### YouTube URLs
|
|
|
|
**Capabilities**:
|
|
- Fetch video transcriptions
|
|
- Extract video metadata
|
|
- Caption download
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[youtube-transcription]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Educational videos
|
|
- Lectures
|
|
- Talks
|
|
- Tutorials
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
|
```
|
|
|
|
---
|
|
|
|
## Data Formats
|
|
|
|
### CSV (.csv)
|
|
|
|
**Capabilities**:
|
|
- Automatic table conversion
|
|
- Delimiter detection
|
|
- Header preservation
|
|
|
|
**Output Format**: Markdown tables
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("data.csv")
|
|
```
|
|
|
|
**Output**:
|
|
```markdown
|
|
| Column1 | Column2 | Column3 |
|
|
|---------|---------|---------|
|
|
| Value1 | Value2 | Value3 |
|
|
```
|
|
|
|
---
|
|
|
|
### JSON (.json)
|
|
|
|
**Capabilities**:
|
|
- Structured representation
|
|
- Pretty formatting
|
|
- Nested data visualization
|
|
|
|
**Best For**:
|
|
- API responses
|
|
- Configuration files
|
|
- Data exports
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("data.json")
|
|
```
|
|
|
|
---
|
|
|
|
### XML (.xml)
|
|
|
|
**Capabilities**:
|
|
- Structure preservation
|
|
- Attribute extraction
|
|
- Formatted output
|
|
|
|
**Best For**:
|
|
- Configuration files
|
|
- Data interchange
|
|
- Structured documents
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("config.xml")
|
|
```
|
|
|
|
---
|
|
|
|
## Archive Formats
|
|
|
|
### ZIP (.zip)
|
|
|
|
**Capabilities**:
|
|
- Iterates through archive contents
|
|
- Converts each file individually
|
|
- Maintains directory structure in output
|
|
|
|
**Best For**:
|
|
- Document collections
|
|
- Project archives
|
|
- Batch conversions
|
|
|
|
**Output Format**:
|
|
```markdown
|
|
# Archive: documents.zip
|
|
|
|
## File: document1.pdf
|
|
[Content from document1.pdf...]
|
|
|
|
---
|
|
|
|
## File: document2.docx
|
|
[Content from document2.docx...]
|
|
```
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("archive.zip")
|
|
```
|
|
|
|
---
|
|
|
|
## E-book Formats
|
|
|
|
### EPUB (.epub)
|
|
|
|
**Capabilities**:
|
|
- Full text extraction
|
|
- Chapter structure
|
|
- Metadata extraction
|
|
|
|
**Best For**:
|
|
- E-books
|
|
- Digital publications
|
|
- Long-form content
|
|
|
|
**Output Format**: Markdown with preserved chapter structure
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("book.epub")
|
|
```
|
|
|
|
---
|
|
|
|
## Other Formats
|
|
|
|
### Outlook Messages (.msg)
|
|
|
|
**Capabilities**:
|
|
- Email content extraction
|
|
- Attachment listing
|
|
- Metadata (from, to, subject, date)
|
|
|
|
**Dependencies**:
|
|
```bash
|
|
pip install 'markitdown[outlook]'
|
|
```
|
|
|
|
**Best For**:
|
|
- Email archives
|
|
- Communication records
|
|
|
|
**Example**:
|
|
```python
|
|
result = md.convert("message.msg")
|
|
```
|
|
|
|
---
|
|
|
|
## Format-Specific Tips
|
|
|
|
### PDF Best Practices
|
|
|
|
1. **Use Azure Document Intelligence for complex layouts**:
|
|
```python
|
|
md = MarkItDown(docintel_endpoint="endpoint_url")
|
|
```
|
|
|
|
2. **For scanned PDFs, ensure OCR is set up**:
|
|
```bash
|
|
brew install tesseract # macOS
|
|
```
|
|
|
|
3. **Split very large PDFs before conversion** for better performance
|
|
|
|
### PowerPoint Best Practices
|
|
|
|
1. **Use AI for visual content**:
|
|
```python
|
|
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
|
```
|
|
|
|
2. **Check speaker notes** - they're included in output
|
|
|
|
3. **Complex animations won't be captured** - static content only
|
|
|
|
### Excel Best Practices
|
|
|
|
1. **Large spreadsheets** may take time to convert
|
|
|
|
2. **Formulas are converted to their calculated values**
|
|
|
|
3. **Multiple sheets** are all included in output
|
|
|
|
4. **Charts become text descriptions** (use AI for better descriptions)
|
|
|
|
### Image Best Practices
|
|
|
|
1. **Use AI for meaningful descriptions**:
|
|
```python
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this scientific figure in detail"
|
|
)
|
|
```
|
|
|
|
2. **For text-heavy images, ensure OCR dependencies** are installed
|
|
|
|
3. **High-resolution images** may take longer to process
|
|
|
|
### Audio Best Practices
|
|
|
|
1. **Clear audio** produces better transcriptions
|
|
|
|
2. **Long recordings** may take significant time
|
|
|
|
3. **Consider splitting long audio files** for faster processing
|
|
|
|
---
|
|
|
|
## Unsupported Formats
|
|
|
|
If you need to convert an unsupported format:
|
|
|
|
1. **Create a custom converter** (see `api_reference.md`)
|
|
2. **Look for plugins** on GitHub (#markitdown-plugin)
|
|
3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
|
|
|
|
---
|
|
|
|
## Format Detection
|
|
|
|
MarkItDown automatically detects format from:
|
|
|
|
1. **File extension** (primary method)
|
|
2. **MIME type** (fallback)
|
|
3. **File signature** (magic bytes, fallback)
|
|
|
|
**Override detection**:
|
|
```python
|
|
# Force specific format
|
|
result = md.convert("file_without_extension", file_extension=".pdf")
|
|
|
|
# With streams
|
|
with open("file", "rb") as f:
|
|
result = md.convert_stream(f, file_extension=".pdf")
|
|
```
|
|
|