8.8 KiB
File Format Support
This document provides detailed information about each file format supported by MarkItDown.
Document Formats
PDF (.pdf)
Capabilities:
- Text extraction
- Table detection
- Metadata extraction
- OCR for scanned documents (with dependencies)
Dependencies:
pip install 'markitdown[pdf]'
Best For:
- Scientific papers
- Reports
- Books
- Forms
Limitations:
- Complex layouts may not preserve perfect formatting
- Scanned PDFs require OCR setup
- Some PDF features (annotations, forms) may not convert
Example:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)
Enhanced with Azure Document Intelligence:
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")
Microsoft Word (.docx)
Capabilities:
- Text extraction
- Table conversion
- Heading hierarchy
- List formatting
- Basic text formatting (bold, italic)
Dependencies:
pip install 'markitdown[docx]'
Best For:
- Research papers
- Reports
- Documentation
- Manuscripts
Preserved Elements:
- Headings (converted to Markdown headers)
- Tables (converted to Markdown tables)
- Lists (bulleted and numbered)
- Basic formatting (bold, italic)
- Paragraphs
Example:
result = md.convert("manuscript.docx")
PowerPoint (.pptx)
Capabilities:
- Slide content extraction
- Speaker notes
- Table extraction
- Image descriptions (with AI)
Dependencies:
pip install 'markitdown[pptx]'
Best For:
- Presentations
- Lecture slides
- Conference talks
Output Format:
# Slide 1: Title
Content from slide 1...
**Notes**: Speaker notes appear here
---
# Slide 2: Next Topic
...
With AI Image Descriptions:
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")
Excel (.xlsx, .xls)
Capabilities:
- Sheet extraction
- Table formatting
- Data preservation
- Formula values (calculated)
Dependencies:
pip install 'markitdown[xlsx]' # Modern Excel
pip install 'markitdown[xls]' # Legacy Excel
Best For:
- Data tables
- Research data
- Statistical results
- Experimental data
Output Format:
# Sheet: Results
| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1 | 10.2 | 12.5 | 0.023 |
| 2 | 9.8 | 11.9 | 0.031 |
Example:
result = md.convert("experimental_data.xlsx")
Image Formats
Images (.jpg, .jpeg, .png, .gif, .webp)
Capabilities:
- EXIF metadata extraction
- OCR text extraction
- AI-powered image descriptions
Dependencies:
pip install 'markitdown[all]' # Includes image support
Best For:
- Scanned documents
- Charts and graphs
- Scientific diagrams
- Photographs with text
Output Without AI:

**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000
Output With AI:
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")
OCR for Text Extraction: Requires Tesseract OCR:
# macOS
brew install tesseract
# Ubuntu
sudo apt-get install tesseract-ocr
Audio Formats
Audio (.wav, .mp3)
Capabilities:
- Metadata extraction
- Speech-to-text transcription
- Duration and technical info
Dependencies:
pip install 'markitdown[audio-transcription]'
Best For:
- Lecture recordings
- Interviews
- Podcasts
- Meeting recordings
Output Format:
# Audio: interview.mp3
**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz
**Transcription**:
[Transcribed text appears here...]
Example:
result = md.convert("lecture.mp3")
Web Formats
HTML (.html, .htm)
Capabilities:
- Clean HTML to Markdown conversion
- Link preservation
- Table conversion
- List formatting
Best For:
- Web pages
- Documentation
- Blog posts
- Online articles
Output Format: Clean Markdown with preserved links and structure
Example:
result = md.convert("webpage.html")
YouTube URLs
Capabilities:
- Fetch video transcriptions
- Extract video metadata
- Caption download
Dependencies:
pip install 'markitdown[youtube-transcription]'
Best For:
- Educational videos
- Lectures
- Talks
- Tutorials
Example:
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
Data Formats
CSV (.csv)
Capabilities:
- Automatic table conversion
- Delimiter detection
- Header preservation
Output Format: Markdown tables
Example:
result = md.convert("data.csv")
Output:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1 | Value2 | Value3 |
JSON (.json)
Capabilities:
- Structured representation
- Pretty formatting
- Nested data visualization
Best For:
- API responses
- Configuration files
- Data exports
Example:
result = md.convert("data.json")
XML (.xml)
Capabilities:
- Structure preservation
- Attribute extraction
- Formatted output
Best For:
- Configuration files
- Data interchange
- Structured documents
Example:
result = md.convert("config.xml")
Archive Formats
ZIP (.zip)
Capabilities:
- Iterates through archive contents
- Converts each file individually
- Maintains directory structure in output
Best For:
- Document collections
- Project archives
- Batch conversions
Output Format:
# Archive: documents.zip
## File: document1.pdf
[Content from document1.pdf...]
---
## File: document2.docx
[Content from document2.docx...]
Example:
result = md.convert("archive.zip")
E-book Formats
EPUB (.epub)
Capabilities:
- Full text extraction
- Chapter structure
- Metadata extraction
Best For:
- E-books
- Digital publications
- Long-form content
Output Format: Markdown with preserved chapter structure
Example:
result = md.convert("book.epub")
Other Formats
Outlook Messages (.msg)
Capabilities:
- Email content extraction
- Attachment listing
- Metadata (from, to, subject, date)
Dependencies:
pip install 'markitdown[outlook]'
Best For:
- Email archives
- Communication records
Example:
result = md.convert("message.msg")
Format-Specific Tips
PDF Best Practices
-
Use Azure Document Intelligence for complex layouts:
md = MarkItDown(docintel_endpoint="endpoint_url") -
For scanned PDFs, ensure OCR is set up:
brew install tesseract # macOS -
Split very large PDFs before conversion for better performance
PowerPoint Best Practices
-
Use AI for visual content:
md = MarkItDown(llm_client=client, llm_model="gpt-4o") -
Check speaker notes - they're included in output
-
Complex animations won't be captured - static content only
Excel Best Practices
-
Large spreadsheets may take time to convert
-
Formulas are converted to their calculated values
-
Multiple sheets are all included in output
-
Charts become text descriptions (use AI for better descriptions)
Image Best Practices
-
Use AI for meaningful descriptions:
md = MarkItDown( llm_client=client, llm_model="gpt-4o", llm_prompt="Describe this scientific figure in detail" ) -
For text-heavy images, ensure OCR dependencies are installed
-
High-resolution images may take longer to process
Audio Best Practices
-
Clear audio produces better transcriptions
-
Long recordings may take significant time
-
Consider splitting long audio files for faster processing
Unsupported Formats
If you need to convert an unsupported format:
- Create a custom converter (see
api_reference.md) - Look for plugins on GitHub (#markitdown-plugin)
- Pre-convert to supported format (e.g., convert .rtf to .docx)
Format Detection
MarkItDown automatically detects format from:
- File extension (primary method)
- MIME type (fallback)
- File signature (magic bytes, fallback)
Override detection:
# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")
# With streams
with open("file", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")