Initial commit
This commit is contained in:
399
skills/markitdown/references/api_reference.md
Normal file
399
skills/markitdown/references/api_reference.md
Normal file
@@ -0,0 +1,399 @@
|
||||
# MarkItDown API Reference
|
||||
|
||||
## Core Classes
|
||||
|
||||
### MarkItDown
|
||||
|
||||
The main class for converting files to Markdown.
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=None,
|
||||
llm_model=None,
|
||||
llm_prompt=None,
|
||||
docintel_endpoint=None,
|
||||
enable_plugins=False
|
||||
)
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `llm_client` | OpenAI client | `None` | OpenAI-compatible client for AI image descriptions |
|
||||
| `llm_model` | str | `None` | Model name (e.g., "anthropic/claude-sonnet-4.5") for image descriptions |
|
||||
| `llm_prompt` | str | `None` | Custom prompt for image description |
|
||||
| `docintel_endpoint` | str | `None` | Azure Document Intelligence endpoint |
|
||||
| `enable_plugins` | bool | `False` | Enable 3rd-party plugins |
|
||||
|
||||
#### Methods
|
||||
|
||||
##### convert()
|
||||
|
||||
Convert a file to Markdown.
|
||||
|
||||
```python
|
||||
result = md.convert(
|
||||
source,
|
||||
file_extension=None
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `source` (str): Path to the file to convert
|
||||
- `file_extension` (str, optional): Override file extension detection
|
||||
|
||||
**Returns**: `DocumentConverterResult` object
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
##### convert_stream()
|
||||
|
||||
Convert from a file-like binary stream.
|
||||
|
||||
```python
|
||||
result = md.convert_stream(
|
||||
stream,
|
||||
file_extension
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
- `stream` (BinaryIO): Binary file-like object (e.g., file opened in `"rb"` mode)
|
||||
- `file_extension` (str): File extension to determine conversion method (e.g., ".pdf")
|
||||
|
||||
**Returns**: `DocumentConverterResult` object
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
with open("document.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Important**: The stream must be opened in binary mode (`"rb"`), not text mode.
|
||||
|
||||
## Result Object
|
||||
|
||||
### DocumentConverterResult
|
||||
|
||||
The result of a conversion operation.
|
||||
|
||||
#### Attributes
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `text_content` | str | The converted Markdown text |
|
||||
| `title` | str | Document title (if available) |
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
result = md.convert("paper.pdf")
|
||||
|
||||
# Access content
|
||||
content = result.text_content
|
||||
|
||||
# Access title (if available)
|
||||
title = result.title
|
||||
```
|
||||
|
||||
## Custom Converters
|
||||
|
||||
You can create custom document converters by implementing the `DocumentConverter` interface.
|
||||
|
||||
### DocumentConverter Interface
|
||||
|
||||
```python
|
||||
from markitdown import DocumentConverter
|
||||
|
||||
class CustomConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
"""
|
||||
Convert a document from a binary stream.
|
||||
|
||||
Parameters:
|
||||
stream (BinaryIO): Binary file-like object
|
||||
file_extension (str): File extension (e.g., ".custom")
|
||||
|
||||
Returns:
|
||||
DocumentConverterResult: Conversion result
|
||||
"""
|
||||
# Your conversion logic here
|
||||
pass
|
||||
```
|
||||
|
||||
### Registering Custom Converters
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
|
||||
|
||||
class MyCustomConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
content = stream.read().decode('utf-8')
|
||||
markdown_text = f"# Custom Format\n\n{content}"
|
||||
return DocumentConverterResult(
|
||||
text_content=markdown_text,
|
||||
title="Custom Document"
|
||||
)
|
||||
|
||||
# Create MarkItDown instance
|
||||
md = MarkItDown()
|
||||
|
||||
# Register custom converter for .custom files
|
||||
md.register_converter(".custom", MyCustomConverter())
|
||||
|
||||
# Use it
|
||||
result = md.convert("myfile.custom")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
### Finding Plugins
|
||||
|
||||
Search GitHub for `#markitdown-plugin` tag.
|
||||
|
||||
### Using Plugins
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### Creating Plugins
|
||||
|
||||
Plugins are Python packages that register converters with MarkItDown.
|
||||
|
||||
**Plugin Structure**:
|
||||
```
|
||||
my-markitdown-plugin/
|
||||
├── setup.py
|
||||
├── my_plugin/
|
||||
│ ├── __init__.py
|
||||
│ └── converter.py
|
||||
└── README.md
|
||||
```
|
||||
|
||||
**setup.py**:
|
||||
```python
|
||||
from setuptools import setup
|
||||
|
||||
setup(
|
||||
name="markitdown-my-plugin",
|
||||
version="0.1.0",
|
||||
packages=["my_plugin"],
|
||||
entry_points={
|
||||
"markitdown.plugins": [
|
||||
"my_plugin = my_plugin.converter:MyConverter",
|
||||
],
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
**converter.py**:
|
||||
```python
|
||||
from markitdown import DocumentConverter, DocumentConverterResult
|
||||
|
||||
class MyConverter(DocumentConverter):
|
||||
def convert(self, stream, file_extension):
|
||||
# Your conversion logic
|
||||
content = stream.read()
|
||||
markdown = self.process(content)
|
||||
return DocumentConverterResult(
|
||||
text_content=markdown,
|
||||
title="My Document"
|
||||
)
|
||||
|
||||
def process(self, content):
|
||||
# Process content
|
||||
return "# Converted Content\n\n..."
|
||||
```
|
||||
|
||||
## AI-Enhanced Conversions
|
||||
|
||||
### Using OpenRouter for Image Descriptions
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize OpenRouter client (OpenAI-compatible API)
|
||||
client = OpenAI(
|
||||
api_key="your-openrouter-api-key",
|
||||
base_url="https://openrouter.ai/api/v1"
|
||||
)
|
||||
|
||||
# Create MarkItDown with AI support
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
|
||||
llm_prompt="Describe this image in detail for scientific documentation"
|
||||
)
|
||||
|
||||
# Convert files with images
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
### Available Models via OpenRouter
|
||||
|
||||
Popular models with vision support:
|
||||
- `anthropic/claude-sonnet-4.5` - **Claude Sonnet 4.5 (recommended for scientific vision)**
|
||||
- `anthropic/claude-3.5-sonnet` - Claude 3.5 Sonnet
|
||||
- `openai/gpt-4o` - GPT-4 Omni
|
||||
- `openai/gpt-4-vision` - GPT-4 Vision
|
||||
- `google/gemini-pro-vision` - Gemini Pro Vision
|
||||
|
||||
See https://openrouter.ai/models for the complete list.
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
```python
|
||||
# For scientific diagrams
|
||||
scientific_prompt = """
|
||||
Analyze this scientific diagram or chart. Describe:
|
||||
1. The type of visualization (graph, chart, diagram, etc.)
|
||||
2. Key data points or trends
|
||||
3. Labels and axes
|
||||
4. Scientific significance
|
||||
Be precise and technical.
|
||||
"""
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="anthropic/claude-sonnet-4.5",
|
||||
llm_prompt=scientific_prompt
|
||||
)
|
||||
```
|
||||
|
||||
## Azure Document Intelligence
|
||||
|
||||
### Setup
|
||||
|
||||
1. Create Azure Document Intelligence resource
|
||||
2. Get endpoint URL
|
||||
3. Set authentication
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Set environment variables:
|
||||
```bash
|
||||
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
|
||||
```
|
||||
|
||||
Or pass credentials programmatically.
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("File not found")
|
||||
except ValueError as e:
|
||||
print(f"Invalid file format: {e}")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### 1. Reuse MarkItDown Instance
|
||||
|
||||
```python
|
||||
# Good: Create once, use many times
|
||||
md = MarkItDown()
|
||||
|
||||
for file in files:
|
||||
result = md.convert(file)
|
||||
process(result)
|
||||
```
|
||||
|
||||
### 2. Use Streaming for Large Files
|
||||
|
||||
```python
|
||||
# For large files
|
||||
with open("large_file.pdf", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
### 3. Batch Processing
|
||||
|
||||
```python
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_file(filepath):
|
||||
return md.convert(filepath)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
results = executor.map(convert_file, file_list)
|
||||
```
|
||||
|
||||
## Breaking Changes (v0.0.1 to v0.1.0)
|
||||
|
||||
1. **Dependencies**: Now organized into optional feature groups
|
||||
```bash
|
||||
# Old
|
||||
pip install markitdown
|
||||
|
||||
# New
|
||||
pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
2. **convert_stream()**: Now requires binary file-like object
|
||||
```python
|
||||
# Old (also accepted text)
|
||||
with open("file.pdf", "r") as f: # text mode
|
||||
result = md.convert_stream(f)
|
||||
|
||||
# New (binary only)
|
||||
with open("file.pdf", "rb") as f: # binary mode
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
3. **DocumentConverter Interface**: Changed to read from streams instead of file paths
|
||||
- No temporary files created
|
||||
- More memory efficient
|
||||
- Plugins need updating
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
- **Python**: 3.10 or higher required
|
||||
- **Dependencies**: Check `setup.py` for version constraints
|
||||
- **OpenAI**: Compatible with OpenAI Python SDK v1.0+
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `OPENROUTER_API_KEY` | OpenRouter API key for image descriptions | `sk-or-v1-...` |
|
||||
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure DI authentication | `key123...` |
|
||||
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure DI endpoint | `https://...` |
|
||||
|
||||
542
skills/markitdown/references/file_formats.md
Normal file
542
skills/markitdown/references/file_formats.md
Normal file
@@ -0,0 +1,542 @@
|
||||
# File Format Support
|
||||
|
||||
This document provides detailed information about each file format supported by MarkItDown.
|
||||
|
||||
## Document Formats
|
||||
|
||||
### PDF (.pdf)
|
||||
|
||||
**Capabilities**:
|
||||
- Text extraction
|
||||
- Table detection
|
||||
- Metadata extraction
|
||||
- OCR for scanned documents (with dependencies)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[pdf]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Scientific papers
|
||||
- Reports
|
||||
- Books
|
||||
- Forms
|
||||
|
||||
**Limitations**:
|
||||
- Complex layouts may not preserve perfect formatting
|
||||
- Scanned PDFs require OCR setup
|
||||
- Some PDF features (annotations, forms) may not convert
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("research_paper.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Enhanced with Azure Document Intelligence**:
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
|
||||
result = md.convert("complex_layout.pdf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Microsoft Word (.docx)
|
||||
|
||||
**Capabilities**:
|
||||
- Text extraction
|
||||
- Table conversion
|
||||
- Heading hierarchy
|
||||
- List formatting
|
||||
- Basic text formatting (bold, italic)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[docx]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Research papers
|
||||
- Reports
|
||||
- Documentation
|
||||
- Manuscripts
|
||||
|
||||
**Preserved Elements**:
|
||||
- Headings (converted to Markdown headers)
|
||||
- Tables (converted to Markdown tables)
|
||||
- Lists (bulleted and numbered)
|
||||
- Basic formatting (bold, italic)
|
||||
- Paragraphs
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("manuscript.docx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### PowerPoint (.pptx)
|
||||
|
||||
**Capabilities**:
|
||||
- Slide content extraction
|
||||
- Speaker notes
|
||||
- Table extraction
|
||||
- Image descriptions (with AI)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[pptx]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Presentations
|
||||
- Lecture slides
|
||||
- Conference talks
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Slide 1: Title
|
||||
|
||||
Content from slide 1...
|
||||
|
||||
**Notes**: Speaker notes appear here
|
||||
|
||||
---
|
||||
|
||||
# Slide 2: Next Topic
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
**With AI Image Descriptions**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Excel (.xlsx, .xls)
|
||||
|
||||
**Capabilities**:
|
||||
- Sheet extraction
|
||||
- Table formatting
|
||||
- Data preservation
|
||||
- Formula values (calculated)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[xlsx]' # Modern Excel
|
||||
pip install 'markitdown[xls]' # Legacy Excel
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Data tables
|
||||
- Research data
|
||||
- Statistical results
|
||||
- Experimental data
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Sheet: Results
|
||||
|
||||
| Sample | Control | Treatment | P-value |
|
||||
|--------|---------|-----------|---------|
|
||||
| 1 | 10.2 | 12.5 | 0.023 |
|
||||
| 2 | 9.8 | 11.9 | 0.031 |
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("experimental_data.xlsx")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Image Formats
|
||||
|
||||
### Images (.jpg, .jpeg, .png, .gif, .webp)
|
||||
|
||||
**Capabilities**:
|
||||
- EXIF metadata extraction
|
||||
- OCR text extraction
|
||||
- AI-powered image descriptions
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[all]' # Includes image support
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Scanned documents
|
||||
- Charts and graphs
|
||||
- Scientific diagrams
|
||||
- Photographs with text
|
||||
|
||||
**Output Without AI**:
|
||||
```markdown
|
||||

|
||||
|
||||
**EXIF Data**:
|
||||
- Camera: Canon EOS 5D
|
||||
- Date: 2024-01-15
|
||||
- Resolution: 4000x3000
|
||||
```
|
||||
|
||||
**Output With AI**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific diagram in detail"
|
||||
)
|
||||
result = md.convert("graph.png")
|
||||
```
|
||||
|
||||
**OCR for Text Extraction**:
|
||||
Requires Tesseract OCR:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Audio Formats
|
||||
|
||||
### Audio (.wav, .mp3)
|
||||
|
||||
**Capabilities**:
|
||||
- Metadata extraction
|
||||
- Speech-to-text transcription
|
||||
- Duration and technical info
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[audio-transcription]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Lecture recordings
|
||||
- Interviews
|
||||
- Podcasts
|
||||
- Meeting recordings
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Audio: interview.mp3
|
||||
|
||||
**Metadata**:
|
||||
- Duration: 45:32
|
||||
- Bitrate: 320kbps
|
||||
- Sample Rate: 44100Hz
|
||||
|
||||
**Transcription**:
|
||||
[Transcribed text appears here...]
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("lecture.mp3")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Web Formats
|
||||
|
||||
### HTML (.html, .htm)
|
||||
|
||||
**Capabilities**:
|
||||
- Clean HTML to Markdown conversion
|
||||
- Link preservation
|
||||
- Table conversion
|
||||
- List formatting
|
||||
|
||||
**Best For**:
|
||||
- Web pages
|
||||
- Documentation
|
||||
- Blog posts
|
||||
- Online articles
|
||||
|
||||
**Output Format**: Clean Markdown with preserved links and structure
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("webpage.html")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### YouTube URLs
|
||||
|
||||
**Capabilities**:
|
||||
- Fetch video transcriptions
|
||||
- Extract video metadata
|
||||
- Caption download
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[youtube-transcription]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Educational videos
|
||||
- Lectures
|
||||
- Talks
|
||||
- Tutorials
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Formats
|
||||
|
||||
### CSV (.csv)
|
||||
|
||||
**Capabilities**:
|
||||
- Automatic table conversion
|
||||
- Delimiter detection
|
||||
- Header preservation
|
||||
|
||||
**Output Format**: Markdown tables
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("data.csv")
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```markdown
|
||||
| Column1 | Column2 | Column3 |
|
||||
|---------|---------|---------|
|
||||
| Value1 | Value2 | Value3 |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### JSON (.json)
|
||||
|
||||
**Capabilities**:
|
||||
- Structured representation
|
||||
- Pretty formatting
|
||||
- Nested data visualization
|
||||
|
||||
**Best For**:
|
||||
- API responses
|
||||
- Configuration files
|
||||
- Data exports
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("data.json")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### XML (.xml)
|
||||
|
||||
**Capabilities**:
|
||||
- Structure preservation
|
||||
- Attribute extraction
|
||||
- Formatted output
|
||||
|
||||
**Best For**:
|
||||
- Configuration files
|
||||
- Data interchange
|
||||
- Structured documents
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("config.xml")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Archive Formats
|
||||
|
||||
### ZIP (.zip)
|
||||
|
||||
**Capabilities**:
|
||||
- Iterates through archive contents
|
||||
- Converts each file individually
|
||||
- Maintains directory structure in output
|
||||
|
||||
**Best For**:
|
||||
- Document collections
|
||||
- Project archives
|
||||
- Batch conversions
|
||||
|
||||
**Output Format**:
|
||||
```markdown
|
||||
# Archive: documents.zip
|
||||
|
||||
## File: document1.pdf
|
||||
[Content from document1.pdf...]
|
||||
|
||||
---
|
||||
|
||||
## File: document2.docx
|
||||
[Content from document2.docx...]
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("archive.zip")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## E-book Formats
|
||||
|
||||
### EPUB (.epub)
|
||||
|
||||
**Capabilities**:
|
||||
- Full text extraction
|
||||
- Chapter structure
|
||||
- Metadata extraction
|
||||
|
||||
**Best For**:
|
||||
- E-books
|
||||
- Digital publications
|
||||
- Long-form content
|
||||
|
||||
**Output Format**: Markdown with preserved chapter structure
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("book.epub")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Other Formats
|
||||
|
||||
### Outlook Messages (.msg)
|
||||
|
||||
**Capabilities**:
|
||||
- Email content extraction
|
||||
- Attachment listing
|
||||
- Metadata (from, to, subject, date)
|
||||
|
||||
**Dependencies**:
|
||||
```bash
|
||||
pip install 'markitdown[outlook]'
|
||||
```
|
||||
|
||||
**Best For**:
|
||||
- Email archives
|
||||
- Communication records
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
result = md.convert("message.msg")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Format-Specific Tips
|
||||
|
||||
### PDF Best Practices
|
||||
|
||||
1. **Use Azure Document Intelligence for complex layouts**:
|
||||
```python
|
||||
md = MarkItDown(docintel_endpoint="endpoint_url")
|
||||
```
|
||||
|
||||
2. **For scanned PDFs, ensure OCR is set up**:
|
||||
```bash
|
||||
brew install tesseract # macOS
|
||||
```
|
||||
|
||||
3. **Split very large PDFs before conversion** for better performance
|
||||
|
||||
### PowerPoint Best Practices
|
||||
|
||||
1. **Use AI for visual content**:
|
||||
```python
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
2. **Check speaker notes** - they're included in output
|
||||
|
||||
3. **Complex animations won't be captured** - static content only
|
||||
|
||||
### Excel Best Practices
|
||||
|
||||
1. **Large spreadsheets** may take time to convert
|
||||
|
||||
2. **Formulas are converted to their calculated values**
|
||||
|
||||
3. **Multiple sheets** are all included in output
|
||||
|
||||
4. **Charts become text descriptions** (use AI for better descriptions)
|
||||
|
||||
### Image Best Practices
|
||||
|
||||
1. **Use AI for meaningful descriptions**:
|
||||
```python
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail"
|
||||
)
|
||||
```
|
||||
|
||||
2. **For text-heavy images, ensure OCR dependencies** are installed
|
||||
|
||||
3. **High-resolution images** may take longer to process
|
||||
|
||||
### Audio Best Practices
|
||||
|
||||
1. **Clear audio** produces better transcriptions
|
||||
|
||||
2. **Long recordings** may take significant time
|
||||
|
||||
3. **Consider splitting long audio files** for faster processing
|
||||
|
||||
---
|
||||
|
||||
## Unsupported Formats
|
||||
|
||||
If you need to convert an unsupported format:
|
||||
|
||||
1. **Create a custom converter** (see `api_reference.md`)
|
||||
2. **Look for plugins** on GitHub (#markitdown-plugin)
|
||||
3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
|
||||
|
||||
---
|
||||
|
||||
## Format Detection
|
||||
|
||||
MarkItDown automatically detects format from:
|
||||
|
||||
1. **File extension** (primary method)
|
||||
2. **MIME type** (fallback)
|
||||
3. **File signature** (magic bytes, fallback)
|
||||
|
||||
**Override detection**:
|
||||
```python
|
||||
# Force specific format
|
||||
result = md.convert("file_without_extension", file_extension=".pdf")
|
||||
|
||||
# With streams
|
||||
with open("file", "rb") as f:
|
||||
result = md.convert_stream(f, file_extension=".pdf")
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user