Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:18 +08:00
commit 74bee324ab
335 changed files with 147377 additions and 0 deletions

View File

@@ -0,0 +1,399 @@
# MarkItDown API Reference
## Core Classes
### MarkItDown
The main class for converting files to Markdown.
```python
from markitdown import MarkItDown
md = MarkItDown(
llm_client=None,
llm_model=None,
llm_prompt=None,
docintel_endpoint=None,
enable_plugins=False
)
```
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `llm_client` | OpenAI client | `None` | OpenAI-compatible client for AI image descriptions |
| `llm_model` | str | `None` | Model name (e.g., "anthropic/claude-sonnet-4.5") for image descriptions |
| `llm_prompt` | str | `None` | Custom prompt for image description |
| `docintel_endpoint` | str | `None` | Azure Document Intelligence endpoint |
| `enable_plugins` | bool | `False` | Enable 3rd-party plugins |
#### Methods
##### convert()
Convert a file to Markdown.
```python
result = md.convert(
source,
file_extension=None
)
```
**Parameters**:
- `source` (str): Path to the file to convert
- `file_extension` (str, optional): Override file extension detection
**Returns**: `DocumentConverterResult` object
**Example**:
```python
result = md.convert("document.pdf")
print(result.text_content)
```
##### convert_stream()
Convert from a file-like binary stream.
```python
result = md.convert_stream(
stream,
file_extension
)
```
**Parameters**:
- `stream` (BinaryIO): Binary file-like object (e.g., file opened in `"rb"` mode)
- `file_extension` (str): File extension to determine conversion method (e.g., ".pdf")
**Returns**: `DocumentConverterResult` object
**Example**:
```python
with open("document.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
print(result.text_content)
```
**Important**: The stream must be opened in binary mode (`"rb"`), not text mode.
## Result Object
### DocumentConverterResult
The result of a conversion operation.
#### Attributes
| Attribute | Type | Description |
|-----------|------|-------------|
| `text_content` | str | The converted Markdown text |
| `title` | str | Document title (if available) |
#### Example
```python
result = md.convert("paper.pdf")
# Access content
content = result.text_content
# Access title (if available)
title = result.title
```
## Custom Converters
You can create custom document converters by implementing the `DocumentConverter` interface.
### DocumentConverter Interface
```python
from markitdown import DocumentConverter
class CustomConverter(DocumentConverter):
def convert(self, stream, file_extension):
"""
Convert a document from a binary stream.
Parameters:
stream (BinaryIO): Binary file-like object
file_extension (str): File extension (e.g., ".custom")
Returns:
DocumentConverterResult: Conversion result
"""
# Your conversion logic here
pass
```
### Registering Custom Converters
```python
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
class MyCustomConverter(DocumentConverter):
def convert(self, stream, file_extension):
content = stream.read().decode('utf-8')
markdown_text = f"# Custom Format\n\n{content}"
return DocumentConverterResult(
text_content=markdown_text,
title="Custom Document"
)
# Create MarkItDown instance
md = MarkItDown()
# Register custom converter for .custom files
md.register_converter(".custom", MyCustomConverter())
# Use it
result = md.convert("myfile.custom")
```
## Plugin System
### Finding Plugins
Search GitHub for `#markitdown-plugin` tag.
### Using Plugins
```python
from markitdown import MarkItDown
# Enable plugins
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")
```
### Creating Plugins
Plugins are Python packages that register converters with MarkItDown.
**Plugin Structure**:
```
my-markitdown-plugin/
├── setup.py
├── my_plugin/
│ ├── __init__.py
│ └── converter.py
└── README.md
```
**setup.py**:
```python
from setuptools import setup
setup(
name="markitdown-my-plugin",
version="0.1.0",
packages=["my_plugin"],
entry_points={
"markitdown.plugins": [
"my_plugin = my_plugin.converter:MyConverter",
],
},
)
```
**converter.py**:
```python
from markitdown import DocumentConverter, DocumentConverterResult
class MyConverter(DocumentConverter):
def convert(self, stream, file_extension):
# Your conversion logic
content = stream.read()
markdown = self.process(content)
return DocumentConverterResult(
text_content=markdown,
title="My Document"
)
def process(self, content):
# Process content
return "# Converted Content\n\n..."
```
## AI-Enhanced Conversions
### Using OpenRouter for Image Descriptions
```python
from markitdown import MarkItDown
from openai import OpenAI
# Initialize OpenRouter client (OpenAI-compatible API)
client = OpenAI(
api_key="your-openrouter-api-key",
base_url="https://openrouter.ai/api/v1"
)
# Create MarkItDown with AI support
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-sonnet-4.5", # recommended for scientific vision
llm_prompt="Describe this image in detail for scientific documentation"
)
# Convert files with images
result = md.convert("presentation.pptx")
```
### Available Models via OpenRouter
Popular models with vision support:
- `anthropic/claude-sonnet-4.5` - **Claude Sonnet 4.5 (recommended for scientific vision)**
- `anthropic/claude-3.5-sonnet` - Claude 3.5 Sonnet
- `openai/gpt-4o` - GPT-4 Omni
- `openai/gpt-4-vision` - GPT-4 Vision
- `google/gemini-pro-vision` - Gemini Pro Vision
See https://openrouter.ai/models for the complete list.
### Custom Prompts
```python
# For scientific diagrams
scientific_prompt = """
Analyze this scientific diagram or chart. Describe:
1. The type of visualization (graph, chart, diagram, etc.)
2. Key data points or trends
3. Labels and axes
4. Scientific significance
Be precise and technical.
"""
md = MarkItDown(
llm_client=client,
llm_model="anthropic/claude-sonnet-4.5",
llm_prompt=scientific_prompt
)
```
## Azure Document Intelligence
### Setup
1. Create Azure Document Intelligence resource
2. Get endpoint URL
3. Set authentication
### Usage
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/"
)
result = md.convert("complex_document.pdf")
```
### Authentication
Set environment variables:
```bash
export AZURE_DOCUMENT_INTELLIGENCE_KEY="your-key"
```
Or pass credentials programmatically.
## Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except FileNotFoundError:
print("File not found")
except ValueError as e:
print(f"Invalid file format: {e}")
except Exception as e:
print(f"Conversion error: {e}")
```
## Performance Tips
### 1. Reuse MarkItDown Instance
```python
# Good: Create once, use many times
md = MarkItDown()
for file in files:
result = md.convert(file)
process(result)
```
### 2. Use Streaming for Large Files
```python
# For large files
with open("large_file.pdf", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
```
### 3. Batch Processing
```python
from concurrent.futures import ThreadPoolExecutor
md = MarkItDown()
def convert_file(filepath):
return md.convert(filepath)
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(convert_file, file_list)
```
## Breaking Changes (v0.0.1 to v0.1.0)
1. **Dependencies**: Now organized into optional feature groups
```bash
# Old
pip install markitdown
# New
pip install 'markitdown[all]'
```
2. **convert_stream()**: Now requires binary file-like object
```python
# Old (also accepted text)
with open("file.pdf", "r") as f: # text mode
result = md.convert_stream(f)
# New (binary only)
with open("file.pdf", "rb") as f: # binary mode
result = md.convert_stream(f, file_extension=".pdf")
```
3. **DocumentConverter Interface**: Changed to read from streams instead of file paths
- No temporary files created
- More memory efficient
- Plugins need updating
## Version Compatibility
- **Python**: 3.10 or higher required
- **Dependencies**: Check `setup.py` for version constraints
- **OpenAI**: Compatible with OpenAI Python SDK v1.0+
## Environment Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `OPENROUTER_API_KEY` | OpenRouter API key for image descriptions | `sk-or-v1-...` |
| `AZURE_DOCUMENT_INTELLIGENCE_KEY` | Azure DI authentication | `key123...` |
| `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` | Azure DI endpoint | `https://...` |

View File

@@ -0,0 +1,542 @@
# File Format Support
This document provides detailed information about each file format supported by MarkItDown.
## Document Formats
### PDF (.pdf)
**Capabilities**:
- Text extraction
- Table detection
- Metadata extraction
- OCR for scanned documents (with dependencies)
**Dependencies**:
```bash
pip install 'markitdown[pdf]'
```
**Best For**:
- Scientific papers
- Reports
- Books
- Forms
**Limitations**:
- Complex layouts may not preserve perfect formatting
- Scanned PDFs require OCR setup
- Some PDF features (annotations, forms) may not convert
**Example**:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)
```
**Enhanced with Azure Document Intelligence**:
```python
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")
```
---
### Microsoft Word (.docx)
**Capabilities**:
- Text extraction
- Table conversion
- Heading hierarchy
- List formatting
- Basic text formatting (bold, italic)
**Dependencies**:
```bash
pip install 'markitdown[docx]'
```
**Best For**:
- Research papers
- Reports
- Documentation
- Manuscripts
**Preserved Elements**:
- Headings (converted to Markdown headers)
- Tables (converted to Markdown tables)
- Lists (bulleted and numbered)
- Basic formatting (bold, italic)
- Paragraphs
**Example**:
```python
result = md.convert("manuscript.docx")
```
---
### PowerPoint (.pptx)
**Capabilities**:
- Slide content extraction
- Speaker notes
- Table extraction
- Image descriptions (with AI)
**Dependencies**:
```bash
pip install 'markitdown[pptx]'
```
**Best For**:
- Presentations
- Lecture slides
- Conference talks
**Output Format**:
```markdown
# Slide 1: Title
Content from slide 1...
**Notes**: Speaker notes appear here
---
# Slide 2: Next Topic
...
```
**With AI Image Descriptions**:
```python
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")
```
---
### Excel (.xlsx, .xls)
**Capabilities**:
- Sheet extraction
- Table formatting
- Data preservation
- Formula values (calculated)
**Dependencies**:
```bash
pip install 'markitdown[xlsx]' # Modern Excel
pip install 'markitdown[xls]' # Legacy Excel
```
**Best For**:
- Data tables
- Research data
- Statistical results
- Experimental data
**Output Format**:
```markdown
# Sheet: Results
| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1 | 10.2 | 12.5 | 0.023 |
| 2 | 9.8 | 11.9 | 0.031 |
```
**Example**:
```python
result = md.convert("experimental_data.xlsx")
```
---
## Image Formats
### Images (.jpg, .jpeg, .png, .gif, .webp)
**Capabilities**:
- EXIF metadata extraction
- OCR text extraction
- AI-powered image descriptions
**Dependencies**:
```bash
pip install 'markitdown[all]' # Includes image support
```
**Best For**:
- Scanned documents
- Charts and graphs
- Scientific diagrams
- Photographs with text
**Output Without AI**:
```markdown
![Image](image.jpg)
**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000
```
**Output With AI**:
```python
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")
```
**OCR for Text Extraction**:
Requires Tesseract OCR:
```bash
# macOS
brew install tesseract
# Ubuntu
sudo apt-get install tesseract-ocr
```
---
## Audio Formats
### Audio (.wav, .mp3)
**Capabilities**:
- Metadata extraction
- Speech-to-text transcription
- Duration and technical info
**Dependencies**:
```bash
pip install 'markitdown[audio-transcription]'
```
**Best For**:
- Lecture recordings
- Interviews
- Podcasts
- Meeting recordings
**Output Format**:
```markdown
# Audio: interview.mp3
**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz
**Transcription**:
[Transcribed text appears here...]
```
**Example**:
```python
result = md.convert("lecture.mp3")
```
---
## Web Formats
### HTML (.html, .htm)
**Capabilities**:
- Clean HTML to Markdown conversion
- Link preservation
- Table conversion
- List formatting
**Best For**:
- Web pages
- Documentation
- Blog posts
- Online articles
**Output Format**: Clean Markdown with preserved links and structure
**Example**:
```python
result = md.convert("webpage.html")
```
---
### YouTube URLs
**Capabilities**:
- Fetch video transcriptions
- Extract video metadata
- Caption download
**Dependencies**:
```bash
pip install 'markitdown[youtube-transcription]'
```
**Best For**:
- Educational videos
- Lectures
- Talks
- Tutorials
**Example**:
```python
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
```
---
## Data Formats
### CSV (.csv)
**Capabilities**:
- Automatic table conversion
- Delimiter detection
- Header preservation
**Output Format**: Markdown tables
**Example**:
```python
result = md.convert("data.csv")
```
**Output**:
```markdown
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1 | Value2 | Value3 |
```
---
### JSON (.json)
**Capabilities**:
- Structured representation
- Pretty formatting
- Nested data visualization
**Best For**:
- API responses
- Configuration files
- Data exports
**Example**:
```python
result = md.convert("data.json")
```
---
### XML (.xml)
**Capabilities**:
- Structure preservation
- Attribute extraction
- Formatted output
**Best For**:
- Configuration files
- Data interchange
- Structured documents
**Example**:
```python
result = md.convert("config.xml")
```
---
## Archive Formats
### ZIP (.zip)
**Capabilities**:
- Iterates through archive contents
- Converts each file individually
- Maintains directory structure in output
**Best For**:
- Document collections
- Project archives
- Batch conversions
**Output Format**:
```markdown
# Archive: documents.zip
## File: document1.pdf
[Content from document1.pdf...]
---
## File: document2.docx
[Content from document2.docx...]
```
**Example**:
```python
result = md.convert("archive.zip")
```
---
## E-book Formats
### EPUB (.epub)
**Capabilities**:
- Full text extraction
- Chapter structure
- Metadata extraction
**Best For**:
- E-books
- Digital publications
- Long-form content
**Output Format**: Markdown with preserved chapter structure
**Example**:
```python
result = md.convert("book.epub")
```
---
## Other Formats
### Outlook Messages (.msg)
**Capabilities**:
- Email content extraction
- Attachment listing
- Metadata (from, to, subject, date)
**Dependencies**:
```bash
pip install 'markitdown[outlook]'
```
**Best For**:
- Email archives
- Communication records
**Example**:
```python
result = md.convert("message.msg")
```
---
## Format-Specific Tips
### PDF Best Practices
1. **Use Azure Document Intelligence for complex layouts**:
```python
md = MarkItDown(docintel_endpoint="endpoint_url")
```
2. **For scanned PDFs, ensure OCR is set up**:
```bash
brew install tesseract # macOS
```
3. **Split very large PDFs before conversion** for better performance
### PowerPoint Best Practices
1. **Use AI for visual content**:
```python
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
2. **Check speaker notes** - they're included in output
3. **Complex animations won't be captured** - static content only
### Excel Best Practices
1. **Large spreadsheets** may take time to convert
2. **Formulas are converted to their calculated values**
3. **Multiple sheets** are all included in output
4. **Charts become text descriptions** (use AI for better descriptions)
### Image Best Practices
1. **Use AI for meaningful descriptions**:
```python
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific figure in detail"
)
```
2. **For text-heavy images, ensure OCR dependencies** are installed
3. **High-resolution images** may take longer to process
### Audio Best Practices
1. **Clear audio** produces better transcriptions
2. **Long recordings** may take significant time
3. **Consider splitting long audio files** for faster processing
---
## Unsupported Formats
If you need to convert an unsupported format:
1. **Create a custom converter** (see `api_reference.md`)
2. **Look for plugins** on GitHub (#markitdown-plugin)
3. **Pre-convert to supported format** (e.g., convert .rtf to .docx)
---
## Format Detection
MarkItDown automatically detects format from:
1. **File extension** (primary method)
2. **MIME type** (fallback)
3. **File signature** (magic bytes, fallback)
**Override detection**:
```python
# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")
# With streams
with open("file", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")
```