Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,538 @@
# Advanced Integrations Reference
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
## Azure Document Intelligence Integration
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
### Setup
**Prerequisites:**
1. Azure subscription
2. Document Intelligence resource created in Azure
3. Endpoint URL and API key
**Create Azure Resource:**
```bash
# Using Azure CLI
az cognitiveservices account create \
--name my-doc-intelligence \
--resource-group my-resource-group \
--kind FormRecognizer \
--sku F0 \
--location eastus
```
### Basic Usage
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_document.pdf")
print(result.text_content)
```
### Configuration from Environment Variables
```python
import os
from markitdown import MarkItDown
# Set environment variables
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
# Use without explicit credentials
md = MarkItDown(
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
)
result = md.convert("document.pdf")
```
### When to Use Azure Document Intelligence
**Use for:**
- Complex PDFs with sophisticated tables
- Multi-column layouts
- Forms and structured documents
- Scanned documents requiring OCR
- PDFs with mixed content types
- Documents with intricate formatting
**Benefits over standard extraction:**
- **Superior table extraction** - Better handling of merged cells, complex layouts
- **Layout analysis** - Understands document structure (headers, footers, columns)
- **Form fields** - Extracts key-value pairs from forms
- **Reading order** - Maintains correct text flow in complex layouts
- **OCR quality** - High-quality text extraction from scanned documents
### Comparison Example
**Standard extraction:**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("complex_table.pdf")
# May struggle with complex tables
```
**Azure Document Intelligence:**
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_table.pdf")
# Better table reconstruction and layout understanding
```
### Cost Considerations
Azure Document Intelligence is a paid service:
- **Free tier**: 500 pages per month
- **Paid tiers**: Pay per page processed
- Monitor usage to control costs
- Use standard extraction for simple documents
### Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Document Intelligence error: {e}")
# Common issues: authentication, quota exceeded, unsupported file
```
## LLM-Powered Image Descriptions
Generate detailed, contextual descriptions for images using large language models.
### Setup with OpenAI
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("image.jpg")
print(result.text_content)
```
### Supported Use Cases
**Images in documents:**
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# PowerPoint with images
result = md.convert("presentation.pptx")
# Word documents with images
result = md.convert("report.docx")
# Standalone images
result = md.convert("diagram.png")
```
### Custom Prompts
Customize the LLM prompt for specific needs:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
)
# For scientific figures
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
)
```
### Model Selection
**GPT-4o (Recommended):**
- Best vision capabilities
- High-quality descriptions
- Good at understanding context
- Higher cost per image
**GPT-4o-mini:**
- Lower cost alternative
- Good for simpler images
- Faster processing
- May miss subtle details
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# High quality (more expensive)
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Budget option (less expensive)
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
```
### Configuration from Environment
```python
import os
from markitdown import MarkItDown
from openai import OpenAI
# Set API key in environment
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
client = OpenAI() # Uses env variable
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
### Alternative LLM Providers
**Anthropic Claude:**
```python
from markitdown import MarkItDown
from anthropic import Anthropic
# Note: Check current compatibility with MarkItDown
client = Anthropic(api_key="YOUR-API-KEY")
# May require adapter for MarkItDown compatibility
```
**Azure OpenAI:**
```python
from markitdown import MarkItDown
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="YOUR-AZURE-KEY",
api_version="2024-02-01",
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
)
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
```
### Cost Management
**Strategies to reduce LLM costs:**
1. **Selective processing:**
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
# Only use LLM for important documents
if is_important_document(file):
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
else:
md = MarkItDown() # Standard processing
result = md.convert(file)
```
2. **Image filtering:**
```python
# Pre-process to identify images that need descriptions
# Only use LLM for complex/important images
```
3. **Batch processing:**
```python
# Process multiple images in batches
# Monitor costs and set limits
```
4. **Model selection:**
```python
# Use gpt-4o-mini for simple images
# Reserve gpt-4o for complex visualizations
```
### Performance Considerations
**LLM processing adds latency:**
- Each image requires an API call
- Processing time: 1-5 seconds per image
- Network dependent
- Consider parallel processing for multiple images
**Batch optimization:**
```python
from markitdown import MarkItDown
from openai import OpenAI
import concurrent.futures
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
def process_image(image_path):
return md.convert(image_path)
# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(process_image, images))
```
## Combined Advanced Features
### Azure Document Intelligence + LLM Descriptions
Combine both for maximum quality:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-AZURE-ENDPOINT",
docintel_key="YOUR-AZURE-KEY"
)
# Best possible PDF conversion with image descriptions
result = md.convert("complex_report.pdf")
```
**Use cases:**
- Research papers with figures
- Business reports with charts
- Technical documentation with diagrams
- Presentations with visual data
### Smart Document Processing Pipeline
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
def smart_convert(file_path):
"""Intelligently choose processing method based on file type."""
client = OpenAI()
ext = os.path.splitext(file_path)[1].lower()
# PDFs with complex tables: Use Azure
if ext == '.pdf':
md = MarkItDown(
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
docintel_key=os.getenv('AZURE_KEY')
)
# Documents/presentations with images: Use LLM
elif ext in ['.pptx', '.docx']:
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
# Simple formats: Standard processing
else:
md = MarkItDown()
return md.convert(file_path)
# Use it
result = smart_convert("document.pdf")
```
## Plugin System
MarkItDown supports custom plugins for extending functionality.
### Plugin Architecture
Plugins are disabled by default for security:
```python
from markitdown import MarkItDown
# Enable plugins
md = MarkItDown(enable_plugins=True)
```
### Creating Custom Plugins
**Plugin structure:**
```python
class CustomConverter:
"""Custom converter plugin for MarkItDown."""
def can_convert(self, file_path):
"""Check if this plugin can handle the file."""
return file_path.endswith('.custom')
def convert(self, file_path):
"""Convert file to Markdown."""
# Your conversion logic here
return {
'text_content': '# Converted Content\n\n...'
}
```
### Plugin Registration
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
# Register custom plugin
md.register_plugin(CustomConverter())
# Use normally
result = md.convert("file.custom")
```
### Plugin Use Cases
**Custom formats:**
- Proprietary document formats
- Specialized scientific data formats
- Legacy file formats
**Enhanced processing:**
- Custom OCR engines
- Specialized table extraction
- Domain-specific parsing
**Integration:**
- Enterprise document systems
- Custom databases
- Specialized APIs
### Plugin Security
**Important security considerations:**
- Plugins run with full system access
- Only enable for trusted plugins
- Validate plugin code before use
- Disable plugins in production unless required
## Error Handling for Advanced Features
```python
from markitdown import MarkItDown
from openai import OpenAI
def robust_convert(file_path):
"""Convert with fallback strategies."""
try:
# Try with all advanced features
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
docintel_key=os.getenv('AZURE_KEY')
)
return md.convert(file_path)
except Exception as azure_error:
print(f"Azure failed: {azure_error}")
try:
# Fallback: LLM only
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
return md.convert(file_path)
except Exception as llm_error:
print(f"LLM failed: {llm_error}")
# Final fallback: Standard processing
md = MarkItDown()
return md.convert(file_path)
# Use it
result = robust_convert("document.pdf")
```
## Best Practices
### Azure Document Intelligence
- Use for complex PDFs only (cost optimization)
- Monitor usage and costs
- Store credentials securely
- Handle quota limits gracefully
- Fall back to standard processing if needed
### LLM Integration
- Use appropriate models for task complexity
- Customize prompts for specific use cases
- Monitor API costs
- Implement rate limiting
- Cache results when possible
- Handle API errors gracefully
### Combined Features
- Test cost/quality tradeoffs
- Use selectively for important documents
- Implement intelligent routing
- Monitor performance and costs
- Have fallback strategies
### Security
- Store API keys securely (environment variables, secrets manager)
- Never commit credentials to code
- Disable plugins unless required
- Validate all inputs
- Use least privilege access

View File

@@ -0,0 +1,273 @@
# Document Conversion Reference
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
## PDF Files
PDF conversion extracts text, tables, and structure from PDF documents.
### Basic PDF Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```
### PDF with Azure Document Intelligence
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
```python
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
docintel_key="YOUR-API-KEY"
)
result = md.convert("complex_table.pdf")
print(result.text_content)
```
**Benefits of Azure Document Intelligence:**
- Superior table extraction and reconstruction
- Better handling of multi-column layouts
- Form field recognition
- Improved text ordering in complex documents
### PDF Handling Notes
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
- Password-protected PDFs are not supported
- Large PDFs may take longer to process
- Vector graphics and embedded images are extracted where possible
## Word Documents (DOCX)
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
### Basic DOCX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
```
### DOCX Structure Preservation
MarkItDown preserves:
- **Headings** → Markdown headers (`#`, `##`, etc.)
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
- **Lists** → Markdown lists (ordered and unordered)
- **Tables** → Markdown tables
- **Hyperlinks** → Markdown links `[text](url)`
- **Images** → Referenced with descriptions (can use LLM for descriptions)
### Command-Line Usage
```bash
# Basic conversion
markitdown report.docx -o report.md
# With output directory
markitdown report.docx -o output/report.md
```
### DOCX with Images
To generate descriptions for images in Word documents, use LLM integration:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("document_with_images.docx")
```
## PowerPoint Presentations (PPTX)
PowerPoint conversion extracts text from slides while preserving structure.
### Basic PPTX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.text_content)
```
### PPTX Structure
MarkItDown processes presentations as:
- Each slide becomes a major section
- Slide titles become headers
- Bullet points are preserved
- Tables are converted to Markdown tables
- Notes are included if present
### PPTX with Image Descriptions
Presentations often contain important visual information. Use LLM integration to describe images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this slide image in detail, focusing on key information"
)
result = md.convert("presentation.pptx")
```
**Custom prompts for presentations:**
- "Describe charts and graphs with their key data points"
- "Explain diagrams and their relationships"
- "Summarize visual content for accessibility"
## Excel Spreadsheets (XLSX, XLS)
Excel conversion formats spreadsheet data as Markdown tables.
### Basic XLSX Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
print(result.text_content)
```
### Multi-Sheet Workbooks
For workbooks with multiple sheets:
- Each sheet becomes a separate section
- Sheet names are used as headers
- Empty sheets are skipped
- Formulas are evaluated (values shown, not formulas)
### XLSX Conversion Details
**What's preserved:**
- Cell values (text, numbers, dates)
- Table structure (rows and columns)
- Sheet names
- Cell formatting (bold headers)
**What's not preserved:**
- Formulas (only computed values)
- Charts and graphs (use LLM integration for descriptions)
- Cell colors and conditional formatting
- Comments and notes
### Large Spreadsheets
For large spreadsheets, consider:
- Processing may be slower for files with many rows/columns
- Very wide tables may not format well in Markdown
- Consider filtering or preprocessing data if possible
### XLS (Legacy Excel) Files
Legacy `.xls` files are supported but require additional dependencies:
```bash
pip install 'markitdown[xls]'
```
Then use normally:
```python
md = MarkItDown()
result = md.convert("legacy_data.xls")
```
## Common Document Conversion Patterns
### Batch Document Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
# Process all documents in a directory
for filename in os.listdir("documents"):
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
result = md.convert(f"documents/{filename}")
# Save to output directory
output_name = os.path.splitext(filename)[0] + ".md"
with open(f"markdown/{output_name}", "w") as f:
f.write(result.text_content)
```
### Document with Mixed Content
For documents containing multiple types of content (text, tables, images):
```python
from markitdown import MarkItDown
from openai import OpenAI
# Use LLM for image descriptions + Azure for complex tables
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
docintel_endpoint="YOUR-ENDPOINT",
docintel_key="YOUR-KEY"
)
result = md.convert("complex_report.pdf")
```
### Error Handling
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("document.pdf")
print(result.text_content)
except Exception as e:
print(f"Conversion failed: {e}")
# Handle specific errors (file not found, unsupported format, etc.)
```
## Output Quality Tips
**For best results:**
1. Use Azure Document Intelligence for PDFs with complex tables
2. Enable LLM descriptions for documents with important visual content
3. Ensure source documents are well-structured (proper headings, etc.)
4. For scanned documents, ensure good scan quality for OCR accuracy
5. Test with sample documents to verify output quality
## Performance Considerations
**Conversion speed depends on:**
- Document size and complexity
- Number of images (especially with LLM descriptions)
- Use of Azure Document Intelligence
- Available system resources
**Optimization tips:**
- Disable LLM integration if image descriptions aren't needed
- Use standard extraction (not Azure) for simple documents
- Process large batches in parallel when possible
- Consider streaming for very large documents

View File

@@ -0,0 +1,365 @@
# Media Processing Reference
This document provides detailed information about processing images and audio files with MarkItDown.
## Image Processing
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
### Basic Image Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)
```
### Image Processing Features
**What's extracted:**
1. **EXIF Metadata** - Camera settings, date, location, etc.
2. **OCR Text** - Text detected in the image (requires tesseract)
3. **Image Description** - AI-generated description (with LLM integration)
### EXIF Metadata Extraction
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)
```
**Example output includes:**
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation
### OCR (Optical Character Recognition)
Extract text from images containing text (screenshots, scanned documents, photos of text):
**Requirements:**
- Install tesseract OCR engine:
```bash
# macOS
brew install tesseract
# Ubuntu/Debian
apt-get install tesseract-ocr
# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
```
**Usage:**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content) # Contains OCR'd text
```
**Best practices for OCR:**
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images
### LLM-Generated Image Descriptions
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)
```
**Custom prompts for specific needs:**
```python
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this chart and provide key data points and trends"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface, listing all visible elements and their layout"
)
```
### Supported Image Formats
MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)
## Audio Processing
MarkItDown can transcribe audio files to text using speech recognition.
### Basic Audio Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content) # Transcribed speech
```
### Audio Transcription Setup
**Installation:**
```bash
pip install 'markitdown[audio]'
```
This installs the `speech_recognition` library and dependencies.
### Supported Audio Formats
- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition
### Audio Transcription Engines
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
**Default (Google Speech Recognition):**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("audio.wav")
```
**Note:** Default Google Speech Recognition requires internet connection.
### Audio Quality Considerations
For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible
### Audio Preprocessing Tips
For better results, consider preprocessing audio:
```python
# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize
# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")
# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")
```
## Combined Media Workflows
### Processing Multiple Images in Batch
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Process all images in directory
for filename in os.listdir("images"):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
result = md.convert(f"images/{filename}")
# Save markdown with same name
output = filename.rsplit('.', 1)[0] + '.md'
with open(f"output/{output}", "w") as f:
f.write(result.text_content)
```
### Screenshot Analysis Pipeline
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []
for screenshot in screenshots:
result = md.convert(screenshot)
analysis.append({
'file': screenshot,
'content': result.text_content
})
# Now ready for further processing
```
### Document Images with OCR
For scanned documents or photos of documents:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []
for page in pages:
result = md.convert(page)
full_text.append(result.text_content)
# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)
```
### Presentation Slide Images
When you have presentation slides as images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)
# Process slide images
for i in range(1, 21): # 20 slides
result = md.convert(f"slides/slide_{i}.png")
print(f"## Slide {i}\n\n{result.text_content}\n\n")
```
## Error Handling
### Image Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("image.jpg")
print(result.text_content)
except FileNotFoundError:
print("Image file not found")
except Exception as e:
print(f"Error processing image: {e}")
```
### Audio Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("audio.mp3")
print(result.text_content)
except Exception as e:
print(f"Transcription failed: {e}")
# Common issues: format not supported, no speech detected, network error
```
## Performance Optimization
### Image Processing
- **LLM descriptions**: Slower but more informative
- **OCR only**: Faster for text extraction
- **EXIF only**: Fastest, metadata only
- **Batch processing**: Process multiple images in parallel
### Audio Processing
- **File size**: Larger files take longer
- **Audio length**: Transcription time scales with duration
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
- **Network dependency**: Default transcription requires internet
## Use Cases
### Document Digitization
Convert scanned documents or photos of documents to searchable text.
### Meeting Notes
Transcribe audio recordings of meetings to text for analysis.
### Presentation Analysis
Extract content from presentation slide images.
### Screenshot Documentation
Generate descriptions of UI screenshots for documentation.
### Image Archiving
Extract metadata and content from photo collections.
### Accessibility
Generate alt-text descriptions for images using LLM integration.
### Data Extraction
OCR text from images containing tables, forms, or structured data.

View File

@@ -0,0 +1,575 @@
# Structured Data Handling Reference
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
## CSV Files
Convert CSV (Comma-Separated Values) files to Markdown tables.
### Basic CSV Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)
```
### CSV to Markdown Table
CSV files are automatically converted to Markdown table format:
**Input CSV (`data.csv`):**
```csv
Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
```
**Output Markdown:**
```markdown
| Name | Age | City |
|---------|-----|-------------|
| Alice | 30 | New York |
| Bob | 25 | Los Angeles |
| Charlie | 35 | Chicago |
```
### CSV Conversion Features
**What's preserved:**
- All column headers
- All data rows
- Cell values (text and numbers)
- Column structure
**Formatting:**
- Headers are bolded (Markdown table format)
- Columns are aligned
- Empty cells are preserved
- Special characters are escaped
### Large CSV Files
For large CSV files:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert large CSV
result = md.convert("large_dataset.csv")
# Save to file instead of printing
with open("output.md", "w") as f:
f.write(result.text_content)
```
**Performance considerations:**
- Very large files may take time to process
- Consider previewing first few rows for testing
- Memory usage scales with file size
- Very wide tables may not display well in all Markdown viewers
### CSV with Special Characters
CSV files containing special characters are handled automatically:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles UTF-8, special characters, quotes, etc.
result = md.convert("international_data.csv")
```
### CSV Delimiters
Standard CSV delimiters are supported:
- Comma (`,`) - standard
- Semicolon (`;`) - common in European formats
- Tab (`\t`) - TSV files
### Command-Line CSV Conversion
```bash
# Basic conversion
markitdown data.csv -o data.md
# Multiple CSV files
for file in *.csv; do
markitdown "$file" -o "${file%.csv}.md"
done
```
## JSON Files
Convert JSON data to readable Markdown format.
### Basic JSON Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.json")
print(result.text_content)
```
### JSON Formatting
JSON is converted to a readable, structured Markdown format:
**Input JSON (`config.json`):**
```json
{
"name": "MyApp",
"version": "1.0.0",
"dependencies": {
"library1": "^2.0.0",
"library2": "^3.1.0"
},
"features": ["auth", "api", "database"]
}
```
**Output Markdown:**
```markdown
## Configuration
**name:** MyApp
**version:** 1.0.0
### dependencies
- **library1:** ^2.0.0
- **library2:** ^3.1.0
### features
- auth
- api
- database
```
### JSON Array Handling
JSON arrays are converted to lists or tables:
**Array of objects:**
```json
[
{"id": 1, "name": "Alice", "active": true},
{"id": 2, "name": "Bob", "active": false}
]
```
**Converted to table:**
```markdown
| id | name | active |
|----|-------|--------|
| 1 | Alice | true |
| 2 | Bob | false |
```
### Nested JSON Structures
Nested JSON is converted with appropriate indentation and hierarchy:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles deeply nested structures
result = md.convert("complex_config.json")
print(result.text_content)
```
### JSON Lines (JSONL)
For JSON Lines format (one JSON object per line):
```python
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read JSONL file
with open("data.jsonl", "r") as f:
for line in f:
obj = json.loads(line)
# Convert to JSON temporarily
with open("temp.json", "w") as temp:
json.dump(obj, temp)
result = md.convert("temp.json")
print(result.text_content)
print("\n---\n")
```
### Large JSON Files
For large JSON files:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert large JSON
result = md.convert("large_data.json")
# Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
```
## XML Files
Convert XML documents to structured Markdown.
### Basic XML Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
print(result.text_content)
```
### XML Structure Preservation
XML is converted to Markdown maintaining hierarchical structure:
**Input XML (`book.xml`):**
```xml
<?xml version="1.0"?>
<book>
<title>Example Book</title>
<author>John Doe</author>
<chapters>
<chapter id="1">
<title>Introduction</title>
<content>Chapter 1 content...</content>
</chapter>
<chapter id="2">
<title>Background</title>
<content>Chapter 2 content...</content>
</chapter>
</chapters>
</book>
```
**Output Markdown:**
```markdown
# book
## title
Example Book
## author
John Doe
## chapters
### chapter (id: 1)
#### title
Introduction
#### content
Chapter 1 content...
### chapter (id: 2)
#### title
Background
#### content
Chapter 2 content...
```
### XML Attributes
XML attributes are preserved in the conversion:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
# Attributes shown as (attr: value) in headings
```
### XML Namespaces
XML namespaces are handled:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Handles xmlns and namespaced elements
result = md.convert("namespaced.xml")
```
### XML Use Cases
**Configuration files:**
- Convert XML configs to readable format
- Document system configurations
- Compare configuration files
**Data interchange:**
- Convert XML APIs responses
- Process XML data feeds
- Transform between formats
**Document processing:**
- Convert DocBook to Markdown
- Process SVG descriptions
- Extract structured data
## Structured Data Workflows
### CSV Data Analysis Pipeline
```python
from markitdown import MarkItDown
import pandas as pd
md = MarkItDown()
# Read CSV for analysis
df = pd.read_csv("data.csv")
# Do analysis
summary = df.describe()
# Convert both to Markdown
original = md.convert("data.csv")
# Save summary as CSV then convert
summary.to_csv("summary.csv")
summary_md = md.convert("summary.csv")
print("## Original Data\n")
print(original.text_content)
print("\n## Statistical Summary\n")
print(summary_md.text_content)
```
### JSON API Documentation
```python
from markitdown import MarkItDown
import requests
import json
md = MarkItDown()
# Fetch JSON from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Save as JSON
with open("api_response.json", "w") as f:
json.dump(data, f, indent=2)
# Convert to Markdown
result = md.convert("api_response.json")
# Create documentation
doc = f"""# API Response Documentation
## Endpoint
GET https://api.example.com/data
## Response
{result.text_content}
"""
with open("api_docs.md", "w") as f:
f.write(doc)
```
### XML to Markdown Documentation
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert XML documentation
xml_files = ["config.xml", "schema.xml", "data.xml"]
for xml_file in xml_files:
result = md.convert(xml_file)
output_name = xml_file.replace('.xml', '.md')
with open(f"docs/{output_name}", "w") as f:
f.write(result.text_content)
```
### Multi-Format Data Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
def convert_structured_data(directory):
"""Convert all structured data files in directory."""
extensions = {'.csv', '.json', '.xml'}
for filename in os.listdir(directory):
ext = os.path.splitext(filename)[1]
if ext in extensions:
input_path = os.path.join(directory, filename)
result = md.convert(input_path)
# Save Markdown
output_name = filename.replace(ext, '.md')
output_path = os.path.join("markdown", output_name)
with open(output_path, 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename}{output_name}")
# Process all structured data
convert_structured_data("data")
```
### CSV to JSON to Markdown
```python
import pandas as pd
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read CSV
df = pd.read_csv("data.csv")
# Convert to JSON
json_data = df.to_dict(orient='records')
with open("temp.json", "w") as f:
json.dump(json_data, f, indent=2)
# Convert JSON to Markdown
result = md.convert("temp.json")
print(result.text_content)
```
### Database Export to Markdown
```python
from markitdown import MarkItDown
import sqlite3
import csv
md = MarkItDown()
# Export database query to CSV
conn = sqlite3.connect("database.db")
cursor = conn.execute("SELECT * FROM users")
with open("users.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow([description[0] for description in cursor.description])
writer.writerows(cursor.fetchall())
# Convert to Markdown
result = md.convert("users.csv")
print(result.text_content)
```
## Error Handling
### CSV Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.csv")
print(result.text_content)
except FileNotFoundError:
print("CSV file not found")
except Exception as e:
print(f"CSV conversion error: {e}")
# Common issues: encoding problems, malformed CSV, delimiter issues
```
### JSON Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.json")
print(result.text_content)
except Exception as e:
print(f"JSON conversion error: {e}")
# Common issues: invalid JSON syntax, encoding issues
```
### XML Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.xml")
print(result.text_content)
except Exception as e:
print(f"XML conversion error: {e}")
# Common issues: malformed XML, encoding problems, namespace issues
```
## Best Practices
### CSV Processing
- Check delimiter before conversion
- Verify encoding (UTF-8 recommended)
- Handle large files with streaming if needed
- Preview output for very wide tables
### JSON Processing
- Validate JSON before conversion
- Consider pretty-printing complex structures
- Handle circular references appropriately
- Be aware of large array performance
### XML Processing
- Validate XML structure first
- Handle namespaces consistently
- Consider XPath for selective extraction
- Be mindful of very deep nesting
### Data Quality
- Clean data before conversion when possible
- Handle missing values appropriately
- Verify special character handling
- Test with representative samples
### Performance
- Process large files in batches
- Use streaming for very large datasets
- Monitor memory usage
- Cache converted results when appropriate

View File

@@ -0,0 +1,478 @@
# Web Content Extraction Reference
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
## HTML Conversion
Convert HTML files and web pages to clean Markdown format.
### Basic HTML Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("webpage.html")
print(result.text_content)
```
### HTML Processing Features
**What's preserved:**
- Headings (`<h1>``#`, `<h2>``##`, etc.)
- Paragraphs and text formatting
- Links (`<a>``[text](url)`)
- Lists (ordered and unordered)
- Tables → Markdown tables
- Code blocks and inline code
- Emphasis (bold, italic)
**What's removed:**
- Scripts and styles
- Navigation elements
- Advertising content
- Boilerplate markup
- HTML comments
### HTML from URLs
Convert web pages directly from URLs:
```python
from markitdown import MarkItDown
import requests
md = MarkItDown()
# Fetch and convert web page
response = requests.get("https://example.com/article")
with open("temp.html", "wb") as f:
f.write(response.content)
result = md.convert("temp.html")
print(result.text_content)
```
### Clean Web Article Extraction
For extracting main content from web articles:
```python
from markitdown import MarkItDown
import requests
from readability import Document # pip install readability-lxml
md = MarkItDown()
# Fetch page
url = "https://example.com/article"
response = requests.get(url)
# Extract main content
doc = Document(response.content)
html_content = doc.summary()
# Save and convert
with open("article.html", "w") as f:
f.write(html_content)
result = md.convert("article.html")
print(result.text_content)
```
### HTML with Images
HTML files containing images can be enhanced with LLM descriptions:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("page_with_images.html")
```
## YouTube Transcripts
Extract video transcripts from YouTube videos.
### Basic YouTube Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```
### YouTube Installation
```bash
pip install 'markitdown[youtube]'
```
This installs the `youtube-transcript-api` dependency.
### YouTube URL Formats
MarkItDown supports various YouTube URL formats:
- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://www.youtube.com/embed/VIDEO_ID`
- `https://m.youtube.com/watch?v=VIDEO_ID`
### YouTube Transcript Features
**What's included:**
- Full video transcript text
- Timestamps (optional, depending on availability)
- Video metadata (title, description)
- Captions in available languages
**Transcript languages:**
```python
from markitdown import MarkItDown
md = MarkItDown()
# Get transcript in specific language (if available)
# Language codes: 'en', 'es', 'fr', 'de', etc.
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
```
### YouTube Playlist Processing
Process multiple videos from a playlist:
```python
from markitdown import MarkItDown
md = MarkItDown()
video_ids = [
"VIDEO_ID_1",
"VIDEO_ID_2",
"VIDEO_ID_3"
]
transcripts = []
for vid_id in video_ids:
url = f"https://youtube.com/watch?v={vid_id}"
result = md.convert(url)
transcripts.append({
'video_id': vid_id,
'transcript': result.text_content
})
```
### YouTube Use Cases
**Content Analysis:**
- Analyze video content without watching
- Extract key information from tutorials
- Build searchable transcript databases
**Research:**
- Process interview transcripts
- Extract lecture content
- Analyze presentation content
**Accessibility:**
- Generate text versions of video content
- Create searchable video archives
### YouTube Limitations
- Requires videos to have captions/transcripts available
- Auto-generated captions may have transcription errors
- Some videos may disable transcript access
- Rate limiting may apply for bulk processing
## EPUB Books
Convert EPUB e-books to Markdown format.
### Basic EPUB Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("book.epub")
print(result.text_content)
```
### EPUB Processing Features
**What's extracted:**
- Book text content
- Chapter structure
- Headings and formatting
- Tables of contents
- Footnotes and references
**What's preserved:**
- Heading hierarchy
- Text emphasis (bold, italic)
- Links and references
- Lists and tables
### EPUB with Images
EPUB files often contain images (covers, diagrams, illustrations):
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("illustrated_book.epub")
```
### EPUB Use Cases
**Research:**
- Convert textbooks to searchable format
- Extract content for analysis
- Build digital libraries
**Content Processing:**
- Prepare books for LLM training data
- Convert to different formats
- Create summaries and extracts
**Accessibility:**
- Convert to more accessible formats
- Extract text for screen readers
- Process for text-to-speech
## RSS Feeds
Process RSS feeds to extract article content.
### Basic RSS Processing
```python
from markitdown import MarkItDown
import feedparser
md = MarkItDown()
# Parse RSS feed
feed = feedparser.parse("https://example.com/feed.xml")
# Convert each entry
for entry in feed.entries:
# Save entry HTML
with open("temp.html", "w") as f:
f.write(entry.summary)
result = md.convert("temp.html")
print(f"## {entry.title}\n\n{result.text_content}\n\n")
```
## Combined Web Content Workflows
### Web Scraping Pipeline
```python
from markitdown import MarkItDown
import requests
from bs4 import BeautifulSoup
md = MarkItDown()
def scrape_and_convert(url):
"""Scrape webpage and convert to Markdown."""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract main content
main_content = soup.find('article') or soup.find('main')
if main_content:
# Save HTML
with open("temp.html", "w") as f:
f.write(str(main_content))
# Convert to Markdown
result = md.convert("temp.html")
return result.text_content
return None
# Use it
markdown = scrape_and_convert("https://example.com/article")
print(markdown)
```
### YouTube Learning Content Extraction
```python
from markitdown import MarkItDown
md = MarkItDown()
# Course videos
course_videos = [
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
]
course_content = []
for url, title in course_videos:
result = md.convert(url)
course_content.append(f"# {title}\n\n{result.text_content}")
# Combine into course document
full_course = "\n\n---\n\n".join(course_content)
with open("course_transcript.md", "w") as f:
f.write(full_course)
```
### Documentation Scraping
```python
from markitdown import MarkItDown
import requests
from urllib.parse import urljoin, urlparse
md = MarkItDown()
def scrape_documentation(base_url, page_urls):
"""Scrape multiple documentation pages."""
docs = []
for page_url in page_urls:
full_url = urljoin(base_url, page_url)
# Fetch page
response = requests.get(full_url)
with open("temp.html", "wb") as f:
f.write(response.content)
# Convert
result = md.convert("temp.html")
docs.append({
'url': full_url,
'content': result.text_content
})
return docs
# Example usage
base = "https://docs.example.com/"
pages = ["intro.html", "getting-started.html", "api.html"]
documentation = scrape_documentation(base, pages)
```
### EPUB Library Processing
```python
from markitdown import MarkItDown
import os
md = MarkItDown()
def process_epub_library(library_path, output_path):
"""Convert all EPUB books in a directory."""
for filename in os.listdir(library_path):
if filename.endswith('.epub'):
epub_path = os.path.join(library_path, filename)
try:
result = md.convert(epub_path)
# Save markdown
output_file = filename.replace('.epub', '.md')
output_full = os.path.join(output_path, output_file)
with open(output_full, 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename}")
except Exception as e:
print(f"Failed to convert {filename}: {e}")
# Process library
process_epub_library("books", "markdown_books")
```
## Error Handling
### HTML Conversion Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("webpage.html")
print(result.text_content)
except FileNotFoundError:
print("HTML file not found")
except Exception as e:
print(f"Conversion error: {e}")
```
### YouTube Transcript Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
except Exception as e:
print(f"Failed to get transcript: {e}")
# Common issues: No transcript available, video unavailable, network error
```
### EPUB Conversion Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("book.epub")
print(result.text_content)
except Exception as e:
print(f"EPUB processing error: {e}")
# Common issues: Corrupted file, unsupported DRM, invalid format
```
## Best Practices
### HTML Processing
- Clean HTML before conversion for better results
- Use readability libraries to extract main content
- Handle different encodings appropriately
- Remove unnecessary markup
### YouTube Processing
- Check transcript availability before batch processing
- Handle API rate limits gracefully
- Store transcripts to avoid re-fetching
- Respect YouTube's terms of service
### EPUB Processing
- DRM-protected EPUBs cannot be processed
- Large EPUBs may require more memory
- Some formatting may not translate perfectly
- Test with representative samples first
### Web Scraping Ethics
- Respect robots.txt
- Add delays between requests
- Identify your scraper in User-Agent
- Cache results to minimize requests
- Follow website terms of service