Initial commit
This commit is contained in:
538
skills/markitdown/references/advanced_integrations.md
Normal file
538
skills/markitdown/references/advanced_integrations.md
Normal file
@@ -0,0 +1,538 @@
|
||||
# Advanced Integrations Reference
|
||||
|
||||
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
|
||||
|
||||
## Azure Document Intelligence Integration
|
||||
|
||||
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
|
||||
|
||||
### Setup
|
||||
|
||||
**Prerequisites:**
|
||||
1. Azure subscription
|
||||
2. Document Intelligence resource created in Azure
|
||||
3. Endpoint URL and API key
|
||||
|
||||
**Create Azure Resource:**
|
||||
```bash
|
||||
# Using Azure CLI
|
||||
az cognitiveservices account create \
|
||||
--name my-doc-intelligence \
|
||||
--resource-group my-resource-group \
|
||||
--kind FormRecognizer \
|
||||
--sku F0 \
|
||||
--location eastus
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Configuration from Environment Variables
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Set environment variables
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
|
||||
|
||||
# Use without explicit credentials
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
)
|
||||
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### When to Use Azure Document Intelligence
|
||||
|
||||
**Use for:**
|
||||
- Complex PDFs with sophisticated tables
|
||||
- Multi-column layouts
|
||||
- Forms and structured documents
|
||||
- Scanned documents requiring OCR
|
||||
- PDFs with mixed content types
|
||||
- Documents with intricate formatting
|
||||
|
||||
**Benefits over standard extraction:**
|
||||
- **Superior table extraction** - Better handling of merged cells, complex layouts
|
||||
- **Layout analysis** - Understands document structure (headers, footers, columns)
|
||||
- **Form fields** - Extracts key-value pairs from forms
|
||||
- **Reading order** - Maintains correct text flow in complex layouts
|
||||
- **OCR quality** - High-quality text extraction from scanned documents
|
||||
|
||||
### Comparison Example
|
||||
|
||||
**Standard extraction:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("complex_table.pdf")
|
||||
# May struggle with complex tables
|
||||
```
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
# Better table reconstruction and layout understanding
|
||||
```
|
||||
|
||||
### Cost Considerations
|
||||
|
||||
Azure Document Intelligence is a paid service:
|
||||
- **Free tier**: 500 pages per month
|
||||
- **Paid tiers**: Pay per page processed
|
||||
- Monitor usage to control costs
|
||||
- Use standard extraction for simple documents
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Document Intelligence error: {e}")
|
||||
# Common issues: authentication, quota exceeded, unsupported file
|
||||
```
|
||||
|
||||
## LLM-Powered Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions for images using large language models.
|
||||
|
||||
### Setup with OpenAI
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Supported Use Cases
|
||||
|
||||
**Images in documents:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# PowerPoint with images
|
||||
result = md.convert("presentation.pptx")
|
||||
|
||||
# Word documents with images
|
||||
result = md.convert("report.docx")
|
||||
|
||||
# Standalone images
|
||||
result = md.convert("diagram.png")
|
||||
```
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
Customize the LLM prompt for specific needs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
|
||||
)
|
||||
|
||||
# For scientific figures
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
|
||||
)
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**GPT-4o (Recommended):**
|
||||
- Best vision capabilities
|
||||
- High-quality descriptions
|
||||
- Good at understanding context
|
||||
- Higher cost per image
|
||||
|
||||
**GPT-4o-mini:**
|
||||
- Lower cost alternative
|
||||
- Good for simpler images
|
||||
- Faster processing
|
||||
- May miss subtle details
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# High quality (more expensive)
|
||||
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Budget option (less expensive)
|
||||
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
|
||||
```
|
||||
|
||||
### Configuration from Environment
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Set API key in environment
|
||||
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
|
||||
|
||||
client = OpenAI() # Uses env variable
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Alternative LLM Providers
|
||||
|
||||
**Anthropic Claude:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from anthropic import Anthropic
|
||||
|
||||
# Note: Check current compatibility with MarkItDown
|
||||
client = Anthropic(api_key="YOUR-API-KEY")
|
||||
# May require adapter for MarkItDown compatibility
|
||||
```
|
||||
|
||||
**Azure OpenAI:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import AzureOpenAI
|
||||
|
||||
client = AzureOpenAI(
|
||||
api_key="YOUR-AZURE-KEY",
|
||||
api_version="2024-02-01",
|
||||
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
|
||||
)
|
||||
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Cost Management
|
||||
|
||||
**Strategies to reduce LLM costs:**
|
||||
|
||||
1. **Selective processing:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# Only use LLM for important documents
|
||||
if is_important_document(file):
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
else:
|
||||
md = MarkItDown() # Standard processing
|
||||
|
||||
result = md.convert(file)
|
||||
```
|
||||
|
||||
2. **Image filtering:**
|
||||
```python
|
||||
# Pre-process to identify images that need descriptions
|
||||
# Only use LLM for complex/important images
|
||||
```
|
||||
|
||||
3. **Batch processing:**
|
||||
```python
|
||||
# Process multiple images in batches
|
||||
# Monitor costs and set limits
|
||||
```
|
||||
|
||||
4. **Model selection:**
|
||||
```python
|
||||
# Use gpt-4o-mini for simple images
|
||||
# Reserve gpt-4o for complex visualizations
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
**LLM processing adds latency:**
|
||||
- Each image requires an API call
|
||||
- Processing time: 1-5 seconds per image
|
||||
- Network dependent
|
||||
- Consider parallel processing for multiple images
|
||||
|
||||
**Batch optimization:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import concurrent.futures
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
def process_image(image_path):
|
||||
return md.convert(image_path)
|
||||
|
||||
# Process multiple images in parallel
|
||||
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
|
||||
results = list(executor.map(process_image, images))
|
||||
```
|
||||
|
||||
## Combined Advanced Features
|
||||
|
||||
### Azure Document Intelligence + LLM Descriptions
|
||||
|
||||
Combine both for maximum quality:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-AZURE-ENDPOINT",
|
||||
docintel_key="YOUR-AZURE-KEY"
|
||||
)
|
||||
|
||||
# Best possible PDF conversion with image descriptions
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Research papers with figures
|
||||
- Business reports with charts
|
||||
- Technical documentation with diagrams
|
||||
- Presentations with visual data
|
||||
|
||||
### Smart Document Processing Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
def smart_convert(file_path):
|
||||
"""Intelligently choose processing method based on file type."""
|
||||
client = OpenAI()
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
# PDFs with complex tables: Use Azure
|
||||
if ext == '.pdf':
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
|
||||
# Documents/presentations with images: Use LLM
|
||||
elif ext in ['.pptx', '.docx']:
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o"
|
||||
)
|
||||
|
||||
# Simple formats: Standard processing
|
||||
else:
|
||||
md = MarkItDown()
|
||||
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = smart_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports custom plugins for extending functionality.
|
||||
|
||||
### Plugin Architecture
|
||||
|
||||
Plugins are disabled by default for security:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
```
|
||||
|
||||
### Creating Custom Plugins
|
||||
|
||||
**Plugin structure:**
|
||||
```python
|
||||
class CustomConverter:
|
||||
"""Custom converter plugin for MarkItDown."""
|
||||
|
||||
def can_convert(self, file_path):
|
||||
"""Check if this plugin can handle the file."""
|
||||
return file_path.endswith('.custom')
|
||||
|
||||
def convert(self, file_path):
|
||||
"""Convert file to Markdown."""
|
||||
# Your conversion logic here
|
||||
return {
|
||||
'text_content': '# Converted Content\n\n...'
|
||||
}
|
||||
```
|
||||
|
||||
### Plugin Registration
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
|
||||
# Register custom plugin
|
||||
md.register_plugin(CustomConverter())
|
||||
|
||||
# Use normally
|
||||
result = md.convert("file.custom")
|
||||
```
|
||||
|
||||
### Plugin Use Cases
|
||||
|
||||
**Custom formats:**
|
||||
- Proprietary document formats
|
||||
- Specialized scientific data formats
|
||||
- Legacy file formats
|
||||
|
||||
**Enhanced processing:**
|
||||
- Custom OCR engines
|
||||
- Specialized table extraction
|
||||
- Domain-specific parsing
|
||||
|
||||
**Integration:**
|
||||
- Enterprise document systems
|
||||
- Custom databases
|
||||
- Specialized APIs
|
||||
|
||||
### Plugin Security
|
||||
|
||||
**Important security considerations:**
|
||||
- Plugins run with full system access
|
||||
- Only enable for trusted plugins
|
||||
- Validate plugin code before use
|
||||
- Disable plugins in production unless required
|
||||
|
||||
## Error Handling for Advanced Features
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
def robust_convert(file_path):
|
||||
"""Convert with fallback strategies."""
|
||||
try:
|
||||
# Try with all advanced features
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as azure_error:
|
||||
print(f"Azure failed: {azure_error}")
|
||||
|
||||
try:
|
||||
# Fallback: LLM only
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as llm_error:
|
||||
print(f"LLM failed: {llm_error}")
|
||||
|
||||
# Final fallback: Standard processing
|
||||
md = MarkItDown()
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = robust_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Azure Document Intelligence
|
||||
- Use for complex PDFs only (cost optimization)
|
||||
- Monitor usage and costs
|
||||
- Store credentials securely
|
||||
- Handle quota limits gracefully
|
||||
- Fall back to standard processing if needed
|
||||
|
||||
### LLM Integration
|
||||
- Use appropriate models for task complexity
|
||||
- Customize prompts for specific use cases
|
||||
- Monitor API costs
|
||||
- Implement rate limiting
|
||||
- Cache results when possible
|
||||
- Handle API errors gracefully
|
||||
|
||||
### Combined Features
|
||||
- Test cost/quality tradeoffs
|
||||
- Use selectively for important documents
|
||||
- Implement intelligent routing
|
||||
- Monitor performance and costs
|
||||
- Have fallback strategies
|
||||
|
||||
### Security
|
||||
- Store API keys securely (environment variables, secrets manager)
|
||||
- Never commit credentials to code
|
||||
- Disable plugins unless required
|
||||
- Validate all inputs
|
||||
- Use least privilege access
|
||||
Reference in New Issue
Block a user