Files
gh-k-dense-ai-claude-scient…/skills/markitdown/references/advanced_integrations.md
2025-11-30 08:30:10 +08:00

12 KiB

Advanced Integrations Reference

This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.

Azure Document Intelligence Integration

Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.

Setup

Prerequisites:

  1. Azure subscription
  2. Document Intelligence resource created in Azure
  3. Endpoint URL and API key

Create Azure Resource:

# Using Azure CLI
az cognitiveservices account create \
  --name my-doc-intelligence \
  --resource-group my-resource-group \
  --kind FormRecognizer \
  --sku F0 \
  --location eastus

Basic Usage

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
    docintel_key="YOUR-API-KEY"
)

result = md.convert("complex_document.pdf")
print(result.text_content)

Configuration from Environment Variables

import os
from markitdown import MarkItDown

# Set environment variables
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'

# Use without explicit credentials
md = MarkItDown(
    docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
    docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
)

result = md.convert("document.pdf")

When to Use Azure Document Intelligence

Use for:

  • Complex PDFs with sophisticated tables
  • Multi-column layouts
  • Forms and structured documents
  • Scanned documents requiring OCR
  • PDFs with mixed content types
  • Documents with intricate formatting

Benefits over standard extraction:

  • Superior table extraction - Better handling of merged cells, complex layouts
  • Layout analysis - Understands document structure (headers, footers, columns)
  • Form fields - Extracts key-value pairs from forms
  • Reading order - Maintains correct text flow in complex layouts
  • OCR quality - High-quality text extraction from scanned documents

Comparison Example

Standard extraction:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("complex_table.pdf")
# May struggle with complex tables

Azure Document Intelligence:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)
result = md.convert("complex_table.pdf")
# Better table reconstruction and layout understanding

Cost Considerations

Azure Document Intelligence is a paid service:

  • Free tier: 500 pages per month
  • Paid tiers: Pay per page processed
  • Monitor usage to control costs
  • Use standard extraction for simple documents

Error Handling

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)

try:
    result = md.convert("document.pdf")
    print(result.text_content)
except Exception as e:
    print(f"Document Intelligence error: {e}")
    # Common issues: authentication, quota exceeded, unsupported file

LLM-Powered Image Descriptions

Generate detailed, contextual descriptions for images using large language models.

Setup with OpenAI

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

result = md.convert("image.jpg")
print(result.text_content)

Supported Use Cases

Images in documents:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# PowerPoint with images
result = md.convert("presentation.pptx")

# Word documents with images
result = md.convert("report.docx")

# Standalone images
result = md.convert("diagram.png")

Custom Prompts

Customize the LLM prompt for specific needs:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# For diagrams
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
)

# For charts
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
)

# For UI screenshots
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
)

# For scientific figures
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
)

Model Selection

GPT-4o (Recommended):

  • Best vision capabilities
  • High-quality descriptions
  • Good at understanding context
  • Higher cost per image

GPT-4o-mini:

  • Lower cost alternative
  • Good for simpler images
  • Faster processing
  • May miss subtle details
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# High quality (more expensive)
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Budget option (less expensive)
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")

Configuration from Environment

import os
from markitdown import MarkItDown
from openai import OpenAI

# Set API key in environment
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'

client = OpenAI()  # Uses env variable
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Alternative LLM Providers

Anthropic Claude:

from markitdown import MarkItDown
from anthropic import Anthropic

# Note: Check current compatibility with MarkItDown
client = Anthropic(api_key="YOUR-API-KEY")
# May require adapter for MarkItDown compatibility

Azure OpenAI:

from markitdown import MarkItDown
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="YOUR-AZURE-KEY",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
)

md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Cost Management

Strategies to reduce LLM costs:

  1. Selective processing:
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# Only use LLM for important documents
if is_important_document(file):
    md = MarkItDown(llm_client=client, llm_model="gpt-4o")
else:
    md = MarkItDown()  # Standard processing

result = md.convert(file)
  1. Image filtering:
# Pre-process to identify images that need descriptions
# Only use LLM for complex/important images
  1. Batch processing:
# Process multiple images in batches
# Monitor costs and set limits
  1. Model selection:
# Use gpt-4o-mini for simple images
# Reserve gpt-4o for complex visualizations

Performance Considerations

LLM processing adds latency:

  • Each image requires an API call
  • Processing time: 1-5 seconds per image
  • Network dependent
  • Consider parallel processing for multiple images

Batch optimization:

from markitdown import MarkItDown
from openai import OpenAI
import concurrent.futures

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

def process_image(image_path):
    return md.convert(image_path)

# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(process_image, images))

Combined Advanced Features

Azure Document Intelligence + LLM Descriptions

Combine both for maximum quality:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    docintel_endpoint="YOUR-AZURE-ENDPOINT",
    docintel_key="YOUR-AZURE-KEY"
)

# Best possible PDF conversion with image descriptions
result = md.convert("complex_report.pdf")

Use cases:

  • Research papers with figures
  • Business reports with charts
  • Technical documentation with diagrams
  • Presentations with visual data

Smart Document Processing Pipeline

from markitdown import MarkItDown
from openai import OpenAI
import os

def smart_convert(file_path):
    """Intelligently choose processing method based on file type."""
    client = OpenAI()
    ext = os.path.splitext(file_path)[1].lower()

    # PDFs with complex tables: Use Azure
    if ext == '.pdf':
        md = MarkItDown(
            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
            docintel_key=os.getenv('AZURE_KEY')
        )

    # Documents/presentations with images: Use LLM
    elif ext in ['.pptx', '.docx']:
        md = MarkItDown(
            llm_client=client,
            llm_model="gpt-4o"
        )

    # Simple formats: Standard processing
    else:
        md = MarkItDown()

    return md.convert(file_path)

# Use it
result = smart_convert("document.pdf")

Plugin System

MarkItDown supports custom plugins for extending functionality.

Plugin Architecture

Plugins are disabled by default for security:

from markitdown import MarkItDown

# Enable plugins
md = MarkItDown(enable_plugins=True)

Creating Custom Plugins

Plugin structure:

class CustomConverter:
    """Custom converter plugin for MarkItDown."""

    def can_convert(self, file_path):
        """Check if this plugin can handle the file."""
        return file_path.endswith('.custom')

    def convert(self, file_path):
        """Convert file to Markdown."""
        # Your conversion logic here
        return {
            'text_content': '# Converted Content\n\n...'
        }

Plugin Registration

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)

# Register custom plugin
md.register_plugin(CustomConverter())

# Use normally
result = md.convert("file.custom")

Plugin Use Cases

Custom formats:

  • Proprietary document formats
  • Specialized scientific data formats
  • Legacy file formats

Enhanced processing:

  • Custom OCR engines
  • Specialized table extraction
  • Domain-specific parsing

Integration:

  • Enterprise document systems
  • Custom databases
  • Specialized APIs

Plugin Security

Important security considerations:

  • Plugins run with full system access
  • Only enable for trusted plugins
  • Validate plugin code before use
  • Disable plugins in production unless required

Error Handling for Advanced Features

from markitdown import MarkItDown
from openai import OpenAI

def robust_convert(file_path):
    """Convert with fallback strategies."""
    try:
        # Try with all advanced features
        client = OpenAI()
        md = MarkItDown(
            llm_client=client,
            llm_model="gpt-4o",
            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
            docintel_key=os.getenv('AZURE_KEY')
        )
        return md.convert(file_path)

    except Exception as azure_error:
        print(f"Azure failed: {azure_error}")

        try:
            # Fallback: LLM only
            client = OpenAI()
            md = MarkItDown(llm_client=client, llm_model="gpt-4o")
            return md.convert(file_path)

        except Exception as llm_error:
            print(f"LLM failed: {llm_error}")

            # Final fallback: Standard processing
            md = MarkItDown()
            return md.convert(file_path)

# Use it
result = robust_convert("document.pdf")

Best Practices

Azure Document Intelligence

  • Use for complex PDFs only (cost optimization)
  • Monitor usage and costs
  • Store credentials securely
  • Handle quota limits gracefully
  • Fall back to standard processing if needed

LLM Integration

  • Use appropriate models for task complexity
  • Customize prompts for specific use cases
  • Monitor API costs
  • Implement rate limiting
  • Cache results when possible
  • Handle API errors gracefully

Combined Features

  • Test cost/quality tradeoffs
  • Use selectively for important documents
  • Implement intelligent routing
  • Monitor performance and costs
  • Have fallback strategies

Security

  • Store API keys securely (environment variables, secrets manager)
  • Never commit credentials to code
  • Disable plugins unless required
  • Validate all inputs
  • Use least privilege access