zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

12 KiB

Raw Blame History

Advanced Integrations Reference

This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.

Azure Document Intelligence Integration

Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.

Setup

Prerequisites:

Azure subscription
Document Intelligence resource created in Azure
Endpoint URL and API key

Create Azure Resource:

# Using Azure CLI
az cognitiveservices account create \
  --name my-doc-intelligence \
  --resource-group my-resource-group \
  --kind FormRecognizer \
  --sku F0 \
  --location eastus

Basic Usage

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
    docintel_key="YOUR-API-KEY"
)

result = md.convert("complex_document.pdf")
print(result.text_content)

Configuration from Environment Variables

import os
from markitdown import MarkItDown

# Set environment variables
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'

# Use without explicit credentials
md = MarkItDown(
    docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
    docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
)

result = md.convert("document.pdf")

When to Use Azure Document Intelligence

Use for:

Complex PDFs with sophisticated tables
Multi-column layouts
Forms and structured documents
Scanned documents requiring OCR
PDFs with mixed content types
Documents with intricate formatting

Benefits over standard extraction:

Superior table extraction - Better handling of merged cells, complex layouts
Layout analysis - Understands document structure (headers, footers, columns)
Form fields - Extracts key-value pairs from forms
Reading order - Maintains correct text flow in complex layouts
OCR quality - High-quality text extraction from scanned documents

Comparison Example

Standard extraction:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("complex_table.pdf")
# May struggle with complex tables

Azure Document Intelligence:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)
result = md.convert("complex_table.pdf")
# Better table reconstruction and layout understanding

Cost Considerations

Azure Document Intelligence is a paid service:

Free tier: 500 pages per month
Paid tiers: Pay per page processed
Monitor usage to control costs
Use standard extraction for simple documents

Error Handling

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="YOUR-ENDPOINT",
    docintel_key="YOUR-KEY"
)

try:
    result = md.convert("document.pdf")
    print(result.text_content)
except Exception as e:
    print(f"Document Intelligence error: {e}")
    # Common issues: authentication, quota exceeded, unsupported file

LLM-Powered Image Descriptions

Generate detailed, contextual descriptions for images using large language models.

Setup with OpenAI

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

result = md.convert("image.jpg")
print(result.text_content)

Supported Use Cases

Images in documents:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# PowerPoint with images
result = md.convert("presentation.pptx")

# Word documents with images
result = md.convert("report.docx")

# Standalone images
result = md.convert("diagram.png")

Custom Prompts

Customize the LLM prompt for specific needs:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# For diagrams
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
)

# For charts
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
)

# For UI screenshots
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
)

# For scientific figures
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
)

Model Selection

GPT-4o (Recommended):

Best vision capabilities
High-quality descriptions
Good at understanding context
Higher cost per image

GPT-4o-mini:

Lower cost alternative
Good for simpler images
Faster processing
May miss subtle details

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# High quality (more expensive)
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Budget option (less expensive)
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")

Configuration from Environment

import os
from markitdown import MarkItDown
from openai import OpenAI

# Set API key in environment
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'

client = OpenAI()  # Uses env variable
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Alternative LLM Providers

Anthropic Claude:

from markitdown import MarkItDown
from anthropic import Anthropic

# Note: Check current compatibility with MarkItDown
client = Anthropic(api_key="YOUR-API-KEY")
# May require adapter for MarkItDown compatibility

Azure OpenAI:

from markitdown import MarkItDown
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="YOUR-AZURE-KEY",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
)

md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Cost Management

Strategies to reduce LLM costs:

Selective processing:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()

# Only use LLM for important documents
if is_important_document(file):
    md = MarkItDown(llm_client=client, llm_model="gpt-4o")
else:
    md = MarkItDown()  # Standard processing

result = md.convert(file)

Image filtering:

# Pre-process to identify images that need descriptions
# Only use LLM for complex/important images

Batch processing:

# Process multiple images in batches
# Monitor costs and set limits

Model selection:

# Use gpt-4o-mini for simple images
# Reserve gpt-4o for complex visualizations

Performance Considerations

LLM processing adds latency:

Each image requires an API call
Processing time: 1-5 seconds per image
Network dependent
Consider parallel processing for multiple images

Batch optimization:

from markitdown import MarkItDown
from openai import OpenAI
import concurrent.futures

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

def process_image(image_path):
    return md.convert(image_path)

# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(process_image, images))

Combined Advanced Features

Azure Document Intelligence + LLM Descriptions

Combine both for maximum quality:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    docintel_endpoint="YOUR-AZURE-ENDPOINT",
    docintel_key="YOUR-AZURE-KEY"
)

# Best possible PDF conversion with image descriptions
result = md.convert("complex_report.pdf")

Use cases:

Research papers with figures
Business reports with charts
Technical documentation with diagrams
Presentations with visual data

Smart Document Processing Pipeline

from markitdown import MarkItDown
from openai import OpenAI
import os

def smart_convert(file_path):
    """Intelligently choose processing method based on file type."""
    client = OpenAI()
    ext = os.path.splitext(file_path)[1].lower()

    # PDFs with complex tables: Use Azure
    if ext == '.pdf':
        md = MarkItDown(
            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
            docintel_key=os.getenv('AZURE_KEY')
        )

    # Documents/presentations with images: Use LLM
    elif ext in ['.pptx', '.docx']:
        md = MarkItDown(
            llm_client=client,
            llm_model="gpt-4o"
        )

    # Simple formats: Standard processing
    else:
        md = MarkItDown()

    return md.convert(file_path)

# Use it
result = smart_convert("document.pdf")

Plugin System

MarkItDown supports custom plugins for extending functionality.

Plugin Architecture

Plugins are disabled by default for security:

from markitdown import MarkItDown

# Enable plugins
md = MarkItDown(enable_plugins=True)

Creating Custom Plugins

Plugin structure:

class CustomConverter:
    """Custom converter plugin for MarkItDown."""

    def can_convert(self, file_path):
        """Check if this plugin can handle the file."""
        return file_path.endswith('.custom')

    def convert(self, file_path):
        """Convert file to Markdown."""
        # Your conversion logic here
        return {
            'text_content': '# Converted Content\n\n...'
        }

Plugin Registration

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)

# Register custom plugin
md.register_plugin(CustomConverter())

# Use normally
result = md.convert("file.custom")

Plugin Use Cases

Custom formats:

Proprietary document formats
Specialized scientific data formats
Legacy file formats

Enhanced processing:

Custom OCR engines
Specialized table extraction
Domain-specific parsing

Integration:

Enterprise document systems
Custom databases
Specialized APIs

Plugin Security

Important security considerations:

Plugins run with full system access
Only enable for trusted plugins
Validate plugin code before use
Disable plugins in production unless required

Error Handling for Advanced Features

from markitdown import MarkItDown
from openai import OpenAI

def robust_convert(file_path):
    """Convert with fallback strategies."""
    try:
        # Try with all advanced features
        client = OpenAI()
        md = MarkItDown(
            llm_client=client,
            llm_model="gpt-4o",
            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
            docintel_key=os.getenv('AZURE_KEY')
        )
        return md.convert(file_path)

    except Exception as azure_error:
        print(f"Azure failed: {azure_error}")

        try:
            # Fallback: LLM only
            client = OpenAI()
            md = MarkItDown(llm_client=client, llm_model="gpt-4o")
            return md.convert(file_path)

        except Exception as llm_error:
            print(f"LLM failed: {llm_error}")

            # Final fallback: Standard processing
            md = MarkItDown()
            return md.convert(file_path)

# Use it
result = robust_convert("document.pdf")

Best Practices

Azure Document Intelligence

Use for complex PDFs only (cost optimization)
Monitor usage and costs
Store credentials securely
Handle quota limits gracefully
Fall back to standard processing if needed

LLM Integration

Use appropriate models for task complexity
Customize prompts for specific use cases
Monitor API costs
Implement rate limiting
Cache results when possible
Handle API errors gracefully

Combined Features

Test cost/quality tradeoffs
Use selectively for important documents
Implement intelligent routing
Monitor performance and costs
Have fallback strategies

Security

Store API keys securely (environment variables, secrets manager)
Never commit credentials to code
Disable plugins unless required
Validate all inputs
Use least privilege access

12 KiB Raw Blame History