Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/markitdown/references/advanced_integrations.md
+++ b/skills/markitdown/references/advanced_integrations.md
@@ -0,0 +1,538 @@
+# Advanced Integrations Reference
+
+This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
+
+## Azure Document Intelligence Integration
+
+Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
+
+### Setup
+
+**Prerequisites:**
+1. Azure subscription
+2. Document Intelligence resource created in Azure
+3. Endpoint URL and API key
+
+**Create Azure Resource:**
+```bash
+# Using Azure CLI
+az cognitiveservices account create \
+  --name my-doc-intelligence \
+  --resource-group my-resource-group \
+  --kind FormRecognizer \
+  --sku F0 \
+  --location eastus
+```
+
+### Basic Usage
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
+    docintel_key="YOUR-API-KEY"
+)
+
+result = md.convert("complex_document.pdf")
+print(result.text_content)
+```
+
+### Configuration from Environment Variables
+
+```python
+import os
+from markitdown import MarkItDown
+
+# Set environment variables
+os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
+os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
+
+# Use without explicit credentials
+md = MarkItDown(
+    docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
+    docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
+)
+
+result = md.convert("document.pdf")
+```
+
+### When to Use Azure Document Intelligence
+
+**Use for:**
+- Complex PDFs with sophisticated tables
+- Multi-column layouts
+- Forms and structured documents
+- Scanned documents requiring OCR
+- PDFs with mixed content types
+- Documents with intricate formatting
+
+**Benefits over standard extraction:**
+- **Superior table extraction** - Better handling of merged cells, complex layouts
+- **Layout analysis** - Understands document structure (headers, footers, columns)
+- **Form fields** - Extracts key-value pairs from forms
+- **Reading order** - Maintains correct text flow in complex layouts
+- **OCR quality** - High-quality text extraction from scanned documents
+
+### Comparison Example
+
+**Standard extraction:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("complex_table.pdf")
+# May struggle with complex tables
+```
+
+**Azure Document Intelligence:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="YOUR-ENDPOINT",
+    docintel_key="YOUR-KEY"
+)
+result = md.convert("complex_table.pdf")
+# Better table reconstruction and layout understanding
+```
+
+### Cost Considerations
+
+Azure Document Intelligence is a paid service:
+- **Free tier**: 500 pages per month
+- **Paid tiers**: Pay per page processed
+- Monitor usage to control costs
+- Use standard extraction for simple documents
+
+### Error Handling
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="YOUR-ENDPOINT",
+    docintel_key="YOUR-KEY"
+)
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except Exception as e:
+    print(f"Document Intelligence error: {e}")
+    # Common issues: authentication, quota exceeded, unsupported file
+```
+
+## LLM-Powered Image Descriptions
+
+Generate detailed, contextual descriptions for images using large language models.
+
+### Setup with OpenAI
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+result = md.convert("image.jpg")
+print(result.text_content)
+```
+
+### Supported Use Cases
+
+**Images in documents:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# PowerPoint with images
+result = md.convert("presentation.pptx")
+
+# Word documents with images
+result = md.convert("report.docx")
+
+# Standalone images
+result = md.convert("diagram.png")
+```
+
+### Custom Prompts
+
+Customize the LLM prompt for specific needs:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# For diagrams
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
+)
+
+# For charts
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
+)
+
+# For UI screenshots
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
+)
+
+# For scientific figures
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
+)
+```
+
+### Model Selection
+
+**GPT-4o (Recommended):**
+- Best vision capabilities
+- High-quality descriptions
+- Good at understanding context
+- Higher cost per image
+
+**GPT-4o-mini:**
+- Lower cost alternative
+- Good for simpler images
+- Faster processing
+- May miss subtle details
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# High quality (more expensive)
+md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# Budget option (less expensive)
+md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
+```
+
+### Configuration from Environment
+
+```python
+import os
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Set API key in environment
+os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
+
+client = OpenAI()  # Uses env variable
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+```
+
+### Alternative LLM Providers
+
+**Anthropic Claude:**
+```python
+from markitdown import MarkItDown
+from anthropic import Anthropic
+
+# Note: Check current compatibility with MarkItDown
+client = Anthropic(api_key="YOUR-API-KEY")
+# May require adapter for MarkItDown compatibility
+```
+
+**Azure OpenAI:**
+```python
+from markitdown import MarkItDown
+from openai import AzureOpenAI
+
+client = AzureOpenAI(
+    api_key="YOUR-AZURE-KEY",
+    api_version="2024-02-01",
+    azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
+)
+
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+```
+
+### Cost Management
+
+**Strategies to reduce LLM costs:**
+
+1. **Selective processing:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# Only use LLM for important documents
+if is_important_document(file):
+    md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+else:
+    md = MarkItDown()  # Standard processing
+
+result = md.convert(file)
+```
+
+2. **Image filtering:**
+```python
+# Pre-process to identify images that need descriptions
+# Only use LLM for complex/important images
+```
+
+3. **Batch processing:**
+```python
+# Process multiple images in batches
+# Monitor costs and set limits
+```
+
+4. **Model selection:**
+```python
+# Use gpt-4o-mini for simple images
+# Reserve gpt-4o for complex visualizations
+```
+
+### Performance Considerations
+
+**LLM processing adds latency:**
+- Each image requires an API call
+- Processing time: 1-5 seconds per image
+- Network dependent
+- Consider parallel processing for multiple images
+
+**Batch optimization:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import concurrent.futures
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+def process_image(image_path):
+    return md.convert(image_path)
+
+# Process multiple images in parallel
+images = ["img1.jpg", "img2.jpg", "img3.jpg"]
+with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
+    results = list(executor.map(process_image, images))
+```
+
+## Combined Advanced Features
+
+### Azure Document Intelligence + LLM Descriptions
+
+Combine both for maximum quality:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    docintel_endpoint="YOUR-AZURE-ENDPOINT",
+    docintel_key="YOUR-AZURE-KEY"
+)
+
+# Best possible PDF conversion with image descriptions
+result = md.convert("complex_report.pdf")
+```
+
+**Use cases:**
+- Research papers with figures
+- Business reports with charts
+- Technical documentation with diagrams
+- Presentations with visual data
+
+### Smart Document Processing Pipeline
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+
+def smart_convert(file_path):
+    """Intelligently choose processing method based on file type."""
+    client = OpenAI()
+    ext = os.path.splitext(file_path)[1].lower()
+
+    # PDFs with complex tables: Use Azure
+    if ext == '.pdf':
+        md = MarkItDown(
+            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
+            docintel_key=os.getenv('AZURE_KEY')
+        )
+
+    # Documents/presentations with images: Use LLM
+    elif ext in ['.pptx', '.docx']:
+        md = MarkItDown(
+            llm_client=client,
+            llm_model="gpt-4o"
+        )
+
+    # Simple formats: Standard processing
+    else:
+        md = MarkItDown()
+
+    return md.convert(file_path)
+
+# Use it
+result = smart_convert("document.pdf")
+```
+
+## Plugin System
+
+MarkItDown supports custom plugins for extending functionality.
+
+### Plugin Architecture
+
+Plugins are disabled by default for security:
+
+```python
+from markitdown import MarkItDown
+
+# Enable plugins
+md = MarkItDown(enable_plugins=True)
+```
+
+### Creating Custom Plugins
+
+**Plugin structure:**
+```python
+class CustomConverter:
+    """Custom converter plugin for MarkItDown."""
+
+    def can_convert(self, file_path):
+        """Check if this plugin can handle the file."""
+        return file_path.endswith('.custom')
+
+    def convert(self, file_path):
+        """Convert file to Markdown."""
+        # Your conversion logic here
+        return {
+            'text_content': '# Converted Content\n\n...'
+        }
+```
+
+### Plugin Registration
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(enable_plugins=True)
+
+# Register custom plugin
+md.register_plugin(CustomConverter())
+
+# Use normally
+result = md.convert("file.custom")
+```
+
+### Plugin Use Cases
+
+**Custom formats:**
+- Proprietary document formats
+- Specialized scientific data formats
+- Legacy file formats
+
+**Enhanced processing:**
+- Custom OCR engines
+- Specialized table extraction
+- Domain-specific parsing
+
+**Integration:**
+- Enterprise document systems
+- Custom databases
+- Specialized APIs
+
+### Plugin Security
+
+**Important security considerations:**
+- Plugins run with full system access
+- Only enable for trusted plugins
+- Validate plugin code before use
+- Disable plugins in production unless required
+
+## Error Handling for Advanced Features
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+def robust_convert(file_path):
+    """Convert with fallback strategies."""
+    try:
+        # Try with all advanced features
+        client = OpenAI()
+        md = MarkItDown(
+            llm_client=client,
+            llm_model="gpt-4o",
+            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
+            docintel_key=os.getenv('AZURE_KEY')
+        )
+        return md.convert(file_path)
+
+    except Exception as azure_error:
+        print(f"Azure failed: {azure_error}")
+
+        try:
+            # Fallback: LLM only
+            client = OpenAI()
+            md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+            return md.convert(file_path)
+
+        except Exception as llm_error:
+            print(f"LLM failed: {llm_error}")
+
+            # Final fallback: Standard processing
+            md = MarkItDown()
+            return md.convert(file_path)
+
+# Use it
+result = robust_convert("document.pdf")
+```
+
+## Best Practices
+
+### Azure Document Intelligence
+- Use for complex PDFs only (cost optimization)
+- Monitor usage and costs
+- Store credentials securely
+- Handle quota limits gracefully
+- Fall back to standard processing if needed
+
+### LLM Integration
+- Use appropriate models for task complexity
+- Customize prompts for specific use cases
+- Monitor API costs
+- Implement rate limiting
+- Cache results when possible
+- Handle API errors gracefully
+
+### Combined Features
+- Test cost/quality tradeoffs
+- Use selectively for important documents
+- Implement intelligent routing
+- Monitor performance and costs
+- Have fallback strategies
+
+### Security
+- Store API keys securely (environment variables, secrets manager)
+- Never commit credentials to code
+- Disable plugins unless required
+- Validate all inputs
+- Use least privilege access