Initial commit
This commit is contained in:
538
skills/markitdown/references/advanced_integrations.md
Normal file
538
skills/markitdown/references/advanced_integrations.md
Normal file
@@ -0,0 +1,538 @@
|
||||
# Advanced Integrations Reference
|
||||
|
||||
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
|
||||
|
||||
## Azure Document Intelligence Integration
|
||||
|
||||
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
|
||||
|
||||
### Setup
|
||||
|
||||
**Prerequisites:**
|
||||
1. Azure subscription
|
||||
2. Document Intelligence resource created in Azure
|
||||
3. Endpoint URL and API key
|
||||
|
||||
**Create Azure Resource:**
|
||||
```bash
|
||||
# Using Azure CLI
|
||||
az cognitiveservices account create \
|
||||
--name my-doc-intelligence \
|
||||
--resource-group my-resource-group \
|
||||
--kind FormRecognizer \
|
||||
--sku F0 \
|
||||
--location eastus
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Configuration from Environment Variables
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Set environment variables
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
|
||||
|
||||
# Use without explicit credentials
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
)
|
||||
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### When to Use Azure Document Intelligence
|
||||
|
||||
**Use for:**
|
||||
- Complex PDFs with sophisticated tables
|
||||
- Multi-column layouts
|
||||
- Forms and structured documents
|
||||
- Scanned documents requiring OCR
|
||||
- PDFs with mixed content types
|
||||
- Documents with intricate formatting
|
||||
|
||||
**Benefits over standard extraction:**
|
||||
- **Superior table extraction** - Better handling of merged cells, complex layouts
|
||||
- **Layout analysis** - Understands document structure (headers, footers, columns)
|
||||
- **Form fields** - Extracts key-value pairs from forms
|
||||
- **Reading order** - Maintains correct text flow in complex layouts
|
||||
- **OCR quality** - High-quality text extraction from scanned documents
|
||||
|
||||
### Comparison Example
|
||||
|
||||
**Standard extraction:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("complex_table.pdf")
|
||||
# May struggle with complex tables
|
||||
```
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
# Better table reconstruction and layout understanding
|
||||
```
|
||||
|
||||
### Cost Considerations
|
||||
|
||||
Azure Document Intelligence is a paid service:
|
||||
- **Free tier**: 500 pages per month
|
||||
- **Paid tiers**: Pay per page processed
|
||||
- Monitor usage to control costs
|
||||
- Use standard extraction for simple documents
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Document Intelligence error: {e}")
|
||||
# Common issues: authentication, quota exceeded, unsupported file
|
||||
```
|
||||
|
||||
## LLM-Powered Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions for images using large language models.
|
||||
|
||||
### Setup with OpenAI
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Supported Use Cases
|
||||
|
||||
**Images in documents:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# PowerPoint with images
|
||||
result = md.convert("presentation.pptx")
|
||||
|
||||
# Word documents with images
|
||||
result = md.convert("report.docx")
|
||||
|
||||
# Standalone images
|
||||
result = md.convert("diagram.png")
|
||||
```
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
Customize the LLM prompt for specific needs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
|
||||
)
|
||||
|
||||
# For scientific figures
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
|
||||
)
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**GPT-4o (Recommended):**
|
||||
- Best vision capabilities
|
||||
- High-quality descriptions
|
||||
- Good at understanding context
|
||||
- Higher cost per image
|
||||
|
||||
**GPT-4o-mini:**
|
||||
- Lower cost alternative
|
||||
- Good for simpler images
|
||||
- Faster processing
|
||||
- May miss subtle details
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# High quality (more expensive)
|
||||
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Budget option (less expensive)
|
||||
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
|
||||
```
|
||||
|
||||
### Configuration from Environment
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Set API key in environment
|
||||
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
|
||||
|
||||
client = OpenAI() # Uses env variable
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Alternative LLM Providers
|
||||
|
||||
**Anthropic Claude:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from anthropic import Anthropic
|
||||
|
||||
# Note: Check current compatibility with MarkItDown
|
||||
client = Anthropic(api_key="YOUR-API-KEY")
|
||||
# May require adapter for MarkItDown compatibility
|
||||
```
|
||||
|
||||
**Azure OpenAI:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import AzureOpenAI
|
||||
|
||||
client = AzureOpenAI(
|
||||
api_key="YOUR-AZURE-KEY",
|
||||
api_version="2024-02-01",
|
||||
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
|
||||
)
|
||||
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Cost Management
|
||||
|
||||
**Strategies to reduce LLM costs:**
|
||||
|
||||
1. **Selective processing:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# Only use LLM for important documents
|
||||
if is_important_document(file):
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
else:
|
||||
md = MarkItDown() # Standard processing
|
||||
|
||||
result = md.convert(file)
|
||||
```
|
||||
|
||||
2. **Image filtering:**
|
||||
```python
|
||||
# Pre-process to identify images that need descriptions
|
||||
# Only use LLM for complex/important images
|
||||
```
|
||||
|
||||
3. **Batch processing:**
|
||||
```python
|
||||
# Process multiple images in batches
|
||||
# Monitor costs and set limits
|
||||
```
|
||||
|
||||
4. **Model selection:**
|
||||
```python
|
||||
# Use gpt-4o-mini for simple images
|
||||
# Reserve gpt-4o for complex visualizations
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
**LLM processing adds latency:**
|
||||
- Each image requires an API call
|
||||
- Processing time: 1-5 seconds per image
|
||||
- Network dependent
|
||||
- Consider parallel processing for multiple images
|
||||
|
||||
**Batch optimization:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import concurrent.futures
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
def process_image(image_path):
|
||||
return md.convert(image_path)
|
||||
|
||||
# Process multiple images in parallel
|
||||
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
|
||||
results = list(executor.map(process_image, images))
|
||||
```
|
||||
|
||||
## Combined Advanced Features
|
||||
|
||||
### Azure Document Intelligence + LLM Descriptions
|
||||
|
||||
Combine both for maximum quality:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-AZURE-ENDPOINT",
|
||||
docintel_key="YOUR-AZURE-KEY"
|
||||
)
|
||||
|
||||
# Best possible PDF conversion with image descriptions
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Research papers with figures
|
||||
- Business reports with charts
|
||||
- Technical documentation with diagrams
|
||||
- Presentations with visual data
|
||||
|
||||
### Smart Document Processing Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
def smart_convert(file_path):
|
||||
"""Intelligently choose processing method based on file type."""
|
||||
client = OpenAI()
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
# PDFs with complex tables: Use Azure
|
||||
if ext == '.pdf':
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
|
||||
# Documents/presentations with images: Use LLM
|
||||
elif ext in ['.pptx', '.docx']:
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o"
|
||||
)
|
||||
|
||||
# Simple formats: Standard processing
|
||||
else:
|
||||
md = MarkItDown()
|
||||
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = smart_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports custom plugins for extending functionality.
|
||||
|
||||
### Plugin Architecture
|
||||
|
||||
Plugins are disabled by default for security:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
```
|
||||
|
||||
### Creating Custom Plugins
|
||||
|
||||
**Plugin structure:**
|
||||
```python
|
||||
class CustomConverter:
|
||||
"""Custom converter plugin for MarkItDown."""
|
||||
|
||||
def can_convert(self, file_path):
|
||||
"""Check if this plugin can handle the file."""
|
||||
return file_path.endswith('.custom')
|
||||
|
||||
def convert(self, file_path):
|
||||
"""Convert file to Markdown."""
|
||||
# Your conversion logic here
|
||||
return {
|
||||
'text_content': '# Converted Content\n\n...'
|
||||
}
|
||||
```
|
||||
|
||||
### Plugin Registration
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
|
||||
# Register custom plugin
|
||||
md.register_plugin(CustomConverter())
|
||||
|
||||
# Use normally
|
||||
result = md.convert("file.custom")
|
||||
```
|
||||
|
||||
### Plugin Use Cases
|
||||
|
||||
**Custom formats:**
|
||||
- Proprietary document formats
|
||||
- Specialized scientific data formats
|
||||
- Legacy file formats
|
||||
|
||||
**Enhanced processing:**
|
||||
- Custom OCR engines
|
||||
- Specialized table extraction
|
||||
- Domain-specific parsing
|
||||
|
||||
**Integration:**
|
||||
- Enterprise document systems
|
||||
- Custom databases
|
||||
- Specialized APIs
|
||||
|
||||
### Plugin Security
|
||||
|
||||
**Important security considerations:**
|
||||
- Plugins run with full system access
|
||||
- Only enable for trusted plugins
|
||||
- Validate plugin code before use
|
||||
- Disable plugins in production unless required
|
||||
|
||||
## Error Handling for Advanced Features
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
def robust_convert(file_path):
|
||||
"""Convert with fallback strategies."""
|
||||
try:
|
||||
# Try with all advanced features
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as azure_error:
|
||||
print(f"Azure failed: {azure_error}")
|
||||
|
||||
try:
|
||||
# Fallback: LLM only
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as llm_error:
|
||||
print(f"LLM failed: {llm_error}")
|
||||
|
||||
# Final fallback: Standard processing
|
||||
md = MarkItDown()
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = robust_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Azure Document Intelligence
|
||||
- Use for complex PDFs only (cost optimization)
|
||||
- Monitor usage and costs
|
||||
- Store credentials securely
|
||||
- Handle quota limits gracefully
|
||||
- Fall back to standard processing if needed
|
||||
|
||||
### LLM Integration
|
||||
- Use appropriate models for task complexity
|
||||
- Customize prompts for specific use cases
|
||||
- Monitor API costs
|
||||
- Implement rate limiting
|
||||
- Cache results when possible
|
||||
- Handle API errors gracefully
|
||||
|
||||
### Combined Features
|
||||
- Test cost/quality tradeoffs
|
||||
- Use selectively for important documents
|
||||
- Implement intelligent routing
|
||||
- Monitor performance and costs
|
||||
- Have fallback strategies
|
||||
|
||||
### Security
|
||||
- Store API keys securely (environment variables, secrets manager)
|
||||
- Never commit credentials to code
|
||||
- Disable plugins unless required
|
||||
- Validate all inputs
|
||||
- Use least privilege access
|
||||
273
skills/markitdown/references/document_conversion.md
Normal file
273
skills/markitdown/references/document_conversion.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Document Conversion Reference
|
||||
|
||||
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
|
||||
|
||||
## PDF Files
|
||||
|
||||
PDF conversion extracts text, tables, and structure from PDF documents.
|
||||
|
||||
### Basic PDF Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PDF with Azure Document Intelligence
|
||||
|
||||
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Benefits of Azure Document Intelligence:**
|
||||
- Superior table extraction and reconstruction
|
||||
- Better handling of multi-column layouts
|
||||
- Form field recognition
|
||||
- Improved text ordering in complex documents
|
||||
|
||||
### PDF Handling Notes
|
||||
|
||||
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
|
||||
- Password-protected PDFs are not supported
|
||||
- Large PDFs may take longer to process
|
||||
- Vector graphics and embedded images are extracted where possible
|
||||
|
||||
## Word Documents (DOCX)
|
||||
|
||||
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
|
||||
|
||||
### Basic DOCX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.docx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### DOCX Structure Preservation
|
||||
|
||||
MarkItDown preserves:
|
||||
- **Headings** → Markdown headers (`#`, `##`, etc.)
|
||||
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
|
||||
- **Lists** → Markdown lists (ordered and unordered)
|
||||
- **Tables** → Markdown tables
|
||||
- **Hyperlinks** → Markdown links `[text](url)`
|
||||
- **Images** → Referenced with descriptions (can use LLM for descriptions)
|
||||
|
||||
### Command-Line Usage
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown report.docx -o report.md
|
||||
|
||||
# With output directory
|
||||
markitdown report.docx -o output/report.md
|
||||
```
|
||||
|
||||
### DOCX with Images
|
||||
|
||||
To generate descriptions for images in Word documents, use LLM integration:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("document_with_images.docx")
|
||||
```
|
||||
|
||||
## PowerPoint Presentations (PPTX)
|
||||
|
||||
PowerPoint conversion extracts text from slides while preserving structure.
|
||||
|
||||
### Basic PPTX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PPTX Structure
|
||||
|
||||
MarkItDown processes presentations as:
|
||||
- Each slide becomes a major section
|
||||
- Slide titles become headers
|
||||
- Bullet points are preserved
|
||||
- Tables are converted to Markdown tables
|
||||
- Notes are included if present
|
||||
|
||||
### PPTX with Image Descriptions
|
||||
|
||||
Presentations often contain important visual information. Use LLM integration to describe images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this slide image in detail, focusing on key information"
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
**Custom prompts for presentations:**
|
||||
- "Describe charts and graphs with their key data points"
|
||||
- "Explain diagrams and their relationships"
|
||||
- "Summarize visual content for accessibility"
|
||||
|
||||
## Excel Spreadsheets (XLSX, XLS)
|
||||
|
||||
Excel conversion formats spreadsheet data as Markdown tables.
|
||||
|
||||
### Basic XLSX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xlsx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Sheet Workbooks
|
||||
|
||||
For workbooks with multiple sheets:
|
||||
- Each sheet becomes a separate section
|
||||
- Sheet names are used as headers
|
||||
- Empty sheets are skipped
|
||||
- Formulas are evaluated (values shown, not formulas)
|
||||
|
||||
### XLSX Conversion Details
|
||||
|
||||
**What's preserved:**
|
||||
- Cell values (text, numbers, dates)
|
||||
- Table structure (rows and columns)
|
||||
- Sheet names
|
||||
- Cell formatting (bold headers)
|
||||
|
||||
**What's not preserved:**
|
||||
- Formulas (only computed values)
|
||||
- Charts and graphs (use LLM integration for descriptions)
|
||||
- Cell colors and conditional formatting
|
||||
- Comments and notes
|
||||
|
||||
### Large Spreadsheets
|
||||
|
||||
For large spreadsheets, consider:
|
||||
- Processing may be slower for files with many rows/columns
|
||||
- Very wide tables may not format well in Markdown
|
||||
- Consider filtering or preprocessing data if possible
|
||||
|
||||
### XLS (Legacy Excel) Files
|
||||
|
||||
Legacy `.xls` files are supported but require additional dependencies:
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[xls]'
|
||||
```
|
||||
|
||||
Then use normally:
|
||||
```python
|
||||
md = MarkItDown()
|
||||
result = md.convert("legacy_data.xls")
|
||||
```
|
||||
|
||||
## Common Document Conversion Patterns
|
||||
|
||||
### Batch Document Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process all documents in a directory
|
||||
for filename in os.listdir("documents"):
|
||||
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
|
||||
result = md.convert(f"documents/{filename}")
|
||||
|
||||
# Save to output directory
|
||||
output_name = os.path.splitext(filename)[0] + ".md"
|
||||
with open(f"markdown/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Document with Mixed Content
|
||||
|
||||
For documents containing multiple types of content (text, tables, images):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Use LLM for image descriptions + Azure for complex tables
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Conversion failed: {e}")
|
||||
# Handle specific errors (file not found, unsupported format, etc.)
|
||||
```
|
||||
|
||||
## Output Quality Tips
|
||||
|
||||
**For best results:**
|
||||
1. Use Azure Document Intelligence for PDFs with complex tables
|
||||
2. Enable LLM descriptions for documents with important visual content
|
||||
3. Ensure source documents are well-structured (proper headings, etc.)
|
||||
4. For scanned documents, ensure good scan quality for OCR accuracy
|
||||
5. Test with sample documents to verify output quality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Conversion speed depends on:**
|
||||
- Document size and complexity
|
||||
- Number of images (especially with LLM descriptions)
|
||||
- Use of Azure Document Intelligence
|
||||
- Available system resources
|
||||
|
||||
**Optimization tips:**
|
||||
- Disable LLM integration if image descriptions aren't needed
|
||||
- Use standard extraction (not Azure) for simple documents
|
||||
- Process large batches in parallel when possible
|
||||
- Consider streaming for very large documents
|
||||
365
skills/markitdown/references/media_processing.md
Normal file
365
skills/markitdown/references/media_processing.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Media Processing Reference
|
||||
|
||||
This document provides detailed information about processing images and audio files with MarkItDown.
|
||||
|
||||
## Image Processing
|
||||
|
||||
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
|
||||
|
||||
### Basic Image Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("photo.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Image Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
1. **EXIF Metadata** - Camera settings, date, location, etc.
|
||||
2. **OCR Text** - Text detected in the image (requires tesseract)
|
||||
3. **Image Description** - AI-generated description (with LLM integration)
|
||||
|
||||
### EXIF Metadata Extraction
|
||||
|
||||
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("IMG_1234.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Example output includes:**
|
||||
- Camera make and model
|
||||
- Capture date and time
|
||||
- GPS coordinates (if available)
|
||||
- Exposure settings (ISO, shutter speed, aperture)
|
||||
- Image dimensions
|
||||
- Orientation
|
||||
|
||||
### OCR (Optical Character Recognition)
|
||||
|
||||
Extract text from images containing text (screenshots, scanned documents, photos of text):
|
||||
|
||||
**Requirements:**
|
||||
- Install tesseract OCR engine:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu/Debian
|
||||
apt-get install tesseract-ocr
|
||||
|
||||
# Windows
|
||||
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("screenshot.png")
|
||||
print(result.text_content) # Contains OCR'd text
|
||||
```
|
||||
|
||||
**Best practices for OCR:**
|
||||
- Use high-resolution images for better accuracy
|
||||
- Ensure good contrast between text and background
|
||||
- Straighten skewed text if possible
|
||||
- Use well-lit, clear images
|
||||
|
||||
### LLM-Generated Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("diagram.png")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Custom prompts for specific needs:**
|
||||
|
||||
```python
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this chart and provide key data points and trends"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface, listing all visible elements and their layout"
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Image Formats
|
||||
|
||||
MarkItDown supports all common image formats:
|
||||
- JPEG/JPG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
- TIFF
|
||||
- WebP
|
||||
- HEIC (requires additional libraries on some platforms)
|
||||
|
||||
## Audio Processing
|
||||
|
||||
MarkItDown can transcribe audio files to text using speech recognition.
|
||||
|
||||
### Basic Audio Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("recording.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
### Audio Transcription Setup
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install 'markitdown[audio]'
|
||||
```
|
||||
|
||||
This installs the `speech_recognition` library and dependencies.
|
||||
|
||||
### Supported Audio Formats
|
||||
|
||||
- WAV
|
||||
- AIFF
|
||||
- FLAC
|
||||
- MP3 (requires ffmpeg or libav)
|
||||
- OGG (requires ffmpeg or libav)
|
||||
- Other formats supported by speech_recognition
|
||||
|
||||
### Audio Transcription Engines
|
||||
|
||||
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
|
||||
|
||||
**Default (Google Speech Recognition):**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("audio.wav")
|
||||
```
|
||||
|
||||
**Note:** Default Google Speech Recognition requires internet connection.
|
||||
|
||||
### Audio Quality Considerations
|
||||
|
||||
For best transcription accuracy:
|
||||
- Use clear audio with minimal background noise
|
||||
- Prefer WAV or FLAC for better quality
|
||||
- Ensure speech is clear and at good volume
|
||||
- Avoid multiple overlapping speakers
|
||||
- Use mono audio when possible
|
||||
|
||||
### Audio Preprocessing Tips
|
||||
|
||||
For better results, consider preprocessing audio:
|
||||
|
||||
```python
|
||||
# Example: If you have pydub installed
|
||||
from pydub import AudioSegment
|
||||
from pydub.effects import normalize
|
||||
|
||||
# Load and normalize audio
|
||||
audio = AudioSegment.from_file("recording.mp3")
|
||||
audio = normalize(audio)
|
||||
audio.export("normalized.wav", format="wav")
|
||||
|
||||
# Then convert with MarkItDown
|
||||
from markitdown import MarkItDown
|
||||
md = MarkItDown()
|
||||
result = md.convert("normalized.wav")
|
||||
```
|
||||
|
||||
## Combined Media Workflows
|
||||
|
||||
### Processing Multiple Images in Batch
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Process all images in directory
|
||||
for filename in os.listdir("images"):
|
||||
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
|
||||
result = md.convert(f"images/{filename}")
|
||||
|
||||
# Save markdown with same name
|
||||
output = filename.rsplit('.', 1)[0] + '.md'
|
||||
with open(f"output/{output}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Screenshot Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
|
||||
)
|
||||
|
||||
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
|
||||
analysis = []
|
||||
|
||||
for screenshot in screenshots:
|
||||
result = md.convert(screenshot)
|
||||
analysis.append({
|
||||
'file': screenshot,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
# Now ready for further processing
|
||||
```
|
||||
|
||||
### Document Images with OCR
|
||||
|
||||
For scanned documents or photos of documents:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process scanned pages
|
||||
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
|
||||
full_text = []
|
||||
|
||||
for page in pages:
|
||||
result = md.convert(page)
|
||||
full_text.append(result.text_content)
|
||||
|
||||
# Combine into single document
|
||||
document = "\n\n---\n\n".join(full_text)
|
||||
print(document)
|
||||
```
|
||||
|
||||
### Presentation Slide Images
|
||||
|
||||
When you have presentation slides as images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
|
||||
)
|
||||
|
||||
# Process slide images
|
||||
for i in range(1, 21): # 20 slides
|
||||
result = md.convert(f"slides/slide_{i}.png")
|
||||
print(f"## Slide {i}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Image Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("Image file not found")
|
||||
except Exception as e:
|
||||
print(f"Error processing image: {e}")
|
||||
```
|
||||
|
||||
### Audio Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("audio.mp3")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Transcription failed: {e}")
|
||||
# Common issues: format not supported, no speech detected, network error
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Image Processing
|
||||
|
||||
- **LLM descriptions**: Slower but more informative
|
||||
- **OCR only**: Faster for text extraction
|
||||
- **EXIF only**: Fastest, metadata only
|
||||
- **Batch processing**: Process multiple images in parallel
|
||||
|
||||
### Audio Processing
|
||||
|
||||
- **File size**: Larger files take longer
|
||||
- **Audio length**: Transcription time scales with duration
|
||||
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
|
||||
- **Network dependency**: Default transcription requires internet
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Document Digitization
|
||||
Convert scanned documents or photos of documents to searchable text.
|
||||
|
||||
### Meeting Notes
|
||||
Transcribe audio recordings of meetings to text for analysis.
|
||||
|
||||
### Presentation Analysis
|
||||
Extract content from presentation slide images.
|
||||
|
||||
### Screenshot Documentation
|
||||
Generate descriptions of UI screenshots for documentation.
|
||||
|
||||
### Image Archiving
|
||||
Extract metadata and content from photo collections.
|
||||
|
||||
### Accessibility
|
||||
Generate alt-text descriptions for images using LLM integration.
|
||||
|
||||
### Data Extraction
|
||||
OCR text from images containing tables, forms, or structured data.
|
||||
575
skills/markitdown/references/structured_data.md
Normal file
575
skills/markitdown/references/structured_data.md
Normal file
@@ -0,0 +1,575 @@
|
||||
# Structured Data Handling Reference
|
||||
|
||||
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
|
||||
|
||||
## CSV Files
|
||||
|
||||
Convert CSV (Comma-Separated Values) files to Markdown tables.
|
||||
|
||||
### Basic CSV Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### CSV to Markdown Table
|
||||
|
||||
CSV files are automatically converted to Markdown table format:
|
||||
|
||||
**Input CSV (`data.csv`):**
|
||||
```csv
|
||||
Name,Age,City
|
||||
Alice,30,New York
|
||||
Bob,25,Los Angeles
|
||||
Charlie,35,Chicago
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
| Name | Age | City |
|
||||
|---------|-----|-------------|
|
||||
| Alice | 30 | New York |
|
||||
| Bob | 25 | Los Angeles |
|
||||
| Charlie | 35 | Chicago |
|
||||
```
|
||||
|
||||
### CSV Conversion Features
|
||||
|
||||
**What's preserved:**
|
||||
- All column headers
|
||||
- All data rows
|
||||
- Cell values (text and numbers)
|
||||
- Column structure
|
||||
|
||||
**Formatting:**
|
||||
- Headers are bolded (Markdown table format)
|
||||
- Columns are aligned
|
||||
- Empty cells are preserved
|
||||
- Special characters are escaped
|
||||
|
||||
### Large CSV Files
|
||||
|
||||
For large CSV files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large CSV
|
||||
result = md.convert("large_dataset.csv")
|
||||
|
||||
# Save to file instead of printing
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
**Performance considerations:**
|
||||
- Very large files may take time to process
|
||||
- Consider previewing first few rows for testing
|
||||
- Memory usage scales with file size
|
||||
- Very wide tables may not display well in all Markdown viewers
|
||||
|
||||
### CSV with Special Characters
|
||||
|
||||
CSV files containing special characters are handled automatically:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles UTF-8, special characters, quotes, etc.
|
||||
result = md.convert("international_data.csv")
|
||||
```
|
||||
|
||||
### CSV Delimiters
|
||||
|
||||
Standard CSV delimiters are supported:
|
||||
- Comma (`,`) - standard
|
||||
- Semicolon (`;`) - common in European formats
|
||||
- Tab (`\t`) - TSV files
|
||||
|
||||
### Command-Line CSV Conversion
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown data.csv -o data.md
|
||||
|
||||
# Multiple CSV files
|
||||
for file in *.csv; do
|
||||
markitdown "$file" -o "${file%.csv}.md"
|
||||
done
|
||||
```
|
||||
|
||||
## JSON Files
|
||||
|
||||
Convert JSON data to readable Markdown format.
|
||||
|
||||
### Basic JSON Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Formatting
|
||||
|
||||
JSON is converted to a readable, structured Markdown format:
|
||||
|
||||
**Input JSON (`config.json`):**
|
||||
```json
|
||||
{
|
||||
"name": "MyApp",
|
||||
"version": "1.0.0",
|
||||
"dependencies": {
|
||||
"library1": "^2.0.0",
|
||||
"library2": "^3.1.0"
|
||||
},
|
||||
"features": ["auth", "api", "database"]
|
||||
}
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
## Configuration
|
||||
|
||||
**name:** MyApp
|
||||
**version:** 1.0.0
|
||||
|
||||
### dependencies
|
||||
- **library1:** ^2.0.0
|
||||
- **library2:** ^3.1.0
|
||||
|
||||
### features
|
||||
- auth
|
||||
- api
|
||||
- database
|
||||
```
|
||||
|
||||
### JSON Array Handling
|
||||
|
||||
JSON arrays are converted to lists or tables:
|
||||
|
||||
**Array of objects:**
|
||||
```json
|
||||
[
|
||||
{"id": 1, "name": "Alice", "active": true},
|
||||
{"id": 2, "name": "Bob", "active": false}
|
||||
]
|
||||
```
|
||||
|
||||
**Converted to table:**
|
||||
```markdown
|
||||
| id | name | active |
|
||||
|----|-------|--------|
|
||||
| 1 | Alice | true |
|
||||
| 2 | Bob | false |
|
||||
```
|
||||
|
||||
### Nested JSON Structures
|
||||
|
||||
Nested JSON is converted with appropriate indentation and hierarchy:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles deeply nested structures
|
||||
result = md.convert("complex_config.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
|
||||
For JSON Lines format (one JSON object per line):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read JSONL file
|
||||
with open("data.jsonl", "r") as f:
|
||||
for line in f:
|
||||
obj = json.loads(line)
|
||||
|
||||
# Convert to JSON temporarily
|
||||
with open("temp.json", "w") as temp:
|
||||
json.dump(obj, temp)
|
||||
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
print("\n---\n")
|
||||
```
|
||||
|
||||
### Large JSON Files
|
||||
|
||||
For large JSON files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large JSON
|
||||
result = md.convert("large_data.json")
|
||||
|
||||
# Save to file
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
## XML Files
|
||||
|
||||
Convert XML documents to structured Markdown.
|
||||
|
||||
### Basic XML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### XML Structure Preservation
|
||||
|
||||
XML is converted to Markdown maintaining hierarchical structure:
|
||||
|
||||
**Input XML (`book.xml`):**
|
||||
```xml
|
||||
<?xml version="1.0"?>
|
||||
<book>
|
||||
<title>Example Book</title>
|
||||
<author>John Doe</author>
|
||||
<chapters>
|
||||
<chapter id="1">
|
||||
<title>Introduction</title>
|
||||
<content>Chapter 1 content...</content>
|
||||
</chapter>
|
||||
<chapter id="2">
|
||||
<title>Background</title>
|
||||
<content>Chapter 2 content...</content>
|
||||
</chapter>
|
||||
</chapters>
|
||||
</book>
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
# book
|
||||
|
||||
## title
|
||||
Example Book
|
||||
|
||||
## author
|
||||
John Doe
|
||||
|
||||
## chapters
|
||||
|
||||
### chapter (id: 1)
|
||||
#### title
|
||||
Introduction
|
||||
|
||||
#### content
|
||||
Chapter 1 content...
|
||||
|
||||
### chapter (id: 2)
|
||||
#### title
|
||||
Background
|
||||
|
||||
#### content
|
||||
Chapter 2 content...
|
||||
```
|
||||
|
||||
### XML Attributes
|
||||
|
||||
XML attributes are preserved in the conversion:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
# Attributes shown as (attr: value) in headings
|
||||
```
|
||||
|
||||
### XML Namespaces
|
||||
|
||||
XML namespaces are handled:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles xmlns and namespaced elements
|
||||
result = md.convert("namespaced.xml")
|
||||
```
|
||||
|
||||
### XML Use Cases
|
||||
|
||||
**Configuration files:**
|
||||
- Convert XML configs to readable format
|
||||
- Document system configurations
|
||||
- Compare configuration files
|
||||
|
||||
**Data interchange:**
|
||||
- Convert XML APIs responses
|
||||
- Process XML data feeds
|
||||
- Transform between formats
|
||||
|
||||
**Document processing:**
|
||||
- Convert DocBook to Markdown
|
||||
- Process SVG descriptions
|
||||
- Extract structured data
|
||||
|
||||
## Structured Data Workflows
|
||||
|
||||
### CSV Data Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import pandas as pd
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV for analysis
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Do analysis
|
||||
summary = df.describe()
|
||||
|
||||
# Convert both to Markdown
|
||||
original = md.convert("data.csv")
|
||||
|
||||
# Save summary as CSV then convert
|
||||
summary.to_csv("summary.csv")
|
||||
summary_md = md.convert("summary.csv")
|
||||
|
||||
print("## Original Data\n")
|
||||
print(original.text_content)
|
||||
print("\n## Statistical Summary\n")
|
||||
print(summary_md.text_content)
|
||||
```
|
||||
|
||||
### JSON API Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch JSON from API
|
||||
response = requests.get("https://api.example.com/data")
|
||||
data = response.json()
|
||||
|
||||
# Save as JSON
|
||||
with open("api_response.json", "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("api_response.json")
|
||||
|
||||
# Create documentation
|
||||
doc = f"""# API Response Documentation
|
||||
|
||||
## Endpoint
|
||||
GET https://api.example.com/data
|
||||
|
||||
## Response
|
||||
{result.text_content}
|
||||
"""
|
||||
|
||||
with open("api_docs.md", "w") as f:
|
||||
f.write(doc)
|
||||
```
|
||||
|
||||
### XML to Markdown Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert XML documentation
|
||||
xml_files = ["config.xml", "schema.xml", "data.xml"]
|
||||
|
||||
for xml_file in xml_files:
|
||||
result = md.convert(xml_file)
|
||||
|
||||
output_name = xml_file.replace('.xml', '.md')
|
||||
with open(f"docs/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Format Data Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_structured_data(directory):
|
||||
"""Convert all structured data files in directory."""
|
||||
extensions = {'.csv', '.json', '.xml'}
|
||||
|
||||
for filename in os.listdir(directory):
|
||||
ext = os.path.splitext(filename)[1]
|
||||
|
||||
if ext in extensions:
|
||||
input_path = os.path.join(directory, filename)
|
||||
result = md.convert(input_path)
|
||||
|
||||
# Save Markdown
|
||||
output_name = filename.replace(ext, '.md')
|
||||
output_path = os.path.join("markdown", output_name)
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename} → {output_name}")
|
||||
|
||||
# Process all structured data
|
||||
convert_structured_data("data")
|
||||
```
|
||||
|
||||
### CSV to JSON to Markdown
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Convert to JSON
|
||||
json_data = df.to_dict(orient='records')
|
||||
with open("temp.json", "w") as f:
|
||||
json.dump(json_data, f, indent=2)
|
||||
|
||||
# Convert JSON to Markdown
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Database Export to Markdown
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import sqlite3
|
||||
import csv
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Export database query to CSV
|
||||
conn = sqlite3.connect("database.db")
|
||||
cursor = conn.execute("SELECT * FROM users")
|
||||
|
||||
with open("users.csv", "w", newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow([description[0] for description in cursor.description])
|
||||
writer.writerows(cursor.fetchall())
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("users.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### CSV Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("CSV file not found")
|
||||
except Exception as e:
|
||||
print(f"CSV conversion error: {e}")
|
||||
# Common issues: encoding problems, malformed CSV, delimiter issues
|
||||
```
|
||||
|
||||
### JSON Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"JSON conversion error: {e}")
|
||||
# Common issues: invalid JSON syntax, encoding issues
|
||||
```
|
||||
|
||||
### XML Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"XML conversion error: {e}")
|
||||
# Common issues: malformed XML, encoding problems, namespace issues
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### CSV Processing
|
||||
- Check delimiter before conversion
|
||||
- Verify encoding (UTF-8 recommended)
|
||||
- Handle large files with streaming if needed
|
||||
- Preview output for very wide tables
|
||||
|
||||
### JSON Processing
|
||||
- Validate JSON before conversion
|
||||
- Consider pretty-printing complex structures
|
||||
- Handle circular references appropriately
|
||||
- Be aware of large array performance
|
||||
|
||||
### XML Processing
|
||||
- Validate XML structure first
|
||||
- Handle namespaces consistently
|
||||
- Consider XPath for selective extraction
|
||||
- Be mindful of very deep nesting
|
||||
|
||||
### Data Quality
|
||||
- Clean data before conversion when possible
|
||||
- Handle missing values appropriately
|
||||
- Verify special character handling
|
||||
- Test with representative samples
|
||||
|
||||
### Performance
|
||||
- Process large files in batches
|
||||
- Use streaming for very large datasets
|
||||
- Monitor memory usage
|
||||
- Cache converted results when appropriate
|
||||
478
skills/markitdown/references/web_content.md
Normal file
478
skills/markitdown/references/web_content.md
Normal file
@@ -0,0 +1,478 @@
|
||||
# Web Content Extraction Reference
|
||||
|
||||
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
|
||||
|
||||
## HTML Conversion
|
||||
|
||||
Convert HTML files and web pages to clean Markdown format.
|
||||
|
||||
### Basic HTML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML Processing Features
|
||||
|
||||
**What's preserved:**
|
||||
- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
|
||||
- Paragraphs and text formatting
|
||||
- Links (`<a>` → `[text](url)`)
|
||||
- Lists (ordered and unordered)
|
||||
- Tables → Markdown tables
|
||||
- Code blocks and inline code
|
||||
- Emphasis (bold, italic)
|
||||
|
||||
**What's removed:**
|
||||
- Scripts and styles
|
||||
- Navigation elements
|
||||
- Advertising content
|
||||
- Boilerplate markup
|
||||
- HTML comments
|
||||
|
||||
### HTML from URLs
|
||||
|
||||
Convert web pages directly from URLs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch and convert web page
|
||||
response = requests.get("https://example.com/article")
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Clean Web Article Extraction
|
||||
|
||||
For extracting main content from web articles:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from readability import Document # pip install readability-lxml
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch page
|
||||
url = "https://example.com/article"
|
||||
response = requests.get(url)
|
||||
|
||||
# Extract main content
|
||||
doc = Document(response.content)
|
||||
html_content = doc.summary()
|
||||
|
||||
# Save and convert
|
||||
with open("article.html", "w") as f:
|
||||
f.write(html_content)
|
||||
|
||||
result = md.convert("article.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML with Images
|
||||
|
||||
HTML files containing images can be enhanced with LLM descriptions:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("page_with_images.html")
|
||||
```
|
||||
|
||||
## YouTube Transcripts
|
||||
|
||||
Extract video transcripts from YouTube videos.
|
||||
|
||||
### Basic YouTube Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### YouTube Installation
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[youtube]'
|
||||
```
|
||||
|
||||
This installs the `youtube-transcript-api` dependency.
|
||||
|
||||
### YouTube URL Formats
|
||||
|
||||
MarkItDown supports various YouTube URL formats:
|
||||
- `https://www.youtube.com/watch?v=VIDEO_ID`
|
||||
- `https://youtu.be/VIDEO_ID`
|
||||
- `https://www.youtube.com/embed/VIDEO_ID`
|
||||
- `https://m.youtube.com/watch?v=VIDEO_ID`
|
||||
|
||||
### YouTube Transcript Features
|
||||
|
||||
**What's included:**
|
||||
- Full video transcript text
|
||||
- Timestamps (optional, depending on availability)
|
||||
- Video metadata (title, description)
|
||||
- Captions in available languages
|
||||
|
||||
**Transcript languages:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Get transcript in specific language (if available)
|
||||
# Language codes: 'en', 'es', 'fr', 'de', etc.
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
### YouTube Playlist Processing
|
||||
|
||||
Process multiple videos from a playlist:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
video_ids = [
|
||||
"VIDEO_ID_1",
|
||||
"VIDEO_ID_2",
|
||||
"VIDEO_ID_3"
|
||||
]
|
||||
|
||||
transcripts = []
|
||||
for vid_id in video_ids:
|
||||
url = f"https://youtube.com/watch?v={vid_id}"
|
||||
result = md.convert(url)
|
||||
transcripts.append({
|
||||
'video_id': vid_id,
|
||||
'transcript': result.text_content
|
||||
})
|
||||
```
|
||||
|
||||
### YouTube Use Cases
|
||||
|
||||
**Content Analysis:**
|
||||
- Analyze video content without watching
|
||||
- Extract key information from tutorials
|
||||
- Build searchable transcript databases
|
||||
|
||||
**Research:**
|
||||
- Process interview transcripts
|
||||
- Extract lecture content
|
||||
- Analyze presentation content
|
||||
|
||||
**Accessibility:**
|
||||
- Generate text versions of video content
|
||||
- Create searchable video archives
|
||||
|
||||
### YouTube Limitations
|
||||
|
||||
- Requires videos to have captions/transcripts available
|
||||
- Auto-generated captions may have transcription errors
|
||||
- Some videos may disable transcript access
|
||||
- Rate limiting may apply for bulk processing
|
||||
|
||||
## EPUB Books
|
||||
|
||||
Convert EPUB e-books to Markdown format.
|
||||
|
||||
### Basic EPUB Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### EPUB Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
- Book text content
|
||||
- Chapter structure
|
||||
- Headings and formatting
|
||||
- Tables of contents
|
||||
- Footnotes and references
|
||||
|
||||
**What's preserved:**
|
||||
- Heading hierarchy
|
||||
- Text emphasis (bold, italic)
|
||||
- Links and references
|
||||
- Lists and tables
|
||||
|
||||
### EPUB with Images
|
||||
|
||||
EPUB files often contain images (covers, diagrams, illustrations):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("illustrated_book.epub")
|
||||
```
|
||||
|
||||
### EPUB Use Cases
|
||||
|
||||
**Research:**
|
||||
- Convert textbooks to searchable format
|
||||
- Extract content for analysis
|
||||
- Build digital libraries
|
||||
|
||||
**Content Processing:**
|
||||
- Prepare books for LLM training data
|
||||
- Convert to different formats
|
||||
- Create summaries and extracts
|
||||
|
||||
**Accessibility:**
|
||||
- Convert to more accessible formats
|
||||
- Extract text for screen readers
|
||||
- Process for text-to-speech
|
||||
|
||||
## RSS Feeds
|
||||
|
||||
Process RSS feeds to extract article content.
|
||||
|
||||
### Basic RSS Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import feedparser
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Parse RSS feed
|
||||
feed = feedparser.parse("https://example.com/feed.xml")
|
||||
|
||||
# Convert each entry
|
||||
for entry in feed.entries:
|
||||
# Save entry HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(entry.summary)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(f"## {entry.title}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Combined Web Content Workflows
|
||||
|
||||
### Web Scraping Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_and_convert(url):
|
||||
"""Scrape webpage and convert to Markdown."""
|
||||
response = requests.get(url)
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# Extract main content
|
||||
main_content = soup.find('article') or soup.find('main')
|
||||
|
||||
if main_content:
|
||||
# Save HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(str(main_content))
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("temp.html")
|
||||
return result.text_content
|
||||
|
||||
return None
|
||||
|
||||
# Use it
|
||||
markdown = scrape_and_convert("https://example.com/article")
|
||||
print(markdown)
|
||||
```
|
||||
|
||||
### YouTube Learning Content Extraction
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Course videos
|
||||
course_videos = [
|
||||
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
|
||||
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
|
||||
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
|
||||
]
|
||||
|
||||
course_content = []
|
||||
for url, title in course_videos:
|
||||
result = md.convert(url)
|
||||
course_content.append(f"# {title}\n\n{result.text_content}")
|
||||
|
||||
# Combine into course document
|
||||
full_course = "\n\n---\n\n".join(course_content)
|
||||
with open("course_transcript.md", "w") as f:
|
||||
f.write(full_course)
|
||||
```
|
||||
|
||||
### Documentation Scraping
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_documentation(base_url, page_urls):
|
||||
"""Scrape multiple documentation pages."""
|
||||
docs = []
|
||||
|
||||
for page_url in page_urls:
|
||||
full_url = urljoin(base_url, page_url)
|
||||
|
||||
# Fetch page
|
||||
response = requests.get(full_url)
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
# Convert
|
||||
result = md.convert("temp.html")
|
||||
docs.append({
|
||||
'url': full_url,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
return docs
|
||||
|
||||
# Example usage
|
||||
base = "https://docs.example.com/"
|
||||
pages = ["intro.html", "getting-started.html", "api.html"]
|
||||
documentation = scrape_documentation(base, pages)
|
||||
```
|
||||
|
||||
### EPUB Library Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def process_epub_library(library_path, output_path):
|
||||
"""Convert all EPUB books in a directory."""
|
||||
for filename in os.listdir(library_path):
|
||||
if filename.endswith('.epub'):
|
||||
epub_path = os.path.join(library_path, filename)
|
||||
|
||||
try:
|
||||
result = md.convert(epub_path)
|
||||
|
||||
# Save markdown
|
||||
output_file = filename.replace('.epub', '.md')
|
||||
output_full = os.path.join(output_path, output_file)
|
||||
|
||||
with open(output_full, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename}")
|
||||
except Exception as e:
|
||||
print(f"Failed to convert {filename}: {e}")
|
||||
|
||||
# Process library
|
||||
process_epub_library("books", "markdown_books")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTML Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("HTML file not found")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
### YouTube Transcript Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Failed to get transcript: {e}")
|
||||
# Common issues: No transcript available, video unavailable, network error
|
||||
```
|
||||
|
||||
### EPUB Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"EPUB processing error: {e}")
|
||||
# Common issues: Corrupted file, unsupported DRM, invalid format
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### HTML Processing
|
||||
- Clean HTML before conversion for better results
|
||||
- Use readability libraries to extract main content
|
||||
- Handle different encodings appropriately
|
||||
- Remove unnecessary markup
|
||||
|
||||
### YouTube Processing
|
||||
- Check transcript availability before batch processing
|
||||
- Handle API rate limits gracefully
|
||||
- Store transcripts to avoid re-fetching
|
||||
- Respect YouTube's terms of service
|
||||
|
||||
### EPUB Processing
|
||||
- DRM-protected EPUBs cannot be processed
|
||||
- Large EPUBs may require more memory
|
||||
- Some formatting may not translate perfectly
|
||||
- Test with representative samples first
|
||||
|
||||
### Web Scraping Ethics
|
||||
- Respect robots.txt
|
||||
- Add delays between requests
|
||||
- Identify your scraper in User-Agent
|
||||
- Cache results to minimize requests
|
||||
- Follow website terms of service
|
||||
Reference in New Issue
Block a user