Initial commit
This commit is contained in:
241
skills/markitdown/SKILL.md
Normal file
241
skills/markitdown/SKILL.md
Normal file
@@ -0,0 +1,241 @@
|
||||
---
|
||||
name: markitdown
|
||||
description: Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.
|
||||
---
|
||||
|
||||
# MarkItDown
|
||||
|
||||
## Overview
|
||||
|
||||
MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when users request:
|
||||
- Converting documents to Markdown format
|
||||
- Extracting text from PDF, Word, PowerPoint, or Excel files
|
||||
- Performing OCR on images to extract text
|
||||
- Transcribing audio files to text
|
||||
- Extracting YouTube video transcripts
|
||||
- Processing HTML, EPUB, or web content to Markdown
|
||||
- Converting structured data (CSV, JSON, XML) to readable Markdown
|
||||
- Batch converting multiple files or ZIP archives
|
||||
- Preparing documents for LLM analysis or RAG systems
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Document Conversion
|
||||
|
||||
Convert Office documents and PDFs to Markdown while preserving structure.
|
||||
|
||||
**Supported formats:**
|
||||
- PDF files (with optional Azure Document Intelligence integration)
|
||||
- Word documents (DOCX)
|
||||
- PowerPoint presentations (PPTX)
|
||||
- Excel spreadsheets (XLSX, XLS)
|
||||
|
||||
**Basic usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Command-line:**
|
||||
```bash
|
||||
markitdown document.pdf -o output.md
|
||||
```
|
||||
|
||||
See `references/document_conversion.md` for detailed documentation on document-specific features.
|
||||
|
||||
### 2. Media Processing
|
||||
|
||||
Extract text from images using OCR and transcribe audio files to text.
|
||||
|
||||
**Supported formats:**
|
||||
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
|
||||
- Audio files with speech transcription (requires speech_recognition)
|
||||
|
||||
**Image with OCR:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content) # Includes EXIF metadata and OCR text
|
||||
```
|
||||
|
||||
**Audio transcription:**
|
||||
```python
|
||||
result = md.convert("audio.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
See `references/media_processing.md` for advanced media handling options.
|
||||
|
||||
### 3. Web Content Extraction
|
||||
|
||||
Convert web-based content and e-books to Markdown.
|
||||
|
||||
**Supported formats:**
|
||||
- HTML files and web pages
|
||||
- YouTube video transcripts (via URL)
|
||||
- EPUB books
|
||||
- RSS feeds
|
||||
|
||||
**YouTube transcript:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
See `references/web_content.md` for web extraction details.
|
||||
|
||||
### 4. Structured Data Handling
|
||||
|
||||
Convert structured data formats to readable Markdown tables.
|
||||
|
||||
**Supported formats:**
|
||||
- CSV files
|
||||
- JSON files
|
||||
- XML files
|
||||
|
||||
**CSV to Markdown table:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content) # Formatted as Markdown table
|
||||
```
|
||||
|
||||
See `references/structured_data.md` for format-specific options.
|
||||
|
||||
### 5. Advanced Integrations
|
||||
|
||||
Enhance conversion quality with AI-powered features.
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
For enhanced PDF processing with better table extraction and layout analysis:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
|
||||
result = md.convert("complex.pdf")
|
||||
```
|
||||
|
||||
**LLM-Powered Image Descriptions:**
|
||||
Generate detailed image descriptions using GPT-4o:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("presentation.pptx") # Images described with LLM
|
||||
```
|
||||
|
||||
See `references/advanced_integrations.md` for integration details.
|
||||
|
||||
### 6. Batch Processing
|
||||
|
||||
Process multiple files or entire ZIP archives at once.
|
||||
|
||||
**ZIP file processing:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("archive.zip")
|
||||
print(result.text_content) # All files converted and concatenated
|
||||
```
|
||||
|
||||
**Batch script:**
|
||||
Use the provided batch processing script for directory conversion:
|
||||
```bash
|
||||
python scripts/batch_convert.py /path/to/documents /path/to/output
|
||||
```
|
||||
|
||||
See `scripts/batch_convert.py` for implementation details.
|
||||
|
||||
## Installation
|
||||
|
||||
**Full installation (all features):**
|
||||
```bash
|
||||
uv pip install 'markitdown[all]'
|
||||
```
|
||||
|
||||
**Modular installation (specific features):**
|
||||
```bash
|
||||
uv pip install 'markitdown[pdf]' # PDF support
|
||||
uv pip install 'markitdown[docx]' # Word support
|
||||
uv pip install 'markitdown[pptx]' # PowerPoint support
|
||||
uv pip install 'markitdown[xlsx]' # Excel support
|
||||
uv pip install 'markitdown[audio]' # Audio transcription
|
||||
uv pip install 'markitdown[youtube]' # YouTube transcripts
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- Python 3.10 or higher
|
||||
|
||||
## Output Format
|
||||
|
||||
MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
|
||||
- Preserves headings, lists, and tables
|
||||
- Maintains hyperlinks and formatting
|
||||
- Includes metadata where relevant (EXIF, document properties)
|
||||
- No temporary files created (streaming approach)
|
||||
|
||||
## Common Workflows
|
||||
|
||||
**Preparing documents for RAG:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert knowledge base documents
|
||||
docs = ["manual.pdf", "guide.docx", "faq.html"]
|
||||
markdown_content = []
|
||||
|
||||
for doc in docs:
|
||||
result = md.convert(doc)
|
||||
markdown_content.append(result.text_content)
|
||||
|
||||
# Now ready for embedding and indexing
|
||||
```
|
||||
|
||||
**Document analysis pipeline:**
|
||||
```bash
|
||||
# Convert all PDFs in directory
|
||||
for file in documents/*.pdf; do
|
||||
markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
|
||||
done
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins if needed
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation for each capability:
|
||||
|
||||
- **references/document_conversion.md** - Detailed PDF, DOCX, PPTX, XLSX conversion options
|
||||
- **references/media_processing.md** - Image OCR and audio transcription details
|
||||
- **references/web_content.md** - HTML, YouTube, and EPUB extraction
|
||||
- **references/structured_data.md** - CSV, JSON, XML conversion formats
|
||||
- **references/advanced_integrations.md** - Azure Document Intelligence and LLM integration
|
||||
- **scripts/batch_convert.py** - Batch processing utility for directories
|
||||
538
skills/markitdown/references/advanced_integrations.md
Normal file
538
skills/markitdown/references/advanced_integrations.md
Normal file
@@ -0,0 +1,538 @@
|
||||
# Advanced Integrations Reference
|
||||
|
||||
This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
|
||||
|
||||
## Azure Document Intelligence Integration
|
||||
|
||||
Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
|
||||
|
||||
### Setup
|
||||
|
||||
**Prerequisites:**
|
||||
1. Azure subscription
|
||||
2. Document Intelligence resource created in Azure
|
||||
3. Endpoint URL and API key
|
||||
|
||||
**Create Azure Resource:**
|
||||
```bash
|
||||
# Using Azure CLI
|
||||
az cognitiveservices account create \
|
||||
--name my-doc-intelligence \
|
||||
--resource-group my-resource-group \
|
||||
--kind FormRecognizer \
|
||||
--sku F0 \
|
||||
--location eastus
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Configuration from Environment Variables
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Set environment variables
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
|
||||
os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
|
||||
|
||||
# Use without explicit credentials
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
)
|
||||
|
||||
result = md.convert("document.pdf")
|
||||
```
|
||||
|
||||
### When to Use Azure Document Intelligence
|
||||
|
||||
**Use for:**
|
||||
- Complex PDFs with sophisticated tables
|
||||
- Multi-column layouts
|
||||
- Forms and structured documents
|
||||
- Scanned documents requiring OCR
|
||||
- PDFs with mixed content types
|
||||
- Documents with intricate formatting
|
||||
|
||||
**Benefits over standard extraction:**
|
||||
- **Superior table extraction** - Better handling of merged cells, complex layouts
|
||||
- **Layout analysis** - Understands document structure (headers, footers, columns)
|
||||
- **Form fields** - Extracts key-value pairs from forms
|
||||
- **Reading order** - Maintains correct text flow in complex layouts
|
||||
- **OCR quality** - High-quality text extraction from scanned documents
|
||||
|
||||
### Comparison Example
|
||||
|
||||
**Standard extraction:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("complex_table.pdf")
|
||||
# May struggle with complex tables
|
||||
```
|
||||
|
||||
**Azure Document Intelligence:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
# Better table reconstruction and layout understanding
|
||||
```
|
||||
|
||||
### Cost Considerations
|
||||
|
||||
Azure Document Intelligence is a paid service:
|
||||
- **Free tier**: 500 pages per month
|
||||
- **Paid tiers**: Pay per page processed
|
||||
- Monitor usage to control costs
|
||||
- Use standard extraction for simple documents
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Document Intelligence error: {e}")
|
||||
# Common issues: authentication, quota exceeded, unsupported file
|
||||
```
|
||||
|
||||
## LLM-Powered Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions for images using large language models.
|
||||
|
||||
### Setup with OpenAI
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Supported Use Cases
|
||||
|
||||
**Images in documents:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# PowerPoint with images
|
||||
result = md.convert("presentation.pptx")
|
||||
|
||||
# Word documents with images
|
||||
result = md.convert("report.docx")
|
||||
|
||||
# Standalone images
|
||||
result = md.convert("diagram.png")
|
||||
```
|
||||
|
||||
### Custom Prompts
|
||||
|
||||
Customize the LLM prompt for specific needs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
|
||||
)
|
||||
|
||||
# For scientific figures
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
|
||||
)
|
||||
```
|
||||
|
||||
### Model Selection
|
||||
|
||||
**GPT-4o (Recommended):**
|
||||
- Best vision capabilities
|
||||
- High-quality descriptions
|
||||
- Good at understanding context
|
||||
- Higher cost per image
|
||||
|
||||
**GPT-4o-mini:**
|
||||
- Lower cost alternative
|
||||
- Good for simpler images
|
||||
- Faster processing
|
||||
- May miss subtle details
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# High quality (more expensive)
|
||||
md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Budget option (less expensive)
|
||||
md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
|
||||
```
|
||||
|
||||
### Configuration from Environment
|
||||
|
||||
```python
|
||||
import os
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Set API key in environment
|
||||
os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
|
||||
|
||||
client = OpenAI() # Uses env variable
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Alternative LLM Providers
|
||||
|
||||
**Anthropic Claude:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from anthropic import Anthropic
|
||||
|
||||
# Note: Check current compatibility with MarkItDown
|
||||
client = Anthropic(api_key="YOUR-API-KEY")
|
||||
# May require adapter for MarkItDown compatibility
|
||||
```
|
||||
|
||||
**Azure OpenAI:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import AzureOpenAI
|
||||
|
||||
client = AzureOpenAI(
|
||||
api_key="YOUR-AZURE-KEY",
|
||||
api_version="2024-02-01",
|
||||
azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
|
||||
)
|
||||
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
```
|
||||
|
||||
### Cost Management
|
||||
|
||||
**Strategies to reduce LLM costs:**
|
||||
|
||||
1. **Selective processing:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
|
||||
# Only use LLM for important documents
|
||||
if is_important_document(file):
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
else:
|
||||
md = MarkItDown() # Standard processing
|
||||
|
||||
result = md.convert(file)
|
||||
```
|
||||
|
||||
2. **Image filtering:**
|
||||
```python
|
||||
# Pre-process to identify images that need descriptions
|
||||
# Only use LLM for complex/important images
|
||||
```
|
||||
|
||||
3. **Batch processing:**
|
||||
```python
|
||||
# Process multiple images in batches
|
||||
# Monitor costs and set limits
|
||||
```
|
||||
|
||||
4. **Model selection:**
|
||||
```python
|
||||
# Use gpt-4o-mini for simple images
|
||||
# Reserve gpt-4o for complex visualizations
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
**LLM processing adds latency:**
|
||||
- Each image requires an API call
|
||||
- Processing time: 1-5 seconds per image
|
||||
- Network dependent
|
||||
- Consider parallel processing for multiple images
|
||||
|
||||
**Batch optimization:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import concurrent.futures
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
def process_image(image_path):
|
||||
return md.convert(image_path)
|
||||
|
||||
# Process multiple images in parallel
|
||||
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
|
||||
results = list(executor.map(process_image, images))
|
||||
```
|
||||
|
||||
## Combined Advanced Features
|
||||
|
||||
### Azure Document Intelligence + LLM Descriptions
|
||||
|
||||
Combine both for maximum quality:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-AZURE-ENDPOINT",
|
||||
docintel_key="YOUR-AZURE-KEY"
|
||||
)
|
||||
|
||||
# Best possible PDF conversion with image descriptions
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
**Use cases:**
|
||||
- Research papers with figures
|
||||
- Business reports with charts
|
||||
- Technical documentation with diagrams
|
||||
- Presentations with visual data
|
||||
|
||||
### Smart Document Processing Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
def smart_convert(file_path):
|
||||
"""Intelligently choose processing method based on file type."""
|
||||
client = OpenAI()
|
||||
ext = os.path.splitext(file_path)[1].lower()
|
||||
|
||||
# PDFs with complex tables: Use Azure
|
||||
if ext == '.pdf':
|
||||
md = MarkItDown(
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
|
||||
# Documents/presentations with images: Use LLM
|
||||
elif ext in ['.pptx', '.docx']:
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o"
|
||||
)
|
||||
|
||||
# Simple formats: Standard processing
|
||||
else:
|
||||
md = MarkItDown()
|
||||
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = smart_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Plugin System
|
||||
|
||||
MarkItDown supports custom plugins for extending functionality.
|
||||
|
||||
### Plugin Architecture
|
||||
|
||||
Plugins are disabled by default for security:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Enable plugins
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
```
|
||||
|
||||
### Creating Custom Plugins
|
||||
|
||||
**Plugin structure:**
|
||||
```python
|
||||
class CustomConverter:
|
||||
"""Custom converter plugin for MarkItDown."""
|
||||
|
||||
def can_convert(self, file_path):
|
||||
"""Check if this plugin can handle the file."""
|
||||
return file_path.endswith('.custom')
|
||||
|
||||
def convert(self, file_path):
|
||||
"""Convert file to Markdown."""
|
||||
# Your conversion logic here
|
||||
return {
|
||||
'text_content': '# Converted Content\n\n...'
|
||||
}
|
||||
```
|
||||
|
||||
### Plugin Registration
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
|
||||
# Register custom plugin
|
||||
md.register_plugin(CustomConverter())
|
||||
|
||||
# Use normally
|
||||
result = md.convert("file.custom")
|
||||
```
|
||||
|
||||
### Plugin Use Cases
|
||||
|
||||
**Custom formats:**
|
||||
- Proprietary document formats
|
||||
- Specialized scientific data formats
|
||||
- Legacy file formats
|
||||
|
||||
**Enhanced processing:**
|
||||
- Custom OCR engines
|
||||
- Specialized table extraction
|
||||
- Domain-specific parsing
|
||||
|
||||
**Integration:**
|
||||
- Enterprise document systems
|
||||
- Custom databases
|
||||
- Specialized APIs
|
||||
|
||||
### Plugin Security
|
||||
|
||||
**Important security considerations:**
|
||||
- Plugins run with full system access
|
||||
- Only enable for trusted plugins
|
||||
- Validate plugin code before use
|
||||
- Disable plugins in production unless required
|
||||
|
||||
## Error Handling for Advanced Features
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
def robust_convert(file_path):
|
||||
"""Convert with fallback strategies."""
|
||||
try:
|
||||
# Try with all advanced features
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
|
||||
docintel_key=os.getenv('AZURE_KEY')
|
||||
)
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as azure_error:
|
||||
print(f"Azure failed: {azure_error}")
|
||||
|
||||
try:
|
||||
# Fallback: LLM only
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
return md.convert(file_path)
|
||||
|
||||
except Exception as llm_error:
|
||||
print(f"LLM failed: {llm_error}")
|
||||
|
||||
# Final fallback: Standard processing
|
||||
md = MarkItDown()
|
||||
return md.convert(file_path)
|
||||
|
||||
# Use it
|
||||
result = robust_convert("document.pdf")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Azure Document Intelligence
|
||||
- Use for complex PDFs only (cost optimization)
|
||||
- Monitor usage and costs
|
||||
- Store credentials securely
|
||||
- Handle quota limits gracefully
|
||||
- Fall back to standard processing if needed
|
||||
|
||||
### LLM Integration
|
||||
- Use appropriate models for task complexity
|
||||
- Customize prompts for specific use cases
|
||||
- Monitor API costs
|
||||
- Implement rate limiting
|
||||
- Cache results when possible
|
||||
- Handle API errors gracefully
|
||||
|
||||
### Combined Features
|
||||
- Test cost/quality tradeoffs
|
||||
- Use selectively for important documents
|
||||
- Implement intelligent routing
|
||||
- Monitor performance and costs
|
||||
- Have fallback strategies
|
||||
|
||||
### Security
|
||||
- Store API keys securely (environment variables, secrets manager)
|
||||
- Never commit credentials to code
|
||||
- Disable plugins unless required
|
||||
- Validate all inputs
|
||||
- Use least privilege access
|
||||
273
skills/markitdown/references/document_conversion.md
Normal file
273
skills/markitdown/references/document_conversion.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Document Conversion Reference
|
||||
|
||||
This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
|
||||
|
||||
## PDF Files
|
||||
|
||||
PDF conversion extracts text, tables, and structure from PDF documents.
|
||||
|
||||
### Basic PDF Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PDF with Azure Document Intelligence
|
||||
|
||||
For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(
|
||||
docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
|
||||
docintel_key="YOUR-API-KEY"
|
||||
)
|
||||
result = md.convert("complex_table.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Benefits of Azure Document Intelligence:**
|
||||
- Superior table extraction and reconstruction
|
||||
- Better handling of multi-column layouts
|
||||
- Form field recognition
|
||||
- Improved text ordering in complex documents
|
||||
|
||||
### PDF Handling Notes
|
||||
|
||||
- Scanned PDFs require OCR (automatically handled if tesseract is installed)
|
||||
- Password-protected PDFs are not supported
|
||||
- Large PDFs may take longer to process
|
||||
- Vector graphics and embedded images are extracted where possible
|
||||
|
||||
## Word Documents (DOCX)
|
||||
|
||||
Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
|
||||
|
||||
### Basic DOCX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.docx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### DOCX Structure Preservation
|
||||
|
||||
MarkItDown preserves:
|
||||
- **Headings** → Markdown headers (`#`, `##`, etc.)
|
||||
- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
|
||||
- **Lists** → Markdown lists (ordered and unordered)
|
||||
- **Tables** → Markdown tables
|
||||
- **Hyperlinks** → Markdown links `[text](url)`
|
||||
- **Images** → Referenced with descriptions (can use LLM for descriptions)
|
||||
|
||||
### Command-Line Usage
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown report.docx -o report.md
|
||||
|
||||
# With output directory
|
||||
markitdown report.docx -o output/report.md
|
||||
```
|
||||
|
||||
### DOCX with Images
|
||||
|
||||
To generate descriptions for images in Word documents, use LLM integration:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("document_with_images.docx")
|
||||
```
|
||||
|
||||
## PowerPoint Presentations (PPTX)
|
||||
|
||||
PowerPoint conversion extracts text from slides while preserving structure.
|
||||
|
||||
### Basic PPTX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("presentation.pptx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### PPTX Structure
|
||||
|
||||
MarkItDown processes presentations as:
|
||||
- Each slide becomes a major section
|
||||
- Slide titles become headers
|
||||
- Bullet points are preserved
|
||||
- Tables are converted to Markdown tables
|
||||
- Notes are included if present
|
||||
|
||||
### PPTX with Image Descriptions
|
||||
|
||||
Presentations often contain important visual information. Use LLM integration to describe images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this slide image in detail, focusing on key information"
|
||||
)
|
||||
result = md.convert("presentation.pptx")
|
||||
```
|
||||
|
||||
**Custom prompts for presentations:**
|
||||
- "Describe charts and graphs with their key data points"
|
||||
- "Explain diagrams and their relationships"
|
||||
- "Summarize visual content for accessibility"
|
||||
|
||||
## Excel Spreadsheets (XLSX, XLS)
|
||||
|
||||
Excel conversion formats spreadsheet data as Markdown tables.
|
||||
|
||||
### Basic XLSX Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xlsx")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Sheet Workbooks
|
||||
|
||||
For workbooks with multiple sheets:
|
||||
- Each sheet becomes a separate section
|
||||
- Sheet names are used as headers
|
||||
- Empty sheets are skipped
|
||||
- Formulas are evaluated (values shown, not formulas)
|
||||
|
||||
### XLSX Conversion Details
|
||||
|
||||
**What's preserved:**
|
||||
- Cell values (text, numbers, dates)
|
||||
- Table structure (rows and columns)
|
||||
- Sheet names
|
||||
- Cell formatting (bold headers)
|
||||
|
||||
**What's not preserved:**
|
||||
- Formulas (only computed values)
|
||||
- Charts and graphs (use LLM integration for descriptions)
|
||||
- Cell colors and conditional formatting
|
||||
- Comments and notes
|
||||
|
||||
### Large Spreadsheets
|
||||
|
||||
For large spreadsheets, consider:
|
||||
- Processing may be slower for files with many rows/columns
|
||||
- Very wide tables may not format well in Markdown
|
||||
- Consider filtering or preprocessing data if possible
|
||||
|
||||
### XLS (Legacy Excel) Files
|
||||
|
||||
Legacy `.xls` files are supported but require additional dependencies:
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[xls]'
|
||||
```
|
||||
|
||||
Then use normally:
|
||||
```python
|
||||
md = MarkItDown()
|
||||
result = md.convert("legacy_data.xls")
|
||||
```
|
||||
|
||||
## Common Document Conversion Patterns
|
||||
|
||||
### Batch Document Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process all documents in a directory
|
||||
for filename in os.listdir("documents"):
|
||||
if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
|
||||
result = md.convert(f"documents/{filename}")
|
||||
|
||||
# Save to output directory
|
||||
output_name = os.path.splitext(filename)[0] + ".md"
|
||||
with open(f"markdown/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Document with Mixed Content
|
||||
|
||||
For documents containing multiple types of content (text, tables, images):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
# Use LLM for image descriptions + Azure for complex tables
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
docintel_endpoint="YOUR-ENDPOINT",
|
||||
docintel_key="YOUR-KEY"
|
||||
)
|
||||
|
||||
result = md.convert("complex_report.pdf")
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Conversion failed: {e}")
|
||||
# Handle specific errors (file not found, unsupported format, etc.)
|
||||
```
|
||||
|
||||
## Output Quality Tips
|
||||
|
||||
**For best results:**
|
||||
1. Use Azure Document Intelligence for PDFs with complex tables
|
||||
2. Enable LLM descriptions for documents with important visual content
|
||||
3. Ensure source documents are well-structured (proper headings, etc.)
|
||||
4. For scanned documents, ensure good scan quality for OCR accuracy
|
||||
5. Test with sample documents to verify output quality
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Conversion speed depends on:**
|
||||
- Document size and complexity
|
||||
- Number of images (especially with LLM descriptions)
|
||||
- Use of Azure Document Intelligence
|
||||
- Available system resources
|
||||
|
||||
**Optimization tips:**
|
||||
- Disable LLM integration if image descriptions aren't needed
|
||||
- Use standard extraction (not Azure) for simple documents
|
||||
- Process large batches in parallel when possible
|
||||
- Consider streaming for very large documents
|
||||
365
skills/markitdown/references/media_processing.md
Normal file
365
skills/markitdown/references/media_processing.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Media Processing Reference
|
||||
|
||||
This document provides detailed information about processing images and audio files with MarkItDown.
|
||||
|
||||
## Image Processing
|
||||
|
||||
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
|
||||
|
||||
### Basic Image Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("photo.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Image Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
1. **EXIF Metadata** - Camera settings, date, location, etc.
|
||||
2. **OCR Text** - Text detected in the image (requires tesseract)
|
||||
3. **Image Description** - AI-generated description (with LLM integration)
|
||||
|
||||
### EXIF Metadata Extraction
|
||||
|
||||
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("IMG_1234.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Example output includes:**
|
||||
- Camera make and model
|
||||
- Capture date and time
|
||||
- GPS coordinates (if available)
|
||||
- Exposure settings (ISO, shutter speed, aperture)
|
||||
- Image dimensions
|
||||
- Orientation
|
||||
|
||||
### OCR (Optical Character Recognition)
|
||||
|
||||
Extract text from images containing text (screenshots, scanned documents, photos of text):
|
||||
|
||||
**Requirements:**
|
||||
- Install tesseract OCR engine:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu/Debian
|
||||
apt-get install tesseract-ocr
|
||||
|
||||
# Windows
|
||||
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("screenshot.png")
|
||||
print(result.text_content) # Contains OCR'd text
|
||||
```
|
||||
|
||||
**Best practices for OCR:**
|
||||
- Use high-resolution images for better accuracy
|
||||
- Ensure good contrast between text and background
|
||||
- Straighten skewed text if possible
|
||||
- Use well-lit, clear images
|
||||
|
||||
### LLM-Generated Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("diagram.png")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Custom prompts for specific needs:**
|
||||
|
||||
```python
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this chart and provide key data points and trends"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface, listing all visible elements and their layout"
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Image Formats
|
||||
|
||||
MarkItDown supports all common image formats:
|
||||
- JPEG/JPG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
- TIFF
|
||||
- WebP
|
||||
- HEIC (requires additional libraries on some platforms)
|
||||
|
||||
## Audio Processing
|
||||
|
||||
MarkItDown can transcribe audio files to text using speech recognition.
|
||||
|
||||
### Basic Audio Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("recording.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
### Audio Transcription Setup
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install 'markitdown[audio]'
|
||||
```
|
||||
|
||||
This installs the `speech_recognition` library and dependencies.
|
||||
|
||||
### Supported Audio Formats
|
||||
|
||||
- WAV
|
||||
- AIFF
|
||||
- FLAC
|
||||
- MP3 (requires ffmpeg or libav)
|
||||
- OGG (requires ffmpeg or libav)
|
||||
- Other formats supported by speech_recognition
|
||||
|
||||
### Audio Transcription Engines
|
||||
|
||||
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
|
||||
|
||||
**Default (Google Speech Recognition):**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("audio.wav")
|
||||
```
|
||||
|
||||
**Note:** Default Google Speech Recognition requires internet connection.
|
||||
|
||||
### Audio Quality Considerations
|
||||
|
||||
For best transcription accuracy:
|
||||
- Use clear audio with minimal background noise
|
||||
- Prefer WAV or FLAC for better quality
|
||||
- Ensure speech is clear and at good volume
|
||||
- Avoid multiple overlapping speakers
|
||||
- Use mono audio when possible
|
||||
|
||||
### Audio Preprocessing Tips
|
||||
|
||||
For better results, consider preprocessing audio:
|
||||
|
||||
```python
|
||||
# Example: If you have pydub installed
|
||||
from pydub import AudioSegment
|
||||
from pydub.effects import normalize
|
||||
|
||||
# Load and normalize audio
|
||||
audio = AudioSegment.from_file("recording.mp3")
|
||||
audio = normalize(audio)
|
||||
audio.export("normalized.wav", format="wav")
|
||||
|
||||
# Then convert with MarkItDown
|
||||
from markitdown import MarkItDown
|
||||
md = MarkItDown()
|
||||
result = md.convert("normalized.wav")
|
||||
```
|
||||
|
||||
## Combined Media Workflows
|
||||
|
||||
### Processing Multiple Images in Batch
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Process all images in directory
|
||||
for filename in os.listdir("images"):
|
||||
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
|
||||
result = md.convert(f"images/{filename}")
|
||||
|
||||
# Save markdown with same name
|
||||
output = filename.rsplit('.', 1)[0] + '.md'
|
||||
with open(f"output/{output}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Screenshot Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
|
||||
)
|
||||
|
||||
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
|
||||
analysis = []
|
||||
|
||||
for screenshot in screenshots:
|
||||
result = md.convert(screenshot)
|
||||
analysis.append({
|
||||
'file': screenshot,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
# Now ready for further processing
|
||||
```
|
||||
|
||||
### Document Images with OCR
|
||||
|
||||
For scanned documents or photos of documents:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process scanned pages
|
||||
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
|
||||
full_text = []
|
||||
|
||||
for page in pages:
|
||||
result = md.convert(page)
|
||||
full_text.append(result.text_content)
|
||||
|
||||
# Combine into single document
|
||||
document = "\n\n---\n\n".join(full_text)
|
||||
print(document)
|
||||
```
|
||||
|
||||
### Presentation Slide Images
|
||||
|
||||
When you have presentation slides as images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
|
||||
)
|
||||
|
||||
# Process slide images
|
||||
for i in range(1, 21): # 20 slides
|
||||
result = md.convert(f"slides/slide_{i}.png")
|
||||
print(f"## Slide {i}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Image Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("Image file not found")
|
||||
except Exception as e:
|
||||
print(f"Error processing image: {e}")
|
||||
```
|
||||
|
||||
### Audio Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("audio.mp3")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Transcription failed: {e}")
|
||||
# Common issues: format not supported, no speech detected, network error
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Image Processing
|
||||
|
||||
- **LLM descriptions**: Slower but more informative
|
||||
- **OCR only**: Faster for text extraction
|
||||
- **EXIF only**: Fastest, metadata only
|
||||
- **Batch processing**: Process multiple images in parallel
|
||||
|
||||
### Audio Processing
|
||||
|
||||
- **File size**: Larger files take longer
|
||||
- **Audio length**: Transcription time scales with duration
|
||||
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
|
||||
- **Network dependency**: Default transcription requires internet
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Document Digitization
|
||||
Convert scanned documents or photos of documents to searchable text.
|
||||
|
||||
### Meeting Notes
|
||||
Transcribe audio recordings of meetings to text for analysis.
|
||||
|
||||
### Presentation Analysis
|
||||
Extract content from presentation slide images.
|
||||
|
||||
### Screenshot Documentation
|
||||
Generate descriptions of UI screenshots for documentation.
|
||||
|
||||
### Image Archiving
|
||||
Extract metadata and content from photo collections.
|
||||
|
||||
### Accessibility
|
||||
Generate alt-text descriptions for images using LLM integration.
|
||||
|
||||
### Data Extraction
|
||||
OCR text from images containing tables, forms, or structured data.
|
||||
575
skills/markitdown/references/structured_data.md
Normal file
575
skills/markitdown/references/structured_data.md
Normal file
@@ -0,0 +1,575 @@
|
||||
# Structured Data Handling Reference
|
||||
|
||||
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
|
||||
|
||||
## CSV Files
|
||||
|
||||
Convert CSV (Comma-Separated Values) files to Markdown tables.
|
||||
|
||||
### Basic CSV Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### CSV to Markdown Table
|
||||
|
||||
CSV files are automatically converted to Markdown table format:
|
||||
|
||||
**Input CSV (`data.csv`):**
|
||||
```csv
|
||||
Name,Age,City
|
||||
Alice,30,New York
|
||||
Bob,25,Los Angeles
|
||||
Charlie,35,Chicago
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
| Name | Age | City |
|
||||
|---------|-----|-------------|
|
||||
| Alice | 30 | New York |
|
||||
| Bob | 25 | Los Angeles |
|
||||
| Charlie | 35 | Chicago |
|
||||
```
|
||||
|
||||
### CSV Conversion Features
|
||||
|
||||
**What's preserved:**
|
||||
- All column headers
|
||||
- All data rows
|
||||
- Cell values (text and numbers)
|
||||
- Column structure
|
||||
|
||||
**Formatting:**
|
||||
- Headers are bolded (Markdown table format)
|
||||
- Columns are aligned
|
||||
- Empty cells are preserved
|
||||
- Special characters are escaped
|
||||
|
||||
### Large CSV Files
|
||||
|
||||
For large CSV files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large CSV
|
||||
result = md.convert("large_dataset.csv")
|
||||
|
||||
# Save to file instead of printing
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
**Performance considerations:**
|
||||
- Very large files may take time to process
|
||||
- Consider previewing first few rows for testing
|
||||
- Memory usage scales with file size
|
||||
- Very wide tables may not display well in all Markdown viewers
|
||||
|
||||
### CSV with Special Characters
|
||||
|
||||
CSV files containing special characters are handled automatically:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles UTF-8, special characters, quotes, etc.
|
||||
result = md.convert("international_data.csv")
|
||||
```
|
||||
|
||||
### CSV Delimiters
|
||||
|
||||
Standard CSV delimiters are supported:
|
||||
- Comma (`,`) - standard
|
||||
- Semicolon (`;`) - common in European formats
|
||||
- Tab (`\t`) - TSV files
|
||||
|
||||
### Command-Line CSV Conversion
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown data.csv -o data.md
|
||||
|
||||
# Multiple CSV files
|
||||
for file in *.csv; do
|
||||
markitdown "$file" -o "${file%.csv}.md"
|
||||
done
|
||||
```
|
||||
|
||||
## JSON Files
|
||||
|
||||
Convert JSON data to readable Markdown format.
|
||||
|
||||
### Basic JSON Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Formatting
|
||||
|
||||
JSON is converted to a readable, structured Markdown format:
|
||||
|
||||
**Input JSON (`config.json`):**
|
||||
```json
|
||||
{
|
||||
"name": "MyApp",
|
||||
"version": "1.0.0",
|
||||
"dependencies": {
|
||||
"library1": "^2.0.0",
|
||||
"library2": "^3.1.0"
|
||||
},
|
||||
"features": ["auth", "api", "database"]
|
||||
}
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
## Configuration
|
||||
|
||||
**name:** MyApp
|
||||
**version:** 1.0.0
|
||||
|
||||
### dependencies
|
||||
- **library1:** ^2.0.0
|
||||
- **library2:** ^3.1.0
|
||||
|
||||
### features
|
||||
- auth
|
||||
- api
|
||||
- database
|
||||
```
|
||||
|
||||
### JSON Array Handling
|
||||
|
||||
JSON arrays are converted to lists or tables:
|
||||
|
||||
**Array of objects:**
|
||||
```json
|
||||
[
|
||||
{"id": 1, "name": "Alice", "active": true},
|
||||
{"id": 2, "name": "Bob", "active": false}
|
||||
]
|
||||
```
|
||||
|
||||
**Converted to table:**
|
||||
```markdown
|
||||
| id | name | active |
|
||||
|----|-------|--------|
|
||||
| 1 | Alice | true |
|
||||
| 2 | Bob | false |
|
||||
```
|
||||
|
||||
### Nested JSON Structures
|
||||
|
||||
Nested JSON is converted with appropriate indentation and hierarchy:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles deeply nested structures
|
||||
result = md.convert("complex_config.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### JSON Lines (JSONL)
|
||||
|
||||
For JSON Lines format (one JSON object per line):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read JSONL file
|
||||
with open("data.jsonl", "r") as f:
|
||||
for line in f:
|
||||
obj = json.loads(line)
|
||||
|
||||
# Convert to JSON temporarily
|
||||
with open("temp.json", "w") as temp:
|
||||
json.dump(obj, temp)
|
||||
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
print("\n---\n")
|
||||
```
|
||||
|
||||
### Large JSON Files
|
||||
|
||||
For large JSON files:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert large JSON
|
||||
result = md.convert("large_data.json")
|
||||
|
||||
# Save to file
|
||||
with open("output.md", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
## XML Files
|
||||
|
||||
Convert XML documents to structured Markdown.
|
||||
|
||||
### Basic XML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### XML Structure Preservation
|
||||
|
||||
XML is converted to Markdown maintaining hierarchical structure:
|
||||
|
||||
**Input XML (`book.xml`):**
|
||||
```xml
|
||||
<?xml version="1.0"?>
|
||||
<book>
|
||||
<title>Example Book</title>
|
||||
<author>John Doe</author>
|
||||
<chapters>
|
||||
<chapter id="1">
|
||||
<title>Introduction</title>
|
||||
<content>Chapter 1 content...</content>
|
||||
</chapter>
|
||||
<chapter id="2">
|
||||
<title>Background</title>
|
||||
<content>Chapter 2 content...</content>
|
||||
</chapter>
|
||||
</chapters>
|
||||
</book>
|
||||
```
|
||||
|
||||
**Output Markdown:**
|
||||
```markdown
|
||||
# book
|
||||
|
||||
## title
|
||||
Example Book
|
||||
|
||||
## author
|
||||
John Doe
|
||||
|
||||
## chapters
|
||||
|
||||
### chapter (id: 1)
|
||||
#### title
|
||||
Introduction
|
||||
|
||||
#### content
|
||||
Chapter 1 content...
|
||||
|
||||
### chapter (id: 2)
|
||||
#### title
|
||||
Background
|
||||
|
||||
#### content
|
||||
Chapter 2 content...
|
||||
```
|
||||
|
||||
### XML Attributes
|
||||
|
||||
XML attributes are preserved in the conversion:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("data.xml")
|
||||
# Attributes shown as (attr: value) in headings
|
||||
```
|
||||
|
||||
### XML Namespaces
|
||||
|
||||
XML namespaces are handled:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Handles xmlns and namespaced elements
|
||||
result = md.convert("namespaced.xml")
|
||||
```
|
||||
|
||||
### XML Use Cases
|
||||
|
||||
**Configuration files:**
|
||||
- Convert XML configs to readable format
|
||||
- Document system configurations
|
||||
- Compare configuration files
|
||||
|
||||
**Data interchange:**
|
||||
- Convert XML APIs responses
|
||||
- Process XML data feeds
|
||||
- Transform between formats
|
||||
|
||||
**Document processing:**
|
||||
- Convert DocBook to Markdown
|
||||
- Process SVG descriptions
|
||||
- Extract structured data
|
||||
|
||||
## Structured Data Workflows
|
||||
|
||||
### CSV Data Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import pandas as pd
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV for analysis
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Do analysis
|
||||
summary = df.describe()
|
||||
|
||||
# Convert both to Markdown
|
||||
original = md.convert("data.csv")
|
||||
|
||||
# Save summary as CSV then convert
|
||||
summary.to_csv("summary.csv")
|
||||
summary_md = md.convert("summary.csv")
|
||||
|
||||
print("## Original Data\n")
|
||||
print(original.text_content)
|
||||
print("\n## Statistical Summary\n")
|
||||
print(summary_md.text_content)
|
||||
```
|
||||
|
||||
### JSON API Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch JSON from API
|
||||
response = requests.get("https://api.example.com/data")
|
||||
data = response.json()
|
||||
|
||||
# Save as JSON
|
||||
with open("api_response.json", "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("api_response.json")
|
||||
|
||||
# Create documentation
|
||||
doc = f"""# API Response Documentation
|
||||
|
||||
## Endpoint
|
||||
GET https://api.example.com/data
|
||||
|
||||
## Response
|
||||
{result.text_content}
|
||||
"""
|
||||
|
||||
with open("api_docs.md", "w") as f:
|
||||
f.write(doc)
|
||||
```
|
||||
|
||||
### XML to Markdown Documentation
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Convert XML documentation
|
||||
xml_files = ["config.xml", "schema.xml", "data.xml"]
|
||||
|
||||
for xml_file in xml_files:
|
||||
result = md.convert(xml_file)
|
||||
|
||||
output_name = xml_file.replace('.xml', '.md')
|
||||
with open(f"docs/{output_name}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Multi-Format Data Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def convert_structured_data(directory):
|
||||
"""Convert all structured data files in directory."""
|
||||
extensions = {'.csv', '.json', '.xml'}
|
||||
|
||||
for filename in os.listdir(directory):
|
||||
ext = os.path.splitext(filename)[1]
|
||||
|
||||
if ext in extensions:
|
||||
input_path = os.path.join(directory, filename)
|
||||
result = md.convert(input_path)
|
||||
|
||||
# Save Markdown
|
||||
output_name = filename.replace(ext, '.md')
|
||||
output_path = os.path.join("markdown", output_name)
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename} → {output_name}")
|
||||
|
||||
# Process all structured data
|
||||
convert_structured_data("data")
|
||||
```
|
||||
|
||||
### CSV to JSON to Markdown
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from markitdown import MarkItDown
|
||||
import json
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Read CSV
|
||||
df = pd.read_csv("data.csv")
|
||||
|
||||
# Convert to JSON
|
||||
json_data = df.to_dict(orient='records')
|
||||
with open("temp.json", "w") as f:
|
||||
json.dump(json_data, f, indent=2)
|
||||
|
||||
# Convert JSON to Markdown
|
||||
result = md.convert("temp.json")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Database Export to Markdown
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import sqlite3
|
||||
import csv
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Export database query to CSV
|
||||
conn = sqlite3.connect("database.db")
|
||||
cursor = conn.execute("SELECT * FROM users")
|
||||
|
||||
with open("users.csv", "w", newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow([description[0] for description in cursor.description])
|
||||
writer.writerows(cursor.fetchall())
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("users.csv")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### CSV Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.csv")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("CSV file not found")
|
||||
except Exception as e:
|
||||
print(f"CSV conversion error: {e}")
|
||||
# Common issues: encoding problems, malformed CSV, delimiter issues
|
||||
```
|
||||
|
||||
### JSON Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.json")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"JSON conversion error: {e}")
|
||||
# Common issues: invalid JSON syntax, encoding issues
|
||||
```
|
||||
|
||||
### XML Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("data.xml")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"XML conversion error: {e}")
|
||||
# Common issues: malformed XML, encoding problems, namespace issues
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### CSV Processing
|
||||
- Check delimiter before conversion
|
||||
- Verify encoding (UTF-8 recommended)
|
||||
- Handle large files with streaming if needed
|
||||
- Preview output for very wide tables
|
||||
|
||||
### JSON Processing
|
||||
- Validate JSON before conversion
|
||||
- Consider pretty-printing complex structures
|
||||
- Handle circular references appropriately
|
||||
- Be aware of large array performance
|
||||
|
||||
### XML Processing
|
||||
- Validate XML structure first
|
||||
- Handle namespaces consistently
|
||||
- Consider XPath for selective extraction
|
||||
- Be mindful of very deep nesting
|
||||
|
||||
### Data Quality
|
||||
- Clean data before conversion when possible
|
||||
- Handle missing values appropriately
|
||||
- Verify special character handling
|
||||
- Test with representative samples
|
||||
|
||||
### Performance
|
||||
- Process large files in batches
|
||||
- Use streaming for very large datasets
|
||||
- Monitor memory usage
|
||||
- Cache converted results when appropriate
|
||||
478
skills/markitdown/references/web_content.md
Normal file
478
skills/markitdown/references/web_content.md
Normal file
@@ -0,0 +1,478 @@
|
||||
# Web Content Extraction Reference
|
||||
|
||||
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
|
||||
|
||||
## HTML Conversion
|
||||
|
||||
Convert HTML files and web pages to clean Markdown format.
|
||||
|
||||
### Basic HTML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML Processing Features
|
||||
|
||||
**What's preserved:**
|
||||
- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
|
||||
- Paragraphs and text formatting
|
||||
- Links (`<a>` → `[text](url)`)
|
||||
- Lists (ordered and unordered)
|
||||
- Tables → Markdown tables
|
||||
- Code blocks and inline code
|
||||
- Emphasis (bold, italic)
|
||||
|
||||
**What's removed:**
|
||||
- Scripts and styles
|
||||
- Navigation elements
|
||||
- Advertising content
|
||||
- Boilerplate markup
|
||||
- HTML comments
|
||||
|
||||
### HTML from URLs
|
||||
|
||||
Convert web pages directly from URLs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch and convert web page
|
||||
response = requests.get("https://example.com/article")
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Clean Web Article Extraction
|
||||
|
||||
For extracting main content from web articles:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from readability import Document # pip install readability-lxml
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch page
|
||||
url = "https://example.com/article"
|
||||
response = requests.get(url)
|
||||
|
||||
# Extract main content
|
||||
doc = Document(response.content)
|
||||
html_content = doc.summary()
|
||||
|
||||
# Save and convert
|
||||
with open("article.html", "w") as f:
|
||||
f.write(html_content)
|
||||
|
||||
result = md.convert("article.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML with Images
|
||||
|
||||
HTML files containing images can be enhanced with LLM descriptions:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("page_with_images.html")
|
||||
```
|
||||
|
||||
## YouTube Transcripts
|
||||
|
||||
Extract video transcripts from YouTube videos.
|
||||
|
||||
### Basic YouTube Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### YouTube Installation
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[youtube]'
|
||||
```
|
||||
|
||||
This installs the `youtube-transcript-api` dependency.
|
||||
|
||||
### YouTube URL Formats
|
||||
|
||||
MarkItDown supports various YouTube URL formats:
|
||||
- `https://www.youtube.com/watch?v=VIDEO_ID`
|
||||
- `https://youtu.be/VIDEO_ID`
|
||||
- `https://www.youtube.com/embed/VIDEO_ID`
|
||||
- `https://m.youtube.com/watch?v=VIDEO_ID`
|
||||
|
||||
### YouTube Transcript Features
|
||||
|
||||
**What's included:**
|
||||
- Full video transcript text
|
||||
- Timestamps (optional, depending on availability)
|
||||
- Video metadata (title, description)
|
||||
- Captions in available languages
|
||||
|
||||
**Transcript languages:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Get transcript in specific language (if available)
|
||||
# Language codes: 'en', 'es', 'fr', 'de', etc.
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
### YouTube Playlist Processing
|
||||
|
||||
Process multiple videos from a playlist:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
video_ids = [
|
||||
"VIDEO_ID_1",
|
||||
"VIDEO_ID_2",
|
||||
"VIDEO_ID_3"
|
||||
]
|
||||
|
||||
transcripts = []
|
||||
for vid_id in video_ids:
|
||||
url = f"https://youtube.com/watch?v={vid_id}"
|
||||
result = md.convert(url)
|
||||
transcripts.append({
|
||||
'video_id': vid_id,
|
||||
'transcript': result.text_content
|
||||
})
|
||||
```
|
||||
|
||||
### YouTube Use Cases
|
||||
|
||||
**Content Analysis:**
|
||||
- Analyze video content without watching
|
||||
- Extract key information from tutorials
|
||||
- Build searchable transcript databases
|
||||
|
||||
**Research:**
|
||||
- Process interview transcripts
|
||||
- Extract lecture content
|
||||
- Analyze presentation content
|
||||
|
||||
**Accessibility:**
|
||||
- Generate text versions of video content
|
||||
- Create searchable video archives
|
||||
|
||||
### YouTube Limitations
|
||||
|
||||
- Requires videos to have captions/transcripts available
|
||||
- Auto-generated captions may have transcription errors
|
||||
- Some videos may disable transcript access
|
||||
- Rate limiting may apply for bulk processing
|
||||
|
||||
## EPUB Books
|
||||
|
||||
Convert EPUB e-books to Markdown format.
|
||||
|
||||
### Basic EPUB Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### EPUB Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
- Book text content
|
||||
- Chapter structure
|
||||
- Headings and formatting
|
||||
- Tables of contents
|
||||
- Footnotes and references
|
||||
|
||||
**What's preserved:**
|
||||
- Heading hierarchy
|
||||
- Text emphasis (bold, italic)
|
||||
- Links and references
|
||||
- Lists and tables
|
||||
|
||||
### EPUB with Images
|
||||
|
||||
EPUB files often contain images (covers, diagrams, illustrations):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("illustrated_book.epub")
|
||||
```
|
||||
|
||||
### EPUB Use Cases
|
||||
|
||||
**Research:**
|
||||
- Convert textbooks to searchable format
|
||||
- Extract content for analysis
|
||||
- Build digital libraries
|
||||
|
||||
**Content Processing:**
|
||||
- Prepare books for LLM training data
|
||||
- Convert to different formats
|
||||
- Create summaries and extracts
|
||||
|
||||
**Accessibility:**
|
||||
- Convert to more accessible formats
|
||||
- Extract text for screen readers
|
||||
- Process for text-to-speech
|
||||
|
||||
## RSS Feeds
|
||||
|
||||
Process RSS feeds to extract article content.
|
||||
|
||||
### Basic RSS Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import feedparser
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Parse RSS feed
|
||||
feed = feedparser.parse("https://example.com/feed.xml")
|
||||
|
||||
# Convert each entry
|
||||
for entry in feed.entries:
|
||||
# Save entry HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(entry.summary)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(f"## {entry.title}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Combined Web Content Workflows
|
||||
|
||||
### Web Scraping Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_and_convert(url):
|
||||
"""Scrape webpage and convert to Markdown."""
|
||||
response = requests.get(url)
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# Extract main content
|
||||
main_content = soup.find('article') or soup.find('main')
|
||||
|
||||
if main_content:
|
||||
# Save HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(str(main_content))
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("temp.html")
|
||||
return result.text_content
|
||||
|
||||
return None
|
||||
|
||||
# Use it
|
||||
markdown = scrape_and_convert("https://example.com/article")
|
||||
print(markdown)
|
||||
```
|
||||
|
||||
### YouTube Learning Content Extraction
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Course videos
|
||||
course_videos = [
|
||||
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
|
||||
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
|
||||
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
|
||||
]
|
||||
|
||||
course_content = []
|
||||
for url, title in course_videos:
|
||||
result = md.convert(url)
|
||||
course_content.append(f"# {title}\n\n{result.text_content}")
|
||||
|
||||
# Combine into course document
|
||||
full_course = "\n\n---\n\n".join(course_content)
|
||||
with open("course_transcript.md", "w") as f:
|
||||
f.write(full_course)
|
||||
```
|
||||
|
||||
### Documentation Scraping
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_documentation(base_url, page_urls):
|
||||
"""Scrape multiple documentation pages."""
|
||||
docs = []
|
||||
|
||||
for page_url in page_urls:
|
||||
full_url = urljoin(base_url, page_url)
|
||||
|
||||
# Fetch page
|
||||
response = requests.get(full_url)
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
# Convert
|
||||
result = md.convert("temp.html")
|
||||
docs.append({
|
||||
'url': full_url,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
return docs
|
||||
|
||||
# Example usage
|
||||
base = "https://docs.example.com/"
|
||||
pages = ["intro.html", "getting-started.html", "api.html"]
|
||||
documentation = scrape_documentation(base, pages)
|
||||
```
|
||||
|
||||
### EPUB Library Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def process_epub_library(library_path, output_path):
|
||||
"""Convert all EPUB books in a directory."""
|
||||
for filename in os.listdir(library_path):
|
||||
if filename.endswith('.epub'):
|
||||
epub_path = os.path.join(library_path, filename)
|
||||
|
||||
try:
|
||||
result = md.convert(epub_path)
|
||||
|
||||
# Save markdown
|
||||
output_file = filename.replace('.epub', '.md')
|
||||
output_full = os.path.join(output_path, output_file)
|
||||
|
||||
with open(output_full, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename}")
|
||||
except Exception as e:
|
||||
print(f"Failed to convert {filename}: {e}")
|
||||
|
||||
# Process library
|
||||
process_epub_library("books", "markdown_books")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTML Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("HTML file not found")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
### YouTube Transcript Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Failed to get transcript: {e}")
|
||||
# Common issues: No transcript available, video unavailable, network error
|
||||
```
|
||||
|
||||
### EPUB Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"EPUB processing error: {e}")
|
||||
# Common issues: Corrupted file, unsupported DRM, invalid format
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### HTML Processing
|
||||
- Clean HTML before conversion for better results
|
||||
- Use readability libraries to extract main content
|
||||
- Handle different encodings appropriately
|
||||
- Remove unnecessary markup
|
||||
|
||||
### YouTube Processing
|
||||
- Check transcript availability before batch processing
|
||||
- Handle API rate limits gracefully
|
||||
- Store transcripts to avoid re-fetching
|
||||
- Respect YouTube's terms of service
|
||||
|
||||
### EPUB Processing
|
||||
- DRM-protected EPUBs cannot be processed
|
||||
- Large EPUBs may require more memory
|
||||
- Some formatting may not translate perfectly
|
||||
- Test with representative samples first
|
||||
|
||||
### Web Scraping Ethics
|
||||
- Respect robots.txt
|
||||
- Add delays between requests
|
||||
- Identify your scraper in User-Agent
|
||||
- Cache results to minimize requests
|
||||
- Follow website terms of service
|
||||
317
skills/markitdown/scripts/batch_convert.py
Executable file
317
skills/markitdown/scripts/batch_convert.py
Executable file
@@ -0,0 +1,317 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch conversion utility for MarkItDown.
|
||||
|
||||
Converts all supported files in a directory to Markdown format.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from markitdown import MarkItDown
|
||||
from typing import Optional, List
|
||||
import argparse
|
||||
|
||||
|
||||
# Supported file extensions
|
||||
SUPPORTED_EXTENSIONS = {
|
||||
'.pdf', '.docx', '.pptx', '.xlsx', '.xls',
|
||||
'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff',
|
||||
'.wav', '.mp3', '.flac', '.ogg', '.aiff',
|
||||
'.html', '.htm', '.epub',
|
||||
'.csv', '.json', '.xml',
|
||||
'.zip'
|
||||
}
|
||||
|
||||
|
||||
def setup_markitdown(
|
||||
use_llm: bool = False,
|
||||
llm_model: str = "gpt-4o",
|
||||
use_azure_di: bool = False,
|
||||
azure_endpoint: Optional[str] = None,
|
||||
azure_key: Optional[str] = None
|
||||
) -> MarkItDown:
|
||||
"""
|
||||
Setup MarkItDown instance with optional advanced features.
|
||||
|
||||
Args:
|
||||
use_llm: Enable LLM-powered image descriptions
|
||||
llm_model: LLM model to use (default: gpt-4o)
|
||||
use_azure_di: Enable Azure Document Intelligence
|
||||
azure_endpoint: Azure Document Intelligence endpoint
|
||||
azure_key: Azure Document Intelligence API key
|
||||
|
||||
Returns:
|
||||
Configured MarkItDown instance
|
||||
"""
|
||||
kwargs = {}
|
||||
|
||||
if use_llm:
|
||||
try:
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
kwargs['llm_client'] = client
|
||||
kwargs['llm_model'] = llm_model
|
||||
print(f"✓ LLM integration enabled ({llm_model})")
|
||||
except ImportError:
|
||||
print("✗ Warning: OpenAI not installed, LLM features disabled")
|
||||
print(" Install with: pip install openai")
|
||||
|
||||
if use_azure_di:
|
||||
if azure_endpoint and azure_key:
|
||||
kwargs['docintel_endpoint'] = azure_endpoint
|
||||
kwargs['docintel_key'] = azure_key
|
||||
print("✓ Azure Document Intelligence enabled")
|
||||
else:
|
||||
print("✗ Warning: Azure credentials not provided, Azure DI disabled")
|
||||
|
||||
return MarkItDown(**kwargs)
|
||||
|
||||
|
||||
def convert_file(
|
||||
md: MarkItDown,
|
||||
input_path: Path,
|
||||
output_dir: Path,
|
||||
verbose: bool = False
|
||||
) -> bool:
|
||||
"""
|
||||
Convert a single file to Markdown.
|
||||
|
||||
Args:
|
||||
md: MarkItDown instance
|
||||
input_path: Path to input file
|
||||
output_dir: Directory for output files
|
||||
verbose: Print detailed progress
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
if verbose:
|
||||
print(f" Processing: {input_path.name}")
|
||||
|
||||
# Convert file
|
||||
result = md.convert(str(input_path))
|
||||
|
||||
# Create output filename
|
||||
output_filename = input_path.stem + '.md'
|
||||
output_path = output_dir / output_filename
|
||||
|
||||
# Write output
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
if verbose:
|
||||
print(f" ✓ Converted: {input_path.name} → {output_filename}")
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f" ✗ Error converting {input_path.name}: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def find_files(input_dir: Path, recursive: bool = False) -> List[Path]:
|
||||
"""
|
||||
Find all supported files in directory.
|
||||
|
||||
Args:
|
||||
input_dir: Directory to search
|
||||
recursive: Search subdirectories
|
||||
|
||||
Returns:
|
||||
List of file paths
|
||||
"""
|
||||
files = []
|
||||
|
||||
if recursive:
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
files.extend(input_dir.rglob(f"*{ext}"))
|
||||
else:
|
||||
for ext in SUPPORTED_EXTENSIONS:
|
||||
files.extend(input_dir.glob(f"*{ext}"))
|
||||
|
||||
return sorted(files)
|
||||
|
||||
|
||||
def batch_convert(
|
||||
input_dir: str,
|
||||
output_dir: str,
|
||||
recursive: bool = False,
|
||||
use_llm: bool = False,
|
||||
llm_model: str = "gpt-4o",
|
||||
use_azure_di: bool = False,
|
||||
azure_endpoint: Optional[str] = None,
|
||||
azure_key: Optional[str] = None,
|
||||
verbose: bool = False
|
||||
) -> None:
|
||||
"""
|
||||
Batch convert all supported files in a directory.
|
||||
|
||||
Args:
|
||||
input_dir: Input directory containing files
|
||||
output_dir: Output directory for Markdown files
|
||||
recursive: Search subdirectories
|
||||
use_llm: Enable LLM-powered descriptions
|
||||
llm_model: LLM model to use
|
||||
use_azure_di: Enable Azure Document Intelligence
|
||||
azure_endpoint: Azure DI endpoint
|
||||
azure_key: Azure DI API key
|
||||
verbose: Print detailed progress
|
||||
"""
|
||||
input_path = Path(input_dir)
|
||||
output_path = Path(output_dir)
|
||||
|
||||
# Validate input directory
|
||||
if not input_path.exists():
|
||||
print(f"✗ Error: Input directory '{input_dir}' does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
if not input_path.is_dir():
|
||||
print(f"✗ Error: '{input_dir}' is not a directory")
|
||||
sys.exit(1)
|
||||
|
||||
# Create output directory
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Setup MarkItDown
|
||||
print("Setting up MarkItDown...")
|
||||
md = setup_markitdown(
|
||||
use_llm=use_llm,
|
||||
llm_model=llm_model,
|
||||
use_azure_di=use_azure_di,
|
||||
azure_endpoint=azure_endpoint,
|
||||
azure_key=azure_key
|
||||
)
|
||||
|
||||
# Find files
|
||||
print(f"\nScanning directory: {input_dir}")
|
||||
if recursive:
|
||||
print(" (including subdirectories)")
|
||||
|
||||
files = find_files(input_path, recursive)
|
||||
|
||||
if not files:
|
||||
print("✗ No supported files found")
|
||||
print(f" Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}")
|
||||
sys.exit(0)
|
||||
|
||||
print(f"✓ Found {len(files)} file(s) to convert\n")
|
||||
|
||||
# Convert files
|
||||
successful = 0
|
||||
failed = 0
|
||||
|
||||
for file_path in files:
|
||||
if convert_file(md, file_path, output_path, verbose):
|
||||
successful += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Conversion complete!")
|
||||
print(f" Successful: {successful}")
|
||||
print(f" Failed: {failed}")
|
||||
print(f" Output: {output_dir}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Batch convert files to Markdown using MarkItDown",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic usage
|
||||
python batch_convert.py documents/ output/
|
||||
|
||||
# Recursive conversion
|
||||
python batch_convert.py documents/ output/ --recursive
|
||||
|
||||
# With LLM-powered image descriptions
|
||||
python batch_convert.py documents/ output/ --llm
|
||||
|
||||
# With Azure Document Intelligence
|
||||
python batch_convert.py documents/ output/ --azure \\
|
||||
--azure-endpoint https://example.cognitiveservices.azure.com/ \\
|
||||
--azure-key YOUR-KEY
|
||||
|
||||
# All features enabled
|
||||
python batch_convert.py documents/ output/ --llm --azure \\
|
||||
--azure-endpoint $AZURE_ENDPOINT --azure-key $AZURE_KEY
|
||||
|
||||
Supported file types:
|
||||
Documents: PDF, DOCX, PPTX, XLSX, XLS
|
||||
Images: JPG, PNG, GIF, BMP, TIFF
|
||||
Audio: WAV, MP3, FLAC, OGG, AIFF
|
||||
Web: HTML, EPUB
|
||||
Data: CSV, JSON, XML
|
||||
Archives: ZIP
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'input_dir',
|
||||
help='Input directory containing files to convert'
|
||||
)
|
||||
parser.add_argument(
|
||||
'output_dir',
|
||||
help='Output directory for Markdown files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'-r', '--recursive',
|
||||
action='store_true',
|
||||
help='Recursively search subdirectories'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--llm',
|
||||
action='store_true',
|
||||
help='Enable LLM-powered image descriptions (requires OpenAI API key)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--llm-model',
|
||||
default='gpt-4o',
|
||||
help='LLM model to use (default: gpt-4o)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure',
|
||||
action='store_true',
|
||||
help='Enable Azure Document Intelligence for PDFs'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure-endpoint',
|
||||
help='Azure Document Intelligence endpoint URL'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--azure-key',
|
||||
help='Azure Document Intelligence API key'
|
||||
)
|
||||
parser.add_argument(
|
||||
'-v', '--verbose',
|
||||
action='store_true',
|
||||
help='Print detailed progress'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Environment variable fallbacks for Azure
|
||||
azure_endpoint = args.azure_endpoint or os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT')
|
||||
azure_key = args.azure_key or os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
|
||||
|
||||
batch_convert(
|
||||
input_dir=args.input_dir,
|
||||
output_dir=args.output_dir,
|
||||
recursive=args.recursive,
|
||||
use_llm=args.llm,
|
||||
llm_model=args.llm_model,
|
||||
use_azure_di=args.azure,
|
||||
azure_endpoint=azure_endpoint,
|
||||
azure_key=azure_key,
|
||||
verbose=args.verbose
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user