Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/markitdown/references/advanced_integrations.md
+++ b/skills/markitdown/references/advanced_integrations.md
@@ -0,0 +1,538 @@
+# Advanced Integrations Reference
+
+This document provides detailed information about advanced MarkItDown features including Azure Document Intelligence integration, LLM-powered descriptions, and plugin system.
+
+## Azure Document Intelligence Integration
+
+Azure Document Intelligence (formerly Form Recognizer) provides superior PDF processing with advanced table extraction and layout analysis.
+
+### Setup
+
+**Prerequisites:**
+1. Azure subscription
+2. Document Intelligence resource created in Azure
+3. Endpoint URL and API key
+
+**Create Azure Resource:**
+```bash
+# Using Azure CLI
+az cognitiveservices account create \
+  --name my-doc-intelligence \
+  --resource-group my-resource-group \
+  --kind FormRecognizer \
+  --sku F0 \
+  --location eastus
+```
+
+### Basic Usage
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="https://YOUR-RESOURCE.cognitiveservices.azure.com/",
+    docintel_key="YOUR-API-KEY"
+)
+
+result = md.convert("complex_document.pdf")
+print(result.text_content)
+```
+
+### Configuration from Environment Variables
+
+```python
+import os
+from markitdown import MarkItDown
+
+# Set environment variables
+os.environ['AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'] = 'YOUR-ENDPOINT'
+os.environ['AZURE_DOCUMENT_INTELLIGENCE_KEY'] = 'YOUR-KEY'
+
+# Use without explicit credentials
+md = MarkItDown(
+    docintel_endpoint=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT'),
+    docintel_key=os.getenv('AZURE_DOCUMENT_INTELLIGENCE_KEY')
+)
+
+result = md.convert("document.pdf")
+```
+
+### When to Use Azure Document Intelligence
+
+**Use for:**
+- Complex PDFs with sophisticated tables
+- Multi-column layouts
+- Forms and structured documents
+- Scanned documents requiring OCR
+- PDFs with mixed content types
+- Documents with intricate formatting
+
+**Benefits over standard extraction:**
+- **Superior table extraction** - Better handling of merged cells, complex layouts
+- **Layout analysis** - Understands document structure (headers, footers, columns)
+- **Form fields** - Extracts key-value pairs from forms
+- **Reading order** - Maintains correct text flow in complex layouts
+- **OCR quality** - High-quality text extraction from scanned documents
+
+### Comparison Example
+
+**Standard extraction:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("complex_table.pdf")
+# May struggle with complex tables
+```
+
+**Azure Document Intelligence:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="YOUR-ENDPOINT",
+    docintel_key="YOUR-KEY"
+)
+result = md.convert("complex_table.pdf")
+# Better table reconstruction and layout understanding
+```
+
+### Cost Considerations
+
+Azure Document Intelligence is a paid service:
+- **Free tier**: 500 pages per month
+- **Paid tiers**: Pay per page processed
+- Monitor usage to control costs
+- Use standard extraction for simple documents
+
+### Error Handling
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="YOUR-ENDPOINT",
+    docintel_key="YOUR-KEY"
+)
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except Exception as e:
+    print(f"Document Intelligence error: {e}")
+    # Common issues: authentication, quota exceeded, unsupported file
+```
+
+## LLM-Powered Image Descriptions
+
+Generate detailed, contextual descriptions for images using large language models.
+
+### Setup with OpenAI
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI(api_key="YOUR-OPENAI-API-KEY")
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+result = md.convert("image.jpg")
+print(result.text_content)
+```
+
+### Supported Use Cases
+
+**Images in documents:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# PowerPoint with images
+result = md.convert("presentation.pptx")
+
+# Word documents with images
+result = md.convert("report.docx")
+
+# Standalone images
+result = md.convert("diagram.png")
+```
+
+### Custom Prompts
+
+Customize the LLM prompt for specific needs:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# For diagrams
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Analyze this diagram and explain all components, connections, and relationships in detail"
+)
+
+# For charts
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this chart, including the type, axes, data points, trends, and key insights"
+)
+
+# For UI screenshots
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this user interface screenshot, listing all UI elements, their layout, and functionality"
+)
+
+# For scientific figures
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this scientific figure in detail, including methodology, results shown, and significance"
+)
+```
+
+### Model Selection
+
+**GPT-4o (Recommended):**
+- Best vision capabilities
+- High-quality descriptions
+- Good at understanding context
+- Higher cost per image
+
+**GPT-4o-mini:**
+- Lower cost alternative
+- Good for simpler images
+- Faster processing
+- May miss subtle details
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# High quality (more expensive)
+md_quality = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# Budget option (less expensive)
+md_budget = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")
+```
+
+### Configuration from Environment
+
+```python
+import os
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Set API key in environment
+os.environ['OPENAI_API_KEY'] = 'YOUR-API-KEY'
+
+client = OpenAI()  # Uses env variable
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+```
+
+### Alternative LLM Providers
+
+**Anthropic Claude:**
+```python
+from markitdown import MarkItDown
+from anthropic import Anthropic
+
+# Note: Check current compatibility with MarkItDown
+client = Anthropic(api_key="YOUR-API-KEY")
+# May require adapter for MarkItDown compatibility
+```
+
+**Azure OpenAI:**
+```python
+from markitdown import MarkItDown
+from openai import AzureOpenAI
+
+client = AzureOpenAI(
+    api_key="YOUR-AZURE-KEY",
+    api_version="2024-02-01",
+    azure_endpoint="https://YOUR-RESOURCE.openai.azure.com"
+)
+
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+```
+
+### Cost Management
+
+**Strategies to reduce LLM costs:**
+
+1. **Selective processing:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+
+# Only use LLM for important documents
+if is_important_document(file):
+    md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+else:
+    md = MarkItDown()  # Standard processing
+
+result = md.convert(file)
+```
+
+2. **Image filtering:**
+```python
+# Pre-process to identify images that need descriptions
+# Only use LLM for complex/important images
+```
+
+3. **Batch processing:**
+```python
+# Process multiple images in batches
+# Monitor costs and set limits
+```
+
+4. **Model selection:**
+```python
+# Use gpt-4o-mini for simple images
+# Reserve gpt-4o for complex visualizations
+```
+
+### Performance Considerations
+
+**LLM processing adds latency:**
+- Each image requires an API call
+- Processing time: 1-5 seconds per image
+- Network dependent
+- Consider parallel processing for multiple images
+
+**Batch optimization:**
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import concurrent.futures
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+def process_image(image_path):
+    return md.convert(image_path)
+
+# Process multiple images in parallel
+images = ["img1.jpg", "img2.jpg", "img3.jpg"]
+with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
+    results = list(executor.map(process_image, images))
+```
+
+## Combined Advanced Features
+
+### Azure Document Intelligence + LLM Descriptions
+
+Combine both for maximum quality:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    docintel_endpoint="YOUR-AZURE-ENDPOINT",
+    docintel_key="YOUR-AZURE-KEY"
+)
+
+# Best possible PDF conversion with image descriptions
+result = md.convert("complex_report.pdf")
+```
+
+**Use cases:**
+- Research papers with figures
+- Business reports with charts
+- Technical documentation with diagrams
+- Presentations with visual data
+
+### Smart Document Processing Pipeline
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+
+def smart_convert(file_path):
+    """Intelligently choose processing method based on file type."""
+    client = OpenAI()
+    ext = os.path.splitext(file_path)[1].lower()
+
+    # PDFs with complex tables: Use Azure
+    if ext == '.pdf':
+        md = MarkItDown(
+            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
+            docintel_key=os.getenv('AZURE_KEY')
+        )
+
+    # Documents/presentations with images: Use LLM
+    elif ext in ['.pptx', '.docx']:
+        md = MarkItDown(
+            llm_client=client,
+            llm_model="gpt-4o"
+        )
+
+    # Simple formats: Standard processing
+    else:
+        md = MarkItDown()
+
+    return md.convert(file_path)
+
+# Use it
+result = smart_convert("document.pdf")
+```
+
+## Plugin System
+
+MarkItDown supports custom plugins for extending functionality.
+
+### Plugin Architecture
+
+Plugins are disabled by default for security:
+
+```python
+from markitdown import MarkItDown
+
+# Enable plugins
+md = MarkItDown(enable_plugins=True)
+```
+
+### Creating Custom Plugins
+
+**Plugin structure:**
+```python
+class CustomConverter:
+    """Custom converter plugin for MarkItDown."""
+
+    def can_convert(self, file_path):
+        """Check if this plugin can handle the file."""
+        return file_path.endswith('.custom')
+
+    def convert(self, file_path):
+        """Convert file to Markdown."""
+        # Your conversion logic here
+        return {
+            'text_content': '# Converted Content\n\n...'
+        }
+```
+
+### Plugin Registration
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(enable_plugins=True)
+
+# Register custom plugin
+md.register_plugin(CustomConverter())
+
+# Use normally
+result = md.convert("file.custom")
+```
+
+### Plugin Use Cases
+
+**Custom formats:**
+- Proprietary document formats
+- Specialized scientific data formats
+- Legacy file formats
+
+**Enhanced processing:**
+- Custom OCR engines
+- Specialized table extraction
+- Domain-specific parsing
+
+**Integration:**
+- Enterprise document systems
+- Custom databases
+- Specialized APIs
+
+### Plugin Security
+
+**Important security considerations:**
+- Plugins run with full system access
+- Only enable for trusted plugins
+- Validate plugin code before use
+- Disable plugins in production unless required
+
+## Error Handling for Advanced Features
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+def robust_convert(file_path):
+    """Convert with fallback strategies."""
+    try:
+        # Try with all advanced features
+        client = OpenAI()
+        md = MarkItDown(
+            llm_client=client,
+            llm_model="gpt-4o",
+            docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
+            docintel_key=os.getenv('AZURE_KEY')
+        )
+        return md.convert(file_path)
+
+    except Exception as azure_error:
+        print(f"Azure failed: {azure_error}")
+
+        try:
+            # Fallback: LLM only
+            client = OpenAI()
+            md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+            return md.convert(file_path)
+
+        except Exception as llm_error:
+            print(f"LLM failed: {llm_error}")
+
+            # Final fallback: Standard processing
+            md = MarkItDown()
+            return md.convert(file_path)
+
+# Use it
+result = robust_convert("document.pdf")
+```
+
+## Best Practices
+
+### Azure Document Intelligence
+- Use for complex PDFs only (cost optimization)
+- Monitor usage and costs
+- Store credentials securely
+- Handle quota limits gracefully
+- Fall back to standard processing if needed
+
+### LLM Integration
+- Use appropriate models for task complexity
+- Customize prompts for specific use cases
+- Monitor API costs
+- Implement rate limiting
+- Cache results when possible
+- Handle API errors gracefully
+
+### Combined Features
+- Test cost/quality tradeoffs
+- Use selectively for important documents
+- Implement intelligent routing
+- Monitor performance and costs
+- Have fallback strategies
+
+### Security
+- Store API keys securely (environment variables, secrets manager)
+- Never commit credentials to code
+- Disable plugins unless required
+- Validate all inputs
+- Use least privilege access
--- a/skills/markitdown/references/document_conversion.md
+++ b/skills/markitdown/references/document_conversion.md
@@ -0,0 +1,273 @@
+# Document Conversion Reference
+
+This document provides detailed information about converting Office documents and PDFs to Markdown using MarkItDown.
+
+## PDF Files
+
+PDF conversion extracts text, tables, and structure from PDF documents.
+
+### Basic PDF Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("document.pdf")
+print(result.text_content)
+```
+
+### PDF with Azure Document Intelligence
+
+For complex PDFs with tables, forms, and sophisticated layouts, use Azure Document Intelligence for enhanced extraction:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown(
+    docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/",
+    docintel_key="YOUR-API-KEY"
+)
+result = md.convert("complex_table.pdf")
+print(result.text_content)
+```
+
+**Benefits of Azure Document Intelligence:**
+- Superior table extraction and reconstruction
+- Better handling of multi-column layouts
+- Form field recognition
+- Improved text ordering in complex documents
+
+### PDF Handling Notes
+
+- Scanned PDFs require OCR (automatically handled if tesseract is installed)
+- Password-protected PDFs are not supported
+- Large PDFs may take longer to process
+- Vector graphics and embedded images are extracted where possible
+
+## Word Documents (DOCX)
+
+Word document conversion preserves headings, paragraphs, lists, tables, and hyperlinks.
+
+### Basic DOCX Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("document.docx")
+print(result.text_content)
+```
+
+### DOCX Structure Preservation
+
+MarkItDown preserves:
+- **Headings** → Markdown headers (`#`, `##`, etc.)
+- **Bold/Italic** → Markdown emphasis (`**bold**`, `*italic*`)
+- **Lists** → Markdown lists (ordered and unordered)
+- **Tables** → Markdown tables
+- **Hyperlinks** → Markdown links `[text](url)`
+- **Images** → Referenced with descriptions (can use LLM for descriptions)
+
+### Command-Line Usage
+
+```bash
+# Basic conversion
+markitdown report.docx -o report.md
+
+# With output directory
+markitdown report.docx -o output/report.md
+```
+
+### DOCX with Images
+
+To generate descriptions for images in Word documents, use LLM integration:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("document_with_images.docx")
+```
+
+## PowerPoint Presentations (PPTX)
+
+PowerPoint conversion extracts text from slides while preserving structure.
+
+### Basic PPTX Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("presentation.pptx")
+print(result.text_content)
+```
+
+### PPTX Structure
+
+MarkItDown processes presentations as:
+- Each slide becomes a major section
+- Slide titles become headers
+- Bullet points are preserved
+- Tables are converted to Markdown tables
+- Notes are included if present
+
+### PPTX with Image Descriptions
+
+Presentations often contain important visual information. Use LLM integration to describe images:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this slide image in detail, focusing on key information"
+)
+result = md.convert("presentation.pptx")
+```
+
+**Custom prompts for presentations:**
+- "Describe charts and graphs with their key data points"
+- "Explain diagrams and their relationships"
+- "Summarize visual content for accessibility"
+
+## Excel Spreadsheets (XLSX, XLS)
+
+Excel conversion formats spreadsheet data as Markdown tables.
+
+### Basic XLSX Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xlsx")
+print(result.text_content)
+```
+
+### Multi-Sheet Workbooks
+
+For workbooks with multiple sheets:
+- Each sheet becomes a separate section
+- Sheet names are used as headers
+- Empty sheets are skipped
+- Formulas are evaluated (values shown, not formulas)
+
+### XLSX Conversion Details
+
+**What's preserved:**
+- Cell values (text, numbers, dates)
+- Table structure (rows and columns)
+- Sheet names
+- Cell formatting (bold headers)
+
+**What's not preserved:**
+- Formulas (only computed values)
+- Charts and graphs (use LLM integration for descriptions)
+- Cell colors and conditional formatting
+- Comments and notes
+
+### Large Spreadsheets
+
+For large spreadsheets, consider:
+- Processing may be slower for files with many rows/columns
+- Very wide tables may not format well in Markdown
+- Consider filtering or preprocessing data if possible
+
+### XLS (Legacy Excel) Files
+
+Legacy `.xls` files are supported but require additional dependencies:
+
+```bash
+pip install 'markitdown[xls]'
+```
+
+Then use normally:
+```python
+md = MarkItDown()
+result = md.convert("legacy_data.xls")
+```
+
+## Common Document Conversion Patterns
+
+### Batch Document Processing
+
+```python
+from markitdown import MarkItDown
+import os
+
+md = MarkItDown()
+
+# Process all documents in a directory
+for filename in os.listdir("documents"):
+    if filename.endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
+        result = md.convert(f"documents/{filename}")
+
+        # Save to output directory
+        output_name = os.path.splitext(filename)[0] + ".md"
+        with open(f"markdown/{output_name}", "w") as f:
+            f.write(result.text_content)
+```
+
+### Document with Mixed Content
+
+For documents containing multiple types of content (text, tables, images):
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Use LLM for image descriptions + Azure for complex tables
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    docintel_endpoint="YOUR-ENDPOINT",
+    docintel_key="YOUR-KEY"
+)
+
+result = md.convert("complex_report.pdf")
+```
+
+### Error Handling
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except Exception as e:
+    print(f"Conversion failed: {e}")
+    # Handle specific errors (file not found, unsupported format, etc.)
+```
+
+## Output Quality Tips
+
+**For best results:**
+1. Use Azure Document Intelligence for PDFs with complex tables
+2. Enable LLM descriptions for documents with important visual content
+3. Ensure source documents are well-structured (proper headings, etc.)
+4. For scanned documents, ensure good scan quality for OCR accuracy
+5. Test with sample documents to verify output quality
+
+## Performance Considerations
+
+**Conversion speed depends on:**
+- Document size and complexity
+- Number of images (especially with LLM descriptions)
+- Use of Azure Document Intelligence
+- Available system resources
+
+**Optimization tips:**
+- Disable LLM integration if image descriptions aren't needed
+- Use standard extraction (not Azure) for simple documents
+- Process large batches in parallel when possible
+- Consider streaming for very large documents
--- a/skills/markitdown/references/media_processing.md
+++ b/skills/markitdown/references/media_processing.md
@@ -0,0 +1,365 @@
+# Media Processing Reference
+
+This document provides detailed information about processing images and audio files with MarkItDown.
+
+## Image Processing
+
+MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
+
+### Basic Image Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("photo.jpg")
+print(result.text_content)
+```
+
+### Image Processing Features
+
+**What's extracted:**
+1. **EXIF Metadata** - Camera settings, date, location, etc.
+2. **OCR Text** - Text detected in the image (requires tesseract)
+3. **Image Description** - AI-generated description (with LLM integration)
+
+### EXIF Metadata Extraction
+
+Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("IMG_1234.jpg")
+print(result.text_content)
+```
+
+**Example output includes:**
+- Camera make and model
+- Capture date and time
+- GPS coordinates (if available)
+- Exposure settings (ISO, shutter speed, aperture)
+- Image dimensions
+- Orientation
+
+### OCR (Optical Character Recognition)
+
+Extract text from images containing text (screenshots, scanned documents, photos of text):
+
+**Requirements:**
+- Install tesseract OCR engine:
+  ```bash
+  # macOS
+  brew install tesseract
+
+  # Ubuntu/Debian
+  apt-get install tesseract-ocr
+
+  # Windows
+  # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
+  ```
+
+**Usage:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("screenshot.png")
+print(result.text_content)  # Contains OCR'd text
+```
+
+**Best practices for OCR:**
+- Use high-resolution images for better accuracy
+- Ensure good contrast between text and background
+- Straighten skewed text if possible
+- Use well-lit, clear images
+
+### LLM-Generated Image Descriptions
+
+Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("diagram.png")
+print(result.text_content)
+```
+
+**Custom prompts for specific needs:**
+
+```python
+# For diagrams
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
+)
+
+# For charts
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Analyze this chart and provide key data points and trends"
+)
+
+# For UI screenshots
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this user interface, listing all visible elements and their layout"
+)
+```
+
+### Supported Image Formats
+
+MarkItDown supports all common image formats:
+- JPEG/JPG
+- PNG
+- GIF
+- BMP
+- TIFF
+- WebP
+- HEIC (requires additional libraries on some platforms)
+
+## Audio Processing
+
+MarkItDown can transcribe audio files to text using speech recognition.
+
+### Basic Audio Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("recording.wav")
+print(result.text_content)  # Transcribed speech
+```
+
+### Audio Transcription Setup
+
+**Installation:**
+```bash
+pip install 'markitdown[audio]'
+```
+
+This installs the `speech_recognition` library and dependencies.
+
+### Supported Audio Formats
+
+- WAV
+- AIFF
+- FLAC
+- MP3 (requires ffmpeg or libav)
+- OGG (requires ffmpeg or libav)
+- Other formats supported by speech_recognition
+
+### Audio Transcription Engines
+
+MarkItDown uses the `speech_recognition` library, which supports multiple backends:
+
+**Default (Google Speech Recognition):**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("audio.wav")
+```
+
+**Note:** Default Google Speech Recognition requires internet connection.
+
+### Audio Quality Considerations
+
+For best transcription accuracy:
+- Use clear audio with minimal background noise
+- Prefer WAV or FLAC for better quality
+- Ensure speech is clear and at good volume
+- Avoid multiple overlapping speakers
+- Use mono audio when possible
+
+### Audio Preprocessing Tips
+
+For better results, consider preprocessing audio:
+
+```python
+# Example: If you have pydub installed
+from pydub import AudioSegment
+from pydub.effects import normalize
+
+# Load and normalize audio
+audio = AudioSegment.from_file("recording.mp3")
+audio = normalize(audio)
+audio.export("normalized.wav", format="wav")
+
+# Then convert with MarkItDown
+from markitdown import MarkItDown
+md = MarkItDown()
+result = md.convert("normalized.wav")
+```
+
+## Combined Media Workflows
+
+### Processing Multiple Images in Batch
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# Process all images in directory
+for filename in os.listdir("images"):
+    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
+        result = md.convert(f"images/{filename}")
+
+        # Save markdown with same name
+        output = filename.rsplit('.', 1)[0] + '.md'
+        with open(f"output/{output}", "w") as f:
+            f.write(result.text_content)
+```
+
+### Screenshot Analysis Pipeline
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
+)
+
+screenshots = ["screen1.png", "screen2.png", "screen3.png"]
+analysis = []
+
+for screenshot in screenshots:
+    result = md.convert(screenshot)
+    analysis.append({
+        'file': screenshot,
+        'content': result.text_content
+    })
+
+# Now ready for further processing
+```
+
+### Document Images with OCR
+
+For scanned documents or photos of documents:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Process scanned pages
+pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
+full_text = []
+
+for page in pages:
+    result = md.convert(page)
+    full_text.append(result.text_content)
+
+# Combine into single document
+document = "\n\n---\n\n".join(full_text)
+print(document)
+```
+
+### Presentation Slide Images
+
+When you have presentation slides as images:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
+)
+
+# Process slide images
+for i in range(1, 21):  # 20 slides
+    result = md.convert(f"slides/slide_{i}.png")
+    print(f"## Slide {i}\n\n{result.text_content}\n\n")
+```
+
+## Error Handling
+
+### Image Processing Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("image.jpg")
+    print(result.text_content)
+except FileNotFoundError:
+    print("Image file not found")
+except Exception as e:
+    print(f"Error processing image: {e}")
+```
+
+### Audio Processing Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("audio.mp3")
+    print(result.text_content)
+except Exception as e:
+    print(f"Transcription failed: {e}")
+    # Common issues: format not supported, no speech detected, network error
+```
+
+## Performance Optimization
+
+### Image Processing
+
+- **LLM descriptions**: Slower but more informative
+- **OCR only**: Faster for text extraction
+- **EXIF only**: Fastest, metadata only
+- **Batch processing**: Process multiple images in parallel
+
+### Audio Processing
+
+- **File size**: Larger files take longer
+- **Audio length**: Transcription time scales with duration
+- **Format conversion**: WAV/FLAC are faster than MP3/OGG
+- **Network dependency**: Default transcription requires internet
+
+## Use Cases
+
+### Document Digitization
+Convert scanned documents or photos of documents to searchable text.
+
+### Meeting Notes
+Transcribe audio recordings of meetings to text for analysis.
+
+### Presentation Analysis
+Extract content from presentation slide images.
+
+### Screenshot Documentation
+Generate descriptions of UI screenshots for documentation.
+
+### Image Archiving
+Extract metadata and content from photo collections.
+
+### Accessibility
+Generate alt-text descriptions for images using LLM integration.
+
+### Data Extraction
+OCR text from images containing tables, forms, or structured data.
--- a/skills/markitdown/references/structured_data.md
+++ b/skills/markitdown/references/structured_data.md
@@ -0,0 +1,575 @@
+# Structured Data Handling Reference
+
+This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
+
+## CSV Files
+
+Convert CSV (Comma-Separated Values) files to Markdown tables.
+
+### Basic CSV Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.csv")
+print(result.text_content)
+```
+
+### CSV to Markdown Table
+
+CSV files are automatically converted to Markdown table format:
+
+**Input CSV (`data.csv`):**
+```csv
+Name,Age,City
+Alice,30,New York
+Bob,25,Los Angeles
+Charlie,35,Chicago
+```
+
+**Output Markdown:**
+```markdown
+| Name    | Age | City        |
+|---------|-----|-------------|
+| Alice   | 30  | New York    |
+| Bob     | 25  | Los Angeles |
+| Charlie | 35  | Chicago     |
+```
+
+### CSV Conversion Features
+
+**What's preserved:**
+- All column headers
+- All data rows
+- Cell values (text and numbers)
+- Column structure
+
+**Formatting:**
+- Headers are bolded (Markdown table format)
+- Columns are aligned
+- Empty cells are preserved
+- Special characters are escaped
+
+### Large CSV Files
+
+For large CSV files:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert large CSV
+result = md.convert("large_dataset.csv")
+
+# Save to file instead of printing
+with open("output.md", "w") as f:
+    f.write(result.text_content)
+```
+
+**Performance considerations:**
+- Very large files may take time to process
+- Consider previewing first few rows for testing
+- Memory usage scales with file size
+- Very wide tables may not display well in all Markdown viewers
+
+### CSV with Special Characters
+
+CSV files containing special characters are handled automatically:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles UTF-8, special characters, quotes, etc.
+result = md.convert("international_data.csv")
+```
+
+### CSV Delimiters
+
+Standard CSV delimiters are supported:
+- Comma (`,`) - standard
+- Semicolon (`;`) - common in European formats
+- Tab (`\t`) - TSV files
+
+### Command-Line CSV Conversion
+
+```bash
+# Basic conversion
+markitdown data.csv -o data.md
+
+# Multiple CSV files
+for file in *.csv; do
+    markitdown "$file" -o "${file%.csv}.md"
+done
+```
+
+## JSON Files
+
+Convert JSON data to readable Markdown format.
+
+### Basic JSON Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.json")
+print(result.text_content)
+```
+
+### JSON Formatting
+
+JSON is converted to a readable, structured Markdown format:
+
+**Input JSON (`config.json`):**
+```json
+{
+  "name": "MyApp",
+  "version": "1.0.0",
+  "dependencies": {
+    "library1": "^2.0.0",
+    "library2": "^3.1.0"
+  },
+  "features": ["auth", "api", "database"]
+}
+```
+
+**Output Markdown:**
+```markdown
+## Configuration
+
+**name:** MyApp
+**version:** 1.0.0
+
+### dependencies
+- **library1:** ^2.0.0
+- **library2:** ^3.1.0
+
+### features
+- auth
+- api
+- database
+```
+
+### JSON Array Handling
+
+JSON arrays are converted to lists or tables:
+
+**Array of objects:**
+```json
+[
+  {"id": 1, "name": "Alice", "active": true},
+  {"id": 2, "name": "Bob", "active": false}
+]
+```
+
+**Converted to table:**
+```markdown
+| id | name  | active |
+|----|-------|--------|
+| 1  | Alice | true   |
+| 2  | Bob   | false  |
+```
+
+### Nested JSON Structures
+
+Nested JSON is converted with appropriate indentation and hierarchy:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles deeply nested structures
+result = md.convert("complex_config.json")
+print(result.text_content)
+```
+
+### JSON Lines (JSONL)
+
+For JSON Lines format (one JSON object per line):
+
+```python
+from markitdown import MarkItDown
+import json
+
+md = MarkItDown()
+
+# Read JSONL file
+with open("data.jsonl", "r") as f:
+    for line in f:
+        obj = json.loads(line)
+
+        # Convert to JSON temporarily
+        with open("temp.json", "w") as temp:
+            json.dump(obj, temp)
+
+        result = md.convert("temp.json")
+        print(result.text_content)
+        print("\n---\n")
+```
+
+### Large JSON Files
+
+For large JSON files:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert large JSON
+result = md.convert("large_data.json")
+
+# Save to file
+with open("output.md", "w") as f:
+    f.write(result.text_content)
+```
+
+## XML Files
+
+Convert XML documents to structured Markdown.
+
+### Basic XML Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xml")
+print(result.text_content)
+```
+
+### XML Structure Preservation
+
+XML is converted to Markdown maintaining hierarchical structure:
+
+**Input XML (`book.xml`):**
+```xml
+<?xml version="1.0"?>
+<book>
+  <title>Example Book</title>
+  <author>John Doe</author>
+  <chapters>
+    <chapter id="1">
+      <title>Introduction</title>
+      <content>Chapter 1 content...</content>
+    </chapter>
+    <chapter id="2">
+      <title>Background</title>
+      <content>Chapter 2 content...</content>
+    </chapter>
+  </chapters>
+</book>
+```
+
+**Output Markdown:**
+```markdown
+# book
+
+## title
+Example Book
+
+## author
+John Doe
+
+## chapters
+
+### chapter (id: 1)
+#### title
+Introduction
+
+#### content
+Chapter 1 content...
+
+### chapter (id: 2)
+#### title
+Background
+
+#### content
+Chapter 2 content...
+```
+
+### XML Attributes
+
+XML attributes are preserved in the conversion:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xml")
+# Attributes shown as (attr: value) in headings
+```
+
+### XML Namespaces
+
+XML namespaces are handled:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles xmlns and namespaced elements
+result = md.convert("namespaced.xml")
+```
+
+### XML Use Cases
+
+**Configuration files:**
+- Convert XML configs to readable format
+- Document system configurations
+- Compare configuration files
+
+**Data interchange:**
+- Convert XML APIs responses
+- Process XML data feeds
+- Transform between formats
+
+**Document processing:**
+- Convert DocBook to Markdown
+- Process SVG descriptions
+- Extract structured data
+
+## Structured Data Workflows
+
+### CSV Data Analysis Pipeline
+
+```python
+from markitdown import MarkItDown
+import pandas as pd
+
+md = MarkItDown()
+
+# Read CSV for analysis
+df = pd.read_csv("data.csv")
+
+# Do analysis
+summary = df.describe()
+
+# Convert both to Markdown
+original = md.convert("data.csv")
+
+# Save summary as CSV then convert
+summary.to_csv("summary.csv")
+summary_md = md.convert("summary.csv")
+
+print("## Original Data\n")
+print(original.text_content)
+print("\n## Statistical Summary\n")
+print(summary_md.text_content)
+```
+
+### JSON API Documentation
+
+```python
+from markitdown import MarkItDown
+import requests
+import json
+
+md = MarkItDown()
+
+# Fetch JSON from API
+response = requests.get("https://api.example.com/data")
+data = response.json()
+
+# Save as JSON
+with open("api_response.json", "w") as f:
+    json.dump(data, f, indent=2)
+
+# Convert to Markdown
+result = md.convert("api_response.json")
+
+# Create documentation
+doc = f"""# API Response Documentation
+
+## Endpoint
+GET https://api.example.com/data
+
+## Response
+{result.text_content}
+"""
+
+with open("api_docs.md", "w") as f:
+    f.write(doc)
+```
+
+### XML to Markdown Documentation
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert XML documentation
+xml_files = ["config.xml", "schema.xml", "data.xml"]
+
+for xml_file in xml_files:
+    result = md.convert(xml_file)
+
+    output_name = xml_file.replace('.xml', '.md')
+    with open(f"docs/{output_name}", "w") as f:
+        f.write(result.text_content)
+```
+
+### Multi-Format Data Processing
+
+```python
+from markitdown import MarkItDown
+import os
+
+md = MarkItDown()
+
+def convert_structured_data(directory):
+    """Convert all structured data files in directory."""
+    extensions = {'.csv', '.json', '.xml'}
+
+    for filename in os.listdir(directory):
+        ext = os.path.splitext(filename)[1]
+
+        if ext in extensions:
+            input_path = os.path.join(directory, filename)
+            result = md.convert(input_path)
+
+            # Save Markdown
+            output_name = filename.replace(ext, '.md')
+            output_path = os.path.join("markdown", output_name)
+
+            with open(output_path, 'w') as f:
+                f.write(result.text_content)
+
+            print(f"Converted: {filename} → {output_name}")
+
+# Process all structured data
+convert_structured_data("data")
+```
+
+### CSV to JSON to Markdown
+
+```python
+import pandas as pd
+from markitdown import MarkItDown
+import json
+
+md = MarkItDown()
+
+# Read CSV
+df = pd.read_csv("data.csv")
+
+# Convert to JSON
+json_data = df.to_dict(orient='records')
+with open("temp.json", "w") as f:
+    json.dump(json_data, f, indent=2)
+
+# Convert JSON to Markdown
+result = md.convert("temp.json")
+print(result.text_content)
+```
+
+### Database Export to Markdown
+
+```python
+from markitdown import MarkItDown
+import sqlite3
+import csv
+
+md = MarkItDown()
+
+# Export database query to CSV
+conn = sqlite3.connect("database.db")
+cursor = conn.execute("SELECT * FROM users")
+
+with open("users.csv", "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow([description[0] for description in cursor.description])
+    writer.writerows(cursor.fetchall())
+
+# Convert to Markdown
+result = md.convert("users.csv")
+print(result.text_content)
+```
+
+## Error Handling
+
+### CSV Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.csv")
+    print(result.text_content)
+except FileNotFoundError:
+    print("CSV file not found")
+except Exception as e:
+    print(f"CSV conversion error: {e}")
+    # Common issues: encoding problems, malformed CSV, delimiter issues
+```
+
+### JSON Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.json")
+    print(result.text_content)
+except Exception as e:
+    print(f"JSON conversion error: {e}")
+    # Common issues: invalid JSON syntax, encoding issues
+```
+
+### XML Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.xml")
+    print(result.text_content)
+except Exception as e:
+    print(f"XML conversion error: {e}")
+    # Common issues: malformed XML, encoding problems, namespace issues
+```
+
+## Best Practices
+
+### CSV Processing
+- Check delimiter before conversion
+- Verify encoding (UTF-8 recommended)
+- Handle large files with streaming if needed
+- Preview output for very wide tables
+
+### JSON Processing
+- Validate JSON before conversion
+- Consider pretty-printing complex structures
+- Handle circular references appropriately
+- Be aware of large array performance
+
+### XML Processing
+- Validate XML structure first
+- Handle namespaces consistently
+- Consider XPath for selective extraction
+- Be mindful of very deep nesting
+
+### Data Quality
+- Clean data before conversion when possible
+- Handle missing values appropriately
+- Verify special character handling
+- Test with representative samples
+
+### Performance
+- Process large files in batches
+- Use streaming for very large datasets
+- Monitor memory usage
+- Cache converted results when appropriate
--- a/skills/markitdown/references/web_content.md
+++ b/skills/markitdown/references/web_content.md
@@ -0,0 +1,478 @@
+# Web Content Extraction Reference
+
+This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
+
+## HTML Conversion
+
+Convert HTML files and web pages to clean Markdown format.
+
+### Basic HTML Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("webpage.html")
+print(result.text_content)
+```
+
+### HTML Processing Features
+
+**What's preserved:**
+- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
+- Paragraphs and text formatting
+- Links (`<a>` → `[text](url)`)
+- Lists (ordered and unordered)
+- Tables → Markdown tables
+- Code blocks and inline code
+- Emphasis (bold, italic)
+
+**What's removed:**
+- Scripts and styles
+- Navigation elements
+- Advertising content
+- Boilerplate markup
+- HTML comments
+
+### HTML from URLs
+
+Convert web pages directly from URLs:
+
+```python
+from markitdown import MarkItDown
+import requests
+
+md = MarkItDown()
+
+# Fetch and convert web page
+response = requests.get("https://example.com/article")
+with open("temp.html", "wb") as f:
+    f.write(response.content)
+
+result = md.convert("temp.html")
+print(result.text_content)
+```
+
+### Clean Web Article Extraction
+
+For extracting main content from web articles:
+
+```python
+from markitdown import MarkItDown
+import requests
+from readability import Document  # pip install readability-lxml
+
+md = MarkItDown()
+
+# Fetch page
+url = "https://example.com/article"
+response = requests.get(url)
+
+# Extract main content
+doc = Document(response.content)
+html_content = doc.summary()
+
+# Save and convert
+with open("article.html", "w") as f:
+    f.write(html_content)
+
+result = md.convert("article.html")
+print(result.text_content)
+```
+
+### HTML with Images
+
+HTML files containing images can be enhanced with LLM descriptions:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("page_with_images.html")
+```
+
+## YouTube Transcripts
+
+Extract video transcripts from YouTube videos.
+
+### Basic YouTube Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
+print(result.text_content)
+```
+
+### YouTube Installation
+
+```bash
+pip install 'markitdown[youtube]'
+```
+
+This installs the `youtube-transcript-api` dependency.
+
+### YouTube URL Formats
+
+MarkItDown supports various YouTube URL formats:
+- `https://www.youtube.com/watch?v=VIDEO_ID`
+- `https://youtu.be/VIDEO_ID`
+- `https://www.youtube.com/embed/VIDEO_ID`
+- `https://m.youtube.com/watch?v=VIDEO_ID`
+
+### YouTube Transcript Features
+
+**What's included:**
+- Full video transcript text
+- Timestamps (optional, depending on availability)
+- Video metadata (title, description)
+- Captions in available languages
+
+**Transcript languages:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Get transcript in specific language (if available)
+# Language codes: 'en', 'es', 'fr', 'de', etc.
+result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
+```
+
+### YouTube Playlist Processing
+
+Process multiple videos from a playlist:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+video_ids = [
+    "VIDEO_ID_1",
+    "VIDEO_ID_2",
+    "VIDEO_ID_3"
+]
+
+transcripts = []
+for vid_id in video_ids:
+    url = f"https://youtube.com/watch?v={vid_id}"
+    result = md.convert(url)
+    transcripts.append({
+        'video_id': vid_id,
+        'transcript': result.text_content
+    })
+```
+
+### YouTube Use Cases
+
+**Content Analysis:**
+- Analyze video content without watching
+- Extract key information from tutorials
+- Build searchable transcript databases
+
+**Research:**
+- Process interview transcripts
+- Extract lecture content
+- Analyze presentation content
+
+**Accessibility:**
+- Generate text versions of video content
+- Create searchable video archives
+
+### YouTube Limitations
+
+- Requires videos to have captions/transcripts available
+- Auto-generated captions may have transcription errors
+- Some videos may disable transcript access
+- Rate limiting may apply for bulk processing
+
+## EPUB Books
+
+Convert EPUB e-books to Markdown format.
+
+### Basic EPUB Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("book.epub")
+print(result.text_content)
+```
+
+### EPUB Processing Features
+
+**What's extracted:**
+- Book text content
+- Chapter structure
+- Headings and formatting
+- Tables of contents
+- Footnotes and references
+
+**What's preserved:**
+- Heading hierarchy
+- Text emphasis (bold, italic)
+- Links and references
+- Lists and tables
+
+### EPUB with Images
+
+EPUB files often contain images (covers, diagrams, illustrations):
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("illustrated_book.epub")
+```
+
+### EPUB Use Cases
+
+**Research:**
+- Convert textbooks to searchable format
+- Extract content for analysis
+- Build digital libraries
+
+**Content Processing:**
+- Prepare books for LLM training data
+- Convert to different formats
+- Create summaries and extracts
+
+**Accessibility:**
+- Convert to more accessible formats
+- Extract text for screen readers
+- Process for text-to-speech
+
+## RSS Feeds
+
+Process RSS feeds to extract article content.
+
+### Basic RSS Processing
+
+```python
+from markitdown import MarkItDown
+import feedparser
+
+md = MarkItDown()
+
+# Parse RSS feed
+feed = feedparser.parse("https://example.com/feed.xml")
+
+# Convert each entry
+for entry in feed.entries:
+    # Save entry HTML
+    with open("temp.html", "w") as f:
+        f.write(entry.summary)
+
+    result = md.convert("temp.html")
+    print(f"## {entry.title}\n\n{result.text_content}\n\n")
+```
+
+## Combined Web Content Workflows
+
+### Web Scraping Pipeline
+
+```python
+from markitdown import MarkItDown
+import requests
+from bs4 import BeautifulSoup
+
+md = MarkItDown()
+
+def scrape_and_convert(url):
+    """Scrape webpage and convert to Markdown."""
+    response = requests.get(url)
+    soup = BeautifulSoup(response.content, 'html.parser')
+
+    # Extract main content
+    main_content = soup.find('article') or soup.find('main')
+
+    if main_content:
+        # Save HTML
+        with open("temp.html", "w") as f:
+            f.write(str(main_content))
+
+        # Convert to Markdown
+        result = md.convert("temp.html")
+        return result.text_content
+
+    return None
+
+# Use it
+markdown = scrape_and_convert("https://example.com/article")
+print(markdown)
+```
+
+### YouTube Learning Content Extraction
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Course videos
+course_videos = [
+    ("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
+    ("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
+    ("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
+]
+
+course_content = []
+for url, title in course_videos:
+    result = md.convert(url)
+    course_content.append(f"# {title}\n\n{result.text_content}")
+
+# Combine into course document
+full_course = "\n\n---\n\n".join(course_content)
+with open("course_transcript.md", "w") as f:
+    f.write(full_course)
+```
+
+### Documentation Scraping
+
+```python
+from markitdown import MarkItDown
+import requests
+from urllib.parse import urljoin, urlparse
+
+md = MarkItDown()
+
+def scrape_documentation(base_url, page_urls):
+    """Scrape multiple documentation pages."""
+    docs = []
+
+    for page_url in page_urls:
+        full_url = urljoin(base_url, page_url)
+
+        # Fetch page
+        response = requests.get(full_url)
+        with open("temp.html", "wb") as f:
+            f.write(response.content)
+
+        # Convert
+        result = md.convert("temp.html")
+        docs.append({
+            'url': full_url,
+            'content': result.text_content
+        })
+
+    return docs
+
+# Example usage
+base = "https://docs.example.com/"
+pages = ["intro.html", "getting-started.html", "api.html"]
+documentation = scrape_documentation(base, pages)
+```
+
+### EPUB Library Processing
+
+```python
+from markitdown import MarkItDown
+import os
+
+md = MarkItDown()
+
+def process_epub_library(library_path, output_path):
+    """Convert all EPUB books in a directory."""
+    for filename in os.listdir(library_path):
+        if filename.endswith('.epub'):
+            epub_path = os.path.join(library_path, filename)
+
+            try:
+                result = md.convert(epub_path)
+
+                # Save markdown
+                output_file = filename.replace('.epub', '.md')
+                output_full = os.path.join(output_path, output_file)
+
+                with open(output_full, 'w') as f:
+                    f.write(result.text_content)
+
+                print(f"Converted: {filename}")
+            except Exception as e:
+                print(f"Failed to convert {filename}: {e}")
+
+# Process library
+process_epub_library("books", "markdown_books")
+```
+
+## Error Handling
+
+### HTML Conversion Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("webpage.html")
+    print(result.text_content)
+except FileNotFoundError:
+    print("HTML file not found")
+except Exception as e:
+    print(f"Conversion error: {e}")
+```
+
+### YouTube Transcript Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
+    print(result.text_content)
+except Exception as e:
+    print(f"Failed to get transcript: {e}")
+    # Common issues: No transcript available, video unavailable, network error
+```
+
+### EPUB Conversion Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("book.epub")
+    print(result.text_content)
+except Exception as e:
+    print(f"EPUB processing error: {e}")
+    # Common issues: Corrupted file, unsupported DRM, invalid format
+```
+
+## Best Practices
+
+### HTML Processing
+- Clean HTML before conversion for better results
+- Use readability libraries to extract main content
+- Handle different encodings appropriately
+- Remove unnecessary markup
+
+### YouTube Processing
+- Check transcript availability before batch processing
+- Handle API rate limits gracefully
+- Store transcripts to avoid re-fetching
+- Respect YouTube's terms of service
+
+### EPUB Processing
+- DRM-protected EPUBs cannot be processed
+- Large EPUBs may require more memory
+- Some formatting may not translate perfectly
+- Test with representative samples first
+
+### Web Scraping Ethics
+- Respect robots.txt
+- Add delays between requests
+- Identify your scraper in User-Agent
+- Cache results to minimize requests
+- Follow website terms of service