Initial commit

2025-11-30 08:30:14 +08:00
commit 1dd5bee3b4
335 changed files with 147360 additions and 0 deletions
--- a/skills/markitdown/SKILL.md
+++ b/skills/markitdown/SKILL.md
@@ -0,0 +1,486 @@
+---
+name: markitdown
+description: "Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more."
+allowed-tools: [Read, Write, Edit, Bash]
+license: MIT
+source: https://github.com/microsoft/markitdown
+---
+
+# MarkItDown - File to Markdown Conversion
+
+## Overview
+
+MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It's particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.
+
+**Key Benefits**:
+- Convert documents to clean, structured Markdown
+- Token-efficient format for LLM processing
+- Supports 15+ file formats
+- Optional AI-enhanced image descriptions
+- OCR for images and scanned documents
+- Speech transcription for audio files
+
+## Visual Enhancement with Scientific Schematics
+
+**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
+
+If your document does not already contain schematics or diagrams:
+- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
+- Simply describe your desired diagram in natural language
+- Nano Banana Pro will automatically generate, review, and refine the schematic
+
+**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
+
+**How to generate schematics:**
+```bash
+python scripts/generate_schematic.py "your diagram description" -o figures/output.png
+```
+
+The AI will automatically:
+- Create publication-quality images with proper formatting
+- Review and refine through multiple iterations
+- Ensure accessibility (colorblind-friendly, high contrast)
+- Save outputs in the figures/ directory
+
+**When to add schematics:**
+- Document conversion workflow diagrams
+- File format architecture illustrations
+- OCR processing pipeline diagrams
+- Integration workflow visualizations
+- System architecture diagrams
+- Data flow diagrams
+- Any complex concept that benefits from visualization
+
+For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
+
+---
+
+## Supported Formats
+
+| Format | Description | Notes |
+|--------|-------------|-------|
+| **PDF** | Portable Document Format | Full text extraction |
+| **DOCX** | Microsoft Word | Tables, formatting preserved |
+| **PPTX** | PowerPoint | Slides with notes |
+| **XLSX** | Excel spreadsheets | Tables and data |
+| **Images** | JPEG, PNG, GIF, WebP | EXIF metadata + OCR |
+| **Audio** | WAV, MP3 | Metadata + transcription |
+| **HTML** | Web pages | Clean conversion |
+| **CSV** | Comma-separated values | Table format |
+| **JSON** | JSON data | Structured representation |
+| **XML** | XML documents | Structured format |
+| **ZIP** | Archive files | Iterates contents |
+| **EPUB** | E-books | Full text extraction |
+| **YouTube** | Video URLs | Fetch transcriptions |
+
+## Quick Start
+
+### Installation
+
+```bash
+# Install with all features
+pip install 'markitdown[all]'
+
+# Or from source
+git clone https://github.com/microsoft/markitdown.git
+cd markitdown
+pip install -e 'packages/markitdown[all]'
+```
+
+### Command-Line Usage
+
+```bash
+# Basic conversion
+markitdown document.pdf > output.md
+
+# Specify output file
+markitdown document.pdf -o output.md
+
+# Pipe content
+cat document.pdf | markitdown > output.md
+
+# Enable plugins
+markitdown --list-plugins  # List available plugins
+markitdown --use-plugins document.pdf -o output.md
+```
+
+### Python API
+
+```python
+from markitdown import MarkItDown
+
+# Basic usage
+md = MarkItDown()
+result = md.convert("document.pdf")
+print(result.text_content)
+
+# Convert from stream
+with open("document.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+    print(result.text_content)
+```
+
+## Advanced Features
+
+### 1. AI-Enhanced Image Descriptions
+
+Use LLMs via OpenRouter to generate detailed image descriptions (for PPTX and image files):
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Initialize OpenRouter client (OpenAI-compatible API)
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",  # recommended for scientific vision
+    llm_prompt="Describe this image in detail for scientific documentation"
+)
+
+result = md.convert("presentation.pptx")
+print(result.text_content)
+```
+
+### 2. Azure Document Intelligence
+
+For enhanced PDF conversion with Microsoft Document Intelligence:
+
+```bash
+# Command line
+markitdown document.pdf -o output.md -d -e "<document_intelligence_endpoint>"
+```
+
+```python
+# Python API
+from markitdown import MarkItDown
+
+md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
+result = md.convert("complex_document.pdf")
+print(result.text_content)
+```
+
+### 3. Plugin System
+
+MarkItDown supports 3rd-party plugins for extending functionality:
+
+```bash
+# List installed plugins
+markitdown --list-plugins
+
+# Enable plugins
+markitdown --use-plugins file.pdf -o output.md
+```
+
+Find plugins on GitHub with hashtag: `#markitdown-plugin`
+
+## Optional Dependencies
+
+Control which file formats you support:
+
+```bash
+# Install specific formats
+pip install 'markitdown[pdf, docx, pptx]'
+
+# All available options:
+# [all]                  - All optional dependencies
+# [pptx]                 - PowerPoint files
+# [docx]                 - Word documents
+# [xlsx]                 - Excel spreadsheets
+# [xls]                  - Older Excel files
+# [pdf]                  - PDF documents
+# [outlook]              - Outlook messages
+# [az-doc-intel]         - Azure Document Intelligence
+# [audio-transcription]  - WAV and MP3 transcription
+# [youtube-transcription] - YouTube video transcription
+```
+
+## Common Use Cases
+
+### 1. Convert Scientific Papers to Markdown
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert PDF paper
+result = md.convert("research_paper.pdf")
+with open("paper.md", "w") as f:
+    f.write(result.text_content)
+```
+
+### 2. Extract Data from Excel for Analysis
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xlsx")
+
+# Result will be in Markdown table format
+print(result.text_content)
+```
+
+### 3. Process Multiple Documents
+
+```python
+from markitdown import MarkItDown
+import os
+from pathlib import Path
+
+md = MarkItDown()
+
+# Process all PDFs in a directory
+pdf_dir = Path("papers/")
+output_dir = Path("markdown_output/")
+output_dir.mkdir(exist_ok=True)
+
+for pdf_file in pdf_dir.glob("*.pdf"):
+    result = md.convert(str(pdf_file))
+    output_file = output_dir / f"{pdf_file.stem}.md"
+    output_file.write_text(result.text_content)
+    print(f"Converted: {pdf_file.name}")
+```
+
+### 4. Convert PowerPoint with AI Descriptions
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+# Use OpenRouter for access to multiple AI models
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",  # recommended for presentations
+    llm_prompt="Describe this slide image in detail, focusing on key visual elements and data"
+)
+
+result = md.convert("presentation.pptx")
+with open("presentation.md", "w") as f:
+    f.write(result.text_content)
+```
+
+### 5. Batch Convert with Different Formats
+
+```python
+from markitdown import MarkItDown
+from pathlib import Path
+
+md = MarkItDown()
+
+# Files to convert
+files = [
+    "document.pdf",
+    "spreadsheet.xlsx",
+    "presentation.pptx",
+    "notes.docx"
+]
+
+for file in files:
+    try:
+        result = md.convert(file)
+        output = Path(file).stem + ".md"
+        with open(output, "w") as f:
+            f.write(result.text_content)
+        print(f"✓ Converted {file}")
+    except Exception as e:
+        print(f"✗ Error converting {file}: {e}")
+```
+
+### 6. Extract YouTube Video Transcription
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert YouTube video to transcript
+result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
+print(result.text_content)
+```
+
+## Docker Usage
+
+```bash
+# Build image
+docker build -t markitdown:latest .
+
+# Run conversion
+docker run --rm -i markitdown:latest < ~/document.pdf > output.md
+```
+
+## Best Practices
+
+### 1. Choose the Right Conversion Method
+
+- **Simple documents**: Use basic `MarkItDown()`
+- **Complex PDFs**: Use Azure Document Intelligence
+- **Visual content**: Enable AI image descriptions
+- **Scanned documents**: Ensure OCR dependencies are installed
+
+### 2. Handle Errors Gracefully
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("document.pdf")
+    print(result.text_content)
+except FileNotFoundError:
+    print("File not found")
+except Exception as e:
+    print(f"Conversion error: {e}")
+```
+
+### 3. Process Large Files Efficiently
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# For large files, use streaming
+with open("large_file.pdf", "rb") as f:
+    result = md.convert_stream(f, file_extension=".pdf")
+    
+    # Process in chunks or save directly
+    with open("output.md", "w") as out:
+        out.write(result.text_content)
+```
+
+### 4. Optimize for Token Efficiency
+
+Markdown output is already token-efficient, but you can:
+- Remove excessive whitespace
+- Consolidate similar sections
+- Strip metadata if not needed
+
+```python
+from markitdown import MarkItDown
+import re
+
+md = MarkItDown()
+result = md.convert("document.pdf")
+
+# Clean up extra whitespace
+clean_text = re.sub(r'\n{3,}', '\n\n', result.text_content)
+clean_text = clean_text.strip()
+
+print(clean_text)
+```
+
+## Integration with Scientific Workflows
+
+### Convert Literature for Review
+
+```python
+from markitdown import MarkItDown
+from pathlib import Path
+
+md = MarkItDown()
+
+# Convert all papers in literature folder
+papers_dir = Path("literature/pdfs")
+output_dir = Path("literature/markdown")
+output_dir.mkdir(exist_ok=True)
+
+for paper in papers_dir.glob("*.pdf"):
+    result = md.convert(str(paper))
+    
+    # Save with metadata
+    output_file = output_dir / f"{paper.stem}.md"
+    content = f"# {paper.stem}\n\n"
+    content += f"**Source**: {paper.name}\n\n"
+    content += "---\n\n"
+    content += result.text_content
+    
+    output_file.write_text(content)
+
+# For AI-enhanced conversion with figures
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-openrouter-api-key",
+    base_url="https://openrouter.ai/api/v1"
+)
+
+md_ai = MarkItDown(
+    llm_client=client,
+    llm_model="anthropic/claude-sonnet-4.5",
+    llm_prompt="Describe scientific figures with technical precision"
+)
+```
+
+### Extract Tables for Analysis
+
+```python
+from markitdown import MarkItDown
+import re
+
+md = MarkItDown()
+result = md.convert("data_tables.xlsx")
+
+# Markdown tables can be parsed or used directly
+print(result.text_content)
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Missing dependencies**: Install feature-specific packages
+   ```bash
+   pip install 'markitdown[pdf]'  # For PDF support
+   ```
+
+2. **Binary file errors**: Ensure files are opened in binary mode
+   ```python
+   with open("file.pdf", "rb") as f:  # Note the "rb"
+       result = md.convert_stream(f, file_extension=".pdf")
+   ```
+
+3. **OCR not working**: Install tesseract
+   ```bash
+   # macOS
+   brew install tesseract
+   
+   # Ubuntu
+   sudo apt-get install tesseract-ocr
+   ```
+
+## Performance Considerations
+
+- **PDF files**: Large PDFs may take time; consider page ranges if supported
+- **Image OCR**: OCR processing is CPU-intensive
+- **Audio transcription**: Requires additional compute resources
+- **AI image descriptions**: Requires API calls (costs may apply)
+
+## Next Steps
+
+- See `references/api_reference.md` for complete API documentation
+- Check `references/file_formats.md` for format-specific details
+- Review `scripts/batch_convert.py` for automation examples
+- Explore `scripts/convert_with_ai.py` for AI-enhanced conversions
+
+## Resources
+
+- **MarkItDown GitHub**: https://github.com/microsoft/markitdown
+- **PyPI**: https://pypi.org/project/markitdown/
+- **OpenRouter**: https://openrouter.ai (for AI-enhanced conversions)
+- **OpenRouter API Keys**: https://openrouter.ai/keys
+- **OpenRouter Models**: https://openrouter.ai/models
+- **MCP Server**: markitdown-mcp (for Claude Desktop integration)
+- **Plugin Development**: See `packages/markitdown-sample-plugin`
+