Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/markitdown/references/media_processing.md
+++ b/skills/markitdown/references/media_processing.md
@@ -0,0 +1,365 @@
+# Media Processing Reference
+
+This document provides detailed information about processing images and audio files with MarkItDown.
+
+## Image Processing
+
+MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
+
+### Basic Image Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("photo.jpg")
+print(result.text_content)
+```
+
+### Image Processing Features
+
+**What's extracted:**
+1. **EXIF Metadata** - Camera settings, date, location, etc.
+2. **OCR Text** - Text detected in the image (requires tesseract)
+3. **Image Description** - AI-generated description (with LLM integration)
+
+### EXIF Metadata Extraction
+
+Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("IMG_1234.jpg")
+print(result.text_content)
+```
+
+**Example output includes:**
+- Camera make and model
+- Capture date and time
+- GPS coordinates (if available)
+- Exposure settings (ISO, shutter speed, aperture)
+- Image dimensions
+- Orientation
+
+### OCR (Optical Character Recognition)
+
+Extract text from images containing text (screenshots, scanned documents, photos of text):
+
+**Requirements:**
+- Install tesseract OCR engine:
+  ```bash
+  # macOS
+  brew install tesseract
+
+  # Ubuntu/Debian
+  apt-get install tesseract-ocr
+
+  # Windows
+  # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
+  ```
+
+**Usage:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("screenshot.png")
+print(result.text_content)  # Contains OCR'd text
+```
+
+**Best practices for OCR:**
+- Use high-resolution images for better accuracy
+- Ensure good contrast between text and background
+- Straighten skewed text if possible
+- Use well-lit, clear images
+
+### LLM-Generated Image Descriptions
+
+Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("diagram.png")
+print(result.text_content)
+```
+
+**Custom prompts for specific needs:**
+
+```python
+# For diagrams
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
+)
+
+# For charts
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Analyze this chart and provide key data points and trends"
+)
+
+# For UI screenshots
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this user interface, listing all visible elements and their layout"
+)
+```
+
+### Supported Image Formats
+
+MarkItDown supports all common image formats:
+- JPEG/JPG
+- PNG
+- GIF
+- BMP
+- TIFF
+- WebP
+- HEIC (requires additional libraries on some platforms)
+
+## Audio Processing
+
+MarkItDown can transcribe audio files to text using speech recognition.
+
+### Basic Audio Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("recording.wav")
+print(result.text_content)  # Transcribed speech
+```
+
+### Audio Transcription Setup
+
+**Installation:**
+```bash
+pip install 'markitdown[audio]'
+```
+
+This installs the `speech_recognition` library and dependencies.
+
+### Supported Audio Formats
+
+- WAV
+- AIFF
+- FLAC
+- MP3 (requires ffmpeg or libav)
+- OGG (requires ffmpeg or libav)
+- Other formats supported by speech_recognition
+
+### Audio Transcription Engines
+
+MarkItDown uses the `speech_recognition` library, which supports multiple backends:
+
+**Default (Google Speech Recognition):**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("audio.wav")
+```
+
+**Note:** Default Google Speech Recognition requires internet connection.
+
+### Audio Quality Considerations
+
+For best transcription accuracy:
+- Use clear audio with minimal background noise
+- Prefer WAV or FLAC for better quality
+- Ensure speech is clear and at good volume
+- Avoid multiple overlapping speakers
+- Use mono audio when possible
+
+### Audio Preprocessing Tips
+
+For better results, consider preprocessing audio:
+
+```python
+# Example: If you have pydub installed
+from pydub import AudioSegment
+from pydub.effects import normalize
+
+# Load and normalize audio
+audio = AudioSegment.from_file("recording.mp3")
+audio = normalize(audio)
+audio.export("normalized.wav", format="wav")
+
+# Then convert with MarkItDown
+from markitdown import MarkItDown
+md = MarkItDown()
+result = md.convert("normalized.wav")
+```
+
+## Combined Media Workflows
+
+### Processing Multiple Images in Batch
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+
+# Process all images in directory
+for filename in os.listdir("images"):
+    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
+        result = md.convert(f"images/{filename}")
+
+        # Save markdown with same name
+        output = filename.rsplit('.', 1)[0] + '.md'
+        with open(f"output/{output}", "w") as f:
+            f.write(result.text_content)
+```
+
+### Screenshot Analysis Pipeline
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
+)
+
+screenshots = ["screen1.png", "screen2.png", "screen3.png"]
+analysis = []
+
+for screenshot in screenshots:
+    result = md.convert(screenshot)
+    analysis.append({
+        'file': screenshot,
+        'content': result.text_content
+    })
+
+# Now ready for further processing
+```
+
+### Document Images with OCR
+
+For scanned documents or photos of documents:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Process scanned pages
+pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
+full_text = []
+
+for page in pages:
+    result = md.convert(page)
+    full_text.append(result.text_content)
+
+# Combine into single document
+document = "\n\n---\n\n".join(full_text)
+print(document)
+```
+
+### Presentation Slide Images
+
+When you have presentation slides as images:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(
+    llm_client=client,
+    llm_model="gpt-4o",
+    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
+)
+
+# Process slide images
+for i in range(1, 21):  # 20 slides
+    result = md.convert(f"slides/slide_{i}.png")
+    print(f"## Slide {i}\n\n{result.text_content}\n\n")
+```
+
+## Error Handling
+
+### Image Processing Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("image.jpg")
+    print(result.text_content)
+except FileNotFoundError:
+    print("Image file not found")
+except Exception as e:
+    print(f"Error processing image: {e}")
+```
+
+### Audio Processing Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("audio.mp3")
+    print(result.text_content)
+except Exception as e:
+    print(f"Transcription failed: {e}")
+    # Common issues: format not supported, no speech detected, network error
+```
+
+## Performance Optimization
+
+### Image Processing
+
+- **LLM descriptions**: Slower but more informative
+- **OCR only**: Faster for text extraction
+- **EXIF only**: Fastest, metadata only
+- **Batch processing**: Process multiple images in parallel
+
+### Audio Processing
+
+- **File size**: Larger files take longer
+- **Audio length**: Transcription time scales with duration
+- **Format conversion**: WAV/FLAC are faster than MP3/OGG
+- **Network dependency**: Default transcription requires internet
+
+## Use Cases
+
+### Document Digitization
+Convert scanned documents or photos of documents to searchable text.
+
+### Meeting Notes
+Transcribe audio recordings of meetings to text for analysis.
+
+### Presentation Analysis
+Extract content from presentation slide images.
+
+### Screenshot Documentation
+Generate descriptions of UI screenshots for documentation.
+
+### Image Archiving
+Extract metadata and content from photo collections.
+
+### Accessibility
+Generate alt-text descriptions for images using LLM integration.
+
+### Data Extraction
+OCR text from images containing tables, forms, or structured data.