Files
gh-k-dense-ai-claude-scient…/skills/markitdown/references/media_processing.md
2025-11-30 08:30:10 +08:00

366 lines
8.2 KiB
Markdown

# Media Processing Reference
This document provides detailed information about processing images and audio files with MarkItDown.
## Image Processing
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
### Basic Image Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)
```
### Image Processing Features
**What's extracted:**
1. **EXIF Metadata** - Camera settings, date, location, etc.
2. **OCR Text** - Text detected in the image (requires tesseract)
3. **Image Description** - AI-generated description (with LLM integration)
### EXIF Metadata Extraction
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)
```
**Example output includes:**
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation
### OCR (Optical Character Recognition)
Extract text from images containing text (screenshots, scanned documents, photos of text):
**Requirements:**
- Install tesseract OCR engine:
```bash
# macOS
brew install tesseract
# Ubuntu/Debian
apt-get install tesseract-ocr
# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
```
**Usage:**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content) # Contains OCR'd text
```
**Best practices for OCR:**
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images
### LLM-Generated Image Descriptions
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)
```
**Custom prompts for specific needs:**
```python
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this chart and provide key data points and trends"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface, listing all visible elements and their layout"
)
```
### Supported Image Formats
MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)
## Audio Processing
MarkItDown can transcribe audio files to text using speech recognition.
### Basic Audio Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content) # Transcribed speech
```
### Audio Transcription Setup
**Installation:**
```bash
pip install 'markitdown[audio]'
```
This installs the `speech_recognition` library and dependencies.
### Supported Audio Formats
- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition
### Audio Transcription Engines
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
**Default (Google Speech Recognition):**
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("audio.wav")
```
**Note:** Default Google Speech Recognition requires internet connection.
### Audio Quality Considerations
For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible
### Audio Preprocessing Tips
For better results, consider preprocessing audio:
```python
# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize
# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")
# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")
```
## Combined Media Workflows
### Processing Multiple Images in Batch
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Process all images in directory
for filename in os.listdir("images"):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
result = md.convert(f"images/{filename}")
# Save markdown with same name
output = filename.rsplit('.', 1)[0] + '.md'
with open(f"output/{output}", "w") as f:
f.write(result.text_content)
```
### Screenshot Analysis Pipeline
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []
for screenshot in screenshots:
result = md.convert(screenshot)
analysis.append({
'file': screenshot,
'content': result.text_content
})
# Now ready for further processing
```
### Document Images with OCR
For scanned documents or photos of documents:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []
for page in pages:
result = md.convert(page)
full_text.append(result.text_content)
# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)
```
### Presentation Slide Images
When you have presentation slides as images:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)
# Process slide images
for i in range(1, 21): # 20 slides
result = md.convert(f"slides/slide_{i}.png")
print(f"## Slide {i}\n\n{result.text_content}\n\n")
```
## Error Handling
### Image Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("image.jpg")
print(result.text_content)
except FileNotFoundError:
print("Image file not found")
except Exception as e:
print(f"Error processing image: {e}")
```
### Audio Processing Errors
```python
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("audio.mp3")
print(result.text_content)
except Exception as e:
print(f"Transcription failed: {e}")
# Common issues: format not supported, no speech detected, network error
```
## Performance Optimization
### Image Processing
- **LLM descriptions**: Slower but more informative
- **OCR only**: Faster for text extraction
- **EXIF only**: Fastest, metadata only
- **Batch processing**: Process multiple images in parallel
### Audio Processing
- **File size**: Larger files take longer
- **Audio length**: Transcription time scales with duration
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
- **Network dependency**: Default transcription requires internet
## Use Cases
### Document Digitization
Convert scanned documents or photos of documents to searchable text.
### Meeting Notes
Transcribe audio recordings of meetings to text for analysis.
### Presentation Analysis
Extract content from presentation slide images.
### Screenshot Documentation
Generate descriptions of UI screenshots for documentation.
### Image Archiving
Extract metadata and content from photo collections.
### Accessibility
Generate alt-text descriptions for images using LLM integration.
### Data Extraction
OCR text from images containing tables, forms, or structured data.