366 lines
8.2 KiB
Markdown
366 lines
8.2 KiB
Markdown
# Media Processing Reference
|
|
|
|
This document provides detailed information about processing images and audio files with MarkItDown.
|
|
|
|
## Image Processing
|
|
|
|
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
|
|
|
|
### Basic Image Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("photo.jpg")
|
|
print(result.text_content)
|
|
```
|
|
|
|
### Image Processing Features
|
|
|
|
**What's extracted:**
|
|
1. **EXIF Metadata** - Camera settings, date, location, etc.
|
|
2. **OCR Text** - Text detected in the image (requires tesseract)
|
|
3. **Image Description** - AI-generated description (with LLM integration)
|
|
|
|
### EXIF Metadata Extraction
|
|
|
|
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("IMG_1234.jpg")
|
|
print(result.text_content)
|
|
```
|
|
|
|
**Example output includes:**
|
|
- Camera make and model
|
|
- Capture date and time
|
|
- GPS coordinates (if available)
|
|
- Exposure settings (ISO, shutter speed, aperture)
|
|
- Image dimensions
|
|
- Orientation
|
|
|
|
### OCR (Optical Character Recognition)
|
|
|
|
Extract text from images containing text (screenshots, scanned documents, photos of text):
|
|
|
|
**Requirements:**
|
|
- Install tesseract OCR engine:
|
|
```bash
|
|
# macOS
|
|
brew install tesseract
|
|
|
|
# Ubuntu/Debian
|
|
apt-get install tesseract-ocr
|
|
|
|
# Windows
|
|
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
|
|
```
|
|
|
|
**Usage:**
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("screenshot.png")
|
|
print(result.text_content) # Contains OCR'd text
|
|
```
|
|
|
|
**Best practices for OCR:**
|
|
- Use high-resolution images for better accuracy
|
|
- Ensure good contrast between text and background
|
|
- Straighten skewed text if possible
|
|
- Use well-lit, clear images
|
|
|
|
### LLM-Generated Image Descriptions
|
|
|
|
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
|
result = md.convert("diagram.png")
|
|
print(result.text_content)
|
|
```
|
|
|
|
**Custom prompts for specific needs:**
|
|
|
|
```python
|
|
# For diagrams
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
|
|
)
|
|
|
|
# For charts
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Analyze this chart and provide key data points and trends"
|
|
)
|
|
|
|
# For UI screenshots
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this user interface, listing all visible elements and their layout"
|
|
)
|
|
```
|
|
|
|
### Supported Image Formats
|
|
|
|
MarkItDown supports all common image formats:
|
|
- JPEG/JPG
|
|
- PNG
|
|
- GIF
|
|
- BMP
|
|
- TIFF
|
|
- WebP
|
|
- HEIC (requires additional libraries on some platforms)
|
|
|
|
## Audio Processing
|
|
|
|
MarkItDown can transcribe audio files to text using speech recognition.
|
|
|
|
### Basic Audio Conversion
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("recording.wav")
|
|
print(result.text_content) # Transcribed speech
|
|
```
|
|
|
|
### Audio Transcription Setup
|
|
|
|
**Installation:**
|
|
```bash
|
|
pip install 'markitdown[audio]'
|
|
```
|
|
|
|
This installs the `speech_recognition` library and dependencies.
|
|
|
|
### Supported Audio Formats
|
|
|
|
- WAV
|
|
- AIFF
|
|
- FLAC
|
|
- MP3 (requires ffmpeg or libav)
|
|
- OGG (requires ffmpeg or libav)
|
|
- Other formats supported by speech_recognition
|
|
|
|
### Audio Transcription Engines
|
|
|
|
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
|
|
|
|
**Default (Google Speech Recognition):**
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
result = md.convert("audio.wav")
|
|
```
|
|
|
|
**Note:** Default Google Speech Recognition requires internet connection.
|
|
|
|
### Audio Quality Considerations
|
|
|
|
For best transcription accuracy:
|
|
- Use clear audio with minimal background noise
|
|
- Prefer WAV or FLAC for better quality
|
|
- Ensure speech is clear and at good volume
|
|
- Avoid multiple overlapping speakers
|
|
- Use mono audio when possible
|
|
|
|
### Audio Preprocessing Tips
|
|
|
|
For better results, consider preprocessing audio:
|
|
|
|
```python
|
|
# Example: If you have pydub installed
|
|
from pydub import AudioSegment
|
|
from pydub.effects import normalize
|
|
|
|
# Load and normalize audio
|
|
audio = AudioSegment.from_file("recording.mp3")
|
|
audio = normalize(audio)
|
|
audio.export("normalized.wav", format="wav")
|
|
|
|
# Then convert with MarkItDown
|
|
from markitdown import MarkItDown
|
|
md = MarkItDown()
|
|
result = md.convert("normalized.wav")
|
|
```
|
|
|
|
## Combined Media Workflows
|
|
|
|
### Processing Multiple Images in Batch
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
import os
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
|
|
|
# Process all images in directory
|
|
for filename in os.listdir("images"):
|
|
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
|
|
result = md.convert(f"images/{filename}")
|
|
|
|
# Save markdown with same name
|
|
output = filename.rsplit('.', 1)[0] + '.md'
|
|
with open(f"output/{output}", "w") as f:
|
|
f.write(result.text_content)
|
|
```
|
|
|
|
### Screenshot Analysis Pipeline
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
|
|
)
|
|
|
|
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
|
|
analysis = []
|
|
|
|
for screenshot in screenshots:
|
|
result = md.convert(screenshot)
|
|
analysis.append({
|
|
'file': screenshot,
|
|
'content': result.text_content
|
|
})
|
|
|
|
# Now ready for further processing
|
|
```
|
|
|
|
### Document Images with OCR
|
|
|
|
For scanned documents or photos of documents:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
|
|
# Process scanned pages
|
|
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
|
|
full_text = []
|
|
|
|
for page in pages:
|
|
result = md.convert(page)
|
|
full_text.append(result.text_content)
|
|
|
|
# Combine into single document
|
|
document = "\n\n---\n\n".join(full_text)
|
|
print(document)
|
|
```
|
|
|
|
### Presentation Slide Images
|
|
|
|
When you have presentation slides as images:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI()
|
|
md = MarkItDown(
|
|
llm_client=client,
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
|
|
)
|
|
|
|
# Process slide images
|
|
for i in range(1, 21): # 20 slides
|
|
result = md.convert(f"slides/slide_{i}.png")
|
|
print(f"## Slide {i}\n\n{result.text_content}\n\n")
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Image Processing Errors
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
|
|
try:
|
|
result = md.convert("image.jpg")
|
|
print(result.text_content)
|
|
except FileNotFoundError:
|
|
print("Image file not found")
|
|
except Exception as e:
|
|
print(f"Error processing image: {e}")
|
|
```
|
|
|
|
### Audio Processing Errors
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown()
|
|
|
|
try:
|
|
result = md.convert("audio.mp3")
|
|
print(result.text_content)
|
|
except Exception as e:
|
|
print(f"Transcription failed: {e}")
|
|
# Common issues: format not supported, no speech detected, network error
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Image Processing
|
|
|
|
- **LLM descriptions**: Slower but more informative
|
|
- **OCR only**: Faster for text extraction
|
|
- **EXIF only**: Fastest, metadata only
|
|
- **Batch processing**: Process multiple images in parallel
|
|
|
|
### Audio Processing
|
|
|
|
- **File size**: Larger files take longer
|
|
- **Audio length**: Transcription time scales with duration
|
|
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
|
|
- **Network dependency**: Default transcription requires internet
|
|
|
|
## Use Cases
|
|
|
|
### Document Digitization
|
|
Convert scanned documents or photos of documents to searchable text.
|
|
|
|
### Meeting Notes
|
|
Transcribe audio recordings of meetings to text for analysis.
|
|
|
|
### Presentation Analysis
|
|
Extract content from presentation slide images.
|
|
|
|
### Screenshot Documentation
|
|
Generate descriptions of UI screenshots for documentation.
|
|
|
|
### Image Archiving
|
|
Extract metadata and content from photo collections.
|
|
|
|
### Accessibility
|
|
Generate alt-text descriptions for images using LLM integration.
|
|
|
|
### Data Extraction
|
|
OCR text from images containing tables, forms, or structured data.
|