8.2 KiB
Media Processing Reference
This document provides detailed information about processing images and audio files with MarkItDown.
Image Processing
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
Basic Image Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)
Image Processing Features
What's extracted:
- EXIF Metadata - Camera settings, date, location, etc.
- OCR Text - Text detected in the image (requires tesseract)
- Image Description - AI-generated description (with LLM integration)
EXIF Metadata Extraction
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)
Example output includes:
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation
OCR (Optical Character Recognition)
Extract text from images containing text (screenshots, scanned documents, photos of text):
Requirements:
- Install tesseract OCR engine:
# macOS brew install tesseract # Ubuntu/Debian apt-get install tesseract-ocr # Windows # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
Usage:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content) # Contains OCR'd text
Best practices for OCR:
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images
LLM-Generated Image Descriptions
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)
Custom prompts for specific needs:
# For diagrams
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)
# For charts
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Analyze this chart and provide key data points and trends"
)
# For UI screenshots
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this user interface, listing all visible elements and their layout"
)
Supported Image Formats
MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)
Audio Processing
MarkItDown can transcribe audio files to text using speech recognition.
Basic Audio Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content) # Transcribed speech
Audio Transcription Setup
Installation:
pip install 'markitdown[audio]'
This installs the speech_recognition library and dependencies.
Supported Audio Formats
- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition
Audio Transcription Engines
MarkItDown uses the speech_recognition library, which supports multiple backends:
Default (Google Speech Recognition):
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("audio.wav")
Note: Default Google Speech Recognition requires internet connection.
Audio Quality Considerations
For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible
Audio Preprocessing Tips
For better results, consider preprocessing audio:
# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize
# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")
# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")
Combined Media Workflows
Processing Multiple Images in Batch
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Process all images in directory
for filename in os.listdir("images"):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
result = md.convert(f"images/{filename}")
# Save markdown with same name
output = filename.rsplit('.', 1)[0] + '.md'
with open(f"output/{output}", "w") as f:
f.write(result.text_content)
Screenshot Analysis Pipeline
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []
for screenshot in screenshots:
result = md.convert(screenshot)
analysis.append({
'file': screenshot,
'content': result.text_content
})
# Now ready for further processing
Document Images with OCR
For scanned documents or photos of documents:
from markitdown import MarkItDown
md = MarkItDown()
# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []
for page in pages:
result = md.convert(page)
full_text.append(result.text_content)
# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)
Presentation Slide Images
When you have presentation slides as images:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)
# Process slide images
for i in range(1, 21): # 20 slides
result = md.convert(f"slides/slide_{i}.png")
print(f"## Slide {i}\n\n{result.text_content}\n\n")
Error Handling
Image Processing Errors
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("image.jpg")
print(result.text_content)
except FileNotFoundError:
print("Image file not found")
except Exception as e:
print(f"Error processing image: {e}")
Audio Processing Errors
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("audio.mp3")
print(result.text_content)
except Exception as e:
print(f"Transcription failed: {e}")
# Common issues: format not supported, no speech detected, network error
Performance Optimization
Image Processing
- LLM descriptions: Slower but more informative
- OCR only: Faster for text extraction
- EXIF only: Fastest, metadata only
- Batch processing: Process multiple images in parallel
Audio Processing
- File size: Larger files take longer
- Audio length: Transcription time scales with duration
- Format conversion: WAV/FLAC are faster than MP3/OGG
- Network dependency: Default transcription requires internet
Use Cases
Document Digitization
Convert scanned documents or photos of documents to searchable text.
Meeting Notes
Transcribe audio recordings of meetings to text for analysis.
Presentation Analysis
Extract content from presentation slide images.
Screenshot Documentation
Generate descriptions of UI screenshots for documentation.
Image Archiving
Extract metadata and content from photo collections.
Accessibility
Generate alt-text descriptions for images using LLM integration.
Data Extraction
OCR text from images containing tables, forms, or structured data.