Initial commit
This commit is contained in:
365
skills/markitdown/references/media_processing.md
Normal file
365
skills/markitdown/references/media_processing.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Media Processing Reference
|
||||
|
||||
This document provides detailed information about processing images and audio files with MarkItDown.
|
||||
|
||||
## Image Processing
|
||||
|
||||
MarkItDown can extract text from images using OCR and retrieve EXIF metadata.
|
||||
|
||||
### Basic Image Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("photo.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Image Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
1. **EXIF Metadata** - Camera settings, date, location, etc.
|
||||
2. **OCR Text** - Text detected in the image (requires tesseract)
|
||||
3. **Image Description** - AI-generated description (with LLM integration)
|
||||
|
||||
### EXIF Metadata Extraction
|
||||
|
||||
Images from cameras and smartphones contain EXIF metadata that's automatically extracted:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("IMG_1234.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Example output includes:**
|
||||
- Camera make and model
|
||||
- Capture date and time
|
||||
- GPS coordinates (if available)
|
||||
- Exposure settings (ISO, shutter speed, aperture)
|
||||
- Image dimensions
|
||||
- Orientation
|
||||
|
||||
### OCR (Optical Character Recognition)
|
||||
|
||||
Extract text from images containing text (screenshots, scanned documents, photos of text):
|
||||
|
||||
**Requirements:**
|
||||
- Install tesseract OCR engine:
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu/Debian
|
||||
apt-get install tesseract-ocr
|
||||
|
||||
# Windows
|
||||
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("screenshot.png")
|
||||
print(result.text_content) # Contains OCR'd text
|
||||
```
|
||||
|
||||
**Best practices for OCR:**
|
||||
- Use high-resolution images for better accuracy
|
||||
- Ensure good contrast between text and background
|
||||
- Straighten skewed text if possible
|
||||
- Use well-lit, clear images
|
||||
|
||||
### LLM-Generated Image Descriptions
|
||||
|
||||
Generate detailed, contextual descriptions of images using GPT-4o or other vision models:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("diagram.png")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Custom prompts for specific needs:**
|
||||
|
||||
```python
|
||||
# For diagrams
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
|
||||
)
|
||||
|
||||
# For charts
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Analyze this chart and provide key data points and trends"
|
||||
)
|
||||
|
||||
# For UI screenshots
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this user interface, listing all visible elements and their layout"
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Image Formats
|
||||
|
||||
MarkItDown supports all common image formats:
|
||||
- JPEG/JPG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
- TIFF
|
||||
- WebP
|
||||
- HEIC (requires additional libraries on some platforms)
|
||||
|
||||
## Audio Processing
|
||||
|
||||
MarkItDown can transcribe audio files to text using speech recognition.
|
||||
|
||||
### Basic Audio Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("recording.wav")
|
||||
print(result.text_content) # Transcribed speech
|
||||
```
|
||||
|
||||
### Audio Transcription Setup
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install 'markitdown[audio]'
|
||||
```
|
||||
|
||||
This installs the `speech_recognition` library and dependencies.
|
||||
|
||||
### Supported Audio Formats
|
||||
|
||||
- WAV
|
||||
- AIFF
|
||||
- FLAC
|
||||
- MP3 (requires ffmpeg or libav)
|
||||
- OGG (requires ffmpeg or libav)
|
||||
- Other formats supported by speech_recognition
|
||||
|
||||
### Audio Transcription Engines
|
||||
|
||||
MarkItDown uses the `speech_recognition` library, which supports multiple backends:
|
||||
|
||||
**Default (Google Speech Recognition):**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("audio.wav")
|
||||
```
|
||||
|
||||
**Note:** Default Google Speech Recognition requires internet connection.
|
||||
|
||||
### Audio Quality Considerations
|
||||
|
||||
For best transcription accuracy:
|
||||
- Use clear audio with minimal background noise
|
||||
- Prefer WAV or FLAC for better quality
|
||||
- Ensure speech is clear and at good volume
|
||||
- Avoid multiple overlapping speakers
|
||||
- Use mono audio when possible
|
||||
|
||||
### Audio Preprocessing Tips
|
||||
|
||||
For better results, consider preprocessing audio:
|
||||
|
||||
```python
|
||||
# Example: If you have pydub installed
|
||||
from pydub import AudioSegment
|
||||
from pydub.effects import normalize
|
||||
|
||||
# Load and normalize audio
|
||||
audio = AudioSegment.from_file("recording.mp3")
|
||||
audio = normalize(audio)
|
||||
audio.export("normalized.wav", format="wav")
|
||||
|
||||
# Then convert with MarkItDown
|
||||
from markitdown import MarkItDown
|
||||
md = MarkItDown()
|
||||
result = md.convert("normalized.wav")
|
||||
```
|
||||
|
||||
## Combined Media Workflows
|
||||
|
||||
### Processing Multiple Images in Batch
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
# Process all images in directory
|
||||
for filename in os.listdir("images"):
|
||||
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
|
||||
result = md.convert(f"images/{filename}")
|
||||
|
||||
# Save markdown with same name
|
||||
output = filename.rsplit('.', 1)[0] + '.md'
|
||||
with open(f"output/{output}", "w") as f:
|
||||
f.write(result.text_content)
|
||||
```
|
||||
|
||||
### Screenshot Analysis Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
|
||||
)
|
||||
|
||||
screenshots = ["screen1.png", "screen2.png", "screen3.png"]
|
||||
analysis = []
|
||||
|
||||
for screenshot in screenshots:
|
||||
result = md.convert(screenshot)
|
||||
analysis.append({
|
||||
'file': screenshot,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
# Now ready for further processing
|
||||
```
|
||||
|
||||
### Document Images with OCR
|
||||
|
||||
For scanned documents or photos of documents:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Process scanned pages
|
||||
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
|
||||
full_text = []
|
||||
|
||||
for page in pages:
|
||||
result = md.convert(page)
|
||||
full_text.append(result.text_content)
|
||||
|
||||
# Combine into single document
|
||||
document = "\n\n---\n\n".join(full_text)
|
||||
print(document)
|
||||
```
|
||||
|
||||
### Presentation Slide Images
|
||||
|
||||
When you have presentation slides as images:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="gpt-4o",
|
||||
llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
|
||||
)
|
||||
|
||||
# Process slide images
|
||||
for i in range(1, 21): # 20 slides
|
||||
result = md.convert(f"slides/slide_{i}.png")
|
||||
print(f"## Slide {i}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Image Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("image.jpg")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("Image file not found")
|
||||
except Exception as e:
|
||||
print(f"Error processing image: {e}")
|
||||
```
|
||||
|
||||
### Audio Processing Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("audio.mp3")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Transcription failed: {e}")
|
||||
# Common issues: format not supported, no speech detected, network error
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Image Processing
|
||||
|
||||
- **LLM descriptions**: Slower but more informative
|
||||
- **OCR only**: Faster for text extraction
|
||||
- **EXIF only**: Fastest, metadata only
|
||||
- **Batch processing**: Process multiple images in parallel
|
||||
|
||||
### Audio Processing
|
||||
|
||||
- **File size**: Larger files take longer
|
||||
- **Audio length**: Transcription time scales with duration
|
||||
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
|
||||
- **Network dependency**: Default transcription requires internet
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Document Digitization
|
||||
Convert scanned documents or photos of documents to searchable text.
|
||||
|
||||
### Meeting Notes
|
||||
Transcribe audio recordings of meetings to text for analysis.
|
||||
|
||||
### Presentation Analysis
|
||||
Extract content from presentation slide images.
|
||||
|
||||
### Screenshot Documentation
|
||||
Generate descriptions of UI screenshots for documentation.
|
||||
|
||||
### Image Archiving
|
||||
Extract metadata and content from photo collections.
|
||||
|
||||
### Accessibility
|
||||
Generate alt-text descriptions for images using LLM integration.
|
||||
|
||||
### Data Extraction
|
||||
OCR text from images containing tables, forms, or structured data.
|
||||
Reference in New Issue
Block a user