gh-k-dense-ai-claude-scient…/skills/markitdown/references/media_processing.md

# Media Processing Reference

This document provides detailed information about processing images and audio files with MarkItDown.

## Image Processing

MarkItDown can extract text from images using OCR and retrieve EXIF metadata.

### Basic Image Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)
```

### Image Processing Features

**What's extracted:**
1. **EXIF Metadata** - Camera settings, date, location, etc.
2. **OCR Text** - Text detected in the image (requires tesseract)
3. **Image Description** - AI-generated description (with LLM integration)

### EXIF Metadata Extraction

Images from cameras and smartphones contain EXIF metadata that's automatically extracted:

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)
```

**Example output includes:**
- Camera make and model
- Capture date and time
- GPS coordinates (if available)
- Exposure settings (ISO, shutter speed, aperture)
- Image dimensions
- Orientation

### OCR (Optical Character Recognition)

Extract text from images containing text (screenshots, scanned documents, photos of text):

**Requirements:**
- Install tesseract OCR engine:
  ```bash
  # macOS
  brew install tesseract

  # Ubuntu/Debian
  apt-get install tesseract-ocr

  # Windows
  # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
  ```

**Usage:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content)  # Contains OCR'd text
```

**Best practices for OCR:**
- Use high-resolution images for better accuracy
- Ensure good contrast between text and background
- Straighten skewed text if possible
- Use well-lit, clear images

### LLM-Generated Image Descriptions

Generate detailed, contextual descriptions of images using GPT-4o or other vision models:

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)
```

**Custom prompts for specific needs:**

```python
# For diagrams
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)

# For charts
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Analyze this chart and provide key data points and trends"
)

# For UI screenshots
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this user interface, listing all visible elements and their layout"
)
```

### Supported Image Formats

MarkItDown supports all common image formats:
- JPEG/JPG
- PNG
- GIF
- BMP
- TIFF
- WebP
- HEIC (requires additional libraries on some platforms)

## Audio Processing

MarkItDown can transcribe audio files to text using speech recognition.

### Basic Audio Conversion

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content)  # Transcribed speech
```

### Audio Transcription Setup

**Installation:**
```bash
pip install 'markitdown[audio]'
```

This installs the `speech_recognition` library and dependencies.

### Supported Audio Formats

- WAV
- AIFF
- FLAC
- MP3 (requires ffmpeg or libav)
- OGG (requires ffmpeg or libav)
- Other formats supported by speech_recognition

### Audio Transcription Engines

MarkItDown uses the `speech_recognition` library, which supports multiple backends:

**Default (Google Speech Recognition):**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("audio.wav")
```

**Note:** Default Google Speech Recognition requires internet connection.

### Audio Quality Considerations

For best transcription accuracy:
- Use clear audio with minimal background noise
- Prefer WAV or FLAC for better quality
- Ensure speech is clear and at good volume
- Avoid multiple overlapping speakers
- Use mono audio when possible

### Audio Preprocessing Tips

For better results, consider preprocessing audio:

```python
# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize

# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")

# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")
```

## Combined Media Workflows

### Processing Multiple Images in Batch

```python
from markitdown import MarkItDown
from openai import OpenAI
import os

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Process all images in directory
for filename in os.listdir("images"):
    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
        result = md.convert(f"images/{filename}")

        # Save markdown with same name
        output = filename.rsplit('.', 1)[0] + '.md'
        with open(f"output/{output}", "w") as f:
            f.write(result.text_content)
```

### Screenshot Analysis Pipeline

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)

screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []

for screenshot in screenshots:
    result = md.convert(screenshot)
    analysis.append({
        'file': screenshot,
        'content': result.text_content
    })

# Now ready for further processing
```

### Document Images with OCR

For scanned documents or photos of documents:

```python
from markitdown import MarkItDown

md = MarkItDown()

# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []

for page in pages:
    result = md.convert(page)
    full_text.append(result.text_content)

# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)
```

### Presentation Slide Images

When you have presentation slides as images:

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)

# Process slide images
for i in range(1, 21):  # 20 slides
    result = md.convert(f"slides/slide_{i}.png")
    print(f"## Slide {i}\n\n{result.text_content}\n\n")
```

## Error Handling

### Image Processing Errors

```python
from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("image.jpg")
    print(result.text_content)
except FileNotFoundError:
    print("Image file not found")
except Exception as e:
    print(f"Error processing image: {e}")
```

### Audio Processing Errors

```python
from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("audio.mp3")
    print(result.text_content)
except Exception as e:
    print(f"Transcription failed: {e}")
    # Common issues: format not supported, no speech detected, network error
```

## Performance Optimization

### Image Processing

- **LLM descriptions**: Slower but more informative
- **OCR only**: Faster for text extraction
- **EXIF only**: Fastest, metadata only
- **Batch processing**: Process multiple images in parallel

### Audio Processing

- **File size**: Larger files take longer
- **Audio length**: Transcription time scales with duration
- **Format conversion**: WAV/FLAC are faster than MP3/OGG
- **Network dependency**: Default transcription requires internet

## Use Cases

### Document Digitization
Convert scanned documents or photos of documents to searchable text.

### Meeting Notes
Transcribe audio recordings of meetings to text for analysis.

### Presentation Analysis
Extract content from presentation slide images.

### Screenshot Documentation
Generate descriptions of UI screenshots for documentation.

### Image Archiving
Extract metadata and content from photo collections.

### Accessibility
Generate alt-text descriptions for images using LLM integration.

### Data Extraction
OCR text from images containing tables, forms, or structured data.