Files
gh-k-dense-ai-claude-scient…/skills/markitdown/references/media_processing.md
2025-11-30 08:30:10 +08:00

8.2 KiB

Media Processing Reference

This document provides detailed information about processing images and audio files with MarkItDown.

Image Processing

MarkItDown can extract text from images using OCR and retrieve EXIF metadata.

Basic Image Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)

Image Processing Features

What's extracted:

  1. EXIF Metadata - Camera settings, date, location, etc.
  2. OCR Text - Text detected in the image (requires tesseract)
  3. Image Description - AI-generated description (with LLM integration)

EXIF Metadata Extraction

Images from cameras and smartphones contain EXIF metadata that's automatically extracted:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)

Example output includes:

  • Camera make and model
  • Capture date and time
  • GPS coordinates (if available)
  • Exposure settings (ISO, shutter speed, aperture)
  • Image dimensions
  • Orientation

OCR (Optical Character Recognition)

Extract text from images containing text (screenshots, scanned documents, photos of text):

Requirements:

  • Install tesseract OCR engine:
    # macOS
    brew install tesseract
    
    # Ubuntu/Debian
    apt-get install tesseract-ocr
    
    # Windows
    # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
    

Usage:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content)  # Contains OCR'd text

Best practices for OCR:

  • Use high-resolution images for better accuracy
  • Ensure good contrast between text and background
  • Straighten skewed text if possible
  • Use well-lit, clear images

LLM-Generated Image Descriptions

Generate detailed, contextual descriptions of images using GPT-4o or other vision models:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)

Custom prompts for specific needs:

# For diagrams
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)

# For charts
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Analyze this chart and provide key data points and trends"
)

# For UI screenshots
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this user interface, listing all visible elements and their layout"
)

Supported Image Formats

MarkItDown supports all common image formats:

  • JPEG/JPG
  • PNG
  • GIF
  • BMP
  • TIFF
  • WebP
  • HEIC (requires additional libraries on some platforms)

Audio Processing

MarkItDown can transcribe audio files to text using speech recognition.

Basic Audio Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content)  # Transcribed speech

Audio Transcription Setup

Installation:

pip install 'markitdown[audio]'

This installs the speech_recognition library and dependencies.

Supported Audio Formats

  • WAV
  • AIFF
  • FLAC
  • MP3 (requires ffmpeg or libav)
  • OGG (requires ffmpeg or libav)
  • Other formats supported by speech_recognition

Audio Transcription Engines

MarkItDown uses the speech_recognition library, which supports multiple backends:

Default (Google Speech Recognition):

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("audio.wav")

Note: Default Google Speech Recognition requires internet connection.

Audio Quality Considerations

For best transcription accuracy:

  • Use clear audio with minimal background noise
  • Prefer WAV or FLAC for better quality
  • Ensure speech is clear and at good volume
  • Avoid multiple overlapping speakers
  • Use mono audio when possible

Audio Preprocessing Tips

For better results, consider preprocessing audio:

# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize

# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")

# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")

Combined Media Workflows

Processing Multiple Images in Batch

from markitdown import MarkItDown
from openai import OpenAI
import os

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Process all images in directory
for filename in os.listdir("images"):
    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
        result = md.convert(f"images/{filename}")

        # Save markdown with same name
        output = filename.rsplit('.', 1)[0] + '.md'
        with open(f"output/{output}", "w") as f:
            f.write(result.text_content)

Screenshot Analysis Pipeline

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)

screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []

for screenshot in screenshots:
    result = md.convert(screenshot)
    analysis.append({
        'file': screenshot,
        'content': result.text_content
    })

# Now ready for further processing

Document Images with OCR

For scanned documents or photos of documents:

from markitdown import MarkItDown

md = MarkItDown()

# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []

for page in pages:
    result = md.convert(page)
    full_text.append(result.text_content)

# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)

Presentation Slide Images

When you have presentation slides as images:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)

# Process slide images
for i in range(1, 21):  # 20 slides
    result = md.convert(f"slides/slide_{i}.png")
    print(f"## Slide {i}\n\n{result.text_content}\n\n")

Error Handling

Image Processing Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("image.jpg")
    print(result.text_content)
except FileNotFoundError:
    print("Image file not found")
except Exception as e:
    print(f"Error processing image: {e}")

Audio Processing Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("audio.mp3")
    print(result.text_content)
except Exception as e:
    print(f"Transcription failed: {e}")
    # Common issues: format not supported, no speech detected, network error

Performance Optimization

Image Processing

  • LLM descriptions: Slower but more informative
  • OCR only: Faster for text extraction
  • EXIF only: Fastest, metadata only
  • Batch processing: Process multiple images in parallel

Audio Processing

  • File size: Larger files take longer
  • Audio length: Transcription time scales with duration
  • Format conversion: WAV/FLAC are faster than MP3/OGG
  • Network dependency: Default transcription requires internet

Use Cases

Document Digitization

Convert scanned documents or photos of documents to searchable text.

Meeting Notes

Transcribe audio recordings of meetings to text for analysis.

Presentation Analysis

Extract content from presentation slide images.

Screenshot Documentation

Generate descriptions of UI screenshots for documentation.

Image Archiving

Extract metadata and content from photo collections.

Accessibility

Generate alt-text descriptions for images using LLM integration.

Data Extraction

OCR text from images containing tables, forms, or structured data.