zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

8.2 KiB

Raw Blame History

Media Processing Reference

This document provides detailed information about processing images and audio files with MarkItDown.

Image Processing

MarkItDown can extract text from images using OCR and retrieve EXIF metadata.

Basic Image Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("photo.jpg")
print(result.text_content)

Image Processing Features

What's extracted:

EXIF Metadata - Camera settings, date, location, etc.
OCR Text - Text detected in the image (requires tesseract)
Image Description - AI-generated description (with LLM integration)

EXIF Metadata Extraction

Images from cameras and smartphones contain EXIF metadata that's automatically extracted:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("IMG_1234.jpg")
print(result.text_content)

Example output includes:

Camera make and model
Capture date and time
GPS coordinates (if available)
Exposure settings (ISO, shutter speed, aperture)
Image dimensions
Orientation

OCR (Optical Character Recognition)

Extract text from images containing text (screenshots, scanned documents, photos of text):

Requirements:

Install tesseract OCR engine:

# macOS
brew install tesseract

# Ubuntu/Debian
apt-get install tesseract-ocr

# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki

Usage:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("screenshot.png")
print(result.text_content)  # Contains OCR'd text

Best practices for OCR:

Use high-resolution images for better accuracy
Ensure good contrast between text and background
Straighten skewed text if possible
Use well-lit, clear images

LLM-Generated Image Descriptions

Generate detailed, contextual descriptions of images using GPT-4o or other vision models:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("diagram.png")
print(result.text_content)

Custom prompts for specific needs:

# For diagrams
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this diagram in detail, explaining all components and their relationships"
)

# For charts
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Analyze this chart and provide key data points and trends"
)

# For UI screenshots
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this user interface, listing all visible elements and their layout"
)

Supported Image Formats

MarkItDown supports all common image formats:

JPEG/JPG
PNG
GIF
BMP
TIFF
WebP
HEIC (requires additional libraries on some platforms)

Audio Processing

MarkItDown can transcribe audio files to text using speech recognition.

Basic Audio Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("recording.wav")
print(result.text_content)  # Transcribed speech

Audio Transcription Setup

Installation:

pip install 'markitdown[audio]'

This installs the speech_recognition library and dependencies.

Supported Audio Formats

WAV
AIFF
FLAC
MP3 (requires ffmpeg or libav)
OGG (requires ffmpeg or libav)
Other formats supported by speech_recognition

Audio Transcription Engines

MarkItDown uses the speech_recognition library, which supports multiple backends:

Default (Google Speech Recognition):

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("audio.wav")

Note: Default Google Speech Recognition requires internet connection.

Audio Quality Considerations

For best transcription accuracy:

Use clear audio with minimal background noise
Prefer WAV or FLAC for better quality
Ensure speech is clear and at good volume
Avoid multiple overlapping speakers
Use mono audio when possible

Audio Preprocessing Tips

For better results, consider preprocessing audio:

# Example: If you have pydub installed
from pydub import AudioSegment
from pydub.effects import normalize

# Load and normalize audio
audio = AudioSegment.from_file("recording.mp3")
audio = normalize(audio)
audio.export("normalized.wav", format="wav")

# Then convert with MarkItDown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("normalized.wav")

Combined Media Workflows

Processing Multiple Images in Batch

from markitdown import MarkItDown
from openai import OpenAI
import os

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Process all images in directory
for filename in os.listdir("images"):
    if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
        result = md.convert(f"images/{filename}")

        # Save markdown with same name
        output = filename.rsplit('.', 1)[0] + '.md'
        with open(f"output/{output}", "w") as f:
            f.write(result.text_content)

Screenshot Analysis Pipeline

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this screenshot comprehensively, including UI elements, text, and layout"
)

screenshots = ["screen1.png", "screen2.png", "screen3.png"]
analysis = []

for screenshot in screenshots:
    result = md.convert(screenshot)
    analysis.append({
        'file': screenshot,
        'content': result.text_content
    })

# Now ready for further processing

Document Images with OCR

For scanned documents or photos of documents:

from markitdown import MarkItDown

md = MarkItDown()

# Process scanned pages
pages = ["page1.jpg", "page2.jpg", "page3.jpg"]
full_text = []

for page in pages:
    result = md.convert(page)
    full_text.append(result.text_content)

# Combine into single document
document = "\n\n---\n\n".join(full_text)
print(document)

Presentation Slide Images

When you have presentation slides as images:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this presentation slide, including title, bullet points, and visual elements"
)

# Process slide images
for i in range(1, 21):  # 20 slides
    result = md.convert(f"slides/slide_{i}.png")
    print(f"## Slide {i}\n\n{result.text_content}\n\n")

Error Handling

Image Processing Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("image.jpg")
    print(result.text_content)
except FileNotFoundError:
    print("Image file not found")
except Exception as e:
    print(f"Error processing image: {e}")

Audio Processing Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("audio.mp3")
    print(result.text_content)
except Exception as e:
    print(f"Transcription failed: {e}")
    # Common issues: format not supported, no speech detected, network error

Performance Optimization

Image Processing

LLM descriptions: Slower but more informative
OCR only: Faster for text extraction
EXIF only: Fastest, metadata only
Batch processing: Process multiple images in parallel

Audio Processing

File size: Larger files take longer
Audio length: Transcription time scales with duration
Format conversion: WAV/FLAC are faster than MP3/OGG
Network dependency: Default transcription requires internet

Use Cases

Document Digitization

Convert scanned documents or photos of documents to searchable text.

Meeting Notes

Transcribe audio recordings of meetings to text for analysis.

Presentation Analysis

Extract content from presentation slide images.

Screenshot Documentation

Generate descriptions of UI screenshots for documentation.

Image Archiving

Extract metadata and content from photo collections.

Accessibility

Generate alt-text descriptions for images using LLM integration.

Data Extraction

OCR text from images containing tables, forms, or structured data.

8.2 KiB Raw Blame History

Media Processing Reference

Image Processing

Basic Image Conversion

Image Processing Features

EXIF Metadata Extraction

OCR (Optical Character Recognition)

LLM-Generated Image Descriptions

Supported Image Formats

Audio Processing

Basic Audio Conversion

Audio Transcription Setup

Supported Audio Formats

Audio Transcription Engines

Audio Quality Considerations

Audio Preprocessing Tips

Combined Media Workflows

Processing Multiple Images in Batch

Screenshot Analysis Pipeline

Document Images with OCR

Presentation Slide Images

Error Handling

Image Processing Errors

Audio Processing Errors

Performance Optimization

Image Processing

Audio Processing

Use Cases

Document Digitization

Meeting Notes

Presentation Analysis

Screenshot Documentation

Image Archiving

Accessibility

Data Extraction

8.2 KiB

Raw Blame History