zhongwei/gh-rafaelcalleja-claude-market-place-plugins-claudekit-skills

Fork 0

Files

Zhongwei Li 6ec3196ecc Initial commit

2025-11-30 08:48:52 +08:00

11 KiB

Raw Blame History

Vision Understanding Reference

Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.

Core Capabilities

Captioning: Generate descriptive text for images
Classification: Categorize and identify content
Visual Q&A: Answer questions about images
Object Detection: Locate objects with bounding boxes (2.0+)
Segmentation: Create pixel-level masks (2.5+)
Multi-image: Compare up to 3,600 images
OCR: Extract text from images
Document Understanding: Process PDFs with vision

Supported Formats

Images: PNG, JPEG, WEBP, HEIC, HEIF
Documents: PDF (up to 1,000 pages)
Size Limits:
- Inline: 20MB max total request
- File API: 2GB per file
- Max images: 3,600 per request

Model Selection

Gemini 2.5 Series

gemini-2.5-pro: Best quality, segmentation + detection
gemini-2.5-flash: Fast, efficient, all features
gemini-2.5-flash-lite: Lightweight, all features

Gemini 2.0 Series

gemini-2.0-flash: Object detection support
gemini-2.0-flash-lite: Lightweight detection

Feature Requirements

Segmentation: Requires 2.5+ models
Object Detection: Requires 2.0+ models
Multi-image: All models (up to 3,600 images)

Basic Image Analysis

Image Captioning

from google import genai
import os

client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

# Local file
with open('image.jpg', 'rb') as f:
    img_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Describe this image in detail',
        genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
    ]
)
print(response.text)

Image Classification

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Classify this image. Provide category and confidence level.',
        img_part
    ]
)

Visual Question Answering

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'How many people are in this image and what are they doing?',
        img_part
    ]
)

Advanced Features

Object Detection (2.0+)

response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
        'Detect all objects in this image and provide bounding boxes',
        img_part
    ]
)

# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
# Normalized to [0, 1000] range

Segmentation (2.5+)

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Create a segmentation mask for all people in this image',
        img_part
    ]
)

# Returns pixel-level masks for requested objects

Multi-Image Comparison

import PIL.Image

img1 = PIL.Image.open('photo1.jpg')
img2 = PIL.Image.open('photo2.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Compare these two images. What are the differences?',
        img1,
        img2
    ]
)

OCR and Text Extraction

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract all visible text from this image',
        img_part
    ]
)

Input Methods

Inline Data (<20MB)

from google.genai import types

# From file
with open('image.jpg', 'rb') as f:
    img_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Analyze this image',
        types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
    ]
)

PIL Image

import PIL.Image

img = PIL.Image.open('photo.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['What is in this image?', img]
)

File API (>20MB or Reuse)

# Upload once
myfile = client.files.upload(file='large-image.jpg')

# Use multiple times
response1 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Describe this image', myfile]
)

response2 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['What colors dominate this image?', myfile]
)

URL (Public Images)

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Analyze this image',
        types.Part.from_uri(
            uri='https://example.com/image.jpg',
            mime_type='image/jpeg'
        )
    ]
)

Token Calculation

Images consume tokens based on size:

Small images (≤384px both dimensions): 258 tokens

Large images: Tiled into 768×768 chunks, 258 tokens each

Formula:

crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258

Examples:

256×256: 258 tokens (small)
512×512: 258 tokens (small)
960×540: 6 tiles = 1,548 tokens
1920×1080: 6 tiles = 1,548 tokens
3840×2160 (4K): 24 tiles = 6,192 tokens

Structured Output

JSON Schema Output

from pydantic import BaseModel
from typing import List

class ObjectDetection(BaseModel):
    object_name: str
    confidence: float
    bounding_box: List[int]  # [ymin, xmin, ymax, xmax]

class ImageAnalysis(BaseModel):
    description: str
    objects: List[ObjectDetection]
    scene_type: str

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Analyze this image', img_part],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=ImageAnalysis
    )
)

result = ImageAnalysis.model_validate_json(response.text)

Multi-Image Analysis

Batch Processing

images = [
    PIL.Image.open(f'image{i}.jpg')
    for i in range(10)
]

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Analyze these images and find common themes'] + images
)

Image Comparison

before = PIL.Image.open('before.jpg')
after = PIL.Image.open('after.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Compare before and after. List all visible changes.',
        before,
        after
    ]
)

Visual Search

reference = PIL.Image.open('target.jpg')
candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Find which candidate images contain objects similar to the reference',
        reference
    ] + candidates
)

Best Practices

Image Quality

Resolution: Use clear, non-blurry images
Rotation: Verify correct orientation
Lighting: Ensure good contrast and lighting
Size optimization: Balance quality vs token cost
Format: JPEG for photos, PNG for graphics

Prompt Engineering

Specific instructions:

"Identify all vehicles with their colors and positions"
"Count people wearing blue shirts"
"Extract text from the sign in the top-left corner"

Output format:

"Return results as JSON with fields: category, count, description"
"Format as markdown table"
"List findings as numbered items"

Few-shot examples:

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
        'Now analyze this image:',
        img_part
    ]
)

File Management

Use File API for images >20MB
Use File API for repeated queries (saves tokens)
Files auto-delete after 48 hours
Clean up manually:
```
client.files.delete(name=myfile.name)
```

Cost Optimization

Token-efficient strategies:

Resize large images before upload
Use File API for repeated queries
Batch multiple images when related
Use appropriate model (Flash vs Pro)

Token costs (Gemini 2.5 Flash at $1/1M):

Small image (258 tokens): $0.000258
HD image (1,548 tokens): $0.001548
4K image (6,192 tokens): $0.006192

Common Use Cases

1. Product Analysis

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        '''Analyze this product image:
        1. Identify the product
        2. List visible features
        3. Assess condition
        4. Estimate value range
        ''',
        img_part
    ]
)

2. Screenshot Analysis

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract all text and UI elements from this screenshot',
        img_part
    ]
)

3. Medical Imaging (Informational Only)

response = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=[
        'Describe visible features in this medical image. Note: This is for informational purposes only.',
        img_part
    ]
)

4. Chart/Graph Reading

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract data from this chart and format as JSON',
        img_part
    ]
)

5. Scene Understanding

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        '''Analyze this scene:
        1. Location type
        2. Time of day
        3. Weather conditions
        4. Activities happening
        5. Mood/atmosphere
        ''',
        img_part
    ]
)

Error Handling

import time

def analyze_image_with_retry(image_path, prompt, max_retries=3):
    """Analyze image with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            with open(image_path, 'rb') as f:
                img_bytes = f.read()

            response = client.models.generate_content(
                model='gemini-2.5-flash',
                contents=[
                    prompt,
                    genai.types.Part.from_bytes(
                        data=img_bytes,
                        mime_type='image/jpeg'
                    )
                ]
            )
            return response.text
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            print(f"Retry {attempt + 1} after {wait_time}s: {e}")
            time.sleep(wait_time)

Limitations

Maximum 3,600 images per request
OCR accuracy varies with text quality
Object detection requires 2.0+ models
Segmentation requires 2.5+ models
No video frame extraction (use video API)
Regional restrictions on child images (EEA, CH, UK)

11 KiB Raw Blame History Unescape Escape