gh-rafaelcalleja-claude-mar…/skills/ai-multimodal/references/vision-understanding.md

# Vision Understanding Reference

Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.

## Core Capabilities

- **Captioning**: Generate descriptive text for images
- **Classification**: Categorize and identify content
- **Visual Q&A**: Answer questions about images
- **Object Detection**: Locate objects with bounding boxes (2.0+)
- **Segmentation**: Create pixel-level masks (2.5+)
- **Multi-image**: Compare up to 3,600 images
- **OCR**: Extract text from images
- **Document Understanding**: Process PDFs with vision

## Supported Formats

- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
- **Documents**: PDF (up to 1,000 pages)
- **Size Limits**:
  - Inline: 20MB max total request
  - File API: 2GB per file
  - Max images: 3,600 per request

## Model Selection

### Gemini 2.5 Series
- **gemini-2.5-pro**: Best quality, segmentation + detection
- **gemini-2.5-flash**: Fast, efficient, all features
- **gemini-2.5-flash-lite**: Lightweight, all features

### Gemini 2.0 Series
- **gemini-2.0-flash**: Object detection support
- **gemini-2.0-flash-lite**: Lightweight detection

### Feature Requirements
- **Segmentation**: Requires 2.5+ models
- **Object Detection**: Requires 2.0+ models
- **Multi-image**: All models (up to 3,600 images)

## Basic Image Analysis

### Image Captioning

```python
from google import genai
import os

client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

# Local file
with open('image.jpg', 'rb') as f:
    img_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Describe this image in detail',
        genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
    ]
)
print(response.text)
```

### Image Classification

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Classify this image. Provide category and confidence level.',
        img_part
    ]
)
```

### Visual Question Answering

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'How many people are in this image and what are they doing?',
        img_part
    ]
)
```

## Advanced Features

### Object Detection (2.0+)

```python
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
        'Detect all objects in this image and provide bounding boxes',
        img_part
    ]
)

# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
# Normalized to [0, 1000] range
```

### Segmentation (2.5+)

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Create a segmentation mask for all people in this image',
        img_part
    ]
)

# Returns pixel-level masks for requested objects
```

### Multi-Image Comparison

```python
import PIL.Image

img1 = PIL.Image.open('photo1.jpg')
img2 = PIL.Image.open('photo2.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Compare these two images. What are the differences?',
        img1,
        img2
    ]
)
```

### OCR and Text Extraction

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract all visible text from this image',
        img_part
    ]
)
```

## Input Methods

### Inline Data (<20MB)

```python
from google.genai import types

# From file
with open('image.jpg', 'rb') as f:
    img_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Analyze this image',
        types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
    ]
)
```

### PIL Image

```python
import PIL.Image

img = PIL.Image.open('photo.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['What is in this image?', img]
)
```

### File API (>20MB or Reuse)

```python
# Upload once
myfile = client.files.upload(file='large-image.jpg')

# Use multiple times
response1 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Describe this image', myfile]
)

response2 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['What colors dominate this image?', myfile]
)
```

### URL (Public Images)

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Analyze this image',
        types.Part.from_uri(
            uri='https://example.com/image.jpg',
            mime_type='image/jpeg'
        )
    ]
)
```

## Token Calculation

Images consume tokens based on size:

**Small images** (≤384px both dimensions): 258 tokens

**Large images**: Tiled into 768×768 chunks, 258 tokens each

**Formula**:
```
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
```

**Examples**:
- 256×256: 258 tokens (small)
- 512×512: 258 tokens (small)
- 960×540: 6 tiles = 1,548 tokens
- 1920×1080: 6 tiles = 1,548 tokens
- 3840×2160 (4K): 24 tiles = 6,192 tokens

## Structured Output

### JSON Schema Output

```python
from pydantic import BaseModel
from typing import List

class ObjectDetection(BaseModel):
    object_name: str
    confidence: float
    bounding_box: List[int]  # [ymin, xmin, ymax, xmax]

class ImageAnalysis(BaseModel):
    description: str
    objects: List[ObjectDetection]
    scene_type: str

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Analyze this image', img_part],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=ImageAnalysis
    )
)

result = ImageAnalysis.model_validate_json(response.text)
```

## Multi-Image Analysis

### Batch Processing

```python
images = [
    PIL.Image.open(f'image{i}.jpg')
    for i in range(10)
]

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Analyze these images and find common themes'] + images
)
```

### Image Comparison

```python
before = PIL.Image.open('before.jpg')
after = PIL.Image.open('after.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Compare before and after. List all visible changes.',
        before,
        after
    ]
)
```

### Visual Search

```python
reference = PIL.Image.open('target.jpg')
candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Find which candidate images contain objects similar to the reference',
        reference
    ] + candidates
)
```

## Best Practices

### Image Quality

1. **Resolution**: Use clear, non-blurry images
2. **Rotation**: Verify correct orientation
3. **Lighting**: Ensure good contrast and lighting
4. **Size optimization**: Balance quality vs token cost
5. **Format**: JPEG for photos, PNG for graphics

### Prompt Engineering

**Specific instructions**:
- "Identify all vehicles with their colors and positions"
- "Count people wearing blue shirts"
- "Extract text from the sign in the top-left corner"

**Output format**:
- "Return results as JSON with fields: category, count, description"
- "Format as markdown table"
- "List findings as numbered items"

**Few-shot examples**:
```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
        'Now analyze this image:',
        img_part
    ]
)
```

### File Management

1. Use File API for images >20MB
2. Use File API for repeated queries (saves tokens)
3. Files auto-delete after 48 hours
4. Clean up manually:
   ```python
   client.files.delete(name=myfile.name)
   ```

### Cost Optimization

**Token-efficient strategies**:
- Resize large images before upload
- Use File API for repeated queries
- Batch multiple images when related
- Use appropriate model (Flash vs Pro)

**Token costs** (Gemini 2.5 Flash at $1/1M):
- Small image (258 tokens): $0.000258
- HD image (1,548 tokens): $0.001548
- 4K image (6,192 tokens): $0.006192

## Common Use Cases

### 1. Product Analysis

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        '''Analyze this product image:
        1. Identify the product
        2. List visible features
        3. Assess condition
        4. Estimate value range
        ''',
        img_part
    ]
)
```

### 2. Screenshot Analysis

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract all text and UI elements from this screenshot',
        img_part
    ]
)
```

### 3. Medical Imaging (Informational Only)

```python
response = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=[
        'Describe visible features in this medical image. Note: This is for informational purposes only.',
        img_part
    ]
)
```

### 4. Chart/Graph Reading

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract data from this chart and format as JSON',
        img_part
    ]
)
```

### 5. Scene Understanding

```python
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        '''Analyze this scene:
        1. Location type
        2. Time of day
        3. Weather conditions
        4. Activities happening
        5. Mood/atmosphere
        ''',
        img_part
    ]
)
```

## Error Handling

```python
import time

def analyze_image_with_retry(image_path, prompt, max_retries=3):
    """Analyze image with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            with open(image_path, 'rb') as f:
                img_bytes = f.read()

            response = client.models.generate_content(
                model='gemini-2.5-flash',
                contents=[
                    prompt,
                    genai.types.Part.from_bytes(
                        data=img_bytes,
                        mime_type='image/jpeg'
                    )
                ]
            )
            return response.text
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            print(f"Retry {attempt + 1} after {wait_time}s: {e}")
            time.sleep(wait_time)
```

## Limitations

- Maximum 3,600 images per request
- OCR accuracy varies with text quality
- Object detection requires 2.0+ models
- Segmentation requires 2.5+ models
- No video frame extraction (use video API)
- Regional restrictions on child images (EEA, CH, UK)