Files
gh-rafaelcalleja-claude-mar…/skills/ai-multimodal/references/vision-understanding.md
2025-11-30 08:48:52 +08:00

484 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Vision Understanding Reference
Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.
## Core Capabilities
- **Captioning**: Generate descriptive text for images
- **Classification**: Categorize and identify content
- **Visual Q&A**: Answer questions about images
- **Object Detection**: Locate objects with bounding boxes (2.0+)
- **Segmentation**: Create pixel-level masks (2.5+)
- **Multi-image**: Compare up to 3,600 images
- **OCR**: Extract text from images
- **Document Understanding**: Process PDFs with vision
## Supported Formats
- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
- **Documents**: PDF (up to 1,000 pages)
- **Size Limits**:
- Inline: 20MB max total request
- File API: 2GB per file
- Max images: 3,600 per request
## Model Selection
### Gemini 2.5 Series
- **gemini-2.5-pro**: Best quality, segmentation + detection
- **gemini-2.5-flash**: Fast, efficient, all features
- **gemini-2.5-flash-lite**: Lightweight, all features
### Gemini 2.0 Series
- **gemini-2.0-flash**: Object detection support
- **gemini-2.0-flash-lite**: Lightweight detection
### Feature Requirements
- **Segmentation**: Requires 2.5+ models
- **Object Detection**: Requires 2.0+ models
- **Multi-image**: All models (up to 3,600 images)
## Basic Image Analysis
### Image Captioning
```python
from google import genai
import os
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Local file
with open('image.jpg', 'rb') as f:
img_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Describe this image in detail',
genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
]
)
print(response.text)
```
### Image Classification
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Classify this image. Provide category and confidence level.',
img_part
]
)
```
### Visual Question Answering
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'How many people are in this image and what are they doing?',
img_part
]
)
```
## Advanced Features
### Object Detection (2.0+)
```python
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=[
'Detect all objects in this image and provide bounding boxes',
img_part
]
)
# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
# Normalized to [0, 1000] range
```
### Segmentation (2.5+)
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Create a segmentation mask for all people in this image',
img_part
]
)
# Returns pixel-level masks for requested objects
```
### Multi-Image Comparison
```python
import PIL.Image
img1 = PIL.Image.open('photo1.jpg')
img2 = PIL.Image.open('photo2.jpg')
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Compare these two images. What are the differences?',
img1,
img2
]
)
```
### OCR and Text Extraction
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract all visible text from this image',
img_part
]
)
```
## Input Methods
### Inline Data (<20MB)
```python
from google.genai import types
# From file
with open('image.jpg', 'rb') as f:
img_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Analyze this image',
types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
]
)
```
### PIL Image
```python
import PIL.Image
img = PIL.Image.open('photo.jpg')
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['What is in this image?', img]
)
```
### File API (>20MB or Reuse)
```python
# Upload once
myfile = client.files.upload(file='large-image.jpg')
# Use multiple times
response1 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Describe this image', myfile]
)
response2 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['What colors dominate this image?', myfile]
)
```
### URL (Public Images)
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Analyze this image',
types.Part.from_uri(
uri='https://example.com/image.jpg',
mime_type='image/jpeg'
)
]
)
```
## Token Calculation
Images consume tokens based on size:
**Small images** (≤384px both dimensions): 258 tokens
**Large images**: Tiled into 768×768 chunks, 258 tokens each
**Formula**:
```
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
```
**Examples**:
- 256×256: 258 tokens (small)
- 512×512: 258 tokens (small)
- 960×540: 6 tiles = 1,548 tokens
- 1920×1080: 6 tiles = 1,548 tokens
- 3840×2160 (4K): 24 tiles = 6,192 tokens
## Structured Output
### JSON Schema Output
```python
from pydantic import BaseModel
from typing import List
class ObjectDetection(BaseModel):
object_name: str
confidence: float
bounding_box: List[int] # [ymin, xmin, ymax, xmax]
class ImageAnalysis(BaseModel):
description: str
objects: List[ObjectDetection]
scene_type: str
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Analyze this image', img_part],
config=genai.types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=ImageAnalysis
)
)
result = ImageAnalysis.model_validate_json(response.text)
```
## Multi-Image Analysis
### Batch Processing
```python
images = [
PIL.Image.open(f'image{i}.jpg')
for i in range(10)
]
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Analyze these images and find common themes'] + images
)
```
### Image Comparison
```python
before = PIL.Image.open('before.jpg')
after = PIL.Image.open('after.jpg')
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Compare before and after. List all visible changes.',
before,
after
]
)
```
### Visual Search
```python
reference = PIL.Image.open('target.jpg')
candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Find which candidate images contain objects similar to the reference',
reference
] + candidates
)
```
## Best Practices
### Image Quality
1. **Resolution**: Use clear, non-blurry images
2. **Rotation**: Verify correct orientation
3. **Lighting**: Ensure good contrast and lighting
4. **Size optimization**: Balance quality vs token cost
5. **Format**: JPEG for photos, PNG for graphics
### Prompt Engineering
**Specific instructions**:
- "Identify all vehicles with their colors and positions"
- "Count people wearing blue shirts"
- "Extract text from the sign in the top-left corner"
**Output format**:
- "Return results as JSON with fields: category, count, description"
- "Format as markdown table"
- "List findings as numbered items"
**Few-shot examples**:
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
'Now analyze this image:',
img_part
]
)
```
### File Management
1. Use File API for images >20MB
2. Use File API for repeated queries (saves tokens)
3. Files auto-delete after 48 hours
4. Clean up manually:
```python
client.files.delete(name=myfile.name)
```
### Cost Optimization
**Token-efficient strategies**:
- Resize large images before upload
- Use File API for repeated queries
- Batch multiple images when related
- Use appropriate model (Flash vs Pro)
**Token costs** (Gemini 2.5 Flash at $1/1M):
- Small image (258 tokens): $0.000258
- HD image (1,548 tokens): $0.001548
- 4K image (6,192 tokens): $0.006192
## Common Use Cases
### 1. Product Analysis
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Analyze this product image:
1. Identify the product
2. List visible features
3. Assess condition
4. Estimate value range
''',
img_part
]
)
```
### 2. Screenshot Analysis
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract all text and UI elements from this screenshot',
img_part
]
)
```
### 3. Medical Imaging (Informational Only)
```python
response = client.models.generate_content(
model='gemini-2.5-pro',
contents=[
'Describe visible features in this medical image. Note: This is for informational purposes only.',
img_part
]
)
```
### 4. Chart/Graph Reading
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract data from this chart and format as JSON',
img_part
]
)
```
### 5. Scene Understanding
```python
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Analyze this scene:
1. Location type
2. Time of day
3. Weather conditions
4. Activities happening
5. Mood/atmosphere
''',
img_part
]
)
```
## Error Handling
```python
import time
def analyze_image_with_retry(image_path, prompt, max_retries=3):
"""Analyze image with exponential backoff retry"""
for attempt in range(max_retries):
try:
with open(image_path, 'rb') as f:
img_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
prompt,
genai.types.Part.from_bytes(
data=img_bytes,
mime_type='image/jpeg'
)
]
)
return response.text
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
print(f"Retry {attempt + 1} after {wait_time}s: {e}")
time.sleep(wait_time)
```
## Limitations
- Maximum 3,600 images per request
- OCR accuracy varies with text quality
- Object detection requires 2.0+ models
- Segmentation requires 2.5+ models
- No video frame extraction (use video API)
- Regional restrictions on child images (EEA, CH, UK)