# Vision Understanding Reference Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API. ## Core Capabilities - **Captioning**: Generate descriptive text for images - **Classification**: Categorize and identify content - **Visual Q&A**: Answer questions about images - **Object Detection**: Locate objects with bounding boxes (2.0+) - **Segmentation**: Create pixel-level masks (2.5+) - **Multi-image**: Compare up to 3,600 images - **OCR**: Extract text from images - **Document Understanding**: Process PDFs with vision ## Supported Formats - **Images**: PNG, JPEG, WEBP, HEIC, HEIF - **Documents**: PDF (up to 1,000 pages) - **Size Limits**: - Inline: 20MB max total request - File API: 2GB per file - Max images: 3,600 per request ## Model Selection ### Gemini 2.5 Series - **gemini-2.5-pro**: Best quality, segmentation + detection - **gemini-2.5-flash**: Fast, efficient, all features - **gemini-2.5-flash-lite**: Lightweight, all features ### Gemini 2.0 Series - **gemini-2.0-flash**: Object detection support - **gemini-2.0-flash-lite**: Lightweight detection ### Feature Requirements - **Segmentation**: Requires 2.5+ models - **Object Detection**: Requires 2.0+ models - **Multi-image**: All models (up to 3,600 images) ## Basic Image Analysis ### Image Captioning ```python from google import genai import os client = genai.Client(api_key=os.getenv('GEMINI_API_KEY')) # Local file with open('image.jpg', 'rb') as f: img_bytes = f.read() response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Describe this image in detail', genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg') ] ) print(response.text) ``` ### Image Classification ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Classify this image. Provide category and confidence level.', img_part ] ) ``` ### Visual Question Answering ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'How many people are in this image and what are they doing?', img_part ] ) ``` ## Advanced Features ### Object Detection (2.0+) ```python response = client.models.generate_content( model='gemini-2.0-flash', contents=[ 'Detect all objects in this image and provide bounding boxes', img_part ] ) # Returns bounding box coordinates: [ymin, xmin, ymax, xmax] # Normalized to [0, 1000] range ``` ### Segmentation (2.5+) ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Create a segmentation mask for all people in this image', img_part ] ) # Returns pixel-level masks for requested objects ``` ### Multi-Image Comparison ```python import PIL.Image img1 = PIL.Image.open('photo1.jpg') img2 = PIL.Image.open('photo2.jpg') response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Compare these two images. What are the differences?', img1, img2 ] ) ``` ### OCR and Text Extraction ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract all visible text from this image', img_part ] ) ``` ## Input Methods ### Inline Data (<20MB) ```python from google.genai import types # From file with open('image.jpg', 'rb') as f: img_bytes = f.read() response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Analyze this image', types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg') ] ) ``` ### PIL Image ```python import PIL.Image img = PIL.Image.open('photo.jpg') response = client.models.generate_content( model='gemini-2.5-flash', contents=['What is in this image?', img] ) ``` ### File API (>20MB or Reuse) ```python # Upload once myfile = client.files.upload(file='large-image.jpg') # Use multiple times response1 = client.models.generate_content( model='gemini-2.5-flash', contents=['Describe this image', myfile] ) response2 = client.models.generate_content( model='gemini-2.5-flash', contents=['What colors dominate this image?', myfile] ) ``` ### URL (Public Images) ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Analyze this image', types.Part.from_uri( uri='https://example.com/image.jpg', mime_type='image/jpeg' ) ] ) ``` ## Token Calculation Images consume tokens based on size: **Small images** (≤384px both dimensions): 258 tokens **Large images**: Tiled into 768×768 chunks, 258 tokens each **Formula**: ``` crop_unit = floor(min(width, height) / 1.5) tiles = (width / crop_unit) × (height / crop_unit) total_tokens = tiles × 258 ``` **Examples**: - 256×256: 258 tokens (small) - 512×512: 258 tokens (small) - 960×540: 6 tiles = 1,548 tokens - 1920×1080: 6 tiles = 1,548 tokens - 3840×2160 (4K): 24 tiles = 6,192 tokens ## Structured Output ### JSON Schema Output ```python from pydantic import BaseModel from typing import List class ObjectDetection(BaseModel): object_name: str confidence: float bounding_box: List[int] # [ymin, xmin, ymax, xmax] class ImageAnalysis(BaseModel): description: str objects: List[ObjectDetection] scene_type: str response = client.models.generate_content( model='gemini-2.5-flash', contents=['Analyze this image', img_part], config=genai.types.GenerateContentConfig( response_mime_type='application/json', response_schema=ImageAnalysis ) ) result = ImageAnalysis.model_validate_json(response.text) ``` ## Multi-Image Analysis ### Batch Processing ```python images = [ PIL.Image.open(f'image{i}.jpg') for i in range(10) ] response = client.models.generate_content( model='gemini-2.5-flash', contents=['Analyze these images and find common themes'] + images ) ``` ### Image Comparison ```python before = PIL.Image.open('before.jpg') after = PIL.Image.open('after.jpg') response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Compare before and after. List all visible changes.', before, after ] ) ``` ### Visual Search ```python reference = PIL.Image.open('target.jpg') candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)] response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Find which candidate images contain objects similar to the reference', reference ] + candidates ) ``` ## Best Practices ### Image Quality 1. **Resolution**: Use clear, non-blurry images 2. **Rotation**: Verify correct orientation 3. **Lighting**: Ensure good contrast and lighting 4. **Size optimization**: Balance quality vs token cost 5. **Format**: JPEG for photos, PNG for graphics ### Prompt Engineering **Specific instructions**: - "Identify all vehicles with their colors and positions" - "Count people wearing blue shirts" - "Extract text from the sign in the top-left corner" **Output format**: - "Return results as JSON with fields: category, count, description" - "Format as markdown table" - "List findings as numbered items" **Few-shot examples**: ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"', 'Now analyze this image:', img_part ] ) ``` ### File Management 1. Use File API for images >20MB 2. Use File API for repeated queries (saves tokens) 3. Files auto-delete after 48 hours 4. Clean up manually: ```python client.files.delete(name=myfile.name) ``` ### Cost Optimization **Token-efficient strategies**: - Resize large images before upload - Use File API for repeated queries - Batch multiple images when related - Use appropriate model (Flash vs Pro) **Token costs** (Gemini 2.5 Flash at $1/1M): - Small image (258 tokens): $0.000258 - HD image (1,548 tokens): $0.001548 - 4K image (6,192 tokens): $0.006192 ## Common Use Cases ### 1. Product Analysis ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Analyze this product image: 1. Identify the product 2. List visible features 3. Assess condition 4. Estimate value range ''', img_part ] ) ``` ### 2. Screenshot Analysis ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract all text and UI elements from this screenshot', img_part ] ) ``` ### 3. Medical Imaging (Informational Only) ```python response = client.models.generate_content( model='gemini-2.5-pro', contents=[ 'Describe visible features in this medical image. Note: This is for informational purposes only.', img_part ] ) ``` ### 4. Chart/Graph Reading ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract data from this chart and format as JSON', img_part ] ) ``` ### 5. Scene Understanding ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Analyze this scene: 1. Location type 2. Time of day 3. Weather conditions 4. Activities happening 5. Mood/atmosphere ''', img_part ] ) ``` ## Error Handling ```python import time def analyze_image_with_retry(image_path, prompt, max_retries=3): """Analyze image with exponential backoff retry""" for attempt in range(max_retries): try: with open(image_path, 'rb') as f: img_bytes = f.read() response = client.models.generate_content( model='gemini-2.5-flash', contents=[ prompt, genai.types.Part.from_bytes( data=img_bytes, mime_type='image/jpeg' ) ] ) return response.text except Exception as e: if attempt == max_retries - 1: raise wait_time = 2 ** attempt print(f"Retry {attempt + 1} after {wait_time}s: {e}") time.sleep(wait_time) ``` ## Limitations - Maximum 3,600 images per request - OCR accuracy varies with text quality - Object detection requires 2.0+ models - Segmentation requires 2.5+ models - No video frame extraction (use video API) - Regional restrictions on child images (EEA, CH, UK)