Initial commit

2025-11-30 08:48:52 +08:00
commit 6ec3196ecc
434 changed files with 125248 additions and 0 deletions
--- a/skills/ai-multimodal/references/vision-understanding.md
+++ b/skills/ai-multimodal/references/vision-understanding.md
@@ -0,0 +1,483 @@
+# Vision Understanding Reference
+
+Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.
+
+## Core Capabilities
+
+- **Captioning**: Generate descriptive text for images
+- **Classification**: Categorize and identify content
+- **Visual Q&A**: Answer questions about images
+- **Object Detection**: Locate objects with bounding boxes (2.0+)
+- **Segmentation**: Create pixel-level masks (2.5+)
+- **Multi-image**: Compare up to 3,600 images
+- **OCR**: Extract text from images
+- **Document Understanding**: Process PDFs with vision
+
+## Supported Formats
+
+- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
+- **Documents**: PDF (up to 1,000 pages)
+- **Size Limits**:
+  - Inline: 20MB max total request
+  - File API: 2GB per file
+  - Max images: 3,600 per request
+
+## Model Selection
+
+### Gemini 2.5 Series
+- **gemini-2.5-pro**: Best quality, segmentation + detection
+- **gemini-2.5-flash**: Fast, efficient, all features
+- **gemini-2.5-flash-lite**: Lightweight, all features
+
+### Gemini 2.0 Series
+- **gemini-2.0-flash**: Object detection support
+- **gemini-2.0-flash-lite**: Lightweight detection
+
+### Feature Requirements
+- **Segmentation**: Requires 2.5+ models
+- **Object Detection**: Requires 2.0+ models
+- **Multi-image**: All models (up to 3,600 images)
+
+## Basic Image Analysis
+
+### Image Captioning
+
+```python
+from google import genai
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+# Local file
+with open('image.jpg', 'rb') as f:
+    img_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Describe this image in detail',
+        genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
+    ]
+)
+print(response.text)
+```
+
+### Image Classification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Classify this image. Provide category and confidence level.',
+        img_part
+    ]
+)
+```
+
+### Visual Question Answering
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'How many people are in this image and what are they doing?',
+        img_part
+    ]
+)
+```
+
+## Advanced Features
+
+### Object Detection (2.0+)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.0-flash',
+    contents=[
+        'Detect all objects in this image and provide bounding boxes',
+        img_part
+    ]
+)
+
+# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
+# Normalized to [0, 1000] range
+```
+
+### Segmentation (2.5+)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Create a segmentation mask for all people in this image',
+        img_part
+    ]
+)
+
+# Returns pixel-level masks for requested objects
+```
+
+### Multi-Image Comparison
+
+```python
+import PIL.Image
+
+img1 = PIL.Image.open('photo1.jpg')
+img2 = PIL.Image.open('photo2.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Compare these two images. What are the differences?',
+        img1,
+        img2
+    ]
+)
+```
+
+### OCR and Text Extraction
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract all visible text from this image',
+        img_part
+    ]
+)
+```
+
+## Input Methods
+
+### Inline Data (<20MB)
+
+```python
+from google.genai import types
+
+# From file
+with open('image.jpg', 'rb') as f:
+    img_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze this image',
+        types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
+    ]
+)
+```
+
+### PIL Image
+
+```python
+import PIL.Image
+
+img = PIL.Image.open('photo.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What is in this image?', img]
+)
+```
+
+### File API (>20MB or Reuse)
+
+```python
+# Upload once
+myfile = client.files.upload(file='large-image.jpg')
+
+# Use multiple times
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Describe this image', myfile]
+)
+
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What colors dominate this image?', myfile]
+)
+```
+
+### URL (Public Images)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze this image',
+        types.Part.from_uri(
+            uri='https://example.com/image.jpg',
+            mime_type='image/jpeg'
+        )
+    ]
+)
+```
+
+## Token Calculation
+
+Images consume tokens based on size:
+
+**Small images** (≤384px both dimensions): 258 tokens
+
+**Large images**: Tiled into 768×768 chunks, 258 tokens each
+
+**Formula**:
+```
+crop_unit = floor(min(width, height) / 1.5)
+tiles = (width / crop_unit) × (height / crop_unit)
+total_tokens = tiles × 258
+```
+
+**Examples**:
+- 256×256: 258 tokens (small)
+- 512×512: 258 tokens (small)
+- 960×540: 6 tiles = 1,548 tokens
+- 1920×1080: 6 tiles = 1,548 tokens
+- 3840×2160 (4K): 24 tiles = 6,192 tokens
+
+## Structured Output
+
+### JSON Schema Output
+
+```python
+from pydantic import BaseModel
+from typing import List
+
+class ObjectDetection(BaseModel):
+    object_name: str
+    confidence: float
+    bounding_box: List[int]  # [ymin, xmin, ymax, xmax]
+
+class ImageAnalysis(BaseModel):
+    description: str
+    objects: List[ObjectDetection]
+    scene_type: str
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Analyze this image', img_part],
+    config=genai.types.GenerateContentConfig(
+        response_mime_type='application/json',
+        response_schema=ImageAnalysis
+    )
+)
+
+result = ImageAnalysis.model_validate_json(response.text)
+```
+
+## Multi-Image Analysis
+
+### Batch Processing
+
+```python
+images = [
+    PIL.Image.open(f'image{i}.jpg')
+    for i in range(10)
+]
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Analyze these images and find common themes'] + images
+)
+```
+
+### Image Comparison
+
+```python
+before = PIL.Image.open('before.jpg')
+after = PIL.Image.open('after.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Compare before and after. List all visible changes.',
+        before,
+        after
+    ]
+)
+```
+
+### Visual Search
+
+```python
+reference = PIL.Image.open('target.jpg')
+candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Find which candidate images contain objects similar to the reference',
+        reference
+    ] + candidates
+)
+```
+
+## Best Practices
+
+### Image Quality
+
+1. **Resolution**: Use clear, non-blurry images
+2. **Rotation**: Verify correct orientation
+3. **Lighting**: Ensure good contrast and lighting
+4. **Size optimization**: Balance quality vs token cost
+5. **Format**: JPEG for photos, PNG for graphics
+
+### Prompt Engineering
+
+**Specific instructions**:
+- "Identify all vehicles with their colors and positions"
+- "Count people wearing blue shirts"
+- "Extract text from the sign in the top-left corner"
+
+**Output format**:
+- "Return results as JSON with fields: category, count, description"
+- "Format as markdown table"
+- "List findings as numbered items"
+
+**Few-shot examples**:
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
+        'Now analyze this image:',
+        img_part
+    ]
+)
+```
+
+### File Management
+
+1. Use File API for images >20MB
+2. Use File API for repeated queries (saves tokens)
+3. Files auto-delete after 48 hours
+4. Clean up manually:
+   ```python
+   client.files.delete(name=myfile.name)
+   ```
+
+### Cost Optimization
+
+**Token-efficient strategies**:
+- Resize large images before upload
+- Use File API for repeated queries
+- Batch multiple images when related
+- Use appropriate model (Flash vs Pro)
+
+**Token costs** (Gemini 2.5 Flash at $1/1M):
+- Small image (258 tokens): $0.000258
+- HD image (1,548 tokens): $0.001548
+- 4K image (6,192 tokens): $0.006192
+
+## Common Use Cases
+
+### 1. Product Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze this product image:
+        1. Identify the product
+        2. List visible features
+        3. Assess condition
+        4. Estimate value range
+        ''',
+        img_part
+    ]
+)
+```
+
+### 2. Screenshot Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract all text and UI elements from this screenshot',
+        img_part
+    ]
+)
+```
+
+### 3. Medical Imaging (Informational Only)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-pro',
+    contents=[
+        'Describe visible features in this medical image. Note: This is for informational purposes only.',
+        img_part
+    ]
+)
+```
+
+### 4. Chart/Graph Reading
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract data from this chart and format as JSON',
+        img_part
+    ]
+)
+```
+
+### 5. Scene Understanding
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze this scene:
+        1. Location type
+        2. Time of day
+        3. Weather conditions
+        4. Activities happening
+        5. Mood/atmosphere
+        ''',
+        img_part
+    ]
+)
+```
+
+## Error Handling
+
+```python
+import time
+
+def analyze_image_with_retry(image_path, prompt, max_retries=3):
+    """Analyze image with exponential backoff retry"""
+    for attempt in range(max_retries):
+        try:
+            with open(image_path, 'rb') as f:
+                img_bytes = f.read()
+
+            response = client.models.generate_content(
+                model='gemini-2.5-flash',
+                contents=[
+                    prompt,
+                    genai.types.Part.from_bytes(
+                        data=img_bytes,
+                        mime_type='image/jpeg'
+                    )
+                ]
+            )
+            return response.text
+        except Exception as e:
+            if attempt == max_retries - 1:
+                raise
+            wait_time = 2 ** attempt
+            print(f"Retry {attempt + 1} after {wait_time}s: {e}")
+            time.sleep(wait_time)
+```
+
+## Limitations
+
+- Maximum 3,600 images per request
+- OCR accuracy varies with text quality
+- Object detection requires 2.0+ models
+- Segmentation requires 2.5+ models
+- No video frame extraction (use video API)
+- Regional restrictions on child images (EEA, CH, UK)