Initial commit

2025-11-30 08:48:52 +08:00
commit 6ec3196ecc
434 changed files with 125248 additions and 0 deletions
--- a/skills/ai-multimodal/references/audio-processing.md
+++ b/skills/ai-multimodal/references/audio-processing.md
@@ -0,0 +1,373 @@
+# Audio Processing Reference
+
+Comprehensive guide for audio analysis and speech generation using Gemini API.
+
+## Audio Understanding
+
+### Supported Formats
+
+| Format | MIME Type | Best Use |
+|--------|-----------|----------|
+| WAV | `audio/wav` | Uncompressed, highest quality |
+| MP3 | `audio/mp3` | Compressed, widely compatible |
+| AAC | `audio/aac` | Compressed, good quality |
+| FLAC | `audio/flac` | Lossless compression |
+| OGG Vorbis | `audio/ogg` | Open format |
+| AIFF | `audio/aiff` | Apple format |
+
+### Specifications
+
+- **Maximum length**: 9.5 hours per request
+- **Multiple files**: Unlimited count, combined max 9.5 hours
+- **Token rate**: 32 tokens/second (1 minute = 1,920 tokens)
+- **Processing**: Auto-downsampled to 16 Kbps mono
+- **File size limits**:
+  - Inline: 20 MB max total request
+  - File API: 2 GB per file, 20 GB project quota
+  - Retention: 48 hours auto-delete
+
+## Transcription
+
+### Basic Transcription
+
+```python
+from google import genai
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+# Upload audio
+myfile = client.files.upload(file='meeting.mp3')
+
+# Transcribe
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Generate a transcript of the speech.', myfile]
+)
+print(response.text)
+```
+
+### With Timestamps
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Generate transcript with timestamps in MM:SS format.', myfile]
+)
+```
+
+### Multi-Speaker Identification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
+)
+```
+
+### Segment-Specific Transcription
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
+)
+```
+
+## Audio Analysis
+
+### Summarization
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Summarize key points in 5 bullets with timestamps.', myfile]
+)
+```
+
+### Non-Speech Audio Analysis
+
+```python
+# Music analysis
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify the musical instruments and genre.', myfile]
+)
+
+# Environmental sounds
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify all sounds: voices, music, ambient noise.', myfile]
+)
+
+# Birdsong identification
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify bird species based on their calls.', myfile]
+)
+```
+
+### Timestamp-Based Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
+)
+```
+
+## Input Methods
+
+### File Upload (>20MB or Reuse)
+
+```python
+# Upload once, use multiple times
+myfile = client.files.upload(file='large-audio.mp3')
+
+# First query
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe this', myfile]
+)
+
+# Second query (reuses same file)
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Summarize this', myfile]
+)
+```
+
+### Inline Data (<20MB)
+
+```python
+from google.genai import types
+
+with open('small-audio.mp3', 'rb') as f:
+    audio_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Describe this audio',
+        types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
+    ]
+)
+```
+
+## Speech Generation (TTS)
+
+### Available Models
+
+| Model | Quality | Speed | Cost/1M tokens |
+|-------|---------|-------|----------------|
+| `gemini-2.5-flash-native-audio-preview-09-2025` | High | Fast | $10 |
+| `gemini-2.5-pro` TTS mode | Premium | Slower | $20 |
+
+### Basic TTS
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio: Welcome to today\'s episode.'
+)
+
+# Save audio
+with open('output.wav', 'wb') as f:
+    f.write(response.audio_data)
+```
+
+### Controllable Voice Style
+
+```python
+# Professional tone
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
+)
+
+# Casual and friendly
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
+)
+
+# Narrative style
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
+)
+```
+
+### Voice Control Parameters
+
+- **Style**: Professional, casual, narrative, conversational
+- **Pace**: Slow, normal, fast
+- **Tone**: Friendly, serious, enthusiastic
+- **Accent**: Natural language control (e.g., "British accent", "Southern drawl")
+
+## Best Practices
+
+### File Management
+
+1. Use File API for files >20MB
+2. Use File API for repeated queries (saves tokens)
+3. Files auto-delete after 48 hours
+4. Clean up manually when done:
+   ```python
+   client.files.delete(name=myfile.name)
+   ```
+
+### Prompt Engineering
+
+**Effective prompts**:
+- "Transcribe from 02:30 to 03:29 in MM:SS format"
+- "Identify speakers and extract dialogue with timestamps"
+- "Summarize key points with relevant timestamps"
+- "Transcribe and analyze sentiment for each speaker"
+
+**Context improves accuracy**:
+- "This is a medical interview - use appropriate terminology"
+- "Transcribe this legal deposition with precise terminology"
+- "This is a technical podcast about machine learning"
+
+**Combined tasks**:
+- "Transcribe and summarize in bullet points"
+- "Extract key quotes with timestamps and speaker labels"
+- "Transcribe and identify action items with timestamps"
+
+### Cost Optimization
+
+**Token calculation**:
+- 1 minute audio = 1,920 tokens
+- 1 hour audio = 115,200 tokens
+- 9.5 hours = 1,094,400 tokens
+
+**Model selection**:
+- Use `gemini-2.5-flash` ($1/1M tokens) for most tasks
+- Upgrade to `gemini-2.5-pro` ($3/1M tokens) for complex analysis
+- For high-volume: `gemini-1.5-flash` ($0.70/1M tokens)
+
+**Reduce costs**:
+- Process only relevant segments using timestamps
+- Use lower-quality audio when possible
+- Batch multiple short files in one request
+- Cache context for repeated queries
+
+### Error Handling
+
+```python
+import time
+
+def transcribe_with_retry(file_path, max_retries=3):
+    """Transcribe audio with exponential backoff retry"""
+    for attempt in range(max_retries):
+        try:
+            myfile = client.files.upload(file=file_path)
+            response = client.models.generate_content(
+                model='gemini-2.5-flash',
+                contents=['Transcribe with timestamps', myfile]
+            )
+            return response.text
+        except Exception as e:
+            if attempt == max_retries - 1:
+                raise
+            wait_time = 2 ** attempt
+            print(f"Retry {attempt + 1} after {wait_time}s")
+            time.sleep(wait_time)
+```
+
+## Common Use Cases
+
+### 1. Meeting Transcription
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Transcribe this meeting with:
+        1. Speaker labels
+        2. Timestamps for topic changes
+        3. Action items highlighted
+        ''',
+        myfile
+    ]
+)
+```
+
+### 2. Podcast Summary
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Create podcast summary with:
+        1. Main topics with timestamps
+        2. Key quotes from each speaker
+        3. Recommended episode highlights
+        ''',
+        myfile
+    ]
+)
+```
+
+### 3. Interview Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze interview:
+        1. Questions asked with timestamps
+        2. Key responses from interviewee
+        3. Overall sentiment and tone
+        ''',
+        myfile
+    ]
+)
+```
+
+### 4. Content Verification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Verify audio content:
+        1. Check for specific keywords or phrases
+        2. Identify any compliance issues
+        3. Note any concerning statements with timestamps
+        ''',
+        myfile
+    ]
+)
+```
+
+### 5. Multilingual Transcription
+
+```python
+# Gemini auto-detects language
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe this audio and translate to English if needed.', myfile]
+)
+```
+
+## Token Costs
+
+**Audio Input** (32 tokens/second):
+- 1 minute = 1,920 tokens
+- 10 minutes = 19,200 tokens
+- 1 hour = 115,200 tokens
+- 9.5 hours = 1,094,400 tokens
+
+**Example costs** (Gemini 2.5 Flash at $1/1M):
+- 1 hour audio: 115,200 tokens = $0.12
+- Full day podcast (8 hours): 921,600 tokens = $0.92
+
+## Limitations
+
+- Maximum 9.5 hours per request
+- Auto-downsampled to 16 Kbps mono (quality loss)
+- Files expire after 48 hours
+- No real-time streaming support
+- Non-speech audio less accurate than speech
--- a/skills/ai-multimodal/references/image-generation.md
+++ b/skills/ai-multimodal/references/image-generation.md
@@ -0,0 +1,558 @@
+# Image Generation Reference
+
+Comprehensive guide for image creation, editing, and composition using Gemini API.
+
+## Core Capabilities
+
+- **Text-to-Image**: Generate images from text prompts
+- **Image Editing**: Modify existing images with text instructions
+- **Multi-Image Composition**: Combine up to 3 images
+- **Iterative Refinement**: Refine images conversationally
+- **Aspect Ratios**: Multiple formats (1:1, 16:9, 9:16, 4:3, 3:4)
+- **Style Control**: Control artistic style and quality
+- **Text in Images**: Limited text rendering (max 25 chars)
+
+## Model
+
+**gemini-2.5-flash-image** - Specialized for image generation
+- Input tokens: 65,536
+- Output tokens: 32,768
+- Knowledge cutoff: June 2025
+- Supports: Text and image inputs, image outputs
+
+## Quick Start
+
+### Basic Generation
+
+```python
+from google import genai
+from google.genai import types
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A serene mountain landscape at sunset with snow-capped peaks',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+
+# Save image
+for i, part in enumerate(response.candidates[0].content.parts):
+    if part.inline_data:
+        with open(f'output-{i}.png', 'wb') as f:
+            f.write(part.inline_data.data)
+```
+
+## Aspect Ratios
+
+| Ratio | Resolution | Use Case | Token Cost |
+|-------|-----------|----------|------------|
+| 1:1 | 1024×1024 | Social media, avatars | 1290 |
+| 16:9 | 1344×768 | Landscapes, banners | 1290 |
+| 9:16 | 768×1344 | Mobile, portraits | 1290 |
+| 4:3 | 1152×896 | Traditional media | 1290 |
+| 3:4 | 896×1152 | Vertical posters | 1290 |
+
+All ratios cost the same: 1,290 tokens per image.
+
+## Response Modalities
+
+### Image Only
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    aspect_ratio='1:1'
+)
+```
+
+### Text Only (No Image)
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['text']
+)
+# Returns text description instead of generating image
+```
+
+### Both Image and Text
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image', 'text'],
+    aspect_ratio='16:9'
+)
+# Returns both generated image and description
+```
+
+## Image Editing
+
+### Modify Existing Image
+
+```python
+import PIL.Image
+
+# Load original
+img = PIL.Image.open('original.png')
+
+# Edit with instructions
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add a red balloon floating in the sky',
+        img
+    ],
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### Style Transfer
+
+```python
+img = PIL.Image.open('photo.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Transform this into an oil painting style',
+        img
+    ]
+)
+```
+
+### Object Addition/Removal
+
+```python
+# Add object
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add a vintage car parked on the street',
+        img
+    ]
+)
+
+# Remove object
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Remove the person on the left side',
+        img
+    ]
+)
+```
+
+## Multi-Image Composition
+
+### Combine Multiple Images
+
+```python
+img1 = PIL.Image.open('background.png')
+img2 = PIL.Image.open('foreground.png')
+img3 = PIL.Image.open('overlay.png')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Combine these images into a cohesive scene',
+        img1,
+        img2,
+        img3
+    ],
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+**Note**: Recommended maximum 3 input images for best results.
+
+## Prompt Engineering
+
+### Effective Prompt Structure
+
+**Three key elements**:
+1. **Subject**: What to generate
+2. **Context**: Environmental setting
+3. **Style**: Artistic treatment
+
+**Example**: "A robot [subject] in a futuristic city [context], cyberpunk style with neon lighting [style]"
+
+### Quality Modifiers
+
+**Technical terms**:
+- "4K", "8K", "high resolution"
+- "HDR", "high dynamic range"
+- "professional photography"
+- "studio lighting"
+- "ultra detailed"
+
+**Camera settings**:
+- "35mm lens", "50mm lens"
+- "shallow depth of field"
+- "wide angle shot"
+- "macro photography"
+- "golden hour lighting"
+
+### Style Keywords
+
+**Art styles**:
+- "oil painting", "watercolor", "sketch"
+- "digital art", "concept art"
+- "photorealistic", "hyperrealistic"
+- "minimalist", "abstract"
+- "cyberpunk", "steampunk", "fantasy"
+
+**Mood and atmosphere**:
+- "dramatic lighting", "soft lighting"
+- "moody", "bright and cheerful"
+- "mysterious", "whimsical"
+- "dark and gritty", "pastel colors"
+
+### Subject Description
+
+**Be specific**:
+- ❌ "A cat"
+- ✅ "A fluffy orange tabby cat with green eyes"
+
+**Add context**:
+- ❌ "A building"
+- ✅ "A modern glass skyscraper reflecting sunset clouds"
+
+**Include details**:
+- ❌ "A person"
+- ✅ "A young woman in a red dress holding an umbrella"
+
+### Composition and Framing
+
+**Camera angles**:
+- "bird's eye view", "aerial shot"
+- "low angle", "high angle"
+- "close-up", "wide shot"
+- "centered composition"
+- "rule of thirds"
+
+**Perspective**:
+- "first person view"
+- "third person perspective"
+- "isometric view"
+- "forced perspective"
+
+### Text in Images
+
+**Limitations**:
+- Maximum 25 characters total
+- Up to 3 distinct text phrases
+- Works best with simple text
+
+**Best practices**:
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A vintage poster with bold text "EXPLORE" at the top, mountain landscape, retro 1950s style'
+)
+```
+
+**Font control**:
+- "bold sans-serif title"
+- "handwritten script"
+- "vintage letterpress"
+- "modern minimalist font"
+
+## Advanced Techniques
+
+### Iterative Refinement
+
+```python
+# Initial generation
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A futuristic city skyline'
+)
+
+# Save first version
+with open('v1.png', 'wb') as f:
+    f.write(response1.candidates[0].content.parts[0].inline_data.data)
+
+# Refine
+img = PIL.Image.open('v1.png')
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add flying vehicles and neon signs',
+        img
+    ]
+)
+```
+
+### Negative Prompts (Indirect)
+
+```python
+# Instead of "no blur", be specific about what you want
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A crystal clear, sharp photograph of a diamond ring with perfect focus and high detail'
+)
+```
+
+### Consistent Style Across Images
+
+```python
+base_prompt = "Digital art, vibrant colors, cel-shaded style, clean lines"
+
+prompts = [
+    f"{base_prompt}, a warrior character",
+    f"{base_prompt}, a mage character",
+    f"{base_prompt}, a rogue character"
+]
+
+for i, prompt in enumerate(prompts):
+    response = client.models.generate_content(
+        model='gemini-2.5-flash-image',
+        contents=prompt
+    )
+    # Save each character
+```
+
+## Safety Settings
+
+### Configure Safety Filters
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    safety_settings=[
+        types.SafetySetting(
+            category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
+            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
+        ),
+        types.SafetySetting(
+            category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
+            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
+        )
+    ]
+)
+```
+
+### Available Categories
+
+- `HARM_CATEGORY_HATE_SPEECH`
+- `HARM_CATEGORY_DANGEROUS_CONTENT`
+- `HARM_CATEGORY_HARASSMENT`
+- `HARM_CATEGORY_SEXUALLY_EXPLICIT`
+
+### Thresholds
+
+- `BLOCK_NONE`: No blocking
+- `BLOCK_LOW_AND_ABOVE`: Block low probability and above
+- `BLOCK_MEDIUM_AND_ABOVE`: Block medium and above (default)
+- `BLOCK_ONLY_HIGH`: Block only high probability
+
+## Common Use Cases
+
+### 1. Marketing Assets
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Professional product photography:
+    - Sleek smartphone on minimalist white surface
+    - Dramatic side lighting creating subtle shadows
+    - Shallow depth of field, crisp focus
+    - Clean, modern aesthetic
+    - 4K quality
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='4:3'
+    )
+)
+```
+
+### 2. Concept Art
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Fantasy concept art:
+    - Ancient floating islands connected by chains
+    - Waterfalls cascading into clouds below
+    - Magical crystals glowing on the islands
+    - Epic scale, dramatic lighting
+    - Detailed digital painting style
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### 3. Social Media Graphics
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Instagram post design:
+    - Pastel gradient background (pink to blue)
+    - Motivational quote layout
+    - Modern minimalist style
+    - Clean typography
+    - Mobile-friendly composition
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='1:1'
+    )
+)
+```
+
+### 4. Illustration
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Children's book illustration:
+    - Friendly cartoon dragon reading a book
+    - Bright, cheerful colors
+    - Soft, rounded shapes
+    - Whimsical forest background
+    - Warm, inviting atmosphere
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='4:3'
+    )
+)
+```
+
+### 5. UI/UX Mockups
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Modern mobile app interface:
+    - Clean dashboard design
+    - Card-based layout
+    - Soft shadows and gradients
+    - Contemporary color scheme (blue and white)
+    - Professional fintech aesthetic
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='9:16'
+    )
+)
+```
+
+## Best Practices
+
+### Prompt Quality
+
+1. **Be specific**: More detail = better results
+2. **Order matters**: Most important elements first
+3. **Use examples**: Reference known styles or artists
+4. **Avoid contradictions**: Don't ask for opposing styles
+5. **Test and iterate**: Refine prompts based on results
+
+### File Management
+
+```python
+# Save with descriptive names
+timestamp = int(time.time())
+filename = f'generated_{timestamp}_{aspect_ratio}.png'
+
+with open(filename, 'wb') as f:
+    f.write(image_data)
+```
+
+### Cost Optimization
+
+**Token costs**:
+- 1 image: 1,290 tokens = $0.00129 (Flash Image at $1/1M)
+- 10 images: 12,900 tokens = $0.0129
+- 100 images: 129,000 tokens = $0.129
+
+**Strategies**:
+- Generate fewer iterations
+- Use text modality first to validate concept
+- Batch similar requests
+- Cache prompts for consistent style
+
+## Error Handling
+
+### Safety Filter Blocking
+
+```python
+try:
+    response = client.models.generate_content(
+        model='gemini-2.5-flash-image',
+        contents=prompt
+    )
+except Exception as e:
+    # Check block reason
+    if hasattr(e, 'prompt_feedback'):
+        print(f"Blocked: {e.prompt_feedback.block_reason}")
+        # Modify prompt and retry
+```
+
+### Token Limit Exceeded
+
+```python
+# Keep prompts concise
+if len(prompt) > 1000:
+    # Truncate or simplify
+    prompt = prompt[:1000]
+```
+
+## Limitations
+
+- Maximum 3 input images for composition
+- Text rendering limited (25 chars max)
+- No video or animation generation
+- Regional restrictions (child images in EEA, CH, UK)
+- Optimal language support: English, Spanish (Mexico), Japanese, Mandarin, Hindi
+- No real-time generation
+- Cannot perfectly replicate specific people or copyrighted characters
+
+## Troubleshooting
+
+### aspect_ratio Parameter Error
+
+**Error**: `Extra inputs are not permitted [type=extra_forbidden, input_value='1:1', input_type=str]`
+
+**Cause**: The `aspect_ratio` parameter must be nested inside an `image_config` object, not passed directly to `GenerateContentConfig`.
+
+**Incorrect Usage**:
+```python
+# ❌ This will fail
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    aspect_ratio='16:9'  # Wrong - not a direct parameter
+)
+```
+
+**Correct Usage**:
+```python
+# ✅ Correct implementation
+config = types.GenerateContentConfig(
+    response_modalities=['Image'],  # Note: Capital 'I'
+    image_config=types.ImageConfig(
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### Response Modality Case Sensitivity
+
+The `response_modalities` parameter expects capital case values:
+- ✅ Correct: `['Image']`, `['Text']`, `['Image', 'Text']`
+- ❌ Wrong: `['image']`, `['text']`
--- a/skills/ai-multimodal/references/video-analysis.md
+++ b/skills/ai-multimodal/references/video-analysis.md
@@ -0,0 +1,502 @@
+# Video Analysis Reference
+
+Comprehensive guide for video understanding, temporal analysis, and YouTube processing using Gemini API.
+
+## Core Capabilities
+
+- **Video Summarization**: Create concise summaries
+- **Question Answering**: Answer specific questions about content
+- **Transcription**: Audio transcription with visual descriptions
+- **Timestamp References**: Query specific moments (MM:SS format)
+- **Video Clipping**: Process specific segments
+- **Scene Detection**: Identify scene changes and transitions
+- **Multiple Videos**: Compare up to 10 videos (2.5+)
+- **YouTube Support**: Analyze YouTube videos directly
+- **Custom Frame Rate**: Adjust FPS sampling
+
+## Supported Formats
+
+- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
+
+## Model Selection
+
+### Gemini 2.5 Series
+- **gemini-2.5-pro**: Best quality, 1M-2M context
+- **gemini-2.5-flash**: Balanced, 1M-2M context
+- **gemini-2.5-flash-preview-09-2025**: Preview features, 1M context
+
+### Gemini 2.0 Series
+- **gemini-2.0-flash**: Fast processing
+- **gemini-2.0-flash-lite**: Lightweight option
+
+### Context Windows
+- **2M token models**: ~2 hours (default) or ~6 hours (low-res)
+- **1M token models**: ~1 hour (default) or ~3 hours (low-res)
+
+## Basic Video Analysis
+
+### Local Video
+
+```python
+from google import genai
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+# Upload video (File API for >20MB)
+myfile = client.files.upload(file='video.mp4')
+
+# Wait for processing
+import time
+while myfile.state.name == 'PROCESSING':
+    time.sleep(1)
+    myfile = client.files.get(name=myfile.name)
+
+if myfile.state.name == 'FAILED':
+    raise ValueError('Video processing failed')
+
+# Analyze
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Summarize this video in 3 key points', myfile]
+)
+print(response.text)
+```
+
+### YouTube Video
+
+```python
+from google.genai import types
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Summarize the main topics discussed',
+        types.Part.from_uri(
+            uri='https://www.youtube.com/watch?v=VIDEO_ID',
+            mime_type='video/mp4'
+        )
+    ]
+)
+```
+
+### Inline Video (<20MB)
+
+```python
+with open('short-clip.mp4', 'rb') as f:
+    video_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'What happens in this video?',
+        types.Part.from_bytes(data=video_bytes, mime_type='video/mp4')
+    ]
+)
+```
+
+## Advanced Features
+
+### Video Clipping
+
+```python
+# Analyze specific time range
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Summarize this segment',
+        types.Part.from_video_metadata(
+            file_uri=myfile.uri,
+            start_offset='40s',
+            end_offset='80s'
+        )
+    ]
+)
+```
+
+### Custom Frame Rate
+
+```python
+# Lower FPS for static content (saves tokens)
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze this presentation',
+        types.Part.from_video_metadata(
+            file_uri=myfile.uri,
+            fps=0.5  # Sample every 2 seconds
+        )
+    ]
+)
+
+# Higher FPS for fast-moving content
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze rapid movements in this sports video',
+        types.Part.from_video_metadata(
+            file_uri=myfile.uri,
+            fps=5  # Sample 5 times per second
+        )
+    ]
+)
+```
+
+### Multiple Videos (2.5+)
+
+```python
+video1 = client.files.upload(file='demo1.mp4')
+video2 = client.files.upload(file='demo2.mp4')
+
+# Wait for processing
+for video in [video1, video2]:
+    while video.state.name == 'PROCESSING':
+        time.sleep(1)
+        video = client.files.get(name=video.name)
+
+response = client.models.generate_content(
+    model='gemini-2.5-pro',
+    contents=[
+        'Compare these two product demos. Which explains features better?',
+        video1,
+        video2
+    ]
+)
+```
+
+## Temporal Understanding
+
+### Timestamp-Based Questions
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'What happens at 01:15 and how does it relate to 02:30?',
+        myfile
+    ]
+)
+```
+
+### Timeline Creation
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Create a timeline with timestamps:
+        - Key events
+        - Scene changes
+        - Important moments
+        Format: MM:SS - Description
+        ''',
+        myfile
+    ]
+)
+```
+
+### Scene Detection
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Identify all scene changes with timestamps and describe each scene',
+        myfile
+    ]
+)
+```
+
+## Transcription
+
+### Basic Transcription
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Transcribe the audio from this video',
+        myfile
+    ]
+)
+```
+
+### With Visual Descriptions
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Transcribe with visual context:
+        - Audio transcription
+        - Visual descriptions of important moments
+        - Timestamps for salient events
+        ''',
+        myfile
+    ]
+)
+```
+
+### Speaker Identification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Transcribe with speaker labels and timestamps',
+        myfile
+    ]
+)
+```
+
+## Common Use Cases
+
+### 1. Video Summarization
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Summarize this video:
+        1. Main topic and purpose
+        2. Key points with timestamps
+        3. Conclusion or call-to-action
+        ''',
+        myfile
+    ]
+)
+```
+
+### 2. Educational Content
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Create educational materials:
+        1. List key concepts taught
+        2. Create 5 quiz questions with answers
+        3. Provide timestamp for each concept
+        ''',
+        myfile
+    ]
+)
+```
+
+### 3. Action Detection
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'List all actions performed in this tutorial with timestamps',
+        myfile
+    ]
+)
+```
+
+### 4. Content Moderation
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Review video content:
+        1. Identify any problematic content
+        2. Note timestamps of concerns
+        3. Provide content rating recommendation
+        ''',
+        myfile
+    ]
+)
+```
+
+### 5. Interview Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze interview:
+        1. Questions asked (timestamps)
+        2. Key responses
+        3. Candidate body language and demeanor
+        4. Overall assessment
+        ''',
+        myfile
+    ]
+)
+```
+
+### 6. Sports Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze sports video:
+        1. Key plays with timestamps
+        2. Player movements and positioning
+        3. Game strategy observations
+        ''',
+        types.Part.from_video_metadata(
+            file_uri=myfile.uri,
+            fps=5  # Higher FPS for fast action
+        )
+    ]
+)
+```
+
+## YouTube Specific Features
+
+### Public Video Requirements
+
+- Video must be public (not private or unlisted)
+- No age-restricted content
+- Valid video ID required
+
+### Usage Example
+
+```python
+# YouTube URL
+youtube_uri = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Create chapter markers with timestamps',
+        types.Part.from_uri(uri=youtube_uri, mime_type='video/mp4')
+    ]
+)
+```
+
+### Rate Limits
+
+- **Free tier**: 8 hours of YouTube video per day
+- **Paid tier**: No length-based limits
+- Public videos only
+
+## Token Calculation
+
+Video tokens depend on resolution and FPS:
+
+**Default resolution** (~300 tokens/second):
+- 1 minute = 18,000 tokens
+- 10 minutes = 180,000 tokens
+- 1 hour = 1,080,000 tokens
+
+**Low resolution** (~100 tokens/second):
+- 1 minute = 6,000 tokens
+- 10 minutes = 60,000 tokens
+- 1 hour = 360,000 tokens
+
+**Context windows**:
+- 2M tokens ≈ 2 hours (default) or 6 hours (low-res)
+- 1M tokens ≈ 1 hour (default) or 3 hours (low-res)
+
+## Best Practices
+
+### File Management
+
+1. Use File API for videos >20MB (most videos)
+2. Wait for ACTIVE state before analysis
+3. Files auto-delete after 48 hours
+4. Clean up manually:
+   ```python
+   client.files.delete(name=myfile.name)
+   ```
+
+### Optimization Strategies
+
+**Reduce token usage**:
+- Process specific segments using start/end offsets
+- Use lower FPS for static content
+- Use low-resolution mode for long videos
+- Split very long videos into chunks
+
+**Improve accuracy**:
+- Provide context in prompts
+- Use higher FPS for fast-moving content
+- Use Pro model for complex analysis
+- Be specific about what to extract
+
+### Prompt Engineering
+
+**Effective prompts**:
+- "Summarize key points with timestamps in MM:SS format"
+- "Identify all scene changes and describe each scene"
+- "Extract action items mentioned with timestamps"
+- "Compare these two videos on: X, Y, Z criteria"
+
+**Structured output**:
+```python
+from pydantic import BaseModel
+from typing import List
+
+class VideoEvent(BaseModel):
+    timestamp: str  # MM:SS format
+    description: str
+    category: str
+
+class VideoAnalysis(BaseModel):
+    summary: str
+    events: List[VideoEvent]
+    duration: str
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Analyze this video', myfile],
+    config=genai.types.GenerateContentConfig(
+        response_mime_type='application/json',
+        response_schema=VideoAnalysis
+    )
+)
+```
+
+### Error Handling
+
+```python
+import time
+
+def upload_and_process_video(file_path, max_wait=300):
+    """Upload video and wait for processing"""
+    myfile = client.files.upload(file=file_path)
+
+    elapsed = 0
+    while myfile.state.name == 'PROCESSING' and elapsed < max_wait:
+        time.sleep(5)
+        myfile = client.files.get(name=myfile.name)
+        elapsed += 5
+
+    if myfile.state.name == 'FAILED':
+        raise ValueError(f'Video processing failed: {myfile.state.name}')
+
+    if myfile.state.name == 'PROCESSING':
+        raise TimeoutError(f'Processing timeout after {max_wait}s')
+
+    return myfile
+```
+
+## Cost Optimization
+
+**Token costs** (Gemini 2.5 Flash at $1/1M):
+- 1 minute video (default): 18,000 tokens = $0.018
+- 10 minute video: 180,000 tokens = $0.18
+- 1 hour video: 1,080,000 tokens = $1.08
+
+**Strategies**:
+- Use video clipping for specific segments
+- Lower FPS for static content
+- Use low-resolution mode for long videos
+- Batch related queries on same video
+- Use context caching for repeated queries
+
+## Limitations
+
+- Maximum 6 hours (low-res) or 2 hours (default)
+- YouTube videos must be public
+- No live streaming analysis
+- Files expire after 48 hours
+- Processing time varies by video length
+- No real-time processing
+- Limited to 10 videos per request (2.5+)
--- a/skills/ai-multimodal/references/vision-understanding.md
+++ b/skills/ai-multimodal/references/vision-understanding.md
@@ -0,0 +1,483 @@
+# Vision Understanding Reference
+
+Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.
+
+## Core Capabilities
+
+- **Captioning**: Generate descriptive text for images
+- **Classification**: Categorize and identify content
+- **Visual Q&A**: Answer questions about images
+- **Object Detection**: Locate objects with bounding boxes (2.0+)
+- **Segmentation**: Create pixel-level masks (2.5+)
+- **Multi-image**: Compare up to 3,600 images
+- **OCR**: Extract text from images
+- **Document Understanding**: Process PDFs with vision
+
+## Supported Formats
+
+- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
+- **Documents**: PDF (up to 1,000 pages)
+- **Size Limits**:
+  - Inline: 20MB max total request
+  - File API: 2GB per file
+  - Max images: 3,600 per request
+
+## Model Selection
+
+### Gemini 2.5 Series
+- **gemini-2.5-pro**: Best quality, segmentation + detection
+- **gemini-2.5-flash**: Fast, efficient, all features
+- **gemini-2.5-flash-lite**: Lightweight, all features
+
+### Gemini 2.0 Series
+- **gemini-2.0-flash**: Object detection support
+- **gemini-2.0-flash-lite**: Lightweight detection
+
+### Feature Requirements
+- **Segmentation**: Requires 2.5+ models
+- **Object Detection**: Requires 2.0+ models
+- **Multi-image**: All models (up to 3,600 images)
+
+## Basic Image Analysis
+
+### Image Captioning
+
+```python
+from google import genai
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+# Local file
+with open('image.jpg', 'rb') as f:
+    img_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Describe this image in detail',
+        genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
+    ]
+)
+print(response.text)
+```
+
+### Image Classification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Classify this image. Provide category and confidence level.',
+        img_part
+    ]
+)
+```
+
+### Visual Question Answering
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'How many people are in this image and what are they doing?',
+        img_part
+    ]
+)
+```
+
+## Advanced Features
+
+### Object Detection (2.0+)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.0-flash',
+    contents=[
+        'Detect all objects in this image and provide bounding boxes',
+        img_part
+    ]
+)
+
+# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
+# Normalized to [0, 1000] range
+```
+
+### Segmentation (2.5+)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Create a segmentation mask for all people in this image',
+        img_part
+    ]
+)
+
+# Returns pixel-level masks for requested objects
+```
+
+### Multi-Image Comparison
+
+```python
+import PIL.Image
+
+img1 = PIL.Image.open('photo1.jpg')
+img2 = PIL.Image.open('photo2.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Compare these two images. What are the differences?',
+        img1,
+        img2
+    ]
+)
+```
+
+### OCR and Text Extraction
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract all visible text from this image',
+        img_part
+    ]
+)
+```
+
+## Input Methods
+
+### Inline Data (<20MB)
+
+```python
+from google.genai import types
+
+# From file
+with open('image.jpg', 'rb') as f:
+    img_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze this image',
+        types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
+    ]
+)
+```
+
+### PIL Image
+
+```python
+import PIL.Image
+
+img = PIL.Image.open('photo.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What is in this image?', img]
+)
+```
+
+### File API (>20MB or Reuse)
+
+```python
+# Upload once
+myfile = client.files.upload(file='large-image.jpg')
+
+# Use multiple times
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Describe this image', myfile]
+)
+
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What colors dominate this image?', myfile]
+)
+```
+
+### URL (Public Images)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Analyze this image',
+        types.Part.from_uri(
+            uri='https://example.com/image.jpg',
+            mime_type='image/jpeg'
+        )
+    ]
+)
+```
+
+## Token Calculation
+
+Images consume tokens based on size:
+
+**Small images** (≤384px both dimensions): 258 tokens
+
+**Large images**: Tiled into 768×768 chunks, 258 tokens each
+
+**Formula**:
+```
+crop_unit = floor(min(width, height) / 1.5)
+tiles = (width / crop_unit) × (height / crop_unit)
+total_tokens = tiles × 258
+```
+
+**Examples**:
+- 256×256: 258 tokens (small)
+- 512×512: 258 tokens (small)
+- 960×540: 6 tiles = 1,548 tokens
+- 1920×1080: 6 tiles = 1,548 tokens
+- 3840×2160 (4K): 24 tiles = 6,192 tokens
+
+## Structured Output
+
+### JSON Schema Output
+
+```python
+from pydantic import BaseModel
+from typing import List
+
+class ObjectDetection(BaseModel):
+    object_name: str
+    confidence: float
+    bounding_box: List[int]  # [ymin, xmin, ymax, xmax]
+
+class ImageAnalysis(BaseModel):
+    description: str
+    objects: List[ObjectDetection]
+    scene_type: str
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Analyze this image', img_part],
+    config=genai.types.GenerateContentConfig(
+        response_mime_type='application/json',
+        response_schema=ImageAnalysis
+    )
+)
+
+result = ImageAnalysis.model_validate_json(response.text)
+```
+
+## Multi-Image Analysis
+
+### Batch Processing
+
+```python
+images = [
+    PIL.Image.open(f'image{i}.jpg')
+    for i in range(10)
+]
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Analyze these images and find common themes'] + images
+)
+```
+
+### Image Comparison
+
+```python
+before = PIL.Image.open('before.jpg')
+after = PIL.Image.open('after.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Compare before and after. List all visible changes.',
+        before,
+        after
+    ]
+)
+```
+
+### Visual Search
+
+```python
+reference = PIL.Image.open('target.jpg')
+candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Find which candidate images contain objects similar to the reference',
+        reference
+    ] + candidates
+)
+```
+
+## Best Practices
+
+### Image Quality
+
+1. **Resolution**: Use clear, non-blurry images
+2. **Rotation**: Verify correct orientation
+3. **Lighting**: Ensure good contrast and lighting
+4. **Size optimization**: Balance quality vs token cost
+5. **Format**: JPEG for photos, PNG for graphics
+
+### Prompt Engineering
+
+**Specific instructions**:
+- "Identify all vehicles with their colors and positions"
+- "Count people wearing blue shirts"
+- "Extract text from the sign in the top-left corner"
+
+**Output format**:
+- "Return results as JSON with fields: category, count, description"
+- "Format as markdown table"
+- "List findings as numbered items"
+
+**Few-shot examples**:
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
+        'Now analyze this image:',
+        img_part
+    ]
+)
+```
+
+### File Management
+
+1. Use File API for images >20MB
+2. Use File API for repeated queries (saves tokens)
+3. Files auto-delete after 48 hours
+4. Clean up manually:
+   ```python
+   client.files.delete(name=myfile.name)
+   ```
+
+### Cost Optimization
+
+**Token-efficient strategies**:
+- Resize large images before upload
+- Use File API for repeated queries
+- Batch multiple images when related
+- Use appropriate model (Flash vs Pro)
+
+**Token costs** (Gemini 2.5 Flash at $1/1M):
+- Small image (258 tokens): $0.000258
+- HD image (1,548 tokens): $0.001548
+- 4K image (6,192 tokens): $0.006192
+
+## Common Use Cases
+
+### 1. Product Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze this product image:
+        1. Identify the product
+        2. List visible features
+        3. Assess condition
+        4. Estimate value range
+        ''',
+        img_part
+    ]
+)
+```
+
+### 2. Screenshot Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract all text and UI elements from this screenshot',
+        img_part
+    ]
+)
+```
+
+### 3. Medical Imaging (Informational Only)
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-pro',
+    contents=[
+        'Describe visible features in this medical image. Note: This is for informational purposes only.',
+        img_part
+    ]
+)
+```
+
+### 4. Chart/Graph Reading
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Extract data from this chart and format as JSON',
+        img_part
+    ]
+)
+```
+
+### 5. Scene Understanding
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze this scene:
+        1. Location type
+        2. Time of day
+        3. Weather conditions
+        4. Activities happening
+        5. Mood/atmosphere
+        ''',
+        img_part
+    ]
+)
+```
+
+## Error Handling
+
+```python
+import time
+
+def analyze_image_with_retry(image_path, prompt, max_retries=3):
+    """Analyze image with exponential backoff retry"""
+    for attempt in range(max_retries):
+        try:
+            with open(image_path, 'rb') as f:
+                img_bytes = f.read()
+
+            response = client.models.generate_content(
+                model='gemini-2.5-flash',
+                contents=[
+                    prompt,
+                    genai.types.Part.from_bytes(
+                        data=img_bytes,
+                        mime_type='image/jpeg'
+                    )
+                ]
+            )
+            return response.text
+        except Exception as e:
+            if attempt == max_retries - 1:
+                raise
+            wait_time = 2 ** attempt
+            print(f"Retry {attempt + 1} after {wait_time}s: {e}")
+            time.sleep(wait_time)
+```
+
+## Limitations
+
+- Maximum 3,600 images per request
+- OCR accuracy varies with text quality
+- Object detection requires 2.0+ models
+- Segmentation requires 2.5+ models
+- No video frame extraction (use video API)
+- Regional restrictions on child images (EEA, CH, UK)