Initial commit
This commit is contained in:
373
skills/ai-multimodal/references/audio-processing.md
Normal file
373
skills/ai-multimodal/references/audio-processing.md
Normal file
@@ -0,0 +1,373 @@
|
||||
# Audio Processing Reference
|
||||
|
||||
Comprehensive guide for audio analysis and speech generation using Gemini API.
|
||||
|
||||
## Audio Understanding
|
||||
|
||||
### Supported Formats
|
||||
|
||||
| Format | MIME Type | Best Use |
|
||||
|--------|-----------|----------|
|
||||
| WAV | `audio/wav` | Uncompressed, highest quality |
|
||||
| MP3 | `audio/mp3` | Compressed, widely compatible |
|
||||
| AAC | `audio/aac` | Compressed, good quality |
|
||||
| FLAC | `audio/flac` | Lossless compression |
|
||||
| OGG Vorbis | `audio/ogg` | Open format |
|
||||
| AIFF | `audio/aiff` | Apple format |
|
||||
|
||||
### Specifications
|
||||
|
||||
- **Maximum length**: 9.5 hours per request
|
||||
- **Multiple files**: Unlimited count, combined max 9.5 hours
|
||||
- **Token rate**: 32 tokens/second (1 minute = 1,920 tokens)
|
||||
- **Processing**: Auto-downsampled to 16 Kbps mono
|
||||
- **File size limits**:
|
||||
- Inline: 20 MB max total request
|
||||
- File API: 2 GB per file, 20 GB project quota
|
||||
- Retention: 48 hours auto-delete
|
||||
|
||||
## Transcription
|
||||
|
||||
### Basic Transcription
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Upload audio
|
||||
myfile = client.files.upload(file='meeting.mp3')
|
||||
|
||||
# Transcribe
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Generate a transcript of the speech.', myfile]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### With Timestamps
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Generate transcript with timestamps in MM:SS format.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Speaker Identification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Segment-Specific Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Audio Analysis
|
||||
|
||||
### Summarization
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize key points in 5 bullets with timestamps.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Non-Speech Audio Analysis
|
||||
|
||||
```python
|
||||
# Music analysis
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify the musical instruments and genre.', myfile]
|
||||
)
|
||||
|
||||
# Environmental sounds
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify all sounds: voices, music, ambient noise.', myfile]
|
||||
)
|
||||
|
||||
# Birdsong identification
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify bird species based on their calls.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Timestamp-Based Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Input Methods
|
||||
|
||||
### File Upload (>20MB or Reuse)
|
||||
|
||||
```python
|
||||
# Upload once, use multiple times
|
||||
myfile = client.files.upload(file='large-audio.mp3')
|
||||
|
||||
# First query
|
||||
response1 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe this', myfile]
|
||||
)
|
||||
|
||||
# Second query (reuses same file)
|
||||
response2 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize this', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Inline Data (<20MB)
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
with open('small-audio.mp3', 'rb') as f:
|
||||
audio_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Describe this audio',
|
||||
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Speech Generation (TTS)
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Quality | Speed | Cost/1M tokens |
|
||||
|-------|---------|-------|----------------|
|
||||
| `gemini-2.5-flash-native-audio-preview-09-2025` | High | Fast | $10 |
|
||||
| `gemini-2.5-pro` TTS mode | Premium | Slower | $20 |
|
||||
|
||||
### Basic TTS
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio: Welcome to today\'s episode.'
|
||||
)
|
||||
|
||||
# Save audio
|
||||
with open('output.wav', 'wb') as f:
|
||||
f.write(response.audio_data)
|
||||
```
|
||||
|
||||
### Controllable Voice Style
|
||||
|
||||
```python
|
||||
# Professional tone
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
|
||||
)
|
||||
|
||||
# Casual and friendly
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
|
||||
)
|
||||
|
||||
# Narrative style
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
|
||||
)
|
||||
```
|
||||
|
||||
### Voice Control Parameters
|
||||
|
||||
- **Style**: Professional, casual, narrative, conversational
|
||||
- **Pace**: Slow, normal, fast
|
||||
- **Tone**: Friendly, serious, enthusiastic
|
||||
- **Accent**: Natural language control (e.g., "British accent", "Southern drawl")
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for files >20MB
|
||||
2. Use File API for repeated queries (saves tokens)
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually when done:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Effective prompts**:
|
||||
- "Transcribe from 02:30 to 03:29 in MM:SS format"
|
||||
- "Identify speakers and extract dialogue with timestamps"
|
||||
- "Summarize key points with relevant timestamps"
|
||||
- "Transcribe and analyze sentiment for each speaker"
|
||||
|
||||
**Context improves accuracy**:
|
||||
- "This is a medical interview - use appropriate terminology"
|
||||
- "Transcribe this legal deposition with precise terminology"
|
||||
- "This is a technical podcast about machine learning"
|
||||
|
||||
**Combined tasks**:
|
||||
- "Transcribe and summarize in bullet points"
|
||||
- "Extract key quotes with timestamps and speaker labels"
|
||||
- "Transcribe and identify action items with timestamps"
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Token calculation**:
|
||||
- 1 minute audio = 1,920 tokens
|
||||
- 1 hour audio = 115,200 tokens
|
||||
- 9.5 hours = 1,094,400 tokens
|
||||
|
||||
**Model selection**:
|
||||
- Use `gemini-2.5-flash` ($1/1M tokens) for most tasks
|
||||
- Upgrade to `gemini-2.5-pro` ($3/1M tokens) for complex analysis
|
||||
- For high-volume: `gemini-1.5-flash` ($0.70/1M tokens)
|
||||
|
||||
**Reduce costs**:
|
||||
- Process only relevant segments using timestamps
|
||||
- Use lower-quality audio when possible
|
||||
- Batch multiple short files in one request
|
||||
- Cache context for repeated queries
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def transcribe_with_retry(file_path, max_retries=3):
|
||||
"""Transcribe audio with exponential backoff retry"""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
myfile = client.files.upload(file=file_path)
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe with timestamps', myfile]
|
||||
)
|
||||
return response.text
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Retry {attempt + 1} after {wait_time}s")
|
||||
time.sleep(wait_time)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Meeting Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Transcribe this meeting with:
|
||||
1. Speaker labels
|
||||
2. Timestamps for topic changes
|
||||
3. Action items highlighted
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Podcast Summary
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create podcast summary with:
|
||||
1. Main topics with timestamps
|
||||
2. Key quotes from each speaker
|
||||
3. Recommended episode highlights
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Interview Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze interview:
|
||||
1. Questions asked with timestamps
|
||||
2. Key responses from interviewee
|
||||
3. Overall sentiment and tone
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Content Verification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Verify audio content:
|
||||
1. Check for specific keywords or phrases
|
||||
2. Identify any compliance issues
|
||||
3. Note any concerning statements with timestamps
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Multilingual Transcription
|
||||
|
||||
```python
|
||||
# Gemini auto-detects language
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe this audio and translate to English if needed.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Token Costs
|
||||
|
||||
**Audio Input** (32 tokens/second):
|
||||
- 1 minute = 1,920 tokens
|
||||
- 10 minutes = 19,200 tokens
|
||||
- 1 hour = 115,200 tokens
|
||||
- 9.5 hours = 1,094,400 tokens
|
||||
|
||||
**Example costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- 1 hour audio: 115,200 tokens = $0.12
|
||||
- Full day podcast (8 hours): 921,600 tokens = $0.92
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 9.5 hours per request
|
||||
- Auto-downsampled to 16 Kbps mono (quality loss)
|
||||
- Files expire after 48 hours
|
||||
- No real-time streaming support
|
||||
- Non-speech audio less accurate than speech
|
||||
558
skills/ai-multimodal/references/image-generation.md
Normal file
558
skills/ai-multimodal/references/image-generation.md
Normal file
@@ -0,0 +1,558 @@
|
||||
# Image Generation Reference
|
||||
|
||||
Comprehensive guide for image creation, editing, and composition using Gemini API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Text-to-Image**: Generate images from text prompts
|
||||
- **Image Editing**: Modify existing images with text instructions
|
||||
- **Multi-Image Composition**: Combine up to 3 images
|
||||
- **Iterative Refinement**: Refine images conversationally
|
||||
- **Aspect Ratios**: Multiple formats (1:1, 16:9, 9:16, 4:3, 3:4)
|
||||
- **Style Control**: Control artistic style and quality
|
||||
- **Text in Images**: Limited text rendering (max 25 chars)
|
||||
|
||||
## Model
|
||||
|
||||
**gemini-2.5-flash-image** - Specialized for image generation
|
||||
- Input tokens: 65,536
|
||||
- Output tokens: 32,768
|
||||
- Knowledge cutoff: June 2025
|
||||
- Supports: Text and image inputs, image outputs
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Generation
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
from google.genai import types
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='A serene mountain landscape at sunset with snow-capped peaks',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
|
||||
# Save image
|
||||
for i, part in enumerate(response.candidates[0].content.parts):
|
||||
if part.inline_data:
|
||||
with open(f'output-{i}.png', 'wb') as f:
|
||||
f.write(part.inline_data.data)
|
||||
```
|
||||
|
||||
## Aspect Ratios
|
||||
|
||||
| Ratio | Resolution | Use Case | Token Cost |
|
||||
|-------|-----------|----------|------------|
|
||||
| 1:1 | 1024×1024 | Social media, avatars | 1290 |
|
||||
| 16:9 | 1344×768 | Landscapes, banners | 1290 |
|
||||
| 9:16 | 768×1344 | Mobile, portraits | 1290 |
|
||||
| 4:3 | 1152×896 | Traditional media | 1290 |
|
||||
| 3:4 | 896×1152 | Vertical posters | 1290 |
|
||||
|
||||
All ratios cost the same: 1,290 tokens per image.
|
||||
|
||||
## Response Modalities
|
||||
|
||||
### Image Only
|
||||
|
||||
```python
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='1:1'
|
||||
)
|
||||
```
|
||||
|
||||
### Text Only (No Image)
|
||||
|
||||
```python
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['text']
|
||||
)
|
||||
# Returns text description instead of generating image
|
||||
```
|
||||
|
||||
### Both Image and Text
|
||||
|
||||
```python
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['image', 'text'],
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
# Returns both generated image and description
|
||||
```
|
||||
|
||||
## Image Editing
|
||||
|
||||
### Modify Existing Image
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
# Load original
|
||||
img = PIL.Image.open('original.png')
|
||||
|
||||
# Edit with instructions
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Add a red balloon floating in the sky',
|
||||
img
|
||||
],
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Style Transfer
|
||||
|
||||
```python
|
||||
img = PIL.Image.open('photo.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Transform this into an oil painting style',
|
||||
img
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Object Addition/Removal
|
||||
|
||||
```python
|
||||
# Add object
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Add a vintage car parked on the street',
|
||||
img
|
||||
]
|
||||
)
|
||||
|
||||
# Remove object
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Remove the person on the left side',
|
||||
img
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Multi-Image Composition
|
||||
|
||||
### Combine Multiple Images
|
||||
|
||||
```python
|
||||
img1 = PIL.Image.open('background.png')
|
||||
img2 = PIL.Image.open('foreground.png')
|
||||
img3 = PIL.Image.open('overlay.png')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Combine these images into a cohesive scene',
|
||||
img1,
|
||||
img2,
|
||||
img3
|
||||
],
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Note**: Recommended maximum 3 input images for best results.
|
||||
|
||||
## Prompt Engineering
|
||||
|
||||
### Effective Prompt Structure
|
||||
|
||||
**Three key elements**:
|
||||
1. **Subject**: What to generate
|
||||
2. **Context**: Environmental setting
|
||||
3. **Style**: Artistic treatment
|
||||
|
||||
**Example**: "A robot [subject] in a futuristic city [context], cyberpunk style with neon lighting [style]"
|
||||
|
||||
### Quality Modifiers
|
||||
|
||||
**Technical terms**:
|
||||
- "4K", "8K", "high resolution"
|
||||
- "HDR", "high dynamic range"
|
||||
- "professional photography"
|
||||
- "studio lighting"
|
||||
- "ultra detailed"
|
||||
|
||||
**Camera settings**:
|
||||
- "35mm lens", "50mm lens"
|
||||
- "shallow depth of field"
|
||||
- "wide angle shot"
|
||||
- "macro photography"
|
||||
- "golden hour lighting"
|
||||
|
||||
### Style Keywords
|
||||
|
||||
**Art styles**:
|
||||
- "oil painting", "watercolor", "sketch"
|
||||
- "digital art", "concept art"
|
||||
- "photorealistic", "hyperrealistic"
|
||||
- "minimalist", "abstract"
|
||||
- "cyberpunk", "steampunk", "fantasy"
|
||||
|
||||
**Mood and atmosphere**:
|
||||
- "dramatic lighting", "soft lighting"
|
||||
- "moody", "bright and cheerful"
|
||||
- "mysterious", "whimsical"
|
||||
- "dark and gritty", "pastel colors"
|
||||
|
||||
### Subject Description
|
||||
|
||||
**Be specific**:
|
||||
- ❌ "A cat"
|
||||
- ✅ "A fluffy orange tabby cat with green eyes"
|
||||
|
||||
**Add context**:
|
||||
- ❌ "A building"
|
||||
- ✅ "A modern glass skyscraper reflecting sunset clouds"
|
||||
|
||||
**Include details**:
|
||||
- ❌ "A person"
|
||||
- ✅ "A young woman in a red dress holding an umbrella"
|
||||
|
||||
### Composition and Framing
|
||||
|
||||
**Camera angles**:
|
||||
- "bird's eye view", "aerial shot"
|
||||
- "low angle", "high angle"
|
||||
- "close-up", "wide shot"
|
||||
- "centered composition"
|
||||
- "rule of thirds"
|
||||
|
||||
**Perspective**:
|
||||
- "first person view"
|
||||
- "third person perspective"
|
||||
- "isometric view"
|
||||
- "forced perspective"
|
||||
|
||||
### Text in Images
|
||||
|
||||
**Limitations**:
|
||||
- Maximum 25 characters total
|
||||
- Up to 3 distinct text phrases
|
||||
- Works best with simple text
|
||||
|
||||
**Best practices**:
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='A vintage poster with bold text "EXPLORE" at the top, mountain landscape, retro 1950s style'
|
||||
)
|
||||
```
|
||||
|
||||
**Font control**:
|
||||
- "bold sans-serif title"
|
||||
- "handwritten script"
|
||||
- "vintage letterpress"
|
||||
- "modern minimalist font"
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Iterative Refinement
|
||||
|
||||
```python
|
||||
# Initial generation
|
||||
response1 = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='A futuristic city skyline'
|
||||
)
|
||||
|
||||
# Save first version
|
||||
with open('v1.png', 'wb') as f:
|
||||
f.write(response1.candidates[0].content.parts[0].inline_data.data)
|
||||
|
||||
# Refine
|
||||
img = PIL.Image.open('v1.png')
|
||||
response2 = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=[
|
||||
'Add flying vehicles and neon signs',
|
||||
img
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Negative Prompts (Indirect)
|
||||
|
||||
```python
|
||||
# Instead of "no blur", be specific about what you want
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='A crystal clear, sharp photograph of a diamond ring with perfect focus and high detail'
|
||||
)
|
||||
```
|
||||
|
||||
### Consistent Style Across Images
|
||||
|
||||
```python
|
||||
base_prompt = "Digital art, vibrant colors, cel-shaded style, clean lines"
|
||||
|
||||
prompts = [
|
||||
f"{base_prompt}, a warrior character",
|
||||
f"{base_prompt}, a mage character",
|
||||
f"{base_prompt}, a rogue character"
|
||||
]
|
||||
|
||||
for i, prompt in enumerate(prompts):
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=prompt
|
||||
)
|
||||
# Save each character
|
||||
```
|
||||
|
||||
## Safety Settings
|
||||
|
||||
### Configure Safety Filters
|
||||
|
||||
```python
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
safety_settings=[
|
||||
types.SafetySetting(
|
||||
category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
|
||||
threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
|
||||
),
|
||||
types.SafetySetting(
|
||||
category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
|
||||
threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Available Categories
|
||||
|
||||
- `HARM_CATEGORY_HATE_SPEECH`
|
||||
- `HARM_CATEGORY_DANGEROUS_CONTENT`
|
||||
- `HARM_CATEGORY_HARASSMENT`
|
||||
- `HARM_CATEGORY_SEXUALLY_EXPLICIT`
|
||||
|
||||
### Thresholds
|
||||
|
||||
- `BLOCK_NONE`: No blocking
|
||||
- `BLOCK_LOW_AND_ABOVE`: Block low probability and above
|
||||
- `BLOCK_MEDIUM_AND_ABOVE`: Block medium and above (default)
|
||||
- `BLOCK_ONLY_HIGH`: Block only high probability
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Marketing Assets
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='''Professional product photography:
|
||||
- Sleek smartphone on minimalist white surface
|
||||
- Dramatic side lighting creating subtle shadows
|
||||
- Shallow depth of field, crisp focus
|
||||
- Clean, modern aesthetic
|
||||
- 4K quality
|
||||
''',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='4:3'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Concept Art
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='''Fantasy concept art:
|
||||
- Ancient floating islands connected by chains
|
||||
- Waterfalls cascading into clouds below
|
||||
- Magical crystals glowing on the islands
|
||||
- Epic scale, dramatic lighting
|
||||
- Detailed digital painting style
|
||||
''',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Social Media Graphics
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='''Instagram post design:
|
||||
- Pastel gradient background (pink to blue)
|
||||
- Motivational quote layout
|
||||
- Modern minimalist style
|
||||
- Clean typography
|
||||
- Mobile-friendly composition
|
||||
''',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='1:1'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Illustration
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='''Children's book illustration:
|
||||
- Friendly cartoon dragon reading a book
|
||||
- Bright, cheerful colors
|
||||
- Soft, rounded shapes
|
||||
- Whimsical forest background
|
||||
- Warm, inviting atmosphere
|
||||
''',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='4:3'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 5. UI/UX Mockups
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents='''Modern mobile app interface:
|
||||
- Clean dashboard design
|
||||
- Card-based layout
|
||||
- Soft shadows and gradients
|
||||
- Contemporary color scheme (blue and white)
|
||||
- Professional fintech aesthetic
|
||||
''',
|
||||
config=types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='9:16'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Prompt Quality
|
||||
|
||||
1. **Be specific**: More detail = better results
|
||||
2. **Order matters**: Most important elements first
|
||||
3. **Use examples**: Reference known styles or artists
|
||||
4. **Avoid contradictions**: Don't ask for opposing styles
|
||||
5. **Test and iterate**: Refine prompts based on results
|
||||
|
||||
### File Management
|
||||
|
||||
```python
|
||||
# Save with descriptive names
|
||||
timestamp = int(time.time())
|
||||
filename = f'generated_{timestamp}_{aspect_ratio}.png'
|
||||
|
||||
with open(filename, 'wb') as f:
|
||||
f.write(image_data)
|
||||
```
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Token costs**:
|
||||
- 1 image: 1,290 tokens = $0.00129 (Flash Image at $1/1M)
|
||||
- 10 images: 12,900 tokens = $0.0129
|
||||
- 100 images: 129,000 tokens = $0.129
|
||||
|
||||
**Strategies**:
|
||||
- Generate fewer iterations
|
||||
- Use text modality first to validate concept
|
||||
- Batch similar requests
|
||||
- Cache prompts for consistent style
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Safety Filter Blocking
|
||||
|
||||
```python
|
||||
try:
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-image',
|
||||
contents=prompt
|
||||
)
|
||||
except Exception as e:
|
||||
# Check block reason
|
||||
if hasattr(e, 'prompt_feedback'):
|
||||
print(f"Blocked: {e.prompt_feedback.block_reason}")
|
||||
# Modify prompt and retry
|
||||
```
|
||||
|
||||
### Token Limit Exceeded
|
||||
|
||||
```python
|
||||
# Keep prompts concise
|
||||
if len(prompt) > 1000:
|
||||
# Truncate or simplify
|
||||
prompt = prompt[:1000]
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 3 input images for composition
|
||||
- Text rendering limited (25 chars max)
|
||||
- No video or animation generation
|
||||
- Regional restrictions (child images in EEA, CH, UK)
|
||||
- Optimal language support: English, Spanish (Mexico), Japanese, Mandarin, Hindi
|
||||
- No real-time generation
|
||||
- Cannot perfectly replicate specific people or copyrighted characters
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### aspect_ratio Parameter Error
|
||||
|
||||
**Error**: `Extra inputs are not permitted [type=extra_forbidden, input_value='1:1', input_type=str]`
|
||||
|
||||
**Cause**: The `aspect_ratio` parameter must be nested inside an `image_config` object, not passed directly to `GenerateContentConfig`.
|
||||
|
||||
**Incorrect Usage**:
|
||||
```python
|
||||
# ❌ This will fail
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['image'],
|
||||
aspect_ratio='16:9' # Wrong - not a direct parameter
|
||||
)
|
||||
```
|
||||
|
||||
**Correct Usage**:
|
||||
```python
|
||||
# ✅ Correct implementation
|
||||
config = types.GenerateContentConfig(
|
||||
response_modalities=['Image'], # Note: Capital 'I'
|
||||
image_config=types.ImageConfig(
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Response Modality Case Sensitivity
|
||||
|
||||
The `response_modalities` parameter expects capital case values:
|
||||
- ✅ Correct: `['Image']`, `['Text']`, `['Image', 'Text']`
|
||||
- ❌ Wrong: `['image']`, `['text']`
|
||||
502
skills/ai-multimodal/references/video-analysis.md
Normal file
502
skills/ai-multimodal/references/video-analysis.md
Normal file
@@ -0,0 +1,502 @@
|
||||
# Video Analysis Reference
|
||||
|
||||
Comprehensive guide for video understanding, temporal analysis, and YouTube processing using Gemini API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Video Summarization**: Create concise summaries
|
||||
- **Question Answering**: Answer specific questions about content
|
||||
- **Transcription**: Audio transcription with visual descriptions
|
||||
- **Timestamp References**: Query specific moments (MM:SS format)
|
||||
- **Video Clipping**: Process specific segments
|
||||
- **Scene Detection**: Identify scene changes and transitions
|
||||
- **Multiple Videos**: Compare up to 10 videos (2.5+)
|
||||
- **YouTube Support**: Analyze YouTube videos directly
|
||||
- **Custom Frame Rate**: Adjust FPS sampling
|
||||
|
||||
## Supported Formats
|
||||
|
||||
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
|
||||
|
||||
## Model Selection
|
||||
|
||||
### Gemini 2.5 Series
|
||||
- **gemini-2.5-pro**: Best quality, 1M-2M context
|
||||
- **gemini-2.5-flash**: Balanced, 1M-2M context
|
||||
- **gemini-2.5-flash-preview-09-2025**: Preview features, 1M context
|
||||
|
||||
### Gemini 2.0 Series
|
||||
- **gemini-2.0-flash**: Fast processing
|
||||
- **gemini-2.0-flash-lite**: Lightweight option
|
||||
|
||||
### Context Windows
|
||||
- **2M token models**: ~2 hours (default) or ~6 hours (low-res)
|
||||
- **1M token models**: ~1 hour (default) or ~3 hours (low-res)
|
||||
|
||||
## Basic Video Analysis
|
||||
|
||||
### Local Video
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Upload video (File API for >20MB)
|
||||
myfile = client.files.upload(file='video.mp4')
|
||||
|
||||
# Wait for processing
|
||||
import time
|
||||
while myfile.state.name == 'PROCESSING':
|
||||
time.sleep(1)
|
||||
myfile = client.files.get(name=myfile.name)
|
||||
|
||||
if myfile.state.name == 'FAILED':
|
||||
raise ValueError('Video processing failed')
|
||||
|
||||
# Analyze
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize this video in 3 key points', myfile]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### YouTube Video
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Summarize the main topics discussed',
|
||||
types.Part.from_uri(
|
||||
uri='https://www.youtube.com/watch?v=VIDEO_ID',
|
||||
mime_type='video/mp4'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Inline Video (<20MB)
|
||||
|
||||
```python
|
||||
with open('short-clip.mp4', 'rb') as f:
|
||||
video_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'What happens in this video?',
|
||||
types.Part.from_bytes(data=video_bytes, mime_type='video/mp4')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Video Clipping
|
||||
|
||||
```python
|
||||
# Analyze specific time range
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Summarize this segment',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
start_offset='40s',
|
||||
end_offset='80s'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Custom Frame Rate
|
||||
|
||||
```python
|
||||
# Lower FPS for static content (saves tokens)
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this presentation',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=0.5 # Sample every 2 seconds
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
# Higher FPS for fast-moving content
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze rapid movements in this sports video',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=5 # Sample 5 times per second
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Videos (2.5+)
|
||||
|
||||
```python
|
||||
video1 = client.files.upload(file='demo1.mp4')
|
||||
video2 = client.files.upload(file='demo2.mp4')
|
||||
|
||||
# Wait for processing
|
||||
for video in [video1, video2]:
|
||||
while video.state.name == 'PROCESSING':
|
||||
time.sleep(1)
|
||||
video = client.files.get(name=video.name)
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-pro',
|
||||
contents=[
|
||||
'Compare these two product demos. Which explains features better?',
|
||||
video1,
|
||||
video2
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Temporal Understanding
|
||||
|
||||
### Timestamp-Based Questions
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'What happens at 01:15 and how does it relate to 02:30?',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Timeline Creation
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create a timeline with timestamps:
|
||||
- Key events
|
||||
- Scene changes
|
||||
- Important moments
|
||||
Format: MM:SS - Description
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Scene Detection
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Identify all scene changes with timestamps and describe each scene',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Transcription
|
||||
|
||||
### Basic Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Transcribe the audio from this video',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### With Visual Descriptions
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Transcribe with visual context:
|
||||
- Audio transcription
|
||||
- Visual descriptions of important moments
|
||||
- Timestamps for salient events
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Speaker Identification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Transcribe with speaker labels and timestamps',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Video Summarization
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Summarize this video:
|
||||
1. Main topic and purpose
|
||||
2. Key points with timestamps
|
||||
3. Conclusion or call-to-action
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Educational Content
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create educational materials:
|
||||
1. List key concepts taught
|
||||
2. Create 5 quiz questions with answers
|
||||
3. Provide timestamp for each concept
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Action Detection
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'List all actions performed in this tutorial with timestamps',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Content Moderation
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Review video content:
|
||||
1. Identify any problematic content
|
||||
2. Note timestamps of concerns
|
||||
3. Provide content rating recommendation
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Interview Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze interview:
|
||||
1. Questions asked (timestamps)
|
||||
2. Key responses
|
||||
3. Candidate body language and demeanor
|
||||
4. Overall assessment
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Sports Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze sports video:
|
||||
1. Key plays with timestamps
|
||||
2. Player movements and positioning
|
||||
3. Game strategy observations
|
||||
''',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=5 # Higher FPS for fast action
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## YouTube Specific Features
|
||||
|
||||
### Public Video Requirements
|
||||
|
||||
- Video must be public (not private or unlisted)
|
||||
- No age-restricted content
|
||||
- Valid video ID required
|
||||
|
||||
### Usage Example
|
||||
|
||||
```python
|
||||
# YouTube URL
|
||||
youtube_uri = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Create chapter markers with timestamps',
|
||||
types.Part.from_uri(uri=youtube_uri, mime_type='video/mp4')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Free tier**: 8 hours of YouTube video per day
|
||||
- **Paid tier**: No length-based limits
|
||||
- Public videos only
|
||||
|
||||
## Token Calculation
|
||||
|
||||
Video tokens depend on resolution and FPS:
|
||||
|
||||
**Default resolution** (~300 tokens/second):
|
||||
- 1 minute = 18,000 tokens
|
||||
- 10 minutes = 180,000 tokens
|
||||
- 1 hour = 1,080,000 tokens
|
||||
|
||||
**Low resolution** (~100 tokens/second):
|
||||
- 1 minute = 6,000 tokens
|
||||
- 10 minutes = 60,000 tokens
|
||||
- 1 hour = 360,000 tokens
|
||||
|
||||
**Context windows**:
|
||||
- 2M tokens ≈ 2 hours (default) or 6 hours (low-res)
|
||||
- 1M tokens ≈ 1 hour (default) or 3 hours (low-res)
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for videos >20MB (most videos)
|
||||
2. Wait for ACTIVE state before analysis
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
**Reduce token usage**:
|
||||
- Process specific segments using start/end offsets
|
||||
- Use lower FPS for static content
|
||||
- Use low-resolution mode for long videos
|
||||
- Split very long videos into chunks
|
||||
|
||||
**Improve accuracy**:
|
||||
- Provide context in prompts
|
||||
- Use higher FPS for fast-moving content
|
||||
- Use Pro model for complex analysis
|
||||
- Be specific about what to extract
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Effective prompts**:
|
||||
- "Summarize key points with timestamps in MM:SS format"
|
||||
- "Identify all scene changes and describe each scene"
|
||||
- "Extract action items mentioned with timestamps"
|
||||
- "Compare these two videos on: X, Y, Z criteria"
|
||||
|
||||
**Structured output**:
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class VideoEvent(BaseModel):
|
||||
timestamp: str # MM:SS format
|
||||
description: str
|
||||
category: str
|
||||
|
||||
class VideoAnalysis(BaseModel):
|
||||
summary: str
|
||||
events: List[VideoEvent]
|
||||
duration: str
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze this video', myfile],
|
||||
config=genai.types.GenerateContentConfig(
|
||||
response_mime_type='application/json',
|
||||
response_schema=VideoAnalysis
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def upload_and_process_video(file_path, max_wait=300):
|
||||
"""Upload video and wait for processing"""
|
||||
myfile = client.files.upload(file=file_path)
|
||||
|
||||
elapsed = 0
|
||||
while myfile.state.name == 'PROCESSING' and elapsed < max_wait:
|
||||
time.sleep(5)
|
||||
myfile = client.files.get(name=myfile.name)
|
||||
elapsed += 5
|
||||
|
||||
if myfile.state.name == 'FAILED':
|
||||
raise ValueError(f'Video processing failed: {myfile.state.name}')
|
||||
|
||||
if myfile.state.name == 'PROCESSING':
|
||||
raise TimeoutError(f'Processing timeout after {max_wait}s')
|
||||
|
||||
return myfile
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
**Token costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- 1 minute video (default): 18,000 tokens = $0.018
|
||||
- 10 minute video: 180,000 tokens = $0.18
|
||||
- 1 hour video: 1,080,000 tokens = $1.08
|
||||
|
||||
**Strategies**:
|
||||
- Use video clipping for specific segments
|
||||
- Lower FPS for static content
|
||||
- Use low-resolution mode for long videos
|
||||
- Batch related queries on same video
|
||||
- Use context caching for repeated queries
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 6 hours (low-res) or 2 hours (default)
|
||||
- YouTube videos must be public
|
||||
- No live streaming analysis
|
||||
- Files expire after 48 hours
|
||||
- Processing time varies by video length
|
||||
- No real-time processing
|
||||
- Limited to 10 videos per request (2.5+)
|
||||
483
skills/ai-multimodal/references/vision-understanding.md
Normal file
483
skills/ai-multimodal/references/vision-understanding.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# Vision Understanding Reference
|
||||
|
||||
Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Captioning**: Generate descriptive text for images
|
||||
- **Classification**: Categorize and identify content
|
||||
- **Visual Q&A**: Answer questions about images
|
||||
- **Object Detection**: Locate objects with bounding boxes (2.0+)
|
||||
- **Segmentation**: Create pixel-level masks (2.5+)
|
||||
- **Multi-image**: Compare up to 3,600 images
|
||||
- **OCR**: Extract text from images
|
||||
- **Document Understanding**: Process PDFs with vision
|
||||
|
||||
## Supported Formats
|
||||
|
||||
- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
|
||||
- **Documents**: PDF (up to 1,000 pages)
|
||||
- **Size Limits**:
|
||||
- Inline: 20MB max total request
|
||||
- File API: 2GB per file
|
||||
- Max images: 3,600 per request
|
||||
|
||||
## Model Selection
|
||||
|
||||
### Gemini 2.5 Series
|
||||
- **gemini-2.5-pro**: Best quality, segmentation + detection
|
||||
- **gemini-2.5-flash**: Fast, efficient, all features
|
||||
- **gemini-2.5-flash-lite**: Lightweight, all features
|
||||
|
||||
### Gemini 2.0 Series
|
||||
- **gemini-2.0-flash**: Object detection support
|
||||
- **gemini-2.0-flash-lite**: Lightweight detection
|
||||
|
||||
### Feature Requirements
|
||||
- **Segmentation**: Requires 2.5+ models
|
||||
- **Object Detection**: Requires 2.0+ models
|
||||
- **Multi-image**: All models (up to 3,600 images)
|
||||
|
||||
## Basic Image Analysis
|
||||
|
||||
### Image Captioning
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Local file
|
||||
with open('image.jpg', 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Describe this image in detail',
|
||||
genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
|
||||
]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### Image Classification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Classify this image. Provide category and confidence level.',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Visual Question Answering
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'How many people are in this image and what are they doing?',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Object Detection (2.0+)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.0-flash',
|
||||
contents=[
|
||||
'Detect all objects in this image and provide bounding boxes',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
|
||||
# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
|
||||
# Normalized to [0, 1000] range
|
||||
```
|
||||
|
||||
### Segmentation (2.5+)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Create a segmentation mask for all people in this image',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
|
||||
# Returns pixel-level masks for requested objects
|
||||
```
|
||||
|
||||
### Multi-Image Comparison
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
img1 = PIL.Image.open('photo1.jpg')
|
||||
img2 = PIL.Image.open('photo2.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Compare these two images. What are the differences?',
|
||||
img1,
|
||||
img2
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### OCR and Text Extraction
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract all visible text from this image',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Input Methods
|
||||
|
||||
### Inline Data (<20MB)
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
# From file
|
||||
with open('image.jpg', 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this image',
|
||||
types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### PIL Image
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
img = PIL.Image.open('photo.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What is in this image?', img]
|
||||
)
|
||||
```
|
||||
|
||||
### File API (>20MB or Reuse)
|
||||
|
||||
```python
|
||||
# Upload once
|
||||
myfile = client.files.upload(file='large-image.jpg')
|
||||
|
||||
# Use multiple times
|
||||
response1 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Describe this image', myfile]
|
||||
)
|
||||
|
||||
response2 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What colors dominate this image?', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### URL (Public Images)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this image',
|
||||
types.Part.from_uri(
|
||||
uri='https://example.com/image.jpg',
|
||||
mime_type='image/jpeg'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Token Calculation
|
||||
|
||||
Images consume tokens based on size:
|
||||
|
||||
**Small images** (≤384px both dimensions): 258 tokens
|
||||
|
||||
**Large images**: Tiled into 768×768 chunks, 258 tokens each
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
crop_unit = floor(min(width, height) / 1.5)
|
||||
tiles = (width / crop_unit) × (height / crop_unit)
|
||||
total_tokens = tiles × 258
|
||||
```
|
||||
|
||||
**Examples**:
|
||||
- 256×256: 258 tokens (small)
|
||||
- 512×512: 258 tokens (small)
|
||||
- 960×540: 6 tiles = 1,548 tokens
|
||||
- 1920×1080: 6 tiles = 1,548 tokens
|
||||
- 3840×2160 (4K): 24 tiles = 6,192 tokens
|
||||
|
||||
## Structured Output
|
||||
|
||||
### JSON Schema Output
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class ObjectDetection(BaseModel):
|
||||
object_name: str
|
||||
confidence: float
|
||||
bounding_box: List[int] # [ymin, xmin, ymax, xmax]
|
||||
|
||||
class ImageAnalysis(BaseModel):
|
||||
description: str
|
||||
objects: List[ObjectDetection]
|
||||
scene_type: str
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze this image', img_part],
|
||||
config=genai.types.GenerateContentConfig(
|
||||
response_mime_type='application/json',
|
||||
response_schema=ImageAnalysis
|
||||
)
|
||||
)
|
||||
|
||||
result = ImageAnalysis.model_validate_json(response.text)
|
||||
```
|
||||
|
||||
## Multi-Image Analysis
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
images = [
|
||||
PIL.Image.open(f'image{i}.jpg')
|
||||
for i in range(10)
|
||||
]
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze these images and find common themes'] + images
|
||||
)
|
||||
```
|
||||
|
||||
### Image Comparison
|
||||
|
||||
```python
|
||||
before = PIL.Image.open('before.jpg')
|
||||
after = PIL.Image.open('after.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Compare before and after. List all visible changes.',
|
||||
before,
|
||||
after
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Visual Search
|
||||
|
||||
```python
|
||||
reference = PIL.Image.open('target.jpg')
|
||||
candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Find which candidate images contain objects similar to the reference',
|
||||
reference
|
||||
] + candidates
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Image Quality
|
||||
|
||||
1. **Resolution**: Use clear, non-blurry images
|
||||
2. **Rotation**: Verify correct orientation
|
||||
3. **Lighting**: Ensure good contrast and lighting
|
||||
4. **Size optimization**: Balance quality vs token cost
|
||||
5. **Format**: JPEG for photos, PNG for graphics
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Specific instructions**:
|
||||
- "Identify all vehicles with their colors and positions"
|
||||
- "Count people wearing blue shirts"
|
||||
- "Extract text from the sign in the top-left corner"
|
||||
|
||||
**Output format**:
|
||||
- "Return results as JSON with fields: category, count, description"
|
||||
- "Format as markdown table"
|
||||
- "List findings as numbered items"
|
||||
|
||||
**Few-shot examples**:
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
|
||||
'Now analyze this image:',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for images >20MB
|
||||
2. Use File API for repeated queries (saves tokens)
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Token-efficient strategies**:
|
||||
- Resize large images before upload
|
||||
- Use File API for repeated queries
|
||||
- Batch multiple images when related
|
||||
- Use appropriate model (Flash vs Pro)
|
||||
|
||||
**Token costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- Small image (258 tokens): $0.000258
|
||||
- HD image (1,548 tokens): $0.001548
|
||||
- 4K image (6,192 tokens): $0.006192
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Product Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze this product image:
|
||||
1. Identify the product
|
||||
2. List visible features
|
||||
3. Assess condition
|
||||
4. Estimate value range
|
||||
''',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Screenshot Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract all text and UI elements from this screenshot',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Medical Imaging (Informational Only)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-pro',
|
||||
contents=[
|
||||
'Describe visible features in this medical image. Note: This is for informational purposes only.',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Chart/Graph Reading
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract data from this chart and format as JSON',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Scene Understanding
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze this scene:
|
||||
1. Location type
|
||||
2. Time of day
|
||||
3. Weather conditions
|
||||
4. Activities happening
|
||||
5. Mood/atmosphere
|
||||
''',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def analyze_image_with_retry(image_path, prompt, max_retries=3):
|
||||
"""Analyze image with exponential backoff retry"""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
with open(image_path, 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
prompt,
|
||||
genai.types.Part.from_bytes(
|
||||
data=img_bytes,
|
||||
mime_type='image/jpeg'
|
||||
)
|
||||
]
|
||||
)
|
||||
return response.text
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Retry {attempt + 1} after {wait_time}s: {e}")
|
||||
time.sleep(wait_time)
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 3,600 images per request
|
||||
- OCR accuracy varies with text quality
|
||||
- Object detection requires 2.0+ models
|
||||
- Segmentation requires 2.5+ models
|
||||
- No video frame extraction (use video API)
|
||||
- Regional restrictions on child images (EEA, CH, UK)
|
||||
Reference in New Issue
Block a user