11 KiB
11 KiB
Video Analysis Reference
Comprehensive guide for video understanding, temporal analysis, and YouTube processing using Gemini API.
Core Capabilities
- Video Summarization: Create concise summaries
- Question Answering: Answer specific questions about content
- Transcription: Audio transcription with visual descriptions
- Timestamp References: Query specific moments (MM:SS format)
- Video Clipping: Process specific segments
- Scene Detection: Identify scene changes and transitions
- Multiple Videos: Compare up to 10 videos (2.5+)
- YouTube Support: Analyze YouTube videos directly
- Custom Frame Rate: Adjust FPS sampling
Supported Formats
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Model Selection
Gemini 2.5 Series
- gemini-2.5-pro: Best quality, 1M-2M context
- gemini-2.5-flash: Balanced, 1M-2M context
- gemini-2.5-flash-preview-09-2025: Preview features, 1M context
Gemini 2.0 Series
- gemini-2.0-flash: Fast processing
- gemini-2.0-flash-lite: Lightweight option
Context Windows
- 2M token models: ~2 hours (default) or ~6 hours (low-res)
- 1M token models: ~1 hour (default) or ~3 hours (low-res)
Basic Video Analysis
Local Video
from google import genai
import os
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Upload video (File API for >20MB)
myfile = client.files.upload(file='video.mp4')
# Wait for processing
import time
while myfile.state.name == 'PROCESSING':
time.sleep(1)
myfile = client.files.get(name=myfile.name)
if myfile.state.name == 'FAILED':
raise ValueError('Video processing failed')
# Analyze
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize this video in 3 key points', myfile]
)
print(response.text)
YouTube Video
from google.genai import types
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Summarize the main topics discussed',
types.Part.from_uri(
uri='https://www.youtube.com/watch?v=VIDEO_ID',
mime_type='video/mp4'
)
]
)
Inline Video (<20MB)
with open('short-clip.mp4', 'rb') as f:
video_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'What happens in this video?',
types.Part.from_bytes(data=video_bytes, mime_type='video/mp4')
]
)
Advanced Features
Video Clipping
# Analyze specific time range
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Summarize this segment',
types.Part.from_video_metadata(
file_uri=myfile.uri,
start_offset='40s',
end_offset='80s'
)
]
)
Custom Frame Rate
# Lower FPS for static content (saves tokens)
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Analyze this presentation',
types.Part.from_video_metadata(
file_uri=myfile.uri,
fps=0.5 # Sample every 2 seconds
)
]
)
# Higher FPS for fast-moving content
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Analyze rapid movements in this sports video',
types.Part.from_video_metadata(
file_uri=myfile.uri,
fps=5 # Sample 5 times per second
)
]
)
Multiple Videos (2.5+)
video1 = client.files.upload(file='demo1.mp4')
video2 = client.files.upload(file='demo2.mp4')
# Wait for processing
for video in [video1, video2]:
while video.state.name == 'PROCESSING':
time.sleep(1)
video = client.files.get(name=video.name)
response = client.models.generate_content(
model='gemini-2.5-pro',
contents=[
'Compare these two product demos. Which explains features better?',
video1,
video2
]
)
Temporal Understanding
Timestamp-Based Questions
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'What happens at 01:15 and how does it relate to 02:30?',
myfile
]
)
Timeline Creation
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Create a timeline with timestamps:
- Key events
- Scene changes
- Important moments
Format: MM:SS - Description
''',
myfile
]
)
Scene Detection
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Identify all scene changes with timestamps and describe each scene',
myfile
]
)
Transcription
Basic Transcription
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Transcribe the audio from this video',
myfile
]
)
With Visual Descriptions
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Transcribe with visual context:
- Audio transcription
- Visual descriptions of important moments
- Timestamps for salient events
''',
myfile
]
)
Speaker Identification
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Transcribe with speaker labels and timestamps',
myfile
]
)
Common Use Cases
1. Video Summarization
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Summarize this video:
1. Main topic and purpose
2. Key points with timestamps
3. Conclusion or call-to-action
''',
myfile
]
)
2. Educational Content
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Create educational materials:
1. List key concepts taught
2. Create 5 quiz questions with answers
3. Provide timestamp for each concept
''',
myfile
]
)
3. Action Detection
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'List all actions performed in this tutorial with timestamps',
myfile
]
)
4. Content Moderation
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Review video content:
1. Identify any problematic content
2. Note timestamps of concerns
3. Provide content rating recommendation
''',
myfile
]
)
5. Interview Analysis
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Analyze interview:
1. Questions asked (timestamps)
2. Key responses
3. Candidate body language and demeanor
4. Overall assessment
''',
myfile
]
)
6. Sports Analysis
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Analyze sports video:
1. Key plays with timestamps
2. Player movements and positioning
3. Game strategy observations
''',
types.Part.from_video_metadata(
file_uri=myfile.uri,
fps=5 # Higher FPS for fast action
)
]
)
YouTube Specific Features
Public Video Requirements
- Video must be public (not private or unlisted)
- No age-restricted content
- Valid video ID required
Usage Example
# YouTube URL
youtube_uri = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Create chapter markers with timestamps',
types.Part.from_uri(uri=youtube_uri, mime_type='video/mp4')
]
)
Rate Limits
- Free tier: 8 hours of YouTube video per day
- Paid tier: No length-based limits
- Public videos only
Token Calculation
Video tokens depend on resolution and FPS:
Default resolution (~300 tokens/second):
- 1 minute = 18,000 tokens
- 10 minutes = 180,000 tokens
- 1 hour = 1,080,000 tokens
Low resolution (~100 tokens/second):
- 1 minute = 6,000 tokens
- 10 minutes = 60,000 tokens
- 1 hour = 360,000 tokens
Context windows:
- 2M tokens ≈ 2 hours (default) or 6 hours (low-res)
- 1M tokens ≈ 1 hour (default) or 3 hours (low-res)
Best Practices
File Management
- Use File API for videos >20MB (most videos)
- Wait for ACTIVE state before analysis
- Files auto-delete after 48 hours
- Clean up manually:
client.files.delete(name=myfile.name)
Optimization Strategies
Reduce token usage:
- Process specific segments using start/end offsets
- Use lower FPS for static content
- Use low-resolution mode for long videos
- Split very long videos into chunks
Improve accuracy:
- Provide context in prompts
- Use higher FPS for fast-moving content
- Use Pro model for complex analysis
- Be specific about what to extract
Prompt Engineering
Effective prompts:
- "Summarize key points with timestamps in MM:SS format"
- "Identify all scene changes and describe each scene"
- "Extract action items mentioned with timestamps"
- "Compare these two videos on: X, Y, Z criteria"
Structured output:
from pydantic import BaseModel
from typing import List
class VideoEvent(BaseModel):
timestamp: str # MM:SS format
description: str
category: str
class VideoAnalysis(BaseModel):
summary: str
events: List[VideoEvent]
duration: str
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Analyze this video', myfile],
config=genai.types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=VideoAnalysis
)
)
Error Handling
import time
def upload_and_process_video(file_path, max_wait=300):
"""Upload video and wait for processing"""
myfile = client.files.upload(file=file_path)
elapsed = 0
while myfile.state.name == 'PROCESSING' and elapsed < max_wait:
time.sleep(5)
myfile = client.files.get(name=myfile.name)
elapsed += 5
if myfile.state.name == 'FAILED':
raise ValueError(f'Video processing failed: {myfile.state.name}')
if myfile.state.name == 'PROCESSING':
raise TimeoutError(f'Processing timeout after {max_wait}s')
return myfile
Cost Optimization
Token costs (Gemini 2.5 Flash at $1/1M):
- 1 minute video (default): 18,000 tokens = $0.018
- 10 minute video: 180,000 tokens = $0.18
- 1 hour video: 1,080,000 tokens = $1.08
Strategies:
- Use video clipping for specific segments
- Lower FPS for static content
- Use low-resolution mode for long videos
- Batch related queries on same video
- Use context caching for repeated queries
Limitations
- Maximum 6 hours (low-res) or 2 hours (default)
- YouTube videos must be public
- No live streaming analysis
- Files expire after 48 hours
- Processing time varies by video length
- No real-time processing
- Limited to 10 videos per request (2.5+)