# Video Analysis Reference Comprehensive guide for video understanding, temporal analysis, and YouTube processing using Gemini API. ## Core Capabilities - **Video Summarization**: Create concise summaries - **Question Answering**: Answer specific questions about content - **Transcription**: Audio transcription with visual descriptions - **Timestamp References**: Query specific moments (MM:SS format) - **Video Clipping**: Process specific segments - **Scene Detection**: Identify scene changes and transitions - **Multiple Videos**: Compare up to 10 videos (2.5+) - **YouTube Support**: Analyze YouTube videos directly - **Custom Frame Rate**: Adjust FPS sampling ## Supported Formats - MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP ## Model Selection ### Gemini 2.5 Series - **gemini-2.5-pro**: Best quality, 1M-2M context - **gemini-2.5-flash**: Balanced, 1M-2M context - **gemini-2.5-flash-preview-09-2025**: Preview features, 1M context ### Gemini 2.0 Series - **gemini-2.0-flash**: Fast processing - **gemini-2.0-flash-lite**: Lightweight option ### Context Windows - **2M token models**: ~2 hours (default) or ~6 hours (low-res) - **1M token models**: ~1 hour (default) or ~3 hours (low-res) ## Basic Video Analysis ### Local Video ```python from google import genai import os client = genai.Client(api_key=os.getenv('GEMINI_API_KEY')) # Upload video (File API for >20MB) myfile = client.files.upload(file='video.mp4') # Wait for processing import time while myfile.state.name == 'PROCESSING': time.sleep(1) myfile = client.files.get(name=myfile.name) if myfile.state.name == 'FAILED': raise ValueError('Video processing failed') # Analyze response = client.models.generate_content( model='gemini-2.5-flash', contents=['Summarize this video in 3 key points', myfile] ) print(response.text) ``` ### YouTube Video ```python from google.genai import types response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Summarize the main topics discussed', types.Part.from_uri( uri='https://www.youtube.com/watch?v=VIDEO_ID', mime_type='video/mp4' ) ] ) ``` ### Inline Video (<20MB) ```python with open('short-clip.mp4', 'rb') as f: video_bytes = f.read() response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'What happens in this video?', types.Part.from_bytes(data=video_bytes, mime_type='video/mp4') ] ) ``` ## Advanced Features ### Video Clipping ```python # Analyze specific time range response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Summarize this segment', types.Part.from_video_metadata( file_uri=myfile.uri, start_offset='40s', end_offset='80s' ) ] ) ``` ### Custom Frame Rate ```python # Lower FPS for static content (saves tokens) response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Analyze this presentation', types.Part.from_video_metadata( file_uri=myfile.uri, fps=0.5 # Sample every 2 seconds ) ] ) # Higher FPS for fast-moving content response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Analyze rapid movements in this sports video', types.Part.from_video_metadata( file_uri=myfile.uri, fps=5 # Sample 5 times per second ) ] ) ``` ### Multiple Videos (2.5+) ```python video1 = client.files.upload(file='demo1.mp4') video2 = client.files.upload(file='demo2.mp4') # Wait for processing for video in [video1, video2]: while video.state.name == 'PROCESSING': time.sleep(1) video = client.files.get(name=video.name) response = client.models.generate_content( model='gemini-2.5-pro', contents=[ 'Compare these two product demos. Which explains features better?', video1, video2 ] ) ``` ## Temporal Understanding ### Timestamp-Based Questions ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'What happens at 01:15 and how does it relate to 02:30?', myfile ] ) ``` ### Timeline Creation ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Create a timeline with timestamps: - Key events - Scene changes - Important moments Format: MM:SS - Description ''', myfile ] ) ``` ### Scene Detection ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Identify all scene changes with timestamps and describe each scene', myfile ] ) ``` ## Transcription ### Basic Transcription ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Transcribe the audio from this video', myfile ] ) ``` ### With Visual Descriptions ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Transcribe with visual context: - Audio transcription - Visual descriptions of important moments - Timestamps for salient events ''', myfile ] ) ``` ### Speaker Identification ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Transcribe with speaker labels and timestamps', myfile ] ) ``` ## Common Use Cases ### 1. Video Summarization ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Summarize this video: 1. Main topic and purpose 2. Key points with timestamps 3. Conclusion or call-to-action ''', myfile ] ) ``` ### 2. Educational Content ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Create educational materials: 1. List key concepts taught 2. Create 5 quiz questions with answers 3. Provide timestamp for each concept ''', myfile ] ) ``` ### 3. Action Detection ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'List all actions performed in this tutorial with timestamps', myfile ] ) ``` ### 4. Content Moderation ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Review video content: 1. Identify any problematic content 2. Note timestamps of concerns 3. Provide content rating recommendation ''', myfile ] ) ``` ### 5. Interview Analysis ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Analyze interview: 1. Questions asked (timestamps) 2. Key responses 3. Candidate body language and demeanor 4. Overall assessment ''', myfile ] ) ``` ### 6. Sports Analysis ```python response = client.models.generate_content( model='gemini-2.5-flash', contents=[ '''Analyze sports video: 1. Key plays with timestamps 2. Player movements and positioning 3. Game strategy observations ''', types.Part.from_video_metadata( file_uri=myfile.uri, fps=5 # Higher FPS for fast action ) ] ) ``` ## YouTube Specific Features ### Public Video Requirements - Video must be public (not private or unlisted) - No age-restricted content - Valid video ID required ### Usage Example ```python # YouTube URL youtube_uri = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ' response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Create chapter markers with timestamps', types.Part.from_uri(uri=youtube_uri, mime_type='video/mp4') ] ) ``` ### Rate Limits - **Free tier**: 8 hours of YouTube video per day - **Paid tier**: No length-based limits - Public videos only ## Token Calculation Video tokens depend on resolution and FPS: **Default resolution** (~300 tokens/second): - 1 minute = 18,000 tokens - 10 minutes = 180,000 tokens - 1 hour = 1,080,000 tokens **Low resolution** (~100 tokens/second): - 1 minute = 6,000 tokens - 10 minutes = 60,000 tokens - 1 hour = 360,000 tokens **Context windows**: - 2M tokens ≈ 2 hours (default) or 6 hours (low-res) - 1M tokens ≈ 1 hour (default) or 3 hours (low-res) ## Best Practices ### File Management 1. Use File API for videos >20MB (most videos) 2. Wait for ACTIVE state before analysis 3. Files auto-delete after 48 hours 4. Clean up manually: ```python client.files.delete(name=myfile.name) ``` ### Optimization Strategies **Reduce token usage**: - Process specific segments using start/end offsets - Use lower FPS for static content - Use low-resolution mode for long videos - Split very long videos into chunks **Improve accuracy**: - Provide context in prompts - Use higher FPS for fast-moving content - Use Pro model for complex analysis - Be specific about what to extract ### Prompt Engineering **Effective prompts**: - "Summarize key points with timestamps in MM:SS format" - "Identify all scene changes and describe each scene" - "Extract action items mentioned with timestamps" - "Compare these two videos on: X, Y, Z criteria" **Structured output**: ```python from pydantic import BaseModel from typing import List class VideoEvent(BaseModel): timestamp: str # MM:SS format description: str category: str class VideoAnalysis(BaseModel): summary: str events: List[VideoEvent] duration: str response = client.models.generate_content( model='gemini-2.5-flash', contents=['Analyze this video', myfile], config=genai.types.GenerateContentConfig( response_mime_type='application/json', response_schema=VideoAnalysis ) ) ``` ### Error Handling ```python import time def upload_and_process_video(file_path, max_wait=300): """Upload video and wait for processing""" myfile = client.files.upload(file=file_path) elapsed = 0 while myfile.state.name == 'PROCESSING' and elapsed < max_wait: time.sleep(5) myfile = client.files.get(name=myfile.name) elapsed += 5 if myfile.state.name == 'FAILED': raise ValueError(f'Video processing failed: {myfile.state.name}') if myfile.state.name == 'PROCESSING': raise TimeoutError(f'Processing timeout after {max_wait}s') return myfile ``` ## Cost Optimization **Token costs** (Gemini 2.5 Flash at $1/1M): - 1 minute video (default): 18,000 tokens = $0.018 - 10 minute video: 180,000 tokens = $0.18 - 1 hour video: 1,080,000 tokens = $1.08 **Strategies**: - Use video clipping for specific segments - Lower FPS for static content - Use low-resolution mode for long videos - Batch related queries on same video - Use context caching for repeated queries ## Limitations - Maximum 6 hours (low-res) or 2 hours (default) - YouTube videos must be public - No live streaming analysis - Files expire after 48 hours - Processing time varies by video length - No real-time processing - Limited to 10 videos per request (2.5+)