9.2 KiB
9.2 KiB
Audio Processing Reference
Comprehensive guide for audio analysis and speech generation using Gemini API.
Audio Understanding
Supported Formats
| Format | MIME Type | Best Use |
|---|---|---|
| WAV | audio/wav |
Uncompressed, highest quality |
| MP3 | audio/mp3 |
Compressed, widely compatible |
| AAC | audio/aac |
Compressed, good quality |
| FLAC | audio/flac |
Lossless compression |
| OGG Vorbis | audio/ogg |
Open format |
| AIFF | audio/aiff |
Apple format |
Specifications
- Maximum length: 9.5 hours per request
- Multiple files: Unlimited count, combined max 9.5 hours
- Token rate: 32 tokens/second (1 minute = 1,920 tokens)
- Processing: Auto-downsampled to 16 Kbps mono
- File size limits:
- Inline: 20 MB max total request
- File API: 2 GB per file, 20 GB project quota
- Retention: 48 hours auto-delete
Transcription
Basic Transcription
from google import genai
import os
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Upload audio
myfile = client.files.upload(file='meeting.mp3')
# Transcribe
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Generate a transcript of the speech.', myfile]
)
print(response.text)
With Timestamps
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Generate transcript with timestamps in MM:SS format.', myfile]
)
Multi-Speaker Identification
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
)
Segment-Specific Transcription
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
)
Audio Analysis
Summarization
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize key points in 5 bullets with timestamps.', myfile]
)
Non-Speech Audio Analysis
# Music analysis
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Identify the musical instruments and genre.', myfile]
)
# Environmental sounds
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Identify all sounds: voices, music, ambient noise.', myfile]
)
# Birdsong identification
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Identify bird species based on their calls.', myfile]
)
Timestamp-Based Analysis
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
)
Input Methods
File Upload (>20MB or Reuse)
# Upload once, use multiple times
myfile = client.files.upload(file='large-audio.mp3')
# First query
response1 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe this', myfile]
)
# Second query (reuses same file)
response2 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize this', myfile]
)
Inline Data (<20MB)
from google.genai import types
with open('small-audio.mp3', 'rb') as f:
audio_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Describe this audio',
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
]
)
Speech Generation (TTS)
Available Models
| Model | Quality | Speed | Cost/1M tokens |
|---|---|---|---|
gemini-2.5-flash-native-audio-preview-09-2025 |
High | Fast | $10 |
gemini-2.5-pro TTS mode |
Premium | Slower | $20 |
Basic TTS
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio: Welcome to today\'s episode.'
)
# Save audio
with open('output.wav', 'wb') as f:
f.write(response.audio_data)
Controllable Voice Style
# Professional tone
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
)
# Casual and friendly
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
)
# Narrative style
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
)
Voice Control Parameters
- Style: Professional, casual, narrative, conversational
- Pace: Slow, normal, fast
- Tone: Friendly, serious, enthusiastic
- Accent: Natural language control (e.g., "British accent", "Southern drawl")
Best Practices
File Management
- Use File API for files >20MB
- Use File API for repeated queries (saves tokens)
- Files auto-delete after 48 hours
- Clean up manually when done:
client.files.delete(name=myfile.name)
Prompt Engineering
Effective prompts:
- "Transcribe from 02:30 to 03:29 in MM:SS format"
- "Identify speakers and extract dialogue with timestamps"
- "Summarize key points with relevant timestamps"
- "Transcribe and analyze sentiment for each speaker"
Context improves accuracy:
- "This is a medical interview - use appropriate terminology"
- "Transcribe this legal deposition with precise terminology"
- "This is a technical podcast about machine learning"
Combined tasks:
- "Transcribe and summarize in bullet points"
- "Extract key quotes with timestamps and speaker labels"
- "Transcribe and identify action items with timestamps"
Cost Optimization
Token calculation:
- 1 minute audio = 1,920 tokens
- 1 hour audio = 115,200 tokens
- 9.5 hours = 1,094,400 tokens
Model selection:
- Use
gemini-2.5-flash($1/1M tokens) for most tasks - Upgrade to
gemini-2.5-pro($3/1M tokens) for complex analysis - For high-volume:
gemini-1.5-flash($0.70/1M tokens)
Reduce costs:
- Process only relevant segments using timestamps
- Use lower-quality audio when possible
- Batch multiple short files in one request
- Cache context for repeated queries
Error Handling
import time
def transcribe_with_retry(file_path, max_retries=3):
"""Transcribe audio with exponential backoff retry"""
for attempt in range(max_retries):
try:
myfile = client.files.upload(file=file_path)
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe with timestamps', myfile]
)
return response.text
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
print(f"Retry {attempt + 1} after {wait_time}s")
time.sleep(wait_time)
Common Use Cases
1. Meeting Transcription
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Transcribe this meeting with:
1. Speaker labels
2. Timestamps for topic changes
3. Action items highlighted
''',
myfile
]
)
2. Podcast Summary
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Create podcast summary with:
1. Main topics with timestamps
2. Key quotes from each speaker
3. Recommended episode highlights
''',
myfile
]
)
3. Interview Analysis
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Analyze interview:
1. Questions asked with timestamps
2. Key responses from interviewee
3. Overall sentiment and tone
''',
myfile
]
)
4. Content Verification
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'''Verify audio content:
1. Check for specific keywords or phrases
2. Identify any compliance issues
3. Note any concerning statements with timestamps
''',
myfile
]
)
5. Multilingual Transcription
# Gemini auto-detects language
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe this audio and translate to English if needed.', myfile]
)
Token Costs
Audio Input (32 tokens/second):
- 1 minute = 1,920 tokens
- 10 minutes = 19,200 tokens
- 1 hour = 115,200 tokens
- 9.5 hours = 1,094,400 tokens
Example costs (Gemini 2.5 Flash at $1/1M):
- 1 hour audio: 115,200 tokens = $0.12
- Full day podcast (8 hours): 921,600 tokens = $0.92
Limitations
- Maximum 9.5 hours per request
- Auto-downsampled to 16 Kbps mono (quality loss)
- Files expire after 48 hours
- No real-time streaming support
- Non-speech audio less accurate than speech