374 lines
9.2 KiB
Markdown
374 lines
9.2 KiB
Markdown
# Audio Processing Reference
|
|
|
|
Comprehensive guide for audio analysis and speech generation using Gemini API.
|
|
|
|
## Audio Understanding
|
|
|
|
### Supported Formats
|
|
|
|
| Format | MIME Type | Best Use |
|
|
|--------|-----------|----------|
|
|
| WAV | `audio/wav` | Uncompressed, highest quality |
|
|
| MP3 | `audio/mp3` | Compressed, widely compatible |
|
|
| AAC | `audio/aac` | Compressed, good quality |
|
|
| FLAC | `audio/flac` | Lossless compression |
|
|
| OGG Vorbis | `audio/ogg` | Open format |
|
|
| AIFF | `audio/aiff` | Apple format |
|
|
|
|
### Specifications
|
|
|
|
- **Maximum length**: 9.5 hours per request
|
|
- **Multiple files**: Unlimited count, combined max 9.5 hours
|
|
- **Token rate**: 32 tokens/second (1 minute = 1,920 tokens)
|
|
- **Processing**: Auto-downsampled to 16 Kbps mono
|
|
- **File size limits**:
|
|
- Inline: 20 MB max total request
|
|
- File API: 2 GB per file, 20 GB project quota
|
|
- Retention: 48 hours auto-delete
|
|
|
|
## Transcription
|
|
|
|
### Basic Transcription
|
|
|
|
```python
|
|
from google import genai
|
|
import os
|
|
|
|
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
|
|
|
# Upload audio
|
|
myfile = client.files.upload(file='meeting.mp3')
|
|
|
|
# Transcribe
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Generate a transcript of the speech.', myfile]
|
|
)
|
|
print(response.text)
|
|
```
|
|
|
|
### With Timestamps
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Generate transcript with timestamps in MM:SS format.', myfile]
|
|
)
|
|
```
|
|
|
|
### Multi-Speaker Identification
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
|
|
)
|
|
```
|
|
|
|
### Segment-Specific Transcription
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
|
|
)
|
|
```
|
|
|
|
## Audio Analysis
|
|
|
|
### Summarization
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Summarize key points in 5 bullets with timestamps.', myfile]
|
|
)
|
|
```
|
|
|
|
### Non-Speech Audio Analysis
|
|
|
|
```python
|
|
# Music analysis
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Identify the musical instruments and genre.', myfile]
|
|
)
|
|
|
|
# Environmental sounds
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Identify all sounds: voices, music, ambient noise.', myfile]
|
|
)
|
|
|
|
# Birdsong identification
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Identify bird species based on their calls.', myfile]
|
|
)
|
|
```
|
|
|
|
### Timestamp-Based Analysis
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
|
|
)
|
|
```
|
|
|
|
## Input Methods
|
|
|
|
### File Upload (>20MB or Reuse)
|
|
|
|
```python
|
|
# Upload once, use multiple times
|
|
myfile = client.files.upload(file='large-audio.mp3')
|
|
|
|
# First query
|
|
response1 = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Transcribe this', myfile]
|
|
)
|
|
|
|
# Second query (reuses same file)
|
|
response2 = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Summarize this', myfile]
|
|
)
|
|
```
|
|
|
|
### Inline Data (<20MB)
|
|
|
|
```python
|
|
from google.genai import types
|
|
|
|
with open('small-audio.mp3', 'rb') as f:
|
|
audio_bytes = f.read()
|
|
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=[
|
|
'Describe this audio',
|
|
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
|
|
]
|
|
)
|
|
```
|
|
|
|
## Speech Generation (TTS)
|
|
|
|
### Available Models
|
|
|
|
| Model | Quality | Speed | Cost/1M tokens |
|
|
|-------|---------|-------|----------------|
|
|
| `gemini-2.5-flash-native-audio-preview-09-2025` | High | Fast | $10 |
|
|
| `gemini-2.5-pro` TTS mode | Premium | Slower | $20 |
|
|
|
|
### Basic TTS
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
|
contents='Generate audio: Welcome to today\'s episode.'
|
|
)
|
|
|
|
# Save audio
|
|
with open('output.wav', 'wb') as f:
|
|
f.write(response.audio_data)
|
|
```
|
|
|
|
### Controllable Voice Style
|
|
|
|
```python
|
|
# Professional tone
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
|
contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
|
|
)
|
|
|
|
# Casual and friendly
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
|
contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
|
|
)
|
|
|
|
# Narrative style
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
|
contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
|
|
)
|
|
```
|
|
|
|
### Voice Control Parameters
|
|
|
|
- **Style**: Professional, casual, narrative, conversational
|
|
- **Pace**: Slow, normal, fast
|
|
- **Tone**: Friendly, serious, enthusiastic
|
|
- **Accent**: Natural language control (e.g., "British accent", "Southern drawl")
|
|
|
|
## Best Practices
|
|
|
|
### File Management
|
|
|
|
1. Use File API for files >20MB
|
|
2. Use File API for repeated queries (saves tokens)
|
|
3. Files auto-delete after 48 hours
|
|
4. Clean up manually when done:
|
|
```python
|
|
client.files.delete(name=myfile.name)
|
|
```
|
|
|
|
### Prompt Engineering
|
|
|
|
**Effective prompts**:
|
|
- "Transcribe from 02:30 to 03:29 in MM:SS format"
|
|
- "Identify speakers and extract dialogue with timestamps"
|
|
- "Summarize key points with relevant timestamps"
|
|
- "Transcribe and analyze sentiment for each speaker"
|
|
|
|
**Context improves accuracy**:
|
|
- "This is a medical interview - use appropriate terminology"
|
|
- "Transcribe this legal deposition with precise terminology"
|
|
- "This is a technical podcast about machine learning"
|
|
|
|
**Combined tasks**:
|
|
- "Transcribe and summarize in bullet points"
|
|
- "Extract key quotes with timestamps and speaker labels"
|
|
- "Transcribe and identify action items with timestamps"
|
|
|
|
### Cost Optimization
|
|
|
|
**Token calculation**:
|
|
- 1 minute audio = 1,920 tokens
|
|
- 1 hour audio = 115,200 tokens
|
|
- 9.5 hours = 1,094,400 tokens
|
|
|
|
**Model selection**:
|
|
- Use `gemini-2.5-flash` ($1/1M tokens) for most tasks
|
|
- Upgrade to `gemini-2.5-pro` ($3/1M tokens) for complex analysis
|
|
- For high-volume: `gemini-1.5-flash` ($0.70/1M tokens)
|
|
|
|
**Reduce costs**:
|
|
- Process only relevant segments using timestamps
|
|
- Use lower-quality audio when possible
|
|
- Batch multiple short files in one request
|
|
- Cache context for repeated queries
|
|
|
|
### Error Handling
|
|
|
|
```python
|
|
import time
|
|
|
|
def transcribe_with_retry(file_path, max_retries=3):
|
|
"""Transcribe audio with exponential backoff retry"""
|
|
for attempt in range(max_retries):
|
|
try:
|
|
myfile = client.files.upload(file=file_path)
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Transcribe with timestamps', myfile]
|
|
)
|
|
return response.text
|
|
except Exception as e:
|
|
if attempt == max_retries - 1:
|
|
raise
|
|
wait_time = 2 ** attempt
|
|
print(f"Retry {attempt + 1} after {wait_time}s")
|
|
time.sleep(wait_time)
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### 1. Meeting Transcription
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=[
|
|
'''Transcribe this meeting with:
|
|
1. Speaker labels
|
|
2. Timestamps for topic changes
|
|
3. Action items highlighted
|
|
''',
|
|
myfile
|
|
]
|
|
)
|
|
```
|
|
|
|
### 2. Podcast Summary
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=[
|
|
'''Create podcast summary with:
|
|
1. Main topics with timestamps
|
|
2. Key quotes from each speaker
|
|
3. Recommended episode highlights
|
|
''',
|
|
myfile
|
|
]
|
|
)
|
|
```
|
|
|
|
### 3. Interview Analysis
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=[
|
|
'''Analyze interview:
|
|
1. Questions asked with timestamps
|
|
2. Key responses from interviewee
|
|
3. Overall sentiment and tone
|
|
''',
|
|
myfile
|
|
]
|
|
)
|
|
```
|
|
|
|
### 4. Content Verification
|
|
|
|
```python
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=[
|
|
'''Verify audio content:
|
|
1. Check for specific keywords or phrases
|
|
2. Identify any compliance issues
|
|
3. Note any concerning statements with timestamps
|
|
''',
|
|
myfile
|
|
]
|
|
)
|
|
```
|
|
|
|
### 5. Multilingual Transcription
|
|
|
|
```python
|
|
# Gemini auto-detects language
|
|
response = client.models.generate_content(
|
|
model='gemini-2.5-flash',
|
|
contents=['Transcribe this audio and translate to English if needed.', myfile]
|
|
)
|
|
```
|
|
|
|
## Token Costs
|
|
|
|
**Audio Input** (32 tokens/second):
|
|
- 1 minute = 1,920 tokens
|
|
- 10 minutes = 19,200 tokens
|
|
- 1 hour = 115,200 tokens
|
|
- 9.5 hours = 1,094,400 tokens
|
|
|
|
**Example costs** (Gemini 2.5 Flash at $1/1M):
|
|
- 1 hour audio: 115,200 tokens = $0.12
|
|
- Full day podcast (8 hours): 921,600 tokens = $0.92
|
|
|
|
## Limitations
|
|
|
|
- Maximum 9.5 hours per request
|
|
- Auto-downsampled to 16 Kbps mono (quality loss)
|
|
- Files expire after 48 hours
|
|
- No real-time streaming support
|
|
- Non-speech audio less accurate than speech
|