Initial commit

2025-11-30 08:48:52 +08:00
commit 6ec3196ecc
434 changed files with 125248 additions and 0 deletions
--- a/skills/ai-multimodal/references/audio-processing.md
+++ b/skills/ai-multimodal/references/audio-processing.md
@@ -0,0 +1,373 @@
+# Audio Processing Reference
+
+Comprehensive guide for audio analysis and speech generation using Gemini API.
+
+## Audio Understanding
+
+### Supported Formats
+
+| Format | MIME Type | Best Use |
+|--------|-----------|----------|
+| WAV | `audio/wav` | Uncompressed, highest quality |
+| MP3 | `audio/mp3` | Compressed, widely compatible |
+| AAC | `audio/aac` | Compressed, good quality |
+| FLAC | `audio/flac` | Lossless compression |
+| OGG Vorbis | `audio/ogg` | Open format |
+| AIFF | `audio/aiff` | Apple format |
+
+### Specifications
+
+- **Maximum length**: 9.5 hours per request
+- **Multiple files**: Unlimited count, combined max 9.5 hours
+- **Token rate**: 32 tokens/second (1 minute = 1,920 tokens)
+- **Processing**: Auto-downsampled to 16 Kbps mono
+- **File size limits**:
+  - Inline: 20 MB max total request
+  - File API: 2 GB per file, 20 GB project quota
+  - Retention: 48 hours auto-delete
+
+## Transcription
+
+### Basic Transcription
+
+```python
+from google import genai
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+# Upload audio
+myfile = client.files.upload(file='meeting.mp3')
+
+# Transcribe
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Generate a transcript of the speech.', myfile]
+)
+print(response.text)
+```
+
+### With Timestamps
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Generate transcript with timestamps in MM:SS format.', myfile]
+)
+```
+
+### Multi-Speaker Identification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
+)
+```
+
+### Segment-Specific Transcription
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
+)
+```
+
+## Audio Analysis
+
+### Summarization
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Summarize key points in 5 bullets with timestamps.', myfile]
+)
+```
+
+### Non-Speech Audio Analysis
+
+```python
+# Music analysis
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify the musical instruments and genre.', myfile]
+)
+
+# Environmental sounds
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify all sounds: voices, music, ambient noise.', myfile]
+)
+
+# Birdsong identification
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Identify bird species based on their calls.', myfile]
+)
+```
+
+### Timestamp-Based Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
+)
+```
+
+## Input Methods
+
+### File Upload (>20MB or Reuse)
+
+```python
+# Upload once, use multiple times
+myfile = client.files.upload(file='large-audio.mp3')
+
+# First query
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe this', myfile]
+)
+
+# Second query (reuses same file)
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Summarize this', myfile]
+)
+```
+
+### Inline Data (<20MB)
+
+```python
+from google.genai import types
+
+with open('small-audio.mp3', 'rb') as f:
+    audio_bytes = f.read()
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        'Describe this audio',
+        types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
+    ]
+)
+```
+
+## Speech Generation (TTS)
+
+### Available Models
+
+| Model | Quality | Speed | Cost/1M tokens |
+|-------|---------|-------|----------------|
+| `gemini-2.5-flash-native-audio-preview-09-2025` | High | Fast | $10 |
+| `gemini-2.5-pro` TTS mode | Premium | Slower | $20 |
+
+### Basic TTS
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio: Welcome to today\'s episode.'
+)
+
+# Save audio
+with open('output.wav', 'wb') as f:
+    f.write(response.audio_data)
+```
+
+### Controllable Voice Style
+
+```python
+# Professional tone
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
+)
+
+# Casual and friendly
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
+)
+
+# Narrative style
+response = client.models.generate_content(
+    model='gemini-2.5-flash-native-audio-preview-09-2025',
+    contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
+)
+```
+
+### Voice Control Parameters
+
+- **Style**: Professional, casual, narrative, conversational
+- **Pace**: Slow, normal, fast
+- **Tone**: Friendly, serious, enthusiastic
+- **Accent**: Natural language control (e.g., "British accent", "Southern drawl")
+
+## Best Practices
+
+### File Management
+
+1. Use File API for files >20MB
+2. Use File API for repeated queries (saves tokens)
+3. Files auto-delete after 48 hours
+4. Clean up manually when done:
+   ```python
+   client.files.delete(name=myfile.name)
+   ```
+
+### Prompt Engineering
+
+**Effective prompts**:
+- "Transcribe from 02:30 to 03:29 in MM:SS format"
+- "Identify speakers and extract dialogue with timestamps"
+- "Summarize key points with relevant timestamps"
+- "Transcribe and analyze sentiment for each speaker"
+
+**Context improves accuracy**:
+- "This is a medical interview - use appropriate terminology"
+- "Transcribe this legal deposition with precise terminology"
+- "This is a technical podcast about machine learning"
+
+**Combined tasks**:
+- "Transcribe and summarize in bullet points"
+- "Extract key quotes with timestamps and speaker labels"
+- "Transcribe and identify action items with timestamps"
+
+### Cost Optimization
+
+**Token calculation**:
+- 1 minute audio = 1,920 tokens
+- 1 hour audio = 115,200 tokens
+- 9.5 hours = 1,094,400 tokens
+
+**Model selection**:
+- Use `gemini-2.5-flash` ($1/1M tokens) for most tasks
+- Upgrade to `gemini-2.5-pro` ($3/1M tokens) for complex analysis
+- For high-volume: `gemini-1.5-flash` ($0.70/1M tokens)
+
+**Reduce costs**:
+- Process only relevant segments using timestamps
+- Use lower-quality audio when possible
+- Batch multiple short files in one request
+- Cache context for repeated queries
+
+### Error Handling
+
+```python
+import time
+
+def transcribe_with_retry(file_path, max_retries=3):
+    """Transcribe audio with exponential backoff retry"""
+    for attempt in range(max_retries):
+        try:
+            myfile = client.files.upload(file=file_path)
+            response = client.models.generate_content(
+                model='gemini-2.5-flash',
+                contents=['Transcribe with timestamps', myfile]
+            )
+            return response.text
+        except Exception as e:
+            if attempt == max_retries - 1:
+                raise
+            wait_time = 2 ** attempt
+            print(f"Retry {attempt + 1} after {wait_time}s")
+            time.sleep(wait_time)
+```
+
+## Common Use Cases
+
+### 1. Meeting Transcription
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Transcribe this meeting with:
+        1. Speaker labels
+        2. Timestamps for topic changes
+        3. Action items highlighted
+        ''',
+        myfile
+    ]
+)
+```
+
+### 2. Podcast Summary
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Create podcast summary with:
+        1. Main topics with timestamps
+        2. Key quotes from each speaker
+        3. Recommended episode highlights
+        ''',
+        myfile
+    ]
+)
+```
+
+### 3. Interview Analysis
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Analyze interview:
+        1. Questions asked with timestamps
+        2. Key responses from interviewee
+        3. Overall sentiment and tone
+        ''',
+        myfile
+    ]
+)
+```
+
+### 4. Content Verification
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=[
+        '''Verify audio content:
+        1. Check for specific keywords or phrases
+        2. Identify any compliance issues
+        3. Note any concerning statements with timestamps
+        ''',
+        myfile
+    ]
+)
+```
+
+### 5. Multilingual Transcription
+
+```python
+# Gemini auto-detects language
+response = client.models.generate_content(
+    model='gemini-2.5-flash',
+    contents=['Transcribe this audio and translate to English if needed.', myfile]
+)
+```
+
+## Token Costs
+
+**Audio Input** (32 tokens/second):
+- 1 minute = 1,920 tokens
+- 10 minutes = 19,200 tokens
+- 1 hour = 115,200 tokens
+- 9.5 hours = 1,094,400 tokens
+
+**Example costs** (Gemini 2.5 Flash at $1/1M):
+- 1 hour audio: 115,200 tokens = $0.12
+- Full day podcast (8 hours): 921,600 tokens = $0.92
+
+## Limitations
+
+- Maximum 9.5 hours per request
+- Auto-downsampled to 16 Kbps mono (quality loss)
+- Files expire after 48 hours
+- No real-time streaming support
+- Non-speech audio less accurate than speech