gh-jezweb-claude-skills-ski…/references/audio-guide.md

# Audio Guide (Whisper & TTS)

**Last Updated**: 2025-10-25

Complete guide to OpenAI's Audio API for transcription and text-to-speech.

---

## Whisper Transcription

### Supported Formats
- mp3, mp4, mpeg, mpga, m4a, wav, webm

### Best Practices

✅ **Audio Quality**:
- Use clear audio with minimal background noise
- 16 kHz or higher sample rate recommended
- Mono or stereo both supported

✅ **File Size**:
- Max file size: 25 MB
- For larger files: split into chunks or compress

✅ **Languages**:
- Whisper automatically detects language
- Supports 50+ languages
- Best results with English, Spanish, French, German, Chinese

❌ **Limitations**:
- May struggle with heavy accents
- Background noise reduces accuracy
- Very quiet audio may fail

---

## Text-to-Speech (TTS)

### Model Selection

| Model | Quality | Latency | Features | Best For |
|-------|---------|---------|----------|----------|
| tts-1 | Standard | Lowest | Basic TTS | Real-time streaming |
| tts-1-hd | High | Medium | Better fidelity | Offline audio, podcasts |
| gpt-4o-mini-tts | Best | Medium | Voice instructions, streaming | Maximum control |

### Voice Selection Guide

| Voice | Character | Best For |
|-------|-----------|----------|
| alloy | Neutral, balanced | General use, professional |
| ash | Clear, professional | Business, presentations |
| ballad | Warm, storytelling | Narration, audiobooks |
| coral | Soft, friendly | Customer service, greetings |
| echo | Calm, measured | Meditation, calm content |
| fable | Expressive, narrative | Stories, entertainment |
| onyx | Deep, authoritative | News, serious content |
| nova | Bright, energetic | Marketing, enthusiastic content |
| sage | Wise, thoughtful | Educational, informative |
| shimmer | Gentle, soothing | Relaxation, sleep content |
| verse | Poetic, rhythmic | Poetry, artistic content |

### Voice Instructions (gpt-4o-mini-tts only)

```typescript
// Professional tone
{
  model: 'gpt-4o-mini-tts',
  voice: 'ash',
  input: 'Welcome to our service',
  instructions: 'Speak in a calm, professional, and friendly tone suitable for customer service.',
}

// Energetic marketing
{
  model: 'gpt-4o-mini-tts',
  voice: 'nova',
  input: 'Don\'t miss this sale!',
  instructions: 'Use an enthusiastic, energetic tone perfect for marketing and advertisements.',
}

// Meditation guidance
{
  model: 'gpt-4o-mini-tts',
  voice: 'shimmer',
  input: 'Take a deep breath',
  instructions: 'Adopt a calm, soothing voice suitable for meditation and relaxation guidance.',
}
```

### Speed Control

```typescript
// Slow (0.5x)
{ speed: 0.5 } // Good for: Learning, accessibility

// Normal (1.0x)
{ speed: 1.0 } // Default

// Fast (1.5x)
{ speed: 1.5 } // Good for: Previews, time-saving

// Very fast (2.0x)
{ speed: 2.0 } // Good for: Quick previews only
```

Range: 0.25 to 4.0

### Audio Format Selection

| Format | Compression | Quality | Best For |
|--------|-------------|---------|----------|
| mp3 | Lossy | Good | Maximum compatibility |
| opus | Lossy | Excellent | Web streaming, low bandwidth |
| aac | Lossy | Good | iOS, Apple devices |
| flac | Lossless | Best | Archiving, editing |
| wav | Uncompressed | Best | Editing, processing |
| pcm | Raw | Best | Low-level processing |

---

## Common Patterns

### 1. Transcribe Interview

```typescript
const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream('./interview.mp3'),
  model: 'whisper-1',
});

// Save transcript
fs.writeFileSync('./interview.txt', transcription.text);
```

### 2. Generate Podcast Narration

```typescript
const script = "Welcome to today's podcast...";

const audio = await openai.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'fable',
  input: script,
  response_format: 'mp3',
});

const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync('./podcast.mp3', buffer);
```

### 3. Multi-Voice Conversation

```typescript
// Speaker 1
const speaker1 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'onyx',
  input: 'Hello, how are you?',
});

// Speaker 2
const speaker2 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'nova',
  input: 'I\'m doing great, thanks!',
});

// Combine audio files (requires audio processing library)
```

---

## Cost Optimization

1. **Use tts-1 for real-time** (cheaper, faster)
2. **Use tts-1-hd for final production** (better quality)
3. **Cache generated audio** (deterministic for same input)
4. **Choose appropriate format** (opus for web, mp3 for compatibility)
5. **Batch transcriptions** with delays to avoid rate limits

---

## Common Issues

### Transcription Accuracy
- Improve audio quality
- Reduce background noise
- Ensure adequate volume levels
- Use supported audio formats

### TTS Naturalness
- Test different voices
- Use voice instructions (gpt-4o-mini-tts)
- Adjust speed for better pacing
- Add punctuation for natural pauses

### File Size
- Compress audio before transcribing
- Choose lossy formats (mp3, opus) for TTS
- Use appropriate bitrates

---

**See Also**: Official Audio Guide (https://platform.openai.com/docs/guides/speech-to-text)