5.0 KiB
5.0 KiB
Audio Guide (Whisper & TTS)
Last Updated: 2025-10-25
Complete guide to OpenAI's Audio API for transcription and text-to-speech.
Whisper Transcription
Supported Formats
- mp3, mp4, mpeg, mpga, m4a, wav, webm
Best Practices
✅ Audio Quality:
- Use clear audio with minimal background noise
- 16 kHz or higher sample rate recommended
- Mono or stereo both supported
✅ File Size:
- Max file size: 25 MB
- For larger files: split into chunks or compress
✅ Languages:
- Whisper automatically detects language
- Supports 50+ languages
- Best results with English, Spanish, French, German, Chinese
❌ Limitations:
- May struggle with heavy accents
- Background noise reduces accuracy
- Very quiet audio may fail
Text-to-Speech (TTS)
Model Selection
| Model | Quality | Latency | Features | Best For |
|---|---|---|---|---|
| tts-1 | Standard | Lowest | Basic TTS | Real-time streaming |
| tts-1-hd | High | Medium | Better fidelity | Offline audio, podcasts |
| gpt-4o-mini-tts | Best | Medium | Voice instructions, streaming | Maximum control |
Voice Selection Guide
| Voice | Character | Best For |
|---|---|---|
| alloy | Neutral, balanced | General use, professional |
| ash | Clear, professional | Business, presentations |
| ballad | Warm, storytelling | Narration, audiobooks |
| coral | Soft, friendly | Customer service, greetings |
| echo | Calm, measured | Meditation, calm content |
| fable | Expressive, narrative | Stories, entertainment |
| onyx | Deep, authoritative | News, serious content |
| nova | Bright, energetic | Marketing, enthusiastic content |
| sage | Wise, thoughtful | Educational, informative |
| shimmer | Gentle, soothing | Relaxation, sleep content |
| verse | Poetic, rhythmic | Poetry, artistic content |
Voice Instructions (gpt-4o-mini-tts only)
// Professional tone
{
model: 'gpt-4o-mini-tts',
voice: 'ash',
input: 'Welcome to our service',
instructions: 'Speak in a calm, professional, and friendly tone suitable for customer service.',
}
// Energetic marketing
{
model: 'gpt-4o-mini-tts',
voice: 'nova',
input: 'Don\'t miss this sale!',
instructions: 'Use an enthusiastic, energetic tone perfect for marketing and advertisements.',
}
// Meditation guidance
{
model: 'gpt-4o-mini-tts',
voice: 'shimmer',
input: 'Take a deep breath',
instructions: 'Adopt a calm, soothing voice suitable for meditation and relaxation guidance.',
}
Speed Control
// Slow (0.5x)
{ speed: 0.5 } // Good for: Learning, accessibility
// Normal (1.0x)
{ speed: 1.0 } // Default
// Fast (1.5x)
{ speed: 1.5 } // Good for: Previews, time-saving
// Very fast (2.0x)
{ speed: 2.0 } // Good for: Quick previews only
Range: 0.25 to 4.0
Audio Format Selection
| Format | Compression | Quality | Best For |
|---|---|---|---|
| mp3 | Lossy | Good | Maximum compatibility |
| opus | Lossy | Excellent | Web streaming, low bandwidth |
| aac | Lossy | Good | iOS, Apple devices |
| flac | Lossless | Best | Archiving, editing |
| wav | Uncompressed | Best | Editing, processing |
| pcm | Raw | Best | Low-level processing |
Common Patterns
1. Transcribe Interview
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream('./interview.mp3'),
model: 'whisper-1',
});
// Save transcript
fs.writeFileSync('./interview.txt', transcription.text);
2. Generate Podcast Narration
const script = "Welcome to today's podcast...";
const audio = await openai.audio.speech.create({
model: 'tts-1-hd',
voice: 'fable',
input: script,
response_format: 'mp3',
});
const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync('./podcast.mp3', buffer);
3. Multi-Voice Conversation
// Speaker 1
const speaker1 = await openai.audio.speech.create({
model: 'tts-1',
voice: 'onyx',
input: 'Hello, how are you?',
});
// Speaker 2
const speaker2 = await openai.audio.speech.create({
model: 'tts-1',
voice: 'nova',
input: 'I\'m doing great, thanks!',
});
// Combine audio files (requires audio processing library)
Cost Optimization
- Use tts-1 for real-time (cheaper, faster)
- Use tts-1-hd for final production (better quality)
- Cache generated audio (deterministic for same input)
- Choose appropriate format (opus for web, mp3 for compatibility)
- Batch transcriptions with delays to avoid rate limits
Common Issues
Transcription Accuracy
- Improve audio quality
- Reduce background noise
- Ensure adequate volume levels
- Use supported audio formats
TTS Naturalness
- Test different voices
- Use voice instructions (gpt-4o-mini-tts)
- Adjust speed for better pacing
- Add punctuation for natural pauses
File Size
- Compress audio before transcribing
- Choose lossy formats (mp3, opus) for TTS
- Use appropriate bitrates
See Also: Official Audio Guide (https://platform.openai.com/docs/guides/speech-to-text)