Files
gh-jezweb-claude-skills-ski…/references/audio-guide.md
2025-11-30 08:25:12 +08:00

5.0 KiB

Audio Guide (Whisper & TTS)

Last Updated: 2025-10-25

Complete guide to OpenAI's Audio API for transcription and text-to-speech.


Whisper Transcription

Supported Formats

  • mp3, mp4, mpeg, mpga, m4a, wav, webm

Best Practices

Audio Quality:

  • Use clear audio with minimal background noise
  • 16 kHz or higher sample rate recommended
  • Mono or stereo both supported

File Size:

  • Max file size: 25 MB
  • For larger files: split into chunks or compress

Languages:

  • Whisper automatically detects language
  • Supports 50+ languages
  • Best results with English, Spanish, French, German, Chinese

Limitations:

  • May struggle with heavy accents
  • Background noise reduces accuracy
  • Very quiet audio may fail

Text-to-Speech (TTS)

Model Selection

Model Quality Latency Features Best For
tts-1 Standard Lowest Basic TTS Real-time streaming
tts-1-hd High Medium Better fidelity Offline audio, podcasts
gpt-4o-mini-tts Best Medium Voice instructions, streaming Maximum control

Voice Selection Guide

Voice Character Best For
alloy Neutral, balanced General use, professional
ash Clear, professional Business, presentations
ballad Warm, storytelling Narration, audiobooks
coral Soft, friendly Customer service, greetings
echo Calm, measured Meditation, calm content
fable Expressive, narrative Stories, entertainment
onyx Deep, authoritative News, serious content
nova Bright, energetic Marketing, enthusiastic content
sage Wise, thoughtful Educational, informative
shimmer Gentle, soothing Relaxation, sleep content
verse Poetic, rhythmic Poetry, artistic content

Voice Instructions (gpt-4o-mini-tts only)

// Professional tone
{
  model: 'gpt-4o-mini-tts',
  voice: 'ash',
  input: 'Welcome to our service',
  instructions: 'Speak in a calm, professional, and friendly tone suitable for customer service.',
}

// Energetic marketing
{
  model: 'gpt-4o-mini-tts',
  voice: 'nova',
  input: 'Don\'t miss this sale!',
  instructions: 'Use an enthusiastic, energetic tone perfect for marketing and advertisements.',
}

// Meditation guidance
{
  model: 'gpt-4o-mini-tts',
  voice: 'shimmer',
  input: 'Take a deep breath',
  instructions: 'Adopt a calm, soothing voice suitable for meditation and relaxation guidance.',
}

Speed Control

// Slow (0.5x)
{ speed: 0.5 } // Good for: Learning, accessibility

// Normal (1.0x)
{ speed: 1.0 } // Default

// Fast (1.5x)
{ speed: 1.5 } // Good for: Previews, time-saving

// Very fast (2.0x)
{ speed: 2.0 } // Good for: Quick previews only

Range: 0.25 to 4.0

Audio Format Selection

Format Compression Quality Best For
mp3 Lossy Good Maximum compatibility
opus Lossy Excellent Web streaming, low bandwidth
aac Lossy Good iOS, Apple devices
flac Lossless Best Archiving, editing
wav Uncompressed Best Editing, processing
pcm Raw Best Low-level processing

Common Patterns

1. Transcribe Interview

const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream('./interview.mp3'),
  model: 'whisper-1',
});

// Save transcript
fs.writeFileSync('./interview.txt', transcription.text);

2. Generate Podcast Narration

const script = "Welcome to today's podcast...";

const audio = await openai.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'fable',
  input: script,
  response_format: 'mp3',
});

const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync('./podcast.mp3', buffer);

3. Multi-Voice Conversation

// Speaker 1
const speaker1 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'onyx',
  input: 'Hello, how are you?',
});

// Speaker 2
const speaker2 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'nova',
  input: 'I\'m doing great, thanks!',
});

// Combine audio files (requires audio processing library)

Cost Optimization

  1. Use tts-1 for real-time (cheaper, faster)
  2. Use tts-1-hd for final production (better quality)
  3. Cache generated audio (deterministic for same input)
  4. Choose appropriate format (opus for web, mp3 for compatibility)
  5. Batch transcriptions with delays to avoid rate limits

Common Issues

Transcription Accuracy

  • Improve audio quality
  • Reduce background noise
  • Ensure adequate volume levels
  • Use supported audio formats

TTS Naturalness

  • Test different voices
  • Use voice instructions (gpt-4o-mini-tts)
  • Adjust speed for better pacing
  • Add punctuation for natural pauses

File Size

  • Compress audio before transcribing
  • Choose lossy formats (mp3, opus) for TTS
  • Use appropriate bitrates

See Also: Official Audio Guide (https://platform.openai.com/docs/guides/speech-to-text)