zhongwei/gh-jezweb-claude-skills-skills-openai-api

Fork 0

Files

Zhongwei Li 7a35a34caa Initial commit

2025-11-30 08:25:12 +08:00

5.0 KiB

Raw Blame History

Audio Guide (Whisper & TTS)

Last Updated: 2025-10-25

Complete guide to OpenAI's Audio API for transcription and text-to-speech.

Whisper Transcription

Supported Formats

mp3, mp4, mpeg, mpga, m4a, wav, webm

Best Practices

✅ Audio Quality:

Use clear audio with minimal background noise
16 kHz or higher sample rate recommended
Mono or stereo both supported

✅ File Size:

Max file size: 25 MB
For larger files: split into chunks or compress

✅ Languages:

Whisper automatically detects language
Supports 50+ languages
Best results with English, Spanish, French, German, Chinese

❌ Limitations:

May struggle with heavy accents
Background noise reduces accuracy
Very quiet audio may fail

Text-to-Speech (TTS)

Model Selection

Model	Quality	Latency	Features	Best For
tts-1	Standard	Lowest	Basic TTS	Real-time streaming
tts-1-hd	High	Medium	Better fidelity	Offline audio, podcasts
gpt-4o-mini-tts	Best	Medium	Voice instructions, streaming	Maximum control

Voice Selection Guide

Voice	Character	Best For
alloy	Neutral, balanced	General use, professional
ash	Clear, professional	Business, presentations
ballad	Warm, storytelling	Narration, audiobooks
coral	Soft, friendly	Customer service, greetings
echo	Calm, measured	Meditation, calm content
fable	Expressive, narrative	Stories, entertainment
onyx	Deep, authoritative	News, serious content
nova	Bright, energetic	Marketing, enthusiastic content
sage	Wise, thoughtful	Educational, informative
shimmer	Gentle, soothing	Relaxation, sleep content
verse	Poetic, rhythmic	Poetry, artistic content

Voice Instructions (gpt-4o-mini-tts only)

// Professional tone
{
  model: 'gpt-4o-mini-tts',
  voice: 'ash',
  input: 'Welcome to our service',
  instructions: 'Speak in a calm, professional, and friendly tone suitable for customer service.',
}

// Energetic marketing
{
  model: 'gpt-4o-mini-tts',
  voice: 'nova',
  input: 'Don\'t miss this sale!',
  instructions: 'Use an enthusiastic, energetic tone perfect for marketing and advertisements.',
}

// Meditation guidance
{
  model: 'gpt-4o-mini-tts',
  voice: 'shimmer',
  input: 'Take a deep breath',
  instructions: 'Adopt a calm, soothing voice suitable for meditation and relaxation guidance.',
}

Speed Control

// Slow (0.5x)
{ speed: 0.5 } // Good for: Learning, accessibility

// Normal (1.0x)
{ speed: 1.0 } // Default

// Fast (1.5x)
{ speed: 1.5 } // Good for: Previews, time-saving

// Very fast (2.0x)
{ speed: 2.0 } // Good for: Quick previews only

Range: 0.25 to 4.0

Audio Format Selection

Format	Compression	Quality	Best For
mp3	Lossy	Good	Maximum compatibility
opus	Lossy	Excellent	Web streaming, low bandwidth
aac	Lossy	Good	iOS, Apple devices
flac	Lossless	Best	Archiving, editing
wav	Uncompressed	Best	Editing, processing
pcm	Raw	Best	Low-level processing

Common Patterns

1. Transcribe Interview

const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream('./interview.mp3'),
  model: 'whisper-1',
});

// Save transcript
fs.writeFileSync('./interview.txt', transcription.text);

2. Generate Podcast Narration

const script = "Welcome to today's podcast...";

const audio = await openai.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'fable',
  input: script,
  response_format: 'mp3',
});

const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync('./podcast.mp3', buffer);

3. Multi-Voice Conversation

// Speaker 1
const speaker1 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'onyx',
  input: 'Hello, how are you?',
});

// Speaker 2
const speaker2 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'nova',
  input: 'I\'m doing great, thanks!',
});

// Combine audio files (requires audio processing library)

Cost Optimization

Use tts-1 for real-time (cheaper, faster)
Use tts-1-hd for final production (better quality)
Cache generated audio (deterministic for same input)
Choose appropriate format (opus for web, mp3 for compatibility)
Batch transcriptions with delays to avoid rate limits

Common Issues

Transcription Accuracy

Improve audio quality
Reduce background noise
Ensure adequate volume levels
Use supported audio formats

TTS Naturalness

Test different voices
Use voice instructions (gpt-4o-mini-tts)
Adjust speed for better pacing
Add punctuation for natural pauses

File Size

Compress audio before transcribing
Choose lossy formats (mp3, opus) for TTS
Use appropriate bitrates

See Also: Official Audio Guide (https://platform.openai.com/docs/guides/speech-to-text)

5.0 KiB Raw Blame History