6.0 KiB
6.0 KiB
Cost Optimization Guide
Last Updated: 2025-10-25
Strategies to minimize OpenAI API costs while maintaining quality.
Model Selection Strategies
1. Model Cascading
Start with cheaper models, escalate only when needed:
async function smartCompletion(prompt: string) {
// Try gpt-5-nano first
const nanoResult = await openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: prompt }],
});
// Validate quality
if (isGoodEnough(nanoResult)) {
return nanoResult;
}
// Escalate to gpt-5-mini
const miniResult = await openai.chat.completions.create({
model: 'gpt-5-mini',
messages: [{ role: 'user', content: prompt }],
});
if (isGoodEnough(miniResult)) {
return miniResult;
}
// Final escalation to gpt-5
return await openai.chat.completions.create({
model: 'gpt-5',
messages: [{ role: 'user', content: prompt }],
});
}
2. Task-Based Model Selection
| Task | Model | Why |
|---|---|---|
| Simple chat | gpt-5-nano | Fast, cheap, sufficient |
| Summarization | gpt-5-mini | Good quality, cost-effective |
| Code generation | gpt-5 | Best reasoning, worth the cost |
| Data extraction | gpt-4o + structured output | Reliable, accurate |
| Vision tasks | gpt-4o | Only model with vision |
Token Optimization
1. Limit max_tokens
// ❌ No limit: May generate unnecessarily long responses
{
model: 'gpt-5',
messages,
}
// ✅ Set reasonable limit
{
model: 'gpt-5',
messages,
max_tokens: 500, // Prevent runaway generation
}
2. Trim Conversation History
function trimHistory(messages: Message[], maxTokens: number = 4000) {
// Keep system message and recent messages
const system = messages.find(m => m.role === 'system');
const recent = messages.slice(-10); // Last 10 messages
return [system, ...recent].filter(Boolean);
}
3. Use Shorter Prompts
// ❌ Verbose
"Please analyze the following text and provide a detailed summary of the main points, including any key takeaways and important details..."
// ✅ Concise
"Summarize key points:"
Caching Strategies
1. Cache Embeddings
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string) {
if (embeddingCache.has(text)) {
return embeddingCache.get(text)!;
}
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
const embedding = response.data[0].embedding;
embeddingCache.set(text, embedding);
return embedding;
}
2. Cache Common Completions
const completionCache = new Map<string, string>();
async function getCachedCompletion(prompt: string) {
const cacheKey = `${model}:${prompt}`;
if (completionCache.has(cacheKey)) {
return completionCache.get(cacheKey)!;
}
const result = await openai.chat.completions.create({
model: 'gpt-5-mini',
messages: [{ role: 'user', content: prompt }],
});
const content = result.choices[0].message.content;
completionCache.set(cacheKey, content!);
return content;
}
Batch Processing
1. Use Embeddings Batch API
// ❌ Individual requests (expensive)
for (const doc of documents) {
await openai.embeddings.create({
model: 'text-embedding-3-small',
input: doc,
});
}
// ✅ Batch request (cheaper)
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: documents, // Array of up to 2048 documents
});
2. Group Similar Requests
// Process non-urgent requests in batches during off-peak hours
const batchQueue: string[] = [];
function queueForBatch(prompt: string) {
batchQueue.push(prompt);
if (batchQueue.length >= 10) {
processBatch();
}
}
async function processBatch() {
// Process all at once
const results = await Promise.all(
batchQueue.map(prompt =>
openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [{ role: 'user', content: prompt }],
})
)
);
batchQueue.length = 0;
return results;
}
Feature-Specific Optimization
Embeddings
- Use custom dimensions: 256 instead of 1536 = 6x storage reduction
- Use text-embedding-3-small: Cheaper than large, good for most use cases
- Batch requests: Up to 2048 documents per request
Images
- Use standard quality: Unless HD is critical
- Use smaller sizes: Generate 1024x1024 instead of 1792x1024 when possible
- Use natural style: Cheaper than vivid
Audio
- Use tts-1 for real-time: Cheaper than tts-1-hd
- Use opus format: Smaller files, good quality
- Cache generated audio: Deterministic for same input
Monitoring and Alerts
interface CostTracker {
totalTokens: number;
totalCost: number;
requestCount: number;
}
const tracker: CostTracker = {
totalTokens: 0,
totalCost: 0,
requestCount: 0,
};
async function trackCosts(fn: () => Promise<any>) {
const result = await fn();
if (result.usage) {
tracker.totalTokens += result.usage.total_tokens;
tracker.requestCount++;
// Estimate cost (adjust rates based on actual pricing)
const cost = estimateCost(result.model, result.usage.total_tokens);
tracker.totalCost += cost;
// Alert if threshold exceeded
if (tracker.totalCost > 100) {
console.warn('Cost threshold exceeded!', tracker);
}
}
return result;
}
Cost Reduction Checklist
- Use cheapest model that meets requirements
- Set max_tokens limits
- Trim conversation history
- Cache embeddings and common queries
- Batch requests when possible
- Use custom embedding dimensions (256-512)
- Monitor token usage
- Implement rate limiting
- Use structured outputs to avoid retries
- Compress prompts (remove unnecessary words)
Estimated Savings: Following these practices can reduce costs by 40-70% while maintaining quality.