# Cost Optimization Guide **Last Updated**: 2025-10-25 Strategies to minimize OpenAI API costs while maintaining quality. --- ## Model Selection Strategies ### 1. Model Cascading Start with cheaper models, escalate only when needed: ```typescript async function smartCompletion(prompt: string) { // Try gpt-5-nano first const nanoResult = await openai.chat.completions.create({ model: 'gpt-5-nano', messages: [{ role: 'user', content: prompt }], }); // Validate quality if (isGoodEnough(nanoResult)) { return nanoResult; } // Escalate to gpt-5-mini const miniResult = await openai.chat.completions.create({ model: 'gpt-5-mini', messages: [{ role: 'user', content: prompt }], }); if (isGoodEnough(miniResult)) { return miniResult; } // Final escalation to gpt-5 return await openai.chat.completions.create({ model: 'gpt-5', messages: [{ role: 'user', content: prompt }], }); } ``` ### 2. Task-Based Model Selection | Task | Model | Why | |------|-------|-----| | Simple chat | gpt-5-nano | Fast, cheap, sufficient | | Summarization | gpt-5-mini | Good quality, cost-effective | | Code generation | gpt-5 | Best reasoning, worth the cost | | Data extraction | gpt-4o + structured output | Reliable, accurate | | Vision tasks | gpt-4o | Only model with vision | --- ## Token Optimization ### 1. Limit max_tokens ```typescript // ❌ No limit: May generate unnecessarily long responses { model: 'gpt-5', messages, } // ✅ Set reasonable limit { model: 'gpt-5', messages, max_tokens: 500, // Prevent runaway generation } ``` ### 2. Trim Conversation History ```typescript function trimHistory(messages: Message[], maxTokens: number = 4000) { // Keep system message and recent messages const system = messages.find(m => m.role === 'system'); const recent = messages.slice(-10); // Last 10 messages return [system, ...recent].filter(Boolean); } ``` ### 3. Use Shorter Prompts ```typescript // ❌ Verbose "Please analyze the following text and provide a detailed summary of the main points, including any key takeaways and important details..." // ✅ Concise "Summarize key points:" ``` --- ## Caching Strategies ### 1. Cache Embeddings ```typescript const embeddingCache = new Map(); async function getCachedEmbedding(text: string) { if (embeddingCache.has(text)) { return embeddingCache.get(text)!; } const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: text, }); const embedding = response.data[0].embedding; embeddingCache.set(text, embedding); return embedding; } ``` ### 2. Cache Common Completions ```typescript const completionCache = new Map(); async function getCachedCompletion(prompt: string) { const cacheKey = `${model}:${prompt}`; if (completionCache.has(cacheKey)) { return completionCache.get(cacheKey)!; } const result = await openai.chat.completions.create({ model: 'gpt-5-mini', messages: [{ role: 'user', content: prompt }], }); const content = result.choices[0].message.content; completionCache.set(cacheKey, content!); return content; } ``` --- ## Batch Processing ### 1. Use Embeddings Batch API ```typescript // ❌ Individual requests (expensive) for (const doc of documents) { await openai.embeddings.create({ model: 'text-embedding-3-small', input: doc, }); } // ✅ Batch request (cheaper) const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: documents, // Array of up to 2048 documents }); ``` ### 2. Group Similar Requests ```typescript // Process non-urgent requests in batches during off-peak hours const batchQueue: string[] = []; function queueForBatch(prompt: string) { batchQueue.push(prompt); if (batchQueue.length >= 10) { processBatch(); } } async function processBatch() { // Process all at once const results = await Promise.all( batchQueue.map(prompt => openai.chat.completions.create({ model: 'gpt-5-nano', messages: [{ role: 'user', content: prompt }], }) ) ); batchQueue.length = 0; return results; } ``` --- ## Feature-Specific Optimization ### Embeddings 1. **Use custom dimensions**: 256 instead of 1536 = 6x storage reduction 2. **Use text-embedding-3-small**: Cheaper than large, good for most use cases 3. **Batch requests**: Up to 2048 documents per request ### Images 1. **Use standard quality**: Unless HD is critical 2. **Use smaller sizes**: Generate 1024x1024 instead of 1792x1024 when possible 3. **Use natural style**: Cheaper than vivid ### Audio 1. **Use tts-1 for real-time**: Cheaper than tts-1-hd 2. **Use opus format**: Smaller files, good quality 3. **Cache generated audio**: Deterministic for same input --- ## Monitoring and Alerts ```typescript interface CostTracker { totalTokens: number; totalCost: number; requestCount: number; } const tracker: CostTracker = { totalTokens: 0, totalCost: 0, requestCount: 0, }; async function trackCosts(fn: () => Promise) { const result = await fn(); if (result.usage) { tracker.totalTokens += result.usage.total_tokens; tracker.requestCount++; // Estimate cost (adjust rates based on actual pricing) const cost = estimateCost(result.model, result.usage.total_tokens); tracker.totalCost += cost; // Alert if threshold exceeded if (tracker.totalCost > 100) { console.warn('Cost threshold exceeded!', tracker); } } return result; } ``` --- ## Cost Reduction Checklist - [ ] Use cheapest model that meets requirements - [ ] Set max_tokens limits - [ ] Trim conversation history - [ ] Cache embeddings and common queries - [ ] Batch requests when possible - [ ] Use custom embedding dimensions (256-512) - [ ] Monitor token usage - [ ] Implement rate limiting - [ ] Use structured outputs to avoid retries - [ ] Compress prompts (remove unnecessary words) --- **Estimated Savings**: Following these practices can reduce costs by 40-70% while maintaining quality.