zhongwei/gh-jezweb-claude-skills-skills-openai-api

Fork 0

Files

Zhongwei Li 7a35a34caa Initial commit

2025-11-30 08:25:12 +08:00

6.0 KiB

Raw Permalink Blame History

Cost Optimization Guide

Last Updated: 2025-10-25

Strategies to minimize OpenAI API costs while maintaining quality.

Model Selection Strategies

1. Model Cascading

Start with cheaper models, escalate only when needed:

async function smartCompletion(prompt: string) {
  // Try gpt-5-nano first
  const nanoResult = await openai.chat.completions.create({
    model: 'gpt-5-nano',
    messages: [{ role: 'user', content: prompt }],
  });

  // Validate quality
  if (isGoodEnough(nanoResult)) {
    return nanoResult;
  }

  // Escalate to gpt-5-mini
  const miniResult = await openai.chat.completions.create({
    model: 'gpt-5-mini',
    messages: [{ role: 'user', content: prompt }],
  });

  if (isGoodEnough(miniResult)) {
    return miniResult;
  }

  // Final escalation to gpt-5
  return await openai.chat.completions.create({
    model: 'gpt-5',
    messages: [{ role: 'user', content: prompt }],
  });
}

2. Task-Based Model Selection

Task	Model	Why
Simple chat	gpt-5-nano	Fast, cheap, sufficient
Summarization	gpt-5-mini	Good quality, cost-effective
Code generation	gpt-5	Best reasoning, worth the cost
Data extraction	gpt-4o + structured output	Reliable, accurate
Vision tasks	gpt-4o	Only model with vision

Token Optimization

1. Limit max_tokens

// ❌ No limit: May generate unnecessarily long responses
{
  model: 'gpt-5',
  messages,
}

// ✅ Set reasonable limit
{
  model: 'gpt-5',
  messages,
  max_tokens: 500, // Prevent runaway generation
}

2. Trim Conversation History

function trimHistory(messages: Message[], maxTokens: number = 4000) {
  // Keep system message and recent messages
  const system = messages.find(m => m.role === 'system');
  const recent = messages.slice(-10); // Last 10 messages

  return [system, ...recent].filter(Boolean);
}

3. Use Shorter Prompts

// ❌ Verbose
"Please analyze the following text and provide a detailed summary of the main points, including any key takeaways and important details..."

// ✅ Concise
"Summarize key points:"

Caching Strategies

1. Cache Embeddings

const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string) {
  if (embeddingCache.has(text)) {
    return embeddingCache.get(text)!;
  }

  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });

  const embedding = response.data[0].embedding;
  embeddingCache.set(text, embedding);

  return embedding;
}

2. Cache Common Completions

const completionCache = new Map<string, string>();

async function getCachedCompletion(prompt: string) {
  const cacheKey = `${model}:${prompt}`;

  if (completionCache.has(cacheKey)) {
    return completionCache.get(cacheKey)!;
  }

  const result = await openai.chat.completions.create({
    model: 'gpt-5-mini',
    messages: [{ role: 'user', content: prompt }],
  });

  const content = result.choices[0].message.content;
  completionCache.set(cacheKey, content!);

  return content;
}

Batch Processing

1. Use Embeddings Batch API

// ❌ Individual requests (expensive)
for (const doc of documents) {
  await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: doc,
  });
}

// ✅ Batch request (cheaper)
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: documents, // Array of up to 2048 documents
});

2. Group Similar Requests

// Process non-urgent requests in batches during off-peak hours
const batchQueue: string[] = [];

function queueForBatch(prompt: string) {
  batchQueue.push(prompt);

  if (batchQueue.length >= 10) {
    processBatch();
  }
}

async function processBatch() {
  // Process all at once
  const results = await Promise.all(
    batchQueue.map(prompt =>
      openai.chat.completions.create({
        model: 'gpt-5-nano',
        messages: [{ role: 'user', content: prompt }],
      })
    )
  );

  batchQueue.length = 0;
  return results;
}

Feature-Specific Optimization

Embeddings

Use custom dimensions: 256 instead of 1536 = 6x storage reduction
Use text-embedding-3-small: Cheaper than large, good for most use cases
Batch requests: Up to 2048 documents per request

Images

Use standard quality: Unless HD is critical
Use smaller sizes: Generate 1024x1024 instead of 1792x1024 when possible
Use natural style: Cheaper than vivid

Audio

Use tts-1 for real-time: Cheaper than tts-1-hd
Use opus format: Smaller files, good quality
Cache generated audio: Deterministic for same input

Monitoring and Alerts

interface CostTracker {
  totalTokens: number;
  totalCost: number;
  requestCount: number;
}

const tracker: CostTracker = {
  totalTokens: 0,
  totalCost: 0,
  requestCount: 0,
};

async function trackCosts(fn: () => Promise<any>) {
  const result = await fn();

  if (result.usage) {
    tracker.totalTokens += result.usage.total_tokens;
    tracker.requestCount++;

    // Estimate cost (adjust rates based on actual pricing)
    const cost = estimateCost(result.model, result.usage.total_tokens);
    tracker.totalCost += cost;

    // Alert if threshold exceeded
    if (tracker.totalCost > 100) {
      console.warn('Cost threshold exceeded!', tracker);
    }
  }

  return result;
}

Cost Reduction Checklist

Use cheapest model that meets requirements
Set max_tokens limits
Trim conversation history
Cache embeddings and common queries
Batch requests when possible
Use custom embedding dimensions (256-512)
Monitor token usage
Implement rate limiting
Use structured outputs to avoid retries
Compress prompts (remove unnecessary words)

Estimated Savings: Following these practices can reduce costs by 40-70% while maintaining quality.

6.0 KiB Raw Permalink Blame History