Files
gh-jezweb-claude-skills-ski…/references/cost-optimization.md
2025-11-30 08:24:46 +08:00

6.0 KiB
Raw Permalink Blame History

Cost Optimization Guide

1. LLM Caching

How It Works

  • First request: Full cost (input_cache_write)
  • Subsequent requests: 10% cost (input_cache_read)
  • Cache TTL: 5 minutes to 1 hour (configurable)

Configuration

{
  "llm_config": {
    "caching": {
      "enabled": true,
      "ttl_seconds": 3600
    }
  }
}

What Gets Cached

System prompts Tool definitions Knowledge base context Recent conversation history

User messages (always fresh) Dynamic variables Tool responses

Savings

Up to 90% on cached inputs

Example:

  • System prompt: 500 tokens
  • Without caching: 500 tokens × 100 conversations = 50,000 tokens
  • With caching: 500 tokens (first) + 50 tokens × 99 (cached) = 5,450 tokens
  • Savings: 89%

2. Model Swapping

Model Comparison

Model Cost (per 1M tokens) Speed Quality Best For
GPT-4o $5 Medium Highest Complex reasoning
GPT-4o-mini $0.15 Fast High Most use cases
Claude Sonnet 4.5 $3 Medium Highest Long context
Gemini 2.5 Flash $0.075 Fastest Medium Simple tasks

Configuration

{
  "llm_config": {
    "model": "gpt-4o-mini"
  }
}

Optimization Strategy

  1. Start with gpt-4o-mini for all agents
  2. Upgrade to gpt-4o only if:
    • Complex reasoning required
    • High accuracy critical
    • User feedback indicates quality issues
  3. Use Gemini 2.5 Flash for:
    • Simple routing/classification
    • FAQ responses
    • Order status lookups

Savings

Up to 97% (gpt-4o → Gemini 2.5 Flash)


3. Burst Pricing

How It Works

  • Normal: Your subscription concurrency limit (e.g., 10 calls)
  • Burst: Up to 3× your limit (e.g., 30 calls)
  • Cost: 2× per-minute rate for burst calls

Configuration

{
  "call_limits": {
    "burst_pricing_enabled": true
  }
}

When to Use

Black Friday traffic spikes Product launches Seasonal demand (holidays) Marketing campaigns

Sustained high traffic (upgrade plan instead) Unpredictable usage patterns

Cost Calculation

Example:

  • Subscription: 10 concurrent calls ($0.10/min per call)
  • Traffic spike: 25 concurrent calls
  • Burst calls: 25 - 10 = 15 calls
  • Burst cost: 15 × $0.20/min = $3/min
  • Regular cost: 10 × $0.10/min = $1/min
  • Total: $4/min during spike

4. Prompt Optimization

Reduce Token Count

Before (500 tokens):

You are a highly experienced and knowledgeable customer support specialist with extensive training in technical troubleshooting, customer service best practices, and empathetic communication. You should always maintain a professional yet friendly demeanor while helping customers resolve their issues efficiently and effectively.

After (150 tokens):

You are an experienced support specialist. Be professional, friendly, and efficient.

Savings: 70% token reduction

Use Tools Instead of Context

Before: Include FAQ in system prompt (2,000 tokens) After: Use RAG/knowledge base (100 tokens + retrieval)

Savings: 95% for large knowledge bases


5. Turn-Taking Optimization

Impact on Cost

Mode Latency LLM Calls Cost Impact
Eager Low More Higher (more interruptions)
Normal Medium Medium Balanced
Patient High Fewer Lower (fewer interruptions)

Recommendation

Use Patient mode for cost-sensitive applications where speed is less critical.


6. Voice Settings

Speed vs Cost

Speed TTS Cost User Experience
0.7x Higher (longer audio) Slow
1.0x Baseline Natural
1.2x Lower (shorter audio) Fast

Recommendation

Use 1.1x speed for slight cost savings without compromising experience.


7. Conversation Duration Limits

Configuration

{
  "conversation": {
    "max_duration_seconds": 300 // 5 minutes
  }
}

Use Cases

  • FAQ bots (limit: 2-3 minutes)
  • Order status (limit: 1 minute)
  • Full support (limit: 10-15 minutes)

Savings

Prevents unexpectedly long conversations.


8. Analytics-Driven Optimization

Monitor Metrics

  1. Average conversation duration
  2. LLM tokens per conversation
  3. Tool call frequency
  4. Resolution rate

Identify Issues

  • Long conversations → improve prompts or add escalation
  • High token count → enable caching or shorten prompts
  • Low resolution rate → upgrade model or improve knowledge base

9. Cost Monitoring

API Usage Tracking

const usage = await client.analytics.getLLMUsage({
  agent_id: 'agent_123',
  from_date: '2025-11-01',
  to_date: '2025-11-30'
});

console.log('Total tokens:', usage.total_tokens);
console.log('Cached tokens:', usage.cached_tokens);
console.log('Cost:', usage.total_cost);

Set Budgets

{
  "cost_limits": {
    "daily_budget_usd": 100,
    "monthly_budget_usd": 2000
  }
}

10. Cost Optimization Checklist

Before Launch

  • Enable LLM caching
  • Use gpt-4o-mini (not gpt-4o)
  • Optimize prompt length
  • Set conversation duration limits
  • Use RAG instead of large system prompts
  • Configure burst pricing if needed

During Operation

  • Monitor LLM token usage weekly
  • Review conversation analytics monthly
  • Test cheaper models quarterly
  • Optimize prompts based on analytics
  • Review and remove unused tools

Continuous Improvement

  • A/B test cheaper models
  • Analyze long conversations
  • Improve resolution rates
  • Reduce average conversation duration
  • Increase cache hit rates

Expected Savings

Baseline Configuration:

  • Model: gpt-4o
  • No caching
  • Average prompt: 1,000 tokens
  • Average conversation: 5 minutes
  • Cost: ~$0.50/conversation

Optimized Configuration:

  • Model: gpt-4o-mini
  • Caching enabled
  • Average prompt: 300 tokens
  • Average conversation: 3 minutes
  • Cost: ~$0.05/conversation

Total Savings: 90% 🎉