6.0 KiB
Cost Optimization Guide
1. LLM Caching
How It Works
- First request: Full cost (
input_cache_write) - Subsequent requests: 10% cost (
input_cache_read) - Cache TTL: 5 minutes to 1 hour (configurable)
Configuration
{
"llm_config": {
"caching": {
"enabled": true,
"ttl_seconds": 3600
}
}
}
What Gets Cached
✅ System prompts ✅ Tool definitions ✅ Knowledge base context ✅ Recent conversation history
❌ User messages (always fresh) ❌ Dynamic variables ❌ Tool responses
Savings
Up to 90% on cached inputs
Example:
- System prompt: 500 tokens
- Without caching: 500 tokens × 100 conversations = 50,000 tokens
- With caching: 500 tokens (first) + 50 tokens × 99 (cached) = 5,450 tokens
- Savings: 89%
2. Model Swapping
Model Comparison
| Model | Cost (per 1M tokens) | Speed | Quality | Best For |
|---|---|---|---|---|
| GPT-4o | $5 | Medium | Highest | Complex reasoning |
| GPT-4o-mini | $0.15 | Fast | High | Most use cases |
| Claude Sonnet 4.5 | $3 | Medium | Highest | Long context |
| Gemini 2.5 Flash | $0.075 | Fastest | Medium | Simple tasks |
Configuration
{
"llm_config": {
"model": "gpt-4o-mini"
}
}
Optimization Strategy
- Start with gpt-4o-mini for all agents
- Upgrade to gpt-4o only if:
- Complex reasoning required
- High accuracy critical
- User feedback indicates quality issues
- Use Gemini 2.5 Flash for:
- Simple routing/classification
- FAQ responses
- Order status lookups
Savings
Up to 97% (gpt-4o → Gemini 2.5 Flash)
3. Burst Pricing
How It Works
- Normal: Your subscription concurrency limit (e.g., 10 calls)
- Burst: Up to 3× your limit (e.g., 30 calls)
- Cost: 2× per-minute rate for burst calls
Configuration
{
"call_limits": {
"burst_pricing_enabled": true
}
}
When to Use
✅ Black Friday traffic spikes ✅ Product launches ✅ Seasonal demand (holidays) ✅ Marketing campaigns
❌ Sustained high traffic (upgrade plan instead) ❌ Unpredictable usage patterns
Cost Calculation
Example:
- Subscription: 10 concurrent calls ($0.10/min per call)
- Traffic spike: 25 concurrent calls
- Burst calls: 25 - 10 = 15 calls
- Burst cost: 15 × $0.20/min = $3/min
- Regular cost: 10 × $0.10/min = $1/min
- Total: $4/min during spike
4. Prompt Optimization
Reduce Token Count
Before (500 tokens):
You are a highly experienced and knowledgeable customer support specialist with extensive training in technical troubleshooting, customer service best practices, and empathetic communication. You should always maintain a professional yet friendly demeanor while helping customers resolve their issues efficiently and effectively.
After (150 tokens):
You are an experienced support specialist. Be professional, friendly, and efficient.
Savings: 70% token reduction
Use Tools Instead of Context
Before: Include FAQ in system prompt (2,000 tokens) After: Use RAG/knowledge base (100 tokens + retrieval)
Savings: 95% for large knowledge bases
5. Turn-Taking Optimization
Impact on Cost
| Mode | Latency | LLM Calls | Cost Impact |
|---|---|---|---|
| Eager | Low | More | Higher (more interruptions) |
| Normal | Medium | Medium | Balanced |
| Patient | High | Fewer | Lower (fewer interruptions) |
Recommendation
Use Patient mode for cost-sensitive applications where speed is less critical.
6. Voice Settings
Speed vs Cost
| Speed | TTS Cost | User Experience |
|---|---|---|
| 0.7x | Higher (longer audio) | Slow |
| 1.0x | Baseline | Natural |
| 1.2x | Lower (shorter audio) | Fast |
Recommendation
Use 1.1x speed for slight cost savings without compromising experience.
7. Conversation Duration Limits
Configuration
{
"conversation": {
"max_duration_seconds": 300 // 5 minutes
}
}
Use Cases
- FAQ bots (limit: 2-3 minutes)
- Order status (limit: 1 minute)
- Full support (limit: 10-15 minutes)
Savings
Prevents unexpectedly long conversations.
8. Analytics-Driven Optimization
Monitor Metrics
- Average conversation duration
- LLM tokens per conversation
- Tool call frequency
- Resolution rate
Identify Issues
- Long conversations → improve prompts or add escalation
- High token count → enable caching or shorten prompts
- Low resolution rate → upgrade model or improve knowledge base
9. Cost Monitoring
API Usage Tracking
const usage = await client.analytics.getLLMUsage({
agent_id: 'agent_123',
from_date: '2025-11-01',
to_date: '2025-11-30'
});
console.log('Total tokens:', usage.total_tokens);
console.log('Cached tokens:', usage.cached_tokens);
console.log('Cost:', usage.total_cost);
Set Budgets
{
"cost_limits": {
"daily_budget_usd": 100,
"monthly_budget_usd": 2000
}
}
10. Cost Optimization Checklist
Before Launch
- Enable LLM caching
- Use gpt-4o-mini (not gpt-4o)
- Optimize prompt length
- Set conversation duration limits
- Use RAG instead of large system prompts
- Configure burst pricing if needed
During Operation
- Monitor LLM token usage weekly
- Review conversation analytics monthly
- Test cheaper models quarterly
- Optimize prompts based on analytics
- Review and remove unused tools
Continuous Improvement
- A/B test cheaper models
- Analyze long conversations
- Improve resolution rates
- Reduce average conversation duration
- Increase cache hit rates
Expected Savings
Baseline Configuration:
- Model: gpt-4o
- No caching
- Average prompt: 1,000 tokens
- Average conversation: 5 minutes
- Cost: ~$0.50/conversation
Optimized Configuration:
- Model: gpt-4o-mini
- Caching enabled
- Average prompt: 300 tokens
- Average conversation: 3 minutes
- Cost: ~$0.05/conversation
Total Savings: 90% 🎉