Files
gh-jezweb-claude-skills-ski…/references/cost-optimization.md
2025-11-30 08:24:46 +08:00

276 lines
6.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Cost Optimization Guide
## 1. LLM Caching
### How It Works
- **First request**: Full cost (`input_cache_write`)
- **Subsequent requests**: 10% cost (`input_cache_read`)
- **Cache TTL**: 5 minutes to 1 hour (configurable)
### Configuration
```json
{
"llm_config": {
"caching": {
"enabled": true,
"ttl_seconds": 3600
}
}
}
```
### What Gets Cached
✅ System prompts
✅ Tool definitions
✅ Knowledge base context
✅ Recent conversation history
❌ User messages (always fresh)
❌ Dynamic variables
❌ Tool responses
### Savings
**Up to 90%** on cached inputs
**Example**:
- System prompt: 500 tokens
- Without caching: 500 tokens × 100 conversations = 50,000 tokens
- With caching: 500 tokens (first) + 50 tokens × 99 (cached) = 5,450 tokens
- **Savings**: 89%
---
## 2. Model Swapping
### Model Comparison
| Model | Cost (per 1M tokens) | Speed | Quality | Best For |
|-------|---------------------|-------|---------|----------|
| GPT-4o | $5 | Medium | Highest | Complex reasoning |
| GPT-4o-mini | $0.15 | Fast | High | Most use cases |
| Claude Sonnet 4.5 | $3 | Medium | Highest | Long context |
| Gemini 2.5 Flash | $0.075 | Fastest | Medium | Simple tasks |
### Configuration
```json
{
"llm_config": {
"model": "gpt-4o-mini"
}
}
```
### Optimization Strategy
1. **Start with gpt-4o-mini** for all agents
2. **Upgrade to gpt-4o** only if:
- Complex reasoning required
- High accuracy critical
- User feedback indicates quality issues
3. **Use Gemini 2.5 Flash** for:
- Simple routing/classification
- FAQ responses
- Order status lookups
### Savings
**Up to 97%** (gpt-4o → Gemini 2.5 Flash)
---
## 3. Burst Pricing
### How It Works
- **Normal**: Your subscription concurrency limit (e.g., 10 calls)
- **Burst**: Up to 3× your limit (e.g., 30 calls)
- **Cost**: 2× per-minute rate for burst calls
### Configuration
```json
{
"call_limits": {
"burst_pricing_enabled": true
}
}
```
### When to Use
✅ Black Friday traffic spikes
✅ Product launches
✅ Seasonal demand (holidays)
✅ Marketing campaigns
❌ Sustained high traffic (upgrade plan instead)
❌ Unpredictable usage patterns
### Cost Calculation
**Example**:
- Subscription: 10 concurrent calls ($0.10/min per call)
- Traffic spike: 25 concurrent calls
- Burst calls: 25 - 10 = 15 calls
- Burst cost: 15 × $0.20/min = $3/min
- Regular cost: 10 × $0.10/min = $1/min
- **Total**: $4/min during spike
---
## 4. Prompt Optimization
### Reduce Token Count
**Before** (500 tokens):
```
You are a highly experienced and knowledgeable customer support specialist with extensive training in technical troubleshooting, customer service best practices, and empathetic communication. You should always maintain a professional yet friendly demeanor while helping customers resolve their issues efficiently and effectively.
```
**After** (150 tokens):
```
You are an experienced support specialist. Be professional, friendly, and efficient.
```
**Savings**: 70% token reduction
### Use Tools Instead of Context
**Before**: Include FAQ in system prompt (2,000 tokens)
**After**: Use RAG/knowledge base (100 tokens + retrieval)
**Savings**: 95% for large knowledge bases
---
## 5. Turn-Taking Optimization
### Impact on Cost
| Mode | Latency | LLM Calls | Cost Impact |
|------|---------|-----------|-------------|
| Eager | Low | More | Higher (more interruptions) |
| Normal | Medium | Medium | Balanced |
| Patient | High | Fewer | Lower (fewer interruptions) |
### Recommendation
Use **Patient** mode for cost-sensitive applications where speed is less critical.
---
## 6. Voice Settings
### Speed vs Cost
| Speed | TTS Cost | User Experience |
|-------|----------|-----------------|
| 0.7x | Higher (longer audio) | Slow |
| 1.0x | Baseline | Natural |
| 1.2x | Lower (shorter audio) | Fast |
### Recommendation
Use **1.1x speed** for slight cost savings without compromising experience.
---
## 7. Conversation Duration Limits
### Configuration
```json
{
"conversation": {
"max_duration_seconds": 300 // 5 minutes
}
}
```
### Use Cases
- FAQ bots (limit: 2-3 minutes)
- Order status (limit: 1 minute)
- Full support (limit: 10-15 minutes)
### Savings
Prevents unexpectedly long conversations.
---
## 8. Analytics-Driven Optimization
### Monitor Metrics
1. **Average conversation duration**
2. **LLM tokens per conversation**
3. **Tool call frequency**
4. **Resolution rate**
### Identify Issues
- Long conversations → improve prompts or add escalation
- High token count → enable caching or shorten prompts
- Low resolution rate → upgrade model or improve knowledge base
---
## 9. Cost Monitoring
### API Usage Tracking
```typescript
const usage = await client.analytics.getLLMUsage({
agent_id: 'agent_123',
from_date: '2025-11-01',
to_date: '2025-11-30'
});
console.log('Total tokens:', usage.total_tokens);
console.log('Cached tokens:', usage.cached_tokens);
console.log('Cost:', usage.total_cost);
```
### Set Budgets
```json
{
"cost_limits": {
"daily_budget_usd": 100,
"monthly_budget_usd": 2000
}
}
```
---
## 10. Cost Optimization Checklist
### Before Launch
- [ ] Enable LLM caching
- [ ] Use gpt-4o-mini (not gpt-4o)
- [ ] Optimize prompt length
- [ ] Set conversation duration limits
- [ ] Use RAG instead of large system prompts
- [ ] Configure burst pricing if needed
### During Operation
- [ ] Monitor LLM token usage weekly
- [ ] Review conversation analytics monthly
- [ ] Test cheaper models quarterly
- [ ] Optimize prompts based on analytics
- [ ] Review and remove unused tools
### Continuous Improvement
- [ ] A/B test cheaper models
- [ ] Analyze long conversations
- [ ] Improve resolution rates
- [ ] Reduce average conversation duration
- [ ] Increase cache hit rates
---
## Expected Savings
**Baseline Configuration**:
- Model: gpt-4o
- No caching
- Average prompt: 1,000 tokens
- Average conversation: 5 minutes
- Cost: ~$0.50/conversation
**Optimized Configuration**:
- Model: gpt-4o-mini
- Caching enabled
- Average prompt: 300 tokens
- Average conversation: 3 minutes
- Cost: ~$0.05/conversation
**Total Savings**: **90%** 🎉