# Cost Optimization Guide

## 1. LLM Caching

### How It Works
- **First request**: Full cost (`input_cache_write`)
- **Subsequent requests**: 10% cost (`input_cache_read`)
- **Cache TTL**: 5 minutes to 1 hour (configurable)

### Configuration
```json
{
  "llm_config": {
    "caching": {
      "enabled": true,
      "ttl_seconds": 3600
    }
  }
}
```

### What Gets Cached
✅ System prompts
✅ Tool definitions
✅ Knowledge base context
✅ Recent conversation history

❌ User messages (always fresh)
❌ Dynamic variables
❌ Tool responses

### Savings
**Up to 90%** on cached inputs

**Example**:
- System prompt: 500 tokens
- Without caching: 500 tokens × 100 conversations = 50,000 tokens
- With caching: 500 tokens (first) + 50 tokens × 99 (cached) = 5,450 tokens
- **Savings**: 89%

---

## 2. Model Swapping

### Model Comparison

| Model | Cost (per 1M tokens) | Speed | Quality | Best For |
|-------|---------------------|-------|---------|----------|
| GPT-4o | $5 | Medium | Highest | Complex reasoning |
| GPT-4o-mini | $0.15 | Fast | High | Most use cases |
| Claude Sonnet 4.5 | $3 | Medium | Highest | Long context |
| Gemini 2.5 Flash | $0.075 | Fastest | Medium | Simple tasks |

### Configuration
```json
{
  "llm_config": {
    "model": "gpt-4o-mini"
  }
}
```

### Optimization Strategy
1. **Start with gpt-4o-mini** for all agents
2. **Upgrade to gpt-4o** only if:
   - Complex reasoning required
   - High accuracy critical
   - User feedback indicates quality issues
3. **Use Gemini 2.5 Flash** for:
   - Simple routing/classification
   - FAQ responses
   - Order status lookups

### Savings
**Up to 97%** (gpt-4o → Gemini 2.5 Flash)

---

## 3. Burst Pricing

### How It Works
- **Normal**: Your subscription concurrency limit (e.g., 10 calls)
- **Burst**: Up to 3× your limit (e.g., 30 calls)
- **Cost**: 2× per-minute rate for burst calls

### Configuration
```json
{
  "call_limits": {
    "burst_pricing_enabled": true
  }
}
```

### When to Use
✅ Black Friday traffic spikes
✅ Product launches
✅ Seasonal demand (holidays)
✅ Marketing campaigns

❌ Sustained high traffic (upgrade plan instead)
❌ Unpredictable usage patterns

### Cost Calculation
**Example**:
- Subscription: 10 concurrent calls ($0.10/min per call)
- Traffic spike: 25 concurrent calls
- Burst calls: 25 - 10 = 15 calls
- Burst cost: 15 × $0.20/min = $3/min
- Regular cost: 10 × $0.10/min = $1/min
- **Total**: $4/min during spike

---

## 4. Prompt Optimization

### Reduce Token Count

**Before** (500 tokens):
```
You are a highly experienced and knowledgeable customer support specialist with extensive training in technical troubleshooting, customer service best practices, and empathetic communication. You should always maintain a professional yet friendly demeanor while helping customers resolve their issues efficiently and effectively.
```

**After** (150 tokens):
```
You are an experienced support specialist. Be professional, friendly, and efficient.
```

**Savings**: 70% token reduction

### Use Tools Instead of Context

**Before**: Include FAQ in system prompt (2,000 tokens)
**After**: Use RAG/knowledge base (100 tokens + retrieval)

**Savings**: 95% for large knowledge bases

---

## 5. Turn-Taking Optimization

### Impact on Cost

| Mode | Latency | LLM Calls | Cost Impact |
|------|---------|-----------|-------------|
| Eager | Low | More | Higher (more interruptions) |
| Normal | Medium | Medium | Balanced |
| Patient | High | Fewer | Lower (fewer interruptions) |

### Recommendation
Use **Patient** mode for cost-sensitive applications where speed is less critical.

---

## 6. Voice Settings

### Speed vs Cost

| Speed | TTS Cost | User Experience |
|-------|----------|-----------------|
| 0.7x | Higher (longer audio) | Slow |
| 1.0x | Baseline | Natural |
| 1.2x | Lower (shorter audio) | Fast |

### Recommendation
Use **1.1x speed** for slight cost savings without compromising experience.

---

## 7. Conversation Duration Limits

### Configuration
```json
{
  "conversation": {
    "max_duration_seconds": 300 // 5 minutes
  }
}
```

### Use Cases
- FAQ bots (limit: 2-3 minutes)
- Order status (limit: 1 minute)
- Full support (limit: 10-15 minutes)

### Savings
Prevents unexpectedly long conversations.

---

## 8. Analytics-Driven Optimization

### Monitor Metrics
1. **Average conversation duration**
2. **LLM tokens per conversation**
3. **Tool call frequency**
4. **Resolution rate**

### Identify Issues
- Long conversations → improve prompts or add escalation
- High token count → enable caching or shorten prompts
- Low resolution rate → upgrade model or improve knowledge base

---

## 9. Cost Monitoring

### API Usage Tracking
```typescript
const usage = await client.analytics.getLLMUsage({
  agent_id: 'agent_123',
  from_date: '2025-11-01',
  to_date: '2025-11-30'
});

console.log('Total tokens:', usage.total_tokens);
console.log('Cached tokens:', usage.cached_tokens);
console.log('Cost:', usage.total_cost);
```

### Set Budgets
```json
{
  "cost_limits": {
    "daily_budget_usd": 100,
    "monthly_budget_usd": 2000
  }
}
```

---

## 10. Cost Optimization Checklist

### Before Launch
- [ ] Enable LLM caching
- [ ] Use gpt-4o-mini (not gpt-4o)
- [ ] Optimize prompt length
- [ ] Set conversation duration limits
- [ ] Use RAG instead of large system prompts
- [ ] Configure burst pricing if needed

### During Operation
- [ ] Monitor LLM token usage weekly
- [ ] Review conversation analytics monthly
- [ ] Test cheaper models quarterly
- [ ] Optimize prompts based on analytics
- [ ] Review and remove unused tools

### Continuous Improvement
- [ ] A/B test cheaper models
- [ ] Analyze long conversations
- [ ] Improve resolution rates
- [ ] Reduce average conversation duration
- [ ] Increase cache hit rates

---

## Expected Savings

**Baseline Configuration**:
- Model: gpt-4o
- No caching
- Average prompt: 1,000 tokens
- Average conversation: 5 minutes
- Cost: ~$0.50/conversation

**Optimized Configuration**:
- Model: gpt-4o-mini
- Caching enabled
- Average prompt: 300 tokens
- Average conversation: 3 minutes
- Cost: ~$0.05/conversation

**Total Savings**: **90%** 🎉