zhongwei/gh-jezweb-claude-skills-skills-elevenlabs-agents

Fork 0

Files

Zhongwei Li 49178918d7 Initial commit

2025-11-30 08:24:46 +08:00

6.0 KiB

Raw Blame History

Cost Optimization Guide

1. LLM Caching

How It Works

First request: Full cost (input_cache_write)
Subsequent requests: 10% cost (input_cache_read)
Cache TTL: 5 minutes to 1 hour (configurable)

Configuration

{
  "llm_config": {
    "caching": {
      "enabled": true,
      "ttl_seconds": 3600
    }
  }
}

What Gets Cached

✅ System prompts ✅ Tool definitions ✅ Knowledge base context ✅ Recent conversation history

❌ User messages (always fresh) ❌ Dynamic variables ❌ Tool responses

Savings

Up to 90% on cached inputs

Example:

System prompt: 500 tokens
Without caching: 500 tokens × 100 conversations = 50,000 tokens
With caching: 500 tokens (first) + 50 tokens × 99 (cached) = 5,450 tokens
Savings: 89%

2. Model Swapping

Model Comparison

Model	Cost (per 1M tokens)	Speed	Quality	Best For
GPT-4o	$5	Medium	Highest	Complex reasoning
GPT-4o-mini	$0.15	Fast	High	Most use cases
Claude Sonnet 4.5	$3	Medium	Highest	Long context
Gemini 2.5 Flash	$0.075	Fastest	Medium	Simple tasks

Configuration

{
  "llm_config": {
    "model": "gpt-4o-mini"
  }
}

Optimization Strategy

Start with gpt-4o-mini for all agents
Upgrade to gpt-4o only if:
- Complex reasoning required
- High accuracy critical
- User feedback indicates quality issues
Use Gemini 2.5 Flash for:
- Simple routing/classification
- FAQ responses
- Order status lookups

Savings

Up to 97% (gpt-4o → Gemini 2.5 Flash)

3. Burst Pricing

How It Works

Normal: Your subscription concurrency limit (e.g., 10 calls)
Burst: Up to 3× your limit (e.g., 30 calls)
Cost: 2× per-minute rate for burst calls

Configuration

{
  "call_limits": {
    "burst_pricing_enabled": true
  }
}

When to Use

✅ Black Friday traffic spikes ✅ Product launches ✅ Seasonal demand (holidays) ✅ Marketing campaigns

❌ Sustained high traffic (upgrade plan instead) ❌ Unpredictable usage patterns

Cost Calculation

Example:

Subscription: 10 concurrent calls ($0.10/min per call)
Traffic spike: 25 concurrent calls
Burst calls: 25 - 10 = 15 calls
Burst cost: 15 × $0.20/min = $3/min
Regular cost: 10 × $0.10/min = $1/min
Total: $4/min during spike

4. Prompt Optimization

Reduce Token Count

Before (500 tokens):

You are a highly experienced and knowledgeable customer support specialist with extensive training in technical troubleshooting, customer service best practices, and empathetic communication. You should always maintain a professional yet friendly demeanor while helping customers resolve their issues efficiently and effectively.

After (150 tokens):

You are an experienced support specialist. Be professional, friendly, and efficient.

Savings: 70% token reduction

Use Tools Instead of Context

Before: Include FAQ in system prompt (2,000 tokens) After: Use RAG/knowledge base (100 tokens + retrieval)

Savings: 95% for large knowledge bases

5. Turn-Taking Optimization

Impact on Cost

Mode	Latency	LLM Calls	Cost Impact
Eager	Low	More	Higher (more interruptions)
Normal	Medium	Medium	Balanced
Patient	High	Fewer	Lower (fewer interruptions)

Recommendation

Use Patient mode for cost-sensitive applications where speed is less critical.

6. Voice Settings

Speed vs Cost

Speed	TTS Cost	User Experience
0.7x	Higher (longer audio)	Slow
1.0x	Baseline	Natural
1.2x	Lower (shorter audio)	Fast

Recommendation

Use 1.1x speed for slight cost savings without compromising experience.

7. Conversation Duration Limits

Configuration

{
  "conversation": {
    "max_duration_seconds": 300 // 5 minutes
  }
}

Use Cases

FAQ bots (limit: 2-3 minutes)
Order status (limit: 1 minute)
Full support (limit: 10-15 minutes)

Savings

Prevents unexpectedly long conversations.

8. Analytics-Driven Optimization

Monitor Metrics

Average conversation duration
LLM tokens per conversation
Tool call frequency
Resolution rate

Identify Issues

Long conversations → improve prompts or add escalation
High token count → enable caching or shorten prompts
Low resolution rate → upgrade model or improve knowledge base

9. Cost Monitoring

API Usage Tracking

const usage = await client.analytics.getLLMUsage({
  agent_id: 'agent_123',
  from_date: '2025-11-01',
  to_date: '2025-11-30'
});

console.log('Total tokens:', usage.total_tokens);
console.log('Cached tokens:', usage.cached_tokens);
console.log('Cost:', usage.total_cost);

Set Budgets

{
  "cost_limits": {
    "daily_budget_usd": 100,
    "monthly_budget_usd": 2000
  }
}

10. Cost Optimization Checklist

Before Launch

Enable LLM caching
Use gpt-4o-mini (not gpt-4o)
Optimize prompt length
Set conversation duration limits
Use RAG instead of large system prompts
Configure burst pricing if needed

During Operation

Monitor LLM token usage weekly
Review conversation analytics monthly
Test cheaper models quarterly
Optimize prompts based on analytics
Review and remove unused tools

Continuous Improvement

A/B test cheaper models
Analyze long conversations
Improve resolution rates
Reduce average conversation duration
Increase cache hit rates

Expected Savings

Baseline Configuration:

Model: gpt-4o
No caching
Average prompt: 1,000 tokens
Average conversation: 5 minutes
Cost: ~$0.50/conversation

Optimized Configuration:

Model: gpt-4o-mini
Caching enabled
Average prompt: 300 tokens
Average conversation: 3 minutes
Cost: ~$0.05/conversation

Total Savings: 90% 🎉

6.0 KiB Raw Blame History Unescape Escape

Cost Optimization Guide

1. LLM Caching

How It Works

Configuration

What Gets Cached

Savings

2. Model Swapping

Model Comparison

Configuration

Optimization Strategy

Savings

3. Burst Pricing

How It Works

Configuration

When to Use

Cost Calculation

4. Prompt Optimization

Reduce Token Count

Use Tools Instead of Context

5. Turn-Taking Optimization

Impact on Cost

Recommendation

6. Voice Settings

Speed vs Cost

Recommendation

7. Conversation Duration Limits

Configuration

Use Cases

Savings

8. Analytics-Driven Optimization

Monitor Metrics

Identify Issues

9. Cost Monitoring

API Usage Tracking

Set Budgets

10. Cost Optimization Checklist

Before Launch

During Operation

Continuous Improvement

Expected Savings

6.0 KiB

Raw Blame History