8.7 KiB
8.7 KiB
Prompt Caching Guide
Complete guide to using prompt caching for cost optimization.
Overview
Prompt caching reduces costs by up to 90% and latency by up to 85% by caching frequently used context.
Benefits
- Cost Savings: Cache reads = 10% of input token price
- Latency Reduction: 85% faster time to first token
- Use Cases: Long documents, codebases, system instructions, conversation history
Pricing
| Operation | Cost (per MTok) | vs Regular Input |
|---|---|---|
| Regular input | $3 | 100% |
| Cache write | $3.75 | 125% |
| Cache read | $0.30 | 10% |
Example: 100k tokens cached, used 10 times
- Without caching: 100k × $3/MTok × 10 = $3.00
- With caching: (100k × $3.75/MTok) + (100k × $0.30/MTok × 9) = $0.375 + $0.27 = $0.645
- Savings: $2.355 (78.5%)
Requirements
Minimum Cacheable Content
- Claude 3.5 Sonnet: 1,024 tokens minimum
- Claude 3.5 Haiku: 2,048 tokens minimum
- Claude 3.7 Sonnet: 1,024 tokens minimum
Cache Lifetime
- Default: 5 minutes
- Extended: 1 hour (configurable)
- Refreshes on each use
Cache Matching
Cache hits require:
- ✅ Identical content (byte-for-byte)
- ✅ Same position in request
- ✅ Within TTL (5 min or 1 hour)
Implementation
Basic System Prompt Caching
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: LARGE_SYSTEM_INSTRUCTIONS, // >= 1024 tokens
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Your question here' }
],
});
Caching in User Messages
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Analyze this document:',
},
{
type: 'text',
text: LARGE_DOCUMENT, // >= 1024 tokens
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: 'What are the main themes?',
},
],
},
],
});
Multi-Turn Conversation Caching
// Turn 1 - Creates cache
const response1 = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: SYSTEM_PROMPT,
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Hello!' }
],
});
// Turn 2 - Hits cache (same system prompt)
const response2 = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: SYSTEM_PROMPT, // Identical - cache hit
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: response1.content[0].text },
{ role: 'user', content: 'Tell me more' },
],
});
Caching Conversation History
const messages = [
{ role: 'user', content: 'Message 1' },
{ role: 'assistant', content: 'Response 1' },
{ role: 'user', content: 'Message 2' },
{ role: 'assistant', content: 'Response 2' },
];
// Cache last assistant message
messages[messages.length - 1] = {
role: 'assistant',
content: [
{
type: 'text',
text: 'Response 2',
cache_control: { type: 'ephemeral' },
},
],
};
messages.push({ role: 'user', content: 'Message 3' });
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages,
});
Best Practices
✅ Do
- Place
cache_controlon the last block of cacheable content - Cache content >= 1024 tokens (3.5 Sonnet) or >= 2048 tokens (3.5 Haiku)
- Use caching for repeated context (system prompts, documents, code)
- Monitor cache usage in response headers
- Cache conversation history in long chats
❌ Don't
- Cache content below minimum token threshold
- Place
cache_controlin the middle of text - Change cached content (breaks cache matching)
- Cache rarely used content (not cost-effective)
- Expect caching to work across different API keys
Monitoring Cache Usage
const response = await anthropic.messages.create({...});
console.log('Input tokens:', response.usage.input_tokens);
console.log('Cache creation:', response.usage.cache_creation_input_tokens);
console.log('Cache read:', response.usage.cache_read_input_tokens);
console.log('Output tokens:', response.usage.output_tokens);
// First request
// input_tokens: 1000
// cache_creation_input_tokens: 5000
// cache_read_input_tokens: 0
// Subsequent requests (within 5 min)
// input_tokens: 1000
// cache_creation_input_tokens: 0
// cache_read_input_tokens: 5000 // 90% cost savings!
Common Patterns
Pattern 1: Document Analysis Chatbot
const document = fs.readFileSync('./document.txt', 'utf-8'); // 10k tokens
// All requests use same cached document
for (const question of questions) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Document:' },
{
type: 'text',
text: document,
cache_control: { type: 'ephemeral' },
},
{ type: 'text', text: `Question: ${question}` },
],
},
],
});
// First request: cache_creation_input_tokens: 10000
// Subsequent: cache_read_input_tokens: 10000 (90% savings)
}
Pattern 2: Code Review with Codebase Context
const codebase = await loadCodebase(); // 50k tokens
const review = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 2048,
system: [
{ type: 'text', text: 'You are a code reviewer.' },
{
type: 'text',
text: `Codebase context:\n${codebase}`,
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Review this PR: ...' }
],
});
Pattern 3: Customer Support with Knowledge Base
const knowledgeBase = await loadKB(); // 20k tokens
// Cache persists across all customer queries
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{ type: 'text', text: 'You are a customer support agent.' },
{
type: 'text',
text: knowledgeBase,
cache_control: { type: 'ephemeral' },
},
],
messages: customerConversation,
});
Troubleshooting
Cache Not Activating
Problem: cache_read_input_tokens is 0
Solutions:
- Ensure content >= 1024 tokens (or 2048 for Haiku)
- Verify
cache_controlis on last block - Check content is byte-for-byte identical
- Confirm requests within 5-minute window
Unexpected Cache Misses
Problem: Cache hits intermittently
Solutions:
- Ensure content doesn't change (even whitespace)
- Check TTL hasn't expired
- Verify using same API key
- Monitor cache headers in responses
High Cache Creation Costs
Problem: Frequent cache_creation_input_tokens
Solutions:
- Increase request frequency (use cache before expiry)
- Consider if caching is cost-effective (need 2+ uses)
- Extend cache TTL to 1 hour if supported
Cost Calculator
function calculateCachingSavings(
cachedTokens: number,
uncachedTokens: number,
requestCount: number
): {
withoutCaching: number;
withCaching: number;
savings: number;
savingsPercent: number;
} {
const inputCostPerMTok = 3;
const cacheCostPerMTok = 3.75;
const cacheReadCostPerMTok = 0.3;
const withoutCaching = ((cachedTokens + uncachedTokens) / 1_000_000) *
inputCostPerMTok * requestCount;
const cacheWrite = (cachedTokens / 1_000_000) * cacheCostPerMTok;
const cacheReads = (cachedTokens / 1_000_000) * cacheReadCostPerMTok * (requestCount - 1);
const uncachedInput = (uncachedTokens / 1_000_000) * inputCostPerMTok * requestCount;
const withCaching = cacheWrite + cacheReads + uncachedInput;
const savings = withoutCaching - withCaching;
const savingsPercent = (savings / withoutCaching) * 100;
return { withoutCaching, withCaching, savings, savingsPercent };
}
// Example: 10k cached tokens, 1k uncached, 20 requests
const result = calculateCachingSavings(10000, 1000, 20);
console.log(`Savings: $${result.savings.toFixed(4)} (${result.savingsPercent.toFixed(1)}%)`);
Official Documentation
- Prompt Caching Guide: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- Pricing: https://www.anthropic.com/pricing
- API Reference: https://docs.claude.com/en/api/messages