Initial commit
This commit is contained in:
354
references/prompt-caching-guide.md
Normal file
354
references/prompt-caching-guide.md
Normal file
@@ -0,0 +1,354 @@
|
||||
# Prompt Caching Guide
|
||||
|
||||
Complete guide to using prompt caching for cost optimization.
|
||||
|
||||
## Overview
|
||||
|
||||
Prompt caching reduces costs by up to 90% and latency by up to 85% by caching frequently used context.
|
||||
|
||||
### Benefits
|
||||
|
||||
- **Cost Savings**: Cache reads = 10% of input token price
|
||||
- **Latency Reduction**: 85% faster time to first token
|
||||
- **Use Cases**: Long documents, codebases, system instructions, conversation history
|
||||
|
||||
### Pricing
|
||||
|
||||
| Operation | Cost (per MTok) | vs Regular Input |
|
||||
|-----------|-----------------|------------------|
|
||||
| Regular input | $3 | 100% |
|
||||
| Cache write | $3.75 | 125% |
|
||||
| Cache read | $0.30 | 10% |
|
||||
|
||||
**Example**: 100k tokens cached, used 10 times
|
||||
- Without caching: 100k × $3/MTok × 10 = $3.00
|
||||
- With caching: (100k × $3.75/MTok) + (100k × $0.30/MTok × 9) = $0.375 + $0.27 = $0.645
|
||||
- **Savings: $2.355 (78.5%)**
|
||||
|
||||
## Requirements
|
||||
|
||||
### Minimum Cacheable Content
|
||||
|
||||
- **Claude 3.5 Sonnet**: 1,024 tokens minimum
|
||||
- **Claude 3.5 Haiku**: 2,048 tokens minimum
|
||||
- **Claude 3.7 Sonnet**: 1,024 tokens minimum
|
||||
|
||||
### Cache Lifetime
|
||||
|
||||
- **Default**: 5 minutes
|
||||
- **Extended**: 1 hour (configurable)
|
||||
- Refreshes on each use
|
||||
|
||||
### Cache Matching
|
||||
|
||||
Cache hits require:
|
||||
- ✅ **Identical content** (byte-for-byte)
|
||||
- ✅ **Same position** in request
|
||||
- ✅ **Within TTL** (5 min or 1 hour)
|
||||
|
||||
## Implementation
|
||||
|
||||
### Basic System Prompt Caching
|
||||
|
||||
```typescript
|
||||
const message = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{
|
||||
type: 'text',
|
||||
text: LARGE_SYSTEM_INSTRUCTIONS, // >= 1024 tokens
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
messages: [
|
||||
{ role: 'user', content: 'Your question here' }
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
### Caching in User Messages
|
||||
|
||||
```typescript
|
||||
const message = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
messages: [
|
||||
{
|
||||
role: 'user',
|
||||
content: [
|
||||
{
|
||||
type: 'text',
|
||||
text: 'Analyze this document:',
|
||||
},
|
||||
{
|
||||
type: 'text',
|
||||
text: LARGE_DOCUMENT, // >= 1024 tokens
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
{
|
||||
type: 'text',
|
||||
text: 'What are the main themes?',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
### Multi-Turn Conversation Caching
|
||||
|
||||
```typescript
|
||||
// Turn 1 - Creates cache
|
||||
const response1 = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{
|
||||
type: 'text',
|
||||
text: SYSTEM_PROMPT,
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
messages: [
|
||||
{ role: 'user', content: 'Hello!' }
|
||||
],
|
||||
});
|
||||
|
||||
// Turn 2 - Hits cache (same system prompt)
|
||||
const response2 = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{
|
||||
type: 'text',
|
||||
text: SYSTEM_PROMPT, // Identical - cache hit
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
messages: [
|
||||
{ role: 'user', content: 'Hello!' },
|
||||
{ role: 'assistant', content: response1.content[0].text },
|
||||
{ role: 'user', content: 'Tell me more' },
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
### Caching Conversation History
|
||||
|
||||
```typescript
|
||||
const messages = [
|
||||
{ role: 'user', content: 'Message 1' },
|
||||
{ role: 'assistant', content: 'Response 1' },
|
||||
{ role: 'user', content: 'Message 2' },
|
||||
{ role: 'assistant', content: 'Response 2' },
|
||||
];
|
||||
|
||||
// Cache last assistant message
|
||||
messages[messages.length - 1] = {
|
||||
role: 'assistant',
|
||||
content: [
|
||||
{
|
||||
type: 'text',
|
||||
text: 'Response 2',
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
messages.push({ role: 'user', content: 'Message 3' });
|
||||
|
||||
const response = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
messages,
|
||||
});
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### ✅ Do
|
||||
|
||||
- Place `cache_control` on the **last block** of cacheable content
|
||||
- Cache content >= 1024 tokens (3.5 Sonnet) or >= 2048 tokens (3.5 Haiku)
|
||||
- Use caching for repeated context (system prompts, documents, code)
|
||||
- Monitor cache usage in response headers
|
||||
- Cache conversation history in long chats
|
||||
|
||||
### ❌ Don't
|
||||
|
||||
- Cache content below minimum token threshold
|
||||
- Place `cache_control` in the middle of text
|
||||
- Change cached content (breaks cache matching)
|
||||
- Cache rarely used content (not cost-effective)
|
||||
- Expect caching to work across different API keys
|
||||
|
||||
## Monitoring Cache Usage
|
||||
|
||||
```typescript
|
||||
const response = await anthropic.messages.create({...});
|
||||
|
||||
console.log('Input tokens:', response.usage.input_tokens);
|
||||
console.log('Cache creation:', response.usage.cache_creation_input_tokens);
|
||||
console.log('Cache read:', response.usage.cache_read_input_tokens);
|
||||
console.log('Output tokens:', response.usage.output_tokens);
|
||||
|
||||
// First request
|
||||
// input_tokens: 1000
|
||||
// cache_creation_input_tokens: 5000
|
||||
// cache_read_input_tokens: 0
|
||||
|
||||
// Subsequent requests (within 5 min)
|
||||
// input_tokens: 1000
|
||||
// cache_creation_input_tokens: 0
|
||||
// cache_read_input_tokens: 5000 // 90% cost savings!
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Document Analysis Chatbot
|
||||
|
||||
```typescript
|
||||
const document = fs.readFileSync('./document.txt', 'utf-8'); // 10k tokens
|
||||
|
||||
// All requests use same cached document
|
||||
for (const question of questions) {
|
||||
const response = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
messages: [
|
||||
{
|
||||
role: 'user',
|
||||
content: [
|
||||
{ type: 'text', text: 'Document:' },
|
||||
{
|
||||
type: 'text',
|
||||
text: document,
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
{ type: 'text', text: `Question: ${question}` },
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
// First request: cache_creation_input_tokens: 10000
|
||||
// Subsequent: cache_read_input_tokens: 10000 (90% savings)
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: Code Review with Codebase Context
|
||||
|
||||
```typescript
|
||||
const codebase = await loadCodebase(); // 50k tokens
|
||||
|
||||
const review = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 2048,
|
||||
system: [
|
||||
{ type: 'text', text: 'You are a code reviewer.' },
|
||||
{
|
||||
type: 'text',
|
||||
text: `Codebase context:\n${codebase}`,
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
messages: [
|
||||
{ role: 'user', content: 'Review this PR: ...' }
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
### Pattern 3: Customer Support with Knowledge Base
|
||||
|
||||
```typescript
|
||||
const knowledgeBase = await loadKB(); // 20k tokens
|
||||
|
||||
// Cache persists across all customer queries
|
||||
const response = await anthropic.messages.create({
|
||||
model: 'claude-sonnet-4-5-20250929',
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{ type: 'text', text: 'You are a customer support agent.' },
|
||||
{
|
||||
type: 'text',
|
||||
text: knowledgeBase,
|
||||
cache_control: { type: 'ephemeral' },
|
||||
},
|
||||
],
|
||||
messages: customerConversation,
|
||||
});
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Cache Not Activating
|
||||
|
||||
**Problem**: `cache_read_input_tokens` is 0
|
||||
|
||||
**Solutions**:
|
||||
1. Ensure content >= 1024 tokens (or 2048 for Haiku)
|
||||
2. Verify `cache_control` is on **last block**
|
||||
3. Check content is byte-for-byte identical
|
||||
4. Confirm requests within 5-minute window
|
||||
|
||||
### Unexpected Cache Misses
|
||||
|
||||
**Problem**: Cache hits intermittently
|
||||
|
||||
**Solutions**:
|
||||
1. Ensure content doesn't change (even whitespace)
|
||||
2. Check TTL hasn't expired
|
||||
3. Verify using same API key
|
||||
4. Monitor cache headers in responses
|
||||
|
||||
### High Cache Creation Costs
|
||||
|
||||
**Problem**: Frequent `cache_creation_input_tokens`
|
||||
|
||||
**Solutions**:
|
||||
1. Increase request frequency (use cache before expiry)
|
||||
2. Consider if caching is cost-effective (need 2+ uses)
|
||||
3. Extend cache TTL to 1 hour if supported
|
||||
|
||||
## Cost Calculator
|
||||
|
||||
```typescript
|
||||
function calculateCachingSavings(
|
||||
cachedTokens: number,
|
||||
uncachedTokens: number,
|
||||
requestCount: number
|
||||
): {
|
||||
withoutCaching: number;
|
||||
withCaching: number;
|
||||
savings: number;
|
||||
savingsPercent: number;
|
||||
} {
|
||||
const inputCostPerMTok = 3;
|
||||
const cacheCostPerMTok = 3.75;
|
||||
const cacheReadCostPerMTok = 0.3;
|
||||
|
||||
const withoutCaching = ((cachedTokens + uncachedTokens) / 1_000_000) *
|
||||
inputCostPerMTok * requestCount;
|
||||
|
||||
const cacheWrite = (cachedTokens / 1_000_000) * cacheCostPerMTok;
|
||||
const cacheReads = (cachedTokens / 1_000_000) * cacheReadCostPerMTok * (requestCount - 1);
|
||||
const uncachedInput = (uncachedTokens / 1_000_000) * inputCostPerMTok * requestCount;
|
||||
const withCaching = cacheWrite + cacheReads + uncachedInput;
|
||||
|
||||
const savings = withoutCaching - withCaching;
|
||||
const savingsPercent = (savings / withoutCaching) * 100;
|
||||
|
||||
return { withoutCaching, withCaching, savings, savingsPercent };
|
||||
}
|
||||
|
||||
// Example: 10k cached tokens, 1k uncached, 20 requests
|
||||
const result = calculateCachingSavings(10000, 1000, 20);
|
||||
console.log(`Savings: $${result.savings.toFixed(4)} (${result.savingsPercent.toFixed(1)}%)`);
|
||||
```
|
||||
|
||||
## Official Documentation
|
||||
|
||||
- **Prompt Caching Guide**: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
|
||||
- **Pricing**: https://www.anthropic.com/pricing
|
||||
- **API Reference**: https://docs.claude.com/en/api/messages
|
||||
Reference in New Issue
Block a user