Files
gh-jezweb-claude-skills-ski…/references/prompt-caching-guide.md
2025-11-30 08:24:01 +08:00

355 lines
8.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Prompt Caching Guide
Complete guide to using prompt caching for cost optimization.
## Overview
Prompt caching reduces costs by up to 90% and latency by up to 85% by caching frequently used context.
### Benefits
- **Cost Savings**: Cache reads = 10% of input token price
- **Latency Reduction**: 85% faster time to first token
- **Use Cases**: Long documents, codebases, system instructions, conversation history
### Pricing
| Operation | Cost (per MTok) | vs Regular Input |
|-----------|-----------------|------------------|
| Regular input | $3 | 100% |
| Cache write | $3.75 | 125% |
| Cache read | $0.30 | 10% |
**Example**: 100k tokens cached, used 10 times
- Without caching: 100k × $3/MTok × 10 = $3.00
- With caching: (100k × $3.75/MTok) + (100k × $0.30/MTok × 9) = $0.375 + $0.27 = $0.645
- **Savings: $2.355 (78.5%)**
## Requirements
### Minimum Cacheable Content
- **Claude 3.5 Sonnet**: 1,024 tokens minimum
- **Claude 3.5 Haiku**: 2,048 tokens minimum
- **Claude 3.7 Sonnet**: 1,024 tokens minimum
### Cache Lifetime
- **Default**: 5 minutes
- **Extended**: 1 hour (configurable)
- Refreshes on each use
### Cache Matching
Cache hits require:
-**Identical content** (byte-for-byte)
-**Same position** in request
-**Within TTL** (5 min or 1 hour)
## Implementation
### Basic System Prompt Caching
```typescript
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: LARGE_SYSTEM_INSTRUCTIONS, // >= 1024 tokens
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Your question here' }
],
});
```
### Caching in User Messages
```typescript
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Analyze this document:',
},
{
type: 'text',
text: LARGE_DOCUMENT, // >= 1024 tokens
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: 'What are the main themes?',
},
],
},
],
});
```
### Multi-Turn Conversation Caching
```typescript
// Turn 1 - Creates cache
const response1 = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: SYSTEM_PROMPT,
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Hello!' }
],
});
// Turn 2 - Hits cache (same system prompt)
const response2 = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{
type: 'text',
text: SYSTEM_PROMPT, // Identical - cache hit
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: response1.content[0].text },
{ role: 'user', content: 'Tell me more' },
],
});
```
### Caching Conversation History
```typescript
const messages = [
{ role: 'user', content: 'Message 1' },
{ role: 'assistant', content: 'Response 1' },
{ role: 'user', content: 'Message 2' },
{ role: 'assistant', content: 'Response 2' },
];
// Cache last assistant message
messages[messages.length - 1] = {
role: 'assistant',
content: [
{
type: 'text',
text: 'Response 2',
cache_control: { type: 'ephemeral' },
},
],
};
messages.push({ role: 'user', content: 'Message 3' });
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages,
});
```
## Best Practices
### ✅ Do
- Place `cache_control` on the **last block** of cacheable content
- Cache content >= 1024 tokens (3.5 Sonnet) or >= 2048 tokens (3.5 Haiku)
- Use caching for repeated context (system prompts, documents, code)
- Monitor cache usage in response headers
- Cache conversation history in long chats
### ❌ Don't
- Cache content below minimum token threshold
- Place `cache_control` in the middle of text
- Change cached content (breaks cache matching)
- Cache rarely used content (not cost-effective)
- Expect caching to work across different API keys
## Monitoring Cache Usage
```typescript
const response = await anthropic.messages.create({...});
console.log('Input tokens:', response.usage.input_tokens);
console.log('Cache creation:', response.usage.cache_creation_input_tokens);
console.log('Cache read:', response.usage.cache_read_input_tokens);
console.log('Output tokens:', response.usage.output_tokens);
// First request
// input_tokens: 1000
// cache_creation_input_tokens: 5000
// cache_read_input_tokens: 0
// Subsequent requests (within 5 min)
// input_tokens: 1000
// cache_creation_input_tokens: 0
// cache_read_input_tokens: 5000 // 90% cost savings!
```
## Common Patterns
### Pattern 1: Document Analysis Chatbot
```typescript
const document = fs.readFileSync('./document.txt', 'utf-8'); // 10k tokens
// All requests use same cached document
for (const question of questions) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Document:' },
{
type: 'text',
text: document,
cache_control: { type: 'ephemeral' },
},
{ type: 'text', text: `Question: ${question}` },
],
},
],
});
// First request: cache_creation_input_tokens: 10000
// Subsequent: cache_read_input_tokens: 10000 (90% savings)
}
```
### Pattern 2: Code Review with Codebase Context
```typescript
const codebase = await loadCodebase(); // 50k tokens
const review = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 2048,
system: [
{ type: 'text', text: 'You are a code reviewer.' },
{
type: 'text',
text: `Codebase context:\n${codebase}`,
cache_control: { type: 'ephemeral' },
},
],
messages: [
{ role: 'user', content: 'Review this PR: ...' }
],
});
```
### Pattern 3: Customer Support with Knowledge Base
```typescript
const knowledgeBase = await loadKB(); // 20k tokens
// Cache persists across all customer queries
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5-20250929',
max_tokens: 1024,
system: [
{ type: 'text', text: 'You are a customer support agent.' },
{
type: 'text',
text: knowledgeBase,
cache_control: { type: 'ephemeral' },
},
],
messages: customerConversation,
});
```
## Troubleshooting
### Cache Not Activating
**Problem**: `cache_read_input_tokens` is 0
**Solutions**:
1. Ensure content >= 1024 tokens (or 2048 for Haiku)
2. Verify `cache_control` is on **last block**
3. Check content is byte-for-byte identical
4. Confirm requests within 5-minute window
### Unexpected Cache Misses
**Problem**: Cache hits intermittently
**Solutions**:
1. Ensure content doesn't change (even whitespace)
2. Check TTL hasn't expired
3. Verify using same API key
4. Monitor cache headers in responses
### High Cache Creation Costs
**Problem**: Frequent `cache_creation_input_tokens`
**Solutions**:
1. Increase request frequency (use cache before expiry)
2. Consider if caching is cost-effective (need 2+ uses)
3. Extend cache TTL to 1 hour if supported
## Cost Calculator
```typescript
function calculateCachingSavings(
cachedTokens: number,
uncachedTokens: number,
requestCount: number
): {
withoutCaching: number;
withCaching: number;
savings: number;
savingsPercent: number;
} {
const inputCostPerMTok = 3;
const cacheCostPerMTok = 3.75;
const cacheReadCostPerMTok = 0.3;
const withoutCaching = ((cachedTokens + uncachedTokens) / 1_000_000) *
inputCostPerMTok * requestCount;
const cacheWrite = (cachedTokens / 1_000_000) * cacheCostPerMTok;
const cacheReads = (cachedTokens / 1_000_000) * cacheReadCostPerMTok * (requestCount - 1);
const uncachedInput = (uncachedTokens / 1_000_000) * inputCostPerMTok * requestCount;
const withCaching = cacheWrite + cacheReads + uncachedInput;
const savings = withoutCaching - withCaching;
const savingsPercent = (savings / withoutCaching) * 100;
return { withoutCaching, withCaching, savings, savingsPercent };
}
// Example: 10k cached tokens, 1k uncached, 20 requests
const result = calculateCachingSavings(10000, 1000, 20);
console.log(`Savings: $${result.savings.toFixed(4)} (${result.savingsPercent.toFixed(1)}%)`);
```
## Official Documentation
- **Prompt Caching Guide**: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- **Pricing**: https://www.anthropic.com/pricing
- **API Reference**: https://docs.claude.com/en/api/messages