# Prompt Caching Guide Complete guide to using prompt caching for cost optimization. ## Overview Prompt caching reduces costs by up to 90% and latency by up to 85% by caching frequently used context. ### Benefits - **Cost Savings**: Cache reads = 10% of input token price - **Latency Reduction**: 85% faster time to first token - **Use Cases**: Long documents, codebases, system instructions, conversation history ### Pricing | Operation | Cost (per MTok) | vs Regular Input | |-----------|-----------------|------------------| | Regular input | $3 | 100% | | Cache write | $3.75 | 125% | | Cache read | $0.30 | 10% | **Example**: 100k tokens cached, used 10 times - Without caching: 100k × $3/MTok × 10 = $3.00 - With caching: (100k × $3.75/MTok) + (100k × $0.30/MTok × 9) = $0.375 + $0.27 = $0.645 - **Savings: $2.355 (78.5%)** ## Requirements ### Minimum Cacheable Content - **Claude 3.5 Sonnet**: 1,024 tokens minimum - **Claude 3.5 Haiku**: 2,048 tokens minimum - **Claude 3.7 Sonnet**: 1,024 tokens minimum ### Cache Lifetime - **Default**: 5 minutes - **Extended**: 1 hour (configurable) - Refreshes on each use ### Cache Matching Cache hits require: - ✅ **Identical content** (byte-for-byte) - ✅ **Same position** in request - ✅ **Within TTL** (5 min or 1 hour) ## Implementation ### Basic System Prompt Caching ```typescript const message = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, system: [ { type: 'text', text: LARGE_SYSTEM_INSTRUCTIONS, // >= 1024 tokens cache_control: { type: 'ephemeral' }, }, ], messages: [ { role: 'user', content: 'Your question here' } ], }); ``` ### Caching in User Messages ```typescript const message = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, messages: [ { role: 'user', content: [ { type: 'text', text: 'Analyze this document:', }, { type: 'text', text: LARGE_DOCUMENT, // >= 1024 tokens cache_control: { type: 'ephemeral' }, }, { type: 'text', text: 'What are the main themes?', }, ], }, ], }); ``` ### Multi-Turn Conversation Caching ```typescript // Turn 1 - Creates cache const response1 = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, system: [ { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' }, }, ], messages: [ { role: 'user', content: 'Hello!' } ], }); // Turn 2 - Hits cache (same system prompt) const response2 = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, system: [ { type: 'text', text: SYSTEM_PROMPT, // Identical - cache hit cache_control: { type: 'ephemeral' }, }, ], messages: [ { role: 'user', content: 'Hello!' }, { role: 'assistant', content: response1.content[0].text }, { role: 'user', content: 'Tell me more' }, ], }); ``` ### Caching Conversation History ```typescript const messages = [ { role: 'user', content: 'Message 1' }, { role: 'assistant', content: 'Response 1' }, { role: 'user', content: 'Message 2' }, { role: 'assistant', content: 'Response 2' }, ]; // Cache last assistant message messages[messages.length - 1] = { role: 'assistant', content: [ { type: 'text', text: 'Response 2', cache_control: { type: 'ephemeral' }, }, ], }; messages.push({ role: 'user', content: 'Message 3' }); const response = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, messages, }); ``` ## Best Practices ### ✅ Do - Place `cache_control` on the **last block** of cacheable content - Cache content >= 1024 tokens (3.5 Sonnet) or >= 2048 tokens (3.5 Haiku) - Use caching for repeated context (system prompts, documents, code) - Monitor cache usage in response headers - Cache conversation history in long chats ### ❌ Don't - Cache content below minimum token threshold - Place `cache_control` in the middle of text - Change cached content (breaks cache matching) - Cache rarely used content (not cost-effective) - Expect caching to work across different API keys ## Monitoring Cache Usage ```typescript const response = await anthropic.messages.create({...}); console.log('Input tokens:', response.usage.input_tokens); console.log('Cache creation:', response.usage.cache_creation_input_tokens); console.log('Cache read:', response.usage.cache_read_input_tokens); console.log('Output tokens:', response.usage.output_tokens); // First request // input_tokens: 1000 // cache_creation_input_tokens: 5000 // cache_read_input_tokens: 0 // Subsequent requests (within 5 min) // input_tokens: 1000 // cache_creation_input_tokens: 0 // cache_read_input_tokens: 5000 // 90% cost savings! ``` ## Common Patterns ### Pattern 1: Document Analysis Chatbot ```typescript const document = fs.readFileSync('./document.txt', 'utf-8'); // 10k tokens // All requests use same cached document for (const question of questions) { const response = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, messages: [ { role: 'user', content: [ { type: 'text', text: 'Document:' }, { type: 'text', text: document, cache_control: { type: 'ephemeral' }, }, { type: 'text', text: `Question: ${question}` }, ], }, ], }); // First request: cache_creation_input_tokens: 10000 // Subsequent: cache_read_input_tokens: 10000 (90% savings) } ``` ### Pattern 2: Code Review with Codebase Context ```typescript const codebase = await loadCodebase(); // 50k tokens const review = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 2048, system: [ { type: 'text', text: 'You are a code reviewer.' }, { type: 'text', text: `Codebase context:\n${codebase}`, cache_control: { type: 'ephemeral' }, }, ], messages: [ { role: 'user', content: 'Review this PR: ...' } ], }); ``` ### Pattern 3: Customer Support with Knowledge Base ```typescript const knowledgeBase = await loadKB(); // 20k tokens // Cache persists across all customer queries const response = await anthropic.messages.create({ model: 'claude-sonnet-4-5-20250929', max_tokens: 1024, system: [ { type: 'text', text: 'You are a customer support agent.' }, { type: 'text', text: knowledgeBase, cache_control: { type: 'ephemeral' }, }, ], messages: customerConversation, }); ``` ## Troubleshooting ### Cache Not Activating **Problem**: `cache_read_input_tokens` is 0 **Solutions**: 1. Ensure content >= 1024 tokens (or 2048 for Haiku) 2. Verify `cache_control` is on **last block** 3. Check content is byte-for-byte identical 4. Confirm requests within 5-minute window ### Unexpected Cache Misses **Problem**: Cache hits intermittently **Solutions**: 1. Ensure content doesn't change (even whitespace) 2. Check TTL hasn't expired 3. Verify using same API key 4. Monitor cache headers in responses ### High Cache Creation Costs **Problem**: Frequent `cache_creation_input_tokens` **Solutions**: 1. Increase request frequency (use cache before expiry) 2. Consider if caching is cost-effective (need 2+ uses) 3. Extend cache TTL to 1 hour if supported ## Cost Calculator ```typescript function calculateCachingSavings( cachedTokens: number, uncachedTokens: number, requestCount: number ): { withoutCaching: number; withCaching: number; savings: number; savingsPercent: number; } { const inputCostPerMTok = 3; const cacheCostPerMTok = 3.75; const cacheReadCostPerMTok = 0.3; const withoutCaching = ((cachedTokens + uncachedTokens) / 1_000_000) * inputCostPerMTok * requestCount; const cacheWrite = (cachedTokens / 1_000_000) * cacheCostPerMTok; const cacheReads = (cachedTokens / 1_000_000) * cacheReadCostPerMTok * (requestCount - 1); const uncachedInput = (uncachedTokens / 1_000_000) * inputCostPerMTok * requestCount; const withCaching = cacheWrite + cacheReads + uncachedInput; const savings = withoutCaching - withCaching; const savingsPercent = (savings / withoutCaching) * 100; return { withoutCaching, withCaching, savings, savingsPercent }; } // Example: 10k cached tokens, 1k uncached, 20 requests const result = calculateCachingSavings(10000, 1000, 20); console.log(`Savings: $${result.savings.toFixed(4)} (${result.savingsPercent.toFixed(1)}%)`); ``` ## Official Documentation - **Prompt Caching Guide**: https://docs.claude.com/en/docs/build-with-claude/prompt-caching - **Pricing**: https://www.anthropic.com/pricing - **API Reference**: https://docs.claude.com/en/api/messages