zhongwei/gh-jezweb-claude-skills-skills-claude-api

Fork 0

Files

Zhongwei Li 7ca465850c Initial commit

2025-11-30 08:24:01 +08:00

8.7 KiB

Raw Permalink Blame History

Prompt Caching Guide

Complete guide to using prompt caching for cost optimization.

Overview

Prompt caching reduces costs by up to 90% and latency by up to 85% by caching frequently used context.

Benefits

Cost Savings: Cache reads = 10% of input token price
Latency Reduction: 85% faster time to first token
Use Cases: Long documents, codebases, system instructions, conversation history

Pricing

Operation	Cost (per MTok)	vs Regular Input
Regular input	$3	100%
Cache write	$3.75	125%
Cache read	$0.30	10%

Example: 100k tokens cached, used 10 times

Without caching: 100k × $3/MTok × 10 = $3.00
With caching: (100k × $3.75/MTok) + (100k × $0.30/MTok × 9) = $0.375 + $0.27 = $0.645
Savings: $2.355 (78.5%)

Requirements

Minimum Cacheable Content

Claude 3.5 Sonnet: 1,024 tokens minimum
Claude 3.5 Haiku: 2,048 tokens minimum
Claude 3.7 Sonnet: 1,024 tokens minimum

Cache Lifetime

Default: 5 minutes
Extended: 1 hour (configurable)
Refreshes on each use

Cache Matching

Cache hits require:

✅ Identical content (byte-for-byte)
✅ Same position in request
✅ Within TTL (5 min or 1 hour)

Implementation

Basic System Prompt Caching

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LARGE_SYSTEM_INSTRUCTIONS, // >= 1024 tokens
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    { role: 'user', content: 'Your question here' }
  ],
});

Caching in User Messages

const message = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'Analyze this document:',
        },
        {
          type: 'text',
          text: LARGE_DOCUMENT, // >= 1024 tokens
          cache_control: { type: 'ephemeral' },
        },
        {
          type: 'text',
          text: 'What are the main themes?',
        },
      ],
    },
  ],
});

Multi-Turn Conversation Caching

// Turn 1 - Creates cache
const response1 = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    { role: 'user', content: 'Hello!' }
  ],
});

// Turn 2 - Hits cache (same system prompt)
const response2 = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: SYSTEM_PROMPT, // Identical - cache hit
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    { role: 'user', content: 'Hello!' },
    { role: 'assistant', content: response1.content[0].text },
    { role: 'user', content: 'Tell me more' },
  ],
});

Caching Conversation History

const messages = [
  { role: 'user', content: 'Message 1' },
  { role: 'assistant', content: 'Response 1' },
  { role: 'user', content: 'Message 2' },
  { role: 'assistant', content: 'Response 2' },
];

// Cache last assistant message
messages[messages.length - 1] = {
  role: 'assistant',
  content: [
    {
      type: 'text',
      text: 'Response 2',
      cache_control: { type: 'ephemeral' },
    },
  ],
};

messages.push({ role: 'user', content: 'Message 3' });

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  messages,
});

Best Practices

✅ Do

Place cache_control on the last block of cacheable content
Cache content >= 1024 tokens (3.5 Sonnet) or >= 2048 tokens (3.5 Haiku)
Use caching for repeated context (system prompts, documents, code)
Monitor cache usage in response headers
Cache conversation history in long chats

❌ Don't

Cache content below minimum token threshold
Place cache_control in the middle of text
Change cached content (breaks cache matching)
Cache rarely used content (not cost-effective)
Expect caching to work across different API keys

Monitoring Cache Usage

const response = await anthropic.messages.create({...});

console.log('Input tokens:', response.usage.input_tokens);
console.log('Cache creation:', response.usage.cache_creation_input_tokens);
console.log('Cache read:', response.usage.cache_read_input_tokens);
console.log('Output tokens:', response.usage.output_tokens);

// First request
// input_tokens: 1000
// cache_creation_input_tokens: 5000
// cache_read_input_tokens: 0

// Subsequent requests (within 5 min)
// input_tokens: 1000
// cache_creation_input_tokens: 0
// cache_read_input_tokens: 5000  // 90% cost savings!

Common Patterns

Pattern 1: Document Analysis Chatbot

const document = fs.readFileSync('./document.txt', 'utf-8'); // 10k tokens

// All requests use same cached document
for (const question of questions) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5-20250929',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: [
          { type: 'text', text: 'Document:' },
          {
            type: 'text',
            text: document,
            cache_control: { type: 'ephemeral' },
          },
          { type: 'text', text: `Question: ${question}` },
        ],
      },
    ],
  });

  // First request: cache_creation_input_tokens: 10000
  // Subsequent: cache_read_input_tokens: 10000 (90% savings)
}

Pattern 2: Code Review with Codebase Context

const codebase = await loadCodebase(); // 50k tokens

const review = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 2048,
  system: [
    { type: 'text', text: 'You are a code reviewer.' },
    {
      type: 'text',
      text: `Codebase context:\n${codebase}`,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    { role: 'user', content: 'Review this PR: ...' }
  ],
});

Pattern 3: Customer Support with Knowledge Base

const knowledgeBase = await loadKB(); // 20k tokens

// Cache persists across all customer queries
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5-20250929',
  max_tokens: 1024,
  system: [
    { type: 'text', text: 'You are a customer support agent.' },
    {
      type: 'text',
      text: knowledgeBase,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: customerConversation,
});

Troubleshooting

Cache Not Activating

Problem: cache_read_input_tokens is 0

Solutions:

Ensure content >= 1024 tokens (or 2048 for Haiku)
Verify cache_control is on last block
Check content is byte-for-byte identical
Confirm requests within 5-minute window

Unexpected Cache Misses

Problem: Cache hits intermittently

Solutions:

Ensure content doesn't change (even whitespace)
Check TTL hasn't expired
Verify using same API key
Monitor cache headers in responses

High Cache Creation Costs

Problem: Frequent cache_creation_input_tokens

Solutions:

Increase request frequency (use cache before expiry)
Consider if caching is cost-effective (need 2+ uses)
Extend cache TTL to 1 hour if supported

Cost Calculator

function calculateCachingSavings(
  cachedTokens: number,
  uncachedTokens: number,
  requestCount: number
): {
  withoutCaching: number;
  withCaching: number;
  savings: number;
  savingsPercent: number;
} {
  const inputCostPerMTok = 3;
  const cacheCostPerMTok = 3.75;
  const cacheReadCostPerMTok = 0.3;

  const withoutCaching = ((cachedTokens + uncachedTokens) / 1_000_000) *
    inputCostPerMTok * requestCount;

  const cacheWrite = (cachedTokens / 1_000_000) * cacheCostPerMTok;
  const cacheReads = (cachedTokens / 1_000_000) * cacheReadCostPerMTok * (requestCount - 1);
  const uncachedInput = (uncachedTokens / 1_000_000) * inputCostPerMTok * requestCount;
  const withCaching = cacheWrite + cacheReads + uncachedInput;

  const savings = withoutCaching - withCaching;
  const savingsPercent = (savings / withoutCaching) * 100;

  return { withoutCaching, withCaching, savings, savingsPercent };
}

// Example: 10k cached tokens, 1k uncached, 20 requests
const result = calculateCachingSavings(10000, 1000, 20);
console.log(`Savings: $${result.savings.toFixed(4)} (${result.savingsPercent.toFixed(1)}%)`);

Official Documentation

Prompt Caching Guide: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
Pricing: https://www.anthropic.com/pricing
API Reference: https://docs.claude.com/en/api/messages

8.7 KiB Raw Permalink Blame History Unescape Escape

Prompt Caching Guide

Overview

Benefits

Pricing

Requirements

Minimum Cacheable Content

Cache Lifetime

Cache Matching

Implementation

Basic System Prompt Caching

Caching in User Messages

Multi-Turn Conversation Caching

Caching Conversation History

Best Practices

✅ Do

❌ Don't

Monitoring Cache Usage

Common Patterns

Pattern 1: Document Analysis Chatbot

Pattern 2: Code Review with Codebase Context

Pattern 3: Customer Support with Knowledge Base

Troubleshooting

Cache Not Activating

Unexpected Cache Misses

High Cache Creation Costs

Cost Calculator

Official Documentation

8.7 KiB

Raw Permalink Blame History