Files
gh-jezweb-claude-skills-ski…/references/embeddings-guide.md
2025-11-30 08:25:12 +08:00

4.5 KiB

Embeddings Guide

Last Updated: 2025-10-25

Complete guide to OpenAI's Embeddings API for semantic search, RAG, and clustering.


Model Comparison

Model Default Dimensions Custom Dimensions Best For
text-embedding-3-large 3072 256-3072 Highest quality semantic search
text-embedding-3-small 1536 256-1536 Most applications, cost-effective
text-embedding-ada-002 1536 Fixed Legacy (use v3 models)

Dimension Selection

Full Dimensions

  • text-embedding-3-small: 1536 (default)
  • text-embedding-3-large: 3072 (default)
  • Use for maximum accuracy

Reduced Dimensions

  • 256 dims: 4-12x storage reduction, minimal quality loss
  • 512 dims: 2-6x storage reduction, good quality
  • Use for cost/storage optimization
// Full dimensions (1536)
const full = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Sample text',
});

// Reduced dimensions (256)
const reduced = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'Sample text',
  dimensions: 256,
});

RAG (Retrieval-Augmented Generation) Pattern

1. Build Knowledge Base

const documents = [
  'TypeScript is a superset of JavaScript',
  'Python is a high-level programming language',
  'React is a JavaScript library for UIs',
];

const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: documents,
});

const knowledgeBase = documents.map((text, i) => ({
  text,
  embedding: embeddings.data[i].embedding,
}));
// Embed user query
const queryEmbedding = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'What is TypeScript?',
});

// Find similar documents
const similarities = knowledgeBase.map(doc => ({
  text: doc.text,
  similarity: cosineSimilarity(queryEmbedding.data[0].embedding, doc.embedding),
}));

similarities.sort((a, b) => b.similarity - a.similarity);
const topResults = similarities.slice(0, 3);

3. Generate Answer with Context

const context = topResults.map(r => r.text).join('\n\n');

const completion = await openai.chat.completions.create({
  model: 'gpt-5',
  messages: [
    { role: 'system', content: `Answer using this context:\n\n${context}` },
    { role: 'user', content: 'What is TypeScript?' },
  ],
});

Similarity Metrics

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

Euclidean Distance

function euclideanDistance(a: number[], b: number[]): number {
  return Math.sqrt(
    a.reduce((sum, val, i) => sum + Math.pow(val - b[i], 2), 0)
  );
}

Batch Processing

// Process up to 2048 documents
const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: documents, // Array of strings
});

embeddings.data.forEach((item, index) => {
  console.log(`Doc ${index}: ${item.embedding.length} dimensions`);
});

Limits:

  • Max tokens per input: 8192
  • Max summed tokens across all inputs: 300,000
  • Array dimension max: 2048

Best Practices

Pre-processing:

  • Normalize text (lowercase, remove special chars)
  • Be consistent across queries and documents
  • Chunk long documents (max 8192 tokens)

Storage:

  • Use custom dimensions (256-512) for storage optimization
  • Store embeddings in vector databases (Pinecone, Weaviate, Qdrant)
  • Cache embeddings (deterministic for same input)

Search:

  • Use cosine similarity for comparison
  • Normalize embeddings before storing (L2 normalization)
  • Pre-filter with metadata before similarity search

Don't:

  • Mix models (incompatible dimensions)
  • Exceed token limits (8192 per input)
  • Skip normalization
  • Use raw embeddings without similarity metric

Use Cases

  1. Semantic Search: Find similar documents
  2. RAG: Retrieve context for generation
  3. Clustering: Group similar content
  4. Recommendations: Content-based recommendations
  5. Anomaly Detection: Detect outliers
  6. Duplicate Detection: Find similar/duplicate content

See Also: Official Embeddings Guide (https://platform.openai.com/docs/guides/embeddings)