188 lines
4.5 KiB
Markdown
188 lines
4.5 KiB
Markdown
# Embeddings Guide
|
|
|
|
**Last Updated**: 2025-10-25
|
|
|
|
Complete guide to OpenAI's Embeddings API for semantic search, RAG, and clustering.
|
|
|
|
---
|
|
|
|
## Model Comparison
|
|
|
|
| Model | Default Dimensions | Custom Dimensions | Best For |
|
|
|-------|-------------------|-------------------|----------|
|
|
| text-embedding-3-large | 3072 | 256-3072 | Highest quality semantic search |
|
|
| text-embedding-3-small | 1536 | 256-1536 | Most applications, cost-effective |
|
|
| text-embedding-ada-002 | 1536 | Fixed | Legacy (use v3 models) |
|
|
|
|
---
|
|
|
|
## Dimension Selection
|
|
|
|
### Full Dimensions
|
|
- **text-embedding-3-small**: 1536 (default)
|
|
- **text-embedding-3-large**: 3072 (default)
|
|
- Use for maximum accuracy
|
|
|
|
### Reduced Dimensions
|
|
- **256 dims**: 4-12x storage reduction, minimal quality loss
|
|
- **512 dims**: 2-6x storage reduction, good quality
|
|
- Use for cost/storage optimization
|
|
|
|
```typescript
|
|
// Full dimensions (1536)
|
|
const full = await openai.embeddings.create({
|
|
model: 'text-embedding-3-small',
|
|
input: 'Sample text',
|
|
});
|
|
|
|
// Reduced dimensions (256)
|
|
const reduced = await openai.embeddings.create({
|
|
model: 'text-embedding-3-small',
|
|
input: 'Sample text',
|
|
dimensions: 256,
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## RAG (Retrieval-Augmented Generation) Pattern
|
|
|
|
### 1. Build Knowledge Base
|
|
|
|
```typescript
|
|
const documents = [
|
|
'TypeScript is a superset of JavaScript',
|
|
'Python is a high-level programming language',
|
|
'React is a JavaScript library for UIs',
|
|
];
|
|
|
|
const embeddings = await openai.embeddings.create({
|
|
model: 'text-embedding-3-small',
|
|
input: documents,
|
|
});
|
|
|
|
const knowledgeBase = documents.map((text, i) => ({
|
|
text,
|
|
embedding: embeddings.data[i].embedding,
|
|
}));
|
|
```
|
|
|
|
### 2. Query with Similarity Search
|
|
|
|
```typescript
|
|
// Embed user query
|
|
const queryEmbedding = await openai.embeddings.create({
|
|
model: 'text-embedding-3-small',
|
|
input: 'What is TypeScript?',
|
|
});
|
|
|
|
// Find similar documents
|
|
const similarities = knowledgeBase.map(doc => ({
|
|
text: doc.text,
|
|
similarity: cosineSimilarity(queryEmbedding.data[0].embedding, doc.embedding),
|
|
}));
|
|
|
|
similarities.sort((a, b) => b.similarity - a.similarity);
|
|
const topResults = similarities.slice(0, 3);
|
|
```
|
|
|
|
### 3. Generate Answer with Context
|
|
|
|
```typescript
|
|
const context = topResults.map(r => r.text).join('\n\n');
|
|
|
|
const completion = await openai.chat.completions.create({
|
|
model: 'gpt-5',
|
|
messages: [
|
|
{ role: 'system', content: `Answer using this context:\n\n${context}` },
|
|
{ role: 'user', content: 'What is TypeScript?' },
|
|
],
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## Similarity Metrics
|
|
|
|
### Cosine Similarity (Recommended)
|
|
|
|
```typescript
|
|
function cosineSimilarity(a: number[], b: number[]): number {
|
|
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
|
|
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
|
|
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
|
|
return dotProduct / (magnitudeA * magnitudeB);
|
|
}
|
|
```
|
|
|
|
### Euclidean Distance
|
|
|
|
```typescript
|
|
function euclideanDistance(a: number[], b: number[]): number {
|
|
return Math.sqrt(
|
|
a.reduce((sum, val, i) => sum + Math.pow(val - b[i], 2), 0)
|
|
);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Batch Processing
|
|
|
|
```typescript
|
|
// Process up to 2048 documents
|
|
const embeddings = await openai.embeddings.create({
|
|
model: 'text-embedding-3-small',
|
|
input: documents, // Array of strings
|
|
});
|
|
|
|
embeddings.data.forEach((item, index) => {
|
|
console.log(`Doc ${index}: ${item.embedding.length} dimensions`);
|
|
});
|
|
```
|
|
|
|
**Limits**:
|
|
- Max tokens per input: 8192
|
|
- Max summed tokens across all inputs: 300,000
|
|
- Array dimension max: 2048
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
✅ **Pre-processing**:
|
|
- Normalize text (lowercase, remove special chars)
|
|
- Be consistent across queries and documents
|
|
- Chunk long documents (max 8192 tokens)
|
|
|
|
✅ **Storage**:
|
|
- Use custom dimensions (256-512) for storage optimization
|
|
- Store embeddings in vector databases (Pinecone, Weaviate, Qdrant)
|
|
- Cache embeddings (deterministic for same input)
|
|
|
|
✅ **Search**:
|
|
- Use cosine similarity for comparison
|
|
- Normalize embeddings before storing (L2 normalization)
|
|
- Pre-filter with metadata before similarity search
|
|
|
|
❌ **Don't**:
|
|
- Mix models (incompatible dimensions)
|
|
- Exceed token limits (8192 per input)
|
|
- Skip normalization
|
|
- Use raw embeddings without similarity metric
|
|
|
|
---
|
|
|
|
## Use Cases
|
|
|
|
1. **Semantic Search**: Find similar documents
|
|
2. **RAG**: Retrieve context for generation
|
|
3. **Clustering**: Group similar content
|
|
4. **Recommendations**: Content-based recommendations
|
|
5. **Anomaly Detection**: Detect outliers
|
|
6. **Duplicate Detection**: Find similar/duplicate content
|
|
|
|
---
|
|
|
|
**See Also**: Official Embeddings Guide (https://platform.openai.com/docs/guides/embeddings)
|