Initial commit
This commit is contained in:
364
references/file-search-rag-guide.md
Normal file
364
references/file-search-rag-guide.md
Normal file
@@ -0,0 +1,364 @@
|
||||
# File Search & RAG Guide
|
||||
|
||||
Complete guide to implementing Retrieval-Augmented Generation (RAG) with the Assistants API.
|
||||
|
||||
---
|
||||
|
||||
## What is File Search?
|
||||
|
||||
A built-in tool for semantic search over documents using vector stores:
|
||||
- **Capacity**: Up to 10,000 files per assistant (vs 20 in v1)
|
||||
- **Technology**: Vector + keyword search with reranking
|
||||
- **Automatic**: Chunking, embedding, and indexing handled by OpenAI
|
||||
- **Pricing**: $0.10/GB/day (first 1GB free)
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Documents (PDF, DOCX, MD, etc.)
|
||||
↓
|
||||
Vector Store (chunking + embeddings)
|
||||
↓
|
||||
Assistant with file_search tool
|
||||
↓
|
||||
Semantic Search + Reranking
|
||||
↓
|
||||
Retrieved Context + LLM Generation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Setup
|
||||
|
||||
### 1. Create Vector Store
|
||||
|
||||
```typescript
|
||||
const vectorStore = await openai.beta.vectorStores.create({
|
||||
name: "Product Documentation",
|
||||
expires_after: {
|
||||
anchor: "last_active_at",
|
||||
days: 30,
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Upload Documents
|
||||
|
||||
```typescript
|
||||
const files = await Promise.all([
|
||||
openai.files.create({ file: fs.createReadStream("doc1.pdf"), purpose: "assistants" }),
|
||||
openai.files.create({ file: fs.createReadStream("doc2.md"), purpose: "assistants" }),
|
||||
]);
|
||||
|
||||
const batch = await openai.beta.vectorStores.fileBatches.create(vectorStore.id, {
|
||||
file_ids: files.map(f => f.id),
|
||||
});
|
||||
```
|
||||
|
||||
### 3. Wait for Indexing
|
||||
|
||||
```typescript
|
||||
let batch = await openai.beta.vectorStores.fileBatches.retrieve(vectorStore.id, batch.id);
|
||||
|
||||
while (batch.status === 'in_progress') {
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
batch = await openai.beta.vectorStores.fileBatches.retrieve(vectorStore.id, batch.id);
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Create Assistant
|
||||
|
||||
```typescript
|
||||
const assistant = await openai.beta.assistants.create({
|
||||
name: "Knowledge Base Assistant",
|
||||
instructions: "Answer questions using the file search tool. Always cite your sources.",
|
||||
tools: [{ type: "file_search" }],
|
||||
tool_resources: {
|
||||
file_search: {
|
||||
vector_store_ids: [vectorStore.id],
|
||||
},
|
||||
},
|
||||
model: "gpt-4o",
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Supported File Formats
|
||||
|
||||
- `.pdf` - PDFs (most common)
|
||||
- `.docx` - Word documents
|
||||
- `.md`, `.txt` - Plain text
|
||||
- `.html` - HTML documents
|
||||
- `.json` - JSON data
|
||||
- `.py`, `.js`, `.ts`, `.cpp`, `.java` - Code files
|
||||
|
||||
**Size Limits**:
|
||||
- **Per file**: 512 MB
|
||||
- **Total per vector store**: Limited by pricing ($0.10/GB/day)
|
||||
|
||||
---
|
||||
|
||||
## Chunking Strategy
|
||||
|
||||
OpenAI automatically chunks documents using:
|
||||
- **Max chunk size**: ~800 tokens (configurable internally)
|
||||
- **Overlap**: Ensures context continuity
|
||||
- **Hierarchy**: Preserves document structure (headers, sections)
|
||||
|
||||
### Optimize for Better Results
|
||||
|
||||
**Document Structure**:
|
||||
```markdown
|
||||
# Main Topic
|
||||
|
||||
## Subtopic 1
|
||||
Content here...
|
||||
|
||||
## Subtopic 2
|
||||
Content here...
|
||||
```
|
||||
|
||||
**Clear Sections**: Use headers to organize content
|
||||
**Concise Paragraphs**: Avoid very long paragraphs (500+ words)
|
||||
**Self-Contained**: Each section should make sense independently
|
||||
|
||||
---
|
||||
|
||||
## Improving Search Quality
|
||||
|
||||
### 1. Better Instructions
|
||||
|
||||
```typescript
|
||||
const assistant = await openai.beta.assistants.create({
|
||||
instructions: `You are a support assistant. When answering:
|
||||
1. Use file_search to find relevant information
|
||||
2. Synthesize information from multiple sources
|
||||
3. Always provide citations with file names
|
||||
4. If information isn't found, say so clearly
|
||||
5. Don't make up information not in the documents`,
|
||||
tools: [{ type: "file_search" }],
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
### 2. Query Refinement
|
||||
|
||||
Encourage users to be specific:
|
||||
- ❌ "How do I install?"
|
||||
- ✅ "How do I install the product on Windows 10?"
|
||||
|
||||
### 3. Multi-Document Answers
|
||||
|
||||
File Search automatically retrieves from multiple documents and combines information.
|
||||
|
||||
---
|
||||
|
||||
## Citations
|
||||
|
||||
### Accessing Citations
|
||||
|
||||
```typescript
|
||||
const messages = await openai.beta.threads.messages.list(thread.id);
|
||||
const response = messages.data[0];
|
||||
|
||||
for (const content of response.content) {
|
||||
if (content.type === 'text') {
|
||||
console.log('Answer:', content.text.value);
|
||||
|
||||
// Citations
|
||||
if (content.text.annotations) {
|
||||
for (const annotation of content.text.annotations) {
|
||||
if (annotation.type === 'file_citation') {
|
||||
console.log('Source:', annotation.file_citation.file_id);
|
||||
console.log('Quote:', annotation.file_citation.quote);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Displaying Citations
|
||||
|
||||
```typescript
|
||||
let answer = response.content[0].text.value;
|
||||
|
||||
// Replace citation markers with clickable links
|
||||
for (const annotation of response.content[0].text.annotations) {
|
||||
if (annotation.type === 'file_citation') {
|
||||
const citation = `[${annotation.text}](source: ${annotation.file_citation.file_id})`;
|
||||
answer = answer.replace(annotation.text, citation);
|
||||
}
|
||||
}
|
||||
|
||||
console.log(answer);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Management
|
||||
|
||||
### Pricing Structure
|
||||
|
||||
- **Storage**: $0.10/GB/day
|
||||
- **Free tier**: First 1GB
|
||||
- **Example**: 5GB = $0.40/day = $12/month
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **Auto-Expiration**:
|
||||
```typescript
|
||||
const vectorStore = await openai.beta.vectorStores.create({
|
||||
expires_after: {
|
||||
anchor: "last_active_at",
|
||||
days: 7, // Delete after 7 days of inactivity
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
2. **Cleanup Old Stores**:
|
||||
```typescript
|
||||
async function cleanupOldVectorStores() {
|
||||
const stores = await openai.beta.vectorStores.list({ limit: 100 });
|
||||
|
||||
for (const store of stores.data) {
|
||||
const ageDays = (Date.now() / 1000 - store.created_at) / (60 * 60 * 24);
|
||||
|
||||
if (ageDays > 30) {
|
||||
await openai.beta.vectorStores.del(store.id);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. **Monitor Usage**:
|
||||
```typescript
|
||||
const store = await openai.beta.vectorStores.retrieve(vectorStoreId);
|
||||
const sizeGB = store.usage_bytes / (1024 * 1024 * 1024);
|
||||
const costPerDay = Math.max(0, (sizeGB - 1) * 0.10);
|
||||
console.log(`Daily cost: $${costPerDay.toFixed(4)}`);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Pattern: Multi-Tenant Knowledge Bases
|
||||
|
||||
```typescript
|
||||
// Separate vector store per tenant
|
||||
const tenantStore = await openai.beta.vectorStores.create({
|
||||
name: `Tenant ${tenantId} KB`,
|
||||
metadata: { tenant_id: tenantId },
|
||||
});
|
||||
|
||||
// Or: Single store with namespace simulation via file metadata
|
||||
await openai.files.create({
|
||||
file: fs.createReadStream("doc.pdf"),
|
||||
purpose: "assistants",
|
||||
metadata: { tenant_id: tenantId }, // Coming soon
|
||||
});
|
||||
```
|
||||
|
||||
### Pattern: Versioned Documentation
|
||||
|
||||
```typescript
|
||||
// Version 1.0
|
||||
const v1Store = await openai.beta.vectorStores.create({
|
||||
name: "Docs v1.0",
|
||||
metadata: { version: "1.0" },
|
||||
});
|
||||
|
||||
// Version 2.0
|
||||
const v2Store = await openai.beta.vectorStores.create({
|
||||
name: "Docs v2.0",
|
||||
metadata: { version: "2.0" },
|
||||
});
|
||||
|
||||
// Switch based on user preference
|
||||
const storeId = userVersion === "1.0" ? v1Store.id : v2Store.id;
|
||||
```
|
||||
|
||||
### Pattern: Hybrid Search (File Search + Code Interpreter)
|
||||
|
||||
```typescript
|
||||
const assistant = await openai.beta.assistants.create({
|
||||
tools: [
|
||||
{ type: "file_search" },
|
||||
{ type: "code_interpreter" },
|
||||
],
|
||||
tool_resources: {
|
||||
file_search: {
|
||||
vector_store_ids: [docsVectorStoreId],
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
// Assistant can search docs AND analyze attached data files
|
||||
await openai.beta.threads.messages.create(thread.id, {
|
||||
content: "Compare this sales data against the targets in our planning docs",
|
||||
attachments: [{
|
||||
file_id: salesDataFileId,
|
||||
tools: [{ type: "code_interpreter" }],
|
||||
}],
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Results Found
|
||||
|
||||
**Causes**:
|
||||
- Vector store not fully indexed
|
||||
- Poor query formulation
|
||||
- Documents lack relevant content
|
||||
|
||||
**Solutions**:
|
||||
- Wait for `status: "completed"`
|
||||
- Refine query to be more specific
|
||||
- Check document quality and structure
|
||||
|
||||
### Irrelevant Results
|
||||
|
||||
**Causes**:
|
||||
- Poor document structure
|
||||
- Too much noise in documents
|
||||
- Vague queries
|
||||
|
||||
**Solutions**:
|
||||
- Add clear section headers
|
||||
- Remove boilerplate/repetitive content
|
||||
- Improve query specificity
|
||||
|
||||
### High Costs
|
||||
|
||||
**Causes**:
|
||||
- Too many vector stores
|
||||
- Large files that don't expire
|
||||
- Duplicate content
|
||||
|
||||
**Solutions**:
|
||||
- Set auto-expiration
|
||||
- Deduplicate documents
|
||||
- Delete unused stores
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Structure documents** with clear headers and sections
|
||||
2. **Wait for indexing** before using vector store
|
||||
3. **Set auto-expiration** to manage costs
|
||||
4. **Monitor storage** regularly
|
||||
5. **Provide citations** in responses
|
||||
6. **Refine queries** for better results
|
||||
7. **Clean up** old vector stores
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-10-25
|
||||
Reference in New Issue
Block a user