Files
gh-jezweb-claude-skills-ski…/references/file-search-rag-guide.md
2025-11-30 08:25:15 +08:00

365 lines
8.0 KiB
Markdown

# File Search & RAG Guide
Complete guide to implementing Retrieval-Augmented Generation (RAG) with the Assistants API.
---
## What is File Search?
A built-in tool for semantic search over documents using vector stores:
- **Capacity**: Up to 10,000 files per assistant (vs 20 in v1)
- **Technology**: Vector + keyword search with reranking
- **Automatic**: Chunking, embedding, and indexing handled by OpenAI
- **Pricing**: $0.10/GB/day (first 1GB free)
---
## Architecture
```
Documents (PDF, DOCX, MD, etc.)
Vector Store (chunking + embeddings)
Assistant with file_search tool
Semantic Search + Reranking
Retrieved Context + LLM Generation
```
---
## Quick Setup
### 1. Create Vector Store
```typescript
const vectorStore = await openai.beta.vectorStores.create({
name: "Product Documentation",
expires_after: {
anchor: "last_active_at",
days: 30,
},
});
```
### 2. Upload Documents
```typescript
const files = await Promise.all([
openai.files.create({ file: fs.createReadStream("doc1.pdf"), purpose: "assistants" }),
openai.files.create({ file: fs.createReadStream("doc2.md"), purpose: "assistants" }),
]);
const batch = await openai.beta.vectorStores.fileBatches.create(vectorStore.id, {
file_ids: files.map(f => f.id),
});
```
### 3. Wait for Indexing
```typescript
let batch = await openai.beta.vectorStores.fileBatches.retrieve(vectorStore.id, batch.id);
while (batch.status === 'in_progress') {
await new Promise(r => setTimeout(r, 2000));
batch = await openai.beta.vectorStores.fileBatches.retrieve(vectorStore.id, batch.id);
}
```
### 4. Create Assistant
```typescript
const assistant = await openai.beta.assistants.create({
name: "Knowledge Base Assistant",
instructions: "Answer questions using the file search tool. Always cite your sources.",
tools: [{ type: "file_search" }],
tool_resources: {
file_search: {
vector_store_ids: [vectorStore.id],
},
},
model: "gpt-4o",
});
```
---
## Supported File Formats
- `.pdf` - PDFs (most common)
- `.docx` - Word documents
- `.md`, `.txt` - Plain text
- `.html` - HTML documents
- `.json` - JSON data
- `.py`, `.js`, `.ts`, `.cpp`, `.java` - Code files
**Size Limits**:
- **Per file**: 512 MB
- **Total per vector store**: Limited by pricing ($0.10/GB/day)
---
## Chunking Strategy
OpenAI automatically chunks documents using:
- **Max chunk size**: ~800 tokens (configurable internally)
- **Overlap**: Ensures context continuity
- **Hierarchy**: Preserves document structure (headers, sections)
### Optimize for Better Results
**Document Structure**:
```markdown
# Main Topic
## Subtopic 1
Content here...
## Subtopic 2
Content here...
```
**Clear Sections**: Use headers to organize content
**Concise Paragraphs**: Avoid very long paragraphs (500+ words)
**Self-Contained**: Each section should make sense independently
---
## Improving Search Quality
### 1. Better Instructions
```typescript
const assistant = await openai.beta.assistants.create({
instructions: `You are a support assistant. When answering:
1. Use file_search to find relevant information
2. Synthesize information from multiple sources
3. Always provide citations with file names
4. If information isn't found, say so clearly
5. Don't make up information not in the documents`,
tools: [{ type: "file_search" }],
// ...
});
```
### 2. Query Refinement
Encourage users to be specific:
- ❌ "How do I install?"
- ✅ "How do I install the product on Windows 10?"
### 3. Multi-Document Answers
File Search automatically retrieves from multiple documents and combines information.
---
## Citations
### Accessing Citations
```typescript
const messages = await openai.beta.threads.messages.list(thread.id);
const response = messages.data[0];
for (const content of response.content) {
if (content.type === 'text') {
console.log('Answer:', content.text.value);
// Citations
if (content.text.annotations) {
for (const annotation of content.text.annotations) {
if (annotation.type === 'file_citation') {
console.log('Source:', annotation.file_citation.file_id);
console.log('Quote:', annotation.file_citation.quote);
}
}
}
}
}
```
### Displaying Citations
```typescript
let answer = response.content[0].text.value;
// Replace citation markers with clickable links
for (const annotation of response.content[0].text.annotations) {
if (annotation.type === 'file_citation') {
const citation = `[${annotation.text}](source: ${annotation.file_citation.file_id})`;
answer = answer.replace(annotation.text, citation);
}
}
console.log(answer);
```
---
## Cost Management
### Pricing Structure
- **Storage**: $0.10/GB/day
- **Free tier**: First 1GB
- **Example**: 5GB = $0.40/day = $12/month
### Optimization Strategies
1. **Auto-Expiration**:
```typescript
const vectorStore = await openai.beta.vectorStores.create({
expires_after: {
anchor: "last_active_at",
days: 7, // Delete after 7 days of inactivity
},
});
```
2. **Cleanup Old Stores**:
```typescript
async function cleanupOldVectorStores() {
const stores = await openai.beta.vectorStores.list({ limit: 100 });
for (const store of stores.data) {
const ageDays = (Date.now() / 1000 - store.created_at) / (60 * 60 * 24);
if (ageDays > 30) {
await openai.beta.vectorStores.del(store.id);
}
}
}
```
3. **Monitor Usage**:
```typescript
const store = await openai.beta.vectorStores.retrieve(vectorStoreId);
const sizeGB = store.usage_bytes / (1024 * 1024 * 1024);
const costPerDay = Math.max(0, (sizeGB - 1) * 0.10);
console.log(`Daily cost: $${costPerDay.toFixed(4)}`);
```
---
## Advanced Patterns
### Pattern: Multi-Tenant Knowledge Bases
```typescript
// Separate vector store per tenant
const tenantStore = await openai.beta.vectorStores.create({
name: `Tenant ${tenantId} KB`,
metadata: { tenant_id: tenantId },
});
// Or: Single store with namespace simulation via file metadata
await openai.files.create({
file: fs.createReadStream("doc.pdf"),
purpose: "assistants",
metadata: { tenant_id: tenantId }, // Coming soon
});
```
### Pattern: Versioned Documentation
```typescript
// Version 1.0
const v1Store = await openai.beta.vectorStores.create({
name: "Docs v1.0",
metadata: { version: "1.0" },
});
// Version 2.0
const v2Store = await openai.beta.vectorStores.create({
name: "Docs v2.0",
metadata: { version: "2.0" },
});
// Switch based on user preference
const storeId = userVersion === "1.0" ? v1Store.id : v2Store.id;
```
### Pattern: Hybrid Search (File Search + Code Interpreter)
```typescript
const assistant = await openai.beta.assistants.create({
tools: [
{ type: "file_search" },
{ type: "code_interpreter" },
],
tool_resources: {
file_search: {
vector_store_ids: [docsVectorStoreId],
},
},
});
// Assistant can search docs AND analyze attached data files
await openai.beta.threads.messages.create(thread.id, {
content: "Compare this sales data against the targets in our planning docs",
attachments: [{
file_id: salesDataFileId,
tools: [{ type: "code_interpreter" }],
}],
});
```
---
## Troubleshooting
### No Results Found
**Causes**:
- Vector store not fully indexed
- Poor query formulation
- Documents lack relevant content
**Solutions**:
- Wait for `status: "completed"`
- Refine query to be more specific
- Check document quality and structure
### Irrelevant Results
**Causes**:
- Poor document structure
- Too much noise in documents
- Vague queries
**Solutions**:
- Add clear section headers
- Remove boilerplate/repetitive content
- Improve query specificity
### High Costs
**Causes**:
- Too many vector stores
- Large files that don't expire
- Duplicate content
**Solutions**:
- Set auto-expiration
- Deduplicate documents
- Delete unused stores
---
## Best Practices
1. **Structure documents** with clear headers and sections
2. **Wait for indexing** before using vector store
3. **Set auto-expiration** to manage costs
4. **Monitor storage** regularly
5. **Provide citations** in responses
6. **Refine queries** for better results
7. **Clean up** old vector stores
---
**Last Updated**: 2025-10-25