Files
gh-jezweb-claude-skills-ski…/references/index-operations.md
2025-11-30 08:24:34 +08:00

365 lines
8.0 KiB
Markdown

# Index Operations Guide
Complete guide for creating and managing Vectorize indexes.
## Index Configuration
### Critical Decisions (Cannot Be Changed!)
When creating an index, these settings are **permanent**:
1. **Dimensions**: Vector width (must match embedding model)
2. **Distance Metric**: How similarity is calculated
Choose carefully - you cannot change these after creation!
### Dimensions
Dimensions must match your embedding model's output:
| Model | Provider | Dimensions | Recommended Metric |
|-------|----------|------------|-------------------|
| @cf/baai/bge-base-en-v1.5 | Workers AI | 768 | cosine |
| text-embedding-3-small | OpenAI | 1536 | cosine |
| text-embedding-3-large | OpenAI | 3072 | cosine |
| text-embedding-ada-002 | OpenAI (legacy) | 1536 | cosine |
| embed-english-v3.0 | Cohere | 1024 | cosine |
**Common Mistake**: Creating an index with 1536 dimensions but using a 768-dim model!
### Distance Metrics
Choose based on your embedding model and use case:
#### Cosine Similarity (`cosine`)
- **Best for**: Normalized embeddings (most common)
- **Range**: -1 (opposite) to 1 (identical)
- **Use when**: Embeddings are L2-normalized
- **Most common choice** - works with Workers AI, OpenAI, Cohere
```bash
npx wrangler vectorize create my-index \
--dimensions=768 \
--metric=cosine
```
#### Euclidean Distance (`euclidean`)
- **Best for**: Absolute distance matters
- **Range**: 0 (identical) to ∞ (different)
- **Use when**: Magnitude of vectors is important
- **Example**: Geographic coordinates, image features
```bash
npx wrangler vectorize create geo-index \
--dimensions=2 \
--metric=euclidean
```
#### Dot Product (`dot-product`)
- **Best for**: Non-normalized embeddings
- **Range**: -∞ to ∞
- **Use when**: Embeddings are not normalized
- **Less common** - most models produce normalized embeddings
```bash
npx wrangler vectorize create sparse-index \
--dimensions=1024 \
--metric=dot-product
```
## Creating Indexes
### Via Wrangler CLI
```bash
npx wrangler vectorize create <name> \
--dimensions=<number> \
--metric=<metric> \
[--description="<text>"]
```
### Via REST API
```typescript
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${accountId}/vectorize/v2/indexes`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${apiToken}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
name: 'my-index',
description: 'Production semantic search',
config: {
dimensions: 768,
metric: 'cosine',
},
}),
}
);
```
## Metadata Indexes
**⚠️ CRITICAL TIMING**: Create metadata indexes IMMEDIATELY after creating the main index, BEFORE inserting any vectors!
### Why Timing Matters
Vectorize builds metadata indexes **only for vectors inserted AFTER** the metadata index was created. Vectors inserted before won't be filterable!
### Best Practice Workflow
```bash
# 1. Create main index
npx wrangler vectorize create docs-search \
--dimensions=768 \
--metric=cosine
# 2. IMMEDIATELY create all metadata indexes
npx wrangler vectorize create-metadata-index docs-search \
--property-name=category --type=string
npx wrangler vectorize create-metadata-index docs-search \
--property-name=timestamp --type=number
npx wrangler vectorize create-metadata-index docs-search \
--property-name=published --type=boolean
# 3. Verify metadata indexes exist
npx wrangler vectorize list-metadata-index docs-search
# 4. NOW safe to start inserting vectors
```
### Metadata Index Limits
- **Max 10 metadata indexes** per Vectorize index
- **String type**: First 64 bytes indexed (UTF-8 boundaries)
- **Number type**: Float64 precision
- **Boolean type**: true/false
### Choosing What to Index
Only create metadata indexes for fields you'll **filter** on:
**Good candidates**:
- `category` (string) - "docs", "tutorials", "guides"
- `language` (string) - "en", "es", "fr"
- `published_at` (number) - Unix timestamp
- `status` (string) - "published", "draft", "archived"
- `verified` (boolean) - true/false
**Bad candidates** (don't need indexes):
- `title` (string) - only for display, not filtering
- `content` (string) - stored in metadata but not filtered
- `url` (string) - unless filtering by URL prefix
## Wrangler Binding
After creating an index, bind it to your Worker:
### wrangler.jsonc
```jsonc
{
"name": "my-worker",
"main": "src/index.ts",
"vectorize": [
{
"binding": "VECTORIZE_INDEX",
"index_name": "docs-search"
}
]
}
```
### TypeScript Types
```typescript
export interface Env {
VECTORIZE_INDEX: VectorizeIndex;
}
```
## Index Management Operations
### List All Indexes
```bash
npx wrangler vectorize list
```
### Get Index Details
```bash
npx wrangler vectorize get my-index
```
**Returns**:
```json
{
"name": "my-index",
"description": "Production search",
"config": {
"dimensions": 768,
"metric": "cosine"
},
"created_on": "2024-01-15T10:30:00Z",
"modified_on": "2024-01-15T10:30:00Z"
}
```
### Get Index Info (Vector Count)
```bash
npx wrangler vectorize info my-index
```
**Returns**:
```json
{
"vectorsCount": 12543,
"lastProcessedMutation": {
"id": "abc123...",
"timestamp": "2024-01-20T14:22:00Z"
}
}
```
### Delete Index
```bash
# With confirmation
npx wrangler vectorize delete my-index
# Skip confirmation (use with caution!)
npx wrangler vectorize delete my-index --force
```
**⚠️ WARNING**: Deletion is **irreversible**! All vectors are permanently lost.
## Index Naming Best Practices
### Good Names
- `production-docs-search` - Environment + purpose
- `dev-product-recommendations` - Environment + use case
- `customer-support-rag` - Descriptive use case
- `en-knowledge-base` - Language + type
### Bad Names
- `index1` - Not descriptive
- `my_index` - Use dashes, not underscores
- `PRODUCTION` - Use lowercase
- `this-is-a-very-long-index-name-that-exceeds-limits` - Too long
### Naming Rules
- Lowercase letters and numbers only
- Dashes allowed (not underscores or spaces)
- Must start with a letter
- Max 32 characters
- No special characters
## Common Patterns
### Multi-Environment Setup
```bash
# Development
npx wrangler vectorize create dev-docs-search \
--dimensions=768 --metric=cosine
# Staging
npx wrangler vectorize create staging-docs-search \
--dimensions=768 --metric=cosine
# Production
npx wrangler vectorize create prod-docs-search \
--dimensions=768 --metric=cosine
```
```jsonc
// wrangler.jsonc
{
"env": {
"dev": {
"vectorize": [
{ "binding": "VECTORIZE", "index_name": "dev-docs-search" }
]
},
"staging": {
"vectorize": [
{ "binding": "VECTORIZE", "index_name": "staging-docs-search" }
]
},
"production": {
"vectorize": [
{ "binding": "VECTORIZE", "index_name": "prod-docs-search" }
]
}
}
}
```
### Multi-Tenant with Namespaces
Instead of creating separate indexes per customer, use one index with namespaces:
```bash
# Single index for all tenants
npx wrangler vectorize create multi-tenant-index \
--dimensions=768 --metric=cosine
```
```typescript
// Insert with namespace
await env.VECTORIZE_INDEX.upsert([{
id: 'doc-1',
values: embedding,
namespace: 'customer-abc123', // Isolates by customer
metadata: { title: 'Customer document' }
}]);
// Query within namespace
const results = await env.VECTORIZE_INDEX.query(queryVector, {
topK: 5,
namespace: 'customer-abc123' // Only search this customer's data
});
```
## Troubleshooting
### "Index name already exists"
```bash
# Check existing indexes
npx wrangler vectorize list
# Delete old index if needed
npx wrangler vectorize delete old-name --force
```
### "Cannot change dimensions"
**No fix** - must create new index and re-insert all vectors.
### "Wrangler version 3.71.0 required"
```bash
# Update Wrangler
npm install -g wrangler@latest
# Or use npx
npx wrangler@latest vectorize create ...
```
## See Also
- [Wrangler Commands](./wrangler-commands.md)
- [Vector Operations](./vector-operations.md)
- [Metadata Guide](./metadata-guide.md)