Files
2025-11-30 08:24:34 +08:00

459 lines
9.1 KiB
Markdown

# Metadata Filtering Guide
Complete reference for metadata indexes and filtering in Vectorize.
## Overview
Metadata allows you to:
- Store additional data alongside vectors (up to 10 KiB per vector)
- Filter query results based on metadata properties
- Narrow search scope without re-indexing
## Metadata Indexes
**⚠️ CRITICAL**: Metadata indexes MUST be created BEFORE inserting vectors!
Vectors inserted before a metadata index exists won't be filterable on that property.
### Creating Metadata Indexes
```bash
npx wrangler vectorize create-metadata-index <index-name> \
--property-name=<property> \
--type=<type>
```
**Types**: `string`, `number`, `boolean`
**Limits**:
- Max **10 metadata indexes** per Vectorize index
- **String**: First 64 bytes indexed (UTF-8 boundaries)
- **Number**: Float64 precision
- **Boolean**: true/false
### Example Setup
```bash
# Create index
npx wrangler vectorize create docs-search --dimensions=768 --metric=cosine
# Create metadata indexes IMMEDIATELY
npx wrangler vectorize create-metadata-index docs-search \
--property-name=category --type=string
npx wrangler vectorize create-metadata-index docs-search \
--property-name=published_at --type=number
npx wrangler vectorize create-metadata-index docs-search \
--property-name=verified --type=boolean
# Verify
npx wrangler vectorize list-metadata-index docs-search
```
## Metadata Schema
### Valid Metadata Keys
```typescript
// ✅ Valid keys
metadata: {
category: 'docs',
title: 'Getting Started',
published_at: 1704067200,
verified: true,
nested: { allowed: true }
}
// ❌ Invalid keys
metadata: {
'': 'value', // Empty key
'user.name': 'John', // Contains dot (reserved for nesting)
'$admin': true, // Starts with $
'key"quoted': 1 // Contains "
}
```
**Key restrictions**:
- Cannot be empty
- Cannot contain `.` (dot) - reserved for nested access
- Cannot contain `"` (double quote)
- Cannot start with `$` (dollar sign)
- Max 512 characters
### Nested Metadata
Use dot notation for nested properties:
```typescript
// Store nested metadata
metadata: {
author: {
id: 'user123',
name: 'John Doe',
verified: true
}
}
// Filter with dot notation
filter: { 'author.verified': true }
// Create index for nested property
npx wrangler vectorize create-metadata-index docs-search \
--property-name=author_verified \
--type=boolean
```
## Filter Operators
### Equality
```typescript
// Implicit $eq
filter: { category: 'documentation' }
// Explicit $eq
filter: { category: { $eq: 'documentation' } }
```
### Not Equals
```typescript
filter: { status: { $ne: 'archived' } }
```
### In Array
```typescript
filter: { category: { $in: ['docs', 'tutorials', 'guides'] } }
```
### Not In Array
```typescript
filter: { status: { $nin: ['archived', 'draft', 'deleted'] } }
```
### Less Than
```typescript
filter: { published_at: { $lt: 1735689600 } }
```
### Less Than or Equal
```typescript
filter: { priority: { $lte: 5 } }
```
### Greater Than
```typescript
filter: { published_at: { $gt: 1704067200 } }
```
### Greater Than or Equal
```typescript
filter: { score: { $gte: 0.8 } }
```
## Range Queries
### Number Ranges
```typescript
// Documents published in 2024
filter: {
published_at: {
$gte: 1704067200, // >= Jan 1, 2024
$lt: 1735689600 // < Jan 1, 2025
}
}
// Scores between 0.7 and 0.9
filter: {
quality_score: {
$gte: 0.7,
$lte: 0.9
}
}
```
### String Ranges (Prefix Search)
```typescript
// URLs starting with /docs/workers/
filter: {
url: {
$gte: '/docs/workers/',
$lt: '/docs/workersz' // 'z' after all possible chars
}
}
// IDs starting with 'user-2024'
filter: {
id: {
$gte: 'user-2024',
$lt: 'user-2025'
}
}
```
## Combined Filters
Multiple conditions are combined with implicit **AND**:
```typescript
filter: {
category: 'documentation', // AND
language: 'en', // AND
published: true, // AND
published_at: { $gte: 1704067200 } // AND
}
```
**No OR operator** - for OR logic, make multiple queries.
## Complex Examples
### Multi-field with Ranges
```typescript
filter: {
category: { $in: ['docs', 'tutorials'] },
language: 'en',
status: { $ne: 'archived' },
published_at: {
$gte: 1704067200,
$lt: 1735689600
},
'author.verified': true
}
```
### Boolean and String
```typescript
filter: {
published: true,
featured: false,
category: 'documentation',
language: { $in: ['en', 'es', 'fr'] }
}
```
### Nested with Range
```typescript
filter: {
'metrics.views': { $gte: 1000 },
'metrics.rating': { $gte: 4.5 },
'author.verified': true,
published_at: { $gt: Date.now() / 1000 - 86400 * 30 } // Last 30 days
}
```
## Cardinality Considerations
**Cardinality** = Number of unique values in a field
### Low Cardinality (Good for Filtering)
```typescript
// Few unique values - efficient
category: 'docs' | 'tutorials' | 'guides' // ~3-10 values
language: 'en' | 'es' | 'fr' // ~5-20 values
published: true | false // 2 values
```
### High Cardinality (Avoid in Range Queries)
```typescript
// Many unique values - can impact performance
user_id: 'uuid-v4-...' // Millions of unique values
timestamp_ms: 1704067200123 // Unique per millisecond
email: 'user@example.com' // Unique per user
```
### Performance Impact
**Range queries** on high-cardinality fields can be slow:
```typescript
// ❌ Slow: High cardinality range
filter: {
user_id: { // Millions of unique UUIDs
$gte: '00000000-0000-0000-0000-000000000000',
$lt: 'zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'
}
}
// ✅ Better: Low cardinality range
filter: {
published_at: { // Timestamps in seconds
$gte: 1704067200,
$lt: 1735689600
}
}
```
### Best Practices
1. **Use seconds, not milliseconds** for timestamps
2. **Categorize high-cardinality fields** (e.g., user → user_tier)
3. **Limit range span** to avoid scanning millions of values
4. **Use $eq for high cardinality**, not ranges
## Filter Size Limit
**Max 2048 bytes** (compact JSON representation)
```typescript
// Check filter size
const filterString = JSON.stringify(filter);
if (filterString.length > 2048) {
console.error('Filter too large!');
}
```
If filter is too large:
- Split into multiple queries
- Simplify conditions
- Use namespace filtering first
## Namespace vs Metadata Filtering
### Namespace Filtering
```typescript
// Insert with namespace
await env.VECTORIZE_INDEX.upsert([{
id: 'doc-1',
values: embedding,
namespace: 'customer-abc123', // Partition key
metadata: { type: 'support' }
}]);
// Query with namespace (applied FIRST)
const results = await env.VECTORIZE_INDEX.query(queryVector, {
namespace: 'customer-abc123',
filter: { type: 'support' }
});
```
### When to Use Each
| Use Namespace | Use Metadata |
|---------------|--------------|
| Multi-tenant isolation | Fine-grained filtering |
| Customer segmentation | Category filtering |
| Environment (dev/prod) | Date ranges |
| Large partitions | Boolean flags |
| Applied BEFORE metadata | Applied AFTER namespace |
### Combined Strategy
```typescript
// Namespace: Customer isolation
namespace: 'customer-abc123'
// Metadata: Detailed filtering
filter: {
category: 'support_tickets',
status: { $ne: 'closed' },
priority: { $gte: 3 },
created_at: { $gte: Date.now() / 1000 - 86400 * 7 } // Last 7 days
}
```
## Common Patterns
### Published Content Only
```typescript
filter: {
published: true,
status: { $ne: 'archived' }
}
```
### Recent Documents
```typescript
const oneWeekAgo = Math.floor(Date.now() / 1000) - (7 * 24 * 60 * 60);
filter: {
published_at: { $gte: oneWeekAgo }
}
```
### Multi-Language Support
```typescript
filter: {
language: { $in: ['en', 'es', 'fr'] },
published: true
}
```
### Verified Authors Only
```typescript
filter: {
'author.verified': true,
'author.active': true
}
```
### Time-Based Content
```typescript
// Content from specific quarter
filter: {
published_at: {
$gte: 1704067200, // Q1 2024 start
$lt: 1711929600 // Q1 2024 end
}
}
```
## Debugging Filters
### Test Filter Syntax
```bash
npx wrangler vectorize query docs-search \
--vector="[0.1,0.2,...]" \
--filter='{"category":"docs","published":true}' \
--top-k=5
```
### Check Metadata Indexes
```bash
npx wrangler vectorize list-metadata-index docs-search
```
### Verify Metadata Structure
```typescript
const vectors = await env.VECTORIZE_INDEX.getByIds(['doc-1']);
console.log(vectors[0].metadata);
```
## Error Messages
| Error | Cause | Solution |
|-------|-------|----------|
| "Metadata property not indexed" | No metadata index for property | Create metadata index |
| "Filter exceeds 2048 bytes" | Filter JSON too large | Simplify or split queries |
| "Invalid metadata key" | Key contains `.`, `"`, or starts with `$` | Rename metadata key |
| "Filter must be non-empty object" | Empty filter `{}` | Remove filter or add conditions |
## See Also
- [Vector Operations](./vector-operations.md)
- [Wrangler Commands](./wrangler-commands.md)
- [Index Operations](./index-operations.md)
- [Official Docs](https://developers.cloudflare.com/vectorize/reference/metadata-filtering/)