9.1 KiB
9.1 KiB
Metadata Filtering Guide
Complete reference for metadata indexes and filtering in Vectorize.
Overview
Metadata allows you to:
- Store additional data alongside vectors (up to 10 KiB per vector)
- Filter query results based on metadata properties
- Narrow search scope without re-indexing
Metadata Indexes
⚠️ CRITICAL: Metadata indexes MUST be created BEFORE inserting vectors!
Vectors inserted before a metadata index exists won't be filterable on that property.
Creating Metadata Indexes
npx wrangler vectorize create-metadata-index <index-name> \
--property-name=<property> \
--type=<type>
Types: string, number, boolean
Limits:
- Max 10 metadata indexes per Vectorize index
- String: First 64 bytes indexed (UTF-8 boundaries)
- Number: Float64 precision
- Boolean: true/false
Example Setup
# Create index
npx wrangler vectorize create docs-search --dimensions=768 --metric=cosine
# Create metadata indexes IMMEDIATELY
npx wrangler vectorize create-metadata-index docs-search \
--property-name=category --type=string
npx wrangler vectorize create-metadata-index docs-search \
--property-name=published_at --type=number
npx wrangler vectorize create-metadata-index docs-search \
--property-name=verified --type=boolean
# Verify
npx wrangler vectorize list-metadata-index docs-search
Metadata Schema
Valid Metadata Keys
// ✅ Valid keys
metadata: {
category: 'docs',
title: 'Getting Started',
published_at: 1704067200,
verified: true,
nested: { allowed: true }
}
// ❌ Invalid keys
metadata: {
'': 'value', // Empty key
'user.name': 'John', // Contains dot (reserved for nesting)
'$admin': true, // Starts with $
'key"quoted': 1 // Contains "
}
Key restrictions:
- Cannot be empty
- Cannot contain
.(dot) - reserved for nested access - Cannot contain
"(double quote) - Cannot start with
$(dollar sign) - Max 512 characters
Nested Metadata
Use dot notation for nested properties:
// Store nested metadata
metadata: {
author: {
id: 'user123',
name: 'John Doe',
verified: true
}
}
// Filter with dot notation
filter: { 'author.verified': true }
// Create index for nested property
npx wrangler vectorize create-metadata-index docs-search \
--property-name=author_verified \
--type=boolean
Filter Operators
Equality
// Implicit $eq
filter: { category: 'documentation' }
// Explicit $eq
filter: { category: { $eq: 'documentation' } }
Not Equals
filter: { status: { $ne: 'archived' } }
In Array
filter: { category: { $in: ['docs', 'tutorials', 'guides'] } }
Not In Array
filter: { status: { $nin: ['archived', 'draft', 'deleted'] } }
Less Than
filter: { published_at: { $lt: 1735689600 } }
Less Than or Equal
filter: { priority: { $lte: 5 } }
Greater Than
filter: { published_at: { $gt: 1704067200 } }
Greater Than or Equal
filter: { score: { $gte: 0.8 } }
Range Queries
Number Ranges
// Documents published in 2024
filter: {
published_at: {
$gte: 1704067200, // >= Jan 1, 2024
$lt: 1735689600 // < Jan 1, 2025
}
}
// Scores between 0.7 and 0.9
filter: {
quality_score: {
$gte: 0.7,
$lte: 0.9
}
}
String Ranges (Prefix Search)
// URLs starting with /docs/workers/
filter: {
url: {
$gte: '/docs/workers/',
$lt: '/docs/workersz' // 'z' after all possible chars
}
}
// IDs starting with 'user-2024'
filter: {
id: {
$gte: 'user-2024',
$lt: 'user-2025'
}
}
Combined Filters
Multiple conditions are combined with implicit AND:
filter: {
category: 'documentation', // AND
language: 'en', // AND
published: true, // AND
published_at: { $gte: 1704067200 } // AND
}
No OR operator - for OR logic, make multiple queries.
Complex Examples
Multi-field with Ranges
filter: {
category: { $in: ['docs', 'tutorials'] },
language: 'en',
status: { $ne: 'archived' },
published_at: {
$gte: 1704067200,
$lt: 1735689600
},
'author.verified': true
}
Boolean and String
filter: {
published: true,
featured: false,
category: 'documentation',
language: { $in: ['en', 'es', 'fr'] }
}
Nested with Range
filter: {
'metrics.views': { $gte: 1000 },
'metrics.rating': { $gte: 4.5 },
'author.verified': true,
published_at: { $gt: Date.now() / 1000 - 86400 * 30 } // Last 30 days
}
Cardinality Considerations
Cardinality = Number of unique values in a field
Low Cardinality (Good for Filtering)
// Few unique values - efficient
category: 'docs' | 'tutorials' | 'guides' // ~3-10 values
language: 'en' | 'es' | 'fr' // ~5-20 values
published: true | false // 2 values
High Cardinality (Avoid in Range Queries)
// Many unique values - can impact performance
user_id: 'uuid-v4-...' // Millions of unique values
timestamp_ms: 1704067200123 // Unique per millisecond
email: 'user@example.com' // Unique per user
Performance Impact
Range queries on high-cardinality fields can be slow:
// ❌ Slow: High cardinality range
filter: {
user_id: { // Millions of unique UUIDs
$gte: '00000000-0000-0000-0000-000000000000',
$lt: 'zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'
}
}
// ✅ Better: Low cardinality range
filter: {
published_at: { // Timestamps in seconds
$gte: 1704067200,
$lt: 1735689600
}
}
Best Practices
- Use seconds, not milliseconds for timestamps
- Categorize high-cardinality fields (e.g., user → user_tier)
- Limit range span to avoid scanning millions of values
- Use $eq for high cardinality, not ranges
Filter Size Limit
Max 2048 bytes (compact JSON representation)
// Check filter size
const filterString = JSON.stringify(filter);
if (filterString.length > 2048) {
console.error('Filter too large!');
}
If filter is too large:
- Split into multiple queries
- Simplify conditions
- Use namespace filtering first
Namespace vs Metadata Filtering
Namespace Filtering
// Insert with namespace
await env.VECTORIZE_INDEX.upsert([{
id: 'doc-1',
values: embedding,
namespace: 'customer-abc123', // Partition key
metadata: { type: 'support' }
}]);
// Query with namespace (applied FIRST)
const results = await env.VECTORIZE_INDEX.query(queryVector, {
namespace: 'customer-abc123',
filter: { type: 'support' }
});
When to Use Each
| Use Namespace | Use Metadata |
|---|---|
| Multi-tenant isolation | Fine-grained filtering |
| Customer segmentation | Category filtering |
| Environment (dev/prod) | Date ranges |
| Large partitions | Boolean flags |
| Applied BEFORE metadata | Applied AFTER namespace |
Combined Strategy
// Namespace: Customer isolation
namespace: 'customer-abc123'
// Metadata: Detailed filtering
filter: {
category: 'support_tickets',
status: { $ne: 'closed' },
priority: { $gte: 3 },
created_at: { $gte: Date.now() / 1000 - 86400 * 7 } // Last 7 days
}
Common Patterns
Published Content Only
filter: {
published: true,
status: { $ne: 'archived' }
}
Recent Documents
const oneWeekAgo = Math.floor(Date.now() / 1000) - (7 * 24 * 60 * 60);
filter: {
published_at: { $gte: oneWeekAgo }
}
Multi-Language Support
filter: {
language: { $in: ['en', 'es', 'fr'] },
published: true
}
Verified Authors Only
filter: {
'author.verified': true,
'author.active': true
}
Time-Based Content
// Content from specific quarter
filter: {
published_at: {
$gte: 1704067200, // Q1 2024 start
$lt: 1711929600 // Q1 2024 end
}
}
Debugging Filters
Test Filter Syntax
npx wrangler vectorize query docs-search \
--vector="[0.1,0.2,...]" \
--filter='{"category":"docs","published":true}' \
--top-k=5
Check Metadata Indexes
npx wrangler vectorize list-metadata-index docs-search
Verify Metadata Structure
const vectors = await env.VECTORIZE_INDEX.getByIds(['doc-1']);
console.log(vectors[0].metadata);
Error Messages
| Error | Cause | Solution |
|---|---|---|
| "Metadata property not indexed" | No metadata index for property | Create metadata index |
| "Filter exceeds 2048 bytes" | Filter JSON too large | Simplify or split queries |
| "Invalid metadata key" | Key contains ., ", or starts with $ |
Rename metadata key |
| "Filter must be non-empty object" | Empty filter {} |
Remove filter or add conditions |