zhongwei/gh-jezweb-claude-skills-skills-cloudflare-vectorize

Fork 0

Files

Zhongwei Li 4ad7b3dd73 Initial commit

2025-11-30 08:24:34 +08:00

9.1 KiB

Raw Blame History

Metadata Filtering Guide

Complete reference for metadata indexes and filtering in Vectorize.

Overview

Metadata allows you to:

Store additional data alongside vectors (up to 10 KiB per vector)
Filter query results based on metadata properties
Narrow search scope without re-indexing

Metadata Indexes

⚠️ CRITICAL: Metadata indexes MUST be created BEFORE inserting vectors!

Vectors inserted before a metadata index exists won't be filterable on that property.

Creating Metadata Indexes

npx wrangler vectorize create-metadata-index <index-name> \
  --property-name=<property> \
  --type=<type>

Types: string, number, boolean

Limits:

Max 10 metadata indexes per Vectorize index
String: First 64 bytes indexed (UTF-8 boundaries)
Number: Float64 precision
Boolean: true/false

Example Setup

# Create index
npx wrangler vectorize create docs-search --dimensions=768 --metric=cosine

# Create metadata indexes IMMEDIATELY
npx wrangler vectorize create-metadata-index docs-search \
  --property-name=category --type=string

npx wrangler vectorize create-metadata-index docs-search \
  --property-name=published_at --type=number

npx wrangler vectorize create-metadata-index docs-search \
  --property-name=verified --type=boolean

# Verify
npx wrangler vectorize list-metadata-index docs-search

Metadata Schema

Valid Metadata Keys

// ✅ Valid keys
metadata: {
  category: 'docs',
  title: 'Getting Started',
  published_at: 1704067200,
  verified: true,
  nested: { allowed: true }
}

// ❌ Invalid keys
metadata: {
  '': 'value',               // Empty key
  'user.name': 'John',       // Contains dot (reserved for nesting)
  '$admin': true,            // Starts with $
  'key"quoted': 1            // Contains "
}

Key restrictions:

Cannot be empty
Cannot contain . (dot) - reserved for nested access
Cannot contain " (double quote)
Cannot start with $ (dollar sign)
Max 512 characters

Nested Metadata

Use dot notation for nested properties:

// Store nested metadata
metadata: {
  author: {
    id: 'user123',
    name: 'John Doe',
    verified: true
  }
}

// Filter with dot notation
filter: { 'author.verified': true }

// Create index for nested property
npx wrangler vectorize create-metadata-index docs-search \
  --property-name=author_verified \
  --type=boolean

Filter Operators

Equality

// Implicit $eq
filter: { category: 'documentation' }

// Explicit $eq
filter: { category: { $eq: 'documentation' } }

Not Equals

filter: { status: { $ne: 'archived' } }

In Array

filter: { category: { $in: ['docs', 'tutorials', 'guides'] } }

Not In Array

filter: { status: { $nin: ['archived', 'draft', 'deleted'] } }

Less Than

filter: { published_at: { $lt: 1735689600 } }

Less Than or Equal

filter: { priority: { $lte: 5 } }

Greater Than

filter: { published_at: { $gt: 1704067200 } }

Greater Than or Equal

filter: { score: { $gte: 0.8 } }

Range Queries

Number Ranges

// Documents published in 2024
filter: {
  published_at: {
    $gte: 1704067200,  // >= Jan 1, 2024
    $lt: 1735689600    // < Jan 1, 2025
  }
}

// Scores between 0.7 and 0.9
filter: {
  quality_score: {
    $gte: 0.7,
    $lte: 0.9
  }
}

String Ranges (Prefix Search)

// URLs starting with /docs/workers/
filter: {
  url: {
    $gte: '/docs/workers/',
    $lt: '/docs/workersz'  // 'z' after all possible chars
  }
}

// IDs starting with 'user-2024'
filter: {
  id: {
    $gte: 'user-2024',
    $lt: 'user-2025'
  }
}

Combined Filters

Multiple conditions are combined with implicit AND:

filter: {
  category: 'documentation',     // AND
  language: 'en',                // AND
  published: true,               // AND
  published_at: { $gte: 1704067200 } // AND
}

No OR operator - for OR logic, make multiple queries.

Complex Examples

Multi-field with Ranges

filter: {
  category: { $in: ['docs', 'tutorials'] },
  language: 'en',
  status: { $ne: 'archived' },
  published_at: {
    $gte: 1704067200,
    $lt: 1735689600
  },
  'author.verified': true
}

Boolean and String

filter: {
  published: true,
  featured: false,
  category: 'documentation',
  language: { $in: ['en', 'es', 'fr'] }
}

Nested with Range

filter: {
  'metrics.views': { $gte: 1000 },
  'metrics.rating': { $gte: 4.5 },
  'author.verified': true,
  published_at: { $gt: Date.now() / 1000 - 86400 * 30 } // Last 30 days
}

Cardinality Considerations

Cardinality = Number of unique values in a field

Low Cardinality (Good for Filtering)

// Few unique values - efficient
category: 'docs' | 'tutorials' | 'guides'  // ~3-10 values
language: 'en' | 'es' | 'fr'                // ~5-20 values
published: true | false                      // 2 values

High Cardinality (Avoid in Range Queries)

// Many unique values - can impact performance
user_id: 'uuid-v4-...'           // Millions of unique values
timestamp_ms: 1704067200123      // Unique per millisecond
email: 'user@example.com'        // Unique per user

Performance Impact

Range queries on high-cardinality fields can be slow:

// ❌ Slow: High cardinality range
filter: {
  user_id: {  // Millions of unique UUIDs
    $gte: '00000000-0000-0000-0000-000000000000',
    $lt: 'zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'
  }
}

// ✅ Better: Low cardinality range
filter: {
  published_at: {  // Timestamps in seconds
    $gte: 1704067200,
    $lt: 1735689600
  }
}

Best Practices

Use seconds, not milliseconds for timestamps
Categorize high-cardinality fields (e.g., user → user_tier)
Limit range span to avoid scanning millions of values
Use $eq for high cardinality, not ranges

Filter Size Limit

Max 2048 bytes (compact JSON representation)

// Check filter size
const filterString = JSON.stringify(filter);
if (filterString.length > 2048) {
  console.error('Filter too large!');
}

If filter is too large:

Split into multiple queries
Simplify conditions
Use namespace filtering first

Namespace vs Metadata Filtering

Namespace Filtering

// Insert with namespace
await env.VECTORIZE_INDEX.upsert([{
  id: 'doc-1',
  values: embedding,
  namespace: 'customer-abc123',  // Partition key
  metadata: { type: 'support' }
}]);

// Query with namespace (applied FIRST)
const results = await env.VECTORIZE_INDEX.query(queryVector, {
  namespace: 'customer-abc123',
  filter: { type: 'support' }
});

When to Use Each

Use Namespace	Use Metadata
Multi-tenant isolation	Fine-grained filtering
Customer segmentation	Category filtering
Environment (dev/prod)	Date ranges
Large partitions	Boolean flags
Applied BEFORE metadata	Applied AFTER namespace

Combined Strategy

// Namespace: Customer isolation
namespace: 'customer-abc123'

// Metadata: Detailed filtering
filter: {
  category: 'support_tickets',
  status: { $ne: 'closed' },
  priority: { $gte: 3 },
  created_at: { $gte: Date.now() / 1000 - 86400 * 7 } // Last 7 days
}

Common Patterns

Published Content Only

filter: {
  published: true,
  status: { $ne: 'archived' }
}

Recent Documents

const oneWeekAgo = Math.floor(Date.now() / 1000) - (7 * 24 * 60 * 60);
filter: {
  published_at: { $gte: oneWeekAgo }
}

Multi-Language Support

filter: {
  language: { $in: ['en', 'es', 'fr'] },
  published: true
}

Verified Authors Only

filter: {
  'author.verified': true,
  'author.active': true
}

Time-Based Content

// Content from specific quarter
filter: {
  published_at: {
    $gte: 1704067200,  // Q1 2024 start
    $lt: 1711929600    // Q1 2024 end
  }
}

Debugging Filters

Test Filter Syntax

npx wrangler vectorize query docs-search \
  --vector="[0.1,0.2,...]" \
  --filter='{"category":"docs","published":true}' \
  --top-k=5

Check Metadata Indexes

npx wrangler vectorize list-metadata-index docs-search

Verify Metadata Structure

const vectors = await env.VECTORIZE_INDEX.getByIds(['doc-1']);
console.log(vectors[0].metadata);

Error Messages

Error	Cause	Solution
"Metadata property not indexed"	No metadata index for property	Create metadata index
"Filter exceeds 2048 bytes"	Filter JSON too large	Simplify or split queries
"Invalid metadata key"	Key contains `.`, `"`, or starts with `$`	Rename metadata key
"Filter must be non-empty object"	Empty filter `{}`	Remove filter or add conditions

9.1 KiB Raw Blame History