Files
gh-jezweb-claude-skills-ski…/references/metadata-guide.md
2025-11-30 08:24:34 +08:00

9.1 KiB

Metadata Filtering Guide

Complete reference for metadata indexes and filtering in Vectorize.

Overview

Metadata allows you to:

  • Store additional data alongside vectors (up to 10 KiB per vector)
  • Filter query results based on metadata properties
  • Narrow search scope without re-indexing

Metadata Indexes

⚠️ CRITICAL: Metadata indexes MUST be created BEFORE inserting vectors!

Vectors inserted before a metadata index exists won't be filterable on that property.

Creating Metadata Indexes

npx wrangler vectorize create-metadata-index <index-name> \
  --property-name=<property> \
  --type=<type>

Types: string, number, boolean

Limits:

  • Max 10 metadata indexes per Vectorize index
  • String: First 64 bytes indexed (UTF-8 boundaries)
  • Number: Float64 precision
  • Boolean: true/false

Example Setup

# Create index
npx wrangler vectorize create docs-search --dimensions=768 --metric=cosine

# Create metadata indexes IMMEDIATELY
npx wrangler vectorize create-metadata-index docs-search \
  --property-name=category --type=string

npx wrangler vectorize create-metadata-index docs-search \
  --property-name=published_at --type=number

npx wrangler vectorize create-metadata-index docs-search \
  --property-name=verified --type=boolean

# Verify
npx wrangler vectorize list-metadata-index docs-search

Metadata Schema

Valid Metadata Keys

// ✅ Valid keys
metadata: {
  category: 'docs',
  title: 'Getting Started',
  published_at: 1704067200,
  verified: true,
  nested: { allowed: true }
}

// ❌ Invalid keys
metadata: {
  '': 'value',               // Empty key
  'user.name': 'John',       // Contains dot (reserved for nesting)
  '$admin': true,            // Starts with $
  'key"quoted': 1            // Contains "
}

Key restrictions:

  • Cannot be empty
  • Cannot contain . (dot) - reserved for nested access
  • Cannot contain " (double quote)
  • Cannot start with $ (dollar sign)
  • Max 512 characters

Nested Metadata

Use dot notation for nested properties:

// Store nested metadata
metadata: {
  author: {
    id: 'user123',
    name: 'John Doe',
    verified: true
  }
}

// Filter with dot notation
filter: { 'author.verified': true }

// Create index for nested property
npx wrangler vectorize create-metadata-index docs-search \
  --property-name=author_verified \
  --type=boolean

Filter Operators

Equality

// Implicit $eq
filter: { category: 'documentation' }

// Explicit $eq
filter: { category: { $eq: 'documentation' } }

Not Equals

filter: { status: { $ne: 'archived' } }

In Array

filter: { category: { $in: ['docs', 'tutorials', 'guides'] } }

Not In Array

filter: { status: { $nin: ['archived', 'draft', 'deleted'] } }

Less Than

filter: { published_at: { $lt: 1735689600 } }

Less Than or Equal

filter: { priority: { $lte: 5 } }

Greater Than

filter: { published_at: { $gt: 1704067200 } }

Greater Than or Equal

filter: { score: { $gte: 0.8 } }

Range Queries

Number Ranges

// Documents published in 2024
filter: {
  published_at: {
    $gte: 1704067200,  // >= Jan 1, 2024
    $lt: 1735689600    // < Jan 1, 2025
  }
}

// Scores between 0.7 and 0.9
filter: {
  quality_score: {
    $gte: 0.7,
    $lte: 0.9
  }
}
// URLs starting with /docs/workers/
filter: {
  url: {
    $gte: '/docs/workers/',
    $lt: '/docs/workersz'  // 'z' after all possible chars
  }
}

// IDs starting with 'user-2024'
filter: {
  id: {
    $gte: 'user-2024',
    $lt: 'user-2025'
  }
}

Combined Filters

Multiple conditions are combined with implicit AND:

filter: {
  category: 'documentation',     // AND
  language: 'en',                // AND
  published: true,               // AND
  published_at: { $gte: 1704067200 } // AND
}

No OR operator - for OR logic, make multiple queries.

Complex Examples

Multi-field with Ranges

filter: {
  category: { $in: ['docs', 'tutorials'] },
  language: 'en',
  status: { $ne: 'archived' },
  published_at: {
    $gte: 1704067200,
    $lt: 1735689600
  },
  'author.verified': true
}

Boolean and String

filter: {
  published: true,
  featured: false,
  category: 'documentation',
  language: { $in: ['en', 'es', 'fr'] }
}

Nested with Range

filter: {
  'metrics.views': { $gte: 1000 },
  'metrics.rating': { $gte: 4.5 },
  'author.verified': true,
  published_at: { $gt: Date.now() / 1000 - 86400 * 30 } // Last 30 days
}

Cardinality Considerations

Cardinality = Number of unique values in a field

Low Cardinality (Good for Filtering)

// Few unique values - efficient
category: 'docs' | 'tutorials' | 'guides'  // ~3-10 values
language: 'en' | 'es' | 'fr'                // ~5-20 values
published: true | false                      // 2 values

High Cardinality (Avoid in Range Queries)

// Many unique values - can impact performance
user_id: 'uuid-v4-...'           // Millions of unique values
timestamp_ms: 1704067200123      // Unique per millisecond
email: 'user@example.com'        // Unique per user

Performance Impact

Range queries on high-cardinality fields can be slow:

// ❌ Slow: High cardinality range
filter: {
  user_id: {  // Millions of unique UUIDs
    $gte: '00000000-0000-0000-0000-000000000000',
    $lt: 'zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'
  }
}

// ✅ Better: Low cardinality range
filter: {
  published_at: {  // Timestamps in seconds
    $gte: 1704067200,
    $lt: 1735689600
  }
}

Best Practices

  1. Use seconds, not milliseconds for timestamps
  2. Categorize high-cardinality fields (e.g., user → user_tier)
  3. Limit range span to avoid scanning millions of values
  4. Use $eq for high cardinality, not ranges

Filter Size Limit

Max 2048 bytes (compact JSON representation)

// Check filter size
const filterString = JSON.stringify(filter);
if (filterString.length > 2048) {
  console.error('Filter too large!');
}

If filter is too large:

  • Split into multiple queries
  • Simplify conditions
  • Use namespace filtering first

Namespace vs Metadata Filtering

Namespace Filtering

// Insert with namespace
await env.VECTORIZE_INDEX.upsert([{
  id: 'doc-1',
  values: embedding,
  namespace: 'customer-abc123',  // Partition key
  metadata: { type: 'support' }
}]);

// Query with namespace (applied FIRST)
const results = await env.VECTORIZE_INDEX.query(queryVector, {
  namespace: 'customer-abc123',
  filter: { type: 'support' }
});

When to Use Each

Use Namespace Use Metadata
Multi-tenant isolation Fine-grained filtering
Customer segmentation Category filtering
Environment (dev/prod) Date ranges
Large partitions Boolean flags
Applied BEFORE metadata Applied AFTER namespace

Combined Strategy

// Namespace: Customer isolation
namespace: 'customer-abc123'

// Metadata: Detailed filtering
filter: {
  category: 'support_tickets',
  status: { $ne: 'closed' },
  priority: { $gte: 3 },
  created_at: { $gte: Date.now() / 1000 - 86400 * 7 } // Last 7 days
}

Common Patterns

Published Content Only

filter: {
  published: true,
  status: { $ne: 'archived' }
}

Recent Documents

const oneWeekAgo = Math.floor(Date.now() / 1000) - (7 * 24 * 60 * 60);
filter: {
  published_at: { $gte: oneWeekAgo }
}

Multi-Language Support

filter: {
  language: { $in: ['en', 'es', 'fr'] },
  published: true
}

Verified Authors Only

filter: {
  'author.verified': true,
  'author.active': true
}

Time-Based Content

// Content from specific quarter
filter: {
  published_at: {
    $gte: 1704067200,  // Q1 2024 start
    $lt: 1711929600    // Q1 2024 end
  }
}

Debugging Filters

Test Filter Syntax

npx wrangler vectorize query docs-search \
  --vector="[0.1,0.2,...]" \
  --filter='{"category":"docs","published":true}' \
  --top-k=5

Check Metadata Indexes

npx wrangler vectorize list-metadata-index docs-search

Verify Metadata Structure

const vectors = await env.VECTORIZE_INDEX.getByIds(['doc-1']);
console.log(vectors[0].metadata);

Error Messages

Error Cause Solution
"Metadata property not indexed" No metadata index for property Create metadata index
"Filter exceeds 2048 bytes" Filter JSON too large Simplify or split queries
"Invalid metadata key" Key contains ., ", or starts with $ Rename metadata key
"Filter must be non-empty object" Empty filter {} Remove filter or add conditions

See Also