Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,60 @@
---
slug: /pyseekdb-sdk-get-started
---
# Get started
## pyseekdb
pyseekdb is a Python client provided by OceanBase Database. It allows you to connect to seekdb in embedded mode or remote mode, and supports connecting to seekdb in server mode or OceanBase Database.
:::tip
OceanBase Database is a fully self-developed, enterprise-level, native distributed database provided by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across cities with the "five IDCs across three regions" architecture. It also sets a new benchmark in the TPC-C standard test, with a single cluster scale exceeding 1,500 nodes. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
:::
pyseekdb is supported on Linux, macOS, and Windows. The supported database connection modes vary by operating system. For more information, see the table below.
| System | Embedded seekdb | Server mode seekdb | Server mode OceanBase Database |
|----|---|---|---|
| Linux | Supported | Supported | Supported |
| macOS | Not supported | Supported | Supported |
| Windows | Not supported | Supported | Supported |
For Linux system, when you install this client, it will also install seekdb in embedded mode, allowing you to directly connect to it to perform operations such as creating a database. Alternatively, you can choose to connect to a deployed seekdb or OceanBase Database in client/server mode.
## Install pyseekdb
### Prerequisites
Make sure that your environment meets the following requirements:
* Operating system: Linux (glibc >= 2.28), macOS or Windows
* Python version: Python 3.11 and later
* System architecture: x86_64 or aarch64
### Procedure
Use pip to install pyseekdb. It will automatically detect the default Python version and platform.
```shell
pip install pyseekdb
```
If your pip version is outdated, upgrade it before installation.
```bash
pip install --upgrade pip
```
## What to do next
* After installing pyseekdb, you can connect to seekdb to perform operations. For information about the API interfaces supported by pyseekdb, see [API Reference](../50.apis/10.api-overview.md).
* You can also refer to the SDK samples provided to quickly experience pyseekdb.
* [Simple sample](50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Complete sample](50.sdk-samples/50.pyseekdb-complete-sample.md)
* [Hybrid search sample](50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,130 @@
---
slug: /pyseekdb-simple-sample
---
# Simple Example
This example demonstrates the basic operations of Embedding Functions in embedded mode of seekdb to help you understand how to use Embedding Functions.
1. Connect to seekdb.
2. Create a collection with Embedding Functions.
3. Add data using documents (vectors will be automatically generated).
4. Query using texts (vectors will be automatically generated).
5. Print the query results.
## Prerequisites
This example uses seekdb in embedded mode. Before using this example, make sure that you have deployed seekdb in server mode.
For information about how to deploy seekdb in embedded mode, see [Embedded Mode](../../../../400.guides/400.deploy/600.python-seekdb.md).
## Example
```python
import pyseekdb
# ==================== Step 1: Create Client Connection ====================
# You can use embedded mode, server mode, or OceanBase mode
# Embedded mode (local SeekDB)
client = pyseekdb.Client()
# Alternative: Server mode (connecting to remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Alternative: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"
# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
name=collection_name,
)
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")
# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text"
]
ids = ["id1", "id2", "id3", "id4", "id5"]
# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
ids=ids,
documents=documents, # embeddings will be automatically generated
metadatas=[
{"category": "AI", "index": 0},
{"category": "Programming", "index": 1},
{"category": "Database", "index": 2},
{"category": "AI", "index": 3},
{"category": "NLP", "index": 4}
]
)
print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")
# ==================== Step 4: Query the Collection ====================
# With embedding function, you can query using text directly
# The embedding function will automatically convert query text to query vector
# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"
results = collection.query(
query_texts=query_text, # Query text - will be embedded automatically
n_results=3 # Return top 3 most similar documents
)
print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")
# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
print(f"\nResult {i+1}:")
print(f" ID: {results['ids'][0][i]}")
print(f" Distance: {results['distances'][0][i]:.4f}")
if results.get('documents'):
print(f" Document: {results['documents'][0][i]}")
if results.get('metadatas'):
print(f" Metadata: {results['metadatas'][0][i]}")
# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Complete Example](50.pyseekdb-complete-sample.md)
* [Hybrid Search Example](100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,350 @@
---
slug: /pyseekdb-hybrid-search-sample
---
# Hybrid search example
This example demonstrates the advantages of `hybrid_search()` over `query()`.
The main advantages of `hybrid_search()` are:
* Supports full-text search and vector similarity search simultaneously
* Allows separate filtering conditions for full-text and vector search
* Combines the ranked results of both searches using the Reciprocal Rank Fusion algorithm to improve relevance.
* Handles complex scenarios that `query()` cannot handle
## Example
```python
import pyseekdb
# Setup
client = pyseekdb.Client()
collection = client.get_or_create_collection(
name="hybrid_search_demo"
)
# Sample data
documents = [
"Machine learning is revolutionizing artificial intelligence and data science",
"Python programming language is essential for machine learning developers",
"Deep learning neural networks enable advanced AI applications",
"Data science combines statistics, programming, and domain expertise",
"Natural language processing uses machine learning to understand text",
"Computer vision algorithms process images using deep learning techniques",
"Reinforcement learning trains agents through reward-based feedback",
"Python libraries like TensorFlow and PyTorch simplify machine learning",
"Artificial intelligence systems can learn from large datasets",
"Neural networks mimic the structure of biological brain connections"
]
metadatas = [
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
]
ids = [f"doc_{i+1}" for i in range(len(documents))]
collection.add(ids=ids, documents=documents, metadatas=metadatas)
print("=" * 100)
print("SCENARIO 1: Keyword + Semantic Search")
print("=" * 100)
print("Goal: Find documents similar to 'AI research' AND containing 'machine learning'\n")
# query() approach
query_result1 = collection.query(
query_texts=["AI research"],
where_document={"$contains": "machine learning"},
n_results=5
)
# hybrid_search() approach
hybrid_result1 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["AI research"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() ranks 'Deep learning neural networks...' first because it's semantically similar to 'AI research',")
print(" but 'machine learning' is not its primary focus. hybrid_search() correctly prioritizes documents that")
print(" explicitly contain 'machine learning' (from full-text search) while also being semantically relevant")
print(" to 'AI research' (from vector search). The RRF fusion ensures documents matching both criteria rank higher.")
print("\n" + "=" * 100)
print("SCENARIO 2: Independent Filters for Different Search Types")
print("=" * 100)
print("Goal: Full-text='neural' (year=2024) + Vector='deep learning' (popularity>=90)\n")
# query() - same filter applies to both conditions
query_result2 = collection.query(
query_texts=["deep learning"],
where={"year": {"$eq": 2024}, "popularity": {"$gte": 90}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for each search type
hybrid_result2 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"year": {"$eq": 2024}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"popularity": {"$gte": 90}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (same filter for both):")
for i, doc_id in enumerate(query_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results (independent filters):")
for i, doc_id in enumerate(hybrid_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH year=2024 AND popularity>=90")
print(" simultaneously. hybrid_search() returns 5 results by applying year=2024 filter to full-text search")
print(" and popularity>=90 filter to vector search independently, then fusing the results. This approach")
print(" captures more relevant documents that might satisfy one criterion strongly while meeting the other")
print("\n" + "=" * 100)
print("SCENARIO 3: Combining Multiple Search Strategies")
print("=" * 100)
print("Goal: Find documents about 'machine learning algorithms'\n")
# query() - vector search only
query_result3 = collection.query(
query_texts=["machine learning algorithms"],
n_results=5
)
# hybrid_search() - combines full-text and vector
hybrid_result3 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["machine learning algorithms"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (vector similarity only):")
for i, doc_id in enumerate(query_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (full-text + vector fusion):")
for i, doc_id in enumerate(hybrid_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() returns 'Artificial intelligence systems...' as the result, which doesn't explicitly")
print(" mention 'machine learning'. hybrid_search() combines full-text search (for 'machine learning')")
print(" with vector search (for semantic similarity to 'machine learning algorithms'), ensuring that")
print(" documents containing the exact keyword rank higher while still capturing semantically relevant content.")
print("\n" + "=" * 100)
print("SCENARIO 4: Complex Multi-Criteria Search")
print("=" * 100)
print("Goal: Full-text='learning' (category=AI) + Vector='artificial intelligence' (year>=2023)\n")
# query() - limited to single search with combined filters
query_result4 = collection.query(
query_texts=["artificial intelligence"],
where={"category": {"$eq": "AI"}, "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# hybrid_search() - separate criteria for each search type
hybrid_result4 = collection.hybrid_search(
query={"where_document": {"$contains": "learning"}, "where": {"category": {"$eq": "AI"}}, "n_results": 10},
knn={"query_texts": ["artificial intelligence"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" While both methods return similar documents, hybrid_search() provides better ranking by prioritizing")
print(" documents that score highly in both full-text search (containing 'learning' with category=AI) and")
print(" vector search (semantically similar to 'artificial intelligence' with year>=2023). The RRF fusion")
print(" algorithm ensures that 'Deep learning neural networks...' ranks first because it strongly matches")
print(" both search criteria, whereas query() applies filters sequentially which may not optimize ranking.")
print("\n" + "=" * 100)
print("SCENARIO 5: Result Quality - RRF Fusion")
print("=" * 100)
print("Goal: Search for 'Python machine learning'\n")
# query() - single ranking
query_result5 = collection.query(
query_texts=["Python machine learning"],
n_results=5
)
# hybrid_search() - RRF fusion of multiple rankings
hybrid_result5 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["Python machine learning"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (single ranking):")
for i, doc_id in enumerate(query_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (RRF fusion):")
for i, doc_id in enumerate(hybrid_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" Both methods return identical results in this case, but hybrid_search() achieves this through RRF")
print(" (Reciprocal Rank Fusion) which combines rankings from full-text search (for 'Python') and vector")
print(" search (for 'Python machine learning'). RRF provides more stable and robust ranking by considering")
print(" multiple signals, making it less sensitive to variations in individual search algorithms and ensuring")
print(" consistent high-quality results across different query formulations.")
print("\n" + "=" * 100)
print("SCENARIO 6: Different Filter Criteria for Each Search")
print("=" * 100)
print("Goal: Full-text='neural' (high popularity) + Vector='deep learning' (recent year)\n")
# query() - cannot separate filters for keyword vs semantic
query_result6 = collection.query(
query_texts=["deep learning"],
where={"popularity": {"$gte": 90}, "year": {"$gte": 2023}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for keyword search vs semantic search
hybrid_result6 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"popularity": {"$gte": 90}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH popularity>=90 AND")
print(" year>=2023 simultaneously, along with containing 'neural' and being semantically similar to")
print(" 'deep learning'. hybrid_search() returns 5 results by applying popularity>=90 filter to full-text")
print(" search (for 'neural') and year>=2023 filter to vector search (for 'deep learning') independently.")
print(" The fusion then combines results from both searches, capturing documents that strongly match either")
print(" criterion while still being relevant to the overall query intent.")
print("\n" + "=" * 100)
print("SCENARIO 7: Partial Keyword Match + Semantic Similarity")
print("=" * 100)
print("Goal: Documents containing 'Python' + Semantically similar to 'data science'\n")
# query() - filter applied after vector search
query_result7 = collection.query(
query_texts=["data science"],
where_document={"$contains": "Python"},
n_results=5
)
# hybrid_search() - parallel searches then fusion
hybrid_result7 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["data science"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it first performs vector search for 'data science', then")
print(" filters to documents containing 'Python', which severely limits the result set. hybrid_search()")
print(" returns 5 results by running full-text search (for 'Python') and vector search (for 'data science')")
print(" in parallel, then fusing the results. This captures documents that contain 'Python' (even if not")
print(" semantically closest to 'data science') and documents semantically similar to 'data science' (even")
print(" if they don't contain 'Python'), providing better recall and more comprehensive results.")
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
print("""
query() limitations:
- Single search type (vector similarity)
- Filters applied after search (may miss relevant docs)
- Cannot combine full-text and vector search results
- Same filter criteria for all conditions
hybrid_search() advantages:
- Simultaneous full-text + vector search
- Independent filters for each search type
- Intelligent result fusion using RRF
- Better recall for complex queries
- Handles scenarios requiring both keyword and semantic matching
""")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple example](10.pyseekdb-simple-sample.md)
* [Complete example](50.pyseekdb-complete-sample.md)

View File

@@ -0,0 +1,440 @@
---
slug: /pyseekdb-complete-sample
---
# Complete Example
This example demonstrates the full capabilities of pyseekdb.
The example includes the following operations:
1. Connection, including all connection modes
2. Collection management
3. DML operations, including add, update, upsert, and delete
4. DQL operations, including query, get, and hybrid_search
5. Filter operators
6. Collection information methods
## Example
```python
import uuid
import random
import pyseekdb
# ============================================================================
# PART 1: CLIENT CONNECTION
# ============================================================================
# Option 1: Embedded mode (local SeekDB)
client = pyseekdb.Client(
#path="./seekdb",
#database="test"
)
# Option 2: Server mode (remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Option 3: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ============================================================================
# PART 2: COLLECTION MANAGEMENT
# ============================================================================
collection_name = "comprehensive_example"
dimension = 128
# 2.1 Create a collection
from pyseekdb import HNSWConfiguration
config = HNSWConfiguration(dimension=dimension, distance='cosine')
collection = client.get_or_create_collection(
name=collection_name,
configuration=config,
embedding_function=None # Explicitly set to None since we're using custom 128-dim embeddings
)
# 2.2 Check if collection exists
exists = client.has_collection(collection_name)
# 2.3 Get collection object
retrieved_collection = client.get_collection(collection_name, embedding_function=None)
# 2.4 List all collections
all_collections = client.list_collections()
# 2.5 Get or create collection (creates if doesn't exist)
config2 = HNSWConfiguration(dimension=64, distance='cosine')
collection2 = client.get_or_create_collection(
name="another_collection",
configuration=config2,
embedding_function=None # Explicitly set to None since we're using custom 64-dim embeddings
)
# ============================================================================
# PART 3: DML OPERATIONS - ADD DATA
# ============================================================================
# Generate sample data
random.seed(42)
documents = [
"Machine learning is transforming the way we solve problems",
"Python programming language is widely used in data science",
"Vector databases enable efficient similarity search",
"Neural networks mimic the structure of the human brain",
"Natural language processing helps computers understand human language",
"Deep learning requires large amounts of training data",
"Reinforcement learning agents learn through trial and error",
"Computer vision enables machines to interpret visual information"
]
# Generate embeddings (in real usage, use an embedding model)
embeddings = []
for i in range(len(documents)):
vector = [random.random() for _ in range(dimension)]
embeddings.append(vector)
ids = [str(uuid.uuid4()) for _ in documents]
# 3.1 Add single item
single_id = str(uuid.uuid4())
collection.add(
ids=single_id,
documents="This is a single document",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"type": "single", "category": "test"}
)
# 3.2 Add multiple items
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=[
{"category": "AI", "score": 95, "tag": "ml", "year": 2023},
{"category": "Programming", "score": 88, "tag": "python", "year": 2022},
{"category": "Database", "score": 92, "tag": "vector", "year": 2023},
{"category": "AI", "score": 90, "tag": "neural", "year": 2022},
{"category": "NLP", "score": 87, "tag": "language", "year": 2023},
{"category": "AI", "score": 93, "tag": "deep", "year": 2023},
{"category": "AI", "score": 85, "tag": "reinforcement", "year": 2022},
{"category": "CV", "score": 91, "tag": "vision", "year": 2023}
]
)
# 3.3 Add with only embeddings (no documents)
vector_only_ids = [str(uuid.uuid4()) for _ in range(2)]
collection.add(
ids=vector_only_ids,
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"type": "vector_only"}, {"type": "vector_only"}]
)
# ============================================================================
# PART 4: DML OPERATIONS - UPDATE DATA
# ============================================================================
# 4.1 Update single item
collection.update(
ids=ids[0],
metadatas={"category": "AI", "score": 98, "tag": "ml", "year": 2024, "updated": True}
)
# 4.2 Update multiple items
collection.update(
ids=ids[1:3],
documents=["Updated document 1", "Updated document 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[
{"category": "Programming", "score": 95, "updated": True},
{"category": "Database", "score": 97, "updated": True}
]
)
# 4.3 Update embeddings
new_embeddings = [[random.random() for _ in range(dimension)] for _ in range(2)]
collection.update(
ids=ids[2:4],
embeddings=new_embeddings
)
# ============================================================================
# PART 5: DML OPERATIONS - UPSERT DATA
# ============================================================================
# 5.1 Upsert existing item (will update)
collection.upsert(
ids=ids[0],
documents="Upserted document (was updated)",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "AI", "upserted": True}
)
# 5.2 Upsert new item (will insert)
new_id = str(uuid.uuid4())
collection.upsert(
ids=new_id,
documents="This is a new document from upsert",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "New", "upserted": True}
)
# 5.3 Upsert multiple items
upsert_ids = [ids[4], str(uuid.uuid4())] # One existing, one new
collection.upsert(
ids=upsert_ids,
documents=["Upserted doc 1", "Upserted doc 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"upserted": True}, {"upserted": True}]
)
# ============================================================================
# PART 6: DQL OPERATIONS - QUERY (VECTOR SIMILARITY SEARCH)
# ============================================================================
# 6.1 Basic vector similarity query
query_vector = embeddings[0] # Query with first document's vector
results = collection.query(
query_embeddings=query_vector,
n_results=3
)
print(f"Query results: {len(results['ids'][0])} items")
# 6.2 Query with metadata filter (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI"},
n_results=5
)
# 6.3 Query with comparison operators
results = collection.query(
query_embeddings=query_vector,
where={"score": {"$gte": 90}},
n_results=5
)
# 6.4 Query with $in operator
results = collection.query(
query_embeddings=query_vector,
where={"tag": {"$in": ["ml", "python", "neural"]}},
n_results=5
)
# 6.5 Query with logical operators ($or) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$or": [
{"category": "AI"},
{"tag": "python"}
]
},
n_results=5
)
# 6.6 Query with logical operators ($and) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$and": [
{"category": "AI"},
{"score": {"$gte": 90}}
]
},
n_results=5
)
# 6.7 Query with document filter
results = collection.query(
query_embeddings=query_vector,
where_document={"$contains": "machine learning"},
n_results=5
)
# 6.8 Query with combined filters (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI", "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# 6.9 Query with multiple embeddings (batch query)
batch_embeddings = [embeddings[0], embeddings[1]]
batch_results = collection.query(
query_embeddings=batch_embeddings,
n_results=2
)
# batch_results["ids"][0] contains results for first query
# batch_results["ids"][1] contains results for second query
# 6.10 Query with specific fields
results = collection.query(
query_embeddings=query_vector,
include=["documents", "metadatas", "embeddings"],
n_results=2
)
# ============================================================================
# PART 7: DQL OPERATIONS - GET (RETRIEVE BY IDS OR FILTERS)
# ============================================================================
# 7.1 Get by single ID
result = collection.get(ids=ids[0])
# result["ids"] contains [ids[0]]
# result["documents"] contains document for ids[0]
# 7.2 Get by multiple IDs
results = collection.get(ids=ids[:3])
# results["ids"] contains ids[:3]
# results["documents"] contains documents for all IDs
# 7.3 Get by metadata filter (simplified equality)
results = collection.get(
where={"category": "AI"},
limit=5
)
# 7.4 Get with comparison operators
results = collection.get(
where={"score": {"$gte": 90}},
limit=5
)
# 7.5 Get with $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=5
)
# 7.6 Get with logical operators (simplified equality)
results = collection.get(
where={
"$or": [
{"category": "AI"},
{"category": "Programming"}
]
},
limit=5
)
# 7.7 Get by document filter
results = collection.get(
where_document={"$contains": "Python"},
limit=5
)
# 7.8 Get with pagination
results_page1 = collection.get(limit=2, offset=0)
results_page2 = collection.get(limit=2, offset=2)
# 7.9 Get with specific fields
results = collection.get(
ids=ids[:2],
include=["documents", "metadatas", "embeddings"]
)
# 7.10 Get all data
all_results = collection.get(limit=100)
# ============================================================================
# PART 8: DQL OPERATIONS - HYBRID SEARCH
# ============================================================================
# 8.1 Hybrid search with full-text and vector search
# Note: This requires query_embeddings to be provided directly
# In real usage, you might have an embedding function
hybrid_results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": "AI"}, # Simplified equality
"n_results": 10
},
knn={
"query_embeddings": [embeddings[0]],
"where": {"year": {"$gte": 2022}},
"n_results": 10
},
rank={"rrf": {}}, # Reciprocal Rank Fusion
n_results=5,
include=["documents", "metadatas"]
)
# hybrid_results["ids"][0] contains IDs for the hybrid search
# hybrid_results["documents"][0] contains documents for the hybrid search
print(f"Hybrid search: {len(hybrid_results.get('ids', [[]])[0])} results")
# ============================================================================
# PART 9: DML OPERATIONS - DELETE DATA
# ============================================================================
# 9.1 Delete by IDs
delete_ids = [vector_only_ids[0], new_id]
collection.delete(ids=delete_ids)
# 9.2 Delete by metadata filter
collection.delete(where={"type": {"$eq": "vector_only"}})
# 9.3 Delete by document filter
collection.delete(where_document={"$contains": "Updated document"})
# 9.4 Delete with combined filters
collection.delete(
where={"category": {"$eq": "CV"}},
where_document={"$contains": "vision"}
)
# ============================================================================
# PART 10: COLLECTION INFORMATION
# ============================================================================
# 10.1 Get collection count
count = collection.count()
print(f"Collection count: {count} items")
# 10.3 Preview first few items in collection (returns all columns by default)
preview = collection.peek(limit=5)
print(f"Preview: {len(preview['ids'])} items")
for i in range(len(preview['ids'])):
print(f" ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
print(f" Metadata: {preview['metadatas'][i]}, Embedding dim: {len(preview['embeddings'][i]) if preview['embeddings'][i] else 0}")
# 10.4 Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
# ============================================================================
# PART 11: CLEANUP
# ============================================================================
# Delete test collections
try:
client.delete_collection("another_collection")
except Exception as e:
print(f"Could not delete 'another_collection': {e}")
# Uncomment to delete main collection
client.delete_collection(collection_name)
```
## References
* For information about the API interfaces supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple Example](../50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Hybrid Search Example](../50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)