Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,60 @@
---
slug: /pyseekdb-sdk-get-started
---
# Get started
## pyseekdb
pyseekdb is a Python client provided by OceanBase Database. It allows you to connect to seekdb in embedded mode or remote mode, and supports connecting to seekdb in server mode or OceanBase Database.
:::tip
OceanBase Database is a fully self-developed, enterprise-level, native distributed database provided by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across cities with the "five IDCs across three regions" architecture. It also sets a new benchmark in the TPC-C standard test, with a single cluster scale exceeding 1,500 nodes. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
:::
pyseekdb is supported on Linux, macOS, and Windows. The supported database connection modes vary by operating system. For more information, see the table below.
| System | Embedded seekdb | Server mode seekdb | Server mode OceanBase Database |
|----|---|---|---|
| Linux | Supported | Supported | Supported |
| macOS | Not supported | Supported | Supported |
| Windows | Not supported | Supported | Supported |
For Linux system, when you install this client, it will also install seekdb in embedded mode, allowing you to directly connect to it to perform operations such as creating a database. Alternatively, you can choose to connect to a deployed seekdb or OceanBase Database in client/server mode.
## Install pyseekdb
### Prerequisites
Make sure that your environment meets the following requirements:
* Operating system: Linux (glibc >= 2.28), macOS or Windows
* Python version: Python 3.11 and later
* System architecture: x86_64 or aarch64
### Procedure
Use pip to install pyseekdb. It will automatically detect the default Python version and platform.
```shell
pip install pyseekdb
```
If your pip version is outdated, upgrade it before installation.
```bash
pip install --upgrade pip
```
## What to do next
* After installing pyseekdb, you can connect to seekdb to perform operations. For information about the API interfaces supported by pyseekdb, see [API Reference](../50.apis/10.api-overview.md).
* You can also refer to the SDK samples provided to quickly experience pyseekdb.
* [Simple sample](50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Complete sample](50.sdk-samples/50.pyseekdb-complete-sample.md)
* [Hybrid search sample](50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,130 @@
---
slug: /pyseekdb-simple-sample
---
# Simple Example
This example demonstrates the basic operations of Embedding Functions in embedded mode of seekdb to help you understand how to use Embedding Functions.
1. Connect to seekdb.
2. Create a collection with Embedding Functions.
3. Add data using documents (vectors will be automatically generated).
4. Query using texts (vectors will be automatically generated).
5. Print the query results.
## Prerequisites
This example uses seekdb in embedded mode. Before using this example, make sure that you have deployed seekdb in server mode.
For information about how to deploy seekdb in embedded mode, see [Embedded Mode](../../../../400.guides/400.deploy/600.python-seekdb.md).
## Example
```python
import pyseekdb
# ==================== Step 1: Create Client Connection ====================
# You can use embedded mode, server mode, or OceanBase mode
# Embedded mode (local SeekDB)
client = pyseekdb.Client()
# Alternative: Server mode (connecting to remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Alternative: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"
# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
name=collection_name,
)
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")
# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text"
]
ids = ["id1", "id2", "id3", "id4", "id5"]
# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
ids=ids,
documents=documents, # embeddings will be automatically generated
metadatas=[
{"category": "AI", "index": 0},
{"category": "Programming", "index": 1},
{"category": "Database", "index": 2},
{"category": "AI", "index": 3},
{"category": "NLP", "index": 4}
]
)
print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")
# ==================== Step 4: Query the Collection ====================
# With embedding function, you can query using text directly
# The embedding function will automatically convert query text to query vector
# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"
results = collection.query(
query_texts=query_text, # Query text - will be embedded automatically
n_results=3 # Return top 3 most similar documents
)
print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")
# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
print(f"\nResult {i+1}:")
print(f" ID: {results['ids'][0][i]}")
print(f" Distance: {results['distances'][0][i]:.4f}")
if results.get('documents'):
print(f" Document: {results['documents'][0][i]}")
if results.get('metadatas'):
print(f" Metadata: {results['metadatas'][0][i]}")
# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Complete Example](50.pyseekdb-complete-sample.md)
* [Hybrid Search Example](100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,350 @@
---
slug: /pyseekdb-hybrid-search-sample
---
# Hybrid search example
This example demonstrates the advantages of `hybrid_search()` over `query()`.
The main advantages of `hybrid_search()` are:
* Supports full-text search and vector similarity search simultaneously
* Allows separate filtering conditions for full-text and vector search
* Combines the ranked results of both searches using the Reciprocal Rank Fusion algorithm to improve relevance.
* Handles complex scenarios that `query()` cannot handle
## Example
```python
import pyseekdb
# Setup
client = pyseekdb.Client()
collection = client.get_or_create_collection(
name="hybrid_search_demo"
)
# Sample data
documents = [
"Machine learning is revolutionizing artificial intelligence and data science",
"Python programming language is essential for machine learning developers",
"Deep learning neural networks enable advanced AI applications",
"Data science combines statistics, programming, and domain expertise",
"Natural language processing uses machine learning to understand text",
"Computer vision algorithms process images using deep learning techniques",
"Reinforcement learning trains agents through reward-based feedback",
"Python libraries like TensorFlow and PyTorch simplify machine learning",
"Artificial intelligence systems can learn from large datasets",
"Neural networks mimic the structure of biological brain connections"
]
metadatas = [
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
]
ids = [f"doc_{i+1}" for i in range(len(documents))]
collection.add(ids=ids, documents=documents, metadatas=metadatas)
print("=" * 100)
print("SCENARIO 1: Keyword + Semantic Search")
print("=" * 100)
print("Goal: Find documents similar to 'AI research' AND containing 'machine learning'\n")
# query() approach
query_result1 = collection.query(
query_texts=["AI research"],
where_document={"$contains": "machine learning"},
n_results=5
)
# hybrid_search() approach
hybrid_result1 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["AI research"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() ranks 'Deep learning neural networks...' first because it's semantically similar to 'AI research',")
print(" but 'machine learning' is not its primary focus. hybrid_search() correctly prioritizes documents that")
print(" explicitly contain 'machine learning' (from full-text search) while also being semantically relevant")
print(" to 'AI research' (from vector search). The RRF fusion ensures documents matching both criteria rank higher.")
print("\n" + "=" * 100)
print("SCENARIO 2: Independent Filters for Different Search Types")
print("=" * 100)
print("Goal: Full-text='neural' (year=2024) + Vector='deep learning' (popularity>=90)\n")
# query() - same filter applies to both conditions
query_result2 = collection.query(
query_texts=["deep learning"],
where={"year": {"$eq": 2024}, "popularity": {"$gte": 90}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for each search type
hybrid_result2 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"year": {"$eq": 2024}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"popularity": {"$gte": 90}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (same filter for both):")
for i, doc_id in enumerate(query_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results (independent filters):")
for i, doc_id in enumerate(hybrid_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH year=2024 AND popularity>=90")
print(" simultaneously. hybrid_search() returns 5 results by applying year=2024 filter to full-text search")
print(" and popularity>=90 filter to vector search independently, then fusing the results. This approach")
print(" captures more relevant documents that might satisfy one criterion strongly while meeting the other")
print("\n" + "=" * 100)
print("SCENARIO 3: Combining Multiple Search Strategies")
print("=" * 100)
print("Goal: Find documents about 'machine learning algorithms'\n")
# query() - vector search only
query_result3 = collection.query(
query_texts=["machine learning algorithms"],
n_results=5
)
# hybrid_search() - combines full-text and vector
hybrid_result3 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["machine learning algorithms"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (vector similarity only):")
for i, doc_id in enumerate(query_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (full-text + vector fusion):")
for i, doc_id in enumerate(hybrid_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() returns 'Artificial intelligence systems...' as the result, which doesn't explicitly")
print(" mention 'machine learning'. hybrid_search() combines full-text search (for 'machine learning')")
print(" with vector search (for semantic similarity to 'machine learning algorithms'), ensuring that")
print(" documents containing the exact keyword rank higher while still capturing semantically relevant content.")
print("\n" + "=" * 100)
print("SCENARIO 4: Complex Multi-Criteria Search")
print("=" * 100)
print("Goal: Full-text='learning' (category=AI) + Vector='artificial intelligence' (year>=2023)\n")
# query() - limited to single search with combined filters
query_result4 = collection.query(
query_texts=["artificial intelligence"],
where={"category": {"$eq": "AI"}, "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# hybrid_search() - separate criteria for each search type
hybrid_result4 = collection.hybrid_search(
query={"where_document": {"$contains": "learning"}, "where": {"category": {"$eq": "AI"}}, "n_results": 10},
knn={"query_texts": ["artificial intelligence"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" While both methods return similar documents, hybrid_search() provides better ranking by prioritizing")
print(" documents that score highly in both full-text search (containing 'learning' with category=AI) and")
print(" vector search (semantically similar to 'artificial intelligence' with year>=2023). The RRF fusion")
print(" algorithm ensures that 'Deep learning neural networks...' ranks first because it strongly matches")
print(" both search criteria, whereas query() applies filters sequentially which may not optimize ranking.")
print("\n" + "=" * 100)
print("SCENARIO 5: Result Quality - RRF Fusion")
print("=" * 100)
print("Goal: Search for 'Python machine learning'\n")
# query() - single ranking
query_result5 = collection.query(
query_texts=["Python machine learning"],
n_results=5
)
# hybrid_search() - RRF fusion of multiple rankings
hybrid_result5 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["Python machine learning"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (single ranking):")
for i, doc_id in enumerate(query_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (RRF fusion):")
for i, doc_id in enumerate(hybrid_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" Both methods return identical results in this case, but hybrid_search() achieves this through RRF")
print(" (Reciprocal Rank Fusion) which combines rankings from full-text search (for 'Python') and vector")
print(" search (for 'Python machine learning'). RRF provides more stable and robust ranking by considering")
print(" multiple signals, making it less sensitive to variations in individual search algorithms and ensuring")
print(" consistent high-quality results across different query formulations.")
print("\n" + "=" * 100)
print("SCENARIO 6: Different Filter Criteria for Each Search")
print("=" * 100)
print("Goal: Full-text='neural' (high popularity) + Vector='deep learning' (recent year)\n")
# query() - cannot separate filters for keyword vs semantic
query_result6 = collection.query(
query_texts=["deep learning"],
where={"popularity": {"$gte": 90}, "year": {"$gte": 2023}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for keyword search vs semantic search
hybrid_result6 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"popularity": {"$gte": 90}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH popularity>=90 AND")
print(" year>=2023 simultaneously, along with containing 'neural' and being semantically similar to")
print(" 'deep learning'. hybrid_search() returns 5 results by applying popularity>=90 filter to full-text")
print(" search (for 'neural') and year>=2023 filter to vector search (for 'deep learning') independently.")
print(" The fusion then combines results from both searches, capturing documents that strongly match either")
print(" criterion while still being relevant to the overall query intent.")
print("\n" + "=" * 100)
print("SCENARIO 7: Partial Keyword Match + Semantic Similarity")
print("=" * 100)
print("Goal: Documents containing 'Python' + Semantically similar to 'data science'\n")
# query() - filter applied after vector search
query_result7 = collection.query(
query_texts=["data science"],
where_document={"$contains": "Python"},
n_results=5
)
# hybrid_search() - parallel searches then fusion
hybrid_result7 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["data science"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it first performs vector search for 'data science', then")
print(" filters to documents containing 'Python', which severely limits the result set. hybrid_search()")
print(" returns 5 results by running full-text search (for 'Python') and vector search (for 'data science')")
print(" in parallel, then fusing the results. This captures documents that contain 'Python' (even if not")
print(" semantically closest to 'data science') and documents semantically similar to 'data science' (even")
print(" if they don't contain 'Python'), providing better recall and more comprehensive results.")
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
print("""
query() limitations:
- Single search type (vector similarity)
- Filters applied after search (may miss relevant docs)
- Cannot combine full-text and vector search results
- Same filter criteria for all conditions
hybrid_search() advantages:
- Simultaneous full-text + vector search
- Independent filters for each search type
- Intelligent result fusion using RRF
- Better recall for complex queries
- Handles scenarios requiring both keyword and semantic matching
""")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple example](10.pyseekdb-simple-sample.md)
* [Complete example](50.pyseekdb-complete-sample.md)

View File

@@ -0,0 +1,440 @@
---
slug: /pyseekdb-complete-sample
---
# Complete Example
This example demonstrates the full capabilities of pyseekdb.
The example includes the following operations:
1. Connection, including all connection modes
2. Collection management
3. DML operations, including add, update, upsert, and delete
4. DQL operations, including query, get, and hybrid_search
5. Filter operators
6. Collection information methods
## Example
```python
import uuid
import random
import pyseekdb
# ============================================================================
# PART 1: CLIENT CONNECTION
# ============================================================================
# Option 1: Embedded mode (local SeekDB)
client = pyseekdb.Client(
#path="./seekdb",
#database="test"
)
# Option 2: Server mode (remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Option 3: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ============================================================================
# PART 2: COLLECTION MANAGEMENT
# ============================================================================
collection_name = "comprehensive_example"
dimension = 128
# 2.1 Create a collection
from pyseekdb import HNSWConfiguration
config = HNSWConfiguration(dimension=dimension, distance='cosine')
collection = client.get_or_create_collection(
name=collection_name,
configuration=config,
embedding_function=None # Explicitly set to None since we're using custom 128-dim embeddings
)
# 2.2 Check if collection exists
exists = client.has_collection(collection_name)
# 2.3 Get collection object
retrieved_collection = client.get_collection(collection_name, embedding_function=None)
# 2.4 List all collections
all_collections = client.list_collections()
# 2.5 Get or create collection (creates if doesn't exist)
config2 = HNSWConfiguration(dimension=64, distance='cosine')
collection2 = client.get_or_create_collection(
name="another_collection",
configuration=config2,
embedding_function=None # Explicitly set to None since we're using custom 64-dim embeddings
)
# ============================================================================
# PART 3: DML OPERATIONS - ADD DATA
# ============================================================================
# Generate sample data
random.seed(42)
documents = [
"Machine learning is transforming the way we solve problems",
"Python programming language is widely used in data science",
"Vector databases enable efficient similarity search",
"Neural networks mimic the structure of the human brain",
"Natural language processing helps computers understand human language",
"Deep learning requires large amounts of training data",
"Reinforcement learning agents learn through trial and error",
"Computer vision enables machines to interpret visual information"
]
# Generate embeddings (in real usage, use an embedding model)
embeddings = []
for i in range(len(documents)):
vector = [random.random() for _ in range(dimension)]
embeddings.append(vector)
ids = [str(uuid.uuid4()) for _ in documents]
# 3.1 Add single item
single_id = str(uuid.uuid4())
collection.add(
ids=single_id,
documents="This is a single document",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"type": "single", "category": "test"}
)
# 3.2 Add multiple items
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=[
{"category": "AI", "score": 95, "tag": "ml", "year": 2023},
{"category": "Programming", "score": 88, "tag": "python", "year": 2022},
{"category": "Database", "score": 92, "tag": "vector", "year": 2023},
{"category": "AI", "score": 90, "tag": "neural", "year": 2022},
{"category": "NLP", "score": 87, "tag": "language", "year": 2023},
{"category": "AI", "score": 93, "tag": "deep", "year": 2023},
{"category": "AI", "score": 85, "tag": "reinforcement", "year": 2022},
{"category": "CV", "score": 91, "tag": "vision", "year": 2023}
]
)
# 3.3 Add with only embeddings (no documents)
vector_only_ids = [str(uuid.uuid4()) for _ in range(2)]
collection.add(
ids=vector_only_ids,
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"type": "vector_only"}, {"type": "vector_only"}]
)
# ============================================================================
# PART 4: DML OPERATIONS - UPDATE DATA
# ============================================================================
# 4.1 Update single item
collection.update(
ids=ids[0],
metadatas={"category": "AI", "score": 98, "tag": "ml", "year": 2024, "updated": True}
)
# 4.2 Update multiple items
collection.update(
ids=ids[1:3],
documents=["Updated document 1", "Updated document 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[
{"category": "Programming", "score": 95, "updated": True},
{"category": "Database", "score": 97, "updated": True}
]
)
# 4.3 Update embeddings
new_embeddings = [[random.random() for _ in range(dimension)] for _ in range(2)]
collection.update(
ids=ids[2:4],
embeddings=new_embeddings
)
# ============================================================================
# PART 5: DML OPERATIONS - UPSERT DATA
# ============================================================================
# 5.1 Upsert existing item (will update)
collection.upsert(
ids=ids[0],
documents="Upserted document (was updated)",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "AI", "upserted": True}
)
# 5.2 Upsert new item (will insert)
new_id = str(uuid.uuid4())
collection.upsert(
ids=new_id,
documents="This is a new document from upsert",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "New", "upserted": True}
)
# 5.3 Upsert multiple items
upsert_ids = [ids[4], str(uuid.uuid4())] # One existing, one new
collection.upsert(
ids=upsert_ids,
documents=["Upserted doc 1", "Upserted doc 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"upserted": True}, {"upserted": True}]
)
# ============================================================================
# PART 6: DQL OPERATIONS - QUERY (VECTOR SIMILARITY SEARCH)
# ============================================================================
# 6.1 Basic vector similarity query
query_vector = embeddings[0] # Query with first document's vector
results = collection.query(
query_embeddings=query_vector,
n_results=3
)
print(f"Query results: {len(results['ids'][0])} items")
# 6.2 Query with metadata filter (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI"},
n_results=5
)
# 6.3 Query with comparison operators
results = collection.query(
query_embeddings=query_vector,
where={"score": {"$gte": 90}},
n_results=5
)
# 6.4 Query with $in operator
results = collection.query(
query_embeddings=query_vector,
where={"tag": {"$in": ["ml", "python", "neural"]}},
n_results=5
)
# 6.5 Query with logical operators ($or) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$or": [
{"category": "AI"},
{"tag": "python"}
]
},
n_results=5
)
# 6.6 Query with logical operators ($and) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$and": [
{"category": "AI"},
{"score": {"$gte": 90}}
]
},
n_results=5
)
# 6.7 Query with document filter
results = collection.query(
query_embeddings=query_vector,
where_document={"$contains": "machine learning"},
n_results=5
)
# 6.8 Query with combined filters (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI", "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# 6.9 Query with multiple embeddings (batch query)
batch_embeddings = [embeddings[0], embeddings[1]]
batch_results = collection.query(
query_embeddings=batch_embeddings,
n_results=2
)
# batch_results["ids"][0] contains results for first query
# batch_results["ids"][1] contains results for second query
# 6.10 Query with specific fields
results = collection.query(
query_embeddings=query_vector,
include=["documents", "metadatas", "embeddings"],
n_results=2
)
# ============================================================================
# PART 7: DQL OPERATIONS - GET (RETRIEVE BY IDS OR FILTERS)
# ============================================================================
# 7.1 Get by single ID
result = collection.get(ids=ids[0])
# result["ids"] contains [ids[0]]
# result["documents"] contains document for ids[0]
# 7.2 Get by multiple IDs
results = collection.get(ids=ids[:3])
# results["ids"] contains ids[:3]
# results["documents"] contains documents for all IDs
# 7.3 Get by metadata filter (simplified equality)
results = collection.get(
where={"category": "AI"},
limit=5
)
# 7.4 Get with comparison operators
results = collection.get(
where={"score": {"$gte": 90}},
limit=5
)
# 7.5 Get with $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=5
)
# 7.6 Get with logical operators (simplified equality)
results = collection.get(
where={
"$or": [
{"category": "AI"},
{"category": "Programming"}
]
},
limit=5
)
# 7.7 Get by document filter
results = collection.get(
where_document={"$contains": "Python"},
limit=5
)
# 7.8 Get with pagination
results_page1 = collection.get(limit=2, offset=0)
results_page2 = collection.get(limit=2, offset=2)
# 7.9 Get with specific fields
results = collection.get(
ids=ids[:2],
include=["documents", "metadatas", "embeddings"]
)
# 7.10 Get all data
all_results = collection.get(limit=100)
# ============================================================================
# PART 8: DQL OPERATIONS - HYBRID SEARCH
# ============================================================================
# 8.1 Hybrid search with full-text and vector search
# Note: This requires query_embeddings to be provided directly
# In real usage, you might have an embedding function
hybrid_results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": "AI"}, # Simplified equality
"n_results": 10
},
knn={
"query_embeddings": [embeddings[0]],
"where": {"year": {"$gte": 2022}},
"n_results": 10
},
rank={"rrf": {}}, # Reciprocal Rank Fusion
n_results=5,
include=["documents", "metadatas"]
)
# hybrid_results["ids"][0] contains IDs for the hybrid search
# hybrid_results["documents"][0] contains documents for the hybrid search
print(f"Hybrid search: {len(hybrid_results.get('ids', [[]])[0])} results")
# ============================================================================
# PART 9: DML OPERATIONS - DELETE DATA
# ============================================================================
# 9.1 Delete by IDs
delete_ids = [vector_only_ids[0], new_id]
collection.delete(ids=delete_ids)
# 9.2 Delete by metadata filter
collection.delete(where={"type": {"$eq": "vector_only"}})
# 9.3 Delete by document filter
collection.delete(where_document={"$contains": "Updated document"})
# 9.4 Delete with combined filters
collection.delete(
where={"category": {"$eq": "CV"}},
where_document={"$contains": "vision"}
)
# ============================================================================
# PART 10: COLLECTION INFORMATION
# ============================================================================
# 10.1 Get collection count
count = collection.count()
print(f"Collection count: {count} items")
# 10.3 Preview first few items in collection (returns all columns by default)
preview = collection.peek(limit=5)
print(f"Preview: {len(preview['ids'])} items")
for i in range(len(preview['ids'])):
print(f" ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
print(f" Metadata: {preview['metadatas'][i]}, Embedding dim: {len(preview['embeddings'][i]) if preview['embeddings'][i] else 0}")
# 10.4 Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
# ============================================================================
# PART 11: CLEANUP
# ============================================================================
# Delete test collections
try:
client.delete_collection("another_collection")
except Exception as e:
print(f"Could not delete 'another_collection': {e}")
# Uncomment to delete main collection
client.delete_collection(collection_name)
```
## References
* For information about the API interfaces supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple Example](../50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Hybrid Search Example](../50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,66 @@
---
slug: /api-overview
---
# API Reference
seekdb allows you to use seekdb through APIs.
## APIs
The following APIs are supported.
### Database
:::info
You can use this API only when you connect to seekdb by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../50.apis/100.admin-client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `create_database()` | Creates a database. | [Documentation](110.database/200.create-database-of-api.md) |
| `get_database()` | Retrieves a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
| `list_databases()` | Retrieves a list of databases in an instance. |[Documentation](110.database/400.list-database-of-api.md)|
| `delete_database()` | Deletes a specified database.|[Documentation](110.database/500.delete-database-of-api.md)|
### Collection
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `create_collection()` | Creates a collection. | [Documentation](200.collection/100.create-collection-of-api.md) |
| `get_collection()` | Retrieves a specified collection. |[Documentation](200.collection/200.get-collection-of-api.md)|
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](200.collection/250.get-or-create-collection-of-api.md)|
| `list_collections()` | Retrieves the collection list in a database. |[Documentation](200.collection/300.list-collection-of-api.md)|
| `count_collection()` | Counts the number of collections in a database. |[Documentation](200.collection/350.count-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection.|[Documentation](200.collection/400.delete-collection-of-api.md)|
### DML
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `add()` | Inserts a new record into a collection. | [Documentation](300.dml/200.add-data-of-api.md) |
| `update()` | Updates an existing record in a collection. |[Documentation](300.dml/300.update-data-of-api.md)|
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](300.dml/400.upsert-data-of-api.md)|
| `delete()` | Deletes a record from a collection.|[Documentation](300.dml/500.delete-data-of-api.md)|
### DQL
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `query()` | Performs vector similarity search. | [Documentation](400.dql/200.query-interfaces-of-api.md) |
| `get()` | Queries specific data from a table by using the ID, document, and metadata (non-vector). |[Documentation](400.dql/300.get-interfaces-of-api.md)|
| `hybrid_search()` | Performs full-text search and vector similarity search by using ranking. |[Documentation](400.dql/400.hybrid-search-of-api.md)|

View File

@@ -0,0 +1,93 @@
---
slug: /admin-client
---
# Admin Client
`AdminClient` provides database management operations. It uses the same database connection mode as `Client`, but only supports database management-related operations.
## Connect to an embedded seekdb instance
Connect to a local embedded seekdb instance by using `AdminClient`.
```python
import pyseekdb
# Embedded mode - Database management
admin = pyseekdb.AdminClient(path="./seekdb")
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `path` | string | Optional | The path of the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
## Connect to a remote server
Connect to a remote server by using `AdminClient`. This way, you can connect to a seekdb instance or an OceanBase Database instance.
:::tip
Before you connect to a remote server, make sure that you have deployed a server mode seekdb instance or an OceanBase Database instance.<br/>For information about how to deploy a server mode seekdb instance, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy an OceanBase Database instance, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
:::
Example: Connect to a server mode seekdb instance
```python
import pyseekdb
# Remote server mode - Database management
admin = pyseekdb.AdminClient(
host="127.0.0.1",
port=2881,
user="root",
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
)
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the instance resides. | `127.0.0.1` |
| `prot` | string | Yes | The port of the instance. The default value is 2881. | `2881` |
| `user` | string | Yes | The username. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
Example: Connect to an OceanBase Database instance
```python
import pyseekdb
# Remote server mode - Database management
admin = pyseekdb.AdminClient(
host="127.0.0.1",
port=2881,
tenant="test"
user="root",
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
)
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the database resides. | `127.0.0.1` |
| `prot` | string | Yes | The port of the OceanBase Database instance. The default value is 2881. | `2881` |
| `tenant` | string | No | The name of the tenant. This parameter is not required for a server mode seekdb instance, but is required for an OceanBase Database instance. The default value is sys. | `test` |
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
## APIs supported when you use AdminClient to connect to a database
The following APIs are supported when you use `AdminClient` to connect to a database.
| API | Description | Documentation Link |
| --- | --- | --- |
| `create_database` | Creates a new database. |[Documentation](110.database/200.create-database-of-api.md)|
| `get_database` | Queries a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
| `delete_database` | Deletes a specified database. |[Documentation](110.database/400.list-database-of-api.md)|
| `list_databases` | Lists all databases. |[Documentation](110.database/500.delete-database-of-api.md)|

View File

@@ -0,0 +1,16 @@
---
slug: /database-overview-of-api
---
# Database Management
A database contains tables, indexes, and metadata of database objects. You can create, query, and delete databases as needed.
The following APIs are available for database operations.
| API | Description | Documentation |
|---|---|---|
| `create_database()` | Creates a database. | [Documentation](200.create-database-of-api.md) |
| `get_database()` | Gets a specified database. |[Documentation](300.get-database-of-api.md)|
| `list_databases()` | Gets the list of databases in the instance. |[Documentation](400.list-database-of-api.md)|
| `delete_database()` | Deletes a specified database.|[Documentation](500.delete-database-of-api.md)|

View File

@@ -0,0 +1,76 @@
---
slug: /create-database-of-api
---
# create_database - Create a database
The `create_database()` function is used to create a new database.
:::info
* This interface can only be used when you are connected to the database using `AdminClient`. For more information about `AdminClient`, see [Admin Client](../100.admin-client.md).
* Currently, when you use `create_database` to create a database, you cannot specify the database properties. The database will be created based on the default values of the properties. If you want to create a database with specific properties, you can try to create it using SQL. For more information about how to create a database using SQL, see [Create a database](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977077).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
* If you are using server mode of seekdb or OceanBase Database, make sure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Limitations
* In a seekdb instance or OceanBase Database, the name of each database must be globally unique.
* The maximum length of a database name is 128 characters.
* The name can contain only uppercase and lowercase letters, digits, underscores, dollar signs, and Chinese characters.
* Avoid using reserved keywords as database names.
For more information about reserved keywords, see [Reserved keywords](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976774).
## Recommendations
* We recommend that you give the database a meaningful name that reflects its purpose and content. For example, you can use `Application Identifier_Sub-application name (optional)_db` as the database name.
* We recommend that you create the database and related users using the root user and assign only the necessary privileges to ensure the security and controllability of the database.
* You can create a database with a name consisting only of digits by enclosing the name in backticks (`), but this is not recommended. This is because names consisting only of digits have no clear meaning, and queries require the use of backticks (`), which can lead to unnecessary complexity and confusion.
## Request parameters
```python
create_database(name, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be created. |`my_database`|
|`tenant`|string|No<ul><li>When using embedded seekdb or server mode of seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required.</li></ul>|The tenant to which the database belongs. |`test_tenant`|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Create database
admin.create_database("my_database")
```
## Response parameters
None
## References
* [Get a specific database](300.get-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [List databases](400.list-database-of-api.md)

View File

@@ -0,0 +1,65 @@
---
slug: /get-database-of-api
---
# get_database - Get the specified database
The `get_database()` method is used to obtain the information of the specified database.
:::info
This method can be used only when you connect to the database by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
## Request parameters
```python
get_database(name, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be queried. |`my_database`|
|`tenant`|string|No<ul><li>When you use embedded seekdb and server mode seekdb, you do not need to specify this parameter.</li><li>When you use OceanBase Database, you must specify this parameter.</li></ul>|The tenant to which the database belongs. |test_tenant|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Get database
db = admin.get_database("my_database")
# print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried database. |`my_database`|
|`tenant`|string|No<br/>When you use embedded seekdb and server mode SeekDB, this parameter does not exist. |The tenant to which the queried database belongs. |`test_tenant`|
|`charset`|string|No|The character set used by the queried database. |`utf8mb4`|
|`collation`|string|No|The collation used by the queried database. |`utf8mb4_general_ci`|
|`metadata`|dict|No|Reserved field. | {} |
## Response example
```python
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
```
## References
* [Create a database](200.create-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [Get the database list](400.list-database-of-api.md)

View File

@@ -0,0 +1,70 @@
---
slug: /list-database-of-api
---
# list_databases - Get the database list
The `list_databases()` method is used to retrieve the database list in the instance.
:::info
This API is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
## Request parameters
```python
list_databases(limit=None, offset=None, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`limit`|int|Optional|The maximum number of databases to return. |2|
|`offset`|int|Optional|The number of databases to skip. |3|
|`tenant`|string|Optional<ul><li>When using embedded seekdb and server mode seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the queried database belongs. |test_tenant|
## Request example
```python
# List all databases
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# list database
databases = admin.list_databases(2,3)
for db in databases:
print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried database. |`my_database`|
|`tenant`|string|Optional<br/>When using embedded seekdb and server mode SeekDB, this parameter is not available. |The tenant to which the queried database belongs. |`test_tenant`|
|`charset`|string|Optional|The character set of the queried database. |`utf8mb4`|
|`collation`|string|Optional|The collation of the queried database. |`utf8mb4_general_ci`|
|`metadata`|dict|Optional|Reserved field. No data is returned. | {} |
## Response example
```python
Database: test, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
```
## References
* [Create a database](200.create-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [Get a specific database](300.get-database-of-api.md)

View File

@@ -0,0 +1,54 @@
---
slug: /delete-database-of-api
---
# delete_database - Delete a database
The `delete_database()` method is used to delete a database.
:::info
This method is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
* If you are using server mode of seekdb or OceanBase Database, ensure that the user has the `DROP` privilege. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the privilege, contact the administrator to grant the privilege. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
delete_database(name,tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example Value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be deleted. |my_database|
|`tenant`|string|No<ul><li>If you are using embedded seekdb or server mode of seekdb, you do not need to specify this parameter.</li><li>If you are using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the database belongs. |test_tenant|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Delete database
admin.delete_database("my_database")
```
## Response parameters
None
## References
* [Create a database](200.create-database-of-api.md)
* [Get a specific database](300.get-database-of-api.md)
* [Obtain a database list](400.list-database-of-api.md)

View File

@@ -0,0 +1,93 @@
---
slug: /create-collection-of-api
---
# create_collection - Create a collection
`create_collection()` is used to create a new collection, which is a table in the database.
:::info
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb in server mode or OceanBase Database, make sure that the user has the `CREATE` privilege. For more information about how to view the privileges of the current user, see [View user privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971368). If the user does not have the privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974754).
## Define the table name
When creating a table, you must first define its name. The following requirements apply when defining the table name:
* In seekdb, each table name must be unique within the database.
* The table name cannot exceed 64 characters.
* We recommend that you give the table a meaningful name instead of using generic names such as t1 or table1. For more information about table naming conventions, see [Table naming conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
## Request parameters
```python
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default values `dimension=384` and `distance='cosine'` are used. If set to `None`, the dimension is calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function to convert data into vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality, and if provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
:::info
When you provide `embedding_function`, the system will automatically calculate the vector dimension by calling this function. If you also provide `configuration.dimension`, it must match the dimension of `embedding_function`. Otherwise, a ValueError will be raised.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Create a collection with default embedding function (auto-calculates dimension)
collection = client.create_collection(
name="my_collection"
)
# Create a collection with custom embedding function
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
config = HNSWConfiguration(dimension=384, distance='cosine') # Must match EF dimension
collection = client.create_collection(
name="my_collection2",
configuration=config,
embedding_function=ef
)
# Create a collection without embedding function (vectors must be provided manually)
collection = client.create_collection(
name="my_collection3",
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
embedding_function=None # Explicitly disable embedding function
)
```
## Response parameters
None
## References
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,89 @@
---
slug: /get-collection-of-api
---
# get_collection - Get a collection
The `get_collection()` function is used to retrieve a specified collection.
:::info
This API is only available when connected using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* The collection you want to retrieve exists. If the collection does not exist, an error will be returned.
## Request parameters
```python
client.get_collection(name,configuration = configuration,embedding_function = embedding_function)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to retrieve. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default value `dimension=384, distance='cosine'` will be used. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function used to convert text to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` will be used. If set to `None`, the collection will not contain an embedding function. If an embedding function is provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
:::info
When vectors are not provided for documents/texts, the embedding function set here will be used for all operations on this collection, including add, upsert, update, query, and hybrid_search.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Get an existing collection (uses default embedding function if collection doesn't have one)
collection = client.get_collection("my_collection")
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
# Get collection with specific embedding function
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
collection = client.get_collection("my_collection", embedding_function=ef)
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
# Get collection without embedding function
collection = client.get_collection("my_collection", embedding_function=None)
# Check if collection exists
if client.has_collection("my_collection"):
collection = client.get_collection("my_collection")
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to query. |my_collection|
|`dimension`|int|No| |384|
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|`distance`|string|No| |cosine|
|`metadata`|dict|No|Reserved field, currently no data| {} |
## Response example
```python
Database: my_collection, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
Database: my_collection1, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
```
## References
* [Create a collection](100.create-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a list of collections](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,79 @@
---
slug: /get-or-create-collection-of-api
---
# get_or_create_collection - Create or query a collection
The `get_or_create_collection()` function creates or queries a collection. If the collection does not exist in the database, it is created. If it exists, the corresponding result is obtained.
:::info
This API is only available when using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb in server mode or OceanBase Database, ensure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [Check User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Define a table name
When creating a table, you need to define a table name. The following requirements must be met:
* In seekdb, each table name must be unique within the database.
* The table name must be no longer than 64 characters.
* It is recommended to use meaningful names for tables instead of generic names like t1 or table1. For more information about table naming conventions, see [Table Naming Conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
## Request parameters
```python
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
```
|Parameter|Value Type|Required|Description|Example Value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration with dimension and distance metric. If not provided, the default value is used, which is `dimension=384, distance='cosine'`. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function to convert to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality. If embedding functionality is provided, it will be automatically calculated based on `configuration.dimension`. |DefaultEmbeddingFunction()|
:::info
When `embedding_function` is provided, the system will automatically calculate the vector dimension by calling the function. If `configuration.dimension` is also provided, it must match the dimension of `embedding_function`, otherwise a ValueError will be raised.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Get or create collection (creates if doesn't exist)
collection = client.get_or_create_collection(
name="my_collection4",
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
embedding_function=DefaultEmbeddingFunction()
)
```
## Response parameters
None
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Get a list of collections](300.list-collection-of-api.md)
* [Count collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,65 @@
---
slug: /list-collection-of-api
---
# list_collections - Get a list of collections
The `list_collections()` API is used to obtain all collections.
:::info
This API is supported only when you use a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
## Request parameters
```python
client.list_collections()
```
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# List all collections
collections = client.list_collections()
for coll in collections:
print(f"Collection: {coll.name}, Dimension: {coll.dimension}, embedding_function: {coll.embedding_function}, distance: {coll.distance}, metadata: {coll.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried collection. |my_collection|
|`dimension`|int|No| | 384 |
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|`distance`|string|No| |cosine|
|`metadata`|dict|No|Reserved field. No data is returned. | {} |
## Response example
```pyhton
Collection: my_collection, Dimension: 384, embedding_function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance: cosine, metadata: {}
Database has 1 collections
```
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Count collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,56 @@
---
slug: /count-collection-of-api
---
# count_collection - Count the number of collections
The `count_collection()` method is used to count the number of collections in the database.
:::info
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
## Request parameters
```python
client.count_collection()
```
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
```
## Return parameters
None
## Return example
```pyhton
Database has 1 collections
```
## Related operations
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,55 @@
---
slug: /delete-collection-of-api
---
# delete_collection - Delete a Collection
The `delete_collection()` method is used to delete a specified Collection.
:::info
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* The Collection you want to delete exists. If the Collection does not exist, an error will be returned.
## Request parameters
```python
client.delete_collection(name)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the Collection to be deleted. |my_collection|
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Delete a collection
client.delete_collection("my_collection")
```
## Response parameters
None
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)

View File

@@ -0,0 +1,18 @@
---
slug: /collection-overview-of-api
---
# Manage collections
In pyseekdb, a collection is a set similar to a table in a database. You can create, query, and delete collections.
The following API interfaces are supported for managing collections.
| API interface | Description | Documentation |
|---|---|---|
| `create_collection()` | Creates a collection. | [Documentation](100.create-collection-of-api.md) |
| `get_collection()` | Gets a specified collection. |[Documentation](200.get-collection-of-api.md)|
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](250.get-or-create-collection-of-api.md)|
| `list_collections()` | Gets the collection list of a database. |[Documentation](300.list-collection-of-api.md)|
| `count_collection()` | Counts the number of collections in a database |[Documentation](350.count-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection.|[Documentation](400.delete-collection-of-api.md)|

View File

@@ -0,0 +1,16 @@
---
slug: /dml-overview-of-api
---
# DML operations
DML (Data Manipulation Language) operations allow you to insert, update, and delete data in a collection.
For DML operations, you can use the following APIs.
| API | Description | Documentation |
|---|---|---|
| `add()` | Inserts a new record into a collection. | [Documentation](200.add-data-of-api.md) |
| `update()` | Updates an existing record in a collection. |[Documentation](300.update-data-of-api.md)|
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](400.upsert-data-of-api.md)|
| `delete()` | Deletes a record from a collection.|[Documentation](500.delete-data-of-api.md)|

View File

@@ -0,0 +1,117 @@
---
slug: /add-data-of-api
---
# add - Insert data
The `add()` method inserts new data into a collection. If a record with the same ID already exists, an error is returned.
:::info
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to which you are connected has the `INSERT` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have the required privilege, contact the administrator to grant you the privilege. For more information about how to directly grant a privilege, see [Directly grant a privilege](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
add(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID of the data to be inserted. You can specify a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The vector or vectors of the data to be inserted. If you specify this parameter, the value of `embedding_function` is ignored. If you do not specify this parameter, you must specify `documents`, and the `collection` must have an `embedding_function`.|[0.1, 0.2, 0.3]|
|`documents`|string or List[str]|No|The document or documents to be inserted. If you do not specify `vectors`, `documents` will be converted to vectors using the `embedding_function` of the `collection`.|"This is a document"|
|`metadatas`|dict or List[dict]|No|The metadata or metadata list of the data to be inserted. |`{"category": "AI", "score": 95}`|
:::info
The `embedding_function` associated with the collection is set during `create_collection()` or `get_collection()`. You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
collection = client.create_collection(
name="my_collection",
configuration=HNSWConfiguration(dimension=3, distance='cosine'),
embedding_function=None
)
# Add single item
collection.add(
ids="item1",
embeddings=[0.1, 0.2, 0.3],
documents="This is a document",
metadatas={"category": "AI", "score": 95}
)
# Add multiple items
collection.add(
ids=["item4", "item2", "item3"],
embeddings=[
[0.1, 0.2, 0.4],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]
],
documents=[
"Document 1",
"Document 2",
"Document 3"
],
metadatas=[
{"category": "AI", "score": 95},
{"category": "ML", "score": 88},
{"category": "DL", "score": 92}
]
)
# Add with only embeddings
collection.add(
ids=["vec1", "vec2"],
embeddings=[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
)
collection1 = client.create_collection(
name="my_collection1"
)
# Add with only documents - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.add(
ids=["doc1", "doc2"],
documents=["Text document 1", "Text document 2"],
metadatas=[{"tag": "A"}, {"tag": "B"}]
)
```
## Response parameters
None
## References
* [Update data](300.update-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)
* [Delete data](500.delete-data-of-api.md)

View File

@@ -0,0 +1,88 @@
---
slug: /update-data-of-api
---
# update - Update data
The `update()` method is used to update existing records in a collection. The record must exist, otherwise an error will be raised.
:::info
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb in client mode or OceanBase Database, make sure that the user to which you have connected has the `UPDATE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
update(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID to be modified. It can be a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The new vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]]|
|`documents`|string or List[str]|No|The new documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"New document text"|
|`metadatas`|dict or List[dict]|No|The new metadata.|`{"category": "AI"}`|
:::info
You can update only the `metadatas`. The `embedding_function` used must be associated with the collection.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Update single item
collection.update(
ids="item1",
metadatas={"category": "AI", "score": 98} # Update metadata only
)
# Update multiple items
collection.update(
ids=["item1", "item2"],
embeddings=[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]], # Update embeddings
documents=["Updated document 1", "Updated document 2"] # Update documents
)
# Update with documents only - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.update(
ids="doc1",
documents="New document text", # Embeddings will be auto-generated
metadatas={"category": "AI"}
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)
* [Delete data](500.delete-data-of-api.md)

View File

@@ -0,0 +1,93 @@
---
slug: /upsert-data-of-api
---
# upsert - Update or insert data
The `upsert()` method is used to insert new records or update existing records. If a record with the given ID already exists, it will be updated; otherwise, a new record will be inserted.
:::info
This API is only available when using a Client connection. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, ensure that the connected user has the `INSERT` and `UPDATE` privileges on the target table. For more information about how to view the current user privileges, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the required privileges, contact the administrator to grant them. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
Upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID to be added or modified. It can be a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[0.1, 0.2, 0.3]|
|`documents`|string or List[str]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"Document text"|
|`metadatas`|dict or List[dict]|No|The metadata. |`{"category": "AI"}`|
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Upsert single item (insert or update)
collection.upsert(
ids="item1",
embeddings=[0.1, 0.2, 0.3],
documents="Document text",
metadatas={"category": "AI", "score": 95}
)
# Upsert multiple items
collection.upsert(
ids=["item1", "item2", "item3"],
embeddings=[
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]
],
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"category": "AI"},
{"category": "ML"},
{"category": "DL"}
]
)
# Upsert with documents only - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.upsert(
ids=["item1", "item2"],
documents=["Document 1", "Document 2"],
metadatas=[{"category": "AI"}, {"category": "ML"}]
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update data](300.update-data-of-api.md)
* [Delete data](400.upsert-data-of-api.md)

View File

@@ -0,0 +1,87 @@
---
slug: /delete-data-of-api
---
# delete - Delete data
`delete()` is used to delete records from a collection. You can delete records by ID, metadata filter, or document filter.
:::info
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to whom you are connected has the `DELETE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
Upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Optional|The ID of the record to be deleted. You can specify a single ID or an array of IDs.|item1|
|`where`|dict|Optional|The metadata filter.|`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|Optional|The document filter.|`{"$contains": "obsolete"}`|
:::info
At least one of the `id`, `where`, or `where_document` parameters must be specified.
:::
## Request examples
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
# Delete by IDs
collection.delete(ids=["item1", "item2", "item3"])
# Delete by single ID
collection.delete(ids="item1")
# Delete by metadata filter
collection.delete(where={"category": {"$eq": "AI"}})
# Delete by comparison operator
collection.delete(where={"score": {"$lt": 50}})
# Delete by document filter
collection.delete(where_document={"$contains": "obsolete"})
# Delete with combined filters
collection.delete(
where={"category": {"$eq": "AI"}},
where_document={"$contains": "deprecated"}
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update data](300.update-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)

View File

@@ -0,0 +1,15 @@
---
slug: /dql-overview-of-api
---
# Overview of DQL
DQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.
For DQL operations, the following API interfaces are supported.
| API Interface | Description | Documentation Link |
|---|---|---|
| `query()` | A vector similarity search method. | [Documentation](200.query-interfaces-of-api.md) |
| `get()` | Queries specific data from a table using an ID, document, or metadata (excluding vectors). | [Documentation](300.get-interfaces-of-api.md) |
| `hybrid_search()` | Combines full-text search and vector similarity search using a ranking method. | [Documentation](400.hybrid-search-of-api.md) |

View File

@@ -0,0 +1,161 @@
---
slug: /query-interfaces-of-api
---
# query - vector query
The `query()` method is used to perform vector similarity search to find the most similar documents to the query vector.
:::info
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
query()
```
|Parameter|Value type|Required|Description|Example value|
|---|---|---|---|---|
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or a list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|`query_texts`|str or List[str]|No|A single text or a list of texts for query; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|`n_results`|int|Yes|The number of similar results to return, default is 10|3|
|`where`|dict |No|Metadata filter conditions.|`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|No|Document filter conditions.|`{"$contains": "machine"}`|
|`include`|List[str]|No|List of fields to include: `["documents", "metadatas", "embeddings"]`|["documents", "metadatas", "embeddings"]|
:::info
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Basic vector similarity query (embedding_function not used)
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
n_results=3
)
# Iterate over results
for i in range(len(results["ids"][0])):
print(f"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}")
if results.get("documents"):
print(f"Document: {results['documents'][0][i]}")
if results.get("metadatas"):
print(f"Metadata: {results['metadatas'][0][i]}")
# Query by texts - vectors auto-generated by embedding_function
# Requires: collection must have embedding_function set
results = collection1.query(
query_texts=["my query text"],
n_results=10
)
# The collection's embedding_function will automatically convert query_texts to query_embeddings
# Query by multiple texts (batch query)
results = collection1.query(
query_texts=["query text 1", "query text 2"],
n_results=5
)
# Returns dict with lists of lists, one list per query text
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with metadata filter (using query_texts)
results = collection1.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}},
n_results=5
)
# Query with comparison operator (using query_texts)
results = collection1.query(
query_texts=["machine learning"],
where={"score": {"$gte": 90}},
n_results=5
)
# Query with document filter (using query_texts)
results = collection1.query(
query_texts=["neural networks"],
where_document={"$contains": "machine learning"},
n_results=5
)
# Query with combined filters (using query_texts)
results = collection1.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}, "score": {"$gte": 90}},
where_document={"$contains": "machine"},
n_results=5
)
# Query with multiple vectors (batch query)
results = collection.query(
query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
n_results=2
)
# Returns dict with lists of lists, one list per query vector
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with specific fields
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
include=["documents", "metadatas", "embeddings"],
n_results=3
)
```
## Return parameters
|Parameter|Value type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|List[List[str]] |Yes|The IDs to add or modify. It can be a single ID or an array of IDs.|item1|
|`embeddings`|[List[List[List[float]]]]|No|The vectors; if provided, it will be used directly (ignoring `embedding_function`), if not provided, `documents` can be provided to generate vectors automatically.|[0.1, 0.2, 0.3]|
|`documents`|[List[List[Dict]]]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the `embedding_function` of the collection.| "Document text"|
|`metadatas`|[List[List[Dict]]]|No|The metadata.|`{"category": "AI"}`|
|`distances`|[List[List[Dict]]]|No| |`{"category": "AI"}`|
## Return example
```python
ID: vec1, Distance: 0.0
Document: None
Metadata: {}
ID: vec2, Distance: 0.025368153802923787
Document: None
Metadata: {}
Query 0: 4 results
Query 1: 4 results
Query 0: 2 results
Query 1: 2 results
```
## Related operations
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,127 @@
---
slug: /get-interfaces-of-api
---
# get - Retrieve
`get()` is used to retrieve documents from a collection without performing vector similarity search.
It supports filtering by IDs, metadata, and documents.
:::info
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
get()
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|List[float] or List[List[float]] |Yes|The ID or list of IDs to retrieve.|[1.0, 2.0, 3.0]|
|`where`|dict |No|The metadata filter. |`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|No|The document filter. |`{"$contains": "machine"}`|
|`limit`|dict |No|The maximum number of results to return. |`{"category": {"$eq": "AI"}}`|
|`offset`|dict|No|The number of results to skip for pagination. |`{"$contains": "machine"}`|
|`include`|List[str]|No|The list of fields to include: `["documents", "metadatas", "embeddings"]`. |["documents", "metadatas", "embeddings"]|
:::info
If no parameters are provided, all data is returned.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
# Get by single ID
results = collection.get(ids="123")
# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])
# Get by metadata filter
results = collection.get(
where={"category": {"$eq": "AI"}},
limit=10
)
# Get by comparison operator
results = collection.get(
where={"score": {"$gte": 90}},
limit=10
)
# Get by $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=10
)
# Get by logical operators ($or)
results = collection.get(
where={
"$or": [
{"category": {"$eq": "AI"}},
{"tag": {"$eq": "python"}}
]
},
limit=10
)
# Get by document content filter
results = collection.get(
where_document={"$contains": "machine learning"},
limit=10
)
# Get with combined filters
results = collection.get(
where={"category": {"$eq": "AI"}},
where_document={"$contains": "machine"},
limit=10
)
# Get with pagination
results = collection.get(limit=2, offset=1)
# Get with specific fields
results = collection.get(
ids=["1", "2"],
include=["documents", "metadatas", "embeddings"]
)
# Get all data (up to limit)
results = collection.get(limit=100)
```
## Response parameters
* If a single ID is provided: The result contains the get object for that ID.
* If multiple IDs are provided: A list of QueryResult objects, one for each ID.
* If filters are provided: A QueryResult object containing all matching results.
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,140 @@
---
slug: /hybrid-search-of-api
---
# hybrid_search - Hybrid search
`hybrid_search()` combines full-text search and vector similarity search with ranking.
:::info
This API is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert Data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
hybrid_search(
query={
"where_document": ,
"where": ,
"n_results":
},
knn={
"query_texts":
"where":
"n_results":
},
rank=,
n_results=,
include=
)
```
* query: full-text search configuration, including the following parameters:
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|Optional|Document filter conditions. |`{"$contains": "machine"}`|
|`n_results`|int|Yes|Number of results for full-text search.||
* knn: vector search configuration, including the following parameters:
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|`query_texts`|str or List[str]|Optional|A single vector or list of vectors; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|`n_results`|int|Yes|Number of results for vector search.||
* Other parameters are as follows:
|Parameter|Type|Required|Description|Example value|
|`rank`|dict |Optional|Ranking configuration, for example: `{"rrf": {"rank_window_size": 60, "rank_constant": 60}}`|`{"category": {"$eq": "AI"}}`|
|`n_results`|int|Yes|Number of similar results to return. Default value is 10|3|
|`include`|List[str]|Optional|List of fields to include: `["documents", "metadatas", "embeddings"]`.|["documents", "metadatas", "embeddings"]|
:::info
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Hybrid search with query_embeddings (embedding_function not used)
results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"n_results": 10
},
knn={
"query_embeddings": [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], # Used directly
"n_results": 10
},
rank={"rrf": {}},
n_results=5
)
# Hybrid search with both full-text and vector search (using query_texts)
results = collection1.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": {"$eq": "science"}},
"n_results": 10
},
knn={
"query_texts": ["AI research"], # Will be embedded automatically
"where": {"year": {"$gte": 2020}},
"n_results": 10
},
rank={"rrf": {}}, # Reciprocal Rank Fusion
n_results=5,
include=["documents", "metadatas", "embeddings"]
)
# Hybrid search with multiple query texts (batch)
results = collection1.hybrid_search(
query={
"where_document": {"$contains": "AI"},
"n_results": 10
},
knn={
"query_texts": ["machine learning", "neural networks"], # Multiple queries
"n_results": 10
},
rank={"rrf": {}},
n_results=5
)
```
## Return parameters
A dictionary containing search results, including ID, distances, metadatas, document, etc.
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,151 @@
---
slug: /filter-operators-of-api
---
# Operators
Operators are used to connect operands or parameters and return results. In terms of syntax, operators can appear before, after, or between operands.
## Operator examples
### Data filtering (where)
#### Equal to
Use `$eq` to indicate equal to, as shown in the following example:
```python
where={"category": {"$eq": "AI"}}
```
#### Not equal to
Use `$ne` to indicate not equal to, as shown in the following example:
```python
where={"status": {"$ne": "deleted"}}
```
#### Greater than
Use `$gt` to indicate greater than, as shown in the following example:
```python
where={"score": {"$gt": 90}}
```
#### Greater than or equal to
Use `$gte` to indicate greater than or equal to, as shown in the following example:
```python
where={"score": {"$gte": 90}}
```
#### Less than
Use `$lt` to indicate less than, as shown in the following example:
```python
where={"score": {"$lt": 50}}
```
#### Less than or equal to
Use `$lte` to indicate less than or equal to, as shown in the following example:
```python
where={"score": {"$lte": 50}}
```
#### Contains
Use `$in` to indicate contains, as shown in the following example:
```python
where={"tag": {"$in": ["ml", "python", "ai"]}}
```
#### Does not contain
Use `$nin` to indicate does not contain, as shown in the following example:
```python
where={"tag": {"$nin": ["deprecated", "old"]}}
```
#### Logical OR
Use `$or` to indicate logical OR, as shown in the following example:
```python
where={
"$or": [
{"category": {"$eq": "AI"}},
{"tag": {"$eq": "python"}}
]
}
```
#### Logical AND
Use `$and` to indicate logical AND, as shown in the following example:
```python
where={
"$and": [
{"category": {"$eq": "AI"}},
{"score": {"$gte": 90}}
]
}
```
### Text filtering (where_document)
#### Full-text search (contains substring)
Use `$contains` to indicate full-text search, as shown in the following example:
```python
where_document={"$contains": "machine learning"}
```
#### Regular expression
Use `$regex` to indicate regular expression, as shown in the following example:
```python
where_document={"$regex": "pattern.*"}
```
#### Logical OR
Use `$or` to indicate logical OR, as shown in the following example:
```python
where_document={
"$or": [
{"$contains": "machine learning"},
{"$contains": "artificial intelligence"}
]
}
```
#### Logical AND
Use `$and` to indicate logical AND, as shown in the following example:
```python
where_document={
"$and": [
{"$contains": "machine"},
{"$contains": "learning"}
]
}
```
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)

View File

@@ -0,0 +1,107 @@
---
slug: /client
---
# Client
The `Client` class is used to connect to a database in either embedded mode or server mode. It automatically selects the appropriate connection mode based on the provided parameters.
:::tip
OceanBase Database is a fully self-developed, enterprise-level, native distributed database developed by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across five IDCs in three regions. It also sets a new benchmark in the TPC-C benchmark test, with a single cluster size exceeding 1,500 nodes. OceanBase Database is cloud-native, highly consistent, and highly compatible with Oracle and MySQL. For more information about OceanBase Database, see [OceanBase Database](https://www.oceanbase.com/docs/oceanbase-database-cn).
:::
## Connect to an embedded seekdb instance
Use the `Client` class to connect to a local embedded seekdb instance.
```python
import pyseekdb
# Create embedded client
client = pyseekdb.Client(
#path="./seekdb", # Path to SeekDB data directory
#database="test" # Database name
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `path` | string | No | The path to the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
| `database` | string | No | The name of the database. | `test` |
## Connect to a remote server
Use the `Client` class to connect to a remote server, which runs seekdb or OceanBase Database.
:::tip
Before you connect to a remote server, make sure that you have deployed a server instance of seekdb or OceanBase Database. <br/>For information about how to deploy a server instance of seekdb, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy OceanBase Database, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
:::
Example: Connect to a server instance of seekdb
```python
import pyseekdb
# Create remote server client (SeekDB Server)
client = pyseekdb.Client(
host="127.0.0.1", # Server host
port=2881, # Server port
database="test", # Database name
user="root", # Username
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the instance is located. | `127.0.0.1` |
| `prot` | string | Yes | The port number of the instance. The default value is 2881. | `2881` |
| `database` | string | Yes | The name of the database. | `test` |
| `user` | string | Yes | The username. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
Example: Connect to OceanBase Database
```python
import pyseekdb
# Create remote server client (OceanBase Server)
client = pyseekdb.Client(
host="127.0.0.1", # Server host
port=2881, # Server port (default: 2881)
tenant="test", # Tenant name
database="test", # Database name
user="root", # Username (default: "root")
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the database is located. | `127.0.0.1` |
| `prot` | string | Yes | The port number of OceanBase Database. The default value is 2881. | `2881` |
| `tenant` | string | No | The name of the tenant. This parameter is not required for seekdb. For OceanBase Database, the default value is sys. | `test` |
| `database` | string | Yes | The name of the database. | `test` |
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
## APIs supported when you use the Client class to connect to a database
When you use the `Client` class to connect to a database, you can call the following APIs.
| API | Description | Document link |
| --- | --- | --- |
| `create_collection()` | Creates a new collection. | [Document](200.collection/100.create-collection-of-api.md) |
| `get_collection()` | Queries a specified collection. |[Document](200.collection/200.get-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection. |[Document](200.collection/400.delete-collection-of-api.md)|
| `list_collections()` | Lists all collections in the current database.|[Document](200.collection/300.list-collection-of-api.md)|
| `get_or_create_collection()` | Queries a specified collection. If the collection does not exist, it is created.|[Document](200.collection/250.get-or-create-collection-of-api.md)|
| `count_collection()` | Queries the number of collections in the current database. |[Document](200.collection/350.count-collection-of-api.md)|

View File

@@ -0,0 +1,35 @@
---
slug: /default-embedding-function-of-api
---
# Default embedding function
An embedding function converts text documents into vector embeddings for similarity search. pyseekdb supports built-in and custom embedding functions.
The `DefaultEmbeddingFunction` is the default embedding function if none is specified. This function is already available in seekdb and does not need to be created separately.
Here is an example:
```python
from pyseekdb import DefaultEmbeddingFunction
# Use default model (all-MiniLM-L6-v2, 384 dimensions)
ef = DefaultEmbeddingFunction()
# Use custom model
ef = DefaultEmbeddingFunction()
# Get embedding dimension
print(f"Dimension: {ef.dimension}") # 384
# Generate embeddings
embeddings = ef(["Hello world", "How are you?"])
print(f"Generated {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions")
```
## Related operations
If you want to use a custom function, you can refer to the following topics to create and use a custom function:
* [Create a custom embedding function](200.create-custim-embedding-functions-of-api.md)
* [Use a custom embedding function](300.using-custom-embedding-functions-of-api.md)

View File

@@ -0,0 +1,271 @@
---
slug: /create-custim-embedding-functions-of-api
---
# Create a custom embedding function
You can create a custom embedding function by implementing the `EmbeddedFunction` protocol. This function includes the following features:
* Execute the `__call__` method, which accepts `Documents (str or List[str])` and returns `Embeddings (List[List[float]])`.
* Optionally implement a dimension attribute to return the vector dimension.
## Prerequisites
Before creating a custom embedding function, ensure the following:
* Implement the `__call__` method:
* Each vector must have the same dimension.
* Input: The type of a single or multiple documents is str or List[str].
* Output: The field type of the embedded vectors is `List[List[float]]`.
* (Recommended) Implement the dimension attribute:
* Output: The type of the vectors generated by this function is `int`.
* Creating collections helps verify uniqueness.
* Handle special cases
* Convert a single string input to a list.
* Return an empty list for empty inputs.
* All vectors in the output must have the same dimension.
## Example 1: Sentence Transformer custom embedding function
```python
from typing import List, Union
from pyseekdb import EmbeddingFunction, Client, HNSWConfiguration
Documents = Union[str, List[str]]
Embeddings = List[List[float]]
class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]):
"""
A custom embedding function using sentence-transformers with a specific model.
"""
def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"): # TODO: your own model name and device
"""
Initialize the sentence-transformer embedding function.
Args:
model_name: Name of the sentence-transformers model to use
device: Device to run the model on ('cpu' or 'cuda')
"""
self.model_name = model_name
self.device = device
self._model = None
self._dimension = None
def _ensure_model_loaded(self):
"""Lazy load the embedding model"""
if self._model is None:
try:
from sentence_transformers import SentenceTransformer
self._model = SentenceTransformer(self.model_name, device=self.device)
# Get dimension from model
test_embedding = self._model.encode(["test"], convert_to_numpy=True)
self._dimension = len(test_embedding[0])
except ImportError:
raise ImportError(
"sentence-transformers is not installed. "
"Please install it with: pip install sentence-transformers"
)
@property
def dimension(self) -> int:
"""Get the dimension of embeddings produced by this function"""
self._ensure_model_loaded()
return self._dimension
def __call__(self, input: Documents) -> Embeddings:
"""
Generate embeddings for the given documents.
Args:
input: Single document (str) or list of documents (List[str])
Returns:
List of embedding vectors
"""
self._ensure_model_loaded()
# Handle single string input
if isinstance(input, str):
input = [input]
# Handle empty input
if not input:
return []
# Generate embeddings
embeddings = self._model.encode(
input,
convert_to_numpy=True,
show_progress_bar=False
)
# Convert numpy arrays to lists
return [embedding.tolist() for embedding in embeddings]
# Use the custom embedding function
client = Client()
# Initialize embedding function with all-mpnet-base-v2 model (768 dimensions)
ef = SentenceTransformerCustomEmbeddingFunction(
model_name='all-mpnet-base-v2', # TODO: your own model name
device='cpu' # TODO: your own device
)
# Get the dimension from the embedding function
dimension = ef.dimension
print(f"Embedding dimension: {dimension}")
# Create collection with matching dimension
collection_name = "my_collection"
if client.has_collection(collection_name):
client.delete_collection(collection_name)
collection = client.create_collection(
name=collection_name,
configuration=HNSWConfiguration(dimension=dimension, distance='cosine'),
embedding_function=ef
)
# Test the embedding function
print("\nTesting embedding function...")
test_documents = ["Hello world", "This is a test", "Sentence transformers are great"]
embeddings = ef(test_documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")
# Add some documents to the collection
print("\nAdding documents to collection...")
collection.add(
ids=["1", "2", "3"],
documents=test_documents,
metadatas=[{"source": "test1"}, {"source": "test2"}, {"source": "test3"}]
)
# Query the collection
print("\nQuerying collection...")
results = collection.query(
query_texts="Hello",
n_results=2
)
print("\nQuery results:")
for i in range(len(results['ids'][0])):
print(f"ID: {results['ids'][0][i]}")
print(f"Document: {results['documents'][0][i]}")
print(f"Distance: {results['distances'][0][i]}")
print()
# Clean up
client.delete_collection(name=collection_name)
print("Test completed successfully!")
```
## Example 2: OpenAI embedding function
```python
from typing import List, Union
import os
from openai import OpenAI
from pyseekdb import EmbeddingFunction
import pyseekdb
Documents = Union[str, List[str]]
Embeddings = List[List[float]]
class QWenEmbeddingFunction(EmbeddingFunction[Documents]):
"""
A custom embedding function using OpenAI's embedding API.
"""
def __init__(self, model_name: str = "", api_key: str = ""): # TODO: your own model name and api key
"""
Initialize the OpenAI embedding function.
Args:
model_name: Name of the OpenAI embedding model
api_key: OpenAI API key (if not provided, uses OPENAI_API_KEY env var)
"""
self.model_name = model_name
self.api_key = api_key or os.environ.get('OPENAI_API_KEY') # TODO: your own api key
if not self.api_key:
raise ValueError("OpenAI API key is required")
self._dimension = 1024 # TODO: your own dimension
@property
def dimension(self) -> int:
"""Get the dimension of embeddings produced by this function"""
if self._dimension is None:
# Call API to get dimension (or use known values)
raise ValueError("Dimension not set for this model")
return self._dimension
def __call__(self, input: Documents) -> Embeddings:
"""
Generate embeddings using OpenAI API.
Args:
input: Single document (str) or list of documents (List[str])
Returns:
List of embedding vectors
"""
# Handle single string input
if isinstance(input, str):
input = [input]
# Handle empty input
if not input:
return []
# Call OpenAI API
client = OpenAI(
api_key=self.api_key,
base_url="" # TODO: your own base url
)
response = client.embeddings.create(
model=self.model_name,
input=input
)
# Extract embeddings
embeddings = [item.embedding for item in response.data]
return embeddings
# Use the custom embedding function
collection_name = "my_collection"
ef = QWenEmbeddingFunction()
client = pyseekdb.Client()
if client.has_collection(collection_name):
client.delete_collection(collection_name)
collection = client.create_collection(
name=collection_name,
embedding_function=ef
)
collection.add(
ids=["1", "2", "3"],
documents=["Hello", "World", "Hello World"],
metadatas=[{"tag": "A"}, {"tag": "B"}, {"tag": "C"}]
)
results = collection.query(
query_texts="Hello",
n_results=2
)
for i in range(len(results['ids'][0])):
print(results['ids'][0][i])
print(results['documents'][0][i])
print(results['metadatas'][0][i])
print(results['distances'][0][i])
print()
client.delete_collection(name=collection_name)
```

View File

@@ -0,0 +1,41 @@
---
slug: /using-custom-embedding-functions-of-api
---
# Use a custom embedding function
After you create a custom embedding function, you can use it when you create or get a collection.
Here is an example:
```python
import pyseekdb
from pyseekdb import HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Create collection with custom embedding function
ef = SentenceTransformerCustomEmbeddingFunction()
collection = client.create_collection(
name="my_collection",
configuration=HNSWConfiguration(dimension=ef.dimension, distance='cosine'),
embedding_function=ef
)
# Get collection with custom embedding function
collection = client.get_collection("my_collection", embedding_function=ef)
# Use the collection - documents will be automatically embedded
collection.add(
ids=["doc1", "doc2"],
documents=["Document 1", "Document 2"], # Vectors auto-generated
metadatas=[{"tag": "A"}, {"tag": "B"}]
)
# Query with texts - query vectors auto-generated
results = collection.query(
query_texts=["my query"],
n_results=10
)
```