Initial commit
This commit is contained in:
@@ -0,0 +1,60 @@
|
||||
---
|
||||
slug: /pyseekdb-sdk-get-started
|
||||
---
|
||||
|
||||
# Get started
|
||||
|
||||
## pyseekdb
|
||||
|
||||
pyseekdb is a Python client provided by OceanBase Database. It allows you to connect to seekdb in embedded mode or remote mode, and supports connecting to seekdb in server mode or OceanBase Database.
|
||||
|
||||
:::tip
|
||||
OceanBase Database is a fully self-developed, enterprise-level, native distributed database provided by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across cities with the "five IDCs across three regions" architecture. It also sets a new benchmark in the TPC-C standard test, with a single cluster scale exceeding 1,500 nodes. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
|
||||
:::
|
||||
|
||||
pyseekdb is supported on Linux, macOS, and Windows. The supported database connection modes vary by operating system. For more information, see the table below.
|
||||
|
||||
| System | Embedded seekdb | Server mode seekdb | Server mode OceanBase Database |
|
||||
|----|---|---|---|
|
||||
| Linux | Supported | Supported | Supported |
|
||||
| macOS | Not supported | Supported | Supported |
|
||||
| Windows | Not supported | Supported | Supported |
|
||||
|
||||
For Linux system, when you install this client, it will also install seekdb in embedded mode, allowing you to directly connect to it to perform operations such as creating a database. Alternatively, you can choose to connect to a deployed seekdb or OceanBase Database in client/server mode.
|
||||
|
||||
## Install pyseekdb
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Make sure that your environment meets the following requirements:
|
||||
|
||||
* Operating system: Linux (glibc >= 2.28), macOS or Windows
|
||||
* Python version: Python 3.11 and later
|
||||
* System architecture: x86_64 or aarch64
|
||||
|
||||
### Procedure
|
||||
|
||||
Use pip to install pyseekdb. It will automatically detect the default Python version and platform.
|
||||
|
||||
```shell
|
||||
pip install pyseekdb
|
||||
```
|
||||
|
||||
If your pip version is outdated, upgrade it before installation.
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
## What to do next
|
||||
|
||||
* After installing pyseekdb, you can connect to seekdb to perform operations. For information about the API interfaces supported by pyseekdb, see [API Reference](../50.apis/10.api-overview.md).
|
||||
|
||||
* You can also refer to the SDK samples provided to quickly experience pyseekdb.
|
||||
|
||||
* [Simple sample](50.sdk-samples/10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Complete sample](50.sdk-samples/50.pyseekdb-complete-sample.md)
|
||||
|
||||
* [Hybrid search sample](50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)
|
||||
|
||||
@@ -0,0 +1,130 @@
|
||||
---
|
||||
slug: /pyseekdb-simple-sample
|
||||
---
|
||||
|
||||
# Simple Example
|
||||
|
||||
This example demonstrates the basic operations of Embedding Functions in embedded mode of seekdb to help you understand how to use Embedding Functions.
|
||||
|
||||
1. Connect to seekdb.
|
||||
2. Create a collection with Embedding Functions.
|
||||
3. Add data using documents (vectors will be automatically generated).
|
||||
4. Query using texts (vectors will be automatically generated).
|
||||
5. Print the query results.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This example uses seekdb in embedded mode. Before using this example, make sure that you have deployed seekdb in server mode.
|
||||
|
||||
For information about how to deploy seekdb in embedded mode, see [Embedded Mode](../../../../400.guides/400.deploy/600.python-seekdb.md).
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# ==================== Step 1: Create Client Connection ====================
|
||||
# You can use embedded mode, server mode, or OceanBase mode
|
||||
|
||||
# Embedded mode (local SeekDB)
|
||||
client = pyseekdb.Client()
|
||||
# Alternative: Server mode (connecting to remote SeekDB server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# Alternative: Remote server mode (OceanBase Server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# tenant="test", # OceanBase default tenant
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# ==================== Step 2: Create a Collection with Embedding Function ====================
|
||||
# A collection is like a table that stores documents with vector embeddings
|
||||
collection_name = "my_simple_collection"
|
||||
|
||||
# Create collection with default embedding function
|
||||
# The embedding function will automatically convert documents to embeddings
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
)
|
||||
|
||||
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
|
||||
print(f"Embedding function: {collection.embedding_function}")
|
||||
|
||||
# ==================== Step 3: Add Data to Collection ====================
|
||||
# With embedding function, you can add documents directly without providing embeddings
|
||||
# The embedding function will automatically generate embeddings from documents
|
||||
|
||||
documents = [
|
||||
"Machine learning is a subset of artificial intelligence",
|
||||
"Python is a popular programming language",
|
||||
"Vector databases enable semantic search",
|
||||
"Neural networks are inspired by the human brain",
|
||||
"Natural language processing helps computers understand text"
|
||||
]
|
||||
|
||||
ids = ["id1", "id2", "id3", "id4", "id5"]
|
||||
|
||||
# Add data with documents only - embeddings will be auto-generated by embedding function
|
||||
collection.add(
|
||||
ids=ids,
|
||||
documents=documents, # embeddings will be automatically generated
|
||||
metadatas=[
|
||||
{"category": "AI", "index": 0},
|
||||
{"category": "Programming", "index": 1},
|
||||
{"category": "Database", "index": 2},
|
||||
{"category": "AI", "index": 3},
|
||||
{"category": "NLP", "index": 4}
|
||||
]
|
||||
)
|
||||
|
||||
print(f"\nAdded {len(documents)} documents to collection")
|
||||
print("Note: Embeddings were automatically generated from documents using the embedding function")
|
||||
|
||||
# ==================== Step 4: Query the Collection ====================
|
||||
# With embedding function, you can query using text directly
|
||||
# The embedding function will automatically convert query text to query vector
|
||||
|
||||
# Query using text - query vector will be auto-generated by embedding function
|
||||
query_text = "artificial intelligence and machine learning"
|
||||
|
||||
results = collection.query(
|
||||
query_texts=query_text, # Query text - will be embedded automatically
|
||||
n_results=3 # Return top 3 most similar documents
|
||||
)
|
||||
|
||||
print(f"\nQuery: '{query_text}'")
|
||||
print(f"Query results: {len(results['ids'][0])} items found")
|
||||
|
||||
# ==================== Step 5: Print Query Results ====================
|
||||
for i in range(len(results['ids'][0])):
|
||||
print(f"\nResult {i+1}:")
|
||||
print(f" ID: {results['ids'][0][i]}")
|
||||
print(f" Distance: {results['distances'][0][i]:.4f}")
|
||||
if results.get('documents'):
|
||||
print(f" Document: {results['documents'][0][i]}")
|
||||
if results.get('metadatas'):
|
||||
print(f" Metadata: {results['metadatas'][0][i]}")
|
||||
|
||||
# ==================== Step 6: Cleanup ====================
|
||||
# Delete the collection
|
||||
client.delete_collection(collection_name)
|
||||
print(f"\nDeleted collection '{collection_name}'")
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Complete Example](50.pyseekdb-complete-sample.md)
|
||||
|
||||
* [Hybrid Search Example](100.pyseekdb-hybrid-search-sample.md)
|
||||
@@ -0,0 +1,350 @@
|
||||
---
|
||||
slug: /pyseekdb-hybrid-search-sample
|
||||
---
|
||||
|
||||
# Hybrid search example
|
||||
|
||||
This example demonstrates the advantages of `hybrid_search()` over `query()`.
|
||||
|
||||
The main advantages of `hybrid_search()` are:
|
||||
|
||||
* Supports full-text search and vector similarity search simultaneously
|
||||
|
||||
* Allows separate filtering conditions for full-text and vector search
|
||||
|
||||
* Combines the ranked results of both searches using the Reciprocal Rank Fusion algorithm to improve relevance.
|
||||
|
||||
* Handles complex scenarios that `query()` cannot handle
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Setup
|
||||
client = pyseekdb.Client()
|
||||
collection = client.get_or_create_collection(
|
||||
name="hybrid_search_demo"
|
||||
)
|
||||
|
||||
# Sample data
|
||||
documents = [
|
||||
"Machine learning is revolutionizing artificial intelligence and data science",
|
||||
"Python programming language is essential for machine learning developers",
|
||||
"Deep learning neural networks enable advanced AI applications",
|
||||
"Data science combines statistics, programming, and domain expertise",
|
||||
"Natural language processing uses machine learning to understand text",
|
||||
"Computer vision algorithms process images using deep learning techniques",
|
||||
"Reinforcement learning trains agents through reward-based feedback",
|
||||
"Python libraries like TensorFlow and PyTorch simplify machine learning",
|
||||
"Artificial intelligence systems can learn from large datasets",
|
||||
"Neural networks mimic the structure of biological brain connections"
|
||||
]
|
||||
|
||||
metadatas = [
|
||||
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
|
||||
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
|
||||
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
|
||||
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
|
||||
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
|
||||
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
|
||||
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
|
||||
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
|
||||
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
|
||||
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
|
||||
]
|
||||
|
||||
ids = [f"doc_{i+1}" for i in range(len(documents))]
|
||||
collection.add(ids=ids, documents=documents, metadatas=metadatas)
|
||||
|
||||
print("=" * 100)
|
||||
print("SCENARIO 1: Keyword + Semantic Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Find documents similar to 'AI research' AND containing 'machine learning'\n")
|
||||
|
||||
# query() approach
|
||||
query_result1 = collection.query(
|
||||
query_texts=["AI research"],
|
||||
where_document={"$contains": "machine learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() approach
|
||||
hybrid_result1 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
|
||||
knn={"query_texts": ["AI research"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result1['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result1['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() ranks 'Deep learning neural networks...' first because it's semantically similar to 'AI research',")
|
||||
print(" but 'machine learning' is not its primary focus. hybrid_search() correctly prioritizes documents that")
|
||||
print(" explicitly contain 'machine learning' (from full-text search) while also being semantically relevant")
|
||||
print(" to 'AI research' (from vector search). The RRF fusion ensures documents matching both criteria rank higher.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 2: Independent Filters for Different Search Types")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='neural' (year=2024) + Vector='deep learning' (popularity>=90)\n")
|
||||
|
||||
# query() - same filter applies to both conditions
|
||||
query_result2 = collection.query(
|
||||
query_texts=["deep learning"],
|
||||
where={"year": {"$eq": 2024}, "popularity": {"$gte": 90}},
|
||||
where_document={"$contains": "neural"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - different filters for each search type
|
||||
hybrid_result2 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "neural"}, "where": {"year": {"$eq": 2024}}, "n_results": 10},
|
||||
knn={"query_texts": ["deep learning"], "where": {"popularity": {"$gte": 90}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (same filter for both):")
|
||||
for i, doc_id in enumerate(query_result2['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (independent filters):")
|
||||
for i, doc_id in enumerate(hybrid_result2['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it requires documents to satisfy BOTH year=2024 AND popularity>=90")
|
||||
print(" simultaneously. hybrid_search() returns 5 results by applying year=2024 filter to full-text search")
|
||||
print(" and popularity>=90 filter to vector search independently, then fusing the results. This approach")
|
||||
print(" captures more relevant documents that might satisfy one criterion strongly while meeting the other")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 3: Combining Multiple Search Strategies")
|
||||
print("=" * 100)
|
||||
print("Goal: Find documents about 'machine learning algorithms'\n")
|
||||
|
||||
# query() - vector search only
|
||||
query_result3 = collection.query(
|
||||
query_texts=["machine learning algorithms"],
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - combines full-text and vector
|
||||
hybrid_result3 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
|
||||
knn={"query_texts": ["machine learning algorithms"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (vector similarity only):")
|
||||
for i, doc_id in enumerate(query_result3['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (full-text + vector fusion):")
|
||||
for i, doc_id in enumerate(hybrid_result3['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() returns 'Artificial intelligence systems...' as the result, which doesn't explicitly")
|
||||
print(" mention 'machine learning'. hybrid_search() combines full-text search (for 'machine learning')")
|
||||
print(" with vector search (for semantic similarity to 'machine learning algorithms'), ensuring that")
|
||||
print(" documents containing the exact keyword rank higher while still capturing semantically relevant content.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 4: Complex Multi-Criteria Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='learning' (category=AI) + Vector='artificial intelligence' (year>=2023)\n")
|
||||
|
||||
# query() - limited to single search with combined filters
|
||||
query_result4 = collection.query(
|
||||
query_texts=["artificial intelligence"],
|
||||
where={"category": {"$eq": "AI"}, "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - separate criteria for each search type
|
||||
hybrid_result4 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "learning"}, "where": {"category": {"$eq": "AI"}}, "n_results": 10},
|
||||
knn={"query_texts": ["artificial intelligence"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result4['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result4['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" While both methods return similar documents, hybrid_search() provides better ranking by prioritizing")
|
||||
print(" documents that score highly in both full-text search (containing 'learning' with category=AI) and")
|
||||
print(" vector search (semantically similar to 'artificial intelligence' with year>=2023). The RRF fusion")
|
||||
print(" algorithm ensures that 'Deep learning neural networks...' ranks first because it strongly matches")
|
||||
print(" both search criteria, whereas query() applies filters sequentially which may not optimize ranking.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 5: Result Quality - RRF Fusion")
|
||||
print("=" * 100)
|
||||
print("Goal: Search for 'Python machine learning'\n")
|
||||
|
||||
# query() - single ranking
|
||||
query_result5 = collection.query(
|
||||
query_texts=["Python machine learning"],
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - RRF fusion of multiple rankings
|
||||
hybrid_result5 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "Python"}, "n_results": 10},
|
||||
knn={"query_texts": ["Python machine learning"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (single ranking):")
|
||||
for i, doc_id in enumerate(query_result5['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (RRF fusion):")
|
||||
for i, doc_id in enumerate(hybrid_result5['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" Both methods return identical results in this case, but hybrid_search() achieves this through RRF")
|
||||
print(" (Reciprocal Rank Fusion) which combines rankings from full-text search (for 'Python') and vector")
|
||||
print(" search (for 'Python machine learning'). RRF provides more stable and robust ranking by considering")
|
||||
print(" multiple signals, making it less sensitive to variations in individual search algorithms and ensuring")
|
||||
print(" consistent high-quality results across different query formulations.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 6: Different Filter Criteria for Each Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='neural' (high popularity) + Vector='deep learning' (recent year)\n")
|
||||
|
||||
# query() - cannot separate filters for keyword vs semantic
|
||||
query_result6 = collection.query(
|
||||
query_texts=["deep learning"],
|
||||
where={"popularity": {"$gte": 90}, "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "neural"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - different filters for keyword search vs semantic search
|
||||
hybrid_result6 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "neural"}, "where": {"popularity": {"$gte": 90}}, "n_results": 10},
|
||||
knn={"query_texts": ["deep learning"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result6['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result6['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it requires documents to satisfy BOTH popularity>=90 AND")
|
||||
print(" year>=2023 simultaneously, along with containing 'neural' and being semantically similar to")
|
||||
print(" 'deep learning'. hybrid_search() returns 5 results by applying popularity>=90 filter to full-text")
|
||||
print(" search (for 'neural') and year>=2023 filter to vector search (for 'deep learning') independently.")
|
||||
print(" The fusion then combines results from both searches, capturing documents that strongly match either")
|
||||
print(" criterion while still being relevant to the overall query intent.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 7: Partial Keyword Match + Semantic Similarity")
|
||||
print("=" * 100)
|
||||
print("Goal: Documents containing 'Python' + Semantically similar to 'data science'\n")
|
||||
|
||||
# query() - filter applied after vector search
|
||||
query_result7 = collection.query(
|
||||
query_texts=["data science"],
|
||||
where_document={"$contains": "Python"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - parallel searches then fusion
|
||||
hybrid_result7 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "Python"}, "n_results": 10},
|
||||
knn={"query_texts": ["data science"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result7['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result7['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it first performs vector search for 'data science', then")
|
||||
print(" filters to documents containing 'Python', which severely limits the result set. hybrid_search()")
|
||||
print(" returns 5 results by running full-text search (for 'Python') and vector search (for 'data science')")
|
||||
print(" in parallel, then fusing the results. This captures documents that contain 'Python' (even if not")
|
||||
print(" semantically closest to 'data science') and documents semantically similar to 'data science' (even")
|
||||
print(" if they don't contain 'Python'), providing better recall and more comprehensive results.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SUMMARY")
|
||||
print("=" * 100)
|
||||
print("""
|
||||
query() limitations:
|
||||
- Single search type (vector similarity)
|
||||
- Filters applied after search (may miss relevant docs)
|
||||
- Cannot combine full-text and vector search results
|
||||
- Same filter criteria for all conditions
|
||||
|
||||
hybrid_search() advantages:
|
||||
- Simultaneous full-text + vector search
|
||||
- Independent filters for each search type
|
||||
- Intelligent result fusion using RRF
|
||||
- Better recall for complex queries
|
||||
- Handles scenarios requiring both keyword and semantic matching
|
||||
""")
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Simple example](10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Complete example](50.pyseekdb-complete-sample.md)
|
||||
@@ -0,0 +1,440 @@
|
||||
---
|
||||
slug: /pyseekdb-complete-sample
|
||||
---
|
||||
|
||||
# Complete Example
|
||||
|
||||
This example demonstrates the full capabilities of pyseekdb.
|
||||
|
||||
The example includes the following operations:
|
||||
|
||||
1. Connection, including all connection modes
|
||||
2. Collection management
|
||||
3. DML operations, including add, update, upsert, and delete
|
||||
4. DQL operations, including query, get, and hybrid_search
|
||||
5. Filter operators
|
||||
6. Collection information methods
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import uuid
|
||||
import random
|
||||
import pyseekdb
|
||||
|
||||
# ============================================================================
|
||||
# PART 1: CLIENT CONNECTION
|
||||
# ============================================================================
|
||||
|
||||
# Option 1: Embedded mode (local SeekDB)
|
||||
client = pyseekdb.Client(
|
||||
#path="./seekdb",
|
||||
#database="test"
|
||||
)
|
||||
|
||||
# Option 2: Server mode (remote SeekDB server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# Option 3: Remote server mode (OceanBase Server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# tenant="test", # OceanBase default tenant
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# ============================================================================
|
||||
# PART 2: COLLECTION MANAGEMENT
|
||||
# ============================================================================
|
||||
|
||||
collection_name = "comprehensive_example"
|
||||
dimension = 128
|
||||
|
||||
# 2.1 Create a collection
|
||||
from pyseekdb import HNSWConfiguration
|
||||
config = HNSWConfiguration(dimension=dimension, distance='cosine')
|
||||
collection = client.get_or_create_collection(
|
||||
name=collection_name,
|
||||
configuration=config,
|
||||
embedding_function=None # Explicitly set to None since we're using custom 128-dim embeddings
|
||||
)
|
||||
|
||||
# 2.2 Check if collection exists
|
||||
exists = client.has_collection(collection_name)
|
||||
|
||||
# 2.3 Get collection object
|
||||
retrieved_collection = client.get_collection(collection_name, embedding_function=None)
|
||||
|
||||
# 2.4 List all collections
|
||||
all_collections = client.list_collections()
|
||||
|
||||
# 2.5 Get or create collection (creates if doesn't exist)
|
||||
config2 = HNSWConfiguration(dimension=64, distance='cosine')
|
||||
collection2 = client.get_or_create_collection(
|
||||
name="another_collection",
|
||||
configuration=config2,
|
||||
embedding_function=None # Explicitly set to None since we're using custom 64-dim embeddings
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 3: DML OPERATIONS - ADD DATA
|
||||
# ============================================================================
|
||||
|
||||
# Generate sample data
|
||||
random.seed(42)
|
||||
documents = [
|
||||
"Machine learning is transforming the way we solve problems",
|
||||
"Python programming language is widely used in data science",
|
||||
"Vector databases enable efficient similarity search",
|
||||
"Neural networks mimic the structure of the human brain",
|
||||
"Natural language processing helps computers understand human language",
|
||||
"Deep learning requires large amounts of training data",
|
||||
"Reinforcement learning agents learn through trial and error",
|
||||
"Computer vision enables machines to interpret visual information"
|
||||
]
|
||||
|
||||
# Generate embeddings (in real usage, use an embedding model)
|
||||
embeddings = []
|
||||
for i in range(len(documents)):
|
||||
vector = [random.random() for _ in range(dimension)]
|
||||
embeddings.append(vector)
|
||||
|
||||
ids = [str(uuid.uuid4()) for _ in documents]
|
||||
|
||||
# 3.1 Add single item
|
||||
single_id = str(uuid.uuid4())
|
||||
collection.add(
|
||||
ids=single_id,
|
||||
documents="This is a single document",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"type": "single", "category": "test"}
|
||||
)
|
||||
|
||||
# 3.2 Add multiple items
|
||||
collection.add(
|
||||
ids=ids,
|
||||
documents=documents,
|
||||
embeddings=embeddings,
|
||||
metadatas=[
|
||||
{"category": "AI", "score": 95, "tag": "ml", "year": 2023},
|
||||
{"category": "Programming", "score": 88, "tag": "python", "year": 2022},
|
||||
{"category": "Database", "score": 92, "tag": "vector", "year": 2023},
|
||||
{"category": "AI", "score": 90, "tag": "neural", "year": 2022},
|
||||
{"category": "NLP", "score": 87, "tag": "language", "year": 2023},
|
||||
{"category": "AI", "score": 93, "tag": "deep", "year": 2023},
|
||||
{"category": "AI", "score": 85, "tag": "reinforcement", "year": 2022},
|
||||
{"category": "CV", "score": 91, "tag": "vision", "year": 2023}
|
||||
]
|
||||
)
|
||||
|
||||
# 3.3 Add with only embeddings (no documents)
|
||||
vector_only_ids = [str(uuid.uuid4()) for _ in range(2)]
|
||||
collection.add(
|
||||
ids=vector_only_ids,
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[{"type": "vector_only"}, {"type": "vector_only"}]
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 4: DML OPERATIONS - UPDATE DATA
|
||||
# ============================================================================
|
||||
|
||||
# 4.1 Update single item
|
||||
collection.update(
|
||||
ids=ids[0],
|
||||
metadatas={"category": "AI", "score": 98, "tag": "ml", "year": 2024, "updated": True}
|
||||
)
|
||||
|
||||
# 4.2 Update multiple items
|
||||
collection.update(
|
||||
ids=ids[1:3],
|
||||
documents=["Updated document 1", "Updated document 2"],
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[
|
||||
{"category": "Programming", "score": 95, "updated": True},
|
||||
{"category": "Database", "score": 97, "updated": True}
|
||||
]
|
||||
)
|
||||
|
||||
# 4.3 Update embeddings
|
||||
new_embeddings = [[random.random() for _ in range(dimension)] for _ in range(2)]
|
||||
collection.update(
|
||||
ids=ids[2:4],
|
||||
embeddings=new_embeddings
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 5: DML OPERATIONS - UPSERT DATA
|
||||
# ============================================================================
|
||||
|
||||
# 5.1 Upsert existing item (will update)
|
||||
collection.upsert(
|
||||
ids=ids[0],
|
||||
documents="Upserted document (was updated)",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"category": "AI", "upserted": True}
|
||||
)
|
||||
|
||||
# 5.2 Upsert new item (will insert)
|
||||
new_id = str(uuid.uuid4())
|
||||
collection.upsert(
|
||||
ids=new_id,
|
||||
documents="This is a new document from upsert",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"category": "New", "upserted": True}
|
||||
)
|
||||
|
||||
# 5.3 Upsert multiple items
|
||||
upsert_ids = [ids[4], str(uuid.uuid4())] # One existing, one new
|
||||
collection.upsert(
|
||||
ids=upsert_ids,
|
||||
documents=["Upserted doc 1", "Upserted doc 2"],
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[{"upserted": True}, {"upserted": True}]
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 6: DQL OPERATIONS - QUERY (VECTOR SIMILARITY SEARCH)
|
||||
# ============================================================================
|
||||
|
||||
# 6.1 Basic vector similarity query
|
||||
query_vector = embeddings[0] # Query with first document's vector
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
n_results=3
|
||||
)
|
||||
print(f"Query results: {len(results['ids'][0])} items")
|
||||
|
||||
# 6.2 Query with metadata filter (simplified equality)
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"category": "AI"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.3 Query with comparison operators
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"score": {"$gte": 90}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.4 Query with $in operator
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"tag": {"$in": ["ml", "python", "neural"]}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.5 Query with logical operators ($or) - simplified equality
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={
|
||||
"$or": [
|
||||
{"category": "AI"},
|
||||
{"tag": "python"}
|
||||
]
|
||||
},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.6 Query with logical operators ($and) - simplified equality
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={
|
||||
"$and": [
|
||||
{"category": "AI"},
|
||||
{"score": {"$gte": 90}}
|
||||
]
|
||||
},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.7 Query with document filter
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where_document={"$contains": "machine learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.8 Query with combined filters (simplified equality)
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"category": "AI", "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.9 Query with multiple embeddings (batch query)
|
||||
batch_embeddings = [embeddings[0], embeddings[1]]
|
||||
batch_results = collection.query(
|
||||
query_embeddings=batch_embeddings,
|
||||
n_results=2
|
||||
)
|
||||
# batch_results["ids"][0] contains results for first query
|
||||
# batch_results["ids"][1] contains results for second query
|
||||
|
||||
# 6.10 Query with specific fields
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
include=["documents", "metadatas", "embeddings"],
|
||||
n_results=2
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 7: DQL OPERATIONS - GET (RETRIEVE BY IDS OR FILTERS)
|
||||
# ============================================================================
|
||||
|
||||
# 7.1 Get by single ID
|
||||
result = collection.get(ids=ids[0])
|
||||
# result["ids"] contains [ids[0]]
|
||||
# result["documents"] contains document for ids[0]
|
||||
|
||||
# 7.2 Get by multiple IDs
|
||||
results = collection.get(ids=ids[:3])
|
||||
# results["ids"] contains ids[:3]
|
||||
# results["documents"] contains documents for all IDs
|
||||
|
||||
# 7.3 Get by metadata filter (simplified equality)
|
||||
results = collection.get(
|
||||
where={"category": "AI"},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.4 Get with comparison operators
|
||||
results = collection.get(
|
||||
where={"score": {"$gte": 90}},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.5 Get with $in operator
|
||||
results = collection.get(
|
||||
where={"tag": {"$in": ["ml", "python"]}},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.6 Get with logical operators (simplified equality)
|
||||
results = collection.get(
|
||||
where={
|
||||
"$or": [
|
||||
{"category": "AI"},
|
||||
{"category": "Programming"}
|
||||
]
|
||||
},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.7 Get by document filter
|
||||
results = collection.get(
|
||||
where_document={"$contains": "Python"},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.8 Get with pagination
|
||||
results_page1 = collection.get(limit=2, offset=0)
|
||||
results_page2 = collection.get(limit=2, offset=2)
|
||||
|
||||
# 7.9 Get with specific fields
|
||||
results = collection.get(
|
||||
ids=ids[:2],
|
||||
include=["documents", "metadatas", "embeddings"]
|
||||
)
|
||||
|
||||
# 7.10 Get all data
|
||||
all_results = collection.get(limit=100)
|
||||
|
||||
# ============================================================================
|
||||
# PART 8: DQL OPERATIONS - HYBRID SEARCH
|
||||
# ============================================================================
|
||||
|
||||
# 8.1 Hybrid search with full-text and vector search
|
||||
# Note: This requires query_embeddings to be provided directly
|
||||
# In real usage, you might have an embedding function
|
||||
hybrid_results = collection.hybrid_search(
|
||||
query={
|
||||
"where_document": {"$contains": "machine learning"},
|
||||
"where": {"category": "AI"}, # Simplified equality
|
||||
"n_results": 10
|
||||
},
|
||||
knn={
|
||||
"query_embeddings": [embeddings[0]],
|
||||
"where": {"year": {"$gte": 2022}},
|
||||
"n_results": 10
|
||||
},
|
||||
rank={"rrf": {}}, # Reciprocal Rank Fusion
|
||||
n_results=5,
|
||||
include=["documents", "metadatas"]
|
||||
)
|
||||
# hybrid_results["ids"][0] contains IDs for the hybrid search
|
||||
# hybrid_results["documents"][0] contains documents for the hybrid search
|
||||
print(f"Hybrid search: {len(hybrid_results.get('ids', [[]])[0])} results")
|
||||
|
||||
# ============================================================================
|
||||
# PART 9: DML OPERATIONS - DELETE DATA
|
||||
# ============================================================================
|
||||
|
||||
# 9.1 Delete by IDs
|
||||
delete_ids = [vector_only_ids[0], new_id]
|
||||
collection.delete(ids=delete_ids)
|
||||
|
||||
# 9.2 Delete by metadata filter
|
||||
collection.delete(where={"type": {"$eq": "vector_only"}})
|
||||
|
||||
# 9.3 Delete by document filter
|
||||
collection.delete(where_document={"$contains": "Updated document"})
|
||||
|
||||
# 9.4 Delete with combined filters
|
||||
collection.delete(
|
||||
where={"category": {"$eq": "CV"}},
|
||||
where_document={"$contains": "vision"}
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 10: COLLECTION INFORMATION
|
||||
# ============================================================================
|
||||
|
||||
# 10.1 Get collection count
|
||||
count = collection.count()
|
||||
print(f"Collection count: {count} items")
|
||||
|
||||
|
||||
# 10.3 Preview first few items in collection (returns all columns by default)
|
||||
preview = collection.peek(limit=5)
|
||||
print(f"Preview: {len(preview['ids'])} items")
|
||||
for i in range(len(preview['ids'])):
|
||||
print(f" ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
|
||||
print(f" Metadata: {preview['metadatas'][i]}, Embedding dim: {len(preview['embeddings'][i]) if preview['embeddings'][i] else 0}")
|
||||
|
||||
# 10.4 Count collections in database
|
||||
collection_count = client.count_collection()
|
||||
print(f"Database has {collection_count} collections")
|
||||
|
||||
# ============================================================================
|
||||
# PART 11: CLEANUP
|
||||
# ============================================================================
|
||||
|
||||
# Delete test collections
|
||||
try:
|
||||
client.delete_collection("another_collection")
|
||||
except Exception as e:
|
||||
print(f"Could not delete 'another_collection': {e}")
|
||||
|
||||
# Uncomment to delete main collection
|
||||
client.delete_collection(collection_name)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the API interfaces supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Simple Example](../50.sdk-samples/10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Hybrid Search Example](../50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)
|
||||
Reference in New Issue
Block a user