Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,84 @@
---
slug: /vector-search-intro
---
# Overview of vector search
This topic introduces the core concepts of vector databases and vector search.
seekdb supports dense float vectors with up to 16,000 dimensions, as well as sparse vectors. It supports various types of vector distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance. seekdb also supports the creation of HNSW/IVF-based vector indexes, as well as incremental updates and deletions, with these operations having no impact on recall rate.
seekdb vector search offers hybrid retrieval capabilities with scalar filtering. It also provides flexible access interfaces: you can use SQL via the MySQL protocol from clients in various programming languages, or access it using a Python SDK. In addition, seekdb is fully adapted to AI application development frameworks such as LlamaIndex, DB-GPT, and the AI application development platform Dify, offering better support for AI application development.
<video data-code="9002093" src="https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/video/03%20OceanBase%20Vector%20Search-An%20Official%20In-depth%20Perspective.mp4" controls width="811px" height="456.188px"></video>
## Key concepts
### Unstructured data
Unstructured data is data that does not have a predefined data format or organizational structure. It typically includes data in forms such as text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, processing it requires specific tools and techniques, such as natural language processing, image recognition, and machine learning.
### Vector
A vector is the projection of an object in a high-dimensional space. Mathematically, a vector is a floating-point array with the following characteristics:
* Each element in the array is a floating-point number that represents a dimension of the vector.
* The size, namely, the number of elements, of the vector array indicates the dimensionality of the entire vector space.
### Vector embedding
Vector embedding is the process of using a deep learning neural network to extract content and semantics from unstructured data such as images and videos, and convert them into feature vectors. Embedding technology maps original data from a high-dimensional space to a low-dimensional space and converts multimodal data with rich features into multi-dimensional vector data.
### Vector similarity search
In today's era of information explosion, users often need to quickly retrieve specific information from massive datasets. Whether it's online literature databases, e-commerce product catalogs, or rapidly growing multimedia content libraries, efficient retrieval systems are essential for locating content of interest. As data volumes continue to grow, traditional keyword-based search methods can no longer meet the demands for both accuracy and speed, giving rise to vector search technology. Vector similarity search uses feature extraction and vectorization techniques to convert unstructured data—such as text, images, and audio—into vectors. By applying similarity measurement methods to compare these vectors, it captures the deeper semantic meaning of the data. This approach delivers more precise and efficient search results, addressing the shortcomings of traditional search methods.
## Why seekdb vector search?
seekdb's vector search capabilities are built on its integrated multi-model capabilities, excelling in areas such as hybrid retrieval, high performance, high availability, cost efficiency, and data security.
### Hybrid retrieval
seekdb supports hybrid retrieval across multiple data types, including vector data, spatial data, document data, and scalar data. With support for various indexes such as vector indexes, spatial indexes, and full-text indexes, seekdb delivers exceptional performance in multi-model hybrid retrieval. It enables a single database to handle diverse storage and retrieval needs for applications.
### Scalability
seekdb vector search supports the storage and retrieval of massive amounts of vector data, meeting the requirements of large-scale vector data applications.
### High performance
seekdb vector search capabilities integrate the VSAG indexing algorithm library, which demonstrates outstanding performance on the 960-dimensional GIST dataset. In the ANN-Benchmarks tests, the VSAG library significantly outperformed other algorithms.
### High availability
seekdb vector search provides reliable data storage and access capabilities. For in-memory HNSW indexes, it ensures stable retrieval performance.
### Transactions
seekdb's transaction capabilities ensure the consistency and integrity of vector data. It also offers effective concurrency control and fault recovery mechanisms.
### Cost efficiency
seekdb's storage encoding and compression capabilities significantly reduce the storage space required for vectors, helping to lower application storage costs.
### Data security
seekdb already supports comprehensive enterprise-grade security features, including identity authentication and verification, access control, data encryption, monitoring and alerts, and security auditing. These features effectively ensure data security in vector search scenarios.
### Ease of use
seekdb vector search provides flexible access interfaces, enabling SQL access through MySQL protocol clients across various programming languages, as well as seamless integration via a Python SDK. Furthermore, seekdb has been optimized for AI application development frameworks like LangChain and LlamaIndex, offering better support for AI application development.
### Comprehensive toolset
seekdb features a comprehensive database toolset, supporting data development, migration, operations, diagnostics, and full lifecycle data management, safeguarding the development and maintenance of AI applications.
## Application scenarios
* Retrieval-Augmented Generation (RAG): RAG is an artificial intelligence (AI) framework that retrieves facts from external knowledge bases to provide the most accurate and latest information for Large Language Models (LLMs) and allow users to have an insight into the generation process of an LLM. RAG is commonly used in intelligent Q&A systems and knowledge bases.
* Personalized recommendation: The recommendation system can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system will calculate the similarity based on the characteristics of the user, and then return items that the user may be interested in as the recommendation results, such as recommended restaurants and scenic spots.
* Image search/Text search: An image/text search task aims to find results that are most similar to the specified image in a large-scale image/text database. The text/image features used in the search can be stored in a vector database, and efficient similarity calculation can be achieved based on high-performance index-based storage, thereby returning image/text results that match the search criteria. This applies to scenarios such as facial recognition.

View File

@@ -0,0 +1,28 @@
---
slug: /vector-search-workflow
---
# AI application workflow using seekdb vector search
This topic describes the AI application workflow using seekdb vector search.
## Convert unstructured data into feature vectors through vector embedding
Unstructured data (such as videos, documents, and images) is the starting point of the entire workflow. Various forms of unstructured data, including videos, text files (documents), and images, are transformed into vector representations through vector embedding models. The task of these models is to convert raw, unstructured data that is difficult to directly calculate similarity into high-dimensional vector data. These vectors capture the semantic information and features of the data, and can express the similarity of data through distances in the vector space. For more information, see [Vector embedding technology](../150.vector-embedding-technology.md).
## Store vector embeddings and create vector indexes in seekdb
As the core storage layer, seekdb is responsible for storing all data. This includes traditional relational tables (used for storing business data), the original unstructured data, and the vector data generated after vector embedding. For more information, see [Store vector data](../160.store-vector-data.md).
To enable efficient vector search, seekdb internally builds vector indexes for the vector data. Vector indexes are specialized data structures that significantly accelerate nearest neighbor searches in high-dimensional vector spaces. Since calculating vector similarity is computationally expensive, exact searches (calculating distances for all vectors one by one) ensure accuracy but can severely impact query performance. Through vector indexes, the system can quickly locate candidate vectors, significantly reducing the number of vectors that need distance calculations, thereby improving query efficiency while maintaining high accuracy. For more information, see [Create vector indexes](../200.vector-index/200.dense-vector-index.md).
## Perform nearest neighbor search and hybrid search through SQL/SDK
Users interact with the AI application through clients or programming languages by submitting queries that may involve text, images, or other formats. For more information, see [Supported clients and languages](../700.vector-search-reference/900.vector-search-supported-clients-and-languages/100.vector-search-supported-clients-and-languages-overview.md).
seekdb uses SQL statements to query and manage relational data, enabling hybrid searches that combine scalar and vector data. When a user initiates a query—if it is unstructured—the system first converts it into a vector using the embedding model. Then, leveraging both vector and scalar indexes, the system quickly retrieves the most similar vectors that also meet scalar filter conditions, thus identifying the most relevant unstructured data. For detailed information about nearest neighbor search, see [Nearest neighbor search](../300.vector-similarity-search.md).
## Generate prompts and send them to the LLM for inference
In the final stage, an optimized prompt is generated based on the hybrid search results and sent to the large language model (LLM) to complete the inference process. The LLM generates a natural language response based on this contextual information. There is a feedback loop between the LLM and the vector embedding model, meaning that the output of the LLM or user feedback can be used to optimize the embedding model, creating a cycle of continuous learning and improvement.

View File

@@ -0,0 +1,339 @@
---
slug: /vector-embedding-technology
---
# Vector embedding technology
This topic introduces vector embedding technology in vector retrieval.
## What is vector embedding?
Vector embedding is a technique for converting unstructured data into numerical vectors. These vectors can capture the semantic information of unstructured data, enabling computers to "understand" and process the meaning of such data. Specifically:
* Vector embedding maps unstructured data such as text, images, or audio/video to points in a high-dimensional vector space.
* In this vector space, semantically similar unstructured data is mapped to nearby locations.
* Vectors are typically composed of hundreds of numbers (such as 512 or 1024 dimensions).
* Mathematical methods (such as cosine similarity) can be used to calculate the similarity between vectors.
* Common vector embedding models include Word2Vec, BERT, and BGE. For example, when developing RAG applications, text data is often embedded into vector data and stored in a vector database, while other structured data is stored in a relational database.
In seekdb, vector data can be stored as a data type in a relational table, allowing vectors and traditional scalar data to be stored in an orderly and efficient manner within seekdb.
## Generate vector embeddings using AI function service in seekdb
In seekdb, you can use the AI function service to generate vector embeddings. Users do not need to install any dependencies. After registering the model information, you can use the AI function service to generate vector embeddings in seekdb. For details, see [AI function service usage and examples](../300.ai-function/200.ai-function.md).
## Common text embedding methods
This section introduces common text embedding methods.
### Prerequisites
You need to have the `pip` command installed in advance.
### Use an offline, locally pre-trained embedding model
Using pre-trained models for local text embedding is the most flexible approach, but it requires significant computing resources. Commonly used models include:
#### Use Sentence Transformers
Sentence Transformers is an NLP model designed to convert sentences or paragraphs into vector embeddings. It uses deep learning technology, particularly the Transformer architecture, to effectively capture the semantic information of text. Since direct access to Hugging Face's domain often times out in China, please set the Hugging Face mirror address `export HF_ENDPOINT=https://hf-mirror.com` in advance. After setting it, run the code below:
```shell
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016 0.00884024 -0.05844684 ... 0.00750248 -0.04790139
# 0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ... 0.00899352 -0.02396279
# 0.02985837]
# [-0.01356584 0.01900942 -0.05800966 ... 0.00523864 -0.05689549
# 0.00077098]
# [-0.02149693 0.02998871 -0.05638731 ... 0.01443702 -0.02131325
# -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])
```
#### Use Hugging Face Transformers
Hugging Face Transformers is an open-source library that provides a wide range of pre-trained deep learning models, especially for NLP tasks. Due to geographical reasons, direct access to Hugging Face's domain may time out. Please set the Hugging Face mirror address `export HF_ENDPOINT=https://hf-mirror.com` in advance. After setting it, run the code below:
```shell
from transformers import AutoTokenizer, AutoModel
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")
# Prepare the input
texts = ["This is an example text."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0] # Use the [CLS] token's output
print(embeddings)
# tensor([[-1.4136, 0.7477, -0.9914, ..., 0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])
```
#### Ollama
[Ollama](https://ollama.com) is an open-source model runtime that allows users to easily run, manage, and use various large language models locally. In addition to supporting open-source language models like Llama 3 and Mistral, it also supports embedding models like bge-m3.
1. Deploy Ollama
On MacOS and Windows, you can directly download and install the package from the official website. For installation instructions, refer to Ollama's official website. After installation, Ollama runs as a background service.
To install Ollama on Linux:
```shell
curl -fsSL https://ollama.ai/install.sh | sh
```
2. Pull an embedding model
Ollama supports using the bge-m3 model for text embeddings:
```shell
ollama pull bge-m3
```
3. Use Ollama for text embeddings
You can use Ollama's embedding capabilities through HTTP API or Python SDK:
* HTTP API
```shell
import requests
def get_embedding(text: str) -> list:
"""Get text embeddings using Ollama's HTTP API"""
response = requests.post(
'http://localhost:11434/api/embeddings',
json={
'model': 'bge-m3',
'prompt': text
}
)
return response.json()['embedding']
# Example usage
text = "This is an example text."
embedding = get_embedding(text)
print(embedding)
# [-1.4269912242889404, 0.9092104434967041, ...]
```
* Python SDK
First, install Ollama's Python SDK:
```shell
pip install ollama
```
Then you can use it like this:
```shell
import ollama
# Example usage
texts = ["First sentence", "Second sentence"]
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
print(embeddings)
# [[0.03486196, 0.0625187, ...], [...]]
```
4. Advantages and limitations of Ollama
Advantages:
* Fully local deployment, no internet connection required
* Open-source and free, no API Key required
* Supports multiple models, easy to switch and compare
* Relatively low resource usage
Limitations:
* Limited selection of embedding models
* Performance may not match commercial services
* Requires self-maintenance and updates
* Lacks enterprise-level support
When choosing whether to use Ollama, you need to weigh these factors. If your application scenario has high privacy requirements, or you want to run completely offline, Ollama is a good choice. However, if you need more stable service quality and better performance, you may need to consider commercial services.
<!-- ### Use online, remote embedding services
Using offline, local embedding models usually requires high hardware specifications for the deployment machine and also demands advanced management of processes such as model loading and unloading. As a result, many users have a strong need for online embedding services. Currently, many AI inference service providers offer corresponding text embedding services. Taking Tongyi Qwen's text embedding service as an example, you can first register for an account with [Alibaba Cloud Model Studio](https://bailian.console.aliyun.com) and obtain an API Key. Then, you can call its public API to get text embeddings.
![Click to activate the model service](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/cloud/tutorial/activate-models.png)
![Confirm to activate the model service](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/cloud/tutorial/confirm-to-activate-models.png)
![Alibaba Cloud Model Studio](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/cloud/tutorial/dashboard.png)
![Get Alibaba Cloud Model Studio API Key](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/cloud/tutorial/get-api-key.png)-->
#### HTTP call
After obtaining the credentials, you can try performing text embedding with the following code. If the requests package is not installed in your Python environment, you need to install it first with `pip install requests` to enable sending network requests.
```shell
import requests
from typing import List
class RemoteEmbedding():
def __init__(
self,
base_url: str,
api_key: str,
model: str,
dimensions: int = 1024,
**kwargs,
):
self._base_url = base_url
self._api_key = api_key
self._model = model
self._dimensions = dimensions
"""
OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
"""
def embed_documents(
self,
texts: List[str],
) -> List[List[float]]:
"""Embed search docs.
Args:
texts: List of text to embed.
Returns:
List of embeddings.
"""
res = requests.post(
f"{self._base_url}",
headers={"Authorization": f"Bearer {self._api_key}"},
json={
"input": texts,
"model": self._model,
"encoding_format": "float",
"dimensions": self._dimensions,
},
)
data = res.json()
embeddings = []
try:
for d in data["data"]:
embeddings.append(d["embedding"][: self._dimensions])
return embeddings
except Exception as e:
print(data)
print("Error", e)
raise e
def embed_query(self, text: str, **kwargs) -> List[float]:
"""Embed query text.
Args:
text: Text to embed.
Returns:
Embedding.
"""
return self.embed_documents([text])[0]
embedding = RemoteEmbedding(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",
api_key="your-api-key", # Enter your API Key
model="text-embedding-v3",
)
print("Embedding result:", embedding.embed_query("The weather is nice today"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["The weather is nice today", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]
```
#### Use Qwen SDK
Qwen provides an SDK called dashscope for quickly accessing model capabilities. After installing it using `pip install dashscope`, you can obtain text embeddings.
```shell
import dashscope
from dashscope import TextEmbedding
# Set the API Key
dashscope.api_key = "your-api-key"
# Prepare the input text
texts = ["This is the first sentence", "This is the second sentence"]
# Call the embedding service
response = TextEmbedding.call(
model="text-embedding-v3",
input=texts
)
# Get the embedding results
if response.status_code == 200:
print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
```
## Common image embedding methods
This section introduces image embedding methods.
### Use an offline, locally pre-trained embedding model
#### Use CLIP
CLIP (Contrastive Language-Image Pretraining) is a model proposed by OpenAI for multimodal learning by combining images and text. CLIP can understand and process the relationships between images and text, making it perform well in various tasks such as image classification, image retrieval, and text generation.
```shell
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]
# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
# Obtain the embedding results
if outputs.status_code == 200:
print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
```
## References
* [Store vector embeddings](160.store-vector-data.md)
* [Vector data types](700.vector-search-reference/100.vector-data-type.md)

View File

@@ -0,0 +1,50 @@
---
slug: /store-vector-data
---
# Store vector data
This topic introduces how to store unstructured, semi-structured, and structured data in a unified way within seekdb. This not only fully leverages the foundational capabilities of seekdb, but also provides strong support for hybrid search.
## How it works
seekdb can store data of different modalities and supports hybrid search by converting various types of data (such as text, images, and videos) into vectors. Searches are performed by calculating the distances between these vectors. Hybrid search can be divided into two types: simple search, which is based on similarity search for a single vector, and complex search, which involves combining vector and scalar searches.
Since vector search is inherently approximate, it is necessary to employ multiple techniques in practical applications to improve accuracy. Only precise search results can deliver greater value to your business.
## Create a vector column
The following example shows a table that stores vector data, spatial data, and relational data. The data type of the vector column is `VECTOR`, and the dimension must be specified when the column is created. The maximum supported dimension is 16,000. The data type of the spatial column is `GEOMETRY`:
```sql
CREATE TABLE t (
-- Store relational data (structured data).
id INT PRIMARY KEY,
-- Store spatial data (semi-structured data).
g GEOMETRY,
-- Store vector data (unstructured data).
vec VECTOR(3)
);
```
## Use the `INSERT` statement to insert vector data
Once you create a table that contains a column of the `VECTOR` data type, you can directly use the `INSERT` statement to insert vectors into the table. When you insert data, the vector must match the dimension specified when the table is created. Otherwise, an error will be returned. This design ensures data consistency and query efficiency. Vectors are represented in standard floating-point number arrays. Each dimension must have a valid floating-point number. Here is a simple example:
```sql
INSERT INTO t (id, g, vec) VALUES (
-- Insert structured data.
1,
-- Insert semi-structured data.
ST_GeomFromText('POINT(1 1)'),
-- Insert unstructured data.
'[0.1, 0.2, 0.3]'
);
```
## References
* [Vector embedding technology](150.vector-embedding-technology.md)
* [Create vector indexes](200.vector-index/200.dense-vector-index.md)

View File

@@ -0,0 +1,600 @@
---
slug: /dense-vector-index
---
# Dense vector index
This topic describes how to create, query, maintain, and drop a dense vector index in seekdb.
## Index types
The following table describes the vector index types supported by seekdb.
| Index type | Description | Scenarios |
|-----------|------|----------|
| HNSW | The maximum dimension of indexed columns is 4096. The HNSW index is a memory-based index that must be fully loaded into memory. It supports DML and real-time queries. | |
| HNSW_SQ | The HNSW_SQ index offers similar construction speed, query performance, and recall rate as the HNSW index, but reduces overall memory usage to 1/2 to 1/3 of the original. | Scenarios with high performance and recall rate requirements. |
| HNSW_BQ | The HNSW_BQ index has a slightly lower recall rate compared to the HNSW index, but significantly reduces memory usage. The BQ quantization compression algorithm (Rabitq) can compress vectors to 1/32 of their original size. The memory optimization effect of the HNSW_BQ index becomes more pronounced as the vector dimension increases. | |
| IVF| An IVF index implemented based on database tables, which does not require resident memory. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
| IVF_PQ| An IVF_PQ index implemented based on database tables, which does not require resident memory. On top of IVF, PQ quantization technology is applied. The recall rate of the index is slightly lower than that of the IVF index, but the performance is higher. The PQ quantization compression algorithm can generally compress vectors to 1/16 to 1/32 of their original size. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
| IVF_SQ (Experimental feature)| An IVF_SQ index implemented based on database tables, which does not require resident memory. On top of IVF, SQ quantization technology is applied. The recall rate of the index is slightly lower than that of the IVF index, but the performance is higher. The SQ quantization compression algorithm can generally compress vectors to 1/3 to 1/4 of their original size. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
Some other notes:
* Dense vector indexes support L2, inner product (IP), and cosine distance as the index distance algorithm.
* Vector index queries support calling some distance functions. For more information, see [Use SQL functions](../250.vector-function.md).
* Vector queries with filter conditions are supported. The filter conditions can be scalar conditions or spatial relationships, such as ST_Intersects. Multi-value indexes, full-text indexes, and global indexes are not supported as pre-filterers.
* You can create vector and full-text indexes on the same table.
* For more information about how vector indexes support offline DDL operations, see [Offline DDL](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974221).
The limitations are described as follows:
* For V1.0.0, creating columnstore vector indexes is currently not supported.
## Index memory estimation and actual usage query
You can estimate the memory required for vector indexes using the `DBMS_VECTOR` system package:
* Before creating a table, you can estimate index memory requirements by using the [INDEX_VECTOR_MEMORY_ADVISOR](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754002) procedure.
* After a table is created and data is inserted, you can estimate index memory requirements by using the [INDEX_VECTOR_MEMORY_ESTIMATE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754001) procedure.
The vector index memory estimation provides two key pieces of information: the minimum memory configuration required to create a vector index, and the actual memory usage after creating HNSW_SQ and IVF indexes.
We also provide the configuration item `load_vector_index_on_follower` to control whether the follower role automatically loads in-memory vector indexes. For syntax and examples, see [load_vector_index_on_follower](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002969407). If weak reads are not needed, you can disable this configuration item to reduce the memory used by vector indexes.
## Creation syntax and description
seekdb vector indexes can be created during table creation or after the table is created. When creating a vector index, note the following:
* The `VECTOR` keyword is required when creating a vector index.
* The parameters and descriptions for an index created after the table is created are the same as those for an index created during table creation.
* If a large amount of data is involved, we recommend that you write the data first and then create the index to achieve the optimal query performance.
* It is recommended to create HNSW_SQ, IVF, IVF_SQ, and IVF_PQ indexes after data is inserted, and to rebuild the indexes after a significant amount of new data is added. For detailed instructions on creating each index, see the specific examples below.
:::tab
tab HNSW/HNSW_SQ/HNSW_BQ
Syntax for creating an index during table creation:
```sql
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 data_type2,
...,
VECTOR INDEX index_name (column_name) WITH (param1=value1, param2=value2, ...)
);
```
Syntax for creating an index after table creation:
```sql
-- Creating an index after table creation supports setting parallel degree to improve index construction performance. The maximum parallel degree should not exceed CPU cores * 2
CREATE [/*+ paralell $value*/] VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
```
`param` parameter description:
| Parameter | Default value | Value range | Required | Description | Remarks |
|------|--------|----------|----------|------|------|
| distance | | l2/inner_product/cosine | Yes | The vector distance function type. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. |
| type | | currently supported `hnsw` / `hnsw_sq`/ `hnsw_bq`. | Yes | The index type. | |
| lib | vsag | vsag | No | The vector index library type. | At present, only the VSAG vector library is supported. |
| m | 16 | [5,128] | No | The maximum number of neighbors of each node. | The larger the value, the slower the index construction, but the better the query performance. |
| ef_construction | 200 | [5,1000] | No | The size of the candidate set during index construction. | The larger the value, the slower the index construction, but the better the index quality. `ef_construction` must be greater than `m`. |
| ef_search | 64 | [1,1000] | No | The size of the candidate set during a query. | The larger the value, the slower the query, but the higher the recall rate. |
| extra_info_max_size | 0 | [0,16384] | No | The maximum size of each primary key information (in bytes). Storing the primary key of the table in the index can speed up queries. | <code>0</code>: The primary key information is not stored.<br/><code>1</code>: The primary key information is forcibly stored, regardless of the size limit. In this case, the primary key type (see below) must be a supported type.<br/><code>Greater than 1</code>: The maximum size of the primary key information (in bytes) is specified. In this case, the following conditions must be met:<ul><li>The size of the primary key information (calculation method see below) must be less than the specified size limit.</li><li>The primary key type must be a supported type.</li><li>The table is not a table without a primary key.</li></ul> |
| refine_k | 4.0 | [1.0,1000.0] | No | <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter is a floating-point number used to adjust the rearrangement ratio for quantized vector indexes. | This parameter can be specified when you create an index or during a query:<ul><li>If this parameter is not specified during a query, the value specified when the index is created is used. </li><li>If this parameter is specified during a query, the value specified during the query is used. </li></ul> |
| refine_type | sq8 <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies the construction precision of quantized vector indexes. | This parameter improves the efficiency of index construction by reducing the memory usage and the construction time, but may affect the recall rate. |
| bq_bits_query | 32 | 0/4/32 | No | <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies the query precision of quantized vector indexes in bits. | This parameter improves the efficiency of index construction by reducing the memory usage and the construction time, but may affect the recall rate. |
| bq_use_fht | true <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies whether to use FHT for queries. FHT (Fast Hadamard Transform) is an algorithm used to accelerate vector inner product calculations. | |
The supported primary key types for `extra_info_max_size` include:
* [Numeric types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975803): Integer types, floating-point types, and BIT_VALUE types.
* [Datetime types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975805)
* [Character types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975810): VARCHAR type is supported.
The calculation method for the primary key information size:
```sql
SET @table_name = 'test'; -- Replace this with the table name to be queried.
SELECT
CASE
WHEN COUNT(*) <> COUNT(result_value) THEN 'not support'
ELSE COALESCE(SUM(result_value), 'not support')
END AS extra_info_size
FROM (
SELECT
CASE
WHEN vdt.data_type_class IN (1, 2, 3, 4, 6, 8, 9, 14, 27, 28) THEN 8 -- For numeric types, extra_info_size += 8
WHEN oc.data_type = 22 THEN oc.data_length -- For varchar types, extra_info_size += data_length
ELSE NULL -- Other types are not supported
END AS result_value
FROM
oceanbase.__all_column oc
JOIN
oceanbase.__all_virtual_data_type vdt
ON
oc.data_type = vdt.data_type
WHERE
oc.rowkey_position != 0
AND oc.table_id = (SELECT table_id FROM oceanbase.__all_table WHERE table_name = @table_name)
) AS result_table;
-- The result is 8 bytes.
```
tab IVF/IVF_SQ (Experimental feature)/IVF_PQ
Syntax for creating an index during table creation:
```sql
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 data_type2,
...,
VECTOR INDEX index_name (column_name) WITH (param1=value1, param2=value2, ...)
);
```
Syntax for creating an index after table creation:
```sql
-- Creating an index after table creation supports setting parallel degree to improve index construction performance. The maximum parallel degree should not exceed CPU cores * 2
CREATE [/*+ paralell $value*/] VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
```
`param` parameter description:
| Parameter | Default value | Value range | Required? | Description | Remarks |
|------|--------|----------|----------|------|------|
| distance | | l2/inner_product/cosine | Yes | The vector distance function type. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. |
| type | | ivf_flat/ivf_sq8/ivf_pq | Yes | The IVF index type. | |
| lib | ob | ob | No | The vector index library type. | |
| nlist | 128 | [1,65536] | No | The number of clusters. | |
| sample_per_nlist | 256 | [1,int64_max] | Yes | The number of samples for each cluster center, which is used when creating an index after table creation. | |
| nbits | 8 | [1,24] | No | The number of quantization bits.<main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an IVF_PQ index.</p></main> | The recommended value is 8. The recommended value range is [8,10]. The larger the value, the higher the quantization accuracy and query accuracy, but the query performance will be affected. |
| m | No default value, must be specified | [1,65536] | Yes | The dimension of the quantized vectors.<main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an IVF_PQ index.</p></main> | The larger the value, the slower the index construction, and the higher the query accuracy, but the query performance will be affected. |
:::
## Query syntax and description
Vector index queries are approximate nearest neighbor queries and do not guarantee 100% accuracy. The accuracy of vector queries is measured by recall. For example, if a query for the 10 nearest neighbors can stably return 9 correct results, the recall is 90%. The recall is described as follows:
* The recall is affected by the build parameters and query parameters.
* The index query parameters are specified when the index is created and cannot be modified. However, you can set session variables to specify the parameters. The `ob_hnsw_ef_search` variable specifies the parameters for the HNSW/HNSW_SQ/HNSW_BQ index, and the `ob_ivf_nprobes` variable specifies the parameters for the IVF index. If you set a session variable, its value is prioritized. For more information, see [ob_hnsw_ef_search](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001976680) and [ob_ivf_nprobes](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002179539).
The syntax for dense vector indexes is as follows:
```sql
SELECT ... FROM $table_name ORDER BY $distance_function($column_name, $vector_expr) [APPROXIMATE|APPROX] LIMIT $num (OFFSET $num);
```
Query usage notes are as follows:
* Syntax requirements:
* The `APPROXIMATE`/`APPROX` keyword must be specified for the query to use the vector index instead of a full table scan.
* The query must include the `ORDER BY` and `LIMIT` clauses.
* The `ORDER BY` clause only supports a single vector condition.
* The value of `LIMIT + OFFSET` must be in the range `(0, 16384]`.
* Rules for distance functions:
* If `APPROXIMATE`/`APPROX` is specified, a supported distance function is called, and it matches the vector index algorithm, the query will use the vector index.
* If `APPROXIMATE`/`APPROX` is specified, but the distance function does not match the vector index algorithm, the query will not use the vector index, but no error is returned.
* If `APPROXIMATE`/`APPROX` is specified, but the distance function is not supported in the current version, the query will not use the vector index, and an error is returned.
* If `APPROXIMATE`/`APPROX` is not specified, and a supported distance function is called, the query will not use the vector index, but no error is returned.
* Other notes:
* The `WHERE` condition will serve as a filter after the vector index query.
* Specifying the `LIMIT` clause is required; otherwise, an error will be returned.
## Create, query, and delete examples
### Create an index during table creation
#### Example of dense vector index
##### HNSW example
:::tip
When you create an HNSW index, the index name must be less than 25 characters in length. Otherwise, an exception may occur because the auxiliary table name exceeds the <code>index_name</code> limit. In future versions, the index name can be longer.
:::
Create a test table.
```sql
CREATE TABLE t1(c1 INT, c0 INT, c2 VECTOR(10), c3 VECTOR(10), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw, lib=vsag), VECTOR INDEX idx2(c3) WITH (distance=l2, type=hnsw, lib=vsag));
```
Write test data.
```sql
INSERT INTO t1 VALUES(1, 1,'[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]', '[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
INSERT INTO t1 VALUES(2, 2, '[0.735541,0.670776,0.903237,0.447223,0.232028,0.659316,0.765661,0.226980,0.579658,0.933939]', '[0.213846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
INSERT INTO t1 VALUES(3, 3, '[0.327936,0.048756,0.084670,0.389642,0.970982,0.370915,0.181664,0.940780,0.013905,0.628127]', '[0.223846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
```
Perform an approximate nearest neighbor query.
```sql
SELECT * FROM t1 ORDER BY l2_distance(c2, [0.712338,0.603321,0.133444,0.428146,0.876387,0.763293,0.408760,0.765300,0.560072,0.900498]) APPROXIMATE LIMIT 1;
```
The query result is as follows:
```shell
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| c1 | c0 | c2 | c3 |
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| 3 | 3 | [0.327936,0.048756,0.08467,0.389642,0.970982,0.370915,0.181664,0.94078,0.013905,0.628127] | [0.223846,0.205289,0.880265,0.82434,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833] |
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
1 row in set
```
##### HNSW_SQ example
```sql
CREATE TABLE t2 (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw_sq, lib=vsag));
```
##### HNSW_BQ example
```sql
CREATE TABLE t3 (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx3(c2) WITH (distance=l2, type=hnsw_bq, lib=vsag));
```
The `distance` parameter of HNSW_BQ supports only the l2 value.
##### IVF example
:::tip
When you create an IVF index, the index name must be less than 33 characters in length. Otherwise, an exception may occur because the auxiliary table name exceeds the <code>index_name</code> limit. In future versions, the index name can be longer.
:::
```sql
CREATE TABLE ivf_vecindex_suite_table_test (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx2(c2) WITH (distance=l2, type=ivf_flat));
```
### Create an index after table creation
:::tip
Currently, only dense vector indexes can be created after table creation.
:::
#### Example of HNSW index
Create a test table.
```sql
CREATE TABLE vec_table_hnsw (id INT, c2 VECTOR(10));
```
Create an HNSW index.
```sql
CREATE VECTOR INDEX vec_idx1 ON vec_table_hnsw(c2) WITH (distance=l2, type=hnsw);
```
View the created table.
```sql
SHOW CREATE TABLE vec_table_hnsw;
```
The return result is as follows:
```shell
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vec_table_hnsw | CREATE TABLE `vec_table_hnsw` (
`id` int(11) DEFAULT NULL,
`c2` VECTOR(10) DEFAULT NULL,
VECTOR KEY `vec_idx1` (`c2`) WITH (DISTANCE=L2, TYPE=HNSW, LIB=VSAG, M=16, EF_CONSTRUCTION=200, EF_SEARCH=64) BLOCK_SIZE 16384
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 2 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
```sql
SHOW INDEX FROM vec_table_hnsw;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
| vec_table | 1 | vec_idx1 | 1 | c2 | A | NULL | NULL | NULL | YES | VECTOR | available | | YES | NULL |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
1 row in set
```
#### Example of HNSW_SQ index
Create a test table.
```sql
CREATE TABLE vec_table_hnsw_sq (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1));
```
Create an HNSW_SQ index.
```sql
CREATE VECTOR INDEX vec_idx2 ON vec_table_hnsw_sq(c2) WITH (distance=l2, type=hnsw_sq, lib=vsag, m=16, ef_construction = 200);
```
##### Example of HNSW_BQ index
```sql
CREATE VECTOR INDEX vec_idx3 ON vec_table_hnsw_bq(c2) WITH (distance=l2, type=hnsw_bq, lib=vsag, m=16, ef_construction = 200);
```
The `distance` parameter of the HNSW_BQ index can be used only with the L2 algorithm.
#### Example of IVF index
Create a test table.
```sql
CREATE TABLE vec_table_ivf (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1));
```
Create an IVF index.
```sql
CREATE VECTOR INDEX vec_idx3 ON vec_table_ivf(c2) WITH (distance=l2, type=ivf_flat);
```
### Drop an index
```sql
DROP INDEX vec_idx1 ON vec_table;
```
View the dropped index.
```sql
SHOW INDEX FROM vec_table;
```
The return result is as follows:
```shell
Empty set
```
<!--## Monitoring
seekdb vector indexes provide monitoring capabilities:
* You can view the basic information and real-time status of HNSW/HNSW_SQ/HNSW_BQ indexes through the [GV$OB_HNSW_INDEX_INFO](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004017373) view.
* You can view the basic information and real-time status of IVF/IVF_SQ/IVF_PQ indexes through the [GV$OB_IVF_INDEX_INFO](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004017374) view.-->
## Maintenance
When there is a large amount of incremental data, the query performance decreases. To reduce the amount of incremental data in the table, seekdb introduced the `DBMS_VECTOR` package for maintaining vector indexes.
### Incremental refresh
:::tip
IVF/IVF_SQ/IVF_PQ indexes do not support incremental refresh.
:::
If a large amount of data is written after the index is created, we recommend that you perform an incremental refresh by using the `REFRESH_INDEX` procedure. For more information, see [REFRESH_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002753999).
The system checks for incremental data every 15 minutes. If more than 10,000 incremental data records are found, the system automatically performs an incremental refresh.
### Full refresh (rebuild)
#### Manual full table rebuild
If a large amount of data is updated or deleted after an index is created, it is recommended to use the `REBUILD_INDEX` procedure to perform a full refresh. For details and examples, see [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000).
A full refresh is automatically checked every 24 hours. If the newly added data exceeds 20% of the original data, a full refresh will be triggered automatically. The full refresh runs asynchronously in the background: a new index is created first, and then the old index is replaced. During the rebuild process, the old index remains available, but the overall process is relatively slow.
We also provide the configuration item `vector_index_memory_saving_mode` to control the memory usage during index rebuild. Enabling this mode can reduce the memory consumption during vector index rebuild for partitioned tables. Typically, vector index rebuild requires memory equivalent to twice the index size. After enabling the memory-saving mode, the system will temporarily delete the memory index of a partition after building that partition to release memory, effectively reducing the total memory required for the rebuild operation. For syntax and examples, see [vector_index_memory_saving_mode](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002969408).
Notes:
* When executing offline DDL operations (such as `ALTER TABLE` to modify the table structure or primary key), the index table will be rebuilt. Since parallel degree cannot be specified for index rebuild, the system uses single-threaded execution by default. Therefore, when the data volume is large, the rebuild process will be slow, affecting the efficiency of the entire offline DDL operation.
* When rebuilding an index, if you need to modify index parameters, you must specify both `type` and `distance` in the parameter list, and `type` and `distance` must match the original index type. For example, if the original index type is `hnsw` and the distance algorithm is `l2`, you must specify both `type=hnsw` and `distance=l2` during rebuild.
* When rebuilding an index, the following are supported:
* Modifying `m`, `ef_search`, and `ef_construction` values.
* Online rebuild of the `ef_search` parameter.
* Index type rebuild between `hnsw` - `hnsw_sq`.
* Index type rebuild between `ivf_flat` - `ivf_flat`, `ivf_sq8` - `ivf_sq8`, `ivf_pq` - `ivf_pq`.
* Setting parallel degree during rebuild. For examples, see [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000).
* When rebuilding an index, the following are not supported:
* Modifying `type` and `distance` types.
* Index rebuild between `hnsw` - `ivf`.
* Index rebuild between `hnsw` - `hnsw_bq`.
* Cross rebuild between `ivf_flat`, `ivf_pq`, and `ivf_sq8`.
#### Automatic partition rebuild (recommended)
:::tip
<li>This feature is supported starting from V1.0.0. If your vector database is upgraded from an earlier version to V1.0.0, you need to manually rebuild all vector indexes for the entire table after the upgrade. Otherwise, automatic partition rebuild tasks may not be executed after the upgrade.</li><li>This feature only supports HNSW/HNSW_SQ/HNSW_BQ indexes.</li>
:::
There are two scenarios that trigger automatic partition rebuild tasks in the current version:
* When executing vector index query statements.
* Scheduled checks, with configurable execution cycle.
1. Configure execution cycle
In the `seekdb` database, configure the execution cycle through the configuration item `vector_index_optimize_duty_time`. Example:
```sql
ALTER SYSTEM SET vector_index_optimize_duty_time='[23:00:00, 24:00:00]';
```
After the above configuration, partition rebuild tasks will only be executed between 23:00:00 and 24:00:00, and will not be initiated at other times. For detailed parameter descriptions, see the corresponding configuration item documentation.
2. View task progress/history
You can view task progress and history through the `CDB/DBA_OB_VECTOR_INDEX_TASKS` or `CDB/DBA_OB_VECTOR_INDEX_TASK_HISTORY` view.
Determine the current task status through the `status` field:
* 0 (PREPARE): The task is waiting to be executed.
* 1 (RUNNING): The task is being executed.
* 2 (PENDING): The task is paused.
* 3 (FINISHED): The task has been completed.
Completed tasks, i.e., tasks with `status=FINISHED`, will be archived to the history table regardless of whether they succeeded. For detailed usage examples, see the corresponding view documentation.
3. Cancel task
To cancel a task, obtain the trace_id from the `DBA_OB_VECTOR_INDEX_TASKS` or `CDB_OB_VECTOR_INDEX_TASKS` view, then execute the following command:
```sql
ALTER SYSTEM CANCEL TASK <trace_id>;
```
Example:
```sql
ALTER SYSTEM CANCEL TASK "Y61480BA2D976-00063084E80435E2-0-1";
```
## Performance optimization
:::tip
Only the IVF index is supported.
:::
seekdb provides an automatic performance optimization mechanism for the IVF index to improve query performance through cache management and regular maintenance.
### Optimization mechanism
IVF index performance optimization includes two types of automated tasks:
1. Cache warming task: Periodically checks all IVF indexes. If it finds that the cache corresponding to an index does not exist, it automatically triggers cache warming and loads the index data into memory. Additionally, cache warming is automatically performed when an IVF index is created.
2. Cache cleanup task: Periodically checks all IVF caches. If it finds that the cache corresponds to an index that has been deleted, it automatically cleans up the invalid cache and releases memory resources. Additionally, cache cleanup is automatically performed when an IVF index is deleted.
### Configure the optimization cycle
The system allows you to customize the execution time window for performance optimization tasks to avoid impacting performance during peak business hours.
In the `seekdb` database, you can set the execution cycle using the `vector_index_optimize_duty_time` parameter:
```sql
ALTER SYSTEM SET vector_index_optimize_duty_time='[23:00:00, 24:00:00]';
```
The configuration is described as follows:
* The time format is `[start time, end time]`.
* The above configuration means that optimization tasks will only be executed between 23:00:00 and 24:00:00.
* Optimization tasks will not be initiated at other times to avoid impacting normal business operations.
### Monitor optimization tasks
seekdb vector indexes provide monitoring capabilities for optimization tasks:
* You can view tasks that are being executed or waiting to be executed through the `DBA_OB_VECTOR_INDEX_TASKS` view.
* You can view historical task records through the `DBA_OB_VECTOR_INDEX_TASK_HISTORY` view.
Usage examples:
1. View the current task status
View tasks that are being executed or waiting to be executed through the `DBA_OB_VECTOR_INDEX_TASKS` view:
```sql
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASKS;
```
Sample return result:
```shell
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| 500020 | 1152921504606846990 | 2002281 | 1970-08-23 17:10:23.174127 | 1970-08-23 17:10:23.174137 | USER | FINISHED | 2 | 1750671687770026 | 0 | YAFF00B9E4D97-00063839E6BD9BBC-0-1 |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
1 row in set
```
Description of the task status:
* `STATUS = 0`: PREPARE, the task is waiting to be executed.
* `STATUS = 1`: RUNNING, the task is being executed.
* `STATUS = 3`: FINISHED, the task has been completed.
Description of the task type:
* `TASK_TYPE = 2`: IVF cache warming task.
* `TASK_TYPE = 3`: IVF invalid cache cleanup task.
2. View the history task records
Completed tasks (with `STATUS = 3`) are automatically archived to the history table every 10 seconds, regardless of whether they were successful. View the history through the `DBA_OB_VECTOR_INDEX_TASKS_HISTORY` view:
```sql
-- Query the history of a specified task ID
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASKS_HISTORY WHERE TASK_ID=2002281;
```
Sample return result:
```shell
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| 500020 | 1152921504606846990 | 2002281 | 1970-08-23 17:10:23.174127 | 1970-08-23 17:10:23.174137 | AUTO | FINISHED | 2 | 1750671687770026 | 0 | YAFF00B9E4D97-00063839E6BD9BBC-0-1 |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
1 row in set
```
### Cancel an optimization task
You can cancel a specified task by using the following command:
```sql
-- trace_id is obtained from the DBA_OB_VECTOR_INDEX_TASKS_HISTORY view
ALTER SYSTEM CANCEL TASK <trace_id>;
```
:::tip
You can cancel a task only in the failed retry phase by executing the <code>ALTER SYSTEM CANCEL TASK</code> statement. If a background task is stuck in a specific execution phase, it cannot be canceled by using this statement.
:::
Example:
```sql
-- Log in to the system and obtain the trace_id of the specified task
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASK_HISTORY WHERE TASK_ID=2037736;
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| 500041 | 1152921504606847008 | 2037736 | 1970-08-23 17:10:23.203821 | 1970-08-23 17:10:23.203821 | USER | PREPARED | 2 | 1750682301145225 | -1 | YAFF00B9E4D97-00063839E6BDDEE0-0-1 |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
1 row in set
```
```sql
-- Cancel the task
ALTER SYSTEM CANCEL TASK "YAFF00B9E4D97-00063839E6BDDEE0-0-1";
```
After the task is canceled, the task status changes to `CANCELLED`.
```sql
-- Log in to the user database and query the task status
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASK_HISTORY;
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
| 500041 | 1152921504606847008 | 2037736 | 1970-08-23 17:10:23.203821 | 1970-08-23 17:10:23.203821 | USER | FINISHED | 2 | 1750682301145225 | -4072 | YAFF00B9E4D97-00063839E6BDDEE0-0-1 |
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
1 row in set
```
## References
* [Use SQL functions](../250.vector-function.md)

View File

@@ -0,0 +1,374 @@
---
slug: /hybrid-vector-index
---
# Create a hybrid vector index
This topic describes how to create a hybrid vector index in seekdb.
## Overview
Hybrid vector indexes leverage seekdb's built-in embedding capabilities to greatly simplify the vector index usage process. They make the vector concept transparent to users: you can directly write raw data (such as text) that needs to be stored, and seekdb will automatically convert it to vectors and build indexes internally. During retrieval, you only need to provide the raw query content, and seekdb will also automatically perform embedding and retrieve the vector index, significantly improving ease of use.
Considering the performance overhead of embedding models, hybrid vector indexes provide two embedding modes for users to choose from:
* Synchronous mode: Embedding and indexing are performed immediately after data is written, ensuring real-time data visibility.
* Asynchronous mode: Background tasks perform data embedding and indexing in batches, which can significantly improve write performance. You can flexibly set the trigger cycle of background tasks based on your requirements for real-time data visibility.
In addition, this feature also provides the capability to perform brute-force search on hybrid vector indexes to help verify the correctness of search results. Brute-force search refers to performing a search using a full table scan to obtain the exact results of the n nearest rows.
## Feature support
:::tip
This feature currently supports only HNSW/HNSW_BQ indexes.
:::
This feature supports the full lifecycle of hybrid vector indexes, including creation, update, deletion, and retrieval, and is compatible with `REFRESH_INDEX` and `REBUILD_INDEX` in the `DBMS_VECTOR` system package. The syntax for update, deletion, and retrieval is exactly the same as that for regular vector indexes. In asynchronous mode, `REFRESH_INDEX` will additionally trigger data embedding. For details about creation and retrieval, see the sections below.
The supported features are as follows:
| Module | Feature | Description |
|------|--------|------|
| DDL | Create a hybrid vector index during table creation | You can create a hybrid vector index on a `VARCHAR` column when creating a table |
| DDL | Create a hybrid vector index after table creation | Supports creating a hybrid vector index on a `VARCHAR` column of an existing table |
| Retrieval | `semantic_distance` function | Pass raw data through this function for vector retrieval |
| Retrieval | `semantic_vector_distance` function | Pass vectors through this function for retrieval. There are two usage modes: <ul><li>When the SQL statement includes the `APPROXIMATE`/`APPROX` clause, vector index retrieval is used.</li><li>When the `APPROXIMATE`/`APPROX` clause is not included, brute-force search using a full table scan is performed.</li></ul> |
| DBMS_VECTOR | `REFRESH_INDEX` | The usage is the same as that for regular vector indexes. Performs incremental index refresh and embedding in asynchronous mode |
| DBMS_VECTOR | `REBUILD_INDEX` | The usage is the same as that for regular vector indexes. Performs full index rebuild |
Some usage notes are as follows:
* In synchronous mode, write performance may be affected by embedding performance. In asynchronous mode, data visibility will be delayed.
* For repeated retrieval scenarios, it is recommended to use AI Function Service to pre-obtain query vectors to avoid embedding for each retrieval.
## Prerequisites
Before using hybrid vector indexes, you must register an embedding model and endpoint. The following is a registration example:
```sql
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
'ob_embed', '{
"type": "dense_embedding",
"model_name": "BAAI/bge-m3"
}');
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
'ob_embed_endpoint', '{
"ai_model_name": "ob_embed",
"url": "https://api.siliconflow.cn/v1/embeddings",
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
"provider": "siliconflow"
}');
```
:::info
Replace <code>access_key</code> with your actual API Key. The BAAI/bge-m3 model has a vector dimension of 1024, so you need to use <code>dim=1024</code> when creating a hybrid vector index.
:::
## Creation syntax and description
Hybrid vector indexes support two creation methods: **creation during table creation** and **creation after table creation**. When creating an index, note the following:
* The index must be created on a column of the `VARCHAR` type.
* The `model` and `sync_mode` parameters are not supported for regular vector indexes.
* The parameters and descriptions for an index created after table creation are the same as those for an index created during table creation.
### Create during table creation
You can use the `CREATE TABLE` statement to create a hybrid vector index. Through index parameters, background tasks can be initiated synchronously or asynchronously. In synchronous mode, `VARCHAR` data is automatically converted to vector data when data is inserted. In asynchronous mode, data conversion is performed periodically or manually.
#### Syntax
```sql
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 VARCHAR, -- Text column
...,
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);
```
#### Parameter description
| Parameter | Default value | Value range | Required | Description | Remarks |
|------|--------|----------|----------|------|------|
| `distance` | | `l2`/`inner_product`/`cosine` | Yes | Specifies the vector distance algorithm type. | `l2` indicates Euclidean distance, `inner_product` indicates inner product distance, and `cosine` indicates cosine distance. |
| `type` | | Currently supports `hnsw` / `hnsw_bq` | Yes | Specifies the index algorithm type. | |
| `lib` | `vsag` | `vsag` | No | Specifies the vector index library type. | Currently, only the VSAG vector library is supported. |
| `model` | | Registered model name | Yes | Specifies the large language model name used for embedding. | The model must be registered using AI Function Service before creating the index.<main id="notice" type='notice'><h4>Note</h4><p>Regular vector indexes do not support setting this parameter.</p></main> |
| `dim` | | Positive integer, maximum 4096 | Yes | Specifies the vector dimension after embedding. | Must match the dimension provided by the model. |
| `sync_mode` | `async` | `immediate`/`manual`/`async` | No | Specifies the data and index synchronization mode. | `immediate` indicates synchronous mode, `manual` indicates manual mode, and `async` indicates asynchronous mode.<main id="notice" type='notice'><h4>Note</h4><p>Regular vector indexes do not support setting this parameter.</p></main> |
| `sync_interval` | `10s` | Time interval, such as `10s`, `1h`, `1d`, etc. | No | Sets the trigger cycle of background tasks in asynchronous mode. | The numeric part must be positive. Units supported include seconds (s), hours (h), days (d), etc. |
The usage of other vector index parameters (such as `m`, `ef_construction`, `ef_search`, etc.) is the same as that for regular vector indexes. For details, see the related documentation.
### Create after table creation
Supports creating a hybrid vector index on a `VARCHAR` column of an existing table. When creating an index after table creation, synchronous or asynchronous background tasks are initiated through the provided index parameters. In synchronous mode, all existing `VARCHAR` data is converted to vector data. In asynchronous mode, data conversion is performed periodically or manually.
#### Syntax
```sql
CREATE VECTOR INDEX index_name
ON table_name(varchar_column_name)
WITH (param1=value1, param2=value2, ...);
```
#### Parameter description
The parameter description is the same as that for creating an index during table creation. For details, see the section above.
## Create, update, and delete examples
DML operations (`INSERT`, `UPDATE`, `DELETE`) for hybrid vector indexes are exactly the same as those for regular vector indexes. When inserting or updating data of the `VARCHAR` type, the system automatically or asynchronously performs embedding based on the `sync_mode` parameter setting.
### Create during table creation
Create the `vector_idx` index when creating the test table `items`:
```sql
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE TABLE items (
id BIGINT PRIMARY KEY,
doc VARCHAR(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s)
);
```
Insert a row of data into the test table `items`. The system will automatically perform embedding:
```sql
INSERT INTO items(id, doc) VALUES(1, 'Rose');
```
### Create after table creation
After creating the test table `items`, use the `CREATE VECTOR INDEX` statement to create the `vector_idx` index:
```sql
CREATE TABLE items (
id BIGINT PRIMARY KEY,
doc VARCHAR(100)
);
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE VECTOR INDEX vector_idx
ON items (doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s);
```
Insert a row of data into the test table `items`. The system will automatically perform embedding:
```sql
INSERT INTO items(id, doc) VALUES(1, 'Rose');
```
### Update
When updating data of the `VARCHAR` type, the system will re-perform embedding:
* Synchronous mode: Re-embedding is performed immediately after update.
* Asynchronous mode: Re-embedding is performed by background tasks at the next trigger cycle after update.
Usage example:
```sql
UPDATE items SET doc = 'Lily' WHERE id = 1;
```
### Delete
The delete operation is the same as that for regular vector indexes. You can directly delete the data.
Usage example:
```sql
DELETE FROM items WHERE id = 1;
```
## Retrieval
Hybrid vector indexes support two retrieval methods:
* Retrieve using raw text
* Retrieve using vectors
For detailed usage of the `APPROXIMATE`/`APPROX` clause, see the related documentation on creating vector indexes at the end of this topic.
### Retrieve using raw text
Use the `semantic_distance` expression to pass raw text for vector retrieval.
#### Syntax
```sql
SELECT ... FROM table_name
ORDER BY semantic_distance(column_name, 'query_text') [APPROXIMATE|APPROX]
LIMIT n;
```
Where:
* `column_name`: The text column specified when creating the hybrid vector index.
* `query_text`: The raw text for retrieval.
* `n`: The number of result rows to return.
#### Usage example
```sql
-- Assume that the ob_embed model has been created previously
CREATE TABLE items (
id INT PRIMARY KEY,
doc varchar(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);
INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Sunflower');
INSERT INTO items(id, doc) VALUES(3, 'Rose');
INSERT INTO items(id, doc) VALUES(4, 'Sunflower');
INSERT INTO items(id, doc) VALUES(5, 'Rose');
-- Retrieve using raw text
SELECT id, doc FROM items
ORDER BY semantic_distance(doc, 'Sunflower')
APPROXIMATE LIMIT 3;
```
The return result is as follows:
```sql
+----+-----------+
| id | doc |
+----+-----------+
| 2 | Sunflower |
| 4 | Sunflower |
| 5 | Rose |
+----+-----------+
3 rows in set
```
### Retrieve using vectors (with APPROXIMATE clause)
Use the `semantic_vector_distance` expression to pass vectors for retrieval. When the retrieval statement includes the `APPROXIMATE`/`APPROX` clause, vector index retrieval is used.
#### Syntax
```sql
SELECT ... FROM table_name
ORDER BY semantic_vector_distance(column_name, 'query_vector') [APPROXIMATE|APPROX]
LIMIT n;
```
Where:
* `column_name`: The text column specified when creating the hybrid vector index.
* `query_vector`: The query vector.
* `n`: The number of result rows to return.
#### Usage example
```sql
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE TABLE items (
id INT PRIMARY KEY,
doc varchar(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);
INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Lily');
INSERT INTO items(id, doc) VALUES(3, 'Sunflower');
INSERT INTO items(id, doc) VALUES(4, 'Rose');
-- First, obtain the query vector
SET @query_vector = AI_EMBED('ob_embed', 'Sunflower');
-- Retrieve using vectors with index
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector)
APPROXIMATE LIMIT 3;
```
The return result is as follows:
```shell
+----+-----------+
| id | doc |
+----+-----------+
| 3 | Sunflower |
| 1 | Rose |
| 4 | Rose |
+----+-----------+
3 rows in set
```
### Retrieve using vectors (without APPROXIMATE clause)
Use the `semantic_vector_distance` expression to pass vectors for retrieval. When the `APPROXIMATE`/`APPROX` clause is not included, brute-force search using a full table scan is performed to obtain the exact results of the n nearest rows. During retrieval execution, the `distance` type is obtained from the table schema, and then a full table scan is performed. Vector distance is calculated for each row to ensure accurate results.
#### Syntax
```sql
SELECT ... FROM table_name
ORDER BY semantic_vector_distance(column_name, 'query_vector')
LIMIT n;
```
Where:
* `column_name`: The text column specified when creating the hybrid vector index.
* `query_vector`: The query vector.
* `n`: The number of result rows to return.
#### Usage example
```sql
-- Retrieve using vectors with brute-force search (exact results)
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector)
LIMIT 3;
```
The return result is as follows:
```shell
+----+-----------+
| id | doc |
+----+-----------+
| 3 | Sunflower |
| 4 | Rose |
| 1 | Rose |
+----+-----------+
3 rows in set
```
## Index maintenance
Hybrid vector indexes support using the `DBMS_VECTOR` system package for index maintenance, including incremental refresh and full rebuild.
### Incremental refresh
If a large amount of data is written after the index is created, it is recommended to use the `REFRESH_INDEX` procedure for incremental refresh. For descriptions and examples, see the related documentation.
Special notes for hybrid vector indexes:
* The usage is the same as that for regular vector indexes. For details, see the related documentation.
* In asynchronous mode, `REFRESH_INDEX` will additionally trigger data embedding to ensure that incremental data is correctly converted to vectors and added to the index.
### Full refresh (rebuild)
If a large amount of data is updated or deleted after the index is created, it is recommended to use the `REBUILD_INDEX` procedure for full refresh. For descriptions and examples, see the related documentation.
Special notes for hybrid vector indexes:
* The usage is the same as that for regular vector indexes. For details, see the related documentation.
* The task merges incremental data and snapshots.
## Related documentation
* [AI Function Service](../../300.ai-function/200.ai-function.md)
* [Create a vector index](200.dense-vector-index.md)
* [REFRESH_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002753999)
* [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000)

View File

@@ -0,0 +1,279 @@
---
slug: /in-memory-sparse-vector-index
---
# In-memory sparse vector index
This topic describes how to create, query, and use in-memory sparse vector indexes in seekdb.
## Overview
In-memory sparse vector indexes are an efficient index type provided by seekdb for sparse vector data (vectors where most elements are zero). In-memory sparse vector indexes must be fully loaded into memory and support DML and real-time queries.
To improve the query performance of sparse vectors, seekdb integrates the sparse vector index (SINDI) from the VSAG algorithm library. This index performs better than disk-based sparse vector indexes and is suitable for use when memory resources are sufficient.
## Feature support
In-memory sparse vector indexes support the following features:
| Module | Feature | Description |
|------|--------|------|
| DDL | Create a sparse vector index during table creation | You can create a sparse vector index on a `SPARSEVECTOR` column when creating a table. The maximum supported dimension is 500,000. |
| DDL | Create a sparse vector index after table creation | Supports creating a sparse vector index on a `SPARSEVECTOR` column of an existing table. The maximum supported dimension is 500,000. |
| DML | Insert, update, delete | The syntax for DML operations is exactly the same as that for regular vector indexes. |
| Retrieval | Vector retrieval | Supports retrieval using SQL functions. |
| Retrieval | Query parameters | Supports setting query-level parameters through the `parameters` clause during retrieval. |
| DBMS_VECTOR | `REFRESH_INDEX` | Performs incremental index refresh. |
| DBMS_VECTOR | `REBUILD_INDEX` | Performs full index rebuild. |
## Index memory estimation and actual usage query
Supports index memory estimation through the `DBMS_VECTOR` system package. The usage is the same as that for dense indexes. Here, only the special requirements for sparse vector indexes are described:
* The `IDX_TYPE` parameter must be set to `SINDI`, case-insensitive.
## Creation syntax and description
In-memory sparse vector indexes support two creation methods: **creation during table creation** and **creation after table creation**. When creating an index, note the following:
* The maximum supported dimension for columns on which sparse vector indexes are created is 500,000.
* Sparse vector indexes must be created on columns of the `SPARSEVECTOR` type.
* The `VECTOR` keyword is required when creating an index.
* The index type must be set to `sindi`, which indicates creating an in-memory sparse vector index.
* Only the `inner_product` (inner product) distance algorithm is supported.
* The parameters and descriptions for an index created after table creation are the same as those for an index created during table creation.
### Create during table creation
Supports using the `CREATE TABLE` statement to create a sparse vector index.
#### Syntax
```sql
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 SPARSEVECTOR,
...,
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);
```
#### Parameter description
| Parameter | Default value | Value range | Required | Description | Remarks |
|------|--------|----------|----------|------|------|
| `distance` | | `inner_product` | Yes | Specifies the vector distance algorithm type. | Sparse vector indexes support only inner product (`inner_product`) as the distance algorithm. |
| `type` | | `sindi` | Yes | Specifies the index algorithm type. | Indicates creating an in-memory sparse vector index. |
| `lib` | `vsag` | `vsag` | No | Specifies the vector index library type. | Currently, only the VSAG vector library is supported. |
| `prune` | `false` | `true`/`false` | No | Whether to perform pruning on vectors. | When `prune` is `true`, you need to set the `refine` and `drop_ratio_build` parameters. When `prune` is `false`, full-precision retrieval can be provided. If `refine` is set to `true` or `drop_ratio_build` is not `0`, an error will be returned. |
| `refine` | `false` | `true`/`false` | No | Whether reranking is needed. | When set to `true`, the original sparse vectors are retrieved for the search results to perform high-precision distance calculation and reranking, which means an additional copy of the original vector data needs to be stored. Can be set only when `prune=true`. |
| `drop_ratio_build` | `0` | `[0, 0.9]` | No | The pruning ratio for sparse vector data. | When a new sparse vector is inserted, the `query_length * drop_ratio_build` smallest values are pruned based on value size. If `refine` is `true`, the original vector data is preserved. Otherwise, only the pruned data is retained. Can be set only when `prune=true`. |
| `drop_ratio_search` | `0` | `[0, 0.9]` | No | The pruning ratio for sparse vector values during retrieval. | The larger the value, the more pruning is performed, the lower the accuracy, and the higher the performance. Can also be set through the `parameters` clause during retrieval, and query parameters have higher priority. |
| `refine_k` | `4.0` | `[1.0, 1000.0]` | No | Indicates the proportion of results participating in reranking. | Retrieves `limit_k * refine_k` results and obtains the original vectors for reranking. Meaningful only when `refine=true`. Can also be set through the `parameters` clause during retrieval, and query parameters have higher priority. |
### Create after table creation
Supports creating a sparse vector index on a `SPARSEVECTOR` column of an existing table.
#### Syntax
```sql
CREATE VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
```
#### Parameter description
The parameter description is the same as that for creating an index during table creation. For details, see the section above.
## Create, update, and delete examples
### Create during table creation
Create the test table `sparse_t1` and create a sparse vector index:
```sql
CREATE TABLE sparse_t1 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR,
VECTOR INDEX sparse_idx1(c2)
WITH (lib=vsag, type=sindi, distance=inner_product)
);
```
Insert sparse vector data into the test table:
```sql
INSERT INTO sparse_t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
INSERT INTO sparse_t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
INSERT INTO sparse_t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
```
Query the test table:
```sql
SELECT * FROM sparse_t1;
```
The return result is as follows:
```
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 1 | {1:0.1,2:0.2,3:0.3} |
| 2 | {2:0.2,3:0.3,4:0.4} |
| 3 | {3:0.3,4:0.4,5:0.5} |
+----+---------------------+
3 rows in set
```
### Create after table creation
Create a sparse vector index after creating the test table:
```sql
CREATE TABLE sparse_t2 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR
);
CREATE VECTOR INDEX sparse_idx2 ON sparse_t2(c2)
WITH (lib=vsag, type=sindi, distance=inner_product,
prune=true, refine=true, drop_ratio_build=0.1,
drop_ratio_search=0.5, refine_k=2.0);
```
Insert sparse vector data into the test table:
```sql
INSERT INTO sparse_t2 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
```
Query the test table:
```sql
SELECT * FROM sparse_t2;
```
The return result is as follows:
```shell
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 1 | {1:0.1,2:0.2,3:0.3} |
+----+---------------------+
1 row in set
```
### Update
When updating sparse vector data, the index is automatically maintained:
```sql
UPDATE sparse_t1 SET c2 = '{1:0.1}' WHERE c1 = 1;
```
### Delete
The delete operation is the same as that for regular vector indexes. You can directly delete the data:
```sql
DELETE FROM sparse_t1 WHERE c1 = 1;
```
## Retrieval
The retrieval syntax for sparse vector indexes is similar to that for dense vector indexes, using the `APPROXIMATE`/`APPROX` keyword for approximate nearest neighbor retrieval.
### Syntax
```sql
SELECT ... FROM table_name
ORDER BY inner_product(column_name, query_vector) [APPROXIMATE|APPROX]
LIMIT n [PARAMETERS(param1=value1, param2=value2)];
```
Where:
* `column_name`: The `SPARSEVECTOR` column specified when creating the sparse vector index.
* `query_vector`: The query vector, which can be a string in sparse vector format, such as `'{1:2.4, 3:1.5}'`.
* `n`: The number of result rows to return.
* `PARAMETERS`: Optional query-level parameters for setting `drop_ratio_search` and `refine_k`.
### Retrieval usage notes
For detailed requirements, see [Dense vector index](../200.dense-vector-index.md). Here, only the special requirements for sparse vector indexes are described:
* Query parameter priority: Query-level parameters set by `PARAMETERS` > Query parameters set when building the index > Default values.
* `drop_ratio_search`: Value range `[0, 0.9]`, default value `0`. The pruning ratio for sparse vector values during retrieval. The larger the value, the more pruning is performed, the lower the accuracy, and the higher the performance. Prunes the `query_length * drop_ratio_search` smallest values based on value size. Since pruning all values is meaningless, at least one value is always retained.
* `refine_k`: Value range `[1.0, 1000.0]`, default value `4.0`. Indicates the proportion of results participating in reranking. Queries `limit_k * refine_k` results and obtains the original vectors for reranking. Effective only when `refine=true`.
### Usage examples
#### Regular query
```sql
CREATE TABLE t1 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR,
VECTOR INDEX idx1(c2)
WITH (lib=vsag, type=sindi, distance=inner_product)
);
INSERT INTO t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
INSERT INTO t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
INSERT INTO t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
INSERT INTO t1 VALUES(4, '{5:0.5, 4:0.4, 6:0.6}');
INSERT INTO t1 VALUES(5, '{5:0.5, 6:0.6, 7:0.7}');
SELECT * FROM t1
ORDER BY negative_inner_product(c2, '{3:0.3, 4:0.4}')
APPROXIMATE LIMIT 4;
```
The return result is as follows:
```shell
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 2 | {2:0.2,3:0.3,4:0.4} |
| 3 | {3:0.3,4:0.4,5:0.5} |
| 4 | {4:0.4,5:0.5,6:0.6} |
| 1 | {1:0.1,2:0.2,3:0.3} |
+----+---------------------+
```
#### Use query parameters
```sql
SELECT *, negative_inner_product(c2, '{3:0.3, 4:0.4}')
AS score FROM t1
ORDER BY score APPROXIMATE LIMIT 4
PARAMETERS(drop_ratio_search=0.5);
```
The return result is as follows:
```shell
+----+---------------------+---------------------+
| c1 | c2 | score |
+----+---------------------+---------------------+
| 4 | {4:0.4,5:0.5,6:0.6} | -0.1600000113248825 |
| 3 | {3:0.3,4:0.4,5:0.5} | -0.2500000149011612 |
| 2 | {2:0.2,3:0.3,4:0.4} | -0.2500000149011612 |
+----+---------------------+---------------------+
3 rows in set
```
## Index monitoring and maintenance
In-memory sparse vector indexes provide monitoring views and support using the `DBMS_VECTOR` system package for index maintenance, including incremental refresh and full rebuild. The usage is the same as that for dense indexes.
## Related documentation
* For detailed information about sparse vector data types, see [Vector data type](../../700.vector-search-reference/100.vector-data-type.md).
* For detailed information about vector distance functions, see [Vector functions](../../250.vector-function.md).
* For monitoring and maintenance of dense vector indexes, see [Vector index monitoring/maintenance](../200.dense-vector-index.md).
* For index memory estimation and actual usage query of vector indexes, see [Index memory estimation and actual usage query](../200.dense-vector-index.md).

View File

@@ -0,0 +1,611 @@
---
slug: /vector-function
---
# Use SQL functions
This topic describes the vector functions supported by seekdb and the considerations for using them.
## Considerations
* Vectors with different dimensions are not allowed to perform the following operations. An error `different vector dimensions %d and %d` is returned.
* When the result exceeds the floating-point number range, an error `value out of range: overflow / underflow` is returned.
* Dense vector indexes support L2, inner product, and cosine distance as index distance algorithms. Memory-based sparse vector indexes support inner product as the distance algorithm. For details, see [Create vector indexes](200.vector-index/200.dense-vector-index.md).
* Vector index search supports calling the `L2_distance`, `Cosine_distance`, and `Inner_product` distance functions in this document.
## Distance functions
Distance functions are used to calculate the distance between two vectors. The calculation method varies depending on the distance algorithm used.
### L2_distance
Euclidean distance reflects the distance between the coordinates of the compared vectors -- essentially the straight-line distance between two vectors. It is calculated by applying the Pythagorean theorem to vector coordinates:
![Pythagorean theorem](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E5%8B%BE%E8%82%A1%E5%AE%9A%E7%90%86.jpg)
The function syntax is as follows:
```sql
l2_distance(vector v1, vector v2)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t1(c1 vector(3));
INSERT INTO t1 VALUES('[1,2,3]');
SELECT l2_distance(c1, [1,2,3]), l2_distance([1,2,3],[1,1,1]), l2_distance('[1,1,1]','[1,2,3]') FROM t1;
```
The return result is as follows:
```shell
+--------------------------+------------------------------+----------------------------------+
| l2_distance(c1, [1,2,3]) | l2_distance([1,2,3],[1,1,1]) | l2_distance('[1,1,1]','[1,2,3]') |
+--------------------------+------------------------------+----------------------------------+
| 0 | 2.23606797749979 | 2.23606797749979 |
+--------------------------+------------------------------+----------------------------------+
1 row in set
```
### L2_squared
L2 Squared distance is the square of the Euclidean distance (L2 Distance). It omits the square root operation in the Euclidean distance formula, thereby reducing computational cost while maintaining the relative order of distances. The calculation method is as follows:
![L2 Squared](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.4.1/%E5%90%91%E9%87%8F/l2-squared.jpg)
The syntax is as follows:
```sql
l2_squared(vector v1, vector v2)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t1(c1 vector(3));
INSERT INTO t1 VALUES('[1,2,3]');
SELECT l2_squared(c1, [1,2,3]), l2_squared([1,2,3],[1,1,1]), l2_squared('[1,1,1]','[1,2,3]') FROM t1;
```
The return result is as follows:
```shell
+-------------------------+-----------------------------+---------------------------------+
| l2_squared(c1, [1,2,3]) | l2_squared([1,2,3],[1,1,1]) | l2_squared('[1,1,1]','[1,2,3]') |
+-------------------------+-----------------------------+---------------------------------+
| 0 | 5 | 5 |
+-------------------------+-----------------------------+---------------------------------+
1 row in set
```
### L1_distance
The Manhattan distance is used to calculate the sum of absolute axis distances between two points in a standard coordinate system. The calculation formula is as follows:
![Manhattan distance](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/l1.jpg)
The function syntax is as follows:
```sql
l1_distance(vector v1, vector v2)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t2(c1 vector(3));
INSERT INTO t2 VALUES('[1,2,3]');
INSERT INTO t2 VALUES('[1,1,1]');
SELECT l1_distance(c1, [1,2,3]) FROM t2;
```
The return result is as follows:
```shell
+--------------------------+
| l1_distance(c1, [1,2,3]) |
+--------------------------+
| 0 |
| 3 |
+--------------------------+
2 rows in set
```
### Cosine_distance
Cosine similarity measures the angular difference between two vectors and reflects their directional similarity, regardless of their lengths (magnitude). The value range of cosine similarity is `[-1, 1]`, where `1` indicates vectors in exactly the same direction, `0` indicates orthogonality, and `-1` indicates completely opposite directions.
The calculation method for cosine similarity is as follows:
![Cosine similarity](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC%BC%E5%BA%A6.jpg)
Since cosine similarity closer to 1 indicates greater similarity, cosine distance (or cosine dissimilarity) is sometimes used as a measure of distance between vectors. Cosine distance can be calculated by subtracting cosine similarity from 1:
![Cosine distance](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E4%BD%99%E5%BC%A6%E8%B7%9D%E7%A6%BB.jpg)
The value range of cosine distance is `[0, 2]`, where `0` indicates exactly the same direction (no distance) and `2` indicates completely opposite directions.
The function syntax is as follows:
```sql
cosine_distance(vector v1, vector v2)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t3(c1 vector(3));
INSERT INTO t3 VALUES('[1,2,3]');
INSERT INTO t3 VALUES('[1,2,1]');
SELECT cosine_distance(c1, [1,2,3]) FROM t3;
```
```shell
+------------------------------+
| cosine_distance(c1, [1,2,3]) |
+------------------------------+
| 0 |
| 0.12712843905603044 |
+------------------------------+
2 rows in set
```
### Inner_product
The inner product, also known as the dot product or scalar product, represents a type of multiplication between two vectors. In geometric terms, the inner product indicates the direction and magnitude relationship between two vectors. The calculation method for the inner product is as follows:
![Inner product](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E5%86%85%E7%A7%AF.jpg)
The syntax is as follows:
```sql
inner_product(vector v1, vector v2)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
* For sparse vectors using this function, one parameter can be a string in sparse vector format, such as `c2,'{1:2.4}'`. Two parameters cannot both be strings.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Dense vector example:
```sql
CREATE TABLE t4(c1 vector(3));
INSERT INTO t4 VALUES('[1,2,3]');
INSERT INTO t4 VALUES('[1,2,1]');
SELECT inner_product(c1, [1,2,3]) FROM t4;
```
The return result is as follows:
```shell
+----------------------------+
| inner_product(c1, [1,2,3]) |
+----------------------------+
| 14 |
| 8 |
+----------------------------+
2 rows in set
```
Sparse vector example:
```sql
CREATE TABLE t4(c1 INT, c2 SPARSEVECTOR, c3 SPARSEVECTOR);
INSERT INTO t4 VALUES(1, '{1:1.1, 2:2.2}', '{1:2.4}');
INSERT INTO t4 VALUES(2, '{1:1.5, 3:3.6}', '{4:4.5}');
SELECT inner_product(c2,c3) FROM t4;
The return result is as follows:
```shell
+----------------------+
| inner_product(c2,c3) |
+----------------------+
| 2.640000104904175 |
| 0 |
+----------------------+
2 rows in set
```
### Vector_distance
The vector_distance function calculates the distance between two vectors. You can specify parameters to select different distance algorithms.
The syntax is as follows:
```sql
vector_distance(vector v1, vector v2 [, string metric])
```
The `vector v1/v2` parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The `metric` parameter is used to specify the distance algorithm. Options:
* If not specified, the default algorithm is `euclidean`.
* If specified, the only valid values are:
* `euclidean`. Represents Euclidean distance, same as L2_distance.
* `manhattan`. Represents Manhattan distance, same as L1_distance.
* `cosine`. Represents cosine distance, same as Cosine_distance.
* `dot`. Represents inner product, same as Inner_product.
The return values are described as follows:
* The return value is a `distance(double)` distance value.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t5(c1 vector(3));
INSERT INTO t5 VALUES('[1,2,3]');
INSERT INTO t5 VALUES('[1,2,1]');
SELECT vector_distance(c1, [1,2,3], euclidean) FROM t5;
```
The return result is as follows:
```shell
+-----------------------------------------+
| vector_distance(c1, [1,2,3], euclidean) |
+-----------------------------------------+
| 0 |
| 2 |
+-----------------------------------------+
2 rows in set
```
## Arithmetic functions
Arithmetic functions provide element-wise addition (+), subtraction (-), and multiplication (*) operations between vector types and vector types, single-level array types, and special string types, as well as between single-level array types and single-level array types, and special string types. The calculation method is element-wise, as shown for addition:
![Addition](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E5%8A%A0%E6%B3%95.jpg)
The syntax is as follows:
```sql
v1 + v2
v1 - v2
v1 * v2
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`). **Note**: Two parameters cannot both be string types. At least one parameter must be a vector or single-level array type.
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* When at least one of the two parameters is a vector type, the return value is of the same vector type as the vector parameter.
* When both parameters are single-level array types, the return value is of the `array(float)` type.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t6(c1 vector(3));
INSERT INTO t6 VALUES('[1,2,3]');
SELECT [1,2,3] + '[1.12,1000.0001, -1.2222]', c1 - [1,2,3] FROM t6;
```
The return result is as follows:
```shell
+---------------------------------------+--------------+
| [1,2,3] + '[1.12,1000.0001, -1.2222]' | c1 - [1,2,3] |
+---------------------------------------+--------------+
| [2.12,1002,1.7778] | [0,0,0] |
+---------------------------------------+--------------+
1 row in set
```
## Comparison functions
Comparison functions provide comparison calculations between vector types and vector types, single-level array types, and special string types, including the comparison operators `=`, `!=`, `>`, `>=`, `<`, and `<=`. The calculation method is element-wise dictionary order comparison.
The syntax is as follows:
```sql
v1 = v2
v1 != v2
v1 > v2
v1 < v2
v1 >= v2
v1 <= v2
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
:::tip
One of the two parameters must be a vector type.
:::
* The dimensions of the two parameters must be the same.
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is of the bool type.
* If any parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t7(c1 vector(3));
INSERT INTO t7 VALUES('[1,2,3]');
SELECT c1 = '[1,2,3]' FROM t7;
```
The return result is as follows:
```shell
+----------------+
| c1 = '[1,2,3]' |
+----------------+
| 1 |
+----------------+
1 row in set
```
## Aggregate functions
:::tip
Vector columns cannot be used as GROUP BY conditions, and DISTINCT is not supported.
:::
### Sum
The Sum function is used to calculate the sum of vectors in a vector column of a table, using element-wise accumulation to obtain the sum vector.
The syntax is as follows:
```sql
sum(vector v1)
```
The parameters are described as follows:
* Only the vector type is supported.
The return values are described as follows:
* The return value is a `sum (vector)` value.
Here is an example:
```sql
CREATE TABLE t8(c1 vector(3));
INSERT INTO t8 VALUES('[1,2,3]');
SELECT sum(c1) FROM t8;
```
The return result is as follows:
```shell
+---------+
| sum(c1) |
+---------+
| [1,2,3] |
+---------+
1 row in set
```
### Avg
The Avg function is used to calculate the average of vectors in a vector column of a table.
The syntax is as follows:
```sql
avg(vector v1)
```
The parameters are described as follows:
* Only the vector type is supported.
The return values are described as follows:
* The return value is an `avg (vector)` value.
* `NULL` rows in the vector column are not counted.
* When the input parameter is empty, the output is `NULL`.
Here is an example:
```sql
CREATE TABLE t9(c1 vector(3));
INSERT INTO t9 VALUES('[1,2,3]');
SELECT avg(c1) FROM t9;
```
The return result is as follows:
```shell
+---------+
| avg(c1) |
+---------+
| [1,2,3] |
+---------+
1 row in set
```
## Other common vector functions
### Vector_norm
The Vector_norm function calculates the Euclidean norm (or length) of a vector, which represents the Euclidean distance between the vector and the origin. The calculation formula is as follows:
![Euclidean norm](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/%E6%AC%A7%E5%87%A0%E9%87%8C%E5%BE%B7.jpg)
The syntax is as follows:
```sql
vector_norm(vector v1)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
* If a single-level array type parameter exists, its elements cannot be `NULL`.
The return values are described as follows:
* The return value is a `norm(double)` modulus value.
* If the parameter is `NULL`, the return value is `NULL`.
Here is an example:
```sql
CREATE TABLE t10(c1 vector(3));
INSERT INTO t10 VALUES('[1,2,3]');
SELECT vector_norm(c1),vector_norm([1,2,3]) FROM t10;
```
The return result is as follows:
```shell
+--------------------+----------------------+
| vector_norm(c1) | vector_norm([1,2,3]) |
+--------------------+----------------------+
| 3.7416573867739413 | 3.7416573867739413 |
+--------------------+----------------------+
1 row in set
```
### Vector_dims
The Vector_dims function is used to return the vector dimension.
The syntax is as follows:
```sql
vector_dims(vector v1)
```
The parameters are described as follows:
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
The return values are described as follows:
* The return value is a `dims(int64)` dimension value.
* If the parameter is `NULL`, an error is returned.
Here is an example:
```sql
CREATE TABLE t11(c1 vector(3));
INSERT INTO t11 VALUES('[1,2,3]');
INSERT INTO t11 VALUES('[1,1,1]');
SELECT vector_dims(c1), vector_dims('[1,2,3]') FROM t11;
```
The return result is as follows:
```shell
+-----------------+------------------------+
| vector_dims(c1) | vector_dims('[1,2,3]') |
+-----------------+------------------------+
| 3 | 3 |
| 3 | 3 |
+-----------------+------------------------+
2 rows in set
```

View File

@@ -0,0 +1,232 @@
---
slug: /vector-similarity-search
---
# Vector similarity search
Vector similarity search, also known as nearest neighbor search, is a search method based on distance metrics in vector space. Its core objective is to find the set of vectors most similar to a given query vector. Although specific distance metrics are used during computation, the final output is the Top K nearest vectors, sorted in ascending order of distance.
This topic describes two vector search methods in seekdb: exact nearest neighbor search based on full-scan and approximate nearest neighbor search based on vector index. It also provides examples to illustrate how to use these methods.
:::tip
For readability, this document refers to vector nearest neighbor search as "vector search," exact nearest neighbor search as "exact search," and approximate nearest neighbor search as "approximate search."
:::
## Perform exact search
Exact search uses a full scan strategy, calculating the distance between the query vector and all vectors in the dataset to perform an exact search. This method ensures complete accuracy of the search results, but because it requires calculating the distance for all data, the search performance decreases significantly as the dataset grows.
When executing an exact search, the system calculates and compares the distances between the query vector vₑ and all other vectors in the vector space. After completing the full distance calculations, the system selects the k vectors closest to the query as the search results.
### Example: Euclidean search
Euclidean similarity search is used to retrieve the top-k vectors in the vector space that are closest to the query vector, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors from a table that are closest to the query vector:
```sql
-- Create a test table
CREATE TABLE t1 (
id INT PRIMARY KEY,
c1 VECTOR(3)
);
-- Insert data
INSERT INTO t1 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');
-- Perform an exact search
SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
```
The return result is as follows:
```shell
+---------------+
| c1 |
+---------------+
| [0.1,0.2,0.3] |
| [0.2,0.3,0.4] |
| [0.3,0.4,0.5] |
| [0.4,0.5,0.6] |
| [0.5,0.6,0.7] |
+---------------+
5 rows in set
```
### Analyze the execution plan
Obtain the execution plan of the preceding example:
```sql
EXPLAIN SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
```
The return result is as follows:
```shell
+---------------------------------------------------------------------------------------------+
| Query Plan |
+---------------------------------------------------------------------------------------------+
| ================================================= |
| |ID|OPERATOR |NAME|EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------- |
| |0 |TOP-N SORT | |5 |3 | |
| |1 |└─TABLE FULL SCAN|t1 |10 |3 | |
| ================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t1.c1]), filter(nil), rowset=16 |
| sort_keys([l2_distance(t1.c1, cast('[0.1, 0.2, 0.3]', ARRAY(18, -1))), ASC]), topn(5) |
| 1 - output([t1.c1]), filter(nil), rowset=16 |
| access([t1.c1]), partitions(p0) |
| is_index_back=false, is_global_index=false, |
| range_key([t1.id]), range(MIN ; MAX)always true |
+---------------------------------------------------------------------------------------------+
14 rows in set
```
The analysis is as follows:
* Execution method:
* The full-table scan method is used, which requires traversing all data in the table. The `TABLE FULL SCAN` operation in the execution plan scans all data in the `t1` table.
* The system calculates the vector distance for each record and then sorts the records by distance. The `TOP-N SORT` operation in the execution plan calculates the vector distance using the `l2_distance` function and sorts the records by distance in ascending order.
* The system returns the five records with the smallest distances. The `topn(5)` setting in the execution plan indicates that only the first five records of the sorted list are returned.
* Performance characteristics:
* Advantages: The search results are completely accurate and ensure that the true nearest neighbors are returned.
* Disadvantages: The system must scan all data in the table and calculate the distance between all vectors, leading to a significant drop in performance as the data volume increases.
* Applicable scenarios:
* Scenarios with a small amount of data.
* Scenarios where high result accuracy is required.
* Scenarios where real-time queries are not suitable for large datasets.
## Perform approximate search by using vector indexes
Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. While it cannot guarantee 100% result accuracy, it can significantly improve search performance, allowing for a good balance between accuracy and efficiency in practical applications.
### Example: Approximate search by using the HNSW index
```sql
-- Create a HNSW vector index with the table.
CREATE TABLE t2 (
id INT PRIMARY KEY,
vec VECTOR(3),
VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);
-- Insert test data
INSERT INTO t2 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');
-- Perform approximate search and return the 5 most similar data records
SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;
```
The return result is as follows. The result is the same as that of the exact search because the data volume is small:
```shell
+------+---------------+
| id | vec |
+------+---------------+
| 1 | [0.1,0.2,0.3] |
| 2 | [0.2,0.3,0.4] |
| 3 | [0.3,0.4,0.5] |
| 4 | [0.4,0.5,0.6] |
| 5 | [0.5,0.6,0.7] |
+------+---------------+
5 rows in set
```
### Execution plan analysis
Obtain the execution plan of the preceding example:
```sql
EXPLAIN SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;
```
The return result is as follows:
```shell
+--------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------+
| ==================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ---------------------------------------------------- |
| |0 |VECTOR INDEX SCAN|t2(idx)|10 |29 | |
| ==================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t2.id], [t2.vec]), filter(nil), rowset=16 |
| access([t2.id], [t2.vec]), partitions(p0) |
| is_index_back=true, is_global_index=false, |
| range_key([t2.__vid_1750162978114053], [t2.__type_17_1750162978114364]), range(MIN,MIN ; MAX,MAX)always true |
+--------------------------------------------------------------------------------------------------------------------+
11 rows in set
```
The analysis is as follows:
* Execution method:
* The vector index scan method is used, directly locating similar vectors through the pre-built HNSW index. The `VECTOR INDEX SCAN` operation in the execution plan uses the index `t2(idx)` for retrieval.
* The graph structure of the index is used to quickly locate nearest neighbors without calculating the distance between all vectors. The `is_index_back=true` setting in the execution plan indicates that complete data is retrieved through index back-lookup.
* The five records that the index considers to be the most similar are returned. The `output([t2.id], [t2.vec])` in the execution plan indicates that the id and vector data are returned.
* Performance characteristics:
* Advantage: The search performance is high and remains stable as the data volume increases.
* Disadvantage: A small amount of error may exist in the results, and 100% accuracy is not guaranteed.
* Applicable scenarios:
* Real-time search for large-scale datasets.
* Scenarios with high requirements for search performance.
* Scenarios that can tolerate a small amount of result error.
## Summary
A comparison of the two search methods is as follows:
| Item | Exact search | Approximate search |
|--------|----------------|----------------|
| Execution method | Full-table scan (`TABLE FULL SCAN`) followed by sorting | Direct search through the vector index (`VECTOR INDEX SCAN`) |
| Execution plan | Contains two operators: `TABLE FULL SCAN` and `TOP-N SORT` | Contains only one operator: `VECTOR INDEX SCAN` |
| Performance characteristics | Requires full-table scan and sorting, and performance decreases significantly as the data volume increases | Directly locates target data through the index, and performance is stable |
| Result accuracy | 100% accurate, ensuring real nearest neighbors are returned | Approximately accurate, with a small amount of error possible |
| Applicable scenarios | Scenarios with small data volumes and high accuracy requirements | Scenarios with large-scale datasets and high performance requirements |
### References
* For more information about SQL functions, see [Use SQL functions](250.vector-function.md).
* For more information about vector indexes and examples, see [Create vector indexes](200.vector-index/200.dense-vector-index.md).
* To perform large-scale performance tests, we recommend that you use the [VectorDBBenchmark tool](700.vector-search-benchmark-test.md) to generate a test dataset to better compare the performance differences between exact search and approximate search.

View File

@@ -0,0 +1,127 @@
---
slug: /vector-search-benchmark-test
---
# Benchmark testing with VectorDBBench
VectorDBBench is a benchmarking tool designed to provide benchmark test results for mainstream vector databases and cloud services. This topic explains how to use VectorDBBench to test the performance of seekdb vector database. Designed for ease of use, VectorDBBench allows you to easily replicate test results or test new systems.
## Prerequisites
* Deploy seekdb.
* Install Python 3.11 or later. The following example uses Conda for installation:
```bash
# Download and install Conda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
# Reopen your terminal and initialize Conda
source ~/miniconda3/bin/activate
conda init --all
# Create and initialize the Python environment required by VectorDBBench
conda create -n vdb python=3.11
conda activate vdb
```
* Connect to the database and optimize memory and query parameters for HNSW vector index searches:
```sql
-- Set ob_vector_memory_limit_percentage to 30%.
ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;
-- Set ob_query_timeout to 24 hours.
SET GLOBAL ob_query_timeout = 86400000000;
-- Set max_allowed_packet to 1 GB.
SET GLOBAL max_allowed_packet=1073741824;
-- Set ddl_thread_score and parallel_servers_target to configure parallelism when creating indexes
ALTER SYSTEM SET ddl_thread_score = 8; -- Parallelism for DDL operations
SET GLOBAL parallel_servers_target = 624; -- Number of parallel queries the database server can handle simultaneously
```
Here, `ob_vector_memory_limit_percentage = 30` is only an example value. Adjust it based on the database memory and workload.
## Recommended configuration
The recommended resource specifications for the database are as follows:
| Parameter | Value |
|-------|----|
| Memory | 64 GB |
| CPU | 16 cores |
## Testing methods
### Clone the VectorDBBench code
:::tip
We recommend that you deploy VectorDBBench and seekdb on separate servers to avoid CPU resource contention and improve the reliability of test results.
:::
Clone the VectorDBBench test tool code to your local server.
```bash
git clone https://github.com/zilliztech/VectorDBBench.git
```
### Install dependencies
Go to the VectorDBBench directory and install the dependencies.
```bash
cd VectorDBBench
pip install .
```
### Run the test
Run VectorDBBench. Two examples are provided here: HNSW index and IVF index.
#### HNSW index example
```bash
# Replace $host, $port, and $user with the actual seekdb connection information.
vectordbbench oceanbasehnsw --host $host --port $port --user $user --database test --m 16 --ef-construction 200 --ef-search 40 --k 10 --case-type Performance768D1M --index-type HNSW
```
For more information about the parameters, run the following command:
```bash
vectordbbench oceanbasehnsw --help
```
Commonly used options are described as follows:
* `--num-concurrency`: Used to adjust the concurrency level. VectorDBBench executes vector queries with the specified concurrency and selects the highest QPS (Queries Per Second) as the final result.
* `--skip-drop-old`/`--skip-load`: Skips the deletion of old data and the data loading step. After adding these two options to the command line, the command only performs vector query operations and does not delete old data or reload data.
* `--k`: Specifies the number of top-k nearest neighbor results to return in a vector query.
* `--ef-search`: HNSW query parameter that indicates the size of the candidate set during query.
* `--index-type`: Specifies the index type. Currently supports `HNSW`, `HNSW_SQ`, and `HNSW_BQ`.
#### IVF index example
```bash
vectordbbench oceanbaseivf --host $host --port $port --user $user --database test --nlist 1000 --sample_per_nlist 256 --ivf_nprobes 100 --case-type Performance768D1M --index-type IVF_FLAT
```
Commonly used options are described as follows:
* `--sample_per_nlist`: The amount of data sampled per cluster center. Default value is `256`.
* `--ivf_nprobes`: Used to set how many nearest cluster centers to search in this query when performing vector index queries. Default value is `8`. The larger the value, the higher the recall rate, but the search time also increases.
* `--index-type`: Specifies the index type. Currently supports `IVF_FLAT`.
For more information about the parameters, run the following command:
```bash
vectordbbench oceanbaseivf --help
```
## FAQs
### Is it normal for the first test execution to be slow?
The first test execution requires downloading the required dataset from AWS S3 storage, which may take relatively longer. This is normal.
### Can I customize and modify the test code?
Yes, you can. If you customize and modify the test code, you need to run `pip install .` again before running the test.

View File

@@ -0,0 +1,61 @@
---
slug: /vector-data-type
---
# Overview of vector data types
seekdb provides vector data types to support AI vector search applications. By using vector data types, you can store and query an array of floating-point numbers, such as `[0.1, 0.3, -0.9, ...]`. Before using vector data, you need to be aware of the following:
* Both dense and sparse vector data are supported, and all data elements must be single-precision floating-point numbers.
* Element values in vector data cannot be NaN (not a number) or Inf (infinity); otherwise, a runtime error will be thrown.
* You must specify the vector dimension when creating a vector column, for example, `VECTOR(3)`.
* Creating dense/sparse vector indexes is supported. For details, see [vector index](../200.vector-index/200.dense-vector-index.md).
* Vector data in seekdb is stored in array form.
* Both dense and sparse vectors support [hybrid search](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001970893).
## Syntax
A dense vector value can contain any number of floating-point numbers up to 16,000. The syntax is as follows:
```sql
-- Dense vector
'[<float>, <float>, ...]'
```
A sparse vector is based on the MAP data type and contains unordered key-value pairs. The syntax is as follows:
```sql
-- Sparse vector
'{<uint:float>, <uint:float>...}'
```
Examples of creating vector columns and indexes are as follows:
```sql
-- Create a dense vector column and index
CREATE TABLE t1(
c1 INT,
c2 VECTOR(3),
PRIMARY KEY(c1),
VECTOR INDEX idx1(c2) WITH (distance=L2, type=hnsw)
);
```
```sql
-- Create a sparse vector column
CREATE TABLE t2 (
c1 INT,
c2 SPARSEVECTOR
);
```
## References
* [Create vector indexes](../200.vector-index/200.dense-vector-index.md)
* [Use SQL functions](../250.vector-function.md)

View File

@@ -0,0 +1,391 @@
---
slug: /vector-sdk-refer
---
# Compatibility
This topic describes the data model mappings, SDK interface compatibility, and concept mappings between seekdb's vector search feature and Milvus.
## Concept mappings
To help users familiar with Milvus quickly get started with seekdb's vector storage capabilities, we analyze the similarities and differences between the two systems and provide a mapping of related concepts.
### Data models
| **Data model layer** | **Milvus** | **seekdb** | **Description** |
|------------|---------|-----------|-----------|
| First layer | Shards | Partition | Milvus specifies partition rules by setting some columns as `partition_key` in the schema definition.<br/>seekdb supports range/range columns, list/list columns, hash, key, and subpartitioning strategies. |
| Second layer | Partitions | ≈Tablet | Milvus enhances read performance by chunking the same shard (shards are usually partitioned by primary key) based on other columns.<br />seekdb implements this by sorting keys within a partition. |
| Third layer | Segments | MemTable+SSTable | Both have a minor compaction mechanism. |
### SDKs
This section introduces the conceptual differences between seekdb's vector storage SDK (pyobvector) and Milvus's SDK (pymilvus).
pyobvector supports two usage modes:
1. pymilvus MilvusClient lightweight compatible mode: This mode is compatible with common interfaces of Milvus clients. Users familiar with Milvus can easily use this mode without concept mapping.
2. SQLAlchemy extension mode: This mode can be used as a vector feature extension of python SQLAlchemy, retaining the operation mode of a relational database. Concept mapping is required.
For more information about pyobvector's APIs, see [pyobvector Python SDK API reference](900.vector-search-supported-clients-and-languages/200.vector-pyobvector.md).
The following table describes the concept mappings between pyobvector's SQLAlchemy extension mode and pymilvus:
| **pymilvus** | **pyobvector** | **Description** |
|---------|------------|---------------|
| Database | Database | Database |
| Collection | Table | Table |
| Field | Column | Column |
| Primary Key | Primary Key | Primary key |
| Vector Field | Vector Column | Vector column |
| Index | Index | Index |
| Partition | Partition | Partition |
| DataType | DataType | Data type |
| Metric Type | Distance Function | Distance function |
| Search | Query | Query |
| Insert | Insert | Insert |
| Delete | Delete | Delete |
| Update | Update | Update |
| Batch | Batch | Batch operations |
| Transaction | Transaction | Transaction |
| NONE | Not supported| NULL value |
| BOOL | Boolean | Corresponds to the MySQL TINYINT type |
| INT8 | Boolean | Corresponds to the MySQL TINYINT type |
| INT16 | SmallInteger | Corresponds to the MySQL SMALLINT type |
| INT32 | Integer | Corresponds to the MySQL INT type |
| INT64 | BigInteger | Corresponds to the MySQL BIGINT type |
| FLOAT | Float | Corresponds to the MySQL FLOAT type |
| DOUBLE | Double | Corresponds to the MySQL DOUBLE type |
| STRING | LONGTEXT | Corresponds to the MySQL LONGTEXT type |
| VARCHAR | STRING | Corresponds to the MySQL VARCHAR type |
| JSON | JSON | For differences and similarities in JSON operations, see [pyobvector Python SDK API reference](900.vector-search-supported-clients-and-languages/200.vector-pyobvector.md). |
| FLOAT_VECTOR | VECTOR | Vector type |
| BINARY_VECTOR | Not supported | |
| FLOAT16_VECTOR | Not supported | |
| BFLOAT16_VECTOR | Not supported | |
| SPARSE_FLOAT_VECTOR | Not supported | |
| dynamic_field | Not needed | The hidden `$meta` metadata column in Milvus.<br/>In seekdb, you can explicitly create a JSON-type column. |
## Compatibility with Milvus
### Milvus SDK
Except `load_collection()`, `release_collection()`, and `close()`, which are supported through SQLAlchemy, all operations listed in the following tables are supported.
**Collection operations**
| **Interface** | **Description** |
|---|---|
| create_collection() | Creates a vector table based on the given schema. |
| get_collection_stats() | Queries table statistics, such as the number of rows. |
| describe_collection() | Provides detailed metadata of a vector table. |
| has_collection() | Checks whether a table exists. |
| list_collections() | Lists existing tables. |
| drop_collection() | Drops a table. |
**Field and schema definition**
| **Interface** | **Description** |
|---|---|
| create_schema() | Creates a schema in memory and adds column definitions. |
| add_field() | The call sequence is: create_schema->add_field->...->add_field<br/>You can also manually build a FieldSchema list and then use the CollectionSchema constructor to create a schema. |
**Vector indexes**
| **Interface** | **Description** |
|---|---|
| list_indexes() | Lists all indexes. |
| create_index() | Supports creating multiple vector indexes in a single call. First, use prepare_index_params to initialize an index parameter list object, call add_index multiple times to set multiple index parameters, and finally call create_index to create the indexes. |
| drop_index() | Drops a vector index. |
| describe_index() | Gets the metadata (schema) of an index. |
**Vector indexes**
| **Interface** | **Description** |
|---|---|
| search() | ANN query interface:<ul><li>collection_name: the table name</li><li>data: the query vectors</li><li>filter: filtering operation, equivalent to `WHERE`</li><li>limit: top K</li><li>output_fields: projected columns, equivalent to `SELECT`</li><li>partition_names: partition names (not supported in Milvus Lite)</li><li>anns_field: the index column name</li><li>search_params: vector distance function name and index algorithm-related parameters</li></ul> |
| query() | Point query with filter, namely `SELECT ... WHERE ids IN (..., ...) AND <filters>`. |
| get() | Point query without filter, namely `SELECT ... WHERE ids IN (..., ...)`. |
| delete() | Deletes a group of vectors, `DELETE FROM ... WHERE ids IN (..., ...)`. |
| insert() | Inserts a group of vectors. |
| upsert() | Insert with update on primary key conflict. |
**Collection metadata synchronization**
| **Interface** | **Description** |
|---|---|
| load_collection() | Loads the table structure from the database to the Python application memory, enabling the application to operate the database table in an object-oriented manner. This is a standard feature of an object-relational mapping (ORM) framework. |
| release_collection() | Releases the loaded table structure from the Python application memory and releases related resources. This is a standard feature of an ORM framework for memory management. |
| close() | Closes the database connection and releases related resources. This is a standard feature of an ORM framework. |
### pymilvus
#### Data model
The data model of Milvus comprises three levels: Shards->Partitions->Segments. Compatibility with seekdb is described as follows:
* Shards correspond to seekdb's Partition concept.
* Partitions currently have no corresponding concept in seekdb.
* Milvus allows you to partition a shard into blocks by other columns to improve read performance (shards are usually partitioned by primary key). seekdb implements this by sorting by primary key within a partition.
* Segments are similar to [MemTable](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001973721) + [SSTable](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001973722).
#### Milvus Lite API compatibility
##### collection operations
1. Milvus create_collection():
```python
create_collection(
collection_name: str,
dimension: int,
primary_field_name: str = "id",
id_type: str = DataType,
vector_field_name: str = "vector",
metric_type: str = "COSINE",
auto_id: bool = False,
timeout: Optional[float] = None,
schema: Optional[CollectionSchema] = None, # Used for custom setup
index_params: Optional[IndexParams] = None, # Used for custom setup
**kwargs,
) -> None
```
seekdb compatibility is described as follows:
* collection_name: compatible, corresponds to table_name.
* dimension: compatible, vector(dim).
* primary_field_name: compatible, the primary key column name.
* id_type: compatible, the primary key column type.
* vector_field_name: compatible, the vector column name.
* auto_id: compatible, auto increment.
* timeout: compatible, seekdb supports it through hint.
* schema: compatible.
* index_params: compatible.
2. Milvus get_collection_stats():
```python
get_collection_stats(
collection_name: str,
timeout: Optional[float] = None
) -> Dict
```
seekdb compatibility is described as follows:
* API is compatible.
* Return value is compatible: `{ 'row_count': ... }`.
3. Milvus has_collection():
```python
has_collection(
collection_name: str,
timeout: Optional[float] = None
) -> Bool
```
seekdb is compatible with Milvus has_collection().
4. Milvus drop_collection():
```python
drop_collection(collection_name: str) -> None
```
seekdb is compatible with Milvus drop_collection().
5. Milvus rename_collection():
```python
rename_collection(
old_name: str,
new_name: str,
timeout: Optional[float] = None
) -> None
```
seekdb is compatible with Milvus rename_collection().
##### Schema-related
1. Milvus create_schema():
```python
create_schema(
auto_id: bool,
enable_dynamic_field: bool,
primary_field: str,
partition_key_field: str,
) -> CollectionSchema
```
seekdb compatibility is described as follows:
* auto_id: whether the primary key column is auto-increment, compatible.
* primary_field & partition_key_field: compatible.
2. Milvus add_field():
```python
add_field(
field_name: str,
datatype: DataType,
is_primary: bool,
max_length: int,
element_type: str,
max_capacity: int,
dim: int,
is_partition_key: bool,
)
```
seekdb is compatible with Milvus add_field().
##### Insert/Search-related
1. Milvus search():
```python
search(
collection_name: str,
data: Union[List[list], list],
filter: str = "",
limit: int = 10,
output_fields: Optional[List[str]] = None,
search_params: Optional[dict] = None,
timeout: Optional[float] = None,
partition_names: Optional[List[str]] = None,
**kwargs,
) -> List[dict]
```
seekdb compatibility is described as follows:
* filter: string expression. For usage examples, see: [Milvus Filtering Explained](https://milvus.io/docs/boolean.md). It is generally similar to SQL's `WHERE` expression.
* search_params:
* metric_type: compatible.
* radius & range filter: related to RNN, currently not supported.
* group_by_field: groups ANN results, currently not supported.
* max_empty_result_buckets: used for IVF series indexes, currently not supported.
* ignore_growing: skips incremental data and directly reads baseline index, currently not supported.
* partition_names: partition read, supported.
* kwargs:
* offset: the number of records to skip in search results, currently not supported.
* round_decimal: rounds results to specified decimal places, currently not supported.
2. Milvus get():
```python
get(
collection_name: str,
ids: Union[list, str, int],
output_fields: Optional[List[str]] = None,
timeout: Optional[float] = None,
partition_names: Optional[List[str]] = None,
**kwargs,
) -> List[dict]
```
seekdb is compatible with Milvus get().
3. Milvus delete()
```python
delete(
collection_name: str,
ids: Optional[Union[list, str, int]] = None,
timeout: Optional[float] = None,
filter: Optional[str] = "",
partition_name: Optional[str] = "",
**kwargs,
) -> dict
```
seekdb is compatible with Milvus delete().
4. Milvus insert()
```python
insert(
collection_name: str,
data: Union[Dict, List[Dict]],
timeout: Optional[float] = None,
partition_name: Optional[str] = "",
) -> List[Union[str, int]]
```
seekdb is compatible with Milvus insert().
5. Milvus upsert()
```python
upsert(
collection_name: str,
data: Union[Dict, List[Dict]],
timeout: Optional[float] = None,
partition_name: Optional[str] = "",
) -> List[Union[str, int]]
```
seekdb is compatible with Milvus upsert().
##### Index-related
1. Milvus create_index()
```python
create_index(
collection_name: str,
index_params: IndexParams,
timeout: Optional[float] = None,
**kwargs,
)
```
seekdb is compatible with Milvus create_index().
2. Milvus drop_index()
```python
drop_index(
collection_name: str,
index_name: str,
timeout: Optional[float] = None,
**kwargs,
)
```
seekdb is compatible with Milvus drop_index().
## Compatibility with MySQL protocol
* In terms of request initiation: All APIs are implemented through general query SQL, and there are no compatibility issues.
* In terms of response result set processing: Only processing of new vector data elements needs to be considered. Currently, string and bytes element parsing are supported. Even if the transmission mode of vector data elements changes in the future, compatibility can be achieved by updating the SDK.

View File

@@ -0,0 +1,12 @@
---
slug: /vector-search-supported-clients-and-languages-overview
---
# Supported clients and languages for vector search
| Client/Language | Version |
|---|---|
| MySQL client | All versions |
| Python SDK | 3.9+ |
| Java SDK | 1.8 |

View File

@@ -0,0 +1,318 @@
---
slug: /vector-pyobvector
---
# pyobvector Python SDK API reference
pyobvector is the Python SDK for seekdb's vector storage feature. It provides two operating modes:
* pymilvus-compatible mode: Operates the database using the MilvusLikeClient object, offering commonly used APIs compatible with the lightweight MilvusClient.
* SQLAlchemy extension mode: Operates the database using the ObVecClient object, serving as an extension of Python's SDK for relational databases.
This topic describes the APIs in the two modes and provides examples.
## MilvusLikeClient
### Constructor
```python
def __init__(
self,
uri: str = "127.0.0.1:2881",
user: str = "root@test",
password: str = "",
db_name: str = "test",
**kwargs,
)
```
### collection-related APIs
| API | Description | Example |
|------|------|------|
| `def create_schema(self, **kwargs) -> CollectionSchema:` | <ul>Creates a `CollectionSchema` object.<li>Parameters are optional, allowing the initialization of an empty schema definition.</li><li>Optional parameters include:</li><ul><li>`fields`: A list of `FieldSchema` objects (see the `add_field` interface below for details).</li><li>`partitions`: Partitioning rules (see the section on defining partition rules using `ObPartition`).</li><li>`description`: Compatible with Milvus, but currently has no practical effect in seekdb.</li></ul></ul> | |
| `def create_collection(<br/>self,<br/>collection_name: str,<br/>dimension: Optional[int] = None,<br/>primary_field_name: str = "id",<br/>id_type: Union[DataType, str] = DataType.INT64,<br/>vector_field_name: str = "vector",<br/>metric_type: str = "l2",<br/>auto_id: bool = False,<br/>timeout: Optional[float] = None,<br/>schema: Optional[CollectionSchema] = None, # Used for custom setup<br/>index_params: Optional[IndexParams] = None, # Used for custom setup<br/>max_length: int = 16384,<br/>**kwargs,<br/>)` | Creates a table: <ul><li>collection_name: the table name</li><li>dimension: the vector data dimension</li><li>primary_field_name: the primary field name</li><li>id_type: the primary field data type (only supports VARCHAR and INT types)</li><li>vector_field_name: the vector field name</li><li>metric_type: not used in seekdb, but maintained for API compatibility (because the main table definition does not need to specify a vector distance function)</li><li>auto_id: specifies whether the primary field value increases automatically</li><li>timeout: not used in seekdb, but maintained for API compatibility</li><li>schema: the custom collection schema. When `schema` is not None, the parameters from dimension to metric_type will be ignored</li><li>index_params: the custom vector index parameters</li><li>max_length: the maximum varchar length when the primary field data type is VARCHAR and `schema` is not None</li></ul> | `client.create_collection(<br/>collection_name=test_collection_name,<br/>schema=schema,<br/>index_params=idx_params,<br/>)` |
| `def get_collection_stats(<br/>self, collection_name: str, timeout: Optional[float] = None # pylint: disable=unused-argument<br/>) -> Dict:` | Queries the record count of a table.<ul><li>collection_name: the table name</li><li>timeout: not used in seekdb, but maintained for API compatibility</li></ul> | |
| `def has_collection(self, collection_name: str, timeout: Optional[float] = None) -> bool` | Verifies whether a table exists.<ul><li>collection_name: the table name</li><li>timeout: not used in seekdb, but maintained for API compatibility</li></ul> | |
| `def drop_collection(self, collection_name: str) -> None` | Drops a table.<ul><li>collection_name: the table name</li></ul> | |
| `def load_table(self, collection_name: str,)` | Reads the metadata of a table to the SQLAlchemy metadata cache.<ul><li>collection_name: the table name</li></ul> | |
### CollectionSchema & FieldSchema
MilvusLikeClient describes the schema of a table by using a CollectionSchema. A CollectionSchema contains multiple FieldSchemas, and a FieldSchema describes the column schema of a table.
#### Create a CollectionSchema by using the create_schema method of the MilvusLikeClient
```python
def __init__(
self,
fields: Optional[List[FieldSchema]] = None,
partitions: Optional[ObPartition] = None,
description: str = "", # ignored in oceanbase
**kwargs,
)
```
The parameters are described as follows:
* fields: an optional parameter that specifies a list of FieldSchema objects.
* partitions: partition rules (for more information, see the ObPartition section).
* description: compatible with Milvus, but currently has no practical effect in seekdb.
#### Create a FieldSchema and register it to a CollectionSchema
```python
def add_field(self, field_name: str, datatype: DataType, **kwargs)
```
* field_name: the column name.
* datatype: the column data type. For supported data types, see [Compatibility reference](../800.vector-sdk-refer.md).
* kwargs: additional parameters for configuring column properties, as shown below:
```python
def __init__(
self,
name: str,
dtype: DataType,
description: str = "",
is_primary: bool = False,
auto_id: bool = False,
nullable: bool = False,
**kwargs,
)
```
The parameters are described as follows:
* is_primary: specifies whether the column is a primary key.
* auto_id: specifies whether the column value increases automatically.
* nullable: specifies whether the column can be null.
#### Example
```python
schema = self.client.create_schema()
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(
field_name="title_vector", datatype=DataType.FLOAT_VECTOR, dim=768
)
schema.add_field(field_name="link", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="reading_time", datatype=DataType.INT64)
schema.add_field(
field_name="publication", datatype=DataType.VARCHAR, max_length=512
)
schema.add_field(field_name="claps", datatype=DataType.INT64)
schema.add_field(field_name="responses", datatype=DataType.INT64)
self.client.create_collection(
collection_name="medium_articles_2020", schema=schema
)
```
### Index-related APIs
| API | Description | Example/Remarks |
|-----|-----|-----|
| `def create_index(<br/>self,<br/>collection_name: str,<br/>index_params: IndexParams,<br/>timeout: Optional[float] = None,<br/>**kwargs,<br/>)` | Creates a vector index table based on the constructed IndexParams (for more information about how to use IndexParams, see the prepare_index_params and add_index APIs).<ul><li>collection_name: the table name</li><li>index_params: the index parameters</li><li>timeout: not used in seekdb, but maintained for API compatibility</li><li>kwargs: other parameters, currently not used, maintained for compatibility</li></ul> | |
| `def drop_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>timeout: Optional[float] = None,<br/>**kwargs,<br/>)` | Drops an index table.<ul><li>collection_name: the table name</li><li>index_name: the index name</li></ul> | |
| `def refresh_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>trigger_threshold: int = 10000,<br/>)` | Refreshes a vector index table to improve read performance. It can be understood as a process of moving incremental data.<ul><li>collection_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the refresh action. A refresh is triggered when the data volume of the index table exceeds the threshold.</li></ul> | An API introduced
| `def rebuild_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>trigger_threshold: float = 0.2,<br/>)` | Rebuilds a vector index table to improve read performance. It can be understood as a process of merging incremental data into baseline index data.<ul><li>collection_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the rebuild action. The value range is 0 to 1. A rebuild is triggered when the proportion of incremental data to full data reaches the threshold.</li></ul> | An API introduced by seekdb.<br/>Not compatible with Milvus. |
| `def search(<br/>self,<br/>collection_name: str,<br/>data: list,<br/>anns_field: str,<br/>with_dist: bool = False,<br/>filter=None,limit: int = 10,output_fields: Optional[List[str]] = None,<br/>search_params: Optional[dict] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Executes a vector approximate nearest neighbor search.<ul><li>collection_name: the table name</li><li>data: the vector data to be searched</li><li>anns_field: the name of the vector column to be searched</li><li>with_dist: specifies whether to return results with vector distances</li><li>filter: uses vector approximate nearest neighbor search with filter conditions</li><li>limit: top K</li><li>output_fields: the output columns (also known as projection columns)</li><li>search_params: supports only the `metric_type` value of `l2`/`neg_ip` (`for example: search_params = {"metric_type": "neg_ip"}`)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul><ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values.</ul> | `res = self.client.search(<br/>collection_name=test_collection_name,<br/>data=[0, 0, 1],<br/>anns_field="embedding",<br/>limit=5,<br/>output_fields=["id"],<br/>search_params={"metric_type": "neg_ip"}<br/>)<br/>self.assertEqual(<br/> set([r['id'] for r in res]), set([12, 111, 11, 112, 10]))` |
| `def query(<br/>self,<br/>collection_name: str,<br/>flter=None,<br/>output_fields: Optional[List[str]] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Reads data records using the specified filter condition.<ul><li>collection_name: the table name</li><li>flter: uses vector approximate nearest neighbor search with filter conditions</li><li>output_fields: the output columns (also known as projection columns)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul><ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values.</ul> | `table = self.client.load_table(collection_name=test_collection_name)<br/>where_clause = [table.c["id"] < 100]<br/>res = self.client.query(<br/> collection_name=test_collection_name,<br/> output_fields=["id"],<br/> flter=where_clause,<br/>)` |
| `def get(<br/>self,<br/>collection_name: str,<br/>ids: Union[list, str, int],<br/>output_fields: Optional[List[str]] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Retrieves records based on the specified primary keys `ids`:<ul><li>collection_name: the table name</li><li>ids: a single ID or a list of IDs. Note: The ids parameter of MilvusLikeClient get interface is different from ObVecClient get. For details, see <a href="#DML%20operations">ObVecClient get</a></li><li>output_fields: the output columns (also known as projection columns)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values. | `res = self.client.get(<br/> collection_name=test_collection_name,<br/> output_fields=["id", "meta"],<br/> ids=[80, 12, 112],<br/>)` |
| `def delete(<br/>self,<br/>collection_name: str,<br/>ids: Optional[Union[list, str, int]] = None,<br/>timeout: Optional[float] = None, # pylint: disable=unused-argument<br/>flter=None,<br/>partition_name: Optional[str] = "",<br/>**kwargs, # pylint: disable=unused-argument<br/>)` | Deletes data in a collection.<ul><li>collection_name: the table name</li><li>ids: a single ID or a list of IDs</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>flter: uses vector approximate nearest neighbor search with filter conditions</li><li>partition_name: limits the deletion operation to a partition</li></ul> | `self.client.delete(<br/> collection_name=test_collection_name, ids=[12, 112], partition_name="p0"<br/>)` |
| `def insert(<br/> self, <br/> collection_name: str, <br/> data: Union[Dict, List[Dict]], <br/> timeout: Optional[float] = None, <br/> partition_name: Optional[str] = ""<br/>)` | Inserts data into a table.<ul><li>collection_name: the table name</li><li>data: the data to be inserted, described in Key-Value form</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_name: limits the insertion operation to a partition</li></ul> | `data = [<br/> {"id": 12, "embedding": [1, 2, 3], "meta": {"doc": "document 1"}},<br/> {<br/> "id": 90,<br/> "embedding": [0.13, 0.123, 1.213],<br/> "meta": {"doc": "document 1"},<br/> },<br/> {"id": 112, "embedding": [1, 2, 3], "meta": None},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": None},<br/>]<br/>self.client.insert(collection_name=test_collection_name, data=data)` |
| `def upsert(<br/>self,<br/>collection_name: str,<br/>data: Union[Dict, List[Dict]],<br/>timeout: Optional[float] = None, # pylint: disable=unused-argument<br/>partition_name: Optional[str] = "",<br/>) -> List[Union[str, int]]` | Updates data in a table. If a primary key already exists, updates the corresponding record; otherwise, inserts a new record.<ul><li>collection_name: the table name</li><li>data: the data to be inserted or updated, in the same format as the insert interface</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_name: limits the operation to a specified partition</li></ul> | `data = [<br/> {"id": 112, "embedding": [1, 2, 3], "meta": {'doc':'hhh1'}},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": {'doc':'hhh2'}},<br/>]<br/>self.client.upsert(collection_name=test_collection_name, data=data)` |
| `def perform_raw_text_sql(self, text_sql: str):<br/> return super().perform_raw_text_sql(text_sql)` | Executes an SQL statement directly.<ul><li>text_sql: the SQL statement to be executed</li></ul>Return value:<br/>Returns an iterator that provides result sets from SQLAlchemy. | |
## ObVecClient
### Constructor
```python
def __init__(
self,
uri: str = "127.0.0.1:2881",
user: str = "root@test",
password: str = "",
db_name: str = "test",
**kwargs,
)
```
### Table mode-related operations
| API | Description | Example/Remarks |
|-----|-----|-----|
| `def check_table_exists(self, table_name: str)` | Checks whether a table exists.<ul><li>table_name: the table name</li></ul> | |
| `def create_table(<br/>self,<br/>table_name: str,<br/>columns: List[Column],<br/>indexes: Optional[List[Index]] = None,<br/>partitions: Optional[ObPartition] = None,<br/>)` | Creates a table.<ul><li>table_name: the table name</li><li>columns: the column schema of the table, defined using SQLAlchemy</li><li>indexes: a set of index schemas, defined using SQLAlchemy</li><li>partitions: optional partition rules (see the section on using ObPartition to define partition rules)</li></ul> | |
| `@classmethod<br/>def prepare_index_params(cls)` | Creates an IndexParams object to record the schema definition of a vector index table.`class IndexParams:<br/> """Vector index parameters for MilvusLikeClient"<br/> def __init__(self):<br/> self._indexes = {}`<br/>The definition of IndexParams is very simple, with only one dictionary member internally<br/>that stores a mapping from a tuple of (column name, index name) to an IndexParam structure.<br/>The constructor of the IndexParam class is:`def __init__(<br/> self,<br/> index_name: str,<br/> field_name: str,<br/> index_type: Union[VecIndexType, str],<br/> **kwargs<br/>)`<ul><li>index_name: the vector index table name</li><li>field_name: the vector column name</li><li>index_type: an enumerated class for vector index algorithm types. Currently, only HNSW is supported.</li></ul>After obtaining an IndexParams by calling `prepare_index_params`, you can register an IndexParam using the `add_index` interface:`def add_index(<br/> self,<br/> field_name: str,<br/> index_type: VecIndexType,<br/> index_name: str,<br/> **kwargs<br/>)`The parameter meanings are the same as those in the IndexParam constructor. | Here is a usage example for creating a vector index: `idx_params = self.client.prepare_index_params()<br/>idx_params.add_index(<br/> field_name="title_vector",<br/> index_type="HNSW",<br/> index_name="vidx_title_vector",<br/> metric_type="L2",<br/> params={"M": 16, "efConstruction": 256},<br/>)<br/>self.client.create_collection(<br/> collection_name=test_collection_name,<br/> schema=schema,<br/> <br/>index_params=idx_params,<br/>)`Note that the `prepare_index_params` function is recommended for use in MilvusLikeClient, not in ObVecClient. In ObVecClient mode, you should use the `create_index` interface to define a vector index table. (For details, see the create_index interface.) |
| `def create_table_with_index_params(<br/>self,<br/>table_name: str,<br/>columns: List[Column],<br/>indexes: Optional[List[Index]] = None,<br/>vidxs: Optional[IndexParams] = None,<br/>partitions: Optional[ObPartition] = None,<br/>) | Creates a table and a vector index at the same time using optional index_params.<ul><li>table_name: the table name</li><li>columns: the column schema of the table, defined using SQLAlchemy</li><li>indexes: a set of index schemas, defined using SQLAlchemy</li><li>vidxs: the vector index schema, specified using IndexParams</li><li>partitions: optional partition rules (see the section on using ObPartition to define partition rules)</li></ul> | Recommended for use in MilvusLikeClient, not recommended for use in ObVecClient |
| `def create_index(<br/>self,<br/>table_name: str,<br/>is_vec_index: bool,<br/>index_name: str,<br/>column_names: List[str],<br/>vidx_params: Optional[str] = None,<br/>**kw,<br/>)` | Supports creating both normal indexes and vector indexes.<ul><li>table_name: the table name</li><li>is_vec_index: specifies whether to create a normal index or a vector index</li><li>index_name: the index name</li><li>column_names: the columns on which to create the index</li><li>vidx_params: the vector index parameters, for example: `"distance=l2, type=hnsw, lib=vsag"`</li></ul>Currently, seekdb supports only `type=hnsw` and `lib=vsag`. Please retain these settings. The distance can be set to `l2` or `inner_product`. | `self.client.create_index(<br/> test_collection_name,<br/> is_vec_index=True,<br/> index_name="vidx",<br/> column_names=["embedding"],<br/> vidx_params="distance=l2, type=hnsw, lib=vsag",<br/>) |
| `def create_vidx_with_vec_index_param(<br/>self,<br/>table_name: str,<br/>vidx_param: IndexParam,<br/>)` | Creates a vector index using vector index parameters.<ul><li>table_name: the table name</li><li>vidx_param: the vector index parameters constructed using IndexParam</li></ul> | |
| `def drop_table_if_exist(self, table_name: str)` | Drops a table.<ul><li>table_name: the table name</li></ul> | |
| `def drop_index(self, table_name: str, index_name: str)` | Drops an index.<ul><li>table_name: the table name</li><li>index_name: the index name</li></ul> | |
| `def refresh_index(<br/>self,<br/>table_name: str,<br/>index_name: str,<br/>trigger_threshold: int = 10000,<br/>)` | Refreshes a vector index table to improve read performance. It can be understood as a process of moving incremental data.<ul><li>table_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the refresh action. A refresh is triggered when the data volume of the index table exceeds the threshold.</li></ul> | |
| `def rebuild_index(<br/>self,<br/>table_name: str,<br/>index_name: str,<br/>trigger_threshold: float = 0.2,<br/>)` | Rebuilds a vector index table to improve read performance. It can be understood as a process of merging incremental data into baseline index data.<ul><li>table_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the rebuild action. The value range is 0 to 1. A rebuild is triggered when the proportion of incremental data to full data reaches the threshold.</li></ul> | |
### DML operations
| API | Description | Example/Remarks |
|-----|-----|-----|
| `def insert(<br/>self,<br/>table_name: str,<br/>data: Union[Dict, List[Dict]],<br/>partition_name: Optional[str] = "",<br/>)` | Inserts data into a table.<ul><li>table_name: the table name</li><li>data: the data to be inserted, described in Key-Value form</li><li>partition_name: limits the insertion operation to a partition</li></ul> | `vector_value1 = [0.748479, 0.276979, 0.555195]<br/>vector_value2 = [0, 0, 0]<br/>data1 = [{"id": i, "embedding": vector_value1} for i in range(10)]<br/>data1.extend([{"id": i, "embedding": vector_value2} for i in range(10, 13)])<br/>data1.extend([{"id": i, "embedding": vector_value2} for i in range(111, 113)])<br/>self.client.insert(test_collection_name, data=data1)` |
| `def upsert(<br/>self,<br/>table_name: str,<br/>data: Union[Dict, List[Dict]],<br/>partition_name: Optional[str] = "",<br/>)` | Inserts or updates data in a table. If a primary key already exists, updates the corresponding record; otherwise, inserts a new record.<ul><li>table_name: the table name</li><li>data: the data to be inserted or updated, in Key-Value format</li><li>partition_name: limits the operation to a specified partition</li></ul> | |
| `def update(<br/>self,<br/>table_name: str,<br/>values_clause,<br/>where_clause=None,<br/>partition_name: Optional[str] = "",<br/>)` | Updates data in a table. If a primary key is repeated, it will be replaced.<ul><li>table_name: the table name</li><li>values_clause: the values of the columns to be updated</li><li>where_clause: the condition for updating</li><li>partition_name: limits the update operation to some partitions</li></ul> | `data = [<br/> {"id": 112, "embedding": [1, 2, 3], "meta": {'doc':'hhh1'}},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": {'doc':'hhh2'}},<br/>]<br/>client.insert(collection_name=test_collection_name, data=data)<br/>client.update(<br/> table_name=test_collection_name,<br/> values_clause=[{'meta':{'doc':'HHH'}}],<br/> where_clause=[text("id=112")]<br/>)` |
| `def delete(<br/>self,<br/>table_name: str,<br/>ids: Optional[Union[list, str, int]] = None,<br/>where_clause=None,<br/>partition_name: Optional[str] = "",<br/>)` | Deletes data from a table.<ul><li>table_name: the table name</li><li>ids: a single ID or a list of IDs</li><li>where_clause: the condition for deletion</li><li>partition_name: limits the deletion operation to some partitions</li></ul> | `self.client.delete(test_collection_name, ids=["bcd", "def"])` |
| `def get(<br/>self,<br/>table_name: str,<br/>ids: Optional[Union[list, str, int]],<br/>where_clause = None,<br/>output_column_name: Optional[List[str]] = None,<br/>partition_names: Optional[List[str]] = None,<br/>)` | Retrieves records based on the specified primary keys `ids`.<ul><li>table_name: the table name</li><li>ids: a single ID or a list of IDs. Optional parameter, can be `ids=None` if not provided. The ids parameter of ObVecClient get interface is different from MilvusLikeClient get. For details, see <a href="#Index-related%20APIs">MilvusLikeClient get</a></li><li>where_clause: the condition for retrieval</li><li>output_column_name: a list of output column or projection column names</li><li>partition_names: limits the retrieval operation to some partitions</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | `res = self.client.get(<br/> test_collection_name,<br/> ids=["abc", "bcd", "cde", "def"],<br/> where_clause=[text("meta->'$.page' > 1")],<br/> output_column_name=['id']<br/>) |
| `def set_ob_hnsw_ef_search(self, ob_hnsw_ef_search: int)` | Set the efSearch parameter of the HNSW index. This is a session-level variable. The larger the value of ef_search, the higher the recall rate but the poorer the query performance. <ul><li>ob_hnsw_ef_search: the efSearch parameter of the HNSW index</li></ul> | |
| `def get_ob_hnsw_ef_search(self) -> int` | Get the efSearch parameter of the HNSW index. | |
| `def ann_search(<br/>self,<br/>table_name: str,<br/>vec_data: list,<br/>vec_column_name: str,<br/>distance_func,<br/>with_dist: bool = False,<br/>topk: int = 10,<br/>output_column_names: Optional[List[str]] = None,<br/>extra_output_cols: Optional[List] = None,<br/>where_clause=None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>)` | Executes a vector approximate nearest neighbor search.<ul><li>table_name: the table name</li><li>vec_data: the vector data to be searched</li><li>vec_column_name: the name of the vector column to be searched</li><li>distance_func: the distance function. Provides an extension of SQLAlchemy func, with optional values: `func.l2_distance`/`func.cosine_distance`/`func.inner_product`/`func.negative_inner_product`, representing the l2 distance function, cosine distance function, inner product distance function, and negative inner product distance function, respectively</li><li>with_dist: specifies whether to return results with vector distances</li><li>topk: the number of nearest vectors to retrieve</li><li>output_column_names: a list of output column or projection column names</li><li>extra_output_cols: additional output columns that allow more complex output expressions</li><li>where_clause: the filter condition</li><li>partition_names: limits the query to some partitions</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | `res = self.client.ann_search(<br/> test_collection_name,<br/> vec_data=[0, 0, 0],<br/> vec_column_name="embedding",<br/> distance_func=func.l2_distance,<br/> with_dist=True,<br/> topk=5,<br/> output_column_names=["id"],<br/>) |
| `def precise_search(<br/>self,<br/>table_name: str,<br/>vec_data: list,<br/>vec_column_name: str,<br/>distance_func,<br/>topk: int = 10,<br/>output_column_names: Optional[List[str]] = None,<br/>where_clause=None,<br/>**kwargs,<br/>) | Executes a precise neighbor search algorithm.<ul><li>table_name: the table name</li><li>vec_data: the query vector</li><li>vec_column_name: the vector column name</li><li>distance_func: the vector distance function. Provides an extension of SQLAlchemy func, with optional values: func.l2_distance/func.cosine_distance/func.inner_product/func.negative_inner_product, representing the l2 distance function, cosine distance function, inner product distance function, and negative inner product distance function, respectively</li><li>topk: the number of nearest vectors to retrieve</li><li>output_column_names: a list of output column or projection column names</li><li>where_clause: the filter condition</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | |
| `def perform_raw_text_sql(self, text_sql: str)` | Executes an SQL statement directly.<ul><li>text_sql: the SQL statement to be executed</li></ul>Return value:<br/>Returns an iterator that provides result sets from SQLAlchemy. | |
## Define partitioning rules by using ObPartition
pyobvector supports the following types for range/range columns, list/list columns, hash, key, and subpartitioning:
* ObRangePartition: specifies to perform range partitioning. Set `is_range_columns` to `True` when you construct this object to create range column partitioning.
* ObListPartition: specifies to perform list partitioning. Set `is_list_columns` to `True` when you construct this object to create list column partitioning.
* ObHashPartition: specifies to perform hash partitioning.
* ObKeyPartition: specifies to perform key partitioning.
* ObSubRangePartition: specifies to perform sub-range partitioning. Set `is_range_columns` to `True` when you construct this object to create sub-range column partitioning.
* ObSubListPartition: specifies to perform sub-list partitioning. Set `is_list_columns` to `True` when you construct this object to create sub-list column partitioning.
* ObSubHashPartition: specifies to perform sub-hash partitioning.
* ObSubKeyPartition: specifies to perform sub-key partitioning.
### Example of range partitioning
```python
range_part = ObRangePartition(
False,
range_part_infos=[
RangeListPartInfo("p0", 100),
RangeListPartInfo("p1", "maxvalue"),
],
range_expr="id",
)
```
### Example of list partitioning
```python
list_part = ObListPartition(
False,
list_part_infos=[
RangeListPartInfo("p0", [1, 2, 3]),
RangeListPartInfo("p1", [5, 6]),
RangeListPartInfo("p2", "DEFAULT"),
],
list_expr="col1",
)
```
### Example of hash partitioning
```python
hash_part = ObHashPartition("col1", part_count=60)
```
### Example of multi-level partitioning
```python
# Perform range partitioning
range_columns_part = ObRangePartition(
True,
range_part_infos=[
RangeListPartInfo("p0", 100),
RangeListPartInfo("p1", 200),
RangeListPartInfo("p2", 300),
],
col_name_list=["col1"],
)
# Perform sub-range partitioning
range_sub_part = ObSubRangePartition(
False,
range_part_infos=[
RangeListPartInfo("mp0", 1000),
RangeListPartInfo("mp1", 2000),
RangeListPartInfo("mp2", 3000),
],
range_expr="col3",
)
range_columns_part.add_subpartition(range_sub_part)
```
## Pure SQLAlchemy API mode
If you prefer to use a purely SQLAlchemy API for seekdb's vector retrieval functionality, you can obtain a synchronized database engine through the following methods:
* Method 1: Use ObVecClient to create a database engine
```python
from pyobvector import ObVecClient
client = ObVecClient(uri="127.0.0.1:2881", user="test@test")
engine = client.engine
# Proceed to create a session as usual with SQLAlchemy and use its API.
```
* Method 2: Call the `create_engine` interface of ObVecClient to create a database engine
```python
import pyobvector
from sqlalchemy.dialects import registry
from sqlalchemy import create_engine
uri: str = "127.0.0.1:2881"
user: str = "root@test"
password: str = ""
db_name: str = "test"
registry.register("mysql.oceanbase", "pyobvector.schema.dialect", "OceanBaseDialect")
connection_str = (
# mysql+oceanbase indicates using the MySQL standard with seekdb's synchronous driver.
f"mysql+oceanbase://{user}:{password}@{uri}/{db_name}?charset=utf8mb4"
)
engine = create_engine(connection_str, **kwargs)
# Proceed to create a session as usual with SQLAlchemy and use its API.
```
If you want to use asynchronous APIs of SQLAlchemy, you can use seekdb's asynchronous driver:
```python
import pyobvector
from sqlalchemy.dialects import registry
from sqlalchemy.ext.asyncio import create_async_engine
uri: str = "127.0.0.1:2881"
user: str = "root@test"
password: str = ""
db_name: str = "test"
registry.register("mysql.aoceanbase", "pyobvector", "AsyncOceanBaseDialect")
connection_str = (
# mysql+aoceanbase indicates using the MySQL standard with seekdb's asynchronous driver.
f"mysql+aoceanbase://{user}:{password}@{uri}/{db_name}?charset=utf8mb4"
)
engine = create_async_engine(connection_str)
# Proceed to create a session as usual with SQLAlchemy and use its API.
```
## More examples
For more examples, visit the [pyobvector repository](https://github.com/oceanbase/pyobvector).

View File

@@ -0,0 +1,470 @@
---
slug: /vector-search-java-sdk
---
# Java SDK API reference
obvec_jdbc is a Java SDK specifically designed for seekdb vector storage scenarios and JSON Table virtual table scenarios. This topic explains how to use obvec_jdbc.
## Installation
You can install obvec_jdbc using either of the following methods.
### Maven dependency
Add the obvec_jdbc dependency to the `pom.xml` file of your project.
```xml
<dependency>
<groupId>com.oceanbase</groupId>
<artifactId>obvec_jdbc</artifactId>
<version>1.0.4</version>
</dependency>
```
### Source code installation
1. Install obvec_jdbc.
```bash
# Clone the obvec_jdbc repository.
git clone https://github.com/oceanbase/obvec_jdbc.git
# Go to the obvec_jdbc directory.
cd obvec_jdbc
# Install obvec_jdbc.
mvn install
```
2. Add the dependency.
```xml
<dependency>
<groupId>com.oceanbase</groupId>
<artifactId>obvec_jdbc</artifactId>
<version>1.0.4</version>
</dependency>
```
## API definition and usage
obvec_jdbc provides the `ObVecClient` object for working with seekdb's vector search features and JSON Table virtual table functionalities.
### Use vector search
#### Create a client
You can use the following interface definition to construct an ObVecClient object:
```java
# uri: the connection string, which contains the address, port, and name of the database to which you want to connect.
# user: the username.
# password: the password.
public ObVecClient(String uri, String user, String password);
```
Here is an example:
```java
import com.oceanbase.obvec_jdbc.ObVecClient;
String uri = "jdbc:oceanbase://127.0.0.1:2881/test";
String user = "root@test";
String password = "";
String tb_name = "JAVA_TEST";
ObVecClient ob = new ObVecClient(uri, user, password);
```
#### ObFieldSchema class
This class is used to define the column schema of a table. The constructor is as follows:
```java
# name: the column name.
# dataType: the data type.
public ObFieldSchema(String name, DataType dataType);
```
The following table describes the data types supported by the class.
| Data type | Description |
|---|---|
| BOOL | Equivalent to TINYINT |
| INT8 | Equivalent to TINYINT |
| INT16 | Equivalent to SMALLINT |
| INT32 | Equivalent to INT |
| INT64 | Equivalent to BIGINT |
| FLOAT | Equivalent to FLOAT |
| DOUBLE | Equivalent to DOUBLE |
| STRING | Equivalent to LONGTEXT |
| VARCHAR | Equivalent to VARCHAR |
| JSON | Equivalent to JSON |
| FLOAT_VECTOR | Equivalent to VECTOR |
:::tip
For more complex types, constraints, and other functionalities, you can use seekdb JDBC's interface directly instead of using obvec_jdbc.
:::
The interface is defined as follows:
| API | Description |
|---|---|
| String getName() | Obtains the column name. |
| ObFieldSchema Name(String name) | Sets the column name and returns the object itself to support chain operations. |
| ObFieldSchema DataType(DataType dataType) | Sets the data type. |
| boolean getIsPrimary() | Specifies whether the column is the primary key. |
| ObFieldSchema IsPrimary(boolean isPrimary) | Specifies whether the column is the primary key. |
| ObFieldSchema IsAutoInc(boolean isAutoInc) | Specifies whether the column is auto-increment. <main id="notice" type='notice'><h4>Notice</h4><p>IsAutoInc takes effect only if IsPrimary is true. </p></main> |
| ObFieldSchema IsNullable(boolean isNullable) | Specifies whether the column can contain NULL values. <main id="notice" type='notice'><h4>Notice</h4><p>IsNullable is set to false by default, which is different from the behavior in MySQL. </p></main> |
| ObFieldSchema MaxLength(int maxLength) | Sets the maximum length for the VARCHAR data type. |
| ObFieldSchema Dim(int dim) | Sets the dimension for the VECTOR data type. |
#### IndexParams/IndexParam
IndexParam is used to set a single index parameter. IndexParams is used to set a group of vector index parameters, which is used when multiple vector indexes are created on a table.
:::tip
obvec_jdbc supports only the creation of vector indexes. To create other indexes, use seekdb JDBC.
:::
The constructor of IndexParam is as follows:
```java
# vidx_name: the index name.
# vector_field_name: the name of the vector column.
public IndexParam(String vidx_name, String vector_field_name);
```
The interface is defined as follows:
| API | Description |
|---|---|
| IndexParam M(int m) | Sets the maximum number of neighbors for each vector in the HNSW algorithm. |
| IndexParam EfConstruction(int ef_construction) | Sets the maximum number of candidate vectors for search during the construction of the HNSW algorithm. |
| IndexParam EfSearch(int ef_search) | Sets the maximum number of candidate vectors for search in the HNSW algorithm. |
| IndexParam Lib(String lib) | Sets the type of the vector library. |
| IndexParam MetricType(String metric_type) | Sets the type of the vector distance function. |
The constructor of IndexParams is as follows:
```
public IndexParams();
```
The interface is defined as follows:
| API | Description |
|---|---|
| void addIndex(IndexParam index_param) | Adds an index definition. |
#### ObCollectionSchema class
When creating a table, you need to rely on the configuration of the ObCollectionSchema object. Below are its constructors and interfaces.
The constructor of ObCollectionSchema is as follows:
```java
public ObCollectionSchema();
```
The interface is defined as follows:
| API | Description |
|---|---|
| void addField(ObFieldSchema field) | Adds a column definition. |
| void setIndexParams(IndexParams index_params) | Sets the vector index parameters of the table. |
#### Drop a table
The constructor is as follows:
```java
# table_name: the name of the target table.
public void dropCollection(String table_name);
```
#### Check whether a table exists
The constructor is as follows:
```java
# table_name: the name of the target table.
public boolean hasCollection(String table_name);
```
#### Create a table
The constructor is as follows:
```java
# table_name: the name of the table to be created.
# collection: an ObCollectionSchema object that specifies the schema of the table.
public void createCollection(String table_name, ObCollectionSchema collection);
```
You can use ObFieldSchema, ObCollectionSchema, and IndexParams to create a table. Here is an example:
```java
import com.oceanbase.obvec_jdbc.DataType;
import com.oceanbase.obvec_jdbc.ObCollectionSchema;
import com.oceanbase.obvec_jdbc.ObFieldSchema;
import com.oceanbase.obvec_jdbc.IndexParam;
import com.oceanbase.obvec_jdbc.IndexParams;
# Define the schema of the table.
ObCollectionSchema collectionSchema = new ObCollectionSchema();
ObFieldSchema c1_field = new ObFieldSchema("c1", DataType.INT32);
c1_field.IsPrimary(true).IsAutoInc(true);
ObFieldSchema c2_field = new ObFieldSchema("c2", DataType.FLOAT_VECTOR);
c2_field.Dim(3).IsNullable(false);
ObFieldSchema c3_field = new ObFieldSchema("c3", DataType.JSON);
c3_field.IsNullable(true);
collectionSchema.addField(c1_field);
collectionSchema.addField(c2_field);
collectionSchema.addField(c3_field);
# Define the index.
IndexParams index_params = new IndexParams();
IndexParam index_param = new IndexParam("vidx1", "c2");
index_params.addIndex(index_param);
collectionSchema.setIndexParams(index_params);
ob.createCollection(tb_name, collectionSchema);
```
#### Create a vector index after table creation
The constructor is as follows:
```java
# table_name: the name of the table.
# index_param: an IndexParam object that specifies the vector index parameters of the table.
public void createIndex(String table_name, IndexParam index_param)
```
#### Insert data
The constructor is as follows:
```java
# table_name: the name of the target table.
# column_names: an array of column names in the target table.
# rows: the data rows. ArrayList<Sqlizable[]>, each row is an Sqlizable array. Sqlizable is a wrapper class that converts Java data types to SQL data types.
public void insert(String table_name, String[] column_names, ArrayList<Sqlizable[]> rows);
```
The supported data types for rows include:
* SqlInteger: wraps integer data.
* SqlFloat: wraps floating-point data.
* SqlDouble: wraps double-precision data.
* SqlText: wraps string data.
* SqlVector: wraps vector data.
Here is an example:
```java
import com.oceanbase.obvec_jdbc.SqlInteger;
import com.oceanbase.obvec_jdbc.SqlText;
import com.oceanbase.obvec_jdbc.SqlVector;
import com.oceanbase.obvec_jdbc.Sqlizable;
ArrayList<Sqlizable[]> insert_rows = new ArrayList<>();
Sqlizable[] ir1 = { new SqlVector(new float[] {1.0f, 2.0f, 3.0f}), new SqlText("{\"doc\": \"oceanbase doc 1\"}") };
insert_rows.add(ir1);
Sqlizable[] ir2 = { new SqlVector(new float[] {1.1f, 2.2f, 3.3f}), new SqlText("{\"doc\": \"oceanbase doc 2\"}") };
insert_rows.add(ir2);
Sqlizable[] ir3 = { new SqlVector(new float[] {0f, 0f, 0f}), new SqlText("{\"doc\": \"oceanbase doc 3\"}") };
insert_rows.add(ir3);
ob.insert(tb_name, new String[] {"c2", "c3"}, insert_rows);
```
#### Delete data
The constructor is as follows:
```java
# table_name: the name of the target table.
# primary_key_name: the name of the primary key column.
# primary_keys: an array of primary key column values for the target rows.
public void delete(String table_name, String primary_key_name, ArrayList<Sqlizable> primary_keys);
```
Here is an example:
```java
ArrayList<Sqlizable> ids = new ArrayList<>();
ids.add(new SqlInteger(2));
ids.add(new SqlInteger(1));
ob.delete(tb_name, "c1", ids);
```
#### ANN queries
The constructor is as follows:
```java
# table_name: the name of the target table.
# vec_col_name: the name of the vector column.
# metric_type: the type of the vector distance function. l2: corresponds to the L2 distance function. cosine: corresponds to the cosine distance function. ip: corresponds to the negative inner product distance function.
# qv: the vector value to be queried.
# topk: the number of the most similar results to be returned.
# output_fields: the projected columns, that is, the array of the fields to be returned.
# output_datatypes: the data types of the projected columns, that is, the data types of the fields to be returned, for direct conversion to Java data types.
# where_expr: the WHERE condition expression.
public ArrayList<HashMap<String, Sqlizable>> query(
String table_name,
String vec_col_name,
String metric_type,
float[] qv,
int topk,
String[] output_fields,
DataType[] output_datatypes,
String where_expr);
```
Here is an example:
```java
ArrayList<HashMap<String, Sqlizable>> res = ob.query(tb_name, "c2", "l2",
new float[] {0f, 0f, 0f}, 10,
new String[] {"c1", "c3", "c2"},
new DataType[] {
DataType.INT32,
DataType.JSON,
DataType.FLOAT_VECTOR,
"c1 > 0"});
if (res != null) {
for (int i = 0; i < res.size(); i++) {
for (HashMap.Entry<String, Sqlizable> entry : res.get(i).entrySet()) {
System.out.printf("%s : %s, ", entry.getKey(), entry.getValue().toString());
}
System.out.print("\n");
}
} else {
System.out.println("res is null");
}
```
### Use the JSON table feature
The JSON table feature of obvec_jdbc relies on seekdb's ability to handle JSON data types (including `JSON_VALUE`/`JSON_TABLE`/`JSON_REPLACE`, etc.) to implement a virtual table mechanism. Multiple users (distinguished by user ID) can perform DDL or DML operations on virtual tables over the same physical table while ensuring data isolation between users. Admin users can perform DDL operations, while regular users can perform DML operations.
This design combines the structured management capabilities of relational databases with the flexibility of JSON, showcasing seekdb's multi-model integration capabilities. Users can enjoy the power and ease of use of SQL while also handling semi-structured data, meeting the diverse data model requirements of modern applications. Although operations are still performed on "tables," data is stored in a more flexible JSON format at the underlying level, better supporting complex and varied application scenarios.
#### How it works
<!-- The following figure illustrates the principle of JSON Table.
![JSON Table principle](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer/V4.3.5/vector_search/JSON-Table%E5%8E%9F%E7%90%86%E5%9B%BE.png)
Detailed explanation:-->
1. User operations: Users still interact with the system using familiar standard SQL statements (such as `CREATE TABLE` to create table structures, `INSERT` to insert data, and `SELECT` to query data). They do not need to worry about how data is stored at the underlying level, just like operating ordinary relational database tables. The tables created by users using SQL statements are logical tables, which correspond to two physical tables (`meta_json_t` and `data_json_t`) within seekdb.
2. JSON Table SDK: Within the application, there is a JSON Table SDK (Software Development Kit). This SDK is the key that connects users' SQL operations and seekdb's actual storage. When SQL statements are executed, the SDK intercepts these requests and intelligently converts them into read and write operations on seekdb's internal tables `meta_json_t` and `data_json_t`.
3. seekdb internal storage:
* `meta_json_t` (stores table schema): stores the metadata of the logical tables created by users, which is the schema information of the table (for example, which columns are created and what data type each column is). When `CREATE TABLE` is executed, the SDK records this schema information in `meta_json_t`.
* `data_json_t` (stores row data as JSON type): stores the actual inserted data. Unlike traditional relational databases that directly store row data, the JSON Table feature encapsulates each row of inserted data into a JSON object and stores it in a column of the `data_json_t` table. This allows for efficient storage even with flexible data structures.
4. Data query: When query operations such as `SELECT` are executed, the SDK reads JSON-format data from `data_json_t` and combines it with the schema information from `meta_json_t` to re-parse and present the JSON data in a familiar tabular format, returning it to your application.
The `meta_json_t` table stores the metadata of the JSON table, which is the logical table schema defined by the user using the `CREATE TABLE` statement. It records the column information of each logical table, with the following schema:
| Field | Description | Example |
|--------|------|------|
| `user_id` | The user ID, used to distinguish the logical tables of different users. | `0`, `1`, `2` |
| `jtable_name` | The name of the logical table. | `test_count` |
| `jcol_id` | The column ID of the logical table. | `1`, `2`, `3` |
| `jcol_name` | The column name of the logical table. | `c1`, `c2`, `c3` |
| `jcol_type` | The data type of the column. | `INT`, `VARCHAR(124)`, `DECIMAL(10,2)` |
| `jcol_nullable` | Indicates whether the column allows null values. | `0`, `1` |
| `jcol_has_default` | Indicates whether the column has a default value. | `0`, `1` |
| `jcol_default` | The default value of the column. | `{'default': null}` |
When a user executes the `CREATE TABLE` statement, the JSON table SDK parses and inserts the column definition information into the `meta_json_t` table.
The `data_json_t` table stores the actual data of the JSON table, which is the data inserted by the user using the `INSERT` statement. It records the row data of each logical table, with the following schema:
| Field | Description | Example |
|--------|------|------|
| `user_id` | The user ID, used to distinguish the logical tables of different users. | `0`, `1`, `2` |
| `admin_id` | The administrator user ID. | `0` |
| `jtable_name` | The name of the logical table, used to associate the metadata in `meta_json_t`. | `test_count` |
| `jdata_id` | The data ID, a unique identifier for the JSON data, corresponding to each row in the logical table. | `1`, `2`, `3` |
| `jdata` | A column of the JSON type, used to store the actual row data of the logical table. | `{"c1": 1, "c2": "test", "c3": 1.23}` |
#### Examples
1. Create a client
The constructor is as follows:
```java
# uri: the connection string, which contains the address, port, and name of the database to which you want to connect.
# user: the username.
# password: the password.
# user_id: the user ID.
# log_level: the log level.
public ObVecJsonClient(String uri, String user, String password, String user_id, Level log_level);
```
Here is an example:
```java
import com.oceanbase.obvec_jdbc.ObVecJsonClient;
String uri = "jdbc:oceanbase://127.0.0.1:2881/test";
String user = "root@test";
String password = "";
ObVecJsonClient client = new ObVecJsonClient(uri, user, password, 0, Level.INFO);
```
2. Execute DDL statements
You can directly call the `parseJsonTableSQL2NormalSQL` interface and pass in the specific SQL statements.
* Create a table
```java
String sql = "CREATE TABLE `t2` (c1 INT NOT NULL DEFAULT 10, c2 VARCHAR(30) DEFAULT 'ca', c3 VARCHAR NOT NULL, c4 DECIMAL(10, 2), c5 TIMESTAMP DEFAULT CURRENT_TIMESTAMP);";
client.parseJsonTableSQL2NormalSQL(sql);
```
* ALTER TABLE CHANGE COLUMN
```java
sql = "ALTER TABLE t2 CHANGE COLUMN c2 changed_col INT";
client.parseJsonTableSQL2NormalSQL(sql);
```
* ALTER TABLE ADD COLUMN
```java
sql = "ALTER TABLE t2 ADD COLUMN email VARCHAR(100) default 'example@example.com'";
client.parseJsonTableSQL2NormalSQL(sql);
```
* ALTER TABLE MODIFY COLUMN
```java
sql = "ALTER TABLE t2 MODIFY COLUMN changed_col TIMESTAMP NOT NULL DEFAULT current_timestamp";
client.parseJsonTableSQL2NormalSQL(sql);
```
* ALTER TABLE DROP COLUMN
```java
sql = "ALTER TABLE t2 DROP c1";
client.parseJsonTableSQL2NormalSQL(sql);
```
* ALTER TABLE RENAME
```java
sql = "ALTER TABLE t2 RENAME TO alter_test";
client.parseJsonTableSQL2NormalSQL(sql);
```

View File

@@ -0,0 +1,20 @@
---
slug: /vector-search-faq
---
# Vector search FAQs
This topic describes some common issues that may occur when using vector search, as well as their causes and solutions.
## Must all rows in a vector column have the same data dimensionality?
Yes. A data dimensionality must be specified when a vector column is defined, and the dimensionality must be verified when vector data is written into the column.
## What is the maximum number of rows of vector data that can be written?
Unlimited. It depends on the database memory resources.
## How do I create an index on a vector with more than 4096 dimensions?
You need to compress the data to within 4096 dimensions before creating an index.

View File

@@ -0,0 +1,520 @@
---
slug: /using-seekdb-in-python-mode
---
# Experience embedded seekdb
seekdb provides an embedded product form that can be integrated into user applications as a library, offering developers a more powerful and flexible data management solution. This enables data management everywhere (microcontrollers, IoT devices, edge computing, mobile applications, data centers, etc.), allowing users to quickly get started with seekdb's All-in-one (TP, AP, AI Native) capabilities.
![python](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/SeekDB/seekdb-python.png)
## Installation and configuration
### Environment requirements
* Supported operating systems: Linux (glibc >= 2.28)
* Supported Python versions: CPython 3.8 ~ 3.14
* Supported system architectures: x86_64, aarch64
You can run the following command to check whether your environment meets the requirements.
```python
python -c 'import sys;import platform; print(f"Python: {platform.python_implementation()} {platform.python_version()}, System: {platform.system()} {platform.machine()}, {platform.libc_ver()[0]}: {platform.libc_ver()[1]}");'
```
The output should be like this:
```python
Python: CPython 3.8.17, System: Linux x86_64, glibc: 2.32
```
### Installation
Use pip to install. It automatically detects the default Python version and platform.
```python
pip install pylibseekdb
# Or specify a mirror source for faster installation
pip install pylibseekdb -i https://pypi.tuna.tsinghua.edu.cn/simple
```
If your pip version is low, upgrade pip first before installation:
```bash
pip install --upgrade pip
```
## Experience seekdb
After completing the installation of seekdb, you can start experiencing seekdb.
#### Considerations
* Multi-statement queries are not supported. By default, only the first statement is executed. For example:
```python
cur.execute("insert into t1 values(100);insert into t1 values(200)")
```
* The tmpfs file system cannot be used as the database directory.
* Only streaming query mode is supported. For example:
```python
cur = con.cursor()
cur.execute("select * from t1")
cur.fetchall()
```
* The execute method does not support parameterization.
#### Experience basic seekdb operations
The following examples demonstrate some basic operations of seekdb. You can create databases, connect to databases, create tables, write and query data, and more.
:::info
For detailed information about seekdb SQL syntax, see <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001972315">SQL syntax</a>.
:::
seekdb provides the `test` database by default. The following example demonstrates how to open and connect to the `test` database using default parameters, and how to create tables, write data, commit transactions, query data, and safely close the database.
```python
import pylibseekdb
# Open the database directory seekdb by default
pylibseekdb.open()
# Connect to the test database by default
conn = pylibseekdb.connect()
# Create a cursor for data operations
cursor = conn.cursor()
# Execute table creation statement
cursor.execute("create table t1(c1 int primary key, c2 int)")
# Execute data insertion
cursor.execute("insert into t1 values(1, 100)")
cursor.execute("insert into t1 values(2, 200)")
# Manually commit the transaction
conn.commit()
# Execute query
cursor.execute("select * from t1")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
You can also manually specify the database directory and create and use a new database.
```python
import pylibseekdb
# Specify the database directory
pylibseekdb.open("mydb")
# Do not connect to any database
conn = pylibseekdb.connect("")
# Create a cursor for data operations
cursor = conn.cursor()
# Manually create a database
cursor.execute("create database db1")
# Use the newly created database
cursor.execute("use db1")
# Close connections
cursor.close()
conn.close()
```
The following example demonstrates how to enable autocommit mode for transactions.
```python
import pylibseekdb
# Specify the database directory
pylibseekdb.open("seekdb")
# Connect to the test database
conn = pylibseekdb.connect(database="test", autocommit=True)
# Create a cursor for data operations
cursor = conn.cursor()
# Execute table creation statement
cursor.execute("create table t2(c1 int primary key, c2 int)")
# Execute data insertion, transaction is automatically committed
cursor.execute("insert into t2 values(1, 100)")
# Execute data insertion, transaction is automatically committed
cursor.execute("insert into t2 values(2, 200)")
# Query data using a new connection
conn2 = pylibseekdb.connect("test")
cursor2=conn2.cursor()
cursor2.execute("select * from t2")
# View data row by row
print(cursor2.fetchone())
print(cursor2.fetchone())
# Close connections
cursor.close()
conn.close()
cursor2.close()
conn2.close()
```
### Experience AI Native
#### Experience vector search
seekdb supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance calculations such as Manhattan distance, Euclidean distance, inner product, and cosine distance. It supports creating vector indexes based on HNSW/IVF, and supports incremental updates and deletions without affecting recall.
:::info
For more detailed information about seekdb vector search, see <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001976351">Vector search</a>.
:::
The following example demonstrates how to use vector search in seekdb.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a table with a vector index
cursor.execute("create table test_vector(c1 int primary key, c2 vector(2), vector index idx1(c2) with (distance=l2, type=hnsw, lib=vsag))")
# Insert data
cursor.execute("insert into test_vector values(1, [1, 1])")
cursor.execute("insert into test_vector values(2, [1, 2])")
cursor.execute("insert into test_vector values(3, [1, 3])")
conn.commit()
# Execute vector search
cursor.execute("SELECT c1,c2 FROM test_vector ORDER BY l2_distance(c2, '[1, 2.5]') APPROXIMATE LIMIT 2;")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
#### Experience full-text search
seekdb provides full-text indexing capabilities. By building full-text indexes, you can comprehensively index entire documents or large text content, significantly improving query performance when dealing with large-scale text data and complex search requirements, enabling users to obtain the required information more efficiently.
The following example demonstrates how to use seekdb's full-text search feature.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a table with a full-text index
sql='''create table articles (title VARCHAR(200) primary key, body Text,
FULLTEXT fts_idx(title, body));
'''
cursor.execute(sql)
# Insert data
sql='''insert into articles(title, body) values
('OceanBase Tutorial', 'This is a tutorial about OceanBase Fulltext.'),
('Fulltext Index', 'Fulltext index can be very useful.'),
('OceanBase Test Case', 'Writing test cases helps ensure quality.')
'''
cursor.execute(sql)
conn.commit()
# Execute full-text search
sql='''select
title,
match (title, body) against ("OceanBase") as score
from
articles
where
match (title, body) against ("OceanBase")
order by
score desc
'''
cursor.execute(sql)
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
#### Experience hybrid search
Hybrid Search combines vector-based semantic search and full-text index-based keyword search, providing more accurate and comprehensive search results through comprehensive ranking. Vector search excels at semantic approximate matching but is weak at matching exact keywords, numbers, and proper nouns, while full-text search effectively compensates for this deficiency. Therefore, hybrid search has become one of the key features of vector databases and is widely used in various products.
Based on multi-model integration, seekdb provides hybrid search capabilities for multi-modal data on the basis of SQL+AI, enabling fusion queries of multiple types of data in a single database system.
The following example demonstrates how to use seekdb's hybrid search feature.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a table with vector indexes and full-text indexes
cursor.execute("create table doc_table(c1 int, vector vector(3), query varchar(255), content varchar(255), vector index idx1(vector) with (distance=l2, type=hnsw, lib=vsag), fulltext idx2(query), fulltext idx3(content))")
# Insert data
sql = '''insert into doc_table values(1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
(2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
(3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
(4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
(5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
(6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");'''
cursor.execute(sql)
conn.commit()
sql = '''set @parm = '{
"query": {
"bool": {
"must": [
{"match": {"query": "hi hello"}},
{"match": { "content": "oceanbase mysql" }}
]
}
},
"knn" : {
"field": "vector",
"k": 5,
"num_candidates": 10,
"query_vector": [1,2,3],
"boost": 0.7
},
"_source" : ["query", "content", "_keyword_score", "_semantic_score"]
}';'''
cursor.execute(sql)
# Execute hybrid search
sql = '''select dbms_hybrid_search.search('doc_table', @parm);'''
cursor.execute(sql)
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
### Experience analytical capabilities (OLAP)
seekdb combines transaction processing (TP) with analytical processing (AP). Based on the LSM-Tree architecture, it achieves unified row and column storage, and introduces a new vectorized engine and cost evaluation model based on column storage, significantly improving the efficiency of processing wide tables and enhancing query performance in AP scenarios. It also supports real-time import, secondary indexes, high-concurrency primary key queries, and other common real-time OLAP requirements.
#### Experience data import
seekdb supports various flexible data import methods, allowing you to import data from multiple data sources into the database. Different import methods are suitable for different scenarios. You can choose appropriate import tools for data import based on data source types and business scenarios. As scenarios become more complex and diverse, multiple import methods can be used together. When importing data, in addition to considering data sources, data file formats should also be considered along with the support of import tools. When business scenarios have clearly defined data sources and data file formats, you need to start from the data source and consider the design of the import solution in combination with import tools. When businesses have import tools they are familiar with, you need to consider the tool's support and the possibility of import in combination with business scenarios.
The following example uses the `load data` method to demonstrate how to quickly import CSV data into seekdb.
1. Create an external data source
```bash
cat /data/1/example.csv
1,10
2,20
3,30
```
2. Import external data using the embedded method.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a table
cursor.execute("create table test_olap(c1 int, c2 int)")
# Execute fast import
cursor.execute("load data /*+ direct(true, 0) */ infile '/data/1/example.csv' into table test_olap fields terminated by ','")
# Query data
cursor.execute("select count(*) from test_olap")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
#### Experience columnar storage
In scenarios involving complex analysis of large-scale data or ad-hoc queries on massive data, columnar storage is one of the key capabilities of AP databases. The seekdb storage engine has been further enhanced on the basis of supporting row storage, achieving support for column storage and unified storage. With one codebase, one architecture, and one instance, columnar data and row data coexist.
The following example demonstrates how to create a columnar table in seekdb.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a columnar table
sql='''create table each_column_group (col1 varchar(30) not null, col2 varchar(30) not null, col3 varchar(30) not null, col4 varchar(30) not null, col5 int)
with column group (each column);
'''
cursor.execute(sql)
# Insert data
sql='''insert into each_column_group values('a', 'b', 'c', 'd', 1)
'''
cursor.execute(sql)
conn.commit()
# Execute query
cursor.execute("select col1,col2 from each_column_group")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
#### Experience materialized views
Materialized views are a key feature supporting AP business. They improve query performance and simplify complex query logic by precomputing and storing query results of views, reducing real-time computation. They are commonly used in fast report generation and data analysis scenarios. seekdb supports non-real-time and real-time materialized views, supports specifying primary keys or creating indexes for materialized views, and introduces nested materialized views, which can significantly improve query performance.
The following example demonstrates how to use materialized views in seekdb.
```python
import pylibseekdb
import time
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create base tables
cursor.execute("create table base_t1(a int primary key, b int)")
cursor.execute("create table base_t2(c int primary key, d int)")
# Create materialized view logs
cursor.execute("create materialized view log on base_t1 with(b)")
cursor.execute("create materialized view log on base_t2 with(d)")
# Create a materialized view named mv based on tables base_t1 and base_t2, specify the refresh strategy as incremental refresh, and set the initial refresh time in the refresh plan to the current date, then refresh the materialized view every 1 second thereafter.
cursor.execute("create materialized view mv REFRESH fast START WITH sysdate() NEXT sysdate() + INTERVAL 1 second as select a,b,c,d from base_t1 join base_t2 on base_t1.a=base_t2.c")
# Insert data into base tables
cursor.execute("insert into base_t1 values(1, 10)")
cursor.execute("insert into base_t2 values(1, 100)")
conn.commit()
# Wait for the materialized view background refresh to complete
time.sleep(10)
# Query data
cursor.execute("select * from mv")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
#### Experience external tables
Typically, table data in a database is stored in the database's storage space, while external table data is stored in external storage services. When creating an external table, you need to define the data file path and data file format. After creation, users can read data from files in external storage services through external tables.
The following example demonstrates how to access external CSV files through seekdb's external table feature.
1. Create an external data source.
```bash
cat /data/1/example.csv
1,10
2,20
3,30
```
2. Access external table data using the embedded method.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create an external table
sql='''CREATE EXTERNAL TABLE test_external_table(c1 int, c2 int) LOCATION='/data/1' FORMAT=(TYPE='CSV' FIELD_DELIMITER=',') PATTERN='example.csv';
'''
cursor.execute(sql)
# Query data
cursor.execute("select * from test_external_table")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
### Experience transaction capabilities (OLTP)
The following example demonstrates seekdb's transaction capabilities.
```python
import pylibseekdb
pylibseekdb.open("seekdb")
conn = pylibseekdb.connect("test")
cursor = conn.cursor()
# Create a table
cursor.execute("create table test_oltp(c1 int primary key, c2 int)")
# Insert data
cursor.execute("insert into test_oltp values(1, 10)")
cursor.execute("insert into test_oltp values(2, 20)")
cursor.execute("insert into test_oltp values(3, 30)")
# Commit transaction
conn.commit()
# Query data, ORA_ROWSCN is the data commit version number
cursor.execute("select *,ORA_ROWSCN from test_oltp")
# Fetch results
print(cursor.fetchall())
# Close connections
cursor.close()
conn.close()
```
## Smooth transition to the distributed version
After users quickly validate product prototypes through the embedded version, if they want to switch to seekdb Server mode or use OceanBase's distributed version cluster processing capabilities, they only need to modify the import package and related configuration, while the main application logic remains unchanged.
```python
import pylibseekdb
pylibseekdb.open()
conn = pylibseekdb.connect()
```
Simply replace the three lines above with the two lines below. Use the pymysql package to replace pylibseekdb, remove the pylibseekdb open phase, and use pymysql's connect method to connect to the database server.
```python
import pymysql
conn = pymysql.connect(host='127.0.0.1', port=11002, user='root@sys', database='test')
```

View File

@@ -0,0 +1,871 @@
---
slug: /vector-index-hybrid-search
---
# Hybrid search with vector indexes
This topic describes hybrid search with full-text indexes and vector indexes in seekdb.
Hybrid search combines vector-based semantic search and full-text index-based keyword search, providing more accurate and comprehensive search results through integrated ranking. Vector search excels at semantic approximate matching but has weaker capabilities for matching exact keywords, numbers, and proper nouns. Full-text search effectively compensates for this limitation. Therefore, hybrid search has become a key feature of vector databases and is widely used in various products. seekdb achieves efficient hybrid queries by integrating its full-text and vector indexing capabilities.
## Usage
The hybrid search feature is provided through the new system package `DBMS_HYBRID_SEARCH`, which contains 2 sub-functions:
| Method name | Description |
| ------------ | -------- |
| `DBMS_HYBRID_SEARCH.SEARCH` | Returns search results in JSON format. Results are sorted by relevance. |
| `DBMS_HYBRID_SEARCH.GET_SQL` | Returns the actual executed SQL statement as a string. |
<!--For detailed syntax and parameter descriptions, see [DBMS_HYBRID_SEARCH](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004020384).-->
## Use cases and examples
### Create example tables and insert data
To demonstrate the hybrid search feature, this section creates and inserts data into several example tables that will be used in different search scenarios below.
* **`products` table**: A basic product information table used to demonstrate regular scalar search. It contains product ID, name, description, brand, category, tags, price, stock quantity, release date, on-sale status, and a vector field `vec`.
:::collapse
```sql
CREATE TABLE products (
`product_id` varchar(50) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`description` text DEFAULT NULL,
`brand` varchar(100) DEFAULT NULL,
`category` varchar(100) DEFAULT NULL,
`tags` varchar(255) DEFAULT NULL,
`price` decimal(10,2) DEFAULT NULL,
`stock_quantity` int(11) DEFAULT NULL,
`release_date` datetime DEFAULT NULL,
`is_on_sale` tinyint(1) DEFAULT NULL,
`vec` VECTOR(4) DEFAULT NULL
);
```
Insert data.
```sql
INSERT INTO products VALUES
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
```
:::
* **`products_fulltext` table**: Based on the `products` table, full-text indexes are created on the `product_name`, `description`, and `tags` columns to demonstrate full-text search.
:::collapse
```sql
CREATE TABLE products_fulltext (
product_id VARCHAR(50),
product_name VARCHAR(255),
description TEXT,
brand VARCHAR(100),
category VARCHAR(100),
tags VARCHAR(255),
price DECIMAL(10, 2),
stock_quantity INT,
release_date DATETIME,
is_on_sale TINYINT(1),
vec vector(4),
-- Create full-text indexes on columns that need full-text search
FULLTEXT INDEX idx_product_name(product_name),
FULLTEXT INDEX idx_description(description),
FULLTEXT INDEX idx_tags(tags)
);
```
Insert data.
```sql
INSERT INTO products_fulltext VALUES
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
```
:::
* **`doc_table` table**: A document table containing scalar columns, vector columns, and full-text indexed columns, used to demonstrate full-text search with scalar filtering conditions and hybrid search.
:::collapse
```sql
CREATE TABLE doc_table(
c1 INT,
vector VECTOR(3),
query VARCHAR(255),
content VARCHAR(255),
VECTOR INDEX idx1(vector) WITH (distance=l2, type=hnsw, lib=vsag),
FULLTEXT INDEX idx2(query),
FULLTEXT INDEX idx3(content)
);
```
Insert data.
```sql
INSERT INTO doc_table VALUES
(1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
(2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
(3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
(4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
(5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
(6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");
```
:::
* **`products_vector` table**: Similar to the `products` table structure, but with a vector index explicitly created on the `vec` column to demonstrate pure vector search.
:::collapse
```sql
CREATE TABLE products_vector (
`product_id` varchar(50) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`description` text DEFAULT NULL,
`brand` varchar(100) DEFAULT NULL,
`category` varchar(100) DEFAULT NULL,
`tags` varchar(255) DEFAULT NULL,
`price` decimal(10,2) DEFAULT NULL,
`stock_quantity` int(11) DEFAULT NULL,
`release_date` datetime DEFAULT NULL,
`is_on_sale` tinyint(1) DEFAULT NULL,
`vec` VECTOR(4) DEFAULT NULL,
-- Create a vector index on the column that needs vector search
VECTOR INDEX idx1(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);
```
Insert data.
```sql
INSERT INTO products_vector VALUES
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
```
:::
* **`products_multi_vector` table**: A table containing multiple vector fields, used to demonstrate multi-vector search.
:::collapse
```sql
CREATE TABLE products_multi_vector (
product_id VARCHAR(50),
product_name VARCHAR(255),
description TEXT,
vec1 VECTOR(4),
vec2 VECTOR(4),
vec3 VECTOR(4),
VECTOR INDEX idx1(vec1) WITH (distance=l2, type=hnsw, lib=vsag),
VECTOR INDEX idx2(vec2) WITH (distance=l2, type=hnsw, lib=vsag),
VECTOR INDEX idx3(vec3) WITH (distance=l2, type=hnsw, lib=vsag)
);
```
Insert data.
```sql
INSERT INTO products_multi_vector VALUES
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard', '[0.5,0.1,0.6,0.9]', '[0.2,0.3,0.4,0.5]', '[0.1,0.2,0.3,0.4]'),
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset', '[0.1,0.9,0.2,0]', '[0.3,0.4,0.5,0.6]', '[0.2,0.3,0.4,0.5]'),
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat', '[0.1,0.9,0.3,0]', '[0.4,0.5,0.6,0.7]', '[0.3,0.4,0.5,0.6]');
```
:::
### Regular scalar search
Some use cases for regular scalar search are as follows:
* E-commerce platform product filtering: Users want to view all products from a specific brand. For example, users want to view all products from the `GamerZone` brand.
* Content management systems: Administrators need to filter articles or documents by specific categories. For example, finding all articles by a specific author.
* User management systems: Finding users with specific statuses or roles. For example, finding all VIP users.
Example:
1. Set search parameters.
```sql
SET @parm = '{
"query": {
"bool": {
"must": [
{"term": {"brand": "GamerZone"}}
]
}
}
}';
```
2. Search for all records where `brand` is `"GamerZone"`.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm)) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec": "[0.5,0.1,0.6,0.9]",
"tags": "best-seller,gaming-gear,rgb",
"brand": "GamerZone",
"price": 149.00,
"_score": 1,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-004",
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
"product_name": "Gamer-Pro Mechanical Keyboard",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
},
{
"vec": "[0.1,0.9,0.2,0]",
"tags": "best-seller,gaming-gear,audio",
"brand": "GamerZone",
"price": 149.00,
"_score": 1,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-009",
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
"product_name": "Gamer-Pro Headset",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
}
] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Range search for regular scalars
Some use cases for regular scalar range search are as follows:
* Price range filtering: E-commerce platforms filter products by price range. For example, finding products with prices in the `[30~80]` range.
* Time range queries: Finding orders or logs within a specific time period. For example, finding orders from the last 30 days.
* Numeric range filtering: Filtering by rating, stock quantity, and other numeric ranges. For example, finding products with ratings between `[4~5]`.
Example:
1. Set search parameters.
```sql
SET @parm = '{
"query": {
"range" : {
"price" : {
"gte" : 30,
"lte" : 80
}
}
}
}';
```
2. Search for all records where `price` is in the `[30~80]` range.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm)) |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec": "[0.1,0.9,0.3,0]",
"tags": "eco-friendly,health",
"brand": "NatureFirst",
"price": 49.99,
"_score": true,
"category": "Sports",
"is_on_sale": 0,
"product_id": "prod-003",
"description": "A non-slip yoga mat made from sustainable and eco-friendly materials.",
"product_name": "Eco-Friendly Yoga Mat",
"release_date": "2023-04-22 00:00:00.000000",
"stock_quantity": 200
}
] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Full-text search
Some use cases for full-text search are as follows:
* Document search: Searching for content containing specific keywords in a large number of documents. For example, searching for documents containing `"how to use"` in FAQs.
* Product search: Fuzzy search based on product names and descriptions. For example, searching for products containing `"database"`.
* Knowledge base retrieval: Searching for related questions in FAQs and help documents. For example, searching for answers to related questions in a customer service system's knowledge base.
Example:
1. Set search parameters.
```sql
SET @query_str_with_mini = '{
"query": {
"query_string": {
"type": "best_fields",
"fields": ["product_name^3", "description^2.5", "tags^1.5"],
"query": "Gamer-Pro^2 keyboard^1.5 audio^1.2",
"boost": 1.5
}
}
}';
```
2. Search for records where the `product_name`, `description`, and `tags` fields contain the keywords `"Gamer-Pro"`, `"keyboard"`, and `"audio"`, and sort them according to the set field and keyword weights.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_fulltext', @query_str_with_mini));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_fulltext', @query_str_with_mini)) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec": "[0.5,0.1,0.6,0.9]",
"tags": "best-seller,gaming-gear,rgb",
"brand": "GamerZone",
"price": 149.00,
"_score": 4.569735248749978,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-001",
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
"product_name": "Gamer-Pro Mechanical Keyboard",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
},
{
"vec": "[0.1,0.9,0.2,0]",
"tags": "best-seller,gaming-gear,audio",
"brand": "GamerZone",
"price": 149.00,
"_score": 1.7338881172399914,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-002",
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
"product_name": "Gamer-Pro Headset",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
}
] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Full-text search with scalar filtering conditions
Some use cases for full-text search with scalar filtering conditions are as follows:
* Precise search: Performing text search under specific conditions. For example, searching for specific keywords in articles with published status.
* Permission control: Searching within data ranges that users have permission to access. For example, an order system searching for product information in orders within a specific time period.
* Category search: Performing keyword search within specific categories. For example, a user system searching for specific user information among active users.
Example:
1. Set search parameters.
```sql
-- Filter condition: specify scalar filter condition c1 >= 2
SET @query_str = '{
"query": {
"bool" : {
"must" : [
{"query_string": {
"fields": ["query", "content"],
"query": "hello what oceanbase mysql"}
}
],
"filter" : [
{"range": {"c1": {"gte" : 2}}}
]
}
}
}';
```
2. Search for all records where `c1` is greater than or equal to 2.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @query_str));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @query_str)) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"c1": 2,
"query": "hello world, what is your name",
"_score": 2.170969786679347,
"vector": "[1,2,1]",
"content": "oceanbase mysql database"
},
{
"c1": 3,
"query": "hello world, how are you",
"_score": 0.3503184713375797,
"vector": "[1,1,1]",
"content": "oceanbase oracle database"
},
{
"c1": 6,
"query": "hello world, where are you from",
"_score": 0.3503184713375797,
"vector": "[2,1,1]",
"content": "starrocks oceanbase database"
}
] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Vector search
Some use cases for vector search are as follows:
* Semantic search: Finding related content based on semantic similarity. For example, finding semantically related questions and answers in a knowledge base.
* Recommendation systems: Recommending similar products based on user preferences. For example, recommending similar products on e-commerce platforms.
* Image search: Finding similar images through image features. For example, finding similar images in an image library.
* Intelligent Q&A: Finding semantically related questions and answers in a knowledge base. For example, finding semantically related questions and answers in a customer service system's knowledge base.
Example:
1. Set search parameters.
```sql
-- field specifies the vector field, k specifies the number of results to return (the k nearest results), query_vector specifies the query vector
SET @parm = '{
"knn" : {
"field": "vec",
"k": 3,
"query_vector": [0.5,0.1,0.6,0.9]
}
}';
```
2. Search for all records where `vec` is similar to `[0.5,0.1,0.6,0.9]`.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec": "[0.5,0.1,0.6,0.9]",
"tags": "best-seller,gaming-gear,rgb",
"brand": "GamerZone",
"price": 149.00,
"_score": 1.0,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-001",
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
"product_name": "Gamer-Pro Mechanical Keyboard",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
},
{
"vec": "[0.1,0.9,0.3,0]",
"tags": "eco-friendly,health",
"brand": "NatureFirst",
"price": 49.99,
"_score": 0.43405784,
"category": "Sports",
"is_on_sale": 0,
"product_id": "prod-003",
"description": "A non-slip yoga mat made from sustainable and eco-friendly materials.",
"product_name": "Eco-Friendly Yoga Mat",
"release_date": "2023-04-22 00:00:00.000000",
"stock_quantity": 200
},
{
"vec": "[0.1,0.9,0.2,0]",
"tags": "best-seller,gaming-gear,audio",
"brand": "GamerZone",
"price": 149.00,
"_score": 0.42910841,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-002",
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
"product_name": "Gamer-Pro Headset",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
}
] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Vector search with scalar filtering conditions
Some use cases for vector search with scalar filtering conditions are as follows:
* Precise search: Performing text search under specific conditions. For example, searching for specific keywords in articles with published status.
* Permission control: Searching within data ranges that users have permission to access. For example, an order system searching for product information in orders within a specific time period.
* Category search: Performing keyword search within specific categories. For example, a user system searching for specific user information among active users.
Example:
1. Set search parameters.
```sql
-- Specify scalar filter condition brand = "GamerZone"
SET @parm = '{
"knn" : {
"field": "vec",
"k": 3,
"query_vector": [0.1,0.5,0.3,0.7],
"filter" : [
{"term" : {"brand": "GamerZone"} }
]
}
}';
```
2. Search for all records where `vec` is similar to `[0.1,0.5,0.3,0.7]` and `brand` is `"GamerZone"`.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm)) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec": "[0.5,0.1,0.6,0.9]",
"tags": "best-seller,gaming-gear,rgb",
"brand": "GamerZone",
"price": 149.00,
"_score": 0.59850837,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-001",
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
"product_name": "Gamer-Pro Mechanical Keyboard",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
},
{
"vec": "[0.1,0.9,0.2,0]",
"tags": "best-seller,gaming-gear,audio",
"brand": "GamerZone",
"price": 149.00,
"_score": 0.55175342,
"category": "Gaming",
"is_on_sale": 1,
"product_id": "prod-002",
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
"product_name": "Gamer-Pro Headset",
"release_date": "2023-07-20 00:00:00.000000",
"stock_quantity": 100
}
] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Multi-vector search
Multi-vector search refers to searching across multiple vector indexes and returning the most similar records.
Example:
1. Set search parameters.
```sql
-- Specify 3-way vector queries, each query specifies the vector index field, number of results to return, and query vector
SET @param_multi_knn = '{
"knn" : [{
"field": "vec1",
"k": 5,
"query_vector": [0.5,0.1,0.6,0.9]
},
{
"field": "vec2",
"k": 5,
"query_vector": [0.2,0.3,0.4,0.5]
},
{
"field": "vec3",
"k": 5,
"query_vector": [0.1,0.2,0.3,0.4]
}
],
"size" : 5
}';
```
2. Execute the query and return the query results.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_multi_vector', @param_multi_knn));
```
:::collapse{title="Return result"}
```shell
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_multi_vector', @param_multi_knn)) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"vec1": "[0.5,0.1,0.6,0.9]",
"vec2": "[0.2,0.3,0.4,0.5]",
"vec3": "[0.1,0.2,0.3,0.4]",
"_score": 3.0,
"product_id": "prod-001",
"description": "A responsive mechanical keyboard",
"product_name": "Gamer-Pro Mechanical Keyboard"
},
{
"vec1": "[0.1,0.9,0.2,0]",
"vec2": "[0.3,0.4,0.5,0.6]",
"vec3": "[0.2,0.3,0.4,0.5]",
"_score": 2.0957750699999997,
"product_id": "prod-002",
"description": "High-fidelity gaming headset",
"product_name": "Gamer-Pro Headset"
},
{
"vec1": "[0.1,0.9,0.3,0]",
"vec2": "[0.4,0.5,0.6,0.7]",
"vec3": "[0.3,0.4,0.5,0.6]",
"_score": 1.86262927,
"product_id": "prod-003",
"description": "A non-slip yoga mat",
"product_name": "Eco-Friendly Yoga Mat"
}
] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
:::
### Full-text and vector hybrid search
Some use cases for full-text and vector hybrid search are as follows:
* Intelligent search: Comprehensive search combining keywords and semantic understanding. For example, when a user inputs `"I need a gaming keyboard"`, the system matches both the keywords `"gaming"` and `"keyboard"`, and understands the semantics of `"gaming equipment"`.
* Document search: Supporting both exact keyword matching and semantic understanding in large document collections. For example, when searching for `"database optimization"`, it matches documents containing these words and also finds semantically related content about `"performance tuning"` and `"query optimization"`.
* Product recommendation: E-commerce platforms support both product name search and requirement description search. For example, based on a user's description `"laptop suitable for office work"`, it matches keywords and understands the semantic requirement of `"business office"`.
Example:
1. Set search parameters.
```sql
SET @parm = '{
"query": {
"bool": {
"should": [
{"match": {"query": "hi hello"}},
{"match": { "content": "oceanbase mysql" }}
]
}
},
"knn" : {
"field": "vector",
"k": 5,
"query_vector": [1,2,3]
},
"_source" : ["query", "content", "_keyword_score", "_semantic_score"]
}';
```
2. Execute the query and return the query results.
```sql
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
```
The return result is as follows:
:::collapse{title="Return result"}
```shell
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(dbms_hybrid_search.search('doc_table', @parm)) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"query": "hello world, what is your name",
"_score": 2.835628417884166,
"content": "oceanbase mysql database",
"_keyword_score": 2.5022950878841663,
"_semantic_score": 0.33333333
},
{
"query": "hello world",
"_score": 1.7219400929592013,
"content": "oceanbase Elasticsearch database",
"_keyword_score": 0.7219400929592014,
"_semantic_score": 1.0
},
{
"query": "hello world, how are you",
"_score": 1.0096539326751595,
"content": "oceanbase oracle database",
"_keyword_score": 0.7006369426751594,
"_semantic_score": 0.30901699
},
{
"query": "real world, how old are you",
"_score": 0.41421356,
"content": "redis oracle database",
"_keyword_score": null,
"_semantic_score": 0.41421356
},
{
"query": "real world, where are you from",
"_score": 0.30901699,
"content": "postgres oracle database",
"_keyword_score": null,
"_semantic_score": 0.30901699
}
] |
```
:::
### Full-text and vector RRF hybrid search
The result sets of full-text sub-queries and vector sub-queries use weighted fusion by default. You can configure the fusion method to RRF (Reciprocal Rank Fusion) ranking fusion through the `Rank` syntax. Some use cases are as follows:
* Multi-dimensional ranking: Requiring comprehensive consideration of results from multiple search dimensions. For example, in academic search systems, when searching in a paper database, both keyword matching degree and semantic relevance need to be considered.
* Fairness requirements: Ensuring that results from different search methods are reasonably displayed. For example, on e-commerce platforms, both textual information such as product titles and descriptions, and visual information such as product images and videos need to be considered.
* Complex queries: Complex search scenarios involving multiple query conditions. For example, in medical systems, both patient symptom descriptions and patient medical history and examination results need to be considered.
Example:
Set search parameters.
```sql
SET @rrf_query_param = '{
"query": {
"query_string": {
"fields": ["title", "author", "description"],
"query": "fiction American Dream"
}
},
"knn" : {
"field": "vector_embedding",
"k": 5,
"query_vector": [0.1, 0.2, 0.3, 0.4]
},
"rank" : {
"rrf" : {
"rank_window_size" : 10,
"rank_constant" : 60
}
}
}';
```
The RRF algorithm calculates the final relevance score by fusing the rankings of multiple sub-query result sets. The calculation formula is as follows:
```sql
score = 0.0
for q in queries:
if d in result(q):
score += 1.0 / ( k + rank( result(q), d ) ) # K constant is the configured rank_constant
return score
```
### Summary
The examples in this topic demonstrate the powerful application value of the hybrid search feature:
* Intelligent search upgrade: Integrating semantic understanding into traditional keyword search to provide more accurate search results that better match user intent.
* Optimized user experience: Supporting natural language queries, simplifying operations, and improving information retrieval efficiency.
* Empowering diverse businesses: Widely applied in scenarios such as e-commerce, content management, knowledge bases, and intelligent customer service, achieving comprehensive coverage from basic filtering to intelligent recommendations.
* Combined technical advantages: Combining exact matching with semantic understanding to significantly improve the accuracy and comprehensiveness of search results.
The hybrid search feature is an ideal choice for processing massive unstructured data and building intelligent search and recommendation systems.
<!--## Related documentation
* [DBMS_HYBRID_SEARCH sub-function overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004020384)-->

View File

@@ -0,0 +1,165 @@
---
slug: /ai-function-permission
---
# AI function privileges
This topic describes the AI function privileges, including `AI MODEL` and `ACCESS AI MODEL`, which are used for managing AI models and calling AI functions, respectively.
## AI MODEL
AI MODEL privileges are used for managing AI models. These include three specific privileges: `CREATE AI MODEL`, `ALTER AI MODEL`, and `DROP AI MODEL`.
### Syntax
The syntax for granting privileges is as follows:
```sql
-- Grant the privilege to create an AI model.
GRANT CREATE AI MODEL ON *.* TO 'username'@'host';
-- Grant the privilege to change an AI model.
GRANT ALTER AI MODEL ON *.* TO 'username'@'host';
-- Grant the privilege to drop an AI model.
GRANT DROP AI MODEL ON *.* TO 'username'@'host';
GRANT CREATE AI MODEL, ALTER AI MODEL, DROP AI MODEL ON *.* TO 'username'@'host';
```
The syntax for revoking privileges is as follows:
```sql
-- Revoke the privilege to create an AI model.
REVOKE CREATE AI MODEL ON *.* FROM 'username'@'host';
-- Revoke the privilege to change an AI model.
REVOKE ALTER AI MODEL ON *.* FROM 'username'@'host';
-- Revoke the privilege to drop an AI model.
REVOKE DROP AI MODEL ON *.* FROM 'username'@'host';
-- Check the privileges.
SHOW GRANTS FOR 'username'@'host';
```
### Examples
1. Create a user.
```sql
CREATE USER test_ai_user@'%' IDENTIFIED BY '123456';
```
2. Log in as the `test_ai_user` user.
```sql
obclient -h 127.0.0.1 -P 2881 -u test_ai_user@'%' -p *** -A -D test;
```
3. Call the `CREATE_AI_MODEL_ENDPOINT` procedure.
```sql
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
-> 'user_ai_model_endpoint_1', '{
'> "ai_model_name": "my_model1",
'> "url": "https://https://api.deepseek.com",
'> "access_key": "sk-xxxxxxxxxxxx",
'> "request_model_name": "deepseek-chat",
'> "provider": "deepseek"
'> }');
```
Since the user does not have the `CREATE AI MODEL` privilege, an error is returned:
```shell
ERROR 42501: Access denied; you need (at least one of) the create ai model endpoint privilege(s) for this operation
```
4. Grant the `CREATE AI MODEL` privilege to the `test_ai_user` user.
```sql
GRANT CREATE AI MODEL ON *.* TO test_ai_user@'%';
```
5. Verify the privilege.
```sql
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
-> 'user_ai_model_endpoint_1', '{
'> "ai_model_name": "my_model1",
'> "url": "https://https://api.deepseek.com",
'> "access_key": "sk-xxxxxxxxxxxx",
'> "request_model_name": "deepseek-caht",
'> "provider": "deepseek"
'> }');
```
This time, the statement executes successfully.
## ACCESS AI MODEL
The `ACCESS AI MODEL` privilege is used for calling AI functions, including `AI_COMPLETE`, `AI_EMBED`, `AI_RERANK`, and `AI_PROMPT`.
### Syntax
The syntax for granting this privilege is as follows:
```sql
GRANT ACCESS AI MODEL ON *.* TO 'username'@'host';
```
The syntax for revoking this privilege is as follows:
```sql
REVOKE ACCESS AI MODEL ON *.* FROM 'username'@'host';
```
### Examples
1. Call the `AI_COMPLETE` function.
```sql
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
The text to analyze is as follows:
<text>
What a beautiful day!
</text>
Judgment criteria:
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
```
Since the user does not have the `ACCESS AI MODEL` privilege, an error is returned:
```shell
ERROR 42501: Access denied; you need (at least one of) the access ai model endpoint privilege(s) for this operation
```
2. Grant the `ACCESS AI MODEL` privilege to the `test_ai_user` user.
```sql
GRANT ACCESS AI MODEL ON *.* TO test_ai_user@'%';
```
3. Verify the privilege.
```sql
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
The text to analyze is as follows:
<text>
What a beautiful day!
</text>
Judgment criteria:
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
```
This time, the statement executes successfully.
```sql
+-----+
| ans |
+-----+
| 1 |
+-----+
```

View File

@@ -0,0 +1,332 @@
---
slug: /ai-function
---
# Use cases and examples of AI functions
This topic describes the features of AI functions in seekdb.
AI functions integrate AI model capabilities directly into data processing within databases through SQL expressions. This greatly simplifies operations such as data extraction, analysis, summarization, and storage using large AI models, making it an important new feature in the fields of databases and data warehouses. seekdb provides AI model and endpoint management through the `DBMS_AI_SERVICE` package, introduces several built-in AI function expressions, and supports monitoring AI model usage through views.
## Prerequisites
Before using AI functions, make sure you have the necessary privileges. For more information about the privileges, see [AI function privileges](100.ai-function-permission.md).
## Considerations
Hybrid search relies on the model management and embedding capabilities of the AI function service. Before deleting an AI model, check whether it is referenced by hybrid search to avoid potential issues.
## AI model management
The `DBMS_AI_SERVICE` package provides the ability to manage AI models and endpoints. It supports the following operations:
| **Operation** | **Description** |
| ------ | ------ |
| CREATE_AI_MODEL | Creates an AI model object. |
| DROP_AI_MODEL | Drops an AI model object. |
| CREATE_AI_MODEL_ENDPOINT | Creates an AI model endpoint object. |
| ALTER_AI_MODEL_ENDPOINT | Modifies an AI model endpoint object. |
| DROP_AI_MODEL_ENDPOINT | Drops an AI model endpoint object. |
By using this system package, you can directly manage AI models and endpoints within seekdb, without relying on external services.
## Monitor AI model usage
seekdb allows you to query and monitor information about AI models and their usage through the following views:
* CDB/DBA_OB_AI_MODELS: Query information about AI models.
* CDB/DBA_OB_AI_MODEL_ENDPOINTS: Monitor the calls of AI models.
## AI function expressions
seekdb supports the following AI function expressions, allowing you to call AI models directly within seekdb using SQL statements and greatly simplifying the process:
| Function | Description |
|-----|------|
|`AI_COMPLETE`|Calls a specified text generation large language model (LLM) to process prompts and data, and then parses the results.|
| `AI_PROMPT` |Constructs and formats prompts. Supports dynamic data insertion.|
|`AI_EMBED`|Calls an embedding model to convert text data into vector data.|
|`AI_RERANK`|Calls a reranking model to sort text based on prompts by similarity.|
:::info
When using AI function expressions, make sure you have registered the AI models and endpoint information in the database.
:::
### AI_COMPLETE and AI_PROMPT
The `AI_COMPLETE` function specifies a registered large language model (LLM) for text generation using the `model_key`, processes the user-provided prompt and data, and returns the text generated by the model. Users can customize the prompt and the format of the data from the database through the `prompt` parameter. This approach not only enables flexible processing of textual data, but also allows for batch processing directly within the database, effectively avoiding the overhead of repeatedly transferring data between the database and the language model.
In many AI application scenarios, prompts are often highly structured and require dynamic injection of specific data. Manually concatenating prompts and input content using functions like `CONCAT` each time is not only costly in terms of development, but also prone to formatting errors. To support prompt reuse and dynamic combination of prompts and data, seekdb provides the `AI_PROMPT` function. `AI_PROMPT` upgrades prompts from "static text" to a "reusable, parameterizable" function template format, which can be used directly in place of the `prompt` parameter within `AI_COMPLETE`. This greatly simplifies the process of constructing prompts, improving both development efficiency and accuracy.
#### AI_PROMPT function
The `AI_PROMPT` function is used to construct and format prompts, supporting dynamic data insertion.
##### Syntax
The syntax for the `AI_PROMPT` function is as follows:
```sql
AI_PROMPT('template', expr0 [ , expr1, ... ]);
```
Parameters:
| Parameter | Description | Type | Nullable |
|-----------|-------------|------|----------|
| template | The prompt template entered by the user. | VARCHAR(max_length) | No |
| expr | The data entered by the user. | VARCHAR(max_length) | No |
Both the `template` and `expr` parameters are required and cannot be null. The `expr` parameter only supports the `VARCHAR` type and does not support the `JSON` type.
Return value:
* JSON, the formatted prompt string.
##### Examples
The `AI_PROMPT` function organizes the template string and dynamic data into JSON format:
* The first parameter (the template string `template`) is placed in the `template` field of the returned JSON.
* Subsequent parameters (data values `expr0`, `expr1`, ...) are placed in the `args` array of the returned JSON.
* Placeholders in the template such as `{0}`, `{1}`, etc., correspond by index to the data in the `args` array and will be automatically replaced when used in the `AI_COMPLETE` function.
For example:
```sql
SELECT AI_PROMPT('Recommend {0} of the most popular {1} to me.', 'ten', 'mobile phones');
```
Return result:
```json
{
"template": "Recommend {0} of the most popular {1} to me.",
"args": ["ten", "mobile phones"]
}
```
Based on the previous example, using the `AI_PROMPT` function within the `AI_COMPLETE` function:
```sql
SELECT AI_COMPLETE("ob_complete", AI_PROMPT('Recommend {0} of the most popular {1} to me. just output name in json array format', 'two', 'mobile phones')) AS ans;
```
Return result:
```json
+--------------------------------------------------+
| ans |
+--------------------------------------------------+
| ["iPhone 15 Pro Max","Samsung Galaxy S24 Ultra"] |
+--------------------------------------------------+
```
#### AI_COMPLETE function
#### Syntax
The syntax for the `AI_COMPLETE` function is as follows:
```sql
AI_COMPLETE(model_key, prompt[, parameters])
-- If you use the AI_PROMPT function, replace the prompt parameter with the AI_PROMPT function. See the AI_PROMPT function example.
AI_COMPLETE(model_key, AI_PROMPT(prompt_template, data))
```
Parameters:
| Parameter | Description | Type | Nullable |
|------|------|------|----------|
| model_key | The model registered in the database. | VARCHAR(128) | No |
| prompt | The prompt provided by the user. | VARCHAR/TEXT(LONGTEXT) | No |
| parameters | Optional configuration for the API, such as `temperature`, `top_p`, and `max_tokens`. These options vary by vendor and are added directly to the message body. Typically, you can use the default settings without specifying these options. | JSON | Yes |
Both `model_key` and `prompt` are required. If either is `NULL`, the function will return an error.
Return value:
* text: The text generated by the LLM based on the prompt.
##### Examples
1. Sentiment analysis example
```sql
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
The text to analyze is as follows:
<text>
What a beautiful day!
</text>
Judgment criteria:
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
```
Result:
```sql
+-----+
| ans |
+-----+
| 1 |
+-----+
```
2. Translation example
```sql
CREATE TABLE comments (
id INT AUTO_INCREMENT PRIMARY KEY,
content TEXT
);
INSERT INTO comments (content) VALUES ('hello world!');
-- Use the concat expression to replace the processed data with column names from the database table, enabling batch processing of database data without copying data to and from the LLM.
SELECT AI_COMPLETE("ob_complete",
concat("You are a professional translator. Please translate the following English text into Chinese. The text to be translated is:<text>",
content,
"</text>")) AS ans FROM comments;
```
Result:
```sql
+-------------+
| ans |
+-------------+
| 你好,世界! |
+-------------+
```
3. Classification example
```sql
SELECT AI_COMPLETE("ob_complete","You are a classification expert. You will receive various issue texts and need to categorize them into the appropriate department. The department list is [\"Hardware\",\"Software\",\"Other\"]. The text to analyze is as follows:
<text>
The screen quality is terrible.
</text>") AS res;
```
Result:
```sql
+--------+
| res |
+--------+
| Hardware |
+--------+
```
### AI_EMBED
The `AI_EMBED` function uses the `model_key` parameter to specify a registered embedding model, which converts your text data into vector representations. If the model supports multiple dimensions, you can use the `dim` parameter to specify the output dimension.
#### Use AI_EMBED
Syntax:
```sql
AI_EMBED(model_key, input, [dim])
```
Parameters:
| Parameter | Description | Type | Nullable |
|------|------|------|----------|
| model_key | The embedding model registered in your database. | VARCHAR(128) | No |
| input | The text you want to convert into a vector. | VARCHAR | No |
| dim | Specifies the output dimension of the vector. Some model providers support configuring this value. | INT64 | Yes |
Both `model_key` and `input` are required. If either is `NULL`, the function will return an error.
Return value:
* A string in vector format, that is, the embedding models vector representation of your text.
#### Examples
1. Embed single row of data.
```sql
SELECT AI_EMBED("ob_embed","Hello world") AS embedding;
```
Result:
```sql
+----------------+
| embedding |
+----------------+
| [0.1, 0.2, 0.3]|
+----------------+
```
2. Embed table columns.
```sql
CREATE TABLE comments (
id INT AUTO_INCREMENT PRIMARY KEY,
content TEXT
);
INSERT INTO comments (content) VALUES ('hello world!');
SELECT AI_EMBED("ob_embed",content) AS embedding FROM comments;
```
Result:
```sql
+----------------+
| embedding |
+----------------+
| [0.1, 0.2, 0.3]|
+----------------+
```
### AI_RERANK
The `AI_RERANK` function uses the `model_key` parameter to specify a registered reranking model. It organizes your query and document list according to the provider's rules, sends them to the specified model, and returns the sorted results. This function is suitable for reranking scenarios in Retrieval-Augmented Generation (RAG).
#### Use AI_RERANK
Syntax:
```sql
AI_RERANK(model_key, query, documents[, document_key])
```
Parameters:
| Parameter | Description | Type | Nullable |
|------|------|------|----------|
| model_key | The reranking model registered in your database. | VARCHAR(128) | No |
| query | The search text you want to use. | VARCHAR(1024) | No |
| documents | The list of documents to be ranked. | JSON array, for example, `'["apple", "banana"]'` | No |
All of the parameters `model_key` and `input` are required. If either is `NULL`, the function will return an error.
Return value:
* A JSON array containing the documents and their relevance scores, sorted in descending order by relevance.
#### Examples
```sql
SELECT AI_RERANK("ob_rerank","Apple",'["apple","banana","fruit","vegetable"]');
```
Result:
```sql
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ai_rerank("ob_rerank","Apple",'["apple","banana","fruit","vegetable"]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"index": 0, "document": {"text": "apple"}, "relevance_score": 0.9912109375}, {"index": 1, "document": {"text": "banana"}, "relevance_score": 0.0033512115478515625}, {"index": 2, "document": {"text": "fruit"}, "relevance_score": 0.0003669261932373047}, {"index": 3, "document": {"text": "vegetable"}, "relevance_score": 0.00001996755599975586}] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
## References
* [Vector embedding technology](../100.vector-search/150.vector-embedding-technology.md)
* [Privilege types in MySQL-compatible mode](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974758)

View File

@@ -0,0 +1,248 @@
---
slug: /oceanbase-mcp-server-and-ai-tool-integration-guide
---
# OceanBase MCP Server
## Background information
AI tools are evolving rapidly, from graphical solutions like Cursor, Windsurf, and Trae to command-line options such as Claude Code, Gemini CLI, and Qwen Code. Empowered by Agent-based frameworks, these tools are remarkably capable. However, a key limitation remains: AI tools cannot directly access databases, leaving a gap between data and analysis. The MCP protocol bridges this gap. By leveraging the MCP protocol, OceanBase MCP Server enables AI tools to interact directly with databases and retrieve data seamlessly.
Traditionally, data analysis tasks—such as user analytics, product analysis, order tracking, and user behavior analysis—require developers to build backend systems for data retrieval and frontend interfaces for data visualization. Even with BI tools, some familiarity with SQL is often necessary. While data can be displayed, understanding the underlying logic or making business decisions based on that data still depends on the expertise of data analysts.
The combination of AI tools, MCP, and large language models (LLMs) is transforming the way data analysis is performed. Analysts no longer need to rely on developers or have SQL knowledge. They can simply describe their requirements to AI tools and instantly receive the results they need—complete with attractive charts and initial data insights.
## Functional architecture
### Core toolkit
OceanBase MCP Server offers standardized interfaces that enable AI tools to interact directly with the database:
| Tool | Description |
|-----------------------------|-------------|
| `execute_sql` | Executes any SQL statement (such as `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `DDL`). |
| `get_ob_ash_report` | Gets seekdb Active Session History (ASH) reports for performance diagnostics. |
| `get_current_time` | Returns the current time of the seekdb instance. |
| `oceanbase_text_search` | Performs full-text searches across seekdb database tables. |
| `oceanbase_vector_search` | Executes vector similarity searches within seekdb database tables. |
| `oceanbase_hybrid_search` | Conducts hybrid searches, combining relational filtering with vector search. |
| `ob_memory_query` | Retrieves historical conversation records from the AI memory system using semantic search. (AI Memory System Tool) |
| `ob_memory_insert` | Automatically captures and stores important conversation content to build a knowledge base. (AI Memory System Tool) |
| `ob_memory_delete` | Deletes outdated or redundant conversation memories. (AI Memory System Tool) |
| `ob_memory_update` | Updates or evolves memory content based on new information. (AI Memory System Tool) |
### Resource endpoints
AI tools can directly access these resource endpoints via the MCP protocol:
| Resource path | Description |
|--------------------------------|-------------|
| `oceanbase://tables` | Lists all tables in the database. |
| `oceanbase://sample/{table}` | Retrieves sample data (first 100 rows) from the specified table. `{table}` can be dynamically replaced with the actual table name. |
## Prerequisites
* You have installed Cursor or another tool that supports the MCP protocol (such as Windsurf or Qwen Code).
* You have deployed seekdb.
* For deployment details, see [Deploy seekdb](../../400.guides/400.deploy/100.prepare-servers.md).
* You have a Python environment set up (version 3.10 to 3.12).
* Download the Python installer from the [official Python website](https://www.python.org/downloads/).
## Procedure
### Step 1: Obtain database connection information
Contact the database administrator or deployment team to get the required database connection string. For example:
```sql
obclient -h$host -P$port -u$user_name -p$password -D$database_name
```
**Parameters:**
* `$host`: The IP address for connecting to seekdb.
* `$port`: The port number for connecting to seekdb. Default is `2881`.
* `$database_name`: The name of the database to access.
* `$user_name`: The account for connecting to the instance. Default is `root`.
* `$password`: The account password. Default is empty.
**Here is an example:**
```shell
obclient -hxxx.xxx.xxx.xxx -P2881 -uroot -p****** -Dtest
```
### Step 2: Install Python dependencies
#### Environment setup
1. Install the uv package manager.
* On macOS or Linux, run:
```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
```
* On Windows, run:
```shell
irm https://astral.sh/uv/install.ps1 | iex
```
* Alternatively, you can install uv using pip:
```shell
pip install uv
```
2. Verify the installation:
```shell
uv --version
```
#### Install OceanBase MCP Server
1. Choose a directory and create a virtual environment:
```shell
uv venv
```
2. Activate the virtual environment:
```shell
source .venv/bin/activate
```
3. Install OceanBase MCP Server:
```shell
uv pip install oceanbase-mcp
```
### Step 3: Configure the MCP Server environment
1. Create a `.env` file with your database connection info:
```shell
cat > .env <<EOF
OB_HOST=127.0.0.1
OB_PORT=2881
OB_USER=root
OB_PASSWORD=your_password
OB_DATABASE=test
EOF
```
2. Start the MCP Server:
```shell
uv run oceanbase_mcp_server \
--transport sse \ # Supports stdio/streamable-http/sse modes
--host 0.0.0.0 \ # Allows external access, use 127.0.0.1 for local access only
--port 8000 # Custom port (update later configs if changed)
```
### Step 4: Connect Cursor to the MCP Server
1. Use Cursor V2.0.64 as an example. Click the **Open Settings** icon in the upper right corner, select **Tools & MCP**, and click **New MCP Server**.
![1](https://obportal.s3.ap-southeast-1.amazonaws.com/doc/img/SeekDB-EN/1-mcp-settings.jpg)
2. Edit the `mcp.json` configuration file:
```shell
{
"mcpServers": {
"ob-sse": {
"autoApprove": [],
"disabled": false, # Must be set to false to enable the service
"timeout": 60,
"type": "sse",
"url": "http://127.0.0.1:8000/sse" # Make sure this matches the port you set in Step 3
}
}
}
```
3. Verify the connection.
After saving the configuration, return to the **Tools & MCP** page. Then you will find the newly added MCP server.
4. Once added, Cursor will automatically use MCP tools when you ask questions in the Chat window.
![2](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/observer-enterprise/V4.4.1/690.practical-tutorial/900.oceanbase-mcp-server-ai/2.png)
## Quick start examples
Once you set up seekdb and the MCP Server, you can quickly try out data analysis capabilities. The following examples use the [Online Retail Dataset](https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset) and Cursor to demonstrate how AI tools can seamlessly work with the OceanBase MCP Server for common analytics tasks.
### Customer distribution analysis
1. Instruction input:
```
Please analyze the customer data and show the distribution of customers by country.
```
2. Cursor execution process:
1. Cursor calls `execute_sql` to run an aggregation query:
```sql
SELECT Country,
COUNT(DISTINCT CustomerID) AS customer_count
FROM dataanalysis_english.invoice_items
WHERE CustomerID IS NOT NULL
AND Country IS NOT NULL
GROUP BY Country
ORDER BY customer_count DESC;
```
2. Cursor automatically generates a structured analysis result.
![3](https://obportal.s3.ap-southeast-1.amazonaws.com/doc/img/SeekDB-EN/2-cursor-mcp-1.jpg)
3. Further request:
```
Please convert the above results into a table.
```
4. Output:
![4](https://obportal.s3.ap-southeast-1.amazonaws.com/doc/img/SeekDB-EN/2-cursor-mcp-2.jpg)
### Best-selling products analysis
1. Instruction input:
```
Find the most popular products and show their sales performance.
```
2. Output:
Cursor summarizes the most popular products with performance insights and displays them in a ranked table or bar chart.
![5](https://obportal.s3.ap-southeast-1.amazonaws.com/doc/img/SeekDB-EN/2-cursor-mcp-3.jpg)
### Sales trend over time
1. Instruction input:
```
Analyze monthly sales trends and identify peak periods.
```
2. Output:
Cursor generates a line chart showing monthly sales trends with peak periods highlighted (for example, November and December for holiday shopping).
![6](https://obportal.s3.ap-southeast-1.amazonaws.com/doc/img/SeekDB-EN/2-cursor-mcp-4.jpg)

View File

@@ -0,0 +1,20 @@
---
slug: /json-formatted-data-types
---
# Overview of JSON data types
seekdb supports the JavaScript Object Notation (JSON) data type in compliance with the RFC 7159 standard. You can use it to store semi-structured JSON data and access or modify the data within JSON documents.
The JSON data type offers the following advantages:
* **Automatic validation**: JSON documents stored in JSON columns are automatically validated. Invalid documents will trigger an error.
* **Optimized storage format**: JSON documents stored in JSON columns are converted into an optimized format that enables fast reading and access. When the server reads a JSON value stored in binary format, it doesn't need to parse the value from text.
* **Semi-structured encoding**: This feature further reduces storage costs by splitting a JSON document into multiple sub-columns, with each sub-column encoded individually. This improves compression rates and reduces the storage space required for JSON data. For more information, see [Create a JSON value](200.create-a-json-value.md) and [Semi-structured encoding](600.json-semi-struct.md).
## References
* [Overview of JSON functions](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974794)

View File

@@ -0,0 +1,257 @@
---
slug: /create-a-json-value
---
# Create a JSON value
A JSON value must be one of the following: objects (JSON objects), arrays, strings, numbers, boolean values (true/false), or the null value. Note that false, true, and the null value must be in lowercase.
## JSON text structure
A JSON text structure includes characters, strings, numbers, and three literal names. Whitespace characters (spaces, horizontal tabs, line feeds, and carriage returns) are allowed before or after any structural character.
```sql
begin-array = [ left square bracket
begin-object = { left curly bracket
end-array = ] right square bracket
end-object = } right curly bracket
name-separator = : colon
value-separator = , comma
```
### Objects
An object is represented by a pair of curly brackets containing zero or more name/value pairs (also called members). Names within an object must be unique. Each name is a string followed by a colon that separates the name from its value. Multiple name/value pairs are separated by commas.
Here is an example:
```sql
{ "NAME": "SAM", "Height": 175, "Weight": 100, "Registered" : false}
```
### Arrays
An array is represented by square brackets containing zero or more values (also called elements). Array elements are separated by commas, and values in an array do not need to be of the same type.
Here is an example:
```sql
["abc", 10, null, true, false]
```
### Numbers
Numbers use decimal format and contain an integer component that may optionally be prefixed with a minus sign (-). This can be followed by a fractional part and/or an exponent part. Leading zeros are not allowed. The fractional part consists of a decimal point followed by one or more digits. The exponent part begins with an uppercase or lowercase letter E, optionally followed by a plus (+) or minus (-) sign and one or more digits.
Here is an example:
```sql
[100, 0, -100, 100.11, -12.11, 10.22e2, -10.22e2]
```
### Strings
A string begins and ends with quotation marks ("). All Unicode characters can be placed within the quotation marks, except characters that must be escaped (including quotation marks, backslashes, and control characters).
JSON text must be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8.
Here is an example:
```sql
{"Url": "http://www.example.com/image/481989943"}
```
## Create JSON values
seekdb supports the following DDL operations on JSON types:
* Create tables with JSON columns.
* Add or drop JSON columns.
* Create indexes on generated columns based on JSON columns.
* Enable semi-structured encoding when creating tables.
* Enable semi-structured encoding on existing tables.
### Limitations
You can create multiple JSON columns in each table, with the following limitations:
* JSON columns cannot be used as `PRIMARY KEY`, `FOREIGN KEY`, or `UNIQUE KEY`, but you can add `NOT NULL` or `CHECK` constraints.
* JSON columns cannot have default values.
* JSON columns cannot be used as partitioning keys.
* The length of JSON data cannot exceed the length of `LONGTEXT`, and the maximum depth of each JSON object or array is 99.
### Examples
#### Create or modify JSON columns
```sql
obclient> CREATE TABLE tbl1 (id INT PRIMARY KEY, docs JSON NOT NULL, docs1 JSON);
Query OK, 0 rows affected
obclient> ALTER TABLE tbl1 MODIFY docs JSON CHECK(docs <'{"a" : 100}');
Query OK, 0 rows affected
obclient> CREATE TABLE json_tab(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'Primary key',
json_info JSON COMMENT 'JSON data',
json_id INT GENERATED ALWAYS AS (json_info -> '$.id') COMMENT 'Virtual field from JSON data',
json_name VARCHAR(5) GENERATED ALWAYS AS (json_info -> '$.NAME'),
index json_info_id_idx (json_id)
)COMMENT 'Example JSON table';
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab ADD COLUMN json_info1 JSON;
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab ADD INDEX (json_name);
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab drop COLUMN json_info1;
Query OK, 0 rows affected
```
#### Create an index on a specific key using a generated column
```sql
obclient> CREATE TABLE jn ( c JSON, g INT GENERATED ALWAYS AS (c->"$.id"));
Query OK, 0 rows affected
obclient> CREATE INDEX idx1 ON jn(g);
Query OK, 0 rows affected
Records: 0 Duplicates: 0 Warnings: 0
obclient> INSERT INTO jn (c) VALUES
('{"id": "1", "name": "Fred"}'), ('{"id": "2", "name": "Wilma"}'),
('{"id": "3", "name": "Barney"}'), ('{"id": "4", "name": "Betty"}');
Query OK, 4 rows affected
Records: 4 Duplicates: 0 Warnings: 0
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
obclient> EXPLAIN SELECT c->>"$.name" AS name FROM jn WHERE g <= 2\G
*************************** 1. row ***************************
Query Plan: =========================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
-----------------------------------------
|0 |TABLE SCAN|jemp(idx1)|2 |92 |
=========================================
Outputs & filters:
-------------------------------------
0 - output([JSON_UNQUOTE(JSON_EXTRACT(jemp.c, '$.name'))]), filter(nil),
access([jemp.c]), partitions(p0)
1 row in set
```
#### Use semi-structured encoding
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur:
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
* You can also set the frequency threshold using the `freq_threshold` parameter.
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
1. Enable semi-structured encoding.
:::tip
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
:::
:::tab
tab Example: Enable semi-structured encoding during table creation
```sql
CREATE TABLE t1( j json)
ROW_FORMAT=COMPRESSED
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
tab Example: Enable semi-structured encoding for existing table
```sql
CREATE TABLE t1(j json);
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
Some modification limitations:
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
* Modifying one sub-parameter does not affect the others.
:::
2. Disable semi-structured encoding.
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
```sql
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
```
3. Query semi-structured encoding configuration.
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
```sql
SHOW CREATE TABLE t1;
```
The result is as follows:
```shell
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
`j` json DEFAULT NULL
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
Here is an example query:
```sql
-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
```
Character set considerations:
- seekdb uses `utf8_bin` encoding for JSON.
- To ensure string whitebox filtering works properly, we recommend the following settings:
```sql
SET @@collation_server = 'utf8mb4_bin';
SET @@collation_connection='utf8mb4_bin';
```

View File

@@ -0,0 +1,174 @@
---
slug: /querying-and-modifying-json-values
---
# Query and modify JSON values
seekdb supports querying and referencing JSON values. Using path expressions, you can extract or modify specific portions of a JSON document.
## Reference JSON values
seekdb provides two methods for querying and referencing JSON values:
* Use the `->` operator to return a key's value with double quotes in JSON data.
* Use the `->>` operator to return a key's value without double quotes in JSON data.
Examples:
```sql
obclient> SELECT c->"$.name" AS name FROM jn WHERE g <= 2;
+---------+
| name |
+---------+
| "Fred" |
| "Wilma" |
+---------+
2 rows in set
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
obclient> SELECT JSON_UNQUOTE(c->'$.name') AS name
FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
```
Because JSON documents are hierarchical, JSON functions use path expressions to extract or modify portions of a document and to specify where in the document the operation should occur.
seekdb uses a path syntax consisting of a leading `$` character followed by a selector to represent the JSON document being accessed. The selector types are as follows:
* The `.` symbol represents the key name to access. Unquoted names are not valid in path expressions (for example, names containing spaces), so key names must be enclosed in double quotes.
Example:
```sql
obclient> SELECT JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name');
+---------------------------------------------------------+
| JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name') |
+---------------------------------------------------------+
| "Aztalan" |
+---------------------------------------------------------+
1 row in set
```
* The `[N]` symbol is placed after the path of the selected array and represents the value at position N in the array, where N is a non-negative integer. Array positions are zero-indexed. If `path` does not select an array value, then `path[0]` evaluates to the same value as `path`.
Example:
```sql
obclient> SELECT JSON_SET('"x"', '$[0]', 'a');
+------------------------------+
| JSON_SET('"x"', '$[0]', 'a') |
+------------------------------+
| "a" |
+------------------------------+
1 row in set
```
* The `[M to N]` symbol specifies a subset or range of array values, starting from position M and ending at position N.
Example:
```sql
obclient> SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]');
+----------------------------------------------+
| JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]') |
+----------------------------------------------+
| [2, 3, 4] |
+----------------------------------------------+
1 row in set
```
* Path expressions can also include `*` or `**` wildcard characters:
* `.[*]` represents the values of all members in a JSON object.
* `[*]` represents the values of all elements in a JSON array.
* `prefix**suffix` represents all paths that begin with the specified prefix and end with the specified suffix. The prefix is optional, but the suffix is required. Using `**` or `***` alone to match arbitrary paths is not allowed.
:::info
Paths that do not exist in the document (evaluating to non-existent data) evaluate to <code>NULL</code>.
:::
## Modify JSON values
seekdb also supports modifying complete JSON values using DML statements, and using the JSON_SET(), JSON_REPLACE(), or JSON_REMOVE() functions in `UPDATE` statements to modify partial JSON values.
Examples:
```sql
// Insert complete data.
INSERT INTO json_tab(json_info) VALUES ('[1, {"a": "b"}, [2, "qwe"]]');
// Insert partial data.
UPDATE json_tab SET json_info=JSON_ARRAY_APPEND(json_info, '$', 2) WHERE id=1;
// Update complete data.
UPDATE json_tab SET json_info='[1, {"a": "b"}]';
// Update partial data.
UPDATE json_tab SET json_info=JSON_REPLACE(json_info, '$[2]', 'aaa') WHERE id=1;
// Delete data.
DELETE FROM json_tab WHERE id=1;
// Update partial data using a function.
UPDATE json_tab SET json_info=JSON_REMOVE(json_info, '$[2]') WHERE id=1;
```
## JSON path syntax
A path consists of a scope and one or more path segments. For paths used in JSON functions, the scope is the document being searched or otherwise operated on, represented by the leading `$` character.
Path segments are separated by periods (.). Array elements are represented by `[N]`, where N is a non-negative integer. Key names must be either double-quoted strings or valid ECMAScript identifiers.
Path expressions (like JSON text) should be encoded using the ascii, utf8, or utf8mb4 character set. Other character encodings are implicitly converted to utf8mb4.
The complete syntax is as follows:
```sql
pathExpression: // Path expression
scope[(pathLeg)*] // Scope is represented by the leading $ character
pathLeg:
member | arrayLocation | doubleAsterisk
member:
period ( keyName | asterisk )
arrayLocation:
leftBracket ( nonNegativeInteger | asterisk ) rightBracket
keyName:
ESIdentifier | doubleQuotedString
doubleAsterisk:
'**'
period:
'.'
asterisk:
'*'
leftBracket:
'['
rightBracket:
']'
```

View File

@@ -0,0 +1,54 @@
---
slug: /json-formatted-data-type-conversion
---
# Convert JSON data types
seekdb supports the CAST function for converting between JSON and other data types.
The following table describes the conversion rules for JSON data types.
| Other data types | CAST(other_type AS JSON) | CAST(JSON AS other_type) |
|-------------------------------------|---------------------------------------------|----------------------------------------------------------|
| JSON | No change. | No change. |
| UTF-8 character types (including utf8mb4, utf8, and ascii) | The characters are converted to JSON values and validated. | The data is serialized into utf8mb4 strings. |
| Other character sets | First converted to utf8mb4 encoding, then processed as UTF-8 character type. | First serialized into utf8mb4-encoded strings, then converted to the corresponding character set. |
| NULL | An empty JSON value is returned. | Not applicable. |
| Other types | Only scalar values are converted to JSON values containing that single value. | If the JSON value contains only one scalar value that matches the target type, it is converted to the corresponding type; otherwise, NULL is returned and a warning is issued. |
:::info
<code>other_type</code> specifies a data type other than JSON.
:::
Here are some conversion examples:
```sql
obclient> SELECT CAST("123" AS JSON);
+---------------------+
| CAST("123" AS JSON) |
+---------------------+
| 123 |
+---------------------+
1 row in set
obclient> SELECT CAST(null AS JSON);
+--------------------+
| CAST(null AS JSON) |
+--------------------+
| NULL |
+--------------------+
1 row in set
CREATE TABLE tj1 (c1 JSON,c2 VARCHAR(20));
INSERT INTO tj1 VALUES ('{"id": 17, "color": "red"}','apple'),('{"id": 18, "color": "yellow"}', 'banana'),('{"id": 16, "color": "orange"}','orange');
obclient> SELECT * FROM tj1 ORDER BY CAST(JSON_EXTRACT(c1, '$.id') AS UNSIGNED);
+-------------------------------+--------+
| c1 | c2 |
+-------------------------------+--------+
| {"id": 16, "color": "orange"} | orange |
| {"id": 17, "color": "red"} | apple |
| {"id": 18, "color": "yellow"} | banana |
+-------------------------------+--------+
3 rows in set
```

View File

@@ -0,0 +1,328 @@
---
slug: /json-partial-update
---
# Partial JSON data updates
seekdb supports partial JSON data updates (JSON Partial Update). When only specific fields in a JSON document need to be modified, this feature allows you to update only the changed portions without having to update the entire JSON document.
## Limitations
## Enable or disable JSON Partial Update
The JSON Partial Update feature in seekdb is disabled by default. It is controlled by the system variable `log_row_value_options`. For more information, see [log_row_value_options](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001972193).
**Here are some examples:**
* Enable the JSON Partial Update feature.
* Session level:
```sql
SET log_row_value_options="partial_json";
```
* Global level:
```sql
SET GLOBAL log_row_value_options="partial_json";
```
* Disable the JSON Partial Update feature.
* Session level:
```sql
SET log_row_value_options="";
```
* Global level:
```sql
SET GLOBAL log_row_value_options="";
```
* Query the value of `log_row_value_options`.
```sql
SHOW VARIABLES LIKE 'log_row_value_options';
```
The result is as follows:
```sql
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| log_row_value_options | |
+-----------------------+-------+
1 row in set
```
## JSON expressions for partial updates
In addition to the JSON Partial Update feature switch `log_row_value_options`, you must use specific expressions to update JSON documents to trigger JSON Partial Update.
The following JSON expressions in seekdb currently support partial updates:
* json_set or json_replace: updates the value of a JSON field.
* json_remove: deletes a JSON field.
:::tip
<ol><li>Ensure that the left operand of the <code>SET</code> assignment clause and the first parameter of the JSON expression are the same and both are JSON columns in the table. For example, in <code>j = json_replace(j, '$.name', 'ab')</code>, the parameter on the left side of the equals sign and the first parameter of the JSON expression <code>json_replace</code> on the right side are both <code>j</code>.</li><li>JSON Partial Update is only triggered when the current JSON column data is stored as <code>outrow</code>. Whether data is stored as <code>outrow</code> or <code>inrow</code> is controlled by the <code>lob_inrow_threshold</code> parameter when creating the table. <code>lob_inrow_threshold</code> is used to configure the <code>INROW</code> threshold. When the LOB data size exceeds this threshold, it is stored as <code>OUTROW</code> in the LOB Meta table. The default value is 4 KB.</li></ol>
:::
**Examples:**
1. Create a table named `json_test`.
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON);
```
2. Insert data.
```sql
INSERT INTO json_test VALUES(1, CONCAT('{"name": "John", "content": "', repeat('x',8), '"}'));
```
The result is as follows:
```shell
Query OK, 1 row affected
```
3. Query the data in the JSON column `j`.
```sql
SELECT j FROM json_test;
```
The result is as follows:
```shell
+-----------------------------------------+
| j |
+-----------------------------------------+
| {"name": "John", "content": "xxxxxxxx"} |
+-----------------------------------------+
1 row in set
```
4. Use `json_repalce` to update the value of the `name` field in the JSON column.
```sql
UPDATE json_test SET j = json_replace(j, '$.name', 'ab') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
5. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+---------------------------------------+
| j |
+---------------------------------------+
| {"name": "ab", "content": "xxxxxxxx"} |
+---------------------------------------+
1 row in set
```
6. Use `json_set` to update the value of the `name` field in the JSON column.
```sql
UPDATE json_test SET j = json_set(j, '$.name', 'cd') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
7. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+---------------------------------------+
| j |
+---------------------------------------+
| {"name": "cd", "content": "xxxxxxxx"} |
+---------------------------------------+
1 row in set
```
8. Use `json_remove` to delete the `name` field value in the JSON column.
```sql
UPDATE json_test SET j = json_remove(j, '$.name') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
9. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+-------------------------+
| j |
+-------------------------+
| {"content": "xxxxxxxx"} |
+-------------------------+
1 row in set
```
## Granularity of updates
JSON data in seekdb is stored based on LOB storage, and LOBs in seekdb are stored in chunks at the underlying level. Therefore, the minimum data amount for each partial update is one LOB chunk. The smaller the LOB chunk, the smaller the amount of data written. A DDL syntax is provided to set the LOB chunk size, which can be specified when creating a column.
**Example:**
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '4k');
```
The chunk size cannot be infinitely small, as too small a size will affect the performance of `SELECT`, `INSERT`, and `DELETE` operations. It is generally recommended to set it based on the average field size of JSON documents. If most fields are very small, you can set it to 1K. To optimize LOB type reads, seekdb stores data smaller than 4K directly as `INROW`, in which case partial update will not be performed. Partial Update is mainly intended to improve the performance of updating large documents; for small documents, full updates actually perform better.
## Rebuild
JSON Partial Update does not impose restrictions on the data length before and after updating a JSON column. When the length of the new value is less than or equal to the length of the old value, the data at the original location is directly replaced with the new data. When the length of the new value is greater than the length of the old value, the new data is appended at the end. seekdb sets a threshold: when the length of the appended data exceeds 30% of the original data length, a rebuild is triggered. In this case, Partial Update is not performed; instead, a full overwrite is performed.
You can use the `JSON_STORAGE_SIZE` expression to get the actual storage length of JSON data, and `JSON_STORAGE_FREE` to get the additional storage overhead.
**Example:**
1. Enable JSON Partial Update.
```sql
SET log_row_value_options = "partial_json";
```
2. Create a test table named `json_test`.
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '1K');
```
3. Insert a row of data into the `json_test` table.
```sql
INSERT INTO json_test VALUES(10 , json_object('name', 'zero', 'age', 100, 'position', 'software engineer', 'profile', repeat('x', 4096), 'like', json_array('a', 'b', 'c'), 'tags', json_array('sql boy', 'football', 'summer', 1), 'money' , json_object('RMB', 10000, 'Dollers', 20000, 'BTC', 100), 'nickname', 'noone'));
```
Result:
```shell
Query OK, 1 row affected
```
4. Use `JSON_STORAGE_SIZE` to query the storage size of the JSON column (actual occupied storage space) and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4335 | 0 |
+----------------------+----------------------+
1 row in set
```
Since no partial update has been performed, the value of `JSON_STORAGE_FREE` is 0.
5. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is less than the length of the old value.
```sql
UPDATE json_test SET j = json_replace(j, '$.position', 'software enginee') WHERE pk = 10;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
6. Again, use `JSON_STORAGE_SIZE` to query the storage size of the JSON column and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4335 | 1 |
+----------------------+----------------------+
1 row in set
```
After the JSON column data is updated, since the new data is one byte less than the old data, the `JSON_STORAGE_FREE` result is 1.
7. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is greater than the length of the old value.
```sql
UPDATE json_test SET j = json_replace(j, '$.position', 'software engineer') WHERE pk = 10;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
8. Use `JSON_STORAGE_SIZE` again to query the JSON column storage size, and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4355 | 19 |
+----------------------+----------------------+
1 row in set
```
After appending new data to the JSON column, the length of `JSON_STORAGE_FREE` is 19, indicating that 19 bytes can be freed after a rebuild.

View File

@@ -0,0 +1,124 @@
---
slug: /json-semi-struct
---
# Semi-structured encoding
This topic describes the semi-structured encoding feature supported by seekdb.
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur.
## Considerations
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
* You can also set the frequency threshold using the `freq_threshold` parameter. When semi-structured encoding is enabled, the system analyzes the frequency of each path in the JSON data and stores paths with frequencies exceeding the specified threshold as independent subcolumns, known as frequent columns. For example, if you have a user table where the JSON field stores user information and 90% of users have the `name` and `age` fields, the system will automatically extract `name` and `age` as independent frequent columns. During queries, these columns are accessed directly without parsing the entire JSON, thereby improving query performance.
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
## Data format
JSON data is split and stored as structured columns in a specific format. The columns split from JSON columns are called subcolumns. Subcolumns can be categorized into different types, including sparse columns and frequent columns.
* Sparse columns: Subcolumns that exist in some JSON documents but not in others, with an occurrence frequency lower than the threshold specified by the table-level parameter `freq_threshold`.
* Frequent columns: Subcolumns that appear in JSON data with a frequency higher than the threshold specified by the table-level parameter `freq_threshold`. These subcolumns are stored as independent columns to improve filtering query performance.
For example:
```sql
{"id": 1001, "name": "n1", "nickname": "nn1"}
{"id": 1002, "name": "n2", "nickname": "nn2"}
{"id": 1003, "name": "n3", "nickname": "nn3"}
{"id": 1004, "name": "n4", "nickname": "nn4"}
{"id": 1005, "name": "n5"}
```
In this example, `id` and `name` are fields that exist in every JSON document with an occurrence frequency of 100%, while `nickname` exists in only four JSON documents with an occurrence frequency of 80%.
If `freq_threshold` is set to 100%, then `nickname` will be inferred as a sparse column, while `id` and `name` will be inferred as frequent columns. If set to 80%, then `nickname`, `id`, and `name` will all be inferred as frequent columns.
## Examples
1. Enable semi-structured encoding.
:::tip
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
:::
:::tab
tab Example: Enable semi-structured encoding during table creation
```sql
CREATE TABLE t1( j json)
ROW_FORMAT=COMPRESSED
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
tab Example: Enable semi-structured encoding for existing table
```sql
CREATE TABLE t1(j json);
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
Some modification limitations:
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
* Modifying one sub-parameter does not affect the others.
:::
2. Disable semi-structured encoding.
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
```sql
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
```
3. Query semi-structured encoding configuration.
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
```sql
SHOW CREATE TABLE t1;
```
The result is as follows:
```shell
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
`j` json DEFAULT NULL
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
Here is an example query:
```sql
-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
```
Character set considerations:
- seekdb uses `utf8_bin` encoding for JSON.
- To ensure string whitebox filtering works properly, we recommend the following settings:
```sql
SET @@collation_server = 'utf8mb4_bin';
SET @@collation_connection='utf8mb4_bin';
```

View File

@@ -0,0 +1,26 @@
---
slug: /spatial-data-type-overview
---
# Overview of spatial data types
The Geographic Information System (GIS) feature of seekdb includes the following spatial data types:
* `GEOMETRY`
* `POINT`
* `LINESTRING`
* `POLYGON`
* `MULTIPOINT`
* `MULTILINESTRING`
* `MULTIPOLYGON`
* `GEOMETRYCOLLECTION`
Among these, `POINT`, `LINESTRING`, and `POLYGON` are the three most fundamental types, used to store individual spatial data. They respectively extend into three collection types: `MULTIPOINT`, `MULTILINESTRING`, and `MULTIPOLYGON`, which are used to store collections of spatial data but can only represent collections of their respective specified base types. `GEOMETRY` is an abstract type that can represent any base type, and `GEOMETRYCOLLECTION` can be a collection of any `GEOMETRY` types.

View File

@@ -0,0 +1,39 @@
---
slug: /spacial-reference-system
---
# Spatial reference systems
A spatial reference system (SRS) for spatial data is a coordinate-based system for defining geographic locations. The current version of seekdb only supports the default SRS provided by the system.
Spatial reference systems generally include the following types:
* Projected SRS: A projected SRS is a projection of the Earth onto a plane, essentially a flat map. The coordinate system on this plane is a Cartesian coordinate system that uses units of length (meters, feet, etc.) rather than longitude and latitude.
* Geographic SRS. A geographic SRS is a non-projected SRS that represents longitude-latitude (or latitude-longitude) coordinates on an ellipsoid, expressed in angular units.
Additionally, there is an infinitely flat Cartesian plane represented by `SRID 0`, whose axes have no assigned units. Unlike a projected SRS, it is an abstract plane with no geographic reference and does not necessarily represent the Earth. `SRID 0` is the default `SRID` for spatial data.
SRS content can be obtained through the `INFORMATION_SCHEMA ST_SPATIAL_REFERENCE_SYSTEMS` table, as shown in the following example:
```sql
obclient> SELECT * FROM INFORMATION_SCHEMA.ST_SPATIAL_REFERENCE_SYSTEMS
WHERE SRS_ID = 4326\G
*************************** 1. row ***************************
SRS_NAME: WGS 84
SRS_ID: 4326
ORGANIZATION: EPSG
ORGANIZATION_COORDSYS_ID: 4326
DEFINITION: GEOGCS["WGS 84",DATUM["World Geodetic System 1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.017453292519943278,AUTHORITY["EPSG","9122"]],AXIS["Lat",NORTH],AXIS["Lon",EAST],AUTHORITY["EPSG","4326"]]
DESCRIPTION: NULL
1 row in set
```
The above example describes the SRS used by GPS systems, with the name (SRS_NAME) WGS 84 and ID (SRS_ID) 4326.
The SRS definition in the `DEFINITION` column is a `WKT` value. WKT is defined based on Extended Backus Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for SRS definitions in GIS.
The `SRS_ID` value represents the same type of value as the `SRID` of a geometry value, or is passed as an `SRID` parameter to spatial functions. `SRID 0` (unitless Cartesian plane) is a special, valid spatial reference system ID that can be used for any spatial data calculations that depend on `SRID` values.
For calculations involving multiple geometry values, all values must have the same SRID; otherwise, an error will occur.

View File

@@ -0,0 +1,45 @@
---
slug: /create-spatial-columns
---
# Create a spatial column
seekdb allows you to create a spatial column using the `CREATE TABLE` or `ALTER TABLE` statement.
To create a table with spatial columns using the `CREATE TABLE` statement, see the following syntax example:
```sql
CREATE TABLE geom (g GEOMETRY);
```
To add or remove spatial columns in an existing table using the `ALTER TABLE` statement, see the following syntax example:
```sql
ALTER TABLE geom ADD pt POINT;
ALTER TABLE geom DROP pt;
```
Examples:
```sql
obclient> CREATE TABLE geom (
p POINT SRID 0,
g GEOMETRY NOT NULL SRID 4326
);
Query OK, 0 rows affected
```
The following constraints apply when creating spatial columns:
* You can explicitly specify an SRID when defining a spatial column. If no SRID is defined on the column, the optimizer will not select the spatial index during queries, but index records will still be generated during insert/update operations.
* A spatial index can be defined on a spatial column only after specifying the` NOT NULL` constraint and an SRID. In other words, only columns with a defined SRID can use spatial indexes.
* Once an SRID is defined on a spatial column, attempting to insert values with a different SRID will result in an error.
The following constraints apply to `SRID`:
* You must explicitly specify `SRID` for a spatial column.
* All objects in the column must have the same SRID.

View File

@@ -0,0 +1,71 @@
---
slug: /create-spatial-indexes
---
# Create a spatial index
seekdb allows you to create a spatial index using the `SPATIAL` keyword. When creating a table, the spatial index column must be declared as `NOT NULL`. Spatial indexes can be created on stored (STORED) generated columns, but not on virtual (VIRTUAL) generated columns.
## Constraints
* The column definition for creating a spatial index must include the `NOT NULL` constraint.
* The column with a spatial index must have an SRID defined. Otherwise, the spatial index on this column will not take effect during queries.
* If you create a spatial index on a STORED generated column, you must explicitly specify the `STORED` keyword in the DDL when creating the column. If neither the `VIRTUAL` nor `STORED` keyword is specified when creating a generated column, a VIRTUAL generated column is created by default.
* After an index is created, comparisons use the coordinate system corresponding to the SRID defined in the column. Spatial indexes store the Minimum Bounding Rectangle (MBR) of geometric objects, and the comparison method for MBRs also depends on the SRID.
## Preparations
Before using the GIS feature, you need to configure GIS metadata. After connecting to the server, execute the following command to import the `default_srs_data_mysql.sql` file into the database:
```shell
-- module specifies the module to import.
-- infile specifies the relative path of the SQL file to import.
ALTER SYSTEM LOAD MODULE DATA module=gis infile = 'etc/default_srs_data_mysql.sql';
```
<!-- For more information about the syntax, see [LOAD MODULE DATA](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980607). -->
The following result indicates that the data file was successfully imported:
```shell
Query OK, 0 rows affected
```
## Examples
The following examples show how to create a spatial index on a regular column:
* Using `CREATE TABLE`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326, SPATIAL INDEX(g));
```
* Using `ALTER TABLE`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
ALTER TABLE geom ADD SPATIAL INDEX(g);
```
* Using `CREATE INDEX`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
CREATE SPATIAL INDEX g ON geom (g);
```
The following examples show how to drop a spatial index:
* Using `ALTER TABLE`:
```sql
ALTER TABLE geom DROP INDEX g;
```
* Using `DROP INDEX`:
```sql
DROP INDEX g ON geom;
```

View File

@@ -0,0 +1,275 @@
---
slug: /spatial-data-format
---
# Spatial data formats
seekdb supports two standard spatial data formats for representing geometric objects in queries:
* Well-Known Text (WKT)
* Well-Known Binary (WKB)
## WKT
WKT is defined based on Extended Backus-Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for defining spatial reference systems (SRS) in Geographic Information System (GIS) (referred to as WKT-SRS in this document).
### Point
A point does not use commas as separators. Example format:
```sql
POINT(15 20)
```
The following example uses `ST_X()` to extract the `X` coordinate from a point object. The first example directly generates the object using the `Point()` function. The second example uses the WKT representation converted to point through `ST_GeomFromText()`.
```sql
obclient> SELECT ST_X(Point(15, 20));
+---------------------+
| ST_X(Point(15, 20)) |
+---------------------+
| 15 |
+---------------------+
1 row in set
obclient> SELECT ST_X(ST_GeomFromText('POINT(15 20)'));
+---------------------------------------+
| ST_X(ST_GeomFromText('POINT(15 20)')) |
+---------------------------------------+
| 15 |
+---------------------------------------+
1 row in set
```
### Line
A line consists of multiple points separated by commas. Example format:
```sql
LINESTRING(0 0, 10 10, 20 25, 50 60)
```
### Polygon
A polygon consists of at least one exterior ring (closed line) and any number (can be 0) of interior rings (closed lines). Example format:
```sql
POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))
```
### MultiPoint
A MultiPoint consists of multiple points, similar to a line but with different semantics. Multiple connected points form a line, while discrete multiple points form a MultiPoint. Example format:
```sql
MULTIPOINT(0 0, 20 20, 60 60)
```
In the functions `ST_MPointFromText()` and `ST_GeoFromText()`, it is also valid to enclose points in a MultiPoint with parentheses. Example format:
```sql
ST_MPointFromText('MULTIPOINT (1 1, 2 2, 3 3)')
ST_MPointFromText('MULTIPOINT ((1 1), (2 2), (3 3))')
```
### MultiLineString
A MultiLineString is a collection of multiple lines. Example format:
```sql
MULTILINESTRING((10 10, 20 20), (15 15, 30 15))
```
### MultiPolygon
A MultiPolygon is a collection of multiple polygons. Example format:
```sql
MULTIPOLYGON(((0 0,10 0,10 10,0 10,0 0)),((5 5,7 5,7 7,5 7, 5 5)))
```
### GeometryCollection
A GeometryCollection can be a collection of multiple basic types and collection types.
```sql
GEOMETRYCOLLECTION(POINT(10 10), POINT(30 30), LINESTRING(15 15, 20 20))
```
## WKB
WKB is developed based on the OpenGIS specification and supports seven types (Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and Geometrycollection) with corresponding format definitions.
### Point
Using `POINT(1 -1)` as an example, the format definition is shown in the following table.
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 01000000 |
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
### Linestring
Using `LINESTRING(1 -1, -1 1)` as an example, the format definition is shown in the following table. `Num points` must be greater than or equal to 2.
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 02000000 |
| **Num points** | 4 bytes | unsigned int | 02000000 |
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
| **X coordinate** | 8 bytes | double-precision | 000000000000F0BF |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F03F |
### Polygon
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 03000000 |
| **Num rings** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat ring** | - |- | - |
### MultiPoint
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 04000000 |
| **Num points** | 4 bytes | unsigned int | Num points >= 1 |
| **repeat POINT** | - |- | - |
### MultiLineString
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 05000000 |
| **Num linestrings** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat LINESTRING** | - | - | - |
### MultiPolygon
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 06000000 |
| **Num polygons** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat POLYGON** | - | - | - |
### GeometryCollection
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 07000000 |
| **Num wkbs** | 4 bytes | unsigned int | - |
| **repeat WKB** | - | - | - |
>Note:
>
>* Only GeometryCollection can be empty, indicating that 0 elements are stored. All other types cannot be empty.
>* When `LENGTH()` is applied to a GIS object, it returns the length of the stored binary data.
```sql
obclient [test]> SET @g = ST_GeomFromText('POINT(1 -1)');
Query OK, 0 rows affected
obclient [test]> SELECT LENGTH(@g);
+------------+
| LENGTH(@g) |
+------------+
| 25 |
+------------+
1 row in set
obclient [test]> SELECT HEX(@g);
+----------------------------------------------------+
| HEX(@g) |
+----------------------------------------------------+
| 000000000101000000000000000000F03F000000000000F0BF |
+----------------------------------------------------+
1 row in set
```
## Syntax and geometric validity
### Syntax validity
Syntax validity must satisfy the following conditions:
- A linestring must have at least two points.
- A polygon must have at least one ring.
- A polygon must be closed (the first and last points are the same).
- A polygon's ring must have at least four points (the smallest polygon is a triangle, where the first and last points are the same).
- Except for GeometryCollection, other collection types cannot be empty.
### Geometric validity
Geometric validity must satisfy the following conditions:
- A polygon cannot intersect with itself.
- The exterior ring of a Polygon must be outside the interior rings.
- Multipolygons cannot contain overlapping polygons.
You can explicitly check the geometric validity of a geometry object using the ST_IsValid() function.
## GIS Examples
### Insert data
```sql
// Both conversion functions and WKT are included in the SQL statement.
INSERT INTO geom VALUES (ST_GeomFromText('POINT(1 1)'));
// WKT is provided as a parameter.
SET @g = 'POINT(1 1)';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
// Conversion expressions are directly embedded in the parameters.
SET @g = ST_GeomFromText('POINT(1 1)');
INSERT INTO geom VALUES (@g);
// A unified conversion function is used.
SET @g = 'LINESTRING(0 0,1 1,2 2)';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
SET @g ='GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
// Type-specific conversion functions are employed.
SET @g = 'POINT(1 1)';
INSERT INTO geom VALUES (ST_PointFromText(@g));
SET @g = 'LINESTRING(0 0,1 1,2 2)';
INSERT INTO geom VALUES (ST_LineStringFromText(@g));
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
INSERT INTO geom VALUES (ST_PolygonFromText(@g));
SET @g =
'GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
INSERT INTO geom VALUES (ST_GeomCollFromText(@g));
// Data can also be inserted directly using WKB.
INSERT INTO geom VALUES(ST_GeomFromWKB(X'0101000000000000000000F03F000000000000F03F'));
```
### Query data
```sql
// Query data and convert it to WKT format for output.
SELECT ST_AsText(g) FROM geom;
// Query data and convert it to WKB format for output.
SELECT ST_AsBinary(g) FROM geom;
```

View File

@@ -0,0 +1,46 @@
---
slug: /char-and-varchar
---
# CHAR and VARCHAR
`CHAR` and `VARCHAR` types are similar, but differ in how they are stored and retrieved, their maximum length, and whether trailing spaces are preserved.
## CHAR
The declared length of the `CHAR` type is the maximum number of characters that can be stored. For example, `CHAR(30)` can contain up to 30 characters.
Syntax:
```sql
[NATIONAL] CHAR[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `CHAR` becomes `BINARY`.
`CHAR` column length can be any value between 0 and 256. When storing `CHAR` values, they are right-padded with spaces to the specified length.
For `CHAR` columns, excess trailing spaces in inserted values are silently truncated regardless of the SQL mode. When retrieving `CHAR` values, trailing spaces are removed unless the `PAD_CHAR_TO_FULL_LENGTH` SQL mode is enabled.
## VARCHAR
The declared length `M` of the `VARCHAR` type is the maximum number of characters that can be stored. For example, `VARCHAR(50)` can contain up to 50 characters.
Syntax:
```sql
[NATIONAL] VARCHAR(M) [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `VARCHAR` becomes `VARBINARY`.
`VARCHAR` column length can be specified as any value between 0 and 262144.
Compared with `CHAR`, `VARCHAR` values are stored as a 1-byte or 2-byte length prefix plus the data. The length prefix indicates the number of bytes in the value. If the value does not exceed 255 bytes, the column uses one byte; if the value may exceed 255 bytes, it uses two bytes.
For `VARCHAR` columns, trailing spaces that exceed the column length are truncated before insertion and generate a warning, regardless of the SQL mode.
`VARCHAR` values are not padded when stored. According to standard SQL, trailing spaces are preserved during both storage and retrieval.
Additionally, seekdb also supports the extended type `CHARACTER VARYING(m)`, but `VARCHAR(m)` is recommended.

View File

@@ -0,0 +1,64 @@
---
slug: /text
---
# TEXT types
The `TEXT` type is used to store all types of text data.
There are four text types: `TINYTEXT`, `TEXT`, `MEDIUMTEXT`, and `LONGTEXT`. They correspond to the four `BLOB` types and have the same maximum length and storage requirements.
`TEXT` values are treated as non-binary strings. They have a character set other than binary, and values are sorted and compared according to the collation rules of the character set.
When strict SQL mode is not enabled, if a value assigned to a `TEXT` column exceeds the column's maximum length, the portion that exceeds the length is truncated and a warning is generated. When using strict SQL mode, an error occurs (rather than a warning) if non-space characters are truncated, and the value insertion is prohibited. Regardless of the SQL mode, truncating excess trailing spaces from values inserted into `TEXT` columns always generates a warning.
## TINYTEXT
`TINYTEXT` is a `TEXT` type with a maximum length of 255 bytes.
`TINYTEXT` syntax:
```sql
TINYTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
## TEXT
The maximum length of a `TEXT` column is 65,535 bytes.
An optional length `M` can be specified for the `TEXT` type. Syntax:
```sql
TEXT[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
## MEDIUMTEXT
`MEDIUMTEXT` is a `TEXT` type with a maximum length of 16,777,215 bytes.
`MEDIUMTEXT` syntax:
```sql
MEDIUMTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
Additionally, seekdb also supports the extended type `LONG`, but `MEDIUMTEXT` is recommended.
## LONGTEXT
`LONGTEXT` is a `TEXT` type with a maximum length of 536,870,910 bytes. The effective maximum length of a `LONGTEXT` column also depends on the maximum packet size configured in the client/server protocol and available memory.
`LONGTEXT` syntax:
```sql
LONGTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.

View File

@@ -0,0 +1,329 @@
---
slug: /full-text-index
---
# Full-text indexes
In seekdb, full-text indexes can be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
* Full-text indexes can only be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types.
* The current version only supports creating local (`LOCAL`) full-text indexes.
* The `UNIQUE` keyword cannot be specified when creating a full-text index.
* If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.
By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.
## DML operations
For tables with full-text indexes, complex DML operations are supported, including `INSERT INTO ON DUPLICATE KEY`, `REPLACE INTO`, multi-table updates/deletes, and updatable views.
**Examples:**
* `INSERT INTO ON DUPLICATE KEY`:
```sql
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
```
* `REPLACE INTO`:
```sql
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
```
* Multi-table updates and deletes.
1. Create table `tbl1`.
```sql
CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
```
2. Create table `tbl2`.
```sql
CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
```
3. Perform an update (`UPDATE`) operation on multiple tables.
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
SET tbl1.b = 'dddd', tbl2.b = 'eeee';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
```
4. Perform a delete (`DELETE`) operation on multiple tables.
```sql
DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
* DML operations on updatable views.
1. Create view `fts_view`.
```sql
CREATE VIEW fts_view AS SELECT * FROM tbl1;
```
2. Perform an `INSERT` operation on the updatable view.
```sql
INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
```
3. Perform an `UPDATE` operation on the updatable view.
```sql
UPDATE fts_view SET b = 'dddd';
```
```sql
UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
SET fts_view.b = 'dddd', tbl2.b = 'eeee';
```
4. Perform a `DELETE` operation on the updatable view.
```sql
DELETE FROM fts_view WHERE b = 'dddd';
```
```sql
DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
```
## Full-text index tokenizer
seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is **Space**, while other tokenizers need to be explicitly specified using the `WITH PARSER` parameter.
**List of tokenizers**:
* **Space tokenizer**
* **Basic English tokenizer**
* **IK tokenizer**
* **Ngram tokenizer**
* **Jieba tokenizer**
**Configuration example**:
When creating or modifying a table, specify the tokenizer type for the full-text index by setting the `WITH PARSER tokenizer_option` parameter in the `CREATE TABLE/ALTER TABLE` statement.
```sql
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the full-text index tokenizer of an existing table.
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
```
### Space tokenizer (default)
**Concepts**:
* This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore `_`) as delimiters.
* The tokenization results include only valid tokens with lengths between `min_token_size` (default 3) and `max_token_size` (default 84).
* Chinese characters are treated as single tokens.
**Applicable scenarios**:
* Languages separated by spaces such as English (for example "apple watch series 9").
* Chinese text with manually added delimiters (for example, "南京 长江大桥").
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|"详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name" |
+-------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.
### Basic English (Beng) tokenizer
**Concepts**:
* Similar to the Space tokenizer, but treats underscores `_` as separators instead of preserving them.
* Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").
**Applicable scenarios**:
* Basic retrieval of English documents (such as logs, comments).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Underscores `_` are split into separate tokens (for example, `server_status` -> `server`, `status`, and `user_name` -> `user`, `name`). The core difference from the Space tokenizer lies in how it handles underscores `_`.
### Ngram tokenizer
**Concepts**:
* **Fixed n-value tokenization**: By default, `n=2`. This tokenizer splits consecutive non-delimiter characters into subsequences of length `n`.
* Delimiter rules follow the Space tokenizer (preserving `_`, digits, and letters).
* **Does not support length limit parameters**, outputs all possible tokens of length `n`.
**Applicable scenarios**:
* Fuzzy matching for short text (such as user IDs, order numbers).
* Scenarios requiring fixed-length feature extraction (such as password policy analysis).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram') |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* With the default setting `n=2`, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, `ORD12345` -> `OR`, `RD`, `D1`, `12`, `23`, `34`, `45`;` user_account` -> `us`, `se`, `er`, `r_`, `_a`, `ac`, `cc`, `co`, `ou`, `un`, `nt`).
### Ngram2 tokenizer
**Concepts**:
* Supports **dynamic n-value range**: Sets token length range through `min_ngram_size` and `max_ngram_size` parameters.
* Suitable for scenarios requiring multi-length token coverage.
**Applicable scenarios**: Scenarios that require multiple fixed-length tokens simultaneously.
:::info
When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for <code>min_ngram_size</code> and <code>max_ngram_size</code> parameters will generate a large number of token combinations, which may lead to excessive resource consumption.
:::
**Tokenization result**:
```sql
OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, `user_login_session_2024` generates tokens like `us`, `use`, `user`, `se`, `ser`, `ser_`, `er_`, `er_l`, `r_lo`, `log`, `logi`, `ogin`, etc.).
### IK tokenizer
**Concepts**:
* A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
* **Smart mode**: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
* **Max Word mode**: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
* Automatically recognizes English words, email addresses, URLs (without `://`), IP addresses, and other formats.
**Applicable scenarios**: Chinese word segmentation
**Business scenarios**:
* E-commerce product description search (for example, precise matching for "华为Mate60").
* Social media content analysis (for example, keyword extraction from user comments).
* **Smart mode**: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.
```sql
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
* **max_word mode**: Includes the same character in different tokens, providing as many possible words as possible.
```sql
OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
### jieba tokenizer
**Concept**: A tokenizer based on the open-source `jieba` tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.
**Features**:
* **Precise mode**: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
* **Full mode**: Lists all possible segmentation combinations.
* **Search engine mode**: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
* Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
**Applicable scenarios**:
* Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
* Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).
To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see [Tokenizer plugin](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002414801).
:::tip
The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.
:::
### Tokenizer selection strategy
| **Business scenario** | **Recommended tokenizer** | **Reason** |
| --- | --- | --- |
| Search for English product titles | **Space** or **Basic English** | Simple and efficient, aligns with English tokenization conventions. |
| Retrieval of Chinese product descriptions | **IK tokenizer** | Accurately recognizes Chinese terminology, supports custom dictionaries. |
| Fuzzy matching of logs (such as error codes) | **Ngram tokenizer** | No dictionary required, covers fuzzy query needs for text without spaces. |
| Keyword extraction from technology papers | **jieba tokenizer** | Supports new word discovery and complex mode switching. |
## References
For more information about creating full-text indexes, see the **Create full-text indexes** section in [Create an index](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971660).

View File

@@ -0,0 +1,60 @@
---
slug: /pyseekdb-sdk-get-started
---
# Get started
## pyseekdb
pyseekdb is a Python client provided by OceanBase Database. It allows you to connect to seekdb in embedded mode or remote mode, and supports connecting to seekdb in server mode or OceanBase Database.
:::tip
OceanBase Database is a fully self-developed, enterprise-level, native distributed database provided by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across cities with the "five IDCs across three regions" architecture. It also sets a new benchmark in the TPC-C standard test, with a single cluster scale exceeding 1,500 nodes. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
:::
pyseekdb is supported on Linux, macOS, and Windows. The supported database connection modes vary by operating system. For more information, see the table below.
| System | Embedded seekdb | Server mode seekdb | Server mode OceanBase Database |
|----|---|---|---|
| Linux | Supported | Supported | Supported |
| macOS | Not supported | Supported | Supported |
| Windows | Not supported | Supported | Supported |
For Linux system, when you install this client, it will also install seekdb in embedded mode, allowing you to directly connect to it to perform operations such as creating a database. Alternatively, you can choose to connect to a deployed seekdb or OceanBase Database in client/server mode.
## Install pyseekdb
### Prerequisites
Make sure that your environment meets the following requirements:
* Operating system: Linux (glibc >= 2.28), macOS or Windows
* Python version: Python 3.11 and later
* System architecture: x86_64 or aarch64
### Procedure
Use pip to install pyseekdb. It will automatically detect the default Python version and platform.
```shell
pip install pyseekdb
```
If your pip version is outdated, upgrade it before installation.
```bash
pip install --upgrade pip
```
## What to do next
* After installing pyseekdb, you can connect to seekdb to perform operations. For information about the API interfaces supported by pyseekdb, see [API Reference](../50.apis/10.api-overview.md).
* You can also refer to the SDK samples provided to quickly experience pyseekdb.
* [Simple sample](50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Complete sample](50.sdk-samples/50.pyseekdb-complete-sample.md)
* [Hybrid search sample](50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,130 @@
---
slug: /pyseekdb-simple-sample
---
# Simple Example
This example demonstrates the basic operations of Embedding Functions in embedded mode of seekdb to help you understand how to use Embedding Functions.
1. Connect to seekdb.
2. Create a collection with Embedding Functions.
3. Add data using documents (vectors will be automatically generated).
4. Query using texts (vectors will be automatically generated).
5. Print the query results.
## Prerequisites
This example uses seekdb in embedded mode. Before using this example, make sure that you have deployed seekdb in server mode.
For information about how to deploy seekdb in embedded mode, see [Embedded Mode](../../../../400.guides/400.deploy/600.python-seekdb.md).
## Example
```python
import pyseekdb
# ==================== Step 1: Create Client Connection ====================
# You can use embedded mode, server mode, or OceanBase mode
# Embedded mode (local SeekDB)
client = pyseekdb.Client()
# Alternative: Server mode (connecting to remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Alternative: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"
# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
name=collection_name,
)
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")
# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text"
]
ids = ["id1", "id2", "id3", "id4", "id5"]
# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
ids=ids,
documents=documents, # embeddings will be automatically generated
metadatas=[
{"category": "AI", "index": 0},
{"category": "Programming", "index": 1},
{"category": "Database", "index": 2},
{"category": "AI", "index": 3},
{"category": "NLP", "index": 4}
]
)
print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")
# ==================== Step 4: Query the Collection ====================
# With embedding function, you can query using text directly
# The embedding function will automatically convert query text to query vector
# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"
results = collection.query(
query_texts=query_text, # Query text - will be embedded automatically
n_results=3 # Return top 3 most similar documents
)
print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")
# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
print(f"\nResult {i+1}:")
print(f" ID: {results['ids'][0][i]}")
print(f" Distance: {results['distances'][0][i]:.4f}")
if results.get('documents'):
print(f" Document: {results['documents'][0][i]}")
if results.get('metadatas'):
print(f" Metadata: {results['metadatas'][0][i]}")
# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Complete Example](50.pyseekdb-complete-sample.md)
* [Hybrid Search Example](100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,350 @@
---
slug: /pyseekdb-hybrid-search-sample
---
# Hybrid search example
This example demonstrates the advantages of `hybrid_search()` over `query()`.
The main advantages of `hybrid_search()` are:
* Supports full-text search and vector similarity search simultaneously
* Allows separate filtering conditions for full-text and vector search
* Combines the ranked results of both searches using the Reciprocal Rank Fusion algorithm to improve relevance.
* Handles complex scenarios that `query()` cannot handle
## Example
```python
import pyseekdb
# Setup
client = pyseekdb.Client()
collection = client.get_or_create_collection(
name="hybrid_search_demo"
)
# Sample data
documents = [
"Machine learning is revolutionizing artificial intelligence and data science",
"Python programming language is essential for machine learning developers",
"Deep learning neural networks enable advanced AI applications",
"Data science combines statistics, programming, and domain expertise",
"Natural language processing uses machine learning to understand text",
"Computer vision algorithms process images using deep learning techniques",
"Reinforcement learning trains agents through reward-based feedback",
"Python libraries like TensorFlow and PyTorch simplify machine learning",
"Artificial intelligence systems can learn from large datasets",
"Neural networks mimic the structure of biological brain connections"
]
metadatas = [
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
]
ids = [f"doc_{i+1}" for i in range(len(documents))]
collection.add(ids=ids, documents=documents, metadatas=metadatas)
print("=" * 100)
print("SCENARIO 1: Keyword + Semantic Search")
print("=" * 100)
print("Goal: Find documents similar to 'AI research' AND containing 'machine learning'\n")
# query() approach
query_result1 = collection.query(
query_texts=["AI research"],
where_document={"$contains": "machine learning"},
n_results=5
)
# hybrid_search() approach
hybrid_result1 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["AI research"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result1['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() ranks 'Deep learning neural networks...' first because it's semantically similar to 'AI research',")
print(" but 'machine learning' is not its primary focus. hybrid_search() correctly prioritizes documents that")
print(" explicitly contain 'machine learning' (from full-text search) while also being semantically relevant")
print(" to 'AI research' (from vector search). The RRF fusion ensures documents matching both criteria rank higher.")
print("\n" + "=" * 100)
print("SCENARIO 2: Independent Filters for Different Search Types")
print("=" * 100)
print("Goal: Full-text='neural' (year=2024) + Vector='deep learning' (popularity>=90)\n")
# query() - same filter applies to both conditions
query_result2 = collection.query(
query_texts=["deep learning"],
where={"year": {"$eq": 2024}, "popularity": {"$gte": 90}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for each search type
hybrid_result2 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"year": {"$eq": 2024}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"popularity": {"$gte": 90}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (same filter for both):")
for i, doc_id in enumerate(query_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results (independent filters):")
for i, doc_id in enumerate(hybrid_result2['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH year=2024 AND popularity>=90")
print(" simultaneously. hybrid_search() returns 5 results by applying year=2024 filter to full-text search")
print(" and popularity>=90 filter to vector search independently, then fusing the results. This approach")
print(" captures more relevant documents that might satisfy one criterion strongly while meeting the other")
print("\n" + "=" * 100)
print("SCENARIO 3: Combining Multiple Search Strategies")
print("=" * 100)
print("Goal: Find documents about 'machine learning algorithms'\n")
# query() - vector search only
query_result3 = collection.query(
query_texts=["machine learning algorithms"],
n_results=5
)
# hybrid_search() - combines full-text and vector
hybrid_result3 = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["machine learning algorithms"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (vector similarity only):")
for i, doc_id in enumerate(query_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (full-text + vector fusion):")
for i, doc_id in enumerate(hybrid_result3['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() returns 'Artificial intelligence systems...' as the result, which doesn't explicitly")
print(" mention 'machine learning'. hybrid_search() combines full-text search (for 'machine learning')")
print(" with vector search (for semantic similarity to 'machine learning algorithms'), ensuring that")
print(" documents containing the exact keyword rank higher while still capturing semantically relevant content.")
print("\n" + "=" * 100)
print("SCENARIO 4: Complex Multi-Criteria Search")
print("=" * 100)
print("Goal: Full-text='learning' (category=AI) + Vector='artificial intelligence' (year>=2023)\n")
# query() - limited to single search with combined filters
query_result4 = collection.query(
query_texts=["artificial intelligence"],
where={"category": {"$eq": "AI"}, "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# hybrid_search() - separate criteria for each search type
hybrid_result4 = collection.hybrid_search(
query={"where_document": {"$contains": "learning"}, "where": {"category": {"$eq": "AI"}}, "n_results": 10},
knn={"query_texts": ["artificial intelligence"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result4['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" While both methods return similar documents, hybrid_search() provides better ranking by prioritizing")
print(" documents that score highly in both full-text search (containing 'learning' with category=AI) and")
print(" vector search (semantically similar to 'artificial intelligence' with year>=2023). The RRF fusion")
print(" algorithm ensures that 'Deep learning neural networks...' ranks first because it strongly matches")
print(" both search criteria, whereas query() applies filters sequentially which may not optimize ranking.")
print("\n" + "=" * 100)
print("SCENARIO 5: Result Quality - RRF Fusion")
print("=" * 100)
print("Goal: Search for 'Python machine learning'\n")
# query() - single ranking
query_result5 = collection.query(
query_texts=["Python machine learning"],
n_results=5
)
# hybrid_search() - RRF fusion of multiple rankings
hybrid_result5 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["Python machine learning"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results (single ranking):")
for i, doc_id in enumerate(query_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results (RRF fusion):")
for i, doc_id in enumerate(hybrid_result5['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" Both methods return identical results in this case, but hybrid_search() achieves this through RRF")
print(" (Reciprocal Rank Fusion) which combines rankings from full-text search (for 'Python') and vector")
print(" search (for 'Python machine learning'). RRF provides more stable and robust ranking by considering")
print(" multiple signals, making it less sensitive to variations in individual search algorithms and ensuring")
print(" consistent high-quality results across different query formulations.")
print("\n" + "=" * 100)
print("SCENARIO 6: Different Filter Criteria for Each Search")
print("=" * 100)
print("Goal: Full-text='neural' (high popularity) + Vector='deep learning' (recent year)\n")
# query() - cannot separate filters for keyword vs semantic
query_result6 = collection.query(
query_texts=["deep learning"],
where={"popularity": {"$gte": 90}, "year": {"$gte": 2023}},
where_document={"$contains": "neural"},
n_results=5
)
# hybrid_search() - different filters for keyword search vs semantic search
hybrid_result6 = collection.hybrid_search(
query={"where_document": {"$contains": "neural"}, "where": {"popularity": {"$gte": 90}}, "n_results": 10},
knn={"query_texts": ["deep learning"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result6['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print(f" {metadatas[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it requires documents to satisfy BOTH popularity>=90 AND")
print(" year>=2023 simultaneously, along with containing 'neural' and being semantically similar to")
print(" 'deep learning'. hybrid_search() returns 5 results by applying popularity>=90 filter to full-text")
print(" search (for 'neural') and year>=2023 filter to vector search (for 'deep learning') independently.")
print(" The fusion then combines results from both searches, capturing documents that strongly match either")
print(" criterion while still being relevant to the overall query intent.")
print("\n" + "=" * 100)
print("SCENARIO 7: Partial Keyword Match + Semantic Similarity")
print("=" * 100)
print("Goal: Documents containing 'Python' + Semantically similar to 'data science'\n")
# query() - filter applied after vector search
query_result7 = collection.query(
query_texts=["data science"],
where_document={"$contains": "Python"},
n_results=5
)
# hybrid_search() - parallel searches then fusion
hybrid_result7 = collection.hybrid_search(
query={"where_document": {"$contains": "Python"}, "n_results": 10},
knn={"query_texts": ["data science"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)
print("query() Results:")
for i, doc_id in enumerate(query_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nhybrid_search() Results:")
for i, doc_id in enumerate(hybrid_result7['ids'][0]):
idx = ids.index(doc_id)
print(f" {i+1}. {documents[idx]}")
print("\nAnalysis:")
print(" query() only returns 2 results because it first performs vector search for 'data science', then")
print(" filters to documents containing 'Python', which severely limits the result set. hybrid_search()")
print(" returns 5 results by running full-text search (for 'Python') and vector search (for 'data science')")
print(" in parallel, then fusing the results. This captures documents that contain 'Python' (even if not")
print(" semantically closest to 'data science') and documents semantically similar to 'data science' (even")
print(" if they don't contain 'Python'), providing better recall and more comprehensive results.")
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
print("""
query() limitations:
- Single search type (vector similarity)
- Filters applied after search (may miss relevant docs)
- Cannot combine full-text and vector search results
- Same filter criteria for all conditions
hybrid_search() advantages:
- Simultaneous full-text + vector search
- Independent filters for each search type
- Intelligent result fusion using RRF
- Better recall for complex queries
- Handles scenarios requiring both keyword and semantic matching
""")
```
## References
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple example](10.pyseekdb-simple-sample.md)
* [Complete example](50.pyseekdb-complete-sample.md)

View File

@@ -0,0 +1,440 @@
---
slug: /pyseekdb-complete-sample
---
# Complete Example
This example demonstrates the full capabilities of pyseekdb.
The example includes the following operations:
1. Connection, including all connection modes
2. Collection management
3. DML operations, including add, update, upsert, and delete
4. DQL operations, including query, get, and hybrid_search
5. Filter operators
6. Collection information methods
## Example
```python
import uuid
import random
import pyseekdb
# ============================================================================
# PART 1: CLIENT CONNECTION
# ============================================================================
# Option 1: Embedded mode (local SeekDB)
client = pyseekdb.Client(
#path="./seekdb",
#database="test"
)
# Option 2: Server mode (remote SeekDB server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# database="test",
# user="root",
# password=""
# )
# Option 3: Remote server mode (OceanBase Server)
# client = pyseekdb.Client(
# host="127.0.0.1",
# port=2881,
# tenant="test", # OceanBase default tenant
# database="test",
# user="root",
# password=""
# )
# ============================================================================
# PART 2: COLLECTION MANAGEMENT
# ============================================================================
collection_name = "comprehensive_example"
dimension = 128
# 2.1 Create a collection
from pyseekdb import HNSWConfiguration
config = HNSWConfiguration(dimension=dimension, distance='cosine')
collection = client.get_or_create_collection(
name=collection_name,
configuration=config,
embedding_function=None # Explicitly set to None since we're using custom 128-dim embeddings
)
# 2.2 Check if collection exists
exists = client.has_collection(collection_name)
# 2.3 Get collection object
retrieved_collection = client.get_collection(collection_name, embedding_function=None)
# 2.4 List all collections
all_collections = client.list_collections()
# 2.5 Get or create collection (creates if doesn't exist)
config2 = HNSWConfiguration(dimension=64, distance='cosine')
collection2 = client.get_or_create_collection(
name="another_collection",
configuration=config2,
embedding_function=None # Explicitly set to None since we're using custom 64-dim embeddings
)
# ============================================================================
# PART 3: DML OPERATIONS - ADD DATA
# ============================================================================
# Generate sample data
random.seed(42)
documents = [
"Machine learning is transforming the way we solve problems",
"Python programming language is widely used in data science",
"Vector databases enable efficient similarity search",
"Neural networks mimic the structure of the human brain",
"Natural language processing helps computers understand human language",
"Deep learning requires large amounts of training data",
"Reinforcement learning agents learn through trial and error",
"Computer vision enables machines to interpret visual information"
]
# Generate embeddings (in real usage, use an embedding model)
embeddings = []
for i in range(len(documents)):
vector = [random.random() for _ in range(dimension)]
embeddings.append(vector)
ids = [str(uuid.uuid4()) for _ in documents]
# 3.1 Add single item
single_id = str(uuid.uuid4())
collection.add(
ids=single_id,
documents="This is a single document",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"type": "single", "category": "test"}
)
# 3.2 Add multiple items
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=[
{"category": "AI", "score": 95, "tag": "ml", "year": 2023},
{"category": "Programming", "score": 88, "tag": "python", "year": 2022},
{"category": "Database", "score": 92, "tag": "vector", "year": 2023},
{"category": "AI", "score": 90, "tag": "neural", "year": 2022},
{"category": "NLP", "score": 87, "tag": "language", "year": 2023},
{"category": "AI", "score": 93, "tag": "deep", "year": 2023},
{"category": "AI", "score": 85, "tag": "reinforcement", "year": 2022},
{"category": "CV", "score": 91, "tag": "vision", "year": 2023}
]
)
# 3.3 Add with only embeddings (no documents)
vector_only_ids = [str(uuid.uuid4()) for _ in range(2)]
collection.add(
ids=vector_only_ids,
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"type": "vector_only"}, {"type": "vector_only"}]
)
# ============================================================================
# PART 4: DML OPERATIONS - UPDATE DATA
# ============================================================================
# 4.1 Update single item
collection.update(
ids=ids[0],
metadatas={"category": "AI", "score": 98, "tag": "ml", "year": 2024, "updated": True}
)
# 4.2 Update multiple items
collection.update(
ids=ids[1:3],
documents=["Updated document 1", "Updated document 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[
{"category": "Programming", "score": 95, "updated": True},
{"category": "Database", "score": 97, "updated": True}
]
)
# 4.3 Update embeddings
new_embeddings = [[random.random() for _ in range(dimension)] for _ in range(2)]
collection.update(
ids=ids[2:4],
embeddings=new_embeddings
)
# ============================================================================
# PART 5: DML OPERATIONS - UPSERT DATA
# ============================================================================
# 5.1 Upsert existing item (will update)
collection.upsert(
ids=ids[0],
documents="Upserted document (was updated)",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "AI", "upserted": True}
)
# 5.2 Upsert new item (will insert)
new_id = str(uuid.uuid4())
collection.upsert(
ids=new_id,
documents="This is a new document from upsert",
embeddings=[random.random() for _ in range(dimension)],
metadatas={"category": "New", "upserted": True}
)
# 5.3 Upsert multiple items
upsert_ids = [ids[4], str(uuid.uuid4())] # One existing, one new
collection.upsert(
ids=upsert_ids,
documents=["Upserted doc 1", "Upserted doc 2"],
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
metadatas=[{"upserted": True}, {"upserted": True}]
)
# ============================================================================
# PART 6: DQL OPERATIONS - QUERY (VECTOR SIMILARITY SEARCH)
# ============================================================================
# 6.1 Basic vector similarity query
query_vector = embeddings[0] # Query with first document's vector
results = collection.query(
query_embeddings=query_vector,
n_results=3
)
print(f"Query results: {len(results['ids'][0])} items")
# 6.2 Query with metadata filter (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI"},
n_results=5
)
# 6.3 Query with comparison operators
results = collection.query(
query_embeddings=query_vector,
where={"score": {"$gte": 90}},
n_results=5
)
# 6.4 Query with $in operator
results = collection.query(
query_embeddings=query_vector,
where={"tag": {"$in": ["ml", "python", "neural"]}},
n_results=5
)
# 6.5 Query with logical operators ($or) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$or": [
{"category": "AI"},
{"tag": "python"}
]
},
n_results=5
)
# 6.6 Query with logical operators ($and) - simplified equality
results = collection.query(
query_embeddings=query_vector,
where={
"$and": [
{"category": "AI"},
{"score": {"$gte": 90}}
]
},
n_results=5
)
# 6.7 Query with document filter
results = collection.query(
query_embeddings=query_vector,
where_document={"$contains": "machine learning"},
n_results=5
)
# 6.8 Query with combined filters (simplified equality)
results = collection.query(
query_embeddings=query_vector,
where={"category": "AI", "year": {"$gte": 2023}},
where_document={"$contains": "learning"},
n_results=5
)
# 6.9 Query with multiple embeddings (batch query)
batch_embeddings = [embeddings[0], embeddings[1]]
batch_results = collection.query(
query_embeddings=batch_embeddings,
n_results=2
)
# batch_results["ids"][0] contains results for first query
# batch_results["ids"][1] contains results for second query
# 6.10 Query with specific fields
results = collection.query(
query_embeddings=query_vector,
include=["documents", "metadatas", "embeddings"],
n_results=2
)
# ============================================================================
# PART 7: DQL OPERATIONS - GET (RETRIEVE BY IDS OR FILTERS)
# ============================================================================
# 7.1 Get by single ID
result = collection.get(ids=ids[0])
# result["ids"] contains [ids[0]]
# result["documents"] contains document for ids[0]
# 7.2 Get by multiple IDs
results = collection.get(ids=ids[:3])
# results["ids"] contains ids[:3]
# results["documents"] contains documents for all IDs
# 7.3 Get by metadata filter (simplified equality)
results = collection.get(
where={"category": "AI"},
limit=5
)
# 7.4 Get with comparison operators
results = collection.get(
where={"score": {"$gte": 90}},
limit=5
)
# 7.5 Get with $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=5
)
# 7.6 Get with logical operators (simplified equality)
results = collection.get(
where={
"$or": [
{"category": "AI"},
{"category": "Programming"}
]
},
limit=5
)
# 7.7 Get by document filter
results = collection.get(
where_document={"$contains": "Python"},
limit=5
)
# 7.8 Get with pagination
results_page1 = collection.get(limit=2, offset=0)
results_page2 = collection.get(limit=2, offset=2)
# 7.9 Get with specific fields
results = collection.get(
ids=ids[:2],
include=["documents", "metadatas", "embeddings"]
)
# 7.10 Get all data
all_results = collection.get(limit=100)
# ============================================================================
# PART 8: DQL OPERATIONS - HYBRID SEARCH
# ============================================================================
# 8.1 Hybrid search with full-text and vector search
# Note: This requires query_embeddings to be provided directly
# In real usage, you might have an embedding function
hybrid_results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": "AI"}, # Simplified equality
"n_results": 10
},
knn={
"query_embeddings": [embeddings[0]],
"where": {"year": {"$gte": 2022}},
"n_results": 10
},
rank={"rrf": {}}, # Reciprocal Rank Fusion
n_results=5,
include=["documents", "metadatas"]
)
# hybrid_results["ids"][0] contains IDs for the hybrid search
# hybrid_results["documents"][0] contains documents for the hybrid search
print(f"Hybrid search: {len(hybrid_results.get('ids', [[]])[0])} results")
# ============================================================================
# PART 9: DML OPERATIONS - DELETE DATA
# ============================================================================
# 9.1 Delete by IDs
delete_ids = [vector_only_ids[0], new_id]
collection.delete(ids=delete_ids)
# 9.2 Delete by metadata filter
collection.delete(where={"type": {"$eq": "vector_only"}})
# 9.3 Delete by document filter
collection.delete(where_document={"$contains": "Updated document"})
# 9.4 Delete with combined filters
collection.delete(
where={"category": {"$eq": "CV"}},
where_document={"$contains": "vision"}
)
# ============================================================================
# PART 10: COLLECTION INFORMATION
# ============================================================================
# 10.1 Get collection count
count = collection.count()
print(f"Collection count: {count} items")
# 10.3 Preview first few items in collection (returns all columns by default)
preview = collection.peek(limit=5)
print(f"Preview: {len(preview['ids'])} items")
for i in range(len(preview['ids'])):
print(f" ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
print(f" Metadata: {preview['metadatas'][i]}, Embedding dim: {len(preview['embeddings'][i]) if preview['embeddings'][i] else 0}")
# 10.4 Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
# ============================================================================
# PART 11: CLEANUP
# ============================================================================
# Delete test collections
try:
client.delete_collection("another_collection")
except Exception as e:
print(f"Could not delete 'another_collection': {e}")
# Uncomment to delete main collection
client.delete_collection(collection_name)
```
## References
* For information about the API interfaces supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
* [Simple Example](../50.sdk-samples/10.pyseekdb-simple-sample.md)
* [Hybrid Search Example](../50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)

View File

@@ -0,0 +1,66 @@
---
slug: /api-overview
---
# API Reference
seekdb allows you to use seekdb through APIs.
## APIs
The following APIs are supported.
### Database
:::info
You can use this API only when you connect to seekdb by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../50.apis/100.admin-client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `create_database()` | Creates a database. | [Documentation](110.database/200.create-database-of-api.md) |
| `get_database()` | Retrieves a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
| `list_databases()` | Retrieves a list of databases in an instance. |[Documentation](110.database/400.list-database-of-api.md)|
| `delete_database()` | Deletes a specified database.|[Documentation](110.database/500.delete-database-of-api.md)|
### Collection
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `create_collection()` | Creates a collection. | [Documentation](200.collection/100.create-collection-of-api.md) |
| `get_collection()` | Retrieves a specified collection. |[Documentation](200.collection/200.get-collection-of-api.md)|
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](200.collection/250.get-or-create-collection-of-api.md)|
| `list_collections()` | Retrieves the collection list in a database. |[Documentation](200.collection/300.list-collection-of-api.md)|
| `count_collection()` | Counts the number of collections in a database. |[Documentation](200.collection/350.count-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection.|[Documentation](200.collection/400.delete-collection-of-api.md)|
### DML
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `add()` | Inserts a new record into a collection. | [Documentation](300.dml/200.add-data-of-api.md) |
| `update()` | Updates an existing record in a collection. |[Documentation](300.dml/300.update-data-of-api.md)|
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](300.dml/400.upsert-data-of-api.md)|
| `delete()` | Deletes a record from a collection.|[Documentation](300.dml/500.delete-data-of-api.md)|
### DQL
:::info
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
:::
| API | Description | Documentation |
|---|---|---|
| `query()` | Performs vector similarity search. | [Documentation](400.dql/200.query-interfaces-of-api.md) |
| `get()` | Queries specific data from a table by using the ID, document, and metadata (non-vector). |[Documentation](400.dql/300.get-interfaces-of-api.md)|
| `hybrid_search()` | Performs full-text search and vector similarity search by using ranking. |[Documentation](400.dql/400.hybrid-search-of-api.md)|

View File

@@ -0,0 +1,93 @@
---
slug: /admin-client
---
# Admin Client
`AdminClient` provides database management operations. It uses the same database connection mode as `Client`, but only supports database management-related operations.
## Connect to an embedded seekdb instance
Connect to a local embedded seekdb instance by using `AdminClient`.
```python
import pyseekdb
# Embedded mode - Database management
admin = pyseekdb.AdminClient(path="./seekdb")
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `path` | string | Optional | The path of the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
## Connect to a remote server
Connect to a remote server by using `AdminClient`. This way, you can connect to a seekdb instance or an OceanBase Database instance.
:::tip
Before you connect to a remote server, make sure that you have deployed a server mode seekdb instance or an OceanBase Database instance.<br/>For information about how to deploy a server mode seekdb instance, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy an OceanBase Database instance, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
:::
Example: Connect to a server mode seekdb instance
```python
import pyseekdb
# Remote server mode - Database management
admin = pyseekdb.AdminClient(
host="127.0.0.1",
port=2881,
user="root",
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
)
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the instance resides. | `127.0.0.1` |
| `prot` | string | Yes | The port of the instance. The default value is 2881. | `2881` |
| `user` | string | Yes | The username. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
Example: Connect to an OceanBase Database instance
```python
import pyseekdb
# Remote server mode - Database management
admin = pyseekdb.AdminClient(
host="127.0.0.1",
port=2881,
tenant="test"
user="root",
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
)
```
Parameter description:
| Parameter | Value Type | Required | Description | Example Value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the database resides. | `127.0.0.1` |
| `prot` | string | Yes | The port of the OceanBase Database instance. The default value is 2881. | `2881` |
| `tenant` | string | No | The name of the tenant. This parameter is not required for a server mode seekdb instance, but is required for an OceanBase Database instance. The default value is sys. | `test` |
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
## APIs supported when you use AdminClient to connect to a database
The following APIs are supported when you use `AdminClient` to connect to a database.
| API | Description | Documentation Link |
| --- | --- | --- |
| `create_database` | Creates a new database. |[Documentation](110.database/200.create-database-of-api.md)|
| `get_database` | Queries a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
| `delete_database` | Deletes a specified database. |[Documentation](110.database/400.list-database-of-api.md)|
| `list_databases` | Lists all databases. |[Documentation](110.database/500.delete-database-of-api.md)|

View File

@@ -0,0 +1,16 @@
---
slug: /database-overview-of-api
---
# Database Management
A database contains tables, indexes, and metadata of database objects. You can create, query, and delete databases as needed.
The following APIs are available for database operations.
| API | Description | Documentation |
|---|---|---|
| `create_database()` | Creates a database. | [Documentation](200.create-database-of-api.md) |
| `get_database()` | Gets a specified database. |[Documentation](300.get-database-of-api.md)|
| `list_databases()` | Gets the list of databases in the instance. |[Documentation](400.list-database-of-api.md)|
| `delete_database()` | Deletes a specified database.|[Documentation](500.delete-database-of-api.md)|

View File

@@ -0,0 +1,76 @@
---
slug: /create-database-of-api
---
# create_database - Create a database
The `create_database()` function is used to create a new database.
:::info
* This interface can only be used when you are connected to the database using `AdminClient`. For more information about `AdminClient`, see [Admin Client](../100.admin-client.md).
* Currently, when you use `create_database` to create a database, you cannot specify the database properties. The database will be created based on the default values of the properties. If you want to create a database with specific properties, you can try to create it using SQL. For more information about how to create a database using SQL, see [Create a database](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977077).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
* If you are using server mode of seekdb or OceanBase Database, make sure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Limitations
* In a seekdb instance or OceanBase Database, the name of each database must be globally unique.
* The maximum length of a database name is 128 characters.
* The name can contain only uppercase and lowercase letters, digits, underscores, dollar signs, and Chinese characters.
* Avoid using reserved keywords as database names.
For more information about reserved keywords, see [Reserved keywords](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976774).
## Recommendations
* We recommend that you give the database a meaningful name that reflects its purpose and content. For example, you can use `Application Identifier_Sub-application name (optional)_db` as the database name.
* We recommend that you create the database and related users using the root user and assign only the necessary privileges to ensure the security and controllability of the database.
* You can create a database with a name consisting only of digits by enclosing the name in backticks (`), but this is not recommended. This is because names consisting only of digits have no clear meaning, and queries require the use of backticks (`), which can lead to unnecessary complexity and confusion.
## Request parameters
```python
create_database(name, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be created. |`my_database`|
|`tenant`|string|No<ul><li>When using embedded seekdb or server mode of seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required.</li></ul>|The tenant to which the database belongs. |`test_tenant`|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Create database
admin.create_database("my_database")
```
## Response parameters
None
## References
* [Get a specific database](300.get-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [List databases](400.list-database-of-api.md)

View File

@@ -0,0 +1,65 @@
---
slug: /get-database-of-api
---
# get_database - Get the specified database
The `get_database()` method is used to obtain the information of the specified database.
:::info
This method can be used only when you connect to the database by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
## Request parameters
```python
get_database(name, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be queried. |`my_database`|
|`tenant`|string|No<ul><li>When you use embedded seekdb and server mode seekdb, you do not need to specify this parameter.</li><li>When you use OceanBase Database, you must specify this parameter.</li></ul>|The tenant to which the database belongs. |test_tenant|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Get database
db = admin.get_database("my_database")
# print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried database. |`my_database`|
|`tenant`|string|No<br/>When you use embedded seekdb and server mode SeekDB, this parameter does not exist. |The tenant to which the queried database belongs. |`test_tenant`|
|`charset`|string|No|The character set used by the queried database. |`utf8mb4`|
|`collation`|string|No|The collation used by the queried database. |`utf8mb4_general_ci`|
|`metadata`|dict|No|Reserved field. | {} |
## Response example
```python
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
```
## References
* [Create a database](200.create-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [Get the database list](400.list-database-of-api.md)

View File

@@ -0,0 +1,70 @@
---
slug: /list-database-of-api
---
# list_databases - Get the database list
The `list_databases()` method is used to retrieve the database list in the instance.
:::info
This API is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
## Request parameters
```python
list_databases(limit=None, offset=None, tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`limit`|int|Optional|The maximum number of databases to return. |2|
|`offset`|int|Optional|The number of databases to skip. |3|
|`tenant`|string|Optional<ul><li>When using embedded seekdb and server mode seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the queried database belongs. |test_tenant|
## Request example
```python
# List all databases
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# list database
databases = admin.list_databases(2,3)
for db in databases:
print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried database. |`my_database`|
|`tenant`|string|Optional<br/>When using embedded seekdb and server mode SeekDB, this parameter is not available. |The tenant to which the queried database belongs. |`test_tenant`|
|`charset`|string|Optional|The character set of the queried database. |`utf8mb4`|
|`collation`|string|Optional|The collation of the queried database. |`utf8mb4_general_ci`|
|`metadata`|dict|Optional|Reserved field. No data is returned. | {} |
## Response example
```python
Database: test, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
```
## References
* [Create a database](200.create-database-of-api.md)
* [Delete a database](500.delete-database-of-api.md)
* [Get a specific database](300.get-database-of-api.md)

View File

@@ -0,0 +1,54 @@
---
slug: /delete-database-of-api
---
# delete_database - Delete a database
The `delete_database()` method is used to delete a database.
:::info
This method is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
* If you are using server mode of seekdb or OceanBase Database, ensure that the user has the `DROP` privilege. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the privilege, contact the administrator to grant the privilege. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
delete_database(name,tenant=DEFAULT_TENANT)
```
|Parameter|Type|Required|Description|Example Value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the database to be deleted. |my_database|
|`tenant`|string|No<ul><li>If you are using embedded seekdb or server mode of seekdb, you do not need to specify this parameter.</li><li>If you are using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the database belongs. |test_tenant|
## Request example
```python
import pyseekdb
# Embedded mode
admin = pyseekdb.AdminClient(path="./seekdb")
# Delete database
admin.delete_database("my_database")
```
## Response parameters
None
## References
* [Create a database](200.create-database-of-api.md)
* [Get a specific database](300.get-database-of-api.md)
* [Obtain a database list](400.list-database-of-api.md)

View File

@@ -0,0 +1,93 @@
---
slug: /create-collection-of-api
---
# create_collection - Create a collection
`create_collection()` is used to create a new collection, which is a table in the database.
:::info
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb in server mode or OceanBase Database, make sure that the user has the `CREATE` privilege. For more information about how to view the privileges of the current user, see [View user privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971368). If the user does not have the privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974754).
## Define the table name
When creating a table, you must first define its name. The following requirements apply when defining the table name:
* In seekdb, each table name must be unique within the database.
* The table name cannot exceed 64 characters.
* We recommend that you give the table a meaningful name instead of using generic names such as t1 or table1. For more information about table naming conventions, see [Table naming conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
## Request parameters
```python
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default values `dimension=384` and `distance='cosine'` are used. If set to `None`, the dimension is calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function to convert data into vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality, and if provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
:::info
When you provide `embedding_function`, the system will automatically calculate the vector dimension by calling this function. If you also provide `configuration.dimension`, it must match the dimension of `embedding_function`. Otherwise, a ValueError will be raised.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Create a collection with default embedding function (auto-calculates dimension)
collection = client.create_collection(
name="my_collection"
)
# Create a collection with custom embedding function
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
config = HNSWConfiguration(dimension=384, distance='cosine') # Must match EF dimension
collection = client.create_collection(
name="my_collection2",
configuration=config,
embedding_function=ef
)
# Create a collection without embedding function (vectors must be provided manually)
collection = client.create_collection(
name="my_collection3",
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
embedding_function=None # Explicitly disable embedding function
)
```
## Response parameters
None
## References
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,89 @@
---
slug: /get-collection-of-api
---
# get_collection - Get a collection
The `get_collection()` function is used to retrieve a specified collection.
:::info
This API is only available when connected using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* The collection you want to retrieve exists. If the collection does not exist, an error will be returned.
## Request parameters
```python
client.get_collection(name,configuration = configuration,embedding_function = embedding_function)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to retrieve. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default value `dimension=384, distance='cosine'` will be used. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function used to convert text to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` will be used. If set to `None`, the collection will not contain an embedding function. If an embedding function is provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
:::info
When vectors are not provided for documents/texts, the embedding function set here will be used for all operations on this collection, including add, upsert, update, query, and hybrid_search.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Get an existing collection (uses default embedding function if collection doesn't have one)
collection = client.get_collection("my_collection")
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
# Get collection with specific embedding function
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
collection = client.get_collection("my_collection", embedding_function=ef)
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
# Get collection without embedding function
collection = client.get_collection("my_collection", embedding_function=None)
# Check if collection exists
if client.has_collection("my_collection"):
collection = client.get_collection("my_collection")
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to query. |my_collection|
|`dimension`|int|No| |384|
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|`distance`|string|No| |cosine|
|`metadata`|dict|No|Reserved field, currently no data| {} |
## Response example
```python
Database: my_collection, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
Database: my_collection1, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
```
## References
* [Create a collection](100.create-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a list of collections](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,79 @@
---
slug: /get-or-create-collection-of-api
---
# get_or_create_collection - Create or query a collection
The `get_or_create_collection()` function creates or queries a collection. If the collection does not exist in the database, it is created. If it exists, the corresponding result is obtained.
:::info
This API is only available when using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb in server mode or OceanBase Database, ensure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [Check User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Define a table name
When creating a table, you need to define a table name. The following requirements must be met:
* In seekdb, each table name must be unique within the database.
* The table name must be no longer than 64 characters.
* It is recommended to use meaningful names for tables instead of generic names like t1 or table1. For more information about table naming conventions, see [Table Naming Conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
## Request parameters
```python
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
```
|Parameter|Value Type|Required|Description|Example Value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|`configuration`|HNSWConfiguration|No|The index configuration with dimension and distance metric. If not provided, the default value is used, which is `dimension=384, distance='cosine'`. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|`embedding_function`|EmbeddingFunction|No|The function to convert to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality. If embedding functionality is provided, it will be automatically calculated based on `configuration.dimension`. |DefaultEmbeddingFunction()|
:::info
When `embedding_function` is provided, the system will automatically calculate the vector dimension by calling the function. If `configuration.dimension` is also provided, it must match the dimension of `embedding_function`, otherwise a ValueError will be raised.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Get or create collection (creates if doesn't exist)
collection = client.get_or_create_collection(
name="my_collection4",
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
embedding_function=DefaultEmbeddingFunction()
)
```
## Response parameters
None
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Get a list of collections](300.list-collection-of-api.md)
* [Count collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,65 @@
---
slug: /list-collection-of-api
---
# list_collections - Get a list of collections
The `list_collections()` API is used to obtain all collections.
:::info
This API is supported only when you use a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
## Request parameters
```python
client.list_collections()
```
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# List all collections
collections = client.list_collections()
for coll in collections:
print(f"Collection: {coll.name}, Dimension: {coll.dimension}, embedding_function: {coll.embedding_function}, distance: {coll.distance}, metadata: {coll.metadata}")
```
## Response parameters
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the queried collection. |my_collection|
|`dimension`|int|No| | 384 |
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|`distance`|string|No| |cosine|
|`metadata`|dict|No|Reserved field. No data is returned. | {} |
## Response example
```pyhton
Collection: my_collection, Dimension: 384, embedding_function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance: cosine, metadata: {}
Database has 1 collections
```
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Count collections](350.count-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,56 @@
---
slug: /count-collection-of-api
---
# count_collection - Count the number of collections
The `count_collection()` method is used to count the number of collections in the database.
:::info
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
## Request parameters
```python
client.count_collection()
```
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
```
## Return parameters
None
## Return example
```pyhton
Database has 1 collections
```
## Related operations
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Delete a collection](400.delete-collection-of-api.md)

View File

@@ -0,0 +1,55 @@
---
slug: /delete-collection-of-api
---
# delete_collection - Delete a Collection
The `delete_collection()` method is used to delete a specified Collection.
:::info
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* The Collection you want to delete exists. If the Collection does not exist, an error will be returned.
## Request parameters
```python
client.delete_collection(name)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`name`|string|Yes|The name of the Collection to be deleted. |my_collection|
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
# Delete a collection
client.delete_collection("my_collection")
```
## Response parameters
None
## References
* [Create a collection](100.create-collection-of-api.md)
* [Query a collection](200.get-collection-of-api.md)
* [Create or query a collection](250.get-or-create-collection-of-api.md)
* [Get a collection list](300.list-collection-of-api.md)
* [Count the number of collections](350.count-collection-of-api.md)

View File

@@ -0,0 +1,18 @@
---
slug: /collection-overview-of-api
---
# Manage collections
In pyseekdb, a collection is a set similar to a table in a database. You can create, query, and delete collections.
The following API interfaces are supported for managing collections.
| API interface | Description | Documentation |
|---|---|---|
| `create_collection()` | Creates a collection. | [Documentation](100.create-collection-of-api.md) |
| `get_collection()` | Gets a specified collection. |[Documentation](200.get-collection-of-api.md)|
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](250.get-or-create-collection-of-api.md)|
| `list_collections()` | Gets the collection list of a database. |[Documentation](300.list-collection-of-api.md)|
| `count_collection()` | Counts the number of collections in a database |[Documentation](350.count-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection.|[Documentation](400.delete-collection-of-api.md)|

View File

@@ -0,0 +1,16 @@
---
slug: /dml-overview-of-api
---
# DML operations
DML (Data Manipulation Language) operations allow you to insert, update, and delete data in a collection.
For DML operations, you can use the following APIs.
| API | Description | Documentation |
|---|---|---|
| `add()` | Inserts a new record into a collection. | [Documentation](200.add-data-of-api.md) |
| `update()` | Updates an existing record in a collection. |[Documentation](300.update-data-of-api.md)|
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](400.upsert-data-of-api.md)|
| `delete()` | Deletes a record from a collection.|[Documentation](500.delete-data-of-api.md)|

View File

@@ -0,0 +1,117 @@
---
slug: /add-data-of-api
---
# add - Insert data
The `add()` method inserts new data into a collection. If a record with the same ID already exists, an error is returned.
:::info
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to which you are connected has the `INSERT` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have the required privilege, contact the administrator to grant you the privilege. For more information about how to directly grant a privilege, see [Directly grant a privilege](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
add(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID of the data to be inserted. You can specify a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The vector or vectors of the data to be inserted. If you specify this parameter, the value of `embedding_function` is ignored. If you do not specify this parameter, you must specify `documents`, and the `collection` must have an `embedding_function`.|[0.1, 0.2, 0.3]|
|`documents`|string or List[str]|No|The document or documents to be inserted. If you do not specify `vectors`, `documents` will be converted to vectors using the `embedding_function` of the `collection`.|"This is a document"|
|`metadatas`|dict or List[dict]|No|The metadata or metadata list of the data to be inserted. |`{"category": "AI", "score": 95}`|
:::info
The `embedding_function` associated with the collection is set during `create_collection()` or `get_collection()`. You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
# Create a client
client = pyseekdb.Client()
collection = client.create_collection(
name="my_collection",
configuration=HNSWConfiguration(dimension=3, distance='cosine'),
embedding_function=None
)
# Add single item
collection.add(
ids="item1",
embeddings=[0.1, 0.2, 0.3],
documents="This is a document",
metadatas={"category": "AI", "score": 95}
)
# Add multiple items
collection.add(
ids=["item4", "item2", "item3"],
embeddings=[
[0.1, 0.2, 0.4],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]
],
documents=[
"Document 1",
"Document 2",
"Document 3"
],
metadatas=[
{"category": "AI", "score": 95},
{"category": "ML", "score": 88},
{"category": "DL", "score": 92}
]
)
# Add with only embeddings
collection.add(
ids=["vec1", "vec2"],
embeddings=[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
)
collection1 = client.create_collection(
name="my_collection1"
)
# Add with only documents - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.add(
ids=["doc1", "doc2"],
documents=["Text document 1", "Text document 2"],
metadatas=[{"tag": "A"}, {"tag": "B"}]
)
```
## Response parameters
None
## References
* [Update data](300.update-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)
* [Delete data](500.delete-data-of-api.md)

View File

@@ -0,0 +1,88 @@
---
slug: /update-data-of-api
---
# update - Update data
The `update()` method is used to update existing records in a collection. The record must exist, otherwise an error will be raised.
:::info
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb in client mode or OceanBase Database, make sure that the user to which you have connected has the `UPDATE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
update(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID to be modified. It can be a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The new vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]]|
|`documents`|string or List[str]|No|The new documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"New document text"|
|`metadatas`|dict or List[dict]|No|The new metadata.|`{"category": "AI"}`|
:::info
You can update only the `metadatas`. The `embedding_function` used must be associated with the collection.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Update single item
collection.update(
ids="item1",
metadatas={"category": "AI", "score": 98} # Update metadata only
)
# Update multiple items
collection.update(
ids=["item1", "item2"],
embeddings=[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]], # Update embeddings
documents=["Updated document 1", "Updated document 2"] # Update documents
)
# Update with documents only - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.update(
ids="doc1",
documents="New document text", # Embeddings will be auto-generated
metadatas={"category": "AI"}
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)
* [Delete data](500.delete-data-of-api.md)

View File

@@ -0,0 +1,93 @@
---
slug: /upsert-data-of-api
---
# upsert - Update or insert data
The `upsert()` method is used to insert new records or update existing records. If a record with the given ID already exists, it will be updated; otherwise, a new record will be inserted.
:::info
This API is only available when using a Client connection. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, ensure that the connected user has the `INSERT` and `UPDATE` privileges on the target table. For more information about how to view the current user privileges, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the required privileges, contact the administrator to grant them. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
Upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Yes|The ID to be added or modified. It can be a single ID or an array of IDs.|item1|
|`embeddings`|List[float] or List[List[float]]|No|The vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[0.1, 0.2, 0.3]|
|`documents`|string or List[str]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"Document text"|
|`metadatas`|dict or List[dict]|No|The metadata. |`{"category": "AI"}`|
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Upsert single item (insert or update)
collection.upsert(
ids="item1",
embeddings=[0.1, 0.2, 0.3],
documents="Document text",
metadatas={"category": "AI", "score": 95}
)
# Upsert multiple items
collection.upsert(
ids=["item1", "item2", "item3"],
embeddings=[
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]
],
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"category": "AI"},
{"category": "ML"},
{"category": "DL"}
]
)
# Upsert with documents only - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
collection1.upsert(
ids=["item1", "item2"],
documents=["Document 1", "Document 2"],
metadatas=[{"category": "AI"}, {"category": "ML"}]
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update data](300.update-data-of-api.md)
* [Delete data](400.upsert-data-of-api.md)

View File

@@ -0,0 +1,87 @@
---
slug: /delete-data-of-api
---
# delete - Delete data
`delete()` is used to delete records from a collection. You can delete records by ID, metadata filter, or document filter.
:::info
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to whom you are connected has the `DELETE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
## Request parameters
```python
Upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|string or List[str]|Optional|The ID of the record to be deleted. You can specify a single ID or an array of IDs.|item1|
|`where`|dict|Optional|The metadata filter.|`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|Optional|The document filter.|`{"$contains": "obsolete"}`|
:::info
At least one of the `id`, `where`, or `where_document` parameters must be specified.
:::
## Request examples
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
# Delete by IDs
collection.delete(ids=["item1", "item2", "item3"])
# Delete by single ID
collection.delete(ids="item1")
# Delete by metadata filter
collection.delete(where={"category": {"$eq": "AI"}})
# Delete by comparison operator
collection.delete(where={"score": {"$lt": 50}})
# Delete by document filter
collection.delete(where_document={"$contains": "obsolete"})
# Delete with combined filters
collection.delete(
where={"category": {"$eq": "AI"}},
where_document={"$contains": "deprecated"}
)
```
## Response parameters
None
## References
* [Insert data](200.add-data-of-api.md)
* [Update data](300.update-data-of-api.md)
* [Update or insert data](400.upsert-data-of-api.md)

View File

@@ -0,0 +1,15 @@
---
slug: /dql-overview-of-api
---
# Overview of DQL
DQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.
For DQL operations, the following API interfaces are supported.
| API Interface | Description | Documentation Link |
|---|---|---|
| `query()` | A vector similarity search method. | [Documentation](200.query-interfaces-of-api.md) |
| `get()` | Queries specific data from a table using an ID, document, or metadata (excluding vectors). | [Documentation](300.get-interfaces-of-api.md) |
| `hybrid_search()` | Combines full-text search and vector similarity search using a ranking method. | [Documentation](400.hybrid-search-of-api.md) |

View File

@@ -0,0 +1,161 @@
---
slug: /query-interfaces-of-api
---
# query - vector query
The `query()` method is used to perform vector similarity search to find the most similar documents to the query vector.
:::info
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
query()
```
|Parameter|Value type|Required|Description|Example value|
|---|---|---|---|---|
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or a list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|`query_texts`|str or List[str]|No|A single text or a list of texts for query; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|`n_results`|int|Yes|The number of similar results to return, default is 10|3|
|`where`|dict |No|Metadata filter conditions.|`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|No|Document filter conditions.|`{"$contains": "machine"}`|
|`include`|List[str]|No|List of fields to include: `["documents", "metadatas", "embeddings"]`|["documents", "metadatas", "embeddings"]|
:::info
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Basic vector similarity query (embedding_function not used)
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
n_results=3
)
# Iterate over results
for i in range(len(results["ids"][0])):
print(f"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}")
if results.get("documents"):
print(f"Document: {results['documents'][0][i]}")
if results.get("metadatas"):
print(f"Metadata: {results['metadatas'][0][i]}")
# Query by texts - vectors auto-generated by embedding_function
# Requires: collection must have embedding_function set
results = collection1.query(
query_texts=["my query text"],
n_results=10
)
# The collection's embedding_function will automatically convert query_texts to query_embeddings
# Query by multiple texts (batch query)
results = collection1.query(
query_texts=["query text 1", "query text 2"],
n_results=5
)
# Returns dict with lists of lists, one list per query text
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with metadata filter (using query_texts)
results = collection1.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}},
n_results=5
)
# Query with comparison operator (using query_texts)
results = collection1.query(
query_texts=["machine learning"],
where={"score": {"$gte": 90}},
n_results=5
)
# Query with document filter (using query_texts)
results = collection1.query(
query_texts=["neural networks"],
where_document={"$contains": "machine learning"},
n_results=5
)
# Query with combined filters (using query_texts)
results = collection1.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}, "score": {"$gte": 90}},
where_document={"$contains": "machine"},
n_results=5
)
# Query with multiple vectors (batch query)
results = collection.query(
query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
n_results=2
)
# Returns dict with lists of lists, one list per query vector
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with specific fields
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
include=["documents", "metadatas", "embeddings"],
n_results=3
)
```
## Return parameters
|Parameter|Value type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|List[List[str]] |Yes|The IDs to add or modify. It can be a single ID or an array of IDs.|item1|
|`embeddings`|[List[List[List[float]]]]|No|The vectors; if provided, it will be used directly (ignoring `embedding_function`), if not provided, `documents` can be provided to generate vectors automatically.|[0.1, 0.2, 0.3]|
|`documents`|[List[List[Dict]]]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the `embedding_function` of the collection.| "Document text"|
|`metadatas`|[List[List[Dict]]]|No|The metadata.|`{"category": "AI"}`|
|`distances`|[List[List[Dict]]]|No| |`{"category": "AI"}`|
## Return example
```python
ID: vec1, Distance: 0.0
Document: None
Metadata: {}
ID: vec2, Distance: 0.025368153802923787
Document: None
Metadata: {}
Query 0: 4 results
Query 1: 4 results
Query 0: 2 results
Query 1: 2 results
```
## Related operations
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,127 @@
---
slug: /get-interfaces-of-api
---
# get - Retrieve
`get()` is used to retrieve documents from a collection without performing vector similarity search.
It supports filtering by IDs, metadata, and documents.
:::info
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
get()
```
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`ids`|List[float] or List[List[float]] |Yes|The ID or list of IDs to retrieve.|[1.0, 2.0, 3.0]|
|`where`|dict |No|The metadata filter. |`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|No|The document filter. |`{"$contains": "machine"}`|
|`limit`|dict |No|The maximum number of results to return. |`{"category": {"$eq": "AI"}}`|
|`offset`|dict|No|The number of results to skip for pagination. |`{"$contains": "machine"}`|
|`include`|List[str]|No|The list of fields to include: `["documents", "metadatas", "embeddings"]`. |["documents", "metadatas", "embeddings"]|
:::info
If no parameters are provided, all data is returned.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
# Get by single ID
results = collection.get(ids="123")
# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])
# Get by metadata filter
results = collection.get(
where={"category": {"$eq": "AI"}},
limit=10
)
# Get by comparison operator
results = collection.get(
where={"score": {"$gte": 90}},
limit=10
)
# Get by $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=10
)
# Get by logical operators ($or)
results = collection.get(
where={
"$or": [
{"category": {"$eq": "AI"}},
{"tag": {"$eq": "python"}}
]
},
limit=10
)
# Get by document content filter
results = collection.get(
where_document={"$contains": "machine learning"},
limit=10
)
# Get with combined filters
results = collection.get(
where={"category": {"$eq": "AI"}},
where_document={"$contains": "machine"},
limit=10
)
# Get with pagination
results = collection.get(limit=2, offset=1)
# Get with specific fields
results = collection.get(
ids=["1", "2"],
include=["documents", "metadatas", "embeddings"]
)
# Get all data (up to limit)
results = collection.get(limit=100)
```
## Response parameters
* If a single ID is provided: The result contains the get object for that ID.
* If multiple IDs are provided: A list of QueryResult objects, one for each ID.
* If filters are provided: A QueryResult object containing all matching results.
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,140 @@
---
slug: /hybrid-search-of-api
---
# hybrid_search - Hybrid search
`hybrid_search()` combines full-text search and vector similarity search with ranking.
:::info
This API is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
:::
## Prerequisites
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert Data](../300.dml/200.add-data-of-api.md).
## Request parameters
```python
hybrid_search(
query={
"where_document": ,
"where": ,
"n_results":
},
knn={
"query_texts":
"where":
"n_results":
},
rank=,
n_results=,
include=
)
```
* query: full-text search configuration, including the following parameters:
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|`where_document`|dict|Optional|Document filter conditions. |`{"$contains": "machine"}`|
|`n_results`|int|Yes|Number of results for full-text search.||
* knn: vector search configuration, including the following parameters:
|Parameter|Type|Required|Description|Example value|
|---|---|---|---|---|
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|`query_texts`|str or List[str]|Optional|A single vector or list of vectors; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|`n_results`|int|Yes|Number of results for vector search.||
* Other parameters are as follows:
|Parameter|Type|Required|Description|Example value|
|`rank`|dict |Optional|Ranking configuration, for example: `{"rrf": {"rank_window_size": 60, "rank_constant": 60}}`|`{"category": {"$eq": "AI"}}`|
|`n_results`|int|Yes|Number of similar results to return. Default value is 10|3|
|`include`|List[str]|Optional|List of fields to include: `["documents", "metadatas", "embeddings"]`.|["documents", "metadatas", "embeddings"]|
:::info
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
:::
## Request example
```python
import pyseekdb
# Create a client
client = pyseekdb.Client()
collection = client.get_collection("my_collection")
collection1 = client.get_collection("my_collection1")
# Hybrid search with query_embeddings (embedding_function not used)
results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"n_results": 10
},
knn={
"query_embeddings": [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], # Used directly
"n_results": 10
},
rank={"rrf": {}},
n_results=5
)
# Hybrid search with both full-text and vector search (using query_texts)
results = collection1.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": {"$eq": "science"}},
"n_results": 10
},
knn={
"query_texts": ["AI research"], # Will be embedded automatically
"where": {"year": {"$gte": 2020}},
"n_results": 10
},
rank={"rrf": {}}, # Reciprocal Rank Fusion
n_results=5,
include=["documents", "metadatas", "embeddings"]
)
# Hybrid search with multiple query texts (batch)
results = collection1.hybrid_search(
query={
"where_document": {"$contains": "AI"},
"n_results": 10
},
knn={
"query_texts": ["machine learning", "neural networks"], # Multiple queries
"n_results": 10
},
rank={"rrf": {}},
n_results=5
)
```
## Return parameters
A dictionary containing search results, including ID, distances, metadatas, document, etc.
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Operators](500.filter-operators-of-api.md)

View File

@@ -0,0 +1,151 @@
---
slug: /filter-operators-of-api
---
# Operators
Operators are used to connect operands or parameters and return results. In terms of syntax, operators can appear before, after, or between operands.
## Operator examples
### Data filtering (where)
#### Equal to
Use `$eq` to indicate equal to, as shown in the following example:
```python
where={"category": {"$eq": "AI"}}
```
#### Not equal to
Use `$ne` to indicate not equal to, as shown in the following example:
```python
where={"status": {"$ne": "deleted"}}
```
#### Greater than
Use `$gt` to indicate greater than, as shown in the following example:
```python
where={"score": {"$gt": 90}}
```
#### Greater than or equal to
Use `$gte` to indicate greater than or equal to, as shown in the following example:
```python
where={"score": {"$gte": 90}}
```
#### Less than
Use `$lt` to indicate less than, as shown in the following example:
```python
where={"score": {"$lt": 50}}
```
#### Less than or equal to
Use `$lte` to indicate less than or equal to, as shown in the following example:
```python
where={"score": {"$lte": 50}}
```
#### Contains
Use `$in` to indicate contains, as shown in the following example:
```python
where={"tag": {"$in": ["ml", "python", "ai"]}}
```
#### Does not contain
Use `$nin` to indicate does not contain, as shown in the following example:
```python
where={"tag": {"$nin": ["deprecated", "old"]}}
```
#### Logical OR
Use `$or` to indicate logical OR, as shown in the following example:
```python
where={
"$or": [
{"category": {"$eq": "AI"}},
{"tag": {"$eq": "python"}}
]
}
```
#### Logical AND
Use `$and` to indicate logical AND, as shown in the following example:
```python
where={
"$and": [
{"category": {"$eq": "AI"}},
{"score": {"$gte": 90}}
]
}
```
### Text filtering (where_document)
#### Full-text search (contains substring)
Use `$contains` to indicate full-text search, as shown in the following example:
```python
where_document={"$contains": "machine learning"}
```
#### Regular expression
Use `$regex` to indicate regular expression, as shown in the following example:
```python
where_document={"$regex": "pattern.*"}
```
#### Logical OR
Use `$or` to indicate logical OR, as shown in the following example:
```python
where_document={
"$or": [
{"$contains": "machine learning"},
{"$contains": "artificial intelligence"}
]
}
```
#### Logical AND
Use `$and` to indicate logical AND, as shown in the following example:
```python
where_document={
"$and": [
{"$contains": "machine"},
{"$contains": "learning"}
]
}
```
## Related operations
* [Vector query](200.query-interfaces-of-api.md)
* [get - Retrieve](300.get-interfaces-of-api.md)
* [Hybrid search](400.hybrid-search-of-api.md)

View File

@@ -0,0 +1,107 @@
---
slug: /client
---
# Client
The `Client` class is used to connect to a database in either embedded mode or server mode. It automatically selects the appropriate connection mode based on the provided parameters.
:::tip
OceanBase Database is a fully self-developed, enterprise-level, native distributed database developed by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across five IDCs in three regions. It also sets a new benchmark in the TPC-C benchmark test, with a single cluster size exceeding 1,500 nodes. OceanBase Database is cloud-native, highly consistent, and highly compatible with Oracle and MySQL. For more information about OceanBase Database, see [OceanBase Database](https://www.oceanbase.com/docs/oceanbase-database-cn).
:::
## Connect to an embedded seekdb instance
Use the `Client` class to connect to a local embedded seekdb instance.
```python
import pyseekdb
# Create embedded client
client = pyseekdb.Client(
#path="./seekdb", # Path to SeekDB data directory
#database="test" # Database name
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `path` | string | No | The path to the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
| `database` | string | No | The name of the database. | `test` |
## Connect to a remote server
Use the `Client` class to connect to a remote server, which runs seekdb or OceanBase Database.
:::tip
Before you connect to a remote server, make sure that you have deployed a server instance of seekdb or OceanBase Database. <br/>For information about how to deploy a server instance of seekdb, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy OceanBase Database, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
:::
Example: Connect to a server instance of seekdb
```python
import pyseekdb
# Create remote server client (SeekDB Server)
client = pyseekdb.Client(
host="127.0.0.1", # Server host
port=2881, # Server port
database="test", # Database name
user="root", # Username
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the instance is located. | `127.0.0.1` |
| `prot` | string | Yes | The port number of the instance. The default value is 2881. | `2881` |
| `database` | string | Yes | The name of the database. | `test` |
| `user` | string | Yes | The username. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
Example: Connect to OceanBase Database
```python
import pyseekdb
# Create remote server client (OceanBase Server)
client = pyseekdb.Client(
host="127.0.0.1", # Server host
port=2881, # Server port (default: 2881)
tenant="test", # Tenant name
database="test", # Database name
user="root", # Username (default: "root")
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
)
```
The following table describes the parameters.
| Parameter | Value type | Required | Description | Example value |
| --- | --- | --- | --- | --- |
| `host` | string | Yes | The IP address of the server where the database is located. | `127.0.0.1` |
| `prot` | string | Yes | The port number of OceanBase Database. The default value is 2881. | `2881` |
| `tenant` | string | No | The name of the tenant. This parameter is not required for seekdb. For OceanBase Database, the default value is sys. | `test` |
| `database` | string | Yes | The name of the database. | `test` |
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
## APIs supported when you use the Client class to connect to a database
When you use the `Client` class to connect to a database, you can call the following APIs.
| API | Description | Document link |
| --- | --- | --- |
| `create_collection()` | Creates a new collection. | [Document](200.collection/100.create-collection-of-api.md) |
| `get_collection()` | Queries a specified collection. |[Document](200.collection/200.get-collection-of-api.md)|
| `delete_collection()` | Deletes a specified collection. |[Document](200.collection/400.delete-collection-of-api.md)|
| `list_collections()` | Lists all collections in the current database.|[Document](200.collection/300.list-collection-of-api.md)|
| `get_or_create_collection()` | Queries a specified collection. If the collection does not exist, it is created.|[Document](200.collection/250.get-or-create-collection-of-api.md)|
| `count_collection()` | Queries the number of collections in the current database. |[Document](200.collection/350.count-collection-of-api.md)|

View File

@@ -0,0 +1,35 @@
---
slug: /default-embedding-function-of-api
---
# Default embedding function
An embedding function converts text documents into vector embeddings for similarity search. pyseekdb supports built-in and custom embedding functions.
The `DefaultEmbeddingFunction` is the default embedding function if none is specified. This function is already available in seekdb and does not need to be created separately.
Here is an example:
```python
from pyseekdb import DefaultEmbeddingFunction
# Use default model (all-MiniLM-L6-v2, 384 dimensions)
ef = DefaultEmbeddingFunction()
# Use custom model
ef = DefaultEmbeddingFunction()
# Get embedding dimension
print(f"Dimension: {ef.dimension}") # 384
# Generate embeddings
embeddings = ef(["Hello world", "How are you?"])
print(f"Generated {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions")
```
## Related operations
If you want to use a custom function, you can refer to the following topics to create and use a custom function:
* [Create a custom embedding function](200.create-custim-embedding-functions-of-api.md)
* [Use a custom embedding function](300.using-custom-embedding-functions-of-api.md)

View File

@@ -0,0 +1,271 @@
---
slug: /create-custim-embedding-functions-of-api
---
# Create a custom embedding function
You can create a custom embedding function by implementing the `EmbeddedFunction` protocol. This function includes the following features:
* Execute the `__call__` method, which accepts `Documents (str or List[str])` and returns `Embeddings (List[List[float]])`.
* Optionally implement a dimension attribute to return the vector dimension.
## Prerequisites
Before creating a custom embedding function, ensure the following:
* Implement the `__call__` method:
* Each vector must have the same dimension.
* Input: The type of a single or multiple documents is str or List[str].
* Output: The field type of the embedded vectors is `List[List[float]]`.
* (Recommended) Implement the dimension attribute:
* Output: The type of the vectors generated by this function is `int`.
* Creating collections helps verify uniqueness.
* Handle special cases
* Convert a single string input to a list.
* Return an empty list for empty inputs.
* All vectors in the output must have the same dimension.
## Example 1: Sentence Transformer custom embedding function
```python
from typing import List, Union
from pyseekdb import EmbeddingFunction, Client, HNSWConfiguration
Documents = Union[str, List[str]]
Embeddings = List[List[float]]
class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]):
"""
A custom embedding function using sentence-transformers with a specific model.
"""
def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"): # TODO: your own model name and device
"""
Initialize the sentence-transformer embedding function.
Args:
model_name: Name of the sentence-transformers model to use
device: Device to run the model on ('cpu' or 'cuda')
"""
self.model_name = model_name
self.device = device
self._model = None
self._dimension = None
def _ensure_model_loaded(self):
"""Lazy load the embedding model"""
if self._model is None:
try:
from sentence_transformers import SentenceTransformer
self._model = SentenceTransformer(self.model_name, device=self.device)
# Get dimension from model
test_embedding = self._model.encode(["test"], convert_to_numpy=True)
self._dimension = len(test_embedding[0])
except ImportError:
raise ImportError(
"sentence-transformers is not installed. "
"Please install it with: pip install sentence-transformers"
)
@property
def dimension(self) -> int:
"""Get the dimension of embeddings produced by this function"""
self._ensure_model_loaded()
return self._dimension
def __call__(self, input: Documents) -> Embeddings:
"""
Generate embeddings for the given documents.
Args:
input: Single document (str) or list of documents (List[str])
Returns:
List of embedding vectors
"""
self._ensure_model_loaded()
# Handle single string input
if isinstance(input, str):
input = [input]
# Handle empty input
if not input:
return []
# Generate embeddings
embeddings = self._model.encode(
input,
convert_to_numpy=True,
show_progress_bar=False
)
# Convert numpy arrays to lists
return [embedding.tolist() for embedding in embeddings]
# Use the custom embedding function
client = Client()
# Initialize embedding function with all-mpnet-base-v2 model (768 dimensions)
ef = SentenceTransformerCustomEmbeddingFunction(
model_name='all-mpnet-base-v2', # TODO: your own model name
device='cpu' # TODO: your own device
)
# Get the dimension from the embedding function
dimension = ef.dimension
print(f"Embedding dimension: {dimension}")
# Create collection with matching dimension
collection_name = "my_collection"
if client.has_collection(collection_name):
client.delete_collection(collection_name)
collection = client.create_collection(
name=collection_name,
configuration=HNSWConfiguration(dimension=dimension, distance='cosine'),
embedding_function=ef
)
# Test the embedding function
print("\nTesting embedding function...")
test_documents = ["Hello world", "This is a test", "Sentence transformers are great"]
embeddings = ef(test_documents)
print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {len(embeddings[0])} dimensions")
# Add some documents to the collection
print("\nAdding documents to collection...")
collection.add(
ids=["1", "2", "3"],
documents=test_documents,
metadatas=[{"source": "test1"}, {"source": "test2"}, {"source": "test3"}]
)
# Query the collection
print("\nQuerying collection...")
results = collection.query(
query_texts="Hello",
n_results=2
)
print("\nQuery results:")
for i in range(len(results['ids'][0])):
print(f"ID: {results['ids'][0][i]}")
print(f"Document: {results['documents'][0][i]}")
print(f"Distance: {results['distances'][0][i]}")
print()
# Clean up
client.delete_collection(name=collection_name)
print("Test completed successfully!")
```
## Example 2: OpenAI embedding function
```python
from typing import List, Union
import os
from openai import OpenAI
from pyseekdb import EmbeddingFunction
import pyseekdb
Documents = Union[str, List[str]]
Embeddings = List[List[float]]
class QWenEmbeddingFunction(EmbeddingFunction[Documents]):
"""
A custom embedding function using OpenAI's embedding API.
"""
def __init__(self, model_name: str = "", api_key: str = ""): # TODO: your own model name and api key
"""
Initialize the OpenAI embedding function.
Args:
model_name: Name of the OpenAI embedding model
api_key: OpenAI API key (if not provided, uses OPENAI_API_KEY env var)
"""
self.model_name = model_name
self.api_key = api_key or os.environ.get('OPENAI_API_KEY') # TODO: your own api key
if not self.api_key:
raise ValueError("OpenAI API key is required")
self._dimension = 1024 # TODO: your own dimension
@property
def dimension(self) -> int:
"""Get the dimension of embeddings produced by this function"""
if self._dimension is None:
# Call API to get dimension (or use known values)
raise ValueError("Dimension not set for this model")
return self._dimension
def __call__(self, input: Documents) -> Embeddings:
"""
Generate embeddings using OpenAI API.
Args:
input: Single document (str) or list of documents (List[str])
Returns:
List of embedding vectors
"""
# Handle single string input
if isinstance(input, str):
input = [input]
# Handle empty input
if not input:
return []
# Call OpenAI API
client = OpenAI(
api_key=self.api_key,
base_url="" # TODO: your own base url
)
response = client.embeddings.create(
model=self.model_name,
input=input
)
# Extract embeddings
embeddings = [item.embedding for item in response.data]
return embeddings
# Use the custom embedding function
collection_name = "my_collection"
ef = QWenEmbeddingFunction()
client = pyseekdb.Client()
if client.has_collection(collection_name):
client.delete_collection(collection_name)
collection = client.create_collection(
name=collection_name,
embedding_function=ef
)
collection.add(
ids=["1", "2", "3"],
documents=["Hello", "World", "Hello World"],
metadatas=[{"tag": "A"}, {"tag": "B"}, {"tag": "C"}]
)
results = collection.query(
query_texts="Hello",
n_results=2
)
for i in range(len(results['ids'][0])):
print(results['ids'][0][i])
print(results['documents'][0][i])
print(results['metadatas'][0][i])
print(results['distances'][0][i])
print()
client.delete_collection(name=collection_name)
```

View File

@@ -0,0 +1,41 @@
---
slug: /using-custom-embedding-functions-of-api
---
# Use a custom embedding function
After you create a custom embedding function, you can use it when you create or get a collection.
Here is an example:
```python
import pyseekdb
from pyseekdb import HNSWConfiguration
# Create a client
client = pyseekdb.Client()
# Create collection with custom embedding function
ef = SentenceTransformerCustomEmbeddingFunction()
collection = client.create_collection(
name="my_collection",
configuration=HNSWConfiguration(dimension=ef.dimension, distance='cosine'),
embedding_function=ef
)
# Get collection with custom embedding function
collection = client.get_collection("my_collection", embedding_function=ef)
# Use the collection - documents will be automatically embedded
collection.add(
ids=["doc1", "doc2"],
documents=["Document 1", "Document 2"], # Vectors auto-generated
metadatas=[{"tag": "A"}, {"tag": "B"}]
)
# Query with texts - query vectors auto-generated
results = collection.query(
query_texts=["my query"],
n_results=10
)
```