Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,193 @@
---
sidebar_label: Jina AI
slug: /jina
---
# Integrate seekdb vector search with Jina AI
seekdb supports vector data storage, vector indexes, and embedding vector search. You can store vectorized data in seekdb for further search.
Jina AI is an AI platform focused on multimodal search and vector search. It offers core components and tools for building enterprise-grade Retrieval-Augmented Generation (RAG) applications based on multimodal search, helping organizations and developers create advanced search-driven generative AI solutions.
## Prerequisites
* You have deployed seekdb.
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
* You have installed Python 3.11 or later.
* You have installed required dependencies:
```shell
python3 -m pip install pyobvector requests sqlalchemy
```
## Step 1: Obtain the database connection information
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
```sql
obclient -h$host -P$port -u$user_name -p$password -D$database_name
```
**Parameters:**
* `$host`: The IP address for connecting to seekdb.
* `$port`: The port number for connecting to seekdb. Default is `2881`.
* `$database_name`: The name of the database to access.
:::tip
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
:::
* `$user_name`: The username for connecting to the database.
* `$password`: The password for the account.
## Step 2: Build your AI assistant
### Set your Jina AI API key as an environment variable
Get your [Jina AI API key](https://jina.ai/api-dashboard/reader) and configure it, along with your seekdb connection details, as environment variables:
```shell
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
```
### Example code snippets
#### Get embeddings from Jina AI
Jina AI offers several embedding models. You can choose the one that best fits your needs.
| Model | Parameter size | Embedding dimension | Text |
| --- | --- | --- | --- |
| [jina-embeddings-v3](https://zilliz.com/ai-models/jina-embeddings-v3) | 570M | flexible embedding size (Default: 1024) | multilingual text embeddings; supports 94 language in total |
| [jina-embeddings-v2-small-en](https://zilliz.com/ai-models/jina-embeddings-v2-small-en) | 33M | 512 | English monolingual embeddings |
| [jina-embeddings-v2-base-en](https://zilliz.com/ai-models/jina-embeddings-v2-base-en) | 137M | 768 | English monolingual embeddings |
| [jina-embeddings-v2-base-zh](https://zilliz.com/ai-models/jina-embeddings-v2-base-zh) | 161M | 768 | Chinese-English Bilingual embeddings |
| [jina-embeddings-v2-base-de](https://zilliz.com/ai-models/jina-embeddings-v2-base-de) | 161M | 768 | German-English Bilingual embeddings |
| [jina-embeddings-v2-base-code](https://zilliz.com/ai-models/jina-embeddings-v2-base-code) | 161M | 768 | English and programming languages |
Here is an example using `jina-embeddings-v3`. The following helper function, `generate_embeddings`, calls the Jina AI embedding API:
```python
import os
import requests
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
# Step 1. Text data vectorization
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3'
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
response_json = response.json()
return response_json['data'][0]['embedding']
TEXTS = [
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
'OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.',
'OceanBase is a native distributed relational database that supports HTAP hybrid transaction analysis and processing. It features enterprise-level characteristics such as high availability, transparent scalability, and multi-tenancy, and is compatible with MySQL/Oracle protocols.'
]
data = []
for text in TEXTS:
# Generate the embedding for the text via Jina AI API.
embedding = generate_embeddings(text)
data.append({
'content': text,
'content_vec': embedding
})
print(f"Successfully processed {len(data)} texts")
```
#### Define the vector table structure and store vectors in seekdb
Create a table called `jinaai_oceanbase_demo_documents` with columns for the text (`content`), the embedding vector (`content_vec`), and vector index information. Then insert the vector data into seekdb:
```python
# Step 2. Connect seekdb Serverless
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
# Step 3. Create the vector table.
table_name = "jinaai_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(500), nullable=False),
Column("content_vec", VECTOR(1024))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)
```
#### Semantic search
Use the Jina AI embedding API to generate an embedding for your query text. Then, search for the most relevant document by calculating the cosine distance between the query embedding and each embedding in the vector table:
```python
# Step 4. Query the most relevant document based on the query.
query = 'What is OceanBase?'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = client.ann_search(
table_name,
vec_data=query_embedding,
vec_column_name="content_vec",
distance_func=cosine_distance, # Use cosine distance function
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f' - ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
```
#### Expected result
```plain
- ID: 2
content: OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
distance: 0.14733879001870276
```

View File

@@ -0,0 +1,228 @@
---
sidebar_label: OpenAI
slug: /openai
---
# OpenAI
OpenAI is an artificial intelligence company that has developed several large language models. These models excel at understanding and generating natural language, making them highly effective for tasks such as text generation, answering questions, and engaging in conversations. Access to these models is available through an API.
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using OpenAI's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
## Prerequisites
* You have deployed seekdb.
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
* You have installed [Python 3.9 or later](https://www.python.org/downloads/) and [pip](https://pip.pypa.io/en/stable/installation/).
* You have installed [Poetry](https://python-poetry.org/docs/), [Pyobvector](https://github.com/oceanbase/pyobvector), and OpenAI SDK. The installation commands are as follows:
```shell
python3 pip install poetry
python3 pip install pyobvector
python3 pip install openai
```
* You have obtained an [OpenAI API key](https://platform.openai.com/api-keys).
## Step 1: Obtain the connection string of seekdb
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
```sql
obclient -h$host -P$port -u$user_name -p$password -D$database_name
```
**Parameters:**
* `$host`: The IP address for connecting to seekdb.
* `$port`: The port number for connecting to seekdb. Default is `2881`.
* `$database_name`: The name of the database to be accessed.
:::tip
The user for connection must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
:::
* `$user_name`: The database account.
* `$password`: The password of the account.
**Here is an example:**
```shell
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001 -p****** -Dtest
```
## Step 2: Register an LLM account
Obtain an OpenAI API key.
1. Log in to the [OpenAI](https://platform.openai.com/) platform.
2. Click **API Keys** in the upper-right corner.
3. Click **Create API Key**.
4. Specify the required information and click **Create API Key**.
Specify the API key for the relevant environment variable.
* For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
```shell
export OPENAI_API_KEY='your-api-key'
```
* For a Windows system, you can run the following command in Command Prompt:
```shell
set OPENAI_API_KEY=your-api-key
```
You must replace `your-api-key` with the actual OpenAI API key.
## Step 3: Store vector data in seekdb
### Store vector data in seekdb
1. Prepare test data.
Download the [CSV file](https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240827/srxyhu/fine_food_reviews.csv) that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
```shell
from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# Here the text-embedding-ada-002 model is used. You can change the model as needed.
def embedding_text(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = client.embeddings.create(input=text, model=model)
return res.data[0].embedding
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row.
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)
```
2. Run the following script to insert the test data into seekdb. The script must be located in the same directory as the test data.
```shell
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The last column is a vector column.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Define the table name.
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index on the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header line.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data by inserting 10 rows into the database each time.
for i, row in enumerate(csvreader):
# The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert 10 rows each time.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the rest rows, if any.
if batch:
client.insert(table_name,batch)
# Check the data in the table and make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of inserted rows:{result[0]}")
```
### Query seekdb data
1. Save the following Python script and name it as `openAIQuery.py`.
```shell
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
from openai import OpenAI
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
openAIclient = OpenAI()
# Define the function for generating text vectors.
def generate_embeddings(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = openAIclient.embeddings.create(input=text, model=model)
return res.data[0].embedding
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Perform an approximate nearest neighbor search (ANNS).
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1)
```
2. Enter a question for an answer.
```shell
python3 openAIQuery.py 'pet food'
```
The expected result is as follows:
```shell
Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.
```

View File

@@ -0,0 +1,205 @@
---
sidebar_label: Qwen
slug: /qwen
---
# Qwen
[Tongyi Qianwen (Qwen)](https://tongyi.aliyun.com) is a large language model (LLM) developed by Alibaba Cloud for interpreting and analyzing user inputs. You can use the API of Qwen in the [Alibaba Cloud Model Studio](https://bailian.console.alibabacloud.com/?spm=a2c63.p38356.0.0.948073b58ycZ3f&accounttraceid=ffba8dd7c8ef4dfd95c06513316ac8cfacdj#/home).
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using Qwen's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
## Prerequisites
* You have deployed seekdb.
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
* You have installed [Python 3.9 or later](https://www.python.org/downloads/) and [pip](https://pip.pypa.io/en/stable/installation/).
* You have installed [Poetry](https://python-poetry.org/docs/), [Pyobvector](https://github.com/oceanbase/pyobvector), and DashScope SDK. The installation commands are as follows:
```shell
pip install poetry
pip install pyobvector
pip install dashscope
```
* You have obtained the [Qwen API key](https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key).
## Step 1: Obtain the connection string of seekdb
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
```sql
obclient -h$host -P$port -u$user_name -p$password -D$database_name
```
**Parameters:**
* `$host`: The IP address for connecting to seekdb.
* `$port`: The port number for connecting to seekdb. Default is `2881`.
* `$database_name`: The name of the database to be accessed.
:::tip
The user for connection must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
:::
* `$user_name`: The database account.
* `$password`: The password of the account.
## Step 2: Configure the environment variable for the Qwen API key
For a Unix-based system (such as Ubuntu or MacOS), run the following command in the terminal:
```shell
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
For Windows, run the following command in the command prompt:
```shell
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
```
You must replace `YOUR_DASHSCOPE_API_KEY` with the actual Qwen API key.
## Step 3: Store the vector data in seekdb
1. Prepare the test data.
Download the [CSV file](https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240827/srxyhu/fine_food_reviews.csv) that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
```shell
import dashscope
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
# Here the text_embedding_v1 model is used. You can change the model as needed.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the Tongyi Qianwen Embedding API row by row.
df["embedding"] = df.combined.apply(generate_embeddings)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)
```
2. Execute the following script to insert the test data into seekdb. The directory where the script is located must be the same as the directory where the test data is stored.
```shell
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Use pyobvector to connect to seekdb. If @ is in the username or password, replace it with %40.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset is prepared in advance and has been vectorized. By default, it is placed in the same directory as the Python script. If you have vectorized it yourself, replace it with the corresponding file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define the columns. The vectorized column is placed in the last field.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Table name
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index for the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header row.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data and insert it into the database every 10 rows.
for i, row in enumerate(csvreader):
# The CSV file has 9 fields: id, product_id, user_id, score, summary, text, combined, n_tokens, embedding.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert data every 10 rows.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the remaining rows (if any).
if batch:
client.insert(table_name,batch)
# Check the data in the table to ensure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of imported data: {result[0]}")
```
## Step 4: Query seekdb data
1. Save the following Python script as `query.py`.
```shell
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
import dashscope
# Get command-line arguments
if len(sys.argv) != 2:
print("Please enter a query statement.")
sys.exit()
queryStatement = sys.argv[1]
# Use pyobvector to connect to seekdb. If the username or password contains @, replace it with %40.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# Define a function to generate text vectors.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Execute approximate nearest neighbor search.
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Table name
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1)
```
2. Enter a question and obtain the related answer.
```shell
python3 query.py 'pet food'
```
The expected result is as follows:
```shell
This is so good!: I purchased this after my sister sent a small bag to me in a gift box. I loved it so much I wanted to find it to buy for myself and keep it around. I always look on Amazon because you can find everything here and true enough, I found this wonderful candy. It is nice to keep in your purse for when you are out and about and get a dry throat or a tickle in the back of your throat. It is also nice to have in a candy dish at home for guests to try.
```