193 lines
7.5 KiB
Markdown
193 lines
7.5 KiB
Markdown
---
|
|
sidebar_label: Jina AI
|
|
slug: /jina
|
|
---
|
|
|
|
# Integrate seekdb vector search with Jina AI
|
|
|
|
seekdb supports vector data storage, vector indexes, and embedding vector search. You can store vectorized data in seekdb for further search.
|
|
|
|
Jina AI is an AI platform focused on multimodal search and vector search. It offers core components and tools for building enterprise-grade Retrieval-Augmented Generation (RAG) applications based on multimodal search, helping organizations and developers create advanced search-driven generative AI solutions.
|
|
|
|
## Prerequisites
|
|
|
|
* You have deployed seekdb.
|
|
|
|
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
|
|
|
|
* You have installed Python 3.11 or later.
|
|
|
|
* You have installed required dependencies:
|
|
|
|
```shell
|
|
python3 -m pip install pyobvector requests sqlalchemy
|
|
```
|
|
|
|
## Step 1: Obtain the database connection information
|
|
|
|
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
|
|
|
```sql
|
|
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
|
```
|
|
|
|
**Parameters:**
|
|
|
|
* `$host`: The IP address for connecting to seekdb.
|
|
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
|
* `$database_name`: The name of the database to access.
|
|
|
|
:::tip
|
|
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
|
:::
|
|
|
|
* `$user_name`: The username for connecting to the database.
|
|
* `$password`: The password for the account.
|
|
|
|
## Step 2: Build your AI assistant
|
|
|
|
### Set your Jina AI API key as an environment variable
|
|
|
|
Get your [Jina AI API key](https://jina.ai/api-dashboard/reader) and configure it, along with your seekdb connection details, as environment variables:
|
|
|
|
```shell
|
|
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
|
|
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
|
|
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
|
|
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
|
|
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
|
|
```
|
|
|
|
### Example code snippets
|
|
|
|
#### Get embeddings from Jina AI
|
|
|
|
Jina AI offers several embedding models. You can choose the one that best fits your needs.
|
|
|
|
| Model | Parameter size | Embedding dimension | Text |
|
|
| --- | --- | --- | --- |
|
|
| [jina-embeddings-v3](https://zilliz.com/ai-models/jina-embeddings-v3) | 570M | flexible embedding size (Default: 1024) | multilingual text embeddings; supports 94 language in total |
|
|
| [jina-embeddings-v2-small-en](https://zilliz.com/ai-models/jina-embeddings-v2-small-en) | 33M | 512 | English monolingual embeddings |
|
|
| [jina-embeddings-v2-base-en](https://zilliz.com/ai-models/jina-embeddings-v2-base-en) | 137M | 768 | English monolingual embeddings |
|
|
| [jina-embeddings-v2-base-zh](https://zilliz.com/ai-models/jina-embeddings-v2-base-zh) | 161M | 768 | Chinese-English Bilingual embeddings |
|
|
| [jina-embeddings-v2-base-de](https://zilliz.com/ai-models/jina-embeddings-v2-base-de) | 161M | 768 | German-English Bilingual embeddings |
|
|
| [jina-embeddings-v2-base-code](https://zilliz.com/ai-models/jina-embeddings-v2-base-code) | 161M | 768 | English and programming languages |
|
|
|
|
Here is an example using `jina-embeddings-v3`. The following helper function, `generate_embeddings`, calls the Jina AI embedding API:
|
|
|
|
```python
|
|
import os
|
|
import requests
|
|
from sqlalchemy import Column, Integer, String
|
|
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
|
|
|
|
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
|
|
|
|
# Step 1. Text data vectorization
|
|
def generate_embeddings(text: str):
|
|
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
|
|
JINAAI_HEADERS = {
|
|
'Content-Type': 'application/json',
|
|
'Authorization': f'Bearer {JINAAI_API_KEY}'
|
|
}
|
|
JINAAI_REQUEST_DATA = {
|
|
'input': [text],
|
|
'model': 'jina-embeddings-v3'
|
|
}
|
|
|
|
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
|
|
response_json = response.json()
|
|
return response_json['data'][0]['embedding']
|
|
|
|
|
|
TEXTS = [
|
|
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
|
|
'OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.',
|
|
'OceanBase is a native distributed relational database that supports HTAP hybrid transaction analysis and processing. It features enterprise-level characteristics such as high availability, transparent scalability, and multi-tenancy, and is compatible with MySQL/Oracle protocols.'
|
|
]
|
|
data = []
|
|
for text in TEXTS:
|
|
# Generate the embedding for the text via Jina AI API.
|
|
embedding = generate_embeddings(text)
|
|
data.append({
|
|
'content': text,
|
|
'content_vec': embedding
|
|
})
|
|
|
|
print(f"Successfully processed {len(data)} texts")
|
|
```
|
|
|
|
#### Define the vector table structure and store vectors in seekdb
|
|
|
|
Create a table called `jinaai_oceanbase_demo_documents` with columns for the text (`content`), the embedding vector (`content_vec`), and vector index information. Then insert the vector data into seekdb:
|
|
|
|
```python
|
|
# Step 2. Connect seekdb Serverless
|
|
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
|
|
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
|
|
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
|
|
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
|
|
|
|
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
|
|
# Step 3. Create the vector table.
|
|
table_name = "jinaai_oceanbase_demo_documents"
|
|
client.drop_table_if_exist(table_name)
|
|
|
|
cols = [
|
|
Column("id", Integer, primary_key=True, autoincrement=True),
|
|
Column("content", String(500), nullable=False),
|
|
Column("content_vec", VECTOR(1024))
|
|
]
|
|
|
|
# Create vector index
|
|
vector_index_params = IndexParam(
|
|
index_name="idx_content_vec",
|
|
field_name="content_vec",
|
|
index_type="HNSW",
|
|
distance_metric="cosine"
|
|
)
|
|
|
|
client.create_table_with_index_params(
|
|
table_name=table_name,
|
|
columns=cols,
|
|
vidxs=[vector_index_params]
|
|
)
|
|
|
|
print('- Inserting Data to OceanBase...')
|
|
client.insert(table_name, data=data)
|
|
```
|
|
|
|
#### Semantic search
|
|
|
|
Use the Jina AI embedding API to generate an embedding for your query text. Then, search for the most relevant document by calculating the cosine distance between the query embedding and each embedding in the vector table:
|
|
|
|
```python
|
|
# Step 4. Query the most relevant document based on the query.
|
|
query = 'What is OceanBase?'
|
|
# Generate the embedding for the query via Jina AI API.
|
|
query_embedding = generate_embeddings(query)
|
|
|
|
res = client.ann_search(
|
|
table_name,
|
|
vec_data=query_embedding,
|
|
vec_column_name="content_vec",
|
|
distance_func=cosine_distance, # Use cosine distance function
|
|
with_dist=True,
|
|
topk=1,
|
|
output_column_names=["id", "content"],
|
|
)
|
|
|
|
print('- The Most Relevant Document and Its Distance to the Query:')
|
|
for row in res.fetchall():
|
|
print(f' - ID: {row[0]}\n'
|
|
f' content: {row[1]}\n'
|
|
f' distance: {row[2]}')
|
|
```
|
|
|
|
#### Expected result
|
|
|
|
```plain
|
|
- ID: 2
|
|
content: OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
|
|
distance: 0.14733879001870276
|
|
``` |