Initial commit
This commit is contained in:
18
skills/seekdb-docs/SKILL.md
Normal file
18
skills/seekdb-docs/SKILL.md
Normal file
@@ -0,0 +1,18 @@
|
||||
---
|
||||
name: seekdb-docs
|
||||
description: Provides documentation and knowledge base for seekdb database.
|
||||
---
|
||||
|
||||
# Seekdb Documentation Index Tool
|
||||
|
||||
This tool provides documentation for the seekdb database. You can find all seekdb-related knowledge and information here.
|
||||
|
||||
## Directory Structure
|
||||
The official-docs directory contains all database documentation. Each file's name is the file's title, and all titles are in English.
|
||||
|
||||
## Query All Documents
|
||||
You can view all document file names by checking the file tree under official-docs
|
||||
|
||||
## Find Documents by Index
|
||||
You can view the location of corresponding documents based on the category index in index files (such as get-started.md, develop.md, etc.)
|
||||
|
||||
217
skills/seekdb-docs/develop.md
Normal file
217
skills/seekdb-docs/develop.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Development Guide
|
||||
|
||||
This category contains development guides and technical documentation for the SeekDB database.
|
||||
|
||||
## File List
|
||||
|
||||
### Vector Search
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/100.vector-search-overview/100.vector-search-intro.md`
|
||||
- **Description**: Vector search introduction
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/100.vector-search-overview/300.vector-search-workflow.md`
|
||||
- **Description**: Vector search workflow
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/150.vector-embedding-technology.md`
|
||||
- **Description**: Vector embedding technology
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/160.store-vector-data.md`
|
||||
- **Description**: Store vector data
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/200.vector-index/200.dense-vector-index.md`
|
||||
- **Description**: Dense vector index
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/200.vector-index/300.hybrid-vector-index.md`
|
||||
- **Description**: Hybrid vector index
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/200.vector-index/400.sparse-vector-index/100.in-memory-sparse-vector-index.md`
|
||||
- **Description**: In-memory sparse vector index
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/250.vector-function.md`
|
||||
- **Description**: Vector function
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/300.vector-similarity-search.md`
|
||||
- **Description**: Vector similarity search
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-benchmark-test.md`
|
||||
- **Description**: Vector search benchmark test
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-reference/100.vector-data-type.md`
|
||||
- **Description**: Vector data type reference
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-reference/800.vector-sdk-refer.md`
|
||||
- **Description**: Vector SDK reference
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-reference/900.vector-search-supported-clients-and-languages/100.vector-search-supported-clients-and-languages-overview.md`
|
||||
- **Description**: Vector search supported clients and languages overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-reference/900.vector-search-supported-clients-and-languages/200.vector-pyobvector.md`
|
||||
- **Description**: Python vector client
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/700.vector-search-reference/900.vector-search-supported-clients-and-languages/300.vector-search-java-sdk.md`
|
||||
- **Description**: Java vector search SDK
|
||||
|
||||
- **File Path**: `official-docs/200.develop/100.vector-search/800.vector-search-faq.md`
|
||||
- **Description**: Vector search FAQ
|
||||
|
||||
### Hybrid Search
|
||||
- **File Path**: `official-docs/200.develop/200.hybrid-search/100.vector-index-hybrid-search.md`
|
||||
- **Description**: Vector index hybrid search
|
||||
|
||||
### AI Function
|
||||
- **File Path**: `official-docs/200.develop/300.ai-function/100.ai-function-permission.md`
|
||||
- **Description**: AI function permissions
|
||||
|
||||
- **File Path**: `official-docs/200.develop/300.ai-function/200.ai-function.md`
|
||||
- **Description**: AI function usage guide
|
||||
|
||||
### MCP Server
|
||||
- **File Path**: `official-docs/200.develop/400.mcp-server/400.oceanbase-mcp-server-and-ai-tool-integration-guide.md`
|
||||
- **Description**: OceanBase MCP server and AI tool integration guide
|
||||
|
||||
### Multi-Model Data
|
||||
#### JSON
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/100.json-formatted-data-types.md`
|
||||
- **Description**: JSON formatted data types
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/200.create-a-json-value.md`
|
||||
- **Description**: Create a JSON value
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/300.querying-and-modifying-json-values.md`
|
||||
- **Description**: Querying and modifying JSON values
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/400.json-formatted-data-type-conversion.md`
|
||||
- **Description**: JSON formatted data type conversion
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/500.json-partial-update.md`
|
||||
- **Description**: JSON partial update
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/100.json/600.json-semi-struct.md`
|
||||
- **Description**: JSON semi-structured data
|
||||
|
||||
#### Spatial Data
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/200.spatial/100.spatial-data-type-overview.md`
|
||||
- **Description**: Spatial data type overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/200.spatial/200.spacial-reference-system.md`
|
||||
- **Description**: Spatial reference system
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/200.spatial/300.create-spatial-columns.md`
|
||||
- **Description**: Create spatial columns
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/200.spatial/400.create-spatial-indexes.md`
|
||||
- **Description**: Create spatial indexes
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/200.spatial/500.spatial-data-format.md`
|
||||
- **Description**: Spatial data format
|
||||
|
||||
#### Character and Text
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/300.char-and-text/100.char-and-varchar.md`
|
||||
- **Description**: CHAR and VARCHAR types
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/300.char-and-text/200.text.md`
|
||||
- **Description**: TEXT type
|
||||
|
||||
- **File Path**: `official-docs/200.develop/500.multi-model/300.char-and-text/300.full-text-index.md`
|
||||
- **Description**: Full-text index
|
||||
|
||||
### Python Development
|
||||
- **File Path**: `official-docs/200.develop/1000.python/20.using-seekdb-in-python-mode.md`
|
||||
- **Description**: Using SeekDB in Python mode
|
||||
|
||||
### SDK and API
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md`
|
||||
- **Description**: pyseekdb SDK quick start
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/10.pyseekdb-sdk/50.sdk-samples/10.pyseekdb-simple-sample.md`
|
||||
- **Description**: pyseekdb simple example
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/10.pyseekdb-sdk/50.sdk-samples/100.pyseekdb-hybrid-search-sample.md`
|
||||
- **Description**: pyseekdb hybrid search example
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/10.api-overview.md`
|
||||
- **Description**: API overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/100.admin-client.md`
|
||||
- **Description**: Admin client API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/50.client.md`
|
||||
- **Description**: Client API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/110.database/100.database-overview-of-api.md`
|
||||
- **Description**: Database API overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/110.database/200.create-database-of-api.md`
|
||||
- **Description**: Create database API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/110.database/300.get-database-of-api.md`
|
||||
- **Description**: Get database API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/110.database/400.list-database-of-api.md`
|
||||
- **Description**: List database API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/110.database/500.delete-database-of-api.md`
|
||||
- **Description**: Delete database API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/50.collection-overview-of-api.md`
|
||||
- **Description**: Collection API overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/100.create-collection-of-api.md`
|
||||
- **Description**: Create collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/200.get-collection-of-api.md`
|
||||
- **Description**: Get collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/250.get-or-create-collection-of-api.md`
|
||||
- **Description**: Get or create collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/300.list-collection-of-api.md`
|
||||
- **Description**: List collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/350.count-collection-of-api.md`
|
||||
- **Description**: Count collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/200.collection/400.delete-collection-of-api.md`
|
||||
- **Description**: Delete collection API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/300.dml/100.dml-overview-of-api.md`
|
||||
- **Description**: DML API overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/300.dml/200.add-data-of-api.md`
|
||||
- **Description**: Add data API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/300.dml/300.update-data-of-api.md`
|
||||
- **Description**: Update data API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/300.dml/400.upsert-data-of-api.md`
|
||||
- **Description**: Upsert data API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/300.dml/500.delete-data-of-api.md`
|
||||
- **Description**: Delete data API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/400.dql/100.dql-overview-of-api.md`
|
||||
- **Description**: DQL API overview
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/400.dql/200.query-interfaces-of-api.md`
|
||||
- **Description**: Query interfaces API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/400.dql/300.get-interfaces-of-api.md`
|
||||
- **Description**: Get interfaces API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/400.dql/400.hybrid-search-of-api.md`
|
||||
- **Description**: Hybrid search API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/50.apis/400.dql/500.filter-operators-of-api.md`
|
||||
- **Description**: Filter operators API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/60.embedding-funcations/100.default-embedding-function-of-api.md`
|
||||
- **Description**: Default embedding function API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/60.embedding-funcations/200.create-custim-embedding-functions-of-api.md`
|
||||
- **Description**: Create custom embedding functions API
|
||||
|
||||
- **File Path**: `official-docs/200.develop/900.sdk/60.embedding-funcations/300.using-custom-embedding-functions-of-api.md`
|
||||
- **Description**: Using custom embedding functions API
|
||||
|
||||
## Summary
|
||||
This category contains numerous documentation files covering development-related content including vector search, hybrid search, AI functions, multi-model data, SDKs, and APIs.
|
||||
|
||||
46
skills/seekdb-docs/get-started.md
Normal file
46
skills/seekdb-docs/get-started.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Get Started
|
||||
|
||||
This category contains quick start tutorials and basic operation guides for the SeekDB database.
|
||||
|
||||
## File List
|
||||
|
||||
### SeekDB Overview
|
||||
- **File Path**: `official-docs/100.get-started/10.overview/10.seekdb-overview.md`
|
||||
- **Description**: Introduction to what SeekDB is, its core features and capabilities
|
||||
|
||||
### Embedded Mode
|
||||
- **File Path**: `official-docs/100.get-started/50.embedded-mode/25.using-seekdb-in-python-sdk.md`
|
||||
- **Description**: Using SeekDB embedded mode in Python SDK
|
||||
|
||||
### Client-Server Mode
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/10.deploy-seekdb-testing-environment.md`
|
||||
- **Description**: Deploy SeekDB testing environment
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/15.basic-sql-operations.md`
|
||||
- **Description**: Basic SQL operations
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/30.experience-vector-search.md`
|
||||
- **Description**: Experience vector search functionality
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/40.experience-full-text-indexing.md`
|
||||
- **Description**: Experience full-text indexing functionality
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/50.experience-hybrid-search.md`
|
||||
- **Description**: Experience hybrid search functionality
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/60.experience-ai-function.md`
|
||||
- **Description**: Experience AI function functionality
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/70.experience-hybrid-vector-index.md`
|
||||
- **Description**: Experience hybrid vector index functionality
|
||||
|
||||
- **File Path**: `official-docs/100.get-started/100.client-server-mode/80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md`
|
||||
- **Description**: Experience the Vibe coding paradigm using Cursor Agent and OceanBase MCP
|
||||
|
||||
### Build AI Applications
|
||||
- **File Path**: `official-docs/100.get-started/150.build-ai-apps/90.build-ai-apps.md`
|
||||
- **Description**: Guide to building AI applications
|
||||
|
||||
## Summary
|
||||
This category contains **11** documentation files covering SeekDB quick start, basic operations, feature experiences, and AI application building.
|
||||
|
||||
64
skills/seekdb-docs/guides.md
Normal file
64
skills/seekdb-docs/guides.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# Operations Guide
|
||||
|
||||
This category contains deployment, operations, and reference documentation for the SeekDB database.
|
||||
|
||||
## File List
|
||||
|
||||
### Deployment
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/50.deploy-overview.md`
|
||||
- **Description**: Deployment overview
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/100.prepare-servers.md`
|
||||
- **Description**: Prepare servers
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/600.python-seekdb.md`
|
||||
- **Description**: Python SeekDB deployment
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/700.server-mode/100.deploy-by-systemd.md`
|
||||
- **Description**: Deploy using systemd
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/700.server-mode/200.deploy-by-docker.md`
|
||||
- **Description**: Deploy using Docker
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/700.server-mode/300.deploy-oceanbase-desktop.md`
|
||||
- **Description**: Deploy OceanBase Desktop
|
||||
|
||||
- **File Path**: `official-docs/400.guides/400.deploy/500.environment-and-configuration-checks/`
|
||||
- **Description**: Environment and configuration check related documentation
|
||||
|
||||
### OBShell
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/100.obshell-overview.md`
|
||||
- **Description**: OBShell overview
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/300.obshell-clients/100.agent-commands.md`
|
||||
- **Description**: Agent commands
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/300.obshell-clients/200.seekdb-commands.md`
|
||||
- **Description**: SeekDB commands
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/300.obshell-clients/300.utilities-commands.md`
|
||||
- **Description**: Utility commands
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/350.obshell-dashboard/`
|
||||
- **Description**: OBShell Dashboard related documentation
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/900.configure-monitor.md`
|
||||
- **Description**: Configure monitoring
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1000.obshell/1000.error.md`
|
||||
- **Description**: Error handling
|
||||
|
||||
### Reference Documentation
|
||||
- **File Path**: `official-docs/400.guides/1200.reference/1100.mysql-compatibility.md`
|
||||
- **Description**: MySQL compatibility
|
||||
|
||||
- **File Path**: `official-docs/400.guides/1200.reference/1500.telemetry.md`
|
||||
- **Description**: Telemetry functionality
|
||||
|
||||
### Release Notes
|
||||
- **File Path**: `official-docs/400.guides/1300.release-notes/10.v1.0.0.md`
|
||||
- **Description**: V1.0.0 release notes
|
||||
|
||||
## Summary
|
||||
This category contains multiple documentation files covering deployment, operations, OBShell usage, reference documentation, and release notes.
|
||||
|
||||
48
skills/seekdb-docs/integrations.md
Normal file
48
skills/seekdb-docs/integrations.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Integration Guide
|
||||
|
||||
This category contains integration guides for SeekDB with third-party platforms and tools.
|
||||
|
||||
## File List
|
||||
|
||||
### Model Integration
|
||||
- **File Path**: `official-docs/300.integrations/100.model/100.jina.md`
|
||||
- **Description**: Jina model integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/100.model/200.openai.md`
|
||||
- **Description**: OpenAI model integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/100.model/300.qwen.md`
|
||||
- **Description**: Qwen model integration
|
||||
|
||||
### Framework Integration
|
||||
- **File Path**: `official-docs/300.integrations/200.frame/100.langchain.md`
|
||||
- **Description**: LangChain framework integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/200.frame/200.llamaindex.md`
|
||||
- **Description**: LlamaIndex framework integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/200.frame/300.springai.md`
|
||||
- **Description**: SpringAI framework integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/200.frame/400.dify.md`
|
||||
- **Description**: Dify framework integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/200.frame/500.n8n.md`
|
||||
- **Description**: n8n framework integration
|
||||
|
||||
### MCP Client Integration
|
||||
- **File Path**: `official-docs/300.integrations/300.mcp-client/100.cursor.md`
|
||||
- **Description**: Cursor MCP client integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/300.mcp-client/200.cline.md`
|
||||
- **Description**: Cline MCP client integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/300.mcp-client/300.continue.md`
|
||||
- **Description**: Continue MCP client integration
|
||||
|
||||
- **File Path**: `official-docs/300.integrations/300.mcp-client/400.trae.md`
|
||||
- **Description**: Trae MCP client integration
|
||||
|
||||
## Summary
|
||||
This category contains **12** documentation files covering integration guides for models, frameworks, and MCP clients.
|
||||
|
||||
158
skills/seekdb-docs/official-docs/10.doc-overview.md
Normal file
158
skills/seekdb-docs/official-docs/10.doc-overview.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# seekdb documentation
|
||||
import DocsCard from '@components/global/DocsCard';
|
||||
import DocsCards from '@components/global/DocsCards';
|
||||
|
||||
The seekdb documentation provides a wide range of resources, including step-by-step getting started guides, examples of building AI applications with live demos, SDK and API code samples, comprehensive feature overviews, and detailed user manuals—all designed to help you quickly get up to speed and make the most of seekdb.
|
||||
|
||||
## Get started
|
||||
|
||||
A minimalist API design that keeps you focused on building your AI
|
||||
|
||||
<DocsCards>
|
||||
<DocsCard header="Use embedded seekdb" href="./100.get-started/50.embedded-mode/25.using-seekdb-in-python-sdk.md">
|
||||
<p>A lightweight and easy-to-use deployment mode recommended for both testing and production, delivering stable and efficient service.</p>
|
||||
</DocsCard>
|
||||
|
||||
<DocsCard header="Use server mode seekdb" href="./100.get-started/50.embedded-mode/25.using-seekdb-in-python-sdk.md">
|
||||
<p>Recommended deployment mode for testing and production environments. Lightweight and easy to use, ideal for stable and efficient service provision.</p>
|
||||
</DocsCard>
|
||||
</DocsCards>
|
||||
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="embedded" label="Embedded mode" default>
|
||||
1. Set up
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# create a knowledge base
|
||||
collection = client.get_or_create_collection("product_database")
|
||||
|
||||
```
|
||||
|
||||
2. Insert
|
||||
|
||||
```python
|
||||
# Add product documents
|
||||
collection.upsert(
|
||||
documents=[
|
||||
"Laptop Pro with 16GB RAM, 512GB SSD, and high-speed processor",
|
||||
"Gaming Laptop with 32GB RAM, 1TB SSD, and high-performance graphics",
|
||||
"Business Ultrabook with 8GB RAM, 256GB SSD, and long battery life",
|
||||
"Tablet with 6GB RAM, 128GB storage, and 10-inch display"
|
||||
],
|
||||
metadatas=[
|
||||
{"category": "laptop", "ram": 16, "storage": 512, "price": 12000, "type": "professional"},
|
||||
{"category": "laptop", "ram": 32, "storage": 1000, "price": 25000, "type": "gaming"},
|
||||
{"category": "laptop", "ram": 8, "storage": 256, "price": 9000, "type": "business"},
|
||||
{"category": "tablet", "ram": 6, "storage": 128, "price": 6000, "type": "consumer"}
|
||||
],
|
||||
ids=["1", "2", "3", "4"]
|
||||
)
|
||||
|
||||
print("Product database built\n")
|
||||
|
||||
```
|
||||
|
||||
3. Query
|
||||
|
||||
```python
|
||||
# Hybrid search for high-performance laptops
|
||||
print("Hybrid Search: High-performance laptops for professional work")
|
||||
results = collection.query(
|
||||
query_texts=["powerful computer for professional work"], # Vector search
|
||||
where={ # Relational filter
|
||||
"category": "laptop",
|
||||
"ram": {"$gte": 16}
|
||||
},
|
||||
where_document={"$contains": "RAM"}, # Full-text search
|
||||
n_results=2
|
||||
)
|
||||
|
||||
print("\nResults:")
|
||||
for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
|
||||
print(f" {i+1}. {doc}")
|
||||
|
||||
```
|
||||
</TabItem>
|
||||
<TabItem value="server" label="Server mode">
|
||||
1. Set up
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
client = pyseekdb.Client(
|
||||
host = "127.0.0.1", # server host
|
||||
port = 2881, # server port (default: 2881)
|
||||
)
|
||||
|
||||
# create a knowledge base
|
||||
collection = client.get_or_create_collection("product_database")
|
||||
|
||||
```
|
||||
|
||||
2. Insert
|
||||
|
||||
```python
|
||||
# Add product documents
|
||||
collection.upsert(
|
||||
documents=[
|
||||
"Laptop Pro with 16GB RAM, 512GB SSD, and high-speed processor",
|
||||
"Gaming Laptop with 32GB RAM, 1TB SSD, and high-performance graphics",
|
||||
"Business Ultrabook with 8GB RAM, 256GB SSD, and long battery life",
|
||||
"Tablet with 6GB RAM, 128GB storage, and 10-inch display"
|
||||
],
|
||||
metadatas=[
|
||||
{"category": "laptop", "ram": 16, "storage": 512, "price": 12000, "type": "professional"},
|
||||
{"category": "laptop", "ram": 32, "storage": 1000, "price": 25000, "type": "gaming"},
|
||||
{"category": "laptop", "ram": 8, "storage": 256, "price": 9000, "type": "business"},
|
||||
{"category": "tablet", "ram": 6, "storage": 128, "price": 6000, "type": "consumer"}
|
||||
],
|
||||
ids=["1", "2", "3", "4"]
|
||||
)
|
||||
|
||||
print("Product database built\n")
|
||||
```
|
||||
|
||||
3. Query
|
||||
|
||||
```python
|
||||
# Hybrid search for high-performance laptops
|
||||
print("Hybrid Search: High-performance laptops for professional work")
|
||||
results = collection.query(
|
||||
query_texts=["powerful computer for professional work"], # Vector search
|
||||
where={ # Relational filter
|
||||
"category": "laptop",
|
||||
"ram": {"$gte": 16}
|
||||
},
|
||||
where_document={"$contains": "RAM"}, # Full-text search
|
||||
n_results=2
|
||||
)
|
||||
|
||||
print("\nResults:")
|
||||
for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
|
||||
print(f" {i+1}. {doc}")
|
||||
```
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
## Start building
|
||||
|
||||
<DocsCards>
|
||||
<DocsCard header="pyseekdb (Python SDK)" href="./200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md">
|
||||
<p>Overview and examples for using the seekdb Python SDK and API.</p>
|
||||
</DocsCard>
|
||||
|
||||
<DocsCard header="Integrations" href="./300.integrations/100.model/100.jina.md">
|
||||
<p>See how seekdb connects with third-party platforms, with practical examples.</p>
|
||||
</DocsCard>
|
||||
|
||||
<DocsCard header="Tutorials" href="./500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md">
|
||||
<p>Step-by-step guides to using seekdb's AI features and building AI apps.</p>
|
||||
</DocsCard>
|
||||
</DocsCards>
|
||||
@@ -0,0 +1,146 @@
|
||||
---
|
||||
|
||||
slug: /seekdb-overview
|
||||
---
|
||||
|
||||
# What is seekdb
|
||||
|
||||
OceanBase seekdb (referred to as seekdb) is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows.
|
||||
|
||||
## Capability matrix
|
||||
|
||||
| Feature | OceanBase seekdb | OceanBase Database | MySQL 9.0 | Chroma | Elasticsearch | DuckDB | Milvus | PostgreSQL and pgvector |
|
||||
|------------------|---------|----------------|------------------------|--------|--------|--------|---|---|
|
||||
| **Embedded database** | Supported | Not supported | Not supported (Removed in 8.0) | Supported | Not supported | Supported | Supported | Not supported |
|
||||
| **Standalone database** | Supported | Supported | Supported | Supported | Supported | Supported | Supported | Supported |
|
||||
| **Distributed database** | Not supported | Supported | Not supported | Not supported | Supported | Not supported | Supported | Not supported |
|
||||
| **MySQL compatibility** | Supported | Supported | Supported | Not supported | Not supported | Supported | Not supported | Not supported |
|
||||
| **Vector search** | Supported | Supported | Not supported | Supported | Supported | Supported | Supported | Supported |
|
||||
| **Full-text search** | Supported | Supported | Supported | Not supported | Supported | Supported | Partially supported | Supported |
|
||||
| **Hybrid search** | Supported | Supported | Not supported | Not supported | Supported | Not supported | Supported | Partially supported |
|
||||
| **Online Transaction Processing (OLTP)** | Supported | Supported | Supported | Not supported | Not supported | Not supported | Not supported | Supported |
|
||||
| **Online Analytical Processing (OLAP)** | Supported | Supported | Not supported | Not supported | Partially supported | Supported | Not supported | Supported |
|
||||
| **Open-source license** | Apache 2.0 | MulanPubL 2.0 | GPL 2.0 | Apache 2.0 | AGPLv3 + SSPLv1 + Elastic 2.0 | MIT | Apache 2.0 | PostgreSQL License |
|
||||
|
||||
|
||||
## Product architecture
|
||||
|
||||

|
||||
|
||||
* Deployment modes: seekdb supports both embedded and client/server deployments. The embedded mode allows seamless integration into Python applications, making it especially convenient for individual developers.
|
||||
|
||||
* Multi-model data and indexing layer: seekdb accommodates a wide range of data types, including vectors, text, JSON, and GIS, and provides robust indexing capabilities. It features HNSW/IVF vector indexes and quantization algorithms, full-text indexes based on BM25 relevance that support various tokenizers and query modes, hybrid indexes for mixed search scenarios, JSON indexes for metadata searches, as well as primary, secondary, and GIS indexes.
|
||||
|
||||
* Multi-model compute layer for hybrid workloads: seekdb enables hybrid searches across vectors, full-text, and scalar conditions, enhancing the accuracy of query results in Retrieval-Augmented Generation (RAG) scenarios. It offers built-in AI function capabilities for real-time inference within the database. seekdb supports full ACID transactions and multi-version concurrency control (MVCC), along with a query optimizer designed for hybrid workloads, an adaptive execution engine, and flexible PL UDF functions to address diverse business needs.
|
||||
|
||||
* Unified application interface: seekdb is compatible with native MySQL drivers and provides a unified SQL-based query language for multi-model data. Additionally, it offers developer-friendly SDKs for vector databases and hybrid search. seekdb integrates with nearly 30 application development frameworks, including popular AI frameworks such as LangChain, LlamaIndex, and Dify, and features an MCP server for seamless connection to the AI ecosystem.
|
||||
|
||||
## Core advantages
|
||||
|
||||
* Build fast
|
||||
|
||||
From prototype to production in minutes: create AI apps using Python, run VectorDBBench on 1C2G.
|
||||
|
||||
* Hybrid search
|
||||
|
||||
Combine vector search, full-text search and relational query in a single statement.
|
||||
|
||||
* Multi-model
|
||||
|
||||
Support relational, vector, text, JSON and GIS in a single engine.
|
||||
|
||||
* AI inside
|
||||
|
||||
Run embedding, reranking, LLM inference and prompt management inside the database, supporting a complete document-in/data-out RAG workflow.
|
||||
|
||||
* SQL inside
|
||||
|
||||
Powered by the proven OceanBase engine, delivering real-time writes and queries with full ACID compliance, and seamless MySQL ecosystem compatibility.
|
||||
|
||||
## AI native
|
||||
|
||||
Full-stack AI capabilities for application development—from search to inference.
|
||||
|
||||
### Hybrid search
|
||||
|
||||
* Support multi-path retrieval in a single SQL query, combining vector-based semantic search with keyword-based search for optimized recall.
|
||||
* Query reranking supports weighted scores, Reciprocal Rank Fusion (RRF), and LLM-based reranking for enhanced results
|
||||
* Relational filters are pushed down to storage for optimized performance, and multi-table joins allow relational data retrieval.
|
||||
|
||||
### Vector & full-text search
|
||||
|
||||
* Supports dense vectors and sparse vectors, with multiple distance metrics including Manhattan, Euclidean, inner product, and cosine similarity.
|
||||
* Vector indexes support in-memory types such as HNSW, HNSW-SQ, HNSW-BQ, and disk-based types including IVF and IVF-PQ, optimizing storage costs.
|
||||
* Full-text search supports keyword, phrase, and Boolean queries, with BM25 ranking for relevance.
|
||||
|
||||
### AI functions
|
||||
|
||||
* Manage built-in AI services via the `DBMS_AI_SERVICE` package in SQL, and register external LLM services.
|
||||
* Convert text to vector embeddings directly in SQL using the `AI_EMBED` function.
|
||||
* Generate text in SQL with `AI_COMPLETE`, supporting reusable prompt templates.
|
||||
* Rerank text using LLM-based models in SQL via `AI_RERANK`.
|
||||
|
||||
## Applicable scenarios
|
||||
|
||||
### RAG & knowledge retrieval
|
||||
|
||||
Large language models are limited by their training data. RAG introduces timely and trusted external knowledge to improve answer quality and reduce hallucination. seekdb enhances search accuracy through vector search, full-text search, hybrid search, built-in AI functions, and efficient indexing, while multi-level access control safeguards data privacy across heterogeneous knowledge sources.
|
||||
|
||||
Applicable scenarios:
|
||||
* Enterprise QA
|
||||
* Customer support
|
||||
* Industry insights
|
||||
* Personal knowledge
|
||||
|
||||
### AI-assisted programming
|
||||
|
||||
seekdb is well-suited for AI-powered programming tasks. It can build vector and full-text indexes for code repositories, making it easy to search for code or generate completions based on keywords or code semantics. seekdb also excels at organizing code data, supporting both structured storage (like syntax trees and dependency graphs) and unstructured storage (such as raw code text). Its dynamic metadata management allows developers to flexibly extend and efficiently query code attributes—like language type, function names, and parameter lists.
|
||||
|
||||
### Semantic search engine
|
||||
|
||||
Traditional keyword search struggles to capture intent. Semantic search leverages embeddings and vector search to understand meaning and connect text, images, and other modalities. seekdb's hybrid search and multi-model querying deliver more precise, context-aware results across complex search scenarios.
|
||||
|
||||
Applicable scenarios:
|
||||
* Product search
|
||||
* Text-to-image
|
||||
* Image-to-product
|
||||
|
||||
### Agentic AI applications
|
||||
|
||||
Agentic AI requires memory, planning, perception, and reasoning. seekdb provides a unified foundation for agents through metadata management, vector/text/mixed queries, multimodal data processing, RAG, built-in AI functions and inference, and robust privacy controls—enabling scalable, production-grade agent systems.
|
||||
|
||||
Applicable scenarios:
|
||||
* Personal assistants
|
||||
* Enterprise automation
|
||||
* Vertical agents
|
||||
* Agent platforms
|
||||
|
||||
### AI-assisted coding & development
|
||||
|
||||
AI-powered coding combines natural-language understanding and code semantic analysis to enable generation, completion, debugging, testing, and refactoring. seekdb enhances code intelligence with semantic search, multi-model storage for code and documents, isolated multi-project management, and time-travel queries—supporting both local and cloud IDE environments.
|
||||
|
||||
Applicable scenarios:
|
||||
* IDE plugins
|
||||
* Design-to-web
|
||||
* Local IDEs
|
||||
* Web IDEs
|
||||
|
||||
### Enterprise application intelligence
|
||||
|
||||
AI transforms enterprise systems from passive tools into proactive collaborators. seekdb provides a unified AI-ready storage layer, fully compatible with MySQL syntax and views, and accelerates mixed workloads with parallel execution and hybrid row-column storage. Legacy applications gain intelligent capabilities with minimal migration across office, workflow, and business analytics scenarios.
|
||||
|
||||
Applicable scenarios:
|
||||
* Document intelligence
|
||||
* Business insights
|
||||
* Finance systems
|
||||
|
||||
### AI on-device & edge AI applications
|
||||
|
||||
Edge devices—from mobile to vehicle and industrial terminals—operate with constrained compute and storage. seekdb's lightweight architecture supports embedded and micro-server modes, delivering full SQL, JSON, and hybrid search under low resource usage. It integrates seamlessly with OceanBase cloud services to enable unified edge-to-cloud intelligent systems.
|
||||
|
||||
Applicable scenarios:
|
||||
* Personal assistants
|
||||
* In-vehicle systems
|
||||
* AI education
|
||||
* Companion robots
|
||||
* Healthcare devices
|
||||
@@ -0,0 +1,131 @@
|
||||
---
|
||||
|
||||
slug: /deploy-seekdb-testing-environment
|
||||
---
|
||||
|
||||
# Quickly deploy seekdb in client/server mode
|
||||
|
||||
seekdb provides embedded mode and client/server mode. You can choose the appropriate deployment mode based on your business scenario. This topic introduces how to quickly deploy seekdb in client/server mode.
|
||||
|
||||
:::info
|
||||
For information about using seekdb in embedded mode, see [Experience embedded seekdb](../50.embedded-mode/25.using-seekdb-in-python-sdk.md).
|
||||
:::
|
||||
|
||||
## Deployment modes
|
||||
|
||||
seekdb provides flexible deployment modes that support everything from rapid prototyping to large-scale user workloads, meeting the full range of your application needs.
|
||||
|
||||
* Embedded mode
|
||||
|
||||
seekdb embeds as a lightweight library installable with a single pip command, ideal for personal learning or prototyping, and can easily run on various end devices.
|
||||
|
||||
* Client/Server mode
|
||||
|
||||
A lightweight and easy-to-use deployment mode recommended for both testing and production, delivering stable and efficient service.
|
||||
|
||||
:::info
|
||||
For more detailed and comprehensive deployment methods for seekdb, see [Deployment overview](../../400.guides/400.deploy/50.deploy-overview.md).
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before performing the operations in this topic, you need to confirm the following information:
|
||||
|
||||
* Your environment is an RPM platform system. The following systems are currently verified to be supported:
|
||||
|
||||
* Anolis OS 8.X (Linux kernel 3.10.0 or later)
|
||||
* Alibaba Cloud Linux 2/3 (Linux kernel 3.10.0 or later)
|
||||
* Red Hat Enterprise Linux Server 7.X, 8.X (Linux kernel 3.10.0 or later)
|
||||
* CentOS Linux 7.X, 8.X (Linux kernel 3.10.0 or later)
|
||||
* Debian 9.X or later (Linux kernel 3.10.0 or later)
|
||||
* Ubuntu 20.X or later (Linux kernel 3.10.0 or later)
|
||||
* SUSE / OpenSUSE 15.X or later (Linux kernel 3.10.0 or later)
|
||||
* openEuler 22.03 and 24.03 (Linux kernel 5.10.0 or later)
|
||||
* KylinOS V10
|
||||
* UOS 1020a/1021a/1021e/1001c
|
||||
* NFSChina 4.0 or later
|
||||
* Inspur KOS 5.8
|
||||
|
||||
* The minimum CPU requirement for the current environment is 1 core.
|
||||
|
||||
* The minimum available memory requirement for the current environment is 2 GB.
|
||||
|
||||
* You have installed a database connection tool (MySQL client or OBClient) in your environment.
|
||||
|
||||
* The user you are using has permission to execute sudo commands.
|
||||
|
||||
* Requirements for deploying using yum install:
|
||||
|
||||
* You have installed the jq command-line tool in your environment and correctly configured systemd as the system and service manager.
|
||||
|
||||
* Requirements for deploying using Docker:
|
||||
|
||||
* You have installed Docker and started the Docker service.
|
||||
|
||||
## Quickly deploy seekdb using yum install
|
||||
|
||||
1. Add the seekdb repository.
|
||||
|
||||
```shell
|
||||
[admin@test001 ~]$ sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
|
||||
```
|
||||
|
||||
2. Install seekdb.
|
||||
|
||||
```shell
|
||||
[admin@test001 ~]$ sudo yum install seekdb obclient
|
||||
```
|
||||
|
||||
3. Start seekdb.
|
||||
|
||||
```shell
|
||||
[admin@test001 ~]$ sudo systemctl start seekdb
|
||||
```
|
||||
|
||||
4. Check the startup status of seekdb.
|
||||
|
||||
```shell
|
||||
[admin@test001 ~]$ sudo systemctl status seekdb
|
||||
```
|
||||
|
||||
When the status shows `Service is ready`, seekdb has started successfully.
|
||||
|
||||
5. Connect to seekdb.
|
||||
|
||||
```shell
|
||||
mysql -h127.0.0.1 -uroot -P2881 -A oceanbase
|
||||
```
|
||||
|
||||
## Quickly deploy seekdb in a container environment
|
||||
|
||||
If Docker is installed and the Docker service is started in your environment, you can also deploy seekdb using Docker containers. For more information about Docker deployment, see [Deploy seekdb in a container environment](../../400.guides/400.deploy/700.server-mode/200.deploy-by-docker.md).
|
||||
|
||||
1. Start a seekdb instance directly.
|
||||
|
||||
```shell
|
||||
[admin@test001 ~]$ sudo docker run -d -p 2881:2881 oceanbase/seekdb
|
||||
```
|
||||
:::info
|
||||
|
||||
If pulling the Docker image fails, you can also pull the image from the quay.io or ghcr.io repository. Simply replace <code>oceanbase/seekdb</code> in the above command with <code>quay.io/oceanbase/seekdb</code> or <code>ghcr.io/oceanbase/seekdb</code>. For example, execute <code>sudo docker run -d -p 2881:2881 quay.io/oceanbase/seekdb</code> to pull the image from quay.io.
|
||||
:::
|
||||
|
||||
2. Connect to seekdb.
|
||||
|
||||
```shell
|
||||
mysql -h127.0.0.1 -uroot -P2881 -A oceanbase
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
After deploying and connecting to seekdb, you can further experience seekdb's AI Native features and try building AI applications based on seekdb:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
@@ -0,0 +1,861 @@
|
||||
---
|
||||
|
||||
slug: /basic-sql-operations
|
||||
---
|
||||
|
||||
# Basic SQL operations
|
||||
|
||||
This topic introduces some basic SQL operations in seekdb.
|
||||
|
||||
## Create a database
|
||||
|
||||
Use the `CREATE DATABASE` statement to create a database.
|
||||
|
||||
Example: Create a database named `db1`, specify the character set as `utf8mb4`, and set the read-write attribute.
|
||||
|
||||
```sql
|
||||
obclient> CREATE DATABASE db1 DEFAULT CHARACTER SET utf8mb4 READ WRITE;
|
||||
Query OK, 1 row affected
|
||||
```
|
||||
|
||||
For more information about the `CREATE DATABASE` statement, see [CREATE DATABASE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974111).
|
||||
|
||||
After creation, you can use the `SHOW DATABASES` command to view all databases in the current database server.
|
||||
|
||||
```sql
|
||||
obclient> SHOW DATABASES;
|
||||
+--------------------+
|
||||
| Database |
|
||||
+--------------------+
|
||||
| db1 |
|
||||
| information_schema |
|
||||
| mysql |
|
||||
| oceanbase |
|
||||
| sys_external_tbs |
|
||||
| test |
|
||||
+--------------------+
|
||||
6 rows in set
|
||||
```
|
||||
|
||||
## Table operations
|
||||
|
||||
In seekdb, a table is the most basic data storage unit that contains all data accessible to users. Each table contains multiple rows of records, and each record consists of multiple columns. This topic provides the syntax and examples for creating, viewing, modifying, and deleting tables in a database.
|
||||
|
||||
### Create a table
|
||||
|
||||
Use the `CREATE TABLE` statement to create a new table in a database.
|
||||
|
||||
Example: Create a table named `test` in the database `db1`.
|
||||
|
||||
```sql
|
||||
obclient> USE db1;
|
||||
Database changed
|
||||
|
||||
obclient> CREATE TABLE test (c1 INT PRIMARY KEY, c2 VARCHAR(3));
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `CREATE TABLE` statement, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
|
||||
|
||||
### View tables
|
||||
|
||||
Use the `SHOW CREATE TABLE` statement to view the table creation statement.
|
||||
|
||||
Examples:
|
||||
|
||||
* View the table creation statement for the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> SHOW CREATE TABLE test\G
|
||||
*************************** 1. row ***************************
|
||||
Table: test
|
||||
Create Table: CREATE TABLE `test` (
|
||||
`c1` int(11) NOT NULL,
|
||||
`c2` varchar(3) DEFAULT NULL,
|
||||
PRIMARY KEY (`c1`)
|
||||
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* Use the `SHOW TABLES` statement to view all tables in the database `db1`.
|
||||
|
||||
```sql
|
||||
obclient> SHOW TABLES FROM db1;
|
||||
+---------------+
|
||||
| Tables_in_db1 |
|
||||
+---------------+
|
||||
| test |
|
||||
+---------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Modify a table
|
||||
|
||||
Use the `ALTER TABLE` statement to modify the structure of an existing table, including modifying table attributes, adding columns, modifying columns and their attributes, and deleting columns.
|
||||
|
||||
Examples:
|
||||
|
||||
* Rename the column `c2` to `c3` in the table `test` and change its data type.
|
||||
|
||||
```sql
|
||||
obclient> DESCRIBE test;
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c2 | varchar(3) | YES | | NULL | |
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
2 rows in set
|
||||
|
||||
obclient> ALTER TABLE test CHANGE COLUMN c2 c3 CHAR(10);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> DESCRIBE test;
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c3 | char(10) | YES | | NULL | |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
* Add and delete columns in the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> DESCRIBE test;
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c3 | char(10) | YES | | NULL | |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
2 rows in set
|
||||
|
||||
obclient> ALTER TABLE test ADD c4 int;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> DESCRIBE test;
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c3 | char(10) | YES | | NULL | |
|
||||
| c4 | int(11) | YES | | NULL | |
|
||||
+-------+----------+------+-----+---------+-------+
|
||||
3 rows in set
|
||||
|
||||
obclient> ALTER TABLE test DROP c3;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> DESCRIBE test;
|
||||
+-------+---------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+---------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c4 | int(11) | YES | | NULL | |
|
||||
+-------+---------+------+-----+---------+-------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
For more information about the `ALTER TABLE` statement, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
|
||||
|
||||
### Delete a table
|
||||
|
||||
Use the `DROP TABLE` statement to delete a table.
|
||||
|
||||
Example: Delete the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> DROP TABLE test;
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `DROP TABLE` statement, see [DROP TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974139).
|
||||
|
||||
## Index operations
|
||||
|
||||
An index is a structure created on a table that sorts the values of one or more columns in the database table. Its main purpose is to improve query speed and reduce the performance overhead of the database system. This topic introduces the syntax and examples for creating, viewing, and deleting indexes in a database.
|
||||
|
||||
### Create an index
|
||||
|
||||
Use the `CREATE INDEX` statement to create an index on a table.
|
||||
|
||||
Example: Create an index on the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE test (c1 INT PRIMARY KEY, c2 VARCHAR(3));
|
||||
Query OK, 0 rows affected (0.10 sec)
|
||||
|
||||
obclient> DESCRIBE test;
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
| c1 | int(11) | NO | PRI | NULL | |
|
||||
| c2 | varchar(3) | YES | | NULL | |
|
||||
+-------+------------+------+-----+---------+-------+
|
||||
2 rows in set
|
||||
|
||||
obclient> CREATE INDEX test_index ON test (c1, c2);
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `CREATE INDEX` statement, see [CREATE INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974165).
|
||||
|
||||
### View indexes
|
||||
|
||||
Use the `SHOW INDEX` statement to view indexes on a table.
|
||||
|
||||
Example: View index information for the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> SHOW INDEX FROM test\G
|
||||
*************************** 1. row ***************************
|
||||
Table: test
|
||||
Non_unique: 0
|
||||
Key_name: PRIMARY
|
||||
Seq_in_index: 1
|
||||
Column_name: c1
|
||||
Collation: A
|
||||
Cardinality: NULL
|
||||
Sub_part: NULL
|
||||
Packed: NULL
|
||||
Null:
|
||||
Index_type: BTREE
|
||||
Comment: available
|
||||
Index_comment:
|
||||
Visible: YES
|
||||
Expression: NULL
|
||||
*************************** 2. row ***************************
|
||||
Table: test
|
||||
Non_unique: 1
|
||||
Key_name: test_index
|
||||
Seq_in_index: 1
|
||||
Column_name: c1
|
||||
Collation: A
|
||||
Cardinality: NULL
|
||||
Sub_part: NULL
|
||||
Packed: NULL
|
||||
Null:
|
||||
Index_type: BTREE
|
||||
Comment: available
|
||||
Index_comment:
|
||||
Visible: YES
|
||||
Expression: NULL
|
||||
*************************** 3. row ***************************
|
||||
Table: test
|
||||
Non_unique: 1
|
||||
Key_name: test_index
|
||||
Seq_in_index: 2
|
||||
Column_name: c2
|
||||
Collation: A
|
||||
Cardinality: NULL
|
||||
Sub_part: NULL
|
||||
Packed: NULL
|
||||
Null: YES
|
||||
Index_type: BTREE
|
||||
Comment: available
|
||||
Index_comment:
|
||||
Visible: YES
|
||||
Expression: NULL
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
### Delete an index
|
||||
|
||||
Use the `DROP INDEX` statement to delete an index on a table.
|
||||
|
||||
Example: Delete the index on the table `test`.
|
||||
|
||||
```sql
|
||||
obclient> DROP INDEX test_index ON test;
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `DROP INDEX` statement, see [DROP INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974168).
|
||||
|
||||
## Insert data
|
||||
|
||||
Use the `INSERT` statement to insert data into an existing table.
|
||||
|
||||
Examples:
|
||||
|
||||
* Create a table `t1` and insert one row of data.
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE t1(c1 INT PRIMARY KEY, c2 int) PARTITION BY KEY(c1) PARTITIONS 4;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> SELECT * FROM t1;
|
||||
Empty set
|
||||
|
||||
obclient> INSERT t1 VALUES(1,1);
|
||||
Query OK, 1 row affected
|
||||
|
||||
obclient> SELECT * FROM t1;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 1 | 1 |
|
||||
+----+------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* Insert multiple rows of data into the table `t1`.
|
||||
|
||||
```sql
|
||||
obclient> INSERT t1 VALUES(2,2),(3,default),(2+2,3*4);
|
||||
Query OK, 3 rows affected
|
||||
Records: 3 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t1;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 1 | 1 |
|
||||
| 2 | 2 |
|
||||
| 3 | NULL |
|
||||
| 4 | 12 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
For more information about the `INSERT` statement, see [INSERT](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974718).
|
||||
|
||||
## Delete data
|
||||
|
||||
Use the `DELETE` statement to delete data. It supports deleting data from a single table or multiple tables.
|
||||
|
||||
Examples:
|
||||
|
||||
* Create tables `t2` and `t3` using `CREATE TABLE`. Delete the row where `c1=2`, where `c1` is the `PRIMARY KEY` column in the table `t2`.
|
||||
|
||||
```sql
|
||||
/*Table `t3` is a `KEY` partitioned table, and the partition names are automatically generated by the system according to the partition naming rules, that is, the partition names are `p0`, `p1`, `p2`, and `p3`*/
|
||||
obclient> CREATE TABLE t2(c1 INT PRIMARY KEY, c2 INT);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT t2 VALUES(1,1),(2,2),(3,3),(5,5);
|
||||
Query OK, 4 rows affected
|
||||
Records: 4 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t2;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 1 | 1 |
|
||||
| 2 | 2 |
|
||||
| 3 | 3 |
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
|
||||
obclient> CREATE TABLE t3(c1 INT PRIMARY KEY, c2 INT) PARTITION BY KEY(c1) PARTITIONS 4;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT INTO t3 VALUES(5,5),(1,1),(2,2),(3,3);
|
||||
Query OK, 4 rows affected
|
||||
Records: 4 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t3;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 5 | 5 |
|
||||
| 1 | 1 |
|
||||
| 2 | 2 |
|
||||
| 3 | 3 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
|
||||
obclient> DELETE FROM t2 WHERE c1 = 2;
|
||||
Query OK, 1 row affected
|
||||
|
||||
obclient> SELECT * FROM t2;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 1 | 1 |
|
||||
| 3 | 3 |
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
* Delete the first row of data from the table `t2` after sorting by the `c2` column.
|
||||
|
||||
```sql
|
||||
obclient> DELETE FROM t2 ORDER BY c2 LIMIT 1;
|
||||
Query OK, 1 row affected
|
||||
|
||||
obclient> SELECT * FROM t2;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 3 | 3 |
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
* Delete data from the `p2` partition of the table `t3`.
|
||||
|
||||
```sql
|
||||
obclient> SELECT * FROM t3 PARTITION(p2);
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 1 | 1 |
|
||||
| 2 | 2 |
|
||||
| 3 | 3 |
|
||||
+----+------+
|
||||
3 rows in set
|
||||
|
||||
obclient> DELETE FROM t3 PARTITION(p2);
|
||||
Query OK, 3 rows affected
|
||||
|
||||
obclient> SELECT * FROM t3;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* Delete data from tables `t2` and `t3` where `t2.c1 = t3.c1`.
|
||||
|
||||
```sql
|
||||
obclient> SELECT * FROM t2;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 3 | 3 |
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
2 rows in set
|
||||
|
||||
obclient> SELECT * FROM t3;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 5 | 5 |
|
||||
+----+------+
|
||||
|
||||
obclient> DELETE t2, t3 FROM t2, t3 WHERE t2.c1 = t3.c1;
|
||||
Query OK, 3 rows affected
|
||||
/*Equivalent to
|
||||
obclient> DELETE FROM t2, t3 USING t2, t3 WHERE t2.c1 = t3.c1;
|
||||
*/
|
||||
|
||||
obclient> SELECT * FROM t2;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 3 | 3 |
|
||||
+----+------+
|
||||
1 row in set
|
||||
|
||||
obclient> SELECT * FROM t3;
|
||||
Empty set
|
||||
```
|
||||
|
||||
For more information about the `DELETE` statement, see [DELETE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974138).
|
||||
|
||||
## Update data
|
||||
|
||||
Use the `UPDATE` statement to modify field values in a table.
|
||||
|
||||
Examples:
|
||||
|
||||
* Create tables `t4` and `t5` using `CREATE TABLE`. Modify the `c2` column value to `100` for the row where `t4.c1=10` in the table `t4`.
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE t4(c1 INT PRIMARY KEY, c2 INT);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT t4 VALUES(10,10),(20,20),(30,30),(40,40);
|
||||
Query OK, 4 rows affected
|
||||
Records: 4 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t4;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 10 |
|
||||
| 20 | 20 |
|
||||
| 30 | 30 |
|
||||
| 40 | 40 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
|
||||
obclient> CREATE TABLE t5(c1 INT PRIMARY KEY, c2 INT) PARTITION BY KEY(c1) PARTITIONS 4;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT t5 VALUES(50,50),(10,10),(20,20),(30,30);
|
||||
Query OK, 4 rows affected
|
||||
Records: 4 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t5;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 20 | 20 |
|
||||
| 10 | 10 |
|
||||
| 50 | 50 |
|
||||
| 30 | 30 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
|
||||
obclient> UPDATE t4 SET t4.c2 = 100 WHERE t4.c1 = 10;
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t4;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 100 |
|
||||
| 20 | 20 |
|
||||
| 30 | 30 |
|
||||
| 40 | 40 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
* Modify the `c2` column value to `100` for the first two rows of data in the table `t4` after sorting by the `c2` column.
|
||||
|
||||
```sql
|
||||
obclient> UPDATE t4 set t4.c2 = 100 ORDER BY c2 LIMIT 2;
|
||||
Query OK, 2 rows affected
|
||||
Rows matched: 2 Changed: 2 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t4;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 100 |
|
||||
| 20 | 100 |
|
||||
| 30 | 100 |
|
||||
| 40 | 40 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
* Modify the `c2` column value to `100` for the rows in the `p1` partition of the table `t5` where `t5.c1 > 20`.
|
||||
|
||||
```sql
|
||||
obclient> SELECT * FROM t5 PARTITION (p1);
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 10 |
|
||||
| 50 | 50 |
|
||||
+----+------+
|
||||
2 rows in set
|
||||
|
||||
obclient> UPDATE t5 PARTITION(p1) SET t5.c2 = 100 WHERE t5.c1 > 20;
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t5 PARTITION(p1);
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 10 |
|
||||
| 50 | 100 |
|
||||
+----+------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
* For rows in tables `t4` and `t5` that satisfy `t4.c2 = t5.c2`, modify the `c2` column value in the table `t4` to `100` and the `c2` column value in the table `t5` to `200`.
|
||||
|
||||
```sql
|
||||
obclient> UPDATE t4,t5 SET t4.c2 = 100, t5.c2 = 200 WHERE t4.c2 = t5.c2;
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 4 Changed: 1 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t4;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 10 | 100 |
|
||||
| 20 | 100 |
|
||||
| 30 | 100 |
|
||||
| 40 | 40 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
|
||||
obclient> SELECT * FROM t5;
|
||||
+----+------+
|
||||
| c1 | c2 |
|
||||
+----+------+
|
||||
| 20 | 20 |
|
||||
| 10 | 10 |
|
||||
| 50 | 200 |
|
||||
| 30 | 30 |
|
||||
+----+------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
For more information about the `UPDATE` statement, see [UPDATE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974152).
|
||||
|
||||
## Query data
|
||||
|
||||
Use the `SELECT` statement to query the contents of a table.
|
||||
|
||||
Examples:
|
||||
|
||||
* Create a table `t6` using `CREATE TABLE`. Read the `name` data from the table `t6`.
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE t6 (id INT, name VARCHAR(50), num INT);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT INTO t6 VALUES(1,'a',100),(2,'b',200),(3,'a',50);
|
||||
Query OK, 3 rows affected
|
||||
Records: 3 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t6;
|
||||
+------+------+------+
|
||||
| ID | NAME | NUM |
|
||||
+------+------+------+
|
||||
| 1 | a | 100 |
|
||||
| 2 | b | 200 |
|
||||
| 3 | a | 50 |
|
||||
+------+------+------+
|
||||
3 rows in set
|
||||
|
||||
obclient> SELECT name FROM t6;
|
||||
+------+
|
||||
| NAME |
|
||||
+------+
|
||||
| a |
|
||||
| b |
|
||||
| a |
|
||||
+------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
* Remove duplicates from the `name` column in the query results.
|
||||
|
||||
```sql
|
||||
obclient> SELECT DISTINCT name FROM t6;
|
||||
+------+
|
||||
| NAME |
|
||||
+------+
|
||||
| a |
|
||||
| b |
|
||||
+------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
* Output the corresponding `id`, `name`, and `num` from the table `t6` based on the filter condition `name = 'a'`.
|
||||
|
||||
```sql
|
||||
obclient> SELECT id, name, num FROM t6 WHERE name = 'a';
|
||||
+------+------+------+
|
||||
| ID | NAME | NUM |
|
||||
+------+------+------+
|
||||
| 1 | a | 100 |
|
||||
| 3 | a | 50 |
|
||||
+------+------+------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
For more information about the `SELECT` statement, see [SELECT](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974942).
|
||||
|
||||
## Commit a transaction
|
||||
|
||||
Use the `COMMIT` statement to commit a transaction.
|
||||
|
||||
Before committing a transaction (COMMIT):
|
||||
|
||||
* Your modifications are visible only to the current session and not visible to other database sessions.
|
||||
* Your modifications are not persisted. You can undo the modifications using the ROLLBACK statement.
|
||||
|
||||
After committing a transaction (COMMIT):
|
||||
|
||||
* Your modifications are visible to all database sessions.
|
||||
* Your modifications are successfully persisted and cannot be rolled back using the ROLLBACK statement.
|
||||
|
||||
Example: Create a table `t_insert` using `CREATE TABLE`. Use the `COMMIT` statement to commit the transaction.
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE t_insert(
|
||||
id number NOT NULL PRIMARY KEY,
|
||||
name varchar(10) NOT NULL,
|
||||
value number,
|
||||
gmt_create DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> BEGIN;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT INTO t_insert(id, name, value, gmt_create) VALUES(1,'CN',10001, current_timestamp),(2,'US',10002, current_timestamp),(3,'EN',10003, current_timestamp);
|
||||
Query OK, 3 rows affected
|
||||
Records: 3 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t_insert;
|
||||
+----+------+-------+---------------------+
|
||||
| id | name | value | gmt_create |
|
||||
+----+------+-------+---------------------+
|
||||
| 1 | CN | 10001 | 2025-11-07 16:01:53 |
|
||||
| 2 | US | 10002 | 2025-11-07 16:01:53 |
|
||||
| 3 | EN | 10003 | 2025-11-07 16:01:53 |
|
||||
+----+------+-------+---------------------+
|
||||
3 rows in set
|
||||
|
||||
obclient> INSERT INTO t_insert(id,name) VALUES(4,'JP');
|
||||
Query OK, 1 row affected
|
||||
|
||||
obclient> COMMIT;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> exit;
|
||||
Bye
|
||||
|
||||
obclient> obclient -h127.0.0.1 -uroot -P2881 -Ddb1
|
||||
|
||||
obclient> SELECT * FROM t_insert;
|
||||
+------+------+-------+---------------------+
|
||||
| id | name | value | gmt_create |
|
||||
+------+------+-------+---------------------+
|
||||
| 1 | CN | 10001 | 2025-11-07 16:01:53 |
|
||||
| 2 | US | 10002 | 2025-11-07 16:01:53 |
|
||||
| 3 | EN | 10003 | 2025-11-07 16:01:53 |
|
||||
| 4 | JP | NULL | 2025-11-07 16:02:02 |
|
||||
+------+------+-------+---------------------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
For more information about transaction control statements, see [Transaction management overview](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971667).
|
||||
|
||||
## Roll back a transaction
|
||||
|
||||
Use the `ROLLBACK` statement to roll back a transaction.
|
||||
|
||||
Rolling back a transaction means undoing all modifications made in the transaction. You can roll back the entire uncommitted transaction or roll back to any savepoint in the transaction. To roll back to a savepoint, you must use the `ROLLBACK` statement together with `TO SAVEPOINT`.
|
||||
|
||||
* If you roll back the entire transaction:
|
||||
* The transaction ends.
|
||||
* All modifications are discarded.
|
||||
* All savepoints are cleared.
|
||||
* All locks held by the transaction are released.
|
||||
|
||||
* If you roll back to a savepoint:
|
||||
* The transaction does not end.
|
||||
* Modifications before the savepoint are retained, and modifications after the savepoint are discarded.
|
||||
* Savepoints after the savepoint are cleared (excluding the savepoint itself).
|
||||
* All locks held by the transaction after the savepoint are released.
|
||||
|
||||
Example: Roll back all modifications in a transaction.
|
||||
|
||||
```sql
|
||||
obclient> SELECT * FROM t_insert;
|
||||
+------+------+-------+---------------------+
|
||||
| id | name | value | gmt_create |
|
||||
+------+------+-------+---------------------+
|
||||
| 1 | CN | 10001 | 2025-11-07 16:01:53 |
|
||||
| 2 | US | 10002 | 2025-11-07 16:01:53 |
|
||||
| 3 | EN | 10003 | 2025-11-07 16:01:53 |
|
||||
| 4 | JP | NULL | 2025-11-07 16:02:02 |
|
||||
+------+------+-------+---------------------+
|
||||
4 rows in set
|
||||
|
||||
obclient> BEGIN;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> INSERT INTO t_insert(id, name, value) VALUES(5,'JP',10004),(6,'FR',10005),(7,'RU',10006);
|
||||
Query OK, 3 rows affected
|
||||
Records: 3 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT * FROM t_insert;
|
||||
+------+------+-------+---------------------+
|
||||
| id | name | value | gmt_create |
|
||||
+------+------+-------+---------------------+
|
||||
| 1 | CN | 10001 | 2025-11-07 16:01:53 |
|
||||
| 2 | US | 10002 | 2025-11-07 16:01:53 |
|
||||
| 3 | EN | 10003 | 2025-11-07 16:01:53 |
|
||||
| 4 | JP | NULL | 2025-11-07 16:02:02 |
|
||||
| 5 | JP | 10004 | 2025-11-07 16:04:14 |
|
||||
| 6 | FR | 10005 | 2025-11-07 16:04:14 |
|
||||
| 7 | RU | 10006 | 2025-11-07 16:04:14 |
|
||||
+------+------+-------+---------------------+
|
||||
7 rows in set
|
||||
|
||||
obclient> ROLLBACK;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> SELECT * FROM t_insert;
|
||||
+------+------+-------+---------------------+
|
||||
| id | name | value | gmt_create |
|
||||
+------+------+-------+---------------------+
|
||||
| 1 | CN | 10001 | 2025-11-07 16:01:53 |
|
||||
| 2 | US | 10002 | 2025-11-07 16:01:53 |
|
||||
| 3 | EN | 10003 | 2025-11-07 16:01:53 |
|
||||
| 4 | JP | NULL | 2025-11-07 16:02:02 |
|
||||
+------+------+-------+---------------------+
|
||||
4 rows in set
|
||||
```
|
||||
|
||||
For more information about transaction control statements, see [Transaction management overview](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971667).
|
||||
|
||||
## Create a user
|
||||
|
||||
Use the `CREATE USER` statement to create a user.
|
||||
|
||||
Example:
|
||||
|
||||
Create a user named `test`.
|
||||
|
||||
```shell
|
||||
obclient> CREATE USER 'test' IDENTIFIED BY '******';
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `CREATE USER` statement, see [CREATE USER](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974176).
|
||||
|
||||
## Grant user privileges
|
||||
|
||||
Use the `GRANT` statement to grant privileges to a user.
|
||||
|
||||
Example:
|
||||
|
||||
Grant the user `test` the privilege to access all tables in the database `db1`.
|
||||
|
||||
```shell
|
||||
obclient> GRANT SELECT ON db1.* TO test;
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
Check the privileges of the user `test`.
|
||||
|
||||
```shell
|
||||
obclient> SHOW GRANTS for test;
|
||||
+-----------------------------------+
|
||||
| Grants for test@% |
|
||||
+-----------------------------------+
|
||||
| GRANT USAGE ON *.* TO 'test' |
|
||||
| GRANT SELECT ON `db1`.* TO 'test' |
|
||||
+-----------------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
For more information about the `GRANT` statement, see [GRANT](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974144).
|
||||
|
||||
## Delete a user
|
||||
|
||||
Use the `DROP USER` statement to delete a user.
|
||||
|
||||
Example:
|
||||
|
||||
Delete the user `test`.
|
||||
|
||||
```shell
|
||||
obclient> DROP USER test;
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
For more information about the `DROP USER` statement, see [DROP USER](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974172).
|
||||
@@ -0,0 +1,230 @@
|
||||
---
|
||||
|
||||
slug: /experience-vector-search
|
||||
---
|
||||
|
||||
# Experience vector search
|
||||
|
||||
## Vector search overview
|
||||
|
||||
In today's era of information explosion, users often need to quickly retrieve the information they need from massive amounts of data. For example, online literature databases, e-commerce platform product catalogs, and growing multimedia content libraries all require efficient retrieval systems to quickly locate content of interest to users. As data volumes continue to grow, traditional keyword-based retrieval methods can no longer meet users' needs for retrieval accuracy and speed. Vector search technology can effectively solve these problems. Vector search encodes different types of data such as text, images, and audio into mathematical vectors and performs retrieval in vector space. This method allows systems to capture deep semantic information of data, thereby providing more accurate and efficient retrieval results.
|
||||
|
||||
seekdb provides the capability to store, index, and search embedding vector data, and supports storing vector data together with other data.
|
||||
|
||||
seekdb supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance calculations such as Manhattan distance, Euclidean distance, inner product, and cosine distance. It supports creating vector indexes based on HNSW/IVF, and supports incremental updates and deletions without affecting recall.
|
||||
|
||||
seekdb vector search has hybrid search capabilities with scalar filtering. It also provides flexible access interfaces, supporting SQL access through MySQL protocol clients in various languages, as well as Python SDK access. It has also completed adaptation to AI application development frameworks LlamaIndex and DB-GPT, and AI application development platform Dify, better serving AI application development.
|
||||
|
||||
This topic demonstrates how to quickly perform vector search using SQL.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Ensure that seekdb is installed.
|
||||
* You are connected to seekdb.
|
||||
|
||||
## Quick start
|
||||
|
||||
1. Create vector columns and indexes.
|
||||
|
||||
When creating a table, you can use the `VECTOR(dim)` data type to declare a column as a vector column and specify its dimension. Vector indexes must be created on vector columns, and at least two parameters, `type` and `distance`, must be provided.
|
||||
|
||||
The example creates a vector column `embedding` with a dimension of `3`, and creates an HNSW index on the `embedding` column, specifying the distance algorithm as L2.
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(
|
||||
id INT PRIMARY KEY,
|
||||
doc VARCHAR(200),
|
||||
embedding VECTOR(3),
|
||||
VECTOR INDEX idx1(embedding) WITH (distance=L2, type=hnsw)
|
||||
);
|
||||
```
|
||||
|
||||
2. Insert vector data.
|
||||
|
||||
To simulate a vector search scenario, you need to construct some vector data first. Each row of data includes a description of the data and the corresponding vector. In the example, it is assumed that `'apple'` corresponds to the vector `'[1.2,0.7,1.1]'`, and `'carrot'` corresponds to the vector `'[5.3,4.8,5.4]'`, and so on.
|
||||
|
||||
```sql
|
||||
INSERT INTO t1
|
||||
VALUES (1, 'apple', '[1.2,0.7,1.1]'),
|
||||
(2, 'banana', '[0.6,1.2,0.8]'),
|
||||
(3, 'orange','[1.1,1.1,0.9]'),
|
||||
(4, 'carrot', '[5.3,4.8,5.4]'),
|
||||
(5, 'spinach', '[4.9,5.3,4.8]'),
|
||||
(6, 'tomato','[5.2,4.9,5.1]');
|
||||
```
|
||||
|
||||
For convenience of demonstration, this example simplifies the vector dimension to only 3 dimensions, and the vectors are manually generated. In actual applications, you need to use embedding models to generate vectors from real text, and the dimensions can reach hundreds or thousands.
|
||||
|
||||
You can check whether the data is inserted successfully by querying the table.
|
||||
|
||||
```sql
|
||||
SELECT * FROM t1;
|
||||
```
|
||||
|
||||
The expected result is as follows:
|
||||
|
||||
```shell
|
||||
+----+---------+---------------+
|
||||
| id | doc | embedding |
|
||||
+----+---------+---------------+
|
||||
| 1 | apple | [1.2,0.7,1.1] |
|
||||
| 2 | banana | [0.6,1.2,0.8] |
|
||||
| 3 | orange | [1.1,1.1,0.9] |
|
||||
| 4 | carrot | [5.3,4.8,5.4] |
|
||||
| 5 | spinach | [4.9,5.3,4.8] |
|
||||
| 6 | tomato | [5.2,4.9,5.1] |
|
||||
+----+---------+---------------+
|
||||
6 rows in set
|
||||
```
|
||||
|
||||
3. Perform vector search.
|
||||
|
||||
To perform vector search, you need to provide a vector as the search condition. Suppose we need to find all `'fruits'`, and the corresponding vector is `[0.9, 1.0, 0.9]`, then the corresponding SQL is:
|
||||
|
||||
```sql
|
||||
SELECT id, doc FROM t1
|
||||
ORDER BY l2_distance(embedding, '[0.9, 1.0, 0.9]')
|
||||
APPROXIMATE LIMIT 3;
|
||||
```
|
||||
|
||||
The expected result is as follows:
|
||||
|
||||
```shell
|
||||
+----+--------+
|
||||
| id | doc |
|
||||
+----+--------+
|
||||
| 3 | orange |
|
||||
| 2 | banana |
|
||||
| 1 | apple |
|
||||
+----+--------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
## Comparison between exact search and approximate search
|
||||
|
||||
### Perform exact search
|
||||
|
||||
Exact search uses a full scan strategy, performing exact search by calculating the distance between the query vector and all vectors in the dataset. This method can guarantee complete accuracy of search results, but since full distance calculation is required, search performance will significantly decrease as the data scale grows.
|
||||
|
||||
When performing exact search, the system calculates and compares the distance between the query vector vₑ and all vectors in the vector space. After completing the full distance calculation, the system selects the k vectors with the closest distance as the search results.
|
||||
|
||||
#### Example: Euclidean similarity search
|
||||
|
||||
Euclidean similarity search is used to retrieve the top-k vectors closest to the query vector in vector space, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors closest to the query vector from a table:
|
||||
|
||||
```sql
|
||||
-- Create a test table
|
||||
CREATE TABLE t1 (
|
||||
id INT PRIMARY KEY,
|
||||
c1 VECTOR(3)
|
||||
);
|
||||
|
||||
-- Insert data
|
||||
INSERT INTO t1 VALUES
|
||||
(1, '[0.1, 0.2, 0.3]'),
|
||||
(2, '[0.2, 0.3, 0.4]'),
|
||||
(3, '[0.3, 0.4, 0.5]'),
|
||||
(4, '[0.4, 0.5, 0.6]'),
|
||||
(5, '[0.5, 0.6, 0.7]'),
|
||||
(6, '[0.6, 0.7, 0.8]'),
|
||||
(7, '[0.7, 0.8, 0.9]'),
|
||||
(8, '[0.8, 0.9, 1.0]'),
|
||||
(9, '[0.9, 1.0, 0.1]'),
|
||||
(10, '[1.0, 0.1, 0.2]');
|
||||
|
||||
-- Perform exact search
|
||||
SELECT c1
|
||||
FROM t1
|
||||
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```shell
|
||||
+---------------+
|
||||
|| c1 |
|
||||
+---------------+
|
||||
|| [0.1,0.2,0.3] |
|
||||
|| [0.2,0.3,0.4] |
|
||||
|| [0.3,0.4,0.5] |
|
||||
|| [0.4,0.5,0.6] |
|
||||
|| [0.5,0.6,0.7] |
|
||||
+---------------+
|
||||
5 rows in set
|
||||
```
|
||||
|
||||
### Perform approximate search using vector indexes
|
||||
|
||||
Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. Although it cannot guarantee 100% accuracy of results, it can significantly improve search performance, achieving a good balance between accuracy and performance in practical applications.
|
||||
|
||||
#### Example: HNSW index approximate search
|
||||
|
||||
```sql
|
||||
-- Create an HNSW vector index with the table
|
||||
CREATE TABLE t2 (
|
||||
id INT PRIMARY KEY,
|
||||
vec VECTOR(3),
|
||||
VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
|
||||
);
|
||||
|
||||
-- Insert test data
|
||||
INSERT INTO t2 VALUES
|
||||
(1, '[0.1, 0.2, 0.3]'),
|
||||
(2, '[0.2, 0.3, 0.4]'),
|
||||
(3, '[0.3, 0.4, 0.5]'),
|
||||
(4, '[0.4, 0.5, 0.6]'),
|
||||
(5, '[0.5, 0.6, 0.7]'),
|
||||
(6, '[0.6, 0.7, 0.8]'),
|
||||
(7, '[0.7, 0.8, 0.9]'),
|
||||
(8, '[0.8, 0.9, 1.0]'),
|
||||
(9, '[0.9, 1.0, 0.1]'),
|
||||
(10, '[1.0, 0.1, 0.2]');
|
||||
|
||||
-- Perform approximate search, returning the 5 most similar records
|
||||
SELECT id, vec
|
||||
FROM t2
|
||||
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
|
||||
APPROXIMATE
|
||||
LIMIT 5;
|
||||
```
|
||||
|
||||
The result is as follows. Due to the small data volume, it is consistent with the exact search result above:
|
||||
|
||||
```shell
|
||||
+------+---------------+
|
||||
|| id | vec |
|
||||
+------+---------------+
|
||||
|| 1 | [0.1,0.2,0.3] |
|
||||
|| 2 | [0.2,0.3,0.4] |
|
||||
|| 3 | [0.3,0.4,0.5] |
|
||||
|| 4 | [0.4,0.5,0.6] |
|
||||
|| 5 | [0.5,0.6,0.7] |
|
||||
+------+---------------+
|
||||
5 rows in set
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
A comparison of the two search methods is as follows:
|
||||
|
||||
| Comparison item | Exact search | Approximate search |
|
||||
|----------------|--------------|-------------------|
|
||||
| Execution method | Full table scan (`TABLE FULL SCAN`) followed by sorting | Direct search through vector index (`VECTOR INDEX SCAN`) |
|
||||
| Performance characteristics | Requires scanning all table data and sorting, performance significantly decreases as data volume grows | Directly locates target data through index, stable performance |
|
||||
| Result accuracy | 100% accurate, guarantees returning true nearest neighbors | Approximately accurate, may have minor errors |
|
||||
| Applicable scenarios | Small data volumes, scenarios with high accuracy requirements | Large-scale datasets, scenarios with high performance requirements |
|
||||
|
||||
## What's next
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,354 @@
|
||||
---
|
||||
|
||||
slug: /experience-full-text-indexing
|
||||
---
|
||||
|
||||
# Experience full-text indexing
|
||||
|
||||
## Background information
|
||||
|
||||
seekdb's full-text indexing feature can effectively solve various problems encountered in actual production, especially in scenarios such as system log analysis and user behavior and profile analysis. This feature can quickly filter and screen data efficiently, as well as perform high-quality relevance evaluation. In addition, combined with the multi-path recall architecture of sparse and dense vectors, more efficient recall can be achieved in RAG systems in specific knowledge domains.
|
||||
|
||||
This tutorial uses document retrieval scenarios as an example. In such scenarios, three core challenges place higher demands on retrieval systems:
|
||||
|
||||
- **Real-time requirements**: Quickly locate target information from TB-level data.
|
||||
- **Semantic complexity**: Solve natural language processing challenges such as word segmentation and synonym processing.
|
||||
- **Hybrid query requirements**: Improve the joint optimization capability of text retrieval and structured queries.
|
||||
|
||||
This tutorial demonstrates how to quickly find target documents from massive information by using the full-text indexing feature. We will use keywords in queries to demonstrate the improvements of seekdb's full-text indexing in terms of functionality, performance, and ease of use.
|
||||
|
||||
## How it works
|
||||
|
||||
In seekdb's storage engine, user documents and queries are split into multiple keywords (word/token) by a tokenizer. These keywords and the statistical information features of documents are stored in internal auxiliary tables (tablets) for relevance evaluation (ranking) during the information retrieval phase. seekdb uses the advanced BM25 algorithm, which can more effectively calculate the relevance score between keywords in user query statements and stored documents, and finally output documents that meet the conditions and their scores.
|
||||
|
||||
In the full-text indexing query process, combined with seekdb's high-performance query engine, seekdb has optimized the TAAT/DAAT process and supports union merge between multiple indexes. These improvements enable full-text indexing to handle more complex query features and meet users' data retrieval needs.
|
||||
|
||||

|
||||
|
||||
## Prerequisites
|
||||
|
||||
To successfully operate and experience seekdb's full-text indexing feature, ensure that the following prerequisites are met:
|
||||
|
||||
1. **Environment requirements**: seekdb is deployed.
|
||||
|
||||
2. **Database creation**: Ensure that a database is created. For detailed steps, see [Create a database](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971662).
|
||||
|
||||
## Procedure
|
||||
|
||||
The following steps guide you through experiencing seekdb's full-text indexing and common views and query techniques.
|
||||
|
||||
### Step 1: Import a dataset
|
||||
|
||||
seekdb has a built-in Beng tokenizer that is suitable for English, as well as a Boolean mode that is more efficient than traditional natural language processing. The Beng tokenizer is suitable for English text and provides efficient word segmentation for English documents. seekdb's built-in tokenizers also include IK (for Chinese), space (for space-separated languages), and ngram (which splits by character length).
|
||||
|
||||
We will use the [wikIR1k dataset](https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/img/SeekDB/get-started/documents.csv) to import data into seekdb, create a table named `wikir1k` with a `document` column, and create a full-text index on the `document` field using the Beng tokenizer.
|
||||
|
||||
:::tip
|
||||
All query results and performance metrics shown in the examples are for reference only. Your actual results may vary depending on your data volume, machine specifications, and query patterns.
|
||||
:::
|
||||
|
||||
```sql
|
||||
-- Create a table and use the Beng tokenizer for full-text indexing
|
||||
CREATE TABLE wikir1k (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
document TEXT,
|
||||
FULLTEXT INDEX ft_idx1_document(document)
|
||||
WITH PARSER beng
|
||||
);
|
||||
```
|
||||
|
||||
Import the dataset into the table through the client's local file method.
|
||||
|
||||
```sql
|
||||
-- Import data
|
||||
LOAD DATA /*+ PARALLEL(8) */ LOCAL INFILE '/home/admin/documents10k.csv' INTO TABLE wikir1k
|
||||
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
|
||||
IGNORE 1 ROWS;
|
||||
```
|
||||
|
||||
After importing the data, the table contains approximately 10,000 documents (the exact count may vary slightly).
|
||||
|
||||
```sql
|
||||
-- Verify the number of imported records
|
||||
SELECT AVG(LENGTH(document)), COUNT(*) FROM wikir1k;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+-----------------------+----------+
|
||||
| AVG(LENGTH(document)) | COUNT(*) |
|
||||
+-----------------------+----------+
|
||||
| 1144.6949 | 369721 |
|
||||
+-----------------------+----------+
|
||||
1 row in set (1.07 sec)
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Query the view to verify the result
|
||||
SELECT * FROM oceanbase.DBA_OB_TABLE_SPACE_USAGE WHERE DATABASE_NAME = 'test' AND TABLE_NAME LIKE '%wikir1k%';
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----------+---------------+------------+-------------+---------------+
|
||||
| TABLE_ID | DATABASE_NAME | TABLE_NAME | OCCUPY_SIZE | REQUIRED_SIZE |
|
||||
+----------+---------------+------------+-------------+---------------+
|
||||
| 500252 | test | wikir1k | 185571540 | 190853120 |
|
||||
+----------+---------------+------------+-------------+---------------+
|
||||
1 row in set (0.05 sec)
|
||||
```
|
||||
|
||||
### Step 2: Query using full-text indexing
|
||||
|
||||
Using the stored document dataset and index, we can perform multi-condition combination or highly filtered retrieval. For example, if I want to search for documents containing both "london" and "mayfair", I can use Boolean mode.
|
||||
|
||||
Compared to string `LIKE` matching without an index, Boolean mode has simpler syntax and faster query speed.
|
||||
|
||||
```sql
|
||||
-- Use Boolean mode to query and find documents that contains both "london" and "mayfair"
|
||||
SELECT COUNT(*) FROM wikir1k
|
||||
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE);
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 58 |
|
||||
+----------+
|
||||
1 row in set (0.01 sec)
|
||||
```
|
||||
|
||||
In contrast, using the `LIKE` query method:
|
||||
|
||||
```sql
|
||||
-- Use LIKE syntax to query
|
||||
SELECT COUNT(*) FROM wikir1k
|
||||
WHERE document LIKE '%london%' AND document LIKE '%mayfair%';
|
||||
```
|
||||
|
||||
The following result is also returned:
|
||||
|
||||
```sql
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 58 |
|
||||
+----------+
|
||||
1 row in set (3.48 sec)
|
||||
```
|
||||
|
||||
For the documents returned, we can further perform ranking by using the score in the output result to determine which documents are more relevant to the query.
|
||||
|
||||
```sql
|
||||
-- Return the id and score of the documents to help determine relevance
|
||||
SELECT id, MATCH (document) AGAINST ('london mayfair') AS score
|
||||
FROM wikir1k
|
||||
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE)
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+---------+--------------------+
|
||||
| id | score |
|
||||
+---------+--------------------+
|
||||
| 425035 | 17.661768297948015 |
|
||||
| 1122217 | 16.349131415195043 |
|
||||
| 34959 | 14.813025094926918 |
|
||||
| 1576669 | 14.620715555483576 |
|
||||
| 2100682 | 13.40354137543347 |
|
||||
| 1179964 | 13.40354137543347 |
|
||||
| 1642217 | 13.391619146335605 |
|
||||
| 123391 | 13.36985391637557 |
|
||||
| 852529 | 13.336357369363272 |
|
||||
| 380931 | 13.249691534256172 |
|
||||
+---------+--------------------+
|
||||
10 rows in set (0.03 sec)
|
||||
```
|
||||
|
||||
At the same time, Boolean mode also allows us to reverse exclude some keywords. For example, if I want to find documents about "london" but exclude those mentioning "westminster", I can use the `-` operator in Boolean mode.
|
||||
|
||||
```sql
|
||||
-- Query documents about london but excluding westminster
|
||||
SELECT COUNT(*) FROM wikir1k
|
||||
WHERE MATCH (document) AGAINST ('+london -westminster' IN BOOLEAN MODE);
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 18771 |
|
||||
+----------+
|
||||
1 row in set (0.01 sec)
|
||||
```
|
||||
|
||||
### Step 3: Tuning
|
||||
|
||||
#### Tune using the `TOKENIZE` function
|
||||
|
||||
When the query results of full-text indexing do not meet expectations, it is usually because the tokenization results are not ideal. seekdb provides a fast `TOKENIZE` function to assist in testing tokenization effects. This function supports all tokenizers and their corresponding properties. You can use the `TOKENIZE` function to verify tokenizer processing effects.
|
||||
|
||||
For example, the tokenization results in the following example show how the Beng tokenizer splits English text into words, which helps verify that the tokenization is working correctly.
|
||||
|
||||
1. Use the `TOKENIZE` function to verify tokenizer processing effects:
|
||||
|
||||
```sql
|
||||
-- Verify English document tokenization effects using Beng tokenizer
|
||||
SELECT TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]');
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+---------------------------------------------------------------------------------------------------------------------+
|
||||
| TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]') |
|
||||
+---------------------------------------------------------------------------------------------------------------------+
|
||||
| ["efficient", "processing", "capabilities", "system", "computer", "provides", "and", "information", "management"] |
|
||||
+---------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set (0.00 sec)
|
||||
```
|
||||
|
||||
The above result shows that the text has been correctly split into individual words.
|
||||
|
||||
2. Next, execute the following statement to check whether the query statement hits the target document:
|
||||
|
||||
```sql
|
||||
-- Use Boolean mode to retrieve documents about computer systems
|
||||
SELECT COUNT(*)
|
||||
FROM wikir1k
|
||||
WHERE MATCH (document) AGAINST ('+computer +system' IN BOOLEAN MODE);
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 1010 |
|
||||
+----------+
|
||||
1 row in set (0.01 sec)
|
||||
```
|
||||
|
||||
The above result shows that target records were matched.
|
||||
|
||||
## Performance comparison with MySQL
|
||||
|
||||
To compare the full-text indexing performance differences between seekdb and MySQL, we use MySQL's full-text indexing feature as a reference. The complete dataset `wikir1k` (containing 369,721 rows, with an average of 200 words per row) is used for performance comparison.
|
||||
|
||||
:::tip
|
||||
|
||||
The test results are provided for reference only and may vary depending on your specific environment, data volume, and query patterns.
|
||||
:::
|
||||
|
||||
The following are the comparison results of various scenarios in natural language mode and Boolean mode. It can be seen that in scenarios that require a large amount of tokenization or return large result sets, seekdb's performance is significantly better than MySQL. For small result sets, since the calculation proportion is small, the query engine's advantage is not obvious, and the performance of both engines is similar.
|
||||
|
||||
**Test environment**: seekdb's test specification is 8c 16g, and MySQL version uses 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL).
|
||||
|
||||
### Natural language mode
|
||||
|
||||
```sql
|
||||
-- q1: Query documents containing "and"
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and');
|
||||
|
||||
-- q2: Query documents containing "and", limit to 10 results
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and') LIMIT 10;
|
||||
|
||||
-- q3: Query documents containing "librettists"
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists');
|
||||
|
||||
-- q4: Query documents containing "librettists", limit to 10 results
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists') LIMIT 10;
|
||||
|
||||
-- q5: Query documents containing "alleviating librettists"
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists');
|
||||
|
||||
-- q6: Query documents containing "black spotted white yellow"
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow');
|
||||
|
||||
-- q7: Query documents containing "black spotted white yellow", limit to 10 results
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow') LIMIT 10;
|
||||
|
||||
-- q8: Query documents containing "between up and down"
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down');
|
||||
|
||||
-- q9: Query documents containing "between up and down", limit to 10 results
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down') LIMIT 10;
|
||||
|
||||
-- q10: Query long documents
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire');
|
||||
|
||||
-- q11: Query long documents, with "and" appended
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and');
|
||||
|
||||
-- q12: Query long documents, limit to 10 results
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and') LIMIT 10;
|
||||
```
|
||||
|
||||
| **Scenario** | **seekdb** | **MySQL** |
|
||||
|-------------------------------|-------------------|-----------------|
|
||||
| q1 Single token high-frequency word | 3820458us | 5718430us |
|
||||
| q2 Single token high-frequency word limit | 231861us | 503772us |
|
||||
| q3 Single token low-frequency word | 879us | 672us |
|
||||
| q4 Single token low-frequency word limit | 720us | 700us |
|
||||
| q5 Multiple tokens small result set | 1591us | 1100us |
|
||||
| q6 Multiple tokens medium result set | 259700us | 602221us |
|
||||
| q7 Multiple tokens medium result set limit | 25502us | 42620us |
|
||||
| q8 Multiple tokens large result set | 3842391us | 6846847us |
|
||||
| q9 Multiple tokens large result set limit | 301362us | 784024us |
|
||||
| q10 Many tokens small result set | 22143us | 10161us |
|
||||
| q11 Many tokens large result set | 3905829us | 5929343us |
|
||||
| q12 Many tokens large result set limit| 345968us | 769970us |
|
||||
|
||||
### Boolean mode
|
||||
|
||||
```sql
|
||||
-- q1: +high-frequency word -medium-frequency word
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -which -his' IN BOOLEAN MODE);
|
||||
|
||||
-- q2: +high-frequency word -low-frequency word
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+which (+and -his)' IN BOOLEAN MODE);
|
||||
|
||||
-- q3: +medium-frequency word (+high-frequency word -medium-frequency word)
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -carabantes -bufera' IN BOOLEAN MODE);
|
||||
|
||||
-- q4: +high-frequency word +low-frequency word
|
||||
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and +librettists' IN BOOLEAN MODE);
|
||||
```
|
||||
|
||||
| **Scenario** | **seekdb** | **MySQL** |
|
||||
|-------------------------------|-------------------|-----------------|
|
||||
| q1: +high-frequency word -medium-frequency word | 1586657us | 2440798us |
|
||||
| q2: +high-frequency word -low-frequency word | 3726508us | 7974832us |
|
||||
| q3: +medium-frequency word (+high-frequency word -medium-frequency word)| 3080644us | 5612041us |
|
||||
| q4: +high-frequency word +low-frequency word | 230284us | 357580us |
|
||||
|
||||
### Performance comparison summary
|
||||
|
||||
From the above data comparison, it can be seen that when performing complex full-text retrieval, seekdb demonstrates significantly better performance than MySQL in both natural language mode and Boolean mode. Especially when processing queries that require a large amount of tokenization or return large result sets, seekdb's advantages are more obvious. This provides strong reference for developers and data analysts when choosing a database, especially in application scenarios that require efficient retrieval of massive data, where seekdb clearly demonstrates its powerful performance and flexible query capabilities.
|
||||
|
||||
seekdb's full-text indexing can always provide fast response times when processing complex queries, making it more suitable for actual application scenarios that require high concurrency and high-performance retrieval.
|
||||
|
||||
## What's next
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb using Python SDK](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,360 @@
|
||||
---
|
||||
|
||||
slug: /experience-hybrid-search
|
||||
---
|
||||
|
||||
# Experience hybrid search in seekdb
|
||||
|
||||
This tutorial guides you through getting started with seekdb's hybrid search feature, demonstrating how hybrid search leverages the advantages of both full-text index keywords and vector index semantic search to help you better understand the practical applications of hybrid search.
|
||||
|
||||
## Overview
|
||||
|
||||
Hybrid search combines vector-based semantic retrieval and full-text index-based keyword retrieval, providing more accurate and comprehensive retrieval results through comprehensive ranking. Vector search excels at semantic approximate matching but is weak at matching exact keywords, numbers, and proper nouns, while full-text retrieval effectively compensates for this deficiency. seekdb provides hybrid search functionality through the DBMS_HYBRID_SEARCH system package, supporting the following scenarios:
|
||||
|
||||
* Pure vector search: Find relevant content based on semantic similarity, suitable for semantic search, recommendation systems, and other scenarios.
|
||||
* Pure full-text search: Find content based on keyword matching, suitable for document search, product search, and other scenarios.
|
||||
* Hybrid search: Combines keyword matching and semantic understanding to provide more accurate and comprehensive search results.
|
||||
|
||||
This feature is widely used in intelligent search, document search, product recommendation, and other scenarios.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Contact the administrator to obtain the corresponding database connection string, then execute the following command to connect to the database:
|
||||
```shell
|
||||
- host: seekdb database connection IP.
|
||||
- port: seekdb database connection port.
|
||||
- database_name: Name of the database to access.
|
||||
- user_name: Database username.
|
||||
- password: Database password.
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
* A test table has been created, and vector indexes and full-text indexes have been created in the table:
|
||||
:::collapse
|
||||
```sql
|
||||
CREATE TABLE doc_table(
|
||||
c1 INT,
|
||||
vector VECTOR(3),
|
||||
query VARCHAR(255),
|
||||
content VARCHAR(255),
|
||||
VECTOR INDEX idx1(vector) WITH (distance=l2, type=hnsw, lib=vsag),
|
||||
FULLTEXT INDEX idx2(query),
|
||||
FULLTEXT INDEX idx3(content)
|
||||
);
|
||||
|
||||
INSERT INTO doc_table VALUES
|
||||
(1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
|
||||
(2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
|
||||
(3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
|
||||
(4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
|
||||
(5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
|
||||
(6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");
|
||||
```
|
||||
:::
|
||||
|
||||
## Step 1: Pure vector search
|
||||
|
||||
Vector search finds semantically relevant content by calculating vector similarity, suitable for semantic search, recommendation systems, and other scenarios.
|
||||
|
||||
Set search parameters and use vector search to find records most similar to the query vector `[1,2,3]`:
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"knn" : {
|
||||
"field": "vector",
|
||||
"k": 3,
|
||||
"query_vector": [1,2,3]
|
||||
}
|
||||
}';
|
||||
|
||||
SELECT JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
:::collapse
|
||||
```shell
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"c1": 1,
|
||||
"query": "hello world",
|
||||
"_score": 1.0,
|
||||
"vector": "[1,2,3]",
|
||||
"content": "oceanbase Elasticsearch database"
|
||||
},
|
||||
{
|
||||
"c1": 5,
|
||||
"query": "real world, how old are you",
|
||||
"_score": 0.41421356,
|
||||
"vector": "[1,3,2]",
|
||||
"content": "redis oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 2,
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 0.33333333,
|
||||
"vector": "[1,2,1]",
|
||||
"content": "oceanbase mysql database"
|
||||
}
|
||||
] |
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
:::
|
||||
|
||||
The results are sorted by vector similarity, where `_score` represents the similarity score. A higher score indicates greater similarity.
|
||||
|
||||
## Step 2: Pure full-text search
|
||||
|
||||
Full-text search finds content through keyword matching, suitable for document search, product search, and other scenarios.
|
||||
|
||||
Set search parameters and use full-text search to find records containing keywords in the `query` and `content` fields:
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"fields": ["query", "content"],
|
||||
"query": "hello oceanbase"
|
||||
}
|
||||
}
|
||||
}';
|
||||
|
||||
SELECT JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
:::collapse
|
||||
```shell
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"c1": 1,
|
||||
"query": "hello world",
|
||||
"_score": 0.37162162162162166,
|
||||
"vector": "[1,2,3]",
|
||||
"content": "oceanbase Elasticsearch database"
|
||||
},
|
||||
{
|
||||
"c1": 2,
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[1,2,1]",
|
||||
"content": "oceanbase mysql database"
|
||||
},
|
||||
{
|
||||
"c1": 3,
|
||||
"query": "hello world, how are you",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[1,1,1]",
|
||||
"content": "oceanbase oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 6,
|
||||
"query": "hello world, where are you from",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[2,1,1]",
|
||||
"content": "starrocks oceanbase database"
|
||||
}
|
||||
] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
:::
|
||||
|
||||
The results are sorted by keyword matching degree, where `_score` represents the matching score. A higher score indicates better matching.
|
||||
|
||||
## Step 3: Hybrid search
|
||||
|
||||
Hybrid search combines keyword matching and semantic understanding to provide more accurate and comprehensive search results, leveraging the advantages of both full-text indexes and vector indexes.
|
||||
|
||||
Set search parameters to perform both full-text search and vector search simultaneously:
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"fields": ["query", "content"],
|
||||
"query": "hello oceanbase"
|
||||
}
|
||||
},
|
||||
"knn" : {
|
||||
"field": "vector",
|
||||
"k": 5,
|
||||
"query_vector": [1,2,3]
|
||||
}
|
||||
}';
|
||||
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
:::collapse
|
||||
```shell
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"c1": 1,
|
||||
"query": "hello world",
|
||||
"_score": 0.37162162162162166,
|
||||
"vector": "[1,2,3]",
|
||||
"content": "oceanbase Elasticsearch database"
|
||||
},
|
||||
{
|
||||
"c1": 2,
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[1,2,1]",
|
||||
"content": "oceanbase mysql database"
|
||||
},
|
||||
{
|
||||
"c1": 3,
|
||||
"query": "hello world, how are you",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[1,1,1]",
|
||||
"content": "oceanbase oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 6,
|
||||
"query": "hello world, where are you from",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[2,1,1]",
|
||||
"content": "starrocks oceanbase database"
|
||||
}
|
||||
] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set (0.00 sec)
|
||||
|
||||
MySQL [test]> SET @parm = '{
|
||||
'> "query": {
|
||||
'> "query_string": {
|
||||
'> "fields": ["query", "content"],
|
||||
'> "query": "hello oceanbase"
|
||||
'> }
|
||||
'> },
|
||||
'> "knn" : {
|
||||
'> "field": "vector",
|
||||
'> "k": 5,
|
||||
'> "query_vector": [1,2,3]
|
||||
'> }
|
||||
'> }';
|
||||
Query OK, 0 rows affected (0.00 sec)
|
||||
|
||||
MySQL [test]>
|
||||
MySQL [test]> SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
|
||||
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"c1": 1,
|
||||
"query": "hello world",
|
||||
"_score": 1.3716216216216217,
|
||||
"vector": "[1,2,3]",
|
||||
"content": "oceanbase Elasticsearch database"
|
||||
},
|
||||
{
|
||||
"c1": 2,
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 0.6836518013375796,
|
||||
"vector": "[1,2,1]",
|
||||
"content": "oceanbase mysql database"
|
||||
},
|
||||
{
|
||||
"c1": 3,
|
||||
"query": "hello world, how are you",
|
||||
"_score": 0.6593354613375797,
|
||||
"vector": "[1,1,1]",
|
||||
"content": "oceanbase oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 5,
|
||||
"query": "real world, how old are you",
|
||||
"_score": 0.41421356,
|
||||
"vector": "[1,3,2]",
|
||||
"content": "redis oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 6,
|
||||
"query": "hello world, where are you from",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[2,1,1]",
|
||||
"content": "starrocks oceanbase database"
|
||||
},
|
||||
{
|
||||
"c1": 4,
|
||||
"query": "real world, where are you from",
|
||||
"_score": 0.30901699,
|
||||
"vector": "[1,3,1]",
|
||||
"content": "postgres oracle database"
|
||||
}
|
||||
] |
|
||||
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
:::
|
||||
|
||||
The hybrid search results comprehensively consider the keyword matching score (`_keyword_score`) and semantic similarity score (`_semantic_score`). The final `_score` is the sum of these two, used to comprehensively rank the search results.
|
||||
|
||||
## Parameter tuning
|
||||
|
||||
In hybrid search, you can adjust the weight ratio of full-text search and vector search through the `boost` parameter to optimize search results. For example, to increase the weight of full-text search:
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"fields": ["query", "content"],
|
||||
"query": "hello oceanbase",
|
||||
"boost": 2.0
|
||||
}
|
||||
},
|
||||
"knn" : {
|
||||
"field": "vector",
|
||||
"k": 5,
|
||||
"query_vector": [1,2,3],
|
||||
"boost": 1.0
|
||||
}
|
||||
}';
|
||||
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
```
|
||||
|
||||
By adjusting the `boost` parameter, you can control the weight of keyword search and semantic search in the final ranking. For example, if you focus more on keyword matching, you can increase the `boost` value of `query_string`; if you focus more on semantic similarity, you can increase the `boost` value of `knn`.
|
||||
|
||||
## Summary
|
||||
|
||||
Through this tutorial, you have mastered the core features of seekdb hybrid search:
|
||||
|
||||
* Pure vector search: Find relevant content through semantic similarity, suitable for semantic search scenarios.
|
||||
* Pure full-text search: Find content through keyword matching, suitable for precise search scenarios.
|
||||
* Hybrid search: Combines keywords and semantic understanding to provide more comprehensive and accurate search results.
|
||||
|
||||
The hybrid search feature is an ideal choice for processing massive unstructured data and building intelligent search and recommendation systems, significantly improving the accuracy and comprehensiveness of retrieval results.
|
||||
|
||||
### What's next
|
||||
|
||||
* Explore [AI function service features](../../200.develop/300.ai-function/200.ai-function.md)
|
||||
* View [hybrid vector index](../../200.develop/300.ai-function/200.ai-function.md) to simplify vector search processes
|
||||
|
||||
## More information
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb using Python SDK](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
|
||||
slug: /experience-ai-function
|
||||
---
|
||||
|
||||
# Experience AI function service in seekdb
|
||||
|
||||
This tutorial guides you through getting started with seekdb's AI function service, helping you understand how it leverages AI capabilities, understand practical applications, and experience the powerful features of an AI-native database.
|
||||
|
||||
## Overview
|
||||
|
||||
AI functions integrate AI model capabilities directly into data processing within the database through SQL expressions. They greatly simplify operations such as data extraction, analysis, summarization, and storage using AI large models, and are an important new feature in the current database and data warehouse field. seekdb provides comprehensive AI model and endpoint management through the `DBMS_AI_SERVICE` package, and includes multiple built-in AI function expressions, while supporting monitoring of AI model calls through views. You can directly call AI models in SQL without writing additional code, and experience several core functions including `AI_COMPLETE`, `AI_EMBED`, `AI_RERANK`, and `AI_PROMPT` in just a few minutes:
|
||||
|
||||
* `AI_EMBED`: Converts text data to vector data by calling an embedding model.
|
||||
* `AI_COMPLETE`: Processes prompts and data information by calling a specified text generation large model and parses the processing results.
|
||||
* `AI_PROMPT`: Organizes prompt templates and dynamic data into JSON format, which can be used directly in the `AI_COMPLETE` function to replace the `prompt` parameter.
|
||||
* `AI_RERANK`: Ranks text by similarity according to prompts by calling a rerank model.
|
||||
|
||||
This feature can be applied to text generation, text conversion, text reranking, and other scenarios.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Contact the administrator to obtain the corresponding database connection string, then execute the following command to connect to the database:
|
||||
```shell
|
||||
# host: seekdb database connection IP.
|
||||
# port: seekdb database connection port.
|
||||
# database_name: Name of the database to access.
|
||||
# user_name: Database username.
|
||||
# password: Database password.
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
* Ensure that you have the relevant permissions for [AI function service](../../200.develop/300.ai-function/200.ai-function.md). Complete model and endpoint registration information is provided before each example, which you can copy and use directly.
|
||||
|
||||
## Step 1: Use AI_EMBED to generate vectors
|
||||
|
||||
`AI_EMBED` can convert text to vectors for vector retrieval. This is a fundamental step in vector retrieval, converting text data into high-dimensional vector representations for similarity calculations.
|
||||
|
||||
### Register embedding model and endpoint
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_embed', '{
|
||||
"type": "dense_embedding",
|
||||
"model_name": "BAAI/bge-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_embed_endpoint', '{
|
||||
"ai_model_name": "ob_embed",
|
||||
"url": "https://api.siliconflow.cn/v1/embeddings",
|
||||
-- Replace with actual access_key
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
|
||||
### Try embedding a single row of data
|
||||
|
||||
```sql
|
||||
SELECT AI_EMBED("ob_embed", "Hello world") AS embedding;
|
||||
```
|
||||
|
||||
The expected result is a vector array, such as `[0.1, 0.2, 0.3]`. This allows you to batch convert text in tables to vectors for subsequent vector retrieval.
|
||||
|
||||
## Step 2: Use AI_COMPLETE and AI_PROMPT to generate text
|
||||
|
||||
`AI_COMPLETE` can directly call large language models in SQL to implement text generation, translation, analysis, and other functions. The `AI_PROMPT` function can organize prompt templates and dynamic data into JSON format, which can be used directly in the `AI_COMPLETE` function to replace the `prompt` parameter.
|
||||
|
||||
### Register text generation model and endpoint
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_complete');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_complete_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_complete', '{
|
||||
"type": "completion",
|
||||
"model_name": "THUDM/GLM-4-9B-0414"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_complete_endpoint', '{
|
||||
"ai_model_name": "ob_complete",
|
||||
"url": "https://api.siliconflow.cn/v1/chat/completions",
|
||||
-- Replace with actual access_key
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
|
||||
### Try sentiment analysis
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete", AI_PROMPT('Your task is to perform sentiment analysis on the provided text and determine whether its emotional tendency is positive or negative.
|
||||
The following is the text to be analyzed:
|
||||
<text>
|
||||
{0}
|
||||
</text>
|
||||
The judgment criteria are as follows:
|
||||
If the text expresses positive emotions, output 1; if the text expresses negative emotions, output -1. Do not output anything else.', 'The weather is really good.')) AS sentiment;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----------+
|
||||
| sentiment|
|
||||
+----------+
|
||||
| 1 |
|
||||
+----------+
|
||||
```
|
||||
|
||||
## Step 3: Use AI_RERANK to optimize retrieval results
|
||||
|
||||
`AI_RERANK` can intelligently rerank retrieval results, reordering document lists by relevance to query terms.
|
||||
|
||||
### Register rerank model and endpoint
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_rerank');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_rerank_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_rerank', '{
|
||||
"type": "rerank",
|
||||
"model_name": "BAAI/bge-reranker-v2-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_rerank_endpoint', '{
|
||||
"ai_model_name": "ob_rerank",
|
||||
"url": "https://api.siliconflow.cn/v1/rerank",
|
||||
-- Replace with actual access_key
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
|
||||
### Try reranking
|
||||
|
||||
```sql
|
||||
SELECT AI_RERANK("ob_rerank", "Apple", '["apple", "banana", "fruit", "vegetable"]');
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| AI_RERANK("ob_rerank", "Apple", '["apple", "banana", "fruit", "vegetable"]') |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [{"index": 0, "relevance_score": 0.9911285638809204}, {"index": 1, "relevance_score": 0.0030552432872354984}, {"index": 2, "relevance_score": 0.0003349370090290904}, {"index": 3, "relevance_score": 0.00001892922773549799}] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
:::
|
||||
|
||||
Reranking can significantly improve the accuracy of retrieval results, especially suitable for RAG scenarios.
|
||||
|
||||
## Step 4: Comprehensive application: Build an intelligent Q&A system
|
||||
|
||||
Combine the three AI functions to build a simple intelligent Q&A system in three steps.
|
||||
|
||||
### Register all required models and endpoints
|
||||
|
||||
This example requires the use of embedding models, text generation models, and rerank models simultaneously. Ensure that the following models and endpoints are registered:
|
||||
|
||||
:::collapse
|
||||
```sql
|
||||
-- Register embedding model (skip if already registered in Step 1)
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_embed', '{
|
||||
"type": "dense_embedding",
|
||||
"model_name": "BAAI/bge-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_embed_endpoint', '{
|
||||
"ai_model_name": "ob_embed",
|
||||
"url": "https://api.siliconflow.cn/v1/embeddings",
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
|
||||
-- Register text generation model (skip if already registered in Step 2)
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_complete');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_complete_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_complete', '{
|
||||
"type": "completion",
|
||||
"model_name": "THUDM/GLM-4-9B-0414"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_complete_endpoint', '{
|
||||
"ai_model_name": "ob_complete",
|
||||
"url": "https://api.siliconflow.cn/v1/chat/completions",
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
|
||||
-- Register rerank model (skip if already registered in Step 3)
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_rerank');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_rerank_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_rerank', '{
|
||||
"type": "rerank",
|
||||
"model_name": "BAAI/bge-reranker-v2-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_rerank_endpoint', '{
|
||||
"ai_model_name": "ob_rerank",
|
||||
"url": "https://api.siliconflow.cn/v1/rerank",
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
:::
|
||||
|
||||
:::info
|
||||
Replace all <code>access_key</code> values with actual API keys. If you have already registered the corresponding models in the previous steps, you can skip the corresponding registration steps.
|
||||
:::
|
||||
|
||||
### Prepare data and generate vectors
|
||||
|
||||
```sql
|
||||
CREATE TABLE knowledge_base (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
title VARCHAR(255),
|
||||
content TEXT,
|
||||
embedding TEXT
|
||||
);
|
||||
|
||||
INSERT INTO knowledge_base (title, content) VALUES
|
||||
('seekdb Introduction', 'seekdb is a powerful database system that supports vector retrieval and AI functions.'),
|
||||
('Vector Retrieval', 'Vector retrieval can be used for semantic search to find similar content.'),
|
||||
('AI Functions', 'AI functions can directly call AI models in SQL.');
|
||||
|
||||
UPDATE knowledge_base
|
||||
SET embedding = AI_EMBED("ob_embed", content);
|
||||
```
|
||||
|
||||
### Vector retrieval and reranking
|
||||
|
||||
```sql
|
||||
SET @query = "What is vector retrieval?";
|
||||
SET @query_vector = AI_EMBED("ob_embed", @query);
|
||||
|
||||
-- Directly construct a document list in string array format
|
||||
SET @candidate_docs = '["seekdb is a powerful database system that supports vector retrieval and AI functions.", "Vector retrieval can be used for semantic search to find similar content."]';
|
||||
|
||||
SELECT AI_RERANK("ob_rerank", @query, @candidate_docs) AS ranked_results;
|
||||
```
|
||||
|
||||
The following result is returned. `index` is the document index, and `relevance_score` is the relevance score:
|
||||
|
||||
```sql
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
| ranked_results |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
| [{"index": 1, "relevance_score": 0.9904329776763916}, {"index": 0, "relevance_score": 0.16993996500968933}] |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Generate answers
|
||||
|
||||
Based on the question retrieval in the first step and the reranking results in the second step, generate an answer:
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete",
|
||||
AI_PROMPT('Based on the following document content, answer the user's question.
|
||||
User question: {0}
|
||||
|
||||
Relevant document: {1}
|
||||
|
||||
Please answer the user's question concisely and accurately based on the above document content.', @query, CAST(JSON_EXTRACT(@candidate_docs, '$[1]') AS CHAR))) AS answer;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
:::collapse
|
||||
```sql
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| answer |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| According to the provided document content, vector retrieval is a technology used for semantic search, aimed at finding similar content by comparing vector data. |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
:::
|
||||
|
||||
Through these three steps, you can quickly complete a complete AI application flow within the seekdb database: vectorization, retrieval, reranking, and answer generation.
|
||||
|
||||
## Summary
|
||||
|
||||
Through this tutorial, you have mastered the core features of seekdb's AI function service:
|
||||
|
||||
* AI_EMBED: Convert text to vectors to prepare data for vector retrieval.
|
||||
* AI_COMPLETE: Directly call LLMs in SQL to implement text generation, translation, analysis, and other functions.
|
||||
* AI_RERANK: Optimize the accuracy of retrieval results and improve RAG application effectiveness.
|
||||
|
||||
### What's next
|
||||
|
||||
* View and monitor AI model information and call status through views in the [AI function service usage and examples - AI model call monitoring](../../200.develop/300.ai-function/200.ai-function.md) section
|
||||
* Learn about [vector retrieval](../../200.develop/100.vector-search/100.vector-search-overview/100.vector-search-intro.md)
|
||||
* Explore [hybrid search](50.experience-hybrid-search.md) features
|
||||
* View [hybrid vector index](70.experience-hybrid-vector-index.md) to simplify vector search processes
|
||||
|
||||
## More information
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb using Python SDK](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,186 @@
|
||||
---
|
||||
|
||||
slug: /experience-hybrid-vector-index
|
||||
---
|
||||
|
||||
# Experience hybrid vector index in seekdb
|
||||
|
||||
This tutorial guides you through getting started with seekdb's hybrid vector index, helping you understand the practical applications of hybrid vector indexes and experience the powerful features of hybrid vector indexes. You can achieve semantic retrieval by directly storing text without manually converting to vectors.
|
||||
|
||||
## Overview
|
||||
|
||||
Hybrid vector index refers to a vector index that can automatically convert text to vectors and build indexes. It is a powerful feature provided by seekdb that makes the vector concept transparent to users. Compared to vector indexes that do not use hybrid vector indexes, hybrid vector indexes greatly simplify the usage process.
|
||||
|
||||
* Vector index process without hybrid vector index:
|
||||
```shell
|
||||
Text → Manually call `AI_EMBED` function to generate vectors → Insert vectors → Use vector retrieval
|
||||
```
|
||||
* Hybrid vector index process:
|
||||
```shell
|
||||
Text → Direct insertion → Direct text retrieval
|
||||
```
|
||||
|
||||
seekdb automatically converts text to vectors and builds indexes internally. During retrieval, you only need to provide the original text, and the system automatically performs embedding and retrieves the vector index, significantly improving ease of use.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Contact the administrator to obtain the corresponding database connection string, then execute the following command to connect to the database:
|
||||
```shell
|
||||
# host: seekdb database connection IP.
|
||||
# port: seekdb database connection port.
|
||||
# database_name: Name of the database to access.
|
||||
# user_name: Database username.
|
||||
# password: Database password.
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
* Ensure that you have the relevant permissions for [AI function service](../../200.develop/300.ai-function/200.ai-function.md), and ensure that an embedding model has been registered in the database using the `CREATE_AI_MODEL` and `CREATE_AI_MODEL_ENDPOINT` procedures:
|
||||
:::collapse
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_embed', '{
|
||||
"type": "dense_embedding",
|
||||
"model_name": "BAAI/bge-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_embed_endpoint', '{
|
||||
"ai_model_name": "ob_embed",
|
||||
"url": "https://api.siliconflow.cn/v1/embeddings",
|
||||
-- Replace with actual access_key
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
:::
|
||||
|
||||
:::info
|
||||
The hybrid vector index feature currently only supports HNSW/HNSW_BQ index types.
|
||||
:::
|
||||
|
||||
## Step 1: Create a hybrid vector index
|
||||
|
||||
Hybrid vector indexes support two methods: **create during table creation** and **create after table creation**.
|
||||
|
||||
:::info
|
||||
When creating a hybrid vector index, you must specify it on a <code>VARCHAR</code> column and specify the embedding model and vector dimension.
|
||||
:::
|
||||
|
||||
### Create during table creation
|
||||
|
||||
```sql
|
||||
CREATE TABLE items (
|
||||
id INT PRIMARY KEY,
|
||||
doc VARCHAR(100),
|
||||
VECTOR INDEX vector_idx(doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
|
||||
);
|
||||
```
|
||||
|
||||
### Create after table creation
|
||||
|
||||
```sql
|
||||
CREATE TABLE items1 (
|
||||
id INT PRIMARY KEY,
|
||||
doc VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE VECTOR INDEX vector_idx
|
||||
ON items (doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate);
|
||||
```
|
||||
|
||||
## Step 2: Insert text data (no manual vectorization required)
|
||||
|
||||
When inserting text data, the system automatically performs embedding without manually calling the `AI_EMBED` function:
|
||||
|
||||
```sql
|
||||
INSERT INTO items(id, doc) VALUES(1, 'Rose');
|
||||
INSERT INTO items(id, doc) VALUES(2, 'Sunflower');
|
||||
INSERT INTO items(id, doc) VALUES(3, 'Lily');
|
||||
```
|
||||
|
||||
## Step 3: Use text for direct retrieval
|
||||
|
||||
Use the `semantic_distance` function, pass in the original text for vector retrieval, without manually generating query vectors:
|
||||
|
||||
```sql
|
||||
SELECT id, doc FROM items
|
||||
ORDER BY semantic_distance(doc, 'flower')
|
||||
APPROXIMATE LIMIT 3;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----+-----------+
|
||||
| id | doc |
|
||||
+----+-----------+
|
||||
| 1 | Rose |
|
||||
| 2 | Sunflower |
|
||||
| 3 | Lily |
|
||||
+----+-----------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
The system automatically converts the query text `'flower'` to a vector and then retrieves the most similar text in the vector index.
|
||||
|
||||
## Advanced: Use vector retrieval
|
||||
|
||||
If you already have vector representations of the retrieval content (for example, pre-generated through the `AI_EMBED` function), you can also directly use these vectors to retrieve hybrid vector indexes, avoiding repeated embedding operations for each retrieval:
|
||||
|
||||
```sql
|
||||
-- First get the query vector
|
||||
SET @query_vector = AI_EMBED("ob_embed", "flower");
|
||||
|
||||
-- Use vectors for index retrieval
|
||||
SELECT id, doc FROM items
|
||||
ORDER BY semantic_vector_distance(doc, @query_vector)
|
||||
APPROXIMATE LIMIT 3;
|
||||
```
|
||||
|
||||
The following result is returned:
|
||||
|
||||
```sql
|
||||
+----+-----------+
|
||||
| id | doc |
|
||||
+----+-----------+
|
||||
| 1 | Rose |
|
||||
| 2 | Sunflower |
|
||||
| 3 | Lily |
|
||||
+----+-----------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Through this tutorial, you have mastered the core features of seekdb's hybrid vector index:
|
||||
|
||||
* Simplified usage process: Achieve semantic retrieval by directly storing text without manually converting to vectors.
|
||||
* Automatic embedding: The system automatically converts text to vectors and builds indexes. During retrieval, you only need to provide the original text.
|
||||
* Performance optimization: Supports direct vector retrieval to avoid repeated embedding operations.
|
||||
|
||||
The hybrid vector index feature greatly simplifies the usage process of vector retrieval and is an ideal choice for building intelligent search applications.
|
||||
|
||||
### What's next
|
||||
|
||||
* Learn about [vector index maintenance and monitoring](../../200.develop/100.vector-search/200.vector-index/200.dense-vector-index.md)
|
||||
* Learn more about [AI function service features](../../200.develop/300.ai-function/200.ai-function.md)
|
||||
* Explore [hybrid search](50.experience-hybrid-search.md) to combine keyword matching and semantic understanding for more accurate and comprehensive search results.
|
||||
|
||||
## More information
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP](80.experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb using Python SDK](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,287 @@
|
||||
---
|
||||
|
||||
slug: /experience-vibe-coding-paradigm-with-cursor-agent-oceanbase-mcp
|
||||
---
|
||||
|
||||
# Experience the vibe coding paradigm with Cursor Agent + OceanBase MCP
|
||||
|
||||
Can you launch a product without writing a single line of code? The arrival of the AI era may mean that "writing code" is no longer about "writing" code. This vision is gradually becoming reality, as AI is bringing transformative changes to people's lives and work. Vibe coding, proposed by AI researcher Andrej Karpathy in 2025, demonstrates a new development approach: developers directly express their intentions to AI through natural language (voice or text), and the rest—from generating code, optimizing structure, to partial debugging—is all handled by AI. In this mode, developers only need to consider how to clearly express "what I want", such as "create a multi-language login module for me", and AI can output the corresponding project structure and implementation code. In other words, developers focus on "goals", and AI handles "implementation".
|
||||
|
||||
## Vibe coding vs. traditional AI-assisted development
|
||||
|
||||
AI-assisted development is not a new concept. However, unlike traditional AI development assistants such as Gemini Code Assist that emphasize manual line-by-line review and code completion, vibe coding tends to automatically complete various development stages and reduces developers' involvement in underlying details.
|
||||
|
||||
Of course, current industry exploration is still far from complete "code-transparent black-box adoption", and actual implementations still mostly use a hybrid process of human-machine collaboration and review.
|
||||
|
||||
* **Trust mechanisms are not yet perfect:** In ideal vibe coding scenarios, developers can directly adopt AI output, but this is difficult to achieve in actual projects. Especially in scenarios with high security and business complexity, manual review and testing are still essential.
|
||||
|
||||
* **Auxiliary tools continue to evolve:** New IDEs continue to improve natural language processing and context awareness capabilities, enhancing AI code output quality and user interaction experience. However, these tools are mostly limited to standardized or prototyping tasks, and complex systems still require engineers to actively oversee and participate.
|
||||
|
||||
* **Collaboration and requirements management are becoming trends:** As vibe coding evolves, collaboration between developers and coordination between projects are gradually becoming new trends. The emergence of protocols such as MCP (Model Context Protocol) enables multiple developers to collaborate through fragmented descriptions, synchronize adjustments, and integrate requirements simultaneously, while agents (such as Cursor) help make these processes smoother.
|
||||
|
||||
## Cursor: A new generation of AI-native development environment
|
||||
|
||||
As emerging AI-driven development modes such as vibe coding gradually become mainstream, various AI-native tools have emerged to provide developers with more convenient and user-friendly development environments. Cursor is one of the leaders.
|
||||
Cursor is an AI-driven code editor that supports efficient programming through natural language by deeply understanding codebases. The Cursor download address is as follows: [https://cursor.com](https://cursor.com).
|
||||
|
||||
Compared to traditional "code suggestion" tools, AI-native IDEs such as Cursor support deeper natural language-code interaction, automatic context association, and intelligent debugging assistance. For example, a single natural language description can build crawlers, configure dependencies, and even introduce testing and exception handling, helping developers lower development barriers and improve engineering efficiency.
|
||||
|
||||
## Cursor Agent + OceanBase MCP: A new vibe coding paradigm
|
||||
|
||||
seekdb, based on a unified architecture, provides users with vector capabilities and supports multi-modal fusion queries. Without introducing new technology stacks, it can meet diverse business needs, thereby reducing learning costs and accelerating AI development, making it a preferred choice for many developers when using vector databases.
|
||||
|
||||
Currently, both seekdb and Cursor support the MCP (Model Context Protocol) protocol. With the MCP protocol, developers can easily implement the new vibe coding paradigm based on Cursor Agent + seekdb.
|
||||
|
||||
The MCP protocol is regarded as an "adapter" connecting AI models with actual business systems. Through the MCP protocol, large models can access various external applications, such as Git version management and database software commonly used in development, to obtain more environmental information and automatically complete various development stages. This means that enterprises can seamlessly integrate seekdb's data service capabilities into various AI application processes directly through the MCP protocol, significantly reducing the barriers to data interface development and integration.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* Install [Python 3.11 or later](https://www.python.org/downloads/) and the corresponding [pip](https://pip.pypa.io/en/stable/installation/). If your machine has a lower Python version, you can use Miniconda to create a new Python 3.11 or later environment. For details, see the [Miniconda installation guide](https://docs.anaconda.com/miniconda/install/).
|
||||
|
||||
* Install [Git](https://git-scm.com/downloads) according to your operating system.
|
||||
|
||||
* Install the Python package manager uv. After installation, you can use the `uv --version` command to verify whether the installation is successful:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
* Download [Cursor](https://cursor.com/downloads) and install the appropriate version for your operating system. Note that when using Cursor for the first time, you need to register a new account or log in with an existing account. After logging in, you can create a new project or open an existing project.
|
||||
|
||||
### Vibe coding practice
|
||||
|
||||
Here we will combine coding with databases based on the vibe coding concept to quickly build an API service.
|
||||
|
||||
#### Step 1: Obtain database connection information
|
||||
|
||||
Contact the deployment personnel or administrator to obtain the corresponding database connection string, for example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameter description:**
|
||||
|
||||
* `$host`: Provides the seekdb connection IP.
|
||||
* `$port`: Provides the seekdb database connection port. The default is `2881`, which can be customized when deploying the seekdb database.
|
||||
* `$database_name`: The name of the database to access.
|
||||
* `$user_name`: Provides the connection account for the tenant. The default is `root`.
|
||||
* `$password`: Provides the account password.
|
||||
|
||||
#### Step 2: Clone the OceanBase MCP server project
|
||||
|
||||
Clone the project to your local machine
|
||||
|
||||
```shell
|
||||
git clone https://github.com/oceanbase/mcp-oceanbase.git
|
||||
```
|
||||
|
||||
#### Step 3: Prepare the Cursor environment
|
||||
|
||||
1. Create a Cursor client working directory and configure the OceanBase MCP server
|
||||
|
||||
Manually create a Cursor working directory and open it with Cursor. Files generated by Cursor will be placed in this directory. The example directory name is `cursor`.
|
||||
|
||||
Use the shortcut `Ctrl + L` (Windows) or `Command + L` (macOS) to open the chat dialog, click the gear icon in the upper right corner, select `MCP Tools`, click `Add Custom MCP` to fill in the configuration file;
|
||||
|
||||
<!---->
|
||||
|
||||
The example configuration file is as follows. Replace `path/to/your/mcp-oceanbase/src/oceanbase_mcp_server` with the absolute path of the `oceanbase_mcp_server` folder, and replace `OB_HOST`, `OB_PORT`, `OB_USER`, `OB_PASSWORD`, and `OB_DATABASE` with your database information:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"oceanbase": {
|
||||
"command": "uv",
|
||||
"args": [
|
||||
"--directory",
|
||||
"/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server",
|
||||
"run",
|
||||
"oceanbase_mcp_server"
|
||||
],
|
||||
"env": {
|
||||
"OB_HOST": "***",
|
||||
"OB_PORT": "***",
|
||||
"OB_USER": "***",
|
||||
"OB_PASSWORD": "***",
|
||||
"OB_DATABASE": "***"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. If the configuration is successful, it will display the `Available` status.
|
||||
|
||||
<!---->
|
||||
|
||||
#### Step 4: Build a RESTful API
|
||||
|
||||
1. Create a customer table.
|
||||
|
||||
Enter the instruction `Create a customer table with ID as the primary key, including name, age, telephone, and location fields`. After confirming the SQL statement, click the `RUN Tool` button to execute the query.
|
||||
|
||||
<!---->
|
||||
|
||||
2. Insert test data.
|
||||
|
||||
Enter the instruction `Insert 10 test records` in the dialog box. After confirming the SQL statement, click `RUN tool`. After successful insertion, you will see a message indicating that `10 test records have been successfully inserted into the customer table...`.
|
||||
|
||||
<!---->
|
||||
|
||||
3. Create a FastAPI project.
|
||||
|
||||
Enter the prompt in the dialog box: `Create a FastAPI project and generate a RESTful API based on the customer table`. After confirming the SQL statement, click the `Run tool` button to execute the query. Cursor will automatically generate files such as main.py, and you can also continue to issue new instructions to automatically start the service, etc.
|
||||
|
||||
<!---->
|
||||
|
||||
4. Create a virtual environment and install dependencies.
|
||||
|
||||
Execute the following commands to create a virtual environment in the current directory using the uv package manager and install dependency packages:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
5. Start the FastAPI project.
|
||||
|
||||
Execute the following command to start the FastAPI project:
|
||||
|
||||
```shell
|
||||
uvicorn main:app --reload
|
||||
```
|
||||
|
||||
6. View data in the table.
|
||||
|
||||
Run the following command in the command line, or use other request tools to view the data in the table:
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:8000/customers
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```json
|
||||
[{"id":1,"name":"Alice","age":28,"telephone":"1234567890","location":"Beijing"},{"id":2,"name":"Bob","age":32,"telephone":"2345678901","location":"Shanghai"},{"id":3,"name":"Charlie","age":25,"telephone":"3456789012","location":"Guangzhou"},{"id":4,"name":"David","age":40,"telephone":"4567890123","location":"Shenzhen"},{"id":5,"name":"Eve","age":22,"telephone":"5678901234","location":"Chengdu"},{"id":6,"name":"Frank","age":35,"telephone":"6789012345","location":"Wuhan"},{"id":7,"name":"Grace","age":30,"telephone":"7890123456","location":"Hangzhou"},{"id":8,"name":"Heidi","age":27,"telephone":"8901234567","location":"Nanjing"},{"id":9,"name":"Ivan","age":29,"telephone":"9012345678","location":"Tianjin"},{"id":10,"name":"Judy","age":31,"telephone":"0123456789","location":"Chongqing"}]
|
||||
```
|
||||
|
||||
You can see that the RESTful API for create, read, update, and delete operations has been successfully generated:
|
||||
|
||||
```shell
|
||||
from fastapi import FastAPI, HTTPException, Depends
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
from sqlalchemy import create_engine, Column, Integer, String
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker, Session
|
||||
|
||||
# OceanBase connection configuration (modify according to your actual situation)
|
||||
DATABASE_URL = "mysql://***:***@***:***/***"
|
||||
|
||||
engine = create_engine(DATABASE_URL, echo=True)
|
||||
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
Base = declarative_base()
|
||||
|
||||
class Customer(Base):
|
||||
__tablename__ = "customer"
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
name = Column(String(100))
|
||||
age = Column(Integer)
|
||||
telephone = Column(String(20))
|
||||
location = Column(String(100))
|
||||
|
||||
class CustomerCreate(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
|
||||
class CustomerUpdate(BaseModel):
|
||||
name: str = None
|
||||
age: int = None
|
||||
telephone: str = None
|
||||
location: str = None
|
||||
|
||||
class CustomerOut(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
class Config:
|
||||
orm_mode = True
|
||||
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
@app.post("/customers/", response_model=CustomerOut)
|
||||
def create_customer(customer: CustomerCreate, db: Session = Depends(get_db)):
|
||||
db_customer = Customer(**customer.dict())
|
||||
db.add(db_customer)
|
||||
try:
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
except Exception as e:
|
||||
db.rollback()
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
return db_customer
|
||||
|
||||
@app.get("/customers/", response_model=List[CustomerOut])
|
||||
def read_customers(skip: int = 0, limit: int = 100, db: Session = Depends(get_db)):
|
||||
return db.query(Customer).offset(skip).limit(limit).all()
|
||||
|
||||
@app.get("/customers/{customer_id}", response_model=CustomerOut)
|
||||
def read_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
return customer
|
||||
|
||||
@app.put("/customers/{customer_id}", response_model=CustomerOut)
|
||||
def update_customer(customer_id: int, customer: CustomerUpdate, db: Session = Depends(get_db)):
|
||||
db_customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if db_customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
for var, value in vars(customer).items():
|
||||
if value is not None:
|
||||
setattr(db_customer, var, value)
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
return db_customer
|
||||
|
||||
@app.delete("/customers/{customer_id}")
|
||||
def delete_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
db_customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if db_customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
db.delete(db_customer)
|
||||
db.commit()
|
||||
return {"ok": True}
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
|
||||
|
||||
* [Experience vector search](30.experience-vector-search.md)
|
||||
* [Experience full-text indexing](40.experience-full-text-indexing.md)
|
||||
* [Experience hybrid search](50.experience-hybrid-search.md)
|
||||
* [Experience AI function service](60.experience-ai-function.md)
|
||||
* [Experience semantic indexing](70.experience-hybrid-vector-index.md)
|
||||
* [Build a knowledge base desktop application based on seekdb](../../500.tutorials/100.create-ai-app-demo/100.build-kb-in-seekdb.md)
|
||||
* [Build a cultural tourism assistant with multi-model integration based on seekdb](../../500.tutorials/100.create-ai-app-demo/300.build-multi-model-application-based-on-oceanbase.md)
|
||||
* [Build an image search application based on seekdb](../../500.tutorials/100.create-ai-app-demo/400.build-image-search-app-in-seekdb.md)
|
||||
|
||||
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see [Experience embedded seekdb using Python SDK](../50.embedded-mode/25.using-seekdb-in-python-sdk.md) and [pyseekdb overview](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
@@ -0,0 +1,23 @@
|
||||
---
|
||||
slug: /build-ai-apps
|
||||
---
|
||||
|
||||
# Build AI applications
|
||||
import DocsCard from '@components/global/DocsCard';
|
||||
import DocsCards from '@components/global/DocsCards';
|
||||
|
||||
Get started with seekdb through a set of short, practical tutorials that show you how to build real AI applications with seekdb.
|
||||
|
||||
<DocsCards>
|
||||
<DocsCard header="Build a knowledge base desktop application based on seekdb" href="/build-kb-in-seekdb">
|
||||
<p>This tutorial walks you through creating MineKB (a personal, local knowledge base desktop application) using seekdb. You will learn how to combine vector search with large language models (LLMs) to deliver an intelligent question-answering experience.</p>
|
||||
</DocsCard>
|
||||
|
||||
<DocsCard header="Build a cultural tourism assistant with seekdb multi-model integration" href="/build-multi-model-application-based-on-oceanbase">
|
||||
<p>This tutorial demonstrates how to build a simple travel assistant using seekdb's multi-model integration. By combining spatial data with vector search, you will create a location-aware recommendation application that can retrieve relevant points of interest using mixed GIS and vector queries. Together with an LLM-driven Agent workflow, you can assemble a lightweight trip-planning assistant.</p>
|
||||
</DocsCard>
|
||||
|
||||
<DocsCard header="Build an image search application based on seekdb" href="/build-image-search-app-in-seekdb">
|
||||
<p>This tutorial shows how to build an image-to-image search application using seekdb's vector search features. seekdb brings vector search directly into SQL, enabling you to use advanced deep learning models to extract image embeddings and implement a simple image search system. The seamless integration of SQL and AI significantly simplifies AI application development.</p>
|
||||
</DocsCard>
|
||||
</DocsCards>
|
||||
@@ -0,0 +1,184 @@
|
||||
---
|
||||
|
||||
slug: /using-seekdb-in-python-sdk
|
||||
---
|
||||
|
||||
# Experience embedded seekdb with Python SDK
|
||||
|
||||
This example demonstrates how to quickly experience embedded seekdb through pyseekdb (a Python client provided by OceanBase) in a Linux environment.
|
||||
|
||||
:::tip
|
||||
In addition to Linux, you can also use pyseekdb in macOS and Windows. However, only server mode of seekdb is supported. For more information about how to use pyseekdb in macOS and Windows, see [Get started with pyseekdb](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
:::
|
||||
|
||||
|
||||
## Background information
|
||||
|
||||
### pyseekdb
|
||||
|
||||
pyseekdb is a Python client provided by OceanBase. It implements a unified API interface that provides three database connection modes, supporting connections to embedded-mode seekdb, server-mode seekdb, and OceanBase databases.
|
||||
|
||||
Installing this client also installs embedded-mode seekdb, allowing you to directly connect to embedded seekdb to perform operations such as creating databases. Alternatively, you can choose to remotely connect to a deployed seekdb in client/server mode or OceanBase database.
|
||||
|
||||
### seekdb deployment modes
|
||||
|
||||
seekdb provides flexible deployment modes that support everything from rapid prototyping to large-scale user workloads, meeting the full range of your application needs.
|
||||
|
||||
* Embedded mode
|
||||
|
||||
seekdb embeds as a lightweight library installable with a single pip command, ideal for personal learning or prototyping, and can easily run on various end devices.
|
||||
|
||||
* Client/Server mode
|
||||
|
||||
A lightweight and easy-to-use deployment mode recommended for both testing and production, delivering stable and efficient service.
|
||||
|
||||
For information about using seekdb in client/server mode, see [Experience seekdb in client/server mode](../100.client-server-mode/10.deploy-seekdb-testing-environment.md).
|
||||
|
||||
## Install pyseekdb
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Ensure that your environment meets the following requirements:
|
||||
|
||||
* Operating system: Linux (glibc >= 2.28)
|
||||
* Python version: Python 3.11 and later
|
||||
* System architecture: x86_64, aarch64
|
||||
|
||||
### Installation
|
||||
|
||||
Use pip to install, which automatically detects the default Python version and platform.
|
||||
|
||||
```shell
|
||||
pip install pyseekdb
|
||||
```
|
||||
|
||||
If your pip version is low, upgrade pip first before installing.
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
## Experience seekdb with Python SDK
|
||||
|
||||
The following example uses embedded-mode seekdb to demonstrate basic operations with embedding functions, helping you quickly understand how to use seekdb.
|
||||
|
||||
1. Connect to seekdb.
|
||||
2. Create a collection with embedding functions.
|
||||
3. Add data using documents (vectors are automatically generated).
|
||||
4. Query using texts (vectors are automatically generated).
|
||||
5. Print query results.
|
||||
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# ==================== Step 1: Create Client Connection ====================
|
||||
# You can use embedded mode, server mode, or OceanBase mode
|
||||
# For this example, we'll use embedded mode (you can change to server mode seekdb or OceanBase)
|
||||
|
||||
# Embedded mode (local SeekDB)
|
||||
client = pyseekdb.Client()
|
||||
# Alternative: Server mode (connecting to remote SeekDB server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# Alternative: Remote server mode (OceanBase Server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# tenant="test", # OceanBase default tenant
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# ==================== Step 2: Create a Collection with Embedding Function ====================
|
||||
# A collection is like a table that stores documents with vector embeddings
|
||||
collection_name = "my_simple_collection"
|
||||
|
||||
# Create collection with default embedding function
|
||||
# The embedding function will automatically convert documents to embeddings
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
)
|
||||
|
||||
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
|
||||
print(f"Embedding function: {collection.embedding_function}")
|
||||
|
||||
# ==================== Step 3: Add Data to Collection ====================
|
||||
# With embedding function, you can add documents directly without providing embeddings
|
||||
# The embedding function will automatically generate embeddings from documents
|
||||
|
||||
documents = [
|
||||
"Machine learning is a subset of artificial intelligence",
|
||||
"Python is a popular programming language",
|
||||
"Vector databases enable semantic search",
|
||||
"Neural networks are inspired by the human brain",
|
||||
"Natural language processing helps computers understand text"
|
||||
]
|
||||
|
||||
ids = ["id1", "id2", "id3", "id4", "id5"]
|
||||
|
||||
# Add data with documents only - embeddings will be auto-generated by embedding function
|
||||
collection.add(
|
||||
ids=ids,
|
||||
documents=documents, # embeddings will be automatically generated
|
||||
metadatas=[
|
||||
{"category": "AI", "index": 0},
|
||||
{"category": "Programming", "index": 1},
|
||||
{"category": "Database", "index": 2},
|
||||
{"category": "AI", "index": 3},
|
||||
{"category": "NLP", "index": 4}
|
||||
]
|
||||
)
|
||||
|
||||
print(f"\nAdded {len(documents)} documents to collection")
|
||||
print("Note: Embeddings were automatically generated from documents using the embedding function")
|
||||
|
||||
# ==================== Step 4: Query the Collection ====================
|
||||
# With embedding function, you can query using text directly
|
||||
# The embedding function will automatically convert query text to query vector
|
||||
|
||||
# Query using text - query vector will be auto-generated by embedding function
|
||||
query_text = "artificial intelligence and machine learning"
|
||||
|
||||
results = collection.query(
|
||||
query_texts=query_text, # Query text - will be embedded automatically
|
||||
n_results=3 # Return top 3 most similar documents
|
||||
)
|
||||
|
||||
print(f"\nQuery: '{query_text}'")
|
||||
print(f"Query results: {len(results['ids'][0])} items found")
|
||||
|
||||
# ==================== Step 5: Print Query Results ====================
|
||||
for i in range(len(results['ids'][0])):
|
||||
print(f"\nResult {i+1}:")
|
||||
print(f" ID: {results['ids'][0][i]}")
|
||||
print(f" Distance: {results['distances'][0][i]:.4f}")
|
||||
if results.get('documents'):
|
||||
print(f" Document: {results['documents'][0][i]}")
|
||||
if results.get('metadatas'):
|
||||
print(f" Metadata: {results['metadatas'][0][i]}")
|
||||
|
||||
# ==================== Step 6: Cleanup ====================
|
||||
# Delete the collection
|
||||
client.delete_collection(collection_name)
|
||||
print(f"\nDeleted collection '{collection_name}'")
|
||||
```
|
||||
|
||||
## More information
|
||||
|
||||
* For more detailed introduction and usage of pyseekdb, see [pyseekdb](../../200.develop/900.sdk/10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* For more pyseekdb usage examples, see:
|
||||
|
||||
* [Complete example](../../200.develop/900.sdk/10.pyseekdb-sdk/50.sdk-samples/50.pyseekdb-complete-sample.md): Demonstrates all capabilities currently supported by pyseekdb.
|
||||
|
||||
* [Hybrid search example](../../200.develop/900.sdk/10.pyseekdb-sdk/50.sdk-samples/100.pyseekdb-hybrid-search-sample.md): Demonstrates the usage of seekdb hybrid search.
|
||||
|
||||
* In addition to the Python SDK, seekdb also supports operations through SQL. For SQL usage, see [Experience seekdb in client/server mode](../100.client-server-mode/10.deploy-seekdb-testing-environment.md).
|
||||
@@ -0,0 +1,84 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-intro
|
||||
---
|
||||
|
||||
# Overview of vector search
|
||||
|
||||
This topic introduces the core concepts of vector databases and vector search.
|
||||
|
||||
seekdb supports dense float vectors with up to 16,000 dimensions, as well as sparse vectors. It supports various types of vector distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance. seekdb also supports the creation of HNSW/IVF-based vector indexes, as well as incremental updates and deletions, with these operations having no impact on recall rate.
|
||||
|
||||
seekdb vector search offers hybrid retrieval capabilities with scalar filtering. It also provides flexible access interfaces: you can use SQL via the MySQL protocol from clients in various programming languages, or access it using a Python SDK. In addition, seekdb is fully adapted to AI application development frameworks such as LlamaIndex, DB-GPT, and the AI application development platform Dify, offering better support for AI application development.
|
||||
|
||||
<video data-code="9002093" src="https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/video/03%20OceanBase%20Vector%20Search-An%20Official%20In-depth%20Perspective.mp4" controls width="811px" height="456.188px"></video>
|
||||
|
||||
## Key concepts
|
||||
|
||||
### Unstructured data
|
||||
|
||||
Unstructured data is data that does not have a predefined data format or organizational structure. It typically includes data in forms such as text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, processing it requires specific tools and techniques, such as natural language processing, image recognition, and machine learning.
|
||||
|
||||
### Vector
|
||||
|
||||
A vector is the projection of an object in a high-dimensional space. Mathematically, a vector is a floating-point array with the following characteristics:
|
||||
|
||||
* Each element in the array is a floating-point number that represents a dimension of the vector.
|
||||
|
||||
* The size, namely, the number of elements, of the vector array indicates the dimensionality of the entire vector space.
|
||||
|
||||
### Vector embedding
|
||||
|
||||
Vector embedding is the process of using a deep learning neural network to extract content and semantics from unstructured data such as images and videos, and convert them into feature vectors. Embedding technology maps original data from a high-dimensional space to a low-dimensional space and converts multimodal data with rich features into multi-dimensional vector data.
|
||||
|
||||
### Vector similarity search
|
||||
|
||||
In today's era of information explosion, users often need to quickly retrieve specific information from massive datasets. Whether it's online literature databases, e-commerce product catalogs, or rapidly growing multimedia content libraries, efficient retrieval systems are essential for locating content of interest. As data volumes continue to grow, traditional keyword-based search methods can no longer meet the demands for both accuracy and speed, giving rise to vector search technology. Vector similarity search uses feature extraction and vectorization techniques to convert unstructured data—such as text, images, and audio—into vectors. By applying similarity measurement methods to compare these vectors, it captures the deeper semantic meaning of the data. This approach delivers more precise and efficient search results, addressing the shortcomings of traditional search methods.
|
||||
|
||||
## Why seekdb vector search?
|
||||
|
||||
seekdb's vector search capabilities are built on its integrated multi-model capabilities, excelling in areas such as hybrid retrieval, high performance, high availability, cost efficiency, and data security.
|
||||
|
||||
### Hybrid retrieval
|
||||
|
||||
seekdb supports hybrid retrieval across multiple data types, including vector data, spatial data, document data, and scalar data. With support for various indexes such as vector indexes, spatial indexes, and full-text indexes, seekdb delivers exceptional performance in multi-model hybrid retrieval. It enables a single database to handle diverse storage and retrieval needs for applications.
|
||||
|
||||
### Scalability
|
||||
|
||||
seekdb vector search supports the storage and retrieval of massive amounts of vector data, meeting the requirements of large-scale vector data applications.
|
||||
|
||||
### High performance
|
||||
|
||||
seekdb vector search capabilities integrate the VSAG indexing algorithm library, which demonstrates outstanding performance on the 960-dimensional GIST dataset. In the ANN-Benchmarks tests, the VSAG library significantly outperformed other algorithms.
|
||||
|
||||
### High availability
|
||||
|
||||
seekdb vector search provides reliable data storage and access capabilities. For in-memory HNSW indexes, it ensures stable retrieval performance.
|
||||
|
||||
### Transactions
|
||||
|
||||
seekdb's transaction capabilities ensure the consistency and integrity of vector data. It also offers effective concurrency control and fault recovery mechanisms.
|
||||
|
||||
### Cost efficiency
|
||||
|
||||
seekdb's storage encoding and compression capabilities significantly reduce the storage space required for vectors, helping to lower application storage costs.
|
||||
|
||||
### Data security
|
||||
|
||||
seekdb already supports comprehensive enterprise-grade security features, including identity authentication and verification, access control, data encryption, monitoring and alerts, and security auditing. These features effectively ensure data security in vector search scenarios.
|
||||
|
||||
### Ease of use
|
||||
|
||||
seekdb vector search provides flexible access interfaces, enabling SQL access through MySQL protocol clients across various programming languages, as well as seamless integration via a Python SDK. Furthermore, seekdb has been optimized for AI application development frameworks like LangChain and LlamaIndex, offering better support for AI application development.
|
||||
|
||||
### Comprehensive toolset
|
||||
|
||||
seekdb features a comprehensive database toolset, supporting data development, migration, operations, diagnostics, and full lifecycle data management, safeguarding the development and maintenance of AI applications.
|
||||
|
||||
## Application scenarios
|
||||
|
||||
* Retrieval-Augmented Generation (RAG): RAG is an artificial intelligence (AI) framework that retrieves facts from external knowledge bases to provide the most accurate and latest information for Large Language Models (LLMs) and allow users to have an insight into the generation process of an LLM. RAG is commonly used in intelligent Q&A systems and knowledge bases.
|
||||
|
||||
* Personalized recommendation: The recommendation system can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system will calculate the similarity based on the characteristics of the user, and then return items that the user may be interested in as the recommendation results, such as recommended restaurants and scenic spots.
|
||||
|
||||
* Image search/Text search: An image/text search task aims to find results that are most similar to the specified image in a large-scale image/text database. The text/image features used in the search can be stored in a vector database, and efficient similarity calculation can be achieved based on high-performance index-based storage, thereby returning image/text results that match the search criteria. This applies to scenarios such as facial recognition.
|
||||
@@ -0,0 +1,28 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-workflow
|
||||
---
|
||||
|
||||
# AI application workflow using seekdb vector search
|
||||
|
||||
This topic describes the AI application workflow using seekdb vector search.
|
||||
|
||||
## Convert unstructured data into feature vectors through vector embedding
|
||||
|
||||
Unstructured data (such as videos, documents, and images) is the starting point of the entire workflow. Various forms of unstructured data, including videos, text files (documents), and images, are transformed into vector representations through vector embedding models. The task of these models is to convert raw, unstructured data that is difficult to directly calculate similarity into high-dimensional vector data. These vectors capture the semantic information and features of the data, and can express the similarity of data through distances in the vector space. For more information, see [Vector embedding technology](../150.vector-embedding-technology.md).
|
||||
|
||||
## Store vector embeddings and create vector indexes in seekdb
|
||||
|
||||
As the core storage layer, seekdb is responsible for storing all data. This includes traditional relational tables (used for storing business data), the original unstructured data, and the vector data generated after vector embedding. For more information, see [Store vector data](../160.store-vector-data.md).
|
||||
|
||||
To enable efficient vector search, seekdb internally builds vector indexes for the vector data. Vector indexes are specialized data structures that significantly accelerate nearest neighbor searches in high-dimensional vector spaces. Since calculating vector similarity is computationally expensive, exact searches (calculating distances for all vectors one by one) ensure accuracy but can severely impact query performance. Through vector indexes, the system can quickly locate candidate vectors, significantly reducing the number of vectors that need distance calculations, thereby improving query efficiency while maintaining high accuracy. For more information, see [Create vector indexes](../200.vector-index/200.dense-vector-index.md).
|
||||
|
||||
## Perform nearest neighbor search and hybrid search through SQL/SDK
|
||||
|
||||
Users interact with the AI application through clients or programming languages by submitting queries that may involve text, images, or other formats. For more information, see [Supported clients and languages](../700.vector-search-reference/900.vector-search-supported-clients-and-languages/100.vector-search-supported-clients-and-languages-overview.md).
|
||||
|
||||
seekdb uses SQL statements to query and manage relational data, enabling hybrid searches that combine scalar and vector data. When a user initiates a query—if it is unstructured—the system first converts it into a vector using the embedding model. Then, leveraging both vector and scalar indexes, the system quickly retrieves the most similar vectors that also meet scalar filter conditions, thus identifying the most relevant unstructured data. For detailed information about nearest neighbor search, see [Nearest neighbor search](../300.vector-similarity-search.md).
|
||||
|
||||
## Generate prompts and send them to the LLM for inference
|
||||
|
||||
In the final stage, an optimized prompt is generated based on the hybrid search results and sent to the large language model (LLM) to complete the inference process. The LLM generates a natural language response based on this contextual information. There is a feedback loop between the LLM and the vector embedding model, meaning that the output of the LLM or user feedback can be used to optimize the embedding model, creating a cycle of continuous learning and improvement.
|
||||
@@ -0,0 +1,339 @@
|
||||
---
|
||||
|
||||
slug: /vector-embedding-technology
|
||||
---
|
||||
|
||||
# Vector embedding technology
|
||||
|
||||
This topic introduces vector embedding technology in vector retrieval.
|
||||
|
||||
## What is vector embedding?
|
||||
|
||||
Vector embedding is a technique for converting unstructured data into numerical vectors. These vectors can capture the semantic information of unstructured data, enabling computers to "understand" and process the meaning of such data. Specifically:
|
||||
|
||||
* Vector embedding maps unstructured data such as text, images, or audio/video to points in a high-dimensional vector space.
|
||||
|
||||
* In this vector space, semantically similar unstructured data is mapped to nearby locations.
|
||||
|
||||
* Vectors are typically composed of hundreds of numbers (such as 512 or 1024 dimensions).
|
||||
|
||||
* Mathematical methods (such as cosine similarity) can be used to calculate the similarity between vectors.
|
||||
|
||||
* Common vector embedding models include Word2Vec, BERT, and BGE. For example, when developing RAG applications, text data is often embedded into vector data and stored in a vector database, while other structured data is stored in a relational database.
|
||||
|
||||
In seekdb, vector data can be stored as a data type in a relational table, allowing vectors and traditional scalar data to be stored in an orderly and efficient manner within seekdb.
|
||||
|
||||
## Generate vector embeddings using AI function service in seekdb
|
||||
|
||||
In seekdb, you can use the AI function service to generate vector embeddings. Users do not need to install any dependencies. After registering the model information, you can use the AI function service to generate vector embeddings in seekdb. For details, see [AI function service usage and examples](../300.ai-function/200.ai-function.md).
|
||||
|
||||
## Common text embedding methods
|
||||
|
||||
This section introduces common text embedding methods.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
You need to have the `pip` command installed in advance.
|
||||
|
||||
### Use an offline, locally pre-trained embedding model
|
||||
|
||||
Using pre-trained models for local text embedding is the most flexible approach, but it requires significant computing resources. Commonly used models include:
|
||||
|
||||
#### Use Sentence Transformers
|
||||
|
||||
Sentence Transformers is an NLP model designed to convert sentences or paragraphs into vector embeddings. It uses deep learning technology, particularly the Transformer architecture, to effectively capture the semantic information of text. Since direct access to Hugging Face's domain often times out in China, please set the Hugging Face mirror address `export HF_ENDPOINT=https://hf-mirror.com` in advance. After setting it, run the code below:
|
||||
|
||||
```shell
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
model = SentenceTransformer("BAAI/bge-m3")
|
||||
|
||||
sentences = [
|
||||
"That is a happy person",
|
||||
"That is a happy dog",
|
||||
"That is a very happy person",
|
||||
"Today is a sunny day"
|
||||
]
|
||||
embeddings = model.encode(sentences)
|
||||
print(embeddings)
|
||||
# [[-0.01178016 0.00884024 -0.05844684 ... 0.00750248 -0.04790139
|
||||
# 0.00330675]
|
||||
# [-0.03470375 -0.00886354 -0.05242309 ... 0.00899352 -0.02396279
|
||||
# 0.02985837]
|
||||
# [-0.01356584 0.01900942 -0.05800966 ... 0.00523864 -0.05689549
|
||||
# 0.00077098]
|
||||
# [-0.02149693 0.02998871 -0.05638731 ... 0.01443702 -0.02131325
|
||||
# -0.00112451]]
|
||||
similarities = model.similarity(embeddings, embeddings)
|
||||
print(similarities.shape)
|
||||
# torch.Size([4, 4])
|
||||
```
|
||||
|
||||
#### Use Hugging Face Transformers
|
||||
|
||||
Hugging Face Transformers is an open-source library that provides a wide range of pre-trained deep learning models, especially for NLP tasks. Due to geographical reasons, direct access to Hugging Face's domain may time out. Please set the Hugging Face mirror address `export HF_ENDPOINT=https://hf-mirror.com` in advance. After setting it, run the code below:
|
||||
|
||||
```shell
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import torch
|
||||
|
||||
# Load the model and tokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
|
||||
model = AutoModel.from_pretrained("BAAI/bge-m3")
|
||||
|
||||
# Prepare the input
|
||||
texts = ["This is an example text."]
|
||||
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
|
||||
|
||||
# Generate embeddings
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
embeddings = outputs.last_hidden_state[:, 0] # Use the [CLS] token's output
|
||||
print(embeddings)
|
||||
# tensor([[-1.4136, 0.7477, -0.9914, ..., 0.0937, -0.0362, -0.1650]])
|
||||
print(embeddings.shape)
|
||||
# torch.Size([1, 1024])
|
||||
```
|
||||
|
||||
#### Ollama
|
||||
|
||||
[Ollama](https://ollama.com) is an open-source model runtime that allows users to easily run, manage, and use various large language models locally. In addition to supporting open-source language models like Llama 3 and Mistral, it also supports embedding models like bge-m3.
|
||||
|
||||
1. Deploy Ollama
|
||||
|
||||
On MacOS and Windows, you can directly download and install the package from the official website. For installation instructions, refer to Ollama's official website. After installation, Ollama runs as a background service.
|
||||
|
||||
To install Ollama on Linux:
|
||||
|
||||
```shell
|
||||
curl -fsSL https://ollama.ai/install.sh | sh
|
||||
```
|
||||
|
||||
2. Pull an embedding model
|
||||
|
||||
Ollama supports using the bge-m3 model for text embeddings:
|
||||
|
||||
```shell
|
||||
ollama pull bge-m3
|
||||
```
|
||||
|
||||
3. Use Ollama for text embeddings
|
||||
|
||||
You can use Ollama's embedding capabilities through HTTP API or Python SDK:
|
||||
|
||||
* HTTP API
|
||||
|
||||
```shell
|
||||
import requests
|
||||
|
||||
def get_embedding(text: str) -> list:
|
||||
"""Get text embeddings using Ollama's HTTP API"""
|
||||
response = requests.post(
|
||||
'http://localhost:11434/api/embeddings',
|
||||
json={
|
||||
'model': 'bge-m3',
|
||||
'prompt': text
|
||||
}
|
||||
)
|
||||
return response.json()['embedding']
|
||||
|
||||
# Example usage
|
||||
text = "This is an example text."
|
||||
embedding = get_embedding(text)
|
||||
print(embedding)
|
||||
# [-1.4269912242889404, 0.9092104434967041, ...]
|
||||
```
|
||||
|
||||
* Python SDK
|
||||
|
||||
First, install Ollama's Python SDK:
|
||||
|
||||
```shell
|
||||
pip install ollama
|
||||
```
|
||||
|
||||
Then you can use it like this:
|
||||
|
||||
```shell
|
||||
import ollama
|
||||
|
||||
# Example usage
|
||||
texts = ["First sentence", "Second sentence"]
|
||||
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
|
||||
print(embeddings)
|
||||
# [[0.03486196, 0.0625187, ...], [...]]
|
||||
```
|
||||
|
||||
4. Advantages and limitations of Ollama
|
||||
|
||||
Advantages:
|
||||
|
||||
* Fully local deployment, no internet connection required
|
||||
* Open-source and free, no API Key required
|
||||
* Supports multiple models, easy to switch and compare
|
||||
* Relatively low resource usage
|
||||
|
||||
Limitations:
|
||||
|
||||
* Limited selection of embedding models
|
||||
* Performance may not match commercial services
|
||||
* Requires self-maintenance and updates
|
||||
* Lacks enterprise-level support
|
||||
|
||||
When choosing whether to use Ollama, you need to weigh these factors. If your application scenario has high privacy requirements, or you want to run completely offline, Ollama is a good choice. However, if you need more stable service quality and better performance, you may need to consider commercial services.
|
||||
|
||||
<!-- ### Use online, remote embedding services
|
||||
|
||||
Using offline, local embedding models usually requires high hardware specifications for the deployment machine and also demands advanced management of processes such as model loading and unloading. As a result, many users have a strong need for online embedding services. Currently, many AI inference service providers offer corresponding text embedding services. Taking Tongyi Qwen's text embedding service as an example, you can first register for an account with [Alibaba Cloud Model Studio](https://bailian.console.aliyun.com) and obtain an API Key. Then, you can call its public API to get text embeddings.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
-->
|
||||
|
||||
#### HTTP call
|
||||
|
||||
After obtaining the credentials, you can try performing text embedding with the following code. If the requests package is not installed in your Python environment, you need to install it first with `pip install requests` to enable sending network requests.
|
||||
|
||||
```shell
|
||||
import requests
|
||||
from typing import List
|
||||
|
||||
class RemoteEmbedding():
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str,
|
||||
api_key: str,
|
||||
model: str,
|
||||
dimensions: int = 1024,
|
||||
**kwargs,
|
||||
):
|
||||
self._base_url = base_url
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._dimensions = dimensions
|
||||
|
||||
"""
|
||||
OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
|
||||
"""
|
||||
|
||||
def embed_documents(
|
||||
self,
|
||||
texts: List[str],
|
||||
) -> List[List[float]]:
|
||||
"""Embed search docs.
|
||||
|
||||
Args:
|
||||
texts: List of text to embed.
|
||||
|
||||
Returns:
|
||||
List of embeddings.
|
||||
"""
|
||||
res = requests.post(
|
||||
f"{self._base_url}",
|
||||
headers={"Authorization": f"Bearer {self._api_key}"},
|
||||
json={
|
||||
"input": texts,
|
||||
"model": self._model,
|
||||
"encoding_format": "float",
|
||||
"dimensions": self._dimensions,
|
||||
},
|
||||
)
|
||||
data = res.json()
|
||||
embeddings = []
|
||||
try:
|
||||
for d in data["data"]:
|
||||
embeddings.append(d["embedding"][: self._dimensions])
|
||||
return embeddings
|
||||
except Exception as e:
|
||||
print(data)
|
||||
print("Error", e)
|
||||
raise e
|
||||
|
||||
def embed_query(self, text: str, **kwargs) -> List[float]:
|
||||
"""Embed query text.
|
||||
|
||||
Args:
|
||||
text: Text to embed.
|
||||
|
||||
Returns:
|
||||
Embedding.
|
||||
"""
|
||||
return self.embed_documents([text])[0]
|
||||
|
||||
embedding = RemoteEmbedding(
|
||||
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",
|
||||
api_key="your-api-key", # Enter your API Key
|
||||
model="text-embedding-v3",
|
||||
)
|
||||
|
||||
print("Embedding result:", embedding.embed_query("The weather is nice today"), "\n")
|
||||
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
|
||||
print("Embedding results:", embedding.embed_documents(["The weather is nice today", "What about tomorrow?"]), "\n")
|
||||
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]
|
||||
```
|
||||
|
||||
#### Use Qwen SDK
|
||||
|
||||
Qwen provides an SDK called dashscope for quickly accessing model capabilities. After installing it using `pip install dashscope`, you can obtain text embeddings.
|
||||
|
||||
```shell
|
||||
import dashscope
|
||||
from dashscope import TextEmbedding
|
||||
|
||||
# Set the API Key
|
||||
dashscope.api_key = "your-api-key"
|
||||
|
||||
# Prepare the input text
|
||||
texts = ["This is the first sentence", "This is the second sentence"]
|
||||
|
||||
# Call the embedding service
|
||||
response = TextEmbedding.call(
|
||||
model="text-embedding-v3",
|
||||
input=texts
|
||||
)
|
||||
|
||||
# Get the embedding results
|
||||
if response.status_code == 200:
|
||||
print(response.output['embeddings'])
|
||||
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
|
||||
```
|
||||
|
||||
## Common image embedding methods
|
||||
|
||||
This section introduces image embedding methods.
|
||||
|
||||
### Use an offline, locally pre-trained embedding model
|
||||
|
||||
#### Use CLIP
|
||||
|
||||
CLIP (Contrastive Language-Image Pretraining) is a model proposed by OpenAI for multimodal learning by combining images and text. CLIP can understand and process the relationships between images and text, making it perform well in various tasks such as image classification, image retrieval, and text generation.
|
||||
|
||||
```shell
|
||||
from PIL import Image
|
||||
from transformers import CLIPProcessor, CLIPModel
|
||||
|
||||
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
|
||||
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
||||
|
||||
# Prepare the input image
|
||||
image = Image.open("path_to_your_image.jpg")
|
||||
texts = ["This is the first sentence", "This is the second sentence"]
|
||||
|
||||
# Call the embedding service
|
||||
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
|
||||
outputs = model(**inputs)
|
||||
|
||||
# Obtain the embedding results
|
||||
if outputs.status_code == 200:
|
||||
print(outputs.output['embeddings'])
|
||||
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Store vector embeddings](160.store-vector-data.md)
|
||||
* [Vector data types](700.vector-search-reference/100.vector-data-type.md)
|
||||
@@ -0,0 +1,50 @@
|
||||
---
|
||||
|
||||
slug: /store-vector-data
|
||||
---
|
||||
|
||||
# Store vector data
|
||||
|
||||
This topic introduces how to store unstructured, semi-structured, and structured data in a unified way within seekdb. This not only fully leverages the foundational capabilities of seekdb, but also provides strong support for hybrid search.
|
||||
|
||||
## How it works
|
||||
|
||||
seekdb can store data of different modalities and supports hybrid search by converting various types of data (such as text, images, and videos) into vectors. Searches are performed by calculating the distances between these vectors. Hybrid search can be divided into two types: simple search, which is based on similarity search for a single vector, and complex search, which involves combining vector and scalar searches.
|
||||
|
||||
Since vector search is inherently approximate, it is necessary to employ multiple techniques in practical applications to improve accuracy. Only precise search results can deliver greater value to your business.
|
||||
|
||||
## Create a vector column
|
||||
|
||||
The following example shows a table that stores vector data, spatial data, and relational data. The data type of the vector column is `VECTOR`, and the dimension must be specified when the column is created. The maximum supported dimension is 16,000. The data type of the spatial column is `GEOMETRY`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t (
|
||||
-- Store relational data (structured data).
|
||||
id INT PRIMARY KEY,
|
||||
-- Store spatial data (semi-structured data).
|
||||
g GEOMETRY,
|
||||
-- Store vector data (unstructured data).
|
||||
vec VECTOR(3)
|
||||
);
|
||||
```
|
||||
|
||||
|
||||
## Use the `INSERT` statement to insert vector data
|
||||
|
||||
Once you create a table that contains a column of the `VECTOR` data type, you can directly use the `INSERT` statement to insert vectors into the table. When you insert data, the vector must match the dimension specified when the table is created. Otherwise, an error will be returned. This design ensures data consistency and query efficiency. Vectors are represented in standard floating-point number arrays. Each dimension must have a valid floating-point number. Here is a simple example:
|
||||
|
||||
```sql
|
||||
INSERT INTO t (id, g, vec) VALUES (
|
||||
-- Insert structured data.
|
||||
1,
|
||||
-- Insert semi-structured data.
|
||||
ST_GeomFromText('POINT(1 1)'),
|
||||
-- Insert unstructured data.
|
||||
'[0.1, 0.2, 0.3]'
|
||||
);
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Vector embedding technology](150.vector-embedding-technology.md)
|
||||
* [Create vector indexes](200.vector-index/200.dense-vector-index.md)
|
||||
@@ -0,0 +1,600 @@
|
||||
---
|
||||
|
||||
slug: /dense-vector-index
|
||||
---
|
||||
|
||||
# Dense vector index
|
||||
|
||||
This topic describes how to create, query, maintain, and drop a dense vector index in seekdb.
|
||||
|
||||
## Index types
|
||||
|
||||
The following table describes the vector index types supported by seekdb.
|
||||
|
||||
| Index type | Description | Scenarios |
|
||||
|-----------|------|----------|
|
||||
| HNSW | The maximum dimension of indexed columns is 4096. The HNSW index is a memory-based index that must be fully loaded into memory. It supports DML and real-time queries. | |
|
||||
| HNSW_SQ | The HNSW_SQ index offers similar construction speed, query performance, and recall rate as the HNSW index, but reduces overall memory usage to 1/2 to 1/3 of the original. | Scenarios with high performance and recall rate requirements. |
|
||||
| HNSW_BQ | The HNSW_BQ index has a slightly lower recall rate compared to the HNSW index, but significantly reduces memory usage. The BQ quantization compression algorithm (Rabitq) can compress vectors to 1/32 of their original size. The memory optimization effect of the HNSW_BQ index becomes more pronounced as the vector dimension increases. | |
|
||||
| IVF| An IVF index implemented based on database tables, which does not require resident memory. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
|
||||
| IVF_PQ| An IVF_PQ index implemented based on database tables, which does not require resident memory. On top of IVF, PQ quantization technology is applied. The recall rate of the index is slightly lower than that of the IVF index, but the performance is higher. The PQ quantization compression algorithm can generally compress vectors to 1/16 to 1/32 of their original size. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
|
||||
| IVF_SQ (Experimental feature)| An IVF_SQ index implemented based on database tables, which does not require resident memory. On top of IVF, SQ quantization technology is applied. The recall rate of the index is slightly lower than that of the IVF index, but the performance is higher. The SQ quantization compression algorithm can generally compress vectors to 1/3 to 1/4 of their original size. | Scenarios with lower performance requirements but large data volumes and cost sensitivity. |
|
||||
|
||||
Some other notes:
|
||||
|
||||
* Dense vector indexes support L2, inner product (IP), and cosine distance as the index distance algorithm.
|
||||
* Vector index queries support calling some distance functions. For more information, see [Use SQL functions](../250.vector-function.md).
|
||||
* Vector queries with filter conditions are supported. The filter conditions can be scalar conditions or spatial relationships, such as ST_Intersects. Multi-value indexes, full-text indexes, and global indexes are not supported as pre-filterers.
|
||||
* You can create vector and full-text indexes on the same table.
|
||||
* For more information about how vector indexes support offline DDL operations, see [Offline DDL](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974221).
|
||||
|
||||
The limitations are described as follows:
|
||||
|
||||
* For V1.0.0, creating columnstore vector indexes is currently not supported.
|
||||
|
||||
## Index memory estimation and actual usage query
|
||||
|
||||
You can estimate the memory required for vector indexes using the `DBMS_VECTOR` system package:
|
||||
|
||||
* Before creating a table, you can estimate index memory requirements by using the [INDEX_VECTOR_MEMORY_ADVISOR](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754002) procedure.
|
||||
* After a table is created and data is inserted, you can estimate index memory requirements by using the [INDEX_VECTOR_MEMORY_ESTIMATE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754001) procedure.
|
||||
|
||||
The vector index memory estimation provides two key pieces of information: the minimum memory configuration required to create a vector index, and the actual memory usage after creating HNSW_SQ and IVF indexes.
|
||||
|
||||
We also provide the configuration item `load_vector_index_on_follower` to control whether the follower role automatically loads in-memory vector indexes. For syntax and examples, see [load_vector_index_on_follower](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002969407). If weak reads are not needed, you can disable this configuration item to reduce the memory used by vector indexes.
|
||||
|
||||
## Creation syntax and description
|
||||
|
||||
seekdb vector indexes can be created during table creation or after the table is created. When creating a vector index, note the following:
|
||||
|
||||
* The `VECTOR` keyword is required when creating a vector index.
|
||||
* The parameters and descriptions for an index created after the table is created are the same as those for an index created during table creation.
|
||||
* If a large amount of data is involved, we recommend that you write the data first and then create the index to achieve the optimal query performance.
|
||||
* It is recommended to create HNSW_SQ, IVF, IVF_SQ, and IVF_PQ indexes after data is inserted, and to rebuild the indexes after a significant amount of new data is added. For detailed instructions on creating each index, see the specific examples below.
|
||||
|
||||
:::tab
|
||||
tab HNSW/HNSW_SQ/HNSW_BQ
|
||||
|
||||
Syntax for creating an index during table creation:
|
||||
|
||||
```sql
|
||||
CREATE TABLE table_name (
|
||||
column_name1 data_type1,
|
||||
column_name2 data_type2,
|
||||
...,
|
||||
VECTOR INDEX index_name (column_name) WITH (param1=value1, param2=value2, ...)
|
||||
);
|
||||
```
|
||||
|
||||
Syntax for creating an index after table creation:
|
||||
|
||||
```sql
|
||||
-- Creating an index after table creation supports setting parallel degree to improve index construction performance. The maximum parallel degree should not exceed CPU cores * 2
|
||||
CREATE [/*+ paralell $value*/] VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
|
||||
```
|
||||
|
||||
`param` parameter description:
|
||||
|
||||
| Parameter | Default value | Value range | Required | Description | Remarks |
|
||||
|------|--------|----------|----------|------|------|
|
||||
| distance | | l2/inner_product/cosine | Yes | The vector distance function type. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. |
|
||||
| type | | currently supported `hnsw` / `hnsw_sq`/ `hnsw_bq`. | Yes | The index type. | |
|
||||
| lib | vsag | vsag | No | The vector index library type. | At present, only the VSAG vector library is supported. |
|
||||
| m | 16 | [5,128] | No | The maximum number of neighbors of each node. | The larger the value, the slower the index construction, but the better the query performance. |
|
||||
| ef_construction | 200 | [5,1000] | No | The size of the candidate set during index construction. | The larger the value, the slower the index construction, but the better the index quality. `ef_construction` must be greater than `m`. |
|
||||
| ef_search | 64 | [1,1000] | No | The size of the candidate set during a query. | The larger the value, the slower the query, but the higher the recall rate. |
|
||||
| extra_info_max_size | 0 | [0,16384] | No | The maximum size of each primary key information (in bytes). Storing the primary key of the table in the index can speed up queries. | <code>0</code>: The primary key information is not stored.<br/><code>1</code>: The primary key information is forcibly stored, regardless of the size limit. In this case, the primary key type (see below) must be a supported type.<br/><code>Greater than 1</code>: The maximum size of the primary key information (in bytes) is specified. In this case, the following conditions must be met:<ul><li>The size of the primary key information (calculation method see below) must be less than the specified size limit.</li><li>The primary key type must be a supported type.</li><li>The table is not a table without a primary key.</li></ul> |
|
||||
| refine_k | 4.0 | [1.0,1000.0] | No | <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter is a floating-point number used to adjust the rearrangement ratio for quantized vector indexes. | This parameter can be specified when you create an index or during a query:<ul><li>If this parameter is not specified during a query, the value specified when the index is created is used. </li><li>If this parameter is specified during a query, the value specified during the query is used. </li></ul> |
|
||||
| refine_type | sq8 <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies the construction precision of quantized vector indexes. | This parameter improves the efficiency of index construction by reducing the memory usage and the construction time, but may affect the recall rate. |
|
||||
| bq_bits_query | 32 | 0/4/32 | No | <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies the query precision of quantized vector indexes in bits. | This parameter improves the efficiency of index construction by reducing the memory usage and the construction time, but may affect the recall rate. |
|
||||
| bq_use_fht | true <main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an HNSW_BQ index. </p></main> This parameter specifies whether to use FHT for queries. FHT (Fast Hadamard Transform) is an algorithm used to accelerate vector inner product calculations. | |
|
||||
|
||||
The supported primary key types for `extra_info_max_size` include:
|
||||
|
||||
* [Numeric types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975803): Integer types, floating-point types, and BIT_VALUE types.
|
||||
* [Datetime types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975805)
|
||||
* [Character types](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975810): VARCHAR type is supported.
|
||||
|
||||
The calculation method for the primary key information size:
|
||||
|
||||
```sql
|
||||
SET @table_name = 'test'; -- Replace this with the table name to be queried.
|
||||
|
||||
SELECT
|
||||
CASE
|
||||
WHEN COUNT(*) <> COUNT(result_value) THEN 'not support'
|
||||
ELSE COALESCE(SUM(result_value), 'not support')
|
||||
END AS extra_info_size
|
||||
FROM (
|
||||
SELECT
|
||||
CASE
|
||||
WHEN vdt.data_type_class IN (1, 2, 3, 4, 6, 8, 9, 14, 27, 28) THEN 8 -- For numeric types, extra_info_size += 8
|
||||
WHEN oc.data_type = 22 THEN oc.data_length -- For varchar types, extra_info_size += data_length
|
||||
ELSE NULL -- Other types are not supported
|
||||
END AS result_value
|
||||
FROM
|
||||
oceanbase.__all_column oc
|
||||
JOIN
|
||||
oceanbase.__all_virtual_data_type vdt
|
||||
ON
|
||||
oc.data_type = vdt.data_type
|
||||
WHERE
|
||||
oc.rowkey_position != 0
|
||||
AND oc.table_id = (SELECT table_id FROM oceanbase.__all_table WHERE table_name = @table_name)
|
||||
) AS result_table;
|
||||
|
||||
-- The result is 8 bytes.
|
||||
```
|
||||
|
||||
tab IVF/IVF_SQ (Experimental feature)/IVF_PQ
|
||||
|
||||
Syntax for creating an index during table creation:
|
||||
|
||||
```sql
|
||||
CREATE TABLE table_name (
|
||||
column_name1 data_type1,
|
||||
column_name2 data_type2,
|
||||
...,
|
||||
VECTOR INDEX index_name (column_name) WITH (param1=value1, param2=value2, ...)
|
||||
);
|
||||
```
|
||||
|
||||
Syntax for creating an index after table creation:
|
||||
|
||||
```sql
|
||||
-- Creating an index after table creation supports setting parallel degree to improve index construction performance. The maximum parallel degree should not exceed CPU cores * 2
|
||||
CREATE [/*+ paralell $value*/] VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
|
||||
```
|
||||
|
||||
`param` parameter description:
|
||||
|
||||
| Parameter | Default value | Value range | Required? | Description | Remarks |
|
||||
|------|--------|----------|----------|------|------|
|
||||
| distance | | l2/inner_product/cosine | Yes | The vector distance function type. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. |
|
||||
| type | | ivf_flat/ivf_sq8/ivf_pq | Yes | The IVF index type. | |
|
||||
| lib | ob | ob | No | The vector index library type. | |
|
||||
| nlist | 128 | [1,65536] | No | The number of clusters. | |
|
||||
| sample_per_nlist | 256 | [1,int64_max] | Yes | The number of samples for each cluster center, which is used when creating an index after table creation. | |
|
||||
| nbits | 8 | [1,24] | No | The number of quantization bits.<main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an IVF_PQ index.</p></main> | The recommended value is 8. The recommended value range is [8,10]. The larger the value, the higher the quantization accuracy and query accuracy, but the query performance will be affected. |
|
||||
| m | No default value, must be specified | [1,65536] | Yes | The dimension of the quantized vectors.<main id="notice" type="notice"><p>This parameter is supported starting from V1.0.0. You can specify this parameter only when you create an IVF_PQ index.</p></main> | The larger the value, the slower the index construction, and the higher the query accuracy, but the query performance will be affected. |
|
||||
|
||||
:::
|
||||
|
||||
## Query syntax and description
|
||||
|
||||
Vector index queries are approximate nearest neighbor queries and do not guarantee 100% accuracy. The accuracy of vector queries is measured by recall. For example, if a query for the 10 nearest neighbors can stably return 9 correct results, the recall is 90%. The recall is described as follows:
|
||||
|
||||
* The recall is affected by the build parameters and query parameters.
|
||||
* The index query parameters are specified when the index is created and cannot be modified. However, you can set session variables to specify the parameters. The `ob_hnsw_ef_search` variable specifies the parameters for the HNSW/HNSW_SQ/HNSW_BQ index, and the `ob_ivf_nprobes` variable specifies the parameters for the IVF index. If you set a session variable, its value is prioritized. For more information, see [ob_hnsw_ef_search](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001976680) and [ob_ivf_nprobes](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002179539).
|
||||
|
||||
The syntax for dense vector indexes is as follows:
|
||||
|
||||
```sql
|
||||
SELECT ... FROM $table_name ORDER BY $distance_function($column_name, $vector_expr) [APPROXIMATE|APPROX] LIMIT $num (OFFSET $num);
|
||||
```
|
||||
|
||||
Query usage notes are as follows:
|
||||
|
||||
* Syntax requirements:
|
||||
* The `APPROXIMATE`/`APPROX` keyword must be specified for the query to use the vector index instead of a full table scan.
|
||||
* The query must include the `ORDER BY` and `LIMIT` clauses.
|
||||
* The `ORDER BY` clause only supports a single vector condition.
|
||||
* The value of `LIMIT + OFFSET` must be in the range `(0, 16384]`.
|
||||
|
||||
* Rules for distance functions:
|
||||
* If `APPROXIMATE`/`APPROX` is specified, a supported distance function is called, and it matches the vector index algorithm, the query will use the vector index.
|
||||
* If `APPROXIMATE`/`APPROX` is specified, but the distance function does not match the vector index algorithm, the query will not use the vector index, but no error is returned.
|
||||
* If `APPROXIMATE`/`APPROX` is specified, but the distance function is not supported in the current version, the query will not use the vector index, and an error is returned.
|
||||
* If `APPROXIMATE`/`APPROX` is not specified, and a supported distance function is called, the query will not use the vector index, but no error is returned.
|
||||
|
||||
* Other notes:
|
||||
* The `WHERE` condition will serve as a filter after the vector index query.
|
||||
* Specifying the `LIMIT` clause is required; otherwise, an error will be returned.
|
||||
|
||||
## Create, query, and delete examples
|
||||
|
||||
### Create an index during table creation
|
||||
|
||||
#### Example of dense vector index
|
||||
|
||||
##### HNSW example
|
||||
|
||||
:::tip
|
||||
|
||||
When you create an HNSW index, the index name must be less than 25 characters in length. Otherwise, an exception may occur because the auxiliary table name exceeds the <code>index_name</code> limit. In future versions, the index name can be longer.
|
||||
:::
|
||||
|
||||
Create a test table.
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(c1 INT, c0 INT, c2 VECTOR(10), c3 VECTOR(10), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw, lib=vsag), VECTOR INDEX idx2(c3) WITH (distance=l2, type=hnsw, lib=vsag));
|
||||
```
|
||||
|
||||
Write test data.
|
||||
|
||||
```sql
|
||||
INSERT INTO t1 VALUES(1, 1,'[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]', '[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
|
||||
|
||||
INSERT INTO t1 VALUES(2, 2, '[0.735541,0.670776,0.903237,0.447223,0.232028,0.659316,0.765661,0.226980,0.579658,0.933939]', '[0.213846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
|
||||
|
||||
INSERT INTO t1 VALUES(3, 3, '[0.327936,0.048756,0.084670,0.389642,0.970982,0.370915,0.181664,0.940780,0.013905,0.628127]', '[0.223846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
|
||||
```
|
||||
|
||||
Perform an approximate nearest neighbor query.
|
||||
|
||||
```sql
|
||||
SELECT * FROM t1 ORDER BY l2_distance(c2, [0.712338,0.603321,0.133444,0.428146,0.876387,0.763293,0.408760,0.765300,0.560072,0.900498]) APPROXIMATE LIMIT 1;
|
||||
```
|
||||
|
||||
The query result is as follows:
|
||||
|
||||
```shell
|
||||
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|
||||
| c1 | c0 | c2 | c3 |
|
||||
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|
||||
| 3 | 3 | [0.327936,0.048756,0.08467,0.389642,0.970982,0.370915,0.181664,0.94078,0.013905,0.628127] | [0.223846,0.205289,0.880265,0.82434,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833] |
|
||||
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
##### HNSW_SQ example
|
||||
|
||||
```sql
|
||||
CREATE TABLE t2 (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw_sq, lib=vsag));
|
||||
```
|
||||
|
||||
##### HNSW_BQ example
|
||||
|
||||
```sql
|
||||
CREATE TABLE t3 (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx3(c2) WITH (distance=l2, type=hnsw_bq, lib=vsag));
|
||||
```
|
||||
|
||||
The `distance` parameter of HNSW_BQ supports only the l2 value.
|
||||
|
||||
##### IVF example
|
||||
|
||||
:::tip
|
||||
|
||||
When you create an IVF index, the index name must be less than 33 characters in length. Otherwise, an exception may occur because the auxiliary table name exceeds the <code>index_name</code> limit. In future versions, the index name can be longer.
|
||||
:::
|
||||
|
||||
```sql
|
||||
CREATE TABLE ivf_vecindex_suite_table_test (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx2(c2) WITH (distance=l2, type=ivf_flat));
|
||||
```
|
||||
|
||||
### Create an index after table creation
|
||||
|
||||
:::tip
|
||||
|
||||
Currently, only dense vector indexes can be created after table creation.
|
||||
:::
|
||||
|
||||
#### Example of HNSW index
|
||||
|
||||
Create a test table.
|
||||
|
||||
```sql
|
||||
CREATE TABLE vec_table_hnsw (id INT, c2 VECTOR(10));
|
||||
```
|
||||
|
||||
Create an HNSW index.
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX vec_idx1 ON vec_table_hnsw(c2) WITH (distance=l2, type=hnsw);
|
||||
```
|
||||
|
||||
View the created table.
|
||||
|
||||
```sql
|
||||
SHOW CREATE TABLE vec_table_hnsw;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Table | Create Table |
|
||||
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| vec_table_hnsw | CREATE TABLE `vec_table_hnsw` (
|
||||
`id` int(11) DEFAULT NULL,
|
||||
`c2` VECTOR(10) DEFAULT NULL,
|
||||
VECTOR KEY `vec_idx1` (`c2`) WITH (DISTANCE=L2, TYPE=HNSW, LIB=VSAG, M=16, EF_CONSTRUCTION=200, EF_SEARCH=64) BLOCK_SIZE 16384
|
||||
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 2 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
|
||||
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
```sql
|
||||
SHOW INDEX FROM vec_table_hnsw;
|
||||
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
|
||||
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
|
||||
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
|
||||
| vec_table | 1 | vec_idx1 | 1 | c2 | A | NULL | NULL | NULL | YES | VECTOR | available | | YES | NULL |
|
||||
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
#### Example of HNSW_SQ index
|
||||
|
||||
Create a test table.
|
||||
|
||||
```sql
|
||||
CREATE TABLE vec_table_hnsw_sq (c1 INT AUTO_INCREMENT, c2 VECTOR(3), PRIMARY KEY(c1));
|
||||
```
|
||||
|
||||
Create an HNSW_SQ index.
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX vec_idx2 ON vec_table_hnsw_sq(c2) WITH (distance=l2, type=hnsw_sq, lib=vsag, m=16, ef_construction = 200);
|
||||
```
|
||||
|
||||
##### Example of HNSW_BQ index
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX vec_idx3 ON vec_table_hnsw_bq(c2) WITH (distance=l2, type=hnsw_bq, lib=vsag, m=16, ef_construction = 200);
|
||||
```
|
||||
|
||||
The `distance` parameter of the HNSW_BQ index can be used only with the L2 algorithm.
|
||||
|
||||
#### Example of IVF index
|
||||
|
||||
Create a test table.
|
||||
|
||||
```sql
|
||||
CREATE TABLE vec_table_ivf (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1));
|
||||
```
|
||||
|
||||
Create an IVF index.
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX vec_idx3 ON vec_table_ivf(c2) WITH (distance=l2, type=ivf_flat);
|
||||
```
|
||||
|
||||
### Drop an index
|
||||
|
||||
```sql
|
||||
DROP INDEX vec_idx1 ON vec_table;
|
||||
```
|
||||
|
||||
View the dropped index.
|
||||
|
||||
```sql
|
||||
SHOW INDEX FROM vec_table;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
Empty set
|
||||
```
|
||||
|
||||
<!--## Monitoring
|
||||
|
||||
seekdb vector indexes provide monitoring capabilities:
|
||||
|
||||
* You can view the basic information and real-time status of HNSW/HNSW_SQ/HNSW_BQ indexes through the [GV$OB_HNSW_INDEX_INFO](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004017373) view.
|
||||
* You can view the basic information and real-time status of IVF/IVF_SQ/IVF_PQ indexes through the [GV$OB_IVF_INDEX_INFO](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004017374) view.-->
|
||||
|
||||
## Maintenance
|
||||
|
||||
When there is a large amount of incremental data, the query performance decreases. To reduce the amount of incremental data in the table, seekdb introduced the `DBMS_VECTOR` package for maintaining vector indexes.
|
||||
|
||||
### Incremental refresh
|
||||
|
||||
:::tip
|
||||
|
||||
IVF/IVF_SQ/IVF_PQ indexes do not support incremental refresh.
|
||||
:::
|
||||
|
||||
If a large amount of data is written after the index is created, we recommend that you perform an incremental refresh by using the `REFRESH_INDEX` procedure. For more information, see [REFRESH_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002753999).
|
||||
|
||||
The system checks for incremental data every 15 minutes. If more than 10,000 incremental data records are found, the system automatically performs an incremental refresh.
|
||||
|
||||
### Full refresh (rebuild)
|
||||
|
||||
#### Manual full table rebuild
|
||||
|
||||
If a large amount of data is updated or deleted after an index is created, it is recommended to use the `REBUILD_INDEX` procedure to perform a full refresh. For details and examples, see [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000).
|
||||
|
||||
A full refresh is automatically checked every 24 hours. If the newly added data exceeds 20% of the original data, a full refresh will be triggered automatically. The full refresh runs asynchronously in the background: a new index is created first, and then the old index is replaced. During the rebuild process, the old index remains available, but the overall process is relatively slow.
|
||||
|
||||
We also provide the configuration item `vector_index_memory_saving_mode` to control the memory usage during index rebuild. Enabling this mode can reduce the memory consumption during vector index rebuild for partitioned tables. Typically, vector index rebuild requires memory equivalent to twice the index size. After enabling the memory-saving mode, the system will temporarily delete the memory index of a partition after building that partition to release memory, effectively reducing the total memory required for the rebuild operation. For syntax and examples, see [vector_index_memory_saving_mode](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002969408).
|
||||
|
||||
Notes:
|
||||
|
||||
* When executing offline DDL operations (such as `ALTER TABLE` to modify the table structure or primary key), the index table will be rebuilt. Since parallel degree cannot be specified for index rebuild, the system uses single-threaded execution by default. Therefore, when the data volume is large, the rebuild process will be slow, affecting the efficiency of the entire offline DDL operation.
|
||||
* When rebuilding an index, if you need to modify index parameters, you must specify both `type` and `distance` in the parameter list, and `type` and `distance` must match the original index type. For example, if the original index type is `hnsw` and the distance algorithm is `l2`, you must specify both `type=hnsw` and `distance=l2` during rebuild.
|
||||
* When rebuilding an index, the following are supported:
|
||||
* Modifying `m`, `ef_search`, and `ef_construction` values.
|
||||
* Online rebuild of the `ef_search` parameter.
|
||||
* Index type rebuild between `hnsw` - `hnsw_sq`.
|
||||
* Index type rebuild between `ivf_flat` - `ivf_flat`, `ivf_sq8` - `ivf_sq8`, `ivf_pq` - `ivf_pq`.
|
||||
* Setting parallel degree during rebuild. For examples, see [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000).
|
||||
* When rebuilding an index, the following are not supported:
|
||||
* Modifying `type` and `distance` types.
|
||||
* Index rebuild between `hnsw` - `ivf`.
|
||||
* Index rebuild between `hnsw` - `hnsw_bq`.
|
||||
* Cross rebuild between `ivf_flat`, `ivf_pq`, and `ivf_sq8`.
|
||||
|
||||
#### Automatic partition rebuild (recommended)
|
||||
|
||||
:::tip
|
||||
<li>This feature is supported starting from V1.0.0. If your vector database is upgraded from an earlier version to V1.0.0, you need to manually rebuild all vector indexes for the entire table after the upgrade. Otherwise, automatic partition rebuild tasks may not be executed after the upgrade.</li><li>This feature only supports HNSW/HNSW_SQ/HNSW_BQ indexes.</li>
|
||||
:::
|
||||
|
||||
There are two scenarios that trigger automatic partition rebuild tasks in the current version:
|
||||
|
||||
* When executing vector index query statements.
|
||||
* Scheduled checks, with configurable execution cycle.
|
||||
|
||||
1. Configure execution cycle
|
||||
|
||||
In the `seekdb` database, configure the execution cycle through the configuration item `vector_index_optimize_duty_time`. Example:
|
||||
|
||||
```sql
|
||||
ALTER SYSTEM SET vector_index_optimize_duty_time='[23:00:00, 24:00:00]';
|
||||
```
|
||||
After the above configuration, partition rebuild tasks will only be executed between 23:00:00 and 24:00:00, and will not be initiated at other times. For detailed parameter descriptions, see the corresponding configuration item documentation.
|
||||
|
||||
2. View task progress/history
|
||||
|
||||
You can view task progress and history through the `CDB/DBA_OB_VECTOR_INDEX_TASKS` or `CDB/DBA_OB_VECTOR_INDEX_TASK_HISTORY` view.
|
||||
|
||||
Determine the current task status through the `status` field:
|
||||
|
||||
* 0 (PREPARE): The task is waiting to be executed.
|
||||
* 1 (RUNNING): The task is being executed.
|
||||
* 2 (PENDING): The task is paused.
|
||||
* 3 (FINISHED): The task has been completed.
|
||||
|
||||
Completed tasks, i.e., tasks with `status=FINISHED`, will be archived to the history table regardless of whether they succeeded. For detailed usage examples, see the corresponding view documentation.
|
||||
|
||||
3. Cancel task
|
||||
|
||||
To cancel a task, obtain the trace_id from the `DBA_OB_VECTOR_INDEX_TASKS` or `CDB_OB_VECTOR_INDEX_TASKS` view, then execute the following command:
|
||||
|
||||
```sql
|
||||
ALTER SYSTEM CANCEL TASK <trace_id>;
|
||||
```
|
||||
Example:
|
||||
```sql
|
||||
ALTER SYSTEM CANCEL TASK "Y61480BA2D976-00063084E80435E2-0-1";
|
||||
```
|
||||
|
||||
## Performance optimization
|
||||
|
||||
:::tip
|
||||
Only the IVF index is supported.
|
||||
:::
|
||||
|
||||
seekdb provides an automatic performance optimization mechanism for the IVF index to improve query performance through cache management and regular maintenance.
|
||||
|
||||
### Optimization mechanism
|
||||
|
||||
IVF index performance optimization includes two types of automated tasks:
|
||||
|
||||
1. Cache warming task: Periodically checks all IVF indexes. If it finds that the cache corresponding to an index does not exist, it automatically triggers cache warming and loads the index data into memory. Additionally, cache warming is automatically performed when an IVF index is created.
|
||||
2. Cache cleanup task: Periodically checks all IVF caches. If it finds that the cache corresponds to an index that has been deleted, it automatically cleans up the invalid cache and releases memory resources. Additionally, cache cleanup is automatically performed when an IVF index is deleted.
|
||||
|
||||
### Configure the optimization cycle
|
||||
|
||||
The system allows you to customize the execution time window for performance optimization tasks to avoid impacting performance during peak business hours.
|
||||
|
||||
In the `seekdb` database, you can set the execution cycle using the `vector_index_optimize_duty_time` parameter:
|
||||
|
||||
```sql
|
||||
ALTER SYSTEM SET vector_index_optimize_duty_time='[23:00:00, 24:00:00]';
|
||||
```
|
||||
|
||||
The configuration is described as follows:
|
||||
|
||||
* The time format is `[start time, end time]`.
|
||||
* The above configuration means that optimization tasks will only be executed between 23:00:00 and 24:00:00.
|
||||
* Optimization tasks will not be initiated at other times to avoid impacting normal business operations.
|
||||
|
||||
### Monitor optimization tasks
|
||||
|
||||
seekdb vector indexes provide monitoring capabilities for optimization tasks:
|
||||
|
||||
* You can view tasks that are being executed or waiting to be executed through the `DBA_OB_VECTOR_INDEX_TASKS` view.
|
||||
* You can view historical task records through the `DBA_OB_VECTOR_INDEX_TASK_HISTORY` view.
|
||||
|
||||
Usage examples:
|
||||
|
||||
1. View the current task status
|
||||
|
||||
View tasks that are being executed or waiting to be executed through the `DBA_OB_VECTOR_INDEX_TASKS` view:
|
||||
|
||||
```sql
|
||||
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASKS;
|
||||
```
|
||||
|
||||
Sample return result:
|
||||
|
||||
```shell
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| 500020 | 1152921504606846990 | 2002281 | 1970-08-23 17:10:23.174127 | 1970-08-23 17:10:23.174137 | USER | FINISHED | 2 | 1750671687770026 | 0 | YAFF00B9E4D97-00063839E6BD9BBC-0-1 |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
Description of the task status:
|
||||
|
||||
* `STATUS = 0`: PREPARE, the task is waiting to be executed.
|
||||
* `STATUS = 1`: RUNNING, the task is being executed.
|
||||
* `STATUS = 3`: FINISHED, the task has been completed.
|
||||
|
||||
Description of the task type:
|
||||
|
||||
* `TASK_TYPE = 2`: IVF cache warming task.
|
||||
* `TASK_TYPE = 3`: IVF invalid cache cleanup task.
|
||||
|
||||
2. View the history task records
|
||||
|
||||
Completed tasks (with `STATUS = 3`) are automatically archived to the history table every 10 seconds, regardless of whether they were successful. View the history through the `DBA_OB_VECTOR_INDEX_TASKS_HISTORY` view:
|
||||
|
||||
```sql
|
||||
-- Query the history of a specified task ID
|
||||
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASKS_HISTORY WHERE TASK_ID=2002281;
|
||||
```
|
||||
|
||||
Sample return result:
|
||||
|
||||
```shell
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| 500020 | 1152921504606846990 | 2002281 | 1970-08-23 17:10:23.174127 | 1970-08-23 17:10:23.174137 | AUTO | FINISHED | 2 | 1750671687770026 | 0 | YAFF00B9E4D97-00063839E6BD9BBC-0-1 |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Cancel an optimization task
|
||||
|
||||
You can cancel a specified task by using the following command:
|
||||
|
||||
```sql
|
||||
-- trace_id is obtained from the DBA_OB_VECTOR_INDEX_TASKS_HISTORY view
|
||||
ALTER SYSTEM CANCEL TASK <trace_id>;
|
||||
```
|
||||
|
||||
:::tip
|
||||
You can cancel a task only in the failed retry phase by executing the <code>ALTER SYSTEM CANCEL TASK</code> statement. If a background task is stuck in a specific execution phase, it cannot be canceled by using this statement.
|
||||
:::
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
-- Log in to the system and obtain the trace_id of the specified task
|
||||
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASK_HISTORY WHERE TASK_ID=2037736;
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| 500041 | 1152921504606847008 | 2037736 | 1970-08-23 17:10:23.203821 | 1970-08-23 17:10:23.203821 | USER | PREPARED | 2 | 1750682301145225 | -1 | YAFF00B9E4D97-00063839E6BDDEE0-0-1 |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Cancel the task
|
||||
ALTER SYSTEM CANCEL TASK "YAFF00B9E4D97-00063839E6BDDEE0-0-1";
|
||||
```
|
||||
|
||||
After the task is canceled, the task status changes to `CANCELLED`.
|
||||
|
||||
```sql
|
||||
-- Log in to the user database and query the task status
|
||||
SELECT * FROM oceanbase.DBA_OB_VECTOR_INDEX_TASK_HISTORY;
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| TABLE_ID | TABLET_ID | TASK_ID | START_TIME | MODIFY_TIME | TRIGGER_TYPE | STATUS | TASK_TYPE | TASK_SCN | RET_CODE | TRACE_ID |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
| 500041 | 1152921504606847008 | 2037736 | 1970-08-23 17:10:23.203821 | 1970-08-23 17:10:23.203821 | USER | FINISHED | 2 | 1750682301145225 | -4072 | YAFF00B9E4D97-00063839E6BDDEE0-0-1 |
|
||||
+----------+---------------------+---------+----------------------------+----------------------------+--------------+----------+-----------+------------------+----------+------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Use SQL functions](../250.vector-function.md)
|
||||
@@ -0,0 +1,374 @@
|
||||
---
|
||||
|
||||
slug: /hybrid-vector-index
|
||||
---
|
||||
|
||||
# Create a hybrid vector index
|
||||
|
||||
This topic describes how to create a hybrid vector index in seekdb.
|
||||
|
||||
## Overview
|
||||
|
||||
Hybrid vector indexes leverage seekdb's built-in embedding capabilities to greatly simplify the vector index usage process. They make the vector concept transparent to users: you can directly write raw data (such as text) that needs to be stored, and seekdb will automatically convert it to vectors and build indexes internally. During retrieval, you only need to provide the raw query content, and seekdb will also automatically perform embedding and retrieve the vector index, significantly improving ease of use.
|
||||
|
||||
Considering the performance overhead of embedding models, hybrid vector indexes provide two embedding modes for users to choose from:
|
||||
* Synchronous mode: Embedding and indexing are performed immediately after data is written, ensuring real-time data visibility.
|
||||
* Asynchronous mode: Background tasks perform data embedding and indexing in batches, which can significantly improve write performance. You can flexibly set the trigger cycle of background tasks based on your requirements for real-time data visibility.
|
||||
|
||||
In addition, this feature also provides the capability to perform brute-force search on hybrid vector indexes to help verify the correctness of search results. Brute-force search refers to performing a search using a full table scan to obtain the exact results of the n nearest rows.
|
||||
|
||||
## Feature support
|
||||
|
||||
:::tip
|
||||
This feature currently supports only HNSW/HNSW_BQ indexes.
|
||||
:::
|
||||
|
||||
This feature supports the full lifecycle of hybrid vector indexes, including creation, update, deletion, and retrieval, and is compatible with `REFRESH_INDEX` and `REBUILD_INDEX` in the `DBMS_VECTOR` system package. The syntax for update, deletion, and retrieval is exactly the same as that for regular vector indexes. In asynchronous mode, `REFRESH_INDEX` will additionally trigger data embedding. For details about creation and retrieval, see the sections below.
|
||||
|
||||
The supported features are as follows:
|
||||
|
||||
| Module | Feature | Description |
|
||||
|------|--------|------|
|
||||
| DDL | Create a hybrid vector index during table creation | You can create a hybrid vector index on a `VARCHAR` column when creating a table |
|
||||
| DDL | Create a hybrid vector index after table creation | Supports creating a hybrid vector index on a `VARCHAR` column of an existing table |
|
||||
| Retrieval | `semantic_distance` function | Pass raw data through this function for vector retrieval |
|
||||
| Retrieval | `semantic_vector_distance` function | Pass vectors through this function for retrieval. There are two usage modes: <ul><li>When the SQL statement includes the `APPROXIMATE`/`APPROX` clause, vector index retrieval is used.</li><li>When the `APPROXIMATE`/`APPROX` clause is not included, brute-force search using a full table scan is performed.</li></ul> |
|
||||
| DBMS_VECTOR | `REFRESH_INDEX` | The usage is the same as that for regular vector indexes. Performs incremental index refresh and embedding in asynchronous mode |
|
||||
| DBMS_VECTOR | `REBUILD_INDEX` | The usage is the same as that for regular vector indexes. Performs full index rebuild |
|
||||
|
||||
Some usage notes are as follows:
|
||||
|
||||
* In synchronous mode, write performance may be affected by embedding performance. In asynchronous mode, data visibility will be delayed.
|
||||
* For repeated retrieval scenarios, it is recommended to use AI Function Service to pre-obtain query vectors to avoid embedding for each retrieval.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before using hybrid vector indexes, you must register an embedding model and endpoint. The following is a registration example:
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
|
||||
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
|
||||
'ob_embed', '{
|
||||
"type": "dense_embedding",
|
||||
"model_name": "BAAI/bge-m3"
|
||||
}');
|
||||
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
'ob_embed_endpoint', '{
|
||||
"ai_model_name": "ob_embed",
|
||||
"url": "https://api.siliconflow.cn/v1/embeddings",
|
||||
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"provider": "siliconflow"
|
||||
}');
|
||||
```
|
||||
|
||||
:::info
|
||||
Replace <code>access_key</code> with your actual API Key. The BAAI/bge-m3 model has a vector dimension of 1024, so you need to use <code>dim=1024</code> when creating a hybrid vector index.
|
||||
:::
|
||||
|
||||
## Creation syntax and description
|
||||
|
||||
Hybrid vector indexes support two creation methods: **creation during table creation** and **creation after table creation**. When creating an index, note the following:
|
||||
|
||||
* The index must be created on a column of the `VARCHAR` type.
|
||||
* The `model` and `sync_mode` parameters are not supported for regular vector indexes.
|
||||
* The parameters and descriptions for an index created after table creation are the same as those for an index created during table creation.
|
||||
|
||||
### Create during table creation
|
||||
|
||||
You can use the `CREATE TABLE` statement to create a hybrid vector index. Through index parameters, background tasks can be initiated synchronously or asynchronously. In synchronous mode, `VARCHAR` data is automatically converted to vector data when data is inserted. In asynchronous mode, data conversion is performed periodically or manually.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
CREATE TABLE table_name (
|
||||
column_name1 data_type1,
|
||||
column_name2 VARCHAR, -- Text column
|
||||
...,
|
||||
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
|
||||
);
|
||||
```
|
||||
|
||||
#### Parameter description
|
||||
|
||||
| Parameter | Default value | Value range | Required | Description | Remarks |
|
||||
|------|--------|----------|----------|------|------|
|
||||
| `distance` | | `l2`/`inner_product`/`cosine` | Yes | Specifies the vector distance algorithm type. | `l2` indicates Euclidean distance, `inner_product` indicates inner product distance, and `cosine` indicates cosine distance. |
|
||||
| `type` | | Currently supports `hnsw` / `hnsw_bq` | Yes | Specifies the index algorithm type. | |
|
||||
| `lib` | `vsag` | `vsag` | No | Specifies the vector index library type. | Currently, only the VSAG vector library is supported. |
|
||||
| `model` | | Registered model name | Yes | Specifies the large language model name used for embedding. | The model must be registered using AI Function Service before creating the index.<main id="notice" type='notice'><h4>Note</h4><p>Regular vector indexes do not support setting this parameter.</p></main> |
|
||||
| `dim` | | Positive integer, maximum 4096 | Yes | Specifies the vector dimension after embedding. | Must match the dimension provided by the model. |
|
||||
| `sync_mode` | `async` | `immediate`/`manual`/`async` | No | Specifies the data and index synchronization mode. | `immediate` indicates synchronous mode, `manual` indicates manual mode, and `async` indicates asynchronous mode.<main id="notice" type='notice'><h4>Note</h4><p>Regular vector indexes do not support setting this parameter.</p></main> |
|
||||
| `sync_interval` | `10s` | Time interval, such as `10s`, `1h`, `1d`, etc. | No | Sets the trigger cycle of background tasks in asynchronous mode. | The numeric part must be positive. Units supported include seconds (s), hours (h), days (d), etc. |
|
||||
|
||||
The usage of other vector index parameters (such as `m`, `ef_construction`, `ef_search`, etc.) is the same as that for regular vector indexes. For details, see the related documentation.
|
||||
|
||||
### Create after table creation
|
||||
|
||||
Supports creating a hybrid vector index on a `VARCHAR` column of an existing table. When creating an index after table creation, synchronous or asynchronous background tasks are initiated through the provided index parameters. In synchronous mode, all existing `VARCHAR` data is converted to vector data. In asynchronous mode, data conversion is performed periodically or manually.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX index_name
|
||||
ON table_name(varchar_column_name)
|
||||
WITH (param1=value1, param2=value2, ...);
|
||||
```
|
||||
|
||||
#### Parameter description
|
||||
|
||||
The parameter description is the same as that for creating an index during table creation. For details, see the section above.
|
||||
|
||||
## Create, update, and delete examples
|
||||
|
||||
DML operations (`INSERT`, `UPDATE`, `DELETE`) for hybrid vector indexes are exactly the same as those for regular vector indexes. When inserting or updating data of the `VARCHAR` type, the system automatically or asynchronously performs embedding based on the `sync_mode` parameter setting.
|
||||
|
||||
### Create during table creation
|
||||
|
||||
Create the `vector_idx` index when creating the test table `items`:
|
||||
|
||||
```sql
|
||||
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
|
||||
CREATE TABLE items (
|
||||
id BIGINT PRIMARY KEY,
|
||||
doc VARCHAR(100),
|
||||
VECTOR INDEX vector_idx(doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s)
|
||||
);
|
||||
```
|
||||
|
||||
Insert a row of data into the test table `items`. The system will automatically perform embedding:
|
||||
|
||||
```sql
|
||||
INSERT INTO items(id, doc) VALUES(1, 'Rose');
|
||||
```
|
||||
|
||||
### Create after table creation
|
||||
|
||||
After creating the test table `items`, use the `CREATE VECTOR INDEX` statement to create the `vector_idx` index:
|
||||
|
||||
```sql
|
||||
CREATE TABLE items (
|
||||
id BIGINT PRIMARY KEY,
|
||||
doc VARCHAR(100)
|
||||
);
|
||||
|
||||
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
|
||||
CREATE VECTOR INDEX vector_idx
|
||||
ON items (doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s);
|
||||
```
|
||||
|
||||
Insert a row of data into the test table `items`. The system will automatically perform embedding:
|
||||
|
||||
```sql
|
||||
INSERT INTO items(id, doc) VALUES(1, 'Rose');
|
||||
```
|
||||
|
||||
### Update
|
||||
|
||||
When updating data of the `VARCHAR` type, the system will re-perform embedding:
|
||||
|
||||
* Synchronous mode: Re-embedding is performed immediately after update.
|
||||
* Asynchronous mode: Re-embedding is performed by background tasks at the next trigger cycle after update.
|
||||
|
||||
Usage example:
|
||||
|
||||
```sql
|
||||
UPDATE items SET doc = 'Lily' WHERE id = 1;
|
||||
```
|
||||
|
||||
### Delete
|
||||
|
||||
The delete operation is the same as that for regular vector indexes. You can directly delete the data.
|
||||
|
||||
Usage example:
|
||||
|
||||
```sql
|
||||
DELETE FROM items WHERE id = 1;
|
||||
```
|
||||
|
||||
## Retrieval
|
||||
|
||||
Hybrid vector indexes support two retrieval methods:
|
||||
|
||||
* Retrieve using raw text
|
||||
* Retrieve using vectors
|
||||
|
||||
For detailed usage of the `APPROXIMATE`/`APPROX` clause, see the related documentation on creating vector indexes at the end of this topic.
|
||||
|
||||
### Retrieve using raw text
|
||||
|
||||
Use the `semantic_distance` expression to pass raw text for vector retrieval.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
SELECT ... FROM table_name
|
||||
ORDER BY semantic_distance(column_name, 'query_text') [APPROXIMATE|APPROX]
|
||||
LIMIT n;
|
||||
```
|
||||
|
||||
Where:
|
||||
* `column_name`: The text column specified when creating the hybrid vector index.
|
||||
* `query_text`: The raw text for retrieval.
|
||||
* `n`: The number of result rows to return.
|
||||
|
||||
#### Usage example
|
||||
|
||||
```sql
|
||||
-- Assume that the ob_embed model has been created previously
|
||||
CREATE TABLE items (
|
||||
id INT PRIMARY KEY,
|
||||
doc varchar(100),
|
||||
VECTOR INDEX vector_idx(doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
|
||||
);
|
||||
|
||||
INSERT INTO items(id, doc) VALUES(1, 'Rose');
|
||||
INSERT INTO items(id, doc) VALUES(2, 'Sunflower');
|
||||
INSERT INTO items(id, doc) VALUES(3, 'Rose');
|
||||
INSERT INTO items(id, doc) VALUES(4, 'Sunflower');
|
||||
INSERT INTO items(id, doc) VALUES(5, 'Rose');
|
||||
|
||||
-- Retrieve using raw text
|
||||
SELECT id, doc FROM items
|
||||
ORDER BY semantic_distance(doc, 'Sunflower')
|
||||
APPROXIMATE LIMIT 3;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```sql
|
||||
+----+-----------+
|
||||
| id | doc |
|
||||
+----+-----------+
|
||||
| 2 | Sunflower |
|
||||
| 4 | Sunflower |
|
||||
| 5 | Rose |
|
||||
+----+-----------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
### Retrieve using vectors (with APPROXIMATE clause)
|
||||
|
||||
Use the `semantic_vector_distance` expression to pass vectors for retrieval. When the retrieval statement includes the `APPROXIMATE`/`APPROX` clause, vector index retrieval is used.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
SELECT ... FROM table_name
|
||||
ORDER BY semantic_vector_distance(column_name, 'query_vector') [APPROXIMATE|APPROX]
|
||||
LIMIT n;
|
||||
```
|
||||
|
||||
Where:
|
||||
* `column_name`: The text column specified when creating the hybrid vector index.
|
||||
* `query_vector`: The query vector.
|
||||
* `n`: The number of result rows to return.
|
||||
|
||||
#### Usage example
|
||||
|
||||
```sql
|
||||
-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
|
||||
CREATE TABLE items (
|
||||
id INT PRIMARY KEY,
|
||||
doc varchar(100),
|
||||
VECTOR INDEX vector_idx(doc)
|
||||
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
|
||||
);
|
||||
|
||||
INSERT INTO items(id, doc) VALUES(1, 'Rose');
|
||||
INSERT INTO items(id, doc) VALUES(2, 'Lily');
|
||||
INSERT INTO items(id, doc) VALUES(3, 'Sunflower');
|
||||
INSERT INTO items(id, doc) VALUES(4, 'Rose');
|
||||
|
||||
-- First, obtain the query vector
|
||||
SET @query_vector = AI_EMBED('ob_embed', 'Sunflower');
|
||||
|
||||
-- Retrieve using vectors with index
|
||||
SELECT id, doc FROM items
|
||||
ORDER BY semantic_vector_distance(doc, @query_vector)
|
||||
APPROXIMATE LIMIT 3;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----+-----------+
|
||||
| id | doc |
|
||||
+----+-----------+
|
||||
| 3 | Sunflower |
|
||||
| 1 | Rose |
|
||||
| 4 | Rose |
|
||||
+----+-----------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
### Retrieve using vectors (without APPROXIMATE clause)
|
||||
|
||||
Use the `semantic_vector_distance` expression to pass vectors for retrieval. When the `APPROXIMATE`/`APPROX` clause is not included, brute-force search using a full table scan is performed to obtain the exact results of the n nearest rows. During retrieval execution, the `distance` type is obtained from the table schema, and then a full table scan is performed. Vector distance is calculated for each row to ensure accurate results.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
SELECT ... FROM table_name
|
||||
ORDER BY semantic_vector_distance(column_name, 'query_vector')
|
||||
LIMIT n;
|
||||
```
|
||||
|
||||
Where:
|
||||
* `column_name`: The text column specified when creating the hybrid vector index.
|
||||
* `query_vector`: The query vector.
|
||||
* `n`: The number of result rows to return.
|
||||
|
||||
#### Usage example
|
||||
|
||||
```sql
|
||||
-- Retrieve using vectors with brute-force search (exact results)
|
||||
SELECT id, doc FROM items
|
||||
ORDER BY semantic_vector_distance(doc, @query_vector)
|
||||
LIMIT 3;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----+-----------+
|
||||
| id | doc |
|
||||
+----+-----------+
|
||||
| 3 | Sunflower |
|
||||
| 4 | Rose |
|
||||
| 1 | Rose |
|
||||
+----+-----------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
## Index maintenance
|
||||
|
||||
Hybrid vector indexes support using the `DBMS_VECTOR` system package for index maintenance, including incremental refresh and full rebuild.
|
||||
|
||||
### Incremental refresh
|
||||
|
||||
If a large amount of data is written after the index is created, it is recommended to use the `REFRESH_INDEX` procedure for incremental refresh. For descriptions and examples, see the related documentation.
|
||||
|
||||
Special notes for hybrid vector indexes:
|
||||
* The usage is the same as that for regular vector indexes. For details, see the related documentation.
|
||||
* In asynchronous mode, `REFRESH_INDEX` will additionally trigger data embedding to ensure that incremental data is correctly converted to vectors and added to the index.
|
||||
|
||||
### Full refresh (rebuild)
|
||||
|
||||
If a large amount of data is updated or deleted after the index is created, it is recommended to use the `REBUILD_INDEX` procedure for full refresh. For descriptions and examples, see the related documentation.
|
||||
|
||||
Special notes for hybrid vector indexes:
|
||||
* The usage is the same as that for regular vector indexes. For details, see the related documentation.
|
||||
* The task merges incremental data and snapshots.
|
||||
|
||||
## Related documentation
|
||||
|
||||
* [AI Function Service](../../300.ai-function/200.ai-function.md)
|
||||
* [Create a vector index](200.dense-vector-index.md)
|
||||
* [REFRESH_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002753999)
|
||||
* [REBUILD_INDEX](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002754000)
|
||||
@@ -0,0 +1,279 @@
|
||||
---
|
||||
|
||||
slug: /in-memory-sparse-vector-index
|
||||
---
|
||||
|
||||
# In-memory sparse vector index
|
||||
|
||||
This topic describes how to create, query, and use in-memory sparse vector indexes in seekdb.
|
||||
|
||||
## Overview
|
||||
|
||||
In-memory sparse vector indexes are an efficient index type provided by seekdb for sparse vector data (vectors where most elements are zero). In-memory sparse vector indexes must be fully loaded into memory and support DML and real-time queries.
|
||||
|
||||
To improve the query performance of sparse vectors, seekdb integrates the sparse vector index (SINDI) from the VSAG algorithm library. This index performs better than disk-based sparse vector indexes and is suitable for use when memory resources are sufficient.
|
||||
|
||||
## Feature support
|
||||
|
||||
In-memory sparse vector indexes support the following features:
|
||||
|
||||
| Module | Feature | Description |
|
||||
|------|--------|------|
|
||||
| DDL | Create a sparse vector index during table creation | You can create a sparse vector index on a `SPARSEVECTOR` column when creating a table. The maximum supported dimension is 500,000. |
|
||||
| DDL | Create a sparse vector index after table creation | Supports creating a sparse vector index on a `SPARSEVECTOR` column of an existing table. The maximum supported dimension is 500,000. |
|
||||
| DML | Insert, update, delete | The syntax for DML operations is exactly the same as that for regular vector indexes. |
|
||||
| Retrieval | Vector retrieval | Supports retrieval using SQL functions. |
|
||||
| Retrieval | Query parameters | Supports setting query-level parameters through the `parameters` clause during retrieval. |
|
||||
| DBMS_VECTOR | `REFRESH_INDEX` | Performs incremental index refresh. |
|
||||
| DBMS_VECTOR | `REBUILD_INDEX` | Performs full index rebuild. |
|
||||
|
||||
## Index memory estimation and actual usage query
|
||||
|
||||
Supports index memory estimation through the `DBMS_VECTOR` system package. The usage is the same as that for dense indexes. Here, only the special requirements for sparse vector indexes are described:
|
||||
|
||||
* The `IDX_TYPE` parameter must be set to `SINDI`, case-insensitive.
|
||||
|
||||
## Creation syntax and description
|
||||
|
||||
In-memory sparse vector indexes support two creation methods: **creation during table creation** and **creation after table creation**. When creating an index, note the following:
|
||||
|
||||
* The maximum supported dimension for columns on which sparse vector indexes are created is 500,000.
|
||||
* Sparse vector indexes must be created on columns of the `SPARSEVECTOR` type.
|
||||
* The `VECTOR` keyword is required when creating an index.
|
||||
* The index type must be set to `sindi`, which indicates creating an in-memory sparse vector index.
|
||||
* Only the `inner_product` (inner product) distance algorithm is supported.
|
||||
* The parameters and descriptions for an index created after table creation are the same as those for an index created during table creation.
|
||||
|
||||
### Create during table creation
|
||||
|
||||
Supports using the `CREATE TABLE` statement to create a sparse vector index.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
CREATE TABLE table_name (
|
||||
column_name1 data_type1,
|
||||
column_name2 SPARSEVECTOR,
|
||||
...,
|
||||
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
|
||||
);
|
||||
```
|
||||
|
||||
#### Parameter description
|
||||
|
||||
| Parameter | Default value | Value range | Required | Description | Remarks |
|
||||
|------|--------|----------|----------|------|------|
|
||||
| `distance` | | `inner_product` | Yes | Specifies the vector distance algorithm type. | Sparse vector indexes support only inner product (`inner_product`) as the distance algorithm. |
|
||||
| `type` | | `sindi` | Yes | Specifies the index algorithm type. | Indicates creating an in-memory sparse vector index. |
|
||||
| `lib` | `vsag` | `vsag` | No | Specifies the vector index library type. | Currently, only the VSAG vector library is supported. |
|
||||
| `prune` | `false` | `true`/`false` | No | Whether to perform pruning on vectors. | When `prune` is `true`, you need to set the `refine` and `drop_ratio_build` parameters. When `prune` is `false`, full-precision retrieval can be provided. If `refine` is set to `true` or `drop_ratio_build` is not `0`, an error will be returned. |
|
||||
| `refine` | `false` | `true`/`false` | No | Whether reranking is needed. | When set to `true`, the original sparse vectors are retrieved for the search results to perform high-precision distance calculation and reranking, which means an additional copy of the original vector data needs to be stored. Can be set only when `prune=true`. |
|
||||
| `drop_ratio_build` | `0` | `[0, 0.9]` | No | The pruning ratio for sparse vector data. | When a new sparse vector is inserted, the `query_length * drop_ratio_build` smallest values are pruned based on value size. If `refine` is `true`, the original vector data is preserved. Otherwise, only the pruned data is retained. Can be set only when `prune=true`. |
|
||||
| `drop_ratio_search` | `0` | `[0, 0.9]` | No | The pruning ratio for sparse vector values during retrieval. | The larger the value, the more pruning is performed, the lower the accuracy, and the higher the performance. Can also be set through the `parameters` clause during retrieval, and query parameters have higher priority. |
|
||||
| `refine_k` | `4.0` | `[1.0, 1000.0]` | No | Indicates the proportion of results participating in reranking. | Retrieves `limit_k * refine_k` results and obtains the original vectors for reranking. Meaningful only when `refine=true`. Can also be set through the `parameters` clause during retrieval, and query parameters have higher priority. |
|
||||
|
||||
### Create after table creation
|
||||
|
||||
Supports creating a sparse vector index on a `SPARSEVECTOR` column of an existing table.
|
||||
|
||||
#### Syntax
|
||||
|
||||
```sql
|
||||
CREATE VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
|
||||
```
|
||||
|
||||
#### Parameter description
|
||||
|
||||
The parameter description is the same as that for creating an index during table creation. For details, see the section above.
|
||||
|
||||
## Create, update, and delete examples
|
||||
|
||||
### Create during table creation
|
||||
|
||||
Create the test table `sparse_t1` and create a sparse vector index:
|
||||
|
||||
```sql
|
||||
CREATE TABLE sparse_t1 (
|
||||
c1 INT PRIMARY KEY,
|
||||
c2 SPARSEVECTOR,
|
||||
VECTOR INDEX sparse_idx1(c2)
|
||||
WITH (lib=vsag, type=sindi, distance=inner_product)
|
||||
);
|
||||
```
|
||||
|
||||
Insert sparse vector data into the test table:
|
||||
|
||||
```sql
|
||||
INSERT INTO sparse_t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
|
||||
INSERT INTO sparse_t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
|
||||
INSERT INTO sparse_t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
|
||||
```
|
||||
|
||||
Query the test table:
|
||||
|
||||
```sql
|
||||
SELECT * FROM sparse_t1;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```
|
||||
+----+---------------------+
|
||||
| c1 | c2 |
|
||||
+----+---------------------+
|
||||
| 1 | {1:0.1,2:0.2,3:0.3} |
|
||||
| 2 | {2:0.2,3:0.3,4:0.4} |
|
||||
| 3 | {3:0.3,4:0.4,5:0.5} |
|
||||
+----+---------------------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
### Create after table creation
|
||||
|
||||
Create a sparse vector index after creating the test table:
|
||||
|
||||
```sql
|
||||
CREATE TABLE sparse_t2 (
|
||||
c1 INT PRIMARY KEY,
|
||||
c2 SPARSEVECTOR
|
||||
);
|
||||
|
||||
CREATE VECTOR INDEX sparse_idx2 ON sparse_t2(c2)
|
||||
WITH (lib=vsag, type=sindi, distance=inner_product,
|
||||
prune=true, refine=true, drop_ratio_build=0.1,
|
||||
drop_ratio_search=0.5, refine_k=2.0);
|
||||
```
|
||||
|
||||
Insert sparse vector data into the test table:
|
||||
|
||||
```sql
|
||||
INSERT INTO sparse_t2 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
|
||||
```
|
||||
|
||||
Query the test table:
|
||||
|
||||
```sql
|
||||
SELECT * FROM sparse_t2;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----+---------------------+
|
||||
| c1 | c2 |
|
||||
+----+---------------------+
|
||||
| 1 | {1:0.1,2:0.2,3:0.3} |
|
||||
+----+---------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Update
|
||||
|
||||
When updating sparse vector data, the index is automatically maintained:
|
||||
|
||||
```sql
|
||||
UPDATE sparse_t1 SET c2 = '{1:0.1}' WHERE c1 = 1;
|
||||
```
|
||||
|
||||
### Delete
|
||||
|
||||
The delete operation is the same as that for regular vector indexes. You can directly delete the data:
|
||||
|
||||
```sql
|
||||
DELETE FROM sparse_t1 WHERE c1 = 1;
|
||||
```
|
||||
|
||||
## Retrieval
|
||||
|
||||
The retrieval syntax for sparse vector indexes is similar to that for dense vector indexes, using the `APPROXIMATE`/`APPROX` keyword for approximate nearest neighbor retrieval.
|
||||
|
||||
### Syntax
|
||||
|
||||
```sql
|
||||
SELECT ... FROM table_name
|
||||
ORDER BY inner_product(column_name, query_vector) [APPROXIMATE|APPROX]
|
||||
LIMIT n [PARAMETERS(param1=value1, param2=value2)];
|
||||
```
|
||||
|
||||
Where:
|
||||
* `column_name`: The `SPARSEVECTOR` column specified when creating the sparse vector index.
|
||||
* `query_vector`: The query vector, which can be a string in sparse vector format, such as `'{1:2.4, 3:1.5}'`.
|
||||
* `n`: The number of result rows to return.
|
||||
* `PARAMETERS`: Optional query-level parameters for setting `drop_ratio_search` and `refine_k`.
|
||||
|
||||
### Retrieval usage notes
|
||||
|
||||
For detailed requirements, see [Dense vector index](../200.dense-vector-index.md). Here, only the special requirements for sparse vector indexes are described:
|
||||
|
||||
* Query parameter priority: Query-level parameters set by `PARAMETERS` > Query parameters set when building the index > Default values.
|
||||
* `drop_ratio_search`: Value range `[0, 0.9]`, default value `0`. The pruning ratio for sparse vector values during retrieval. The larger the value, the more pruning is performed, the lower the accuracy, and the higher the performance. Prunes the `query_length * drop_ratio_search` smallest values based on value size. Since pruning all values is meaningless, at least one value is always retained.
|
||||
* `refine_k`: Value range `[1.0, 1000.0]`, default value `4.0`. Indicates the proportion of results participating in reranking. Queries `limit_k * refine_k` results and obtains the original vectors for reranking. Effective only when `refine=true`.
|
||||
|
||||
### Usage examples
|
||||
|
||||
#### Regular query
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1 (
|
||||
c1 INT PRIMARY KEY,
|
||||
c2 SPARSEVECTOR,
|
||||
VECTOR INDEX idx1(c2)
|
||||
WITH (lib=vsag, type=sindi, distance=inner_product)
|
||||
);
|
||||
|
||||
INSERT INTO t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
|
||||
INSERT INTO t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
|
||||
INSERT INTO t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
|
||||
INSERT INTO t1 VALUES(4, '{5:0.5, 4:0.4, 6:0.6}');
|
||||
INSERT INTO t1 VALUES(5, '{5:0.5, 6:0.6, 7:0.7}');
|
||||
|
||||
SELECT * FROM t1
|
||||
ORDER BY negative_inner_product(c2, '{3:0.3, 4:0.4}')
|
||||
APPROXIMATE LIMIT 4;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----+---------------------+
|
||||
| c1 | c2 |
|
||||
+----+---------------------+
|
||||
| 2 | {2:0.2,3:0.3,4:0.4} |
|
||||
| 3 | {3:0.3,4:0.4,5:0.5} |
|
||||
| 4 | {4:0.4,5:0.5,6:0.6} |
|
||||
| 1 | {1:0.1,2:0.2,3:0.3} |
|
||||
+----+---------------------+
|
||||
```
|
||||
|
||||
#### Use query parameters
|
||||
|
||||
```sql
|
||||
SELECT *, negative_inner_product(c2, '{3:0.3, 4:0.4}')
|
||||
AS score FROM t1
|
||||
ORDER BY score APPROXIMATE LIMIT 4
|
||||
PARAMETERS(drop_ratio_search=0.5);
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----+---------------------+---------------------+
|
||||
| c1 | c2 | score |
|
||||
+----+---------------------+---------------------+
|
||||
| 4 | {4:0.4,5:0.5,6:0.6} | -0.1600000113248825 |
|
||||
| 3 | {3:0.3,4:0.4,5:0.5} | -0.2500000149011612 |
|
||||
| 2 | {2:0.2,3:0.3,4:0.4} | -0.2500000149011612 |
|
||||
+----+---------------------+---------------------+
|
||||
3 rows in set
|
||||
```
|
||||
|
||||
## Index monitoring and maintenance
|
||||
|
||||
In-memory sparse vector indexes provide monitoring views and support using the `DBMS_VECTOR` system package for index maintenance, including incremental refresh and full rebuild. The usage is the same as that for dense indexes.
|
||||
|
||||
## Related documentation
|
||||
|
||||
* For detailed information about sparse vector data types, see [Vector data type](../../700.vector-search-reference/100.vector-data-type.md).
|
||||
* For detailed information about vector distance functions, see [Vector functions](../../250.vector-function.md).
|
||||
* For monitoring and maintenance of dense vector indexes, see [Vector index monitoring/maintenance](../200.dense-vector-index.md).
|
||||
* For index memory estimation and actual usage query of vector indexes, see [Index memory estimation and actual usage query](../200.dense-vector-index.md).
|
||||
@@ -0,0 +1,611 @@
|
||||
---
|
||||
|
||||
slug: /vector-function
|
||||
---
|
||||
|
||||
# Use SQL functions
|
||||
|
||||
This topic describes the vector functions supported by seekdb and the considerations for using them.
|
||||
|
||||
## Considerations
|
||||
|
||||
* Vectors with different dimensions are not allowed to perform the following operations. An error `different vector dimensions %d and %d` is returned.
|
||||
|
||||
* When the result exceeds the floating-point number range, an error `value out of range: overflow / underflow` is returned.
|
||||
|
||||
* Dense vector indexes support L2, inner product, and cosine distance as index distance algorithms. Memory-based sparse vector indexes support inner product as the distance algorithm. For details, see [Create vector indexes](200.vector-index/200.dense-vector-index.md).
|
||||
|
||||
* Vector index search supports calling the `L2_distance`, `Cosine_distance`, and `Inner_product` distance functions in this document.
|
||||
|
||||
## Distance functions
|
||||
|
||||
Distance functions are used to calculate the distance between two vectors. The calculation method varies depending on the distance algorithm used.
|
||||
|
||||
### L2_distance
|
||||
|
||||
Euclidean distance reflects the distance between the coordinates of the compared vectors -- essentially the straight-line distance between two vectors. It is calculated by applying the Pythagorean theorem to vector coordinates:
|
||||
|
||||

|
||||
|
||||
The function syntax is as follows:
|
||||
|
||||
```sql
|
||||
l2_distance(vector v1, vector v2)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(c1 vector(3));
|
||||
INSERT INTO t1 VALUES('[1,2,3]');
|
||||
SELECT l2_distance(c1, [1,2,3]), l2_distance([1,2,3],[1,1,1]), l2_distance('[1,1,1]','[1,2,3]') FROM t1;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+--------------------------+------------------------------+----------------------------------+
|
||||
| l2_distance(c1, [1,2,3]) | l2_distance([1,2,3],[1,1,1]) | l2_distance('[1,1,1]','[1,2,3]') |
|
||||
+--------------------------+------------------------------+----------------------------------+
|
||||
| 0 | 2.23606797749979 | 2.23606797749979 |
|
||||
+--------------------------+------------------------------+----------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### L2_squared
|
||||
|
||||
L2 Squared distance is the square of the Euclidean distance (L2 Distance). It omits the square root operation in the Euclidean distance formula, thereby reducing computational cost while maintaining the relative order of distances. The calculation method is as follows:
|
||||
|
||||

|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
l2_squared(vector v1, vector v2)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(c1 vector(3));
|
||||
INSERT INTO t1 VALUES('[1,2,3]');
|
||||
SELECT l2_squared(c1, [1,2,3]), l2_squared([1,2,3],[1,1,1]), l2_squared('[1,1,1]','[1,2,3]') FROM t1;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+-------------------------+-----------------------------+---------------------------------+
|
||||
| l2_squared(c1, [1,2,3]) | l2_squared([1,2,3],[1,1,1]) | l2_squared('[1,1,1]','[1,2,3]') |
|
||||
+-------------------------+-----------------------------+---------------------------------+
|
||||
| 0 | 5 | 5 |
|
||||
+-------------------------+-----------------------------+---------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### L1_distance
|
||||
|
||||
The Manhattan distance is used to calculate the sum of absolute axis distances between two points in a standard coordinate system. The calculation formula is as follows:
|
||||
|
||||

|
||||
|
||||
The function syntax is as follows:
|
||||
|
||||
```sql
|
||||
l1_distance(vector v1, vector v2)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t2(c1 vector(3));
|
||||
INSERT INTO t2 VALUES('[1,2,3]');
|
||||
INSERT INTO t2 VALUES('[1,1,1]');
|
||||
SELECT l1_distance(c1, [1,2,3]) FROM t2;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+--------------------------+
|
||||
| l1_distance(c1, [1,2,3]) |
|
||||
+--------------------------+
|
||||
| 0 |
|
||||
| 3 |
|
||||
+--------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
### Cosine_distance
|
||||
|
||||
Cosine similarity measures the angular difference between two vectors and reflects their directional similarity, regardless of their lengths (magnitude). The value range of cosine similarity is `[-1, 1]`, where `1` indicates vectors in exactly the same direction, `0` indicates orthogonality, and `-1` indicates completely opposite directions.
|
||||
|
||||
The calculation method for cosine similarity is as follows:
|
||||
|
||||

|
||||
|
||||
Since cosine similarity closer to 1 indicates greater similarity, cosine distance (or cosine dissimilarity) is sometimes used as a measure of distance between vectors. Cosine distance can be calculated by subtracting cosine similarity from 1:
|
||||
|
||||

|
||||
|
||||
The value range of cosine distance is `[0, 2]`, where `0` indicates exactly the same direction (no distance) and `2` indicates completely opposite directions.
|
||||
|
||||
The function syntax is as follows:
|
||||
|
||||
```sql
|
||||
cosine_distance(vector v1, vector v2)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t3(c1 vector(3));
|
||||
INSERT INTO t3 VALUES('[1,2,3]');
|
||||
INSERT INTO t3 VALUES('[1,2,1]');
|
||||
SELECT cosine_distance(c1, [1,2,3]) FROM t3;
|
||||
```
|
||||
|
||||
```shell
|
||||
+------------------------------+
|
||||
| cosine_distance(c1, [1,2,3]) |
|
||||
+------------------------------+
|
||||
| 0 |
|
||||
| 0.12712843905603044 |
|
||||
+------------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
### Inner_product
|
||||
|
||||
The inner product, also known as the dot product or scalar product, represents a type of multiplication between two vectors. In geometric terms, the inner product indicates the direction and magnitude relationship between two vectors. The calculation method for the inner product is as follows:
|
||||
|
||||

|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
inner_product(vector v1, vector v2)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
* For sparse vectors using this function, one parameter can be a string in sparse vector format, such as `c2,'{1:2.4}'`. Two parameters cannot both be strings.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Dense vector example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t4(c1 vector(3));
|
||||
INSERT INTO t4 VALUES('[1,2,3]');
|
||||
INSERT INTO t4 VALUES('[1,2,1]');
|
||||
SELECT inner_product(c1, [1,2,3]) FROM t4;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----------------------------+
|
||||
| inner_product(c1, [1,2,3]) |
|
||||
+----------------------------+
|
||||
| 14 |
|
||||
| 8 |
|
||||
+----------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
Sparse vector example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t4(c1 INT, c2 SPARSEVECTOR, c3 SPARSEVECTOR);
|
||||
INSERT INTO t4 VALUES(1, '{1:1.1, 2:2.2}', '{1:2.4}');
|
||||
INSERT INTO t4 VALUES(2, '{1:1.5, 3:3.6}', '{4:4.5}');
|
||||
SELECT inner_product(c2,c3) FROM t4;
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----------------------+
|
||||
| inner_product(c2,c3) |
|
||||
+----------------------+
|
||||
| 2.640000104904175 |
|
||||
| 0 |
|
||||
+----------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
### Vector_distance
|
||||
|
||||
The vector_distance function calculates the distance between two vectors. You can specify parameters to select different distance algorithms.
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
vector_distance(vector v1, vector v2 [, string metric])
|
||||
```
|
||||
|
||||
The `vector v1/v2` parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The `metric` parameter is used to specify the distance algorithm. Options:
|
||||
|
||||
* If not specified, the default algorithm is `euclidean`.
|
||||
|
||||
* If specified, the only valid values are:
|
||||
|
||||
* `euclidean`. Represents Euclidean distance, same as L2_distance.
|
||||
|
||||
* `manhattan`. Represents Manhattan distance, same as L1_distance.
|
||||
|
||||
* `cosine`. Represents cosine distance, same as Cosine_distance.
|
||||
|
||||
* `dot`. Represents inner product, same as Inner_product.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `distance(double)` distance value.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t5(c1 vector(3));
|
||||
INSERT INTO t5 VALUES('[1,2,3]');
|
||||
INSERT INTO t5 VALUES('[1,2,1]');
|
||||
SELECT vector_distance(c1, [1,2,3], euclidean) FROM t5;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+-----------------------------------------+
|
||||
| vector_distance(c1, [1,2,3], euclidean) |
|
||||
+-----------------------------------------+
|
||||
| 0 |
|
||||
| 2 |
|
||||
+-----------------------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
## Arithmetic functions
|
||||
|
||||
Arithmetic functions provide element-wise addition (+), subtraction (-), and multiplication (*) operations between vector types and vector types, single-level array types, and special string types, as well as between single-level array types and single-level array types, and special string types. The calculation method is element-wise, as shown for addition:
|
||||
|
||||

|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
v1 + v2
|
||||
v1 - v2
|
||||
v1 * v2
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`). **Note**: Two parameters cannot both be string types. At least one parameter must be a vector or single-level array type.
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* When at least one of the two parameters is a vector type, the return value is of the same vector type as the vector parameter.
|
||||
|
||||
* When both parameters are single-level array types, the return value is of the `array(float)` type.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t6(c1 vector(3));
|
||||
INSERT INTO t6 VALUES('[1,2,3]');
|
||||
SELECT [1,2,3] + '[1.12,1000.0001, -1.2222]', c1 - [1,2,3] FROM t6;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+---------------------------------------+--------------+
|
||||
| [1,2,3] + '[1.12,1000.0001, -1.2222]' | c1 - [1,2,3] |
|
||||
+---------------------------------------+--------------+
|
||||
| [2.12,1002,1.7778] | [0,0,0] |
|
||||
+---------------------------------------+--------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## Comparison functions
|
||||
|
||||
Comparison functions provide comparison calculations between vector types and vector types, single-level array types, and special string types, including the comparison operators `=`, `!=`, `>`, `>=`, `<`, and `<=`. The calculation method is element-wise dictionary order comparison.
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
v1 = v2
|
||||
v1 != v2
|
||||
v1 > v2
|
||||
v1 < v2
|
||||
v1 >= v2
|
||||
v1 <= v2
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
:::tip
|
||||
One of the two parameters must be a vector type.
|
||||
:::
|
||||
|
||||
* The dimensions of the two parameters must be the same.
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is of the bool type.
|
||||
|
||||
* If any parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t7(c1 vector(3));
|
||||
INSERT INTO t7 VALUES('[1,2,3]');
|
||||
SELECT c1 = '[1,2,3]' FROM t7;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+----------------+
|
||||
| c1 = '[1,2,3]' |
|
||||
+----------------+
|
||||
| 1 |
|
||||
+----------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## Aggregate functions
|
||||
|
||||
:::tip
|
||||
Vector columns cannot be used as GROUP BY conditions, and DISTINCT is not supported.
|
||||
:::
|
||||
|
||||
### Sum
|
||||
|
||||
The Sum function is used to calculate the sum of vectors in a vector column of a table, using element-wise accumulation to obtain the sum vector.
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
sum(vector v1)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Only the vector type is supported.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `sum (vector)` value.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t8(c1 vector(3));
|
||||
INSERT INTO t8 VALUES('[1,2,3]');
|
||||
SELECT sum(c1) FROM t8;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+---------+
|
||||
| sum(c1) |
|
||||
+---------+
|
||||
| [1,2,3] |
|
||||
+---------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Avg
|
||||
|
||||
The Avg function is used to calculate the average of vectors in a vector column of a table.
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
avg(vector v1)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Only the vector type is supported.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is an `avg (vector)` value.
|
||||
|
||||
* `NULL` rows in the vector column are not counted.
|
||||
|
||||
* When the input parameter is empty, the output is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t9(c1 vector(3));
|
||||
INSERT INTO t9 VALUES('[1,2,3]');
|
||||
SELECT avg(c1) FROM t9;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+---------+
|
||||
| avg(c1) |
|
||||
+---------+
|
||||
| [1,2,3] |
|
||||
+---------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## Other common vector functions
|
||||
|
||||
### Vector_norm
|
||||
|
||||
The Vector_norm function calculates the Euclidean norm (or length) of a vector, which represents the Euclidean distance between the vector and the origin. The calculation formula is as follows:
|
||||
|
||||

|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
vector_norm(vector v1)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
* If a single-level array type parameter exists, its elements cannot be `NULL`.
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `norm(double)` modulus value.
|
||||
|
||||
* If the parameter is `NULL`, the return value is `NULL`.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t10(c1 vector(3));
|
||||
INSERT INTO t10 VALUES('[1,2,3]');
|
||||
SELECT vector_norm(c1),vector_norm([1,2,3]) FROM t10;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+--------------------+----------------------+
|
||||
| vector_norm(c1) | vector_norm([1,2,3]) |
|
||||
+--------------------+----------------------+
|
||||
| 3.7416573867739413 | 3.7416573867739413 |
|
||||
+--------------------+----------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Vector_dims
|
||||
|
||||
The Vector_dims function is used to return the vector dimension.
|
||||
|
||||
The syntax is as follows:
|
||||
|
||||
```sql
|
||||
vector_dims(vector v1)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* Apart from the vector type, other types that can be forcibly converted to the vector type are accepted, including single-level array types (such as `[1,2,3,..]`) and string types (such as `'[1,2,3]'`).
|
||||
|
||||
The return values are described as follows:
|
||||
|
||||
* The return value is a `dims(int64)` dimension value.
|
||||
|
||||
* If the parameter is `NULL`, an error is returned.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE t11(c1 vector(3));
|
||||
INSERT INTO t11 VALUES('[1,2,3]');
|
||||
INSERT INTO t11 VALUES('[1,1,1]');
|
||||
SELECT vector_dims(c1), vector_dims('[1,2,3]') FROM t11;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+-----------------+------------------------+
|
||||
| vector_dims(c1) | vector_dims('[1,2,3]') |
|
||||
+-----------------+------------------------+
|
||||
| 3 | 3 |
|
||||
| 3 | 3 |
|
||||
+-----------------+------------------------+
|
||||
2 rows in set
|
||||
```
|
||||
@@ -0,0 +1,232 @@
|
||||
---
|
||||
|
||||
slug: /vector-similarity-search
|
||||
---
|
||||
|
||||
# Vector similarity search
|
||||
|
||||
Vector similarity search, also known as nearest neighbor search, is a search method based on distance metrics in vector space. Its core objective is to find the set of vectors most similar to a given query vector. Although specific distance metrics are used during computation, the final output is the Top K nearest vectors, sorted in ascending order of distance.
|
||||
|
||||
This topic describes two vector search methods in seekdb: exact nearest neighbor search based on full-scan and approximate nearest neighbor search based on vector index. It also provides examples to illustrate how to use these methods.
|
||||
|
||||
:::tip
|
||||
For readability, this document refers to vector nearest neighbor search as "vector search," exact nearest neighbor search as "exact search," and approximate nearest neighbor search as "approximate search."
|
||||
:::
|
||||
|
||||
## Perform exact search
|
||||
|
||||
Exact search uses a full scan strategy, calculating the distance between the query vector and all vectors in the dataset to perform an exact search. This method ensures complete accuracy of the search results, but because it requires calculating the distance for all data, the search performance decreases significantly as the dataset grows.
|
||||
|
||||
When executing an exact search, the system calculates and compares the distances between the query vector vₑ and all other vectors in the vector space. After completing the full distance calculations, the system selects the k vectors closest to the query as the search results.
|
||||
|
||||
### Example: Euclidean search
|
||||
|
||||
Euclidean similarity search is used to retrieve the top-k vectors in the vector space that are closest to the query vector, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors from a table that are closest to the query vector:
|
||||
|
||||
```sql
|
||||
-- Create a test table
|
||||
CREATE TABLE t1 (
|
||||
id INT PRIMARY KEY,
|
||||
c1 VECTOR(3)
|
||||
);
|
||||
|
||||
-- Insert data
|
||||
INSERT INTO t1 VALUES
|
||||
(1, '[0.1, 0.2, 0.3]'),
|
||||
(2, '[0.2, 0.3, 0.4]'),
|
||||
(3, '[0.3, 0.4, 0.5]'),
|
||||
(4, '[0.4, 0.5, 0.6]'),
|
||||
(5, '[0.5, 0.6, 0.7]'),
|
||||
(6, '[0.6, 0.7, 0.8]'),
|
||||
(7, '[0.7, 0.8, 0.9]'),
|
||||
(8, '[0.8, 0.9, 1.0]'),
|
||||
(9, '[0.9, 1.0, 0.1]'),
|
||||
(10, '[1.0, 0.1, 0.2]');
|
||||
|
||||
-- Perform an exact search
|
||||
SELECT c1
|
||||
FROM t1
|
||||
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+---------------+
|
||||
| c1 |
|
||||
+---------------+
|
||||
| [0.1,0.2,0.3] |
|
||||
| [0.2,0.3,0.4] |
|
||||
| [0.3,0.4,0.5] |
|
||||
| [0.4,0.5,0.6] |
|
||||
| [0.5,0.6,0.7] |
|
||||
+---------------+
|
||||
5 rows in set
|
||||
```
|
||||
|
||||
### Analyze the execution plan
|
||||
|
||||
Obtain the execution plan of the preceding example:
|
||||
|
||||
```sql
|
||||
EXPLAIN SELECT c1
|
||||
FROM t1
|
||||
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------+
|
||||
| Query Plan |
|
||||
+---------------------------------------------------------------------------------------------+
|
||||
| ================================================= |
|
||||
| |ID|OPERATOR |NAME|EST.ROWS|EST.TIME(us)| |
|
||||
| ------------------------------------------------- |
|
||||
| |0 |TOP-N SORT | |5 |3 | |
|
||||
| |1 |└─TABLE FULL SCAN|t1 |10 |3 | |
|
||||
| ================================================= |
|
||||
| Outputs & filters: |
|
||||
| ------------------------------------- |
|
||||
| 0 - output([t1.c1]), filter(nil), rowset=16 |
|
||||
| sort_keys([l2_distance(t1.c1, cast('[0.1, 0.2, 0.3]', ARRAY(18, -1))), ASC]), topn(5) |
|
||||
| 1 - output([t1.c1]), filter(nil), rowset=16 |
|
||||
| access([t1.c1]), partitions(p0) |
|
||||
| is_index_back=false, is_global_index=false, |
|
||||
| range_key([t1.id]), range(MIN ; MAX)always true |
|
||||
+---------------------------------------------------------------------------------------------+
|
||||
14 rows in set
|
||||
```
|
||||
|
||||
The analysis is as follows:
|
||||
|
||||
* Execution method:
|
||||
* The full-table scan method is used, which requires traversing all data in the table. The `TABLE FULL SCAN` operation in the execution plan scans all data in the `t1` table.
|
||||
* The system calculates the vector distance for each record and then sorts the records by distance. The `TOP-N SORT` operation in the execution plan calculates the vector distance using the `l2_distance` function and sorts the records by distance in ascending order.
|
||||
* The system returns the five records with the smallest distances. The `topn(5)` setting in the execution plan indicates that only the first five records of the sorted list are returned.
|
||||
|
||||
* Performance characteristics:
|
||||
* Advantages: The search results are completely accurate and ensure that the true nearest neighbors are returned.
|
||||
* Disadvantages: The system must scan all data in the table and calculate the distance between all vectors, leading to a significant drop in performance as the data volume increases.
|
||||
|
||||
* Applicable scenarios:
|
||||
* Scenarios with a small amount of data.
|
||||
* Scenarios where high result accuracy is required.
|
||||
* Scenarios where real-time queries are not suitable for large datasets.
|
||||
|
||||
## Perform approximate search by using vector indexes
|
||||
|
||||
Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. While it cannot guarantee 100% result accuracy, it can significantly improve search performance, allowing for a good balance between accuracy and efficiency in practical applications.
|
||||
|
||||
### Example: Approximate search by using the HNSW index
|
||||
|
||||
```sql
|
||||
-- Create a HNSW vector index with the table.
|
||||
CREATE TABLE t2 (
|
||||
id INT PRIMARY KEY,
|
||||
vec VECTOR(3),
|
||||
VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
|
||||
);
|
||||
|
||||
-- Insert test data
|
||||
INSERT INTO t2 VALUES
|
||||
(1, '[0.1, 0.2, 0.3]'),
|
||||
(2, '[0.2, 0.3, 0.4]'),
|
||||
(3, '[0.3, 0.4, 0.5]'),
|
||||
(4, '[0.4, 0.5, 0.6]'),
|
||||
(5, '[0.5, 0.6, 0.7]'),
|
||||
(6, '[0.6, 0.7, 0.8]'),
|
||||
(7, '[0.7, 0.8, 0.9]'),
|
||||
(8, '[0.8, 0.9, 1.0]'),
|
||||
(9, '[0.9, 1.0, 0.1]'),
|
||||
(10, '[1.0, 0.1, 0.2]');
|
||||
|
||||
-- Perform approximate search and return the 5 most similar data records
|
||||
SELECT id, vec
|
||||
FROM t2
|
||||
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
|
||||
APPROXIMATE
|
||||
LIMIT 5;
|
||||
```
|
||||
|
||||
The return result is as follows. The result is the same as that of the exact search because the data volume is small:
|
||||
|
||||
```shell
|
||||
+------+---------------+
|
||||
| id | vec |
|
||||
+------+---------------+
|
||||
| 1 | [0.1,0.2,0.3] |
|
||||
| 2 | [0.2,0.3,0.4] |
|
||||
| 3 | [0.3,0.4,0.5] |
|
||||
| 4 | [0.4,0.5,0.6] |
|
||||
| 5 | [0.5,0.6,0.7] |
|
||||
+------+---------------+
|
||||
5 rows in set
|
||||
```
|
||||
|
||||
### Execution plan analysis
|
||||
|
||||
Obtain the execution plan of the preceding example:
|
||||
|
||||
```sql
|
||||
EXPLAIN SELECT id, vec
|
||||
FROM t2
|
||||
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
|
||||
APPROXIMATE
|
||||
LIMIT 5;
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```shell
|
||||
+--------------------------------------------------------------------------------------------------------------------+
|
||||
| Query Plan |
|
||||
+--------------------------------------------------------------------------------------------------------------------+
|
||||
| ==================================================== |
|
||||
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
|
||||
| ---------------------------------------------------- |
|
||||
| |0 |VECTOR INDEX SCAN|t2(idx)|10 |29 | |
|
||||
| ==================================================== |
|
||||
| Outputs & filters: |
|
||||
| ------------------------------------- |
|
||||
| 0 - output([t2.id], [t2.vec]), filter(nil), rowset=16 |
|
||||
| access([t2.id], [t2.vec]), partitions(p0) |
|
||||
| is_index_back=true, is_global_index=false, |
|
||||
| range_key([t2.__vid_1750162978114053], [t2.__type_17_1750162978114364]), range(MIN,MIN ; MAX,MAX)always true |
|
||||
+--------------------------------------------------------------------------------------------------------------------+
|
||||
11 rows in set
|
||||
```
|
||||
|
||||
The analysis is as follows:
|
||||
|
||||
* Execution method:
|
||||
* The vector index scan method is used, directly locating similar vectors through the pre-built HNSW index. The `VECTOR INDEX SCAN` operation in the execution plan uses the index `t2(idx)` for retrieval.
|
||||
* The graph structure of the index is used to quickly locate nearest neighbors without calculating the distance between all vectors. The `is_index_back=true` setting in the execution plan indicates that complete data is retrieved through index back-lookup.
|
||||
* The five records that the index considers to be the most similar are returned. The `output([t2.id], [t2.vec])` in the execution plan indicates that the id and vector data are returned.
|
||||
|
||||
* Performance characteristics:
|
||||
* Advantage: The search performance is high and remains stable as the data volume increases.
|
||||
* Disadvantage: A small amount of error may exist in the results, and 100% accuracy is not guaranteed.
|
||||
|
||||
* Applicable scenarios:
|
||||
* Real-time search for large-scale datasets.
|
||||
* Scenarios with high requirements for search performance.
|
||||
* Scenarios that can tolerate a small amount of result error.
|
||||
|
||||
## Summary
|
||||
|
||||
A comparison of the two search methods is as follows:
|
||||
|
||||
| Item | Exact search | Approximate search |
|
||||
|--------|----------------|----------------|
|
||||
| Execution method | Full-table scan (`TABLE FULL SCAN`) followed by sorting | Direct search through the vector index (`VECTOR INDEX SCAN`) |
|
||||
| Execution plan | Contains two operators: `TABLE FULL SCAN` and `TOP-N SORT` | Contains only one operator: `VECTOR INDEX SCAN` |
|
||||
| Performance characteristics | Requires full-table scan and sorting, and performance decreases significantly as the data volume increases | Directly locates target data through the index, and performance is stable |
|
||||
| Result accuracy | 100% accurate, ensuring real nearest neighbors are returned | Approximately accurate, with a small amount of error possible |
|
||||
| Applicable scenarios | Scenarios with small data volumes and high accuracy requirements | Scenarios with large-scale datasets and high performance requirements |
|
||||
|
||||
### References
|
||||
|
||||
* For more information about SQL functions, see [Use SQL functions](250.vector-function.md).
|
||||
* For more information about vector indexes and examples, see [Create vector indexes](200.vector-index/200.dense-vector-index.md).
|
||||
* To perform large-scale performance tests, we recommend that you use the [VectorDBBenchmark tool](700.vector-search-benchmark-test.md) to generate a test dataset to better compare the performance differences between exact search and approximate search.
|
||||
@@ -0,0 +1,127 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-benchmark-test
|
||||
---
|
||||
|
||||
# Benchmark testing with VectorDBBench
|
||||
|
||||
VectorDBBench is a benchmarking tool designed to provide benchmark test results for mainstream vector databases and cloud services. This topic explains how to use VectorDBBench to test the performance of seekdb vector database. Designed for ease of use, VectorDBBench allows you to easily replicate test results or test new systems.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Deploy seekdb.
|
||||
* Install Python 3.11 or later. The following example uses Conda for installation:
|
||||
```bash
|
||||
# Download and install Conda
|
||||
mkdir -p ~/miniconda3
|
||||
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
|
||||
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
|
||||
rm ~/miniconda3/miniconda.sh
|
||||
|
||||
# Reopen your terminal and initialize Conda
|
||||
source ~/miniconda3/bin/activate
|
||||
conda init --all
|
||||
|
||||
# Create and initialize the Python environment required by VectorDBBench
|
||||
conda create -n vdb python=3.11
|
||||
conda activate vdb
|
||||
```
|
||||
* Connect to the database and optimize memory and query parameters for HNSW vector index searches:
|
||||
|
||||
```sql
|
||||
-- Set ob_vector_memory_limit_percentage to 30%.
|
||||
ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;
|
||||
-- Set ob_query_timeout to 24 hours.
|
||||
SET GLOBAL ob_query_timeout = 86400000000;
|
||||
-- Set max_allowed_packet to 1 GB.
|
||||
SET GLOBAL max_allowed_packet=1073741824;
|
||||
-- Set ddl_thread_score and parallel_servers_target to configure parallelism when creating indexes
|
||||
ALTER SYSTEM SET ddl_thread_score = 8; -- Parallelism for DDL operations
|
||||
SET GLOBAL parallel_servers_target = 624; -- Number of parallel queries the database server can handle simultaneously
|
||||
```
|
||||
Here, `ob_vector_memory_limit_percentage = 30` is only an example value. Adjust it based on the database memory and workload.
|
||||
|
||||
## Recommended configuration
|
||||
|
||||
The recommended resource specifications for the database are as follows:
|
||||
|
||||
| Parameter | Value |
|
||||
|-------|----|
|
||||
| Memory | 64 GB |
|
||||
| CPU | 16 cores |
|
||||
|
||||
## Testing methods
|
||||
|
||||
### Clone the VectorDBBench code
|
||||
|
||||
:::tip
|
||||
We recommend that you deploy VectorDBBench and seekdb on separate servers to avoid CPU resource contention and improve the reliability of test results.
|
||||
:::
|
||||
|
||||
Clone the VectorDBBench test tool code to your local server.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zilliztech/VectorDBBench.git
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
|
||||
Go to the VectorDBBench directory and install the dependencies.
|
||||
|
||||
```bash
|
||||
cd VectorDBBench
|
||||
pip install .
|
||||
```
|
||||
|
||||
### Run the test
|
||||
|
||||
Run VectorDBBench. Two examples are provided here: HNSW index and IVF index.
|
||||
|
||||
#### HNSW index example
|
||||
|
||||
```bash
|
||||
# Replace $host, $port, and $user with the actual seekdb connection information.
|
||||
vectordbbench oceanbasehnsw --host $host --port $port --user $user --database test --m 16 --ef-construction 200 --ef-search 40 --k 10 --case-type Performance768D1M --index-type HNSW
|
||||
```
|
||||
|
||||
For more information about the parameters, run the following command:
|
||||
|
||||
```bash
|
||||
vectordbbench oceanbasehnsw --help
|
||||
```
|
||||
|
||||
Commonly used options are described as follows:
|
||||
|
||||
* `--num-concurrency`: Used to adjust the concurrency level. VectorDBBench executes vector queries with the specified concurrency and selects the highest QPS (Queries Per Second) as the final result.
|
||||
* `--skip-drop-old`/`--skip-load`: Skips the deletion of old data and the data loading step. After adding these two options to the command line, the command only performs vector query operations and does not delete old data or reload data.
|
||||
* `--k`: Specifies the number of top-k nearest neighbor results to return in a vector query.
|
||||
* `--ef-search`: HNSW query parameter that indicates the size of the candidate set during query.
|
||||
* `--index-type`: Specifies the index type. Currently supports `HNSW`, `HNSW_SQ`, and `HNSW_BQ`.
|
||||
|
||||
#### IVF index example
|
||||
|
||||
```bash
|
||||
vectordbbench oceanbaseivf --host $host --port $port --user $user --database test --nlist 1000 --sample_per_nlist 256 --ivf_nprobes 100 --case-type Performance768D1M --index-type IVF_FLAT
|
||||
```
|
||||
|
||||
Commonly used options are described as follows:
|
||||
|
||||
* `--sample_per_nlist`: The amount of data sampled per cluster center. Default value is `256`.
|
||||
* `--ivf_nprobes`: Used to set how many nearest cluster centers to search in this query when performing vector index queries. Default value is `8`. The larger the value, the higher the recall rate, but the search time also increases.
|
||||
* `--index-type`: Specifies the index type. Currently supports `IVF_FLAT`.
|
||||
|
||||
For more information about the parameters, run the following command:
|
||||
|
||||
```bash
|
||||
vectordbbench oceanbaseivf --help
|
||||
```
|
||||
|
||||
## FAQs
|
||||
|
||||
### Is it normal for the first test execution to be slow?
|
||||
|
||||
The first test execution requires downloading the required dataset from AWS S3 storage, which may take relatively longer. This is normal.
|
||||
|
||||
### Can I customize and modify the test code?
|
||||
|
||||
Yes, you can. If you customize and modify the test code, you need to run `pip install .` again before running the test.
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
|
||||
slug: /vector-data-type
|
||||
---
|
||||
|
||||
# Overview of vector data types
|
||||
|
||||
seekdb provides vector data types to support AI vector search applications. By using vector data types, you can store and query an array of floating-point numbers, such as `[0.1, 0.3, -0.9, ...]`. Before using vector data, you need to be aware of the following:
|
||||
|
||||
* Both dense and sparse vector data are supported, and all data elements must be single-precision floating-point numbers.
|
||||
|
||||
* Element values in vector data cannot be NaN (not a number) or Inf (infinity); otherwise, a runtime error will be thrown.
|
||||
|
||||
* You must specify the vector dimension when creating a vector column, for example, `VECTOR(3)`.
|
||||
|
||||
* Creating dense/sparse vector indexes is supported. For details, see [vector index](../200.vector-index/200.dense-vector-index.md).
|
||||
|
||||
* Vector data in seekdb is stored in array form.
|
||||
|
||||
* Both dense and sparse vectors support [hybrid search](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001970893).
|
||||
|
||||
## Syntax
|
||||
|
||||
A dense vector value can contain any number of floating-point numbers up to 16,000. The syntax is as follows:
|
||||
|
||||
```sql
|
||||
-- Dense vector
|
||||
'[<float>, <float>, ...]'
|
||||
```
|
||||
|
||||
A sparse vector is based on the MAP data type and contains unordered key-value pairs. The syntax is as follows:
|
||||
|
||||
```sql
|
||||
-- Sparse vector
|
||||
'{<uint:float>, <uint:float>...}'
|
||||
```
|
||||
|
||||
Examples of creating vector columns and indexes are as follows:
|
||||
|
||||
```sql
|
||||
-- Create a dense vector column and index
|
||||
CREATE TABLE t1(
|
||||
c1 INT,
|
||||
c2 VECTOR(3),
|
||||
PRIMARY KEY(c1),
|
||||
VECTOR INDEX idx1(c2) WITH (distance=L2, type=hnsw)
|
||||
);
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Create a sparse vector column
|
||||
CREATE TABLE t2 (
|
||||
c1 INT,
|
||||
c2 SPARSEVECTOR
|
||||
);
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Create vector indexes](../200.vector-index/200.dense-vector-index.md)
|
||||
* [Use SQL functions](../250.vector-function.md)
|
||||
@@ -0,0 +1,391 @@
|
||||
---
|
||||
|
||||
slug: /vector-sdk-refer
|
||||
---
|
||||
|
||||
# Compatibility
|
||||
|
||||
This topic describes the data model mappings, SDK interface compatibility, and concept mappings between seekdb's vector search feature and Milvus.
|
||||
|
||||
## Concept mappings
|
||||
|
||||
To help users familiar with Milvus quickly get started with seekdb's vector storage capabilities, we analyze the similarities and differences between the two systems and provide a mapping of related concepts.
|
||||
|
||||
### Data models
|
||||
|
||||
| **Data model layer** | **Milvus** | **seekdb** | **Description** |
|
||||
|------------|---------|-----------|-----------|
|
||||
| First layer | Shards | Partition | Milvus specifies partition rules by setting some columns as `partition_key` in the schema definition.<br/>seekdb supports range/range columns, list/list columns, hash, key, and subpartitioning strategies. |
|
||||
| Second layer | Partitions | ≈Tablet | Milvus enhances read performance by chunking the same shard (shards are usually partitioned by primary key) based on other columns.<br />seekdb implements this by sorting keys within a partition. |
|
||||
| Third layer | Segments | MemTable+SSTable | Both have a minor compaction mechanism. |
|
||||
|
||||
### SDKs
|
||||
|
||||
This section introduces the conceptual differences between seekdb's vector storage SDK (pyobvector) and Milvus's SDK (pymilvus).
|
||||
|
||||
pyobvector supports two usage modes:
|
||||
|
||||
1. pymilvus MilvusClient lightweight compatible mode: This mode is compatible with common interfaces of Milvus clients. Users familiar with Milvus can easily use this mode without concept mapping.
|
||||
|
||||
2. SQLAlchemy extension mode: This mode can be used as a vector feature extension of python SQLAlchemy, retaining the operation mode of a relational database. Concept mapping is required.
|
||||
|
||||
For more information about pyobvector's APIs, see [pyobvector Python SDK API reference](900.vector-search-supported-clients-and-languages/200.vector-pyobvector.md).
|
||||
|
||||
The following table describes the concept mappings between pyobvector's SQLAlchemy extension mode and pymilvus:
|
||||
|
||||
| **pymilvus** | **pyobvector** | **Description** |
|
||||
|---------|------------|---------------|
|
||||
| Database | Database | Database |
|
||||
| Collection | Table | Table |
|
||||
| Field | Column | Column |
|
||||
| Primary Key | Primary Key | Primary key |
|
||||
| Vector Field | Vector Column | Vector column |
|
||||
| Index | Index | Index |
|
||||
| Partition | Partition | Partition |
|
||||
| DataType | DataType | Data type |
|
||||
| Metric Type | Distance Function | Distance function |
|
||||
| Search | Query | Query |
|
||||
| Insert | Insert | Insert |
|
||||
| Delete | Delete | Delete |
|
||||
| Update | Update | Update |
|
||||
| Batch | Batch | Batch operations |
|
||||
| Transaction | Transaction | Transaction |
|
||||
| NONE | Not supported| NULL value |
|
||||
| BOOL | Boolean | Corresponds to the MySQL TINYINT type |
|
||||
| INT8 | Boolean | Corresponds to the MySQL TINYINT type |
|
||||
| INT16 | SmallInteger | Corresponds to the MySQL SMALLINT type |
|
||||
| INT32 | Integer | Corresponds to the MySQL INT type |
|
||||
| INT64 | BigInteger | Corresponds to the MySQL BIGINT type |
|
||||
| FLOAT | Float | Corresponds to the MySQL FLOAT type |
|
||||
| DOUBLE | Double | Corresponds to the MySQL DOUBLE type |
|
||||
| STRING | LONGTEXT | Corresponds to the MySQL LONGTEXT type |
|
||||
| VARCHAR | STRING | Corresponds to the MySQL VARCHAR type |
|
||||
| JSON | JSON | For differences and similarities in JSON operations, see [pyobvector Python SDK API reference](900.vector-search-supported-clients-and-languages/200.vector-pyobvector.md). |
|
||||
| FLOAT_VECTOR | VECTOR | Vector type |
|
||||
| BINARY_VECTOR | Not supported | |
|
||||
| FLOAT16_VECTOR | Not supported | |
|
||||
| BFLOAT16_VECTOR | Not supported | |
|
||||
| SPARSE_FLOAT_VECTOR | Not supported | |
|
||||
| dynamic_field | Not needed | The hidden `$meta` metadata column in Milvus.<br/>In seekdb, you can explicitly create a JSON-type column. |
|
||||
|
||||
## Compatibility with Milvus
|
||||
|
||||
### Milvus SDK
|
||||
|
||||
Except `load_collection()`, `release_collection()`, and `close()`, which are supported through SQLAlchemy, all operations listed in the following tables are supported.
|
||||
|
||||
**Collection operations**
|
||||
|
||||
| **Interface** | **Description** |
|
||||
|---|---|
|
||||
| create_collection() | Creates a vector table based on the given schema. |
|
||||
| get_collection_stats() | Queries table statistics, such as the number of rows. |
|
||||
| describe_collection() | Provides detailed metadata of a vector table. |
|
||||
| has_collection() | Checks whether a table exists. |
|
||||
| list_collections() | Lists existing tables. |
|
||||
| drop_collection() | Drops a table. |
|
||||
|
||||
**Field and schema definition**
|
||||
|
||||
| **Interface** | **Description** |
|
||||
|---|---|
|
||||
| create_schema() | Creates a schema in memory and adds column definitions. |
|
||||
| add_field() | The call sequence is: create_schema->add_field->...->add_field<br/>You can also manually build a FieldSchema list and then use the CollectionSchema constructor to create a schema. |
|
||||
|
||||
**Vector indexes**
|
||||
|
||||
| **Interface** | **Description** |
|
||||
|---|---|
|
||||
| list_indexes() | Lists all indexes. |
|
||||
| create_index() | Supports creating multiple vector indexes in a single call. First, use prepare_index_params to initialize an index parameter list object, call add_index multiple times to set multiple index parameters, and finally call create_index to create the indexes. |
|
||||
| drop_index() | Drops a vector index. |
|
||||
| describe_index() | Gets the metadata (schema) of an index. |
|
||||
|
||||
**Vector indexes**
|
||||
|
||||
| **Interface** | **Description** |
|
||||
|---|---|
|
||||
| search() | ANN query interface:<ul><li>collection_name: the table name</li><li>data: the query vectors</li><li>filter: filtering operation, equivalent to `WHERE`</li><li>limit: top K</li><li>output_fields: projected columns, equivalent to `SELECT`</li><li>partition_names: partition names (not supported in Milvus Lite)</li><li>anns_field: the index column name</li><li>search_params: vector distance function name and index algorithm-related parameters</li></ul> |
|
||||
| query() | Point query with filter, namely `SELECT ... WHERE ids IN (..., ...) AND <filters>`. |
|
||||
| get() | Point query without filter, namely `SELECT ... WHERE ids IN (..., ...)`. |
|
||||
| delete() | Deletes a group of vectors, `DELETE FROM ... WHERE ids IN (..., ...)`. |
|
||||
| insert() | Inserts a group of vectors. |
|
||||
| upsert() | Insert with update on primary key conflict. |
|
||||
|
||||
**Collection metadata synchronization**
|
||||
|
||||
| **Interface** | **Description** |
|
||||
|---|---|
|
||||
| load_collection() | Loads the table structure from the database to the Python application memory, enabling the application to operate the database table in an object-oriented manner. This is a standard feature of an object-relational mapping (ORM) framework. |
|
||||
| release_collection() | Releases the loaded table structure from the Python application memory and releases related resources. This is a standard feature of an ORM framework for memory management. |
|
||||
| close() | Closes the database connection and releases related resources. This is a standard feature of an ORM framework. |
|
||||
|
||||
### pymilvus
|
||||
|
||||
#### Data model
|
||||
|
||||
The data model of Milvus comprises three levels: Shards->Partitions->Segments. Compatibility with seekdb is described as follows:
|
||||
|
||||
* Shards correspond to seekdb's Partition concept.
|
||||
|
||||
* Partitions currently have no corresponding concept in seekdb.
|
||||
|
||||
* Milvus allows you to partition a shard into blocks by other columns to improve read performance (shards are usually partitioned by primary key). seekdb implements this by sorting by primary key within a partition.
|
||||
|
||||
* Segments are similar to [MemTable](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001973721) + [SSTable](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001973722).
|
||||
|
||||
#### Milvus Lite API compatibility
|
||||
|
||||
##### collection operations
|
||||
|
||||
1. Milvus create_collection():
|
||||
|
||||
```python
|
||||
create_collection(
|
||||
collection_name: str,
|
||||
dimension: int,
|
||||
primary_field_name: str = "id",
|
||||
id_type: str = DataType,
|
||||
vector_field_name: str = "vector",
|
||||
metric_type: str = "COSINE",
|
||||
auto_id: bool = False,
|
||||
timeout: Optional[float] = None,
|
||||
schema: Optional[CollectionSchema] = None, # Used for custom setup
|
||||
index_params: Optional[IndexParams] = None, # Used for custom setup
|
||||
**kwargs,
|
||||
) -> None
|
||||
```
|
||||
|
||||
seekdb compatibility is described as follows:
|
||||
|
||||
* collection_name: compatible, corresponds to table_name.
|
||||
|
||||
* dimension: compatible, vector(dim).
|
||||
|
||||
* primary_field_name: compatible, the primary key column name.
|
||||
|
||||
* id_type: compatible, the primary key column type.
|
||||
|
||||
* vector_field_name: compatible, the vector column name.
|
||||
|
||||
* auto_id: compatible, auto increment.
|
||||
|
||||
* timeout: compatible, seekdb supports it through hint.
|
||||
|
||||
* schema: compatible.
|
||||
|
||||
* index_params: compatible.
|
||||
|
||||
2. Milvus get_collection_stats():
|
||||
|
||||
```python
|
||||
get_collection_stats(
|
||||
collection_name: str,
|
||||
timeout: Optional[float] = None
|
||||
) -> Dict
|
||||
```
|
||||
|
||||
seekdb compatibility is described as follows:
|
||||
|
||||
* API is compatible.
|
||||
|
||||
* Return value is compatible: `{ 'row_count': ... }`.
|
||||
|
||||
3. Milvus has_collection():
|
||||
|
||||
```python
|
||||
has_collection(
|
||||
collection_name: str,
|
||||
timeout: Optional[float] = None
|
||||
) -> Bool
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus has_collection().
|
||||
|
||||
4. Milvus drop_collection():
|
||||
|
||||
```python
|
||||
drop_collection(collection_name: str) -> None
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus drop_collection().
|
||||
|
||||
5. Milvus rename_collection():
|
||||
|
||||
```python
|
||||
rename_collection(
|
||||
old_name: str,
|
||||
new_name: str,
|
||||
timeout: Optional[float] = None
|
||||
) -> None
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus rename_collection().
|
||||
|
||||
##### Schema-related
|
||||
|
||||
1. Milvus create_schema():
|
||||
|
||||
```python
|
||||
create_schema(
|
||||
auto_id: bool,
|
||||
enable_dynamic_field: bool,
|
||||
primary_field: str,
|
||||
partition_key_field: str,
|
||||
) -> CollectionSchema
|
||||
```
|
||||
|
||||
seekdb compatibility is described as follows:
|
||||
|
||||
* auto_id: whether the primary key column is auto-increment, compatible.
|
||||
|
||||
* primary_field & partition_key_field: compatible.
|
||||
|
||||
2. Milvus add_field():
|
||||
|
||||
```python
|
||||
add_field(
|
||||
field_name: str,
|
||||
datatype: DataType,
|
||||
is_primary: bool,
|
||||
max_length: int,
|
||||
element_type: str,
|
||||
max_capacity: int,
|
||||
dim: int,
|
||||
is_partition_key: bool,
|
||||
)
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus add_field().
|
||||
|
||||
##### Insert/Search-related
|
||||
|
||||
1. Milvus search():
|
||||
|
||||
```python
|
||||
search(
|
||||
collection_name: str,
|
||||
data: Union[List[list], list],
|
||||
filter: str = "",
|
||||
limit: int = 10,
|
||||
output_fields: Optional[List[str]] = None,
|
||||
search_params: Optional[dict] = None,
|
||||
timeout: Optional[float] = None,
|
||||
partition_names: Optional[List[str]] = None,
|
||||
**kwargs,
|
||||
) -> List[dict]
|
||||
```
|
||||
|
||||
seekdb compatibility is described as follows:
|
||||
|
||||
* filter: string expression. For usage examples, see: [Milvus Filtering Explained](https://milvus.io/docs/boolean.md). It is generally similar to SQL's `WHERE` expression.
|
||||
|
||||
* search_params:
|
||||
|
||||
* metric_type: compatible.
|
||||
|
||||
* radius & range filter: related to RNN, currently not supported.
|
||||
|
||||
* group_by_field: groups ANN results, currently not supported.
|
||||
|
||||
* max_empty_result_buckets: used for IVF series indexes, currently not supported.
|
||||
|
||||
* ignore_growing: skips incremental data and directly reads baseline index, currently not supported.
|
||||
|
||||
* partition_names: partition read, supported.
|
||||
|
||||
* kwargs:
|
||||
|
||||
* offset: the number of records to skip in search results, currently not supported.
|
||||
|
||||
* round_decimal: rounds results to specified decimal places, currently not supported.
|
||||
|
||||
2. Milvus get():
|
||||
|
||||
```python
|
||||
get(
|
||||
collection_name: str,
|
||||
ids: Union[list, str, int],
|
||||
output_fields: Optional[List[str]] = None,
|
||||
timeout: Optional[float] = None,
|
||||
partition_names: Optional[List[str]] = None,
|
||||
**kwargs,
|
||||
) -> List[dict]
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus get().
|
||||
|
||||
3. Milvus delete()
|
||||
|
||||
```python
|
||||
delete(
|
||||
collection_name: str,
|
||||
ids: Optional[Union[list, str, int]] = None,
|
||||
timeout: Optional[float] = None,
|
||||
filter: Optional[str] = "",
|
||||
partition_name: Optional[str] = "",
|
||||
**kwargs,
|
||||
) -> dict
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus delete().
|
||||
|
||||
4. Milvus insert()
|
||||
|
||||
```python
|
||||
insert(
|
||||
collection_name: str,
|
||||
data: Union[Dict, List[Dict]],
|
||||
timeout: Optional[float] = None,
|
||||
partition_name: Optional[str] = "",
|
||||
) -> List[Union[str, int]]
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus insert().
|
||||
|
||||
5. Milvus upsert()
|
||||
|
||||
```python
|
||||
upsert(
|
||||
collection_name: str,
|
||||
data: Union[Dict, List[Dict]],
|
||||
timeout: Optional[float] = None,
|
||||
partition_name: Optional[str] = "",
|
||||
) -> List[Union[str, int]]
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus upsert().
|
||||
|
||||
##### Index-related
|
||||
|
||||
1. Milvus create_index()
|
||||
|
||||
```python
|
||||
create_index(
|
||||
collection_name: str,
|
||||
index_params: IndexParams,
|
||||
timeout: Optional[float] = None,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus create_index().
|
||||
|
||||
2. Milvus drop_index()
|
||||
|
||||
```python
|
||||
drop_index(
|
||||
collection_name: str,
|
||||
index_name: str,
|
||||
timeout: Optional[float] = None,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
seekdb is compatible with Milvus drop_index().
|
||||
|
||||
## Compatibility with MySQL protocol
|
||||
|
||||
* In terms of request initiation: All APIs are implemented through general query SQL, and there are no compatibility issues.
|
||||
|
||||
* In terms of response result set processing: Only processing of new vector data elements needs to be considered. Currently, string and bytes element parsing are supported. Even if the transmission mode of vector data elements changes in the future, compatibility can be achieved by updating the SDK.
|
||||
@@ -0,0 +1,12 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-supported-clients-and-languages-overview
|
||||
---
|
||||
|
||||
# Supported clients and languages for vector search
|
||||
|
||||
| Client/Language | Version |
|
||||
|---|---|
|
||||
| MySQL client | All versions |
|
||||
| Python SDK | 3.9+ |
|
||||
| Java SDK | 1.8 |
|
||||
@@ -0,0 +1,318 @@
|
||||
---
|
||||
|
||||
slug: /vector-pyobvector
|
||||
---
|
||||
|
||||
# pyobvector Python SDK API reference
|
||||
|
||||
pyobvector is the Python SDK for seekdb's vector storage feature. It provides two operating modes:
|
||||
|
||||
* pymilvus-compatible mode: Operates the database using the MilvusLikeClient object, offering commonly used APIs compatible with the lightweight MilvusClient.
|
||||
|
||||
* SQLAlchemy extension mode: Operates the database using the ObVecClient object, serving as an extension of Python's SDK for relational databases.
|
||||
|
||||
This topic describes the APIs in the two modes and provides examples.
|
||||
|
||||
## MilvusLikeClient
|
||||
|
||||
### Constructor
|
||||
|
||||
```python
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
uri: str = "127.0.0.1:2881",
|
||||
user: str = "root@test",
|
||||
password: str = "",
|
||||
db_name: str = "test",
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
### collection-related APIs
|
||||
|
||||
| API | Description | Example |
|
||||
|------|------|------|
|
||||
| `def create_schema(self, **kwargs) -> CollectionSchema:` | <ul>Creates a `CollectionSchema` object.<li>Parameters are optional, allowing the initialization of an empty schema definition.</li><li>Optional parameters include:</li><ul><li>`fields`: A list of `FieldSchema` objects (see the `add_field` interface below for details).</li><li>`partitions`: Partitioning rules (see the section on defining partition rules using `ObPartition`).</li><li>`description`: Compatible with Milvus, but currently has no practical effect in seekdb.</li></ul></ul> | |
|
||||
| `def create_collection(<br/>self,<br/>collection_name: str,<br/>dimension: Optional[int] = None,<br/>primary_field_name: str = "id",<br/>id_type: Union[DataType, str] = DataType.INT64,<br/>vector_field_name: str = "vector",<br/>metric_type: str = "l2",<br/>auto_id: bool = False,<br/>timeout: Optional[float] = None,<br/>schema: Optional[CollectionSchema] = None, # Used for custom setup<br/>index_params: Optional[IndexParams] = None, # Used for custom setup<br/>max_length: int = 16384,<br/>**kwargs,<br/>)` | Creates a table: <ul><li>collection_name: the table name</li><li>dimension: the vector data dimension</li><li>primary_field_name: the primary field name</li><li>id_type: the primary field data type (only supports VARCHAR and INT types)</li><li>vector_field_name: the vector field name</li><li>metric_type: not used in seekdb, but maintained for API compatibility (because the main table definition does not need to specify a vector distance function)</li><li>auto_id: specifies whether the primary field value increases automatically</li><li>timeout: not used in seekdb, but maintained for API compatibility</li><li>schema: the custom collection schema. When `schema` is not None, the parameters from dimension to metric_type will be ignored</li><li>index_params: the custom vector index parameters</li><li>max_length: the maximum varchar length when the primary field data type is VARCHAR and `schema` is not None</li></ul> | `client.create_collection(<br/>collection_name=test_collection_name,<br/>schema=schema,<br/>index_params=idx_params,<br/>)` |
|
||||
| `def get_collection_stats(<br/>self, collection_name: str, timeout: Optional[float] = None # pylint: disable=unused-argument<br/>) -> Dict:` | Queries the record count of a table.<ul><li>collection_name: the table name</li><li>timeout: not used in seekdb, but maintained for API compatibility</li></ul> | |
|
||||
| `def has_collection(self, collection_name: str, timeout: Optional[float] = None) -> bool` | Verifies whether a table exists.<ul><li>collection_name: the table name</li><li>timeout: not used in seekdb, but maintained for API compatibility</li></ul> | |
|
||||
| `def drop_collection(self, collection_name: str) -> None` | Drops a table.<ul><li>collection_name: the table name</li></ul> | |
|
||||
| `def load_table(self, collection_name: str,)` | Reads the metadata of a table to the SQLAlchemy metadata cache.<ul><li>collection_name: the table name</li></ul> | |
|
||||
|
||||
### CollectionSchema & FieldSchema
|
||||
|
||||
MilvusLikeClient describes the schema of a table by using a CollectionSchema. A CollectionSchema contains multiple FieldSchemas, and a FieldSchema describes the column schema of a table.
|
||||
|
||||
#### Create a CollectionSchema by using the create_schema method of the MilvusLikeClient
|
||||
|
||||
```python
|
||||
def __init__(
|
||||
self,
|
||||
fields: Optional[List[FieldSchema]] = None,
|
||||
partitions: Optional[ObPartition] = None,
|
||||
description: str = "", # ignored in oceanbase
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* fields: an optional parameter that specifies a list of FieldSchema objects.
|
||||
|
||||
* partitions: partition rules (for more information, see the ObPartition section).
|
||||
|
||||
* description: compatible with Milvus, but currently has no practical effect in seekdb.
|
||||
|
||||
#### Create a FieldSchema and register it to a CollectionSchema
|
||||
|
||||
```python
|
||||
def add_field(self, field_name: str, datatype: DataType, **kwargs)
|
||||
```
|
||||
|
||||
* field_name: the column name.
|
||||
|
||||
* datatype: the column data type. For supported data types, see [Compatibility reference](../800.vector-sdk-refer.md).
|
||||
|
||||
* kwargs: additional parameters for configuring column properties, as shown below:
|
||||
|
||||
```python
|
||||
def __init__(
|
||||
self,
|
||||
name: str,
|
||||
dtype: DataType,
|
||||
description: str = "",
|
||||
is_primary: bool = False,
|
||||
auto_id: bool = False,
|
||||
nullable: bool = False,
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
The parameters are described as follows:
|
||||
|
||||
* is_primary: specifies whether the column is a primary key.
|
||||
|
||||
* auto_id: specifies whether the column value increases automatically.
|
||||
|
||||
* nullable: specifies whether the column can be null.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
schema = self.client.create_schema()
|
||||
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
|
||||
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
|
||||
schema.add_field(
|
||||
field_name="title_vector", datatype=DataType.FLOAT_VECTOR, dim=768
|
||||
)
|
||||
schema.add_field(field_name="link", datatype=DataType.VARCHAR, max_length=512)
|
||||
schema.add_field(field_name="reading_time", datatype=DataType.INT64)
|
||||
schema.add_field(
|
||||
field_name="publication", datatype=DataType.VARCHAR, max_length=512
|
||||
)
|
||||
schema.add_field(field_name="claps", datatype=DataType.INT64)
|
||||
schema.add_field(field_name="responses", datatype=DataType.INT64)
|
||||
|
||||
self.client.create_collection(
|
||||
collection_name="medium_articles_2020", schema=schema
|
||||
)
|
||||
```
|
||||
|
||||
### Index-related APIs
|
||||
|
||||
| API | Description | Example/Remarks |
|
||||
|-----|-----|-----|
|
||||
| `def create_index(<br/>self,<br/>collection_name: str,<br/>index_params: IndexParams,<br/>timeout: Optional[float] = None,<br/>**kwargs,<br/>)` | Creates a vector index table based on the constructed IndexParams (for more information about how to use IndexParams, see the prepare_index_params and add_index APIs).<ul><li>collection_name: the table name</li><li>index_params: the index parameters</li><li>timeout: not used in seekdb, but maintained for API compatibility</li><li>kwargs: other parameters, currently not used, maintained for compatibility</li></ul> | |
|
||||
| `def drop_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>timeout: Optional[float] = None,<br/>**kwargs,<br/>)` | Drops an index table.<ul><li>collection_name: the table name</li><li>index_name: the index name</li></ul> | |
|
||||
| `def refresh_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>trigger_threshold: int = 10000,<br/>)` | Refreshes a vector index table to improve read performance. It can be understood as a process of moving incremental data.<ul><li>collection_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the refresh action. A refresh is triggered when the data volume of the index table exceeds the threshold.</li></ul> | An API introduced
|
||||
| `def rebuild_index(<br/>self,<br/>collection_name: str,<br/>index_name: str,<br/>trigger_threshold: float = 0.2,<br/>)` | Rebuilds a vector index table to improve read performance. It can be understood as a process of merging incremental data into baseline index data.<ul><li>collection_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the rebuild action. The value range is 0 to 1. A rebuild is triggered when the proportion of incremental data to full data reaches the threshold.</li></ul> | An API introduced by seekdb.<br/>Not compatible with Milvus. |
|
||||
| `def search(<br/>self,<br/>collection_name: str,<br/>data: list,<br/>anns_field: str,<br/>with_dist: bool = False,<br/>filter=None,limit: int = 10,output_fields: Optional[List[str]] = None,<br/>search_params: Optional[dict] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Executes a vector approximate nearest neighbor search.<ul><li>collection_name: the table name</li><li>data: the vector data to be searched</li><li>anns_field: the name of the vector column to be searched</li><li>with_dist: specifies whether to return results with vector distances</li><li>filter: uses vector approximate nearest neighbor search with filter conditions</li><li>limit: top K</li><li>output_fields: the output columns (also known as projection columns)</li><li>search_params: supports only the `metric_type` value of `l2`/`neg_ip` (`for example: search_params = {"metric_type": "neg_ip"}`)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul><ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values.</ul> | `res = self.client.search(<br/>collection_name=test_collection_name,<br/>data=[0, 0, 1],<br/>anns_field="embedding",<br/>limit=5,<br/>output_fields=["id"],<br/>search_params={"metric_type": "neg_ip"}<br/>)<br/>self.assertEqual(<br/> set([r['id'] for r in res]), set([12, 111, 11, 112, 10]))` |
|
||||
| `def query(<br/>self,<br/>collection_name: str,<br/>flter=None,<br/>output_fields: Optional[List[str]] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Reads data records using the specified filter condition.<ul><li>collection_name: the table name</li><li>flter: uses vector approximate nearest neighbor search with filter conditions</li><li>output_fields: the output columns (also known as projection columns)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul><ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values.</ul> | `table = self.client.load_table(collection_name=test_collection_name)<br/>where_clause = [table.c["id"] < 100]<br/>res = self.client.query(<br/> collection_name=test_collection_name,<br/> output_fields=["id"],<br/> flter=where_clause,<br/>)` |
|
||||
| `def get(<br/>self,<br/>collection_name: str,<br/>ids: Union[list, str, int],<br/>output_fields: Optional[List[str]] = None,<br/>timeout: Optional[float] = None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>) -> List[dict]` | Retrieves records based on the specified primary keys `ids`:<ul><li>collection_name: the table name</li><li>ids: a single ID or a list of IDs. Note: The ids parameter of MilvusLikeClient get interface is different from ObVecClient get. For details, see <a href="#DML%20operations">ObVecClient get</a></li><li>output_fields: the output columns (also known as projection columns)</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_names: limits the query to some partitions</li></ul>Return value:<br/>A list of records, where each record is a dictionary<br/>representing a mapping from column_name to column values. | `res = self.client.get(<br/> collection_name=test_collection_name,<br/> output_fields=["id", "meta"],<br/> ids=[80, 12, 112],<br/>)` |
|
||||
| `def delete(<br/>self,<br/>collection_name: str,<br/>ids: Optional[Union[list, str, int]] = None,<br/>timeout: Optional[float] = None, # pylint: disable=unused-argument<br/>flter=None,<br/>partition_name: Optional[str] = "",<br/>**kwargs, # pylint: disable=unused-argument<br/>)` | Deletes data in a collection.<ul><li>collection_name: the table name</li><li>ids: a single ID or a list of IDs</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>flter: uses vector approximate nearest neighbor search with filter conditions</li><li>partition_name: limits the deletion operation to a partition</li></ul> | `self.client.delete(<br/> collection_name=test_collection_name, ids=[12, 112], partition_name="p0"<br/>)` |
|
||||
| `def insert(<br/> self, <br/> collection_name: str, <br/> data: Union[Dict, List[Dict]], <br/> timeout: Optional[float] = None, <br/> partition_name: Optional[str] = ""<br/>)` | Inserts data into a table.<ul><li>collection_name: the table name</li><li>data: the data to be inserted, described in Key-Value form</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_name: limits the insertion operation to a partition</li></ul> | `data = [<br/> {"id": 12, "embedding": [1, 2, 3], "meta": {"doc": "document 1"}},<br/> {<br/> "id": 90,<br/> "embedding": [0.13, 0.123, 1.213],<br/> "meta": {"doc": "document 1"},<br/> },<br/> {"id": 112, "embedding": [1, 2, 3], "meta": None},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": None},<br/>]<br/>self.client.insert(collection_name=test_collection_name, data=data)` |
|
||||
| `def upsert(<br/>self,<br/>collection_name: str,<br/>data: Union[Dict, List[Dict]],<br/>timeout: Optional[float] = None, # pylint: disable=unused-argument<br/>partition_name: Optional[str] = "",<br/>) -> List[Union[str, int]]` | Updates data in a table. If a primary key already exists, updates the corresponding record; otherwise, inserts a new record.<ul><li>collection_name: the table name</li><li>data: the data to be inserted or updated, in the same format as the insert interface</li><li>timeout: not used in seekdb, maintained for compatibility only</li><li>partition_name: limits the operation to a specified partition</li></ul> | `data = [<br/> {"id": 112, "embedding": [1, 2, 3], "meta": {'doc':'hhh1'}},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": {'doc':'hhh2'}},<br/>]<br/>self.client.upsert(collection_name=test_collection_name, data=data)` |
|
||||
| `def perform_raw_text_sql(self, text_sql: str):<br/> return super().perform_raw_text_sql(text_sql)` | Executes an SQL statement directly.<ul><li>text_sql: the SQL statement to be executed</li></ul>Return value:<br/>Returns an iterator that provides result sets from SQLAlchemy. | |
|
||||
|
||||
## ObVecClient
|
||||
|
||||
### Constructor
|
||||
|
||||
```python
|
||||
def __init__(
|
||||
self,
|
||||
uri: str = "127.0.0.1:2881",
|
||||
user: str = "root@test",
|
||||
password: str = "",
|
||||
db_name: str = "test",
|
||||
**kwargs,
|
||||
)
|
||||
```
|
||||
|
||||
### Table mode-related operations
|
||||
|
||||
| API | Description | Example/Remarks |
|
||||
|-----|-----|-----|
|
||||
| `def check_table_exists(self, table_name: str)` | Checks whether a table exists.<ul><li>table_name: the table name</li></ul> | |
|
||||
| `def create_table(<br/>self,<br/>table_name: str,<br/>columns: List[Column],<br/>indexes: Optional[List[Index]] = None,<br/>partitions: Optional[ObPartition] = None,<br/>)` | Creates a table.<ul><li>table_name: the table name</li><li>columns: the column schema of the table, defined using SQLAlchemy</li><li>indexes: a set of index schemas, defined using SQLAlchemy</li><li>partitions: optional partition rules (see the section on using ObPartition to define partition rules)</li></ul> | |
|
||||
| `@classmethod<br/>def prepare_index_params(cls)` | Creates an IndexParams object to record the schema definition of a vector index table.`class IndexParams:<br/> """Vector index parameters for MilvusLikeClient"<br/> def __init__(self):<br/> self._indexes = {}`<br/>The definition of IndexParams is very simple, with only one dictionary member internally<br/>that stores a mapping from a tuple of (column name, index name) to an IndexParam structure.<br/>The constructor of the IndexParam class is:`def __init__(<br/> self,<br/> index_name: str,<br/> field_name: str,<br/> index_type: Union[VecIndexType, str],<br/> **kwargs<br/>)`<ul><li>index_name: the vector index table name</li><li>field_name: the vector column name</li><li>index_type: an enumerated class for vector index algorithm types. Currently, only HNSW is supported.</li></ul>After obtaining an IndexParams by calling `prepare_index_params`, you can register an IndexParam using the `add_index` interface:`def add_index(<br/> self,<br/> field_name: str,<br/> index_type: VecIndexType,<br/> index_name: str,<br/> **kwargs<br/>)`The parameter meanings are the same as those in the IndexParam constructor. | Here is a usage example for creating a vector index: `idx_params = self.client.prepare_index_params()<br/>idx_params.add_index(<br/> field_name="title_vector",<br/> index_type="HNSW",<br/> index_name="vidx_title_vector",<br/> metric_type="L2",<br/> params={"M": 16, "efConstruction": 256},<br/>)<br/>self.client.create_collection(<br/> collection_name=test_collection_name,<br/> schema=schema,<br/> <br/>index_params=idx_params,<br/>)`Note that the `prepare_index_params` function is recommended for use in MilvusLikeClient, not in ObVecClient. In ObVecClient mode, you should use the `create_index` interface to define a vector index table. (For details, see the create_index interface.) |
|
||||
| `def create_table_with_index_params(<br/>self,<br/>table_name: str,<br/>columns: List[Column],<br/>indexes: Optional[List[Index]] = None,<br/>vidxs: Optional[IndexParams] = None,<br/>partitions: Optional[ObPartition] = None,<br/>) | Creates a table and a vector index at the same time using optional index_params.<ul><li>table_name: the table name</li><li>columns: the column schema of the table, defined using SQLAlchemy</li><li>indexes: a set of index schemas, defined using SQLAlchemy</li><li>vidxs: the vector index schema, specified using IndexParams</li><li>partitions: optional partition rules (see the section on using ObPartition to define partition rules)</li></ul> | Recommended for use in MilvusLikeClient, not recommended for use in ObVecClient |
|
||||
| `def create_index(<br/>self,<br/>table_name: str,<br/>is_vec_index: bool,<br/>index_name: str,<br/>column_names: List[str],<br/>vidx_params: Optional[str] = None,<br/>**kw,<br/>)` | Supports creating both normal indexes and vector indexes.<ul><li>table_name: the table name</li><li>is_vec_index: specifies whether to create a normal index or a vector index</li><li>index_name: the index name</li><li>column_names: the columns on which to create the index</li><li>vidx_params: the vector index parameters, for example: `"distance=l2, type=hnsw, lib=vsag"`</li></ul>Currently, seekdb supports only `type=hnsw` and `lib=vsag`. Please retain these settings. The distance can be set to `l2` or `inner_product`. | `self.client.create_index(<br/> test_collection_name,<br/> is_vec_index=True,<br/> index_name="vidx",<br/> column_names=["embedding"],<br/> vidx_params="distance=l2, type=hnsw, lib=vsag",<br/>) |
|
||||
| `def create_vidx_with_vec_index_param(<br/>self,<br/>table_name: str,<br/>vidx_param: IndexParam,<br/>)` | Creates a vector index using vector index parameters.<ul><li>table_name: the table name</li><li>vidx_param: the vector index parameters constructed using IndexParam</li></ul> | |
|
||||
| `def drop_table_if_exist(self, table_name: str)` | Drops a table.<ul><li>table_name: the table name</li></ul> | |
|
||||
| `def drop_index(self, table_name: str, index_name: str)` | Drops an index.<ul><li>table_name: the table name</li><li>index_name: the index name</li></ul> | |
|
||||
| `def refresh_index(<br/>self,<br/>table_name: str,<br/>index_name: str,<br/>trigger_threshold: int = 10000,<br/>)` | Refreshes a vector index table to improve read performance. It can be understood as a process of moving incremental data.<ul><li>table_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the refresh action. A refresh is triggered when the data volume of the index table exceeds the threshold.</li></ul> | |
|
||||
| `def rebuild_index(<br/>self,<br/>table_name: str,<br/>index_name: str,<br/>trigger_threshold: float = 0.2,<br/>)` | Rebuilds a vector index table to improve read performance. It can be understood as a process of merging incremental data into baseline index data.<ul><li>table_name: the table name</li><li>index_name: the index name</li><li>trigger_threshold: the trigger threshold of the rebuild action. The value range is 0 to 1. A rebuild is triggered when the proportion of incremental data to full data reaches the threshold.</li></ul> | |
|
||||
|
||||
### DML operations
|
||||
|
||||
| API | Description | Example/Remarks |
|
||||
|-----|-----|-----|
|
||||
| `def insert(<br/>self,<br/>table_name: str,<br/>data: Union[Dict, List[Dict]],<br/>partition_name: Optional[str] = "",<br/>)` | Inserts data into a table.<ul><li>table_name: the table name</li><li>data: the data to be inserted, described in Key-Value form</li><li>partition_name: limits the insertion operation to a partition</li></ul> | `vector_value1 = [0.748479, 0.276979, 0.555195]<br/>vector_value2 = [0, 0, 0]<br/>data1 = [{"id": i, "embedding": vector_value1} for i in range(10)]<br/>data1.extend([{"id": i, "embedding": vector_value2} for i in range(10, 13)])<br/>data1.extend([{"id": i, "embedding": vector_value2} for i in range(111, 113)])<br/>self.client.insert(test_collection_name, data=data1)` |
|
||||
| `def upsert(<br/>self,<br/>table_name: str,<br/>data: Union[Dict, List[Dict]],<br/>partition_name: Optional[str] = "",<br/>)` | Inserts or updates data in a table. If a primary key already exists, updates the corresponding record; otherwise, inserts a new record.<ul><li>table_name: the table name</li><li>data: the data to be inserted or updated, in Key-Value format</li><li>partition_name: limits the operation to a specified partition</li></ul> | |
|
||||
| `def update(<br/>self,<br/>table_name: str,<br/>values_clause,<br/>where_clause=None,<br/>partition_name: Optional[str] = "",<br/>)` | Updates data in a table. If a primary key is repeated, it will be replaced.<ul><li>table_name: the table name</li><li>values_clause: the values of the columns to be updated</li><li>where_clause: the condition for updating</li><li>partition_name: limits the update operation to some partitions</li></ul> | `data = [<br/> {"id": 112, "embedding": [1, 2, 3], "meta": {'doc':'hhh1'}},<br/> {"id": 190, "embedding": [0.13, 0.123, 1.213], "meta": {'doc':'hhh2'}},<br/>]<br/>client.insert(collection_name=test_collection_name, data=data)<br/>client.update(<br/> table_name=test_collection_name,<br/> values_clause=[{'meta':{'doc':'HHH'}}],<br/> where_clause=[text("id=112")]<br/>)` |
|
||||
| `def delete(<br/>self,<br/>table_name: str,<br/>ids: Optional[Union[list, str, int]] = None,<br/>where_clause=None,<br/>partition_name: Optional[str] = "",<br/>)` | Deletes data from a table.<ul><li>table_name: the table name</li><li>ids: a single ID or a list of IDs</li><li>where_clause: the condition for deletion</li><li>partition_name: limits the deletion operation to some partitions</li></ul> | `self.client.delete(test_collection_name, ids=["bcd", "def"])` |
|
||||
| `def get(<br/>self,<br/>table_name: str,<br/>ids: Optional[Union[list, str, int]],<br/>where_clause = None,<br/>output_column_name: Optional[List[str]] = None,<br/>partition_names: Optional[List[str]] = None,<br/>)` | Retrieves records based on the specified primary keys `ids`.<ul><li>table_name: the table name</li><li>ids: a single ID or a list of IDs. Optional parameter, can be `ids=None` if not provided. The ids parameter of ObVecClient get interface is different from MilvusLikeClient get. For details, see <a href="#Index-related%20APIs">MilvusLikeClient get</a></li><li>where_clause: the condition for retrieval</li><li>output_column_name: a list of output column or projection column names</li><li>partition_names: limits the retrieval operation to some partitions</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | `res = self.client.get(<br/> test_collection_name,<br/> ids=["abc", "bcd", "cde", "def"],<br/> where_clause=[text("meta->'$.page' > 1")],<br/> output_column_name=['id']<br/>) |
|
||||
| `def set_ob_hnsw_ef_search(self, ob_hnsw_ef_search: int)` | Set the efSearch parameter of the HNSW index. This is a session-level variable. The larger the value of ef_search, the higher the recall rate but the poorer the query performance. <ul><li>ob_hnsw_ef_search: the efSearch parameter of the HNSW index</li></ul> | |
|
||||
| `def get_ob_hnsw_ef_search(self) -> int` | Get the efSearch parameter of the HNSW index. | |
|
||||
| `def ann_search(<br/>self,<br/>table_name: str,<br/>vec_data: list,<br/>vec_column_name: str,<br/>distance_func,<br/>with_dist: bool = False,<br/>topk: int = 10,<br/>output_column_names: Optional[List[str]] = None,<br/>extra_output_cols: Optional[List] = None,<br/>where_clause=None,<br/>partition_names: Optional[List[str]] = None,<br/>**kwargs,<br/>)` | Executes a vector approximate nearest neighbor search.<ul><li>table_name: the table name</li><li>vec_data: the vector data to be searched</li><li>vec_column_name: the name of the vector column to be searched</li><li>distance_func: the distance function. Provides an extension of SQLAlchemy func, with optional values: `func.l2_distance`/`func.cosine_distance`/`func.inner_product`/`func.negative_inner_product`, representing the l2 distance function, cosine distance function, inner product distance function, and negative inner product distance function, respectively</li><li>with_dist: specifies whether to return results with vector distances</li><li>topk: the number of nearest vectors to retrieve</li><li>output_column_names: a list of output column or projection column names</li><li>extra_output_cols: additional output columns that allow more complex output expressions</li><li>where_clause: the filter condition</li><li>partition_names: limits the query to some partitions</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | `res = self.client.ann_search(<br/> test_collection_name,<br/> vec_data=[0, 0, 0],<br/> vec_column_name="embedding",<br/> distance_func=func.l2_distance,<br/> with_dist=True,<br/> topk=5,<br/> output_column_names=["id"],<br/>) |
|
||||
| `def precise_search(<br/>self,<br/>table_name: str,<br/>vec_data: list,<br/>vec_column_name: str,<br/>distance_func,<br/>topk: int = 10,<br/>output_column_names: Optional[List[str]] = None,<br/>where_clause=None,<br/>**kwargs,<br/>) | Executes a precise neighbor search algorithm.<ul><li>table_name: the table name</li><li>vec_data: the query vector</li><li>vec_column_name: the vector column name</li><li>distance_func: the vector distance function. Provides an extension of SQLAlchemy func, with optional values: func.l2_distance/func.cosine_distance/func.inner_product/func.negative_inner_product, representing the l2 distance function, cosine distance function, inner product distance function, and negative inner product distance function, respectively</li><li>topk: the number of nearest vectors to retrieve</li><li>output_column_names: a list of output column or projection column names</li><li>where_clause: the filter condition</li></ul>Return value:<br/>Unlike MilvusLikeClient, the return value of ObVecClient is a tuple list, with each tuple representing a row of records. | |
|
||||
| `def perform_raw_text_sql(self, text_sql: str)` | Executes an SQL statement directly.<ul><li>text_sql: the SQL statement to be executed</li></ul>Return value:<br/>Returns an iterator that provides result sets from SQLAlchemy. | |
|
||||
|
||||
## Define partitioning rules by using ObPartition
|
||||
|
||||
pyobvector supports the following types for range/range columns, list/list columns, hash, key, and subpartitioning:
|
||||
|
||||
* ObRangePartition: specifies to perform range partitioning. Set `is_range_columns` to `True` when you construct this object to create range column partitioning.
|
||||
|
||||
* ObListPartition: specifies to perform list partitioning. Set `is_list_columns` to `True` when you construct this object to create list column partitioning.
|
||||
|
||||
* ObHashPartition: specifies to perform hash partitioning.
|
||||
|
||||
* ObKeyPartition: specifies to perform key partitioning.
|
||||
|
||||
* ObSubRangePartition: specifies to perform sub-range partitioning. Set `is_range_columns` to `True` when you construct this object to create sub-range column partitioning.
|
||||
|
||||
* ObSubListPartition: specifies to perform sub-list partitioning. Set `is_list_columns` to `True` when you construct this object to create sub-list column partitioning.
|
||||
|
||||
* ObSubHashPartition: specifies to perform sub-hash partitioning.
|
||||
|
||||
* ObSubKeyPartition: specifies to perform sub-key partitioning.
|
||||
|
||||
### Example of range partitioning
|
||||
|
||||
```python
|
||||
range_part = ObRangePartition(
|
||||
False,
|
||||
range_part_infos=[
|
||||
RangeListPartInfo("p0", 100),
|
||||
RangeListPartInfo("p1", "maxvalue"),
|
||||
],
|
||||
range_expr="id",
|
||||
)
|
||||
```
|
||||
|
||||
### Example of list partitioning
|
||||
|
||||
```python
|
||||
list_part = ObListPartition(
|
||||
False,
|
||||
list_part_infos=[
|
||||
RangeListPartInfo("p0", [1, 2, 3]),
|
||||
RangeListPartInfo("p1", [5, 6]),
|
||||
RangeListPartInfo("p2", "DEFAULT"),
|
||||
],
|
||||
list_expr="col1",
|
||||
)
|
||||
```
|
||||
|
||||
### Example of hash partitioning
|
||||
|
||||
```python
|
||||
hash_part = ObHashPartition("col1", part_count=60)
|
||||
```
|
||||
|
||||
### Example of multi-level partitioning
|
||||
|
||||
```python
|
||||
# Perform range partitioning
|
||||
range_columns_part = ObRangePartition(
|
||||
True,
|
||||
range_part_infos=[
|
||||
RangeListPartInfo("p0", 100),
|
||||
RangeListPartInfo("p1", 200),
|
||||
RangeListPartInfo("p2", 300),
|
||||
],
|
||||
col_name_list=["col1"],
|
||||
)
|
||||
# Perform sub-range partitioning
|
||||
range_sub_part = ObSubRangePartition(
|
||||
False,
|
||||
range_part_infos=[
|
||||
RangeListPartInfo("mp0", 1000),
|
||||
RangeListPartInfo("mp1", 2000),
|
||||
RangeListPartInfo("mp2", 3000),
|
||||
],
|
||||
range_expr="col3",
|
||||
)
|
||||
range_columns_part.add_subpartition(range_sub_part)
|
||||
```
|
||||
|
||||
## Pure SQLAlchemy API mode
|
||||
|
||||
If you prefer to use a purely SQLAlchemy API for seekdb's vector retrieval functionality, you can obtain a synchronized database engine through the following methods:
|
||||
|
||||
* Method 1: Use ObVecClient to create a database engine
|
||||
|
||||
```python
|
||||
from pyobvector import ObVecClient
|
||||
|
||||
client = ObVecClient(uri="127.0.0.1:2881", user="test@test")
|
||||
engine = client.engine
|
||||
# Proceed to create a session as usual with SQLAlchemy and use its API.
|
||||
```
|
||||
|
||||
* Method 2: Call the `create_engine` interface of ObVecClient to create a database engine
|
||||
|
||||
```python
|
||||
import pyobvector
|
||||
from sqlalchemy.dialects import registry
|
||||
from sqlalchemy import create_engine
|
||||
|
||||
uri: str = "127.0.0.1:2881"
|
||||
user: str = "root@test"
|
||||
password: str = ""
|
||||
db_name: str = "test"
|
||||
registry.register("mysql.oceanbase", "pyobvector.schema.dialect", "OceanBaseDialect")
|
||||
connection_str = (
|
||||
# mysql+oceanbase indicates using the MySQL standard with seekdb's synchronous driver.
|
||||
f"mysql+oceanbase://{user}:{password}@{uri}/{db_name}?charset=utf8mb4"
|
||||
)
|
||||
engine = create_engine(connection_str, **kwargs)
|
||||
# Proceed to create a session as usual with SQLAlchemy and use its API.
|
||||
```
|
||||
|
||||
If you want to use asynchronous APIs of SQLAlchemy, you can use seekdb's asynchronous driver:
|
||||
|
||||
```python
|
||||
import pyobvector
|
||||
from sqlalchemy.dialects import registry
|
||||
from sqlalchemy.ext.asyncio import create_async_engine
|
||||
|
||||
uri: str = "127.0.0.1:2881"
|
||||
user: str = "root@test"
|
||||
password: str = ""
|
||||
db_name: str = "test"
|
||||
registry.register("mysql.aoceanbase", "pyobvector", "AsyncOceanBaseDialect")
|
||||
connection_str = (
|
||||
# mysql+aoceanbase indicates using the MySQL standard with seekdb's asynchronous driver.
|
||||
f"mysql+aoceanbase://{user}:{password}@{uri}/{db_name}?charset=utf8mb4"
|
||||
)
|
||||
engine = create_async_engine(connection_str)
|
||||
# Proceed to create a session as usual with SQLAlchemy and use its API.
|
||||
```
|
||||
|
||||
## More examples
|
||||
|
||||
For more examples, visit the [pyobvector repository](https://github.com/oceanbase/pyobvector).
|
||||
@@ -0,0 +1,470 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-java-sdk
|
||||
---
|
||||
|
||||
# Java SDK API reference
|
||||
|
||||
obvec_jdbc is a Java SDK specifically designed for seekdb vector storage scenarios and JSON Table virtual table scenarios. This topic explains how to use obvec_jdbc.
|
||||
|
||||
## Installation
|
||||
|
||||
You can install obvec_jdbc using either of the following methods.
|
||||
|
||||
### Maven dependency
|
||||
|
||||
Add the obvec_jdbc dependency to the `pom.xml` file of your project.
|
||||
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>com.oceanbase</groupId>
|
||||
<artifactId>obvec_jdbc</artifactId>
|
||||
<version>1.0.4</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
### Source code installation
|
||||
|
||||
1. Install obvec_jdbc.
|
||||
|
||||
```bash
|
||||
# Clone the obvec_jdbc repository.
|
||||
git clone https://github.com/oceanbase/obvec_jdbc.git
|
||||
# Go to the obvec_jdbc directory.
|
||||
cd obvec_jdbc
|
||||
# Install obvec_jdbc.
|
||||
mvn install
|
||||
```
|
||||
|
||||
2. Add the dependency.
|
||||
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>com.oceanbase</groupId>
|
||||
<artifactId>obvec_jdbc</artifactId>
|
||||
<version>1.0.4</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
## API definition and usage
|
||||
|
||||
obvec_jdbc provides the `ObVecClient` object for working with seekdb's vector search features and JSON Table virtual table functionalities.
|
||||
|
||||
### Use vector search
|
||||
|
||||
#### Create a client
|
||||
|
||||
You can use the following interface definition to construct an ObVecClient object:
|
||||
|
||||
```java
|
||||
# uri: the connection string, which contains the address, port, and name of the database to which you want to connect.
|
||||
# user: the username.
|
||||
# password: the password.
|
||||
public ObVecClient(String uri, String user, String password);
|
||||
```
|
||||
|
||||
Here is an example:
|
||||
|
||||
```java
|
||||
import com.oceanbase.obvec_jdbc.ObVecClient;
|
||||
|
||||
String uri = "jdbc:oceanbase://127.0.0.1:2881/test";
|
||||
String user = "root@test";
|
||||
String password = "";
|
||||
String tb_name = "JAVA_TEST";
|
||||
|
||||
ObVecClient ob = new ObVecClient(uri, user, password);
|
||||
```
|
||||
|
||||
#### ObFieldSchema class
|
||||
|
||||
This class is used to define the column schema of a table. The constructor is as follows:
|
||||
|
||||
```java
|
||||
# name: the column name.
|
||||
# dataType: the data type.
|
||||
public ObFieldSchema(String name, DataType dataType);
|
||||
```
|
||||
|
||||
The following table describes the data types supported by the class.
|
||||
|
||||
| Data type | Description |
|
||||
|---|---|
|
||||
| BOOL | Equivalent to TINYINT |
|
||||
| INT8 | Equivalent to TINYINT |
|
||||
| INT16 | Equivalent to SMALLINT |
|
||||
| INT32 | Equivalent to INT |
|
||||
| INT64 | Equivalent to BIGINT |
|
||||
| FLOAT | Equivalent to FLOAT |
|
||||
| DOUBLE | Equivalent to DOUBLE |
|
||||
| STRING | Equivalent to LONGTEXT |
|
||||
| VARCHAR | Equivalent to VARCHAR |
|
||||
| JSON | Equivalent to JSON |
|
||||
| FLOAT_VECTOR | Equivalent to VECTOR |
|
||||
|
||||
:::tip
|
||||
For more complex types, constraints, and other functionalities, you can use seekdb JDBC's interface directly instead of using obvec_jdbc.
|
||||
:::
|
||||
|
||||
The interface is defined as follows:
|
||||
|
||||
| API | Description |
|
||||
|---|---|
|
||||
| String getName() | Obtains the column name. |
|
||||
| ObFieldSchema Name(String name) | Sets the column name and returns the object itself to support chain operations. |
|
||||
| ObFieldSchema DataType(DataType dataType) | Sets the data type. |
|
||||
| boolean getIsPrimary() | Specifies whether the column is the primary key. |
|
||||
| ObFieldSchema IsPrimary(boolean isPrimary) | Specifies whether the column is the primary key. |
|
||||
| ObFieldSchema IsAutoInc(boolean isAutoInc) | Specifies whether the column is auto-increment. <main id="notice" type='notice'><h4>Notice</h4><p>IsAutoInc takes effect only if IsPrimary is true. </p></main> |
|
||||
| ObFieldSchema IsNullable(boolean isNullable) | Specifies whether the column can contain NULL values. <main id="notice" type='notice'><h4>Notice</h4><p>IsNullable is set to false by default, which is different from the behavior in MySQL. </p></main> |
|
||||
| ObFieldSchema MaxLength(int maxLength) | Sets the maximum length for the VARCHAR data type. |
|
||||
| ObFieldSchema Dim(int dim) | Sets the dimension for the VECTOR data type. |
|
||||
|
||||
#### IndexParams/IndexParam
|
||||
|
||||
IndexParam is used to set a single index parameter. IndexParams is used to set a group of vector index parameters, which is used when multiple vector indexes are created on a table.
|
||||
|
||||
:::tip
|
||||
obvec_jdbc supports only the creation of vector indexes. To create other indexes, use seekdb JDBC.
|
||||
:::
|
||||
|
||||
The constructor of IndexParam is as follows:
|
||||
|
||||
```java
|
||||
# vidx_name: the index name.
|
||||
# vector_field_name: the name of the vector column.
|
||||
public IndexParam(String vidx_name, String vector_field_name);
|
||||
```
|
||||
|
||||
The interface is defined as follows:
|
||||
|
||||
| API | Description |
|
||||
|---|---|
|
||||
| IndexParam M(int m) | Sets the maximum number of neighbors for each vector in the HNSW algorithm. |
|
||||
| IndexParam EfConstruction(int ef_construction) | Sets the maximum number of candidate vectors for search during the construction of the HNSW algorithm. |
|
||||
| IndexParam EfSearch(int ef_search) | Sets the maximum number of candidate vectors for search in the HNSW algorithm. |
|
||||
| IndexParam Lib(String lib) | Sets the type of the vector library. |
|
||||
| IndexParam MetricType(String metric_type) | Sets the type of the vector distance function. |
|
||||
|
||||
The constructor of IndexParams is as follows:
|
||||
|
||||
```
|
||||
public IndexParams();
|
||||
```
|
||||
|
||||
The interface is defined as follows:
|
||||
|
||||
| API | Description |
|
||||
|---|---|
|
||||
| void addIndex(IndexParam index_param) | Adds an index definition. |
|
||||
|
||||
#### ObCollectionSchema class
|
||||
|
||||
When creating a table, you need to rely on the configuration of the ObCollectionSchema object. Below are its constructors and interfaces.
|
||||
|
||||
The constructor of ObCollectionSchema is as follows:
|
||||
|
||||
```java
|
||||
public ObCollectionSchema();
|
||||
```
|
||||
|
||||
The interface is defined as follows:
|
||||
|
||||
| API | Description |
|
||||
|---|---|
|
||||
| void addField(ObFieldSchema field) | Adds a column definition. |
|
||||
| void setIndexParams(IndexParams index_params) | Sets the vector index parameters of the table. |
|
||||
|
||||
#### Drop a table
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the target table.
|
||||
public void dropCollection(String table_name);
|
||||
```
|
||||
|
||||
#### Check whether a table exists
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the target table.
|
||||
public boolean hasCollection(String table_name);
|
||||
```
|
||||
|
||||
#### Create a table
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the table to be created.
|
||||
# collection: an ObCollectionSchema object that specifies the schema of the table.
|
||||
public void createCollection(String table_name, ObCollectionSchema collection);
|
||||
```
|
||||
|
||||
You can use ObFieldSchema, ObCollectionSchema, and IndexParams to create a table. Here is an example:
|
||||
|
||||
```java
|
||||
import com.oceanbase.obvec_jdbc.DataType;
|
||||
import com.oceanbase.obvec_jdbc.ObCollectionSchema;
|
||||
import com.oceanbase.obvec_jdbc.ObFieldSchema;
|
||||
import com.oceanbase.obvec_jdbc.IndexParam;
|
||||
import com.oceanbase.obvec_jdbc.IndexParams;
|
||||
|
||||
# Define the schema of the table.
|
||||
ObCollectionSchema collectionSchema = new ObCollectionSchema();
|
||||
ObFieldSchema c1_field = new ObFieldSchema("c1", DataType.INT32);
|
||||
c1_field.IsPrimary(true).IsAutoInc(true);
|
||||
ObFieldSchema c2_field = new ObFieldSchema("c2", DataType.FLOAT_VECTOR);
|
||||
c2_field.Dim(3).IsNullable(false);
|
||||
ObFieldSchema c3_field = new ObFieldSchema("c3", DataType.JSON);
|
||||
c3_field.IsNullable(true);
|
||||
collectionSchema.addField(c1_field);
|
||||
collectionSchema.addField(c2_field);
|
||||
collectionSchema.addField(c3_field);
|
||||
|
||||
# Define the index.
|
||||
IndexParams index_params = new IndexParams();
|
||||
IndexParam index_param = new IndexParam("vidx1", "c2");
|
||||
index_params.addIndex(index_param);
|
||||
collectionSchema.setIndexParams(index_params);
|
||||
|
||||
ob.createCollection(tb_name, collectionSchema);
|
||||
```
|
||||
|
||||
#### Create a vector index after table creation
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the table.
|
||||
# index_param: an IndexParam object that specifies the vector index parameters of the table.
|
||||
public void createIndex(String table_name, IndexParam index_param)
|
||||
```
|
||||
|
||||
#### Insert data
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the target table.
|
||||
# column_names: an array of column names in the target table.
|
||||
# rows: the data rows. ArrayList<Sqlizable[]>, each row is an Sqlizable array. Sqlizable is a wrapper class that converts Java data types to SQL data types.
|
||||
public void insert(String table_name, String[] column_names, ArrayList<Sqlizable[]> rows);
|
||||
```
|
||||
|
||||
The supported data types for rows include:
|
||||
|
||||
* SqlInteger: wraps integer data.
|
||||
* SqlFloat: wraps floating-point data.
|
||||
* SqlDouble: wraps double-precision data.
|
||||
* SqlText: wraps string data.
|
||||
* SqlVector: wraps vector data.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```java
|
||||
import com.oceanbase.obvec_jdbc.SqlInteger;
|
||||
import com.oceanbase.obvec_jdbc.SqlText;
|
||||
import com.oceanbase.obvec_jdbc.SqlVector;
|
||||
import com.oceanbase.obvec_jdbc.Sqlizable;
|
||||
|
||||
ArrayList<Sqlizable[]> insert_rows = new ArrayList<>();
|
||||
Sqlizable[] ir1 = { new SqlVector(new float[] {1.0f, 2.0f, 3.0f}), new SqlText("{\"doc\": \"oceanbase doc 1\"}") };
|
||||
insert_rows.add(ir1);
|
||||
Sqlizable[] ir2 = { new SqlVector(new float[] {1.1f, 2.2f, 3.3f}), new SqlText("{\"doc\": \"oceanbase doc 2\"}") };
|
||||
insert_rows.add(ir2);
|
||||
Sqlizable[] ir3 = { new SqlVector(new float[] {0f, 0f, 0f}), new SqlText("{\"doc\": \"oceanbase doc 3\"}") };
|
||||
insert_rows.add(ir3);
|
||||
ob.insert(tb_name, new String[] {"c2", "c3"}, insert_rows);
|
||||
```
|
||||
|
||||
#### Delete data
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the target table.
|
||||
# primary_key_name: the name of the primary key column.
|
||||
# primary_keys: an array of primary key column values for the target rows.
|
||||
public void delete(String table_name, String primary_key_name, ArrayList<Sqlizable> primary_keys);
|
||||
```
|
||||
|
||||
Here is an example:
|
||||
|
||||
```java
|
||||
ArrayList<Sqlizable> ids = new ArrayList<>();
|
||||
ids.add(new SqlInteger(2));
|
||||
ids.add(new SqlInteger(1));
|
||||
ob.delete(tb_name, "c1", ids);
|
||||
```
|
||||
|
||||
#### ANN queries
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# table_name: the name of the target table.
|
||||
# vec_col_name: the name of the vector column.
|
||||
# metric_type: the type of the vector distance function. l2: corresponds to the L2 distance function. cosine: corresponds to the cosine distance function. ip: corresponds to the negative inner product distance function.
|
||||
# qv: the vector value to be queried.
|
||||
# topk: the number of the most similar results to be returned.
|
||||
# output_fields: the projected columns, that is, the array of the fields to be returned.
|
||||
# output_datatypes: the data types of the projected columns, that is, the data types of the fields to be returned, for direct conversion to Java data types.
|
||||
# where_expr: the WHERE condition expression.
|
||||
public ArrayList<HashMap<String, Sqlizable>> query(
|
||||
String table_name,
|
||||
String vec_col_name,
|
||||
String metric_type,
|
||||
float[] qv,
|
||||
int topk,
|
||||
String[] output_fields,
|
||||
DataType[] output_datatypes,
|
||||
String where_expr);
|
||||
```
|
||||
|
||||
Here is an example:
|
||||
|
||||
```java
|
||||
ArrayList<HashMap<String, Sqlizable>> res = ob.query(tb_name, "c2", "l2",
|
||||
new float[] {0f, 0f, 0f}, 10,
|
||||
new String[] {"c1", "c3", "c2"},
|
||||
new DataType[] {
|
||||
DataType.INT32,
|
||||
DataType.JSON,
|
||||
DataType.FLOAT_VECTOR,
|
||||
"c1 > 0"});
|
||||
if (res != null) {
|
||||
for (int i = 0; i < res.size(); i++) {
|
||||
for (HashMap.Entry<String, Sqlizable> entry : res.get(i).entrySet()) {
|
||||
System.out.printf("%s : %s, ", entry.getKey(), entry.getValue().toString());
|
||||
}
|
||||
System.out.print("\n");
|
||||
}
|
||||
} else {
|
||||
System.out.println("res is null");
|
||||
}
|
||||
```
|
||||
|
||||
### Use the JSON table feature
|
||||
|
||||
The JSON table feature of obvec_jdbc relies on seekdb's ability to handle JSON data types (including `JSON_VALUE`/`JSON_TABLE`/`JSON_REPLACE`, etc.) to implement a virtual table mechanism. Multiple users (distinguished by user ID) can perform DDL or DML operations on virtual tables over the same physical table while ensuring data isolation between users. Admin users can perform DDL operations, while regular users can perform DML operations.
|
||||
|
||||
This design combines the structured management capabilities of relational databases with the flexibility of JSON, showcasing seekdb's multi-model integration capabilities. Users can enjoy the power and ease of use of SQL while also handling semi-structured data, meeting the diverse data model requirements of modern applications. Although operations are still performed on "tables," data is stored in a more flexible JSON format at the underlying level, better supporting complex and varied application scenarios.
|
||||
|
||||
#### How it works
|
||||
|
||||
<!-- The following figure illustrates the principle of JSON Table.
|
||||
|
||||

|
||||
|
||||
Detailed explanation:-->
|
||||
|
||||
1. User operations: Users still interact with the system using familiar standard SQL statements (such as `CREATE TABLE` to create table structures, `INSERT` to insert data, and `SELECT` to query data). They do not need to worry about how data is stored at the underlying level, just like operating ordinary relational database tables. The tables created by users using SQL statements are logical tables, which correspond to two physical tables (`meta_json_t` and `data_json_t`) within seekdb.
|
||||
|
||||
2. JSON Table SDK: Within the application, there is a JSON Table SDK (Software Development Kit). This SDK is the key that connects users' SQL operations and seekdb's actual storage. When SQL statements are executed, the SDK intercepts these requests and intelligently converts them into read and write operations on seekdb's internal tables `meta_json_t` and `data_json_t`.
|
||||
|
||||
3. seekdb internal storage:
|
||||
|
||||
* `meta_json_t` (stores table schema): stores the metadata of the logical tables created by users, which is the schema information of the table (for example, which columns are created and what data type each column is). When `CREATE TABLE` is executed, the SDK records this schema information in `meta_json_t`.
|
||||
* `data_json_t` (stores row data as JSON type): stores the actual inserted data. Unlike traditional relational databases that directly store row data, the JSON Table feature encapsulates each row of inserted data into a JSON object and stores it in a column of the `data_json_t` table. This allows for efficient storage even with flexible data structures.
|
||||
|
||||
4. Data query: When query operations such as `SELECT` are executed, the SDK reads JSON-format data from `data_json_t` and combines it with the schema information from `meta_json_t` to re-parse and present the JSON data in a familiar tabular format, returning it to your application.
|
||||
|
||||
The `meta_json_t` table stores the metadata of the JSON table, which is the logical table schema defined by the user using the `CREATE TABLE` statement. It records the column information of each logical table, with the following schema:
|
||||
|
||||
| Field | Description | Example |
|
||||
|--------|------|------|
|
||||
| `user_id` | The user ID, used to distinguish the logical tables of different users. | `0`, `1`, `2` |
|
||||
| `jtable_name` | The name of the logical table. | `test_count` |
|
||||
| `jcol_id` | The column ID of the logical table. | `1`, `2`, `3` |
|
||||
| `jcol_name` | The column name of the logical table. | `c1`, `c2`, `c3` |
|
||||
| `jcol_type` | The data type of the column. | `INT`, `VARCHAR(124)`, `DECIMAL(10,2)` |
|
||||
| `jcol_nullable` | Indicates whether the column allows null values. | `0`, `1` |
|
||||
| `jcol_has_default` | Indicates whether the column has a default value. | `0`, `1` |
|
||||
| `jcol_default` | The default value of the column. | `{'default': null}` |
|
||||
|
||||
When a user executes the `CREATE TABLE` statement, the JSON table SDK parses and inserts the column definition information into the `meta_json_t` table.
|
||||
|
||||
The `data_json_t` table stores the actual data of the JSON table, which is the data inserted by the user using the `INSERT` statement. It records the row data of each logical table, with the following schema:
|
||||
|
||||
| Field | Description | Example |
|
||||
|--------|------|------|
|
||||
| `user_id` | The user ID, used to distinguish the logical tables of different users. | `0`, `1`, `2` |
|
||||
| `admin_id` | The administrator user ID. | `0` |
|
||||
| `jtable_name` | The name of the logical table, used to associate the metadata in `meta_json_t`. | `test_count` |
|
||||
| `jdata_id` | The data ID, a unique identifier for the JSON data, corresponding to each row in the logical table. | `1`, `2`, `3` |
|
||||
| `jdata` | A column of the JSON type, used to store the actual row data of the logical table. | `{"c1": 1, "c2": "test", "c3": 1.23}` |
|
||||
|
||||
#### Examples
|
||||
|
||||
1. Create a client
|
||||
|
||||
The constructor is as follows:
|
||||
|
||||
```java
|
||||
# uri: the connection string, which contains the address, port, and name of the database to which you want to connect.
|
||||
# user: the username.
|
||||
# password: the password.
|
||||
# user_id: the user ID.
|
||||
# log_level: the log level.
|
||||
public ObVecJsonClient(String uri, String user, String password, String user_id, Level log_level);
|
||||
```
|
||||
|
||||
Here is an example:
|
||||
|
||||
```java
|
||||
import com.oceanbase.obvec_jdbc.ObVecJsonClient;
|
||||
|
||||
String uri = "jdbc:oceanbase://127.0.0.1:2881/test";
|
||||
String user = "root@test";
|
||||
String password = "";
|
||||
ObVecJsonClient client = new ObVecJsonClient(uri, user, password, 0, Level.INFO);
|
||||
```
|
||||
|
||||
2. Execute DDL statements
|
||||
|
||||
You can directly call the `parseJsonTableSQL2NormalSQL` interface and pass in the specific SQL statements.
|
||||
|
||||
* Create a table
|
||||
|
||||
```java
|
||||
String sql = "CREATE TABLE `t2` (c1 INT NOT NULL DEFAULT 10, c2 VARCHAR(30) DEFAULT 'ca', c3 VARCHAR NOT NULL, c4 DECIMAL(10, 2), c5 TIMESTAMP DEFAULT CURRENT_TIMESTAMP);";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
|
||||
* ALTER TABLE CHANGE COLUMN
|
||||
|
||||
```java
|
||||
sql = "ALTER TABLE t2 CHANGE COLUMN c2 changed_col INT";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
|
||||
* ALTER TABLE ADD COLUMN
|
||||
|
||||
```java
|
||||
sql = "ALTER TABLE t2 ADD COLUMN email VARCHAR(100) default 'example@example.com'";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
|
||||
* ALTER TABLE MODIFY COLUMN
|
||||
|
||||
```java
|
||||
sql = "ALTER TABLE t2 MODIFY COLUMN changed_col TIMESTAMP NOT NULL DEFAULT current_timestamp";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
|
||||
* ALTER TABLE DROP COLUMN
|
||||
|
||||
```java
|
||||
sql = "ALTER TABLE t2 DROP c1";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
|
||||
* ALTER TABLE RENAME
|
||||
|
||||
```java
|
||||
sql = "ALTER TABLE t2 RENAME TO alter_test";
|
||||
client.parseJsonTableSQL2NormalSQL(sql);
|
||||
```
|
||||
@@ -0,0 +1,20 @@
|
||||
---
|
||||
|
||||
slug: /vector-search-faq
|
||||
---
|
||||
|
||||
# Vector search FAQs
|
||||
|
||||
This topic describes some common issues that may occur when using vector search, as well as their causes and solutions.
|
||||
|
||||
## Must all rows in a vector column have the same data dimensionality?
|
||||
|
||||
Yes. A data dimensionality must be specified when a vector column is defined, and the dimensionality must be verified when vector data is written into the column.
|
||||
|
||||
## What is the maximum number of rows of vector data that can be written?
|
||||
|
||||
Unlimited. It depends on the database memory resources.
|
||||
|
||||
## How do I create an index on a vector with more than 4096 dimensions?
|
||||
|
||||
You need to compress the data to within 4096 dimensions before creating an index.
|
||||
@@ -0,0 +1,520 @@
|
||||
---
|
||||
|
||||
slug: /using-seekdb-in-python-mode
|
||||
---
|
||||
|
||||
# Experience embedded seekdb
|
||||
|
||||
seekdb provides an embedded product form that can be integrated into user applications as a library, offering developers a more powerful and flexible data management solution. This enables data management everywhere (microcontrollers, IoT devices, edge computing, mobile applications, data centers, etc.), allowing users to quickly get started with seekdb's All-in-one (TP, AP, AI Native) capabilities.
|
||||
|
||||

|
||||
|
||||
## Installation and configuration
|
||||
|
||||
### Environment requirements
|
||||
|
||||
* Supported operating systems: Linux (glibc >= 2.28)
|
||||
|
||||
* Supported Python versions: CPython 3.8 ~ 3.14
|
||||
|
||||
* Supported system architectures: x86_64, aarch64
|
||||
|
||||
You can run the following command to check whether your environment meets the requirements.
|
||||
|
||||
```python
|
||||
python -c 'import sys;import platform; print(f"Python: {platform.python_implementation()} {platform.python_version()}, System: {platform.system()} {platform.machine()}, {platform.libc_ver()[0]}: {platform.libc_ver()[1]}");'
|
||||
```
|
||||
|
||||
The output should be like this:
|
||||
|
||||
```python
|
||||
Python: CPython 3.8.17, System: Linux x86_64, glibc: 2.32
|
||||
```
|
||||
|
||||
### Installation
|
||||
|
||||
Use pip to install. It automatically detects the default Python version and platform.
|
||||
|
||||
```python
|
||||
pip install pylibseekdb
|
||||
# Or specify a mirror source for faster installation
|
||||
pip install pylibseekdb -i https://pypi.tuna.tsinghua.edu.cn/simple
|
||||
```
|
||||
|
||||
If your pip version is low, upgrade pip first before installation:
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
## Experience seekdb
|
||||
|
||||
After completing the installation of seekdb, you can start experiencing seekdb.
|
||||
|
||||
#### Considerations
|
||||
|
||||
* Multi-statement queries are not supported. By default, only the first statement is executed. For example:
|
||||
|
||||
```python
|
||||
cur.execute("insert into t1 values(100);insert into t1 values(200)")
|
||||
```
|
||||
|
||||
* The tmpfs file system cannot be used as the database directory.
|
||||
* Only streaming query mode is supported. For example:
|
||||
|
||||
```python
|
||||
cur = con.cursor()
|
||||
cur.execute("select * from t1")
|
||||
cur.fetchall()
|
||||
```
|
||||
|
||||
* The execute method does not support parameterization.
|
||||
|
||||
#### Experience basic seekdb operations
|
||||
|
||||
The following examples demonstrate some basic operations of seekdb. You can create databases, connect to databases, create tables, write and query data, and more.
|
||||
|
||||
:::info
|
||||
For detailed information about seekdb SQL syntax, see <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001972315">SQL syntax</a>.
|
||||
:::
|
||||
|
||||
seekdb provides the `test` database by default. The following example demonstrates how to open and connect to the `test` database using default parameters, and how to create tables, write data, commit transactions, query data, and safely close the database.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
# Open the database directory seekdb by default
|
||||
pylibseekdb.open()
|
||||
# Connect to the test database by default
|
||||
conn = pylibseekdb.connect()
|
||||
# Create a cursor for data operations
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Execute table creation statement
|
||||
cursor.execute("create table t1(c1 int primary key, c2 int)")
|
||||
# Execute data insertion
|
||||
cursor.execute("insert into t1 values(1, 100)")
|
||||
cursor.execute("insert into t1 values(2, 200)")
|
||||
# Manually commit the transaction
|
||||
conn.commit()
|
||||
|
||||
# Execute query
|
||||
cursor.execute("select * from t1")
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
You can also manually specify the database directory and create and use a new database.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
# Specify the database directory
|
||||
pylibseekdb.open("mydb")
|
||||
# Do not connect to any database
|
||||
conn = pylibseekdb.connect("")
|
||||
# Create a cursor for data operations
|
||||
cursor = conn.cursor()
|
||||
# Manually create a database
|
||||
cursor.execute("create database db1")
|
||||
# Use the newly created database
|
||||
cursor.execute("use db1")
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
The following example demonstrates how to enable autocommit mode for transactions.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
# Specify the database directory
|
||||
pylibseekdb.open("seekdb")
|
||||
# Connect to the test database
|
||||
conn = pylibseekdb.connect(database="test", autocommit=True)
|
||||
# Create a cursor for data operations
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Execute table creation statement
|
||||
cursor.execute("create table t2(c1 int primary key, c2 int)")
|
||||
# Execute data insertion, transaction is automatically committed
|
||||
cursor.execute("insert into t2 values(1, 100)")
|
||||
# Execute data insertion, transaction is automatically committed
|
||||
cursor.execute("insert into t2 values(2, 200)")
|
||||
|
||||
# Query data using a new connection
|
||||
conn2 = pylibseekdb.connect("test")
|
||||
cursor2=conn2.cursor()
|
||||
cursor2.execute("select * from t2")
|
||||
# View data row by row
|
||||
print(cursor2.fetchone())
|
||||
print(cursor2.fetchone())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
cursor2.close()
|
||||
conn2.close()
|
||||
```
|
||||
|
||||
### Experience AI Native
|
||||
|
||||
#### Experience vector search
|
||||
|
||||
seekdb supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance calculations such as Manhattan distance, Euclidean distance, inner product, and cosine distance. It supports creating vector indexes based on HNSW/IVF, and supports incremental updates and deletions without affecting recall.
|
||||
|
||||
:::info
|
||||
|
||||
For more detailed information about seekdb vector search, see <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001976351">Vector search</a>.
|
||||
:::
|
||||
|
||||
The following example demonstrates how to use vector search in seekdb.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a table with a vector index
|
||||
cursor.execute("create table test_vector(c1 int primary key, c2 vector(2), vector index idx1(c2) with (distance=l2, type=hnsw, lib=vsag))")
|
||||
|
||||
# Insert data
|
||||
cursor.execute("insert into test_vector values(1, [1, 1])")
|
||||
cursor.execute("insert into test_vector values(2, [1, 2])")
|
||||
cursor.execute("insert into test_vector values(3, [1, 3])")
|
||||
conn.commit()
|
||||
|
||||
# Execute vector search
|
||||
cursor.execute("SELECT c1,c2 FROM test_vector ORDER BY l2_distance(c2, '[1, 2.5]') APPROXIMATE LIMIT 2;")
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
#### Experience full-text search
|
||||
|
||||
seekdb provides full-text indexing capabilities. By building full-text indexes, you can comprehensively index entire documents or large text content, significantly improving query performance when dealing with large-scale text data and complex search requirements, enabling users to obtain the required information more efficiently.
|
||||
|
||||
The following example demonstrates how to use seekdb's full-text search feature.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a table with a full-text index
|
||||
sql='''create table articles (title VARCHAR(200) primary key, body Text,
|
||||
FULLTEXT fts_idx(title, body));
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
|
||||
# Insert data
|
||||
sql='''insert into articles(title, body) values
|
||||
('OceanBase Tutorial', 'This is a tutorial about OceanBase Fulltext.'),
|
||||
('Fulltext Index', 'Fulltext index can be very useful.'),
|
||||
('OceanBase Test Case', 'Writing test cases helps ensure quality.')
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
conn.commit()
|
||||
|
||||
# Execute full-text search
|
||||
sql='''select
|
||||
title,
|
||||
match (title, body) against ("OceanBase") as score
|
||||
from
|
||||
articles
|
||||
where
|
||||
match (title, body) against ("OceanBase")
|
||||
order by
|
||||
score desc
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
#### Experience hybrid search
|
||||
|
||||
Hybrid Search combines vector-based semantic search and full-text index-based keyword search, providing more accurate and comprehensive search results through comprehensive ranking. Vector search excels at semantic approximate matching but is weak at matching exact keywords, numbers, and proper nouns, while full-text search effectively compensates for this deficiency. Therefore, hybrid search has become one of the key features of vector databases and is widely used in various products.
|
||||
|
||||
Based on multi-model integration, seekdb provides hybrid search capabilities for multi-modal data on the basis of SQL+AI, enabling fusion queries of multiple types of data in a single database system.
|
||||
|
||||
The following example demonstrates how to use seekdb's hybrid search feature.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a table with vector indexes and full-text indexes
|
||||
cursor.execute("create table doc_table(c1 int, vector vector(3), query varchar(255), content varchar(255), vector index idx1(vector) with (distance=l2, type=hnsw, lib=vsag), fulltext idx2(query), fulltext idx3(content))")
|
||||
|
||||
# Insert data
|
||||
sql = '''insert into doc_table values(1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
|
||||
(2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
|
||||
(3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
|
||||
(4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
|
||||
(5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
|
||||
(6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");'''
|
||||
cursor.execute(sql)
|
||||
conn.commit()
|
||||
|
||||
|
||||
sql = '''set @parm = '{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{"match": {"query": "hi hello"}},
|
||||
{"match": { "content": "oceanbase mysql" }}
|
||||
]
|
||||
}
|
||||
},
|
||||
"knn" : {
|
||||
"field": "vector",
|
||||
"k": 5,
|
||||
"num_candidates": 10,
|
||||
"query_vector": [1,2,3],
|
||||
"boost": 0.7
|
||||
},
|
||||
"_source" : ["query", "content", "_keyword_score", "_semantic_score"]
|
||||
}';'''
|
||||
cursor.execute(sql)
|
||||
|
||||
# Execute hybrid search
|
||||
sql = '''select dbms_hybrid_search.search('doc_table', @parm);'''
|
||||
cursor.execute(sql)
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
### Experience analytical capabilities (OLAP)
|
||||
|
||||
seekdb combines transaction processing (TP) with analytical processing (AP). Based on the LSM-Tree architecture, it achieves unified row and column storage, and introduces a new vectorized engine and cost evaluation model based on column storage, significantly improving the efficiency of processing wide tables and enhancing query performance in AP scenarios. It also supports real-time import, secondary indexes, high-concurrency primary key queries, and other common real-time OLAP requirements.
|
||||
|
||||
#### Experience data import
|
||||
|
||||
seekdb supports various flexible data import methods, allowing you to import data from multiple data sources into the database. Different import methods are suitable for different scenarios. You can choose appropriate import tools for data import based on data source types and business scenarios. As scenarios become more complex and diverse, multiple import methods can be used together. When importing data, in addition to considering data sources, data file formats should also be considered along with the support of import tools. When business scenarios have clearly defined data sources and data file formats, you need to start from the data source and consider the design of the import solution in combination with import tools. When businesses have import tools they are familiar with, you need to consider the tool's support and the possibility of import in combination with business scenarios.
|
||||
|
||||
The following example uses the `load data` method to demonstrate how to quickly import CSV data into seekdb.
|
||||
|
||||
1. Create an external data source
|
||||
|
||||
```bash
|
||||
cat /data/1/example.csv
|
||||
1,10
|
||||
2,20
|
||||
3,30
|
||||
```
|
||||
|
||||
2. Import external data using the embedded method.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a table
|
||||
cursor.execute("create table test_olap(c1 int, c2 int)")
|
||||
# Execute fast import
|
||||
cursor.execute("load data /*+ direct(true, 0) */ infile '/data/1/example.csv' into table test_olap fields terminated by ','")
|
||||
# Query data
|
||||
cursor.execute("select count(*) from test_olap")
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
#### Experience columnar storage
|
||||
|
||||
In scenarios involving complex analysis of large-scale data or ad-hoc queries on massive data, columnar storage is one of the key capabilities of AP databases. The seekdb storage engine has been further enhanced on the basis of supporting row storage, achieving support for column storage and unified storage. With one codebase, one architecture, and one instance, columnar data and row data coexist.
|
||||
|
||||
The following example demonstrates how to create a columnar table in seekdb.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a columnar table
|
||||
sql='''create table each_column_group (col1 varchar(30) not null, col2 varchar(30) not null, col3 varchar(30) not null, col4 varchar(30) not null, col5 int)
|
||||
with column group (each column);
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
|
||||
# Insert data
|
||||
sql='''insert into each_column_group values('a', 'b', 'c', 'd', 1)
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
conn.commit()
|
||||
|
||||
# Execute query
|
||||
cursor.execute("select col1,col2 from each_column_group")
|
||||
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
#### Experience materialized views
|
||||
|
||||
Materialized views are a key feature supporting AP business. They improve query performance and simplify complex query logic by precomputing and storing query results of views, reducing real-time computation. They are commonly used in fast report generation and data analysis scenarios. seekdb supports non-real-time and real-time materialized views, supports specifying primary keys or creating indexes for materialized views, and introduces nested materialized views, which can significantly improve query performance.
|
||||
|
||||
The following example demonstrates how to use materialized views in seekdb.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
import time
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create base tables
|
||||
cursor.execute("create table base_t1(a int primary key, b int)")
|
||||
cursor.execute("create table base_t2(c int primary key, d int)")
|
||||
|
||||
# Create materialized view logs
|
||||
cursor.execute("create materialized view log on base_t1 with(b)")
|
||||
cursor.execute("create materialized view log on base_t2 with(d)")
|
||||
|
||||
# Create a materialized view named mv based on tables base_t1 and base_t2, specify the refresh strategy as incremental refresh, and set the initial refresh time in the refresh plan to the current date, then refresh the materialized view every 1 second thereafter.
|
||||
cursor.execute("create materialized view mv REFRESH fast START WITH sysdate() NEXT sysdate() + INTERVAL 1 second as select a,b,c,d from base_t1 join base_t2 on base_t1.a=base_t2.c")
|
||||
|
||||
# Insert data into base tables
|
||||
cursor.execute("insert into base_t1 values(1, 10)")
|
||||
cursor.execute("insert into base_t2 values(1, 100)")
|
||||
conn.commit()
|
||||
|
||||
# Wait for the materialized view background refresh to complete
|
||||
time.sleep(10)
|
||||
|
||||
# Query data
|
||||
cursor.execute("select * from mv")
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
#### Experience external tables
|
||||
|
||||
Typically, table data in a database is stored in the database's storage space, while external table data is stored in external storage services. When creating an external table, you need to define the data file path and data file format. After creation, users can read data from files in external storage services through external tables.
|
||||
|
||||
The following example demonstrates how to access external CSV files through seekdb's external table feature.
|
||||
|
||||
1. Create an external data source.
|
||||
|
||||
```bash
|
||||
cat /data/1/example.csv
|
||||
1,10
|
||||
2,20
|
||||
3,30
|
||||
```
|
||||
|
||||
2. Access external table data using the embedded method.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create an external table
|
||||
sql='''CREATE EXTERNAL TABLE test_external_table(c1 int, c2 int) LOCATION='/data/1' FORMAT=(TYPE='CSV' FIELD_DELIMITER=',') PATTERN='example.csv';
|
||||
'''
|
||||
cursor.execute(sql)
|
||||
|
||||
# Query data
|
||||
cursor.execute("select * from test_external_table")
|
||||
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
### Experience transaction capabilities (OLTP)
|
||||
|
||||
The following example demonstrates seekdb's transaction capabilities.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
|
||||
pylibseekdb.open("seekdb")
|
||||
conn = pylibseekdb.connect("test")
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Create a table
|
||||
cursor.execute("create table test_oltp(c1 int primary key, c2 int)")
|
||||
|
||||
# Insert data
|
||||
cursor.execute("insert into test_oltp values(1, 10)")
|
||||
cursor.execute("insert into test_oltp values(2, 20)")
|
||||
cursor.execute("insert into test_oltp values(3, 30)")
|
||||
# Commit transaction
|
||||
conn.commit()
|
||||
|
||||
# Query data, ORA_ROWSCN is the data commit version number
|
||||
cursor.execute("select *,ORA_ROWSCN from test_oltp")
|
||||
# Fetch results
|
||||
print(cursor.fetchall())
|
||||
|
||||
# Close connections
|
||||
cursor.close()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
## Smooth transition to the distributed version
|
||||
|
||||
After users quickly validate product prototypes through the embedded version, if they want to switch to seekdb Server mode or use OceanBase's distributed version cluster processing capabilities, they only need to modify the import package and related configuration, while the main application logic remains unchanged.
|
||||
|
||||
```python
|
||||
import pylibseekdb
|
||||
pylibseekdb.open()
|
||||
conn = pylibseekdb.connect()
|
||||
```
|
||||
|
||||
Simply replace the three lines above with the two lines below. Use the pymysql package to replace pylibseekdb, remove the pylibseekdb open phase, and use pymysql's connect method to connect to the database server.
|
||||
|
||||
```python
|
||||
import pymysql
|
||||
conn = pymysql.connect(host='127.0.0.1', port=11002, user='root@sys', database='test')
|
||||
```
|
||||
@@ -0,0 +1,871 @@
|
||||
---
|
||||
|
||||
slug: /vector-index-hybrid-search
|
||||
---
|
||||
|
||||
# Hybrid search with vector indexes
|
||||
|
||||
This topic describes hybrid search with full-text indexes and vector indexes in seekdb.
|
||||
|
||||
Hybrid search combines vector-based semantic search and full-text index-based keyword search, providing more accurate and comprehensive search results through integrated ranking. Vector search excels at semantic approximate matching but has weaker capabilities for matching exact keywords, numbers, and proper nouns. Full-text search effectively compensates for this limitation. Therefore, hybrid search has become a key feature of vector databases and is widely used in various products. seekdb achieves efficient hybrid queries by integrating its full-text and vector indexing capabilities.
|
||||
|
||||
## Usage
|
||||
|
||||
The hybrid search feature is provided through the new system package `DBMS_HYBRID_SEARCH`, which contains 2 sub-functions:
|
||||
|
||||
| Method name | Description |
|
||||
| ------------ | -------- |
|
||||
| `DBMS_HYBRID_SEARCH.SEARCH` | Returns search results in JSON format. Results are sorted by relevance. |
|
||||
| `DBMS_HYBRID_SEARCH.GET_SQL` | Returns the actual executed SQL statement as a string. |
|
||||
|
||||
<!--For detailed syntax and parameter descriptions, see [DBMS_HYBRID_SEARCH](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004020384).-->
|
||||
|
||||
## Use cases and examples
|
||||
|
||||
### Create example tables and insert data
|
||||
|
||||
To demonstrate the hybrid search feature, this section creates and inserts data into several example tables that will be used in different search scenarios below.
|
||||
|
||||
* **`products` table**: A basic product information table used to demonstrate regular scalar search. It contains product ID, name, description, brand, category, tags, price, stock quantity, release date, on-sale status, and a vector field `vec`.
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
CREATE TABLE products (
|
||||
`product_id` varchar(50) DEFAULT NULL,
|
||||
`product_name` varchar(255) DEFAULT NULL,
|
||||
`description` text DEFAULT NULL,
|
||||
`brand` varchar(100) DEFAULT NULL,
|
||||
`category` varchar(100) DEFAULT NULL,
|
||||
`tags` varchar(255) DEFAULT NULL,
|
||||
`price` decimal(10,2) DEFAULT NULL,
|
||||
`stock_quantity` int(11) DEFAULT NULL,
|
||||
`release_date` datetime DEFAULT NULL,
|
||||
`is_on_sale` tinyint(1) DEFAULT NULL,
|
||||
`vec` VECTOR(4) DEFAULT NULL
|
||||
);
|
||||
```
|
||||
|
||||
Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO products VALUES
|
||||
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
|
||||
|
||||
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
|
||||
|
||||
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
|
||||
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
|
||||
* **`products_fulltext` table**: Based on the `products` table, full-text indexes are created on the `product_name`, `description`, and `tags` columns to demonstrate full-text search.
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
CREATE TABLE products_fulltext (
|
||||
product_id VARCHAR(50),
|
||||
product_name VARCHAR(255),
|
||||
description TEXT,
|
||||
brand VARCHAR(100),
|
||||
category VARCHAR(100),
|
||||
tags VARCHAR(255),
|
||||
price DECIMAL(10, 2),
|
||||
stock_quantity INT,
|
||||
release_date DATETIME,
|
||||
is_on_sale TINYINT(1),
|
||||
vec vector(4),
|
||||
-- Create full-text indexes on columns that need full-text search
|
||||
FULLTEXT INDEX idx_product_name(product_name),
|
||||
FULLTEXT INDEX idx_description(description),
|
||||
FULLTEXT INDEX idx_tags(tags)
|
||||
);
|
||||
```
|
||||
|
||||
Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO products_fulltext VALUES
|
||||
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
|
||||
|
||||
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
|
||||
|
||||
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
|
||||
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
|
||||
* **`doc_table` table**: A document table containing scalar columns, vector columns, and full-text indexed columns, used to demonstrate full-text search with scalar filtering conditions and hybrid search.
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
CREATE TABLE doc_table(
|
||||
c1 INT,
|
||||
vector VECTOR(3),
|
||||
query VARCHAR(255),
|
||||
content VARCHAR(255),
|
||||
VECTOR INDEX idx1(vector) WITH (distance=l2, type=hnsw, lib=vsag),
|
||||
FULLTEXT INDEX idx2(query),
|
||||
FULLTEXT INDEX idx3(content)
|
||||
);
|
||||
```
|
||||
|
||||
Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO doc_table VALUES
|
||||
(1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
|
||||
|
||||
(2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
|
||||
|
||||
(3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
|
||||
|
||||
(4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
|
||||
|
||||
(5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
|
||||
|
||||
(6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
* **`products_vector` table**: Similar to the `products` table structure, but with a vector index explicitly created on the `vec` column to demonstrate pure vector search.
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
CREATE TABLE products_vector (
|
||||
`product_id` varchar(50) DEFAULT NULL,
|
||||
`product_name` varchar(255) DEFAULT NULL,
|
||||
`description` text DEFAULT NULL,
|
||||
`brand` varchar(100) DEFAULT NULL,
|
||||
`category` varchar(100) DEFAULT NULL,
|
||||
`tags` varchar(255) DEFAULT NULL,
|
||||
`price` decimal(10,2) DEFAULT NULL,
|
||||
`stock_quantity` int(11) DEFAULT NULL,
|
||||
`release_date` datetime DEFAULT NULL,
|
||||
`is_on_sale` tinyint(1) DEFAULT NULL,
|
||||
`vec` VECTOR(4) DEFAULT NULL,
|
||||
-- Create a vector index on the column that needs vector search
|
||||
VECTOR INDEX idx1(vec) WITH (distance=l2, type=hnsw, lib=vsag)
|
||||
);
|
||||
```
|
||||
|
||||
Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO products_vector VALUES
|
||||
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,rgb', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.5,0.1,0.6,0.9]'),
|
||||
|
||||
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset with a noise-cancelling microphone.',
|
||||
'GamerZone', 'Gaming', 'best-seller,gaming-gear,audio', 149.00, 100, '2023-07-20 00:00:00.000000', 1, '[0.1,0.9,0.2,0]'),
|
||||
|
||||
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat made from sustainable and eco-friendly materials.',
|
||||
'NatureFirst', 'Sports', 'eco-friendly,health', 49.99, 200, '2023-04-22 00:00:00.000000', 0, '[0.1,0.9,0.3,0]');
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
* **`products_multi_vector` table**: A table containing multiple vector fields, used to demonstrate multi-vector search.
|
||||
:::collapse
|
||||
|
||||
```sql
|
||||
CREATE TABLE products_multi_vector (
|
||||
product_id VARCHAR(50),
|
||||
product_name VARCHAR(255),
|
||||
description TEXT,
|
||||
vec1 VECTOR(4),
|
||||
vec2 VECTOR(4),
|
||||
vec3 VECTOR(4),
|
||||
VECTOR INDEX idx1(vec1) WITH (distance=l2, type=hnsw, lib=vsag),
|
||||
VECTOR INDEX idx2(vec2) WITH (distance=l2, type=hnsw, lib=vsag),
|
||||
VECTOR INDEX idx3(vec3) WITH (distance=l2, type=hnsw, lib=vsag)
|
||||
);
|
||||
```
|
||||
|
||||
Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO products_multi_vector VALUES
|
||||
('prod-001', 'Gamer-Pro Mechanical Keyboard', 'A responsive mechanical keyboard', '[0.5,0.1,0.6,0.9]', '[0.2,0.3,0.4,0.5]', '[0.1,0.2,0.3,0.4]'),
|
||||
('prod-002', 'Gamer-Pro Headset', 'High-fidelity gaming headset', '[0.1,0.9,0.2,0]', '[0.3,0.4,0.5,0.6]', '[0.2,0.3,0.4,0.5]'),
|
||||
('prod-003', 'Eco-Friendly Yoga Mat', 'A non-slip yoga mat', '[0.1,0.9,0.3,0]', '[0.4,0.5,0.6,0.7]', '[0.3,0.4,0.5,0.6]');
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Regular scalar search
|
||||
|
||||
Some use cases for regular scalar search are as follows:
|
||||
|
||||
* E-commerce platform product filtering: Users want to view all products from a specific brand. For example, users want to view all products from the `GamerZone` brand.
|
||||
* Content management systems: Administrators need to filter articles or documents by specific categories. For example, finding all articles by a specific author.
|
||||
* User management systems: Finding users with specific statuses or roles. For example, finding all VIP users.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{"term": {"brand": "GamerZone"}}
|
||||
]
|
||||
}
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for all records where `brand` is `"GamerZone"`.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm)) |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec": "[0.5,0.1,0.6,0.9]",
|
||||
"tags": "best-seller,gaming-gear,rgb",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 1,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-004",
|
||||
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
|
||||
"product_name": "Gamer-Pro Mechanical Keyboard",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
},
|
||||
{
|
||||
"vec": "[0.1,0.9,0.2,0]",
|
||||
"tags": "best-seller,gaming-gear,audio",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 1,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-009",
|
||||
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
|
||||
"product_name": "Gamer-Pro Headset",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
}
|
||||
] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Range search for regular scalars
|
||||
|
||||
Some use cases for regular scalar range search are as follows:
|
||||
|
||||
* Price range filtering: E-commerce platforms filter products by price range. For example, finding products with prices in the `[30~80]` range.
|
||||
* Time range queries: Finding orders or logs within a specific time period. For example, finding orders from the last 30 days.
|
||||
* Numeric range filtering: Filtering by rating, stock quantity, and other numeric ranges. For example, finding products with ratings between `[4~5]`.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"range" : {
|
||||
"price" : {
|
||||
"gte" : 30,
|
||||
"lte" : 80
|
||||
}
|
||||
}
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for all records where `price` is in the `[30~80]` range.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products', @parm)) |
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec": "[0.1,0.9,0.3,0]",
|
||||
"tags": "eco-friendly,health",
|
||||
"brand": "NatureFirst",
|
||||
"price": 49.99,
|
||||
"_score": true,
|
||||
"category": "Sports",
|
||||
"is_on_sale": 0,
|
||||
"product_id": "prod-003",
|
||||
"description": "A non-slip yoga mat made from sustainable and eco-friendly materials.",
|
||||
"product_name": "Eco-Friendly Yoga Mat",
|
||||
"release_date": "2023-04-22 00:00:00.000000",
|
||||
"stock_quantity": 200
|
||||
}
|
||||
] |
|
||||
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Full-text search
|
||||
|
||||
Some use cases for full-text search are as follows:
|
||||
|
||||
* Document search: Searching for content containing specific keywords in a large number of documents. For example, searching for documents containing `"how to use"` in FAQs.
|
||||
* Product search: Fuzzy search based on product names and descriptions. For example, searching for products containing `"database"`.
|
||||
* Knowledge base retrieval: Searching for related questions in FAQs and help documents. For example, searching for answers to related questions in a customer service system's knowledge base.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
SET @query_str_with_mini = '{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"type": "best_fields",
|
||||
"fields": ["product_name^3", "description^2.5", "tags^1.5"],
|
||||
"query": "Gamer-Pro^2 keyboard^1.5 audio^1.2",
|
||||
"boost": 1.5
|
||||
}
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for records where the `product_name`, `description`, and `tags` fields contain the keywords `"Gamer-Pro"`, `"keyboard"`, and `"audio"`, and sort them according to the set field and keyword weights.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_fulltext', @query_str_with_mini));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_fulltext', @query_str_with_mini)) |
|
||||
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec": "[0.5,0.1,0.6,0.9]",
|
||||
"tags": "best-seller,gaming-gear,rgb",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 4.569735248749978,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-001",
|
||||
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
|
||||
"product_name": "Gamer-Pro Mechanical Keyboard",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
},
|
||||
{
|
||||
"vec": "[0.1,0.9,0.2,0]",
|
||||
"tags": "best-seller,gaming-gear,audio",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 1.7338881172399914,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-002",
|
||||
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
|
||||
"product_name": "Gamer-Pro Headset",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
}
|
||||
] |
|
||||
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Full-text search with scalar filtering conditions
|
||||
|
||||
Some use cases for full-text search with scalar filtering conditions are as follows:
|
||||
|
||||
* Precise search: Performing text search under specific conditions. For example, searching for specific keywords in articles with published status.
|
||||
* Permission control: Searching within data ranges that users have permission to access. For example, an order system searching for product information in orders within a specific time period.
|
||||
* Category search: Performing keyword search within specific categories. For example, a user system searching for specific user information among active users.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
-- Filter condition: specify scalar filter condition c1 >= 2
|
||||
SET @query_str = '{
|
||||
"query": {
|
||||
"bool" : {
|
||||
"must" : [
|
||||
{"query_string": {
|
||||
"fields": ["query", "content"],
|
||||
"query": "hello what oceanbase mysql"}
|
||||
}
|
||||
],
|
||||
"filter" : [
|
||||
{"range": {"c1": {"gte" : 2}}}
|
||||
]
|
||||
}
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for all records where `c1` is greater than or equal to 2.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @query_str));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @query_str)) |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"c1": 2,
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 2.170969786679347,
|
||||
"vector": "[1,2,1]",
|
||||
"content": "oceanbase mysql database"
|
||||
},
|
||||
{
|
||||
"c1": 3,
|
||||
"query": "hello world, how are you",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[1,1,1]",
|
||||
"content": "oceanbase oracle database"
|
||||
},
|
||||
{
|
||||
"c1": 6,
|
||||
"query": "hello world, where are you from",
|
||||
"_score": 0.3503184713375797,
|
||||
"vector": "[2,1,1]",
|
||||
"content": "starrocks oceanbase database"
|
||||
}
|
||||
] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Vector search
|
||||
|
||||
Some use cases for vector search are as follows:
|
||||
|
||||
* Semantic search: Finding related content based on semantic similarity. For example, finding semantically related questions and answers in a knowledge base.
|
||||
* Recommendation systems: Recommending similar products based on user preferences. For example, recommending similar products on e-commerce platforms.
|
||||
* Image search: Finding similar images through image features. For example, finding similar images in an image library.
|
||||
* Intelligent Q&A: Finding semantically related questions and answers in a knowledge base. For example, finding semantically related questions and answers in a customer service system's knowledge base.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
-- field specifies the vector field, k specifies the number of results to return (the k nearest results), query_vector specifies the query vector
|
||||
SET @parm = '{
|
||||
"knn" : {
|
||||
"field": "vec",
|
||||
"k": 3,
|
||||
"query_vector": [0.5,0.1,0.6,0.9]
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for all records where `vec` is similar to `[0.5,0.1,0.6,0.9]`.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm)) |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec": "[0.5,0.1,0.6,0.9]",
|
||||
"tags": "best-seller,gaming-gear,rgb",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 1.0,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-001",
|
||||
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
|
||||
"product_name": "Gamer-Pro Mechanical Keyboard",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
},
|
||||
{
|
||||
"vec": "[0.1,0.9,0.3,0]",
|
||||
"tags": "eco-friendly,health",
|
||||
"brand": "NatureFirst",
|
||||
"price": 49.99,
|
||||
"_score": 0.43405784,
|
||||
"category": "Sports",
|
||||
"is_on_sale": 0,
|
||||
"product_id": "prod-003",
|
||||
"description": "A non-slip yoga mat made from sustainable and eco-friendly materials.",
|
||||
"product_name": "Eco-Friendly Yoga Mat",
|
||||
"release_date": "2023-04-22 00:00:00.000000",
|
||||
"stock_quantity": 200
|
||||
},
|
||||
{
|
||||
"vec": "[0.1,0.9,0.2,0]",
|
||||
"tags": "best-seller,gaming-gear,audio",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 0.42910841,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-002",
|
||||
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
|
||||
"product_name": "Gamer-Pro Headset",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
}
|
||||
] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Vector search with scalar filtering conditions
|
||||
|
||||
Some use cases for vector search with scalar filtering conditions are as follows:
|
||||
|
||||
* Precise search: Performing text search under specific conditions. For example, searching for specific keywords in articles with published status.
|
||||
* Permission control: Searching within data ranges that users have permission to access. For example, an order system searching for product information in orders within a specific time period.
|
||||
* Category search: Performing keyword search within specific categories. For example, a user system searching for specific user information among active users.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
|
||||
```sql
|
||||
-- Specify scalar filter condition brand = "GamerZone"
|
||||
SET @parm = '{
|
||||
"knn" : {
|
||||
"field": "vec",
|
||||
"k": 3,
|
||||
"query_vector": [0.1,0.5,0.3,0.7],
|
||||
"filter" : [
|
||||
{"term" : {"brand": "GamerZone"} }
|
||||
]
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
2. Search for all records where `vec` is similar to `[0.1,0.5,0.3,0.7]` and `brand` is `"GamerZone"`.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_vector', @parm)) |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec": "[0.5,0.1,0.6,0.9]",
|
||||
"tags": "best-seller,gaming-gear,rgb",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 0.59850837,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-001",
|
||||
"description": "A responsive mechanical keyboard with customizable RGB lighting for the ultimate gaming experience.",
|
||||
"product_name": "Gamer-Pro Mechanical Keyboard",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
},
|
||||
{
|
||||
"vec": "[0.1,0.9,0.2,0]",
|
||||
"tags": "best-seller,gaming-gear,audio",
|
||||
"brand": "GamerZone",
|
||||
"price": 149.00,
|
||||
"_score": 0.55175342,
|
||||
"category": "Gaming",
|
||||
"is_on_sale": 1,
|
||||
"product_id": "prod-002",
|
||||
"description": "High-fidelity gaming headset with a noise-cancelling microphone.",
|
||||
"product_name": "Gamer-Pro Headset",
|
||||
"release_date": "2023-07-20 00:00:00.000000",
|
||||
"stock_quantity": 100
|
||||
}
|
||||
] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Multi-vector search
|
||||
|
||||
Multi-vector search refers to searching across multiple vector indexes and returning the most similar records.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
```sql
|
||||
-- Specify 3-way vector queries, each query specifies the vector index field, number of results to return, and query vector
|
||||
SET @param_multi_knn = '{
|
||||
"knn" : [{
|
||||
"field": "vec1",
|
||||
"k": 5,
|
||||
"query_vector": [0.5,0.1,0.6,0.9]
|
||||
},
|
||||
{
|
||||
"field": "vec2",
|
||||
"k": 5,
|
||||
"query_vector": [0.2,0.3,0.4,0.5]
|
||||
},
|
||||
{
|
||||
"field": "vec3",
|
||||
"k": 5,
|
||||
"query_vector": [0.1,0.2,0.3,0.4]
|
||||
}
|
||||
],
|
||||
"size" : 5
|
||||
}';
|
||||
```
|
||||
|
||||
2. Execute the query and return the query results.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_multi_vector', @param_multi_knn));
|
||||
```
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('products_multi_vector', @param_multi_knn)) |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"vec1": "[0.5,0.1,0.6,0.9]",
|
||||
"vec2": "[0.2,0.3,0.4,0.5]",
|
||||
"vec3": "[0.1,0.2,0.3,0.4]",
|
||||
"_score": 3.0,
|
||||
"product_id": "prod-001",
|
||||
"description": "A responsive mechanical keyboard",
|
||||
"product_name": "Gamer-Pro Mechanical Keyboard"
|
||||
},
|
||||
{
|
||||
"vec1": "[0.1,0.9,0.2,0]",
|
||||
"vec2": "[0.3,0.4,0.5,0.6]",
|
||||
"vec3": "[0.2,0.3,0.4,0.5]",
|
||||
"_score": 2.0957750699999997,
|
||||
"product_id": "prod-002",
|
||||
"description": "High-fidelity gaming headset",
|
||||
"product_name": "Gamer-Pro Headset"
|
||||
},
|
||||
{
|
||||
"vec1": "[0.1,0.9,0.3,0]",
|
||||
"vec2": "[0.4,0.5,0.6,0.7]",
|
||||
"vec3": "[0.3,0.4,0.5,0.6]",
|
||||
"_score": 1.86262927,
|
||||
"product_id": "prod-003",
|
||||
"description": "A non-slip yoga mat",
|
||||
"product_name": "Eco-Friendly Yoga Mat"
|
||||
}
|
||||
] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Full-text and vector hybrid search
|
||||
|
||||
Some use cases for full-text and vector hybrid search are as follows:
|
||||
|
||||
* Intelligent search: Comprehensive search combining keywords and semantic understanding. For example, when a user inputs `"I need a gaming keyboard"`, the system matches both the keywords `"gaming"` and `"keyboard"`, and understands the semantics of `"gaming equipment"`.
|
||||
* Document search: Supporting both exact keyword matching and semantic understanding in large document collections. For example, when searching for `"database optimization"`, it matches documents containing these words and also finds semantically related content about `"performance tuning"` and `"query optimization"`.
|
||||
* Product recommendation: E-commerce platforms support both product name search and requirement description search. For example, based on a user's description `"laptop suitable for office work"`, it matches keywords and understands the semantic requirement of `"business office"`.
|
||||
|
||||
Example:
|
||||
|
||||
1. Set search parameters.
|
||||
```sql
|
||||
SET @parm = '{
|
||||
"query": {
|
||||
"bool": {
|
||||
"should": [
|
||||
{"match": {"query": "hi hello"}},
|
||||
{"match": { "content": "oceanbase mysql" }}
|
||||
]
|
||||
}
|
||||
},
|
||||
"knn" : {
|
||||
"field": "vector",
|
||||
"k": 5,
|
||||
"query_vector": [1,2,3]
|
||||
},
|
||||
"_source" : ["query", "content", "_keyword_score", "_semantic_score"]
|
||||
}';
|
||||
```
|
||||
|
||||
2. Execute the query and return the query results.
|
||||
|
||||
```sql
|
||||
SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
:::collapse{title="Return result"}
|
||||
|
||||
```shell
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| json_pretty(dbms_hybrid_search.search('doc_table', @parm)) |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [
|
||||
{
|
||||
"query": "hello world, what is your name",
|
||||
"_score": 2.835628417884166,
|
||||
"content": "oceanbase mysql database",
|
||||
"_keyword_score": 2.5022950878841663,
|
||||
"_semantic_score": 0.33333333
|
||||
},
|
||||
{
|
||||
"query": "hello world",
|
||||
"_score": 1.7219400929592013,
|
||||
"content": "oceanbase Elasticsearch database",
|
||||
"_keyword_score": 0.7219400929592014,
|
||||
"_semantic_score": 1.0
|
||||
},
|
||||
{
|
||||
"query": "hello world, how are you",
|
||||
"_score": 1.0096539326751595,
|
||||
"content": "oceanbase oracle database",
|
||||
"_keyword_score": 0.7006369426751594,
|
||||
"_semantic_score": 0.30901699
|
||||
},
|
||||
{
|
||||
"query": "real world, how old are you",
|
||||
"_score": 0.41421356,
|
||||
"content": "redis oracle database",
|
||||
"_keyword_score": null,
|
||||
"_semantic_score": 0.41421356
|
||||
},
|
||||
{
|
||||
"query": "real world, where are you from",
|
||||
"_score": 0.30901699,
|
||||
"content": "postgres oracle database",
|
||||
"_keyword_score": null,
|
||||
"_semantic_score": 0.30901699
|
||||
}
|
||||
] |
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Full-text and vector RRF hybrid search
|
||||
|
||||
The result sets of full-text sub-queries and vector sub-queries use weighted fusion by default. You can configure the fusion method to RRF (Reciprocal Rank Fusion) ranking fusion through the `Rank` syntax. Some use cases are as follows:
|
||||
|
||||
* Multi-dimensional ranking: Requiring comprehensive consideration of results from multiple search dimensions. For example, in academic search systems, when searching in a paper database, both keyword matching degree and semantic relevance need to be considered.
|
||||
* Fairness requirements: Ensuring that results from different search methods are reasonably displayed. For example, on e-commerce platforms, both textual information such as product titles and descriptions, and visual information such as product images and videos need to be considered.
|
||||
* Complex queries: Complex search scenarios involving multiple query conditions. For example, in medical systems, both patient symptom descriptions and patient medical history and examination results need to be considered.
|
||||
|
||||
Example:
|
||||
|
||||
Set search parameters.
|
||||
|
||||
```sql
|
||||
SET @rrf_query_param = '{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"fields": ["title", "author", "description"],
|
||||
"query": "fiction American Dream"
|
||||
}
|
||||
},
|
||||
"knn" : {
|
||||
"field": "vector_embedding",
|
||||
"k": 5,
|
||||
"query_vector": [0.1, 0.2, 0.3, 0.4]
|
||||
},
|
||||
"rank" : {
|
||||
"rrf" : {
|
||||
"rank_window_size" : 10,
|
||||
"rank_constant" : 60
|
||||
}
|
||||
}
|
||||
}';
|
||||
```
|
||||
|
||||
The RRF algorithm calculates the final relevance score by fusing the rankings of multiple sub-query result sets. The calculation formula is as follows:
|
||||
|
||||
```sql
|
||||
score = 0.0
|
||||
for q in queries:
|
||||
if d in result(q):
|
||||
score += 1.0 / ( k + rank( result(q), d ) ) # K constant is the configured rank_constant
|
||||
return score
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
The examples in this topic demonstrate the powerful application value of the hybrid search feature:
|
||||
|
||||
* Intelligent search upgrade: Integrating semantic understanding into traditional keyword search to provide more accurate search results that better match user intent.
|
||||
* Optimized user experience: Supporting natural language queries, simplifying operations, and improving information retrieval efficiency.
|
||||
* Empowering diverse businesses: Widely applied in scenarios such as e-commerce, content management, knowledge bases, and intelligent customer service, achieving comprehensive coverage from basic filtering to intelligent recommendations.
|
||||
* Combined technical advantages: Combining exact matching with semantic understanding to significantly improve the accuracy and comprehensiveness of search results.
|
||||
|
||||
The hybrid search feature is an ideal choice for processing massive unstructured data and building intelligent search and recommendation systems.
|
||||
|
||||
<!--## Related documentation
|
||||
|
||||
* [DBMS_HYBRID_SEARCH sub-function overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004020384)-->
|
||||
@@ -0,0 +1,165 @@
|
||||
---
|
||||
|
||||
slug: /ai-function-permission
|
||||
---
|
||||
|
||||
# AI function privileges
|
||||
|
||||
This topic describes the AI function privileges, including `AI MODEL` and `ACCESS AI MODEL`, which are used for managing AI models and calling AI functions, respectively.
|
||||
|
||||
## AI MODEL
|
||||
|
||||
AI MODEL privileges are used for managing AI models. These include three specific privileges: `CREATE AI MODEL`, `ALTER AI MODEL`, and `DROP AI MODEL`.
|
||||
|
||||
### Syntax
|
||||
|
||||
The syntax for granting privileges is as follows:
|
||||
|
||||
```sql
|
||||
-- Grant the privilege to create an AI model.
|
||||
GRANT CREATE AI MODEL ON *.* TO 'username'@'host';
|
||||
|
||||
-- Grant the privilege to change an AI model.
|
||||
GRANT ALTER AI MODEL ON *.* TO 'username'@'host';
|
||||
|
||||
-- Grant the privilege to drop an AI model.
|
||||
GRANT DROP AI MODEL ON *.* TO 'username'@'host';
|
||||
|
||||
GRANT CREATE AI MODEL, ALTER AI MODEL, DROP AI MODEL ON *.* TO 'username'@'host';
|
||||
```
|
||||
|
||||
The syntax for revoking privileges is as follows:
|
||||
|
||||
```sql
|
||||
-- Revoke the privilege to create an AI model.
|
||||
REVOKE CREATE AI MODEL ON *.* FROM 'username'@'host';
|
||||
|
||||
-- Revoke the privilege to change an AI model.
|
||||
REVOKE ALTER AI MODEL ON *.* FROM 'username'@'host';
|
||||
|
||||
-- Revoke the privilege to drop an AI model.
|
||||
REVOKE DROP AI MODEL ON *.* FROM 'username'@'host';
|
||||
|
||||
-- Check the privileges.
|
||||
SHOW GRANTS FOR 'username'@'host';
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
1. Create a user.
|
||||
|
||||
```sql
|
||||
CREATE USER test_ai_user@'%' IDENTIFIED BY '123456';
|
||||
```
|
||||
|
||||
2. Log in as the `test_ai_user` user.
|
||||
|
||||
```sql
|
||||
obclient -h 127.0.0.1 -P 2881 -u test_ai_user@'%' -p *** -A -D test;
|
||||
```
|
||||
|
||||
3. Call the `CREATE_AI_MODEL_ENDPOINT` procedure.
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
-> 'user_ai_model_endpoint_1', '{
|
||||
'> "ai_model_name": "my_model1",
|
||||
'> "url": "https://https://api.deepseek.com",
|
||||
'> "access_key": "sk-xxxxxxxxxxxx",
|
||||
'> "request_model_name": "deepseek-chat",
|
||||
'> "provider": "deepseek"
|
||||
'> }');
|
||||
```
|
||||
|
||||
Since the user does not have the `CREATE AI MODEL` privilege, an error is returned:
|
||||
|
||||
```shell
|
||||
ERROR 42501: Access denied; you need (at least one of) the create ai model endpoint privilege(s) for this operation
|
||||
```
|
||||
|
||||
4. Grant the `CREATE AI MODEL` privilege to the `test_ai_user` user.
|
||||
|
||||
```sql
|
||||
GRANT CREATE AI MODEL ON *.* TO test_ai_user@'%';
|
||||
```
|
||||
|
||||
5. Verify the privilege.
|
||||
|
||||
```sql
|
||||
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
|
||||
-> 'user_ai_model_endpoint_1', '{
|
||||
'> "ai_model_name": "my_model1",
|
||||
'> "url": "https://https://api.deepseek.com",
|
||||
'> "access_key": "sk-xxxxxxxxxxxx",
|
||||
'> "request_model_name": "deepseek-caht",
|
||||
'> "provider": "deepseek"
|
||||
'> }');
|
||||
```
|
||||
|
||||
This time, the statement executes successfully.
|
||||
|
||||
## ACCESS AI MODEL
|
||||
|
||||
The `ACCESS AI MODEL` privilege is used for calling AI functions, including `AI_COMPLETE`, `AI_EMBED`, `AI_RERANK`, and `AI_PROMPT`.
|
||||
|
||||
### Syntax
|
||||
|
||||
The syntax for granting this privilege is as follows:
|
||||
|
||||
```sql
|
||||
GRANT ACCESS AI MODEL ON *.* TO 'username'@'host';
|
||||
```
|
||||
|
||||
The syntax for revoking this privilege is as follows:
|
||||
|
||||
```sql
|
||||
REVOKE ACCESS AI MODEL ON *.* FROM 'username'@'host';
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
1. Call the `AI_COMPLETE` function.
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
|
||||
The text to analyze is as follows:
|
||||
<text>
|
||||
What a beautiful day!
|
||||
</text>
|
||||
Judgment criteria:
|
||||
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
|
||||
```
|
||||
|
||||
Since the user does not have the `ACCESS AI MODEL` privilege, an error is returned:
|
||||
|
||||
```shell
|
||||
ERROR 42501: Access denied; you need (at least one of) the access ai model endpoint privilege(s) for this operation
|
||||
```
|
||||
|
||||
2. Grant the `ACCESS AI MODEL` privilege to the `test_ai_user` user.
|
||||
|
||||
```sql
|
||||
GRANT ACCESS AI MODEL ON *.* TO test_ai_user@'%';
|
||||
```
|
||||
|
||||
3. Verify the privilege.
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
|
||||
The text to analyze is as follows:
|
||||
<text>
|
||||
What a beautiful day!
|
||||
</text>
|
||||
Judgment criteria:
|
||||
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
|
||||
```
|
||||
|
||||
This time, the statement executes successfully.
|
||||
|
||||
```sql
|
||||
+-----+
|
||||
| ans |
|
||||
+-----+
|
||||
| 1 |
|
||||
+-----+
|
||||
```
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
|
||||
slug: /ai-function
|
||||
---
|
||||
|
||||
# Use cases and examples of AI functions
|
||||
|
||||
This topic describes the features of AI functions in seekdb.
|
||||
|
||||
AI functions integrate AI model capabilities directly into data processing within databases through SQL expressions. This greatly simplifies operations such as data extraction, analysis, summarization, and storage using large AI models, making it an important new feature in the fields of databases and data warehouses. seekdb provides AI model and endpoint management through the `DBMS_AI_SERVICE` package, introduces several built-in AI function expressions, and supports monitoring AI model usage through views.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before using AI functions, make sure you have the necessary privileges. For more information about the privileges, see [AI function privileges](100.ai-function-permission.md).
|
||||
|
||||
## Considerations
|
||||
|
||||
Hybrid search relies on the model management and embedding capabilities of the AI function service. Before deleting an AI model, check whether it is referenced by hybrid search to avoid potential issues.
|
||||
|
||||
## AI model management
|
||||
|
||||
The `DBMS_AI_SERVICE` package provides the ability to manage AI models and endpoints. It supports the following operations:
|
||||
|
||||
| **Operation** | **Description** |
|
||||
| ------ | ------ |
|
||||
| CREATE_AI_MODEL | Creates an AI model object. |
|
||||
| DROP_AI_MODEL | Drops an AI model object. |
|
||||
| CREATE_AI_MODEL_ENDPOINT | Creates an AI model endpoint object. |
|
||||
| ALTER_AI_MODEL_ENDPOINT | Modifies an AI model endpoint object. |
|
||||
| DROP_AI_MODEL_ENDPOINT | Drops an AI model endpoint object. |
|
||||
|
||||
By using this system package, you can directly manage AI models and endpoints within seekdb, without relying on external services.
|
||||
|
||||
## Monitor AI model usage
|
||||
|
||||
seekdb allows you to query and monitor information about AI models and their usage through the following views:
|
||||
|
||||
* CDB/DBA_OB_AI_MODELS: Query information about AI models.
|
||||
* CDB/DBA_OB_AI_MODEL_ENDPOINTS: Monitor the calls of AI models.
|
||||
|
||||
## AI function expressions
|
||||
|
||||
seekdb supports the following AI function expressions, allowing you to call AI models directly within seekdb using SQL statements and greatly simplifying the process:
|
||||
|
||||
| Function | Description |
|
||||
|-----|------|
|
||||
|`AI_COMPLETE`|Calls a specified text generation large language model (LLM) to process prompts and data, and then parses the results.|
|
||||
| `AI_PROMPT` |Constructs and formats prompts. Supports dynamic data insertion.|
|
||||
|`AI_EMBED`|Calls an embedding model to convert text data into vector data.|
|
||||
|`AI_RERANK`|Calls a reranking model to sort text based on prompts by similarity.|
|
||||
|
||||
:::info
|
||||
When using AI function expressions, make sure you have registered the AI models and endpoint information in the database.
|
||||
:::
|
||||
|
||||
### AI_COMPLETE and AI_PROMPT
|
||||
|
||||
The `AI_COMPLETE` function specifies a registered large language model (LLM) for text generation using the `model_key`, processes the user-provided prompt and data, and returns the text generated by the model. Users can customize the prompt and the format of the data from the database through the `prompt` parameter. This approach not only enables flexible processing of textual data, but also allows for batch processing directly within the database, effectively avoiding the overhead of repeatedly transferring data between the database and the language model.
|
||||
|
||||
In many AI application scenarios, prompts are often highly structured and require dynamic injection of specific data. Manually concatenating prompts and input content using functions like `CONCAT` each time is not only costly in terms of development, but also prone to formatting errors. To support prompt reuse and dynamic combination of prompts and data, seekdb provides the `AI_PROMPT` function. `AI_PROMPT` upgrades prompts from "static text" to a "reusable, parameterizable" function template format, which can be used directly in place of the `prompt` parameter within `AI_COMPLETE`. This greatly simplifies the process of constructing prompts, improving both development efficiency and accuracy.
|
||||
|
||||
#### AI_PROMPT function
|
||||
|
||||
The `AI_PROMPT` function is used to construct and format prompts, supporting dynamic data insertion.
|
||||
|
||||
##### Syntax
|
||||
|
||||
The syntax for the `AI_PROMPT` function is as follows:
|
||||
|
||||
```sql
|
||||
AI_PROMPT('template', expr0 [ , expr1, ... ]);
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
| Parameter | Description | Type | Nullable |
|
||||
|-----------|-------------|------|----------|
|
||||
| template | The prompt template entered by the user. | VARCHAR(max_length) | No |
|
||||
| expr | The data entered by the user. | VARCHAR(max_length) | No |
|
||||
|
||||
Both the `template` and `expr` parameters are required and cannot be null. The `expr` parameter only supports the `VARCHAR` type and does not support the `JSON` type.
|
||||
|
||||
Return value:
|
||||
|
||||
* JSON, the formatted prompt string.
|
||||
|
||||
##### Examples
|
||||
|
||||
The `AI_PROMPT` function organizes the template string and dynamic data into JSON format:
|
||||
* The first parameter (the template string `template`) is placed in the `template` field of the returned JSON.
|
||||
* Subsequent parameters (data values `expr0`, `expr1`, ...) are placed in the `args` array of the returned JSON.
|
||||
* Placeholders in the template such as `{0}`, `{1}`, etc., correspond by index to the data in the `args` array and will be automatically replaced when used in the `AI_COMPLETE` function.
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
SELECT AI_PROMPT('Recommend {0} of the most popular {1} to me.', 'ten', 'mobile phones');
|
||||
```
|
||||
|
||||
Return result:
|
||||
|
||||
```json
|
||||
{
|
||||
"template": "Recommend {0} of the most popular {1} to me.",
|
||||
"args": ["ten", "mobile phones"]
|
||||
}
|
||||
```
|
||||
|
||||
Based on the previous example, using the `AI_PROMPT` function within the `AI_COMPLETE` function:
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete", AI_PROMPT('Recommend {0} of the most popular {1} to me. just output name in json array format', 'two', 'mobile phones')) AS ans;
|
||||
```
|
||||
|
||||
Return result:
|
||||
|
||||
```json
|
||||
+--------------------------------------------------+
|
||||
| ans |
|
||||
+--------------------------------------------------+
|
||||
| ["iPhone 15 Pro Max","Samsung Galaxy S24 Ultra"] |
|
||||
+--------------------------------------------------+
|
||||
```
|
||||
|
||||
#### AI_COMPLETE function
|
||||
|
||||
#### Syntax
|
||||
|
||||
The syntax for the `AI_COMPLETE` function is as follows:
|
||||
|
||||
```sql
|
||||
AI_COMPLETE(model_key, prompt[, parameters])
|
||||
-- If you use the AI_PROMPT function, replace the prompt parameter with the AI_PROMPT function. See the AI_PROMPT function example.
|
||||
AI_COMPLETE(model_key, AI_PROMPT(prompt_template, data))
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
| Parameter | Description | Type | Nullable |
|
||||
|------|------|------|----------|
|
||||
| model_key | The model registered in the database. | VARCHAR(128) | No |
|
||||
| prompt | The prompt provided by the user. | VARCHAR/TEXT(LONGTEXT) | No |
|
||||
| parameters | Optional configuration for the API, such as `temperature`, `top_p`, and `max_tokens`. These options vary by vendor and are added directly to the message body. Typically, you can use the default settings without specifying these options. | JSON | Yes |
|
||||
|
||||
Both `model_key` and `prompt` are required. If either is `NULL`, the function will return an error.
|
||||
|
||||
Return value:
|
||||
|
||||
* text: The text generated by the LLM based on the prompt.
|
||||
|
||||
##### Examples
|
||||
|
||||
1. Sentiment analysis example
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete","Your task is to perform sentiment analysis on the provided text and determine whether the sentiment is positive or negative.
|
||||
The text to analyze is as follows:
|
||||
<text>
|
||||
What a beautiful day!
|
||||
</text>
|
||||
Judgment criteria:
|
||||
If the text expresses a positive sentiment, output 1; if it expresses a negative sentiment, output -1. Do not output anything else.\n") AS ans;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+-----+
|
||||
| ans |
|
||||
+-----+
|
||||
| 1 |
|
||||
+-----+
|
||||
```
|
||||
|
||||
2. Translation example
|
||||
|
||||
```sql
|
||||
CREATE TABLE comments (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
content TEXT
|
||||
);
|
||||
|
||||
INSERT INTO comments (content) VALUES ('hello world!');
|
||||
|
||||
-- Use the concat expression to replace the processed data with column names from the database table, enabling batch processing of database data without copying data to and from the LLM.
|
||||
SELECT AI_COMPLETE("ob_complete",
|
||||
concat("You are a professional translator. Please translate the following English text into Chinese. The text to be translated is:<text>",
|
||||
content,
|
||||
"</text>")) AS ans FROM comments;
|
||||
```
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+-------------+
|
||||
| ans |
|
||||
+-------------+
|
||||
| 你好,世界! |
|
||||
+-------------+
|
||||
```
|
||||
|
||||
3. Classification example
|
||||
|
||||
```sql
|
||||
SELECT AI_COMPLETE("ob_complete","You are a classification expert. You will receive various issue texts and need to categorize them into the appropriate department. The department list is [\"Hardware\",\"Software\",\"Other\"]. The text to analyze is as follows:
|
||||
<text>
|
||||
The screen quality is terrible.
|
||||
</text>") AS res;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+--------+
|
||||
| res |
|
||||
+--------+
|
||||
| Hardware |
|
||||
+--------+
|
||||
```
|
||||
|
||||
### AI_EMBED
|
||||
|
||||
The `AI_EMBED` function uses the `model_key` parameter to specify a registered embedding model, which converts your text data into vector representations. If the model supports multiple dimensions, you can use the `dim` parameter to specify the output dimension.
|
||||
|
||||
#### Use AI_EMBED
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
AI_EMBED(model_key, input, [dim])
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
| Parameter | Description | Type | Nullable |
|
||||
|------|------|------|----------|
|
||||
| model_key | The embedding model registered in your database. | VARCHAR(128) | No |
|
||||
| input | The text you want to convert into a vector. | VARCHAR | No |
|
||||
| dim | Specifies the output dimension of the vector. Some model providers support configuring this value. | INT64 | Yes |
|
||||
|
||||
Both `model_key` and `input` are required. If either is `NULL`, the function will return an error.
|
||||
|
||||
Return value:
|
||||
|
||||
* A string in vector format, that is, the embedding model’s vector representation of your text.
|
||||
|
||||
#### Examples
|
||||
|
||||
1. Embed single row of data.
|
||||
|
||||
```sql
|
||||
SELECT AI_EMBED("ob_embed","Hello world") AS embedding;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+----------------+
|
||||
| embedding |
|
||||
+----------------+
|
||||
| [0.1, 0.2, 0.3]|
|
||||
+----------------+
|
||||
```
|
||||
|
||||
2. Embed table columns.
|
||||
|
||||
```sql
|
||||
CREATE TABLE comments (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
content TEXT
|
||||
);
|
||||
|
||||
INSERT INTO comments (content) VALUES ('hello world!');
|
||||
|
||||
SELECT AI_EMBED("ob_embed",content) AS embedding FROM comments;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+----------------+
|
||||
| embedding |
|
||||
+----------------+
|
||||
| [0.1, 0.2, 0.3]|
|
||||
+----------------+
|
||||
```
|
||||
|
||||
### AI_RERANK
|
||||
|
||||
The `AI_RERANK` function uses the `model_key` parameter to specify a registered reranking model. It organizes your query and document list according to the provider's rules, sends them to the specified model, and returns the sorted results. This function is suitable for reranking scenarios in Retrieval-Augmented Generation (RAG).
|
||||
|
||||
#### Use AI_RERANK
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
AI_RERANK(model_key, query, documents[, document_key])
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
| Parameter | Description | Type | Nullable |
|
||||
|------|------|------|----------|
|
||||
| model_key | The reranking model registered in your database. | VARCHAR(128) | No |
|
||||
| query | The search text you want to use. | VARCHAR(1024) | No |
|
||||
| documents | The list of documents to be ranked. | JSON array, for example, `'["apple", "banana"]'` | No |
|
||||
|
||||
All of the parameters `model_key` and `input` are required. If either is `NULL`, the function will return an error.
|
||||
|
||||
Return value:
|
||||
|
||||
* A JSON array containing the documents and their relevance scores, sorted in descending order by relevance.
|
||||
|
||||
#### Examples
|
||||
|
||||
```sql
|
||||
SELECT AI_RERANK("ob_rerank","Apple",'["apple","banana","fruit","vegetable"]');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```sql
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ai_rerank("ob_rerank","Apple",'["apple","banana","fruit","vegetable"]') |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| [{"index": 0, "document": {"text": "apple"}, "relevance_score": 0.9912109375}, {"index": 1, "document": {"text": "banana"}, "relevance_score": 0.0033512115478515625}, {"index": 2, "document": {"text": "fruit"}, "relevance_score": 0.0003669261932373047}, {"index": 3, "document": {"text": "vegetable"}, "relevance_score": 0.00001996755599975586}] |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Vector embedding technology](../100.vector-search/150.vector-embedding-technology.md)
|
||||
* [Privilege types in MySQL-compatible mode](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974758)
|
||||
@@ -0,0 +1,248 @@
|
||||
---
|
||||
|
||||
slug: /oceanbase-mcp-server-and-ai-tool-integration-guide
|
||||
---
|
||||
|
||||
# OceanBase MCP Server
|
||||
|
||||
## Background information
|
||||
|
||||
AI tools are evolving rapidly, from graphical solutions like Cursor, Windsurf, and Trae to command-line options such as Claude Code, Gemini CLI, and Qwen Code. Empowered by Agent-based frameworks, these tools are remarkably capable. However, a key limitation remains: AI tools cannot directly access databases, leaving a gap between data and analysis. The MCP protocol bridges this gap. By leveraging the MCP protocol, OceanBase MCP Server enables AI tools to interact directly with databases and retrieve data seamlessly.
|
||||
|
||||
Traditionally, data analysis tasks—such as user analytics, product analysis, order tracking, and user behavior analysis—require developers to build backend systems for data retrieval and frontend interfaces for data visualization. Even with BI tools, some familiarity with SQL is often necessary. While data can be displayed, understanding the underlying logic or making business decisions based on that data still depends on the expertise of data analysts.
|
||||
|
||||
The combination of AI tools, MCP, and large language models (LLMs) is transforming the way data analysis is performed. Analysts no longer need to rely on developers or have SQL knowledge. They can simply describe their requirements to AI tools and instantly receive the results they need—complete with attractive charts and initial data insights.
|
||||
|
||||
## Functional architecture
|
||||
|
||||
### Core toolkit
|
||||
|
||||
OceanBase MCP Server offers standardized interfaces that enable AI tools to interact directly with the database:
|
||||
|
||||
| Tool | Description |
|
||||
|-----------------------------|-------------|
|
||||
| `execute_sql` | Executes any SQL statement (such as `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `DDL`). |
|
||||
| `get_ob_ash_report` | Gets seekdb Active Session History (ASH) reports for performance diagnostics. |
|
||||
| `get_current_time` | Returns the current time of the seekdb instance. |
|
||||
| `oceanbase_text_search` | Performs full-text searches across seekdb database tables. |
|
||||
| `oceanbase_vector_search` | Executes vector similarity searches within seekdb database tables. |
|
||||
| `oceanbase_hybrid_search` | Conducts hybrid searches, combining relational filtering with vector search. |
|
||||
| `ob_memory_query` | Retrieves historical conversation records from the AI memory system using semantic search. (AI Memory System Tool) |
|
||||
| `ob_memory_insert` | Automatically captures and stores important conversation content to build a knowledge base. (AI Memory System Tool) |
|
||||
| `ob_memory_delete` | Deletes outdated or redundant conversation memories. (AI Memory System Tool) |
|
||||
| `ob_memory_update` | Updates or evolves memory content based on new information. (AI Memory System Tool) |
|
||||
|
||||
### Resource endpoints
|
||||
|
||||
AI tools can directly access these resource endpoints via the MCP protocol:
|
||||
|
||||
| Resource path | Description |
|
||||
|--------------------------------|-------------|
|
||||
| `oceanbase://tables` | Lists all tables in the database. |
|
||||
| `oceanbase://sample/{table}` | Retrieves sample data (first 100 rows) from the specified table. `{table}` can be dynamically replaced with the actual table name. |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed Cursor or another tool that supports the MCP protocol (such as Windsurf or Qwen Code).
|
||||
* You have deployed seekdb.
|
||||
|
||||
* For deployment details, see [Deploy seekdb](../../400.guides/400.deploy/100.prepare-servers.md).
|
||||
|
||||
* You have a Python environment set up (version 3.10 to 3.12).
|
||||
|
||||
* Download the Python installer from the [official Python website](https://www.python.org/downloads/).
|
||||
|
||||
## Procedure
|
||||
|
||||
### Step 1: Obtain database connection information
|
||||
|
||||
Contact the database administrator or deployment team to get the required database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
* `$user_name`: The account for connecting to the instance. Default is `root`.
|
||||
* `$password`: The account password. Default is empty.
|
||||
|
||||
**Here is an example:**
|
||||
|
||||
```shell
|
||||
obclient -hxxx.xxx.xxx.xxx -P2881 -uroot -p****** -Dtest
|
||||
```
|
||||
|
||||
### Step 2: Install Python dependencies
|
||||
|
||||
#### Environment setup
|
||||
|
||||
1. Install the uv package manager.
|
||||
|
||||
* On macOS or Linux, run:
|
||||
|
||||
```shell
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
```
|
||||
|
||||
* On Windows, run:
|
||||
|
||||
```shell
|
||||
irm https://astral.sh/uv/install.ps1 | iex
|
||||
```
|
||||
|
||||
* Alternatively, you can install uv using pip:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
```
|
||||
|
||||
2. Verify the installation:
|
||||
|
||||
```shell
|
||||
uv --version
|
||||
```
|
||||
|
||||
#### Install OceanBase MCP Server
|
||||
|
||||
1. Choose a directory and create a virtual environment:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
```
|
||||
|
||||
2. Activate the virtual environment:
|
||||
|
||||
```shell
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
3. Install OceanBase MCP Server:
|
||||
|
||||
```shell
|
||||
uv pip install oceanbase-mcp
|
||||
```
|
||||
|
||||
### Step 3: Configure the MCP Server environment
|
||||
|
||||
1. Create a `.env` file with your database connection info:
|
||||
|
||||
```shell
|
||||
cat > .env <<EOF
|
||||
OB_HOST=127.0.0.1
|
||||
OB_PORT=2881
|
||||
OB_USER=root
|
||||
OB_PASSWORD=your_password
|
||||
OB_DATABASE=test
|
||||
EOF
|
||||
```
|
||||
|
||||
2. Start the MCP Server:
|
||||
|
||||
```shell
|
||||
uv run oceanbase_mcp_server \
|
||||
--transport sse \ # Supports stdio/streamable-http/sse modes
|
||||
--host 0.0.0.0 \ # Allows external access, use 127.0.0.1 for local access only
|
||||
--port 8000 # Custom port (update later configs if changed)
|
||||
```
|
||||
|
||||
### Step 4: Connect Cursor to the MCP Server
|
||||
|
||||
1. Use Cursor V2.0.64 as an example. Click the **Open Settings** icon in the upper right corner, select **Tools & MCP**, and click **New MCP Server**.
|
||||
|
||||

|
||||
|
||||
2. Edit the `mcp.json` configuration file:
|
||||
|
||||
```shell
|
||||
{
|
||||
"mcpServers": {
|
||||
"ob-sse": {
|
||||
"autoApprove": [],
|
||||
"disabled": false, # Must be set to false to enable the service
|
||||
"timeout": 60,
|
||||
"type": "sse",
|
||||
"url": "http://127.0.0.1:8000/sse" # Make sure this matches the port you set in Step 3
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. Verify the connection.
|
||||
|
||||
After saving the configuration, return to the **Tools & MCP** page. Then you will find the newly added MCP server.
|
||||
|
||||
4. Once added, Cursor will automatically use MCP tools when you ask questions in the Chat window.
|
||||
|
||||

|
||||
|
||||
## Quick start examples
|
||||
|
||||
Once you set up seekdb and the MCP Server, you can quickly try out data analysis capabilities. The following examples use the [Online Retail Dataset](https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset) and Cursor to demonstrate how AI tools can seamlessly work with the OceanBase MCP Server for common analytics tasks.
|
||||
|
||||
### Customer distribution analysis
|
||||
|
||||
1. Instruction input:
|
||||
|
||||
```
|
||||
Please analyze the customer data and show the distribution of customers by country.
|
||||
```
|
||||
|
||||
2. Cursor execution process:
|
||||
|
||||
1. Cursor calls `execute_sql` to run an aggregation query:
|
||||
|
||||
```sql
|
||||
SELECT Country,
|
||||
COUNT(DISTINCT CustomerID) AS customer_count
|
||||
FROM dataanalysis_english.invoice_items
|
||||
WHERE CustomerID IS NOT NULL
|
||||
AND Country IS NOT NULL
|
||||
GROUP BY Country
|
||||
ORDER BY customer_count DESC;
|
||||
```
|
||||
|
||||
2. Cursor automatically generates a structured analysis result.
|
||||
|
||||

|
||||
|
||||
3. Further request:
|
||||
|
||||
```
|
||||
Please convert the above results into a table.
|
||||
```
|
||||
|
||||
4. Output:
|
||||
|
||||

|
||||
|
||||
### Best-selling products analysis
|
||||
|
||||
1. Instruction input:
|
||||
|
||||
```
|
||||
Find the most popular products and show their sales performance.
|
||||
```
|
||||
|
||||
2. Output:
|
||||
|
||||
Cursor summarizes the most popular products with performance insights and displays them in a ranked table or bar chart.
|
||||
|
||||

|
||||
|
||||
### Sales trend over time
|
||||
|
||||
1. Instruction input:
|
||||
|
||||
```
|
||||
Analyze monthly sales trends and identify peak periods.
|
||||
```
|
||||
|
||||
2. Output:
|
||||
|
||||
Cursor generates a line chart showing monthly sales trends with peak periods highlighted (for example, November and December for holiday shopping).
|
||||
|
||||

|
||||
@@ -0,0 +1,20 @@
|
||||
---
|
||||
|
||||
slug: /json-formatted-data-types
|
||||
---
|
||||
|
||||
# Overview of JSON data types
|
||||
|
||||
seekdb supports the JavaScript Object Notation (JSON) data type in compliance with the RFC 7159 standard. You can use it to store semi-structured JSON data and access or modify the data within JSON documents.
|
||||
|
||||
The JSON data type offers the following advantages:
|
||||
|
||||
* **Automatic validation**: JSON documents stored in JSON columns are automatically validated. Invalid documents will trigger an error.
|
||||
|
||||
* **Optimized storage format**: JSON documents stored in JSON columns are converted into an optimized format that enables fast reading and access. When the server reads a JSON value stored in binary format, it doesn't need to parse the value from text.
|
||||
|
||||
* **Semi-structured encoding**: This feature further reduces storage costs by splitting a JSON document into multiple sub-columns, with each sub-column encoded individually. This improves compression rates and reduces the storage space required for JSON data. For more information, see [Create a JSON value](200.create-a-json-value.md) and [Semi-structured encoding](600.json-semi-struct.md).
|
||||
|
||||
## References
|
||||
|
||||
* [Overview of JSON functions](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974794)
|
||||
@@ -0,0 +1,257 @@
|
||||
---
|
||||
|
||||
slug: /create-a-json-value
|
||||
---
|
||||
|
||||
# Create a JSON value
|
||||
|
||||
A JSON value must be one of the following: objects (JSON objects), arrays, strings, numbers, boolean values (true/false), or the null value. Note that false, true, and the null value must be in lowercase.
|
||||
|
||||
## JSON text structure
|
||||
|
||||
A JSON text structure includes characters, strings, numbers, and three literal names. Whitespace characters (spaces, horizontal tabs, line feeds, and carriage returns) are allowed before or after any structural character.
|
||||
|
||||
```sql
|
||||
begin-array = [ left square bracket
|
||||
|
||||
begin-object = { left curly bracket
|
||||
|
||||
end-array = ] right square bracket
|
||||
|
||||
end-object = } right curly bracket
|
||||
|
||||
name-separator = : colon
|
||||
|
||||
value-separator = , comma
|
||||
```
|
||||
|
||||
### Objects
|
||||
|
||||
An object is represented by a pair of curly brackets containing zero or more name/value pairs (also called members). Names within an object must be unique. Each name is a string followed by a colon that separates the name from its value. Multiple name/value pairs are separated by commas.
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
{ "NAME": "SAM", "Height": 175, "Weight": 100, "Registered" : false}
|
||||
```
|
||||
|
||||
### Arrays
|
||||
|
||||
An array is represented by square brackets containing zero or more values (also called elements). Array elements are separated by commas, and values in an array do not need to be of the same type.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
["abc", 10, null, true, false]
|
||||
```
|
||||
|
||||
### Numbers
|
||||
|
||||
Numbers use decimal format and contain an integer component that may optionally be prefixed with a minus sign (-). This can be followed by a fractional part and/or an exponent part. Leading zeros are not allowed. The fractional part consists of a decimal point followed by one or more digits. The exponent part begins with an uppercase or lowercase letter E, optionally followed by a plus (+) or minus (-) sign and one or more digits.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
[100, 0, -100, 100.11, -12.11, 10.22e2, -10.22e2]
|
||||
```
|
||||
|
||||
### Strings
|
||||
|
||||
A string begins and ends with quotation marks ("). All Unicode characters can be placed within the quotation marks, except characters that must be escaped (including quotation marks, backslashes, and control characters).
|
||||
|
||||
JSON text must be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```sql
|
||||
{"Url": "http://www.example.com/image/481989943"}
|
||||
```
|
||||
|
||||
## Create JSON values
|
||||
|
||||
seekdb supports the following DDL operations on JSON types:
|
||||
|
||||
* Create tables with JSON columns.
|
||||
|
||||
* Add or drop JSON columns.
|
||||
|
||||
* Create indexes on generated columns based on JSON columns.
|
||||
|
||||
* Enable semi-structured encoding when creating tables.
|
||||
|
||||
* Enable semi-structured encoding on existing tables.
|
||||
|
||||
### Limitations
|
||||
|
||||
You can create multiple JSON columns in each table, with the following limitations:
|
||||
|
||||
* JSON columns cannot be used as `PRIMARY KEY`, `FOREIGN KEY`, or `UNIQUE KEY`, but you can add `NOT NULL` or `CHECK` constraints.
|
||||
|
||||
* JSON columns cannot have default values.
|
||||
|
||||
* JSON columns cannot be used as partitioning keys.
|
||||
|
||||
* The length of JSON data cannot exceed the length of `LONGTEXT`, and the maximum depth of each JSON object or array is 99.
|
||||
|
||||
### Examples
|
||||
|
||||
#### Create or modify JSON columns
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE tbl1 (id INT PRIMARY KEY, docs JSON NOT NULL, docs1 JSON);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> ALTER TABLE tbl1 MODIFY docs JSON CHECK(docs <'{"a" : 100}');
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> CREATE TABLE json_tab(
|
||||
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'Primary key',
|
||||
json_info JSON COMMENT 'JSON data',
|
||||
json_id INT GENERATED ALWAYS AS (json_info -> '$.id') COMMENT 'Virtual field from JSON data',
|
||||
json_name VARCHAR(5) GENERATED ALWAYS AS (json_info -> '$.NAME'),
|
||||
index json_info_id_idx (json_id)
|
||||
)COMMENT 'Example JSON table';
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> ALTER TABLE json_tab ADD COLUMN json_info1 JSON;
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> ALTER TABLE json_tab ADD INDEX (json_name);
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> ALTER TABLE json_tab drop COLUMN json_info1;
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
#### Create an index on a specific key using a generated column
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE jn ( c JSON, g INT GENERATED ALWAYS AS (c->"$.id"));
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient> CREATE INDEX idx1 ON jn(g);
|
||||
Query OK, 0 rows affected
|
||||
Records: 0 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> INSERT INTO jn (c) VALUES
|
||||
('{"id": "1", "name": "Fred"}'), ('{"id": "2", "name": "Wilma"}'),
|
||||
('{"id": "3", "name": "Barney"}'), ('{"id": "4", "name": "Betty"}');
|
||||
Query OK, 4 rows affected
|
||||
Records: 4 Duplicates: 0 Warnings: 0
|
||||
|
||||
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
|
||||
+-------+
|
||||
| name |
|
||||
+-------+
|
||||
| Fred |
|
||||
| Wilma |
|
||||
+-------+
|
||||
2 rows in set
|
||||
|
||||
obclient> EXPLAIN SELECT c->>"$.name" AS name FROM jn WHERE g <= 2\G
|
||||
*************************** 1. row ***************************
|
||||
Query Plan: =========================================
|
||||
|ID|OPERATOR |NAME |EST. ROWS|COST|
|
||||
-----------------------------------------
|
||||
|0 |TABLE SCAN|jemp(idx1)|2 |92 |
|
||||
=========================================
|
||||
|
||||
Outputs & filters:
|
||||
-------------------------------------
|
||||
0 - output([JSON_UNQUOTE(JSON_EXTRACT(jemp.c, '$.name'))]), filter(nil),
|
||||
access([jemp.c]), partitions(p0)
|
||||
|
||||
1 row in set
|
||||
```
|
||||
|
||||
#### Use semi-structured encoding
|
||||
|
||||
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur:
|
||||
|
||||
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
|
||||
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
|
||||
* You can also set the frequency threshold using the `freq_threshold` parameter.
|
||||
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
|
||||
|
||||
1. Enable semi-structured encoding.
|
||||
|
||||
:::tip
|
||||
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
|
||||
:::
|
||||
|
||||
:::tab
|
||||
tab Example: Enable semi-structured encoding during table creation
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1( j json)
|
||||
ROW_FORMAT=COMPRESSED
|
||||
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
|
||||
```
|
||||
|
||||
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
|
||||
|
||||
tab Example: Enable semi-structured encoding for existing table
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(j json);
|
||||
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
|
||||
```
|
||||
|
||||
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
|
||||
|
||||
Some modification limitations:
|
||||
|
||||
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
|
||||
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
|
||||
* Modifying one sub-parameter does not affect the others.
|
||||
:::
|
||||
|
||||
2. Disable semi-structured encoding.
|
||||
|
||||
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
|
||||
|
||||
```sql
|
||||
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
|
||||
```
|
||||
|
||||
3. Query semi-structured encoding configuration.
|
||||
|
||||
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
|
||||
|
||||
```sql
|
||||
SHOW CREATE TABLE t1;
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```shell
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Table | Create Table |
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| t1 | CREATE TABLE `t1` (
|
||||
`j` json DEFAULT NULL
|
||||
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
|
||||
|
||||
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
|
||||
|
||||
Here is an example query:
|
||||
|
||||
```sql
|
||||
-- Query rows where the value of the name field is 'Devin'
|
||||
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
|
||||
```
|
||||
|
||||
Character set considerations:
|
||||
|
||||
- seekdb uses `utf8_bin` encoding for JSON.
|
||||
|
||||
- To ensure string whitebox filtering works properly, we recommend the following settings:
|
||||
|
||||
```sql
|
||||
SET @@collation_server = 'utf8mb4_bin';
|
||||
SET @@collation_connection='utf8mb4_bin';
|
||||
```
|
||||
@@ -0,0 +1,174 @@
|
||||
---
|
||||
|
||||
slug: /querying-and-modifying-json-values
|
||||
---
|
||||
|
||||
# Query and modify JSON values
|
||||
|
||||
seekdb supports querying and referencing JSON values. Using path expressions, you can extract or modify specific portions of a JSON document.
|
||||
|
||||
## Reference JSON values
|
||||
|
||||
seekdb provides two methods for querying and referencing JSON values:
|
||||
|
||||
* Use the `->` operator to return a key's value with double quotes in JSON data.
|
||||
|
||||
* Use the `->>` operator to return a key's value without double quotes in JSON data.
|
||||
|
||||
Examples:
|
||||
|
||||
```sql
|
||||
obclient> SELECT c->"$.name" AS name FROM jn WHERE g <= 2;
|
||||
+---------+
|
||||
| name |
|
||||
+---------+
|
||||
| "Fred" |
|
||||
| "Wilma" |
|
||||
+---------+
|
||||
2 rows in set
|
||||
|
||||
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
|
||||
+-------+
|
||||
| name |
|
||||
+-------+
|
||||
| Fred |
|
||||
| Wilma |
|
||||
+-------+
|
||||
2 rows in set
|
||||
|
||||
obclient> SELECT JSON_UNQUOTE(c->'$.name') AS name
|
||||
FROM jn WHERE g <= 2;
|
||||
+-------+
|
||||
| name |
|
||||
+-------+
|
||||
| Fred |
|
||||
| Wilma |
|
||||
+-------+
|
||||
2 rows in set
|
||||
```
|
||||
|
||||
Because JSON documents are hierarchical, JSON functions use path expressions to extract or modify portions of a document and to specify where in the document the operation should occur.
|
||||
|
||||
seekdb uses a path syntax consisting of a leading `$` character followed by a selector to represent the JSON document being accessed. The selector types are as follows:
|
||||
|
||||
* The `.` symbol represents the key name to access. Unquoted names are not valid in path expressions (for example, names containing spaces), so key names must be enclosed in double quotes.
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
obclient> SELECT JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name');
|
||||
+---------------------------------------------------------+
|
||||
| JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name') |
|
||||
+---------------------------------------------------------+
|
||||
| "Aztalan" |
|
||||
+---------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* The `[N]` symbol is placed after the path of the selected array and represents the value at position N in the array, where N is a non-negative integer. Array positions are zero-indexed. If `path` does not select an array value, then `path[0]` evaluates to the same value as `path`.
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
obclient> SELECT JSON_SET('"x"', '$[0]', 'a');
|
||||
+------------------------------+
|
||||
| JSON_SET('"x"', '$[0]', 'a') |
|
||||
+------------------------------+
|
||||
| "a" |
|
||||
+------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* The `[M to N]` symbol specifies a subset or range of array values, starting from position M and ending at position N.
|
||||
|
||||
Example:
|
||||
|
||||
```sql
|
||||
obclient> SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]');
|
||||
+----------------------------------------------+
|
||||
| JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]') |
|
||||
+----------------------------------------------+
|
||||
| [2, 3, 4] |
|
||||
+----------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
* Path expressions can also include `*` or `**` wildcard characters:
|
||||
|
||||
* `.[*]` represents the values of all members in a JSON object.
|
||||
|
||||
* `[*]` represents the values of all elements in a JSON array.
|
||||
|
||||
* `prefix**suffix` represents all paths that begin with the specified prefix and end with the specified suffix. The prefix is optional, but the suffix is required. Using `**` or `***` alone to match arbitrary paths is not allowed.
|
||||
|
||||
:::info
|
||||
Paths that do not exist in the document (evaluating to non-existent data) evaluate to <code>NULL</code>.
|
||||
:::
|
||||
|
||||
## Modify JSON values
|
||||
|
||||
seekdb also supports modifying complete JSON values using DML statements, and using the JSON_SET(), JSON_REPLACE(), or JSON_REMOVE() functions in `UPDATE` statements to modify partial JSON values.
|
||||
|
||||
Examples:
|
||||
|
||||
```sql
|
||||
// Insert complete data.
|
||||
INSERT INTO json_tab(json_info) VALUES ('[1, {"a": "b"}, [2, "qwe"]]');
|
||||
|
||||
// Insert partial data.
|
||||
UPDATE json_tab SET json_info=JSON_ARRAY_APPEND(json_info, '$', 2) WHERE id=1;
|
||||
|
||||
// Update complete data.
|
||||
UPDATE json_tab SET json_info='[1, {"a": "b"}]';
|
||||
|
||||
// Update partial data.
|
||||
UPDATE json_tab SET json_info=JSON_REPLACE(json_info, '$[2]', 'aaa') WHERE id=1;
|
||||
|
||||
// Delete data.
|
||||
DELETE FROM json_tab WHERE id=1;
|
||||
|
||||
// Update partial data using a function.
|
||||
UPDATE json_tab SET json_info=JSON_REMOVE(json_info, '$[2]') WHERE id=1;
|
||||
```
|
||||
|
||||
## JSON path syntax
|
||||
|
||||
A path consists of a scope and one or more path segments. For paths used in JSON functions, the scope is the document being searched or otherwise operated on, represented by the leading `$` character.
|
||||
|
||||
Path segments are separated by periods (.). Array elements are represented by `[N]`, where N is a non-negative integer. Key names must be either double-quoted strings or valid ECMAScript identifiers.
|
||||
|
||||
Path expressions (like JSON text) should be encoded using the ascii, utf8, or utf8mb4 character set. Other character encodings are implicitly converted to utf8mb4.
|
||||
|
||||
The complete syntax is as follows:
|
||||
|
||||
```sql
|
||||
pathExpression: // Path expression
|
||||
scope[(pathLeg)*] // Scope is represented by the leading $ character
|
||||
|
||||
pathLeg:
|
||||
member | arrayLocation | doubleAsterisk
|
||||
|
||||
member:
|
||||
period ( keyName | asterisk )
|
||||
|
||||
arrayLocation:
|
||||
leftBracket ( nonNegativeInteger | asterisk ) rightBracket
|
||||
|
||||
keyName:
|
||||
ESIdentifier | doubleQuotedString
|
||||
|
||||
doubleAsterisk:
|
||||
'**'
|
||||
|
||||
period:
|
||||
'.'
|
||||
|
||||
asterisk:
|
||||
'*'
|
||||
|
||||
leftBracket:
|
||||
'['
|
||||
|
||||
rightBracket:
|
||||
']'
|
||||
```
|
||||
@@ -0,0 +1,54 @@
|
||||
---
|
||||
|
||||
slug: /json-formatted-data-type-conversion
|
||||
---
|
||||
|
||||
# Convert JSON data types
|
||||
|
||||
seekdb supports the CAST function for converting between JSON and other data types.
|
||||
|
||||
The following table describes the conversion rules for JSON data types.
|
||||
|
||||
| Other data types | CAST(other_type AS JSON) | CAST(JSON AS other_type) |
|
||||
|-------------------------------------|---------------------------------------------|----------------------------------------------------------|
|
||||
| JSON | No change. | No change. |
|
||||
| UTF-8 character types (including utf8mb4, utf8, and ascii) | The characters are converted to JSON values and validated. | The data is serialized into utf8mb4 strings. |
|
||||
| Other character sets | First converted to utf8mb4 encoding, then processed as UTF-8 character type. | First serialized into utf8mb4-encoded strings, then converted to the corresponding character set. |
|
||||
| NULL | An empty JSON value is returned. | Not applicable. |
|
||||
| Other types | Only scalar values are converted to JSON values containing that single value. | If the JSON value contains only one scalar value that matches the target type, it is converted to the corresponding type; otherwise, NULL is returned and a warning is issued. |
|
||||
|
||||
:::info
|
||||
<code>other_type</code> specifies a data type other than JSON.
|
||||
:::
|
||||
|
||||
Here are some conversion examples:
|
||||
|
||||
```sql
|
||||
obclient> SELECT CAST("123" AS JSON);
|
||||
+---------------------+
|
||||
| CAST("123" AS JSON) |
|
||||
+---------------------+
|
||||
| 123 |
|
||||
+---------------------+
|
||||
1 row in set
|
||||
|
||||
obclient> SELECT CAST(null AS JSON);
|
||||
+--------------------+
|
||||
| CAST(null AS JSON) |
|
||||
+--------------------+
|
||||
| NULL |
|
||||
+--------------------+
|
||||
1 row in set
|
||||
|
||||
CREATE TABLE tj1 (c1 JSON,c2 VARCHAR(20));
|
||||
INSERT INTO tj1 VALUES ('{"id": 17, "color": "red"}','apple'),('{"id": 18, "color": "yellow"}', 'banana'),('{"id": 16, "color": "orange"}','orange');
|
||||
obclient> SELECT * FROM tj1 ORDER BY CAST(JSON_EXTRACT(c1, '$.id') AS UNSIGNED);
|
||||
+-------------------------------+--------+
|
||||
| c1 | c2 |
|
||||
+-------------------------------+--------+
|
||||
| {"id": 16, "color": "orange"} | orange |
|
||||
| {"id": 17, "color": "red"} | apple |
|
||||
| {"id": 18, "color": "yellow"} | banana |
|
||||
+-------------------------------+--------+
|
||||
3 rows in set
|
||||
```
|
||||
@@ -0,0 +1,328 @@
|
||||
---
|
||||
|
||||
slug: /json-partial-update
|
||||
---
|
||||
|
||||
# Partial JSON data updates
|
||||
|
||||
seekdb supports partial JSON data updates (JSON Partial Update). When only specific fields in a JSON document need to be modified, this feature allows you to update only the changed portions without having to update the entire JSON document.
|
||||
|
||||
## Limitations
|
||||
|
||||
## Enable or disable JSON Partial Update
|
||||
|
||||
The JSON Partial Update feature in seekdb is disabled by default. It is controlled by the system variable `log_row_value_options`. For more information, see [log_row_value_options](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001972193).
|
||||
|
||||
**Here are some examples:**
|
||||
|
||||
* Enable the JSON Partial Update feature.
|
||||
|
||||
* Session level:
|
||||
|
||||
```sql
|
||||
SET log_row_value_options="partial_json";
|
||||
```
|
||||
|
||||
* Global level:
|
||||
|
||||
```sql
|
||||
SET GLOBAL log_row_value_options="partial_json";
|
||||
```
|
||||
|
||||
* Disable the JSON Partial Update feature.
|
||||
|
||||
* Session level:
|
||||
|
||||
```sql
|
||||
SET log_row_value_options="";
|
||||
```
|
||||
|
||||
* Global level:
|
||||
|
||||
```sql
|
||||
SET GLOBAL log_row_value_options="";
|
||||
```
|
||||
|
||||
* Query the value of `log_row_value_options`.
|
||||
|
||||
```sql
|
||||
SHOW VARIABLES LIKE 'log_row_value_options';
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```sql
|
||||
+-----------------------+-------+
|
||||
| Variable_name | Value |
|
||||
+-----------------------+-------+
|
||||
| log_row_value_options | |
|
||||
+-----------------------+-------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## JSON expressions for partial updates
|
||||
|
||||
In addition to the JSON Partial Update feature switch `log_row_value_options`, you must use specific expressions to update JSON documents to trigger JSON Partial Update.
|
||||
|
||||
The following JSON expressions in seekdb currently support partial updates:
|
||||
|
||||
* json_set or json_replace: updates the value of a JSON field.
|
||||
* json_remove: deletes a JSON field.
|
||||
|
||||
:::tip
|
||||
<ol><li>Ensure that the left operand of the <code>SET</code> assignment clause and the first parameter of the JSON expression are the same and both are JSON columns in the table. For example, in <code>j = json_replace(j, '$.name', 'ab')</code>, the parameter on the left side of the equals sign and the first parameter of the JSON expression <code>json_replace</code> on the right side are both <code>j</code>.</li><li>JSON Partial Update is only triggered when the current JSON column data is stored as <code>outrow</code>. Whether data is stored as <code>outrow</code> or <code>inrow</code> is controlled by the <code>lob_inrow_threshold</code> parameter when creating the table. <code>lob_inrow_threshold</code> is used to configure the <code>INROW</code> threshold. When the LOB data size exceeds this threshold, it is stored as <code>OUTROW</code> in the LOB Meta table. The default value is 4 KB.</li></ol>
|
||||
:::
|
||||
|
||||
**Examples:**
|
||||
|
||||
1. Create a table named `json_test`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON);
|
||||
```
|
||||
|
||||
2. Insert data.
|
||||
|
||||
```sql
|
||||
INSERT INTO json_test VALUES(1, CONCAT('{"name": "John", "content": "', repeat('x',8), '"}'));
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
```
|
||||
|
||||
3. Query the data in the JSON column `j`.
|
||||
|
||||
```sql
|
||||
SELECT j FROM json_test;
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```shell
|
||||
+-----------------------------------------+
|
||||
| j |
|
||||
+-----------------------------------------+
|
||||
| {"name": "John", "content": "xxxxxxxx"} |
|
||||
+-----------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
4. Use `json_repalce` to update the value of the `name` field in the JSON column.
|
||||
|
||||
```sql
|
||||
UPDATE json_test SET j = json_replace(j, '$.name', 'ab') WHERE pk = 1;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
```
|
||||
|
||||
5. Query the modified data in JSON column `j`.
|
||||
|
||||
```sql
|
||||
SELECT j FROM json_test;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+---------------------------------------+
|
||||
| j |
|
||||
+---------------------------------------+
|
||||
| {"name": "ab", "content": "xxxxxxxx"} |
|
||||
+---------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
6. Use `json_set` to update the value of the `name` field in the JSON column.
|
||||
|
||||
```sql
|
||||
UPDATE json_test SET j = json_set(j, '$.name', 'cd') WHERE pk = 1;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
```
|
||||
|
||||
7. Query the modified data in JSON column `j`.
|
||||
|
||||
```sql
|
||||
SELECT j FROM json_test;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+---------------------------------------+
|
||||
| j |
|
||||
+---------------------------------------+
|
||||
| {"name": "cd", "content": "xxxxxxxx"} |
|
||||
+---------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
8. Use `json_remove` to delete the `name` field value in the JSON column.
|
||||
|
||||
```sql
|
||||
UPDATE json_test SET j = json_remove(j, '$.name') WHERE pk = 1;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
```
|
||||
|
||||
9. Query the modified data in JSON column `j`.
|
||||
|
||||
```sql
|
||||
SELECT j FROM json_test;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+-------------------------+
|
||||
| j |
|
||||
+-------------------------+
|
||||
| {"content": "xxxxxxxx"} |
|
||||
+-------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## Granularity of updates
|
||||
|
||||
JSON data in seekdb is stored based on LOB storage, and LOBs in seekdb are stored in chunks at the underlying level. Therefore, the minimum data amount for each partial update is one LOB chunk. The smaller the LOB chunk, the smaller the amount of data written. A DDL syntax is provided to set the LOB chunk size, which can be specified when creating a column.
|
||||
|
||||
**Example:**
|
||||
|
||||
```sql
|
||||
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '4k');
|
||||
```
|
||||
|
||||
The chunk size cannot be infinitely small, as too small a size will affect the performance of `SELECT`, `INSERT`, and `DELETE` operations. It is generally recommended to set it based on the average field size of JSON documents. If most fields are very small, you can set it to 1K. To optimize LOB type reads, seekdb stores data smaller than 4K directly as `INROW`, in which case partial update will not be performed. Partial Update is mainly intended to improve the performance of updating large documents; for small documents, full updates actually perform better.
|
||||
|
||||
## Rebuild
|
||||
|
||||
JSON Partial Update does not impose restrictions on the data length before and after updating a JSON column. When the length of the new value is less than or equal to the length of the old value, the data at the original location is directly replaced with the new data. When the length of the new value is greater than the length of the old value, the new data is appended at the end. seekdb sets a threshold: when the length of the appended data exceeds 30% of the original data length, a rebuild is triggered. In this case, Partial Update is not performed; instead, a full overwrite is performed.
|
||||
|
||||
You can use the `JSON_STORAGE_SIZE` expression to get the actual storage length of JSON data, and `JSON_STORAGE_FREE` to get the additional storage overhead.
|
||||
|
||||
**Example:**
|
||||
|
||||
1. Enable JSON Partial Update.
|
||||
|
||||
```sql
|
||||
SET log_row_value_options = "partial_json";
|
||||
```
|
||||
|
||||
2. Create a test table named `json_test`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '1K');
|
||||
```
|
||||
|
||||
3. Insert a row of data into the `json_test` table.
|
||||
|
||||
```sql
|
||||
INSERT INTO json_test VALUES(10 , json_object('name', 'zero', 'age', 100, 'position', 'software engineer', 'profile', repeat('x', 4096), 'like', json_array('a', 'b', 'c'), 'tags', json_array('sql boy', 'football', 'summer', 1), 'money' , json_object('RMB', 10000, 'Dollers', 20000, 'BTC', 100), 'nickname', 'noone'));
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
```
|
||||
|
||||
4. Use `JSON_STORAGE_SIZE` to query the storage size of the JSON column (actual occupied storage space) and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
|
||||
|
||||
```sql
|
||||
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+----------------------+----------------------+
|
||||
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
|
||||
+----------------------+----------------------+
|
||||
| 4335 | 0 |
|
||||
+----------------------+----------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
Since no partial update has been performed, the value of `JSON_STORAGE_FREE` is 0.
|
||||
|
||||
5. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is less than the length of the old value.
|
||||
|
||||
```sql
|
||||
UPDATE json_test SET j = json_replace(j, '$.position', 'software enginee') WHERE pk = 10;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
```
|
||||
|
||||
6. Again, use `JSON_STORAGE_SIZE` to query the storage size of the JSON column and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
|
||||
|
||||
```sql
|
||||
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+----------------------+----------------------+
|
||||
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
|
||||
+----------------------+----------------------+
|
||||
| 4335 | 1 |
|
||||
+----------------------+----------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
After the JSON column data is updated, since the new data is one byte less than the old data, the `JSON_STORAGE_FREE` result is 1.
|
||||
|
||||
7. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is greater than the length of the old value.
|
||||
|
||||
```sql
|
||||
UPDATE json_test SET j = json_replace(j, '$.position', 'software engineer') WHERE pk = 10;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
Query OK, 1 row affected
|
||||
Rows matched: 1 Changed: 1 Warnings: 0
|
||||
```
|
||||
|
||||
8. Use `JSON_STORAGE_SIZE` again to query the JSON column storage size, and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
|
||||
|
||||
```sql
|
||||
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```shell
|
||||
+----------------------+----------------------+
|
||||
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
|
||||
+----------------------+----------------------+
|
||||
| 4355 | 19 |
|
||||
+----------------------+----------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
After appending new data to the JSON column, the length of `JSON_STORAGE_FREE` is 19, indicating that 19 bytes can be freed after a rebuild.
|
||||
@@ -0,0 +1,124 @@
|
||||
---
|
||||
|
||||
slug: /json-semi-struct
|
||||
---
|
||||
|
||||
# Semi-structured encoding
|
||||
|
||||
This topic describes the semi-structured encoding feature supported by seekdb.
|
||||
|
||||
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur.
|
||||
|
||||
## Considerations
|
||||
|
||||
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
|
||||
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
|
||||
* You can also set the frequency threshold using the `freq_threshold` parameter. When semi-structured encoding is enabled, the system analyzes the frequency of each path in the JSON data and stores paths with frequencies exceeding the specified threshold as independent subcolumns, known as frequent columns. For example, if you have a user table where the JSON field stores user information and 90% of users have the `name` and `age` fields, the system will automatically extract `name` and `age` as independent frequent columns. During queries, these columns are accessed directly without parsing the entire JSON, thereby improving query performance.
|
||||
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
|
||||
|
||||
## Data format
|
||||
|
||||
JSON data is split and stored as structured columns in a specific format. The columns split from JSON columns are called subcolumns. Subcolumns can be categorized into different types, including sparse columns and frequent columns.
|
||||
|
||||
* Sparse columns: Subcolumns that exist in some JSON documents but not in others, with an occurrence frequency lower than the threshold specified by the table-level parameter `freq_threshold`.
|
||||
* Frequent columns: Subcolumns that appear in JSON data with a frequency higher than the threshold specified by the table-level parameter `freq_threshold`. These subcolumns are stored as independent columns to improve filtering query performance.
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
{"id": 1001, "name": "n1", "nickname": "nn1"}
|
||||
{"id": 1002, "name": "n2", "nickname": "nn2"}
|
||||
{"id": 1003, "name": "n3", "nickname": "nn3"}
|
||||
{"id": 1004, "name": "n4", "nickname": "nn4"}
|
||||
{"id": 1005, "name": "n5"}
|
||||
```
|
||||
|
||||
In this example, `id` and `name` are fields that exist in every JSON document with an occurrence frequency of 100%, while `nickname` exists in only four JSON documents with an occurrence frequency of 80%.
|
||||
|
||||
If `freq_threshold` is set to 100%, then `nickname` will be inferred as a sparse column, while `id` and `name` will be inferred as frequent columns. If set to 80%, then `nickname`, `id`, and `name` will all be inferred as frequent columns.
|
||||
|
||||
## Examples
|
||||
|
||||
1. Enable semi-structured encoding.
|
||||
|
||||
:::tip
|
||||
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
|
||||
:::
|
||||
|
||||
:::tab
|
||||
tab Example: Enable semi-structured encoding during table creation
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1( j json)
|
||||
ROW_FORMAT=COMPRESSED
|
||||
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
|
||||
```
|
||||
|
||||
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
|
||||
|
||||
tab Example: Enable semi-structured encoding for existing table
|
||||
|
||||
```sql
|
||||
CREATE TABLE t1(j json);
|
||||
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
|
||||
```
|
||||
|
||||
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
|
||||
|
||||
Some modification limitations:
|
||||
|
||||
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
|
||||
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
|
||||
* Modifying one sub-parameter does not affect the others.
|
||||
:::
|
||||
|
||||
2. Disable semi-structured encoding.
|
||||
|
||||
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
|
||||
|
||||
```sql
|
||||
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
|
||||
```
|
||||
|
||||
3. Query semi-structured encoding configuration.
|
||||
|
||||
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
|
||||
|
||||
```sql
|
||||
SHOW CREATE TABLE t1;
|
||||
```
|
||||
|
||||
The result is as follows:
|
||||
|
||||
```shell
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Table | Create Table |
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| t1 | CREATE TABLE `t1` (
|
||||
`j` json DEFAULT NULL
|
||||
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
|
||||
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
|
||||
|
||||
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
|
||||
|
||||
Here is an example query:
|
||||
|
||||
```sql
|
||||
-- Query rows where the value of the name field is 'Devin'
|
||||
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
|
||||
```
|
||||
|
||||
Character set considerations:
|
||||
|
||||
- seekdb uses `utf8_bin` encoding for JSON.
|
||||
|
||||
- To ensure string whitebox filtering works properly, we recommend the following settings:
|
||||
|
||||
```sql
|
||||
SET @@collation_server = 'utf8mb4_bin';
|
||||
SET @@collation_connection='utf8mb4_bin';
|
||||
```
|
||||
@@ -0,0 +1,26 @@
|
||||
---
|
||||
|
||||
slug: /spatial-data-type-overview
|
||||
---
|
||||
|
||||
# Overview of spatial data types
|
||||
|
||||
The Geographic Information System (GIS) feature of seekdb includes the following spatial data types:
|
||||
|
||||
* `GEOMETRY`
|
||||
|
||||
* `POINT`
|
||||
|
||||
* `LINESTRING`
|
||||
|
||||
* `POLYGON`
|
||||
|
||||
* `MULTIPOINT`
|
||||
|
||||
* `MULTILINESTRING`
|
||||
|
||||
* `MULTIPOLYGON`
|
||||
|
||||
* `GEOMETRYCOLLECTION`
|
||||
|
||||
Among these, `POINT`, `LINESTRING`, and `POLYGON` are the three most fundamental types, used to store individual spatial data. They respectively extend into three collection types: `MULTIPOINT`, `MULTILINESTRING`, and `MULTIPOLYGON`, which are used to store collections of spatial data but can only represent collections of their respective specified base types. `GEOMETRY` is an abstract type that can represent any base type, and `GEOMETRYCOLLECTION` can be a collection of any `GEOMETRY` types.
|
||||
@@ -0,0 +1,39 @@
|
||||
---
|
||||
|
||||
slug: /spacial-reference-system
|
||||
---
|
||||
|
||||
# Spatial reference systems
|
||||
|
||||
A spatial reference system (SRS) for spatial data is a coordinate-based system for defining geographic locations. The current version of seekdb only supports the default SRS provided by the system.
|
||||
|
||||
Spatial reference systems generally include the following types:
|
||||
|
||||
* Projected SRS: A projected SRS is a projection of the Earth onto a plane, essentially a flat map. The coordinate system on this plane is a Cartesian coordinate system that uses units of length (meters, feet, etc.) rather than longitude and latitude.
|
||||
|
||||
* Geographic SRS. A geographic SRS is a non-projected SRS that represents longitude-latitude (or latitude-longitude) coordinates on an ellipsoid, expressed in angular units.
|
||||
|
||||
Additionally, there is an infinitely flat Cartesian plane represented by `SRID 0`, whose axes have no assigned units. Unlike a projected SRS, it is an abstract plane with no geographic reference and does not necessarily represent the Earth. `SRID 0` is the default `SRID` for spatial data.
|
||||
|
||||
SRS content can be obtained through the `INFORMATION_SCHEMA ST_SPATIAL_REFERENCE_SYSTEMS` table, as shown in the following example:
|
||||
|
||||
```sql
|
||||
obclient> SELECT * FROM INFORMATION_SCHEMA.ST_SPATIAL_REFERENCE_SYSTEMS
|
||||
WHERE SRS_ID = 4326\G
|
||||
*************************** 1. row ***************************
|
||||
SRS_NAME: WGS 84
|
||||
SRS_ID: 4326
|
||||
ORGANIZATION: EPSG
|
||||
ORGANIZATION_COORDSYS_ID: 4326
|
||||
DEFINITION: GEOGCS["WGS 84",DATUM["World Geodetic System 1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.017453292519943278,AUTHORITY["EPSG","9122"]],AXIS["Lat",NORTH],AXIS["Lon",EAST],AUTHORITY["EPSG","4326"]]
|
||||
DESCRIPTION: NULL
|
||||
1 row in set
|
||||
```
|
||||
|
||||
The above example describes the SRS used by GPS systems, with the name (SRS_NAME) WGS 84 and ID (SRS_ID) 4326.
|
||||
|
||||
The SRS definition in the `DEFINITION` column is a `WKT` value. WKT is defined based on Extended Backus Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for SRS definitions in GIS.
|
||||
|
||||
The `SRS_ID` value represents the same type of value as the `SRID` of a geometry value, or is passed as an `SRID` parameter to spatial functions. `SRID 0` (unitless Cartesian plane) is a special, valid spatial reference system ID that can be used for any spatial data calculations that depend on `SRID` values.
|
||||
|
||||
For calculations involving multiple geometry values, all values must have the same SRID; otherwise, an error will occur.
|
||||
@@ -0,0 +1,45 @@
|
||||
---
|
||||
|
||||
slug: /create-spatial-columns
|
||||
---
|
||||
|
||||
# Create a spatial column
|
||||
|
||||
seekdb allows you to create a spatial column using the `CREATE TABLE` or `ALTER TABLE` statement.
|
||||
|
||||
To create a table with spatial columns using the `CREATE TABLE` statement, see the following syntax example:
|
||||
|
||||
```sql
|
||||
CREATE TABLE geom (g GEOMETRY);
|
||||
```
|
||||
|
||||
To add or remove spatial columns in an existing table using the `ALTER TABLE` statement, see the following syntax example:
|
||||
|
||||
```sql
|
||||
ALTER TABLE geom ADD pt POINT;
|
||||
ALTER TABLE geom DROP pt;
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
```sql
|
||||
obclient> CREATE TABLE geom (
|
||||
p POINT SRID 0,
|
||||
g GEOMETRY NOT NULL SRID 4326
|
||||
);
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
The following constraints apply when creating spatial columns:
|
||||
|
||||
* You can explicitly specify an SRID when defining a spatial column. If no SRID is defined on the column, the optimizer will not select the spatial index during queries, but index records will still be generated during insert/update operations.
|
||||
|
||||
* A spatial index can be defined on a spatial column only after specifying the` NOT NULL` constraint and an SRID. In other words, only columns with a defined SRID can use spatial indexes.
|
||||
|
||||
* Once an SRID is defined on a spatial column, attempting to insert values with a different SRID will result in an error.
|
||||
|
||||
The following constraints apply to `SRID`:
|
||||
|
||||
* You must explicitly specify `SRID` for a spatial column.
|
||||
|
||||
* All objects in the column must have the same SRID.
|
||||
@@ -0,0 +1,71 @@
|
||||
---
|
||||
|
||||
slug: /create-spatial-indexes
|
||||
---
|
||||
|
||||
# Create a spatial index
|
||||
|
||||
seekdb allows you to create a spatial index using the `SPATIAL` keyword. When creating a table, the spatial index column must be declared as `NOT NULL`. Spatial indexes can be created on stored (STORED) generated columns, but not on virtual (VIRTUAL) generated columns.
|
||||
|
||||
## Constraints
|
||||
|
||||
* The column definition for creating a spatial index must include the `NOT NULL` constraint.
|
||||
* The column with a spatial index must have an SRID defined. Otherwise, the spatial index on this column will not take effect during queries.
|
||||
* If you create a spatial index on a STORED generated column, you must explicitly specify the `STORED` keyword in the DDL when creating the column. If neither the `VIRTUAL` nor `STORED` keyword is specified when creating a generated column, a VIRTUAL generated column is created by default.
|
||||
* After an index is created, comparisons use the coordinate system corresponding to the SRID defined in the column. Spatial indexes store the Minimum Bounding Rectangle (MBR) of geometric objects, and the comparison method for MBRs also depends on the SRID.
|
||||
|
||||
## Preparations
|
||||
|
||||
Before using the GIS feature, you need to configure GIS metadata. After connecting to the server, execute the following command to import the `default_srs_data_mysql.sql` file into the database:
|
||||
|
||||
```shell
|
||||
-- module specifies the module to import.
|
||||
-- infile specifies the relative path of the SQL file to import.
|
||||
ALTER SYSTEM LOAD MODULE DATA module=gis infile = 'etc/default_srs_data_mysql.sql';
|
||||
```
|
||||
|
||||
<!-- For more information about the syntax, see [LOAD MODULE DATA](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980607). -->
|
||||
|
||||
The following result indicates that the data file was successfully imported:
|
||||
|
||||
```shell
|
||||
Query OK, 0 rows affected
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
The following examples show how to create a spatial index on a regular column:
|
||||
|
||||
* Using `CREATE TABLE`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326, SPATIAL INDEX(g));
|
||||
```
|
||||
|
||||
* Using `ALTER TABLE`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
|
||||
ALTER TABLE geom ADD SPATIAL INDEX(g);
|
||||
```
|
||||
|
||||
* Using `CREATE INDEX`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
|
||||
CREATE SPATIAL INDEX g ON geom (g);
|
||||
```
|
||||
|
||||
The following examples show how to drop a spatial index:
|
||||
|
||||
* Using `ALTER TABLE`:
|
||||
|
||||
```sql
|
||||
ALTER TABLE geom DROP INDEX g;
|
||||
```
|
||||
|
||||
* Using `DROP INDEX`:
|
||||
|
||||
```sql
|
||||
DROP INDEX g ON geom;
|
||||
```
|
||||
@@ -0,0 +1,275 @@
|
||||
---
|
||||
|
||||
slug: /spatial-data-format
|
||||
---
|
||||
|
||||
# Spatial data formats
|
||||
|
||||
seekdb supports two standard spatial data formats for representing geometric objects in queries:
|
||||
|
||||
* Well-Known Text (WKT)
|
||||
|
||||
* Well-Known Binary (WKB)
|
||||
|
||||
## WKT
|
||||
|
||||
WKT is defined based on Extended Backus-Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for defining spatial reference systems (SRS) in Geographic Information System (GIS) (referred to as WKT-SRS in this document).
|
||||
|
||||
### Point
|
||||
|
||||
A point does not use commas as separators. Example format:
|
||||
|
||||
```sql
|
||||
POINT(15 20)
|
||||
```
|
||||
|
||||
The following example uses `ST_X()` to extract the `X` coordinate from a point object. The first example directly generates the object using the `Point()` function. The second example uses the WKT representation converted to point through `ST_GeomFromText()`.
|
||||
|
||||
```sql
|
||||
obclient> SELECT ST_X(Point(15, 20));
|
||||
+---------------------+
|
||||
| ST_X(Point(15, 20)) |
|
||||
+---------------------+
|
||||
| 15 |
|
||||
+---------------------+
|
||||
1 row in set
|
||||
|
||||
obclient> SELECT ST_X(ST_GeomFromText('POINT(15 20)'));
|
||||
+---------------------------------------+
|
||||
| ST_X(ST_GeomFromText('POINT(15 20)')) |
|
||||
+---------------------------------------+
|
||||
| 15 |
|
||||
+---------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
### Line
|
||||
|
||||
A line consists of multiple points separated by commas. Example format:
|
||||
|
||||
```sql
|
||||
LINESTRING(0 0, 10 10, 20 25, 50 60)
|
||||
```
|
||||
|
||||
### Polygon
|
||||
|
||||
A polygon consists of at least one exterior ring (closed line) and any number (can be 0) of interior rings (closed lines). Example format:
|
||||
|
||||
```sql
|
||||
POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))
|
||||
```
|
||||
|
||||
### MultiPoint
|
||||
|
||||
A MultiPoint consists of multiple points, similar to a line but with different semantics. Multiple connected points form a line, while discrete multiple points form a MultiPoint. Example format:
|
||||
|
||||
```sql
|
||||
MULTIPOINT(0 0, 20 20, 60 60)
|
||||
```
|
||||
|
||||
In the functions `ST_MPointFromText()` and `ST_GeoFromText()`, it is also valid to enclose points in a MultiPoint with parentheses. Example format:
|
||||
|
||||
```sql
|
||||
ST_MPointFromText('MULTIPOINT (1 1, 2 2, 3 3)')
|
||||
ST_MPointFromText('MULTIPOINT ((1 1), (2 2), (3 3))')
|
||||
```
|
||||
|
||||
### MultiLineString
|
||||
|
||||
A MultiLineString is a collection of multiple lines. Example format:
|
||||
|
||||
```sql
|
||||
MULTILINESTRING((10 10, 20 20), (15 15, 30 15))
|
||||
```
|
||||
|
||||
### MultiPolygon
|
||||
|
||||
A MultiPolygon is a collection of multiple polygons. Example format:
|
||||
|
||||
```sql
|
||||
MULTIPOLYGON(((0 0,10 0,10 10,0 10,0 0)),((5 5,7 5,7 7,5 7, 5 5)))
|
||||
```
|
||||
|
||||
### GeometryCollection
|
||||
|
||||
A GeometryCollection can be a collection of multiple basic types and collection types.
|
||||
|
||||
```sql
|
||||
GEOMETRYCOLLECTION(POINT(10 10), POINT(30 30), LINESTRING(15 15, 20 20))
|
||||
```
|
||||
|
||||
## WKB
|
||||
|
||||
WKB is developed based on the OpenGIS specification and supports seven types (Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and Geometrycollection) with corresponding format definitions.
|
||||
|
||||
### Point
|
||||
|
||||
Using `POINT(1 -1)` as an example, the format definition is shown in the following table.
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 01000000 |
|
||||
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
|
||||
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
|
||||
|
||||
### Linestring
|
||||
|
||||
Using `LINESTRING(1 -1, -1 1)` as an example, the format definition is shown in the following table. `Num points` must be greater than or equal to 2.
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 02000000 |
|
||||
| **Num points** | 4 bytes | unsigned int | 02000000 |
|
||||
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
|
||||
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
|
||||
| **X coordinate** | 8 bytes | double-precision | 000000000000F0BF |
|
||||
| **Y coordinate** | 8 bytes | double-precision | 000000000000F03F |
|
||||
|
||||
### Polygon
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 03000000 |
|
||||
| **Num rings** | 4 bytes | unsigned int | Greater than or equal to 1 |
|
||||
| **repeat ring** | - |- | - |
|
||||
|
||||
### MultiPoint
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 04000000 |
|
||||
| **Num points** | 4 bytes | unsigned int | Num points >= 1 |
|
||||
| **repeat POINT** | - |- | - |
|
||||
|
||||
### MultiLineString
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 05000000 |
|
||||
| **Num linestrings** | 4 bytes | unsigned int | Greater than or equal to 1 |
|
||||
| **repeat LINESTRING** | - | - | - |
|
||||
|
||||
### MultiPolygon
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 06000000 |
|
||||
| **Num polygons** | 4 bytes | unsigned int | Greater than or equal to 1 |
|
||||
| **repeat POLYGON** | - | - | - |
|
||||
|
||||
### GeometryCollection
|
||||
|
||||
| **Component** | **Specification** | **Type** | **Value** |
|
||||
| --- | --- | --- | --- |
|
||||
| **Byte order** | 1 byte | unsigned int | 01 |
|
||||
| **WKB type** | 4 bytes | unsigned int | 07000000 |
|
||||
| **Num wkbs** | 4 bytes | unsigned int | - |
|
||||
| **repeat WKB** | - | - | - |
|
||||
|
||||
>Note:
|
||||
>
|
||||
>* Only GeometryCollection can be empty, indicating that 0 elements are stored. All other types cannot be empty.
|
||||
>* When `LENGTH()` is applied to a GIS object, it returns the length of the stored binary data.
|
||||
|
||||
```sql
|
||||
obclient [test]> SET @g = ST_GeomFromText('POINT(1 -1)');
|
||||
Query OK, 0 rows affected
|
||||
|
||||
obclient [test]> SELECT LENGTH(@g);
|
||||
+------------+
|
||||
| LENGTH(@g) |
|
||||
+------------+
|
||||
| 25 |
|
||||
+------------+
|
||||
1 row in set
|
||||
|
||||
obclient [test]> SELECT HEX(@g);
|
||||
+----------------------------------------------------+
|
||||
| HEX(@g) |
|
||||
+----------------------------------------------------+
|
||||
| 000000000101000000000000000000F03F000000000000F0BF |
|
||||
+----------------------------------------------------+
|
||||
1 row in set
|
||||
```
|
||||
|
||||
## Syntax and geometric validity
|
||||
|
||||
### Syntax validity
|
||||
|
||||
Syntax validity must satisfy the following conditions:
|
||||
|
||||
- A linestring must have at least two points.
|
||||
- A polygon must have at least one ring.
|
||||
- A polygon must be closed (the first and last points are the same).
|
||||
- A polygon's ring must have at least four points (the smallest polygon is a triangle, where the first and last points are the same).
|
||||
- Except for GeometryCollection, other collection types cannot be empty.
|
||||
|
||||
### Geometric validity
|
||||
|
||||
Geometric validity must satisfy the following conditions:
|
||||
|
||||
- A polygon cannot intersect with itself.
|
||||
- The exterior ring of a Polygon must be outside the interior rings.
|
||||
- Multipolygons cannot contain overlapping polygons.
|
||||
|
||||
You can explicitly check the geometric validity of a geometry object using the ST_IsValid() function.
|
||||
|
||||
## GIS Examples
|
||||
|
||||
### Insert data
|
||||
|
||||
```sql
|
||||
// Both conversion functions and WKT are included in the SQL statement.
|
||||
INSERT INTO geom VALUES (ST_GeomFromText('POINT(1 1)'));
|
||||
|
||||
// WKT is provided as a parameter.
|
||||
SET @g = 'POINT(1 1)';
|
||||
INSERT INTO geom VALUES (ST_GeomFromText(@g));
|
||||
|
||||
// Conversion expressions are directly embedded in the parameters.
|
||||
SET @g = ST_GeomFromText('POINT(1 1)');
|
||||
INSERT INTO geom VALUES (@g);
|
||||
|
||||
// A unified conversion function is used.
|
||||
SET @g = 'LINESTRING(0 0,1 1,2 2)';
|
||||
INSERT INTO geom VALUES (ST_GeomFromText(@g));
|
||||
|
||||
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
|
||||
INSERT INTO geom VALUES (ST_GeomFromText(@g));
|
||||
|
||||
SET @g ='GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
|
||||
INSERT INTO geom VALUES (ST_GeomFromText(@g));
|
||||
|
||||
// Type-specific conversion functions are employed.
|
||||
SET @g = 'POINT(1 1)';
|
||||
INSERT INTO geom VALUES (ST_PointFromText(@g));
|
||||
|
||||
SET @g = 'LINESTRING(0 0,1 1,2 2)';
|
||||
INSERT INTO geom VALUES (ST_LineStringFromText(@g));
|
||||
|
||||
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
|
||||
INSERT INTO geom VALUES (ST_PolygonFromText(@g));
|
||||
|
||||
SET @g =
|
||||
'GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
|
||||
INSERT INTO geom VALUES (ST_GeomCollFromText(@g));
|
||||
|
||||
// Data can also be inserted directly using WKB.
|
||||
INSERT INTO geom VALUES(ST_GeomFromWKB(X'0101000000000000000000F03F000000000000F03F'));
|
||||
```
|
||||
|
||||
### Query data
|
||||
|
||||
```sql
|
||||
// Query data and convert it to WKT format for output.
|
||||
SELECT ST_AsText(g) FROM geom;
|
||||
|
||||
// Query data and convert it to WKB format for output.
|
||||
SELECT ST_AsBinary(g) FROM geom;
|
||||
```
|
||||
@@ -0,0 +1,46 @@
|
||||
---
|
||||
|
||||
slug: /char-and-varchar
|
||||
---
|
||||
|
||||
# CHAR and VARCHAR
|
||||
|
||||
`CHAR` and `VARCHAR` types are similar, but differ in how they are stored and retrieved, their maximum length, and whether trailing spaces are preserved.
|
||||
|
||||
## CHAR
|
||||
|
||||
The declared length of the `CHAR` type is the maximum number of characters that can be stored. For example, `CHAR(30)` can contain up to 30 characters.
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
[NATIONAL] CHAR[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `CHAR` becomes `BINARY`.
|
||||
|
||||
`CHAR` column length can be any value between 0 and 256. When storing `CHAR` values, they are right-padded with spaces to the specified length.
|
||||
|
||||
For `CHAR` columns, excess trailing spaces in inserted values are silently truncated regardless of the SQL mode. When retrieving `CHAR` values, trailing spaces are removed unless the `PAD_CHAR_TO_FULL_LENGTH` SQL mode is enabled.
|
||||
|
||||
## VARCHAR
|
||||
|
||||
The declared length `M` of the `VARCHAR` type is the maximum number of characters that can be stored. For example, `VARCHAR(50)` can contain up to 50 characters.
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
[NATIONAL] VARCHAR(M) [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `VARCHAR` becomes `VARBINARY`.
|
||||
|
||||
`VARCHAR` column length can be specified as any value between 0 and 262144.
|
||||
|
||||
Compared with `CHAR`, `VARCHAR` values are stored as a 1-byte or 2-byte length prefix plus the data. The length prefix indicates the number of bytes in the value. If the value does not exceed 255 bytes, the column uses one byte; if the value may exceed 255 bytes, it uses two bytes.
|
||||
|
||||
For `VARCHAR` columns, trailing spaces that exceed the column length are truncated before insertion and generate a warning, regardless of the SQL mode.
|
||||
|
||||
`VARCHAR` values are not padded when stored. According to standard SQL, trailing spaces are preserved during both storage and retrieval.
|
||||
|
||||
Additionally, seekdb also supports the extended type `CHARACTER VARYING(m)`, but `VARCHAR(m)` is recommended.
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
|
||||
slug: /text
|
||||
---
|
||||
|
||||
# TEXT types
|
||||
|
||||
The `TEXT` type is used to store all types of text data.
|
||||
|
||||
There are four text types: `TINYTEXT`, `TEXT`, `MEDIUMTEXT`, and `LONGTEXT`. They correspond to the four `BLOB` types and have the same maximum length and storage requirements.
|
||||
|
||||
`TEXT` values are treated as non-binary strings. They have a character set other than binary, and values are sorted and compared according to the collation rules of the character set.
|
||||
|
||||
When strict SQL mode is not enabled, if a value assigned to a `TEXT` column exceeds the column's maximum length, the portion that exceeds the length is truncated and a warning is generated. When using strict SQL mode, an error occurs (rather than a warning) if non-space characters are truncated, and the value insertion is prohibited. Regardless of the SQL mode, truncating excess trailing spaces from values inserted into `TEXT` columns always generates a warning.
|
||||
|
||||
## TINYTEXT
|
||||
|
||||
`TINYTEXT` is a `TEXT` type with a maximum length of 255 bytes.
|
||||
|
||||
`TINYTEXT` syntax:
|
||||
|
||||
```sql
|
||||
TINYTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
## TEXT
|
||||
|
||||
The maximum length of a `TEXT` column is 65,535 bytes.
|
||||
|
||||
An optional length `M` can be specified for the `TEXT` type. Syntax:
|
||||
|
||||
```sql
|
||||
TEXT[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
## MEDIUMTEXT
|
||||
|
||||
`MEDIUMTEXT` is a `TEXT` type with a maximum length of 16,777,215 bytes.
|
||||
|
||||
`MEDIUMTEXT` syntax:
|
||||
|
||||
```sql
|
||||
MEDIUMTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
Additionally, seekdb also supports the extended type `LONG`, but `MEDIUMTEXT` is recommended.
|
||||
|
||||
## LONGTEXT
|
||||
|
||||
`LONGTEXT` is a `TEXT` type with a maximum length of 536,870,910 bytes. The effective maximum length of a `LONGTEXT` column also depends on the maximum packet size configured in the client/server protocol and available memory.
|
||||
|
||||
`LONGTEXT` syntax:
|
||||
|
||||
```sql
|
||||
LONGTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
@@ -0,0 +1,329 @@
|
||||
---
|
||||
|
||||
slug: /full-text-index
|
||||
---
|
||||
|
||||
# Full-text indexes
|
||||
|
||||
In seekdb, full-text indexes can be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.
|
||||
|
||||
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
|
||||
|
||||
* Full-text indexes can only be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types.
|
||||
* The current version only supports creating local (`LOCAL`) full-text indexes.
|
||||
* The `UNIQUE` keyword cannot be specified when creating a full-text index.
|
||||
* If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.
|
||||
|
||||
By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.
|
||||
|
||||
## DML operations
|
||||
|
||||
For tables with full-text indexes, complex DML operations are supported, including `INSERT INTO ON DUPLICATE KEY`, `REPLACE INTO`, multi-table updates/deletes, and updatable views.
|
||||
|
||||
**Examples:**
|
||||
|
||||
* `INSERT INTO ON DUPLICATE KEY`:
|
||||
|
||||
```sql
|
||||
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
|
||||
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
|
||||
```
|
||||
|
||||
* `REPLACE INTO`:
|
||||
|
||||
```sql
|
||||
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
|
||||
```
|
||||
|
||||
* Multi-table updates and deletes.
|
||||
|
||||
1. Create table `tbl1`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
|
||||
```
|
||||
|
||||
2. Create table `tbl2`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
|
||||
```
|
||||
|
||||
3. Perform an update (`UPDATE`) operation on multiple tables.
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
|
||||
SET tbl1.b = 'dddd', tbl2.b = 'eeee';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
|
||||
```
|
||||
|
||||
4. Perform a delete (`DELETE`) operation on multiple tables.
|
||||
|
||||
```sql
|
||||
DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
* DML operations on updatable views.
|
||||
|
||||
1. Create view `fts_view`.
|
||||
|
||||
```sql
|
||||
CREATE VIEW fts_view AS SELECT * FROM tbl1;
|
||||
```
|
||||
|
||||
2. Perform an `INSERT` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
|
||||
```
|
||||
|
||||
3. Perform an `UPDATE` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
UPDATE fts_view SET b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
|
||||
SET fts_view.b = 'dddd', tbl2.b = 'eeee';
|
||||
```
|
||||
|
||||
4. Perform a `DELETE` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
DELETE FROM fts_view WHERE b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
|
||||
```
|
||||
|
||||
## Full-text index tokenizer
|
||||
|
||||
seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is **Space**, while other tokenizers need to be explicitly specified using the `WITH PARSER` parameter.
|
||||
|
||||
**List of tokenizers**:
|
||||
|
||||
* **Space tokenizer**
|
||||
* **Basic English tokenizer**
|
||||
* **IK tokenizer**
|
||||
* **Ngram tokenizer**
|
||||
* **Jieba tokenizer**
|
||||
|
||||
**Configuration example**:
|
||||
|
||||
When creating or modifying a table, specify the tokenizer type for the full-text index by setting the `WITH PARSER tokenizer_option` parameter in the `CREATE TABLE/ALTER TABLE` statement.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
|
||||
FULLTEXT INDEX full_idx1_tbl2(name, doc)
|
||||
WITH PARSER NGRAM
|
||||
PARSER_PROPERTIES=(ngram_token_size=3));
|
||||
|
||||
|
||||
-- Modify the full-text index tokenizer of an existing table.
|
||||
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
|
||||
FULLTEXT INDEX full_idx1_tbl2(name, doc)
|
||||
WITH PARSER NGRAM
|
||||
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
|
||||
```
|
||||
|
||||
### Space tokenizer (default)
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore `_`) as delimiters.
|
||||
* The tokenization results include only valid tokens with lengths between `min_token_size` (default 3) and `max_token_size` (default 84).
|
||||
* Chinese characters are treated as single tokens.
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Languages separated by spaces such as English (for example "apple watch series 9").
|
||||
* Chinese text with manually added delimiters (for example, "南京 长江大桥").
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM,一平方公里也很小 hello-word h_name", 'space');
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM,一平方公里也很小 hello-word h_name", 'space') |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
|["详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name"] |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.
|
||||
|
||||
### Basic English (Beng) tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* Similar to the Space tokenizer, but treats underscores `_` as separators instead of preserving them.
|
||||
* Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Basic retrieval of English documents (such as logs, comments).
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng') |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* Underscores `_` are split into separate tokens (for example, `server_status` -> `server`, `status`, and `user_name` -> `user`, `name`). The core difference from the Space tokenizer lies in how it handles underscores `_`.
|
||||
|
||||
### Ngram tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* **Fixed n-value tokenization**: By default, `n=2`. This tokenizer splits consecutive non-delimiter characters into subsequences of length `n`.
|
||||
* Delimiter rules follow the Space tokenizer (preserving `_`, digits, and letters).
|
||||
* **Does not support length limit parameters**, outputs all possible tokens of length `n`.
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Fuzzy matching for short text (such as user IDs, order numbers).
|
||||
* Scenarios requiring fixed-length feature extraction (such as password policy analysis).
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram') |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* With the default setting `n=2`, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, `ORD12345` -> `OR`, `RD`, `D1`, `12`, `23`, `34`, `45`;` user_account` -> `us`, `se`, `er`, `r_`, `_a`, `ac`, `cc`, `co`, `ou`, `un`, `nt`).
|
||||
|
||||
### Ngram2 tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* Supports **dynamic n-value range**: Sets token length range through `min_ngram_size` and `max_ngram_size` parameters.
|
||||
* Suitable for scenarios requiring multi-length token coverage.
|
||||
|
||||
**Applicable scenarios**: Scenarios that require multiple fixed-length tokens simultaneously.
|
||||
|
||||
:::info
|
||||
When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for <code>min_ngram_size</code> and <code>max_ngram_size</code> parameters will generate a large number of token combinations, which may lead to excessive resource consumption.
|
||||
:::
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]') |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, `user_login_session_2024` generates tokens like `us`, `use`, `user`, `se`, `ser`, `ser_`, `er_`, `er_l`, `r_lo`, `log`, `logi`, `ogin`, etc.).
|
||||
|
||||
### IK tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
|
||||
|
||||
* **Smart mode**: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
|
||||
* **Max Word mode**: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
|
||||
|
||||
* Automatically recognizes English words, email addresses, URLs (without `://`), IP addresses, and other formats.
|
||||
|
||||
**Applicable scenarios**: Chinese word segmentation
|
||||
|
||||
**Business scenarios**:
|
||||
|
||||
* E-commerce product description search (for example, precise matching for "华为Mate60").
|
||||
* Social media content analysis (for example, keyword extraction from user comments).
|
||||
|
||||
* **Smart mode**: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
* **max_word mode**: Includes the same character in different tokens, providing as many possible words as possible.
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"] |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
### jieba tokenizer
|
||||
|
||||
**Concept**: A tokenizer based on the open-source `jieba` tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.
|
||||
|
||||
**Features**:
|
||||
|
||||
* **Precise mode**: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
|
||||
* **Full mode**: Lists all possible segmentation combinations.
|
||||
* **Search engine mode**: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
|
||||
* Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
|
||||
* Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).
|
||||
|
||||
To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see [Tokenizer plugin](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002414801).
|
||||
|
||||
:::tip
|
||||
The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.
|
||||
:::
|
||||
|
||||
### Tokenizer selection strategy
|
||||
|
||||
| **Business scenario** | **Recommended tokenizer** | **Reason** |
|
||||
| --- | --- | --- |
|
||||
| Search for English product titles | **Space** or **Basic English** | Simple and efficient, aligns with English tokenization conventions. |
|
||||
| Retrieval of Chinese product descriptions | **IK tokenizer** | Accurately recognizes Chinese terminology, supports custom dictionaries. |
|
||||
| Fuzzy matching of logs (such as error codes) | **Ngram tokenizer** | No dictionary required, covers fuzzy query needs for text without spaces. |
|
||||
| Keyword extraction from technology papers | **jieba tokenizer** | Supports new word discovery and complex mode switching. |
|
||||
|
||||
## References
|
||||
|
||||
For more information about creating full-text indexes, see the **Create full-text indexes** section in [Create an index](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971660).
|
||||
@@ -0,0 +1,60 @@
|
||||
---
|
||||
slug: /pyseekdb-sdk-get-started
|
||||
---
|
||||
|
||||
# Get started
|
||||
|
||||
## pyseekdb
|
||||
|
||||
pyseekdb is a Python client provided by OceanBase Database. It allows you to connect to seekdb in embedded mode or remote mode, and supports connecting to seekdb in server mode or OceanBase Database.
|
||||
|
||||
:::tip
|
||||
OceanBase Database is a fully self-developed, enterprise-level, native distributed database provided by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across cities with the "five IDCs across three regions" architecture. It also sets a new benchmark in the TPC-C standard test, with a single cluster scale exceeding 1,500 nodes. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
|
||||
:::
|
||||
|
||||
pyseekdb is supported on Linux, macOS, and Windows. The supported database connection modes vary by operating system. For more information, see the table below.
|
||||
|
||||
| System | Embedded seekdb | Server mode seekdb | Server mode OceanBase Database |
|
||||
|----|---|---|---|
|
||||
| Linux | Supported | Supported | Supported |
|
||||
| macOS | Not supported | Supported | Supported |
|
||||
| Windows | Not supported | Supported | Supported |
|
||||
|
||||
For Linux system, when you install this client, it will also install seekdb in embedded mode, allowing you to directly connect to it to perform operations such as creating a database. Alternatively, you can choose to connect to a deployed seekdb or OceanBase Database in client/server mode.
|
||||
|
||||
## Install pyseekdb
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Make sure that your environment meets the following requirements:
|
||||
|
||||
* Operating system: Linux (glibc >= 2.28), macOS or Windows
|
||||
* Python version: Python 3.11 and later
|
||||
* System architecture: x86_64 or aarch64
|
||||
|
||||
### Procedure
|
||||
|
||||
Use pip to install pyseekdb. It will automatically detect the default Python version and platform.
|
||||
|
||||
```shell
|
||||
pip install pyseekdb
|
||||
```
|
||||
|
||||
If your pip version is outdated, upgrade it before installation.
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
## What to do next
|
||||
|
||||
* After installing pyseekdb, you can connect to seekdb to perform operations. For information about the API interfaces supported by pyseekdb, see [API Reference](../50.apis/10.api-overview.md).
|
||||
|
||||
* You can also refer to the SDK samples provided to quickly experience pyseekdb.
|
||||
|
||||
* [Simple sample](50.sdk-samples/10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Complete sample](50.sdk-samples/50.pyseekdb-complete-sample.md)
|
||||
|
||||
* [Hybrid search sample](50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)
|
||||
|
||||
@@ -0,0 +1,130 @@
|
||||
---
|
||||
slug: /pyseekdb-simple-sample
|
||||
---
|
||||
|
||||
# Simple Example
|
||||
|
||||
This example demonstrates the basic operations of Embedding Functions in embedded mode of seekdb to help you understand how to use Embedding Functions.
|
||||
|
||||
1. Connect to seekdb.
|
||||
2. Create a collection with Embedding Functions.
|
||||
3. Add data using documents (vectors will be automatically generated).
|
||||
4. Query using texts (vectors will be automatically generated).
|
||||
5. Print the query results.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This example uses seekdb in embedded mode. Before using this example, make sure that you have deployed seekdb in server mode.
|
||||
|
||||
For information about how to deploy seekdb in embedded mode, see [Embedded Mode](../../../../400.guides/400.deploy/600.python-seekdb.md).
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# ==================== Step 1: Create Client Connection ====================
|
||||
# You can use embedded mode, server mode, or OceanBase mode
|
||||
|
||||
# Embedded mode (local SeekDB)
|
||||
client = pyseekdb.Client()
|
||||
# Alternative: Server mode (connecting to remote SeekDB server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# Alternative: Remote server mode (OceanBase Server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# tenant="test", # OceanBase default tenant
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# ==================== Step 2: Create a Collection with Embedding Function ====================
|
||||
# A collection is like a table that stores documents with vector embeddings
|
||||
collection_name = "my_simple_collection"
|
||||
|
||||
# Create collection with default embedding function
|
||||
# The embedding function will automatically convert documents to embeddings
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
)
|
||||
|
||||
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
|
||||
print(f"Embedding function: {collection.embedding_function}")
|
||||
|
||||
# ==================== Step 3: Add Data to Collection ====================
|
||||
# With embedding function, you can add documents directly without providing embeddings
|
||||
# The embedding function will automatically generate embeddings from documents
|
||||
|
||||
documents = [
|
||||
"Machine learning is a subset of artificial intelligence",
|
||||
"Python is a popular programming language",
|
||||
"Vector databases enable semantic search",
|
||||
"Neural networks are inspired by the human brain",
|
||||
"Natural language processing helps computers understand text"
|
||||
]
|
||||
|
||||
ids = ["id1", "id2", "id3", "id4", "id5"]
|
||||
|
||||
# Add data with documents only - embeddings will be auto-generated by embedding function
|
||||
collection.add(
|
||||
ids=ids,
|
||||
documents=documents, # embeddings will be automatically generated
|
||||
metadatas=[
|
||||
{"category": "AI", "index": 0},
|
||||
{"category": "Programming", "index": 1},
|
||||
{"category": "Database", "index": 2},
|
||||
{"category": "AI", "index": 3},
|
||||
{"category": "NLP", "index": 4}
|
||||
]
|
||||
)
|
||||
|
||||
print(f"\nAdded {len(documents)} documents to collection")
|
||||
print("Note: Embeddings were automatically generated from documents using the embedding function")
|
||||
|
||||
# ==================== Step 4: Query the Collection ====================
|
||||
# With embedding function, you can query using text directly
|
||||
# The embedding function will automatically convert query text to query vector
|
||||
|
||||
# Query using text - query vector will be auto-generated by embedding function
|
||||
query_text = "artificial intelligence and machine learning"
|
||||
|
||||
results = collection.query(
|
||||
query_texts=query_text, # Query text - will be embedded automatically
|
||||
n_results=3 # Return top 3 most similar documents
|
||||
)
|
||||
|
||||
print(f"\nQuery: '{query_text}'")
|
||||
print(f"Query results: {len(results['ids'][0])} items found")
|
||||
|
||||
# ==================== Step 5: Print Query Results ====================
|
||||
for i in range(len(results['ids'][0])):
|
||||
print(f"\nResult {i+1}:")
|
||||
print(f" ID: {results['ids'][0][i]}")
|
||||
print(f" Distance: {results['distances'][0][i]:.4f}")
|
||||
if results.get('documents'):
|
||||
print(f" Document: {results['documents'][0][i]}")
|
||||
if results.get('metadatas'):
|
||||
print(f" Metadata: {results['metadatas'][0][i]}")
|
||||
|
||||
# ==================== Step 6: Cleanup ====================
|
||||
# Delete the collection
|
||||
client.delete_collection(collection_name)
|
||||
print(f"\nDeleted collection '{collection_name}'")
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Complete Example](50.pyseekdb-complete-sample.md)
|
||||
|
||||
* [Hybrid Search Example](100.pyseekdb-hybrid-search-sample.md)
|
||||
@@ -0,0 +1,350 @@
|
||||
---
|
||||
slug: /pyseekdb-hybrid-search-sample
|
||||
---
|
||||
|
||||
# Hybrid search example
|
||||
|
||||
This example demonstrates the advantages of `hybrid_search()` over `query()`.
|
||||
|
||||
The main advantages of `hybrid_search()` are:
|
||||
|
||||
* Supports full-text search and vector similarity search simultaneously
|
||||
|
||||
* Allows separate filtering conditions for full-text and vector search
|
||||
|
||||
* Combines the ranked results of both searches using the Reciprocal Rank Fusion algorithm to improve relevance.
|
||||
|
||||
* Handles complex scenarios that `query()` cannot handle
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Setup
|
||||
client = pyseekdb.Client()
|
||||
collection = client.get_or_create_collection(
|
||||
name="hybrid_search_demo"
|
||||
)
|
||||
|
||||
# Sample data
|
||||
documents = [
|
||||
"Machine learning is revolutionizing artificial intelligence and data science",
|
||||
"Python programming language is essential for machine learning developers",
|
||||
"Deep learning neural networks enable advanced AI applications",
|
||||
"Data science combines statistics, programming, and domain expertise",
|
||||
"Natural language processing uses machine learning to understand text",
|
||||
"Computer vision algorithms process images using deep learning techniques",
|
||||
"Reinforcement learning trains agents through reward-based feedback",
|
||||
"Python libraries like TensorFlow and PyTorch simplify machine learning",
|
||||
"Artificial intelligence systems can learn from large datasets",
|
||||
"Neural networks mimic the structure of biological brain connections"
|
||||
]
|
||||
|
||||
metadatas = [
|
||||
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
|
||||
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
|
||||
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
|
||||
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
|
||||
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
|
||||
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
|
||||
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
|
||||
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
|
||||
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
|
||||
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
|
||||
]
|
||||
|
||||
ids = [f"doc_{i+1}" for i in range(len(documents))]
|
||||
collection.add(ids=ids, documents=documents, metadatas=metadatas)
|
||||
|
||||
print("=" * 100)
|
||||
print("SCENARIO 1: Keyword + Semantic Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Find documents similar to 'AI research' AND containing 'machine learning'\n")
|
||||
|
||||
# query() approach
|
||||
query_result1 = collection.query(
|
||||
query_texts=["AI research"],
|
||||
where_document={"$contains": "machine learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() approach
|
||||
hybrid_result1 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
|
||||
knn={"query_texts": ["AI research"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result1['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result1['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() ranks 'Deep learning neural networks...' first because it's semantically similar to 'AI research',")
|
||||
print(" but 'machine learning' is not its primary focus. hybrid_search() correctly prioritizes documents that")
|
||||
print(" explicitly contain 'machine learning' (from full-text search) while also being semantically relevant")
|
||||
print(" to 'AI research' (from vector search). The RRF fusion ensures documents matching both criteria rank higher.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 2: Independent Filters for Different Search Types")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='neural' (year=2024) + Vector='deep learning' (popularity>=90)\n")
|
||||
|
||||
# query() - same filter applies to both conditions
|
||||
query_result2 = collection.query(
|
||||
query_texts=["deep learning"],
|
||||
where={"year": {"$eq": 2024}, "popularity": {"$gte": 90}},
|
||||
where_document={"$contains": "neural"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - different filters for each search type
|
||||
hybrid_result2 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "neural"}, "where": {"year": {"$eq": 2024}}, "n_results": 10},
|
||||
knn={"query_texts": ["deep learning"], "where": {"popularity": {"$gte": 90}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (same filter for both):")
|
||||
for i, doc_id in enumerate(query_result2['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (independent filters):")
|
||||
for i, doc_id in enumerate(hybrid_result2['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it requires documents to satisfy BOTH year=2024 AND popularity>=90")
|
||||
print(" simultaneously. hybrid_search() returns 5 results by applying year=2024 filter to full-text search")
|
||||
print(" and popularity>=90 filter to vector search independently, then fusing the results. This approach")
|
||||
print(" captures more relevant documents that might satisfy one criterion strongly while meeting the other")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 3: Combining Multiple Search Strategies")
|
||||
print("=" * 100)
|
||||
print("Goal: Find documents about 'machine learning algorithms'\n")
|
||||
|
||||
# query() - vector search only
|
||||
query_result3 = collection.query(
|
||||
query_texts=["machine learning algorithms"],
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - combines full-text and vector
|
||||
hybrid_result3 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
|
||||
knn={"query_texts": ["machine learning algorithms"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (vector similarity only):")
|
||||
for i, doc_id in enumerate(query_result3['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (full-text + vector fusion):")
|
||||
for i, doc_id in enumerate(hybrid_result3['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() returns 'Artificial intelligence systems...' as the result, which doesn't explicitly")
|
||||
print(" mention 'machine learning'. hybrid_search() combines full-text search (for 'machine learning')")
|
||||
print(" with vector search (for semantic similarity to 'machine learning algorithms'), ensuring that")
|
||||
print(" documents containing the exact keyword rank higher while still capturing semantically relevant content.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 4: Complex Multi-Criteria Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='learning' (category=AI) + Vector='artificial intelligence' (year>=2023)\n")
|
||||
|
||||
# query() - limited to single search with combined filters
|
||||
query_result4 = collection.query(
|
||||
query_texts=["artificial intelligence"],
|
||||
where={"category": {"$eq": "AI"}, "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - separate criteria for each search type
|
||||
hybrid_result4 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "learning"}, "where": {"category": {"$eq": "AI"}}, "n_results": 10},
|
||||
knn={"query_texts": ["artificial intelligence"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result4['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result4['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" While both methods return similar documents, hybrid_search() provides better ranking by prioritizing")
|
||||
print(" documents that score highly in both full-text search (containing 'learning' with category=AI) and")
|
||||
print(" vector search (semantically similar to 'artificial intelligence' with year>=2023). The RRF fusion")
|
||||
print(" algorithm ensures that 'Deep learning neural networks...' ranks first because it strongly matches")
|
||||
print(" both search criteria, whereas query() applies filters sequentially which may not optimize ranking.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 5: Result Quality - RRF Fusion")
|
||||
print("=" * 100)
|
||||
print("Goal: Search for 'Python machine learning'\n")
|
||||
|
||||
# query() - single ranking
|
||||
query_result5 = collection.query(
|
||||
query_texts=["Python machine learning"],
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - RRF fusion of multiple rankings
|
||||
hybrid_result5 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "Python"}, "n_results": 10},
|
||||
knn={"query_texts": ["Python machine learning"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results (single ranking):")
|
||||
for i, doc_id in enumerate(query_result5['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results (RRF fusion):")
|
||||
for i, doc_id in enumerate(hybrid_result5['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" Both methods return identical results in this case, but hybrid_search() achieves this through RRF")
|
||||
print(" (Reciprocal Rank Fusion) which combines rankings from full-text search (for 'Python') and vector")
|
||||
print(" search (for 'Python machine learning'). RRF provides more stable and robust ranking by considering")
|
||||
print(" multiple signals, making it less sensitive to variations in individual search algorithms and ensuring")
|
||||
print(" consistent high-quality results across different query formulations.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 6: Different Filter Criteria for Each Search")
|
||||
print("=" * 100)
|
||||
print("Goal: Full-text='neural' (high popularity) + Vector='deep learning' (recent year)\n")
|
||||
|
||||
# query() - cannot separate filters for keyword vs semantic
|
||||
query_result6 = collection.query(
|
||||
query_texts=["deep learning"],
|
||||
where={"popularity": {"$gte": 90}, "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "neural"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - different filters for keyword search vs semantic search
|
||||
hybrid_result6 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "neural"}, "where": {"popularity": {"$gte": 90}}, "n_results": 10},
|
||||
knn={"query_texts": ["deep learning"], "where": {"year": {"$gte": 2023}}, "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result6['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result6['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
print(f" {metadatas[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it requires documents to satisfy BOTH popularity>=90 AND")
|
||||
print(" year>=2023 simultaneously, along with containing 'neural' and being semantically similar to")
|
||||
print(" 'deep learning'. hybrid_search() returns 5 results by applying popularity>=90 filter to full-text")
|
||||
print(" search (for 'neural') and year>=2023 filter to vector search (for 'deep learning') independently.")
|
||||
print(" The fusion then combines results from both searches, capturing documents that strongly match either")
|
||||
print(" criterion while still being relevant to the overall query intent.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SCENARIO 7: Partial Keyword Match + Semantic Similarity")
|
||||
print("=" * 100)
|
||||
print("Goal: Documents containing 'Python' + Semantically similar to 'data science'\n")
|
||||
|
||||
# query() - filter applied after vector search
|
||||
query_result7 = collection.query(
|
||||
query_texts=["data science"],
|
||||
where_document={"$contains": "Python"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# hybrid_search() - parallel searches then fusion
|
||||
hybrid_result7 = collection.hybrid_search(
|
||||
query={"where_document": {"$contains": "Python"}, "n_results": 10},
|
||||
knn={"query_texts": ["data science"], "n_results": 10},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("query() Results:")
|
||||
for i, doc_id in enumerate(query_result7['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nhybrid_search() Results:")
|
||||
for i, doc_id in enumerate(hybrid_result7['ids'][0]):
|
||||
idx = ids.index(doc_id)
|
||||
print(f" {i+1}. {documents[idx]}")
|
||||
|
||||
print("\nAnalysis:")
|
||||
print(" query() only returns 2 results because it first performs vector search for 'data science', then")
|
||||
print(" filters to documents containing 'Python', which severely limits the result set. hybrid_search()")
|
||||
print(" returns 5 results by running full-text search (for 'Python') and vector search (for 'data science')")
|
||||
print(" in parallel, then fusing the results. This captures documents that contain 'Python' (even if not")
|
||||
print(" semantically closest to 'data science') and documents semantically similar to 'data science' (even")
|
||||
print(" if they don't contain 'Python'), providing better recall and more comprehensive results.")
|
||||
|
||||
print("\n" + "=" * 100)
|
||||
print("SUMMARY")
|
||||
print("=" * 100)
|
||||
print("""
|
||||
query() limitations:
|
||||
- Single search type (vector similarity)
|
||||
- Filters applied after search (may miss relevant docs)
|
||||
- Cannot combine full-text and vector search results
|
||||
- Same filter criteria for all conditions
|
||||
|
||||
hybrid_search() advantages:
|
||||
- Simultaneous full-text + vector search
|
||||
- Independent filters for each search type
|
||||
- Intelligent result fusion using RRF
|
||||
- Better recall for complex queries
|
||||
- Handles scenarios requiring both keyword and semantic matching
|
||||
""")
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the APIs supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Simple example](10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Complete example](50.pyseekdb-complete-sample.md)
|
||||
@@ -0,0 +1,440 @@
|
||||
---
|
||||
slug: /pyseekdb-complete-sample
|
||||
---
|
||||
|
||||
# Complete Example
|
||||
|
||||
This example demonstrates the full capabilities of pyseekdb.
|
||||
|
||||
The example includes the following operations:
|
||||
|
||||
1. Connection, including all connection modes
|
||||
2. Collection management
|
||||
3. DML operations, including add, update, upsert, and delete
|
||||
4. DQL operations, including query, get, and hybrid_search
|
||||
5. Filter operators
|
||||
6. Collection information methods
|
||||
|
||||
## Example
|
||||
|
||||
```python
|
||||
import uuid
|
||||
import random
|
||||
import pyseekdb
|
||||
|
||||
# ============================================================================
|
||||
# PART 1: CLIENT CONNECTION
|
||||
# ============================================================================
|
||||
|
||||
# Option 1: Embedded mode (local SeekDB)
|
||||
client = pyseekdb.Client(
|
||||
#path="./seekdb",
|
||||
#database="test"
|
||||
)
|
||||
|
||||
# Option 2: Server mode (remote SeekDB server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# Option 3: Remote server mode (OceanBase Server)
|
||||
# client = pyseekdb.Client(
|
||||
# host="127.0.0.1",
|
||||
# port=2881,
|
||||
# tenant="test", # OceanBase default tenant
|
||||
# database="test",
|
||||
# user="root",
|
||||
# password=""
|
||||
# )
|
||||
|
||||
# ============================================================================
|
||||
# PART 2: COLLECTION MANAGEMENT
|
||||
# ============================================================================
|
||||
|
||||
collection_name = "comprehensive_example"
|
||||
dimension = 128
|
||||
|
||||
# 2.1 Create a collection
|
||||
from pyseekdb import HNSWConfiguration
|
||||
config = HNSWConfiguration(dimension=dimension, distance='cosine')
|
||||
collection = client.get_or_create_collection(
|
||||
name=collection_name,
|
||||
configuration=config,
|
||||
embedding_function=None # Explicitly set to None since we're using custom 128-dim embeddings
|
||||
)
|
||||
|
||||
# 2.2 Check if collection exists
|
||||
exists = client.has_collection(collection_name)
|
||||
|
||||
# 2.3 Get collection object
|
||||
retrieved_collection = client.get_collection(collection_name, embedding_function=None)
|
||||
|
||||
# 2.4 List all collections
|
||||
all_collections = client.list_collections()
|
||||
|
||||
# 2.5 Get or create collection (creates if doesn't exist)
|
||||
config2 = HNSWConfiguration(dimension=64, distance='cosine')
|
||||
collection2 = client.get_or_create_collection(
|
||||
name="another_collection",
|
||||
configuration=config2,
|
||||
embedding_function=None # Explicitly set to None since we're using custom 64-dim embeddings
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 3: DML OPERATIONS - ADD DATA
|
||||
# ============================================================================
|
||||
|
||||
# Generate sample data
|
||||
random.seed(42)
|
||||
documents = [
|
||||
"Machine learning is transforming the way we solve problems",
|
||||
"Python programming language is widely used in data science",
|
||||
"Vector databases enable efficient similarity search",
|
||||
"Neural networks mimic the structure of the human brain",
|
||||
"Natural language processing helps computers understand human language",
|
||||
"Deep learning requires large amounts of training data",
|
||||
"Reinforcement learning agents learn through trial and error",
|
||||
"Computer vision enables machines to interpret visual information"
|
||||
]
|
||||
|
||||
# Generate embeddings (in real usage, use an embedding model)
|
||||
embeddings = []
|
||||
for i in range(len(documents)):
|
||||
vector = [random.random() for _ in range(dimension)]
|
||||
embeddings.append(vector)
|
||||
|
||||
ids = [str(uuid.uuid4()) for _ in documents]
|
||||
|
||||
# 3.1 Add single item
|
||||
single_id = str(uuid.uuid4())
|
||||
collection.add(
|
||||
ids=single_id,
|
||||
documents="This is a single document",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"type": "single", "category": "test"}
|
||||
)
|
||||
|
||||
# 3.2 Add multiple items
|
||||
collection.add(
|
||||
ids=ids,
|
||||
documents=documents,
|
||||
embeddings=embeddings,
|
||||
metadatas=[
|
||||
{"category": "AI", "score": 95, "tag": "ml", "year": 2023},
|
||||
{"category": "Programming", "score": 88, "tag": "python", "year": 2022},
|
||||
{"category": "Database", "score": 92, "tag": "vector", "year": 2023},
|
||||
{"category": "AI", "score": 90, "tag": "neural", "year": 2022},
|
||||
{"category": "NLP", "score": 87, "tag": "language", "year": 2023},
|
||||
{"category": "AI", "score": 93, "tag": "deep", "year": 2023},
|
||||
{"category": "AI", "score": 85, "tag": "reinforcement", "year": 2022},
|
||||
{"category": "CV", "score": 91, "tag": "vision", "year": 2023}
|
||||
]
|
||||
)
|
||||
|
||||
# 3.3 Add with only embeddings (no documents)
|
||||
vector_only_ids = [str(uuid.uuid4()) for _ in range(2)]
|
||||
collection.add(
|
||||
ids=vector_only_ids,
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[{"type": "vector_only"}, {"type": "vector_only"}]
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 4: DML OPERATIONS - UPDATE DATA
|
||||
# ============================================================================
|
||||
|
||||
# 4.1 Update single item
|
||||
collection.update(
|
||||
ids=ids[0],
|
||||
metadatas={"category": "AI", "score": 98, "tag": "ml", "year": 2024, "updated": True}
|
||||
)
|
||||
|
||||
# 4.2 Update multiple items
|
||||
collection.update(
|
||||
ids=ids[1:3],
|
||||
documents=["Updated document 1", "Updated document 2"],
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[
|
||||
{"category": "Programming", "score": 95, "updated": True},
|
||||
{"category": "Database", "score": 97, "updated": True}
|
||||
]
|
||||
)
|
||||
|
||||
# 4.3 Update embeddings
|
||||
new_embeddings = [[random.random() for _ in range(dimension)] for _ in range(2)]
|
||||
collection.update(
|
||||
ids=ids[2:4],
|
||||
embeddings=new_embeddings
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 5: DML OPERATIONS - UPSERT DATA
|
||||
# ============================================================================
|
||||
|
||||
# 5.1 Upsert existing item (will update)
|
||||
collection.upsert(
|
||||
ids=ids[0],
|
||||
documents="Upserted document (was updated)",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"category": "AI", "upserted": True}
|
||||
)
|
||||
|
||||
# 5.2 Upsert new item (will insert)
|
||||
new_id = str(uuid.uuid4())
|
||||
collection.upsert(
|
||||
ids=new_id,
|
||||
documents="This is a new document from upsert",
|
||||
embeddings=[random.random() for _ in range(dimension)],
|
||||
metadatas={"category": "New", "upserted": True}
|
||||
)
|
||||
|
||||
# 5.3 Upsert multiple items
|
||||
upsert_ids = [ids[4], str(uuid.uuid4())] # One existing, one new
|
||||
collection.upsert(
|
||||
ids=upsert_ids,
|
||||
documents=["Upserted doc 1", "Upserted doc 2"],
|
||||
embeddings=[[random.random() for _ in range(dimension)] for _ in range(2)],
|
||||
metadatas=[{"upserted": True}, {"upserted": True}]
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 6: DQL OPERATIONS - QUERY (VECTOR SIMILARITY SEARCH)
|
||||
# ============================================================================
|
||||
|
||||
# 6.1 Basic vector similarity query
|
||||
query_vector = embeddings[0] # Query with first document's vector
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
n_results=3
|
||||
)
|
||||
print(f"Query results: {len(results['ids'][0])} items")
|
||||
|
||||
# 6.2 Query with metadata filter (simplified equality)
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"category": "AI"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.3 Query with comparison operators
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"score": {"$gte": 90}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.4 Query with $in operator
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"tag": {"$in": ["ml", "python", "neural"]}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.5 Query with logical operators ($or) - simplified equality
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={
|
||||
"$or": [
|
||||
{"category": "AI"},
|
||||
{"tag": "python"}
|
||||
]
|
||||
},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.6 Query with logical operators ($and) - simplified equality
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={
|
||||
"$and": [
|
||||
{"category": "AI"},
|
||||
{"score": {"$gte": 90}}
|
||||
]
|
||||
},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.7 Query with document filter
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where_document={"$contains": "machine learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.8 Query with combined filters (simplified equality)
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
where={"category": "AI", "year": {"$gte": 2023}},
|
||||
where_document={"$contains": "learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# 6.9 Query with multiple embeddings (batch query)
|
||||
batch_embeddings = [embeddings[0], embeddings[1]]
|
||||
batch_results = collection.query(
|
||||
query_embeddings=batch_embeddings,
|
||||
n_results=2
|
||||
)
|
||||
# batch_results["ids"][0] contains results for first query
|
||||
# batch_results["ids"][1] contains results for second query
|
||||
|
||||
# 6.10 Query with specific fields
|
||||
results = collection.query(
|
||||
query_embeddings=query_vector,
|
||||
include=["documents", "metadatas", "embeddings"],
|
||||
n_results=2
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 7: DQL OPERATIONS - GET (RETRIEVE BY IDS OR FILTERS)
|
||||
# ============================================================================
|
||||
|
||||
# 7.1 Get by single ID
|
||||
result = collection.get(ids=ids[0])
|
||||
# result["ids"] contains [ids[0]]
|
||||
# result["documents"] contains document for ids[0]
|
||||
|
||||
# 7.2 Get by multiple IDs
|
||||
results = collection.get(ids=ids[:3])
|
||||
# results["ids"] contains ids[:3]
|
||||
# results["documents"] contains documents for all IDs
|
||||
|
||||
# 7.3 Get by metadata filter (simplified equality)
|
||||
results = collection.get(
|
||||
where={"category": "AI"},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.4 Get with comparison operators
|
||||
results = collection.get(
|
||||
where={"score": {"$gte": 90}},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.5 Get with $in operator
|
||||
results = collection.get(
|
||||
where={"tag": {"$in": ["ml", "python"]}},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.6 Get with logical operators (simplified equality)
|
||||
results = collection.get(
|
||||
where={
|
||||
"$or": [
|
||||
{"category": "AI"},
|
||||
{"category": "Programming"}
|
||||
]
|
||||
},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.7 Get by document filter
|
||||
results = collection.get(
|
||||
where_document={"$contains": "Python"},
|
||||
limit=5
|
||||
)
|
||||
|
||||
# 7.8 Get with pagination
|
||||
results_page1 = collection.get(limit=2, offset=0)
|
||||
results_page2 = collection.get(limit=2, offset=2)
|
||||
|
||||
# 7.9 Get with specific fields
|
||||
results = collection.get(
|
||||
ids=ids[:2],
|
||||
include=["documents", "metadatas", "embeddings"]
|
||||
)
|
||||
|
||||
# 7.10 Get all data
|
||||
all_results = collection.get(limit=100)
|
||||
|
||||
# ============================================================================
|
||||
# PART 8: DQL OPERATIONS - HYBRID SEARCH
|
||||
# ============================================================================
|
||||
|
||||
# 8.1 Hybrid search with full-text and vector search
|
||||
# Note: This requires query_embeddings to be provided directly
|
||||
# In real usage, you might have an embedding function
|
||||
hybrid_results = collection.hybrid_search(
|
||||
query={
|
||||
"where_document": {"$contains": "machine learning"},
|
||||
"where": {"category": "AI"}, # Simplified equality
|
||||
"n_results": 10
|
||||
},
|
||||
knn={
|
||||
"query_embeddings": [embeddings[0]],
|
||||
"where": {"year": {"$gte": 2022}},
|
||||
"n_results": 10
|
||||
},
|
||||
rank={"rrf": {}}, # Reciprocal Rank Fusion
|
||||
n_results=5,
|
||||
include=["documents", "metadatas"]
|
||||
)
|
||||
# hybrid_results["ids"][0] contains IDs for the hybrid search
|
||||
# hybrid_results["documents"][0] contains documents for the hybrid search
|
||||
print(f"Hybrid search: {len(hybrid_results.get('ids', [[]])[0])} results")
|
||||
|
||||
# ============================================================================
|
||||
# PART 9: DML OPERATIONS - DELETE DATA
|
||||
# ============================================================================
|
||||
|
||||
# 9.1 Delete by IDs
|
||||
delete_ids = [vector_only_ids[0], new_id]
|
||||
collection.delete(ids=delete_ids)
|
||||
|
||||
# 9.2 Delete by metadata filter
|
||||
collection.delete(where={"type": {"$eq": "vector_only"}})
|
||||
|
||||
# 9.3 Delete by document filter
|
||||
collection.delete(where_document={"$contains": "Updated document"})
|
||||
|
||||
# 9.4 Delete with combined filters
|
||||
collection.delete(
|
||||
where={"category": {"$eq": "CV"}},
|
||||
where_document={"$contains": "vision"}
|
||||
)
|
||||
|
||||
# ============================================================================
|
||||
# PART 10: COLLECTION INFORMATION
|
||||
# ============================================================================
|
||||
|
||||
# 10.1 Get collection count
|
||||
count = collection.count()
|
||||
print(f"Collection count: {count} items")
|
||||
|
||||
|
||||
# 10.3 Preview first few items in collection (returns all columns by default)
|
||||
preview = collection.peek(limit=5)
|
||||
print(f"Preview: {len(preview['ids'])} items")
|
||||
for i in range(len(preview['ids'])):
|
||||
print(f" ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
|
||||
print(f" Metadata: {preview['metadatas'][i]}, Embedding dim: {len(preview['embeddings'][i]) if preview['embeddings'][i] else 0}")
|
||||
|
||||
# 10.4 Count collections in database
|
||||
collection_count = client.count_collection()
|
||||
print(f"Database has {collection_count} collections")
|
||||
|
||||
# ============================================================================
|
||||
# PART 11: CLEANUP
|
||||
# ============================================================================
|
||||
|
||||
# Delete test collections
|
||||
try:
|
||||
client.delete_collection("another_collection")
|
||||
except Exception as e:
|
||||
print(f"Could not delete 'another_collection': {e}")
|
||||
|
||||
# Uncomment to delete main collection
|
||||
client.delete_collection(collection_name)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* For information about the API interfaces supported by pyseekdb, see [API Reference](../../50.apis/10.api-overview.md).
|
||||
|
||||
* [Simple Example](../50.sdk-samples/10.pyseekdb-simple-sample.md)
|
||||
|
||||
* [Hybrid Search Example](../50.sdk-samples/100.pyseekdb-hybrid-search-sample.md)
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
slug: /api-overview
|
||||
---
|
||||
|
||||
# API Reference
|
||||
|
||||
seekdb allows you to use seekdb through APIs.
|
||||
|
||||
## APIs
|
||||
|
||||
The following APIs are supported.
|
||||
|
||||
### Database
|
||||
|
||||
:::info
|
||||
You can use this API only when you connect to seekdb by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../50.apis/100.admin-client.md).
|
||||
:::
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `create_database()` | Creates a database. | [Documentation](110.database/200.create-database-of-api.md) |
|
||||
| `get_database()` | Retrieves a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
|
||||
| `list_databases()` | Retrieves a list of databases in an instance. |[Documentation](110.database/400.list-database-of-api.md)|
|
||||
| `delete_database()` | Deletes a specified database.|[Documentation](110.database/500.delete-database-of-api.md)|
|
||||
|
||||
|
||||
### Collection
|
||||
|
||||
:::info
|
||||
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
|
||||
:::
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `create_collection()` | Creates a collection. | [Documentation](200.collection/100.create-collection-of-api.md) |
|
||||
| `get_collection()` | Retrieves a specified collection. |[Documentation](200.collection/200.get-collection-of-api.md)|
|
||||
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](200.collection/250.get-or-create-collection-of-api.md)|
|
||||
| `list_collections()` | Retrieves the collection list in a database. |[Documentation](200.collection/300.list-collection-of-api.md)|
|
||||
| `count_collection()` | Counts the number of collections in a database. |[Documentation](200.collection/350.count-collection-of-api.md)|
|
||||
| `delete_collection()` | Deletes a specified collection.|[Documentation](200.collection/400.delete-collection-of-api.md)|
|
||||
|
||||
|
||||
### DML
|
||||
|
||||
:::info
|
||||
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
|
||||
:::
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `add()` | Inserts a new record into a collection. | [Documentation](300.dml/200.add-data-of-api.md) |
|
||||
| `update()` | Updates an existing record in a collection. |[Documentation](300.dml/300.update-data-of-api.md)|
|
||||
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](300.dml/400.upsert-data-of-api.md)|
|
||||
| `delete()` | Deletes a record from a collection.|[Documentation](300.dml/500.delete-data-of-api.md)|
|
||||
|
||||
### DQL
|
||||
|
||||
:::info
|
||||
You can use this API only when you connect to seekdb by using the `Client`. For more information about the `Client`, see [Client](../50.apis/50.client.md).
|
||||
:::
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `query()` | Performs vector similarity search. | [Documentation](400.dql/200.query-interfaces-of-api.md) |
|
||||
| `get()` | Queries specific data from a table by using the ID, document, and metadata (non-vector). |[Documentation](400.dql/300.get-interfaces-of-api.md)|
|
||||
| `hybrid_search()` | Performs full-text search and vector similarity search by using ranking. |[Documentation](400.dql/400.hybrid-search-of-api.md)|
|
||||
@@ -0,0 +1,93 @@
|
||||
---
|
||||
slug: /admin-client
|
||||
---
|
||||
|
||||
# Admin Client
|
||||
|
||||
`AdminClient` provides database management operations. It uses the same database connection mode as `Client`, but only supports database management-related operations.
|
||||
|
||||
## Connect to an embedded seekdb instance
|
||||
|
||||
Connect to a local embedded seekdb instance by using `AdminClient`.
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Embedded mode - Database management
|
||||
admin = pyseekdb.AdminClient(path="./seekdb")
|
||||
```
|
||||
|
||||
Parameter description:
|
||||
|
||||
| Parameter | Value Type | Required | Description | Example Value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `path` | string | Optional | The path of the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
|
||||
|
||||
## Connect to a remote server
|
||||
|
||||
Connect to a remote server by using `AdminClient`. This way, you can connect to a seekdb instance or an OceanBase Database instance.
|
||||
|
||||
:::tip
|
||||
|
||||
Before you connect to a remote server, make sure that you have deployed a server mode seekdb instance or an OceanBase Database instance.<br/>For information about how to deploy a server mode seekdb instance, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy an OceanBase Database instance, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
|
||||
|
||||
:::
|
||||
|
||||
Example: Connect to a server mode seekdb instance
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Remote server mode - Database management
|
||||
admin = pyseekdb.AdminClient(
|
||||
host="127.0.0.1",
|
||||
port=2881,
|
||||
user="root",
|
||||
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
|
||||
)
|
||||
```
|
||||
|
||||
Parameter description:
|
||||
|
||||
| Parameter | Value Type | Required | Description | Example Value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `host` | string | Yes | The IP address of the server where the instance resides. | `127.0.0.1` |
|
||||
| `prot` | string | Yes | The port of the instance. The default value is 2881. | `2881` |
|
||||
| `user` | string | Yes | The username. The default value is root. | `root` |
|
||||
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
|
||||
|
||||
Example: Connect to an OceanBase Database instance
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Remote server mode - Database management
|
||||
admin = pyseekdb.AdminClient(
|
||||
host="127.0.0.1",
|
||||
port=2881,
|
||||
tenant="test"
|
||||
user="root",
|
||||
password="" # Can be retrieved from SEEKDB_PASSWORD environment variable
|
||||
)
|
||||
```
|
||||
|
||||
Parameter description:
|
||||
|
||||
| Parameter | Value Type | Required | Description | Example Value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `host` | string | Yes | The IP address of the server where the database resides. | `127.0.0.1` |
|
||||
| `prot` | string | Yes | The port of the OceanBase Database instance. The default value is 2881. | `2881` |
|
||||
| `tenant` | string | No | The name of the tenant. This parameter is not required for a server mode seekdb instance, but is required for an OceanBase Database instance. The default value is sys. | `test` |
|
||||
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
|
||||
| `password` | string | Yes | The password corresponding to the username. If you do not specify `password` or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. | |
|
||||
|
||||
## APIs supported when you use AdminClient to connect to a database
|
||||
|
||||
The following APIs are supported when you use `AdminClient` to connect to a database.
|
||||
|
||||
| API | Description | Documentation Link |
|
||||
| --- | --- | --- |
|
||||
| `create_database` | Creates a new database. |[Documentation](110.database/200.create-database-of-api.md)|
|
||||
| `get_database` | Queries a specified database. |[Documentation](110.database/300.get-database-of-api.md)|
|
||||
| `delete_database` | Deletes a specified database. |[Documentation](110.database/400.list-database-of-api.md)|
|
||||
| `list_databases` | Lists all databases. |[Documentation](110.database/500.delete-database-of-api.md)|
|
||||
@@ -0,0 +1,16 @@
|
||||
---
|
||||
slug: /database-overview-of-api
|
||||
---
|
||||
|
||||
# Database Management
|
||||
|
||||
A database contains tables, indexes, and metadata of database objects. You can create, query, and delete databases as needed.
|
||||
|
||||
The following APIs are available for database operations.
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `create_database()` | Creates a database. | [Documentation](200.create-database-of-api.md) |
|
||||
| `get_database()` | Gets a specified database. |[Documentation](300.get-database-of-api.md)|
|
||||
| `list_databases()` | Gets the list of databases in the instance. |[Documentation](400.list-database-of-api.md)|
|
||||
| `delete_database()` | Deletes a specified database.|[Documentation](500.delete-database-of-api.md)|
|
||||
@@ -0,0 +1,76 @@
|
||||
---
|
||||
slug: /create-database-of-api
|
||||
---
|
||||
|
||||
# create_database - Create a database
|
||||
|
||||
The `create_database()` function is used to create a new database.
|
||||
|
||||
:::info
|
||||
* This interface can only be used when you are connected to the database using `AdminClient`. For more information about `AdminClient`, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
* Currently, when you use `create_database` to create a database, you cannot specify the database properties. The database will be created based on the default values of the properties. If you want to create a database with specific properties, you can try to create it using SQL. For more information about how to create a database using SQL, see [Create a database](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977077).
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You are connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
* If you are using server mode of seekdb or OceanBase Database, make sure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Limitations
|
||||
|
||||
* In a seekdb instance or OceanBase Database, the name of each database must be globally unique.
|
||||
|
||||
* The maximum length of a database name is 128 characters.
|
||||
|
||||
* The name can contain only uppercase and lowercase letters, digits, underscores, dollar signs, and Chinese characters.
|
||||
|
||||
* Avoid using reserved keywords as database names.
|
||||
|
||||
For more information about reserved keywords, see [Reserved keywords](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976774).
|
||||
|
||||
## Recommendations
|
||||
|
||||
* We recommend that you give the database a meaningful name that reflects its purpose and content. For example, you can use `Application Identifier_Sub-application name (optional)_db` as the database name.
|
||||
|
||||
* We recommend that you create the database and related users using the root user and assign only the necessary privileges to ensure the security and controllability of the database.
|
||||
|
||||
* You can create a database with a name consisting only of digits by enclosing the name in backticks (`), but this is not recommended. This is because names consisting only of digits have no clear meaning, and queries require the use of backticks (`), which can lead to unnecessary complexity and confusion.
|
||||
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
create_database(name, tenant=DEFAULT_TENANT)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the database to be created. |`my_database`|
|
||||
|`tenant`|string|No<ul><li>When using embedded seekdb or server mode of seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required.</li></ul>|The tenant to which the database belongs. |`test_tenant`|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Embedded mode
|
||||
admin = pyseekdb.AdminClient(path="./seekdb")
|
||||
|
||||
# Create database
|
||||
admin.create_database("my_database")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
|
||||
## References
|
||||
|
||||
* [Get a specific database](300.get-database-of-api.md)
|
||||
* [Delete a database](500.delete-database-of-api.md)
|
||||
* [List databases](400.list-database-of-api.md)
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
slug: /get-database-of-api
|
||||
---
|
||||
|
||||
# get_database - Get the specified database
|
||||
|
||||
The `get_database()` method is used to obtain the information of the specified database.
|
||||
|
||||
:::info
|
||||
|
||||
This method can be used only when you connect to the database by using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
get_database(name, tenant=DEFAULT_TENANT)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the database to be queried. |`my_database`|
|
||||
|`tenant`|string|No<ul><li>When you use embedded seekdb and server mode seekdb, you do not need to specify this parameter.</li><li>When you use OceanBase Database, you must specify this parameter.</li></ul>|The tenant to which the database belongs. |test_tenant|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Embedded mode
|
||||
admin = pyseekdb.AdminClient(path="./seekdb")
|
||||
|
||||
# Get database
|
||||
db = admin.get_database("my_database")
|
||||
# print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the queried database. |`my_database`|
|
||||
|`tenant`|string|No<br/>When you use embedded seekdb and server mode SeekDB, this parameter does not exist. |The tenant to which the queried database belongs. |`test_tenant`|
|
||||
|`charset`|string|No|The character set used by the queried database. |`utf8mb4`|
|
||||
|`collation`|string|No|The collation used by the queried database. |`utf8mb4_general_ci`|
|
||||
|`metadata`|dict|No|Reserved field. | {} |
|
||||
|
||||
## Response example
|
||||
|
||||
```python
|
||||
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Create a database](200.create-database-of-api.md)
|
||||
* [Delete a database](500.delete-database-of-api.md)
|
||||
* [Get the database list](400.list-database-of-api.md)
|
||||
@@ -0,0 +1,70 @@
|
||||
---
|
||||
slug: /list-database-of-api
|
||||
---
|
||||
|
||||
# list_databases - Get the database list
|
||||
|
||||
The `list_databases()` method is used to retrieve the database list in the instance.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
list_databases(limit=None, offset=None, tenant=DEFAULT_TENANT)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`limit`|int|Optional|The maximum number of databases to return. |2|
|
||||
|`offset`|int|Optional|The number of databases to skip. |3|
|
||||
|`tenant`|string|Optional<ul><li>When using embedded seekdb and server mode seekdb, this parameter is not required.</li><li>When using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the queried database belongs. |test_tenant|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
# List all databases
|
||||
import pyseekdb
|
||||
|
||||
# Embedded mode
|
||||
admin = pyseekdb.AdminClient(path="./seekdb")
|
||||
|
||||
# list database
|
||||
databases = admin.list_databases(2,3)
|
||||
for db in databases:
|
||||
print(f"Database: {db.name}, Charset: {db.charset}, collation:{db.collation}, metadata:{db.metadata}")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the queried database. |`my_database`|
|
||||
|`tenant`|string|Optional<br/>When using embedded seekdb and server mode SeekDB, this parameter is not available. |The tenant to which the queried database belongs. |`test_tenant`|
|
||||
|`charset`|string|Optional|The character set of the queried database. |`utf8mb4`|
|
||||
|`collation`|string|Optional|The collation of the queried database. |`utf8mb4_general_ci`|
|
||||
|`metadata`|dict|Optional|Reserved field. No data is returned. | {} |
|
||||
|
||||
|
||||
## Response example
|
||||
|
||||
```python
|
||||
Database: test, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
|
||||
Database: my_database, Charset: utf8mb4, collation:utf8mb4_general_ci, metadata:{}
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Create a database](200.create-database-of-api.md)
|
||||
* [Delete a database](500.delete-database-of-api.md)
|
||||
* [Get a specific database](300.get-database-of-api.md)
|
||||
@@ -0,0 +1,54 @@
|
||||
---
|
||||
slug: /delete-database-of-api
|
||||
---
|
||||
|
||||
# delete_database - Delete a database
|
||||
|
||||
The `delete_database()` method is used to delete a database.
|
||||
|
||||
:::info
|
||||
|
||||
This method is only available when using the `AdminClient`. For more information about the `AdminClient`, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Admin Client](../100.admin-client.md).
|
||||
|
||||
* If you are using server mode of seekdb or OceanBase Database, ensure that the user has the `DROP` privilege. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the privilege, contact the administrator to grant the privilege. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
delete_database(name,tenant=DEFAULT_TENANT)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example Value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the database to be deleted. |my_database|
|
||||
|`tenant`|string|No<ul><li>If you are using embedded seekdb or server mode of seekdb, you do not need to specify this parameter.</li><li>If you are using OceanBase Database, this parameter is required. The default value is `sys`.</li></ul>|The tenant to which the database belongs. |test_tenant|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Embedded mode
|
||||
admin = pyseekdb.AdminClient(path="./seekdb")
|
||||
|
||||
# Delete database
|
||||
admin.delete_database("my_database")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Create a database](200.create-database-of-api.md)
|
||||
* [Get a specific database](300.get-database-of-api.md)
|
||||
* [Obtain a database list](400.list-database-of-api.md)
|
||||
@@ -0,0 +1,93 @@
|
||||
---
|
||||
slug: /create-collection-of-api
|
||||
---
|
||||
|
||||
# create_collection - Create a collection
|
||||
|
||||
`create_collection()` is used to create a new collection, which is a table in the database.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb in server mode or OceanBase Database, make sure that the user has the `CREATE` privilege. For more information about how to view the privileges of the current user, see [View user privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971368). If the user does not have the privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly grant privileges](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974754).
|
||||
|
||||
## Define the table name
|
||||
|
||||
When creating a table, you must first define its name. The following requirements apply when defining the table name:
|
||||
|
||||
* In seekdb, each table name must be unique within the database.
|
||||
|
||||
* The table name cannot exceed 64 characters.
|
||||
|
||||
* We recommend that you give the table a meaningful name instead of using generic names such as t1 or table1. For more information about table naming conventions, see [Table naming conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
|
||||
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|
||||
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default values `dimension=384` and `distance='cosine'` are used. If set to `None`, the dimension is calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|
||||
|`embedding_function`|EmbeddingFunction|No|The function to convert data into vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality, and if provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
|
||||
|
||||
:::info
|
||||
|
||||
When you provide `embedding_function`, the system will automatically calculate the vector dimension by calling this function. If you also provide `configuration.dimension`, it must match the dimension of `embedding_function`. Otherwise, a ValueError will be raised.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Create a collection with default embedding function (auto-calculates dimension)
|
||||
collection = client.create_collection(
|
||||
name="my_collection"
|
||||
)
|
||||
|
||||
# Create a collection with custom embedding function
|
||||
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
|
||||
config = HNSWConfiguration(dimension=384, distance='cosine') # Must match EF dimension
|
||||
collection = client.create_collection(
|
||||
name="my_collection2",
|
||||
configuration=config,
|
||||
embedding_function=ef
|
||||
)
|
||||
|
||||
# Create a collection without embedding function (vectors must be provided manually)
|
||||
collection = client.create_collection(
|
||||
name="my_collection3",
|
||||
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
|
||||
embedding_function=None # Explicitly disable embedding function
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Query a collection](200.get-collection-of-api.md)
|
||||
* [Create or query a collection](250.get-or-create-collection-of-api.md)
|
||||
* [Get a collection list](300.list-collection-of-api.md)
|
||||
* [Count the number of collections](350.count-collection-of-api.md)
|
||||
* [Delete a collection](400.delete-collection-of-api.md)
|
||||
@@ -0,0 +1,89 @@
|
||||
---
|
||||
slug: /get-collection-of-api
|
||||
---
|
||||
|
||||
# get_collection - Get a collection
|
||||
|
||||
The `get_collection()` function is used to retrieve a specified collection.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when connected using a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
|
||||
|
||||
* The collection you want to retrieve exists. If the collection does not exist, an error will be returned.
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
client.get_collection(name,configuration = configuration,embedding_function = embedding_function)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the collection to retrieve. |my_collection|
|
||||
|`configuration`|HNSWConfiguration|No|The index configuration, which specifies the dimension and distance metric. If not provided, the default value `dimension=384, distance='cosine'` will be used. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|
||||
|`embedding_function`|EmbeddingFunction|No|The function used to convert text to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` will be used. If set to `None`, the collection will not contain an embedding function. If an embedding function is provided, it will be calculated based on `configuration.dimension`.|DefaultEmbeddingFunction()|
|
||||
|
||||
:::info
|
||||
|
||||
When vectors are not provided for documents/texts, the embedding function set here will be used for all operations on this collection, including add, upsert, update, query, and hybrid_search.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Get an existing collection (uses default embedding function if collection doesn't have one)
|
||||
collection = client.get_collection("my_collection")
|
||||
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
|
||||
|
||||
# Get collection with specific embedding function
|
||||
ef = UserDefinedEmbeddingFunction() // define your own Embedding function, See section.6
|
||||
collection = client.get_collection("my_collection", embedding_function=ef)
|
||||
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
|
||||
|
||||
# Get collection without embedding function
|
||||
collection = client.get_collection("my_collection", embedding_function=None)
|
||||
# Check if collection exists
|
||||
if client.has_collection("my_collection"):
|
||||
collection = client.get_collection("my_collection")
|
||||
print(f"Database: {collection.name}, dimension: {collection.dimension}, embedding_function:{collection.embedding_function}, distance:{collection.distance}, metadata:{collection.metadata}")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the collection to query. |my_collection|
|
||||
|`dimension`|int|No| |384|
|
||||
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|
||||
|`distance`|string|No| |cosine|
|
||||
|`metadata`|dict|No|Reserved field, currently no data| {} |
|
||||
|
||||
## Response example
|
||||
|
||||
```python
|
||||
Database: my_collection, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
|
||||
Database: my_collection1, dimension: 384, embedding_function:DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance:cosine, metadata:{}
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Create a collection](100.create-collection-of-api.md)
|
||||
* [Create or query a collection](250.get-or-create-collection-of-api.md)
|
||||
* [Get a list of collections](300.list-collection-of-api.md)
|
||||
* [Count the number of collections](350.count-collection-of-api.md)
|
||||
* [Delete a collection](400.delete-collection-of-api.md)
|
||||
@@ -0,0 +1,79 @@
|
||||
---
|
||||
slug: /get-or-create-collection-of-api
|
||||
---
|
||||
|
||||
# get_or_create_collection - Create or query a collection
|
||||
|
||||
The `get_or_create_collection()` function creates or queries a collection. If the collection does not exist in the database, it is created. If it exists, the corresponding result is obtained.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using a client. For more information about the client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb in server mode or OceanBase Database, ensure that the connected user has the `CREATE` privilege. For more information about how to check the privileges of the current user, see [Check User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have this privilege, contact the administrator to grant it. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Define a table name
|
||||
|
||||
When creating a table, you need to define a table name. The following requirements must be met:
|
||||
|
||||
* In seekdb, each table name must be unique within the database.
|
||||
|
||||
* The table name must be no longer than 64 characters.
|
||||
|
||||
* It is recommended to use meaningful names for tables instead of generic names like t1 or table1. For more information about table naming conventions, see [Table Naming Conventions](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003977289).
|
||||
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
create_collection(name = name,configuration = configuration, embedding_function = embedding_function )
|
||||
```
|
||||
|
||||
|Parameter|Value Type|Required|Description|Example Value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the collection to be created. |my_collection|
|
||||
|`configuration`|HNSWConfiguration|No|The index configuration with dimension and distance metric. If not provided, the default value is used, which is `dimension=384, distance='cosine'`. If set to `None`, the dimension will be calculated from the `embedding_function` value. |HNSWConfiguration(dimension=384, distance='cosine')|
|
||||
|`embedding_function`|EmbeddingFunction|No|The function to convert to vectors. If not provided, `DefaultEmbeddingFunction()(384 dimensions)` is used. If set to `None`, the collection will not include embedding functionality. If embedding functionality is provided, it will be automatically calculated based on `configuration.dimension`. |DefaultEmbeddingFunction()|
|
||||
|
||||
:::info
|
||||
|
||||
When `embedding_function` is provided, the system will automatically calculate the vector dimension by calling the function. If `configuration.dimension` is also provided, it must match the dimension of `embedding_function`, otherwise a ValueError will be raised.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Get or create collection (creates if doesn't exist)
|
||||
collection = client.get_or_create_collection(
|
||||
name="my_collection4",
|
||||
configuration=HNSWConfiguration(dimension=384, distance='cosine'),
|
||||
embedding_function=DefaultEmbeddingFunction()
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Create a collection](100.create-collection-of-api.md)
|
||||
* [Query a collection](200.get-collection-of-api.md)
|
||||
* [Get a list of collections](300.list-collection-of-api.md)
|
||||
* [Count collections](350.count-collection-of-api.md)
|
||||
* [Delete a collection](400.delete-collection-of-api.md)
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
slug: /list-collection-of-api
|
||||
---
|
||||
|
||||
|
||||
# list_collections - Get a list of collections
|
||||
|
||||
The `list_collections()` API is used to obtain all collections.
|
||||
|
||||
:::info
|
||||
|
||||
This API is supported only when you use a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
client.list_collections()
|
||||
```
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# List all collections
|
||||
collections = client.list_collections()
|
||||
for coll in collections:
|
||||
print(f"Collection: {coll.name}, Dimension: {coll.dimension}, embedding_function: {coll.embedding_function}, distance: {coll.distance}, metadata: {coll.metadata}")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the queried collection. |my_collection|
|
||||
|`dimension`|int|No| | 384 |
|
||||
|`embedding_function`|EmbeddingFunction|No|DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')|
|
||||
|`distance`|string|No| |cosine|
|
||||
|`metadata`|dict|No|Reserved field. No data is returned. | {} |
|
||||
|
||||
## Response example
|
||||
|
||||
```pyhton
|
||||
Collection: my_collection, Dimension: 384, embedding_function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2'), distance: cosine, metadata: {}
|
||||
Database has 1 collections
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* [Create a collection](100.create-collection-of-api.md)
|
||||
* [Query a collection](200.get-collection-of-api.md)
|
||||
* [Create or query a collection](250.get-or-create-collection-of-api.md)
|
||||
* [Count collections](350.count-collection-of-api.md)
|
||||
* [Delete a collection](400.delete-collection-of-api.md)
|
||||
@@ -0,0 +1,56 @@
|
||||
---
|
||||
slug: /count-collection-of-api
|
||||
---
|
||||
|
||||
# count_collection - Count the number of collections
|
||||
|
||||
The `count_collection()` method is used to count the number of collections in the database.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
client.count_collection()
|
||||
```
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Count collections in database
|
||||
collection_count = client.count_collection()
|
||||
print(f"Database has {collection_count} collections")
|
||||
```
|
||||
|
||||
## Return parameters
|
||||
|
||||
None
|
||||
|
||||
## Return example
|
||||
|
||||
```pyhton
|
||||
Database has 1 collections
|
||||
```
|
||||
|
||||
## Related operations
|
||||
|
||||
* [Create a collection](100.create-collection-of-api.md)
|
||||
* [Query a collection](200.get-collection-of-api.md)
|
||||
* [Create or query a collection](250.get-or-create-collection-of-api.md)
|
||||
* [Get a collection list](300.list-collection-of-api.md)
|
||||
* [Delete a collection](400.delete-collection-of-api.md)
|
||||
@@ -0,0 +1,55 @@
|
||||
---
|
||||
slug: /delete-collection-of-api
|
||||
---
|
||||
|
||||
# delete_collection - Delete a Collection
|
||||
|
||||
The `delete_collection()` method is used to delete a specified Collection.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when you are connected to the database using a client. For more information about the client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* The Collection you want to delete exists. If the Collection does not exist, an error will be returned.
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
client.delete_collection(name)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`name`|string|Yes|The name of the Collection to be deleted. |my_collection|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Delete a collection
|
||||
client.delete_collection("my_collection")
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Create a collection](100.create-collection-of-api.md)
|
||||
* [Query a collection](200.get-collection-of-api.md)
|
||||
* [Create or query a collection](250.get-or-create-collection-of-api.md)
|
||||
* [Get a collection list](300.list-collection-of-api.md)
|
||||
* [Count the number of collections](350.count-collection-of-api.md)
|
||||
@@ -0,0 +1,18 @@
|
||||
---
|
||||
slug: /collection-overview-of-api
|
||||
---
|
||||
|
||||
# Manage collections
|
||||
|
||||
In pyseekdb, a collection is a set similar to a table in a database. You can create, query, and delete collections.
|
||||
|
||||
The following API interfaces are supported for managing collections.
|
||||
|
||||
| API interface | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `create_collection()` | Creates a collection. | [Documentation](100.create-collection-of-api.md) |
|
||||
| `get_collection()` | Gets a specified collection. |[Documentation](200.get-collection-of-api.md)|
|
||||
| `get_or_create_collection()` | Creates or queries a collection. If the collection does not exist in the database, it is created. If the collection exists, the corresponding result is obtained. |[Documentation](250.get-or-create-collection-of-api.md)|
|
||||
| `list_collections()` | Gets the collection list of a database. |[Documentation](300.list-collection-of-api.md)|
|
||||
| `count_collection()` | Counts the number of collections in a database |[Documentation](350.count-collection-of-api.md)|
|
||||
| `delete_collection()` | Deletes a specified collection.|[Documentation](400.delete-collection-of-api.md)|
|
||||
@@ -0,0 +1,16 @@
|
||||
---
|
||||
slug: /dml-overview-of-api
|
||||
---
|
||||
|
||||
# DML operations
|
||||
|
||||
DML (Data Manipulation Language) operations allow you to insert, update, and delete data in a collection.
|
||||
|
||||
For DML operations, you can use the following APIs.
|
||||
|
||||
| API | Description | Documentation |
|
||||
|---|---|---|
|
||||
| `add()` | Inserts a new record into a collection. | [Documentation](200.add-data-of-api.md) |
|
||||
| `update()` | Updates an existing record in a collection. |[Documentation](300.update-data-of-api.md)|
|
||||
| `upsert()` | Inserts a new record or updates an existing record. |[Documentation](400.upsert-data-of-api.md)|
|
||||
| `delete()` | Deletes a record from a collection.|[Documentation](500.delete-data-of-api.md)|
|
||||
@@ -0,0 +1,117 @@
|
||||
---
|
||||
slug: /add-data-of-api
|
||||
---
|
||||
|
||||
# add - Insert data
|
||||
|
||||
The `add()` method inserts new data into a collection. If a record with the same ID already exists, an error is returned.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to which you are connected has the `INSERT` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have the required privilege, contact the administrator to grant you the privilege. For more information about how to directly grant a privilege, see [Directly grant a privilege](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
add(
|
||||
ids=ids,
|
||||
embeddings=embeddings,
|
||||
documents=documents,
|
||||
metadatas=metadatas
|
||||
)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|string or List[str]|Yes|The ID of the data to be inserted. You can specify a single ID or an array of IDs.|item1|
|
||||
|`embeddings`|List[float] or List[List[float]]|No|The vector or vectors of the data to be inserted. If you specify this parameter, the value of `embedding_function` is ignored. If you do not specify this parameter, you must specify `documents`, and the `collection` must have an `embedding_function`.|[0.1, 0.2, 0.3]|
|
||||
|`documents`|string or List[str]|No|The document or documents to be inserted. If you do not specify `vectors`, `documents` will be converted to vectors using the `embedding_function` of the `collection`.|"This is a document"|
|
||||
|`metadatas`|dict or List[dict]|No|The metadata or metadata list of the data to be inserted. |`{"category": "AI", "score": 95}`|
|
||||
|
||||
:::info
|
||||
|
||||
The `embedding_function` associated with the collection is set during `create_collection()` or `get_collection()`. You cannot override it for each operation.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
from pyseekdb import DefaultEmbeddingFunction, HNSWConfiguration
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.create_collection(
|
||||
name="my_collection",
|
||||
configuration=HNSWConfiguration(dimension=3, distance='cosine'),
|
||||
embedding_function=None
|
||||
)
|
||||
|
||||
# Add single item
|
||||
collection.add(
|
||||
ids="item1",
|
||||
embeddings=[0.1, 0.2, 0.3],
|
||||
documents="This is a document",
|
||||
metadatas={"category": "AI", "score": 95}
|
||||
)
|
||||
|
||||
# Add multiple items
|
||||
collection.add(
|
||||
ids=["item4", "item2", "item3"],
|
||||
embeddings=[
|
||||
[0.1, 0.2, 0.4],
|
||||
[0.4, 0.5, 0.6],
|
||||
[0.7, 0.8, 0.9]
|
||||
],
|
||||
documents=[
|
||||
"Document 1",
|
||||
"Document 2",
|
||||
"Document 3"
|
||||
],
|
||||
metadatas=[
|
||||
{"category": "AI", "score": 95},
|
||||
{"category": "ML", "score": 88},
|
||||
{"category": "DL", "score": 92}
|
||||
]
|
||||
)
|
||||
|
||||
# Add with only embeddings
|
||||
collection.add(
|
||||
ids=["vec1", "vec2"],
|
||||
embeddings=[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
|
||||
)
|
||||
|
||||
collection1 = client.create_collection(
|
||||
name="my_collection1"
|
||||
)
|
||||
|
||||
# Add with only documents - embeddings auto-generated by embedding_function
|
||||
# Requires: collection must have embedding_function set
|
||||
collection1.add(
|
||||
ids=["doc1", "doc2"],
|
||||
documents=["Text document 1", "Text document 2"],
|
||||
metadatas=[{"tag": "A"}, {"tag": "B"}]
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Update data](300.update-data-of-api.md)
|
||||
* [Update or insert data](400.upsert-data-of-api.md)
|
||||
* [Delete data](500.delete-data-of-api.md)
|
||||
@@ -0,0 +1,88 @@
|
||||
---
|
||||
slug: /update-data-of-api
|
||||
---
|
||||
|
||||
# update - Update data
|
||||
|
||||
The `update()` method is used to update existing records in a collection. The record must exist, otherwise an error will be raised.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb in client mode or OceanBase Database, make sure that the user to which you have connected has the `UPDATE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View User Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly Grant Privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
update(
|
||||
ids=ids,
|
||||
embeddings=embeddings,
|
||||
documents=documents,
|
||||
metadatas=metadatas
|
||||
)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|string or List[str]|Yes|The ID to be modified. It can be a single ID or an array of IDs.|item1|
|
||||
|`embeddings`|List[float] or List[List[float]]|No|The new vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]]|
|
||||
|`documents`|string or List[str]|No|The new documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"New document text"|
|
||||
|`metadatas`|dict or List[dict]|No|The new metadata.|`{"category": "AI"}`|
|
||||
|
||||
:::info
|
||||
|
||||
You can update only the `metadatas`. The `embedding_function` used must be associated with the collection.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
collection1 = client.get_collection("my_collection1")
|
||||
|
||||
# Update single item
|
||||
collection.update(
|
||||
ids="item1",
|
||||
metadatas={"category": "AI", "score": 98} # Update metadata only
|
||||
)
|
||||
|
||||
# Update multiple items
|
||||
collection.update(
|
||||
ids=["item1", "item2"],
|
||||
embeddings=[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]], # Update embeddings
|
||||
documents=["Updated document 1", "Updated document 2"] # Update documents
|
||||
)
|
||||
|
||||
# Update with documents only - embeddings auto-generated by embedding_function
|
||||
# Requires: collection must have embedding_function set
|
||||
collection1.update(
|
||||
ids="doc1",
|
||||
documents="New document text", # Embeddings will be auto-generated
|
||||
metadatas={"category": "AI"}
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Insert data](200.add-data-of-api.md)
|
||||
* [Update or insert data](400.upsert-data-of-api.md)
|
||||
* [Delete data](500.delete-data-of-api.md)
|
||||
@@ -0,0 +1,93 @@
|
||||
---
|
||||
slug: /upsert-data-of-api
|
||||
---
|
||||
|
||||
# upsert - Update or insert data
|
||||
|
||||
The `upsert()` method is used to insert new records or update existing records. If a record with the given ID already exists, it will be updated; otherwise, a new record will be inserted.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using a Client connection. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb or OceanBase Database in client mode, ensure that the connected user has the `INSERT` and `UPDATE` privileges on the target table. For more information about how to view the current user privileges, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If the user does not have the required privileges, contact the administrator to grant them. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
Upsert(
|
||||
ids=ids,
|
||||
embeddings=embeddings,
|
||||
documents=documents,
|
||||
metadatas=metadatas
|
||||
)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|string or List[str]|Yes|The ID to be added or modified. It can be a single ID or an array of IDs.|item1|
|
||||
|`embeddings`|List[float] or List[List[float]]|No|The vectors. If provided, they will be used directly (ignoring `embedding_function`). If not provided, you can provide `documents` to automatically generate vectors.|[0.1, 0.2, 0.3]|
|
||||
|`documents`|string or List[str]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the collection's `embedding_function`.|"Document text"|
|
||||
|`metadatas`|dict or List[dict]|No|The metadata. |`{"category": "AI"}`|
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
collection1 = client.get_collection("my_collection1")
|
||||
|
||||
# Upsert single item (insert or update)
|
||||
collection.upsert(
|
||||
ids="item1",
|
||||
embeddings=[0.1, 0.2, 0.3],
|
||||
documents="Document text",
|
||||
metadatas={"category": "AI", "score": 95}
|
||||
)
|
||||
|
||||
# Upsert multiple items
|
||||
collection.upsert(
|
||||
ids=["item1", "item2", "item3"],
|
||||
embeddings=[
|
||||
[0.1, 0.2, 0.3],
|
||||
[0.4, 0.5, 0.6],
|
||||
[0.7, 0.8, 0.9]
|
||||
],
|
||||
documents=["Doc 1", "Doc 2", "Doc 3"],
|
||||
metadatas=[
|
||||
{"category": "AI"},
|
||||
{"category": "ML"},
|
||||
{"category": "DL"}
|
||||
]
|
||||
)
|
||||
|
||||
# Upsert with documents only - embeddings auto-generated by embedding_function
|
||||
# Requires: collection must have embedding_function set
|
||||
collection1.upsert(
|
||||
ids=["item1", "item2"],
|
||||
documents=["Document 1", "Document 2"],
|
||||
metadatas=[{"category": "AI"}, {"category": "ML"}]
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Insert data](200.add-data-of-api.md)
|
||||
* [Update data](300.update-data-of-api.md)
|
||||
* [Delete data](400.upsert-data-of-api.md)
|
||||
@@ -0,0 +1,87 @@
|
||||
---
|
||||
slug: /delete-data-of-api
|
||||
---
|
||||
|
||||
# delete - Delete data
|
||||
|
||||
`delete()` is used to delete records from a collection. You can delete records by ID, metadata filter, or document filter.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when you are connected to the database using a Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Quick Start](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You are connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* If you are using seekdb or OceanBase Database in client mode, make sure that the user to whom you are connected has the `DELETE` privilege on the table to be operated. For more information about how to view the privileges of the current user, see [View user privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980135). If you do not have this privilege, contact the administrator to grant it to you. For more information about how to directly grant privileges, see [Directly grant privileges](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980140).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
Upsert(
|
||||
ids=ids,
|
||||
embeddings=embeddings,
|
||||
documents=documents,
|
||||
metadatas=metadatas
|
||||
)
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|string or List[str]|Optional|The ID of the record to be deleted. You can specify a single ID or an array of IDs.|item1|
|
||||
|`where`|dict|Optional|The metadata filter.|`{"category": {"$eq": "AI"}}`|
|
||||
|`where_document`|dict|Optional|The document filter.|`{"$contains": "obsolete"}`|
|
||||
|
||||
:::info
|
||||
|
||||
At least one of the `id`, `where`, or `where_document` parameters must be specified.
|
||||
|
||||
:::
|
||||
|
||||
## Request examples
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
|
||||
# Delete by IDs
|
||||
collection.delete(ids=["item1", "item2", "item3"])
|
||||
|
||||
# Delete by single ID
|
||||
collection.delete(ids="item1")
|
||||
|
||||
# Delete by metadata filter
|
||||
collection.delete(where={"category": {"$eq": "AI"}})
|
||||
|
||||
# Delete by comparison operator
|
||||
collection.delete(where={"score": {"$lt": 50}})
|
||||
|
||||
# Delete by document filter
|
||||
collection.delete(where_document={"$contains": "obsolete"})
|
||||
|
||||
# Delete with combined filters
|
||||
collection.delete(
|
||||
where={"category": {"$eq": "AI"}},
|
||||
where_document={"$contains": "deprecated"}
|
||||
)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
None
|
||||
|
||||
## References
|
||||
|
||||
* [Insert data](200.add-data-of-api.md)
|
||||
* [Update data](300.update-data-of-api.md)
|
||||
* [Update or insert data](400.upsert-data-of-api.md)
|
||||
@@ -0,0 +1,15 @@
|
||||
---
|
||||
slug: /dql-overview-of-api
|
||||
---
|
||||
|
||||
# Overview of DQL
|
||||
|
||||
DQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.
|
||||
|
||||
For DQL operations, the following API interfaces are supported.
|
||||
|
||||
| API Interface | Description | Documentation Link |
|
||||
|---|---|---|
|
||||
| `query()` | A vector similarity search method. | [Documentation](200.query-interfaces-of-api.md) |
|
||||
| `get()` | Queries specific data from a table using an ID, document, or metadata (excluding vectors). | [Documentation](300.get-interfaces-of-api.md) |
|
||||
| `hybrid_search()` | Combines full-text search and vector similarity search using a ranking method. | [Documentation](400.hybrid-search-of-api.md) |
|
||||
@@ -0,0 +1,161 @@
|
||||
---
|
||||
slug: /query-interfaces-of-api
|
||||
---
|
||||
|
||||
# query - vector query
|
||||
|
||||
The `query()` method is used to perform vector similarity search to find the most similar documents to the query vector.
|
||||
|
||||
:::info
|
||||
|
||||
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
query()
|
||||
```
|
||||
|
||||
|Parameter|Value type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or a list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|
||||
|`query_texts`|str or List[str]|No|A single text or a list of texts for query; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|
||||
|`n_results`|int|Yes|The number of similar results to return, default is 10|3|
|
||||
|`where`|dict |No|Metadata filter conditions.|`{"category": {"$eq": "AI"}}`|
|
||||
|`where_document`|dict|No|Document filter conditions.|`{"$contains": "machine"}`|
|
||||
|`include`|List[str]|No|List of fields to include: `["documents", "metadatas", "embeddings"]`|["documents", "metadatas", "embeddings"]|
|
||||
|
||||
:::info
|
||||
|
||||
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
collection1 = client.get_collection("my_collection1")
|
||||
|
||||
# Basic vector similarity query (embedding_function not used)
|
||||
results = collection.query(
|
||||
query_embeddings=[1.0, 2.0, 3.0],
|
||||
n_results=3
|
||||
)
|
||||
|
||||
# Iterate over results
|
||||
for i in range(len(results["ids"][0])):
|
||||
print(f"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}")
|
||||
if results.get("documents"):
|
||||
print(f"Document: {results['documents'][0][i]}")
|
||||
if results.get("metadatas"):
|
||||
print(f"Metadata: {results['metadatas'][0][i]}")
|
||||
|
||||
# Query by texts - vectors auto-generated by embedding_function
|
||||
# Requires: collection must have embedding_function set
|
||||
results = collection1.query(
|
||||
query_texts=["my query text"],
|
||||
n_results=10
|
||||
)
|
||||
# The collection's embedding_function will automatically convert query_texts to query_embeddings
|
||||
|
||||
# Query by multiple texts (batch query)
|
||||
results = collection1.query(
|
||||
query_texts=["query text 1", "query text 2"],
|
||||
n_results=5
|
||||
)
|
||||
# Returns dict with lists of lists, one list per query text
|
||||
for i in range(len(results["ids"])):
|
||||
print(f"Query {i}: {len(results['ids'][i])} results")
|
||||
|
||||
# Query with metadata filter (using query_texts)
|
||||
results = collection1.query(
|
||||
query_texts=["AI research"],
|
||||
where={"category": {"$eq": "AI"}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# Query with comparison operator (using query_texts)
|
||||
results = collection1.query(
|
||||
query_texts=["machine learning"],
|
||||
where={"score": {"$gte": 90}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# Query with document filter (using query_texts)
|
||||
results = collection1.query(
|
||||
query_texts=["neural networks"],
|
||||
where_document={"$contains": "machine learning"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# Query with combined filters (using query_texts)
|
||||
results = collection1.query(
|
||||
query_texts=["AI research"],
|
||||
where={"category": {"$eq": "AI"}, "score": {"$gte": 90}},
|
||||
where_document={"$contains": "machine"},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# Query with multiple vectors (batch query)
|
||||
results = collection.query(
|
||||
query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
|
||||
n_results=2
|
||||
)
|
||||
# Returns dict with lists of lists, one list per query vector
|
||||
for i in range(len(results["ids"])):
|
||||
print(f"Query {i}: {len(results['ids'][i])} results")
|
||||
|
||||
# Query with specific fields
|
||||
results = collection.query(
|
||||
query_embeddings=[1.0, 2.0, 3.0],
|
||||
include=["documents", "metadatas", "embeddings"],
|
||||
n_results=3
|
||||
)
|
||||
```
|
||||
|
||||
## Return parameters
|
||||
|
||||
|Parameter|Value type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|List[List[str]] |Yes|The IDs to add or modify. It can be a single ID or an array of IDs.|item1|
|
||||
|`embeddings`|[List[List[List[float]]]]|No|The vectors; if provided, it will be used directly (ignoring `embedding_function`), if not provided, `documents` can be provided to generate vectors automatically.|[0.1, 0.2, 0.3]|
|
||||
|`documents`|[List[List[Dict]]]|No|The documents. If `vectors` are not provided, `documents` will be converted to vectors using the `embedding_function` of the collection.| "Document text"|
|
||||
|`metadatas`|[List[List[Dict]]]|No|The metadata.|`{"category": "AI"}`|
|
||||
|`distances`|[List[List[Dict]]]|No| |`{"category": "AI"}`|
|
||||
|
||||
## Return example
|
||||
|
||||
```python
|
||||
ID: vec1, Distance: 0.0
|
||||
Document: None
|
||||
Metadata: {}
|
||||
ID: vec2, Distance: 0.025368153802923787
|
||||
Document: None
|
||||
Metadata: {}
|
||||
Query 0: 4 results
|
||||
Query 1: 4 results
|
||||
Query 0: 2 results
|
||||
Query 1: 2 results
|
||||
```
|
||||
|
||||
## Related operations
|
||||
|
||||
* [get - Retrieve](300.get-interfaces-of-api.md)
|
||||
* [Hybrid search](400.hybrid-search-of-api.md)
|
||||
* [Operators](500.filter-operators-of-api.md)
|
||||
@@ -0,0 +1,127 @@
|
||||
---
|
||||
slug: /get-interfaces-of-api
|
||||
---
|
||||
|
||||
# get - Retrieve
|
||||
|
||||
`get()` is used to retrieve documents from a collection without performing vector similarity search.
|
||||
|
||||
It supports filtering by IDs, metadata, and documents.
|
||||
|
||||
:::info
|
||||
|
||||
This interface is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert data](../300.dml/200.add-data-of-api.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
get()
|
||||
```
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`ids`|List[float] or List[List[float]] |Yes|The ID or list of IDs to retrieve.|[1.0, 2.0, 3.0]|
|
||||
|`where`|dict |No|The metadata filter. |`{"category": {"$eq": "AI"}}`|
|
||||
|`where_document`|dict|No|The document filter. |`{"$contains": "machine"}`|
|
||||
|`limit`|dict |No|The maximum number of results to return. |`{"category": {"$eq": "AI"}}`|
|
||||
|`offset`|dict|No|The number of results to skip for pagination. |`{"$contains": "machine"}`|
|
||||
|`include`|List[str]|No|The list of fields to include: `["documents", "metadatas", "embeddings"]`. |["documents", "metadatas", "embeddings"]|
|
||||
|
||||
:::info
|
||||
|
||||
If no parameters are provided, all data is returned.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
|
||||
# Get by single ID
|
||||
results = collection.get(ids="123")
|
||||
|
||||
# Get by multiple IDs
|
||||
results = collection.get(ids=["1", "2", "3"])
|
||||
|
||||
# Get by metadata filter
|
||||
results = collection.get(
|
||||
where={"category": {"$eq": "AI"}},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get by comparison operator
|
||||
results = collection.get(
|
||||
where={"score": {"$gte": 90}},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get by $in operator
|
||||
results = collection.get(
|
||||
where={"tag": {"$in": ["ml", "python"]}},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get by logical operators ($or)
|
||||
results = collection.get(
|
||||
where={
|
||||
"$or": [
|
||||
{"category": {"$eq": "AI"}},
|
||||
{"tag": {"$eq": "python"}}
|
||||
]
|
||||
},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get by document content filter
|
||||
results = collection.get(
|
||||
where_document={"$contains": "machine learning"},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get with combined filters
|
||||
results = collection.get(
|
||||
where={"category": {"$eq": "AI"}},
|
||||
where_document={"$contains": "machine"},
|
||||
limit=10
|
||||
)
|
||||
|
||||
# Get with pagination
|
||||
results = collection.get(limit=2, offset=1)
|
||||
|
||||
# Get with specific fields
|
||||
results = collection.get(
|
||||
ids=["1", "2"],
|
||||
include=["documents", "metadatas", "embeddings"]
|
||||
)
|
||||
|
||||
# Get all data (up to limit)
|
||||
results = collection.get(limit=100)
|
||||
```
|
||||
|
||||
## Response parameters
|
||||
|
||||
* If a single ID is provided: The result contains the get object for that ID.
|
||||
* If multiple IDs are provided: A list of QueryResult objects, one for each ID.
|
||||
* If filters are provided: A QueryResult object containing all matching results.
|
||||
|
||||
## Related operations
|
||||
|
||||
* [Vector query](200.query-interfaces-of-api.md)
|
||||
* [Hybrid search](400.hybrid-search-of-api.md)
|
||||
* [Operators](500.filter-operators-of-api.md)
|
||||
@@ -0,0 +1,140 @@
|
||||
---
|
||||
slug: /hybrid-search-of-api
|
||||
---
|
||||
|
||||
# hybrid_search - Hybrid search
|
||||
|
||||
`hybrid_search()` combines full-text search and vector similarity search with ranking.
|
||||
|
||||
:::info
|
||||
|
||||
This API is only available when using the Client. For more information about the Client, see [Client](../50.client.md).
|
||||
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have installed pyseekdb. For more information about how to install pyseekdb, see [Get Started](../../10.pyseekdb-sdk/10.pyseekdb-sdk-get-started.md).
|
||||
|
||||
* You have connected to the database. For more information about how to connect to the database, see [Client](../50.client.md).
|
||||
|
||||
* You have created a collection and inserted data. For more information about how to create a collection and insert data, see [create_collection - Create a collection](../200.collection/100.create-collection-of-api.md) and [add - Insert Data](../300.dml/200.add-data-of-api.md).
|
||||
|
||||
## Request parameters
|
||||
|
||||
```python
|
||||
hybrid_search(
|
||||
query={
|
||||
"where_document": ,
|
||||
"where": ,
|
||||
"n_results":
|
||||
},
|
||||
knn={
|
||||
"query_texts":
|
||||
"where":
|
||||
"n_results":
|
||||
},
|
||||
rank=,
|
||||
n_results=,
|
||||
include=
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
* query: full-text search configuration, including the following parameters:
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|
||||
|`where_document`|dict|Optional|Document filter conditions. |`{"$contains": "machine"}`|
|
||||
|`n_results`|int|Yes|Number of results for full-text search.||
|
||||
|
||||
* knn: vector search configuration, including the following parameters:
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|---|---|---|---|---|
|
||||
|`query_embeddings`|List[float] or List[List[float]] |Yes|A single vector or list of vectors for batch queries; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `query_text` must be provided, and the `collection` must have an `embedding_function`|[1.0, 2.0, 3.0]|
|
||||
|`query_texts`|str or List[str]|Optional|A single vector or list of vectors; if provided, it will be used directly (ignoring `embedding_function`); if not provided, `documents` must be provided, and the `collection` must have an `embedding_function`|["my query text"]|
|
||||
|`where`|dict |Optional|Metadata filter conditions. |`{"category": {"$eq": "AI"}}`|
|
||||
|`n_results`|int|Yes|Number of results for vector search.||
|
||||
|
||||
* Other parameters are as follows:
|
||||
|
||||
|Parameter|Type|Required|Description|Example value|
|
||||
|`rank`|dict |Optional|Ranking configuration, for example: `{"rrf": {"rank_window_size": 60, "rank_constant": 60}}`|`{"category": {"$eq": "AI"}}`|
|
||||
|`n_results`|int|Yes|Number of similar results to return. Default value is 10|3|
|
||||
|`include`|List[str]|Optional|List of fields to include: `["documents", "metadatas", "embeddings"]`.|["documents", "metadatas", "embeddings"]|
|
||||
|
||||
|
||||
:::info
|
||||
|
||||
The `embedding_function` used is associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it for each operation.
|
||||
|
||||
:::
|
||||
|
||||
## Request example
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
collection = client.get_collection("my_collection")
|
||||
collection1 = client.get_collection("my_collection1")
|
||||
|
||||
# Hybrid search with query_embeddings (embedding_function not used)
|
||||
results = collection.hybrid_search(
|
||||
query={
|
||||
"where_document": {"$contains": "machine learning"},
|
||||
"n_results": 10
|
||||
},
|
||||
knn={
|
||||
"query_embeddings": [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], # Used directly
|
||||
"n_results": 10
|
||||
},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
|
||||
# Hybrid search with both full-text and vector search (using query_texts)
|
||||
results = collection1.hybrid_search(
|
||||
query={
|
||||
"where_document": {"$contains": "machine learning"},
|
||||
"where": {"category": {"$eq": "science"}},
|
||||
"n_results": 10
|
||||
},
|
||||
knn={
|
||||
"query_texts": ["AI research"], # Will be embedded automatically
|
||||
"where": {"year": {"$gte": 2020}},
|
||||
"n_results": 10
|
||||
},
|
||||
rank={"rrf": {}}, # Reciprocal Rank Fusion
|
||||
n_results=5,
|
||||
include=["documents", "metadatas", "embeddings"]
|
||||
)
|
||||
|
||||
# Hybrid search with multiple query texts (batch)
|
||||
results = collection1.hybrid_search(
|
||||
query={
|
||||
"where_document": {"$contains": "AI"},
|
||||
"n_results": 10
|
||||
},
|
||||
knn={
|
||||
"query_texts": ["machine learning", "neural networks"], # Multiple queries
|
||||
"n_results": 10
|
||||
},
|
||||
rank={"rrf": {}},
|
||||
n_results=5
|
||||
)
|
||||
```
|
||||
|
||||
## Return parameters
|
||||
|
||||
A dictionary containing search results, including ID, distances, metadatas, document, etc.
|
||||
|
||||
## Related operations
|
||||
|
||||
* [Vector query](200.query-interfaces-of-api.md)
|
||||
* [get - Retrieve](300.get-interfaces-of-api.md)
|
||||
* [Operators](500.filter-operators-of-api.md)
|
||||
@@ -0,0 +1,151 @@
|
||||
---
|
||||
slug: /filter-operators-of-api
|
||||
---
|
||||
|
||||
# Operators
|
||||
|
||||
Operators are used to connect operands or parameters and return results. In terms of syntax, operators can appear before, after, or between operands.
|
||||
|
||||
## Operator examples
|
||||
|
||||
### Data filtering (where)
|
||||
|
||||
#### Equal to
|
||||
|
||||
Use `$eq` to indicate equal to, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"category": {"$eq": "AI"}}
|
||||
```
|
||||
|
||||
#### Not equal to
|
||||
|
||||
Use `$ne` to indicate not equal to, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"status": {"$ne": "deleted"}}
|
||||
```
|
||||
|
||||
#### Greater than
|
||||
|
||||
Use `$gt` to indicate greater than, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"score": {"$gt": 90}}
|
||||
```
|
||||
|
||||
#### Greater than or equal to
|
||||
|
||||
Use `$gte` to indicate greater than or equal to, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"score": {"$gte": 90}}
|
||||
```
|
||||
|
||||
#### Less than
|
||||
|
||||
Use `$lt` to indicate less than, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"score": {"$lt": 50}}
|
||||
```
|
||||
|
||||
#### Less than or equal to
|
||||
|
||||
Use `$lte` to indicate less than or equal to, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"score": {"$lte": 50}}
|
||||
```
|
||||
|
||||
#### Contains
|
||||
|
||||
Use `$in` to indicate contains, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"tag": {"$in": ["ml", "python", "ai"]}}
|
||||
```
|
||||
|
||||
#### Does not contain
|
||||
|
||||
Use `$nin` to indicate does not contain, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={"tag": {"$nin": ["deprecated", "old"]}}
|
||||
```
|
||||
|
||||
#### Logical OR
|
||||
|
||||
Use `$or` to indicate logical OR, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={
|
||||
"$or": [
|
||||
{"category": {"$eq": "AI"}},
|
||||
{"tag": {"$eq": "python"}}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Logical AND
|
||||
|
||||
Use `$and` to indicate logical AND, as shown in the following example:
|
||||
|
||||
```python
|
||||
where={
|
||||
"$and": [
|
||||
{"category": {"$eq": "AI"}},
|
||||
{"score": {"$gte": 90}}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Text filtering (where_document)
|
||||
|
||||
#### Full-text search (contains substring)
|
||||
|
||||
Use `$contains` to indicate full-text search, as shown in the following example:
|
||||
|
||||
```python
|
||||
where_document={"$contains": "machine learning"}
|
||||
```
|
||||
|
||||
#### Regular expression
|
||||
|
||||
Use `$regex` to indicate regular expression, as shown in the following example:
|
||||
|
||||
```python
|
||||
where_document={"$regex": "pattern.*"}
|
||||
```
|
||||
|
||||
#### Logical OR
|
||||
|
||||
Use `$or` to indicate logical OR, as shown in the following example:
|
||||
|
||||
```python
|
||||
where_document={
|
||||
"$or": [
|
||||
{"$contains": "machine learning"},
|
||||
{"$contains": "artificial intelligence"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Logical AND
|
||||
|
||||
Use `$and` to indicate logical AND, as shown in the following example:
|
||||
|
||||
```python
|
||||
where_document={
|
||||
"$and": [
|
||||
{"$contains": "machine"},
|
||||
{"$contains": "learning"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Related operations
|
||||
|
||||
* [Vector query](200.query-interfaces-of-api.md)
|
||||
* [get - Retrieve](300.get-interfaces-of-api.md)
|
||||
* [Hybrid search](400.hybrid-search-of-api.md)
|
||||
@@ -0,0 +1,107 @@
|
||||
---
|
||||
slug: /client
|
||||
---
|
||||
|
||||
# Client
|
||||
|
||||
The `Client` class is used to connect to a database in either embedded mode or server mode. It automatically selects the appropriate connection mode based on the provided parameters.
|
||||
|
||||
:::tip
|
||||
OceanBase Database is a fully self-developed, enterprise-level, native distributed database developed by OceanBase. It achieves financial-grade high availability on ordinary hardware and sets a new standard for automatic, lossless disaster recovery across five IDCs in three regions. It also sets a new benchmark in the TPC-C benchmark test, with a single cluster size exceeding 1,500 nodes. OceanBase Database is cloud-native, highly consistent, and highly compatible with Oracle and MySQL. For more information about OceanBase Database, see [OceanBase Database](https://www.oceanbase.com/docs/oceanbase-database-cn).
|
||||
:::
|
||||
|
||||
## Connect to an embedded seekdb instance
|
||||
|
||||
Use the `Client` class to connect to a local embedded seekdb instance.
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create embedded client
|
||||
client = pyseekdb.Client(
|
||||
#path="./seekdb", # Path to SeekDB data directory
|
||||
#database="test" # Database name
|
||||
)
|
||||
```
|
||||
|
||||
The following table describes the parameters.
|
||||
|
||||
| Parameter | Value type | Required | Description | Example value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `path` | string | No | The path to the seekdb data directory. seekdb stores database files in this directory and loads them when it starts. | `./seekdb` |
|
||||
| `database` | string | No | The name of the database. | `test` |
|
||||
|
||||
## Connect to a remote server
|
||||
|
||||
Use the `Client` class to connect to a remote server, which runs seekdb or OceanBase Database.
|
||||
|
||||
:::tip
|
||||
|
||||
Before you connect to a remote server, make sure that you have deployed a server instance of seekdb or OceanBase Database. <br/>For information about how to deploy a server instance of seekdb, see [Overview](../../../400.guides/400.deploy/50.deploy-overview.md).<br/>For information about how to deploy OceanBase Database, see [Overview](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003976427).
|
||||
|
||||
:::
|
||||
|
||||
Example: Connect to a server instance of seekdb
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create remote server client (SeekDB Server)
|
||||
client = pyseekdb.Client(
|
||||
host="127.0.0.1", # Server host
|
||||
port=2881, # Server port
|
||||
database="test", # Database name
|
||||
user="root", # Username
|
||||
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
|
||||
)
|
||||
```
|
||||
|
||||
The following table describes the parameters.
|
||||
|
||||
| Parameter | Value type | Required | Description | Example value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `host` | string | Yes | The IP address of the server where the instance is located. | `127.0.0.1` |
|
||||
| `prot` | string | Yes | The port number of the instance. The default value is 2881. | `2881` |
|
||||
| `database` | string | Yes | The name of the database. | `test` |
|
||||
| `user` | string | Yes | The username. The default value is root. | `root` |
|
||||
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
|
||||
|
||||
Example: Connect to OceanBase Database
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
|
||||
# Create remote server client (OceanBase Server)
|
||||
client = pyseekdb.Client(
|
||||
host="127.0.0.1", # Server host
|
||||
port=2881, # Server port (default: 2881)
|
||||
tenant="test", # Tenant name
|
||||
database="test", # Database name
|
||||
user="root", # Username (default: "root")
|
||||
password="" # Password (can be retrieved from SEEKDB_PASSWORD environment variable)
|
||||
)
|
||||
```
|
||||
|
||||
The following table describes the parameters.
|
||||
|
||||
| Parameter | Value type | Required | Description | Example value |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `host` | string | Yes | The IP address of the server where the database is located. | `127.0.0.1` |
|
||||
| `prot` | string | Yes | The port number of OceanBase Database. The default value is 2881. | `2881` |
|
||||
| `tenant` | string | No | The name of the tenant. This parameter is not required for seekdb. For OceanBase Database, the default value is sys. | `test` |
|
||||
| `database` | string | Yes | The name of the database. | `test` |
|
||||
| `user` | string | Yes | The username corresponding to the tenant. The default value is root. | `root` |
|
||||
| `password` | string | Yes | The password corresponding to the user. If you do not provide the `password` parameter or specify an empty string, the system retrieves the password from the `SEEKDB_PASSWORD` environment variable. ||
|
||||
|
||||
## APIs supported when you use the Client class to connect to a database
|
||||
|
||||
When you use the `Client` class to connect to a database, you can call the following APIs.
|
||||
|
||||
| API | Description | Document link |
|
||||
| --- | --- | --- |
|
||||
| `create_collection()` | Creates a new collection. | [Document](200.collection/100.create-collection-of-api.md) |
|
||||
| `get_collection()` | Queries a specified collection. |[Document](200.collection/200.get-collection-of-api.md)|
|
||||
| `delete_collection()` | Deletes a specified collection. |[Document](200.collection/400.delete-collection-of-api.md)|
|
||||
| `list_collections()` | Lists all collections in the current database.|[Document](200.collection/300.list-collection-of-api.md)|
|
||||
| `get_or_create_collection()` | Queries a specified collection. If the collection does not exist, it is created.|[Document](200.collection/250.get-or-create-collection-of-api.md)|
|
||||
| `count_collection()` | Queries the number of collections in the current database. |[Document](200.collection/350.count-collection-of-api.md)|
|
||||
@@ -0,0 +1,35 @@
|
||||
---
|
||||
slug: /default-embedding-function-of-api
|
||||
---
|
||||
|
||||
# Default embedding function
|
||||
|
||||
An embedding function converts text documents into vector embeddings for similarity search. pyseekdb supports built-in and custom embedding functions.
|
||||
|
||||
The `DefaultEmbeddingFunction` is the default embedding function if none is specified. This function is already available in seekdb and does not need to be created separately.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```python
|
||||
from pyseekdb import DefaultEmbeddingFunction
|
||||
|
||||
# Use default model (all-MiniLM-L6-v2, 384 dimensions)
|
||||
ef = DefaultEmbeddingFunction()
|
||||
|
||||
# Use custom model
|
||||
ef = DefaultEmbeddingFunction()
|
||||
|
||||
# Get embedding dimension
|
||||
print(f"Dimension: {ef.dimension}") # 384
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = ef(["Hello world", "How are you?"])
|
||||
print(f"Generated {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions")
|
||||
```
|
||||
|
||||
## Related operations
|
||||
|
||||
If you want to use a custom function, you can refer to the following topics to create and use a custom function:
|
||||
|
||||
* [Create a custom embedding function](200.create-custim-embedding-functions-of-api.md)
|
||||
* [Use a custom embedding function](300.using-custom-embedding-functions-of-api.md)
|
||||
@@ -0,0 +1,271 @@
|
||||
---
|
||||
slug: /create-custim-embedding-functions-of-api
|
||||
---
|
||||
|
||||
# Create a custom embedding function
|
||||
|
||||
You can create a custom embedding function by implementing the `EmbeddedFunction` protocol. This function includes the following features:
|
||||
|
||||
* Execute the `__call__` method, which accepts `Documents (str or List[str])` and returns `Embeddings (List[List[float]])`.
|
||||
|
||||
* Optionally implement a dimension attribute to return the vector dimension.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before creating a custom embedding function, ensure the following:
|
||||
|
||||
* Implement the `__call__` method:
|
||||
|
||||
* Each vector must have the same dimension.
|
||||
* Input: The type of a single or multiple documents is str or List[str].
|
||||
* Output: The field type of the embedded vectors is `List[List[float]]`.
|
||||
|
||||
* (Recommended) Implement the dimension attribute:
|
||||
* Output: The type of the vectors generated by this function is `int`.
|
||||
* Creating collections helps verify uniqueness.
|
||||
|
||||
* Handle special cases
|
||||
* Convert a single string input to a list.
|
||||
* Return an empty list for empty inputs.
|
||||
* All vectors in the output must have the same dimension.
|
||||
|
||||
## Example 1: Sentence Transformer custom embedding function
|
||||
|
||||
```python
|
||||
from typing import List, Union
|
||||
from pyseekdb import EmbeddingFunction, Client, HNSWConfiguration
|
||||
|
||||
Documents = Union[str, List[str]]
|
||||
Embeddings = List[List[float]]
|
||||
|
||||
class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]):
|
||||
"""
|
||||
A custom embedding function using sentence-transformers with a specific model.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"): # TODO: your own model name and device
|
||||
"""
|
||||
Initialize the sentence-transformer embedding function.
|
||||
|
||||
Args:
|
||||
model_name: Name of the sentence-transformers model to use
|
||||
device: Device to run the model on ('cpu' or 'cuda')
|
||||
"""
|
||||
self.model_name = model_name
|
||||
self.device = device
|
||||
self._model = None
|
||||
self._dimension = None
|
||||
|
||||
def _ensure_model_loaded(self):
|
||||
"""Lazy load the embedding model"""
|
||||
if self._model is None:
|
||||
try:
|
||||
from sentence_transformers import SentenceTransformer
|
||||
self._model = SentenceTransformer(self.model_name, device=self.device)
|
||||
# Get dimension from model
|
||||
test_embedding = self._model.encode(["test"], convert_to_numpy=True)
|
||||
self._dimension = len(test_embedding[0])
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"sentence-transformers is not installed. "
|
||||
"Please install it with: pip install sentence-transformers"
|
||||
)
|
||||
|
||||
@property
|
||||
def dimension(self) -> int:
|
||||
"""Get the dimension of embeddings produced by this function"""
|
||||
self._ensure_model_loaded()
|
||||
return self._dimension
|
||||
|
||||
def __call__(self, input: Documents) -> Embeddings:
|
||||
"""
|
||||
Generate embeddings for the given documents.
|
||||
|
||||
Args:
|
||||
input: Single document (str) or list of documents (List[str])
|
||||
|
||||
Returns:
|
||||
List of embedding vectors
|
||||
"""
|
||||
self._ensure_model_loaded()
|
||||
|
||||
# Handle single string input
|
||||
if isinstance(input, str):
|
||||
input = [input]
|
||||
|
||||
# Handle empty input
|
||||
if not input:
|
||||
return []
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = self._model.encode(
|
||||
input,
|
||||
convert_to_numpy=True,
|
||||
show_progress_bar=False
|
||||
)
|
||||
|
||||
# Convert numpy arrays to lists
|
||||
return [embedding.tolist() for embedding in embeddings]
|
||||
|
||||
# Use the custom embedding function
|
||||
client = Client()
|
||||
|
||||
# Initialize embedding function with all-mpnet-base-v2 model (768 dimensions)
|
||||
ef = SentenceTransformerCustomEmbeddingFunction(
|
||||
model_name='all-mpnet-base-v2', # TODO: your own model name
|
||||
device='cpu' # TODO: your own device
|
||||
)
|
||||
|
||||
# Get the dimension from the embedding function
|
||||
dimension = ef.dimension
|
||||
print(f"Embedding dimension: {dimension}")
|
||||
|
||||
# Create collection with matching dimension
|
||||
collection_name = "my_collection"
|
||||
if client.has_collection(collection_name):
|
||||
client.delete_collection(collection_name)
|
||||
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
configuration=HNSWConfiguration(dimension=dimension, distance='cosine'),
|
||||
embedding_function=ef
|
||||
)
|
||||
|
||||
# Test the embedding function
|
||||
print("\nTesting embedding function...")
|
||||
test_documents = ["Hello world", "This is a test", "Sentence transformers are great"]
|
||||
embeddings = ef(test_documents)
|
||||
print(f"Generated {len(embeddings)} embeddings")
|
||||
print(f"Each embedding has {len(embeddings[0])} dimensions")
|
||||
|
||||
# Add some documents to the collection
|
||||
print("\nAdding documents to collection...")
|
||||
collection.add(
|
||||
ids=["1", "2", "3"],
|
||||
documents=test_documents,
|
||||
metadatas=[{"source": "test1"}, {"source": "test2"}, {"source": "test3"}]
|
||||
)
|
||||
|
||||
# Query the collection
|
||||
print("\nQuerying collection...")
|
||||
results = collection.query(
|
||||
query_texts="Hello",
|
||||
n_results=2
|
||||
)
|
||||
|
||||
print("\nQuery results:")
|
||||
for i in range(len(results['ids'][0])):
|
||||
print(f"ID: {results['ids'][0][i]}")
|
||||
print(f"Document: {results['documents'][0][i]}")
|
||||
print(f"Distance: {results['distances'][0][i]}")
|
||||
print()
|
||||
|
||||
# Clean up
|
||||
client.delete_collection(name=collection_name)
|
||||
print("Test completed successfully!")
|
||||
```
|
||||
|
||||
## Example 2: OpenAI embedding function
|
||||
|
||||
```python
|
||||
from typing import List, Union
|
||||
import os
|
||||
from openai import OpenAI
|
||||
from pyseekdb import EmbeddingFunction
|
||||
import pyseekdb
|
||||
|
||||
Documents = Union[str, List[str]]
|
||||
Embeddings = List[List[float]]
|
||||
|
||||
class QWenEmbeddingFunction(EmbeddingFunction[Documents]):
|
||||
"""
|
||||
A custom embedding function using OpenAI's embedding API.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str = "", api_key: str = ""): # TODO: your own model name and api key
|
||||
"""
|
||||
Initialize the OpenAI embedding function.
|
||||
|
||||
Args:
|
||||
model_name: Name of the OpenAI embedding model
|
||||
api_key: OpenAI API key (if not provided, uses OPENAI_API_KEY env var)
|
||||
"""
|
||||
self.model_name = model_name
|
||||
self.api_key = api_key or os.environ.get('OPENAI_API_KEY') # TODO: your own api key
|
||||
if not self.api_key:
|
||||
raise ValueError("OpenAI API key is required")
|
||||
|
||||
self._dimension = 1024 # TODO: your own dimension
|
||||
|
||||
@property
|
||||
def dimension(self) -> int:
|
||||
"""Get the dimension of embeddings produced by this function"""
|
||||
if self._dimension is None:
|
||||
# Call API to get dimension (or use known values)
|
||||
raise ValueError("Dimension not set for this model")
|
||||
return self._dimension
|
||||
|
||||
def __call__(self, input: Documents) -> Embeddings:
|
||||
"""
|
||||
Generate embeddings using OpenAI API.
|
||||
|
||||
Args:
|
||||
input: Single document (str) or list of documents (List[str])
|
||||
|
||||
Returns:
|
||||
List of embedding vectors
|
||||
"""
|
||||
# Handle single string input
|
||||
if isinstance(input, str):
|
||||
input = [input]
|
||||
|
||||
# Handle empty input
|
||||
if not input:
|
||||
return []
|
||||
|
||||
# Call OpenAI API
|
||||
client = OpenAI(
|
||||
api_key=self.api_key,
|
||||
base_url="" # TODO: your own base url
|
||||
)
|
||||
response = client.embeddings.create(
|
||||
model=self.model_name,
|
||||
input=input
|
||||
)
|
||||
|
||||
# Extract embeddings
|
||||
embeddings = [item.embedding for item in response.data]
|
||||
return embeddings
|
||||
|
||||
# Use the custom embedding function
|
||||
collection_name = "my_collection"
|
||||
ef = QWenEmbeddingFunction()
|
||||
client = pyseekdb.Client()
|
||||
|
||||
if client.has_collection(collection_name):
|
||||
client.delete_collection(collection_name)
|
||||
|
||||
collection = client.create_collection(
|
||||
name=collection_name,
|
||||
embedding_function=ef
|
||||
)
|
||||
|
||||
collection.add(
|
||||
ids=["1", "2", "3"],
|
||||
documents=["Hello", "World", "Hello World"],
|
||||
metadatas=[{"tag": "A"}, {"tag": "B"}, {"tag": "C"}]
|
||||
)
|
||||
|
||||
results = collection.query(
|
||||
query_texts="Hello",
|
||||
n_results=2
|
||||
)
|
||||
for i in range(len(results['ids'][0])):
|
||||
print(results['ids'][0][i])
|
||||
print(results['documents'][0][i])
|
||||
print(results['metadatas'][0][i])
|
||||
print(results['distances'][0][i])
|
||||
print()
|
||||
|
||||
client.delete_collection(name=collection_name)
|
||||
```
|
||||
@@ -0,0 +1,41 @@
|
||||
---
|
||||
slug: /using-custom-embedding-functions-of-api
|
||||
---
|
||||
|
||||
# Use a custom embedding function
|
||||
|
||||
After you create a custom embedding function, you can use it when you create or get a collection.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```python
|
||||
import pyseekdb
|
||||
from pyseekdb import HNSWConfiguration
|
||||
|
||||
# Create a client
|
||||
client = pyseekdb.Client()
|
||||
|
||||
# Create collection with custom embedding function
|
||||
ef = SentenceTransformerCustomEmbeddingFunction()
|
||||
collection = client.create_collection(
|
||||
name="my_collection",
|
||||
configuration=HNSWConfiguration(dimension=ef.dimension, distance='cosine'),
|
||||
embedding_function=ef
|
||||
)
|
||||
|
||||
# Get collection with custom embedding function
|
||||
collection = client.get_collection("my_collection", embedding_function=ef)
|
||||
|
||||
# Use the collection - documents will be automatically embedded
|
||||
collection.add(
|
||||
ids=["doc1", "doc2"],
|
||||
documents=["Document 1", "Document 2"], # Vectors auto-generated
|
||||
metadatas=[{"tag": "A"}, {"tag": "B"}]
|
||||
)
|
||||
|
||||
# Query with texts - query vectors auto-generated
|
||||
results = collection.query(
|
||||
query_texts=["my query"],
|
||||
n_results=10
|
||||
)
|
||||
```
|
||||
@@ -0,0 +1,193 @@
|
||||
---
|
||||
sidebar_label: Jina AI
|
||||
slug: /jina
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with Jina AI
|
||||
|
||||
seekdb supports vector data storage, vector indexes, and embedding vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
Jina AI is an AI platform focused on multimodal search and vector search. It offers core components and tools for building enterprise-grade Retrieval-Augmented Generation (RAG) applications based on multimodal search, helping organizations and developers create advanced search-driven generative AI solutions.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
|
||||
|
||||
* You have installed Python 3.11 or later.
|
||||
|
||||
* You have installed required dependencies:
|
||||
|
||||
```shell
|
||||
python3 -m pip install pyobvector requests sqlalchemy
|
||||
```
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
:::tip
|
||||
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The username for connecting to the database.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Build your AI assistant
|
||||
|
||||
### Set your Jina AI API key as an environment variable
|
||||
|
||||
Get your [Jina AI API key](https://jina.ai/api-dashboard/reader) and configure it, along with your seekdb connection details, as environment variables:
|
||||
|
||||
```shell
|
||||
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
|
||||
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
|
||||
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
|
||||
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
|
||||
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
|
||||
```
|
||||
|
||||
### Example code snippets
|
||||
|
||||
#### Get embeddings from Jina AI
|
||||
|
||||
Jina AI offers several embedding models. You can choose the one that best fits your needs.
|
||||
|
||||
| Model | Parameter size | Embedding dimension | Text |
|
||||
| --- | --- | --- | --- |
|
||||
| [jina-embeddings-v3](https://zilliz.com/ai-models/jina-embeddings-v3) | 570M | flexible embedding size (Default: 1024) | multilingual text embeddings; supports 94 language in total |
|
||||
| [jina-embeddings-v2-small-en](https://zilliz.com/ai-models/jina-embeddings-v2-small-en) | 33M | 512 | English monolingual embeddings |
|
||||
| [jina-embeddings-v2-base-en](https://zilliz.com/ai-models/jina-embeddings-v2-base-en) | 137M | 768 | English monolingual embeddings |
|
||||
| [jina-embeddings-v2-base-zh](https://zilliz.com/ai-models/jina-embeddings-v2-base-zh) | 161M | 768 | Chinese-English Bilingual embeddings |
|
||||
| [jina-embeddings-v2-base-de](https://zilliz.com/ai-models/jina-embeddings-v2-base-de) | 161M | 768 | German-English Bilingual embeddings |
|
||||
| [jina-embeddings-v2-base-code](https://zilliz.com/ai-models/jina-embeddings-v2-base-code) | 161M | 768 | English and programming languages |
|
||||
|
||||
Here is an example using `jina-embeddings-v3`. The following helper function, `generate_embeddings`, calls the Jina AI embedding API:
|
||||
|
||||
```python
|
||||
import os
|
||||
import requests
|
||||
from sqlalchemy import Column, Integer, String
|
||||
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
|
||||
|
||||
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
|
||||
|
||||
# Step 1. Text data vectorization
|
||||
def generate_embeddings(text: str):
|
||||
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
|
||||
JINAAI_HEADERS = {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': f'Bearer {JINAAI_API_KEY}'
|
||||
}
|
||||
JINAAI_REQUEST_DATA = {
|
||||
'input': [text],
|
||||
'model': 'jina-embeddings-v3'
|
||||
}
|
||||
|
||||
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
|
||||
response_json = response.json()
|
||||
return response_json['data'][0]['embedding']
|
||||
|
||||
|
||||
TEXTS = [
|
||||
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
|
||||
'OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.',
|
||||
'OceanBase is a native distributed relational database that supports HTAP hybrid transaction analysis and processing. It features enterprise-level characteristics such as high availability, transparent scalability, and multi-tenancy, and is compatible with MySQL/Oracle protocols.'
|
||||
]
|
||||
data = []
|
||||
for text in TEXTS:
|
||||
# Generate the embedding for the text via Jina AI API.
|
||||
embedding = generate_embeddings(text)
|
||||
data.append({
|
||||
'content': text,
|
||||
'content_vec': embedding
|
||||
})
|
||||
|
||||
print(f"Successfully processed {len(data)} texts")
|
||||
```
|
||||
|
||||
#### Define the vector table structure and store vectors in seekdb
|
||||
|
||||
Create a table called `jinaai_oceanbase_demo_documents` with columns for the text (`content`), the embedding vector (`content_vec`), and vector index information. Then insert the vector data into seekdb:
|
||||
|
||||
```python
|
||||
# Step 2. Connect seekdb Serverless
|
||||
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
|
||||
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
|
||||
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
|
||||
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
|
||||
|
||||
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
|
||||
# Step 3. Create the vector table.
|
||||
table_name = "jinaai_oceanbase_demo_documents"
|
||||
client.drop_table_if_exist(table_name)
|
||||
|
||||
cols = [
|
||||
Column("id", Integer, primary_key=True, autoincrement=True),
|
||||
Column("content", String(500), nullable=False),
|
||||
Column("content_vec", VECTOR(1024))
|
||||
]
|
||||
|
||||
# Create vector index
|
||||
vector_index_params = IndexParam(
|
||||
index_name="idx_content_vec",
|
||||
field_name="content_vec",
|
||||
index_type="HNSW",
|
||||
distance_metric="cosine"
|
||||
)
|
||||
|
||||
client.create_table_with_index_params(
|
||||
table_name=table_name,
|
||||
columns=cols,
|
||||
vidxs=[vector_index_params]
|
||||
)
|
||||
|
||||
print('- Inserting Data to OceanBase...')
|
||||
client.insert(table_name, data=data)
|
||||
```
|
||||
|
||||
#### Semantic search
|
||||
|
||||
Use the Jina AI embedding API to generate an embedding for your query text. Then, search for the most relevant document by calculating the cosine distance between the query embedding and each embedding in the vector table:
|
||||
|
||||
```python
|
||||
# Step 4. Query the most relevant document based on the query.
|
||||
query = 'What is OceanBase?'
|
||||
# Generate the embedding for the query via Jina AI API.
|
||||
query_embedding = generate_embeddings(query)
|
||||
|
||||
res = client.ann_search(
|
||||
table_name,
|
||||
vec_data=query_embedding,
|
||||
vec_column_name="content_vec",
|
||||
distance_func=cosine_distance, # Use cosine distance function
|
||||
with_dist=True,
|
||||
topk=1,
|
||||
output_column_names=["id", "content"],
|
||||
)
|
||||
|
||||
print('- The Most Relevant Document and Its Distance to the Query:')
|
||||
for row in res.fetchall():
|
||||
print(f' - ID: {row[0]}\n'
|
||||
f' content: {row[1]}\n'
|
||||
f' distance: {row[2]}')
|
||||
```
|
||||
|
||||
#### Expected result
|
||||
|
||||
```plain
|
||||
- ID: 2
|
||||
content: OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
|
||||
distance: 0.14733879001870276
|
||||
```
|
||||
@@ -0,0 +1,228 @@
|
||||
---
|
||||
sidebar_label: OpenAI
|
||||
slug: /openai
|
||||
---
|
||||
|
||||
# OpenAI
|
||||
|
||||
OpenAI is an artificial intelligence company that has developed several large language models. These models excel at understanding and generating natural language, making them highly effective for tasks such as text generation, answering questions, and engaging in conversations. Access to these models is available through an API.
|
||||
|
||||
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using OpenAI's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
|
||||
* You have installed [Python 3.9 or later](https://www.python.org/downloads/) and [pip](https://pip.pypa.io/en/stable/installation/).
|
||||
* You have installed [Poetry](https://python-poetry.org/docs/), [Pyobvector](https://github.com/oceanbase/pyobvector), and OpenAI SDK. The installation commands are as follows:
|
||||
|
||||
```shell
|
||||
python3 pip install poetry
|
||||
python3 pip install pyobvector
|
||||
python3 pip install openai
|
||||
```
|
||||
|
||||
* You have obtained an [OpenAI API key](https://platform.openai.com/api-keys).
|
||||
|
||||
## Step 1: Obtain the connection string of seekdb
|
||||
|
||||
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to be accessed.
|
||||
|
||||
:::tip
|
||||
The user for connection must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The database account.
|
||||
* `$password`: The password of the account.
|
||||
|
||||
**Here is an example:**
|
||||
|
||||
```shell
|
||||
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001 -p****** -Dtest
|
||||
```
|
||||
|
||||
## Step 2: Register an LLM account
|
||||
|
||||
Obtain an OpenAI API key.
|
||||
|
||||
1. Log in to the [OpenAI](https://platform.openai.com/) platform.
|
||||
|
||||
2. Click **API Keys** in the upper-right corner.
|
||||
|
||||
3. Click **Create API Key**.
|
||||
|
||||
4. Specify the required information and click **Create API Key**.
|
||||
|
||||
Specify the API key for the relevant environment variable.
|
||||
|
||||
* For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
|
||||
|
||||
```shell
|
||||
export OPENAI_API_KEY='your-api-key'
|
||||
```
|
||||
|
||||
* For a Windows system, you can run the following command in Command Prompt:
|
||||
|
||||
```shell
|
||||
set OPENAI_API_KEY=your-api-key
|
||||
```
|
||||
|
||||
You must replace `your-api-key` with the actual OpenAI API key.
|
||||
|
||||
## Step 3: Store vector data in seekdb
|
||||
|
||||
### Store vector data in seekdb
|
||||
|
||||
1. Prepare test data.
|
||||
|
||||
Download the [CSV file](https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240827/srxyhu/fine_food_reviews.csv) that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
|
||||
|
||||
```shell
|
||||
from openai import OpenAI
|
||||
import pandas as pd
|
||||
input_datapath = "./fine_food_reviews.csv"
|
||||
client = OpenAI()
|
||||
# Here the text-embedding-ada-002 model is used. You can change the model as needed.
|
||||
def embedding_text(text, model="text-embedding-ada-002"):
|
||||
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
|
||||
res = client.embeddings.create(input=text, model=model)
|
||||
return res.data[0].embedding
|
||||
df = pd.read_csv(input_datapath, index_col=0)
|
||||
# It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row.
|
||||
df["embedding"] = df.combined.apply(embedding_text)
|
||||
output_datapath = './fine_food_reviews_self_embeddings.csv'
|
||||
df.to_csv(output_datapath)
|
||||
```
|
||||
|
||||
2. Run the following script to insert the test data into seekdb. The script must be located in the same directory as the test data.
|
||||
|
||||
```shell
|
||||
import os
|
||||
import sys
|
||||
import csv
|
||||
import json
|
||||
from pyobvector import *
|
||||
from sqlalchemy import Column, Integer, String
|
||||
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
|
||||
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
|
||||
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
|
||||
file_name = "fine_food_reviews.csv"
|
||||
file_path = os.path.join("./", file_name)
|
||||
# Define columns. The last column is a vector column.
|
||||
cols = [
|
||||
Column('id', Integer, primary_key=True, autoincrement=False),
|
||||
Column('product_id', String(256), nullable=True),
|
||||
Column('user_id', String(256), nullable=True),
|
||||
Column('score', Integer, nullable=True),
|
||||
Column('summary', String(2048), nullable=True),
|
||||
Column('text', String(8192), nullable=True),
|
||||
Column('combined', String(8192), nullable=True),
|
||||
Column('n_tokens', Integer, nullable=True),
|
||||
Column('embedding', VECTOR(1536))
|
||||
]
|
||||
# Define the table name.
|
||||
table_name = 'fine_food_reviews'
|
||||
# If the table does not exist, create it.
|
||||
if not client.check_table_exists(table_name):
|
||||
client.create_table(table_name,columns=cols)
|
||||
# Create an index on the vector column.
|
||||
client.create_index(
|
||||
table_name=table_name,
|
||||
is_vec_index=True,
|
||||
index_name='vidx',
|
||||
column_names=['embedding'],
|
||||
vidx_params='distance=l2, type=hnsw, lib=vsag',
|
||||
)
|
||||
# Open and read the CSV file.
|
||||
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
|
||||
csvreader = csv.reader(csvfile)
|
||||
# Read the header line.
|
||||
headers = next(csvreader)
|
||||
print("Headers:", headers)
|
||||
batch = [] # Store data by inserting 10 rows into the database each time.
|
||||
for i, row in enumerate(csvreader):
|
||||
# The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
|
||||
if not row:
|
||||
break
|
||||
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
|
||||
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
|
||||
batch.append(food_review_line)
|
||||
# Insert 10 rows each time.
|
||||
if (i + 1) % 10 == 0:
|
||||
client.insert(table_name,batch)
|
||||
batch = [] # Clear the cache.
|
||||
# Insert the rest rows, if any.
|
||||
if batch:
|
||||
client.insert(table_name,batch)
|
||||
# Check the data in the table and make sure that all data has been inserted.
|
||||
count_sql = f"select count(*) from {table_name};"
|
||||
cursor = client.perform_raw_text_sql(count_sql)
|
||||
result = cursor.fetchone()
|
||||
print(f"Total number of inserted rows:{result[0]}")
|
||||
```
|
||||
|
||||
### Query seekdb data
|
||||
|
||||
1. Save the following Python script and name it as `openAIQuery.py`.
|
||||
|
||||
```shell
|
||||
import os
|
||||
import sys
|
||||
import csv
|
||||
import json
|
||||
from pyobvector import *
|
||||
from sqlalchemy import func
|
||||
from openai import OpenAI
|
||||
# Obtain command-line options.
|
||||
if len(sys.argv) != 2:
|
||||
print("Enter a query statement." )
|
||||
sys.exit()
|
||||
queryStatement = sys.argv[1]
|
||||
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
|
||||
client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
|
||||
openAIclient = OpenAI()
|
||||
# Define the function for generating text vectors.
|
||||
def generate_embeddings(text, model="text-embedding-ada-002"):
|
||||
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
|
||||
res = openAIclient.embeddings.create(input=text, model=model)
|
||||
return res.data[0].embedding
|
||||
|
||||
def query_ob(query, tableName, vector_name="embedding", top_k=1):
|
||||
embedding = generate_embeddings(query)
|
||||
# Perform an approximate nearest neighbor search (ANNS).
|
||||
res = client.ann_search(
|
||||
table_name=tableName,
|
||||
vec_data=embedding,
|
||||
vec_column_name=vector_name,
|
||||
distance_func=func.l2_distance,
|
||||
topk=top_k,
|
||||
output_column_names=['combined']
|
||||
)
|
||||
for row in res:
|
||||
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
|
||||
# Specify the table name.
|
||||
table_name = 'fine_food_reviews'
|
||||
query_ob(queryStatement,table_name,'embedding',1)
|
||||
```
|
||||
|
||||
2. Enter a question for an answer.
|
||||
|
||||
```shell
|
||||
python3 openAIQuery.py 'pet food'
|
||||
```
|
||||
|
||||
The expected result is as follows:
|
||||
|
||||
```shell
|
||||
Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.
|
||||
```
|
||||
@@ -0,0 +1,205 @@
|
||||
---
|
||||
sidebar_label: Qwen
|
||||
slug: /qwen
|
||||
---
|
||||
|
||||
# Qwen
|
||||
|
||||
[Tongyi Qianwen (Qwen)](https://tongyi.aliyun.com) is a large language model (LLM) developed by Alibaba Cloud for interpreting and analyzing user inputs. You can use the API of Qwen in the [Alibaba Cloud Model Studio](https://bailian.console.alibabacloud.com/?spm=a2c63.p38356.0.0.948073b58ycZ3f&accounttraceid=ffba8dd7c8ef4dfd95c06513316ac8cfacdj#/home).
|
||||
|
||||
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using Qwen's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
|
||||
* You have installed [Python 3.9 or later](https://www.python.org/downloads/) and [pip](https://pip.pypa.io/en/stable/installation/).
|
||||
* You have installed [Poetry](https://python-poetry.org/docs/), [Pyobvector](https://github.com/oceanbase/pyobvector), and DashScope SDK. The installation commands are as follows:
|
||||
|
||||
```shell
|
||||
pip install poetry
|
||||
pip install pyobvector
|
||||
pip install dashscope
|
||||
```
|
||||
|
||||
* You have obtained the [Qwen API key](https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key).
|
||||
|
||||
## Step 1: Obtain the connection string of seekdb
|
||||
|
||||
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to be accessed.
|
||||
|
||||
:::tip
|
||||
The user for connection must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The database account.
|
||||
* `$password`: The password of the account.
|
||||
|
||||
## Step 2: Configure the environment variable for the Qwen API key
|
||||
|
||||
For a Unix-based system (such as Ubuntu or MacOS), run the following command in the terminal:
|
||||
|
||||
```shell
|
||||
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
|
||||
```
|
||||
|
||||
For Windows, run the following command in the command prompt:
|
||||
|
||||
```shell
|
||||
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
|
||||
```
|
||||
|
||||
You must replace `YOUR_DASHSCOPE_API_KEY` with the actual Qwen API key.
|
||||
|
||||
## Step 3: Store the vector data in seekdb
|
||||
|
||||
1. Prepare the test data.
|
||||
Download the [CSV file](https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240827/srxyhu/fine_food_reviews.csv) that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
|
||||
|
||||
```shell
|
||||
import dashscope
|
||||
import pandas as pd
|
||||
input_datapath = "./fine_food_reviews.csv"
|
||||
# Here the text_embedding_v1 model is used. You can change the model as needed.
|
||||
def generate_embeddings(text):
|
||||
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
|
||||
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
|
||||
return embeddings if isinstance(text, list) else embeddings[0]
|
||||
df = pd.read_csv(input_datapath, index_col=0)
|
||||
# It takes a few minutes to generate the CSV file by calling the Tongyi Qianwen Embedding API row by row.
|
||||
df["embedding"] = df.combined.apply(generate_embeddings)
|
||||
output_datapath = './fine_food_reviews_self_embeddings.csv'
|
||||
df.to_csv(output_datapath)
|
||||
```
|
||||
|
||||
2. Execute the following script to insert the test data into seekdb. The directory where the script is located must be the same as the directory where the test data is stored.
|
||||
|
||||
```shell
|
||||
import os
|
||||
import sys
|
||||
import csv
|
||||
import json
|
||||
from pyobvector import *
|
||||
from sqlalchemy import Column, Integer, String
|
||||
# Use pyobvector to connect to seekdb. If @ is in the username or password, replace it with %40.
|
||||
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
|
||||
# The test dataset is prepared in advance and has been vectorized. By default, it is placed in the same directory as the Python script. If you have vectorized it yourself, replace it with the corresponding file.
|
||||
file_name = "fine_food_reviews.csv"
|
||||
file_path = os.path.join("./", file_name)
|
||||
# Define the columns. The vectorized column is placed in the last field.
|
||||
cols = [
|
||||
Column('id', Integer, primary_key=True, autoincrement=False),
|
||||
Column('product_id', String(256), nullable=True),
|
||||
Column('user_id', String(256), nullable=True),
|
||||
Column('score', Integer, nullable=True),
|
||||
Column('summary', String(2048), nullable=True),
|
||||
Column('text', String(8192), nullable=True),
|
||||
Column('combined', String(8192), nullable=True),
|
||||
Column('n_tokens', Integer, nullable=True),
|
||||
Column('embedding', VECTOR(1536))
|
||||
]
|
||||
# Table name
|
||||
table_name = 'fine_food_reviews'
|
||||
# If the table does not exist, create it.
|
||||
if not client.check_table_exists(table_name):
|
||||
client.create_table(table_name,columns=cols)
|
||||
# Create an index for the vector column.
|
||||
client.create_index(
|
||||
table_name=table_name,
|
||||
is_vec_index=True,
|
||||
index_name='vidx',
|
||||
column_names=['embedding'],
|
||||
vidx_params='distance=l2, type=hnsw, lib=vsag',
|
||||
)
|
||||
# Open and read the CSV file.
|
||||
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
|
||||
csvreader = csv.reader(csvfile)
|
||||
# Read the header row.
|
||||
headers = next(csvreader)
|
||||
print("Headers:", headers)
|
||||
batch = [] # Store data and insert it into the database every 10 rows.
|
||||
for i, row in enumerate(csvreader):
|
||||
# The CSV file has 9 fields: id, product_id, user_id, score, summary, text, combined, n_tokens, embedding.
|
||||
if not row:
|
||||
break
|
||||
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
|
||||
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
|
||||
batch.append(food_review_line)
|
||||
# Insert data every 10 rows.
|
||||
if (i + 1) % 10 == 0:
|
||||
client.insert(table_name,batch)
|
||||
batch = [] # Clear the cache.
|
||||
# Insert the remaining rows (if any).
|
||||
if batch:
|
||||
client.insert(table_name,batch)
|
||||
# Check the data in the table to ensure that all data has been inserted.
|
||||
count_sql = f"select count(*) from {table_name};"
|
||||
cursor = client.perform_raw_text_sql(count_sql)
|
||||
result = cursor.fetchone()
|
||||
print(f"Total number of imported data: {result[0]}")
|
||||
```
|
||||
|
||||
## Step 4: Query seekdb data
|
||||
|
||||
1. Save the following Python script as `query.py`.
|
||||
|
||||
```shell
|
||||
import os
|
||||
import sys
|
||||
import csv
|
||||
import json
|
||||
from pyobvector import *
|
||||
from sqlalchemy import func
|
||||
import dashscope
|
||||
# Get command-line arguments
|
||||
if len(sys.argv) != 2:
|
||||
print("Please enter a query statement.")
|
||||
sys.exit()
|
||||
queryStatement = sys.argv[1]
|
||||
# Use pyobvector to connect to seekdb. If the username or password contains @, replace it with %40.
|
||||
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
|
||||
# Define a function to generate text vectors.
|
||||
def generate_embeddings(text):
|
||||
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
|
||||
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
|
||||
return embeddings if isinstance(text, list) else embeddings[0]
|
||||
|
||||
def query_ob(query, tableName, vector_name="embedding", top_k=1):
|
||||
embedding = generate_embeddings(query)
|
||||
# Execute approximate nearest neighbor search.
|
||||
res = client.ann_search(
|
||||
table_name=tableName,
|
||||
vec_data=embedding,
|
||||
vec_column_name=vector_name,
|
||||
distance_func=func.l2_distance,
|
||||
topk=top_k,
|
||||
output_column_names=['combined']
|
||||
)
|
||||
for row in res:
|
||||
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
|
||||
# Table name
|
||||
table_name = 'fine_food_reviews'
|
||||
query_ob(queryStatement,table_name,'embedding',1)
|
||||
```
|
||||
|
||||
2. Enter a question and obtain the related answer.
|
||||
|
||||
```shell
|
||||
python3 query.py 'pet food'
|
||||
```
|
||||
|
||||
The expected result is as follows:
|
||||
|
||||
```shell
|
||||
This is so good!: I purchased this after my sister sent a small bag to me in a gift box. I loved it so much I wanted to find it to buy for myself and keep it around. I always look on Amazon because you can find everything here and true enough, I found this wonderful candy. It is nice to keep in your purse for when you are out and about and get a dry throat or a tickle in the back of your throat. It is also nice to have in a candy dish at home for guests to try.
|
||||
```
|
||||
@@ -0,0 +1,183 @@
|
||||
---
|
||||
sidebar_label: LangChain
|
||||
slug: /langchain
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with LangChain
|
||||
|
||||
seekdb supports vector data storage, vector indexing, and embedding-based vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
LangChain is a framework for developing language model-driven applications. It enables an application to have the following capabilities:
|
||||
|
||||
* Context awareness: The application can connect language models to context sources, such as prompt instructions, a few examples, and content requiring responses.
|
||||
* Reasoning: The application can perform reasoning based on language models. For example, it can decide how to answer a question or what actions to take based on the provided context.
|
||||
|
||||
This topic describes how to integrate the [vector search feature](../../200.develop/100.vector-search/100.vector-search-overview/100.vector-search-intro.md) of seekdb with the [Tongyi Qianwen (Qwen) API](https://www.alibabacloud.com/en/solutions/generative-ai/qwen?_p_lc=1) and [LangChain](https://python.langchain.com/) for Document Question Answering (DQA).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
* Your environment has a database and account with read and write privileges.
|
||||
* You have installed Python 3.9 or later.
|
||||
* You have installed required dependencies:
|
||||
|
||||
```shell
|
||||
python3 -m pip install -U langchain-oceanbase
|
||||
python3 -m pip install langchain_community
|
||||
python3 -m pip install dashscope
|
||||
```
|
||||
|
||||
* You can set the `ob_vector_memory_limit_percentage` parameter to enable vector search. We recommend keeping the default value of `0` (adaptive mode). For more precise configuration settings, see the relevant configuration documentation.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact the seekdb database deployment personnel or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to the seekdb database.
|
||||
* `$port`: The port for connecting to the seekdb database. The default value is `2881`, which can be customized during deployment.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
<main id="notice" type='notice'>
|
||||
<h4>Notice</h4>
|
||||
<p>The user connecting to the database must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.</p>
|
||||
</main>
|
||||
|
||||
* `$user_name`: The database account, in the format of `username`.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
For more information about the connection string, see [Connect to OceanBase Database by using OBClient](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971649).
|
||||
|
||||
## Step 2: Build your AI assistant
|
||||
|
||||
### Set the environment variable for the Qwen API key
|
||||
|
||||
Create a [Qwen API key](https://www.alibabacloud.com/help/en/model-studio/get-api-key?spm=a2c63.l28256.help-menu-2400256.d_2.47db1b76nM44Ut) and [configure it in the environment variables](https://www.alibabacloud.com/help/en/model-studio/configure-api-key-through-environment-variables?spm=a2c63.p38356.help-menu-2400256.d_2_0_1.56069f6b3m576u).
|
||||
|
||||
```shell
|
||||
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
|
||||
```
|
||||
|
||||
### Load and split the documents
|
||||
|
||||
Download the sample data and split it into chunks of approximately 1000 characters using the `CharacterTextSplitter` class.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import TextLoader
|
||||
from langchain_community.embeddings import DashScopeEmbeddings
|
||||
from langchain_text_splitters import CharacterTextSplitter
|
||||
from langchain_oceanbase.vectorstores import OceanbaseVectorStore
|
||||
import os
|
||||
import requests
|
||||
|
||||
DASHSCOPE_API = os.environ.get("DASHSCOPE_API_KEY", "")
|
||||
embeddings = DashScopeEmbeddings(
|
||||
model="text-embedding-v1", dashscope_api_key=DASHSCOPE_API
|
||||
)
|
||||
|
||||
url = "https://raw.githubusercontent.com/GITHUBear/langchain/refs/heads/master/docs/docs/how_to/state_of_the_union.txt"
|
||||
res = requests.get(url)
|
||||
with open("state_of_the_union.txt", "w") as f:
|
||||
f.write(res.text)
|
||||
|
||||
loader = TextLoader('./state_of_the_union.txt')
|
||||
|
||||
documents = loader.load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
docs = text_splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Insert the data into seekdb
|
||||
|
||||
```python
|
||||
connection_args = {
|
||||
"host": "127.0.0.1",
|
||||
"port": "2881",
|
||||
"user": "root@user_name",
|
||||
"password": "",
|
||||
"db_name": "test",
|
||||
}
|
||||
DEMO_TABLE_NAME = "demo_ann"
|
||||
ob = OceanbaseVectorStore(
|
||||
embedding_function=embeddings,
|
||||
table_name=DEMO_TABLE_NAME,
|
||||
connection_args=connection_args,
|
||||
drop_old=True,
|
||||
normalize=True,
|
||||
)
|
||||
res = ob.add_documents(documents=docs)
|
||||
```
|
||||
|
||||
### Vector search
|
||||
|
||||
This step shows how to query `"What did the president say about Ketanji Brown Jackson"` from the document `state_of_the_union.txt`.
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
docs_with_score = ob.similarity_search_with_score(query, k=3)
|
||||
|
||||
for doc, score in docs_with_score:
|
||||
print("-" * 80)
|
||||
print("Score: ", score)
|
||||
print(doc.page_content)
|
||||
print("-" * 80)
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```shell
|
||||
--------------------------------------------------------------------------------
|
||||
Score: 1.204783671324283
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
--------------------------------------------------------------------------------
|
||||
--------------------------------------------------------------------------------
|
||||
Score: 1.2146663629717394
|
||||
It is going to transform America and put us on a path to win the economic competition of the 21st Century that we face with the rest of the world—particularly with China.
|
||||
|
||||
As I’ve told Xi Jinping, it is never a good bet to bet against the American people.
|
||||
|
||||
We’ll create good jobs for millions of Americans, modernizing roads, airports, ports, and waterways all across America.
|
||||
|
||||
And we’ll do it all to withstand the devastating effects of the climate crisis and promote environmental justice.
|
||||
|
||||
We’ll build a national network of 500,000 electric vehicle charging stations, begin to replace poisonous lead pipes—so every child—and every American—has clean water to drink at home and at school, provide affordable high-speed internet for every American—urban, suburban, rural, and tribal communities.
|
||||
|
||||
4,000 projects have already been announced.
|
||||
|
||||
And tonight, I’m announcing that this year we will start fixing over 65,000 miles of highway and 1,500 bridges in disrepair.
|
||||
--------------------------------------------------------------------------------
|
||||
--------------------------------------------------------------------------------
|
||||
Score: 1.2193955178945004
|
||||
Vice President Harris and I ran for office with a new economic vision for America.
|
||||
|
||||
Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up
|
||||
and the middle out, not from the top down.
|
||||
|
||||
Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well.
|
||||
|
||||
America used to have the best roads, bridges, and airports on Earth.
|
||||
|
||||
Now our infrastructure is ranked 13th in the world.
|
||||
|
||||
We won’t be able to compete for the jobs of the 21st Century if we don’t fix that.
|
||||
|
||||
That’s why it was so important to pass the Bipartisan Infrastructure Law—the most sweeping investment to rebuild America in history.
|
||||
|
||||
This was a bipartisan effort, and I want to thank the members of both parties who worked to make it happen.
|
||||
|
||||
We’re done talking about infrastructure weeks.
|
||||
|
||||
We’re going to have an infrastructure decade.
|
||||
--------------------------------------------------------------------------------
|
||||
```
|
||||
@@ -0,0 +1,125 @@
|
||||
---
|
||||
sidebar_label: LlamaIndex
|
||||
slug: /llamaindex
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with LlamaIndex
|
||||
|
||||
seekdb supports vector data storage, vector indexing, and embedding-based vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
LlamaIndex is a framework for building context-augmented generative AI applications by using large language models (LLMs), including proxies and workflows. It provides a wealth of capabilities such as data connectors, data indexes, proxies, observability/assessment integration, and workflows.
|
||||
|
||||
This topic demonstrates how to integrate the vector search feature of seekdb with the Tongyi Qianwen (Qwen) API and LlamaIndex for Document Question Answering (DQA).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed the seekdb database.
|
||||
* Your environment has a database and account with read and write privileges.
|
||||
* You can set the `ob_vector_memory_limit_percentage` parameter to enable vector search. We recommend keeping the default value of `0` (adaptive mode). For more precise configuration settings, see the relevant configuration documentation.
|
||||
* You have installed Python 3.9 or later.
|
||||
* You have installed the required dependencies:
|
||||
|
||||
```shell
|
||||
python3 -m pip install llama-index-vector-stores-oceanbase llama-index
|
||||
python3 -m pip install llama-index-embeddings-dashscope
|
||||
python3 -m pip install llama-index-llms-dashscope
|
||||
```
|
||||
|
||||
* You have obtained the Qwen API key.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact the seekdb database deployment personnel or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to the seekdb database.
|
||||
* `$port`: The port for connecting to the seekdb database. The default value is `2881`, which can be customized during deployment.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
<main id="notice" type='notice'>
|
||||
<h4>Notice</h4>
|
||||
<p>The user connecting to the database must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.</p>
|
||||
</main>
|
||||
|
||||
* `$user_name`: The database account, in the format of `username`.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
For more information about the connection string, see [Connect to OceanBase Database by using OBClient](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971649).
|
||||
|
||||
## Step 2: Build your AI assistant
|
||||
|
||||
### Set the environment variable for the Qwen API key
|
||||
|
||||
Create a [Qwen API key](https://www.alibabacloud.com/help/en/model-studio/get-api-key?spm=a2c63.l28256.help-menu-2400256.d_2.47db1b76nM44Ut) and [configure it in the environment variables](https://www.alibabacloud.com/help/en/model-studio/configure-api-key-through-environment-variables?spm=a2c63.p38356.help-menu-2400256.d_2_0_1.56069f6b3m576u).
|
||||
|
||||
```shell
|
||||
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
|
||||
```
|
||||
|
||||
### Download the sample data
|
||||
|
||||
```shell
|
||||
mkdir -p '/root/llamaindex/paul_graham/'
|
||||
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O '/root/llamaindex/paul_graham/paul_graham_essay.txt'
|
||||
```
|
||||
|
||||
### Load the data text
|
||||
|
||||
```python
|
||||
import os
|
||||
from pyobvector import ObVecClient
|
||||
from llama_index.core import Settings
|
||||
from llama_index.embeddings.dashscope import DashScopeEmbedding
|
||||
from llama_index.core import (
|
||||
SimpleDirectoryReader,
|
||||
load_index_from_storage,
|
||||
VectorStoreIndex,
|
||||
StorageContext,
|
||||
)
|
||||
from llama_index.vector_stores.oceanbase import OceanBaseVectorStore
|
||||
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
|
||||
#set ob client
|
||||
client = ObVecClient(uri="127.0.0.1:2881", user="root@test",password="",db_name="test")
|
||||
# Global Settings
|
||||
Settings.embed_model = DashScopeEmbedding()
|
||||
# config llm model
|
||||
dashscope_llm = DashScope(
|
||||
model_name=DashScopeGenerationModels.QWEN_MAX,
|
||||
api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
|
||||
)
|
||||
# load documents
|
||||
documents = SimpleDirectoryReader("/root/llamaindex/paul_graham/").load_data()
|
||||
oceanbase = OceanBaseVectorStore(
|
||||
client=client,
|
||||
dim=1536,
|
||||
drop_old=True,
|
||||
normalize=True,
|
||||
)
|
||||
|
||||
storage_context = StorageContext.from_defaults(vector_store=oceanbase)
|
||||
index = VectorStoreIndex.from_documents(
|
||||
documents, storage_context=storage_context
|
||||
)
|
||||
```
|
||||
|
||||
## Vector search
|
||||
|
||||
This step shows how to query `"What did the author do growing up?"` from the document `paul_graham_essay.txt`.
|
||||
|
||||
```shell
|
||||
# set Logging to DEBUG for more detailed outputs
|
||||
query_engine = index.as_query_engine(llm=dashscope_llm)
|
||||
res = query_engine.query("What did the author do growing up?")
|
||||
res.response
|
||||
```
|
||||
|
||||
Expected result:
|
||||
|
||||
```python
|
||||
'Growing up, the author worked on writing and programming outside of school. In terms of writing, he wrote short stories, which he now considers to be awful, as they had very little plot and focused mainly on characters with strong feelings. For programming, he started in 9th grade by trying to write programs on an IBM 1401 at his school, using an early version of Fortran. Later, after getting a TRS-80 microcomputer, he began to write more practical programs, including simple games, a program to predict the flight height of model rockets, and a word processor that his father used for writing.'
|
||||
```
|
||||
@@ -0,0 +1,338 @@
|
||||
---
|
||||
sidebar_label: Spring AI
|
||||
slug: /springai
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with Spring AI Alibaba
|
||||
|
||||
seekdb supports vector data storage, vector indexing, and embedding-based vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
The Spring AI Alibaba project is an open-source project that uses Spring AI and provides the best practices for developing Java applications with AI. It simplifies the AI application development process and adapts to cloud-native infrastructure. It helps developers quickly build AI applications.
|
||||
|
||||
This topic describes how to integrate the vector search capability of seekdb with Spring AI Alibaba to implement data import and similarity search features. By configuring vector storage and search services, developers can easily build AI application scenarios based on seekdb, supporting advanced features such as text similarity search and content recommendation.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* Download [JDK 17+](https://www.oracle.com/java/technologies/downloads/#java17). Make sure that you have installed Java 17 and configured the environment variables.
|
||||
|
||||
* Download [Maven](https://dlcdn.apache.org/maven/). Make sure that you have installed Maven 3.6+ for building projects and managing dependencies.
|
||||
|
||||
* Download [IntelliJ IDEA](https://www.jetbrains.com/idea/download/) or [Eclipse](https://www.eclipse.org/downloads/). Choose the version that suits your operating system and install it.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact the seekdb deployment personnel or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to the seekdb database.
|
||||
* `$port`: The port for connecting to the seekdb database. The default value is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
<main id="notice" type='notice'>
|
||||
<h4>Notice</h4>
|
||||
<p>The user connecting to the database must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.</p>
|
||||
</main>
|
||||
|
||||
* `$user_name`: The database account.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Set up the Maven project
|
||||
|
||||
Maven is a project management and build tool used in this topic. This step describes how to create a Maven project and add project dependencies by configuring the `pom.xml` file.
|
||||
|
||||
### Create a project
|
||||
|
||||
1. Run the following Maven command to create a project.
|
||||
|
||||
```shell
|
||||
mvn archetype:generate -DgroupId=com.alibaba.cloud.ai.example -DartifactId=vector-oceanbase-example -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
|
||||
```
|
||||
|
||||
2. Go to the project directory.
|
||||
|
||||
```shell
|
||||
cd vector-oceanbase-example
|
||||
```
|
||||
|
||||
### Configure the `pom.xml` file
|
||||
|
||||
The `pom.xml` file is the core configuration file of the Maven project, used to manage project dependencies, plugins, configurations, and other information. To build the project, you need to modify the `pom.xml` file and add Spring AI Alibaba, seekdb vector storage, and other necessary dependencies.
|
||||
|
||||
Open the `pom.xml` file and replace the existing content with the following:
|
||||
|
||||
```xml
|
||||
<project xmlns="http://maven.apache.org/POM/4.0.0"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
<parent>
|
||||
<groupId>com.alibaba.cloud.ai.example</groupId>
|
||||
<artifactId>spring-ai-alibaba-vector-databases-example</artifactId>
|
||||
<version>1.0.0</version>
|
||||
</parent>
|
||||
|
||||
<artifactId>vector-oceanbase-example</artifactId>
|
||||
|
||||
<properties>
|
||||
<maven.compiler.source>17</maven.compiler.source>
|
||||
<maven.compiler.target>17</maven.compiler.target>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
</properties>
|
||||
|
||||
<dependencies>
|
||||
<!-- Alibaba Cloud AI starter -->
|
||||
<dependency>
|
||||
<groupId>com.alibaba.cloud.ai</groupId>
|
||||
<artifactId>spring-ai-alibaba-starter</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- Spring Boot Web support -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.boot</groupId>
|
||||
<artifactId>spring-boot-starter-web</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- Spring AI auto-configuration -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-spring-boot-autoconfigure</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- Spring JDBC support -->
|
||||
<dependency>
|
||||
<groupId>org.springframework</groupId>
|
||||
<artifactId>spring-jdbc</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- Transformers model support -->
|
||||
<dependency>
|
||||
<groupId>org.springframework.ai</groupId>
|
||||
<artifactId>spring-ai-transformers</artifactId>
|
||||
</dependency>
|
||||
|
||||
<!-- OceanBase Vector Database starter -->
|
||||
<dependency>
|
||||
<groupId>com.alibaba.cloud.ai</groupId>
|
||||
<artifactId>spring-ai-alibaba-starter-oceanbase-store</artifactId>
|
||||
<version>1.0.0-M6.2-SNAPSHOT</version>
|
||||
</dependency>
|
||||
|
||||
<!-- OceanBase JDBC driver -->
|
||||
<dependency>
|
||||
<groupId>com.oceanbase</groupId>
|
||||
<artifactId>oceanbase-client</artifactId>
|
||||
<version>2.4.14</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
|
||||
<!-- SNAPSHOT repository configuration -->
|
||||
<repositories>
|
||||
<repository>
|
||||
<id>sonatype-snapshots</id>
|
||||
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
|
||||
<releases>
|
||||
<enabled>false</enabled>
|
||||
</releases>
|
||||
<snapshots>
|
||||
<enabled>true</enabled>
|
||||
</snapshots>
|
||||
</repository>
|
||||
</repositories>
|
||||
</project>
|
||||
```
|
||||
|
||||
## Step 3: Configure the connection information of seekdb
|
||||
|
||||
This step configures the `application.yml` file to add the connection information of seekdb.
|
||||
|
||||
Create the `application.yml` file in the `src/main/resources` directory of the project and add the following content:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
port: 8080
|
||||
|
||||
spring:
|
||||
application:
|
||||
name: oceanbase-example
|
||||
ai:
|
||||
dashscope:
|
||||
api-key: ${DASHSCOPE_API_KEY} # Replace with your DashScope API Key
|
||||
vectorstore:
|
||||
oceanbase:
|
||||
enabled: true
|
||||
url: jdbc:oceanbase://xxx:xxx/xxx # URL for connecting to seekdb
|
||||
username: xxx # Username of seekdb
|
||||
password: xxx # Password of seekdb
|
||||
tableName: vector_table # Name of the vector table (automatically created)
|
||||
defaultTopK: 2 # Default number of similar results to return
|
||||
defaultSimilarityThreshold: 0.8 # Similarity threshold (0~1, smaller values indicate higher similarity)
|
||||
```
|
||||
|
||||
## Step 4: Create the main application class and controller
|
||||
|
||||
Create the startup class and controller class of the Spring Boot application to implement the data import and similarity search features.
|
||||
|
||||
### Create an application startup class
|
||||
|
||||
Create a file named `OceanBaseApplication.java` in the `src/main/java/com/alibaba/cloud/ai/example/vector` directory, and add the following code to the file:
|
||||
|
||||
```java
|
||||
package com.alibaba.cloud.ai.example.vector; // The package name must be consistent with the directory structure.
|
||||
|
||||
import org.springframework.boot.SpringApplication;
|
||||
import org.springframework.boot.autoconfigure.SpringBootApplication;
|
||||
|
||||
@SpringBootApplication // Enable Spring Boot auto-configuration
|
||||
public class OceanBaseApplication {
|
||||
public static void main(String[] args) {
|
||||
SpringApplication.run(OceanBaseApplication.class, args); // Start the Spring Boot application
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The sample code creates the core startup class for the project, which is used to start the Spring Boot application.
|
||||
|
||||
### Create a vector storage controller
|
||||
|
||||
Create the `OceanBaseController.java` file in the `src/main/java/com/alibaba/cloud/ai/example/vector` directory and add the following code:
|
||||
|
||||
```java
|
||||
package com.alibaba.cloud.ai.example.vector.controller; // The package name must be consistent with the directory structure.
|
||||
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.springframework.ai.document.Document;
|
||||
import org.springframework.ai.vectorstore.SearchRequest;
|
||||
import org.springframework.beans.factory.annotation.Autowired;
|
||||
import org.springframework.web.bind.annotation.GetMapping;
|
||||
import org.springframework.web.bind.annotation.RequestMapping;
|
||||
import org.springframework.web.bind.annotation.RestController;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
import com.alibaba.cloud.ai.vectorstore.oceanbase.OceanBaseVectorStore;
|
||||
|
||||
@RestController // Mark the class as a REST controller.
|
||||
@RequestMapping("/oceanbase") // Set the base path to /oceanbase.
|
||||
public class OceanBaseController {
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(OceanBaseController.class); // The logger.
|
||||
|
||||
@Autowired // Automatically inject the seekdb vector store service.
|
||||
private OceanBaseVectorStore oceanBaseVectorStore;
|
||||
|
||||
// The data import interface.
|
||||
@GetMapping("/import")
|
||||
public void importData() {
|
||||
logger.info("Start importing data");
|
||||
|
||||
// Create sample data.
|
||||
HashMap<String, Object> map = new HashMap<>();
|
||||
map.put("id", "12345");
|
||||
map.put("year", "2025");
|
||||
map.put("name", "yingzi");
|
||||
|
||||
// Create a list that contains three documents.
|
||||
List<Document> documents = List.of(
|
||||
new Document("The World is Big and Salvation Lurks Around the Corner"),
|
||||
new Document("You walk forward facing the past and you turn back toward the future.", Map.of("year", 2024)),
|
||||
new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", map)
|
||||
);
|
||||
|
||||
// Add the documents to the vector store.
|
||||
oceanBaseVectorStore.add(documents);
|
||||
}
|
||||
|
||||
// The similar document search interface.
|
||||
@GetMapping("/search")
|
||||
public List<Document> search() {
|
||||
logger.info("Start searching data");
|
||||
|
||||
// Perform a similarity search for documents that contain "Spring" and return the two most similar results.
|
||||
return oceanBaseVectorStore.similaritySearch(SearchRequest.builder()
|
||||
.query("Spring")
|
||||
.topK(2)
|
||||
.build());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 5: Start and test the Maven project
|
||||
|
||||
### Start the project using an IDE
|
||||
|
||||
The following example shows how to start the project using IntelliJ IDEA.
|
||||
|
||||
The steps are as follows:
|
||||
|
||||
1. Open the project by clicking **File** > **Open** and selecting `pom.xml`.
|
||||
2. Select **Open as a project**.
|
||||
3. Find the main class `OceanBaseApplication.java`.
|
||||
4. Right-click and select **Run 'OceanBaseApplication.main()'**.
|
||||
|
||||
### Test the project
|
||||
|
||||
1. Import the test data by visiting the following URL:
|
||||
|
||||
```shell
|
||||
http://localhost:8080/oceanbase/import
|
||||
```
|
||||
|
||||
2. Perform vector search by visiting the following URL:
|
||||
|
||||
```shell
|
||||
http://localhost:8080/oceanbase/search
|
||||
```
|
||||
|
||||
The expected result is as follows:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "03fe9aad-13cc-4d25-807b-ca1bc314f571",
|
||||
"text": "Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!",
|
||||
"metadata": {
|
||||
"name": "yingzi",
|
||||
"id": "12345",
|
||||
"year": "2025",
|
||||
"distance": "7.274442499114312"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "75864954-0a23-4fa1-8e18-b78fd870d474",
|
||||
"text": "Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!",
|
||||
"metadata": {
|
||||
"name": "yingzi",
|
||||
"id": "12345",
|
||||
"year": "2025",
|
||||
"distance": "7.274442499114312"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
### seekdb connection failure
|
||||
|
||||
* Cause: The URL, username, or password is incorrect.
|
||||
* Solution: Check the seekdb configuration in `application.yml` and make sure the database service is running.
|
||||
|
||||
### Dependency conflict
|
||||
|
||||
* Cause: Conflicts between multiple Spring Boot versions.
|
||||
* Solution: Run `mvn dependency:tree` to view the dependency tree and exclude the conflicting versions.
|
||||
|
||||
### SNAPSHOT dependency cannot be downloaded
|
||||
|
||||
* Cause: The SNAPSHOT repository is not configured.
|
||||
* Solution: Make sure that the `sonatype-snapshots` repository is added in `pom.xml`.
|
||||
@@ -0,0 +1,72 @@
|
||||
---
|
||||
sidebar_label: Dify
|
||||
slug: /dify
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with Dify
|
||||
|
||||
seekdb supports vector data storage, vector indexing, and embedding-based vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
Dify is an open-source Large Language Model (LLM) application development platform. Combining Backend as Service (BaaS) and LLMOps concepts, it enables developers to quickly build production-ready generative AI applications. Even non-technical users can participate in defining AI applications and managing data operations.
|
||||
|
||||
Dify includes essential technologies for building LLM applications: support for hundreds of models, an intuitive prompt orchestration interface, a high-quality RAG engine, a robust agent framework, flexible workflow orchestration, along with user-friendly interfaces and APIs. This eliminates redundant development efforts, enabling developers to focus on innovation and business needs.
|
||||
|
||||
This topic describes how to integrate the vector search capability of seekdb with Dify.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Before deploying Dify, ensure that your machine meets the following minimum system requirements:
|
||||
|
||||
* CPU: 2 cores
|
||||
* Memory: 4 GB
|
||||
|
||||
* This integration tutorial runs on Docker container platform. Ensure you have set up the [Docker platform](https://docs.docker.com/get-started/get-docker/).
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact the seekdb deployment personnel or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to the seekdb database.
|
||||
* `$port`: The port for connecting to the seekdb database. The default value is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
<main id="notice" type='notice'>
|
||||
<h4>Notice</h4>
|
||||
<p>The user connecting to the database must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.</p>
|
||||
</main>
|
||||
|
||||
* `$user_name`: The database account.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Deploy Dify
|
||||
|
||||
### Method 1
|
||||
|
||||
For Dify deployment, refer to [Deploy with Docker Compose](https://docs.dify.ai/getting-started/install-self-hosted/docker-compose) with these modifications:
|
||||
|
||||
* Change the `VECTOR_STORE` variable value to `oceanbase` in `.env` file.
|
||||
* Start services using `docker compose --profile oceanbase up -d`.
|
||||
|
||||
### Method 2
|
||||
|
||||
Alternatively, you can refer to [Dify on MySQL](https://github.com/oceanbase/dify-on-mysql) to quickly start the Dify service.
|
||||
|
||||
To start the service, run the following commands:
|
||||
|
||||
```shell
|
||||
cd docker
|
||||
bash setup-mysql-env.sh
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Step 3: Use Dify
|
||||
|
||||
For information about connecting LLMs in Dify, refer to [Model Configuration](https://docs.dify.ai/guides/model-configuration).
|
||||
@@ -0,0 +1,265 @@
|
||||
---
|
||||
sidebar_label: n8n
|
||||
slug: /n8n
|
||||
---
|
||||
|
||||
# Integrate seekdb vector search with n8n
|
||||
|
||||
n8n is a workflow automation platform with native AI capabilities, providing technical teams with the flexibility of code and the speed of no-code. With over 400 integrations, native AI features, and a fair code license, n8n allows you to build a robust automation while maintaining full control over your data and deployments.
|
||||
|
||||
This topic demonstrates how to build a Chat to seekdb workflow template using the powerful features of n8n.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* This integration tutorial is performed in a Docker container. Make sure that you have [set up a Docker container](https://docs.docker.com/get-started/get-docker/).
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact the seekdb deployment personnel or administrator to obtain the database connection string, for example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to the seekdb database.
|
||||
* `$port`: The port for connecting to the seekdb database. The default value is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
<main id="notice" type='notice'>
|
||||
<h4>Notice</h4>
|
||||
<p>The user connecting to the database must have the <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.</p>
|
||||
</main>
|
||||
|
||||
* `$user_name`: The database account.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Create a test table and insert data
|
||||
|
||||
Before you build the workflow, you need to create a sample table in seekdb to store book information and insert some sample data.
|
||||
|
||||
```sql
|
||||
CREATE TABLE books (
|
||||
id VARCHAR(255) PRIMARY KEY,
|
||||
isbn13 VARCHAR(255),
|
||||
author TEXT,
|
||||
title VARCHAR(255),
|
||||
publisher VARCHAR(255),
|
||||
category TEXT,
|
||||
pages INT,
|
||||
price DECIMAL(10,2),
|
||||
format VARCHAR(50),
|
||||
rating DECIMAL(3,1),
|
||||
release_year YEAR
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'database-internals',
|
||||
'978-1492040347',
|
||||
'"Alexander Petrov"',
|
||||
'Database Internals: A deep-dive into how distributed data systems work',
|
||||
'O\'Reilly',
|
||||
'["databases","information systems"]',
|
||||
350,
|
||||
47.28,
|
||||
'paperback',
|
||||
4.5,
|
||||
2019
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'designing-data-intensive-applications',
|
||||
'978-1449373320',
|
||||
'"Martin Kleppmann"',
|
||||
'Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems',
|
||||
'O\'Reilly',
|
||||
'["databases"]',
|
||||
590,
|
||||
31.06,
|
||||
'paperback',
|
||||
4.4,
|
||||
2017
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'kafka-the-definitive-guide',
|
||||
'978-1491936160',
|
||||
'["Neha Narkhede", "Gwen Shapira", "Todd Palino"]',
|
||||
'Kafka: The Definitive Guide: Real-time data and stream processing at scale',
|
||||
'O\'Reilly',
|
||||
'["databases"]',
|
||||
297,
|
||||
37.31,
|
||||
'paperback',
|
||||
3.9,
|
||||
2017
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'effective-java',
|
||||
'978-1491936160',
|
||||
'"Joshua Block"',
|
||||
'Effective Java',
|
||||
'Addison-Wesley',
|
||||
'["programming languages", "java"]',
|
||||
412,
|
||||
27.91,
|
||||
'paperback',
|
||||
4.2,
|
||||
2017
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'daemon',
|
||||
'978-1847249616',
|
||||
'"Daniel Suarez"',
|
||||
'Daemon',
|
||||
'Quercus',
|
||||
'["dystopia","novel"]',
|
||||
448,
|
||||
12.03,
|
||||
'paperback',
|
||||
4.0,
|
||||
2011
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'cryptonomicon',
|
||||
'978-1847249616',
|
||||
'"Neal Stephenson"',
|
||||
'Cryptonomicon',
|
||||
'Avon',
|
||||
'["thriller", "novel"]',
|
||||
1152,
|
||||
6.99,
|
||||
'paperback',
|
||||
4.0,
|
||||
2002
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'garbage-collection-handbook',
|
||||
'978-1420082791',
|
||||
'["Richard Jones", "Antony Hosking", "Eliot Moss"]',
|
||||
'The Garbage Collection Handbook: The Art of Automatic Memory Management',
|
||||
'Taylor & Francis',
|
||||
'["programming algorithms"]',
|
||||
511,
|
||||
87.85,
|
||||
'paperback',
|
||||
5.0,
|
||||
2011
|
||||
);
|
||||
|
||||
INSERT INTO books (
|
||||
id, isbn13, author, title, publisher, category, pages, price, format, rating, release_year
|
||||
) VALUES (
|
||||
'radical-candor',
|
||||
'978-1250258403',
|
||||
'"Kim Scott"',
|
||||
'Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity',
|
||||
'Macmillan',
|
||||
'["human resources","management", "new work"]',
|
||||
404,
|
||||
7.29,
|
||||
'paperback',
|
||||
4.0,
|
||||
2018
|
||||
);
|
||||
```
|
||||
|
||||
## Step 3: Deploy the tools
|
||||
|
||||
### Private deployment of n8n
|
||||
|
||||
n8n is a workflow automation platform based on Node.js. It provides extensive integration capabilities and flexible extensibility. By privately deploying n8n, you can better control the runtime environment of your workflows and ensure the security and privacy of your data. This section describes how to deploy n8n in a Docker environment.
|
||||
|
||||
```shell
|
||||
sudo docker run -d --name n8n -p 5678:5678 -e N8N_SECURE_COOKIE=false n8nio/n8n
|
||||
```
|
||||
|
||||
### Deploy the Qwen3 model using Ollama
|
||||
|
||||
Ollama is an open-source AI model server that supports the deployment and management of multiple AI models. With Ollama, you can easily deploy the Qwen3 model locally to enable AI Agent functionality. This section describes how to deploy Ollama in a Docker environment, and then use Ollama to deploy the Qwen3 model.
|
||||
|
||||
```shell
|
||||
# Deploy Ollama in a Docker environment
|
||||
sudo docker run -d -p 11434:11434 --name ollama ollama/ollama
|
||||
# Deploy the Qwen3 model
|
||||
sudo docker exec -it ollama sh -c 'ollama run qwen3:latest'
|
||||
```
|
||||
|
||||
## Step 4: Build an AI Agent workflow
|
||||
|
||||
n8n provides a variety of nodes to build an AI Agent workflow. This section shows you how to build a Chat to seekdb workflow template. The workflow consists of five nodes, and the steps are as follows:
|
||||
|
||||
1. Add a trigger
|
||||
|
||||
Add an HTTP trigger node to receive HTTP requests.
|
||||
|
||||

|
||||
|
||||
2. Add an AI Agent node
|
||||
|
||||
Add an AI Agent node to process AI Agent requests.
|
||||
|
||||

|
||||
|
||||
3. Add an Ollama Chat Model node
|
||||
|
||||
Select a free Ollama chat model, such as Qwen3, and configure the Ollama account.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
4. Add a Simple Memory node
|
||||
|
||||
The Simple Memory node indicates short-term memory and can remember the previous five interactions in the chat.
|
||||
|
||||

|
||||
|
||||
5. Add a Tool node
|
||||
|
||||
The Tool node is used to perform database operations in seekdb. Add a MySQL tool under AI Agent-tool.
|
||||
|
||||

|
||||
|
||||
Configure the MySQL tool as follows:
|
||||
|
||||

|
||||
|
||||
Click the edit icon shown in the preceding figure to configure the MySQL connection information.
|
||||
|
||||

|
||||
|
||||
After the configuration is completed, close the window. Click **Test step** in the upper-right corner of the configuration panel to test the database connection with a simple SQL statement, or click **Back to canvas** to return to the main interface.
|
||||
|
||||
6. Click **Save** to complete the workflow construction.
|
||||
|
||||
After all 5 nodes are configured, click **Save** to complete the workflow construction. You can then test the workflow.
|
||||
|
||||
<!--  -->
|
||||
|
||||
## Workflow demo
|
||||
|
||||
The completed workflow is displayed as follows:
|
||||
|
||||

|
||||
@@ -0,0 +1,284 @@
|
||||
---
|
||||
sidebar_label: Cursor
|
||||
slug: /cursor
|
||||
---
|
||||
|
||||
# Integrate OceanBase MCP Server with Cursor
|
||||
|
||||
[MCP (Model Context Protocol)](https://modelcontextprotocol.io/introduction) is an open-source protocol introduced by Anthropic in November 2024. It allows large language models to interact with external tools or data sources. With MCP, you do not need to manually copy and execute the output of large language models. Instead, the large language model can directly instruct tools to perform specific actions.
|
||||
|
||||
[MCP Server](https://github.com/oceanbase/mcp-oceanbase/tree/main/src/oceanbase_mcp_server) enables large language models to interact with OceanBase Database through the MCP protocol and execute SQL statements. With the right client, you can quickly build project prototypes. The server has been open-sourced on GitHub.
|
||||
|
||||
[Cursor](https://cursordocs.com) is an AI-powered code editor that supports multiple operating systems, including Windows, macOS, and Linux.
|
||||
|
||||
This topic demonstrates how to integrate Cursor with OceanBase MCP Server to quickly build a backend application.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* You have installed [Python 3.11 or later](https://www.python.org/downloads/) and the corresponding [pip](https://pip.pypa.io/en/stable/installation/). If your machine has an older version of Python, you can use Miniconda to create a new environment with Python 3.11 or above. For more information, see [Miniconda installation guide](https://docs.anaconda.com/miniconda/install/).
|
||||
|
||||
* You have installed [Git](https://git-scm.com//downloads) based on your operating system.
|
||||
|
||||
* You have installed the Python package manager uv. After the installation, run the `uv --version` command to verify whether the installation is successful:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
* You have downloaded [Cursor](https://cursor.com/cn/downloads) and installed the version that matches your operating system. When you use Cursor for the first time, you need to register a new account or log in with an existing one. After logging in, you can create a new project or open an existing project.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
:::tip
|
||||
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The username for connecting to the database.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Configure the OceanBase MCP Server
|
||||
|
||||
### Clone the OceanBase MCP Server repository
|
||||
|
||||
Run the following command to download the source code to your local device:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/oceanbase/mcp-oceanbase.git
|
||||
```
|
||||
|
||||
Go to the source code directory:
|
||||
|
||||
```shell
|
||||
cd mcp-oceanbase
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
|
||||
Run the following command in the `mcp-oceanbase` directory to create a virtual environment and install dependencies:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install .
|
||||
```
|
||||
|
||||
### Create a working directory for the Cursor client
|
||||
|
||||
Manually create a working directory (such as `cursor`) for the Cursor client and open it with Cursor. The files generated by Cursor will be stored in this directory.
|
||||
|
||||
### Add and configure the OceanBase MCP Server
|
||||
|
||||
1. Use Cursor V2.0.64 as an example. Click the **Open Settings** icon in the upper-right corner, select **Tools & MCP**, and click **New MCP Server**.
|
||||
|
||||

|
||||
|
||||
2. Edit the `mcp.json` configuration file.
|
||||
|
||||

|
||||
|
||||
Replace `path/to/your/mcp-oceanbase/src/oceanbase_mcp_server` with the absolute path of the `oceanbase_mcp_server` folder. Replace `OB_HOST`, `OB_PORT`, `OB_USER`, `OB_PASSWORD`, and `OB_DATABASE` with the corresponding information of your database:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"oceanbase": {
|
||||
"command": "uv",
|
||||
"args": [
|
||||
"--directory",
|
||||
"/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server",
|
||||
"run",
|
||||
"oceanbase_mcp_server"
|
||||
],
|
||||
"env": {
|
||||
"OB_HOST": "***",
|
||||
"OB_PORT": "***",
|
||||
"OB_USER": "***",
|
||||
"OB_PASSWORD": "***",
|
||||
"OB_DATABASE": "***"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. If the configuration is successful, the MCP Server is displayed in ready status.
|
||||
|
||||

|
||||
|
||||
### Test the MCP Server
|
||||
|
||||
1. In the chat dialog box, enter the prompt: `How many tables are there in the dataanalysis_english database?`. The Cursor client will display the SQL statement to be executed. Confirm that it is correct and click the `Run` button to execute the query. The Cursor client will display all the table names in the `dataanalysis_english` database, indicating that we have successfully connected to seekdb.
|
||||
|
||||

|
||||
|
||||
### Use FastAPI to quickly create a RESTful API project
|
||||
|
||||
You can use FastAPI to quickly create a RESTful API project. FastAPI is a Python web framework for building RESTful APIs.
|
||||
|
||||
1. Create a customer table
|
||||
|
||||
In the dialog box, enter the prompt: `Create a customer table with the ID as the primary key and name, age, telephone, and location as fields`, confirm the SQL statement, and click `Run` to execute the query.
|
||||
|
||||

|
||||
|
||||
2. Insert test data
|
||||
|
||||
In the dialog box, enter the prompt: `Insert 10 rows of data into the customer table`, confirm the SQL statement, and click `Run` to execute the query. After the data is inserted, a message will be displayed: `Inserted 10 rows into the customer table. The data includes...`.
|
||||
|
||||

|
||||
|
||||
3. Create a FastAPI project
|
||||
|
||||
In the dialog box, enter the prompt: `Create a FastAPI project and generate a RESTful API based on the customer table`, confirm the SQL statement, and click `Run` to execute the query.
|
||||
|
||||

|
||||
|
||||
This step will automatically generate necessary files. It is recommended to select `Accept All` for the first use, because the content of the files generated by AI may be uncertain, and you can adjust them as needed later.
|
||||
|
||||
4. Create a virtual environment and install dependencies
|
||||
|
||||
Execute the following command to use the uv package manager to create a virtual environment and install the dependencies in the current directory:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
5. Start the FastAPI project
|
||||
|
||||
Execute the following command to start the FastAPI project:
|
||||
|
||||
```shell
|
||||
uvicorn main:app --reload
|
||||
```
|
||||
|
||||
6. View the data in the table
|
||||
|
||||
Run the following command in the command line or use other request tools to view the data in the table:
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:8000/customers
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```json
|
||||
[{"ID":1,"name":"John Smith","age":28,"telephone":"555-0101","location":"New York, NY"},{"ID":2,"name":"Emily Johnson","age":35,"telephone":"555-0102","location":"Los Angeles, CA"},{"ID":3,"name":"Michael Brown","age":42,"telephone":"555-0103","location":"Chicago, IL"},{"ID":4,"name":"Sarah Davis","age":29,"telephone":"555-0104","location":"Houston, TX"},{"ID":5,"name":"David Wilson","age":51,"telephone":"555-0105","location":"Phoenix, AZ"},{"ID":6,"name":"Jessica Martinez","age":33,"telephone":"555-0106","location":"Philadelphia, PA"},{"ID":7,"name":"Robert Taylor","age":45,"telephone":"555-0107","location":"San Antonio, TX"},{"ID":8,"name":"Amanda Anderson","age":27,"telephone":"555-0108","location":"San Diego, CA"},{"ID":9,"name":"James Thomas","age":38,"telephone":"555-0109","location":"Dallas, TX"},{"ID":10,"name":"Lisa Jackson","age":31,"telephone":"555-0110","location":"San Jose, CA"}]
|
||||
```
|
||||
|
||||
You can see that the RESTful APIs for creating, deleting, updating, and querying data have been successfully generated:
|
||||
|
||||
```shell
|
||||
from fastapi import FastAPI, HTTPException, Depends
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
from sqlalchemy import create_engine, Column, Integer, String
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker, Session
|
||||
|
||||
# seekdb connection configuration (modify it as needed)
|
||||
DATABASE_URL = "mysql://***:***@***:***/***"
|
||||
|
||||
engine = create_engine(DATABASE_URL, echo=True)
|
||||
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
Base = declarative_base()
|
||||
|
||||
class Customer(Base):
|
||||
__tablename__ = "customer"
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
name = Column(String(100))
|
||||
age = Column(Integer)
|
||||
telephone = Column(String(20))
|
||||
location = Column(String(100))
|
||||
|
||||
class CustomerCreate(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
|
||||
class CustomerUpdate(BaseModel):
|
||||
name: str = None
|
||||
age: int = None
|
||||
telephone: str = None
|
||||
location: str = None
|
||||
|
||||
class CustomerOut(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
class Config:
|
||||
orm_mode = True
|
||||
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
@app.post("/customers/", response_model=CustomerOut)
|
||||
def create_customer(customer: CustomerCreate, db: Session = Depends(get_db)):
|
||||
db_customer = Customer(**customer.dict())
|
||||
db.add(db_customer)
|
||||
try:
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
except Exception as e:
|
||||
db.rollback()
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
return db_customer
|
||||
|
||||
@app.get("/customers/", response_model=List[CustomerOut])
|
||||
def read_customers(skip: int = 0, limit: int = 100, db: Session = Depends(get_db)):
|
||||
return db.query(Customer).offset(skip).limit(limit).all()
|
||||
|
||||
@app.get("/customers/{customer_id}", response_model=CustomerOut)
|
||||
def read_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
return customer
|
||||
|
||||
@app.put("/customers/{customer_id}", response_model=CustomerOut)
|
||||
def update_customer(customer_id: int, customer: CustomerUpdate, db: Session = Depends(get_db)):
|
||||
db_customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if db_customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
for var, value in vars(customer).items():
|
||||
if value is not None:
|
||||
setattr(db_customer, var, value)
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
return db_customer
|
||||
|
||||
@app.delete("/customers/{customer_id}")
|
||||
def delete_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
db_customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if db_customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
db.delete(db_customer)
|
||||
db.commit()
|
||||
return {"ok": True}
|
||||
```
|
||||
@@ -0,0 +1,288 @@
|
||||
---
|
||||
sidebar_label: Cline
|
||||
slug: /cline
|
||||
---
|
||||
|
||||
# Integrate OceanBase MCP Server with Cline
|
||||
|
||||
seekdb supports vector data storage, vector indexing, and embedding-based vector search. You can store vectorized data in seekdb for further search.
|
||||
|
||||
[Cline](https://cline.bot/) is an open-source AI coding assistant that supports the MCP protocol.
|
||||
|
||||
This topic uses Cline to demonstrate how to quickly build a backend application using OceanBase MCP Server.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
|
||||
|
||||
* You have installed [Python 3.11 or later](https://www.python.org/downloads/) and the corresponding [pip](https://pip.pypa.io/en/stable/installation/). If your machine has a low Python version, you can use Miniconda to create a new Python 3.11 or later environment. For more information, see [Miniconda installation guide](https://docs.anaconda.com/miniconda/install/).
|
||||
|
||||
* Install [Git](https://git-scm.com//downloads) based on your operating system.
|
||||
|
||||
* Install uv, a Python package manager. After the installation, run the `uv --version` command to verify the installation:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
* Install Cline:
|
||||
|
||||
* If you are using Visual Studio Code IDE, search for the Cline extension and install it in the `Extensions` section. The extension name is `Cline`. After the installation, click the settings icon to configure the large model API for Cline as follows:
|
||||
|
||||

|
||||
|
||||
* If you do not have an IDE, download Cline from [Cline](https://cline.bot/) and follow the [installation guide](https://docs.cline.bot/getting-started/installing-cline).
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
:::tip
|
||||
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The username for connecting to the database.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Configure the OceanBase MCP Server
|
||||
|
||||
This example uses Visual Studio Code to demonstrate how to configure the OceanBase MCP Server.
|
||||
|
||||
### Clone the OceanBase MCP Server repository
|
||||
|
||||
Run the following command to download the source code to your local device:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/oceanbase/mcp-oceanbase.git
|
||||
```
|
||||
|
||||
Go to the source code directory:
|
||||
|
||||
```shell
|
||||
cd mcp-oceanbase
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
|
||||
Run the following command in the `mcp-oceanbase` directory to create a virtual environment and install dependencies:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install .
|
||||
```
|
||||
|
||||
### Create a working directory for Visual Studio Code
|
||||
|
||||
Manually create a working directory for Visual Studio Code on your local device and open it with Visual Studio Code. The files generated by Cline will be placed in this directory. The name of the sample directory is `cline-generate`.
|
||||
|
||||
<!--  -->
|
||||
|
||||
### Configure the OceanBase MCP Server in Cline
|
||||
|
||||
Click the Cline icon on the left-side navigation pane to open the Cline dialog box.
|
||||
|
||||

|
||||
|
||||
### Add and configure MCP servers
|
||||
|
||||
1. Click the **MCP Servers** icon as shown in the following figure.
|
||||
|
||||

|
||||
|
||||
2. Manually configure the OceanBase MCP Server according to the numbered instructions in the figure below.
|
||||
|
||||

|
||||
|
||||
3. Edit the configuration file.
|
||||
|
||||
In the `cline_mcp_settings.json` file that was opened in the previous step, enter the following configuration information and save the file. Replace `/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server` with the absolute path of the `oceanbase_mcp_server` folder, and replace `OB_HOST`, `OB_PORT`, `OB_USER`, `OB_PASSWORD`, and `OB_DATABASE` with your database information.
|
||||
|
||||
The configuration file is as follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"oceanbase": {
|
||||
"command": "uv",
|
||||
"args": [
|
||||
"--directory",
|
||||
"/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server",
|
||||
"run",
|
||||
"oceanbase_mcp_server"
|
||||
],
|
||||
"env": {
|
||||
"OB_HOST": "***",
|
||||
"OB_PORT": "***",
|
||||
"OB_USER": "***",
|
||||
"OB_PASSWORD": "***",
|
||||
"OB_DATABASE": "***"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. If the configuration is successful, the MCP Server is displayed in ready status, and the MCP `Tools` and `Resources` information will be displayed, as shown in the following figure:
|
||||
|
||||

|
||||
|
||||
5. Click the switch button in the following figure to enable the MCP Server so that Cline can use it:
|
||||
|
||||

|
||||
|
||||
### Test the MCP Server
|
||||
|
||||
Open the Cline session dialog box and enter the prompt `How many tables are there in the dataanalysis_english database`. Cline will display the SQL statement about to be executed. Confirm the SQL statement and click the `Act` button.
|
||||
|
||||

|
||||
|
||||
Cline will display the table names in the `dataanalysis_english` database, indicating that it can properly connect to seekdb.
|
||||
|
||||

|
||||
|
||||
### Create a RESTful API project using FastAPI
|
||||
|
||||
You can use FastAPI to quickly create a RESTful API project. FastAPI is a Python web framework that allows you to build RESTful APIs efficiently.
|
||||
|
||||
1. Create the customer table
|
||||
|
||||
In the dialog box, enter the prompt: `Create a "customer" table with "ID" as the primary key, including the fields "name", "age", "telephone", and "location"`. Confirm the SQL statement and click the `Act` button.
|
||||
|
||||

|
||||
|
||||
2. Insert test data
|
||||
|
||||
In the dialog box, enter the prompt: `Insert 10 rows of test data`. Confirm the SQL statement and click the `Act` button.
|
||||
|
||||

|
||||
|
||||
After the data is inserted, the execution result will be displayed.
|
||||
|
||||

|
||||
|
||||
3. Create a FastAPI project
|
||||
|
||||
In the dialog box, enter the prompt: `Create a FastAPI project and generate a RESTful API based on the "customer" table`. Confirm the SQL statement and click the `Act` button.
|
||||
|
||||

|
||||
|
||||
This step will automatically generate files. We recommend selecting "Accept All" the first time, because AI-generated files may be uncertain; you can adjust them later as needed.
|
||||
|
||||
4. Create a virtual environment and install dependencies
|
||||
|
||||
Run the following command to create a virtual environment using the uv package manager and install the dependency packages in the current directory:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
5. Start the FastAPI project
|
||||
|
||||
Run the following command to start the FastAPI project:
|
||||
|
||||
```shell
|
||||
uvicorn main:app --reload
|
||||
```
|
||||
|
||||
6. View data in the table
|
||||
|
||||
Run the following command in the command line, or use another request tool, to view the data in the table:
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:8000/customers
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```json
|
||||
[{"ID":1,"name":"Alice Johnson","age":28,"telephone":"123-456-7890","location":"New York"},{"ID":2,"name":"Bob Smith","age":34,"telephone":"234-567-8901","location":"Los Angeles"},{"ID":3,"name":"Charlie Brown","age":45,"telephone":"345-678-9012","location":"Chicago"},{"ID":4,"name":"David Wilson","age":56,"telephone":"456-789-0123","location":"Houston"},{"ID":5,"name":"Eve Davis","age":67,"telephone":"567-890-1234","location":"Phoenix"},{"ID":6,"name":"Frank Garcia","age":78,"telephone":"678-901-2345","location":"Philadelphia"},{"ID":7,"name":"Grace Martinez","age":89,"telephone":"789-012-3456","location":"San Antonio"},{"ID":8,"name":"Hannah Robinson","age":19,"telephone":"890-123-4567","location":"San Diego"},{"ID":9,"name":"Ian Clark","age":23,"telephone":"901-234-5678","location":"Dallas"},{"ID":10,"name":"Julia Lewis","age":31,"telephone":"012-345-6789","location":"San Jose"}]
|
||||
```
|
||||
|
||||
You can see that the RESTful APIs for creating, reading, updating, and deleting data have been successfully generated:
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI, Depends, HTTPException
|
||||
from sqlalchemy.orm import Session
|
||||
from models import Customer
|
||||
from database import SessionLocal, engine
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
# Database dependency
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
# Request model
|
||||
class CustomerCreate(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
|
||||
# Response model
|
||||
class CustomerResponse(CustomerCreate):
|
||||
id: int
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
@app.post("/customers/")
|
||||
def create_customer(customer: CustomerCreate, db: Session = Depends(get_db)):
|
||||
db_customer = Customer(**customer.model_dump())
|
||||
db.add(db_customer)
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
return db_customer
|
||||
|
||||
@app.get("/customers/{customer_id}")
|
||||
def read_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
return customer
|
||||
|
||||
@app.get("/customers/")
|
||||
def read_customers(skip: int = 0, limit: int = 10, db: Session = Depends(get_db)):
|
||||
return db.query(Customer).offset(skip).limit(limit).all()
|
||||
|
||||
@app.put("/customers/{customer_id}")
|
||||
def update_customer(customer_id: int, customer: CustomerCreate, db: Session = Depends(get_db)):
|
||||
db_customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if db_customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
for field, value in customer.model_dump().items():
|
||||
setattr(db_customer, field, value)
|
||||
db.commit()
|
||||
db.refresh(db_customer)
|
||||
return db_customer
|
||||
|
||||
@app.delete("/customers/{customer_id}")
|
||||
def delete_customer(customer_id: int, db: Session = Depends(get_db)):
|
||||
customer = db.query(Customer).filter(Customer.id == customer_id).first()
|
||||
if customer is None:
|
||||
raise HTTPException(status_code=404, detail="Customer not found")
|
||||
db.delete(customer)
|
||||
db.commit()
|
||||
return {"message": "Customer deleted successfully"}
|
||||
```
|
||||
@@ -0,0 +1,154 @@
|
||||
---
|
||||
sidebar_label: Continue
|
||||
slug: /continue
|
||||
---
|
||||
|
||||
# Integrate OceanBase MCP Server with Continue
|
||||
|
||||
[MCP (Model Context Protocol)](https://modelcontextprotocol.io/introduction) is an open-source protocol released by Anthropic in November 2024. It enables large language models to interact with external tools or data sources. With MCP, users do not need to manually copy and execute the output of large models; instead, the models can directly instruct tools to perform specific actions (Actions).
|
||||
|
||||
[MCP Server](https://github.com/oceanbase/mcp-oceanbase/tree/main/src/oceanbase_mcp_server) provides the capability for large models to interact with seekdb through the MCP protocol, allowing the execution of SQL statements. With the right client, you can quickly build a project prototype, and it is open-source on GitHub.
|
||||
|
||||
[Continue](https://continue.dev) is an IDE extension that integrates with the MCP Server, supporting Visual Studio Code and IntelliJ IDEA.
|
||||
|
||||
This topic will guide you on how to integrate Continue with the OceanBase MCP Server to quickly build backend applications.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* You have installed [Python 3.11 or later](https://www.python.org/downloads/) and the corresponding [pip](https://pip.pypa.io/en/stable/installation/). If your machine has a low Python version, you can use Miniconda to create a new Python 3.11 or later environment. For more information, see [Miniconda installation guide](https://docs.anaconda.com/miniconda/install/).
|
||||
|
||||
* Install [Git](https://git-scm.com//downloads) based on your operating system.
|
||||
|
||||
* Install uv, a Python package manager. After the installation, run the `uv --version` command to verify the installation:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
* Install the Continue extension in Visual Studio Code or IntelliJ IDEA. The extension name is `Continue`.
|
||||
|
||||

|
||||
|
||||
* After the installation is complete, click `Add Models` to configure the large model API for Continue. The API configuration is as follows:
|
||||
|
||||

|
||||
|
||||
* The configuration file is as follows:
|
||||
|
||||
```json
|
||||
name: Local Assistant
|
||||
version: 1.0.0
|
||||
schema: v1
|
||||
models:
|
||||
# Model name
|
||||
- name: DeepSeek-R1-671B
|
||||
# Model provider
|
||||
provider: deepseek
|
||||
# Model type
|
||||
model: DeepSeek-R1-671B
|
||||
# URL address for accessing the model
|
||||
apiBase: *********
|
||||
# API key for accessing the model
|
||||
apiKey: ********
|
||||
# Context provider
|
||||
context:
|
||||
- provider: code
|
||||
- provider: docs
|
||||
- provider: diff
|
||||
- provider: terminal
|
||||
- provider: problems
|
||||
- provider: folder
|
||||
- provider: codebase
|
||||
```
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
:::tip
|
||||
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The username for connecting to the database.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Configure the OceanBase MCP Server
|
||||
|
||||
### Clone the OceanBase MCP Server repository
|
||||
|
||||
Run the following command to download the source code to your local device:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/oceanbase/mcp-oceanbase.git
|
||||
```
|
||||
|
||||
Go to the source code directory:
|
||||
|
||||
```shell
|
||||
cd mcp-oceanbase
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
|
||||
Run the following command in the `mcp-oceanbase` directory to create a virtual environment and install dependencies:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install .
|
||||
```
|
||||
|
||||
### Add and configure MCP servers
|
||||
|
||||
1. Click the button in the upper-right corner of the menu bar to open the MCP panel.
|
||||
|
||||

|
||||
|
||||
2. Click Add `MCP Servers`.
|
||||
|
||||
:::tip
|
||||
MCP can be used only in the Continue Agent mode.
|
||||
:::
|
||||
|
||||

|
||||
|
||||
3. Fill in the configuration file and click OK.
|
||||
|
||||
Replace `/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server` with the absolute path of the `oceanbase_mcp_server` folder. Replace `OB_HOST`, `OB_PORT`, `OB_USER`, `OB_PASSWORD`, and `OB_DATABASE` with the corresponding information of your database:
|
||||
|
||||
```json
|
||||
name: SeekDB
|
||||
version: 0.0.1
|
||||
schema: v1
|
||||
mcpServers:
|
||||
- name: SeekDB
|
||||
command: uv
|
||||
args:
|
||||
- --directory
|
||||
- /path/to/your/mcp-oceanbase/src/oceanbase_mcp_server
|
||||
- run
|
||||
- oceanbase_mcp_server
|
||||
env:
|
||||
OB_HOST: "****"
|
||||
OB_PORT: "***"
|
||||
OB_USER: "***"
|
||||
OB_PASSWORD: "***"
|
||||
OB_DATABASE: "***"
|
||||
```
|
||||
|
||||
4. If the configuration is successful, the following message is displayed:
|
||||
|
||||

|
||||
@@ -0,0 +1,291 @@
|
||||
---
|
||||
sidebar_label: TRAE
|
||||
slug: /trae
|
||||
---
|
||||
|
||||
# Integrate OceanBase MCP Server with TRAE
|
||||
|
||||
[Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction) is an open-source protocol introduced by Anthropic in November 2024. It allows large language models to interact with external tools or data sources. With MCP, you do not need to manually copy and execute the output of large language models. Instead, the large language model can directly command tools to perform specific actions.
|
||||
|
||||
[MCP Server](https://github.com/oceanbase/mcp-oceanbase/tree/main/src/oceanbase_mcp_server) enables large language models to interact with OceanBase Database through the MCP protocol and execute SQL statements. It allows you to quickly build a project prototype with the help of an appropriate client and has been open-sourced on GitHub.
|
||||
|
||||
[TRAE](https://www.trae.ai/) is an IDE that can integrate with MCP Server, can be downloaded from its official website.
|
||||
|
||||
This topic will guide you through the process of integrating TRAE IDE with OceanBase MCP Server to quickly build a backend application.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* You have deployed seekdb.
|
||||
|
||||
* You have installed [Python 3.11 or later](https://www.python.org/downloads/) and the corresponding [pip](https://pip.pypa.io/en/stable/installation/). If your system has a low Python version, you can use Miniconda to create a new Python 3.11 or later environment. For more information, see [Install Miniconda](https://docs.anaconda.com/miniconda/install/).
|
||||
|
||||
* You have installed [Git](https://git-scm.com//downloads) based on your operating system.
|
||||
|
||||
* You have installed uv, a Python package manager. After the installation, run the `uv --version` command to check whether the installation was successful:
|
||||
|
||||
```shell
|
||||
pip install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
* You have downloaded [TRAE IDE](https://www.trae.ai/download) and installed the version suitable for your operating system.
|
||||
|
||||
## Step 1: Obtain the database connection information
|
||||
|
||||
Contact your seekdb deployment engineer or administrator to obtain the database connection string. For example:
|
||||
|
||||
```sql
|
||||
obclient -h$host -P$port -u$user_name -p$password -D$database_name
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
* `$host`: The IP address for connecting to seekdb.
|
||||
* `$port`: The port number for connecting to seekdb. Default is `2881`.
|
||||
* `$database_name`: The name of the database to access.
|
||||
|
||||
:::tip
|
||||
The connected user must have <code>CREATE</code>, <code>INSERT</code>, <code>DROP</code>, and <code>SELECT</code> privileges on the database.
|
||||
:::
|
||||
|
||||
* `$user_name`: The username for connecting to the database.
|
||||
* `$password`: The password for the account.
|
||||
|
||||
## Step 2: Configure the OceanBase MCP Server
|
||||
|
||||
### Clone the OceanBase MCP Server repository
|
||||
|
||||
Run the following command to download the source code to your local device:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/oceanbase/mcp-oceanbase.git
|
||||
```
|
||||
|
||||
Go to the source code directory:
|
||||
|
||||
```shell
|
||||
cd mcp-oceanbase
|
||||
```
|
||||
|
||||
### Install the dependencies
|
||||
|
||||
Run the following commands in the `mcp-oceanbase` directory to create a virtual environment and install the dependencies:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install .
|
||||
```
|
||||
|
||||
### Create a working directory for the TRAE client
|
||||
|
||||
Manually create a working directory for TRAE and open it. TRAE will generate files in this directory. The example directory name is `trae-generate`.
|
||||
|
||||

|
||||
|
||||
### Configure the OceanBase MCP Server in TRAE
|
||||
|
||||
Press `Ctrl + U` (Windows) or `Command + U` (MacOS) to open the chat box. Click the gear icon in the upper-right corner and select **MCP**.
|
||||
|
||||

|
||||
|
||||
### Add and configure MCP servers
|
||||
|
||||
1. Click **Add MCP Servers** and select **Add Manually**.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
2. Delete the sample content in the edit box.
|
||||
|
||||

|
||||
|
||||
Then enter the following contents:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"oceanbase": {
|
||||
"command": "uv",
|
||||
"args": [
|
||||
"--directory",
|
||||
// Replace with the absolute path of the oceanbase_mcp_server folder.
|
||||
"/path/to/your/mcp-oceanbase/src/oceanbase_mcp_server",
|
||||
"run",
|
||||
"oceanbase_mcp_server"
|
||||
],
|
||||
"env": {
|
||||
// Replace with your OceanBase Database connection information.
|
||||
"OB_HOST": "***",
|
||||
"OB_PORT": "***",
|
||||
"OB_USER": "***",
|
||||
"OB_PASSWORD": "***",
|
||||
"OB_DATABASE": "***"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. The configuration is successful.
|
||||
|
||||

|
||||
|
||||
### Test the MCP server
|
||||
|
||||
1. Select the **Builder with MCP** agent.
|
||||
|
||||

|
||||
|
||||
2. In the dialog box, enter `How many tables are there in the test database`. The TRAE client will display the SQL statement to be executed. Confirm the SQL statement and click the `Run` button.
|
||||
|
||||

|
||||
|
||||
3. The TRAE client will display the number of tables in the `test` database. This indicates that you have successfully connected to seekdb.
|
||||
|
||||

|
||||
|
||||
### Create a RESTful API project using FastAPI
|
||||
|
||||
You can use FastAPI to quickly create a RESTful API project. FastAPI is a Python web framework that enables you to build RESTful APIs efficiently.
|
||||
|
||||
1. Create the customer table.
|
||||
|
||||
In the dialog box, enter `Create a "customer" table with "Id" as the primary key, including the fields of "name", "age", "telephone", and "location"`. Confirm the SQL statement and click the `Run` button.
|
||||
|
||||

|
||||
|
||||
2. Insert test data.
|
||||
|
||||
In the dialog box, enter `Insert 10 test data entries`. Confirm the SQL statement and click the `Run` button.
|
||||
|
||||

|
||||
|
||||
The execution result is displayed after the insertion is successful:
|
||||
|
||||

|
||||
|
||||
3. Create a FastAPI project.
|
||||
|
||||
In the dialog box, enter `Create a FastAPI project and generate a RESTful API based on the "customer" table`. Confirm the SQL statement and click the `Run` button.
|
||||
|
||||

|
||||
|
||||
This step will generate three files. We recommend that you select "Accept All" for the first use, because the files generated by AI may contain uncertain contents. You can adjust them based on your actual needs later.
|
||||
|
||||
4. Create a virtual environment and install dependencies
|
||||
|
||||
Execute the following command to use the uv package manager to create a virtual environment and install the required packages in the current directory:
|
||||
|
||||
```shell
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
5. Start the FastAPI project.
|
||||
|
||||
Run the following command to start the FastAPI project:
|
||||
|
||||
```shell
|
||||
uvicorn main:app --reload
|
||||
```
|
||||
|
||||
6. View the data in the table.
|
||||
|
||||
Run the following command in the command line or use other request tools to view the data in the table:
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:8000/customers
|
||||
```
|
||||
|
||||
The return result is as follows:
|
||||
|
||||
```json
|
||||
[{"Id":1,"name":"Alice","age":25,"telephone":"123-***-7890","location":"New York"},{"Id":2,"name":"Bob","age":30,"telephone":"234-***-8901","location":"Los Angeles"},{"Id":3,"name":"Charlie","age":35,"telephone":"345-***-9012","location":"Chicago"},{"Id":4,"name":"David","age":40,"telephone":"456-***-0123","location":"Houston"},{"Id":5,"name":"Eve","age":45,"telephone":"567-***-1234","location":"Miami"},{"Id":6,"name":"Frank","age":50,"telephone":"678-***-2345","location":"Seattle"},{"Id":7,"name":"Grace","age":55,"telephone":"789-***-3456","location":"Denver"},{"Id":8,"name":"Heidi","age":60,"telephone":"890-***-4567","location":"Boston"},{"Id":9,"name":"Ivan","age":65,"telephone":"901-***-5678","location":"Philadelphia"},{"Id":10,"name":"Judy","age":70,"telephone":"012-***-6789","location":"San Francisco"}]
|
||||
```
|
||||
|
||||
You can see that the RESTful APIs for CRUD operations has been successfully generated:
|
||||
|
||||
```shell
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
import mysql.connector
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
# Database connection configuration
|
||||
config = {
|
||||
'user': '*******',
|
||||
'password': '******',
|
||||
'host': 'xx.xxx.xxx.xx',
|
||||
'database': 'test',
|
||||
'port':xxxx,
|
||||
'raise_on_warnings': True
|
||||
}
|
||||
|
||||
class Customer(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
age: int
|
||||
telephone: str
|
||||
location: str
|
||||
|
||||
@app.get('/customers')
|
||||
async def get_customers():
|
||||
cnx = mysql.connector.connect(**config)
|
||||
cursor = cnx.cursor(dictionary=True)
|
||||
query = 'SELECT * FROM customer'
|
||||
cursor.execute(query)
|
||||
results = cursor.fetchall()
|
||||
cursor.close()
|
||||
cnx.close()
|
||||
return results
|
||||
|
||||
@app.get('/customers/{customer_id}')
|
||||
async def get_customer(customer_id: int):
|
||||
cnx = mysql.connector.connect(**config)
|
||||
cursor = cnx.cursor(dictionary=True)
|
||||
query = 'SELECT * FROM customer WHERE ID = %s'
|
||||
cursor.execute(query, (customer_id,))
|
||||
result = cursor.fetchone()
|
||||
cursor.close()
|
||||
cnx.close()
|
||||
return result
|
||||
|
||||
@app.post('/customers')
|
||||
async def create_customer(customer: Customer):
|
||||
cnx = mysql.connector.connect(**config)
|
||||
cursor = cnx.cursor()
|
||||
query = 'INSERT INTO customer (ID, name, age, telephone, location) VALUES (%s, %s, %s, %s, %s)'
|
||||
data = (customer.id, customer.name, customer.age, customer.telephone, customer.location)
|
||||
cursor.execute(query, data)
|
||||
cnx.commit()
|
||||
cursor.close()
|
||||
cnx.close()
|
||||
return {'message': 'Customer created successfully'}
|
||||
|
||||
@app.put('/customers/{customer_id}')
|
||||
async def update_customer(customer_id: int, customer: Customer):
|
||||
cnx = mysql.connector.connect(**config)
|
||||
cursor = cnx.cursor()
|
||||
query = 'UPDATE customer SET name = %s, age = %s, telephone = %s, location = %s WHERE ID = %s'
|
||||
data = (customer.name, customer.age, customer.telephone, customer.location, customer_id)
|
||||
cursor.execute(query, data)
|
||||
cnx.commit()
|
||||
cursor.close()
|
||||
cnx.close()
|
||||
return {'message': 'Customer updated successfully'}
|
||||
|
||||
@app.delete('/customers/{customer_id}')
|
||||
async def delete_customer(customer_id: int):
|
||||
cnx = mysql.connector.connect(**config)
|
||||
cursor = cnx.cursor()
|
||||
query = 'DELETE FROM customer WHERE ID = %s'
|
||||
cursor.execute(query, (customer_id,))
|
||||
cnx.commit()
|
||||
cursor.close()
|
||||
cnx.close()
|
||||
return {'message': 'Customer deleted successfully'}
|
||||
```
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
|
||||
slug: /obshell-overview
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
OceanBase Shell (obshell) is a local database command-line tool provided by OceanBase for administrators and developers. It is a no-installation, out-of-the-box tool. obshell supports cluster and standalone (seekdb) operations, enabling unified management of different ecosystem products for the same database. This simplifies integration with third-party tools and reduces the complexity and cost of managing OceanBase databases.
|
||||
|
||||
obshell does not require additional installation. By default, after you install OceanBase seekdb through any method, you can find the obshell executable file in the `usr/bin` directory of the installation directory.
|
||||
@@ -0,0 +1,151 @@
|
||||
---
|
||||
|
||||
slug: /error
|
||||
---
|
||||
|
||||
# Error codes
|
||||
|
||||
This topic describes error messages that may occur during obshell usage and provides solutions.
|
||||
|
||||
The error codes of obshell use strings with literal meanings to facilitate user understanding. When an operation fails, you can view the corresponding error code in the <code>obshell.log</code> log file and then refer to the <b>Solution</b> section in the following table for troubleshooting.
|
||||
|
||||
:::info
|
||||
If an error occurs while using obshell Dashboard, you can also view the HTTP response result by using the browser developer tools to obtain the error code (<code>errCode</code>).
|
||||
:::
|
||||
|
||||
| Error code | Error message | Description | Solution |
|
||||
| --- | --- | --- | --- |
|
||||
| Agent.AlreadyInitialized | Agent already initialized | The obshell node has already been initialized | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Current.UnderMaintenance | Agent is under maintenance | The current obshell node is under maintenance | Wait for the maintenance task to finish, or use task commands/APIs to roll back, retry, or skip the failed maintenance task. |
|
||||
| Agent.Daemon.StartFailed | Daemon start failed | Failed to start the daemon process | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Daemon.ServeOnUnixSocketFailed | Daemon serve on socket listener failed | The daemon process failed to serve on the socket listener | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Identify.NotSupportOperation | '%s' is '%s', instead of '%s', does not support this operation | The current node identity does not support this operation | Please check the node identity and try again. |
|
||||
| Agent.Identify.Unknown | Unknown agent identity: %s | The identity of the obshell node is unknown | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Info.NotEqual | Agent info not equal, input is %v, meta is %v | The obshell node information does not match | Please check the obshell node information. |
|
||||
| Agent.IP.InconsistentWithOBServer | Agent IP inconsistent with observer | The obshell node IP does not match the seekdb node IP | Please check the IP configuration. |
|
||||
| Agent.Load.OBConfigFailed | Load ob config from config file failed | Failed to load seekdb configuration from the config file | Please check the configuration file. |
|
||||
| Agent.NotInitialized | Agent not initialized | The obshell node has not been initialized | Please initialize the corresponding obshell node first. |
|
||||
| Agent.OBVersionNotSupported | Unsupported ob version '%s', the minimum supported version is '%s' | The current OceanBase seekdb version is not supported | Please use a supported OceanBase seekdb version. |
|
||||
| Agent.OceanBase.DB.NotOcs | The current database is not ocs | The current database is not OCS | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.OceanBase.NotHold | Agent is not holding OceanBase seekdb | The obshell node does not have an OceanBase seekdb connection | Please restart the obshell node and try again. |
|
||||
| Agent.Oceanbase.Password.LoadFailed | Check password of root in sqlite failed | Failed to check the root password in SQLite | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.OceanBase.Useless | The current database is useless | The current OceanBase seekdb is unavailable | Please check whether OceanBase seekdb is available. |
|
||||
| Agent.Package.NotFound | Package %v is not found | The installation package could not be found | Please check the installation package and try again. |
|
||||
| Agent.Rebuild.PortNotSame | Agent port is not the same, agent port in all_agents: %d, agent port now: %d | The obshell node ports are inconsistent | Rebuild obshell using the same port. |
|
||||
| Agent.Rebuild.VersionNotSame | Agent version is not the same, agent version in all_agents: %s, agent version now: %s | obshell version inconsistency | Rebuild obshell using the same version. |
|
||||
| Agent.Response.DataEmpty | Response data is empty | Response data is empty | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Response.DataFormatInvalid | Response data is not map | Response data format is invalid | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.RPC.RequestError | Request [%s]%s to %s error: %v | Error occurred when sending internal request between nodes | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.RPC.RequestFailed | Request [%s]%s to %s failed: %s | Failed to send internal request between nodes | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.ServeOnTcpSocketFailed | Serve on tcp listener failed | Failed to serve on TCP listener | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.ServeOnUnixSocketFailed | Serve on unix listener failed | Failed to serve on Unix listener | Please check system resources and permissions. |
|
||||
| Agent.Sqlite.DB.NotInit | The sqlite db is not initialized | SQLite database not initialized | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Start.ObserverFailed | Start observer via flag failed | Failed to start seekdb | Please check seekdb configuration. |
|
||||
| Agent.Start.WithInvalidInfo | Agent start with invalid info: %v | obshell start information is invalid | Please check the startup parameters. |
|
||||
| Agent.TakeOverFailed | Take over or rebuild failed | Takeover or rebuild failed | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.TCP.Listener.CreateFailed | Create tcp listerner failed | Failed to create TCP listener | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.UnderMaintenanceDag | %s is under maintenance by dag [%s:%s] | Current obshell node is under maintenance | Wait for the maintenance task to complete, or use task commands/APIs to rollback, retry, or skip the failed maintenance task. |
|
||||
| Agent.UnderMaintenance | %s is under maintenance | Current obshell node is under maintenance | Wait for the maintenance task to complete, or use task commands/APIs to rollback, retry, or skip the failed maintenance task. |
|
||||
| Agent.Unix.Socket.Listener.CreateFailed | Create unix socket listerner failed | Failed to create Unix socket listener | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Upgrade.KillOldServerTimeout | Wait obshell server killed timeout | Timeout while waiting for obshell server termination | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Agent.Upgrade.ToLowerVersion | Target version %s is not greater than current version %s. Please verify if the params have been filled out correctly | Upgrade target version is not higher than the current version | Please verify that the parameters are correctly filled in and provide a valid target version. |
|
||||
| Agent.Version.Inconsistent | obshell version is not consistent between %s(%s) and %s(%s) | obshell version inconsistency | Please use obshell with consistent versions. |
|
||||
| Cli.FlagRequired | Required flag(s) "%s" not set | Required command options not set | Please configure all required options. |
|
||||
| Cli.NotFound | %s not found | Corresponding resource not found | Please check if the resource exists. |
|
||||
| Cli.OperationCancelled | Operation cancelled | Operation cancelled | Please re-execute the operation as needed. |
|
||||
| Cli.UnixSocket.RequestFailed | Request unix-socket [%s]%s failed: %v | Unix socket request failed | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Cli.Upgrade.NoValidTargetBuildVersionFound | No valid target build version in pkg_directory found by '%s' | No valid target build version found | Please check the installation packages in the directory or provide a specific version number. |
|
||||
| Cli.Upgrade.PackageNotFoundInPath | No valid %s package found in %s | No valid installation package found | Please check the package location. |
|
||||
| Cli.UsageError | Incorrect usage: %s | Command usage error | Please check the command syntax. You can use the `-h`/`--help` option for help information on the corresponding command. |
|
||||
| Common.BadRequest | Bad request: %v | Invalid request | Please check the request parameters and try again. |
|
||||
| Common.BindJsonFailed | Bind JSON failed: %v | Failed to bind JSON | Please check the JSON format and try again. |
|
||||
| Common.DirNotEmpty | Dir '%s' is not empty | Directory is not empty | Please check the directory and try again. |
|
||||
| Common.FileNotExist | File '%s' does not exist | File does not exist | Please check the file path and try again. |
|
||||
| Common.FilePermissionDenied | No read/write permission for file '%s' | No read/write permission for the file | Please ensure you have read/write permissions for the file before retrying. |
|
||||
| Common.IllegalArgument | Illegal argument | Invalid argument | Please check the arguments and try again. |
|
||||
| Common.InvalidAddress | '%s' is not a valid address | Invalid address | Please check the address and try again. |
|
||||
| Common.InvalidIp | '%s' is not a valid IP address | Invalid IP address | Please check the IP address and try again. |
|
||||
| Common.InvalidPath | Path '%s' is not valid: %s | Invalid path format | Please check the path format and try again. |
|
||||
| Common.InvalidPort | The port '%s' is invalid, must in (1024, 65535]. | Invalid port | Please check the port. Valid port range is (1024, 65535]. |
|
||||
| Common.InvalidTimeDuration | Time duration '%s' is invalid: %s | Invalid time duration | Please check the time duration and try again. |
|
||||
| Common.NotFound | Element not found: %v | API request not found | Please check if the URI is correct. |
|
||||
| Common.PathNotDir | '%s' is not a directory | The configured path is not a directory | Please check the path and try again. |
|
||||
| Common.PathNotExist | '%s' does not exist | Path does not exist | Please check the path and try again. |
|
||||
| Common.Unauthorized | Unauthorized | Authentication failed | Please check permissions and try again. |
|
||||
| Common.Unexpected | Unexpected error: %s | Unexpected error | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Environment.DiskSpaceNotEnough | The remaining disk space is insufficient, the remaining disk space is %d, and the required disk space is %d | Insufficient disk space | Please free up disk space and try again, or switch to another disk with sufficient space. |
|
||||
| Gorm.NoRowAffected | %s: no row affected | This operation did not make any changes to the OceanBase database | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Log.FileNameExtensionMismatched | File name '%s' extension mismatched | File extension does not match | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Log.FileNamePrefixMismatched | File name '%s' prefix mismatched | File name prefix does not match | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Log.WriteExceedMaxSize | Write length %d exceeds maximum file size %d | Write length exceeds maximum file size | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| MySQL.Error | Occur error when execute sql | Error occurred while executing SQL | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| OB.Binary.Version.Unexpected | Unexpected observer binary version. | OceanBase database version is unexpected | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| SeekDB.MinorFreezeTimeout | Minor freeze timeout | Minor Freeze timeout | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| SeekDB.NotInitialized | seekdb has not been initialized, please initialize it first | seekdb has not been initialized | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| SeekDB.Password.Incorrect | The seekdb root password is incorrect | The provided seekdb root user password is incorrect | Please check the password and try again. |
|
||||
| SeekDB.UnderMaintenance | seekdb is under maintenance, please try again later | seekdb is under maintenance | Wait for the maintenance task to complete, or use task commands/APIs to rollback, retry, or skip the failed maintenance task. |
|
||||
| SeekDB.UnderMaintenanceWithDag | seekdb is under maintenance by DAG: %s | seekdb is under maintenance | Wait for the maintenance task to complete, or use task commands/APIs to rollback, retry, or skip the failed maintenance task. |
|
||||
| SeekDB.Database.NotExist | Database %s of tenant %s | The specified database does not exist in seekdb | Please configure an existing database and try again. |
|
||||
| SeekDB.Compaction.Status.NotIdle | seekdb is in '%s' status, operation not allowed | seekdb is not in idle status | Please wait until it is idle before trying the operation again. |
|
||||
| SeekDB.Process.CheckFailed | Check seekdb process exist: %s | Failed to check seekdb process | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| SeekDB.Process.NotExist | seekdb process does not exist | seekdb process does not exist | Please start the seekdb process and try again. |
|
||||
| SeekDB.Variable.Empty | Variable name or value is empty. | Variable name or value is empty | Please check the configured variable and try again. |
|
||||
| SeekDB.Variable.Invalid | Variable '%s' is invalid: %s | Invalid variable | Please check the variable and try again. |
|
||||
| SeekDB.Variable.Name.Empty | Variable name is empty | Variable name is empty | Please provide a valid variable name. |
|
||||
| SeekDB.Variable.NotExist | Variable '%s' is not found | Variable does not exist | Please check whether the variable exists. |
|
||||
| SeekDB.User.Name.Empty | User name is empty | User name cannot be empty | Please provide a valid user name. |
|
||||
| SeekDB.Privilege.NotSupported | Unsupported privilege %s | Operation privilege not supported | Please configure the required privileges for the user. |
|
||||
| Package.Compression.NotSupported | Unsupported compression '%s', the supported compression is 'xz' | Unsupported package compression format | The supported compression format is `xz`. Please visit [OceanBase Download Center](https://en.oceanbase.com/softwarecenter) to download the corresponding installation package and re-upload. |
|
||||
| Package.Format.Invalid | Unsupported payload format '%s', the supported payload format is 'cpio' | Unsupported payload format | The supported payload format is `cpio`. Please visit [OceanBase Download Center](https://en.oceanbase.com/softwarecenter) to download the corresponding installation package and re-upload. |
|
||||
| Package.NameMismatch | RPM package name %s not match %s | Package name mismatch | Please check the package name. |
|
||||
| Package.ReleaseFormat.Invalid | Release format %s is illegal | Release format is invalid | Please visit [OceanBase Download Center](https://en.oceanbase.com/softwarecenter) to download the corresponding installation package and re-upload. |
|
||||
| Package.ReleaseInvalid | RPM package release %s not match format | Package release version does not meet format requirements | Please check the package format requirements. |
|
||||
| Request.Body.Decrypt.AES.ContentLength.Invalid | Decrypted string length is not a multiple of the block size | Decrypted string length is not a multiple of the block size | Please check the encrypted body in the HTTP request (using AES encryption). |
|
||||
| Request.Body.Decrypt.AES.KeyAndIv.Invalid | AES key and iv size error | AES key and IV size error | Please provide a valid key and IV in the request header. |
|
||||
| Request.Body.Decrypt.AES.NoKey | No key for aes | Missing AES key | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Request.Body.Decrypt.SM4.NoKey | No key for sm4 | Missing SM4 key | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Request.Body.ReadFailed | Failed to read request body: %s | Failed to read HTTP request body | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Request.File.Missing | File %s in the request is missing: %s | Missing file in request | Please check the requested file. |
|
||||
| Request.Header.NotFound | X-OCS-Header not found in http request header. | X-OCS-Header not found in HTTP request header | Please set the correct X-OCS-Header in the HTTP request header to authenticate. |
|
||||
| Request.Header.Type.Invalid | Header type error | Header type error | HTTP header type is invalid; please check the HTTP request header and try again. |
|
||||
| Request.Method.NotSupport | %s method not support | Method not supported | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Request.Query.Param.Empty | Query param %s is empty | Query parameter is empty | Please provide the path parameter and try again. |
|
||||
| Request.Query.Param.Illegal | Query param %s is illegal | Query parameter is invalid | Please provide a valid query parameter and try again. |
|
||||
| Security.Authentication.Expired | Authentication expired | Authentication information has expired | Please update the authentication information and try again. |
|
||||
| Security.Authentication.File.Sha256Mismatch | File sha256 mismatch | File SHA256 value mismatch | The file in the request does not match the SHA256 in the HTTP header. Please check and re-upload. |
|
||||
| Security.Authentication.Header.DecryptFailed | Decrypt http header failed: %s | Failed to decrypt HTTP request header | Please check if the encryption public key is correct. |
|
||||
| Security.Authentication.Header.UriMismatch | URI mismatch | URI in request header does not match actual request URI | Please check the request URI set in the HTTP header and try again. |
|
||||
| Security.Authentication.IncorrectSeekDBPassword | seekdb root password is incorrect | Incorrect root user password in seekdb instance | Please check the root user password in seekdb and try again after confirming it is correct. |
|
||||
| Security.Authentication.Timestamp.Invalid | Invalid timestamp: %s, err: %s | Invalid timestamp | Please provide a valid timestamp in the HTTP header and try again. |
|
||||
| Security.Authentication.Unauthorized | Authentication failed | Authentication failed | Please check your request and try again. |
|
||||
| Security.User.PermissionDenied | Permission denied | Insufficient user permissions | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.Operator.CancelFinishedDag | Failed to cancel dag: dag is finished | Cannot cancel a finished DAG | Only DAGs in the RUNNING state can be canceled. |
|
||||
| Task.Dag.Operator.CancelNotAllowed | Failed to cancel dag: node %s can not cancel | The DAG contains nodes that cannot be canceled | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.Operator.NotSupport | Not support operator %s | Unsupported DAG operation | Please perform a supported task operation. |
|
||||
| Task.Dag.Operator.PassNotAllowed | Failed to pass dag: node %s can not pass | The DAG contains nodes that cannot be skipped | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.Operator.PassNotFailedDag | Failed to pass dag: dag is not failed | Cannot skip a DAG that is not in the failed state | Only failed DAGs can be skipped. |
|
||||
| Task.Dag.Operator.RetryNotAllowed | Failed to set dag retry: node %s can not retry | The DAG contains nodes that cannot be retried | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.Operator.RetryNotFailedDag | Failed to set dag retry: dag state is not failed | Cannot retry a DAG that is not in the failed state | Only failed DAGs can be retried. |
|
||||
| Task.Dag.Operator.RollbackNotAllowed | Failed to set dag rollback: node %s can not rollback | The DAG contains nodes that cannot be rolled back | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.Operator.RollbackNotFailedDag | Failed to set dag rollback: dag state is not failed | Cannot roll back a DAG that is not in the failed state | Only failed DAGs can be rolled back. |
|
||||
| Task.Dag.PassTimeout | Pass %d timeout after %d seconds | Task skip operation timed out | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Dag.State.Invalid | Invalid dag state '%d' | Invalid task state | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Data.ConvertFailed | Convert %s failed: %s | Failed to convert task data | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Data.NotSet | Data %s is not set | Required task data not set | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.GenericID.Invalid | Invalid id: %s | Invalid generic ID | Please provide a valid generic ID. |
|
||||
| Task.LocalData.ConvertFailed | Convert %s failed: %s | Failed to convert local task data | Please check the local data format and try again. |
|
||||
| Task.LocalData.NotSet | Task local data %s not set | Required local task data not set | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Node.Operator.NotSupport | Not support operator %s | Unsupported operation | Please perform a supported task operation. |
|
||||
| Task.Node.Operator.PassNotAllowed | Failed to pass node: node %s can not pass | Task node cannot be skipped | Task node cannot be skipped. |
|
||||
| Task.Node.Operator.PassNotFailedDag | Failed to pass node: assigned dag is not failed | Cannot skip task node when DAG is not failed | Only failed task nodes can be skipped. |
|
||||
| Task.Node.Operator.PassNotFailedNode | Failed to pass node: node %s is not failed | Cannot skip a task node that is not failed | Only failed task nodes can be skipped. |
|
||||
| Task.NotFound | Task not found: %v | Task not found | Please verify the task ID and try again. |
|
||||
| Task.Param.ConvertFailed | Convert %s failed: %s | Failed to convert task parameter | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.Param.NotSet | Param %s is not set | Required task parameter not set | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.RemoteTask.Failed | Remote task %s %s failed | Remote task failed | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
| Task.SubDag.NotAllAdvanced | Sub dag of agents: %v not advanced, main dag failed | Some Agent subtasks failed to advance | Please check and resolve the blocked subtasks. |
|
||||
| Task.SubDag.NotAllCreated | Sub dag of agents: %v not created, main dag failed | Some Agent subtasks failed to be created | Please check the availability and connectivity of the Agents. |
|
||||
| Task.SubDag.NotAllPassed | Not all sub dag passed, main dag failed | Some Agent subtasks failed to be skipped | Please check and resolve the failed subtasks. |
|
||||
| Task.SubDag.NotAllReady | Sub dag of agents: %v not ready, can not advance main dag | Some Agent subtasks are not ready | Please wait until all subtasks are ready, or check for issues. |
|
||||
| Task.SubDag.NotAllSucceed | Sub dag of agents: %v failed, main dag failed | Some Agent subtasks failed | Please check and resolve the failed subtasks. |
|
||||
| Task.Template.Empty | Task template is empty | Task template is empty | Please contact the OceanBase technical support team for troubleshooting. |
|
||||
@@ -0,0 +1,92 @@
|
||||
---
|
||||
|
||||
slug: /agent-commands
|
||||
---
|
||||
|
||||
# obshell agent commands
|
||||
|
||||
This topic describes the obshell agent commands, which are used to manage obshell. You can use the `-h`/`--help` option in a command to view the help information of the command. For example, `obshell agent start -h --seekdb` is used to view the help information of the `start` command for seekdb. `--seekdb` indicates that the help information of the `start` command for seekdb is to be viewed.
|
||||
|
||||
## obshell agent start
|
||||
|
||||
Use this command to start obshell.
|
||||
|
||||
```shell
|
||||
obshell agent start [-P] [--password] [--seekdb] [--base-dir] [-6]
|
||||
|
||||
# example
|
||||
obshell agent start -P 2886 --base-dir /var/lib/oceanbase
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data type | Default value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -P/--port | No | int | 2886 | The port number to which the obshell is bound. |
|
||||
| --password | No | string | N/A | This option is used only for taking over a seekdb instance. When you take over a seekdb instance, you must use this option to specify the password of the root user of the seekdb instance. You can also specify the root user password by using the `OB_ROOT_PASSWORD` environment variable. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. If you specify `--base-dir`, you can omit `--seekdb`. |
|
||||
| --base-dir | No | string | N/A | The working directory of obshell. It must be consistent with the working directory of the corresponding seekdb instance. If you do not specify this option, the default value is the current directory. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
|
||||
## obshell agent stop
|
||||
|
||||
Use this command to stop obshell.
|
||||
|
||||
```shell
|
||||
obshell agent stop [--port] [--seekdb] [-6]
|
||||
|
||||
# example
|
||||
obshell agent stop --seekdb --port 2886
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data type | Default value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| --port | No | int | 2886 | The port number of obshell. If you do not want to specify the port number by using this option, you can specify the obshell port by using the `OBSHELL_PORT_FOR_SEEKDB` environment variable. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
|
||||
## obshell agent restart
|
||||
|
||||
Use this command to restart obshell.
|
||||
|
||||
```shell
|
||||
obshell agent restart [-P] [--password] [--seekdb] [-6]
|
||||
|
||||
# example
|
||||
obshell agent restart --seekdb --port 2886
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data type | Default value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -P/--port | No | int | 2886 | The port number to which the obshell is bound. |
|
||||
| --password | No | string | N/A | This option is used only for taking over a seekdb instance. When you take over a seekdb instance, you must use this option to specify the password of the root user of the seekdb instance. You can also specify the root user password by using the `OB_ROOT_PASSWORD` environment variable. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
|
||||
## Upgrade obshell
|
||||
|
||||
You can run this command to upgrade obshell.
|
||||
|
||||
```shell
|
||||
obshell agent upgrade -d [-V] [-t] [--port] [--seekdb] [-6] [-y] [-v]
|
||||
|
||||
# example
|
||||
obshell agent upgrade -d /home/oceanbase/upgrade/ -V 4.2.2.0-20231224224959 --port 2886 --seekdb
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data type | Default value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -d/--pkg_directory | Yes | string | N/A | The path where the upgrade package is stored. |
|
||||
| -V/--target_version | No | string | N/A | The target version. The value must be in the correct format, such as `4.2.2.0` or `4.2.2.0-20231224224959`. If you do not specify this option, the highest version of the obshell RPM package in the directory specified by the `-d` option is selected. |
|
||||
| -t/--tmp_directory | No | string | `${home_path}`/upgrade | The temporary directory for the upgrade process. This directory stores the downloaded installation package and all files generated during the decompression and installation process. The value must be an absolute path. |
|
||||
| --port | No | int | 2886 | The port number of obshell. If you do not specify this option, you can specify the obshell port by setting the `OBSHELL_PORT_FOR_SEEKDB` environment variable. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command is applied to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
| -y/--yes | No | N/A | N/A | This option does not require a value. If you specify this option, the system does not prompt you for confirmation when it performs an upgrade. |
|
||||
| -v/--verbose | No | N/A | N/A | This option does not require a value. If you specify this option, the system displays detailed execution information. |
|
||||
@@ -0,0 +1,93 @@
|
||||
---
|
||||
|
||||
slug: /seekdb-commands
|
||||
---
|
||||
|
||||
# obshell seekdb commands
|
||||
|
||||
This topic describes the seekdb commands in obshell. You can use the commands to manage a seekdb instance. You can use the `-h`/`--help` option in a command to view the help information of the command, for example, `obshell seekdb start -h`. You can also use the `-v`/`--verbose` option to view the detailed execution process of the command.
|
||||
|
||||
:::info
|
||||
The seekdb commands are designed for managing a seekdb instance. You do not need to explicitly specify the <code>--seekdb</code> option when you execute these commands.
|
||||
:::
|
||||
|
||||
## obshell seekdb start
|
||||
|
||||
You can run this command to start seekdb.
|
||||
|
||||
```shell
|
||||
obshell seekdb start [--port] [--seekdb] [-6]
|
||||
|
||||
# example
|
||||
obshell seekdb start --port 2886
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data Type | Default Value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| --port | No | int | 2886 | The port number of obshell. If you do not want to specify the port number by using this option, you can also specify it by using the environment variable `OBSHELL_PORT_FOR_SEEKDB`. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
|
||||
## obshell seekdb restart
|
||||
|
||||
You can run this command to restart seekdb.
|
||||
|
||||
```shell
|
||||
obshell seekdb restart [-t] [--port] [--seekdb] [-6] [-y]
|
||||
|
||||
# example
|
||||
obshell seekdb restart -t --port 2886
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data Type | Default Value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -t/--terminate | No | N/A | N/A | This option does not require a value. If you specify this option, the `MINOR FREEZE` command is triggered before seekdb is restarted. |
|
||||
| --port | No | int | 2886 | The port number of obshell. If you do not want to specify the port number by using this option, you can also specify it by using the environment variable `OBSHELL_PORT_FOR_SEEKDB`. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
| -y/--yes | No | N/A | N/A | This option does not require a value. If you specify this option, the confirmation for the restart operation is skipped. |
|
||||
|
||||
## obshell seekdb stop
|
||||
|
||||
You can run this command to stop seekdb.
|
||||
|
||||
```shell
|
||||
obshell seekdb stop [-t] [--port] [--seekdb] [-6] [-y]
|
||||
|
||||
# example
|
||||
obshell agent stop -t --port 2886
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data Type | Default Value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -t/--terminate | No | N/A | N/A | This option does not require a value. If you specify this option, the `MINOR FREEZE` command is triggered before the seekdb process is stopped. |
|
||||
| --port | No | int | 2886 | The port number of obshell. If you do not want to specify the port number by using this option, you can also specify it by using the environment variable `OBSHELL_PORT_FOR_SEEKDB`. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
| -y/--yes | No | N/A | N/A | This option does not require a value. If you specify this option, the confirmation for the stop operation is skipped. |
|
||||
|
||||
## obshell seekdb show
|
||||
|
||||
You can run this command to view the configurations and status of seekdb.
|
||||
|
||||
```shell
|
||||
obshell seekdb show [-d] [--port] [--seekdb] [-6]
|
||||
|
||||
# example
|
||||
obshell agent show
|
||||
```
|
||||
|
||||
The following table describes the options.
|
||||
|
||||
| Option | Required | Data Type | Default Value | Description |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| -d/--show_detail | No | N/A | N/A | This option does not require a value. If you specify this option, the command displays the details of seekdb. |
|
||||
| --port | No | int | 2886 | Specifies the port number of obshell. If you do not want to specify the port number by using this option, you can also specify the port number of obshell by using the environment variable `OBSHELL_PORT_FOR_SEEKDB`. |
|
||||
| --seekdb | No | N/A | N/A | This option does not require a value. If you specify this option, the command applies to seekdb. |
|
||||
| -6/--use-ipv6 | No | N/A | N/A | This option does not require a value. If you specify this option, IPv6 is used. |
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user