Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,84 @@
---
slug: /vector-search-intro
---
# Overview of vector search
This topic introduces the core concepts of vector databases and vector search.
seekdb supports dense float vectors with up to 16,000 dimensions, as well as sparse vectors. It supports various types of vector distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance. seekdb also supports the creation of HNSW/IVF-based vector indexes, as well as incremental updates and deletions, with these operations having no impact on recall rate.
seekdb vector search offers hybrid retrieval capabilities with scalar filtering. It also provides flexible access interfaces: you can use SQL via the MySQL protocol from clients in various programming languages, or access it using a Python SDK. In addition, seekdb is fully adapted to AI application development frameworks such as LlamaIndex, DB-GPT, and the AI application development platform Dify, offering better support for AI application development.
<video data-code="9002093" src="https://obbusiness-private.oss-cn-shanghai.aliyuncs.com/doc/video/03%20OceanBase%20Vector%20Search-An%20Official%20In-depth%20Perspective.mp4" controls width="811px" height="456.188px"></video>
## Key concepts
### Unstructured data
Unstructured data is data that does not have a predefined data format or organizational structure. It typically includes data in forms such as text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, processing it requires specific tools and techniques, such as natural language processing, image recognition, and machine learning.
### Vector
A vector is the projection of an object in a high-dimensional space. Mathematically, a vector is a floating-point array with the following characteristics:
* Each element in the array is a floating-point number that represents a dimension of the vector.
* The size, namely, the number of elements, of the vector array indicates the dimensionality of the entire vector space.
### Vector embedding
Vector embedding is the process of using a deep learning neural network to extract content and semantics from unstructured data such as images and videos, and convert them into feature vectors. Embedding technology maps original data from a high-dimensional space to a low-dimensional space and converts multimodal data with rich features into multi-dimensional vector data.
### Vector similarity search
In today's era of information explosion, users often need to quickly retrieve specific information from massive datasets. Whether it's online literature databases, e-commerce product catalogs, or rapidly growing multimedia content libraries, efficient retrieval systems are essential for locating content of interest. As data volumes continue to grow, traditional keyword-based search methods can no longer meet the demands for both accuracy and speed, giving rise to vector search technology. Vector similarity search uses feature extraction and vectorization techniques to convert unstructured data—such as text, images, and audio—into vectors. By applying similarity measurement methods to compare these vectors, it captures the deeper semantic meaning of the data. This approach delivers more precise and efficient search results, addressing the shortcomings of traditional search methods.
## Why seekdb vector search?
seekdb's vector search capabilities are built on its integrated multi-model capabilities, excelling in areas such as hybrid retrieval, high performance, high availability, cost efficiency, and data security.
### Hybrid retrieval
seekdb supports hybrid retrieval across multiple data types, including vector data, spatial data, document data, and scalar data. With support for various indexes such as vector indexes, spatial indexes, and full-text indexes, seekdb delivers exceptional performance in multi-model hybrid retrieval. It enables a single database to handle diverse storage and retrieval needs for applications.
### Scalability
seekdb vector search supports the storage and retrieval of massive amounts of vector data, meeting the requirements of large-scale vector data applications.
### High performance
seekdb vector search capabilities integrate the VSAG indexing algorithm library, which demonstrates outstanding performance on the 960-dimensional GIST dataset. In the ANN-Benchmarks tests, the VSAG library significantly outperformed other algorithms.
### High availability
seekdb vector search provides reliable data storage and access capabilities. For in-memory HNSW indexes, it ensures stable retrieval performance.
### Transactions
seekdb's transaction capabilities ensure the consistency and integrity of vector data. It also offers effective concurrency control and fault recovery mechanisms.
### Cost efficiency
seekdb's storage encoding and compression capabilities significantly reduce the storage space required for vectors, helping to lower application storage costs.
### Data security
seekdb already supports comprehensive enterprise-grade security features, including identity authentication and verification, access control, data encryption, monitoring and alerts, and security auditing. These features effectively ensure data security in vector search scenarios.
### Ease of use
seekdb vector search provides flexible access interfaces, enabling SQL access through MySQL protocol clients across various programming languages, as well as seamless integration via a Python SDK. Furthermore, seekdb has been optimized for AI application development frameworks like LangChain and LlamaIndex, offering better support for AI application development.
### Comprehensive toolset
seekdb features a comprehensive database toolset, supporting data development, migration, operations, diagnostics, and full lifecycle data management, safeguarding the development and maintenance of AI applications.
## Application scenarios
* Retrieval-Augmented Generation (RAG): RAG is an artificial intelligence (AI) framework that retrieves facts from external knowledge bases to provide the most accurate and latest information for Large Language Models (LLMs) and allow users to have an insight into the generation process of an LLM. RAG is commonly used in intelligent Q&A systems and knowledge bases.
* Personalized recommendation: The recommendation system can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system will calculate the similarity based on the characteristics of the user, and then return items that the user may be interested in as the recommendation results, such as recommended restaurants and scenic spots.
* Image search/Text search: An image/text search task aims to find results that are most similar to the specified image in a large-scale image/text database. The text/image features used in the search can be stored in a vector database, and efficient similarity calculation can be achieved based on high-performance index-based storage, thereby returning image/text results that match the search criteria. This applies to scenarios such as facial recognition.

View File

@@ -0,0 +1,28 @@
---
slug: /vector-search-workflow
---
# AI application workflow using seekdb vector search
This topic describes the AI application workflow using seekdb vector search.
## Convert unstructured data into feature vectors through vector embedding
Unstructured data (such as videos, documents, and images) is the starting point of the entire workflow. Various forms of unstructured data, including videos, text files (documents), and images, are transformed into vector representations through vector embedding models. The task of these models is to convert raw, unstructured data that is difficult to directly calculate similarity into high-dimensional vector data. These vectors capture the semantic information and features of the data, and can express the similarity of data through distances in the vector space. For more information, see [Vector embedding technology](../150.vector-embedding-technology.md).
## Store vector embeddings and create vector indexes in seekdb
As the core storage layer, seekdb is responsible for storing all data. This includes traditional relational tables (used for storing business data), the original unstructured data, and the vector data generated after vector embedding. For more information, see [Store vector data](../160.store-vector-data.md).
To enable efficient vector search, seekdb internally builds vector indexes for the vector data. Vector indexes are specialized data structures that significantly accelerate nearest neighbor searches in high-dimensional vector spaces. Since calculating vector similarity is computationally expensive, exact searches (calculating distances for all vectors one by one) ensure accuracy but can severely impact query performance. Through vector indexes, the system can quickly locate candidate vectors, significantly reducing the number of vectors that need distance calculations, thereby improving query efficiency while maintaining high accuracy. For more information, see [Create vector indexes](../200.vector-index/200.dense-vector-index.md).
## Perform nearest neighbor search and hybrid search through SQL/SDK
Users interact with the AI application through clients or programming languages by submitting queries that may involve text, images, or other formats. For more information, see [Supported clients and languages](../700.vector-search-reference/900.vector-search-supported-clients-and-languages/100.vector-search-supported-clients-and-languages-overview.md).
seekdb uses SQL statements to query and manage relational data, enabling hybrid searches that combine scalar and vector data. When a user initiates a query—if it is unstructured—the system first converts it into a vector using the embedding model. Then, leveraging both vector and scalar indexes, the system quickly retrieves the most similar vectors that also meet scalar filter conditions, thus identifying the most relevant unstructured data. For detailed information about nearest neighbor search, see [Nearest neighbor search](../300.vector-similarity-search.md).
## Generate prompts and send them to the LLM for inference
In the final stage, an optimized prompt is generated based on the hybrid search results and sent to the large language model (LLM) to complete the inference process. The LLM generates a natural language response based on this contextual information. There is a feedback loop between the LLM and the vector embedding model, meaning that the output of the LLM or user feedback can be used to optimize the embedding model, creating a cycle of continuous learning and improvement.