Files
gh-oceanbase-ecology-plugin…/skills/seekdb-docs/official-docs/200.develop/500.multi-model/300.char-and-text/300.full-text-index.md
2025-11-30 08:44:54 +08:00

329 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
slug: /full-text-index
---
# Full-text indexes
In seekdb, full-text indexes can be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
* Full-text indexes can only be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types.
* The current version only supports creating local (`LOCAL`) full-text indexes.
* The `UNIQUE` keyword cannot be specified when creating a full-text index.
* If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.
By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.
## DML operations
For tables with full-text indexes, complex DML operations are supported, including `INSERT INTO ON DUPLICATE KEY`, `REPLACE INTO`, multi-table updates/deletes, and updatable views.
**Examples:**
* `INSERT INTO ON DUPLICATE KEY`:
```sql
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
```
* `REPLACE INTO`:
```sql
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
```
* Multi-table updates and deletes.
1. Create table `tbl1`.
```sql
CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
```
2. Create table `tbl2`.
```sql
CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
```
3. Perform an update (`UPDATE`) operation on multiple tables.
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
SET tbl1.b = 'dddd', tbl2.b = 'eeee';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
```
4. Perform a delete (`DELETE`) operation on multiple tables.
```sql
DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
* DML operations on updatable views.
1. Create view `fts_view`.
```sql
CREATE VIEW fts_view AS SELECT * FROM tbl1;
```
2. Perform an `INSERT` operation on the updatable view.
```sql
INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
```
3. Perform an `UPDATE` operation on the updatable view.
```sql
UPDATE fts_view SET b = 'dddd';
```
```sql
UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
SET fts_view.b = 'dddd', tbl2.b = 'eeee';
```
4. Perform a `DELETE` operation on the updatable view.
```sql
DELETE FROM fts_view WHERE b = 'dddd';
```
```sql
DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
```
## Full-text index tokenizer
seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is **Space**, while other tokenizers need to be explicitly specified using the `WITH PARSER` parameter.
**List of tokenizers**:
* **Space tokenizer**
* **Basic English tokenizer**
* **IK tokenizer**
* **Ngram tokenizer**
* **Jieba tokenizer**
**Configuration example**:
When creating or modifying a table, specify the tokenizer type for the full-text index by setting the `WITH PARSER tokenizer_option` parameter in the `CREATE TABLE/ALTER TABLE` statement.
```sql
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the full-text index tokenizer of an existing table.
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
```
### Space tokenizer (default)
**Concepts**:
* This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore `_`) as delimiters.
* The tokenization results include only valid tokens with lengths between `min_token_size` (default 3) and `max_token_size` (default 84).
* Chinese characters are treated as single tokens.
**Applicable scenarios**:
* Languages separated by spaces such as English (for example "apple watch series 9").
* Chinese text with manually added delimiters (for example, "南京 长江大桥").
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|"详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name" |
+-------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.
### Basic English (Beng) tokenizer
**Concepts**:
* Similar to the Space tokenizer, but treats underscores `_` as separators instead of preserving them.
* Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").
**Applicable scenarios**:
* Basic retrieval of English documents (such as logs, comments).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Underscores `_` are split into separate tokens (for example, `server_status` -> `server`, `status`, and `user_name` -> `user`, `name`). The core difference from the Space tokenizer lies in how it handles underscores `_`.
### Ngram tokenizer
**Concepts**:
* **Fixed n-value tokenization**: By default, `n=2`. This tokenizer splits consecutive non-delimiter characters into subsequences of length `n`.
* Delimiter rules follow the Space tokenizer (preserving `_`, digits, and letters).
* **Does not support length limit parameters**, outputs all possible tokens of length `n`.
**Applicable scenarios**:
* Fuzzy matching for short text (such as user IDs, order numbers).
* Scenarios requiring fixed-length feature extraction (such as password policy analysis).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram') |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* With the default setting `n=2`, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, `ORD12345` -> `OR`, `RD`, `D1`, `12`, `23`, `34`, `45`;` user_account` -> `us`, `se`, `er`, `r_`, `_a`, `ac`, `cc`, `co`, `ou`, `un`, `nt`).
### Ngram2 tokenizer
**Concepts**:
* Supports **dynamic n-value range**: Sets token length range through `min_ngram_size` and `max_ngram_size` parameters.
* Suitable for scenarios requiring multi-length token coverage.
**Applicable scenarios**: Scenarios that require multiple fixed-length tokens simultaneously.
:::info
When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for <code>min_ngram_size</code> and <code>max_ngram_size</code> parameters will generate a large number of token combinations, which may lead to excessive resource consumption.
:::
**Tokenization result**:
```sql
OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, `user_login_session_2024` generates tokens like `us`, `use`, `user`, `se`, `ser`, `ser_`, `er_`, `er_l`, `r_lo`, `log`, `logi`, `ogin`, etc.).
### IK tokenizer
**Concepts**:
* A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
* **Smart mode**: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
* **Max Word mode**: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
* Automatically recognizes English words, email addresses, URLs (without `://`), IP addresses, and other formats.
**Applicable scenarios**: Chinese word segmentation
**Business scenarios**:
* E-commerce product description search (for example, precise matching for "华为Mate60").
* Social media content analysis (for example, keyword extraction from user comments).
* **Smart mode**: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.
```sql
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
* **max_word mode**: Includes the same character in different tokens, providing as many possible words as possible.
```sql
OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
### jieba tokenizer
**Concept**: A tokenizer based on the open-source `jieba` tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.
**Features**:
* **Precise mode**: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
* **Full mode**: Lists all possible segmentation combinations.
* **Search engine mode**: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
* Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
**Applicable scenarios**:
* Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
* Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).
To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see [Tokenizer plugin](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002414801).
:::tip
The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.
:::
### Tokenizer selection strategy
| **Business scenario** | **Recommended tokenizer** | **Reason** |
| --- | --- | --- |
| Search for English product titles | **Space** or **Basic English** | Simple and efficient, aligns with English tokenization conventions. |
| Retrieval of Chinese product descriptions | **IK tokenizer** | Accurately recognizes Chinese terminology, supports custom dictionaries. |
| Fuzzy matching of logs (such as error codes) | **Ngram tokenizer** | No dictionary required, covers fuzzy query needs for text without spaces. |
| Keyword extraction from technology papers | **jieba tokenizer** | Supports new word discovery and complex mode switching. |
## References
For more information about creating full-text indexes, see the **Create full-text indexes** section in [Create an index](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971660).