Initial commit
This commit is contained in:
@@ -0,0 +1,46 @@
|
||||
---
|
||||
|
||||
slug: /char-and-varchar
|
||||
---
|
||||
|
||||
# CHAR and VARCHAR
|
||||
|
||||
`CHAR` and `VARCHAR` types are similar, but differ in how they are stored and retrieved, their maximum length, and whether trailing spaces are preserved.
|
||||
|
||||
## CHAR
|
||||
|
||||
The declared length of the `CHAR` type is the maximum number of characters that can be stored. For example, `CHAR(30)` can contain up to 30 characters.
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
[NATIONAL] CHAR[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `CHAR` becomes `BINARY`.
|
||||
|
||||
`CHAR` column length can be any value between 0 and 256. When storing `CHAR` values, they are right-padded with spaces to the specified length.
|
||||
|
||||
For `CHAR` columns, excess trailing spaces in inserted values are silently truncated regardless of the SQL mode. When retrieving `CHAR` values, trailing spaces are removed unless the `PAD_CHAR_TO_FULL_LENGTH` SQL mode is enabled.
|
||||
|
||||
## VARCHAR
|
||||
|
||||
The declared length `M` of the `VARCHAR` type is the maximum number of characters that can be stored. For example, `VARCHAR(50)` can contain up to 50 characters.
|
||||
|
||||
Syntax:
|
||||
|
||||
```sql
|
||||
[NATIONAL] VARCHAR(M) [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `VARCHAR` becomes `VARBINARY`.
|
||||
|
||||
`VARCHAR` column length can be specified as any value between 0 and 262144.
|
||||
|
||||
Compared with `CHAR`, `VARCHAR` values are stored as a 1-byte or 2-byte length prefix plus the data. The length prefix indicates the number of bytes in the value. If the value does not exceed 255 bytes, the column uses one byte; if the value may exceed 255 bytes, it uses two bytes.
|
||||
|
||||
For `VARCHAR` columns, trailing spaces that exceed the column length are truncated before insertion and generate a warning, regardless of the SQL mode.
|
||||
|
||||
`VARCHAR` values are not padded when stored. According to standard SQL, trailing spaces are preserved during both storage and retrieval.
|
||||
|
||||
Additionally, seekdb also supports the extended type `CHARACTER VARYING(m)`, but `VARCHAR(m)` is recommended.
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
|
||||
slug: /text
|
||||
---
|
||||
|
||||
# TEXT types
|
||||
|
||||
The `TEXT` type is used to store all types of text data.
|
||||
|
||||
There are four text types: `TINYTEXT`, `TEXT`, `MEDIUMTEXT`, and `LONGTEXT`. They correspond to the four `BLOB` types and have the same maximum length and storage requirements.
|
||||
|
||||
`TEXT` values are treated as non-binary strings. They have a character set other than binary, and values are sorted and compared according to the collation rules of the character set.
|
||||
|
||||
When strict SQL mode is not enabled, if a value assigned to a `TEXT` column exceeds the column's maximum length, the portion that exceeds the length is truncated and a warning is generated. When using strict SQL mode, an error occurs (rather than a warning) if non-space characters are truncated, and the value insertion is prohibited. Regardless of the SQL mode, truncating excess trailing spaces from values inserted into `TEXT` columns always generates a warning.
|
||||
|
||||
## TINYTEXT
|
||||
|
||||
`TINYTEXT` is a `TEXT` type with a maximum length of 255 bytes.
|
||||
|
||||
`TINYTEXT` syntax:
|
||||
|
||||
```sql
|
||||
TINYTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
## TEXT
|
||||
|
||||
The maximum length of a `TEXT` column is 65,535 bytes.
|
||||
|
||||
An optional length `M` can be specified for the `TEXT` type. Syntax:
|
||||
|
||||
```sql
|
||||
TEXT[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
## MEDIUMTEXT
|
||||
|
||||
`MEDIUMTEXT` is a `TEXT` type with a maximum length of 16,777,215 bytes.
|
||||
|
||||
`MEDIUMTEXT` syntax:
|
||||
|
||||
```sql
|
||||
MEDIUMTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
|
||||
Additionally, seekdb also supports the extended type `LONG`, but `MEDIUMTEXT` is recommended.
|
||||
|
||||
## LONGTEXT
|
||||
|
||||
`LONGTEXT` is a `TEXT` type with a maximum length of 536,870,910 bytes. The effective maximum length of a `LONGTEXT` column also depends on the maximum packet size configured in the client/server protocol and available memory.
|
||||
|
||||
`LONGTEXT` syntax:
|
||||
|
||||
```sql
|
||||
LONGTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
|
||||
```
|
||||
|
||||
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
|
||||
@@ -0,0 +1,329 @@
|
||||
---
|
||||
|
||||
slug: /full-text-index
|
||||
---
|
||||
|
||||
# Full-text indexes
|
||||
|
||||
In seekdb, full-text indexes can be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.
|
||||
|
||||
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
|
||||
|
||||
* Full-text indexes can only be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types.
|
||||
* The current version only supports creating local (`LOCAL`) full-text indexes.
|
||||
* The `UNIQUE` keyword cannot be specified when creating a full-text index.
|
||||
* If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.
|
||||
|
||||
By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.
|
||||
|
||||
## DML operations
|
||||
|
||||
For tables with full-text indexes, complex DML operations are supported, including `INSERT INTO ON DUPLICATE KEY`, `REPLACE INTO`, multi-table updates/deletes, and updatable views.
|
||||
|
||||
**Examples:**
|
||||
|
||||
* `INSERT INTO ON DUPLICATE KEY`:
|
||||
|
||||
```sql
|
||||
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
|
||||
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
|
||||
```
|
||||
|
||||
* `REPLACE INTO`:
|
||||
|
||||
```sql
|
||||
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
|
||||
```
|
||||
|
||||
* Multi-table updates and deletes.
|
||||
|
||||
1. Create table `tbl1`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
|
||||
```
|
||||
|
||||
2. Create table `tbl2`.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
|
||||
```
|
||||
|
||||
3. Perform an update (`UPDATE`) operation on multiple tables.
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
|
||||
SET tbl1.b = 'dddd', tbl2.b = 'eeee';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
|
||||
```
|
||||
|
||||
4. Perform a delete (`DELETE`) operation on multiple tables.
|
||||
|
||||
```sql
|
||||
DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
|
||||
```
|
||||
|
||||
* DML operations on updatable views.
|
||||
|
||||
1. Create view `fts_view`.
|
||||
|
||||
```sql
|
||||
CREATE VIEW fts_view AS SELECT * FROM tbl1;
|
||||
```
|
||||
|
||||
2. Perform an `INSERT` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
|
||||
```
|
||||
|
||||
3. Perform an `UPDATE` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
UPDATE fts_view SET b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
|
||||
SET fts_view.b = 'dddd', tbl2.b = 'eeee';
|
||||
```
|
||||
|
||||
4. Perform a `DELETE` operation on the updatable view.
|
||||
|
||||
```sql
|
||||
DELETE FROM fts_view WHERE b = 'dddd';
|
||||
```
|
||||
|
||||
```sql
|
||||
DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
|
||||
```
|
||||
|
||||
## Full-text index tokenizer
|
||||
|
||||
seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is **Space**, while other tokenizers need to be explicitly specified using the `WITH PARSER` parameter.
|
||||
|
||||
**List of tokenizers**:
|
||||
|
||||
* **Space tokenizer**
|
||||
* **Basic English tokenizer**
|
||||
* **IK tokenizer**
|
||||
* **Ngram tokenizer**
|
||||
* **Jieba tokenizer**
|
||||
|
||||
**Configuration example**:
|
||||
|
||||
When creating or modifying a table, specify the tokenizer type for the full-text index by setting the `WITH PARSER tokenizer_option` parameter in the `CREATE TABLE/ALTER TABLE` statement.
|
||||
|
||||
```sql
|
||||
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
|
||||
FULLTEXT INDEX full_idx1_tbl2(name, doc)
|
||||
WITH PARSER NGRAM
|
||||
PARSER_PROPERTIES=(ngram_token_size=3));
|
||||
|
||||
|
||||
-- Modify the full-text index tokenizer of an existing table.
|
||||
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
|
||||
FULLTEXT INDEX full_idx1_tbl2(name, doc)
|
||||
WITH PARSER NGRAM
|
||||
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
|
||||
```
|
||||
|
||||
### Space tokenizer (default)
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore `_`) as delimiters.
|
||||
* The tokenization results include only valid tokens with lengths between `min_token_size` (default 3) and `max_token_size` (default 84).
|
||||
* Chinese characters are treated as single tokens.
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Languages separated by spaces such as English (for example "apple watch series 9").
|
||||
* Chinese text with manually added delimiters (for example, "南京 长江大桥").
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM,一平方公里也很小 hello-word h_name", 'space');
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM,一平方公里也很小 hello-word h_name", 'space') |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
|["详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name"] |
|
||||
+-------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.
|
||||
|
||||
### Basic English (Beng) tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* Similar to the Space tokenizer, but treats underscores `_` as separators instead of preserving them.
|
||||
* Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Basic retrieval of English documents (such as logs, comments).
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng') |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* Underscores `_` are split into separate tokens (for example, `server_status` -> `server`, `status`, and `user_name` -> `user`, `name`). The core difference from the Space tokenizer lies in how it handles underscores `_`.
|
||||
|
||||
### Ngram tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* **Fixed n-value tokenization**: By default, `n=2`. This tokenizer splits consecutive non-delimiter characters into subsequences of length `n`.
|
||||
* Delimiter rules follow the Space tokenizer (preserving `_`, digits, and letters).
|
||||
* **Does not support length limit parameters**, outputs all possible tokens of length `n`.
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Fuzzy matching for short text (such as user IDs, order numbers).
|
||||
* Scenarios requiring fixed-length feature extraction (such as password policy analysis).
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```shell
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram') |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* With the default setting `n=2`, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, `ORD12345` -> `OR`, `RD`, `D1`, `12`, `23`, `34`, `45`;` user_account` -> `us`, `se`, `er`, `r_`, `_a`, `ac`, `cc`, `co`, `ou`, `un`, `nt`).
|
||||
|
||||
### Ngram2 tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* Supports **dynamic n-value range**: Sets token length range through `min_ngram_size` and `max_ngram_size` parameters.
|
||||
* Suitable for scenarios requiring multi-length token coverage.
|
||||
|
||||
**Applicable scenarios**: Scenarios that require multiple fixed-length tokens simultaneously.
|
||||
|
||||
:::info
|
||||
When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for <code>min_ngram_size</code> and <code>max_ngram_size</code> parameters will generate a large number of token combinations, which may lead to excessive resource consumption.
|
||||
:::
|
||||
|
||||
**Tokenization result**:
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]') |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
**Example explanation**:
|
||||
|
||||
* This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, `user_login_session_2024` generates tokens like `us`, `use`, `user`, `se`, `ser`, `ser_`, `er_`, `er_l`, `r_lo`, `log`, `logi`, `ogin`, etc.).
|
||||
|
||||
### IK tokenizer
|
||||
|
||||
**Concepts**:
|
||||
|
||||
* A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
|
||||
|
||||
* **Smart mode**: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
|
||||
* **Max Word mode**: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
|
||||
|
||||
* Automatically recognizes English words, email addresses, URLs (without `://`), IP addresses, and other formats.
|
||||
|
||||
**Applicable scenarios**: Chinese word segmentation
|
||||
|
||||
**Business scenarios**:
|
||||
|
||||
* E-commerce product description search (for example, precise matching for "华为Mate60").
|
||||
* Social media content analysis (for example, keyword extraction from user comments).
|
||||
|
||||
* **Smart mode**: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"] |
|
||||
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
* **max_word mode**: Includes the same character in different tokens, providing as many possible words as possible.
|
||||
|
||||
```sql
|
||||
OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"] |
|
||||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
### jieba tokenizer
|
||||
|
||||
**Concept**: A tokenizer based on the open-source `jieba` tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.
|
||||
|
||||
**Features**:
|
||||
|
||||
* **Precise mode**: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
|
||||
* **Full mode**: Lists all possible segmentation combinations.
|
||||
* **Search engine mode**: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
|
||||
* Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
|
||||
|
||||
**Applicable scenarios**:
|
||||
|
||||
* Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
|
||||
* Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).
|
||||
|
||||
To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see [Tokenizer plugin](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002414801).
|
||||
|
||||
:::tip
|
||||
The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.
|
||||
:::
|
||||
|
||||
### Tokenizer selection strategy
|
||||
|
||||
| **Business scenario** | **Recommended tokenizer** | **Reason** |
|
||||
| --- | --- | --- |
|
||||
| Search for English product titles | **Space** or **Basic English** | Simple and efficient, aligns with English tokenization conventions. |
|
||||
| Retrieval of Chinese product descriptions | **IK tokenizer** | Accurately recognizes Chinese terminology, supports custom dictionaries. |
|
||||
| Fuzzy matching of logs (such as error codes) | **Ngram tokenizer** | No dictionary required, covers fuzzy query needs for text without spaces. |
|
||||
| Keyword extraction from technology papers | **jieba tokenizer** | Supports new word discovery and complex mode switching. |
|
||||
|
||||
## References
|
||||
|
||||
For more information about creating full-text indexes, see the **Create full-text indexes** section in [Create an index](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971660).
|
||||
Reference in New Issue
Block a user