Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:44:54 +08:00
commit eb309b7b59
133 changed files with 21979 additions and 0 deletions

View File

@@ -0,0 +1,20 @@
---
slug: /json-formatted-data-types
---
# Overview of JSON data types
seekdb supports the JavaScript Object Notation (JSON) data type in compliance with the RFC 7159 standard. You can use it to store semi-structured JSON data and access or modify the data within JSON documents.
The JSON data type offers the following advantages:
* **Automatic validation**: JSON documents stored in JSON columns are automatically validated. Invalid documents will trigger an error.
* **Optimized storage format**: JSON documents stored in JSON columns are converted into an optimized format that enables fast reading and access. When the server reads a JSON value stored in binary format, it doesn't need to parse the value from text.
* **Semi-structured encoding**: This feature further reduces storage costs by splitting a JSON document into multiple sub-columns, with each sub-column encoded individually. This improves compression rates and reduces the storage space required for JSON data. For more information, see [Create a JSON value](200.create-a-json-value.md) and [Semi-structured encoding](600.json-semi-struct.md).
## References
* [Overview of JSON functions](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974794)

View File

@@ -0,0 +1,257 @@
---
slug: /create-a-json-value
---
# Create a JSON value
A JSON value must be one of the following: objects (JSON objects), arrays, strings, numbers, boolean values (true/false), or the null value. Note that false, true, and the null value must be in lowercase.
## JSON text structure
A JSON text structure includes characters, strings, numbers, and three literal names. Whitespace characters (spaces, horizontal tabs, line feeds, and carriage returns) are allowed before or after any structural character.
```sql
begin-array = [ left square bracket
begin-object = { left curly bracket
end-array = ] right square bracket
end-object = } right curly bracket
name-separator = : colon
value-separator = , comma
```
### Objects
An object is represented by a pair of curly brackets containing zero or more name/value pairs (also called members). Names within an object must be unique. Each name is a string followed by a colon that separates the name from its value. Multiple name/value pairs are separated by commas.
Here is an example:
```sql
{ "NAME": "SAM", "Height": 175, "Weight": 100, "Registered" : false}
```
### Arrays
An array is represented by square brackets containing zero or more values (also called elements). Array elements are separated by commas, and values in an array do not need to be of the same type.
Here is an example:
```sql
["abc", 10, null, true, false]
```
### Numbers
Numbers use decimal format and contain an integer component that may optionally be prefixed with a minus sign (-). This can be followed by a fractional part and/or an exponent part. Leading zeros are not allowed. The fractional part consists of a decimal point followed by one or more digits. The exponent part begins with an uppercase or lowercase letter E, optionally followed by a plus (+) or minus (-) sign and one or more digits.
Here is an example:
```sql
[100, 0, -100, 100.11, -12.11, 10.22e2, -10.22e2]
```
### Strings
A string begins and ends with quotation marks ("). All Unicode characters can be placed within the quotation marks, except characters that must be escaped (including quotation marks, backslashes, and control characters).
JSON text must be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8.
Here is an example:
```sql
{"Url": "http://www.example.com/image/481989943"}
```
## Create JSON values
seekdb supports the following DDL operations on JSON types:
* Create tables with JSON columns.
* Add or drop JSON columns.
* Create indexes on generated columns based on JSON columns.
* Enable semi-structured encoding when creating tables.
* Enable semi-structured encoding on existing tables.
### Limitations
You can create multiple JSON columns in each table, with the following limitations:
* JSON columns cannot be used as `PRIMARY KEY`, `FOREIGN KEY`, or `UNIQUE KEY`, but you can add `NOT NULL` or `CHECK` constraints.
* JSON columns cannot have default values.
* JSON columns cannot be used as partitioning keys.
* The length of JSON data cannot exceed the length of `LONGTEXT`, and the maximum depth of each JSON object or array is 99.
### Examples
#### Create or modify JSON columns
```sql
obclient> CREATE TABLE tbl1 (id INT PRIMARY KEY, docs JSON NOT NULL, docs1 JSON);
Query OK, 0 rows affected
obclient> ALTER TABLE tbl1 MODIFY docs JSON CHECK(docs <'{"a" : 100}');
Query OK, 0 rows affected
obclient> CREATE TABLE json_tab(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'Primary key',
json_info JSON COMMENT 'JSON data',
json_id INT GENERATED ALWAYS AS (json_info -> '$.id') COMMENT 'Virtual field from JSON data',
json_name VARCHAR(5) GENERATED ALWAYS AS (json_info -> '$.NAME'),
index json_info_id_idx (json_id)
)COMMENT 'Example JSON table';
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab ADD COLUMN json_info1 JSON;
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab ADD INDEX (json_name);
Query OK, 0 rows affected
obclient> ALTER TABLE json_tab drop COLUMN json_info1;
Query OK, 0 rows affected
```
#### Create an index on a specific key using a generated column
```sql
obclient> CREATE TABLE jn ( c JSON, g INT GENERATED ALWAYS AS (c->"$.id"));
Query OK, 0 rows affected
obclient> CREATE INDEX idx1 ON jn(g);
Query OK, 0 rows affected
Records: 0 Duplicates: 0 Warnings: 0
obclient> INSERT INTO jn (c) VALUES
('{"id": "1", "name": "Fred"}'), ('{"id": "2", "name": "Wilma"}'),
('{"id": "3", "name": "Barney"}'), ('{"id": "4", "name": "Betty"}');
Query OK, 4 rows affected
Records: 4 Duplicates: 0 Warnings: 0
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
obclient> EXPLAIN SELECT c->>"$.name" AS name FROM jn WHERE g <= 2\G
*************************** 1. row ***************************
Query Plan: =========================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
-----------------------------------------
|0 |TABLE SCAN|jemp(idx1)|2 |92 |
=========================================
Outputs & filters:
-------------------------------------
0 - output([JSON_UNQUOTE(JSON_EXTRACT(jemp.c, '$.name'))]), filter(nil),
access([jemp.c]), partitions(p0)
1 row in set
```
#### Use semi-structured encoding
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur:
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
* You can also set the frequency threshold using the `freq_threshold` parameter.
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
1. Enable semi-structured encoding.
:::tip
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
:::
:::tab
tab Example: Enable semi-structured encoding during table creation
```sql
CREATE TABLE t1( j json)
ROW_FORMAT=COMPRESSED
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
tab Example: Enable semi-structured encoding for existing table
```sql
CREATE TABLE t1(j json);
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
Some modification limitations:
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
* Modifying one sub-parameter does not affect the others.
:::
2. Disable semi-structured encoding.
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
```sql
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
```
3. Query semi-structured encoding configuration.
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
```sql
SHOW CREATE TABLE t1;
```
The result is as follows:
```shell
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
`j` json DEFAULT NULL
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
Here is an example query:
```sql
-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
```
Character set considerations:
- seekdb uses `utf8_bin` encoding for JSON.
- To ensure string whitebox filtering works properly, we recommend the following settings:
```sql
SET @@collation_server = 'utf8mb4_bin';
SET @@collation_connection='utf8mb4_bin';
```

View File

@@ -0,0 +1,174 @@
---
slug: /querying-and-modifying-json-values
---
# Query and modify JSON values
seekdb supports querying and referencing JSON values. Using path expressions, you can extract or modify specific portions of a JSON document.
## Reference JSON values
seekdb provides two methods for querying and referencing JSON values:
* Use the `->` operator to return a key's value with double quotes in JSON data.
* Use the `->>` operator to return a key's value without double quotes in JSON data.
Examples:
```sql
obclient> SELECT c->"$.name" AS name FROM jn WHERE g <= 2;
+---------+
| name |
+---------+
| "Fred" |
| "Wilma" |
+---------+
2 rows in set
obclient> SELECT c->>"$.name" AS name FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
obclient> SELECT JSON_UNQUOTE(c->'$.name') AS name
FROM jn WHERE g <= 2;
+-------+
| name |
+-------+
| Fred |
| Wilma |
+-------+
2 rows in set
```
Because JSON documents are hierarchical, JSON functions use path expressions to extract or modify portions of a document and to specify where in the document the operation should occur.
seekdb uses a path syntax consisting of a leading `$` character followed by a selector to represent the JSON document being accessed. The selector types are as follows:
* The `.` symbol represents the key name to access. Unquoted names are not valid in path expressions (for example, names containing spaces), so key names must be enclosed in double quotes.
Example:
```sql
obclient> SELECT JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name');
+---------------------------------------------------------+
| JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name') |
+---------------------------------------------------------+
| "Aztalan" |
+---------------------------------------------------------+
1 row in set
```
* The `[N]` symbol is placed after the path of the selected array and represents the value at position N in the array, where N is a non-negative integer. Array positions are zero-indexed. If `path` does not select an array value, then `path[0]` evaluates to the same value as `path`.
Example:
```sql
obclient> SELECT JSON_SET('"x"', '$[0]', 'a');
+------------------------------+
| JSON_SET('"x"', '$[0]', 'a') |
+------------------------------+
| "a" |
+------------------------------+
1 row in set
```
* The `[M to N]` symbol specifies a subset or range of array values, starting from position M and ending at position N.
Example:
```sql
obclient> SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]');
+----------------------------------------------+
| JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]') |
+----------------------------------------------+
| [2, 3, 4] |
+----------------------------------------------+
1 row in set
```
* Path expressions can also include `*` or `**` wildcard characters:
* `.[*]` represents the values of all members in a JSON object.
* `[*]` represents the values of all elements in a JSON array.
* `prefix**suffix` represents all paths that begin with the specified prefix and end with the specified suffix. The prefix is optional, but the suffix is required. Using `**` or `***` alone to match arbitrary paths is not allowed.
:::info
Paths that do not exist in the document (evaluating to non-existent data) evaluate to <code>NULL</code>.
:::
## Modify JSON values
seekdb also supports modifying complete JSON values using DML statements, and using the JSON_SET(), JSON_REPLACE(), or JSON_REMOVE() functions in `UPDATE` statements to modify partial JSON values.
Examples:
```sql
// Insert complete data.
INSERT INTO json_tab(json_info) VALUES ('[1, {"a": "b"}, [2, "qwe"]]');
// Insert partial data.
UPDATE json_tab SET json_info=JSON_ARRAY_APPEND(json_info, '$', 2) WHERE id=1;
// Update complete data.
UPDATE json_tab SET json_info='[1, {"a": "b"}]';
// Update partial data.
UPDATE json_tab SET json_info=JSON_REPLACE(json_info, '$[2]', 'aaa') WHERE id=1;
// Delete data.
DELETE FROM json_tab WHERE id=1;
// Update partial data using a function.
UPDATE json_tab SET json_info=JSON_REMOVE(json_info, '$[2]') WHERE id=1;
```
## JSON path syntax
A path consists of a scope and one or more path segments. For paths used in JSON functions, the scope is the document being searched or otherwise operated on, represented by the leading `$` character.
Path segments are separated by periods (.). Array elements are represented by `[N]`, where N is a non-negative integer. Key names must be either double-quoted strings or valid ECMAScript identifiers.
Path expressions (like JSON text) should be encoded using the ascii, utf8, or utf8mb4 character set. Other character encodings are implicitly converted to utf8mb4.
The complete syntax is as follows:
```sql
pathExpression: // Path expression
scope[(pathLeg)*] // Scope is represented by the leading $ character
pathLeg:
member | arrayLocation | doubleAsterisk
member:
period ( keyName | asterisk )
arrayLocation:
leftBracket ( nonNegativeInteger | asterisk ) rightBracket
keyName:
ESIdentifier | doubleQuotedString
doubleAsterisk:
'**'
period:
'.'
asterisk:
'*'
leftBracket:
'['
rightBracket:
']'
```

View File

@@ -0,0 +1,54 @@
---
slug: /json-formatted-data-type-conversion
---
# Convert JSON data types
seekdb supports the CAST function for converting between JSON and other data types.
The following table describes the conversion rules for JSON data types.
| Other data types | CAST(other_type AS JSON) | CAST(JSON AS other_type) |
|-------------------------------------|---------------------------------------------|----------------------------------------------------------|
| JSON | No change. | No change. |
| UTF-8 character types (including utf8mb4, utf8, and ascii) | The characters are converted to JSON values and validated. | The data is serialized into utf8mb4 strings. |
| Other character sets | First converted to utf8mb4 encoding, then processed as UTF-8 character type. | First serialized into utf8mb4-encoded strings, then converted to the corresponding character set. |
| NULL | An empty JSON value is returned. | Not applicable. |
| Other types | Only scalar values are converted to JSON values containing that single value. | If the JSON value contains only one scalar value that matches the target type, it is converted to the corresponding type; otherwise, NULL is returned and a warning is issued. |
:::info
<code>other_type</code> specifies a data type other than JSON.
:::
Here are some conversion examples:
```sql
obclient> SELECT CAST("123" AS JSON);
+---------------------+
| CAST("123" AS JSON) |
+---------------------+
| 123 |
+---------------------+
1 row in set
obclient> SELECT CAST(null AS JSON);
+--------------------+
| CAST(null AS JSON) |
+--------------------+
| NULL |
+--------------------+
1 row in set
CREATE TABLE tj1 (c1 JSON,c2 VARCHAR(20));
INSERT INTO tj1 VALUES ('{"id": 17, "color": "red"}','apple'),('{"id": 18, "color": "yellow"}', 'banana'),('{"id": 16, "color": "orange"}','orange');
obclient> SELECT * FROM tj1 ORDER BY CAST(JSON_EXTRACT(c1, '$.id') AS UNSIGNED);
+-------------------------------+--------+
| c1 | c2 |
+-------------------------------+--------+
| {"id": 16, "color": "orange"} | orange |
| {"id": 17, "color": "red"} | apple |
| {"id": 18, "color": "yellow"} | banana |
+-------------------------------+--------+
3 rows in set
```

View File

@@ -0,0 +1,328 @@
---
slug: /json-partial-update
---
# Partial JSON data updates
seekdb supports partial JSON data updates (JSON Partial Update). When only specific fields in a JSON document need to be modified, this feature allows you to update only the changed portions without having to update the entire JSON document.
## Limitations
## Enable or disable JSON Partial Update
The JSON Partial Update feature in seekdb is disabled by default. It is controlled by the system variable `log_row_value_options`. For more information, see [log_row_value_options](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001972193).
**Here are some examples:**
* Enable the JSON Partial Update feature.
* Session level:
```sql
SET log_row_value_options="partial_json";
```
* Global level:
```sql
SET GLOBAL log_row_value_options="partial_json";
```
* Disable the JSON Partial Update feature.
* Session level:
```sql
SET log_row_value_options="";
```
* Global level:
```sql
SET GLOBAL log_row_value_options="";
```
* Query the value of `log_row_value_options`.
```sql
SHOW VARIABLES LIKE 'log_row_value_options';
```
The result is as follows:
```sql
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| log_row_value_options | |
+-----------------------+-------+
1 row in set
```
## JSON expressions for partial updates
In addition to the JSON Partial Update feature switch `log_row_value_options`, you must use specific expressions to update JSON documents to trigger JSON Partial Update.
The following JSON expressions in seekdb currently support partial updates:
* json_set or json_replace: updates the value of a JSON field.
* json_remove: deletes a JSON field.
:::tip
<ol><li>Ensure that the left operand of the <code>SET</code> assignment clause and the first parameter of the JSON expression are the same and both are JSON columns in the table. For example, in <code>j = json_replace(j, '$.name', 'ab')</code>, the parameter on the left side of the equals sign and the first parameter of the JSON expression <code>json_replace</code> on the right side are both <code>j</code>.</li><li>JSON Partial Update is only triggered when the current JSON column data is stored as <code>outrow</code>. Whether data is stored as <code>outrow</code> or <code>inrow</code> is controlled by the <code>lob_inrow_threshold</code> parameter when creating the table. <code>lob_inrow_threshold</code> is used to configure the <code>INROW</code> threshold. When the LOB data size exceeds this threshold, it is stored as <code>OUTROW</code> in the LOB Meta table. The default value is 4 KB.</li></ol>
:::
**Examples:**
1. Create a table named `json_test`.
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON);
```
2. Insert data.
```sql
INSERT INTO json_test VALUES(1, CONCAT('{"name": "John", "content": "', repeat('x',8), '"}'));
```
The result is as follows:
```shell
Query OK, 1 row affected
```
3. Query the data in the JSON column `j`.
```sql
SELECT j FROM json_test;
```
The result is as follows:
```shell
+-----------------------------------------+
| j |
+-----------------------------------------+
| {"name": "John", "content": "xxxxxxxx"} |
+-----------------------------------------+
1 row in set
```
4. Use `json_repalce` to update the value of the `name` field in the JSON column.
```sql
UPDATE json_test SET j = json_replace(j, '$.name', 'ab') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
5. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+---------------------------------------+
| j |
+---------------------------------------+
| {"name": "ab", "content": "xxxxxxxx"} |
+---------------------------------------+
1 row in set
```
6. Use `json_set` to update the value of the `name` field in the JSON column.
```sql
UPDATE json_test SET j = json_set(j, '$.name', 'cd') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
7. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+---------------------------------------+
| j |
+---------------------------------------+
| {"name": "cd", "content": "xxxxxxxx"} |
+---------------------------------------+
1 row in set
```
8. Use `json_remove` to delete the `name` field value in the JSON column.
```sql
UPDATE json_test SET j = json_remove(j, '$.name') WHERE pk = 1;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
9. Query the modified data in JSON column `j`.
```sql
SELECT j FROM json_test;
```
Result:
```shell
+-------------------------+
| j |
+-------------------------+
| {"content": "xxxxxxxx"} |
+-------------------------+
1 row in set
```
## Granularity of updates
JSON data in seekdb is stored based on LOB storage, and LOBs in seekdb are stored in chunks at the underlying level. Therefore, the minimum data amount for each partial update is one LOB chunk. The smaller the LOB chunk, the smaller the amount of data written. A DDL syntax is provided to set the LOB chunk size, which can be specified when creating a column.
**Example:**
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '4k');
```
The chunk size cannot be infinitely small, as too small a size will affect the performance of `SELECT`, `INSERT`, and `DELETE` operations. It is generally recommended to set it based on the average field size of JSON documents. If most fields are very small, you can set it to 1K. To optimize LOB type reads, seekdb stores data smaller than 4K directly as `INROW`, in which case partial update will not be performed. Partial Update is mainly intended to improve the performance of updating large documents; for small documents, full updates actually perform better.
## Rebuild
JSON Partial Update does not impose restrictions on the data length before and after updating a JSON column. When the length of the new value is less than or equal to the length of the old value, the data at the original location is directly replaced with the new data. When the length of the new value is greater than the length of the old value, the new data is appended at the end. seekdb sets a threshold: when the length of the appended data exceeds 30% of the original data length, a rebuild is triggered. In this case, Partial Update is not performed; instead, a full overwrite is performed.
You can use the `JSON_STORAGE_SIZE` expression to get the actual storage length of JSON data, and `JSON_STORAGE_FREE` to get the additional storage overhead.
**Example:**
1. Enable JSON Partial Update.
```sql
SET log_row_value_options = "partial_json";
```
2. Create a test table named `json_test`.
```sql
CREATE TABLE json_test(pk INT PRIMARY KEY, j JSON CHUNK '1K');
```
3. Insert a row of data into the `json_test` table.
```sql
INSERT INTO json_test VALUES(10 , json_object('name', 'zero', 'age', 100, 'position', 'software engineer', 'profile', repeat('x', 4096), 'like', json_array('a', 'b', 'c'), 'tags', json_array('sql boy', 'football', 'summer', 1), 'money' , json_object('RMB', 10000, 'Dollers', 20000, 'BTC', 100), 'nickname', 'noone'));
```
Result:
```shell
Query OK, 1 row affected
```
4. Use `JSON_STORAGE_SIZE` to query the storage size of the JSON column (actual occupied storage space) and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4335 | 0 |
+----------------------+----------------------+
1 row in set
```
Since no partial update has been performed, the value of `JSON_STORAGE_FREE` is 0.
5. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is less than the length of the old value.
```sql
UPDATE json_test SET j = json_replace(j, '$.position', 'software enginee') WHERE pk = 10;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
6. Again, use `JSON_STORAGE_SIZE` to query the storage size of the JSON column and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4335 | 1 |
+----------------------+----------------------+
1 row in set
```
After the JSON column data is updated, since the new data is one byte less than the old data, the `JSON_STORAGE_FREE` result is 1.
7. Use `json_replace` to update the value of the `position` field in the JSON column, where the length of the new value is greater than the length of the old value.
```sql
UPDATE json_test SET j = json_replace(j, '$.position', 'software engineer') WHERE pk = 10;
```
Result:
```shell
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
```
8. Use `JSON_STORAGE_SIZE` again to query the JSON column storage size, and `JSON_STORAGE_FREE` to estimate the storage space that can be freed from the JSON column.
```sql
SELECT JSON_STORAGE_SIZE(j), JSON_STORAGE_FREE(j) FROM json_test WHERE pk = 10;
```
Result:
```shell
+----------------------+----------------------+
| JSON_STORAGE_SIZE(j) | JSON_STORAGE_FREE(j) |
+----------------------+----------------------+
| 4355 | 19 |
+----------------------+----------------------+
1 row in set
```
After appending new data to the JSON column, the length of `JSON_STORAGE_FREE` is 19, indicating that 19 bytes can be freed after a rebuild.

View File

@@ -0,0 +1,124 @@
---
slug: /json-semi-struct
---
# Semi-structured encoding
This topic describes the semi-structured encoding feature supported by seekdb.
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur.
## Considerations
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
* You can also set the frequency threshold using the `freq_threshold` parameter. When semi-structured encoding is enabled, the system analyzes the frequency of each path in the JSON data and stores paths with frequencies exceeding the specified threshold as independent subcolumns, known as frequent columns. For example, if you have a user table where the JSON field stores user information and 90% of users have the `name` and `age` fields, the system will automatically extract `name` and `age` as independent frequent columns. During queries, these columns are accessed directly without parsing the entire JSON, thereby improving query performance.
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
## Data format
JSON data is split and stored as structured columns in a specific format. The columns split from JSON columns are called subcolumns. Subcolumns can be categorized into different types, including sparse columns and frequent columns.
* Sparse columns: Subcolumns that exist in some JSON documents but not in others, with an occurrence frequency lower than the threshold specified by the table-level parameter `freq_threshold`.
* Frequent columns: Subcolumns that appear in JSON data with a frequency higher than the threshold specified by the table-level parameter `freq_threshold`. These subcolumns are stored as independent columns to improve filtering query performance.
For example:
```sql
{"id": 1001, "name": "n1", "nickname": "nn1"}
{"id": 1002, "name": "n2", "nickname": "nn2"}
{"id": 1003, "name": "n3", "nickname": "nn3"}
{"id": 1004, "name": "n4", "nickname": "nn4"}
{"id": 1005, "name": "n5"}
```
In this example, `id` and `name` are fields that exist in every JSON document with an occurrence frequency of 100%, while `nickname` exists in only four JSON documents with an occurrence frequency of 80%.
If `freq_threshold` is set to 100%, then `nickname` will be inferred as a sparse column, while `id` and `name` will be inferred as frequent columns. If set to 80%, then `nickname`, `id`, and `name` will all be inferred as frequent columns.
## Examples
1. Enable semi-structured encoding.
:::tip
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
:::
:::tab
tab Example: Enable semi-structured encoding during table creation
```sql
CREATE TABLE t1( j json)
ROW_FORMAT=COMPRESSED
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
tab Example: Enable semi-structured encoding for existing table
```sql
CREATE TABLE t1(j json);
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
Some modification limitations:
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
* Modifying one sub-parameter does not affect the others.
:::
2. Disable semi-structured encoding.
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
```sql
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
```
3. Query semi-structured encoding configuration.
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
```sql
SHOW CREATE TABLE t1;
```
The result is as follows:
```shell
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
`j` json DEFAULT NULL
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
Here is an example query:
```sql
-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
```
Character set considerations:
- seekdb uses `utf8_bin` encoding for JSON.
- To ensure string whitebox filtering works properly, we recommend the following settings:
```sql
SET @@collation_server = 'utf8mb4_bin';
SET @@collation_connection='utf8mb4_bin';
```

View File

@@ -0,0 +1,26 @@
---
slug: /spatial-data-type-overview
---
# Overview of spatial data types
The Geographic Information System (GIS) feature of seekdb includes the following spatial data types:
* `GEOMETRY`
* `POINT`
* `LINESTRING`
* `POLYGON`
* `MULTIPOINT`
* `MULTILINESTRING`
* `MULTIPOLYGON`
* `GEOMETRYCOLLECTION`
Among these, `POINT`, `LINESTRING`, and `POLYGON` are the three most fundamental types, used to store individual spatial data. They respectively extend into three collection types: `MULTIPOINT`, `MULTILINESTRING`, and `MULTIPOLYGON`, which are used to store collections of spatial data but can only represent collections of their respective specified base types. `GEOMETRY` is an abstract type that can represent any base type, and `GEOMETRYCOLLECTION` can be a collection of any `GEOMETRY` types.

View File

@@ -0,0 +1,39 @@
---
slug: /spacial-reference-system
---
# Spatial reference systems
A spatial reference system (SRS) for spatial data is a coordinate-based system for defining geographic locations. The current version of seekdb only supports the default SRS provided by the system.
Spatial reference systems generally include the following types:
* Projected SRS: A projected SRS is a projection of the Earth onto a plane, essentially a flat map. The coordinate system on this plane is a Cartesian coordinate system that uses units of length (meters, feet, etc.) rather than longitude and latitude.
* Geographic SRS. A geographic SRS is a non-projected SRS that represents longitude-latitude (or latitude-longitude) coordinates on an ellipsoid, expressed in angular units.
Additionally, there is an infinitely flat Cartesian plane represented by `SRID 0`, whose axes have no assigned units. Unlike a projected SRS, it is an abstract plane with no geographic reference and does not necessarily represent the Earth. `SRID 0` is the default `SRID` for spatial data.
SRS content can be obtained through the `INFORMATION_SCHEMA ST_SPATIAL_REFERENCE_SYSTEMS` table, as shown in the following example:
```sql
obclient> SELECT * FROM INFORMATION_SCHEMA.ST_SPATIAL_REFERENCE_SYSTEMS
WHERE SRS_ID = 4326\G
*************************** 1. row ***************************
SRS_NAME: WGS 84
SRS_ID: 4326
ORGANIZATION: EPSG
ORGANIZATION_COORDSYS_ID: 4326
DEFINITION: GEOGCS["WGS 84",DATUM["World Geodetic System 1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.017453292519943278,AUTHORITY["EPSG","9122"]],AXIS["Lat",NORTH],AXIS["Lon",EAST],AUTHORITY["EPSG","4326"]]
DESCRIPTION: NULL
1 row in set
```
The above example describes the SRS used by GPS systems, with the name (SRS_NAME) WGS 84 and ID (SRS_ID) 4326.
The SRS definition in the `DEFINITION` column is a `WKT` value. WKT is defined based on Extended Backus Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for SRS definitions in GIS.
The `SRS_ID` value represents the same type of value as the `SRID` of a geometry value, or is passed as an `SRID` parameter to spatial functions. `SRID 0` (unitless Cartesian plane) is a special, valid spatial reference system ID that can be used for any spatial data calculations that depend on `SRID` values.
For calculations involving multiple geometry values, all values must have the same SRID; otherwise, an error will occur.

View File

@@ -0,0 +1,45 @@
---
slug: /create-spatial-columns
---
# Create a spatial column
seekdb allows you to create a spatial column using the `CREATE TABLE` or `ALTER TABLE` statement.
To create a table with spatial columns using the `CREATE TABLE` statement, see the following syntax example:
```sql
CREATE TABLE geom (g GEOMETRY);
```
To add or remove spatial columns in an existing table using the `ALTER TABLE` statement, see the following syntax example:
```sql
ALTER TABLE geom ADD pt POINT;
ALTER TABLE geom DROP pt;
```
Examples:
```sql
obclient> CREATE TABLE geom (
p POINT SRID 0,
g GEOMETRY NOT NULL SRID 4326
);
Query OK, 0 rows affected
```
The following constraints apply when creating spatial columns:
* You can explicitly specify an SRID when defining a spatial column. If no SRID is defined on the column, the optimizer will not select the spatial index during queries, but index records will still be generated during insert/update operations.
* A spatial index can be defined on a spatial column only after specifying the` NOT NULL` constraint and an SRID. In other words, only columns with a defined SRID can use spatial indexes.
* Once an SRID is defined on a spatial column, attempting to insert values with a different SRID will result in an error.
The following constraints apply to `SRID`:
* You must explicitly specify `SRID` for a spatial column.
* All objects in the column must have the same SRID.

View File

@@ -0,0 +1,71 @@
---
slug: /create-spatial-indexes
---
# Create a spatial index
seekdb allows you to create a spatial index using the `SPATIAL` keyword. When creating a table, the spatial index column must be declared as `NOT NULL`. Spatial indexes can be created on stored (STORED) generated columns, but not on virtual (VIRTUAL) generated columns.
## Constraints
* The column definition for creating a spatial index must include the `NOT NULL` constraint.
* The column with a spatial index must have an SRID defined. Otherwise, the spatial index on this column will not take effect during queries.
* If you create a spatial index on a STORED generated column, you must explicitly specify the `STORED` keyword in the DDL when creating the column. If neither the `VIRTUAL` nor `STORED` keyword is specified when creating a generated column, a VIRTUAL generated column is created by default.
* After an index is created, comparisons use the coordinate system corresponding to the SRID defined in the column. Spatial indexes store the Minimum Bounding Rectangle (MBR) of geometric objects, and the comparison method for MBRs also depends on the SRID.
## Preparations
Before using the GIS feature, you need to configure GIS metadata. After connecting to the server, execute the following command to import the `default_srs_data_mysql.sql` file into the database:
```shell
-- module specifies the module to import.
-- infile specifies the relative path of the SQL file to import.
ALTER SYSTEM LOAD MODULE DATA module=gis infile = 'etc/default_srs_data_mysql.sql';
```
<!-- For more information about the syntax, see [LOAD MODULE DATA](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000003980607). -->
The following result indicates that the data file was successfully imported:
```shell
Query OK, 0 rows affected
```
## Examples
The following examples show how to create a spatial index on a regular column:
* Using `CREATE TABLE`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326, SPATIAL INDEX(g));
```
* Using `ALTER TABLE`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
ALTER TABLE geom ADD SPATIAL INDEX(g);
```
* Using `CREATE INDEX`:
```sql
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326);
CREATE SPATIAL INDEX g ON geom (g);
```
The following examples show how to drop a spatial index:
* Using `ALTER TABLE`:
```sql
ALTER TABLE geom DROP INDEX g;
```
* Using `DROP INDEX`:
```sql
DROP INDEX g ON geom;
```

View File

@@ -0,0 +1,275 @@
---
slug: /spatial-data-format
---
# Spatial data formats
seekdb supports two standard spatial data formats for representing geometric objects in queries:
* Well-Known Text (WKT)
* Well-Known Binary (WKB)
## WKT
WKT is defined based on Extended Backus-Naur Form (EBNF). WKT can be used both as a data format (referred to as WKT-Data in this document) and for defining spatial reference systems (SRS) in Geographic Information System (GIS) (referred to as WKT-SRS in this document).
### Point
A point does not use commas as separators. Example format:
```sql
POINT(15 20)
```
The following example uses `ST_X()` to extract the `X` coordinate from a point object. The first example directly generates the object using the `Point()` function. The second example uses the WKT representation converted to point through `ST_GeomFromText()`.
```sql
obclient> SELECT ST_X(Point(15, 20));
+---------------------+
| ST_X(Point(15, 20)) |
+---------------------+
| 15 |
+---------------------+
1 row in set
obclient> SELECT ST_X(ST_GeomFromText('POINT(15 20)'));
+---------------------------------------+
| ST_X(ST_GeomFromText('POINT(15 20)')) |
+---------------------------------------+
| 15 |
+---------------------------------------+
1 row in set
```
### Line
A line consists of multiple points separated by commas. Example format:
```sql
LINESTRING(0 0, 10 10, 20 25, 50 60)
```
### Polygon
A polygon consists of at least one exterior ring (closed line) and any number (can be 0) of interior rings (closed lines). Example format:
```sql
POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))
```
### MultiPoint
A MultiPoint consists of multiple points, similar to a line but with different semantics. Multiple connected points form a line, while discrete multiple points form a MultiPoint. Example format:
```sql
MULTIPOINT(0 0, 20 20, 60 60)
```
In the functions `ST_MPointFromText()` and `ST_GeoFromText()`, it is also valid to enclose points in a MultiPoint with parentheses. Example format:
```sql
ST_MPointFromText('MULTIPOINT (1 1, 2 2, 3 3)')
ST_MPointFromText('MULTIPOINT ((1 1), (2 2), (3 3))')
```
### MultiLineString
A MultiLineString is a collection of multiple lines. Example format:
```sql
MULTILINESTRING((10 10, 20 20), (15 15, 30 15))
```
### MultiPolygon
A MultiPolygon is a collection of multiple polygons. Example format:
```sql
MULTIPOLYGON(((0 0,10 0,10 10,0 10,0 0)),((5 5,7 5,7 7,5 7, 5 5)))
```
### GeometryCollection
A GeometryCollection can be a collection of multiple basic types and collection types.
```sql
GEOMETRYCOLLECTION(POINT(10 10), POINT(30 30), LINESTRING(15 15, 20 20))
```
## WKB
WKB is developed based on the OpenGIS specification and supports seven types (Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and Geometrycollection) with corresponding format definitions.
### Point
Using `POINT(1 -1)` as an example, the format definition is shown in the following table.
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 01000000 |
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
### Linestring
Using `LINESTRING(1 -1, -1 1)` as an example, the format definition is shown in the following table. `Num points` must be greater than or equal to 2.
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 02000000 |
| **Num points** | 4 bytes | unsigned int | 02000000 |
| **X coordinate** | 8 bytes | double-precision | 000000000000F03F |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F0BF |
| **X coordinate** | 8 bytes | double-precision | 000000000000F0BF |
| **Y coordinate** | 8 bytes | double-precision | 000000000000F03F |
### Polygon
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 03000000 |
| **Num rings** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat ring** | - |- | - |
### MultiPoint
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 04000000 |
| **Num points** | 4 bytes | unsigned int | Num points >= 1 |
| **repeat POINT** | - |- | - |
### MultiLineString
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 05000000 |
| **Num linestrings** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat LINESTRING** | - | - | - |
### MultiPolygon
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 06000000 |
| **Num polygons** | 4 bytes | unsigned int | Greater than or equal to 1 |
| **repeat POLYGON** | - | - | - |
### GeometryCollection
| **Component** | **Specification** | **Type** | **Value** |
| --- | --- | --- | --- |
| **Byte order** | 1 byte | unsigned int | 01 |
| **WKB type** | 4 bytes | unsigned int | 07000000 |
| **Num wkbs** | 4 bytes | unsigned int | - |
| **repeat WKB** | - | - | - |
>Note:
>
>* Only GeometryCollection can be empty, indicating that 0 elements are stored. All other types cannot be empty.
>* When `LENGTH()` is applied to a GIS object, it returns the length of the stored binary data.
```sql
obclient [test]> SET @g = ST_GeomFromText('POINT(1 -1)');
Query OK, 0 rows affected
obclient [test]> SELECT LENGTH(@g);
+------------+
| LENGTH(@g) |
+------------+
| 25 |
+------------+
1 row in set
obclient [test]> SELECT HEX(@g);
+----------------------------------------------------+
| HEX(@g) |
+----------------------------------------------------+
| 000000000101000000000000000000F03F000000000000F0BF |
+----------------------------------------------------+
1 row in set
```
## Syntax and geometric validity
### Syntax validity
Syntax validity must satisfy the following conditions:
- A linestring must have at least two points.
- A polygon must have at least one ring.
- A polygon must be closed (the first and last points are the same).
- A polygon's ring must have at least four points (the smallest polygon is a triangle, where the first and last points are the same).
- Except for GeometryCollection, other collection types cannot be empty.
### Geometric validity
Geometric validity must satisfy the following conditions:
- A polygon cannot intersect with itself.
- The exterior ring of a Polygon must be outside the interior rings.
- Multipolygons cannot contain overlapping polygons.
You can explicitly check the geometric validity of a geometry object using the ST_IsValid() function.
## GIS Examples
### Insert data
```sql
// Both conversion functions and WKT are included in the SQL statement.
INSERT INTO geom VALUES (ST_GeomFromText('POINT(1 1)'));
// WKT is provided as a parameter.
SET @g = 'POINT(1 1)';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
// Conversion expressions are directly embedded in the parameters.
SET @g = ST_GeomFromText('POINT(1 1)');
INSERT INTO geom VALUES (@g);
// A unified conversion function is used.
SET @g = 'LINESTRING(0 0,1 1,2 2)';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
SET @g ='GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
INSERT INTO geom VALUES (ST_GeomFromText(@g));
// Type-specific conversion functions are employed.
SET @g = 'POINT(1 1)';
INSERT INTO geom VALUES (ST_PointFromText(@g));
SET @g = 'LINESTRING(0 0,1 1,2 2)';
INSERT INTO geom VALUES (ST_LineStringFromText(@g));
SET @g = 'POLYGON((0 0,10 0,10 10,0 10,0 0),(5 5,7 5,7 7,5 7, 5 5))';
INSERT INTO geom VALUES (ST_PolygonFromText(@g));
SET @g =
'GEOMETRYCOLLECTION(POINT(1 1),LINESTRING(0 0,1 1,2 2,3 3,4 4))';
INSERT INTO geom VALUES (ST_GeomCollFromText(@g));
// Data can also be inserted directly using WKB.
INSERT INTO geom VALUES(ST_GeomFromWKB(X'0101000000000000000000F03F000000000000F03F'));
```
### Query data
```sql
// Query data and convert it to WKT format for output.
SELECT ST_AsText(g) FROM geom;
// Query data and convert it to WKB format for output.
SELECT ST_AsBinary(g) FROM geom;
```

View File

@@ -0,0 +1,46 @@
---
slug: /char-and-varchar
---
# CHAR and VARCHAR
`CHAR` and `VARCHAR` types are similar, but differ in how they are stored and retrieved, their maximum length, and whether trailing spaces are preserved.
## CHAR
The declared length of the `CHAR` type is the maximum number of characters that can be stored. For example, `CHAR(30)` can contain up to 30 characters.
Syntax:
```sql
[NATIONAL] CHAR[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `CHAR` becomes `BINARY`.
`CHAR` column length can be any value between 0 and 256. When storing `CHAR` values, they are right-padded with spaces to the specified length.
For `CHAR` columns, excess trailing spaces in inserted values are silently truncated regardless of the SQL mode. When retrieving `CHAR` values, trailing spaces are removed unless the `PAD_CHAR_TO_FULL_LENGTH` SQL mode is enabled.
## VARCHAR
The declared length `M` of the `VARCHAR` type is the maximum number of characters that can be stored. For example, `VARCHAR(50)` can contain up to 50 characters.
Syntax:
```sql
[NATIONAL] VARCHAR(M) [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `VARCHAR` becomes `VARBINARY`.
`VARCHAR` column length can be specified as any value between 0 and 262144.
Compared with `CHAR`, `VARCHAR` values are stored as a 1-byte or 2-byte length prefix plus the data. The length prefix indicates the number of bytes in the value. If the value does not exceed 255 bytes, the column uses one byte; if the value may exceed 255 bytes, it uses two bytes.
For `VARCHAR` columns, trailing spaces that exceed the column length are truncated before insertion and generate a warning, regardless of the SQL mode.
`VARCHAR` values are not padded when stored. According to standard SQL, trailing spaces are preserved during both storage and retrieval.
Additionally, seekdb also supports the extended type `CHARACTER VARYING(m)`, but `VARCHAR(m)` is recommended.

View File

@@ -0,0 +1,64 @@
---
slug: /text
---
# TEXT types
The `TEXT` type is used to store all types of text data.
There are four text types: `TINYTEXT`, `TEXT`, `MEDIUMTEXT`, and `LONGTEXT`. They correspond to the four `BLOB` types and have the same maximum length and storage requirements.
`TEXT` values are treated as non-binary strings. They have a character set other than binary, and values are sorted and compared according to the collation rules of the character set.
When strict SQL mode is not enabled, if a value assigned to a `TEXT` column exceeds the column's maximum length, the portion that exceeds the length is truncated and a warning is generated. When using strict SQL mode, an error occurs (rather than a warning) if non-space characters are truncated, and the value insertion is prohibited. Regardless of the SQL mode, truncating excess trailing spaces from values inserted into `TEXT` columns always generates a warning.
## TINYTEXT
`TINYTEXT` is a `TEXT` type with a maximum length of 255 bytes.
`TINYTEXT` syntax:
```sql
TINYTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
## TEXT
The maximum length of a `TEXT` column is 65,535 bytes.
An optional length `M` can be specified for the `TEXT` type. Syntax:
```sql
TEXT[(M)] [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
## MEDIUMTEXT
`MEDIUMTEXT` is a `TEXT` type with a maximum length of 16,777,215 bytes.
`MEDIUMTEXT` syntax:
```sql
MEDIUMTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.
Additionally, seekdb also supports the extended type `LONG`, but `MEDIUMTEXT` is recommended.
## LONGTEXT
`LONGTEXT` is a `TEXT` type with a maximum length of 536,870,910 bytes. The effective maximum length of a `LONGTEXT` column also depends on the maximum packet size configured in the client/server protocol and available memory.
`LONGTEXT` syntax:
```sql
LONGTEXT [CHARACTER SET charset_name] [COLLATE collation_name]
```
`CHARACTER SET` is used to specify the character set. If needed, you can use the `COLLATE` attribute along with other attributes to specify the collation rules for the character set. If the binary attribute of `CHARACTER SET` is specified, the column will be created as the corresponding binary string data type, and `TEXT` becomes `BLOB`.

View File

@@ -0,0 +1,329 @@
---
slug: /full-text-index
---
# Full-text indexes
In seekdb, full-text indexes can be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
* Full-text indexes can only be applied to columns of `CHAR`, `VARCHAR`, and `TEXT` types.
* The current version only supports creating local (`LOCAL`) full-text indexes.
* The `UNIQUE` keyword cannot be specified when creating a full-text index.
* If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.
By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.
## DML operations
For tables with full-text indexes, complex DML operations are supported, including `INSERT INTO ON DUPLICATE KEY`, `REPLACE INTO`, multi-table updates/deletes, and updatable views.
**Examples:**
* `INSERT INTO ON DUPLICATE KEY`:
```sql
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
```
* `REPLACE INTO`:
```sql
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
```
* Multi-table updates and deletes.
1. Create table `tbl1`.
```sql
CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
```
2. Create table `tbl2`.
```sql
CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
```
3. Perform an update (`UPDATE`) operation on multiple tables.
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
SET tbl1.b = 'dddd', tbl2.b = 'eeee';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
```
```sql
UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
```
4. Perform a delete (`DELETE`) operation on multiple tables.
```sql
DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
```sql
DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
```
* DML operations on updatable views.
1. Create view `fts_view`.
```sql
CREATE VIEW fts_view AS SELECT * FROM tbl1;
```
2. Perform an `INSERT` operation on the updatable view.
```sql
INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
```
3. Perform an `UPDATE` operation on the updatable view.
```sql
UPDATE fts_view SET b = 'dddd';
```
```sql
UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
SET fts_view.b = 'dddd', tbl2.b = 'eeee';
```
4. Perform a `DELETE` operation on the updatable view.
```sql
DELETE FROM fts_view WHERE b = 'dddd';
```
```sql
DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
```
## Full-text index tokenizer
seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is **Space**, while other tokenizers need to be explicitly specified using the `WITH PARSER` parameter.
**List of tokenizers**:
* **Space tokenizer**
* **Basic English tokenizer**
* **IK tokenizer**
* **Ngram tokenizer**
* **Jieba tokenizer**
**Configuration example**:
When creating or modifying a table, specify the tokenizer type for the full-text index by setting the `WITH PARSER tokenizer_option` parameter in the `CREATE TABLE/ALTER TABLE` statement.
```sql
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the full-text index tokenizer of an existing table.
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
```
### Space tokenizer (default)
**Concepts**:
* This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore `_`) as delimiters.
* The tokenization results include only valid tokens with lengths between `min_token_size` (default 3) and `max_token_size` (default 84).
* Chinese characters are treated as single tokens.
**Applicable scenarios**:
* Languages separated by spaces such as English (for example "apple watch series 9").
* Chinese text with manually added delimiters (for example, "南京 长江大桥").
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM一平方公里也很小 hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|"详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name" |
+-------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.
### Basic English (Beng) tokenizer
**Concepts**:
* Similar to the Space tokenizer, but treats underscores `_` as separators instead of preserving them.
* Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").
**Applicable scenarios**:
* Basic retrieval of English documents (such as logs, comments).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* Underscores `_` are split into separate tokens (for example, `server_status` -> `server`, `status`, and `user_name` -> `user`, `name`). The core difference from the Space tokenizer lies in how it handles underscores `_`.
### Ngram tokenizer
**Concepts**:
* **Fixed n-value tokenization**: By default, `n=2`. This tokenizer splits consecutive non-delimiter characters into subsequences of length `n`.
* Delimiter rules follow the Space tokenizer (preserving `_`, digits, and letters).
* **Does not support length limit parameters**, outputs all possible tokens of length `n`.
**Applicable scenarios**:
* Fuzzy matching for short text (such as user IDs, order numbers).
* Scenarios requiring fixed-length feature extraction (such as password policy analysis).
**Tokenization result**:
```shell
OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram') |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* With the default setting `n=2`, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, `ORD12345` -> `OR`, `RD`, `D1`, `12`, `23`, `34`, `45`;` user_account` -> `us`, `se`, `er`, `r_`, `_a`, `ac`, `cc`, `co`, `ou`, `un`, `nt`).
### Ngram2 tokenizer
**Concepts**:
* Supports **dynamic n-value range**: Sets token length range through `min_ngram_size` and `max_ngram_size` parameters.
* Suitable for scenarios requiring multi-length token coverage.
**Applicable scenarios**: Scenarios that require multiple fixed-length tokens simultaneously.
:::info
When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for <code>min_ngram_size</code> and <code>max_ngram_size</code> parameters will generate a large number of token combinations, which may lead to excessive resource consumption.
:::
**Tokenization result**:
```sql
OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Example explanation**:
* This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, `user_login_session_2024` generates tokens like `us`, `use`, `user`, `se`, `ser`, `ser_`, `er_`, `er_l`, `r_lo`, `log`, `logi`, `ogin`, etc.).
### IK tokenizer
**Concepts**:
* A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
* **Smart mode**: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
* **Max Word mode**: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
* Automatically recognizes English words, email addresses, URLs (without `://`), IP addresses, and other formats.
**Applicable scenarios**: Chinese word segmentation
**Business scenarios**:
* E-commerce product description search (for example, precise matching for "华为Mate60").
* Social media content analysis (for example, keyword extraction from user comments).
* **Smart mode**: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.
```sql
OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
* **max_word mode**: Includes the same character in different tokens, providing as many possible words as possible.
```sql
OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
### jieba tokenizer
**Concept**: A tokenizer based on the open-source `jieba` tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.
**Features**:
* **Precise mode**: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
* **Full mode**: Lists all possible segmentation combinations.
* **Search engine mode**: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
* Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
**Applicable scenarios**:
* Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
* Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).
To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see [Tokenizer plugin](https://en.oceanbase.com/docs/common-oceanbase-database-10000000002414801).
:::tip
The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.
:::
### Tokenizer selection strategy
| **Business scenario** | **Recommended tokenizer** | **Reason** |
| --- | --- | --- |
| Search for English product titles | **Space** or **Basic English** | Simple and efficient, aligns with English tokenization conventions. |
| Retrieval of Chinese product descriptions | **IK tokenizer** | Accurately recognizes Chinese terminology, supports custom dictionaries. |
| Fuzzy matching of logs (such as error codes) | **Ngram tokenizer** | No dictionary required, covers fuzzy query needs for text without spaces. |
| Keyword extraction from technology papers | **jieba tokenizer** | Supports new word discovery and complex mode switching. |
## References
For more information about creating full-text indexes, see the **Create full-text indexes** section in [Create an index](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971660).