Files
gh-oceanbase-ecology-plugin…/skills/seekdb-docs/official-docs/200.develop/500.multi-model/100.json/600.json-semi-struct.md
2025-11-30 08:44:54 +08:00

124 lines
7.7 KiB
Markdown

---
slug: /json-semi-struct
---
# Semi-structured encoding
This topic describes the semi-structured encoding feature supported by seekdb.
seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter `SEMISTRUCT_PROPERTIES`. You must also set `ROW_FORMAT=COMPRESSED` for the table, otherwise an error will occur.
## Considerations
* When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)`, the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
* When `SEMISTRUCT_PROPERTIES=(encoding_type=none)`, the table is considered a structured table.
* You can also set the frequency threshold using the `freq_threshold` parameter. When semi-structured encoding is enabled, the system analyzes the frequency of each path in the JSON data and stores paths with frequencies exceeding the specified threshold as independent subcolumns, known as frequent columns. For example, if you have a user table where the JSON field stores user information and 90% of users have the `name` and `age` fields, the system will automatically extract `name` and `age` as independent frequent columns. During queries, these columns are accessed directly without parsing the entire JSON, thereby improving query performance.
* Currently, `encoding_type` and `freq_threshold` can only be modified using online DDL statements, not offline DDL statements.
## Data format
JSON data is split and stored as structured columns in a specific format. The columns split from JSON columns are called subcolumns. Subcolumns can be categorized into different types, including sparse columns and frequent columns.
* Sparse columns: Subcolumns that exist in some JSON documents but not in others, with an occurrence frequency lower than the threshold specified by the table-level parameter `freq_threshold`.
* Frequent columns: Subcolumns that appear in JSON data with a frequency higher than the threshold specified by the table-level parameter `freq_threshold`. These subcolumns are stored as independent columns to improve filtering query performance.
For example:
```sql
{"id": 1001, "name": "n1", "nickname": "nn1"}
{"id": 1002, "name": "n2", "nickname": "nn2"}
{"id": 1003, "name": "n3", "nickname": "nn3"}
{"id": 1004, "name": "n4", "nickname": "nn4"}
{"id": 1005, "name": "n5"}
```
In this example, `id` and `name` are fields that exist in every JSON document with an occurrence frequency of 100%, while `nickname` exists in only four JSON documents with an occurrence frequency of 80%.
If `freq_threshold` is set to 100%, then `nickname` will be inferred as a sparse column, while `id` and `name` will be inferred as frequent columns. If set to 80%, then `nickname`, `id`, and `name` will all be inferred as frequent columns.
## Examples
1. Enable semi-structured encoding.
:::tip
If you enable semi-structured encoding, make sure that the parameter <a href="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001971939">micro_block_merge_verify_level</a> is set to the default value <code>2</code>. Do not disable micro-block major compaction verification.
:::
:::tab
tab Example: Enable semi-structured encoding during table creation
```sql
CREATE TABLE t1( j json)
ROW_FORMAT=COMPRESSED
SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [CREATE TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974140).
tab Example: Enable semi-structured encoding for existing table
```sql
CREATE TABLE t1(j json);
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);
```
For more information about the syntax, see [ALTER TABLE](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001974126).
Some modification limitations:
* If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
* The `freq_threshold` parameter cannot be modified during direct load operations or when the table is locked.
* Modifying one sub-parameter does not affect the others.
:::
2. Disable semi-structured encoding.
When `SEMISTRUCT_PROPERTIES` is set to `(encoding_type=none)`, semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:
```sql
ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
```
3. Query semi-structured encoding configuration.
Use the `SHOW CREATE TABLE` statement to query the semi-structured encoding configuration. Here is an example statement:
```sql
SHOW CREATE TABLE t1;
```
The result is as follows:
```shell
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1 | CREATE TABLE `t1` (
`j` json DEFAULT NULL
) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
```
When `SEMISTRUCT_PROPERTIES=(encoding_type=encoding)` is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.
Using semi-structured encoding can improve the performance of conditional filtering queries with the [JSON_VALUE() function](https://en.oceanbase.com/docs/common-oceanbase-database-10000000001975890). Based on JSON semi-structured encoding technology, seekdb optimizes the performance of `JSON_VALUE` expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.
Here is an example query:
```sql
-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';
```
Character set considerations:
- seekdb uses `utf8_bin` encoding for JSON.
- To ensure string whitebox filtering works properly, we recommend the following settings:
```sql
SET @@collation_server = 'utf8mb4_bin';
SET @@collation_connection='utf8mb4_bin';
```