Managing JSON Data Types in Klustron
Managing JSON Data Types in Klustron
Starting with Klustron-1.3, Klustron fully supports the management of JSON data. This includes creating and using JSON path indexes, pushing down most JSON functions and operators to the storage nodes for execution, and exchanging data between compute nodes and storage nodes in an internal binary format, among other capabilities.
This article first introduces the JSON data types in Klustron, then explains how to import JSON-formatted data into Klustron, and showcases querying and filtering JSON format data, creating and using JSON path indexes, and examples of joining JSON object fields with other standard table fields.
01 JSON Data Types
JSON (JavaScript Object Notation) is a lightweight data interchange format that stores and represents data in a text format independent of programming languages. JSON is easy to read and write for humans while also being easy for machines to parse and generate, enhancing network transmission efficiency. In data transmission over networks, JSON has become a strong alternative to XML.
Klustron offers two JSON data types: JSON and JSONB. The primary difference between them lies in the data storage format, with JSONB using a binary format that is easier to process.
The following table describes the differences between the two data types:
Feature | JSON | JSONB |
---|---|---|
Storage Format | Textual storage | Parsed binary |
Full-text Index | Not supported | Supported (from version 1.3) |
Whitespace Preservation | Preserved | Not preserved |
Key Order Preservation | Preserved | Not preserved |
Duplicate Keys Preservation | Preserved | Not preserved |
Due to differences in storage format, JSONB is slightly slower to insert (as it requires conversion) but much faster to query. The rest of this article will primarily use the JSONB data type, although most features are also available with JSON.
02 Using JSON Fields in Databases
Create a product table named product:
CREATE TABLE product (
id INT NOT NULL PRIMARY KEY,
product_code VARCHAR(10),
product_name VARCHAR(100),
attributes JSONB
);
The product table includes a JSONB field attributes for storing product attributes.
To assign values to JSON fields, you can directly use a string that must be in valid JSON format; otherwise, an error will be returned. Execute the following statement to insert a product record:
INSERT INTO product (id, product_code, product_name, attributes)
VALUES (1, '101', 'Chair', '{"color":"brown", "material":"solid wood", "height":"60cm"}');
Next, we attempt to insert data that is not in proper JSON format:
INSERT INTO product (id, product_code, product_name, attributes)
VALUES (2, '102', 'Armchair', '"color":"white:50cm}');
Although inserting JSON data as shown above is straightforward, the input process can be cumbersome. For this reason, Klustron also provides functions that facilitate the creation of JSON data. The jsonb_build_object function creates a binary JSON object using a series of inputs, for example:
INSERT INTO product (id,product_code,product_name, attributes)
VALUES (3, '103', 'Small Table', JSONB_BUILD_OBJECT('color', 'black', 'material', 'plastic'));
Other commonly used functions for building JSON data include:
- json_build_object
- to_json
- to_jsonbarray_
- to_json
- row_to_json
- json_build_array
- jsonb_build_array
- json_object
- jsonb_object
Querying JSON fields is no different from querying regular fields, as shown in the image below:
Querying not only allows for retrieving entire JSON fields but also for extracting the value of specific attributes from within the JSON data. For example:
The operator -> retrieves the data associated with a specified key in JSON, which is still returned as a JSON type, enclosed in double quotes. If you want to return the value of a node as a string, you can use the ->> operator, as shown in the image below:
For further information on using JSON data types in Klustron, please refer to the link:
https://downloads.kunlunbase.com/docs_cn/html_cn/datatype-json.html
03 Examples of Complex Queries Using JSON Data Types
First, create two tables, sales_order
and product
:
create table sales_order
(
order_number INT NOT NULL AUTO_INCREMENT,
customer_number INT NOT NULL,
product_code INT NOT NULL,
order_date DATETIME NOT NULL,
entry_date DATETIME NOT NULL,
order_amount DECIMAL(18,2) NOT NULL,
primary key(order_number)
);
create table product
(
product_code INT NOT NULL AUTO_INCREMENT,
product_name VARCHAR(128) NOT NULL,
product_category VARCHAR(256) NOT NULL,
PRIMARY KEY (product_code)
) ;
Insert 1,000,000 rows of data into sales_order
and 1,000 rows into product
.
create or replace procedure generate_order_data()
AS $$
DECLARE
v_customer_number integer;
v_product_code integer;
v_order_date date;
v_amount integer;
start_date date := to_date('2021-01-01','yyyy-mm-dd');
i integer :=1;
BEGIN
while i<=1000000 loop
v_customer_number := FLOOR(1+RANDOM()*6);
v_product_code := FLOOR(1+RANDOM()*500);
v_order_date := to_date('2021-01-01','yyyy-mm-dd') + CAST(FLOOR(RANDOM()*365) AS INT);
v_amount := FLOOR(1000+RANDOM()*9000);
insert into sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
commit;
i := i+1;
end loop;
END; $$
LANGUAGE plpgsql;
create or replace procedure generate_product_data()
AS $$
DECLARE
v_product_name varchar(128);
i integer :=1;
BEGIN
while i<=1000 loop
case mod(i,3)
when 0 then
v_product_name := 'Hard Disk '||i;
INSERT INTO product VALUES (i,v_product_name,'Storage');
when 1 then
v_product_name := 'LCD '||i;
INSERT INTO product VALUES (i,v_product_name,'Monitor');
when 2 then
v_product_name := 'Paper'||i;
INSERT INTO product VALUES (i,v_product_name,'Paper');
end case;
commit;
i := i+1;
end loop;
END; $$
LANGUAGE plpgsql;
set statement_timeout=0;
call generate_order_data();
call generate_product_data();
Then, convert the data in sales_order
into JSONB type and finally insert it into the sales_order_json
table, with the same volume of 1,000,000 rows.
create table sales_order_json_seed (
order_detail jsonb);
insert into sales_order_json_seed SELECT row_to_json("sales_order") FROM sales_order;
create table sales_order_json (
order_number int NOT NULL,
order_detail jsonb,
primary key(order_number));
insert into sales_order_json select (order_detail->'order_number')::integer, order_detail from sales_order_json_seed;
3.1 Querying JSON Fields in a Single Table
To query the total sales for January 2021, you can use the following SQL. Here, the value of the order_date attribute from the JSON field is retrieved (order_detail->>'order_date'). Since the retrieved attribute value is a string, it also needs to be mapped to a datetime type using :DATETIME.
select sum((order_detail->>'order_amount')::DECIMAL(18,2)) from sales_order_json where
(order_detail->>'order_date')::DATETIME between '2021-01-01' and '2021-01-31';
You can run the same SQL on the regular sales_order
table to verify the summarized results.
select sum(order_amount) from sales_order where order_date between '2021-01-01' and '2021-01-31';
3.2 Performing Joins on JSON Object Fields Between Two Tables
Query the total sales of all LCD category products in January 2021. The specific query SQL is as follows:
select sum((order_detail->>'order_amount')::DECIMAL(18,2)) from sales_order_json t1,product t2 where
(order_detail->>'product_code')::integer = t2.product_code and (order_detail->>'order_date')::DATETIME between '2021-01-01' and '2021-01-31' and t2.product_name like 'LCD%';
The query results are as follows:
The query results can be verified on the regular sales_order
table as well:
select sum(order_amount) from sales_order t1,product t2 where t1.product_code = t2.product_code
and t1.order_date between '2021-01-01' and '2021-01-31'
and t2.product_name like 'LCD%';
The results match those obtained using the sales_order_json
table.
3.3 Creating and Using JSON Path Indexes
Preparing Data
test1=# create table gears(id serial primary key, name varchar(64), size jsonb, unit text generated always as (size->'unit') virtual);
CREATE TABLE
test1=# insert into gears(name, size) values('shaft','{"height":200, "width": 100,"unit":"mm"}');
INSERT 0 1
test1=# insert into gears(name, size) values('bolt','{"height":40, "width": 10,"unit":"mm"}');
INSERT 0 1
test1=# alter table gears add column height int generated always as (int4(size->'height')) virtual;
ALTER TABLE
test1=# alter table gears add column width int generated always as (int4(size->'width')) virtual;
ALTER TABLE
test1=# select*from gears;
id | name | size | unit | height | width
----+-------+---------------------------------------------+------+--------+-------
1 | shaft | {"unit": "mm", "width": 100, "height": 200} | "mm" | 200 | 100
2 | bolt | {"unit": "mm", "width": 10, "height": 40} | "mm" | 40 | 10
(2 rows)
test1=# create index gears_h on gears(height);
CREATE INDEX
The data is prepared as shown above. The CREATE TABLE statement creates a virtual generated column unit
, followed by two ALTER TABLE statements that create virtual generated columns height
and width
. These generated columns do not occupy physical storage space and are computed during queries.
An index, gears_h
, is created on the height
column, which is a JSON path index. Note that the queries explicitly specify conditions height=40
and unit='"mm"'
, which use the indexed virtual columns. From the query plan on the compute node, you can see that these filter conditions are pushed down to the storage node; and from the storage node's query plan, you can see that the query utilizes the gears_h
and gears_unit
path indexes.
Query Execution Results and Plans in the Compute Node
test1=# explain select*from gears where height=40;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
RemotePlan (cost=102.05..102.05 rows=1 width=222)
Shard: 4 Remote SQL: SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM `test1_$$_public`.`gears` WHERE (gears.height = cast(? as signed))
(2 rows)
test1=# select*from gears where height=40;
id | name | size | unit | height | width
----+------+-------------------------------------------+------+--------+-------
2 | bolt | {"unit": "mm", "width": 10, "height": 40} | "mm" | 40 | 10
(1 row)
test1=# create index gears_unit on gears(unit);
CREATE INDEX
test1=# select*from gears where unit='"mm"';
id | name | size | unit | height | width
----+-------+---------------------------------------------+------+--------+-------
1 | shaft | {"unit": "mm", "width": 100, "height": 200} | "mm" | 200 | 100
2 | bolt | {"unit": "mm", "width": 10, "height": 40} | "mm" | 40 | 10
(2 rows)
test1=# explain select*from gears where unit='"mm"';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
RemotePlan (cost=102.05..102.05 rows=1 width=222)
Shard: 4 Remote SQL: SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM `test1_$$_public`.`gears` WHERE (gears.unit = cast(? as char) collate `utf8mb4_0900_bin`)
(2 rows)
Storage Node Query Plans for the Above Queries
mysql> explain SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM `test1_$$_public`.`gears` WHERE (gears.height = cast(40 as signed));
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | gears | NULL | ref | gears_h | gears_h | 5 | const | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
mysql> explain SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM `test1_$$_public`.`gears` WHERE (gears.unit = cast("\"mm\"" as char) collate `utf8mb4_0900_bin`);
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | gears | NULL | ref | gears_unit | gears_unit | 259 | const | 2 | 100.00 | Using where |
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
3.4 Comparison of JSON Objects in InnoDB and RocksDB Storage Engines
Importing large JSON objects obtained from the internet into Klustron requires converting them into NDJSON format first.
create table temp(data jsonb);
\COPY temp (data) FROM '/home/kunlun/json/1.json';
\COPY temp (data) FROM '/home/kunlun/json/2.json';
A total of 10,000 JSON objects were imported, querying the id
field within the JSON objects to test if the import was successful.
The size of the temp
table is 787.52 MB.
Create a table using the RocksDB storage engine and check the storage space used.
create table temp_rdb(data jsonb) with (engine=rocksdb);
\COPY temp (data) FROM '/home/kunlun/json/1.json';
\COPY temp (data) FROM '/home/kunlun/json/2.json';
Data is compressed and stored under the RocksDB storage engine.
3.5 JSON Object Read and Write Performance Test
Since the tests are conducted on a virtual machine, the execution times mentioned in the article can only serve as a reference for comparison between different scenarios under the same environment. The configuration information of the test cluster virtual machines is as follows:
Node Type | Name | IP | CPU | Memory |
---|---|---|---|---|
Compute node | 10.37.129.6 | 2C Intel i9 2.4 GHz | 6G | |
Storage primary node | shard_1 | 10.37.129.5 | 2C Intel i9 2.4 GHz | 6G |
Storage primary node | shard_2 | 10.37.129.7 | 2C Intel i9 2.4 GHz | 6G |
Create a new table test1
from the previous temp
table and populate it with data.
create table test1 (id int primary key, content jsonb);
CREATE SEQUENCE
test_id_seq
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START WITH 1
CACHE 1;
insert into test1 select nextval('test_id_seq'),data from temp;
analyze test1;
Use the pgbench tool to conduct read and write performance tests on the test1
table. Prepare custom.sql as follows:
\set id random(1,100000)
BEGIN;
select * from test1 where id = :id;
update test1 set content=jsonb_set(content,'{user,email}','"klustron@mail.com"') where id = :id;
END;
Parameters include 10 connections, running a stress test for 2 minutes.
The efficiency of reading and writing JSON objects in the current environment can reach an average of 1001 tps with 10 connections.