Managing JSON Data Types in Klustron

Starting with Klustron-1.3, Klustron fully supports the management of JSON data. This includes creating and using JSON path indexes, pushing down most JSON functions and operators to the storage nodes for execution, and exchanging data between compute nodes and storage nodes in an internal binary format, among other capabilities.

This article first introduces the JSON data types in Klustron, then explains how to import JSON-formatted data into Klustron, and showcases querying and filtering JSON format data, creating and using JSON path indexes, and examples of joining JSON object fields with other standard table fields.

01 JSON Data Types

JSON (JavaScript Object Notation) is a lightweight data interchange format that stores and represents data in a text format independent of programming languages. JSON is easy to read and write for humans while also being easy for machines to parse and generate, enhancing network transmission efficiency. In data transmission over networks, JSON has become a strong alternative to XML.

Klustron offers two JSON data types: JSON and JSONB. The primary difference between them lies in the data storage format, with JSONB using a binary format that is easier to process.

The following table describes the differences between the two data types:

Feature	JSON	JSONB
Storage Format	Textual storage	Parsed binary
Full-text Index	Not supported	Supported (from version 1.3)
Whitespace Preservation	Preserved	Not preserved
Key Order Preservation	Preserved	Not preserved
Duplicate Keys Preservation	Preserved	Not preserved

Due to differences in storage format, JSONB is slightly slower to insert (as it requires conversion) but much faster to query. The rest of this article will primarily use the JSONB data type, although most features are also available with JSON.

02 Using JSON Fields in Databases

Create a product table named product:

CREATE TABLE product (
  id INT NOT NULL PRIMARY KEY,
  product_code VARCHAR(10),
  product_name VARCHAR(100),
  attributes JSONB
);

The product table includes a JSONB field attributes for storing product attributes.

To assign values to JSON fields, you can directly use a string that must be in valid JSON format; otherwise, an error will be returned. Execute the following statement to insert a product record:

INSERT INTO product (id, product_code, product_name, attributes)
VALUES (1, '101', 'Chair', '{"color":"brown", "material":"solid wood", "height":"60cm"}');

Next, we attempt to insert data that is not in proper JSON format:

INSERT INTO product (id, product_code, product_name, attributes)
VALUES (2, '102', 'Armchair', '"color":"white:50cm}');

Although inserting JSON data as shown above is straightforward, the input process can be cumbersome. For this reason, Klustron also provides functions that facilitate the creation of JSON data. The jsonb_build_object function creates a binary JSON object using a series of inputs, for example:

INSERT INTO product (id,product_code,product_name, attributes)
VALUES (3, '103', 'Small Table', JSONB_BUILD_OBJECT('color', 'black', 'material', 'plastic'));

Other commonly used functions for building JSON data include:

json_build_object
to_json
to_jsonbarray_
to_json
row_to_json
json_build_array
jsonb_build_array
json_object
jsonb_object

Querying JSON fields is no different from querying regular fields, as shown in the image below:

Querying not only allows for retrieving entire JSON fields but also for extracting the value of specific attributes from within the JSON data. For example:

The operator -> retrieves the data associated with a specified key in JSON, which is still returned as a JSON type, enclosed in double quotes. If you want to return the value of a node as a string, you can use the ->> operator, as shown in the image below:

For further information on using JSON data types in Klustron, please refer to the link:

https://downloads.kunlunbase.com/docs_cn/html_cn/datatype-json.html

03 Examples of Complex Queries Using JSON Data Types

First, create two tables, sales_order and product:

create table sales_order
(
   order_number         INT NOT NULL AUTO_INCREMENT,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL,
   primary key(order_number)
);
create table product
(
   product_code         INT NOT NULL AUTO_INCREMENT,
   product_name         VARCHAR(128) NOT NULL,
   product_category     VARCHAR(256) NOT NULL,
   PRIMARY KEY (product_code)
) ;

Insert 1,000,000 rows of data into sales_order and 1,000 rows into product.

create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2021-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=1000000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*500);
		v_order_date := to_date('2021-01-01','yyyy-mm-dd') + CAST(FLOOR(RANDOM()*365) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		insert into sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;

create or replace procedure generate_product_data()
AS $$
DECLARE
  v_product_name varchar(128);
  i integer :=1;
BEGIN
	while i<=1000 loop
               case mod(i,3)
                   when 0 then
		        v_product_name := 'Hard Disk '||i;
		        INSERT INTO product VALUES (i,v_product_name,'Storage');
	         when 1 then
                          v_product_name := 'LCD '||i;
		         INSERT INTO product VALUES (i,v_product_name,'Monitor');
	         when 2 then
		      v_product_name := 'Paper'||i;
		       INSERT INTO product VALUES (i,v_product_name,'Paper');
		end case;
	      commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;

set statement_timeout=0;
call generate_order_data();
call generate_product_data();

Then, convert the data in sales_order into JSONB type and finally insert it into the sales_order_json table, with the same volume of 1,000,000 rows.

create table sales_order_json_seed (
order_detail jsonb);
insert into sales_order_json_seed SELECT row_to_json("sales_order") FROM sales_order;
create table sales_order_json (
order_number int NOT NULL,
order_detail jsonb,
primary key(order_number));
insert into sales_order_json select (order_detail->'order_number')::integer, order_detail from sales_order_json_seed;

3.1 Querying JSON Fields in a Single Table

To query the total sales for January 2021, you can use the following SQL. Here, the value of the order_date attribute from the JSON field is retrieved (order_detail->>'order_date'). Since the retrieved attribute value is a string, it also needs to be mapped to a datetime type using :DATETIME.

select sum((order_detail->>'order_amount')::DECIMAL(18,2)) from sales_order_json where
(order_detail->>'order_date')::DATETIME between '2021-01-01' and '2021-01-31';

You can run the same SQL on the regular sales_order table to verify the summarized results.

select sum(order_amount) from sales_order where order_date between '2021-01-01' and '2021-01-31';

3.2 Performing Joins on JSON Object Fields Between Two Tables

Query the total sales of all LCD category products in January 2021. The specific query SQL is as follows:

select sum((order_detail->>'order_amount')::DECIMAL(18,2)) from sales_order_json t1,product t2 where
(order_detail->>'product_code')::integer = t2.product_code and (order_detail->>'order_date')::DATETIME between '2021-01-01' and '2021-01-31' and t2.product_name like 'LCD%';

The query results are as follows:

The query results can be verified on the regular sales_order table as well:

select sum(order_amount) from sales_order t1,product t2 where t1.product_code = t2.product_code 
and t1.order_date between '2021-01-01' and '2021-01-31' 
and t2.product_name like 'LCD%';

The results match those obtained using the sales_order_json table.

3.3 Creating and Using JSON Path Indexes

Preparing Data

test1=# create table gears(id serial primary key, name varchar(64), size jsonb, unit text generated always as (size->'unit') virtual);
CREATE TABLE
test1=# insert into gears(name, size) values('shaft','{"height":200, "width": 100,"unit":"mm"}');
INSERT 0 1
test1=# insert into gears(name, size) values('bolt','{"height":40, "width": 10,"unit":"mm"}');
INSERT 0 1
test1=# alter table gears add column height int generated always as (int4(size->'height')) virtual;
ALTER TABLE
test1=# alter table gears add column width int generated always as (int4(size->'width')) virtual;
ALTER TABLE
test1=# select*from gears;
 id | name  |                    size                     | unit | height | width 
----+-------+---------------------------------------------+------+--------+-------
  1 | shaft | {"unit": "mm", "width": 100, "height": 200} | "mm" |    200 |   100
  2 | bolt  | {"unit": "mm", "width": 10, "height": 40}   | "mm" |     40 |    10
(2 rows)

test1=# create index gears_h on gears(height);
CREATE INDEX

The data is prepared as shown above. The CREATE TABLE statement creates a virtual generated column unit, followed by two ALTER TABLE statements that create virtual generated columns height and width. These generated columns do not occupy physical storage space and are computed during queries.

An index, gears_h, is created on the height column, which is a JSON path index. Note that the queries explicitly specify conditions height=40 and unit='"mm"', which use the indexed virtual columns. From the query plan on the compute node, you can see that these filter conditions are pushed down to the storage node; and from the storage node's query plan, you can see that the query utilizes the gears_h and gears_unit path indexes.

Query Execution Results and Plans in the Compute Node

test1=# explain select*from gears where height=40;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 RemotePlan  (cost=102.05..102.05 rows=1 width=222)
   Shard: 4      Remote SQL: SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM  `test1_$$_public`.`gears`  WHERE (gears.height = cast(? as signed))
(2 rows)

test1=# select*from gears where height=40;
 id | name |                   size                    | unit | height | width 
----+------+-------------------------------------------+------+--------+-------
  2 | bolt | {"unit": "mm", "width": 10, "height": 40} | "mm" |     40 |    10
(1 row)

test1=# create index gears_unit on gears(unit);
CREATE INDEX

test1=# select*from gears where unit='"mm"';
 id | name  |                    size                     | unit | height | width 
----+-------+---------------------------------------------+------+--------+-------
  1 | shaft | {"unit": "mm", "width": 100, "height": 200} | "mm" |    200 |   100
  2 | bolt  | {"unit": "mm", "width": 10, "height": 40}   | "mm" |     40 |    10
(2 rows)

test1=# explain select*from gears where unit='"mm"';
                                                                                               QUERY PLAN                                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 RemotePlan  (cost=102.05..102.05 rows=1 width=222)
   Shard: 4      Remote SQL: SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM  `test1_$$_public`.`gears`  WHERE (gears.unit = cast(? as char)  collate `utf8mb4_0900_bin`)
(2 rows)

Storage Node Query Plans for the Above Queries


mysql> explain SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM  `test1_$$_public`.`gears`  WHERE (gears.height = cast(40 as signed));
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | gears | NULL       | ref  | gears_h       | gears_h | 5       | const |    1 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

mysql> explain SELECT gears.id,gears.name,gears.size,gears.unit,gears.height,gears.width FROM  `test1_$$_public`.`gears`  WHERE (gears.unit = cast("\"mm\"" as char)  collate `utf8mb4_0900_bin`);
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key        | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | gears | NULL       | ref  | gears_unit    | gears_unit | 259     | const |    2 |   100.00 | Using where |
+----+-------------+-------+------------+------+---------------+------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

3.4 Comparison of JSON Objects in InnoDB and RocksDB Storage Engines

Importing large JSON objects obtained from the internet into Klustron requires converting them into NDJSON format first.

create table temp(data jsonb);
\COPY temp (data) FROM '/home/kunlun/json/1.json';
\COPY temp (data) FROM '/home/kunlun/json/2.json';

A total of 10,000 JSON objects were imported, querying the id field within the JSON objects to test if the import was successful.

The size of the temp table is 787.52 MB.

Create a table using the RocksDB storage engine and check the storage space used.

create table temp_rdb(data jsonb) with (engine=rocksdb);
\COPY temp (data) FROM '/home/kunlun/json/1.json';
\COPY temp (data) FROM '/home/kunlun/json/2.json';

Data is compressed and stored under the RocksDB storage engine.

3.5 JSON Object Read and Write Performance Test

Since the tests are conducted on a virtual machine, the execution times mentioned in the article can only serve as a reference for comparison between different scenarios under the same environment. The configuration information of the test cluster virtual machines is as follows:

Node Type	Name	IP	CPU	Memory
Compute node		10.37.129.6	2C Intel i9 2.4 GHz	6G
Storage primary node	shard_1	10.37.129.5	2C Intel i9 2.4 GHz	6G
Storage primary node	shard_2	10.37.129.7	2C Intel i9 2.4 GHz	6G

Create a new table test1 from the previous temp table and populate it with data.

create table test1 (id int primary key, content jsonb);
CREATE SEQUENCE 
test_id_seq
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START WITH 1
CACHE 1;
insert into test1 select nextval('test_id_seq'),data from temp;
analyze test1;

Use the pgbench tool to conduct read and write performance tests on the test1 table. Prepare custom.sql as follows:

\set id random(1,100000)  
BEGIN;
select * from test1 where id = :id;
update test1 set content=jsonb_set(content,'{user,email}','"klustron@mail.com"') where id = :id;
END;

Parameters include 10 connections, running a stress test for 2 minutes.

The efficiency of reading and writing JSON objects in the current environment can reach an average of 1001 tps with 10 connections.

Managing JSON Data Types in Klustron

# Managing JSON Data Types in Klustron

# 01 JSON Data Types

# 02 Using JSON Fields in Databases

# 03 Examples of Complex Queries Using JSON Data Types

# 3.1 Querying JSON Fields in a Single Table

# 3.2 Performing Joins on JSON Object Fields Between Two Tables

# 3.3 Creating and Using JSON Path Indexes

# Preparing Data

# Query Execution Results and Plans in the Compute Node

# Storage Node Query Plans for the Above Queries

# 3.4 Comparison of JSON Objects in InnoDB and RocksDB Storage Engines

# 3.5 JSON Object Read and Write Performance Test

# END

Managing JSON Data Types in Klustron

01 JSON Data Types

02 Using JSON Fields in Databases

03 Examples of Complex Queries Using JSON Data Types

3.1 Querying JSON Fields in a Single Table

3.2 Performing Joins on JSON Object Fields Between Two Tables

3.3 Creating and Using JSON Path Indexes

Preparing Data

Query Execution Results and Plans in the Compute Node

Storage Node Query Plans for the Above Queries

3.4 Comparison of JSON Objects in InnoDB and RocksDB Storage Engines

3.5 JSON Object Read and Write Performance Test

END