Skip to main content

Klustron Mirror Table Feature Use Cases

KlustronAbout 5 min

Klustron Mirror Table Feature Use Cases

Note:

Unless specifically stated, any version number mentioned can be substituted with any released version number. For all released versions, please visit: Release_notes

Objective:

This document aims to introduce the functionality and specific use cases of Mirror tables, demonstrating how to create Mirror and use them to enhance the performance of analytical queries. All operations are demonstrated using the command line.

01 Role of Mirror Tables

In real-world OLAP scenarios, some data tables possess the following characteristics:

  • The data volume is relatively small, such as department information tables or business code tables.
  • Data changes infrequently.
  • These tables are often joined with larger tables (e.g., order tables).

For tables with these characteristics, Klustron allows users to define them as Mirror tables (mirror images) to optimize query performance. Klustron ensures that a copy of each Mirror table in the cluster is present on every storage node.

When performing insert, update, or delete operations on Mirror tables, Klustron's compute nodes automatically execute the corresponding operations on every storage node's data within the same global transaction, thus maintaining ACID properties.

However, if a table is frequently updated, updating copies of this mirror table on all shards within a distributed transaction could severely degrade performance, making it unsuitable as a mirror table. This is a general guideline; users may decide based on these considerations.

When a new Storage Shard is added, Klustron automatically replicates all Mirror tables in the system to the new shard. During this replication process, these Mirror tables remain readable and writable, although adding new Mirror tables will be blocked until the replication completes.

Joins between Mirror tables and sharded large tables can always be pushed down to the storage nodes for execution, ensuring that these joins are performed in parallel by multiple storage for enhanced performance.

In OLAP applications, dimension tables typically fit the characteristics of Mirror tables and are suitable to be defined as such. In a star schema query in OLAP, the join between multiple Mirror tables and a fact table is effectively run in parallel across multiple storage nodes because the fact table is sharded across these nodes.

Furthermore, joins between two or more Mirror tables can always be pushed down to a specific storage node for execution, which can potentially improve query performance.

02 Test Enviroment

Since the testing is conducted on virtual machines, the execution times mentioned in this article should only be used as a reference for comparisons between different scenarios under similar conditions. The configuration information of the related virtual machines in the test cluster is as follows:

03 Case Study

First, log in to the compute node at 192.168.40.152 using the pgsql client to create a user and database:

psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;
/q

As user kunlun_test, log into testdb and create test tables for sales_order and product:

psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb
create table sales_order
(
   order_number         INT NOT NULL AUTO_INCREMENT,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL
)  partition by range(order_date);
create table sales_order_p0 partition of sales_order
for values from ('2021-01-01') to ('2021-04-01');
create table sales_order_p1 partition of sales_order
for values from ('2021-04-01') to ('2021-07-01');
create table sales_order_p2 partition of sales_order
for values from ('2021-07-01') to ('2021-10-01');
create table sales_order_p3 partition of sales_order
for values from ('2021-10-01') to ('2022-01-01');

create table product
(
   product_code         INT NOT NULL AUTO_INCREMENT,
   product_name         VARCHAR(128) NOT NULL,
   product_category     VARCHAR(256) NOT NULL,
   PRIMARY KEY (product_code)
) ;

Next, create stored procedures and use them to load 10,000 records into the sales_order table and 1,000 records into the product table:

create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2021-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=10000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*500);
		v_order_date := to_date('2021-01-01','yyyy-mm-dd') + CAST(FLOOR(RANDOM()*365) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		insert into sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;

create or replace procedure generate_product_data()
AS $$
DECLARE
  v_product_name varchar(128);
  i integer :=1;
BEGIN
	while i<=1000 loop
               case mod(i,3)
                   when 0 then
		        v_product_name := 'Hard Disk '||i;
		        INSERT INTO product VALUES (i,v_product_name,'Storage');
	         when 1 then
                          v_product_name := 'LCD '||i;
		         INSERT INTO product VALUES (i,v_product_name,'Monitor');
	         when 2 then
		      v_product_name := 'Paper'||i;
		       INSERT INTO product VALUES (i,v_product_name,'Paper');
		end case;
	      commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;
set statement_timeout 0;
call generate_order_data();
call generate_product_data();

Create indexes and collect statistical data:

create index sales_order_idx1 on sales_order(product_code,order_amount);
create index product_idx1 on product(product_name,product_code);
analyze sales_order;
analyze product;

Check the distribution of the product and sales_order tables across the storage nodes:

select * from pg_shard; 
/d+ product

It is evident that the product table is located on shard id 6, i.e., on storage node shard_1, while the partitions of the sales_order table are distributed across both shard_1 and shard_2:

select relname table_name ,reltuples num_rows, name shard_name from pg_class t1,pg_shard t2 where t1.relshardid
= t2.id and t1.reltype<>0 and t1.relname  like 'sales%';

Review the execution plan for the SQL query retrieving total sales data for all hard disks:

explain select sum(order_amount) from sales_order t1, product t2 where t1.product_code = t2.product_code  and
product_name like 'Hard Disk%' ;

Initially, the compute node optimizer chooses a Nested Loop join method, with the product table as the driving table. The compute node sends the following Remote SQL to shard_1 to retrieve records meeting the criteria:

SELECT `t2`.`product_code` FROM  `testdb_$$_public`.`product`  as t2 WHERE (`t2`.`product_name`  like
"Hard Disk%")

Then, the product codes from the driving table are used to form the following Remote SQL:

SELECT `t1`.`order_amount`,`t1`.`product_code` FROM  `testdb_$$_public`.`sales_order_p0`  as t1 WHERE (?= `t1`.`product_code`)

This SQL is sent to shards where the four partitions of the sales_order are located (shard_1 and shard_2) for execution, and the returned order_amount from each partition is aggregated.

The SQL execution time is 4.59 seconds:

testdb=>/timing on
testdb=> select sum(order_amount) from sales_order t1, product t2 where t1.product_code = t2.product_code  and product_name like 'Hard Disk%' ;
     sum     
-------------
 17975514.00
(1 row)

Time: 4590.807 ms (00:04.591)

After setting the product table as a Mirror table and reloading the database for retesting.

The clause with (shard=all) indicates that the table is created as a Mirror table, meaning it will be stored on all storage nodes, as the name suggests.

drop table product;
create table product
(
   product_code         INT NOT NULL AUTO_INCREMENT,
   product_name         VARCHAR(128) NOT NULL,
   product_category     VARCHAR(256) NOT NULL,
   PRIMARY KEY (product_code)
) with (shard=all);
call generate_product_data();
create index product_idx1 on product(product_name,product_code);
analyze product;

The product table's corresponding shard id is a large value, indicating it is a Mirror table.

Recheck the execution plan for the test SQL:

explain select sum(order_amount) from sales_order t1, product t2 where t1.product_code = t2.product_code  and
product_name like 'Hard Disk%' ;

The Remote SQL below is pushed to storage nodes where all partitions of the sales_order are executed:

SELECT sum(var$$0) FROM ( `testdb_$$_public`.`product`  as t2 join lateral (SELECT `t1`.`order_amount` as
var$$0,`t1`.`product_code` as var$$1 FROM  `testdb_$$_public`.`sales_order_p0`  as t1 WHERE (`t2`.`product_code` = `t1`.`product_code`)) tmp$$2 on (`t2`.`product_name` like "Hard Disk%")) WHERE (var$$1 = `t2`.`product_code`) 

Finally, results are aggregated at the compute node (Finalize Aggregate). The SQL execution is faster, dropping from the previous 4.59 seconds to 1.46 seconds. This demonstrates that the strategic use of Mirror tables can substantially enhance performance in specific analytical scenarios.

testdb=>/timing on
testdb=> select sum(order_amount) from sales_order t1, product t2 where t1.product_code = t2.product_code  and product_name like 'Hard Disk%' ;
     sum     
-------------
 17975514.00
(1 row)

Time: 1459.327 ms (00:01.459)

END