Exploring the Klustron Table Group Feature

Note:

Unless specified otherwise, any version number mentioned in the document can be replaced by any released version number. For a list of all released versions, refer to: Release_notes

Content:

Table group is a distinctive feature of Klustron as a distributed database. A table group is an attribute of a table that affects the specific distribution of the table's data across shards. This article introduces the purpose of table groups, the syntax, and specific examples, demonstrating the advantages of using table groups. Finally, the process of completely migrating a table group to another shard will be demonstrated through XPanel.

01 Introduction to Table Group

In real-world scenarios, some tables that are related often need to join or are frequently updated within the same transaction. Therefore, Klustron allows users to place these related tables in the same shard to achieve better query performance and transaction processing performance.

When performing joins among these related tables, the join operation can be pushed down to the storage shard for execution, thus obtaining better query performance; when updating related tables within the same transaction, it can avoid two-phase commits, thereby achieving better transaction processing performance.

A single table in a table group cannot be migrated independently to another shard; if migration is necessary, it must involve the entire table group.

02 Test Environment

Since the tests are conducted on virtual machines, the execution times mentioned in this article can only serve as a reference for comparison between different scenarios under the same environment. The configuration information for the test cluster's virtual machines is as follows:

Node Type	Name	IP	CPU	Memory
Compute Node		192.168.40.152	2C Intel i9 2.4 GHz	6G
Storage Primary Node	shard_1	192.168.40.151	2C Intel i9 2.4 GHz	6G
Storage Primary Node	shard_2	192.168.40.153	2C Intel i9 2.4 GHz	6G

03 Table Group Syntax

The syntax for creating a table group is as follows:

CREATE TABLEGROUP tablegroup_name
    [ OWNER  { new_owner | CURRENT_USER | SESSION_USER } ]
    [ WITH (SHARD = shardid) ]

When creating a table, you can specify the table group name.

CREATE TABLE table_name … WITH (TABLEGROUP tablegroup_name);

You can also specify the table group for a table after its creation using the following statement:

ALTER TABLE table_name SET (TABLEGROUP tablegroup_name);

04 Practical Example

First, log into the compute node at 192.168.40.152 using the psql client to create a user and database and check the shard information in the cluster.

psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;
alter user kunlun_test superuser;

Specify that customer and product are stored on shard_2, and sales_order is stored on shard_1.

psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb
CREATE TABLE customer
(
   customer_number      INT NOT NULL,
   customer_name        VARCHAR(128) NOT NULL,
   customer_street_address VARCHAR(256) NOT NULL,
   customer_zip_code    INT NOT NULL,
   customer_city        VARCHAR(32) NOT NULL,
   customer_state       VARCHAR(32) NOT NULL,
   PRIMARY KEY (customer_number)
) 
with(shard=1);
CREATE TABLE product
(
   product_code         INT NOT NULL,
   product_name         VARCHAR(128) NOT NULL,
   product_category     VARCHAR(256) NOT NULL,
   PRIMARY KEY (product_code)
)  
with (shard=1);

CREATE TABLE sales_order
(
   order_number         INT NOT NULL,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL,
   PRIMARY KEY (order_number)
) with (shard=2);

Insert test data:

INSERT INTO customer
( customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
 )
VALUES
  (1,'Big Customers', '7500 Louise Dr.', '17050','Mechanicsburg', 'PA')
, ( 2,'Small Stores', '2500 Woodland St.', '17055','Pittsburgh', 'PA')
, (3,'Medium Retailers', '1111 Ritter Rd.', '17055',  'Pittsburgh', 'PA')
,  (4,'Good Companies', '9500 Scott St.', '17050','Mechanicsburg', 'PA')
, (5,'Wonderful Shops', '3333 Rossmoyne Rd.', '17050','Mechanicsburg', 'PA')
, (6,'Loyal Clients', '7070 Ritter Rd.', '17055', 'Pittsburgh', 'PA')
;

Use the following stored procedures to load test data for product and sales_order, loading 1000 rows for product and 100000 rows for sales_order.

create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2022-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=100000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*1000);
		v_order_date := to_date('2022-01-01','yyyy-mm-dd') + CAST(FLOOR(RANDOM()*365) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;

create or replace procedure generate_product_data()
AS $$
DECLARE
  v_product_name varchar(128);
  i integer :=1;
BEGIN
	while i<=1000 loop
               case mod(i,3)
                   when 0 then
		        v_product_name := 'Hard Disk '||i;
		        INSERT INTO product VALUES (i,v_product_name,'Storage');
	         when 1 then
                          v_product_name := 'LCD '||i;
		         INSERT INTO product VALUES (i,v_product_name,'Monitor');
	         when 2 then
		      v_product_name := 'Paper'||i;
		       INSERT INTO product VALUES (i,v_product_name,'Paper');
		end case;
	      commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;

After data loading, create an index for sales_order and collect statistics for all tables:

set statement_timeout=0;
call generate_product_data();
call generate_order_data();
create index sales_order_idx1 on sales_order(order_date);
analyze product;
analyze sales_order;
analyze customer;
\timing on

The business wants to see how much Big Customers spent on each product category in the first half of 2022.

Here's the specific SQL:

select product_category,sum(order_amount ) from sales_order t1, product t2 ,customer t3 where 
t1.product_code = t2.product_code and 
t1.customer_number = t3.customer_number and
t1.order_date between to_date('2022-01-01','yyyy-mm-dd') and to_date('2022-06-30','yyyy-mm-dd') and customer_name='Big Customers' group by product_category;

Review the execution plan:

It shows data being fetched from both Shard 1 and Shard 2, joined on the compute node.

The execution time was 168 ms.

Now create a table group, sales_sum, on Shard 2, and place the sales_order table in it.

CREATE TABLEGROUP sales_sum WITH (SHARD=2);
alter table sales_order set (tablegroup=sales_sum);

Recreate the product and customer tables, specifying the table group sales_tg at creation.

drop table customer;
alter table product rename to product_old;
CREATE TABLE customer
(
   customer_number      INT NOT NULL,
   customer_name        VARCHAR(128) NOT NULL,
   customer_street_address VARCHAR(256) NOT NULL,
   customer_zip_code    INT NOT NULL,
   customer_city        VARCHAR(32) NOT NULL,
   customer_state       VARCHAR(32) NOT NULL,
   PRIMARY KEY (customer_number)
) 
with (tablegroup=sales_sum);

CREATE TABLE product
(
   product_code         INT NOT NULL,
   product_name         VARCHAR(128) NOT NULL,
   product_category     VARCHAR(256) NOT NULL,
   PRIMARY KEY (product_code)
)  
with (tablegroup=sales_sum);

Reload data for customer and product and collect statistics:

INSERT INTO customer
( customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
 )
VALUES
  (1,'Big Customers', '7500 Louise Dr.', '17050','Mechanicsburg', 'PA')
, ( 2,'Small Stores', '2500 Woodland St.', '17055','Pittsburgh', 'PA')
, (3,'Medium Retailers', '1111 Ritter Rd.', '17055',  'Pittsburgh', 'PA')
,  (4,'Good Companies', '9500 Scott St.', '17050','Mechanicsburg', 'PA')
, (5,'Wonderful Shops', '3333 Rossmoyne Rd.', '17050','Mechanicsburg', 'PA')
, (6,'Loyal Clients', '7070 Ritter Rd.', '17055', 'Pittsburgh', 'PA');

insert into product select * from product_old;
analyze product;
analyze customer;

Recheck the SQL execution plan; now everything is executed on Shard 2.