Skip to main content

Klustron Shard Fault Isolation Test

KlustronAbout 3 min

Klustron Shard Fault Isolation Test

Note:

Unless specified otherwise, any version number mentioned in the document can be replaced by any released version number. For a list of all released versions, refer to: http://doc.klustron.com/zh/Release_notes.html

Content:

Klustron's financial-grade high reliability is composed of a complete technical system of capabilities and technologies, solid as a rock, forming Klustron's impregnable financial-grade high reliability technical system. This includes fault isolation capabilities. This article mainly tests that when a primary and backup shard in the cluster completely fails, the Klustron cluster can still partially operate normally. Data stored on the failed shard cannot be accessed, but data on other shards can be read, written, and committed normally.

1. Test Environment

The configuration information for the test cluster machines is as follows:

NameNode TypeIPPortShard_ID
comp3Compute Node192.168.40.15247001N/A
shard_1Primary Storage Node192.168.40.151570036
Replica Storage Node192.168.40.152
Replica Storage Node192.168.40.153
shard_2Primary Storage Node192.168.40.152570057
Replica Storage Node192.168.40.153
Replica Storage Node192.168.40.151
shard_3Primary Storage Node192.168.40.153570075
Replica Storage Node192.168.40.152
Replica Storage Node192.168.40.151

2. Environment Preparation

Create a test database and user:

psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;

Prepare the test table sales_order, which is partitioned by month. Data for January 2023 is located on shard_3, February's data on shard_1, and March's data on shard_2.

psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb

CREATE TABLE sales_order
(
   order_number         INT NOT NULL,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL,
   PRIMARY KEY (order_number,order_date)
) partition by range(order_date);
CREATE TABLE sales_order_202301 PARTITION OF sales_order
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
CREATE TABLE sales_order_202302 PARTITION OF sales_order
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
CREATE TABLE sales_order_202303 PARTITION OF sales_order
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01') with (Shard=7);

Populate the tables with 10,000 entries:

create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2023-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=10000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*1000);
		v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;
call generate_order_data();
analyze sales_order;
testdb=> select count(*) from sales_order;
 count 
-------
 10000
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
 count 
-------
  3086
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

3. Simulating a Complete Shard Failure

At this point, simulate a failure where all primary and backup nodes of shard_1 fail. Since Klustron has an automatic failover feature, first move the data file directory of shard_1. Execute on all primary and backup nodes of shard_1:

cd /kunlun/storage_datadir
mv 57003 57003_bak

Then, on each node of shard_1, use the kill -9 command to stop the mysqld and mysqld_safe processes running on port 57003.

4. Fault Isolation Test

Returning to the compute node to query data for February 2023, an exception is thrown:

testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)

Queries for January and March data return results normally:

testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

Similarly, data for January and March can still be inserted normally:

testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
INSERT 0 1
testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
INSERT 0 1
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3418
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)

testdb=> commit;

However, if the transaction involves data from the failed shard, an error will be reported:

testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)

testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
UPDATE 113
testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
ERROR:  current transaction is aborted, commands ignored until end of transaction block
testdb=> rollback;
ROLLBACK

These tests show that while data stored on the failed shard can no longer provide any service, data on other normal shards can still be read, written, and transacted normally.

END