Klustron Shard Fault Isolation Test
Klustron Shard Fault Isolation Test
Note:
Unless specified otherwise, any version number mentioned in the document can be replaced by any released version number. For a list of all released versions, refer to: http://doc.klustron.com/zh/Release_notes.html
Content:
Klustron's financial-grade high reliability is composed of a complete technical system of capabilities and technologies, solid as a rock, forming Klustron's impregnable financial-grade high reliability technical system. This includes fault isolation capabilities. This article mainly tests that when a primary and backup shard in the cluster completely fails, the Klustron cluster can still partially operate normally. Data stored on the failed shard cannot be accessed, but data on other shards can be read, written, and committed normally.
1. Test Environment
The configuration information for the test cluster machines is as follows:
Name | Node Type | IP | Port | Shard_ID |
---|---|---|---|---|
comp3 | Compute Node | 192.168.40.152 | 47001 | N/A |
shard_1 | Primary Storage Node | 192.168.40.151 | 57003 | 6 |
Replica Storage Node | 192.168.40.152 | |||
Replica Storage Node | 192.168.40.153 | |||
shard_2 | Primary Storage Node | 192.168.40.152 | 57005 | 7 |
Replica Storage Node | 192.168.40.153 | |||
Replica Storage Node | 192.168.40.151 | |||
shard_3 | Primary Storage Node | 192.168.40.153 | 57007 | 5 |
Replica Storage Node | 192.168.40.152 | |||
Replica Storage Node | 192.168.40.151 |
2. Environment Preparation
Create a test database and user:
psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;
Prepare the test table sales_order, which is partitioned by month. Data for January 2023 is located on shard_3, February's data on shard_1, and March's data on shard_2.
psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb
CREATE TABLE sales_order
(
order_number INT NOT NULL,
customer_number INT NOT NULL,
product_code INT NOT NULL,
order_date DATETIME NOT NULL,
entry_date DATETIME NOT NULL,
order_amount DECIMAL(18,2) NOT NULL,
PRIMARY KEY (order_number,order_date)
) partition by range(order_date);
CREATE TABLE sales_order_202301 PARTITION OF sales_order
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
CREATE TABLE sales_order_202302 PARTITION OF sales_order
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
CREATE TABLE sales_order_202303 PARTITION OF sales_order
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01') with (Shard=7);
Populate the tables with 10,000 entries:
create or replace procedure generate_order_data()
AS $$
DECLARE
v_customer_number integer;
v_product_code integer;
v_order_date date;
v_amount integer;
start_date date := to_date('2023-01-01','yyyy-mm-dd');
i integer :=1;
BEGIN
while i<=10000 loop
v_customer_number := FLOOR(1+RANDOM()*6);
v_product_code := FLOOR(1+RANDOM()*1000);
v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
v_amount := FLOOR(1000+RANDOM()*9000);
INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
commit;
i := i+1;
end loop;
END; $$
LANGUAGE plpgsql;
call generate_order_data();
analyze sales_order;
testdb=> select count(*) from sales_order;
count
-------
10000
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
count
-------
3086
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
3. Simulating a Complete Shard Failure
At this point, simulate a failure where all primary and backup nodes of shard_1 fail. Since Klustron has an automatic failover feature, first move the data file directory of shard_1. Execute on all primary and backup nodes of shard_1:
cd /kunlun/storage_datadir
mv 57003 57003_bak
Then, on each node of shard_1, use the kill -9 command to stop the mysqld and mysqld_safe processes running on port 57003.
4. Fault Isolation Test
Returning to the compute node to query data for February 2023, an exception is thrown:
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
ERROR: Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
Queries for January and March data return results normally:
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
Similarly, data for January and March can still be inserted normally:
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
INSERT 0 1
testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
INSERT 0 1
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3418
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3498
(1 row)
testdb=> commit;
However, if the transaction involves data from the failed shard, an error will be reported:
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3498
(1 row)
testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
UPDATE 113
testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
ERROR: Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
ERROR: current transaction is aborted, commands ignored until end of transaction block
testdb=> rollback;
ROLLBACK
These tests show that while data stored on the failed shard can no longer provide any service, data on other normal shards can still be read, written, and transacted normally.