Skip to main content

Klustron Multi-Data Center High Availability Demo

KlustronAbout 5 min

Klustron Multi-Data Center High Availability Demo

Note:

Unless specifically mentioned, any version number mentioned in the text can be replaced with any released version number. For all released versions, see: Release_notesopen in new window

Objective:

After deploying a Klustron cluster in a multi-data center mode across several local data centers, Klustron's multi-data center high availability feature ensures that in the event of a total failure of one data center, the Klustron cluster's primary node automatically switches to another data center. It guarantees automatic recovery from failures, without data loss or damage, with an RPO=0 and RTO<30 seconds.

01 Requirement Background

Klustron's fullsync strong synchronous replication and fullsync HA ensure that in case of a primary node failure within a storage cluster (storage shard), Klustron can automatically and timely detect the failure and elect a new primary node to continue providing data read and write services without loss or damage. This reliability meets the needs of most application scenarios.

However, for financial applications demanding the highest level of reliability, this might not suffice. If an entire data center (IDC) goes down (due to power failures, fires, floods, etc.), and all nodes of a Klustron cluster are deployed in such an IDC, user data could still be lost, and database services would halt.

To achieve IDC-level high availability, the Klustron team has developed IDC-level high availability technology. Starting from version 1.2, Klustron supports high availability across data centers.

02 Feature Overview

Klustron's IDC disaster recovery technology automatically detects if the primary IDC in the main city fails, switching each Klustron storage shard's primary node to a standby primary node in a backup IDC of the same city. This is referred to as upgrading the backup IDC to a primary IDC. In cases where all IDCs in the main city fail, the user's DBA receives alerts and can manually switch each shard's primary node to a standby primary node in a backup IDC of a backup city using Klustron XPanel GUI or the cluster_mgr API. This may require some manual intervention to upgrade the backup city's IDC to primary IDC.

Both operations essentially switch all storage shards of a Klustron cluster and the metadata cluster's primary nodes to another data center's candidate primary nodes.

03 Intra-City IDC HA Deployment Architecture

An intra-city IDC HA cluster achieves intra-city IDC-level high availability through deployment across two local data centers. The topology of each shard of a Klustron cluster is shown in the diagram above. Users can deploy any number of compute nodes (Klustron-server) and clusterManager nodes in each IDC as needed. All compute nodes continuously sync user metadata updates from the metadata cluster.

For more information, see: Klustron_idc_high_availability_architectureopen in new window

04 Configuring Data Center (IDC) Management

Assuming familiarity with Klustron database cluster creation, the following sections cover configuring IDCs, adding IDCs in data center management, then binding IDCs in the computer management center; creating intra-city IDCHA clusters, checking their operational status; conducting switch drills between main and backup IDCs during routine maintenance of the intra-city IDC HA cluster; and testing the backup IDC's automatic takeover drill in case of main IDC failures; with detailed explanations of the process and steps.

The main testing process in this document is done through the XPanel console.

All tests are conducted through the XPanel console and PostgreSQL client connections to the database cluster. The IP of the server where XPanel service is installed is 192.168.56.112.

Open a browser on a device that can access 192.168.56.112 and enter the address: http://192.168.56.112:18080/KunlunXPanel/#/login?redirect=%2Fdashboard

After logging in, the homepage is displayed as follows:

Next, add an IDC in Data Center (IDC) MGT.

4.1 Click "Data Center (IDC) Management", "Data Center List", and then click the "Add IDC" button on the data center list interface.

4.2 Add a new IDC, specify the IDC name "IDC1", city "Guangdong Province/Shenzhen City", and then click "Confirm".

4.3 The data center "IDC1" is added.

4.4 Repeat steps 4.1 and 4.2 to add a new data center "IDC2"; after adding, check the IDC configuration.

Next, bind IDCs in the computer management center.

4.5 Click "Computer Management", "Computer List", and then click the "Update IDC" button on the computer list interface.

4.6 Select the appropriate computer resources to bind to the IDC1 data center.

4.7 Click "Confirm" to check the completion of the cluster creation task.

4.8 Use the same method to bind computer resources to the data center IDC2; after binding, check the IDC binding situation.

05 Creating an Intra-City IDC HA Cluster

5.1 Click on "Cluster Management" and then "Cluster List". On the cluster list interface, click the ""Create button.

5.2 Create a new cluster, specify the business name as "IDC_Cluster", and select the cluster type as "IDC Cluster".

5.3 For the type of purchase in the IDC cluster, select "Intra-City Purchase".

5.4 In the IDC cluster (Intra-City), select the city where the IDCs are located, "Shenzhen City", and choose the primary and backup IDCs.

5.5 Select the compute nodes for the IDC cluster (Intra-City).

5.6 Set up the new cluster configuration using default values.

5.7 Overview of the new cluster.

5.8 Click "Confirm" to view the completion of the cluster creation task.

5.9 After the IDC cluster configuration, check the operational status of the "IDC_Cluster".

06 Intra-City IDC HA Cluster Primary-Backup IDC Switch Test

6.1 Prepare test data by creating test tables in the database and inserting test data.

postgres=# create table prod_part (id int primary key, name char(8)) partition by hash(id);
postgres=# create table prod_part_p1 partition of prod_part for values with (modulus 6, remainder 0);
postgres=# create table prod_part_p2 partition of prod_part for values with (modulus 6, remainder 1);
postgres=# create table prod_part_p3 partition of prod_part for values with (modulus 6, remainder 2);
postgres=# create table prod_part_p4 partition of prod_part for values with (modulus 6, remainder 3);
postgres=# create table prod_part_p5 partition of prod_part for values with (modulus 6, remainder 4);
postgres=# create table prod_part_p6 partition of prod_part for values with (modulus 6, remainder 5);
postgres=# insert into prod_part select i,'text'||i from generate_series(1,300) i;
postgres=# analyze prod_part;

6.2 Prepare a Python script, pyprod.py, to continuously operate on the database while switching between primary and backup IDCs in the IDC HA cluster. The content of the pyprod.py script is as follows:

import psycopg2.extras
from psycopg2 import DatabaseError
import time
import datetime

conn = psycopg2.connect(database='postgres',user='abc',
                 password='abc',host='192.168.56.112',port='47001')

select_sql = ''' select * from prod_part where id=%s; '''
i = 1
try:
    while (i <= 1000 ):
        cursor = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
        cursor.execute(select_sql,[i])
        res = cursor.fetchall()
        print(dict(res[0]))
        current_datetime = datetime.datetime.now()
        print("Current date and time: ", current_datetime)
        if (i == 1000 ):
            i = 1
        else :
            i = i+1
        cursor.close()
        conn.commit()
        time.sleep(1)
except (Exception, DatabaseError) as e:
    print(e)
    input('Press any key and Enter to continue ~!')
    conn = psycopg2.connect(database='postgres', user='abc',
                            password='abc', host='192.168.56.112', port='47001')
    select_sql = ''' select * from prod_part where id=%s; '''
    while (i <= 1000):
        cursor = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
        cursor.execute(select_sql, [i])
        res = cursor.fetchall()
        print(dict(res[0]))
        current_datetime = datetime.datetime.now()
        print("Current date and time: ", current_datetime)
        if (i == 1000):
            i = 1
        else:
            i = i +1
cursor.close()
conn.commit()
time.sleep(1)
finally:
conn.close()

6.3 Run the pyprod.py script to continuously operate on the database.

bash [kunlun@kunlun1 scripts]$ python pyprod.py

6.4 Check the current primary and backup IDCs in the "Cluster List Information". The primary node is "192.168.56.112", and the backup node is "192.168.56.114", then click the "Settings" button.

6.5 In the "Cluster Settings" interface, click the "IDC Switch" button to perform the primary-backup IDC switch.

6.6 On the IDC switch interface, select the IDC named "IDC2".

6.7 Click "Confirm" to check the completion of the IDC switch task.

6.8 After the primary-backup IDC switch, view the "Cluster List Information"; the primary node changes to "192.168.56.114" and the backup node to "192.168.56.112".

6.9 After the primary-backup IDC switch, applications can continue accessing and operating on data tables, maintaining continuous operations on the database.

07 Intra-City IDC HA Cluster Primary IDC Failure and Backup IDC Takeover Test

7.1 Review the current primary and backup IDCs in the "Cluster List Info" where the primary node is "192.168.56.114" and the backup node is "192.168.56.112".

7.2 Before simulating a primary IDC failure, run the pyprod.py script to continuously operate on the database.

[kunlun@kunlun1 scripts]$ python pyprod.py

7.3 Simulate a primary IDC failure by shutting down the server at the primary node "192.168.56.114".

7.4 After shutting down the server at "192.168.56.114", the former backup node "192.168.56.112" automatically switches to become the primary node.

7.5 With the primary IDC failure and backup IDC takeover, applications continue to access and operate on data tables, maintaining continuous operations on the database.

7.6 Restore the server at the node "192.168.56.114".

7.7 After restoring the server at "192.168.56.114", it automatically rejoins the IDC HA cluster as a backup node, completing the intra-city IDC cluster primary IDC failure and backup IDC takeover.

With this, the setup and configuration of the intra-city IDC HA cluster, as well as the primary-backup switch test, are complete.

END