Klustron

Preface

Klustron (Klustron Database) has a complete disaster recovery and error handling mechanism. Through distributed transaction two-phase commit algorithm, as well as Fullsync and Fullsync HA mechanism, it can ensure that during the operation of the cluster, any computing node, storage node, cluster_mgr and other components that fail or restart, or encounter network partition or disconnection, the user data and metadata managed by the cluster will be consistent and complete. There will be no loss of any data updates that the user has submitted, no inconsistent situations such as partial transaction commit and partial rollback, and no inconsistencies between user metadata and user data.

Klustron implements reliable distributed transaction processing based on the two-phase commit protocol, ensuring that the cluster data is completely consistent when any number of nodes in the cluster fail, guaranteeing the ACID of all transactions.

For more detailed content, please refer to (Two-phase Commit Mechanism and Principle of Distributed Transaction Processing) and (Error Handling for Two-phase Commit in Distributed Transactions), and the Fullsync mechanism of Klustron-storage has been introduced here (Klustron Storage Cluster Fullsync Mechanism).

This article introduces the Fullsync high availability mechanism in detail, hereinafter referred to as Fullsync HA.

The fullsync mechanism of Klustron ensures that a storage shard of Klustron-storage must receive fullsync_consistency_level slave ACKs confirmation of the successful receipt of transaction binlog after completing any transaction commit and before returning a successful confirmation to the client.

The computing nodes are only obligated to ensure the durability of transactions if they receive successful commit confirmations from the storage nodes, which are then returned to the client. Otherwise they have no such obligation.

If the master node and up to fullsync_consistency_level-1 slave nodes fail at the same time, there is still one slave node (when fullsync_consistency_level>1) that contains all the binlogs of the transactions that have been confirmed and committed, so the updates of these transactions will not be lost.

The fullsync HA of Klustron ensures that any failure of the master node or network partition in a Klustron-storage storage cluster can be promptly detected and a new master node can be elected in a timely manner to continue to bear the write load of this storage shard. The following text will provide detailed information.

Klustron Fullsync and Fullsync HA mechanisms ensure that as long as a Klustron-storage storage shard with 2*N+1 nodes has N+1 nodes running, this shard can continue to write.

Here, N is the fullsync_consistency_level of Klustron-storage.

Klustron fullsync and fullsync HA mechanisms implement a high availability mechanism equivalent to the Raft protocol, which can ensure that the master node of each storage shard of a Klustron cluster can synchronize data with one or more slave nodes.

For one or more fullsync storage shards with fullsync_consistency_level=N (N>1), even if the master node of these storage shards and N-1 slaves simultaneously fail or network disconnection isolation and other problems occur, Klustron can ensure that the data of these shards is not lost and keep data consistency with other storage shards in the cluster, and Klustron will automatically elect a new master node and provide read and write capabilities.

Master Node Probing

Klustron Fullsync HA ensures that after the master node of a storage shard fails, restarts, or network partition occurs, it can automatically start the master node election process, complete the master-slave switch, and ensure the continuous availability of the storage cluster.

To confirm the availability of each storage shard's master node, the cluster_mgr module will continuously write a heartbeat to probe the availability of its master node every N seconds.

If it is found that the master node M0 of a storage shard (marked as SS1) cannot be written to for a certain period of time, the master-slave election and switch process described below will be started.

If M0 restarts fast enough, it will not trigger a master-slave switch. Cluster_mgr will set it as writable, so that M0 can continue to serve as the master node. Otherwise, cluster_mgr will initiate the following process of master node election and switch.

Master Node Election and Switch

The master node election and switch process of Klustron Fullsync HA mainly includes the following steps:

Find the slave machine M1 with the latest relay log in all slave machines of SS1 as the candidate master node.

If there are multiple latest slave machines, the most suitable slave machine will be elected as M1 according to more detailed rules.

Klustron Fullsync mechanism ensures that the cluster has one or more (fullsync_consistency_level) slave machines that definitely contain all the binlogs of transactions that have been confirmed as committed to the computing nodes.

Therefore, Klustron can tolerate the error that the master node and fullsync_consistency_level-1 slave nodes fail at the same time without losing the data of the committed transactions.

After the relay log of M1 is replayed, M1 is promoted to the master node of SS1.

MySQL-8.0 has a writeset transaction dependency tracking mechanism (binlog_transaction_dependency_tracking=writeset or writeset_session), which can make the MySQL slave machines replay transactions faster than MySQL-5.7 when replica_parallel_type=logical_clock. In most cases, the master-slave delay is only a few seconds.

However, if the user's table design and usage are not reasonable, such as the lack of primary keys and unique indexes, executing a large number of row update or delete statements (even if each statement only changes/deletes a small number of rows) will cause a significant delay in replaying the binlog on the slave machine. In this special case, it takes a long time to replay all the relay logs, and during this time, no slave machine can be promoted to become the new master node.

To avoid such special cases, we have developed a very convenient slave machine redo interface and slave machine delay alarm mechanism for Klustron, which ensures that DBAs can receive timely alerts when a slave machine has significant delay, and with just one click, complete the redo of the slave machine to catch up with the pace of the master node again.

Change the master node of other slave machines of SS1 to M1.

For the case of network partition or manual master-slave switch, if the former master node M0 can still be written to, that is, M0 has not failed or restarted, then cluster_mgr will first downgrade M0 to a slave machine and set it to read-only status before promoting M1 to the master node of SS1 to prevent brain split.

Inform all computing nodes that "M1 is the master node of SS1", that is, update their pg_shard.master_node_id field, so that computing nodes can know and write to the new master node in time.

It does not matter if the master node information of a computing node is not up to date, as we have defense measures for this.

First of all, any Klustron-storage node starts in read-only status, so if M0 restarts for any reason after SS1 has completed the master-slave switch, M0 cannot be written to even if some computing nodes still try to write to M0 because they have not updated the master node information in time. These write operations will fail, so there will be no brain split.

When a computing node finds that the master node it knows cannot be written to, if cluster_mgr has not updated the pg_shard.master_nodeid field of the computing node yet, the computing node will automatically start the master node probing program to find the new master node of SS1.

Before finding the new master node, the computing node will block and wait for the new master node election to complete according to the system settings or return an error directly to the client. Therefore, the master node failure has no impact on the business.

The former master node is re-added as a slave machine---flashback

If the former master node M0 completes the restart later, cluster_mgr will re-add it as a slave machine to the SS1 storage shard.

Since the Fullsync mechanism uses the after commit mode to wait for the slave machine ACK, there may be some transactions that have been committed to the slave machine in M0 but have not received any ACK. These transactions need to be flashbacked on the slave machine.

The flashback plugin of Klustron-storage will complete the flashback after startup to ensure that subsequent slave machine replication can run normally.

The flashback operation mainly does the opposite operation for the row operation performed by these redundant transactions, removes their changes, cuts off the redundant binlog files, and removes the gtid of those transactions that have been flashbacked from the mysql.gtid_executed system table.

Finally, Klustron-storage FullsyncHA has a set of practical measures to avoid unnecessary master-slave switches in extreme cases.

This is usually easy to occur under heavy write loads, so unnecessary master-slave switches can easily cause performance degradation or even temporary unavailability in the case of the heaviest system load. Therefore, it is a problem that needs to be avoided as much as possible.

Based on years of rich experience in online development and operation, as well as a deep understanding of MySQL kernel-related technologies, we have implemented a complete set of logic that can identify and avoid unnecessary master-slave switches.

Through distributed transaction processing and Fullsync and Fullsync HA mechanisms, Klustron has ensured complete data consistency guarantees and disaster recovery capabilities, and has achieved high reliability and high availability, while also providing high performance, which can handle high-concurrency heavy-load OLTP scenarios.

# Preface

# Master Node Probing

# Master Node Election and Switch

# END

Preface

Master Node Probing

Master Node Election and Switch

END