Klustron (Klustron Database) uses MySQL as a storage node, called Klustron-storage. The latest version of Klustron-storage is based on percona-mysql-8.0.26. On this community version, we have addressed several vulnerabilities related to fault tolerance and data consistency in MySQL's XA transactions and binlog replication, added several features required for Kunlun database cluster, and enhanced its performance. We have also been merging upstream versions, continuously gathering the latest achievements of the MySQL community into Klustron-storage.
Before the development of the Klustron-storage Fullsync mechanism, we had been using MySQL Group Replication (MGR) to achieve high availability of storage clusters. In order to achieve better data write performance, including higher throughput and shorter latency, reduce consumption of storage systems and network bandwidth, and achieve more flexible strategies in high availability, we designed and developed the Fullsync high availability mechanism in Klustron-storage.
Introduction to Klustron-storage Fullsync Functionality
Overview and Principle of Klustron-storage Fullsync
The Fullsync mechanism of Klustron is a high-availability mechanism for storage clusters. It is used to ensure that when node failures, network partitions and other issues occur in a storage cluster, there is an available slave machine in the cluster that contains all binlogs of transactions that have been confirmed to be successfully submitted to users, so that a new master node can be elected as needed to ensure that the cluster can continue to write and achieve high availability.
The Fullsync mechanism of Klustron is based on MySQL's well-established Row Based Replication (RBR) binlog replication mechanism, which achieves strong synchronization of master-slave replication. That is, it ensures that every transaction committed on the master node -- including explicit ordinary transactions (i.e., begin...commit), autocommit update/delete/insert statements, and XA transactions -- waits continuously until it receives confirmation (ACK) from a sufficient number of slave machines after completing the internal transaction commit process (i.e., engine log, binlog flush&sync and engine commit). Only then does it confirm to the client (in Klustron, it is the computing node) that the transaction has been successfully committed.
An ACK is a combination of a binlog file identifier and an offset value, which represents that all binlogs before this position have been received and persistently stored (flushed to disk) by the slave machine. Therefore, all transactions whose binlogs are stored before this position in the master node's binlog file have been confirmed, and their commit operations can return success status to the client.
A slave machine sends this ACK to confirm to its master node that it has received and persisted (flushed to disk) a group of transaction binlogs to its relay log file. Only when the master node receives this confirmation does it return a confirmation of transaction commit success to the client. After receiving this result, the client can send the next SQL statement to Klustron. Klustron is only obligated to ensure that the changes made by such a transaction are not lost (i.e., the D of ACID, durability) after returning a successful transaction commit (or prepare) to the client.
Preconditions for Klustron-storage Fullsync
The Klustron-storage Fullsync mechanism requires a specific combination of parameters to function properly. The parameter configuration template file that comes with Klustron-storage contains parameter settings that have been optimized by our development team, including the parameter settings required for the fullsync feature.
All storage node instances in the Klustron cluster are created using its built-in parameter template. Specifically, the following parameters are included:
- gtid_mode=on, enforce_gtid_consistency=1, log_slave_updates=ON
- binlog_format=row, i.e. use Row based replication
- All master-slave nodes use binlog
- Sessions use binlog, i.e. sql_log_bin=true. If sql_log_bin is set to false in a session, the fullsync mechanism in this session will not work, but the fullsync in other sessions will still work.
- enable_fullsync = ON, i.e. turn on the global switch of fullsync
- thread_handling=2 or 0, i.e. Klustron-storage fullsync mechanism is applicable to both thread pool (thread_handling=2) and each thread processing a transaction (thread_handling=0) situations.
Klustron-storage Fullsync Mechanism Design and Implementation
1. Master Node
The fullsync mechanism of Klustron-storage is an after-commit synchronization mode. Before the working thread thr processing the user session thd completes the transaction T commit or prepare (XA prepare) and has not yet confirmed success to the client (i.e., sending an OK packet), the master node checks whether the binlog of transaction T has received enough ACKs from the slave machines (the ACKs of the slave machines confirm the receipt of the binlog of several transactions). --- This condition is called the release condition.
Fullsync_consistency_level defines how many ACKs the master node needs to wait for each transaction. If it is set to 0, the master node does not wait for any ACKs. If it is greater than 0, the master node waits for that many ACKs from slave machines.
In a Klustron cluster, clustermgr sets a reasonable Fullsync_consistency_level for each master node based on the number of cluster nodes. Generally, for a storage shard with 2*n+1 nodes, Fullsync_consistency_level is set to n, achieving a simple majority. Thus, the cluster can continue writing even if n nodes disappear. Clustermgr also supports other policies, such as requiring all slave machines to confirm.
If the release condition is met, thr returns a successful status to the client and completes the request processing. Otherwise, the working thread thr puts the session object thd into the fullsync ACK waiting queue and processes other requests received from other connections.
After receiving ACKs, the master node checks the release condition for sessions in the waiting queue. If the release condition is satisfied, the session is released, and a successful status is returned to the client. When waiting for slave machine ACKs, the user session does not occupy the working thread.
If enough ACKs are not received to release a waiting session during the fullsync_timeout, Klustron-storage has two strategies controlled by the global variable disable_fullsync_on_slave_ack_timeout:
A. If disable_fullsync_on_slave_ack_timeout is set to 1, fullsync will automatically degrade to asynchronous. As a result, subsequent waiting transactions will no longer wait for fullsync. When the master node receives ACKs from the slave machines again, the fullsync mechanism is automatically enabled.
B. If disable_fullsync_on_slave_ack_timeout is set to 0, a waiting session that times out in the fullsync process will return an error (error code 9000) to the client. For a Klustron cluster, the computing node that receives this error will trigger a master-slave switch.
If the master node crashes and a transaction T does not receive any ACKs from the slave machines, this will cause the new master node to lack the binlog of the transaction T that has already been committed on the old master machine M0. And if the old master node M0 subsequently rejoins the cluster, Klustron-storage will perform a flashback on M0 to remove the changes and binlog of the transaction T from the instance's storage engine (InnoDB) and binlog files.
As T has not been returned to the client, Klustron has not promised that T has been committed successfully, so we can completely flashback T without affecting the durability of the transaction.
2. Slave Node
When a slave machine receives an EG (Event Group, or binlog transaction, includes several types: explicit normal transactions, the first and second stages of XA transactions, DDL statements, and autocommit statements. It is abbreviated as EG in the following text.) termination binlog event (XID_EVENT, XA_PREPARE_LOG_EVENT, or DDL transaction), it decides whether to write the received EGs to the relay log file and flush&fsync them to the persistent storage system. Then, it sends an ACK to the master node to confirm the persistence of these received EGs.
The decision is made based on minimizing resource consumption and optimizing performance --- if a slave node receives enough EGs (configured by the parameter fullsync_fsync_ack_least_txns) or enough binlog data (configured by the parameter fullsync_fsync_ack_least_event_bytes), or if it has not sent ACKs for too long (configured by fullsync_fsync_ack_wait_max_milli_secs), it will flush and fsync relay log, and then send an ACK.
An ACK contains the following information: the server_id of the slave node, and the position of the last EG that has been persisted in the master node's binlog (file number and offset). After receiving an ACK from a slave node, the master node can be sure that the slave node has received and persisted all EGs before the ACK position.
The global variable fullsync_relaylog_fsync_ack_level is used to control the behavior of the Fullsync mechanism in a slave node, which includes flushing & fsyncing the relay log and sending an ACK, with the following meanings:
fullsync_relaylog_fsync_ack_level | Slave Node Behavior |
---|---|
0 | Do not flush or fsync relay log nor send ACK |
1 | Do not flush or fsync relay log nor send ACK |
2 | Flush and fsync the relay log, send an ACK |
Enabling the log_fullsync_replica_acks option on the master node allows logging of every received ACK in the mysqld running log. However, this feature is intended only for debugging the replica ACK mechanism and should not be enabled in production systems as it can significantly impact performance.
There are two methods for the slave nodes to send ACK to the master node. Both methods require the slave nodes to use the MySQL client library to connect to the master node, resulting in each slave node having two connections to the master node: one for the IO thread and one for sending ACK. In this connection, the slave node can send the Klustron-storage-specific COM_BINLOG_ACK command or SQL statements understood by Klustron-storage, with the former having better performance, while the latter allows various third-party binlog storage components to send ACK to the master node.
a. COM_BINLOG_ACK Compile a program using Klustron-storage's client library files and its mysql.h header file, and then call the mysql_send_binlog_ack() function to send the ACK. Klustron-storage fullsync feature uses this method to send ACK to its master node.
b. SLAVE server_id CONSISTENT TO file_index offset SQL statement This method can be used with any community version of MySQL client library, and Klustron-storage master node can correctly handle this statement as an ACK confirmation. This method is particularly suitable for various binlog storage components.
3. Klustron-storage Fullsync Status Variables
These status variables can help DBAs observe the running status and performance of fullsync, and serve as a basis for adjusting fullsync configuration parameters. Their meanings are in the table.
status variable name | |
---|---|
fullsync_received_replica_acks | NO. of received replica acks |
fullsync_old_acks_received | NO. of received ACKs that are obsolete,i.e. an obsolete ACK ACKs a position already ACK'ed by previously received ACKs. |
fullsync_txns_acked | NO. of txns the replica ACK'ed |
fullsync_txns_fully_acked_before_wait | NO. of txns pre-ACK'ed before it starts to wait --- when the txn tries to wait for ack, its receives all needed ACKs from slaves |
fullsync_txns_acked_before_wait | NO. of txns partly ACK'ed before it starts to wait, when the txn tries to wait for ack, its receives part of all needed ACKs from slaves |
fullsync_txns_long_wait_warnings_for_acks | NO. of txns long wait warnings for ACKs. although the wait doesn't timeout, it's still long enough to trigger a fullsync warning. |
fullsync_txns_timed_out_waiting_for_acks | NO. of txns timed out waiting for ACKs |
fullsync_txns_received_by_replica | NO. of txns received by the replica |
fullsync_relay_log_syncs | NO. of relay log syncs. |
fullsync_acks_sent_to_master | NO. of ACKs sent to master |
fullsync_num_txns_in_acked_group | set by a replica, NO. of txns flushed and fsync'ed corresponding to latest ACK |
fullsync_replica_skipped_old_trx_acks | NO. of times the replica skipped sending ACKs because received txns are too old |
fullsync_replica_ack_upto_file and fullsync_replica_ack_upto_offset; | fullsync replicas have ACKed upto this position(file and offset within master's binlog file). |
fullsync_replica_fully_acked_upto_file; and fullsync_replica_fully_acked_upto_offset; | fullsync replicas have fully ACKed upto this position(file and offset within master's binlog file). |
fullsync_latest_recvd_trx_ts timestamp on master node of latest received transaction,it's the timestamp when the transaction on master node is flushed to its binlog file. it can be used to measure IO thread latency. | fullsync_replica_ack_timedout whether the master node timed out waiting for replica acks |
fullsync_effective whether fullsync is effective on this master or slave node | fullsync_num_waiting_txns NO. of transactions currently waiting for ACKs on master node |
4. Fullsync Configuration Parameters
Klustron-storage Fullsync supports a rich set of configuration parameters that enable users to achieve a suitable balance between performance, resource consumption, and consistency. These variables are all MySQL global variables, and their meanings and usage are explained in the table below.
Fullsync Variables | meanings |
---|---|
fullsync_consistency_level | At end of transaction commit, whether and how to wait for fullsync replica ACKs before replying the client that a transaction has committed. 0: no wait; 99: wait for simple majority replicas; 100: wait for all replicas; [1, 98]: wait for this number of ACKs. |
fullsync_relaylog_fsync_ack_level | When fullsync is enabled, how should the replica fsync relay log and/or reply an ACK to master after it write its received event group(s) to relay log file.\t" "0: don't fsync or send ACK; 1: don't fsync but send ACK; 2: fsync and send ACK. |
fullsync_fsync_ack_least_event_bytes | Accumulate at least this many relay log bytes before fsync'ing the relay log and sending an ACK. |
fullsync_fsync_ack_least_txns | Accumulate at least this many event groups before fsync'ing the relay log and sending an ACK. |
fullsync_fsync_ack_wait_max_milli_secs | Replica nodes wait for more event groups to arrive at most this many milli-seconds before fsync'ing the relay log and sending an ACK. |
skip_fullsync_replica_acks_older_than | If a replica is this many milli-seconds later than the master node, skip fsync'ing the relay log or sending ACKs. |
fullsync_warning_timeout | If a replica ACK arrives this many milli-seconds since the transaction started to wait for it, write a warning in error log. |
fullsync_timeout | If a replica ACK doesn't arrive after this many milli-seconds since the transaction started to wait for it, return error to client and write an error in error log. |
log_fullsync_replica_acks | Whether log replica ACKs to mysqld error log. Note that when fullsync is enabled there can be a huge amount of such logs which are seldom used. |
enable_fullsync | Whether enable fullsync mechanism. |
disable_fullsync_on_slave_ack_timeout | Whether disable fullsync when replicas do not ACK in time and timeout happened, if this is false, then the master node can't be written when it has no running replicas. |
Klustron-storage Fullsync Advantages
Klustron-storage Fullsync has the following advantages compared to MySQL's semi-synchronous (semisync) plugin:
During the waiting period for slave confirmation, the client session and its transaction do not occupy working threads. This avoids blocking a large number of working threads waiting for the slave ACK, the thread pool to have to start more working threads to handle a continuous influx of requests from other client sessions, consuming too many system resources.
After the slave machine collects a certain amount of binlog transactions (configurable), it performs fsync on the relay log to ensure that binlog is persisted. This not only avoids serious problems caused by slave machine power failure, OS crash or restart resulting in the loss of the latest relay log, but also does not cause huge write loads on storage devices. It achieves a perfect balance between performance, latency, and storage resource consumption.
Fullsync has flexible configuration capabilities, allowing users to make flexible trade-offs between high availability, data consistency, and performance.
Fullsync provides rich runtime status information, which facilitates DBAs to monitor the runtime status of fullsync and provides complete information for effective fullsync-related configurations.
DBAs can configure the master node to wait for ACKs from several slaves (configurable), thereby achieving a higher fault tolerance level. For example, for some high-value businesses, DBAs can configure one master and four slaves, allowing the master node to wait for two slave ACKs for each transaction. After any commit operation completes, it must wait for two slave ACKs before the transaction commit status can be returned to the client.
At the same time, DBAs can also configure whether a specific channel of a slave sends ACK, even if the slave's fullsync_relaylog_fsync_ack_level is set to 1 or 2, to achieve the goal of flexible configuration of the cluster's high availability architecture.