Klustron(原KunlunBase) 存储集群 Fullsync 机制
Klustron 使用MySQL做存储节点,称为Klustron-storage。Klustron-storage目前最新版本基于percona-mysql-8.0.26开发,在此社区版本基础上,我们补充了MySQL的XA事务和binlog 复制方面的容错和数据一致性漏洞,增加了kunlun数据库集群需要的若干功能,并且增强了其性能。我们也一直在合并上游版本,持续汇集MySQL社区的最新成果到Klustron-storage中。
在Klustron-storage Fullsync机制开发完成之前,我们一直在使用MySQL Group Replication(MGR)实现存储集群高可用。为了实现更好的数据写入性能包括更高的吞吐量和更短的延时,以及降低对存储系统和网络带宽的消耗,并且在高可用方面实现更加灵活的策略,我们在Klustron-storage中设计并开发了Fullsync高可用机制。
Klustron-storage Fullsync功能简介
Klustron-storage Fullsync概况和原理
Klustron 的Fullsync机制是一种存储集群的高可用机制,它用于确保一个存储集群在发生节点故障、网络分区等问题时,该集群中存在可用的备机含有所有向用户确认提交成功的事务的binlog,以便可以按需选举出新的主节点,确保集群持续可以写入,实现高可用。
Klustron 的Fullsync机制基于MySQL久经考验的Row Based Replication(RBR)binlog复制机制,实现了主备复制的强同步,也就是确保主节点上提交的每一笔事务 --- 包括显式普通事务(即begin 。。。Commit),autocommit的update/delete/insert语句,以及XA事务 --- 在完成内部的事务提交流程(即engine log和binlog flush&sync和engine commit 这三个阶段)之后,持续等待直到收到了足够数量的备机的确认(ACK)之后,才向客户端(在Klustron 中就是计算节点,下同)确认这个事务成功提交。
一个备节点在汇集了多个事务的binlog之后,一次性写入relay log文件并且刷盘(fsync),然后发送ACK 向其主节点确认它收到并持久化(刷盘)了一组事务的binlog到其relay log文件中。只有主节点收到这个确认,才向客户端返回事务提交成功的确认状态,客户端收到此结果后才能发送下一条SQL语句给Klustron 。只有向客户端返回事务成功提交(或prepare),Klustron 才有义务保障这样的事务的改动不丢失(即ACID的D,Durability)。
Klustron-storage Fullsync的前提条件
Klustron-storage Fullsync机制需要一组特定的参数组合才能正常工作。Klustron-storage自带的参数配置模版文件中含有经过我们开发团队调优之后的参数设置,其中包含了fullsync功能需要的参数设置。
Klustron 集群的所有存储节点实例都是使用其自带的参数模版创建的。具体来说包括一下参数:
- gtid_mode=on, enforce_gtid_consistency=1,log_slave_updates=ON
- binlog_format=row, i.e. 使用Row based replication
- 所有主备节点都使用binlog
- 会话都使用binlog,即 sql_log_bin=true。如果把一个会话的 sql_log_bin设置为false则此会话中fullsync机制不工作但是其他会话中fullsync仍然工作。
- enable_fullsync = ON 打开fullsync全局开关
- thread_handling=2或者0, 即Klustron-storage的fullsync机制适用于线程池(thread_handling=2)以及每个线程处理一个事务(thread_handling=0)的情况。
Klustron-storage Fullsync的功能设计与实现
1. 主节点
Klustron-storage的fullsync机制是一种after-commit的同步模式。在处理用户会话thd的工作线程thr 完成事务T提交或者prepare(XA prepare)并且还未向客户端确认成功(即发送OK包)之前,主节点检查事务T的binlog是否已经收到了足够数量的备机的ACK(备机的ACK 确认收到若干个事务的binlog) ** --- 此条件称为释放条件。 **
在Klustron 集群中,clustermgr会根据集群节点数量为每个master节点设置合理的Fullsync_consistency_level,通常的设定方法是对于一个2*n+1个节点的storage shard,那么设置Fullsync_consistency_level=n,这样就达到了简单多数,所以同时有n个节点消失的情况下,集群仍然可以正常写入。clustermgr也可以支持其他策略,比如要求所有备机全部确认等。
如果释放条件满足那么thr直接返回成功状态给客户端并且完成本次请求处理,否则工作线程thr就把会话对象thd放到fullsync ack等待队列,然后去处理其他连接中收到的请求。
如果主节点宕机时有某个事务T没有收到任何备机的ack,导致切换后新的主节点缺少原来已经在旧主机M0上 提交的T事务的binlog,并且这个老主节点M0随后重新加入集群, Klustron-storage会对M0做flashback(闪回),把T的改动及其binlog从实例的存储引擎(innodb)和binlog文件中去掉。
由于T并没有返回给客户端所以Klustron 并没有向客户端承诺T提交成功了,因此我们完全可以把T闪回,并没有影响事务的durability。
事务等待Fullsync ACK的细节
等待的内容 本事务已经被足够的备机收到并刷盘,ACK的binlog位置 >= 本事务的binlog位置
等待的方法 不阻塞占用工作线程,挂起会话(THD)直到ACK到来。工作线程继续去处理其他连接的请求。
等待ACK 超时后的处理 在会话中返回 “fullsync等待超时” 错误给客户端。
与Fullsync HA机制配合实现高可用机制。考虑网络或者IO 随机偶发的抖动,支持选择优先 Consistency 或者 Availability。高可用机制自动工作,无需人工干预。
- 收到ACK的处理 某个后台工作线程执行目标会话(THD)的事务提交流程中的剩余操作,即发送OK包给客户端,客户端语句返回。
2. 备节点
备机收到事件组(event group,即binlog事务,包括普通显式事务,XA事务第一阶段,XA事务第二阶段,DDL语句,autocommit语句这几种类型。下文简称EG)的终止binlog事件(XID_EVENT, XA_PREPARE_LOG_EVENT或者DDL事务)后,它会决定是否需要把收到的若干个EG写到relay log文件并且刷到持久存储系统中(即flush&fsync relay log),然后发送ACK给主节点来确认持久化了这些收到的EG。由于做过了fsync,所以断电或者OS重启后那些被确认过的事务的binlog仍然在备机的relay log文件中持久保存。
刷盘时机的决定基于最小化资源消耗和最优化性能来做出,在磁盘负载、性能以及数据一致性 之间取得平衡 --- 如果备机收到了足够多的EG(配置参数:fullsync_fsync_ack_least_txns)或者足够量的binlog(配置参数:fullsync_fsync_ack_least_event_bytes) ,或者太久没有发送ACK了( fullsync_fsync_ack_wait_max_milli_secs), 他就会flush&fsync relay log然后发送ACK。
一个ACK包含这些信息:备机的server_id, 落盘的最后一个EG在主节点binlog中的终止位置(文件编号和偏移值)。主节点收到一个备机的ACK后就可以确信这个备机收到并持久存储了ACK位置之前的所有EG。
使用fullsync_relaylog_fsync_ack_level 全局变量来控制一个备机节点的Fullsync机制在flush&fsync relay log和发送ACK的行为,其含义如下:
fullsync_relaylog_fsync_ack_level | 备机行为 |
0 | 不flush或者fsync relay log也不发送ACK |
1 | 要flush但不fsync relay log也不发送ACK |
2 | Flush&fsync relay log后发送ACK |
在主节点上打开 log_fullsync_replica_acks 可以在mysqld运行日志中记录每一个收到的ACK,这个功能仅仅用于调试备机ACK机制,在生产系统中千万不要打开否则会严重影响性能。
a. COM_BINLOG_ACK 使用Klustron-storage的客户端库文件及其mysql.h 头文件编译程序,然后调用 mysql_send_binlog_ack() 函数发送ACK。Klustron-storage fullsync功能使用此方法发送ACK给其主节点。
b. SLAVE server_id CONSISTENT TO file_index offset SQL 语句 这种方法可以使用任何社区版mysql客户端库,Klustron-storage的主节点可以正确处理该语句,把它当作确认ACK。此方法特别适合各种binlog存储组件。
3. Klustron-storage Fullsync的状态变量
status variable name | |
fullsync_received_replica_acks | NO. of received replica acks |
fullsync_old_acks_received | NO. of received ACKs that are obsolete,i.e. an obsolete ACK ACKs a position already ACK'ed by previously received ACKs. |
fullsync_txns_acked | NO. of txns the replica ACK'ed |
fullsync_txns_fully_acked_before_wait | NO. of txns pre-ACK'ed before it starts to wait --- when the txn tries to wait for ack, its receives all needed ACKs from slaves |
fullsync_txns_acked_before_wait | NO. of txns partly ACK'ed before it starts to wait, when the txn tries to wait for ack, its receives part of all needed ACKs from slaves |
fullsync_txns_long_wait_warnings_for_acks | NO. of txns long wait warnings for ACKs. although the wait doesn't timeout, it's still long enough to trigger a fullsync warning. |
fullsync_txns_timed_out_waiting_for_acks | NO. of txns timed out waiting for ACKs |
fullsync_txns_received_by_replica | NO. of txns received by the replica |
fullsync_relay_log_syncs | NO. of relay log syncs. |
fullsync_acks_sent_to_master | NO. of ACKs sent to master |
fullsync_num_txns_in_acked_group | set by a replica, NO. of txns flushed and fsync'ed corresponding to latest ACK |
fullsync_replica_skipped_old_trx_acks | NO. of times the replica skipped sending ACKs because received txns are too old |
fullsync_replica_ack_upto_file and fullsync_replica_ack_upto_offset; | fullsync replicas have ACKed upto this position(file and offset within master's binlog file). |
fullsync_replica_fully_acked_upto_file; and fullsync_replica_fully_acked_upto_offset; | fullsync replicas have fully ACKed upto this position(file and offset within master's binlog file). |
fullsync_latest_recvd_trx_ts timestamp on master node of latest received transaction,it's the timestamp when the transaction on master node is flushed to its binlog file. it can be used to measure IO thread latency. | fullsync_replica_ack_timedout whether the master node timed out waiting for replica acks |
fullsync_effective whether fullsync is effective on this master or slave node | fullsync_num_waiting_txns NO. of transactions currently waiting for ACKs on master node |
4. Fullsync配置参数
Klustron-storage Fullsync支持丰富的配置参数让用户在性能、资源消耗和一致性方面取得适当的平衡。这些变量都是MySQL的全局变量,其意义和用法说明见下面的表格,都是简单的英语,相信大家都能看懂,因此不再翻译了。
Fullsync Variables | meanings |
fullsync_consistency_level | At end of transaction commit, whether and how to wait for fullsync replica ACKs before replying the client that a transaction has committed. 0: no wait; 99: wait for simple majority replicas; 100: wait for all replicas; [1, 98]: wait for this number of ACKs. |
fullsync_relaylog_fsync_ack_level | When fullsync is enabled, how should the replica fsync relay log and/or reply an ACK to primary after it write its received event group(s) to relay log file.\t" "0: don't fsync or send ACK; 1: don't fsync but send ACK; 2: fsync and send ACK. |
fullsync_fsync_ack_least_event_bytes | Accumulate at least this many relay log bytes before fsync'ing the relay log and sending an ACK. |
fullsync_fsync_ack_least_txns | Accumulate at least this many event groups before fsync'ing the relay log and sending an ACK. |
fullsync_fsync_ack_wait_max_milli_secs | Replica nodes wait for more event groups to arrive at most this many milli-seconds before fsync'ing the relay log and sending an ACK. |
skip_fullsync_replica_acks_older_than | If a replica is this many milli-seconds later than the primary node, skip fsync'ing the relay log or sending ACKs. |
fullsync_warning_timeout | If a replica ACK arrives this many milli-seconds since the transaction started to wait for it, write a warning in error log. |
fullsync_timeout | If a replica ACK doesn't arrive after this many milli-seconds since the transaction started to wait for it, return error to client and write an error in error log. |
log_fullsync_replica_acks | Whether log replica ACKs to mysqld error log. Note that when fullsync is enabled there can be a huge amount of such logs which are seldom used. |
enable_fullsync | Whether enable fullsync mechanism. |
disable_fullsync_on_slave_ack_timeout | Whether disable fullsync when replicas do not ACK in time and timeout happened, if this is false, then the primary node can't be written when it has no running replicas. |
Klustron-storage Fullsync的优势
与MySQL的半同步(semisync)插件相比,Klustron-storage 的fullsync机制有如下优势。
备机会聚集若干个事务的binlog后(可配置),才对relay log做fsync,确保binlog落盘。这样不仅可以避免备机断电或者其OS crash或者重启导致备机丢失了最近的relay log的严重问题,还不会对存储设备造成巨大的写入负载。达到了性能,延时和存储资源消耗之间的完美平衡。