Key Technologies of Kunlun-Cantian
Key Technologies of Kunlun-Cantian
Introduction
The Kunlun-Cantian Database is a new type of centralized database product developed by Zet拓 Technology based on Huawei's Cantian engine. As a new product jointly launched by Zet拓 Technology and Huawei Storage, it combines the powerful computing layer performance of the Kunlun Database with the excellent storage processing capabilities of the Cantian engine, achieving significant improvements in database functionality, performance, and stability across various scenarios. This sharing will mainly introduce some key kernel technologies in this new product, as well as new features that will be developed based on it, hoping to give everyone a preliminary understanding of the Kunlun-Cantian Database and a deeper understanding of the new technologies in its kernel.
The golden sentence of this article: The Kunlun-Cantian Database combines the powerful computing layer performance of the Kunlun Database with the excellent storage processing capabilities of the Cantian engine, achieving significant improvements in database functionality, performance, and stability across various scenarios.
01 Background of Kunlun-Cantian Database
First, let's look at the problems we often encounter when using common databases, such as MySQL, for high availability.
When using MySQL, we usually use a master-slave replication architecture based on Binlog for high availability. In this architecture, we usually encounter the following problems:
Performance bottleneck of the master node transaction group commit.
All threads that submit transactions concurrently are blocked;
One thread appends the Binlog events that have been written in the private cache of each transaction to the Binlog file in turn;
Measurements show that: if Binlog is turned off, performance can be doubled.
Master-slave delay: no master is available after the master node crashes.
Typical causes: tables without primary keys and unique indexes, later hash mechanisms have been slightly improved;
Typical causes: a large number of inserts, updates, and deletes in the same transaction;
Typical causes: the master node has insufficient concurrency and the slave machine joins later; large transactions (2 times IO, written and then sent).
Long recovery time: storage engine recovery, then MySQL server layer Binlog recovery.
- Scan the last Binlog file, which may be very large.
[Image: Kunlun-Cantian related diagram]
If we use MySQL MGR or Galera for high availability, we may also encounter the problem of excessive overhead for conflict detection during multi-write, specifically:
Tables must have primary keys or unique indexes for hashing to identify rows as ids;
During the transaction, the collection of ids of rows increased, deleted, or changed;
Before the transaction is committed, the collection of ids of increased, deleted, and changed is sent to other nodes, and conflicts are checked;
If there are concurrent transactions with conflicts, the transaction is rolled back, wasting computing overhead.
No conflict, then the Binlog of the transaction is distributed through the raft protocol;
Significant performance overhead: a large number of transactions submitted at the same time are queued and waiting;
Holding locks to complete the above operations, causing a large number of concurrent transactions running conflicts to be blocked.
[Image: Kunlun-Cantian related diagram]
Therefore, in order to solve the problems described above, Zet拓 has joined hands with Huawei to launch a new Kunlun-Cantian solution. Its main technical points are:
Both storage nodes have their buffer pool and redo log, shared storage, similar to Oracle RAC architecture;
Concurrency control;
Page transaction row lock for concurrency control of inserts, deletes, and changes;
Based on MVCC for select concurrency control;
Avoids the huge performance loss of transaction conflict detection when committing;
The redo log is written during the transaction, and it is submitted in parallel without queuing;
Solves the consistency problem of MySQL innodb in the same transaction select -> update.
Hash distribution to Cantian nodes by page, both nodes can be updated;
Crash recovery: both nodes recover in parallel;
There is no requirement for tables without PK, UK, avoiding the performance bottleneck and long-term lock block of a large number of conflicting transactions during submission;
It is possible to merge the update operations at the connection point to generate Binlog for use by MySQL Binlog ecological tools.
[Image: Kunlun-Cantian related diagram]
As we introduced in the introduction, Kunlun-Cantian combines the powerful computing layer performance of the Kunlun Database and the excellent storage processing capabilities of the Cantian engine. Therefore, before delving into the details of the Kunlun-Cantian Database, let's briefly introduce the Kunlun Klustron distributed database and Huawei Cantian engine.
First, let's briefly introduce the core architecture of our Zet拓 Technology's distributed data product Klustron:
Klustron's Distributed Storage and Computing Separation Architecture
Computing layer (Klustron-server): Multiple PostgreSQL instances form computing nodes that are responsible for accepting connection requests from the application software end, as well as accepting SQL query requests from established connections, executing requests, and then returning query results;
Storage layer (Klustron-storage): Three or more MySQL 8.0 instances form a storage cluster (storage shard, referred to as a shard), each shard stores a portion of user tables or table partitions;
The metadata cluster stores the metadata of the Klustron cluster, including topology structure, node connection information, DDL logs, commit logs, and other cluster management logs;
Cluster_mgr cluster is responsible for maintaining the correct cluster and node status, achieving cluster management, cluster logical backup and recovery, cluster physical backup and recovery, horizontal scaling, and other functions.
[Image: Kunlun-Cantian related diagram]
Next, let's briefly understand Huawei's Cantian engine:
Cantian engine is a storage engine launched by Huawei Storage, which is based on shared storage and supports multi-point writing. The main features of the Cantian engine are:
Global distributed caching technology is used to ensure cross-node data consistency and achieve multi-read multi-write clusters;
Distributed MVCC technology greatly improves the performance of MVCC transaction submission;
Multi-active cluster high availability technology, achieving rapid node failure switching.
As shown in the figure on the right, the difference between the Cantian engine and the traditional MySQL one master and two slave architecture is that it supports concurrent writing of the two masters and high availability through its internal management modules (cluster management, resource management, caching management, lock resource management, etc.).
[Image: Kunlun-Cantian related diagram]
02Architecture and Core Technologies of Kunlun-Cantian Database
Architecture and Advantages of Kunlun-Cantian Database
The storage nodes of the Kunlun-Cantian Database use the Cantian engine.
No other storage engines are used, including InnoDB; Binlog is not required, but Binlog can be generated;
Catalog exists in the Cantian engine;
Completely retains the SQL syntax, connection protocol, and transaction processing functions of MySQL;
Data is stored in multiple copies in the Cantian storage system, and based on its high availability, the storage system can be deployed across multiple rooms;
Cantian can generate merged Binlog for data updates from two master nodes.
Ultimate performance.
Storage nodes do not need to output Binlog, and the concurrency and performance of transaction processing are greatly improved;
Avoids the inherent master-slave delay issues of MySQL Binlog replication under special circumstances;
Huawei's high-performance storage system has high IO bandwidth and low latency: the most fundamental performance guarantee;
Dual master nodes, better read and write scalability, transparent read consistency;
Shorter RTO time (Cantian engine high-speed parallel recovery VS. MySQL two-layer serial recovery)
[Image: Kunlun-Cantian related diagram]
Applicable Scenarios for Kunlun-Cantian Database
The Kunlun-Cantian Database has the following features:
Achieve high performance under fewer constraints.
Frequent long transactions or large transactions;
Frequent queries and updates to tables without primary keys and unique indexes;
Occasional complex queries (multiple table connections, subqueries, window functions).
Better write scalability under heavy write load.
Read consistency and read scalability are obtained at the same time.
- MySQL read-write separation (reading from the slave machine) cannot achieve read consistency in a non-intrusive manner.
High-performance OLAP for BI and reporting loads (relying on Klustron Tornado).
More reliable HA, slave machines will not expire and fail due to master-slave delays.
Need faster crash recovery (shorter RTO).
Users who originally used Oracle RAC need shorter RTO.
Due to the above characteristics, Kunlun-Cantian Database can break through various restrictions and has broader support for scenarios under various conditions.
[Image: Kunlun-Cantian related diagram]
Advantages and Value of Kunlun-Cantian Database:
Completely retains the powerful functions and ecological compatibility capabilities of Klustron.
The ecological compatibility capability of Klustron-server, making full use of the technical resources of the PostgreSQL community;
Support for JSON, vector (pgvector) data management;
Support for PostgreSQL and MySQL dual protocols and dual syntax;
Support for machine learning (PostgresML, Python stored procedures with ML libraries);
Various FDWs access various external data sources (object storage, databases);
Advanced SQL functions (MySQL does not have the 2nd and 3rd);
Stored procedures, triggers, materialized views, CTE;
Multi-level fine-grained access control, data validation & constraints;
OLAP-related query functions (window function, grouping-sets, cube, roll-up, etc.).
Klustron-server's high-performance distributed query processing technology system.
- Fully utilize hardware performance.
[Image: Kunlun-Cantian related diagram]
03 High-Performance Query Processing Technology of Kunlun-Cantian Database
3.1 Distributed Parallel Query Execution Technology
In the computing layer of the Kunlun-Cantian Database, we use the same computing layer architecture as the distributed database Klustron, providing distributed parallel query technology based on multiple computing nodes.
The details of this distributed query infrastructure mainly include the following points:
Within the computing node --- Inherited from PostgreSQL's parallel query architecture.
Multi-process architecture (leader, worker) and Gather mechanism;
Parallel Append, Parallel Aggregate, Parallel Join;
Remote Plan;
Parallelism within the computing node is fully based on the original parallel query framework.
Parallel query capability between computing nodes and storage nodes --- Newly designed and developed by Klustron.
Can work independently of the parallel query framework of the computing node, or collaborate with it;
Parallel Remote Scan: multi-process division and parallel execution of Remote Scan subtasks;
Utilize multiple slave machines to improve the scalability of read-only queries;
Reduce query overhead;
Support functions of the storage node --- Newly designed and developed by Klustron;
Multiple backend connections (threads) execute the same client connection's query subtasks;
Transaction snapshot sharing technology.
[Image: Distributed parallel query execution technology diagram]
Asynchronous communication between computing nodes and storage nodes.
SQL statements are sent asynchronously on multiple connections, without blocking and waiting for query results;
- Can be used to drive multiple shards to execute write commands (update, delete, insert) in parallel;
Can be used for parallel execution of read-only queries (select) on a single or multiple nodes.
Asynchronous issuance of remote plan SQL statements.
Supported plan nodes: MergeAppend, Append, ModifyTable;
All are issued and then waited for the remote plan query results to be collected uniformly;
Parallel Remote Scan: read-only query task decomposition and multi-threaded execution;
The remote scan task is divided into multiple subtasks, each subtask returns a sub-result set;
- Use primary keys or unique indexes for range division;
Asynchronously sent in multiple connections for parallel execution by storage nodes;
Multiple worker processes can independently decompose tasks, send tasks, collect, and summarize sub-results;
Supported by Klustron version 1.2 and above.
[Image: Asynchronous remote plan execution diagram]
Read-write separation, multiple slave machines execute SELECT.
Avoid affecting the performance of the master node;
Parallelism is achieved among multiple slave machines in the same shard.
Prepared statements are used between computing nodes and storage nodes.
Suitable for clients sending prepared statements or plain text statements;
Avoid repeated parsing and optimization of query subtasks by storage nodes;
Optimize the performance of LIMIT ... OFFSET: use fetch(N) to incrementally pull query results.
- Stop halfway to avoid unnecessary query execution and result transmission; Avoid IO load surge caused by read and write temporary tables.
- Support functions of Kunlun-storage.
Multiple backend connections share the same snapshot;
Multiple threads execute multiple query subtasks in parallel to ensure the consistency of query results within the same node.
Use server-side read-only cursors to obtain results;
- Avoid materializing (materialize) query results to temporary tables on the computing node;
- Avoid caching query results to temporary tables on the storage node;
- The fetch can execute other queries or other prepared statement fetches between fetches.
[Image: Kunlun-Cantian related diagram]
Below, we provide a simple example to illustrate parallel queries:
[Image: Kunlun-Cantian related diagram]
Taking the two tables in the figure on the left as an example of a join query, the execution plan on the right is when parallel query is not turned on. We can see that it simply splits the original query into several sub-queries to be executed separately according to the shard. The two queries take 250 and 247 milliseconds respectively.
[Image: Kunlun-Cantian related diagram]
When parallel query is turned on, there are three worker threads executing the execution plan at the same time, which greatly improves the execution efficiency. The query time is reduced to 92 and 71 milliseconds, only half of the non-parallel execution time, and the performance improvement is very significant.
3.2 JIT Code Just-in-Time Compilation Technology
Next, let's introduce the JIT code just-in-time compilation technology of the Kunlun-Cantian Database.
First, let's see why we need to adopt JIT technology. You can see the figure below.
[Image: Kunlun-Cantian related diagram]
From the flame graph above, we can see that the expression execution part of a simple query accounts for a very high proportion of the entire SQL execution process. Therefore, if there is a method to improve the execution efficiency of expression calculations, it can greatly accelerate the SQL execution process. This is exactly what code just-in-time compilation JIT can provide us.
In the figure below, we can see the difference between JIT and regular code execution.
[Image: Kunlun-Cantian related diagram]
JIT (Just-in-time compilation) is a technology that compiles code while it is running. It can analyze the code during execution and determine the parts of the code that can be accelerated by immediate compilation. In these parts, the performance improvement brought by compilation or recompilation will exceed the overhead of compiling the code, thus saving a lot of resources consumed in the function call process, and thus greatly improving the code execution efficiency.
In the figure on the left, the traditional code execution process requires three function calls to perform a simple expression calculation, and after JIT analysis, this expression calculation only requires a few simple assembly instructions to complete. Therefore, JIT compiles this expression calculation into more efficient and simple machine instructions, thereby accelerating the execution process.
The figure below shows the comparison of SQL execution processes before and after adopting JIT technology.
[Image: Kunlun-Cantian related diagram]
At present, the data types supported by the JIT of the Kunlun-Cantian Database include:
int, bigint, float, double, decimal
date, time, datetime, timestamp
varchar
The supported operators include:
binary algebras: Plus, Minus, Multiply, Div
binary comparators: GT, GE, LT, LE, EQ, NE, like
logical operators: And, Or, Not
unary operators: Abs
others: Between ... and, interval
After adopting JIT technology, the performance improvement of TPC-H: 20% ~ 60%+, and it can reach 10 times performance improvement in the future! Moreover, both computing nodes and storage nodes have JIT capabilities.
[Image: Kunlun-Cantian related diagram]
3.3 Tornado High-Performance Vectorized Execution Engine
Tornado High-Performance Vectorized Execution Engine is the latest SQL execution engine launched by Zet拓. It greatly accelerates the efficiency of SQL statement execution through vectorized execution technology.
First, let's understand what vectorization is. Vectorization is based on SIMD (single Instruction Multiple Data) technology. It accelerates the execution process of code by processing the same instruction for different data in batches.
SIMD history
First used in the ILLIAC IV large-scale parallel computer in the 1970s;
Large-scale application to consumer computers in the 1990s:
In 1996, Intel launched the X86 MMX instruction set extension;
In 99, Interl introduced the SSE instruction set extension that fully covers MMX;
... SSE3, SSE4, AVX, AVX2, and other instruction set extensions.
From the figure below, we can simply understand how SIMD processes the operation of a+b in batches.
[Image: Kunlun-Cantian related diagram]
At present, many analytical databases have adopted vectorized execution technology, especially analytical databases such as ClickHouse, DukeDB, etc., which greatly accelerate the execution process of analytical calculations.
The current architecture of ZeTuo's Tornado vectorized execution engine is as follows:
[Image: Kunlun-Cantian related diagram]
First, Tornado needs to convert the data into column storage in memory, and then execute it through vectorized functions and operators in a pipeline manner.
In the figure below, we can compare the execution process of Tornado.
[Image: Kunlun-Cantian related diagram]
In the example in the figure above, Tornado first converts the values of a and b into column storage, then processes the operation of a+b in batches, and then completes the execution of this SQL through filter and aggregate operators.
[Image: Kunlun-Cantian related diagram]
In the figure below, we can see how vectorized physical operators work with SIMD instructions.
[Image: Kunlun-Cantian related diagram]
04 Q & A
Q: Can there be multiple computing nodes in the Kunlun-Cantian Database? What about storage nodes?
A: Yes, there can be multiple computing nodes, but there can only be two storage nodes currently. This is because Cantian engine only supports dual masters at present. That is to say, the number of computing nodes can be configured as needed, but there are only two storage nodes, but the two storage nodes can write at the same time.