Skip to main content

KunlunBase Elastic Scalability and Data Distribution Capabilities

KlustronAbout 13 min

KunlunBase Elastic Scalability and Data Distribution Capabilities

Why Database Systems Need Elastic Scalability

To ensure that application systems can continue to increase in throughput, performance, and data volume while providing a smooth user experience, it is essential to use a cluster composed of many ordinary servers and deploy a distributed database with excellent scalability. Elastic scalability is the foundation of a distributed database's expansion capability, and data partitioning is a prerequisite for elastic scalability. KunlunBase meets all these requirements and conditions, offering users unlimited performance, throughput, and the ability to handle data read/write loads and transactions.

Cloud Computing and Storage Resource Pools

Why is elastic scalability a necessary capability for database systems? In today's world, where cloud computing technology is widespread, storage space shortages are no longer the reason for needing elastic scalability in database systems. Both in public and private cloud environments, distributed file systems allow for virtually unlimited storage space, as long as there is sufficient resource budget. Distributed storage technology pools the storage resources from numerous computer servers, creating a virtual 'disk' with unlimited storage space. Thus, the issue of insufficient external storage space is perfectly resolved for database systems.

However, a cluster's computing resources include not just external storage but also CPU and memory, both of which are limited in a single server. A simple analysis reveals that memory, unlike disk storage, cannot be pooled into a resource pool without significantly increasing read/write latency.

Memory Pooling Limitations

Consider the latency for a server's CPU to read/write to another server's memory via direct hardware connection. The answer is greater than 10 microseconds. This calculation is based on the average connection length of 250 meters within a data center (IDC), with light traveling close to the speed of light (2.5 x 10^8 m/s). Additional delays include signal relays, RDMA interrupt handling, cache evictions, and especially the hardware/software delays in network devices such as network cards and switches. This delay exceeds the typical access time for local memory (usually within 100 nanoseconds) by over 100 times, and in practical scenarios, it can be several hundred to a thousand times higher. Such high memory read/write latency means that accessing memory from another server within the same IDC deviates significantly from the expected performance of memory operations. Therefore, memory resources cannot form a resource pool across many servers to achieve ideally unlimited memory capacity. The performance loss incurred would render such an approach impractical. Since memory cannot be expanded infinitely, it is also impossible to transparently allocate CPU instruction streams across nodes, meaning CPU resources cannot be pooled either.

Public Cloud RDS Scalability Limited by Single Server Resources

The acceptable solution is to add more computer servers and implement horizontal elastic scaling of the distributed database system. This is something that single-node databases (like MySQL, PostgreSQL, Oracle, etc.) cannot achieve. On public cloud platforms, users may find they can scale these single-node databases' RDS instances, for example, from 2 cores and 4GB to 8 cores and 32GB. However, this scaling is contingent on the availability of more CPU cores and memory on the physical machine hosting the virtual machine. Thus, the scalability of single-node database RDS is limited by the maximum CPU cores and memory space of the host physical machine, making it impossible to achieve pooled computing resources and automatic elastic scaling for the database system. If relying on the operating system's virtual memory capability, using swap operations to handle a working set that cannot fully fit into memory would result in frequent page swapping, severely degrading performance and causing the database system and its operating system to perform poorly.

Building a Server Cluster for Continuous Resource Expansion

When database system loads are heavy, requiring thousands of CPU cores and tens of terabytes of memory, users will find that no matter their budget, they cannot buy such a single server. However, using a cluster of 64 servers with similar configurations, where each server has 32 physical CPU cores, 512GB memory, and 16TB SSD storage, the total cost can be controlled around 6 million RMB. This is a completely affordable price for companies with such high business loads.

Let's also consider the elastic scalability of application software. Today, with technologies like Kubernetes and containerization, automated high availability and elastic scaling can be achieved for various applications without any changes to the application system. However, this relies on an easily overlooked assumption: the application can leverage Kubernetes for automatic elastic scaling because it is stateless. For most application systems, their persistent state is stored in the database system. Therefore, the elastic scalability of the database system is the fundamental guarantee for the overall elastic scalability of most application systems; if the database system cannot scale out, the overall performance and throughput of the application system using the database system will face a hard, insurmountable limit. Even if those application systems scale up their processing capabilities, their database system will remain the performance bottleneck, limiting the overall performance of such application systems.

Another challenge is the scalability of the database system, which involves fully utilizing the performance of a powerful server with hundreds of CPU cores. This requires forward-looking architectural design and meticulous engineering implementation. Without excellent scalability, even a powerful server cannot perform optimally. This is especially difficult for single-node databases. Of course, distributed database systems also need good system design and engineering implementation to achieve excellent scalability.

KunlunBase Offers Unlimited Horizontal Scalability

We firmly believe that distributed database systems are the key to providing limitless system performance using a large number of commodity servers. This hardware solution is cost-effective, has a clear and executable technical route, and poses no technical, cost, or vendor lock-in risks, making it the optimal universal technical route for all industries and application types.

In this approach, the role of Distributed Database Management System (DDBMS) software with excellent scalability is crucial. KunlunBase is an outstanding DDBMS.

First, a DDBMS needs to be able to utilize the hardware resources of numerous servers, distributing data roughly evenly across these servers. As data volume and data read/write load continuously increase, the DBA only needs to add more servers to the DDBMS. The DDBMS can then automatically move some data to these new servers and expand the computing and storage load to them, thereby rebalancing and improving the overall load capacity of the cluster. This is the elastic scalabilityopen in new window of a DDBMS.

Equally important is the DDBMS's scalability. If a cluster of 10 servers can provide 100,000 TPS, then a cluster of 20 servers should be able to provide 200,000 TPS. This is the ideal "linear scalability."

Moreover, a DDBMS must effectively handle hardware, software, and network failures, ensuring that these failures do not cause data corruption, loss, or service interruption. This is the fault tolerance and disaster recovery capability of a DDBMS. Besides these two critical points, a DDBMS also needs convenient database cluster management, monitoring, troubleshooting tools, and APIs to enhance DBA efficiency.

KunlunBase possesses all these capabilities. This article mainly introduces KunlunBase's capabilities in data distribution and elastic scaling.

Database Design with Distributed Databases

In the era of traditional single-node databases, database design primarily involved designing and defining tables based on the architecture and functionality of the application system. This included defining the columns in each table, their data types and constraints, the table’s primary key, (unique) indexes, table-level constraints, triggers for each table, as well as defining stored procedures, views, and access control rules.

When using distributed databases, it is also necessary to design the partitioning rules and distribution strategies for the tables. KunlunBase's data table distribution method includes both partitioning rules and distribution strategies. Partitioning rules refer to how to split a table, which columns to use, and what calculations to perform as the basis for splitting. Distribution strategies refer to which shard a single table or table partition should be placed in. KunlunBase allocates table partitions to storage clusters based on the distribution strategy.

Partitioning rules and distribution strategies are transparent to the application. Users only need to read and write the root table and do not need to know how the table is partitioned or how its partitions are distributed across shards. The compute nodes are responsible for determining which shards and table partitions the data relevant to the user’s query reside in, and sending appropriate data read/write statements to the corresponding shards.

When data is split and stored across multiple shards, a single transaction often involves writing to data table partitions on multiple shards. KunlunBase's distributed transaction processing mechanism ensures that these transactions are reliably committed, and any node failure will not result in data loss, corruption, or inconsistency. Users also do not need to consider how a SQL statement involving partitioned tables stored across shards is handled. KunlunBase's distributed query processing mechanism encapsulates all these details internally, automatically executing any valid SQL statement.

The following sections introduce the partitioning rules, their usage scenarios, and SQL syntax, as well as KunlunBase's data table distribution strategies.

KunlunBase's Partitioning Rules

分区表示意图

(Illustration of partitioned tables and their shards)

KunlunBase supports three types of partitioning rules: Hash, Range, and List. When creating a partitioned table, users need to select an appropriate partitioning method based on the expected data distribution pattern and common query methods for the table. Users should also select one or more columns as the partitioning columns and specify the specific parameters required for the chosen partitioning method when creating partitions for the table.

Once this is done, the Kunlun-server will arrange the partitioned table to be stored on appropriate shards based on a series of distribution strategies. This information is recorded in the compute node's metadata table and propagated to other compute nodes via the DDL log replication mechanism. Thus, each compute node in a KunlunBase cluster knows which shard each partition of every partitioned table resides on.

This way, the compute nodes (Klustron-servers) of KunlunBase can use the data rows in the INSERT statement and the table's partitioning method and parameters to determine which partition each row should be inserted into, thereby sending the appropriate set of INSERT statements to the relevant shards to insert the final data into the corresponding data partitions. For UPDATE/DELETE/SELECT SQL statements, the compute nodes use the filtering conditions specified by the user to identify the partition containing the target rows based on the partitioning columns, sending specific SQL statements to those partitions. If the filtering conditions do not specify the partitioning columns, the compute nodes cannot determine the target partition. In such cases, they send the query, optimized by KunlunBase's distributed query mechanism, to all shards that contain the table’s partitions. The actual results from the storage nodes may show that some shards have rows matching the filter conditions while others do not.

KunlunBase Data Distribution Functionality and Usage

Using KunlunBase's Table Partitioning Feature

The following sections provide an overview of the applicable scenarios and syntax for each partitioning method. In KunlunBase, specifying a partitioning method for a table involves creating the table and its partitions. KunlunBase supports using multiple columns as partitioning columns, so the values of these columns together form a partition key vector for a row of data.

Hash

The Hash partitioning method is suitable for the following scenarios:

A. The table's data volume is roughly predictable, and the growth rate is neither rapid nor sudden. This is not a strict requirement. If the data volume of a hash-partitioned table changes significantly over time, leading to an inappropriate number of partitions, KunlunBase's repartitioning feature can be used to adjust the number of partitions.

B. Equi-join queries on one or a set of columns with high cardinality (i.e., columns with low duplicate values) are among the most frequent queries.

C. The distribution of rows across partition key vectors is relatively uniform.

A table fitting these criteria is suitable for hash partitioning, and these columns are appropriate as partitioning columns. If the frequently queried columns have high duplicate values, such as gender, it is better to add one or two more columns for hash partitioning to avoid severe data skew (i.e., some partitions having a lot of data while others have very little).

Common examples of information suitable for hash partitioning include names, IDs, email addresses, and phone numbers. The pros and cons of hash partitioning are very clear.

Pros: Simple to define, no need to know the distribution of field values.

Cons: If the distribution of partition key values is highly uneven, it may lead to severe data and system load imbalance.

The hash method requires selecting columns and the number of partitions. Below is an example of the syntax for creating a hash partitioned table:

CREATE TABLE t1(a serial primary key, b int, c varchar(32)) partition by hash(a);

CREATE TABLE t1_0 partition of t1 for values with (modulus 4, remainder 0);

CREATE TABLE t1_1 partition of t1 for values with (modulus 4, remainder 1);

CREATE TABLE t1_2 partition of t1 for values with (modulus 4, remainder 2);

CREATE TABLE t1_3 partition of t1 for values with (modulus 4, remainder 3);

Range

The Range partitioning method is suitable for the following scenarios:

A. Range queries on one or a set of columns with high cardinality (i.e., columns with low duplicate values) are among the most frequent queries.

B. The distribution of partition key values is highly uneven.

Common examples of information suitable for range partitioning include quantities, dates, times, and IDs.

Pros: Allows precise definition of data partitioning rules.

Cons: As the data volume changes, the distribution rules may become unsuitable. In such cases, KunlunBase's repartitioning feature can be used to adjust partition boundaries and quantities.

The range method requires selecting partitioning columns and partition boundaries. For example, the following statements create a table t1:

CREATE TABLE t1(a serial primary key, b int, c varchar(32)) partition by range(a);

CREATE TABLE t1_0 partition of t1 for values from (MINVALUE) to (100000);

CREATE TABLE t1_1 partition of t1 for values from (100000) to (1000000)

If t1's partitioning column is a, the partition boundaries are scalars like 100000 and 1000000. As more data is added to t1, additional partitions can be created as needed. Users can also create range-partitioned tables with multiple partitioning columns:

CREATE TABLE t2(a serial primary key, b int, c varchar(32)) partition by range(c, a)

For t2, the partitioning columns are (c, a), so the partition boundaries are tuples like ('abc', 1000) and ('ufo', 10000):

CREATE TABLE t2_0 partition of t1 for values from (MINVALUE, MINVALUE) to ('abc', 100000);

CREATE TABLE t2_1 partition of t1 for values from ('mvp', 100000) to ('ufo', 1000000);

List

The List partitioning method is suitable for scenarios where the number of different values in the partition key is relatively small (e.g., dozens), the data volume corresponding to each partition key vector is not significantly different, and the main query pattern for the partition key is equality queries.

Common examples of information suitable for list partitioning include country names, province or city names, and names or categories of items with a small number of distinct values (e.g., less than a few hundred). It pairs well with KunlunBase's enum type.

The pros and cons of the list partitioning method are:

Pros: The table definition is simple and intuitive.

Cons: Accurate knowledge of the data distribution in the partition key is required; otherwise, data can become severely skewed across partitions. If severe skew occurs, adding another level of partitioning can help resolve the issue.

List partitioning requires setting the partitioning columns and their corresponding set of partition key vectors, for example:

CREATE TABLE t1(a serial primary key, b int, province varchar(32), city varchar(32)) partition by list(c);

CREATE TABLE t1_0 partition of t1 for values in ('beijing','beijing');

CREATE TABLE t1_1 partition of t1 for values in ('guangdong','shenzhen');

Multi-level Partitioning and Default Partitions

KunlunBase supports multi-level partitioning and default partitions to address real-world production scenarios where data volumes across different partitions under specific partitioning rules often vary significantly, leading to severe data skew. Multi-level partitioning allows a table to be split into an unbalanced tree where the data volumes of all leaf nodes are similar, thus mitigating the uneven data distribution problem of single-level partitioning.

For instance, many large companies segment their business teams, sales channels, and customer resources by province, but the population size varies significantly across provinces. After partitioning by province, further partitioning by city may be necessary for populous provinces like Guangdong, where mega-cities like Shenzhen, Guangzhou, and Dongguan require secondary partitions, while smaller cities in the province can be grouped into a 'default' partition. In contrast, provinces in the northwest may not require secondary partitioning. Moreover, cities like Shenzhen and Guangzhou may need tertiary partitions since each district has millions of residents, exceeding the total population of many cities.

Another example is an e-commerce live streaming platform that partitions its order table by influencer ID using range partitioning. The challenge is that top influencers generate orders thousands of times more than regular staff. Therefore, secondary partitioning by manufacturer ID within each top influencer's partition can be applied, while regular staff can be grouped into partitions of 100 IDs each, with the remaining low-volume IDs placed in a default partition.

Table Distribution Strategies

A distribution strategy answers the question: "Which shard should a single table or table partition be placed on?"

The ultimate goal of a distribution strategy is to optimize query performance, which can be achieved by applying different distribution strategies to tables with different access patterns. The metadata table pg_class in the compute node contains a column relshardid, which records the shard ID of each table. The compute node can use the data in the metadata tables pg_shard and pg_shard_node to obtain the connection information of the current primary node of each shard, thereby accessing the data of table partitions. KunlunBase currently supports the following distribution strategies:

Least Burden

When creating a table partition, assign it to the shard with the lowest average load.

Least Space

When creating a table partition, assign it to the shard with the least used storage space.

Random

When creating a table partition, assign it to a randomly selected shard.

Mirror (Replicated Table)

If a table is infrequently updated (e.g., tens of thousands of times a day) and is relatively small (e.g., less than 1GB), and it frequently joins with large tables, it is suitable to create it as a mirror table, meaning a copy of the table is stored on each shard. The data of the same mirror table on all shards is identical. When updating a mirror table, the compute node updates all copies on all shards. When a mirror table joins with another table t1, if t1 is a single table, Kunlun-server pushes the join to the partition where t1 resides for a local join; if t1 is a partitioned table, Kunlun-server pushes the join to all leaf partitions of t1 for local joins.

The advantage of a mirror table is that Kunlun-server can push down joins, especially in OLAP snowflake models where a large fact table (partitioned across all shards) joins multiple dimension tables (each a mirror table). In this case, Kunlun-server can push the join down to all shards, allowing each shard's fact table partitions to join with the shard's dimension table, and then collect the join results. Even the aggregate nodes above the join can be pushed down, enabling the use of computing resources from many shards to execute a single query.

KunlunBase ensures that when adding new shards, all mirror tables are copied to the new shards and resolves conflicts between creating mirror tables and adding new shards to ensure that the newly created mirror table exists on the new shards.

Table Grouping

In practical applications, some tables are closely related, frequently joined, or often updated within the same transaction. To enhance query performance and transaction handling, Klustron supports placing these tables on the same shard through the TABLEGROUP feature. This ensures that joins between tables in the same group can be pushed down to the storage shard, and updates to these tables within the same transaction can avoid two-phase commit, thereby improving performance.

When creating a table or table partition, you can specify it belongs to a table group, for example: CREATE TABLE t1(a int) WITH (TABLEGROUP=tg1);

If not specified, the table does not belong to any table group. All tables or table partitions within the same table group are always kept on the same shard. A table in a TABLEGROUP cannot be migrated individually to another shard; KunlunBase prohibits this action. To migrate (e.g., during scaling), all tables within the table group must be migrated together to the new shard. Multiple TABLEGROUPs can exist in a database, along with tables that do not belong to any TABLEGROUP.

Manual Allocation

KunlunBase also supports explicitly specifying the shard for table partitions, e.g., CREATE TABLE t1(a int) WITH (shard=3). While we do not recommend users to do this, we offer the flexibility to customize KunlunBase's behavior. Additionally, users can set several other options in the CREATE TABLE statement.

Adjusting Partitioning Rules and Table Partition Locations

To adjust a table's partitioning rules, use Klustron's repartition feature to automate the adjustment. To move data partitions to other shards, such as new shards added during scaling, use Klustron's table migration feature to complete the task automatically.

Summary

Using a distributed database with excellent scalability ensures that the throughput and data volume of application systems can continue to grow. Elastic scalability is the foundation of a distributed database's expansion capability, and data partitioning is the prerequisite for elastic scalability. KunlunBase supports various table partitioning rules and distribution strategies, enabling users to define their data for optimal performance based on their data characteristics and access patterns. KunlunBase also supports Online DDL & Repartition functions, allowing users to modify table partitioning rules and distribution strategies as needed.

Contents of This Chapter: