Skip to main content

Klustron Customizable Data Sharding Schema

KlustronAbout 6 min

Klustron Customizable Data Sharding Schema

Database and Schema

In Klustron, data distribution is organized into 3 levels: 'database', 'schema', and 'table', as well as indexes, stored procedures, etc. A database can contain any number of schemas.

A schema is a logical grouping of tables and other database objects. Each database in Klustron has a default schema of public, and users can add more schemas as needed.

Each schema can contain any number of tables. If no schema is specified in a reference to a database object such as a table, index, or stored procedure, the public schema is always used. Explicitly specifying (referencing) a schema involves prefixing the schema name before the object name, such as schemaName.tableName. When connecting to a Klustron computing node, the database must be specified, and the connection can only access the schema, tables, and other database objects in that database. It cannot access schemas, tables, or other database objects in other databases. System metadata can be accessed when connected to any database.

In PostgreSQL and Klustron, a schema is also called a namespace because it is a logical grouping of database objects, or a namespace. The table that stores schema metadata is called pg_nanamespace, and the metadata table that stores database metadata is called pg_database.

For each schema in each database of Klustron, there is a corresponding mysql database on every backend storage cluster to store all the table shards. The naming convention is that a schema named mySchema in a database named myDB in a computing node corresponds to a database named myDB__mySchema in each mysql storage cluster. Any table or table partition in myDB.mySchema corresponds to a data table stored in myDB__mySchema database of a storage cluster. Whenever a new schema is created, the corresponding database for this schema is automatically created in all the current storage clusters.

If a user creates a non-partitioned table myDB.mySchema.t1, a storage cluster is automatically assigned to store t1, and Klustron creates a table named myDB__mySchema.t1 in the myDB__mySchema database of the storage cluster to store the data of myDB.mySchema.t1. Klustron records in the metadata table of the computing node that t1's data is stored in SS1.

This way, when reading and writing data, the computing node can automatically find the storage node of t1 and interact with the target node to read and write the data of the myDB.mySchema.t1 table.

If a user creates a partitioned table t2, the user also needs to create several table partitions (the leaf nodes of the partition tree) to store data, such as t20, t21, t22, etc., and these table partitions are also automatically assigned to a backend storage cluster. The processing of each leaf node is similar to that of a single table, and the computing node automatically assigns a storage cluster to it to create a table with the same name.

All single tables or table partitions in all schemas of all databases in Klustron are automatically evenly distributed to all backend storage clusters. If a storage cluster needs to be added, Klustron will automatically migrate a portion of table shards to the new storage cluster to achieve uniform distribution.

Therefore, if users need to group data according to business logic and need to do table joins between groups, these groups should be several schemas within the same database. At the same time, users need to decide whether a table needs to be partitioned based on their business and data characteristics (for example, if the data volume is small, such as within 100,000 rows or 100 MB, it may not need to be partitioned), and if partitioning is necessary, which columns to use as partitioning columns and how to partition (with hash, range, or list options). Additionally, after deciding on the partitioning method, each specific partition needs to be created by a DBA.

Because data between different databases in Klustron cannot be relatedly used in the same database connection, but is completely independent, we do not recommend creating many databases in a Klustron cluster. If data tables need to be separated by business, several schemas can be created for each business, and then tables can be created within them.

This way, the same database connection can operate all tables in all schemas within its database. Different databases use different Klustron clusters, which benefits resource allocation for computing/storage based on databases, as well as data backup/restore/rollback management based on databases.

Storage Cluster Metadata

The metadata table of each computing node in the Klustron cluster contains information about all storage clusters in the cluster.

In the pg_shard table, information about each storage cluster is stored, including ID, name, data volume, and number of shards; while in the pg_shard_node table, metadata information about each storage node of each storage cluster is stored, including its IP, port, username, and password.

In the pg_class table, we added a field called "relshardid", which stores the ID of the storage cluster where the corresponding data table for this table/index/sequence is located, and this ID is the ID column of the pg_shard table.

The computing node uses this information to find the information of the storage cluster and node where each table and other database object is located, then connect to the node and interact with it to complete data reading and writing and transaction processing.

Example Explanation

The Klustron cluster we are using has two storage clusters with the following details:

postgres=# select t1.name, t2.shard_id, t2.ip, t2.port, t2.user_name, t2.passwd from pg_shard t1, pg_shard_node t2  where t2.shard_id=t1.id;
  name  | shard_id |    ip     | port | user_name | passwd
--------+----------+-----------+------+-----------+---------
 shard1 |        3 | 127.0.0.1 | 4001 | pgx       | pgx_pwd
 shard2 |        4 | 127.0.0.1 | 4002 | pgx       | pgx_pwd

In the postgres database, there are two schemas: public and benchmarksql, so there are postgres__public and postgres__benchmarksql two databases on the shard1 and shard2 storage clusters.

Connecting to their master nodes respectively, we can see:

mysql> select @@port;
+--------+
| @@port |
+--------+
|   4001 |
+--------+
1 row in set (0.00 sec)

mysql> show databases;
+--------------------------+
| Database                 |
+--------------------------+
| information_schema       |
| mysql                    |
| performance_schema       |
| postgres_$$_benchmarksql |
| postgres_$$_public       |
| regression_$$_public     |
| sys                      |
+--------------------------+
7 rows in set (0.00 sec)

mysql> select @@port;
+--------+
| @@port |
+--------+
|   4002 |
+--------+
1 row in set (0.00 sec)

mysql> show databases;
+--------------------------+
| Database                 |
+--------------------------+
| information_schema       |
| mysql                    |
| performance_schema       |
| postgres_$$_benchmarksql |
| postgres_$$_public       |
| regression_$$_public     |
| sys                      |
+--------------------------+
7 rows in set (0.02 sec)

Use the following query to obtain information about which schema each table belongs to (i.e., the nspname column), where the table file is located on which storage cluster (i.e., relshardid), and other information. Then, check the corresponding tables or table shards on the two storage clusters respectively.

select t1.nspname, t2.relname, t2.relshardid, t2.relkind from pg_namespace t1 join pg_class t2 on t1.oid=t2.relnamespace where t2.relshardid != 0 order by t1.nspname;
postgres=# select t1.nspname, t2.relname, t2.relshardid, t2.relkind from pg_namespace t1 join pg_class t2 on t1.oid=t2.relnamespace where t2.relshardid != 0 order by t1.nspname;
   nspname    |      relname       | relshardid | relkind
--------------+--------------------+------------+---------
 benchmarksql | customer           |          4 | r
 benchmarksql | hist_id_seq        |          3 | S
 benchmarksql | history            |          4 | r
 benchmarksql | oorder             |          3 | r
 benchmarksql | new_order          |          4 | r
 benchmarksql | order_line         |          3 | r
 benchmarksql | stock              |          4 | r
 benchmarksql | item               |          3 | r
 benchmarksql | warehouse          |          4 | r
 benchmarksql | district           |          3 | r
 public       | t101_pkey          |          4 | i
 public       | t102               |          3 | r
 public       | t102_pkey          |          3 | i
 public       | tt14t              |          4 | r
 public       | scores1            |          4 | r
 public       | scores1_pkey       |          4 | i
 public       | t1                 |          4 | r
 public       | uv_iocu_tab_a_seq  |          4 | S
 public       | uv_iocu_tab        |          4 | r
 public       | uv_iocu_tab_pkey   |          4 | i
 public       | warehouse2         |          4 | r
 public       | warehouse2_pkey    |          4 | i
 public       | district2          |          3 | r
 public       | warehouse1         |          4 | r
 public       | warehouse1_pkey    |          4 | i
 public       | district1          |          3 | r
 public       | district1_pkey     |          3 | i
 public       | customer1          |          4 | r
 public       | customer1_pkey     |          4 | i
 public       | history1           |          3 | r
 public       | orders1            |          4 | r
 public       | orders1_pkey       |          4 | i
 public       | new_orders1        |          3 | r

Due to a large number of rows, several lines are omitted here.

Connecting to the master node of the storage cluster on port 4001, you can see the databases corresponding to each database and schema combination of the computing node and the data tables located in shard1.

mysql> use postgres_$$_benchmarksql
Database changed
mysql> show tables;
+------------------------------------+
| Tables_in_postgres_$$_benchmarksql |
+------------------------------------+
| district                           |
| item                               |
| oorder                             |
| order_line                         |
+------------------------------------+
4 rows in set (0.01 sec)

mysql> use postgres_$$_public
Database changed
mysql> show tables;
+------------------------------+
| Tables_in_postgres_$$_public |
+------------------------------+
| district1                    |
| district2                    |
| history1                     |
| history2                     |
| new_orders1                  |
| new_orders2                  |
| scores                       |
| stock1                       |
| stock2                       |
| t100                         |
| t102                         |
+------------------------------+
11 rows in set (0.01 sec)

Connecting to the master node of storage cluster with port number 4002, we can see the databases corresponding to the combination of each non-system internal database and schema used by the computing node, as well as the data tables located in shard2.

mysql> use postgres_$$_benchmarksql
Database changed
mysql> show tables;
+------------------------------------+
| Tables_in_postgres_$$_benchmarksql |
+------------------------------------+
| customer                           |
| history                            |
| new_order                          |
| stock                              |
| warehouse                          |
+------------------------------------+
5 rows in set (0.02 sec)

mysql> use postgres_$$_public
Database changed
mysql> show tables;
+------------------------------+
| Tables_in_postgres_$$_public |
+------------------------------+
| customer1                    |
| customer2                    |
| item1                        |
| item2                        |
| order_line1                  |
| order_line2                  |
| orders1                      |
| orders2                      |
| scores1                      |
| students                     |
| t1                           |
| t101                         |
| tt14t                        |
| uv_iocu_tab                  |
| warehouse1                   |
| warehouse2                   |
+------------------------------+
16 rows in set (0.02 sec)

Table Storage Mechanism

Each partition of a partitioned table is stored in a specific backend storage cluster, and each non-partitioned table is also stored in a specific backend storage cluster. All these tables have their metadata stored in the computing node to support query optimization and execution, but their data is stored in the corresponding tables in the storage clusters. For example:

t10 is a partitioned table, and its three partitions t100, t101, t102 are stored in the partitions with ids 3, 4, and 3, respectively:

postgres=# \d+ t10;
                                                Table "public.t10"
 Column |  Type   | Collation | Nullable |             Default              | Storage | Stats target | Description
--------+---------+-----------+----------+----------------------------------+---------+--------------+-------------
 a      | integer |           | not null | "nextval"('t10_a_seq'::regclass) | plain   |              |
 b      | integer |           |          |                                  | plain   |              |
Partition key: HASH (a)
Indexes:
    "t10_pkey" PRIMARY KEY, btree (a)
Partitions: t100 FOR VALUES WITH (modulus 3, remainder 0),
            t101 FOR VALUES WITH (modulus 3, remainder 1),
            t102 FOR VALUES WITH (modulus 3, remainder 2)

postgres=# select relname, relshardid from pg_class where relname like 't10_' and relkind='r';
 relname | relshardid
---------+------------
 t100    |          3
 t101    |          4
 t102    |          3

Querying t10 from a computing node shows all of its data, which is located in its three partitions:

postgres=# select*from t10;
 a  | b
----+----
  2 |  2
  4 |  4
  6 |  6
  8 |  8
 15 | 13
  3 |  3
  7 |  7
 10 | 10
 13 | 11
 14 |
  1 |  1
  5 |  5
  9 |  9
 11 |
 12 |
(15 rows)

To view the definition of the tables corresponding to the partitions of t10 and the data in each table shard, you can query each corresponding table in each storage cluster. You can see that the union of the data in these table shards is equivalent to all the data rows obtained by querying t10 on the computing node.

mysql> show create table t100;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                   |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| t100  | CREATE TABLE `t100` (
  `a` int(11) NOT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show create table t102;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                   |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| t102  | CREATE TABLE `t102` (
  `a` int(11) NOT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select*from t100;
+----+------+
| a  | b    |
+----+------+
|  2 |    2 |
|  4 |    4 |
|  6 |    6 |
|  8 |    8 |
| 15 |   13 |
+----+------+
5 rows in set (0.00 sec)

mysql> select*from t102;
+----+------+
| a  | b    |
+----+------+
|  1 |    1 |
|  5 |    5 |
|  9 |    9 |
| 11 | NULL |
| 12 | NULL |
+----+------+
5 rows in set (0.00 sec)

mysql> show create table t101;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                   |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| t101  | CREATE TABLE `t101` (
  `a` int(11) NOT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`a`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select*from t101;
+----+------+
| a  | b    |
+----+------+
|  3 |    3 |
|  7 |    7 |
| 10 |   10 |
| 13 |   11 |
| 14 | NULL |
+----+------+
5 rows in set (0.00 sec)

END