Klustron Parameter Optimization Settings

Note:

Unless specifically stated, any version number mentioned can be substituted with any released version number. For all released versions, please visit: Release_notes

Objective:

The default configuration of a Klustron cluster tends to be conservative for some crucial parameters. Given that the hardware environments where customers install Klustron vary, it is necessary to adjust these significant performance parameters according to the specific environment and characteristics of the business workload to meet performance metrics. This document aims to explain the meaning and use of these relevant parameters and provide setting recommendations.

01 Key Parameters for Compute Nodes (PG):

1.1 shared_buffers: In KunlunBase, compute nodes do not store data locally; data retrieved from storage nodes is not stored in the buffer pool but is streamed incrementally as needed. Therefore, the buffer pool of compute nodes is used only for storing metadata table data pages, totaling only a few MBs, hence we set this value relatively low. If the total data volume of temporary tables used by a user is substantial, then setting temp_buffers larger is advisable.

1.2 statement_timeout: This parameter controls the duration of SQL statement execution on compute nodes, measured in milliseconds. If set to 10000, it means that statements can run for up to 10000ms or 10 seconds. Typically, it is advisable to set this value to 6000000.

1.3 mysql_read_timeout, mysql_write_timeout: These parameters control the timeout for sending and receiving (reading and writing) between compute nodes and storage/metadata nodes. If reading exceeds mysql_read_timeout or writing exceeds mysql_write_timeout, the MySQL client library used by the compute nodes will throw an error and return from the read/write wait, prematurely terminating the SQL statement execution. If an insert statement sent to a compute node inserts 1 million rows or a select statement returns millions of rows from a storage node, it is better to increase these two variables. They are set to 50 seconds by default, recommended to be set to, for example, 1200 seconds. Additionally, in such cases, increase the mysql_max_packet_size to ensure that such large data packets can be correctly sent to the storage nodes.

1.4 lock_timeout: This parameter sets the time compute nodes wait for table locks. Concurrently executed CRUD (Create, Read, Update, Delete) statements are compatible with table locks and do not need to wait. However, if an alter table statement is executing, other connections on the same compute node cannot perform DML (Data Manipulation Language) operations on this table; they will wait for this duration and return an error if they still cannot obtain the lock (default is 100 seconds). It is advisable to set this to, for example, 1200000 seconds.

1.5 log_min_duration_statement: Statements exceeding this duration are logged as slow queries in the log files. If every insert statement inserts tens of thousands of rows or more, it is crucial to increase this variable to prevent logging excessive data in the log files, which could exhaust disk space on the compute node (default is 10 seconds). It is advisable to set this to, for example, 1200000 seconds.

1.6 effective_cache_size: This parameter represents the maximum cache the compute node can utilize. This number should be large enough, for instance, over 50% of physical memory.

1.7 work_mem: This is the amount of memory a compute node can use during execution. When performing sorting operations, the system decides based on the size of work_mem whether to split a large result set into several temporary files approximately the size of work_mem. Naturally, splitting the result set slows down the sorting speed, so increasing work_mem helps improve sorting speed. For OLTP (Online Transaction Processing) type loads, setting this value to about 8M is sufficient, but for OLAP (Online Analytical Processing) loads, setting it to 2%-4% of physical memory is also reasonable.

1.8 wal_buffers: This sets the default size of the Write-Ahead Logging (WAL) buffers, initially at 16MB. For KunlunBase, since user data is not stored locally on compute nodes, writing to the WAL buffer is only needed when modifying vacuum metadata tables. Therefore, this parameter does not need to be large; the default value is sufficient, and setting it larger will not significantly improve system performance.

1.9 autovacuum: Autovacuum in PostgreSQL is a background process that starts with the database. The roles of autovacuum are as follows:

To clean expired dead tuples from tables
To prevent table bloat
To update table statistics for the optimizer's use
The autovacuum launcher uses the stats collector background process to gather statistical information to determine the autovacuum candidate list

It is recommended to set this value to false.

All the above parameters can be modified with the command alter system set <parameter_name>=<value>. However, after making changes, it is necessary to restart the compute node services for the new settings to take effect, as shown below

pg_ctl restart -D /kunlun/server_datadir/47001

02 Key Parameters for Storage Nodes (MySQL):

2.1 innodb_buffer_pool_size: Specifies the size of the Innodb buffer pool for the Klustron-Storage instance. For metadata nodes, the default value is 128MB, which may be small for production environments. It generally needs to be set based on requirements, with the maximum value depending on the CPU architecture. On a 32-bit platform, the maximum is 2^32 -1, and on a 64-bit platform, it's 2^64-1. When the buffer pool size exceeds 1GB, setting innodb_buffer_pool_instances to a value greater than one can improve server scalability. A larger buffer pool can reduce the frequency of disk I/O for accessing the same table data. On dedicated database servers, the buffer pool size can typically be set to 80% of the server's physical memory.

2.2 lock_wait_timeout: A MySQL server layer lock timeout variable, this controls the maximum waiting time for server layer table locks. If a DDL statement is altering a table, then all transactions performing DML statements on that table will block and wait up to this duration, returning an error if they cannot obtain the table lock. Since MySQL 8.0, common operations such as adding columns and indexes that previously required locking the entire table can now be done via online DDL, so the default is generally sufficient. For performance testing, whether TPCC or TPCH, it is recommended to set this value to 1200 seconds.

2.3 innodb_lock_wait_timeout: A MySQL InnoDB lock timeout variable, this controls the maximum waiting time for InnoDB row locks. Exceeding this duration will cause DML statements to return an error. For full table updates, especially when the table size is very large, such as several hundred GBs or more, update statements can lock a significant number of rows for an extended period. In such cases, other transactions may typically encounter lock timeouts unless this parameter is increased (default is 20 seconds), recommended setting in these scenarios is 1200 seconds.

2.4 fullsync_timeout: For a complete explanation of this parameter, please refer to: http://doc.klustron.com/zh/Klustron_Storage_Cluster_Fullsync.html?h=fullsync_timeout. It is recommended to set this parameter to: 1200000.

2.5 enable_fullsync: For a full explanation of this parameter, please refer to: http://doc.klustron.com/zh/Klustron_Storage_Cluster_Fullsync.html?h=enable_fullsync. Typically, this parameter should be set to true. If inserting a large amount of data (e.g., more than 10GB) during TPCC/TPCH tests to improve data loading speed, it is advisable to set this parameter to false.

2.6 innodb_flush_log_at_trx_commit: Specific to the InnoDB engine, this parameter controls how the ib_logfile is flushed. If set to 0, it means the log buffer is flushed to the file system (OS buffer) every second, and the file system’s "flush" operation is called to push the cache to the disk. If the machine crashes, transactions committed within the last second may be lost. If set to 1, every transaction commit flushes the log buffer to the file system and calls a "flush" operation. This setting places high demands on database I/O. Even though InnoDB combines commits from several transactions into a single disk flush, the resulting I/O load is still significant. If the underlying hardware provides poor IOPS, MySQL database concurrency will quickly be limited by hardware I/O issues. If set to 2, the log buffer is flushed to the file system's Page cache at every transaction commit but is not immediately written to disk. This reduces the chance of losing transaction data unless the database’s host OS is damaged or there is a sudden power failure. If loading a large amount of data (e.g., more than 10GB) during TPCH tests to improve speed, it is recommended to set this parameter to 2.

2.7 sync_binlog: MySQL provides a sync_binlog parameter to control how the database’s binlog is flushed to the disk. By default, sync_binlog=0, meaning MySQL does not control binlog flushing; the file system manages its cache refresh. This setting provides the best performance but the highest risk since any system crash will result in the loss of any binlog file updates cached in the operating system’s page cache. If sync_binlog>0, MySQL calls a file system flush operation every sync_binlog transaction commits. The safest setting is sync_binlog=1, meaning every transaction commit forces a binlog flush, offering the highest safety but the largest performance hit. For high-concurrency transaction systems, the write performance difference between settings of 0 and 1 for "sync_binlog" can be as much as five times or more. Depending on the needs, sync_binlog might not be set to the safest value of 1 but rather to 100 or 0, sacrificing some consistency for higher concurrency and performance. In performance tests, especially for TPCH-type tests, it is recommended to set this to: 0.

2.8 max_binlog_size: Specifies the maximum size of a single binary log file; if exceeded, a new binary log file is created with an incremented suffix and recorded in the .index file. The default value from MySQL 5.0 onward is 1073741824, representing 1GB. The recommended value using the KunlunBase configuration template is 500MB.

The above parameters can be temporarily set using:

set global <parameter_name>=<parameter_value>;

If a storage node restarts, these settings will be lost unless they are permanently set in the my.cnf file.

Connection method for storage nodes: mysql -h <storage node host IP> -P <storage node port> -u pgx -p pgx_pwd

Additionally, to enhance speed and reduce unnecessary failures during TPCC/TPCH test scenarios, it is advisable to set the shard to be fault-tolerant. The concept of setting fault tolerance means that the shard will not trigger a primary-replica switch in the event of a primary node failure during the set fault-tolerant period, e.g., 10800 seconds (3 hours) from the point of setting. This action needs to be performed on the XPanel.

To monitor whether a primary-replica switch has occurred, check the "Primary-Replica Switch Record":

2.9 Description of storage node engine parameters:

By default, Klustron storage nodes use the InnoDB storage engine. However, in certain scenarios to enhance the efficiency of data loading into the database, such as when dealing with large volumes of log entries, the storage engine can be switched to RocksDB. This can be achieved by modifying the MySQL configuration file of the storage node to enable the RocksDB storage plugin. An example path for this file is:

vi /kunlun/storage_datadir/57001/data/57001.cnf

To enable RocksDB, remove the “#” symbol from #default-storage-engine=rocksdb, save and exit. Afterwards, restart the MySQL service on the storage node to activate the RocksDB engine. However, specific configurations and environmental factors warrant adjustments to several other RocksDB parameters, as explained below.

2.9.1 rocksdb_max_open_files: RocksDB keeps all file descriptors in the table cache. If the number of file descriptors exceeds max_open_files, some files are evicted from the table cache and their file descriptors are closed. This means that each read operation has to search through the table cache to find the required file. It is recommended to set max_open_files to -1 to keep all files open continuously, thus avoiding costly table cache calls.

2.9.2 rocksdb_max_background_jobs: This variable replaces the earlier variables rocksdb_base_background_compactions, rocksdb_max_background_compactions, and rocksdb_max_background_flushes. It specifies the maximum number of background jobs, automatically deciding how many threads to allocate for flushing/compacting. This setting is intended to reduce the number of confusing options for users. A common configuration is CPU cores / 4. Too few threads can cause more stalling due to restrictions. If too many threads are used, threads handling user queries might be affected. The default value is 2.

2.9.3 rocksdb_max_background_flushes: This variable has been replaced by rocksdb_max_background_jobs.

2.9.4 rocksdb_max_background_compactions: This variable has been replaced by rocksdb_max_background_jobs.

2.9.5 rocksdb_max_total_wal_size: Specifies the maximum size of the Write-Ahead Logging (WAL) files to control the total size of the WAL log files. WAL logs ensure data persistence and consistency. Typically, you do not want the WAL files to be too large, as they consume significant disk space. Yet, you need to ensure they are large enough to accommodate the transaction data of your application. Smaller WAL files can reduce write latency but may increase persistence risks. Larger WAL files enhance persistence but may increase write latency. Therefore, you need to choose a suitable size based on application needs. The default value is 4G.

2.9.6 rocksdb_table_cache_numshardbits: Controls the number of shards in the table cache for better memory management. By default, the value for rocksdb_table_cache_numshardbits is 6, meaning RocksDB will use 64 shards. You can consider this a starting point and adjust as needed based on performance monitoring of RocksDB, looking at read latency and cache hit rates to ensure the table cache settings meet application performance requirements.

2.9.7 rocksdb_default_cf_options for write_buffer_size: Sets the size of a single memory table. Once a memtable exceeds this size, it is marked immutable and a new memtable is created. Different workloads require different write buffer sizes. For applications with heavy write operations, a larger write buffer may help reduce write amplification and enhance performance. However, if your workload is primarily read operations, a smaller write buffer might be sufficient. The default value is 128M.

2.9.8 rocksdb_default_cf_options for target_file_size_base and target_file_size_multiplier: Files at level 1 will have bytes equal to target_file_size_base. The file size at the next level will be target_file_size_multiplier times larger than the previous level. By default, the target_file_size_multiplier is 1, so files at all levels from L1 to Lmax are equal. Increasing target_file_size_base reduces the total number of database files, which is generally beneficial. It is recommended to set target_file_size_base to max_bytes_for_level_base / 10 so that there are 10 files at level 1. The default value is 64M.

2.9.9 rocksdb_default_cf_options for cache_index_and_filter_blocks: This parameter controls whether to cache the index and filter blocks of SST files, which can significantly affect read performance. If your application often performs range queries or needs fast random reads, caching these blocks is generally advantageous as it can improve read performance. However, if your application is primarily write-intensive, or if memory is limited, you might need to reduce or disable this cache to free up more memory. The default value is 1.

2.9.10 rocksdb_default_cf_options for filter_policy=bloomfilter: Specifies the Bloom Filter options for the default column family (CF). A Bloom Filter is a data structure used to quickly determine if an element is not part of a set, usually to speed up read operations and reduce unnecessary disk reads. Several related parameters need to be set, such as the false positive rate and the number of bits in the Bloom Filter, depending on the needs of your application. A Bloom Filter with a lower false positive rate will be more accurate but will use more memory. You can balance accuracy and memory usage by setting the appropriate number of bits.

2.9.11 rocksdb_lock_wait_timeout: This parameter controls the lock wait timeout, specifying the maximum time (in seconds) RocksDB will wait for a lock to be released. This can help avoid indefinite waiting in cases of lock conflict, thus reducing the risk of potential deadlocks. If your application has frequent write operations or a high likelihood of lock conflicts, a shorter wait timeout may be necessary to reduce lock waiting times. If your application is mainly read-oriented, a longer wait timeout may be appropriate. The default value is 20 seconds.

03 Metadata Cluster Parameter Adjustments:

For the Klustron metadata cluster, it is also advisable to make suitable adjustments to a few critical parameters as shown in the following example:

export LD_LIBRARY_PATH=/kunlun/instance_binaries/storage/57001/Klustron-storage-1.2.1/lib:$LD_LIBRARY_PATH 
/kunlun/instance_binaries/storage/57001/Klustron-storage-1.2.1/bin/mysql --connect-expired-password -S/nvme2/kunlun/storage_logdir/57001/mysql.sock -uroot -proot 

set global innodb_buffer_pool_size=1*1024*1024*1024;  (default 128M)

This parameter needs to be set on all three nodes. Similarly, the above setting is only a temporary change in memory. For a permanent setting, it needs to be saved in the my.cnf file and restarted on each node to take effect.

Klustron Parameter Optimization Settings

# Klustron Parameter Optimization Settings

# 01 Key Parameters for Compute Nodes (PG):

# 02 Key Parameters for Storage Nodes (MySQL):

# 03 Metadata Cluster Parameter Adjustments:

# END

Klustron Parameter Optimization Settings

01 Key Parameters for Compute Nodes (PG):

02 Key Parameters for Storage Nodes (MySQL):

03 Metadata Cluster Parameter Adjustments:

END