What to pay attention to when inserting/updating/deleting/retrieving millions of rows of data in a DML statement?
What to pay attention to when inserting/updating/deleting/retrieving millions of rows of data in a DML statement?
Preface
Any node of a distributed computing and storage system may time out due to node overload, insufficient computing and storage resources of the node, network delay, and network short-term unreachability.
Any operation in a distributed system usually holds various resources while waiting for remote nodes to return. It cannot wait indefinitely. Otherwise, the overall operation of the system will be blocked and gradually stalled.
Therefore, timeout control is a problem that all distributed systems need to solve well, and if not properly addressed, it can cause system operations to stall and not work properly.
Introduction to Klustron's timeout control mechanism
Klustron has the following timeout control variables:
- Some are in the computing nodes. The timeout variables of the computing node are all in the configuration file of the computing node instance, which can be modified as needed, and the parameters of the running instance can be refreshed after modification.
- Some are in the storage nodes. The timeout variables of the storage node are in the storage node configuration file, which can be modified by modifying the configuration file or by executing the set statement on the computing node or storage node to modify the corresponding variable value.
In general, users do not need to modify these variables because we have optimized the configuration parameters of the computing node and storage node for common situations.
However, these timeout variables still need to be modified in special scenarios.
A typical scenario is to insert/update/delete millions or even more rows of data in a single DML statement, or to return millions or even more rows of data in a select statement. For example, logical importing large amounts of data into a table, updating the entire table of large data, querying a large table for data analysis (OLAP), or a programmer or DBA intending to delete the database and run away😂, etc.
In these scenarios, users should increase the following timeout values based on the estimated amount of data to be inserted/updated/deleted/read, to ensure that the relevant statements and operations can work properly until completion and will not be terminated prematurely by the timeout mechanism, which may mistakenly consider the statement as timed out and unable to execute correctly.
Alternatively, users can increase these timeout values after attempting these operations and getting errors.
Now let's take a look at all the timeout control variables in Klustron.
Klustron Computing Node Timeout Variables
1. Statement_timeout.
If the total time for a query executed by a computing node exceeds this limit, the statement will be rolled back.
For example, if a computing node takes too long to perform a table join with data returned by the storage cluster, it will eventually stop after reaching this timeout limit (default 100 seconds).
2. mysql_read_timeout and mysql_write_timeout: communication (read/write) timeouts between computing nodes and storage nodes/metadata nodes.
If the read time exceeds mysql_read_timeout or the write time exceeds mysql_write_timeout, the MySQL client library used by the computing node will report an error and return from the read/write wait, thus prematurely terminating the statement execution.
If an insert statement sent to the computing node will insert 1 million rows of data, or a select statement will return hundreds of thousands of rows of data from the storage node, it is best to increase the values of these two variables, which are set to 50 seconds by default.
In addition, in this case, the mysql_max_packet_size variable should be increased to ensure that such large data packets can be sent correctly to the storage node.
3. lock_timeout: The time that a computing node waits for a table lock.
Concurrent DML statements are compatible with table locks and do not need to wait for locks.
However, if an alter table statement is being executed, other connections on the same computing node cannot execute DML statements against this table. If they cannot obtain the lock after waiting for this long, they will return an error (default 100 seconds).
4. log_min_duration_statement: Statements that exceed this time will be recorded as slow queries in the log file.
If you want to insert tens of thousands or even more rows in each insert statement, you must increase this variable. Otherwise, a large amount of data will be recorded in the log file, causing the computing node to run out of disk space (default 10 seconds).
Klustron storage node timeout variables
All modifications to storage node variables described below can be made by executing a set statement on any computing node in the cluster before data is loaded, for example: set global long_query_time=1000;
This statement will be sent by the computing node to the master node of each storage shard in the cluster.
1. lock_wait_timeout: MySQL server-level lock timeout variable.
The maximum time to wait for server-level table locks. If a DDL statement is in progress, such as alter table, all transactions that attempt DML statements on that table will be blocked and wait for this many seconds for the table lock. If the table lock cannot be obtained within this time limit, an error will be returned.
In the era of MySQL 8.0, common operations like adding columns and indexes that previously required locking the entire table to complete can now be done online, so 5 seconds is generally sufficient by default.
2. innodb_lock_wait_timeout: MySQL InnoDB lock timeout variable, maximum time to wait for InnoDB row locks.
If the timeout is exceeded, the DML statement will return an error.
If a full table update needs to be performed and the table contains a large amount of data, such as several hundred GB or more, the update statement will lock a large number of rows for a long time, and other transactions will usually experience lock timeouts, unless the innodb_lock_wait_timeout variable is increased (default 20 seconds).
3. If the storage cluster uses MySQL Group Replication for high availability
Then you need to increase the timeout control variables of MGR, such as group_replication_member_expel_timeout, group_replication_component_stop_timeout, and group_replication_unreachable_majority_timeout, otherwise the slave machine of MGR will mistakenly assume that the master node has failed and initiate a master-slave switch, or the master node may not be able to write if it thinks that the slave machine has lost contact.
However, starting from Klustron-1.0, Klustron no longer supports the use of MGR in the storage cluster.
4. Slow query threshold Long_query_time
Increase Long_query_time, which is set to 10 seconds by default, to, for example, 100 seconds or more. If an insert statement that inserts 1 million rows is considered a slow query and recorded in the slow query log, writing to this log will consume a lot of time and disk storage space.
Conclusion
Klustron features a robust timeout control mechanism with time limits set for any operation across all nodes in the communication network. This ensures that the system state is continually progressing and that system resources can efficiently handle an increasing number of service requests.