PostgreSQL vs MySQL TPC-H Test
PostgreSQL vs MySQL TPC-H Test
Analytical queries Even for simple analytical queries in TPC-H, the performance of PostgreSQL is far better than that of MySQL. Klustron (Kunlun Database) inherits and extends PostgreSQL's powerful capabilities in analytical SQL query processing, and can support all queries of TPC-H and TPC-DS. At the same time, the data scale that can be managed by a Klustron (Kunlun Database) cluster is much larger than that of a PostgreSQL instance.
1 Test environment
Server configuration: PostgreSQL and MySQL are deployed on one: Amazon m5.4xlarge (CPU 8cores 16 Threads, memory: 64 G, storage gp3, general-purpose SSD volume 3000IOPS, 125 MB/s throughput).
Software version:
PostgreSQL:PostgreSQL 12.7 on x86_64-koji-linux-gn
MySQL: percona 8.0.26-16
Database parameter configuration:
PostgreSQL:shared_buffers 8192MB
MySQL: innodb_buffer_pool_size 8192MB
Test background: PostgreSQL and MySQL adopt the default installation configuration, only the memory parameters are adjusted, and PostgreSQL does not have any optimization behavior during the whole test process.
2 Table information
1 fact table: lineorder
4 dimension tables: customer, part, supplier, dates
Table occupies operating system storage space: 19 G
Data query: 11 standard SQL query test statements (statistical query, multi-table association, sum, complex conditions, group by, order by, etc.).
Specific SQL statements: Appendix 1 TPC-H Test SQL.
3 Test results
MySQL error message: ERROR 3 (HY000): Error writing file '/kunlun2/data10/6010/tmp/MYfd=332'(OS errno 28-No space left on device)
Analysis and summary: By comparing the execution plans of the same SQL statement in PostgreSQL and MySQL, it can be found that the execution plan of MySQL does not adopt the optimal join order and parallel operation, resulting in poor performance.
For example, Q2.1 MySQL execution plan:
Execution plan analysis: The above execution plan first joins several dimension tables (dates, supplier and part), and then joins the obtained results with the fact table lineorder, thus obtaining a super large intermediate result set with a data volume of 10 The number to the power of 15 (the rows returned by the fifth line of the query plan) eventually causes the temporary file to run out of disk space and fail to complete the query.
Optimization scheme: by specifying the join order of the tables in the SQL statement forcibly: first join with the part table to obtain a minimum data subset, and then join with the supplier and dates to gradually narrow the range. The results of the query statement and query plan are as follows:
explain format=tree selectsum(lo_revenue) as lo_revenue, d_year as year, p_brand from ((lineorderstraight_join part on lo_partkey = p_partkey) straight_join supplier on lo_suppkey = s_suppkey) straight_join dates ON lo_orderdate = d_datekey where p_category ='MFGR#12' and s_region = 'AMERICA' group by year, p_brand order by year,p_brand;
Specify the join order on MySQL,
At the same time start parallel query: set local innodb_parallel_read_threads=16
Execution time: 1 min 53.72 sec
And the execution plan of PostgreSQL:
Execution time: 1133 ms
PostgreSQL automatically adopts the optimal join order, which greatly reduces the amount of final sorted data, and starts three parallel split data sets, so the execution performance is 10 times faster than that of manually optimized MySQL.
Conclusion: MySQL executes the TPC-H test and needs to manually optimize the query statement. Even so, the performance is still far lower than PostgreSQL, and the default configuration of PostgreSQL can achieve relatively good performance.
4 Appendix: Test SQL and query plan
Q1.1
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
PostgreSQL execution plan:
MySQL execution plan:
Q1.2
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_yearmonthnum = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;
PostgreSQL execution plan:
MySQL execution plan:
Q1.3
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_weeknuminyear = 6 and year(d_datekey) = 1994
and lo_discount between 5 and 7
and lo_quantity between 26 and 35;
PostgreSQL execution plan:
MySQL execution plan:
Q2.1
select sum(lo_revenue) as lo_revenue, d_year as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_category = 'MFGR#12' and s_region = 'AMERICA'
group by year, p_brand
order by year, p_brand;
PostgreSQL execution plan:
MySQL execution plan:
Q2.2
select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
group by year, p_brand
order by year, p_brand;
PostgreSQL execution plan:
MySQL execution plan:
Q2.3
select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand = 'MFGR#2239' and s_region = 'EUROPE'
group by year, p_brand
order by year, p_brand;
PostgreSQL execution plan:
MySQL execution plan:
Q3.1
select c_nation, s_nation, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_region = 'ASIA' and s_region = 'ASIA' and year(d_datekey) between 1992 and 1997
group by c_nation, s_nation, year
order by year asc, lo_revenue desc;
PostgreSQL execution plan:
MySQL execution plan:
Q3.2
select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;
PostgreSQL execution plan:
MySQL execution plan:
Q3.3
select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;
PostgreSQL execution plan:
MySQL execution plan:
Q4.1
select year(d_datekey) as year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, c_nation
order by year, c_nation;
PostgreSQL execution plan:
MySQL execution plan:
Q4.2
select year(d_datekey) as year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'and s_region = 'AMERICA'
and (year(d_datekey) = 1997 or year(d_datekey) = 1998)
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, s_nation, p_category
order by year, s_nation, p_category;
PostgreSQL execution plan:
MySQL execution plan: