Skip to main content

PostgreSQL vs MySQL TPC-H Test

KlustronAbout 5 min

PostgreSQL vs MySQL TPC-H Test

Analytical queries Even for simple analytical queries in TPC-H, the performance of PostgreSQL is far better than that of MySQL. Klustron (Kunlun Database) inherits and extends PostgreSQL's powerful capabilities in analytical SQL query processing, and can support all queries of TPC-H and TPC-DS. At the same time, the data scale that can be managed by a Klustron (Kunlun Database) cluster is much larger than that of a PostgreSQL instance.

1 Test environment

Server configuration: PostgreSQL and MySQL are deployed on one: Amazon m5.4xlarge (CPU 8cores 16 Threads, memory: 64 G, storage gp3, general-purpose SSD volume 3000IOPS, 125 MB/s throughput).

Software version:

PostgreSQL:PostgreSQL 12.7 on x86_64-koji-linux-gn

MySQL: percona 8.0.26-16

Database parameter configuration:

PostgreSQL:shared_buffers 8192MB

MySQL: innodb_buffer_pool_size 8192MB

Test background: PostgreSQL and MySQL adopt the default installation configuration, only the memory parameters are adjusted, and PostgreSQL does not have any optimization behavior during the whole test process.

2 Table information

1 fact table: lineorder

4 dimension tables: customer, part, supplier, dates

img

Table occupies operating system storage space: 19 G

Data query: 11 standard SQL query test statements (statistical query, multi-table association, sum, complex conditions, group by, order by, etc.).

Specific SQL statements: Appendix 1 TPC-H Test SQL.

3 Test results

img

MySQL error message: ERROR 3 (HY000): Error writing file '/kunlun2/data10/6010/tmp/MYfd=332'(OS errno 28-No space left on device)

Analysis and summary: By comparing the execution plans of the same SQL statement in PostgreSQL and MySQL, it can be found that the execution plan of MySQL does not adopt the optimal join order and parallel operation, resulting in poor performance.

For example, Q2.1 MySQL execution plan:

img

Execution plan analysis: The above execution plan first joins several dimension tables (dates, supplier and part), and then joins the obtained results with the fact table lineorder, thus obtaining a super large intermediate result set with a data volume of 10 The number to the power of 15 (the rows returned by the fifth line of the query plan) eventually causes the temporary file to run out of disk space and fail to complete the query.

Optimization scheme: by specifying the join order of the tables in the SQL statement forcibly: first join with the part table to obtain a minimum data subset, and then join with the supplier and dates to gradually narrow the range. The results of the query statement and query plan are as follows:

explain format=tree  selectsum(lo_revenue) as lo_revenue, d_year as year, p_brand from ((lineorderstraight_join part on lo_partkey = p_partkey) straight_join  supplier on  lo_suppkey = s_suppkey)  straight_join dates ON  lo_orderdate = d_datekey where p_category ='MFGR#12' and s_region = 'AMERICA' group by year, p_brand order by year,p_brand;

Specify the join order on MySQL,

At the same time start parallel query: set local innodb_parallel_read_threads=16

img

Execution time: 1 min 53.72 sec

And the execution plan of PostgreSQL:

img

Execution time: 1133 ms

PostgreSQL automatically adopts the optimal join order, which greatly reduces the amount of final sorted data, and starts three parallel split data sets, so the execution performance is 10 times faster than that of manually optimized MySQL.

Conclusion: MySQL executes the TPC-H test and needs to manually optimize the query statement. Even so, the performance is still far lower than PostgreSQL, and the default configuration of PostgreSQL can achieve relatively good performance.

4 Appendix: Test SQL and query plan

Q1.1

select sum(lo_revenue) as revenue from lineorder 
join dates on lo_orderdate = d_datekey
where d_year  = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q1.2

 select sum(lo_revenue) as revenue   from lineorder
join dates on lo_orderdate = d_datekey
where d_yearmonthnum = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q1.3

select sum(lo_revenue) as revenue   from lineorder
join dates on lo_orderdate = d_datekey
where d_weeknuminyear = 6 and year(d_datekey)  = 1994
and lo_discount between 5 and 7
and lo_quantity between 26 and 35;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q2.1

select sum(lo_revenue) as lo_revenue, d_year as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_category = 'MFGR#12' and s_region = 'AMERICA'
group by year, p_brand
order by year, p_brand;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q2.2

select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
group by year, p_brand
order by year, p_brand;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q2.3

select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand = 'MFGR#2239' and s_region = 'EUROPE'
group by year, p_brand
order by year, p_brand;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q3.1

select c_nation, s_nation, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_region = 'ASIA' and s_region = 'ASIA' and year(d_datekey) between 1992 and 1997
group by c_nation, s_nation, year
order by year asc, lo_revenue desc;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q3.2

select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q3.3

select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q4.1

select year(d_datekey) as year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, c_nation
order by year, c_nation;

PostgreSQL execution plan:

img

MySQL execution plan:

img

Q4.2

select year(d_datekey) as year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'and s_region = 'AMERICA'
and (year(d_datekey) = 1997 or year(d_datekey) = 1998)
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, s_nation, p_category
order by year, s_nation, p_category;

PostgreSQL execution plan:

img

MySQL execution plan:

img

END