TPC-H on MariaDB (MySQL)

Benchmarks for Database can be roughly divided into two kinds, OLTP and OLAP[1]. One of the most popular OLTP benchamrks is TPC-C, while that of OLAP is TPC-H[2]. This post describes how you can run TPC-H against MariaDB.

Environment Setup

The versions of the OS and programs I used for writeup of this post are as below.

  • Ubuntu 16.04.2 Server
  • MariaDB 10.2.8
  • TPC-H toolkit 2.17.2

Automated Scripts

I automated whole things I will describe below using scripts and uploaded the scripts to the Github: https://github.com/sjp38/tpch-mariadb

If you hano no time, just go to the link and use the script.

TPC-H Spec

You can ge TPC-H benchmark specification from TPC website[3]. Refer to it if you need strict specification of the benchmark. Nonetheless, I give you a rough description of the benchmark here, though. TPC-H stores a huge data into 8 tables and run 22 queries analysing this data. To measure the performance, we can measure the time to store/load the data, runtime of each of the 22 queries, and the throughput (number of processed queries per second) of multiple sessions.

Get TPC-H Toolkit

TPC provides the specification says how the tables should be constructed, what data should be stored in the tables, and what kind of queries should be issued. So, anyone can implement their TPC-H based on this specification. Nonetheless, as manual read of the specification and manual implementation can be boring and exhaustive to someone. For the reason, TPC officially provides a tool, namely TPC-H Toolkit, that helps creation of the tables, the data, and the queries. If you go to the site[4], enter your personal information, and agree to the license, they will send you a link to download the TPC-H Toolkit by mail.

The file is in zip format. After extracting the file, you can get the directory named by the version of the toolkit. The source code and license file is under the directory. We will use dbgen/ directory here.

$ tree 2.17.2 -L 1
2.17.2
├── dbgen
├── dev-tools
├── EULA.txt
└── ref_data

Build Source Code

Under the dbgen/, there are source code of the programs that we will use, dbgen and qgen. Copy makefile.suite to makefile and edit the content as below.

$ diff -u makefile.suite makefile
--- makefile.suite      2017-04-21 06:01:08.000000000 +0900
+++ makefile    2017-09-10 11:10:12.563384756 +0900
@@ -100,15 +100,15 @@
 ################
 ## CHANGE NAME OF ANSI COMPILER HERE
 ################
-CC      =
+CC      = gcc
 # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
 #                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
 # Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS,
 #                                  SGI, SUN, U2200, VMS, LINUX, WIN32
 # Current values for WORKLOAD are:  TPCH
-DATABASE=
-MACHINE =
-WORKLOAD =
+DATABASE= SQLSERVER
+MACHINE = LINUX
+WORKLOAD = TPCH
 #
 CFLAGS = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64
 LDFLAGS = -O

Now, $ make command will build dbgen and qgen.

$ make

...

 bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm
$ file dbgen
dbgen: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=d0bca1a18c33947d85f5943c8325837faec7c95d, not stripped
$ file qgen
qgen: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=79a6216e04c446715b0e55f9c804586887b698c8, not stripped

TPC-H Toolkit Usage

To run TPC-H on MariaDB, we need several more modification. Before that, let’s see how we run TPC-H using the files under dbgen/ directory.

Data Preparation

dss.ddl file has the sql statements be used for the creation of the 8 tables. If you command your DB to execute the sql statements in this file, the 8 tables will be created.

dbgen creates the sql statements that stores the data to the table. If you run the dbgen with optional arguments including scale factor, 8 files having .tbl suffix will be created. Each of these files has query statements inserting the data to the table having same name with the file. You can store the data to the tables by making your DB to run the sql statements.

dss.ri has a sql statement creating the table index. Once your DB runs these sql statements, the table index will be generated and you will be ready to run the TPC-H queries.

Query Creation and Execution

qgen creates the queries doing the TPC-H’s analysis tasks. We simply said 22 queries, but it would be better to say 22 kinds of queries. If a query is repeatedly issued with same arguments, DB server can cache the results and simply return it to the client. As this can distort the results, TPC-H Toolkit has the template of the 22 queries under dbgen/queries/ directory and let qgen to create query statements using random arguments, based on this templates. The seed for the random number generation is created based on dists.dss file.

Modify TPC-H Toolkit for MariaDB

Each DBMS has a little bit different query syntax. Thus, TPC-H Toolkit is made to support multiple DBMSes, but the MariaDB is not in the supported DBMSes. Therefore, you should modify a part of the TPC-H Toolkit to run TPC-H on the MariaDB.

The files need modifications are dss.ddl, dss.ri, and the templates under the queries/ directory. As the description of the each of the modifications will be too long, I simply made patches for the modifications and uploaded on Github[5]. You can apply the patch files to your TPC-H Toolkit as below. You should be on the TPC-H Toolkit source code directory while running below command.

$ wget https://raw.githubusercontent.com/sjp38/tpch-mariadb/07fdcbfa9ba1be26f437ff130338a223d0dbfecd/0001-Modify-for-MariaDB.patch
$ patch -p1 < 0001-Modify-for-MariaDB.patch

The changes above includes the modification of the makefile. After applying, build dbgen and qgen.

Data Creation

Now, move to dbgen/ directory and run dbgen. You can give scale factor using its argument. Refer to the usage of dbgen using -h option. After its execution, you can see below 8 .tbl files.

dbgen$ ls -ahl *.tbl
-rw-rw-r-- 1 sjpark sjpark 187M Sep  9 10:48 customer.tbl
-rw-rw-r-- 1 sjpark sjpark 5.8G Sep  9 10:48 lineitem.tbl
-rw-rw-r-- 1 sjpark sjpark 2.2K Sep  9 10:48 nation.tbl
-rw-rw-r-- 1 sjpark sjpark 1.4G Sep  9 10:48 orders.tbl
-rw-rw-r-- 1 sjpark sjpark 919M Sep  9 10:48 partsupp.tbl
-rw-rw-r-- 1 sjpark sjpark 186M Sep  9 10:48 part.tbl
-rw-rw-r-- 1 sjpark sjpark  389 Sep  9 10:48 region.tbl
-rw-rw-r-- 1 sjpark sjpark  11M Sep  9 10:48 supplier.tbl

The above results used scale factor value as 8.

Load Data

Now, with your MariaDB, create a database named tpch (line 1), create tables under the database (line 2), load the data to the tables (lines 3-5), and make index for each of the tables (line 6).

$ mysql -u root -p password -e "create database tpch;"
$ mysql -u root -p password < dss.ddl
$ for table in customer lineitem nation orders partsupp part region supplier \
	do mysql -u root -p -e \
		"LOAD DATA LOCAL INFILE '$table.tbl' FIELDS TERMINATED BY '|';"
	done
$ mysql -u root -p password < ./dss.ri

Create Queries

Now, create the queries. qgen receives a number from 1 to 22 which means the type of the queries to create, generate the query statement, and print on the screen. You can save the queries with below commands. This example assumes it is executed on the dbgen/ directory.

$ cd queries/
$ for i in {1..22}; do ../qgen $i > query-$i.sql; done

Now, you have 22 query- prefixed files containing each type of queries under the dbgen/queries/ directory.

Execute Queries

As the query files have sql statement compatible with MariaDB, you can simply run them as below.

$ for i in {1..22}; do mysql -u root -p password < dbgen/queries/query-$i.sql

Few queries will take few of tens of seconds to complete while another queries take less than 1 second.

Conclusion

This post summarised how TPC-H is constructed, how you can run it, and what modifications are required to run it on MariaDB. You can do this with only few lines of command based on the automated scripts on my Github repo[6].

References

[1] http://datawarehouse4u.info/OLTP-vs-OLAP.html
[2] http://www.tpc.org/tpch/
[3] http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf
[4] http://www.tpc.org/tpc_documents_current_versions/download_programs/tools-download-request.asp?bm_type=TPC-H&bm_vers=2.17.2&mode=CURRENT-ONLY
[5] https://raw.githubusercontent.com/sjp38/tpch-mariadb/07fdcbfa9ba1be26f437ff130338a223d0dbfecd/0001-Modify-for-MariaDB.patch
[6] https://github.com/sjp38/tpch-mariadb

Avatar
SeongJae Park (SJ)
Kernel Programmer

SeongJae Park (SJ) is a programmer who loves to analyze and develop systems.

Related