Click to learn more about authors Gary Ogasawara and Tatsuya Kawano.
When deploying Hadoop, scaling storage can be difficult and costly because the storage and compute are co-located on the same hardware nodes. By implementing the storage layer using S3-compatible storage software and using an S3 connector instead of HDFS, it’s possible to separate storage and compute and scale storage independently. This provides greater flexibility and cost-efficiencies but raises the question of how performance will be impacted.
The two of us recently led a project to answer this question by examining the performance of four different combinations of query type and storage type:
- Hive+HDFS
- Hive+S3 (Cloudian HyperStore)
- Presto+HDFS
- Presto+S3 (Cloudian HyperStore)
We used HiBench’s SQL (Hive-QL) workloads with ~11 million records (~1.8GB) and TPC-H benchmark with ~866 million records (~100GB). CDH5 (Cloudera Distribution Hadoop v5.14.4) was used for the Hadoop and HDFS implementation. For S3-compatible storage, we used Cloudian HyperStore v7.1 that implements the Amazon S3 API in a software package that can be deployed on Linux.
Aggregating the benchmark results, the relative performance from best to worst was:
- Presto+HDFS (best)
- Presto+S3
- Hive+HDFS
- Hive+S3 (worst)
Both Presto configurations substantially outperformed the two Hive configurations, roughly by a factor of 10. Moreover, the Presto+S3 combination showed very similar performance results to the best Presto+HDFS combination, demonstrating that Hadoop users can achieve the flexibility and cost advantages of separating storage and compute with S3 software, without any significant tradeoff in performance.
The remainder of this article describes in greater detail the basic data flow for each of the four configurations, the performance benchmarks and results.
Basic Data Flow
- Input and output Hive tables are stored on HDFS. (The output table should be empty at this point)
- A HiBench or TPC-H query is submitted from a Hive client on node 0 to the HiveServer2 on the same node.
- Hive locates the tables from its Metastore and schedules a series of MapReduce (M/R) jobs for the query. (Hive calls each job as “stage”)
- Hadoop YARN (which is not shown in the diagram) runs the tasks in the M/R jobs. Each task has an embedded HDFS client and reads/writes data on HDFS. Intermediate results are stored on HDFS.
- When all stages have finished, Hive returns the query result to the client.
Case 2: Hive+S3
- Input and output Hive tables are stored on S3. (The output table should be empty at this point)
- A HiBench or TPC-H query was submitted from a Hive client on node 0 to the HiveServer2 on the same node.
- Hive locates the tables from its Metastore and schedules a series of M/R jobs for the query.
- Hadoop YARN runs the tasks in the M/R jobs. Each task has embedded S3A filesystem client and reads/writes data on HyperStore S3. HAProxy works as a round-robin load-balancer, forwarding the S3 requests to different S3 servers. Intermediate results are stored on the default distributed filesystem instance, which is HDFS for our case.
- When all stages have finished, Hive returns the query result to the client.
Case 3: Presto+HDFS
- Input and output Hive tables are stored on HDFS. (The output table should be empty at this point)
- A HiBench or TPC-H query is submitted from a Presto client on node 0 to the Presto Coordinator on the same node.
- Presto Coordinator inquires to Hive Metastore to locates the Hive tables.
- Presto Coordinator schedules a series of tasks for the query.
- Presto Workers execute tasks with its embedded HDFS client, reading/writing data on HDFS. Intermediate results are kept in-memory and streamed between Presto Workers.
- When all tasks have finished, Presto Coordinator returns the query result to the client.
Case 4: Presto+S3
- Input and output Hive tables are stored on S3. (The output table should be empty at this point)
- A HiBench or TPC-H query is submitted from a Presto client on node 0 to the Presto Coordinator on the same node.
- Presto Coordinator inquires to Hive Metastore to locates the Hive tables.
- Presto Coordinator schedules a series of tasks for the query.
- Presto Workers execute these tasks using its embedded S3 client, reading/writing data on HyperStore S3. Intermediate results are kept in-memory and streamed between Presto Workers.
- When all tasks have finished, Presto Coordinator returns the query result to the client.
Measuring Performance
To generate load on Hive and Presto, we used the following benchmarks:
- HiBench, https://github.com/intel-hadoop/HiBench
- TPC-H, http://www.tpc.org/tpch
Intel HiBench is a big data benchmark suite that helps to evaluate different big data products in terms of speed. It has three Hive workloads, which were developed based on SIGMOD 09 paper “A Comparison of Approaches to Large Scale Data Analysis”.
TPC-H is an OLAP (Online Analytical Processing) workload that measures analytic query performance in a data warehouse context. Presto can run unmodified TPC-H queries (which are ANSI SQL compliant) and it has TPC-H connector that can generate TPC-H dataset. Hive cannot directly run TPC-H queries but we found a couple of Hive-QL implementations of TPC-H on GitHub and we used one of them.
HiBench workloads provide very simple write-heavy and read-heavy queries. They will help us to understand the differences between the products in base read/write performance.
TPC-H provides complex read-heavy queries. These queries will give us more idea on real world performance differences.
HiBench Hive Workloads
HiBench dataset was created using HiBench’s data generator and stored in SequenceFile format. We set hibench.scale.profile to large and hibench.default.map.parallelism to 48, and we got the following amount of input data:
- ~11 million records for all 2 tables combined
- ~1.8 GB for all 48 storage files combined (~37MB each)
We ported Hive-QL to SQL that Presto can execute. Presto accessed the Hive tables via Hive connector.
We measured the performance by manually executing each query and recorded the query time in millisecond. Here are the results:
HiBench Query Time
Findings
Despite the name, “Scan” query does not only read all records from the input table but also copies all of them to the output table. So it is write-heavy.
Summary of HiBench results:
- Input data size ~11 million records (~1.8GB), stored in SequenceFile format.
- HiBench is good for measuring base read/write performance.
- For write-heavy query, Presto+S3 is 4.2 times faster than Hive+HDFS.
- For read-heavy queries, Presto+S3 is average 15.1 times faster than Hive+HDFS.
TPC-H Benchmark
TPC-H dataset was created using Presto’s TPC-H connector and stored in ORC (Optimized Row Columnar) format with ZLIB compression. ORC is similar to Parquet and widely used in Hive. We could not use Parquet because Hive 1.1 does not support “date” column type in Parquet files.
We chose a TPC-H scale factor of 100 to generate a 100GB dataset. We got the following amount of input data: roughly 866 million records for all 8 tables combined. Presto was able to run unmodified TPC-H queries. It accessed the Hive tables storing the TPC-H dataset via Hive connector.
TPC-H Query Time
Findings
- All TPC-H queries are read-heavy.
- We did not measure Hive+S3 performance because from HiBench results, we expected it will be slower than all other combinations and we might not be interested in the result.
- We did not run query “q19” on 100GB dataset because Hive and Presto returned different query results on 1GB dataset.
- We skipped query “q22” because it failed on Hive on 100GB dataset.
- As described earlier, we set Presto coordinator/worker’s query.max-memory-per-node to 12GB to process queries in memory. The most of the queries completed within 8GB of memory per node, but “q09” and “q21” required 10GB and 12GB of memory per node respectively.
Summary of TPC-H results:
- Input data size: ~866 million records (~100GB), stored in ORC format with ZLIB compression.
- TPC-H is good for measuring real world performance of analytic queries.
- Presto+S3 is on average 11.8 times faster than Hive+HDFS.
Why Presto is Faster than Hive in the Benchmarks
Presto is an in-memory query engine, so it does not write intermediate results to storage (S3). Presto makes much fewer S3 requests than Hive does. In addition, unlike Hive M/R jobs, Presto does not perform rename file operation after writes. Rename is very expensive operation in a S3 storage system as it is implemented by copy and delete file operations. Finally, Hive’s architecture requires it to wait between stages (M/R jobs), making it difficult to keep utilizing all CPU and disk resources.
Image source: https://blog.treasuredata.com/blog/2015/03/20/presto-versus-hive/
Conclusion
Aggregating the benchmark results, the relative performance from best to worst was:
- Presto+HDFS (best)
- Presto+S3
- Hive+HDFS
- Hive+S3 (worst)
As noted above, both Presto configurations substantially outperformed the two Hive configurations, roughly by a factor of 10. In addition, the Presto+S3 combination showed very similar performance results to the best Presto+HDFS combination, demonstrating that Hadoop users can achieve the flexibility and cost advantages of separating storage and compute with S3 software, without any significant tradeoff in performance.