CAT: A Cost-Aware Translator for SQL-query workflow to MapReduce jobflow

作者:

Highlights:

摘要

MapReduce is undoubtedly the most popular framework for large-scale processing and analysis of vast data sets in clusters of machines. To facilitate the easier use of MapReduce, SQL-like declarative languages and SQL-to-MapReduce translators have attracted increasing attentions recently. The SQL-to-MapReduce translator can automatically generate the MapReduce jobflow for each SQL query submitted by users, which significantly simplifies the interfacing between users and systems. Although a plethora of translators have been developed, the auto-generated MapReduce programs still suffered from extremely inefficiency. In this paper, we attempt to address this challenge by developing a novel Cost-Aware Translator (CAT). CAT has two notable features. First, it defines two intra-SQL correlations: Generalized Job Flow Correlation (GJFC) and Input Correlation (IC), based on which a set of looser merging rules are introduced. Thus, both Top-Down (TD) and Bottom-Up (BU) merging strategies are proposed and integrated into CAT simultaneously. Second, it adopts a cost estimation model for MapReduce jobflows to guide the selection of a more efficient MapReduce jobflows auto-generated by TD and BU merging strategies. Finally, comparative experiments on TPC-H benchmark demonstrate the effectiveness and scalability of CAT.

论文关键词:MapReduce,SQL-to-MapReduce,Intra-SQL correlations,Cost estimation model,Hadoop,Query

论文评审过程:Received 13 July 2014, Revised 17 October 2015, Accepted 28 December 2015, Available online 19 January 2016, Version of Record 17 March 2016.

论文官网地址:https://doi.org/10.1016/j.datak.2015.12.004