Building Pipelines for Heterogeneous Execution Environments for Big Data Processing

Select |




Print


Wu, Dongyao; Zhu, Liming; Xu, Xiwei (Sherry); Sakr, Sherif; Sun, Wei (Daniel); Lu, Qinghua


2016-03-01


Journal Article


IEEE Software


33


2


60-67


Many real-world data analysis scenarios require pipelining and integration of multiple (big) data processing and analytics jobs, which are often executed in heterogeneous environments, such as MapReduce, Spark, R/Python/Bash scripts. For such a pipeline, a large amount of glue code has to be written to get data across environments. Maintaining and evolving such pipelines are difficult. Existing pipeline frameworks trying to solve such problems are usually built on top of a single environment, and/or require the original job to be re-written against a new APIs or paradigm. In this article, we propose Pipeline61, a framework that supports the building of data pipelines involving heterogeneous execution environments. Pipeline61 reuses the existing code of the deployed jobs in different environments and also provides version control and dependency management that deals with typical software engineering issues. A real-world case study is used to show the effectiveness of Pipeline61.


IEEE


big data, pipeline, Spark, MapReduce


English


nicta:9178


Wu, Dongyao; Zhu, Liming; Xu, Xiwei (Sherry); Sakr, Sherif; Sun, Wei (Daniel); Lu, Qinghua. Building Pipelines for Heterogeneous Execution Environments for Big Data Processing. IEEE Software. 2016-03-01; 33(2):60-67.



Loading citation data...

Citation counts
(Requires subscription to view)