Enhancing Spark Performance in Docker Container Clusters through Machine Learning-Based Parameter Tuning and Dynamic Scaling

Main Article Content

Emily Carter Johnson

Abstract

Currently, Spark-based applications are widely utilized, and proper configuration can significantly enhance the execution efficiency of Spark jobs. Numerous scholars have conducted extensive research on Spark parameter tuning within virtual machine clusters. Recently, containers have emerged as a prominent cloud computing infrastructure, increasingly employed in service clusters. Consequently, it is essential to investigate the parameter tuning of Spark in container clusters. This paper addresses the issue of Spark parameter configuration within Docker container clusters, introducing a novel parameter tuning method known as ContainerOpt. ContainerOpt leverages machine learning techniques to learn and predict job performance under various parameter combinations and incorporates an automatic node scaling mechanism to optimize the performance of high-input jobs.To achieve a better balance between job execution time and resource utilization, a performance representation model based on both time and resources is proposed, replacing the traditional model that focuses solely on execution time. Experimental results demonstrate that the parameter tuning method improves execution efficiency by 50% compared to the default configuration, thereby validating its effectiveness.

Article Details

How to Cite
Johnson, E. C. (2022). Enhancing Spark Performance in Docker Container Clusters through Machine Learning-Based Parameter Tuning and Dynamic Scaling. Journal of Computer Science and Software Applications, 2(1), 16–24. https://doi.org/10.5281/jcssa.v2i1.49
Section
Articles

References

Dean J, Sanjay G. MapReduce: simplified data processing on large clusters , Communications of the ACM, vol.51(2008), 107-113.

Y.B. Chen, B. Liu, Y.T. Shi. Storage and retrieval optimization of large data volume log based on Hadoop architecture ,Information network security, vol.6(2013),p.40-45.(In Chinese)

Apache Spark [EB/OL], http://spark.apache.org/.

Babu S. Towards automatic optimization of MapReduce programs[C]// ACM. Acm Symposium on Cloud Computing, June 10 - 11, 2010, Indianapolis, Indiana, USA. New York: ACM, 2010:137-142.

Herodotou H, Dong F, Babu S. No one (Cluster) size fits all: automatic cluster sizing for data- intensive analytics[C]// ACM. Acm Symposium on Cloud Computing, October 26 - 28, 2011, Cascais, Portugal. New York: ACM, 2011.

Herodotou H, Lim H, Luo G,et al. Starfish: A self-tuning system for big data analytics [J].CIDR, 2011,1(11):161-272.

Ding X, Liu Y, Qian D. JellyFish: Online performance tuning with adaptive configuration and elastic container in Hadoop Yarn[C]// IEEE. IEEE International Conference on Parallel & Distributed Systems, December 14-17, 2015, Melbourne, VIC, Australia. New Jersey: IEEE, 2016:831-836.

Jiang D, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2):472-483.

Lama, Palden, and Xiaobo Zhou. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud[C]//ACM. Proceedings of the 9th international conference on Autonomic computing. September 18 - 20, 2012, San Jose, California, USA. New York: ACM, 2012:63-72.

Liao G, Datta K, Willke T L. Gunther: search-based auto-tuning of MapReduce[C]// Euro-Par. Proceedings of the 19th International Conference on Parallel Processing, August 26 - 30, 2013, Aachen, Germany. Berlin: Springer, Heidelberg, 2013:406-419.

Wu D, Gokhale A. A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration[C]//IEEE. 20th International Conference on High Performance Computing (HiPC), December18-21, 2013, Bangalore, India. New Delhi: IEEE, 2014:89-98.

Li M, Zeng L, Meng S, et al. MRONLINE: MapReduce online performance tuning[C]// ACM. International Symposium on High-performance Parallel & Distributed Computing, June 23 - 27, 2014, Vancouver, BC, Canada. New York: ACM, 2014: 165-176.

Cheng D , Rao J , Guo Y , et al. Improving MapReduce performance in heterogeneous environments with adaptive task tuning[C]//ACM. International Middleware Conference, December 08 - 12, 2014, Bordeaux, France. New York: ACM, 2014: 97-108.

Janki B, Zhengyu Y, Miriam L, et al. Accelerating big data applications using lightweight virtualization framework on enterprise cloud[C]//IEEE. 2017 IEEE High Performance Extreme Computing Conference (HPEC), September 12-14, 2017, Waltham, MA, USA. New York: IEEE, 2017:1-7.

Ye K, Ji Y. Performance tuning and modeling for big data applications in Docker containers[C]//IEEE. International Conference on Networking, August 7-9, 2017, Shenzhen, China, Beijing: IEEE, 2017:1-6.

Xueyuan, Brian, Yuansong. Experimental evaluation of memory configurations of Hadoop in Docker environments[C]//IEEE. 2016 27th Irish Signals and Systems Conference (ISSC), June 21-22, 2016, Londonderry, UK. London: IEEE, 2016: 1-6.

Wang K , Khan M M H , Nguyen N , et al. Modeling interference for Apache Spark jobs[C]//IEEE. IEEE International Conference on Cloud Computing, June 27-July 2, 2016, San Francisco, CA, USA. New York:IEEE,2017:423-431.

Álvaro B. H, Maraí S. Perez , et al. Victor M. Using machine learning to optimize parallelism in big data applications[J]. Future Generation Computer Systems, 2017, 86:1076-1092.

Marco V S, Taylor B, Porter B, et al. Improving Spark application throughput via memory aware task Co-location: A mixture of experts approach[C]//ACM. Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, December 11 - 15, 2017, Las Vegas, Nevada.New York:ACM, 2017:95-108.

Similar Articles

<< < 1 2 3 > >> 

You may also start an advanced similarity search for this article.