Analysis and Comparison of Dockerized and Standalone Apache Spark Configurations for Efficient Distributed Data Processing
Published in In the proceedings of ASYU 2024: International Symposium on Computer and Information Sciences, 2024
Download paper here Use Google Scholar for full citation
Apache Spark, a powerful distributed computing framework, has become a key to handling large-scale data processing tasks in many applications, including signal processing. However, there are different considerations for performance, cost, or ease of use during the development and deployment stages. This paper benchmarks Apache Spark on different local setups in terms of performance and elaborates on alternative cloud computing costs. Performances of different Spark master-slave configurations with Docker and no Docker are evaluated over different numbers of worker nodes, cores, memory, and executors. Spark WordCount benchmark is tested on Wikipedia datasets of sizes ranging from 1GB to 25GB. The results reveal that in local setups Docker creates additional parameter complexity as well as performance overheads, therefore a ``no Docker’’ setup is a better choice. We also observe the dominance of I/O bottlenecks in local setups. These results can help practitioners choose optimal setups for different Development and Operations (DevOps) and big data processing scenarios.
Recommended citation: Alain Ndigande, Ismail Ari, Sedat Özer, "Analysis and Comparison of Dockerized and Standalone Apache Spark Configurations for Efficient Distributed Data Processing." In the proceedings of ASYU 2024: International Symposium on Computer and Information Sciences, 2024.