DOI: 10.1002/cpe.70808 ISSN: 1532-0626

An Empirical Comparison of Kryo Serialization Optimization Strategies in Apache Spark: Cloud‐Based Evaluation at Scale

Lennah Etyang, Lawrence Nderu, Waweru Mwangi

ABSTRACT

Serialization is a major performance bottleneck in Apache Spark, accounting for up to 40% of total execution time in data processing workloads. Although Kryo serialization presents a remarkable improvement over Java's native serialization, ensuring optimum configuration is a challenging task due to complexity on parameter interactions and workload‐dependent performance behavior. In this research, we conducted a comprehensive comparative study of four Kryo optimization strategies deployed on production‐scale AWS EMR infrastructure. We carried out an evaluation using the TPC‐DS benchmark at a 100GB scale across the following workload types: (1) Java serialization baseline, (2) default Kryo configuration, (3) rule‐based optimization based on best practices, and (4) a new adaptive framework that dynamically changes parameters based on workload characteristics. To our knowledge, this is the first systematic comparative evaluation of Kryo optimization strategies that integrates workload‐aware adaptation with cost‐performance analysis, distinguishing our work from black‐box automatic tuning frameworks such as SparkTune, AutoSpark, and DeepTune. Our experimental results indicate the adaptive framework performs 38.4% faster execution than Java serialization ( p  < 0.001, Cohen's d  = 1.42, 95% CI [1.11, 1.73]) as well as a 26.8% advantage over default Kryo ( p  = 0.0021). Ablation studies reveal that buffer sizing contributes 34.2% of the total improvement, compression selection 29.3%, class registration 17.1%, and adaptivity itself 23.9%, together explaining 65.1% of performance variance ( R 2  = 0.651). The study shows extensive workload dependence on optimization effectiveness as well, with join‐heavy queries experiencing a 52.3% gain but scan‐intensive operations showing only an 18.7% gain. Economic studies find that the performance gains are directly linked to cloud cost reductions with the total spending in our model reduced by 38.7% as compared to baseline configurations, though these savings accrue only when clusters are elastically scaled. A cost‐equivalent baseline demonstrates that achieving the same performance through resource scaling alone would require 62.5% more worker nodes, highlighting optimization value beyond simple horizontal scaling. These results question the one‐size‐fits‐all optimization models and provide evidence‐based configuration recommendations to practitioners that consider workload variability and cloud infrastructure behavior validated on held‐out TPC‐DS queries with 90.5% recommendation accuracy.

More from our Archive