Apache Spark is among the most popular and usable open source distributed computing frameworks because it allows data engineers to parallelize the processing of large amounts of data across a cluster of machines. That's important for those who are tasked with maintaining complex data environments.
When it comes to data operations, Spark provides a tremendous advantage as a compute resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI which are used for batch processing (in real-time and at scale) and is adept at operating within different types of environments.
Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.
Interest in YARN is Declining
Spark’s popularity has contributed to declining interest in the Hadoop stack in general. Furthermore, data professionals have struggled to make Hadoop YARN fit into modern data environments -- especially when it comes to ensuring security, integrating with new data resources, or adapting to cloud and hybrid environments.
Into this modern data world comes the ease and agility of Kubernetes, which has accelerated Hadoop’s fall from grace. Spark, however, is still an important piece of the modern data stack. It’s in that spirit that we look at how Spark users -- especially those in roles like data engineering who are tasked with managing and validating data operations in complex environments -- transition from a reliance on Hadoop YARN to Kubernetes as an effective solution for operating Spark applications.
Benefits of Kubernetes
What makes Kubernetes such a popular choice for managing containerized applications? Building on more than a decade of running production workloads at Google, Kubernetes offers numerous potential benefits:
- Scalability: Data ops teams can easily scale up to support a near-unlimited number of workloads.
- Open Source: Kubernetes was originally developed by Google but is now maintained by Cloud Native Computing Foundation.
- Flexibility: Whether you’re in the cloud, multiple clouds, hybrid, or on-prem, Kubernetes provides flexibility to adapt to your infrastructure.
Benefits of Spark on Kubernetes
The benefits of transitioning from one technology to another must outweigh the cost of switching, and moving from YARN to Kubernetes can deliver both financial and operational benefits. Many companies are finding that Kubernetes offers better dependency management, resource management, and includes a rich ecosystem of integrations.
Dependency Management: Spark has traditionally struggled with dependency management, which requires data scientists to spend valuable time on flaky init scripts to compile C-libraries instead of focusing on their core work. Running in containers on Kubernetes, however, allows them to package all of their dependencies in containers and easily take them from their laptop to production.
Resource Management: Containers also allow for better isolation of workloads, increasing the ability to bin pack workloads. Using YARN will require running multiple clusters or compromise on isolation, both of which can increase costs. Running transient / multiple clusters requires you to pay for setup and breakdown for every instance, or it reduces resource sharing. Spark on Kubernetes, on the other hand, allows for different versions of Spark and Python to run in the same cluster and allows seamless resource sharing. As one job ends, another can be scheduled. It even allows cluster autoscaling to further optimize costs.
Ecosystem: Because modern data environments are so diverse and involve multiple layers of complexity, Kubernetes provides a broad framework within which to operate. We’ve already discussed Spark on Kubernetes, but there are many other powerful open source add-ons for management and monitoring, like Prometheus for time-series data and Fluentd for log aggregation. Acceldata Torch can be installed on an existing Kubernetes cluster and supports installation with embedded Kubernetes. In addition, Kubernetes Resource Quotas and Namespaces bring greater multitenancy control over how applications use and share system resources. Role-based access control gives fine grained permissions to resources and data for increasing security. If all of these benefits are convincing, the question then becomes how to make the migration successful.
4 Steps for Migrating from YARN to Kubernetes
There are four key steps to switching from YARN to Kubernetes:
1. Determining the complexity of a job and YARN requirements
When migrating to a new technology, it is easiest to start with the least complex low hanging fruit. Simple YARN container definitions can be quickly translated into Kubernetes resource assignments, such as number of CPUs or memory. A great resource for this is from Princeton Research Computing on Tuning Spark Applications. More complex YARN scheduler definitions, like fair scheduler or capacity scheduler, should be moved last after considering how Kubernetes resource assignments will be defined.
2. Evaluating data connectivity needs
Once the order of applications to migrate has been determined, reviews for connecting to specific data need to happen. Kubernetes offers many ways to access data, such as through existing HDFS clusters, with S3 API enabled storage, or connecting to Cloud Object Storage providers. This migration can be a good opportunity to define where data for each business use case or data type should be stored.
3. Analyzing compute and storage latency
If there is a change to how data is accessed, it is also important to evaluate how this affects the performance of compute and storage latency. One way to test this is to compare resilient distributed dataset (RDD) read and write times. This can be done by turning off all “MEMORY_ONLY” settings on RDDs and comparing current versus new implementation. This will provide a baseline of “DISK_ONLY” performance, which should be like-for-like with the memory-enabled RDD’s performance (assuming the same number of resources will be used for assignment in Kubernetes).
4. Auditing monitoring and security policies
Finally, when changing how to access, transmit, and process data, monitoring and security practices must also stay up to date.
Time to Make the Switch to Kubernetes?
Switching to Spark on Kubernetes can yield big benefits for busy data engineers like you. From simpler dependency and resource management to value-added integrations and numerous cost savings opportunities, Kubernetes is rapidly becoming a not-so-secret weapon for managing containers.
Thinking about migrating from YARN to Kubernetes? Check out our tips for monitoring Spark on Kubernetes.