Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Skip to main content

Apache Spark is a large-scale open-source data processing framework. The standalone Spark cluster consists of a master node and multiple worker nodes. However, as the need to scale up and down becomes crucial with unpredictable load, it is better to migrate standalone spark applications to Kubernetes. With an extensible open-source orchestration platform like Kubernetes, master and worker nodes can be managed easily in isolation with better resource management. Though there are several ways to deploy standalone Spark Cluster, in this blog, we will focus on Kubernetes that allows us to gain maximum efficiency from containers that can run anywhere.

Below is a diagrammatic representation of Analytics system architecture on standalone spark cluster (VM Based architecture)

Technology Used:
  1. Apache Spark 3.1.1
  2. Elasticsearch
  3. RabbitMQ
  4. Python Scripts & Crontab
Here is the data flow:
  • The Business Services and Apps publish Analytics events to a specific queue of RabbitMQ.
  • The Spark application listens to the same queue and dumps them into Elasticsearch.
  • Analyze data jobs are triggered by Crontab every midnight. First, crontab invokes the python script. Then, the python script calculates the current date, last analyze date, computes the analyze range, and accordingly submits the spark jobs.
  • Spark job performs data massaging and aggregation in different time intervals such as hourly, daily, etc. This analyzed data is stored in relevant Elasticsearch indexes.
  • Analyzed data is queried from the dashboard with graphs.
  • The Angular app shows the analyzed data in charts and/or tabular format as needed.
Migration to Spark managed by Kubernetes

Spark can run on clusters managed by Kubernetes. Here are the steps to create and run the applications on Spark managed by Kubernetes.

1.Code Changes:

Ensure that the spark session is closed at the end; otherwise, spark application pods will remain running forever.

session.stop()

Create an application jar that will be used in docker image creation.

2. Spark setup:

Download the spark zip from the official site and set it up. Below steps are used on CentOS 7 to setup Spark.

# create directory
mkdir -p /opt/apache
cd /opt/apache

# download the spark 3.1.1
wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip spark and move it to spark directory
tar -xzf spark-3.1.1-bin-hadoop2.7.tgz
mv spark-3.1.1-bin-hadoop2.7/ spark

# initialize SPARK_HOME and path variables
export SPARK_HOME=/opt/apache/spark
export PATH=”$SPARK_HOME/bin:$PATH”

3. Create Docker Image:

Spark (with version 2.3 or higher) ships with a Dockerfile that can be used or customized based on an individual application’s needs. It can be found in the kubernetes/dockerfiles/ directory.
Spark also ships with a bin/docker-image-tool.sh script that can be used to build and publish the Docker images to use with the Kubernetes.

Copy application jar file to $SPARK_HOME/jars/ directory. Only one application jar file allowed in one image. Run below commands to build and push the docker image to the repository.

cd $SPARK_HOME/bin

docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest build

docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest push

We need to create separate docker images for each application in our case analyze and streaming applications. These images will be used in the /spark-submit command later.

Architecture:
[vc_single_image image=”7036″]

Every component from the VM-based system is now converted into a separate container. Spark pods are created during job execution and terminated upon completion. For example, here is the spark-submit command that is executed from the scheduler container python script.

$SPARK_HOME/bin/spark-submit
–master k8s://https://10.2.1.11:6443
–deploy-mode cluster
–name spark-streaming
–class com.example.analytic.SparkStreaming
–conf spark.executor.instances=1
–conf spark.executor.cores=2
–conf spark.kubernetes.submission.waitAppCompletion=false
–conf spark.kubernetes.namespace=dev
–conf spark.kubernetes.container.image.pullPolicy=Always
–conf spark.kubernetes.container.image.pullSecrets=regcred
–conf spark.kubernetes.container.image=repo.com/path/spark:spark-streaming-latest
–conf spark.kubernetes.driver.volumes.persistentVolumeClaim.exvolume.mount.path=/opt/spark/work-dir
–conf spark.kubernetes.driver.volumes.persistentVolumeClaim.exvolume.options.claimName=spark-streaming-volume-pvc
–conf spark.kubernetes.file.upload.path=/opt/apache/spark/upload-temp
local:///opt/spark/jars/analyze-assembly-1.0.0.jar ‘2021-05-14 00:00:00’ ‘2021-05-16 00:00:00’

Scheduler container should have permission to create, delete, watch, list, get pods, configmaps and services. In addition, it will be responsible for running and monitoring the job’s status and marking the job as completed upon successful completion.

Conclusion:

Migration of standalone apache to Kubernetes is comparatively cost effective and provide dramatic cost benefits with efficient resource sharing. Neurealm has extensive expertise working with container orchestration tools and our product engineering expertise has helped in 2000+ booming product release.

   Author

Shivaji Mutkule | Lead Software Engineer

Shivaji has 7+ years of experience in Software Development. He is an experienced FullStack Developer and works on cutting-edge technologies in the healthcare domain. Shivaji possesses industry experiences in Web Development, Analytics, Microservices, DevOps, and Azure Cloud Networking. He has completed M.E. in Computer Science and Engineering from Government College of Engineering, Aurangabad(GECA). His area of interest includes Web development, Data Mining, and Machine Learning.