Skip to main content

Apache Spark is a large-scale open-source data processing framework. The standalone Spark cluster consists of a master node and multiple worker nodes. However, as the need to scale up and down becomes crucial with unpredictable load, it is better to migrate standalone spark applications to Kubernetes. With an extensible open-source orchestration platform like Kubernetes, master and worker nodes can be managed easily in isolation with better resource management. Though there are several ways to deploy standalone Spark Cluster, in this blog, we will focus on Kubernetes that allows us to gain maximum efficiency from containers that can run anywhere.

Below is a diagrammatic representation of Analytics system architecture on standalone spark cluster (VM Based architecture)

Technology Used:
  1. Apache Spark 3.1.1
  2. Elasticsearch
  3. RabbitMQ
  4. Python Scripts & Crontab
Here is the data flow:
  • The Business Services and Apps publish Analytics events to a specific queue of RabbitMQ.
  • The Spark application listens to the same queue and dumps them into Elasticsearch.
  • Analyze data jobs are triggered by Crontab every midnight. First, crontab invokes the python script. Then, the python script calculates the current date, last analyze date, computes the analyze range, and accordingly submits the spark jobs.
  • Spark job performs data massaging and aggregation in different time intervals such as hourly, daily, etc. This analyzed data is stored in relevant Elasticsearch indexes.
  • Analyzed data is queried from the dashboard with graphs.
  • The Angular app shows the analyzed data in charts and/or tabular format as needed.
Migration to Spark managed by Kubernetes

Spark can run on clusters managed by Kubernetes. Here are the steps to create and run the applications on Spark managed by Kubernetes.

1.Code Changes:

Ensure that the spark session is closed at the end; otherwise, spark application pods will remain running forever.

session.stop()

Create an application jar that will be used in docker image creation.

2. Spark setup:

Download the spark zip from the official site and set it up. Below steps are used on CentOS 7 to setup Spark.

# create directory
mkdir -p /opt/apache
cd /opt/apache

# download the spark 3.1.1
wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip spark and move it to spark directory
tar -xzf spark-3.1.1-bin-hadoop2.7.tgz
mv spark-3.1.1-bin-hadoop2.7/ spark

# initialize SPARK_HOME and path variables
export SPARK_HOME=/opt/apache/spark
export PATH=”$SPARK_HOME/bin:$PATH”

3. Create Docker Image:

Spark (with version 2.3 or higher) ships with a Dockerfile that can be used or customized based on an individual application’s needs. It can be found in the kubernetes/dockerfiles/ directory.
Spark also ships with a bin/docker-image-tool.sh script that can be used to build and publish the Docker images to use with the Kubernetes.

Copy application jar file to $SPARK_HOME/jars/ directory. Only one application jar file allowed in one image. Run below commands to build and push the docker image to the repository.

cd $SPARK_HOME/bin

docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest build

docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest push

We need to create separate docker images for each application in our case analyze and streaming applications. These images will be used in the /spark-submit command later.

Architecture:
[vc_single_image image=”7036″]

Every component from the VM-based system is now converted into a separate container. Spark pods are created during job execution and terminated upon completion. For example, here is the spark-submit command that is executed from the scheduler container python script.

$SPARK_HOME/bin/spark-submit
–master k8s://https://10.2.1.11:6443
–deploy-mode cluster
–name spark-streaming
–class com.example.analytic.SparkStreaming
–conf spark.executor.instances=1
–conf spark.executor.cores=2
–conf spark.kubernetes.submission.waitAppCompletion=false
–conf spark.kubernetes.namespace=dev
–conf spark.kubernetes.container.image.pullPolicy=Always
–conf spark.kubernetes.container.image.pullSecrets=regcred
–conf spark.kubernetes.container.image=repo.com/path/spark:spark-streaming-latest
–conf spark.kubernetes.driver.volumes.persistentVolumeClaim.exvolume.mount.path=/opt/spark/work-dir
–conf spark.kubernetes.driver.volumes.persistentVolumeClaim.exvolume.options.claimName=spark-streaming-volume-pvc
–conf spark.kubernetes.file.upload.path=/opt/apache/spark/upload-temp
local:///opt/spark/jars/analyze-assembly-1.0.0.jar ‘2021-05-14 00:00:00’ ‘2021-05-16 00:00:00’

Scheduler container should have permission to create, delete, watch, list, get pods, configmaps and services. In addition, it will be responsible for running and monitoring the job’s status and marking the job as completed upon successful completion.

Conclusion:

Migration of standalone apache to Kubernetes is comparatively cost effective and provide dramatic cost benefits with efficient resource sharing. GS Lab has extensive expertise working with container orchestration tools and our product engineering expertise has helped in 2000+ booming product release.

   Author

Shivaji Mutkule | Lead Software Engineer

Shivaji has 7+ years of experience in Software Development. He is an experienced FullStack Developer and works on cutting-edge technologies in the healthcare domain. Shivaji possesses industry experiences in Web Development, Analytics, Microservices, DevOps, and Azure Cloud Networking. He has completed M.E. in Computer Science and Engineering from Government College of Engineering, Aurangabad(GECA). His area of interest includes Web development, Data Mining, and Machine Learning.