I have worked with spark and spark cluster setup multiple times before. I started out with hadoop map-reduce in java and then I moved to a much efficient spark framework. Python was my default choice for coding, so pyspark is my saviour to build distributed code.
But a pain point in spark or hadoop mapreduce is setting up the pyspark environment. Having java, then installing hadoop framework and then setting up clusters …. blah blah blah…
I was looking for a simple one step setup process in which i can simply just do one click/command setup and just get started with coding in my local system. Once the code is ready I can simply run the job in a pre-setup cluster. (say over cloud)
So this article is to help you get started with pyspark in your local environment.
For the purpose of ease, we will be creating a virtual environment using conda. Simply follow the below commands in terminal:
conda create -n pyspark_local python=3.7
Click on [y] for setups.
conda activate pyspark_local
To ensure things are working fine, just check which python/pip the environment is taking.
which python which pip
pip install pyspark
And voila! Its done!
Now that you have a pyspark setup. Let us write a basic spark code to check things.
We will we reading a file in
pysparknow. So, create a
sample.txtwith some dummy text to check things are running fine.
Simply run the command to start spark shell: (you can do the same in python notebook as well)
Now let us run the below code.
from pyspark.sql import SparkSession spark = SparkSession.builder \ .master('local') \ .appName('firstapp') \ .config('spark.come.config.option','some-value') \ .getOrCreate() # make sure the path of the file is correct text_f = spark.read.text('sample.txt') print(text_f.first())
Hope this will get you excited for running spark code in python!
Quiz: Do you know what is the role of
SparkSession here? Comment below.