“pip install pyspark”: Getting started with Spark in Python


I have worked with spark and spark cluster setup multiple times before. I started out with hadoop map-reduce in java and then I moved to a much efficient spark framework. Python was my default choice for coding, so pyspark is my saviour to build distributed code.

But a pain point in spark or hadoop mapreduce is setting up the pyspark environment. Having java, then installing hadoop framework and then setting up clusters …. blah blah blah…

I was looking for a simple one step setup process in which i can simply just do one click/command setup and just get started with coding in my local system. Once the code is ready I can simply run the job in a pre-setup cluster. (say over cloud)

So this article is to help you get started with pyspark in your local environment.

Assuming you have conda or python setup in local.

For the purpose of ease, we will be creating a virtual environment using conda. Simply follow the below commands in terminal:

conda create -n pyspark_local python=3.7

Click on [y] for setups.

conda activate pyspark_local

To ensure things are working fine, just check which python/pip the environment is taking.

which python
which pip
pip install pyspark

And voila! Its done!

Now that you have a pyspark setup. Let us write a basic spark code to check things.

We will we reading a file in pyspark now. So, create a sample.txt with some dummy text to check things are running fine.

Simply run the command to start spark shell: (you can do the same in python notebook as well)

pyspark

Now let us run the below code.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local') \ 
    .appName('firstapp') \
    .config('spark.come.config.option','some-value') \
    .getOrCreate()

# make sure the path of the file is correct
text_f = spark.read.text('sample.txt')
print(text_f.first())

Hope this will get you excited for running spark code in python!

For more: https://spark.apache.org/docs/latest/quick-start.html

Quiz: Do you know what is the role of SparkSession here? Comment below.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.