“pip install pyspark”: Getting started with Spark in Python

I have worked with spark and spark cluster setup multiple times before. I started out with hadoop map-reduce in java and then I moved to a much efficient spark framework. Python was my default choice for coding, so pyspark is my saviour to build distributed code.

But a pain point in spark or hadoop mapreduce is setting up the pyspark environment. Having java, then installing hadoop framework and then setting up clusters …. blah blah blah…

I was looking for a simple one step setup process in which i can simply just do one click/command setup and just get started with coding in my local system. Once the code is ready I can simply run the job in a pre-setup cluster. (say over cloud)

So this article is to help you get started with pyspark in your local environment.

Assuming you have conda or python setup in local.

For the purpose of ease, we will be creating a virtual environment using conda. Simply follow the below commands in terminal:

conda create -n pyspark_local python=3.7

Click on [y] for setups.

conda activate pyspark_local

To ensure things are working fine, just check which python/pip the environment is taking.

which python
which pip
pip install pyspark

And voila! Its done!

Now that you have a pyspark setup. Let us write a basic spark code to check things.

We will we reading a file in pyspark now. So, create a sample.txt with some dummy text to check things are running fine.

Simply run the command to start spark shell: (you can do the same in python notebook as well)


Now let us run the below code.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local') \ 
    .appName('firstapp') \
    .config('spark.come.config.option','some-value') \

# make sure the path of the file is correct
text_f = spark.read.text('sample.txt')

Hope this will get you excited for running spark code in python!

For more: https://spark.apache.org/docs/latest/quick-start.html

Quiz: Do you know what is the role of SparkSession here? Comment below.


Installing Auto-Sklearn Properly using python 3.5+

auto-sklearn installation requires python 3.5 or higher. In addition, it also has dependencies on the packages mentioned here: https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt

Better approach is to have a python 3.5+ environment. And then using pip install auto-sklearn.

  • Check which version/path are you using – which pythonwhich pip
  • Install python 3.5 or higher, if you don’t have it already: steps to follow
  • Once you have the correct version of python installed, set up a virtual environment of the python3.5. Follow the code to setup a virtual environment:

python3 -m pip install --user virtualenv

source env/bin/activate

  • Finally call pip install auto-sklearn


  • In case you are using anaconda, following command will start your virtual env:conda update conda #Update your current version of condaconda create --name py35 python=3.5 #creat e a virtual env for python 3.5source activate py35 #activate the environment

Post your query here, again in case you are not sure of the steps.

How to begin a Supervised Machine Learning Problem in Kaggle!

Identifying the Problem Type:

It is important in Machine learning to understand the problem type first. If it is continuous output – [1,23,4,5,6, 5.5, 6.7,..], use Linear Regression. If it is a categorical output – [0,1,0,0,1…] or [‘High’, ‘low’, ‘Medium’, …] etc., go for Logistic Regression. Since your target labels are either 0 or 1, this is a problem to be worked with Logistic Regression or other Classification algorithms (SVM, Decision Tree, Random Forest).

Data Cleaning/Exploration:

You must convert your data to numeric format or standardized format for regression.https://realpython.com/python-data-cleaning-numpy-pandas/

Starter Code:

In case you are looking for a starter code for your problem, you can find that from Kaggle kernels. Here are a few links:

Overfitting, Cross Validation & Regularization!

Why do machine learning models need regularization?

  • Overfitting is a state where the model is trying too hard to capture the noise in your training dataset. This means each point and feature in the training set is too much fitted with the visible training set, that it fails to understand anything beyond the train set. The leads to low accuracy in the test set.
  • Overfitting the train set is being specific to training set data. Hence to have good accuracy on the test set (unknown to model), it must generalize.
  • Overfitting happens due to the heavy bias and variance in the data.

Now Let us understand cross-validation and regularization:

  • Cross-Validation: One of the ways of avoiding overfitting is using cross-validation, that helps in estimating the error over the test set, and in deciding what parameters work best for your model. Cross-validation is done by building models on sub-samples of train data and then getting results on sub-sample test sets. This helps in removing the randomness in data, which may be the cause for the noise. This is different from regularisation technique, but it has its importance in choosing the regularization parameter which I will explain below.
  • Regularization: The is a technique in which an additional term is introduced in the loss function of the learned model to remove the overfitting problem.

Let me explain:

Consider a simple relation for linear regression. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X).

Y ≈ β0 + β1*X1 + β2*X2 + …+ βp*Xp

A machine learning model is trying to fit X with Y to attain the β coefficients. The fitting procedure involves a loss function, known as residual sum of squares or RSS.

This is sum of square of the difference between actual (y_i) minus predicted values (y_predicted_i).

The coefficients are chosen, such that they minimize this loss function.

A zero (or minimum) loss function indicates the tight fit of the model with parameters. In layman terms, the actuals and the predicted in the train are same.

Hence this RSS function helps in finding the optimal coefficients of the equation.

(The below equation is before regularization)

enter image description here

Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero. (source)

For regularization (Using ridge regression), the loss function is modified as follows:

enter image description here

Note, a lambda (λ) parameter is multiplied with each of the coefficient parameters. λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by an increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high.

Selecting a good value of λ is critical. Cross-validation comes in handy for this purpose. The value of lambda depends on the data, and there is no universal rule how a lambda should be. So to find the optimal value of lambda, several models are created using cross-validation and the lambda is averaged among the best performing models.

Below is an image showing sample data points and learned equation. The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution by adjusting lambda (regularization term), the weight of the regularization term.

enter image description here

Referred below links for this answer:

Semantic Web, and the Vision of World Wide Web

Hi Innovators!

The Evolution of internet, started with the PC Era, rising and constantly evolving towards Intelligent web can be observed these days in every part of the technology. We want machines to be more and more intelligent day by day. For us Siri is not enough! We want real life JARVIS!

In one of our college lectures, we came across this really interesting comparison of Semantic Web with Web 2.0. Have a look!

As a technology fanatic, “Tim Berners lee” is  definitely my idol! In the year 2001, he wrote an article in Scientific American on Semantic Web.  Its a must read! So have a look!


Imagine what we probably do via the Zomato’s Link(In Zomatos App) with Uber Taxi is what he(TIm Berners Lee) envisioned in the year 2001!

Well I am definitely enthralled by the power of WWW!  Tell us about your take on this.

Intelligent Systems and IBM Watson…

I have always been curious on Artificial Intelligence and its related topics. During my undergraduate studies, I was introduced to Alan and HAL of http://www.a-i.org. Oh they were my best buddies, I must say. You could talk to them for hours. AlanAlan, specially is my favorite. He says he is merely based on pattern matching, However, I find him very intelligent still. Recent chat bot to their addition is Jennifer. JenniferIts interface is quite like a regular chatting session.

Chat bots have always been my favorite. Recently TARS of the hellotars.com started a whatsapp chat service, where they have local experts serving all requests for you. I initially thought it was a AI service. However, they have a human interface at the top layer to answer the queries. Now this makes the whole idea not so fascinating 😦 However, they said they r working on building AI based agents soon.

During my final year of B.Tech. Project, I was working on a similar idea to build an ER diagram though Natural Language processing. Now this was my first attempt to understand how compilers and Intelligent systems work. Since then I had been learning and reading on Pattern recognition, Parsers, etc. Although, I am still a beginner in all these. I am quite optimistic that I will build an Real AI system soon. (Very soon.)

My recent discovery is the IBM Watson! Oh its a charm! It works though cognitive framework and learns by taking the decisions just like we humans learn: Observe, Interpret, Evaluate and Decide. The best part is it can understand any form of raw data. And that makes it even more powerful.

How does it learns? Watch the below video for the same.

Now I am still exploring more on Watson! I will update you on it soon… But you guys keep thinking on innovation!!