Why do machine learning models need regularization?

- Overfitting is a state where the model is trying too hard to capture the noise in your training dataset. This means each point and feature in the training set is too much fitted with the visible training set, that it fails to understand anything beyond the train set. The leads to low accuracy in the test set.
- Overfitting the train set is being
**specific** to training set data. Hence to have good accuracy on the test set (unknown to model), it must **generalize**.
- Overfitting happens due to the heavy bias and variance in the data.

Now Let us understand cross-validation and regularization:

**Cross-Validation**: One of the ways of avoiding overfitting is using cross-validation, that helps in estimating the error over the test set, and in deciding what **parameters** work best for your model. Cross-validation is done by building models on sub-samples of train data and then getting results on sub-sample test sets. This helps in removing the randomness in data, which may be the cause for the noise. This is different from regularisation technique, but it has its importance in choosing the regularization parameter which I will explain below.
**Regularization**: The is a technique in which an *additional term* is introduced in the loss function of the learned model to remove the overfitting problem.

Let me explain:

Consider a simple relation for linear regression. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X).

**Y ≈ β0 + β1*X1 + β2*X2 + …+ βp*Xp**

A machine learning model is trying to **fit** X with Y to attain the β coefficients. The fitting procedure involves a **loss** function, known as residual sum of squares or RSS.

This is sum of square of the difference between actual (y_i) minus predicted values (y_predicted_i).

The coefficients are chosen, such that they **minimize** this loss function.

A zero (or minimum) loss function indicates the tight fit of the model with parameters. In layman terms, the actuals and the predicted in the train are same.

Hence this RSS function helps in finding the optimal coefficients of the equation.

(The below equation is before regularization)

Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero. (source)

For regularization (Using ridge regression), the loss function is modified as follows:

*Note, a lambda (λ) parameter is multiplied with each of the coefficient parameters. ***λ** is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by an increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high.

*Selecting a good value of λ is critical. Cross-validation comes in handy for this purpose.* The value of lambda depends on the data, and there is no universal rule how a lambda should be. So to find the optimal value of lambda, several models are created using cross-validation and the lambda is averaged among the best performing models.

Below is an image showing sample data points and learned equation. The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution by adjusting lambda (regularization term), the weight of the regularization term.

**Referred below links for this answer:**