Linear Regression in Python

From learning to earning – Courses that prepare you for job - Enroll now

Linear regression is one of the simplest algorithms in machine learning. It is a statistical model that tries to show the relationship between two variables with the help of a linear equation. In this article, we will learn about Linear Regression in Python. Let’s start!!

What is regression?

Regression analysis is a predictive modelling technique investigating the relationship between dependent and independent variables. It involves graphing a line between a set of data points that will fit the overall shape of the data. Regressions show the changes in the dependent variable on the y-axis and the changes in the explanatory variable on the x-axis.

Uses of regression

Determine the strength of predictors.
It helps analyse the strength of independent variables on the dependent variables—for example, the relationship between sales and marketing or age and income.
Identifies the important relationships between forecast variables
Regression can forecast the effects or impact of changes.
We can use regression analysis to understand the level of change in a dependent variable with a change in one more independent variable.
The regression analysis predicts trends and future values.
It helps get point estimates to predict things like the price of bitcoin will be in the next six months.

Linear regression

In this model, we are trying to find the relationship between a and y from the mathematical equation y=mx+c. We model the data in a straight line in linear regression using continuous variables. The output of a linear regression prediction is the variable’s value. Measure by loss, R squared, and adjust R squared are some methods used to check the accuracy and goodness of fit.

Type of linear regression

Linear regression is divided into two main types:

1. Simple linear regression

When we use only one single independent variable to predict the value of a numerical independent variable, then this is known as simple linear regression.

2. Multiple linear regression

When more than one independent variable exists to predict the values of the numerical dependent variable, then this is known as multiple linear regression.

Linear regression line

The line shows the relation between the dependent and independent variables. There are two types of relationships the line represents:

1. Positive linear relationship

When the dependent variables and the independent variables on the y-axis and x-axis, respectively, increase linearly, this relationship will be a positive linear relationship.

2. Negative linear relationship

If the dependent variables on the y-axis decrease and the independent variables on the x-axis increase, then this relation is known as the negative linear relationship

Finding the best-fit line or the regression line in the model

The main aim of the regression model is to find the best fit line. The error found between the actual and predicted values must be less. The best fit line gives the minimum errors.

Cost function

The cost function is a correlation function that helps us determine the model’s accuracy. It helps us to quantify how wrong the model is in finding the relationship between input and output. Finally, it tells you about your model’s behaviour.

Gradient descent

Gradient Descent is an algorithm we use to optimise the cost function or the error in the model. We use it to find the minimum error value likely to occur in your model.

Gradient Descent is the direction you take to reach the least possible error. The error in your model can be distinct at distinct points, and you have to find the fastest way to undervalue it, to prevent wastage.

Hypothesis function for linear regression

The task of linear regression is to take in input or a dependent variable (y) based on an independent variable (x). It then finds a linear relationship between input and output.

The hypothesis function is as follows:

y = θ₁ + θ₂.x

In the above equation:

X is the input training data.

Y is the label given to the data in the case of supervised learning.

What is the cost function of linear regression model?

In linear regression, we use the straight line to fit the model. The straight-line equation is as follows:

Output= a*input + b

In the above equation, a and b are constants. A decides the point of intercept at the x-axis, and b determines the steepness of the slope.

The perfect fit will be a straight line that crosses most data points from the dataset. It ignores the noise and the outliers.

For a linear regression model, the cost function is the minimum of the root mean of the acquired error of the model. It is as follows:

Using gradient descent, find the direction in which the error decreases continuously. It is calculated by finding the difference between errors. You can find the difference between errors by differentiating the cost function and then subtracting it from the old gradient descent to move down the slope.

In the equations, a is the learning rate. It decides how quickly you move down the slope. If alpha is big, you take big steps; if it is small, then you take small steps. If the alpha is too large, you can miss the least error point, and our outcomes will not be accurate. On the other hand, it will take too long to optimise the model if it is too small. Hence it would be best if you chose an optimal weight of alpha.

Linear regression model representation

The representation of the linear regression model is pretty simple.

The input and output values are numeric. The linear equation is a combination of a set of input values (x), and the output is predicted for the set of values(y).

In a simple regression model, the linear equation has one x value and one y value. The equation is as follows:

y= B₀=B₁*x

The greek letter B is the coefficient and scalar factor for each input column or value.

Model performance

We observe the goodness of fit to decide the line of regression that fits into the set of observations. Optimisation is the procedure of finding the top model out of all the available models. There are many methods, one of them is as below:

1. R squared method

We use r-square as a goodness of fit measure in linear regression.

On a scale of 0 to 100%, it decides the relationship between dependent and independent variables.
The higher the R-square value, the lesser the difference between actual and predicted values; hence, it becomes a good model.

2. Ordinary least squares

We use the ordinary least squares method when we have more than one input value. The task minimises the sum of the squared residuals. First, we find the regression line and then calculate the distances of each data point from the regression line. Then we square the values and sum all the squared errors together. It is the value that the method wants to minimise.

This method considers the data in matrix form and uses the linear equation to calculate the optimal values for the coefficients.

3. Regularisation

Regularisation models are extensions of the training models. The aim is to reduce both the model’s complexity and the squared error. We use this method when there is collinearity in the data and if the ordinary least squares would cause overfitting in the data.

Assumptions of linear regression

Multicollinearity means a high correlation between the independent variables. It makes finding the relationship between the predictors and target variables challenging. In addition, it is challenging to decide which predictor variable is disturbing the target variable and which is not. So, the model presumes less or no multicollinearity between the features or independent variables.

1. Homoscedasticity Assumption:

Homoscedasticity is a problem when the error term is identical for all the values of independent variables. With homoscedasticity, the scatter plot has a pattern distribution of data.

2. Normal distribution of error terms:

Linear regression works on the assumption that the error term should observe the normal distribution convention. In case error terms are not normally distributed, then confidence intervals are too wide or too narrow, which may cause problems calculating coefficients.

3. No autocorrelations:

The linear regression model takes no autocorrelation in error terms. If there is any correlation in the error term, it drastically decreases the model’s accuracy.

Use case

Let us take the use case. Consider the case where we take salary (dependent variable) and experience (independent variable), and we predict the impact experience has on an individual’s salary.

Steps to implement linear regression model

1. Import the required libraries

2. Define the dataset

3. Plot the data points

Summary

This article taught us about the linear regression model, a simple machine learning model. We look at the concepts and cover one of the case studies. We hope our explanation was easy to understand.