In my previous article, I explained Linear Regression concepts. Please go through it if you want to know the theory behind Linear Regression. In this article, I will mainly focus on the python implementation of Linear Regression on the Boston House pricing dataset.

  1. Introduction:

Linear Regression is all about finding an equation of a line(in 2-d)/plane (in 3-d)/hyperplane (in n-d) that almost fits the given data so that it can predict the future (unseen) values.

2. Assumptions in Linear Regression:

Regression is a parametric approach, which makes it restrictive in nature. So data should fulfill certain assumptions in order to deliver good results using Linear regression. These assumptions are:

  1. There should be a Linear and Additive relationship between dependent (output: y) and independent (inputs: X’s) variables. The linear relationship creates a straight line when plotted on a graph. An additive relationship means that the effect of X’s on y is independent of other variables.
  2. There should be no correlation between the error or residual terms. The absence of this phenomenon is known as Autocorrelation.
  3. The independent (X’s) variables should not be correlated, the absence of phenomena is called multi-collinearity.
  4. The error term must have constant variance. The phenomenon is known as homoskedasticity. The presence of non-constant variance is referred to as heteroskedasticity.
  5. The error term must be normally distributed.

3. Hypothesis:

Hypothesis in Linear Regression is the equation of a line/plane/hyperplane that fits the data. Basically, the hypothesis mapping input features to the output value. By using the equation of the line (in 2-d) (y=c + m.x) our hypothesis will be:

Linear Regression Hypothesis

Here the question is, how do we select the optimal values of θ₀ and θ₁, because based on different values of θ’s, a different hypothesis is formed, and we get a different line that fits the data. Which line is the best?

Different hypotheses based on different values of θ’s

So we select θ₀ and θ₁, such that hypothesis h(x) or the predicted value is close to the actual value y in the training dataset (xi, yi).

4. Cost Function:

The cost function (or loss function) is used to measure the performance of a machine learning model or quantifies the error between the actual values and the values predicted by our hypothetical function. The cost function J for Linear Regression is represented as follows:

The cost function of Linear Regression

Cost function J is the summation of the square of the difference between predicted and actual values, this method is called the Least square method.

Linear Regression tries to minimize this cost function by finding the optimal values of θ₀ and θ₁. How? By using techniques like Gradient Descent (GD) or Stochastic Gradient Descent (SGD) or mini-batch SGD. But in this article, we focus only on Gradient Descent.

5. Gradient Descent (GD):

GD is an iterative algorithm, in which we initially make a guess on the solution, and we move towards the solution iteratively through solution correction. Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function. It is widely used, from Linear Regression to Neural Networks. Below are the steps of GD:

Gradient Descent Algorithm with m: training dataset size

Let’s understand what 2 terms learning rate (α) and derivative terms actually doing?

6. Role of derivative term and learning rate:

Consider cost function J(θ₁) with only 1 parameter, θ₀ is 0 means line passing through the origin. Let θ₁ is initialized as shown below:

θ₁ is updated using below:

Weight update

Update in θ₁ is done by subtracting the derivative of the cost function with respect to the θ₁ multiplied by some constant. A derivative is the slope of the cost function at the given point θ₁. In this case, the slope at θ₁ is positive, and hence the θ₁ is subtracted with a positive value, which will force the θ₁ to move (next update) in the left direction (towards minima, where slope = 0).

Here comes the role of learning rate or α. It is the learning rate that decides how much we want to descent in one iteration. Also, as we are moving to the minimum, the slope of the curve is also getting less steep than the means, as we are reaching the minimum value, we will be taking smaller and smaller steps.

Learning rate gives the rate of speed with which the gradient moves during gradient descent. Setting it too high would make your path unstable, too low would make convergence slow. Putting it to zero means your model isn’t learning anything from the gradients.

Now, let's consider θ₁ is initialized on the left side of the minima as shown below:

The slope at this point will be negative and hence we are updating θ₁ by subtracting the negative value, which means we are adding some value to the weight which forces θ₁ to move towards the right (towards minima, where slope = 0).

7. Python example of Linear Regression using Boston House pricing dataset:

Boston House pricing Dataset was originally part of the UCI Machine Learning Repository and has been removed now. This data also comes with the Scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Let's get started, first we load the data:

Let’s check what values in boston_dataset:

which gives below keys:

  • data: contains the information for various houses
  • target: prices of the house
  • feature_names: names of the features
  • DESCR: describes the dataset

boston_dataset.DESCR gives below, description of the dataset:

The prices of the house indicated by the variable MEDV is our target variable (y), and the remaining are the feature variables (X) based on which we will predict the value of a house.

We will now load the data into a pandas data frame using pd.DataFrame


We can see that the target column is missing, let's add that using the below code:

Let’s do the data Pre-processing to check for any missing values in any of the columns:

From the above output, we can see that there are no missing values.

In data visualization for the target variable MEDV using distplot from the seaborn library, let’s check how the distribution looks like:

From the above output, we can see that the target variable MEDV is distributed normally, with a few outliers.

Correlation is a statistical term describing the degree to which two variables move in coordination with one another. If the two variables move in the same direction, then those variables are said to have a positive correlation. If they move in opposite directions, then they have a negative correlation.

Now let’s check the correlation among different features:

Correlation matrix

From the above output, we can see that MEDV is positively correlated with RM(0.7) and MEDV is negatively correlated with LSTAT(-0.74).

Based on the above observation, we select RM and LSTAT as our features to predict MEDV. Let's create scatter plots for the selected features:

From the above plots, we can see that RM is following a linear relationship with MEDV but not LSTAT, we will select RM for training the linear regression model.

Now split the data into train and test:

Train the model using train data:

Evaluation of the model for train and test data using RMSE, and matrices:

For full code, please refer to the GitHub link.

8. References:




Heena Sharma

Data Scientist@Reltio, expert in ML, DL, NLP, and AI, passionate about using cutting-edge tech to solve real-world problems and drive success.