SIMPLE LINEAR REGRESSION

DEFINITION

Simple linear regression is concerned with the analysis of the relation between the two variables, the explanatory variable (or independent variable) and the response variable (or dependent variable) and trying to fit a linear model that could explain this relationship.

In simple words, it is the process of estimating the function for a straight line, which would must suitably fit the data under consideration.

In simple words, it is the process of estimating the function for a straight line, which would must suitably fit the data under consideration.

Value of the explanatory variable to be regressed:

Regression equation:

Expected response for

Variance of expected response estimator:

Individual response for

Variance of individual response estimator:

Co-efficient of correlation:

Co-efficient of determination:

Total sum of squares:

Regression sum of squares:

Residual sum of squares:

Consider two variables x and y, taking values, x

Simple linear regression is concerned with obtaining a straight line that best fits the data or best explains the linear relationship between the two variables.

Let us consider for example, a scatterplot for x

We now need a straight line that approximately fits the data in the above scatterplot, as follows:

The equation of the line is y= α+ βx where, α is the y-intercept (the point where the line crosses the y-axis) and β is the gradient or slope of the line.

If x and y were perfectly correlated, then all points in the above scatterplot would lie exactly on the line. But this wouldn’t be the case in most situations as the points lie above or below the line, as shown in the following scatterplot:

The first point (x

The second point (x

and so on .

Therefore,

y_{1}= α+ βx_{1}+ e_{1}

y_{2}= α+ βx_{2}+ e_{2}

y_{3}= α+ βx_{3}+ e_{3}

and so on.

y

y

and so on.

e

We learnt from the above that

Now, we need to find

In other words, we need the values of α and β, such that

is minimum.

In order to get the minimum we need to differentiate this expression with respect to α and β and set them equal to 0. Let

Partially differentiating this with respect to α gives

(In the first step, chain rule has been used – multiply by the power and reduce the power of the bracket by 1 and then multiply by the derivative of the terms in the bracket)

Now we need to set this equal to zero.

Now partially differentiating this with respect to β gives

Now, setting this equal to zero, we get,

In the previous partial derivation, we have found out that Substituting this in the above equations, we get

Now we use

We recognize the numerator and denominator as the sum of squares of xy and x respectively. Therefore,

Instead of writing them as α and β, we write them as α̂ and β̂ , since they are only estimates of α and β.

Once the value of α̂ and β̂ are found out, we can substitute them in the equation of the regression line

Also, for any value x

Once the least square estimates, α̂ and β̂ are calculated and the equation of the regression line is found out, it is necessary to analyze how well our model fits the data, i.e. we need to know the `goodness of fit` for our regression model.

To analyze the `goodness of fit` for the regression model we need to analyze the total variation of the responses. The total variation of the response is

This is simply the sum of squares of y. It measures the sum of squared deviation of the response variable y

Consider

Adding and subtracting y

Squaring and summing both sides, we get,

The last term in the RHS simplifies to 0, giving us the following equation:-

The term in the LHS is the total sum of squares. It represents the total variation in the model. It is denoted by SS

The second term in the RHS is the regression sum of squares. It represents the variation that is explained by the regression model. It is the sum of squared deviations of the fitted responses from the mean. It is denoted by SS

The first term in the RHS is the residual sum of squares. It represents the variation that is unexplained by the regression model. It is the sum of squared deviations of the actual response from the fitted response. It is denoted by SS

Therefore, the variation in the model can be written as follows:

SS

In order to determine the `goodness of fit` we need to know the proportion of the total variation that is explained by the model. This proportion is called the co-efficient of determination and is denoted by R

It can be seen that R

Example 1

Following is the information on the number of classes attended by 8 different students and the marks scored by them in exams of the subject for which they attended the classes:-
The students wish to use a linear regression model to find out how much a student would score in the exam after attending 40 classes.

We need to fit a line that most suitably explains the relationship between the two sets of data. The equation of this regression line will be . To know the equation, we need to calculate the least squares estimates, α̂ and β̂ .

To calculate these, we need

Therefore, the regression equation will be,

y= 13.5+ 2x

Here α ̂=13.5 represents the y- intercept, i.e. when x=0 ,y will be 13.5.

β̂ =2 represents the slope of the line, i.e. for every one unit of increase in x, the response (y value) will increase by 2 units.

Now, to find out the estimated response for x=40, we need to substitute x=40 into the following equation:-

This means that a student might be able to score 93.5 marks in the exam if he/she attended 40 classes.

Let us now find out the goodness of fit:-

This means that 94.11% of the total variation is explained by the regression model that we have fit to the given data, making it a good model to study the above given data regarding number of classes attended and marks scored.

Example 2

Following are the prices of 6 super cars of the same model used for different number of years:-
Number of years used Price(in million rupees)

It is required to know the estimated price of a super car of the same model used for 9 years.

Therefore, the regression equation will be,

y= 48.6-4.886x

Now, to find out the estimated response for x=9, we need to substitute x=9 into the following equation:-

Let us now find out the goodness of fit:-