FEEDBACK ABOUT
SIMPLE LINEAR REGRESSION
DEFINITION
Simple linear regression is concerned with the analysis of the relation between the two variables, the explanatory variable (or independent variable) and the response variable (or dependent variable) and trying to fit a linear model that could explain this relationship.
In simple words, it is the process of estimating the function for a straight line, which would must suitably fit the data under consideration.
CALCULATOR

Enter the following details:




Value of the explanatory variable to be regressed:


Regression equation:


Expected response for :


Variance of expected response estimator:

Individual response for :

Variance of individual response estimator:


Co-efficient of correlation:


Co-efficient of determination:


Total sum of squares:


Regression sum of squares:


Residual sum of squares:

FORMULA AND DERIVATION
Consider two variables x and y, taking values, x1,x2…… xn and y1,y2…… yn where x is considered the explanatory variable and y is the response variable, i.e. the values of y are dependent on the values of x.

Simple linear regression is concerned with obtaining a straight line that best fits the data or best explains the linear relationship between the two variables.

Let us consider for example, a scatterplot for xi and yi, where i = 1,2,3.......n.



We now need a straight line that approximately fits the data in the above scatterplot, as follows:



The equation of the line is y= α+ βx where, α is the y-intercept (the point where the line crosses the y-axis) and β is the gradient or slope of the line.

If x and y were perfectly correlated, then all points in the above scatterplot would lie exactly on the line. But this wouldn’t be the case in most situations as the points lie above or below the line, as shown in the following scatterplot:



The first point (x1,y1 ) is e1 units above the line.
The second point (x2,y2 ) is e2 units below the line
and so on .

Therefore,

y1= α+ βx1+ e1
y2= α+ βx2+ e2
y3= α+ βx3+ e3
and so on.

e1,e2………e3 are called errors or residuals. These ei`s show the displacement of each point from the line. While fitting the straight line we need to do it in such a way that the errors are minimum. Errors above the line will have a positive value and those below the line will have a negative value. Since we are concerned only about the distance and not the signs, we consider the squares of errors to cancel the signs. Therefore, instead of minimizing the errors, we try to minimize the squares of errors.

We learnt from the above that



Now, we need to find



In other words, we need the values of α and β, such that

is minimum.

In order to get the minimum we need to differentiate this expression with respect to α and β and set them equal to 0. Let



Partially differentiating this with respect to α gives



(In the first step, chain rule has been used – multiply by the power and reduce the power of the bracket by 1 and then multiply by the derivative of the terms in the bracket)

Now we need to set this equal to zero.





Now partially differentiating this with respect to β gives



Now, setting this equal to zero, we get,



In the previous partial derivation, we have found out that Substituting this in the above equations, we get



Now we use



We recognize the numerator and denominator as the sum of squares of xy and x respectively. Therefore,



Instead of writing them as α and β, we write them as α̂ and β̂ , since they are only estimates of α and β.

Once the value of α̂ and β̂ are found out, we can substitute them in the equation of the regression line



Also, for any value xi, we can find out the predicted value of yi (the response variable), as follows:-



Once the least square estimates, α̂ and β̂ are calculated and the equation of the regression line is found out, it is necessary to analyze how well our model fits the data, i.e. we need to know the `goodness of fit` for our regression model.

Goodness of fit:-

To analyze the `goodness of fit` for the regression model we need to analyze the total variation of the responses. The total variation of the response is


This is simply the sum of squares of y. It measures the sum of squared deviation of the response variable yi from the mean ȳ. This total variation in the model consists of two components, the variation explained by the model and the variation not explained by the model. This can be derived as follows:

Consider

Adding and subtracting yi, we get



Squaring and summing both sides, we get,



The last term in the RHS simplifies to 0, giving us the following equation:-



The term in the LHS is the total sum of squares. It represents the total variation in the model. It is denoted by SSTOT.
The second term in the RHS is the regression sum of squares. It represents the variation that is explained by the regression model. It is the sum of squared deviations of the fitted responses from the mean. It is denoted by SSREG.
The first term in the RHS is the residual sum of squares. It represents the variation that is unexplained by the regression model. It is the sum of squared deviations of the actual response from the fitted response. It is denoted by SSRES.

Therefore, the variation in the model can be written as follows:



SSTOT = SSRES + SSREG

Calculating SSTOT, SSRES and SSREG :-









In order to determine the `goodness of fit` we need to know the proportion of the total variation that is explained by the model. This proportion is called the co-efficient of determination and is denoted by R2.



It can be seen that R2 is the square of the co-efficient of correlation, r, for a given set of data.

EXAMPLES
Example 1
Following is the information on the number of classes attended by 8 different students and the marks scored by them in exams of the subject for which they attended the classes:-



The students wish to use a linear regression model to find out how much a student would score in the exam after attending 40 classes.

We need to fit a line that most suitably explains the relationship between the two sets of data. The equation of this regression line will be . To know the equation, we need to calculate the least squares estimates, α̂ and β̂ .



To calculate these, we need









Therefore, the regression equation will be,

y= 13.5+ 2x

Here α ̂=13.5 represents the y- intercept, i.e. when x=0 ,y will be 13.5.
β̂ =2 represents the slope of the line, i.e. for every one unit of increase in x, the response (y value) will increase by 2 units.

Now, to find out the estimated response for x=40, we need to substitute x=40 into the following equation:-



This means that a student might be able to score 93.5 marks in the exam if he/she attended 40 classes.

Let us now find out the goodness of fit:-









This means that 94.11% of the total variation is explained by the regression model that we have fit to the given data, making it a good model to study the above given data regarding number of classes attended and marks scored.

Example 2
Following are the prices of 6 super cars of the same model used for different number of years:-

Number of years used Price(in million rupees)



It is required to know the estimated price of a super car of the same model used for 9 years.









Therefore, the regression equation will be,

y= 48.6-4.886x


Now, to find out the estimated response for x=9, we need to substitute x=9 into the following equation:-



Let us now find out the goodness of fit:-









RELATED TOPICS
Correlation