SIMPLE LINEAR REGRESSION

DEFINITION

Simple linear regression is concerned with the analysis of the relation between the two variables, the explanatory variable (or independent variable) and the response variable (or dependent variable) and trying to fit a linear model that could explain this relationship.

In simple words, it is the process of estimating the function for a straight line, which would must suitably fit the data under consideration.

In simple words, it is the process of estimating the function for a straight line, which would must suitably fit the data under consideration.

Value of the explanatory variable to be regressed:

Regression equation:

Expected response for

Variance of expected response estimator:

Individual response for

Variance of individual response estimator:

Co-efficient of correlation:

Co-efficient of determination:

Total sum of squares:

Regression sum of squares:

Residual sum of squares:

Consider two variables \(x\) and \(y\), taking values, \(x_1,\ x_2\dots \dots x_n\) and \(y_1,y_2\dots \dots y_n\) where \(x\) is considered the explanatory variable and \(y\) is the response variable, i.e. the values of \(y\) are dependent on the values of \(x\). Simple linear regression is concerned with obtaining a straight line that best fits the data or best explains the linear relationship between the two variables.

Let us consider for example, a scatterplot for \(x_i\) and \(y_i\) where \(i=1,\ 2,\ 3\dots \dots n\)

We now need a straight line that approximately fits the data in the above scatterplot, as follows:

The equation of the line is \(y=\alpha +\beta x\) where, \(\alpha \) is the \(y\)-intercept (the point where the line crosses the \(y\)-axis) and \(\beta \) is the gradient or slope of the line.

If \(x\) and \(y\) were perfectly correlated, then all points in the above scatterplot would lie exactly on the line. But this wouldn’t be the case in most situations as the points lie above or below the line, as shown in the following scatterplot:

If \(x\) and \(y\) were perfectly correlated, then all points in the above scatterplot would lie exactly on the line. But this wouldn’t be the case in most situations as the points lie above or below the line, as shown in the following scatterplot:

The first point \(\left(x_1,\ y_1\right)\) is \(e_1\) units above the line.

The second point \(\left(x_2,\ y_2\right)\) is \(e_2\) units below the line

and so on.

The second point \(\left(x_2,\ y_2\right)\) is \(e_2\) units below the line

and so on.

Therefore,

\(y_1=\alpha +\beta x_1+e_1\)

\(y_2=\alpha +\beta x_2+e_2\)

\(y_3=\alpha +\beta x_3+e_3\)

and so on.

\(y_2=\alpha +\beta x_2+e_2\)

\(y_3=\alpha +\beta x_3+e_3\)

and so on.

\(e_1,\ e_2\dots \dots e_3\) are called errors or residuals. These \(e_1,\)'s show the displacement of each point from the line. While fitting the straight line we need to do it in such a way that the errors are minimum. Errors above the line will have a positive value and those below the line will have a negative value. Since we are concerned only about the distance and not the signs, we consider the squares of errors to cancel the signs. Therefore, instead of minimizing the errors, we try to minimize the squares of errors.

We learnt from the above that

\[y_i=\ \alpha +\beta x_i+e_i\]
\[\Rightarrow e_i=\ y_i-\left(\alpha +\beta x_i\right)\]
\[\ \ \ \ \ \ \ \ \ =\ y_i-\alpha -\ \beta x_i\]

Now, we need to find
\[min\sum{e^2_i=min\sum{{\left(y_i-\alpha -\beta x_i\right)}^2}}\]

In other words, we need the values of \(\alpha \ and\ \beta \) , such that \(\sum{{\left(y_i-\ \alpha -\ \beta x_i\right)}^2}\) is minimum.

In order to get the minimum we need to differentiate this expression with respect to \(\alpha\) and \(\beta \) and set them equal to 0.

\[S=\sum{{\left(y_i-\ \alpha -\ \beta x_i\right)}^2}\]

Partially differentiating this with respect to \(\alpha\) gives

\[\frac{\partial S}{\partial \alpha }=\ \left[\sum{2}{\left(y_i-\ \alpha -\ \beta x_i\right)}^{2-1}\right]\times \ -1\]
\[\ \ \ \ \ \ \ =\ -2\sum{\left(y_i-\alpha -\ \beta x_i\right)}\]

(In the first step, chain rule has been used – multiply by the power and reduce the power of the bracket by 1 and then multiply by the derivative of the terms in the bracket)
Now we need to set this equal to zero.

\[-2\sum{\left(y_i-\alpha -\ \beta x_i\right)=0}\]
\[\sum{\left(y_i-\ \alpha -\ \beta x_i\right)=0}\]
\[\sum{y_i-\ \sum{\alpha }-\ \sum{\beta x_i=0}}\]
\[\sum{y_i-\ \alpha \sum{1-\ \beta \sum{x_i=0}}}\]
\[\sum{y_i-\alpha n-\ \beta \sum{x_i=0}}\]
\[\alpha n=\ \sum{y_i-\beta \sum{x_i}}\]
\[\alpha =\ \frac{\sum{y_i}}{n}-\ \frac{\beta \sum{x_i}}{n}\]
\[\alpha =\ \overline{y}-\ \beta \overline{x}\]

Now partially differentiating this with respect to \(\beta \) gives

\[\frac{\partial S}{\partial \beta }=\ \left[\sum{2{\left(y_i-\ \alpha -\ \beta x_i\right)}^{2-1}}\right]\times \ -x_i\]
\[\ \ \ \ \ \ \ =\ -2\sum{x_i\left(y_i-\alpha -\ \beta x_i\right)}\]

Now, setting this equal to zero, we get,
\[-2\sum{x_i\left(y_i-\alpha -\ \beta x_i\right)}=0\]
\[\sum{x_i\left(y_i-\alpha -\ \beta x_i\right)=0\ }\]
\[\sum{\left(x_iy_i-\alpha x_i-\ \beta x^2_i\right)=0}\]
\[\sum{x_iy_i-\ \sum{{\alpha x}_i-\ \sum{\beta x^2_i}}=0}\]
\[\sum{x_i}y_i-\alpha \sum{x_i-\ \beta \sum{x^2_i}=0\ }\]

In the previous partial derivation, we have found out that \(\alpha =\ \overline{y}-\ \beta \overline{x}\). Substituting this in the above equations, we get

\[\sum{x_iy_i-\ \left(\overline{y}-\beta \overline{x}\right)\sum{x_i-\beta \sum{x^2_i}=0}}\]
\[\sum{x_iy_i-\overline{y}}\sum{x_i+\beta }\overline{x}\sum{x_i-\beta \sum{x^2_i=0}}\]
\[\sum{x_iy_i-\overline{y}}\sum{x_i=\ \beta \sum{x^2_i-\ \beta \overline{x}\sum{x_i}}}\]
\[\sum{x_iy_i-\overline{y}}\sum{x_i}=\ \beta \left(\sum{x^2_i-\ \overline{x}}\sum{x_i}\right)\]
\[\beta =\frac{\sum{x_iy_i}-\overline{y}\sum{x_i}}{\sum{x^2_i}-\overline{x}\sum{x_i}}\]

Now we use \(\overline{x}=\ \frac{\sum{x}}{n}\Longrightarrow \ \sum{x}=n\overline{x}\)

\[\beta =\frac{\sum{x_iy_i-\overline{y}n\overline{x}}}{\sum{x^2_i-\overline{x}n\overline{x}}}\]
\[\beta =\ \frac{\sum{x_i}y_{i-n\overline{x}\overline{y}}}{\sum{x^2_i-n{\overline{x}}^2}}\]

We recognize the numerator and denominator as the sum of squares of \(xy\) and \(x\) respectively. Therefore,

\[\beta =\ \frac{S_{xy}}{S_{xx}}\]

Instead of writing them as \(\alpha \) and \(\beta \) , we write them \(\widehat{\alpha }\ and\ \widehat{\beta }\) , since they are only estimates \(\alpha \) and \(\beta \). Once the value of \(\widehat{\alpha }\ and\ \widehat{\beta }\) are found out, we can substitute them in the equation of the regression line

\[y=\ \widehat{\alpha }+\ \widehat{\beta }x\]

Also, for any value \(x_i\), we can find out the predicted value of \(y_i\) (the response variable), as follows:-

\[\widehat{y_i}=\ \widehat{\alpha }+\ \widehat{\beta }x_i\]

Once the least square estimates, \(\widehat{\alpha }\ and\ \widehat{\beta }\) are calculated and the equation of the regression line is found out, it is necessary to analyze how well our model fits the data, i.e. we need to know the `goodness of fit` for our regression model.

Goodness of fit:-

To analyze the `goodness of fit` for the regression model we need to analyze the total variation of the responses. The total variation of the response is

\[S_{yy}=\ \sum{{\left(y_i-\overline{y}\right)}^2}\]

This is simply the sum of squares of \(y\). It measures the sum of squared deviation of the response variable \(y_i\) from the mean \(\overline{y}\) . This total variation in the model consists of two components, the variation explained by the model and the variation not explained by the model. This can be derived as follows:

Consider \(y_i-\ \overline{y}\)

Adding and subtracting \(y_i\), we get

\[y_i-\overline{y}=\ \left(y_i-{\hat{y}}_i\right)+\ \left({\hat{y}}_i-\overline{y}\right)\]

Squaring and summing both sides, we get,
\[\sum{{\left(y_i-\overline{y}\right)}^2}=\ \sum{{\left[\left(y_i-{\hat{y}}_i\right)+\ \left({\hat{y}}_i-\overline{y}\right)\right]}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =\ \sum{\left[{\left(y_i-{\hat{y}}_i\right)}^2+2\left(y_i-{\hat{y}}_i\right)\left({\hat{y}}_i-\overline{y}\right)+\ {\left({\hat{y}}_i-\overline{y}\right)}^2\right]}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =\ \sum{{\left(y_i-{\hat{y}}_i\right)}^2+\ \sum{{\left({\hat{y}}_i-\overline{y}\right)}^2+\sum{2\left(y_i-{\hat{y}}_i\right)\left({\hat{y}}_i-\overline{y}\right)}}}\]

The last term in the RHS simplifies to 0, giving us the following equation:-
\[\sum{{\left(y_i-\overline{y}\right)}^2=\ \sum{{\left(y_i-{\hat{y}}_i\right)}^2}}+\ \sum{{\left({\hat{y}}_i-\overline{y}\right)}^2}\]

The term in the LHS is the total sum of squares. It represents the total variation in the model. It is denoted by \({SS}_{TOT}\) .

The second term in the RHS is the regression sum of squares. It represents the variation that is explained by the regression model. It is the sum of squared deviations of the fitted responses from the mean. It is denoted by \({SS}_{REG}\) .

The first term in the RHS is the residual sum of squares. It represents the variation that is unexplained by the regression model. It is the sum of squared deviations of the actual response from the fitted response. It is denoted by \({SS}_{RES}\) .

The second term in the RHS is the regression sum of squares. It represents the variation that is explained by the regression model. It is the sum of squared deviations of the fitted responses from the mean. It is denoted by \({SS}_{REG}\) .

The first term in the RHS is the residual sum of squares. It represents the variation that is unexplained by the regression model. It is the sum of squared deviations of the actual response from the fitted response. It is denoted by \({SS}_{RES}\) .

Therefore, the variation in the model can be written as follows:

\[\sum{{\left(y_i-\overline{y}\right)}^2=\ \sum{{\left(y_i-{\hat{y}}_i\right)}^2}}+\ \sum{{\left({\hat{y}}_i-\overline{y}\right)}^2}\]
\[{SS}_{TOT}={SS}_{RES}+{SS}_{REG}\]

Calculating \({SS}_{TOT}\), \({SS}_{RES}\) and \({SS}_{REG}\) :-

\[{SS}_{TOT}=\ s_{yy}=\ \sum{{\left(y_i-\overline{y}\right)}^2=\ \sum{y^2}-n{\overline{y}}^2}\]

\[{SS}_{REG}=\ \sum{{\left({\hat{y}}_i-\overline{y}\right)}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ \sum{{\left[\left(\widehat{\alpha }+\ \widehat{\beta }x_i\right)-\left(\widehat{\alpha }+\ \widehat{\beta }\overline{x}\right)\right]}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ \sum{{\left[\widehat{\beta }\left(x_i-\overline{x}\right)\right]}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\sum{\left[{\widehat{\beta }}^2{\left(x_i-\overline{x}\right)}^2\right]}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ ={\widehat{\beta }}^2\sum{{\left(x_i-\overline{x}\right)}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ {\widehat{\beta }}^2S_{xx}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ {\left(\frac{S_{xy}}{S_{xx}}\right)}^2S_{xx}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ \frac{S^2_{xy}}{S_{xx}}\]

\[{SS}_{RES}=\ {SS}_{TOT}-\ {SS}_{REG}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ S_{yy-\ \ }\frac{S^2_{xy}}{S_{xx}}\]

In order to determine the `goodness of fit` we need to know the proportion of the total variation that is explained by the model. This proportion is called the co-efficient of determination and is denoted by \(R^2\).

\[R^2=\ \frac{{SS}_{REG}}{{SS}_{TOT}}=\ \frac{S^2_{xy}}{S_{xx}S_{yy}}\]

It can be seen that \(R^2\) is the square of the co-efficient of correlation, \(r\), for a given set of data.

Example 1

Following is the information on the number of classes attended by 8 different students and the marks scored by them in exams of the subject for which they attended the classes:-

Number of classes\(\left(x_i\right)\) | Marks scored \((y_i)\) |
---|---|

\(20\) | \(50\) |

\(22\) | \(57\) |

\(24\) | \(63\) |

\(26\) | \(68\) |

\(28\) | \(72\) |

\(30\) | \(75\) |

\(32\) | \(77\) |

\(34\) | \(78\) |

The students wish to use a linear regression model to find out how much a student would score in the exam after attending 40 classes.

We need to fit a line that most suitably explains the relationship between the two sets of data. The equation of this regression line will be \(y=\ \widehat{\alpha }+\ \widehat{\beta }x\) . To know the equation, we need to calculate the least squares estimates, \(\widehat{\alpha }\ and \ \widehat{\beta }\) .

\[\ \widehat{\alpha }=\overline{y}-\ \widehat{\beta }\overline{x}\]
\[\beta =\ \frac{S_{xy}}{S_{xx}}=\ \frac{\sum{x_iy_i}-n\overline{x}\overline{y}}{\sum{x^2_i-n{\overline{x}}^2}}\]

To calculate these, we need \(\sum{xy,\ \sum{x^2,\ \overline{x}}\ and\ \overline{y}}\)

\(x\) | \(y\) | \(x^2\) | \(xy\) | \(y^2\) |
---|---|---|---|---|

\(20\) | \(50\) | \(400\) | \(1000\) | \(2500\) |

\(22\) | \(57\) | \(484\) | \(1254\) | \(3249\) |

\(24\) | \(63\) | \(576\) | \(1515\) | \(3969\) |

\(26\) | \(68\) | \(676\) | \(1768\) | \(4624\) |

\(28\) | \(72\) | \(784\) | \(2016\) | \(5184\) |

\(30\) | \(75\) | \(900\) | \(2250\) | \(5625\) |

\(32\) | \(77\) | \(1024\) | \(2464\) | \(5929\) |

\(34\) | \(78\) | \(1156\) | \(2652\) | \(6084\) |

\(\sum{x}=216\) | \(\sum{y}=540\ \) | \(\sum{x^2}=6000\) | \(\sum{xy}=14916\) | \(\sum{y^2}=37164\) |

\[\overline{x}=\ \frac{\sum{x}}{n}=\ \frac{216}{8}=27\]

\[\overline{y}=\ \frac{\sum{y}}{n}=\ \frac{540}{8}=67.5\]

\[\widehat{\beta }=\ \frac{\sum{x_iy_i-n\overline{x}\overline{y}}}{\sum{x^2_i-\ {n\overline{x}}^2}}\]
\[\ \ \ =\ \frac{14916-\ \left(8\times 27\times 67.5\right)}{6000-\left(8\times {27}^2\right)}\]
\[\ \ \ =2\]

\[\widehat{\alpha }=\ \overline{y}-\ \widehat{\beta }\overline{x}\]
\[\ \ \ =67.5-2\times 27\]
\[\ \ \ =13.5\]

Therefore, the regression equation will be,
\[y= 13.5+ 2x\]

Here \(\widehat{\alpha }=13.5\) represents the \(y\)-intercept, i.e. when \(x=0,\ y\) will be 13.5.

\(\widehat{\beta }=2\ \) represents the slope of the line, i.e. for every one unit of increase in \(x\), the response (\(y\) value) will increase by 2 units.

Now, to find out the estimated response for \(x=40\), we need to substitute \(x=40\) into the following equation:-

\[{\hat{y}}_i=\ \widehat{\alpha }+\ \widehat{\beta }x_i\]
\[\ \ \ =13.5+2\left(40\right)\]
\[\ \ \ =93.5\]

This means that a student might be able to score 93.5 marks in the exam if he/she attended 40 classes.
Let us now find out the goodness of fit:-

\[{SS}_{TOT}=\ \sum{y^2-\ {n\overline{y}}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =37164-\ \left(8\times \ {67.5}^2\right)\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =714\]

\[{SS}_{REG}=\ \frac{S^2_{xy}}{S_{xx}}=\ \frac{{\left(\sum{xy}-n\overline{xy}\right)}^2}{\sum{x^2}-n{\overline{x}}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\frac{{\left(14916-8\times 27\times 67.5\right)}^2}{6000-8\times {27}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =672\]

\[{SS}_{RES}=\ {SS}_{TOT}-\ {SS}_{REG}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =714-672\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =42\]

\[R^2=\ \frac{{SS}_{REG}}{{SS}_{TOT}}=\ \frac{672}{714}\]
\[\ \ \ \ \ =0.941176\]
\[\ \ \ \ \ =94.1176\ \%\]
\[\ \ \ \ \ =94.1176\%\]

This means that \(94.11\ \%\) of the total variation is explained by the regression model that we have fit to the given data, making it a good model to study the above given data regarding number of classes attended and marks scored.

Example 2

Following are the prices of 6 super cars of the same model used for different number of years:-

\(\left(x_i\right)\) | \((y_i)\) |
---|---|

\(1\) | \(48\) |

\(2\) | \(38\) |

\(3\) | \(30\) |

\(4\) | \(26\) |

\(5\) | \(24\) |

\(6\) | \(23\) |

It is required to know the estimated price of a super car of the same model used for 9 years.

\(\left(x_i\right)\) | \((y_i)\) | \(x^2\) | \(xy\) | \(y^2\) |
---|---|---|---|---|

\(1\) | \(48\) | \(1\) | \(48\) | \(2304\) |

\(2\) | \(38\) | \(4\) | \(76\) | \(1444\) |

\(3\) | \(30\) | \(9\) | \(90\) | \(900\) |

\(4\) | \(26\) | \(16\) | \(104\) | \(676\) |

\(5\) | \(24\) | \(25\) | \(120\) | \(576\) |

\(6\) | \(23\) | \(36\) | \(138\) | \(529\) |

\(\sum{x}=21\) | \(\sum{y}=189\) | \(\sum{x^2}=91\) | \(\sum{xy}=576\) | \(\sum{y^2}=6429\) |

\[\overline{x}=\ \frac{\sum{x}}{n}=\ \frac{21}{6}=3.5\]

\[\overline{y}=\ \frac{\sum{y}}{n}=\ \frac{189}{6}=31.5\]

\[\widehat{\beta }=\ \frac{\sum{x_iy_i}-n\overline{xy}}{\sum{x^2_i-n{\overline{x}}^2}}\]
\[\ \ \ =\ \frac{579-\ \left(6\times 3\times 31.5\right)}{91-\ \left(6\times 3.5^2\right)}\]
\[\ \ \ =\ -4.886\]

\[\widehat{\alpha }=\ \overline{y}-\ \widehat{\beta }\overline{x}\]
\[\ \ \ =31.5-4.886\times 3.5\]
\[\ \ \ =48.6\]

Therefore, the regression equation will be,
\[y=48.6-4.886x\]

Now, to find out the estimated response for \(x=9\) , we need to substitute \(x=9\) into the following equation:-

\[{\hat{y}}_i=\ \widehat{\alpha }+\ \widehat{\beta }x_i\]
\[\ \ \ =48.6-4.886\left(9\right)\]
\[\ \ \ =4.626\]

Let us now find out the goodness of fit:-
\[{SS}_{TOT}=\ \sum{y^2-n{\overline{y}}^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =6429-\ \left(6\times 31.5^2\right)\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =475.5\]

\[{SS}_{REG}=\ \frac{s^2_{xy}}{S_{xx}}=\ \frac{{\left(\sum{xy}-n\overline{x}\overline{y}\right)}^2}{\sum{x^2-{n\overline{x}}^2}}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =\ \frac{{\left(576-6\times 3.5\times 31.5\right)}^2}{91-6\times 3.5^2}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =417.73\]

\[{SS}_{RES}=\ {SS}_{TOT}-\ {SS}_{REG}\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =475.5-417.73\]
\[\ \ \ \ \ \ \ \ \ \ \ \ =57.77\]

\[R^2=\ \frac{{SS}_{REG}}{{SS}_{TOT}}=\ \frac{417.75}{475.5}\]
\[\ \ \ \ \ =0.8785\]
\[\ \ \ \ \ =87.85\ \%\]