CORRELATION

DEFINITION

Correlation is a measure that indicates the strength of the linear relationship (if any) between the values of two different variables.

The most commonly used measure is Karl Pearson's co-efficient of correlation. Though correlation is a measure of the relationship between variable, it does not indicate causation. A proper cause and effect relationship is necessary to meaningfully justify the correlation between variables.

The most commonly used measure is Karl Pearson's co-efficient of correlation. Though correlation is a measure of the relationship between variable, it does not indicate causation. A proper cause and effect relationship is necessary to meaningfully justify the correlation between variables.

Co-efficient of correlation:

Interpretation:

For a set of values \(x_1,x_2\dots \dots x_n\), of a variable \(x\), the sample variance can be measured as follows:

\[s^2=\frac{\sum{{\left(x-\overline{x}\right)}^2}}{n-1}=\frac{\sum{x^2-n{\overline{x}}^2}}{n-1}\]

Similarly, the variance of values of variable \(y\) can be calculated as follows:-

\[S^2=\frac{\sum{{\left(y-\overline{y}\right)}^2}}{n-1}=\frac{\sum{y^2-n{\overline{y}}^2}}{n-1}\]

The variance measures how spread out each observation is from the mean. But we need to know how the values of \(y\) vary with respect to values of variable \(x\). For this, we calculate a measure called the covariance.

\[Covariance=\frac{\sum{\left(x-\overline{x}\right)\left(y-\overline{y}\right)}}{n-1}=\frac{\sum{xy}-n\overline{x}\overline{y}}{n-1}\]

But, the covariance obtained would be different for different sets of data with the same correlation. The covariance doesn’t exactly define the strength of the relationship between values of the variables. Also, covariance is measured in the same units as the variables, making it difficult for comparison of correlation of two or more sets of data.
It is, therefore, required to have a measure that is independent of the units of measurement, such that we get the same value for any set of data correlated in the same manner and to the same extent. In order to calculate such a measure, we divide the covariance by the standard deviations of both variables, and arrive at a measure called the co-efficient of correlation, denoted by \(r\).

\[r=\frac{\frac{\sum{xy-n\overline{x}}\overline{y}}{n-1}}{\sqrt{\left(\frac{\sum{x^2-n{\overline{x}}^2}}{n-1}\right)\left(\frac{\sum{y^2-n{\overline{y}}^2}}{n-1}\right)}}\]
\[\ \ =\frac{\frac{\sum{xy-n\overline{x}\overline{y}}}{n-1}}{\frac{\sqrt{\left({\sum{x}}^2-n{\overline{x}}^2\right)\left(\sum{y^2-n{\overline{y}}^2}\right)}}{n-1}}\]
\[\ \ =\frac{\sum{xy}-n\overline{x}\overline{y}}{\sqrt{\left({\sum{x}}^2-n{\overline{x}}^2\right)\left(\sum{y^2-n{\overline{y}}^2}\right)}}\]

This equation can be written in terms of the sum of squares, as follows:-
\[r=\frac{s_{xy}}{\sqrt{s_{xx}s_{yy}}}\]

where,
\[s_{xy}=\sum{xy}-n\overline{x}\overline{y}\]
\[s_{xx}=\sum{x^2}-n{\overline{x}}^2\]
\[s_{yy}=\sum{y^2}-n{\overline{y}}^2\]

The value of \(r\) obtained will simply be a number(independent of units of measurement), between -1 and 1, i.e.\(-1<r<1\)

Different values of \(r\) can be interpreted as follows:-

Different values of \(r\) can be interpreted as follows:-

Value of \(r\) | Interpretation |
---|---|

\(r=1\) | The two variables move in the same direction and have a perfectly linear relationship. |

\(0<r<1\) | The two variables move in the same direction, but there is noo perfect linear relationship. |

\(r=0\) | The two variables move in either direction without any relationship. |

\(-1<r<0\) | The two variables move in opposite directions but there is no perfect linear relationship. |

\(r=-1\) | The two variables move in opposite directions but have a perfect linear relationship. |

The above interpretation can be represented graphically as follows:-

\(r=1\)

Perfect positive correlation
\(0<r <1\)

Strong positive correlation
\(0<r <1\)

Weak positive correlation
\(r=-1\)

Perfect negative correlation
\(–1 < r < 0\)

Strong negative correlation
\(–1 < r < 0\)

Weak negative correlation
\(r=0\)

No correlation
Example 1

A school wants to analyse if conducting more number of classes can give better results. It gathers the following information on the number of classes conducted and the class average marks.

Number of classes\((x_i)\) | Class average \((y_i)\) |
---|---|

\(20\) | \(62\) |

\(22\) | \(68\) |

\(24\) | \(77\) |

\(26\) | \(80\) |

\(28\) | \(79\) |

\(30\) | \(83\) |

\(32\) | \(85\) |

\(34\) | \(89\) |

Let us first calculate the means of both variables.

\[\overline{x}=\frac{\sum{x}}{n}=\frac{216}{8}=27\]
\[\overline{y}=\frac{\sum{y}}{n}=\frac{623}{8}=77.875\]

We can now calculate the sum of squares.
\[s_{xx}=\sum{x^2}-n{\overline{x}}^2\]
\[\ \ \ \ \ ={20}^2+{22}^2+{24}^2+{26}^2+{28}^2+{30}^2+{32}^2+{34}^2-\left(8\times {27}^2\right)\]
\[\ \ \ \ \ =6000-5832\]
\[\ \ \ \ \ =168\]

\[s_{yy}=\sum{y^2}-n{\overline{y}}^2\]
\[\ \ \ \ ={62}^2+{68}^2+{77}^2+{80}^2+{79}^2+{83}^2+{85}^2+{89}^2-\ \left(8\times {77.875}^2\right)\]
\[\ \ \ \ =49073-48516.125\]
\[\ \ \ \ =556.875\]

\[s_{xy}=\sum{xy}-n\overline{x}\overline{y}\]
\[\ \ \ \ =\left(20\times 62\right)+\left(22\times 68\right)+\left(24\times 77\right)
+\left(26\times 80\right)+\left(28\times 79\right)+\left(30\times 83\right)+\left(32\times 85\right)
+\left(34\times 89\right)-\left(8\times 27\times 77.875\right)\]
\[\ \ \ \ =17112-16821\]
\[\ \ \ \ =291\]

We can find the correlation co-efficient as follows:-
\[r=\frac{s_{xy}}{\sqrt{s_{xx}s_{yy}}}\]
\[\ \ =\frac{291}{\sqrt{168\times 556.875}}\]
\[\ \ =0.951\]

Since \(r\) is positive, it means that there is a direct relationship between average marks and the number of classes conducted, i.e. as number of classes conducted increases, the average marks will go on increasing too.

Since \(r\) is close to 1, it means that there is a strong correlation between the variables.

Since \(r\) is close to 1, it means that there is a strong correlation between the variables.

Example 2

Following are the prices of six supercars of the same model, used for different number of years.

Number of years used \((x_i)\) | Price (in million rupees) \((y_i)\) |
---|---|

\(1\) | \(60\) |

\(2\) | \(38\) |

\(3\) | \(30\) |

\(4\) | \(26\) |

\(5\) | \(24\) |

\(6\) | \(23\) |

In order to determine the co-efficient of correlation, we need to calculate the sum of squares.

\(x\) | \(y\) | \(x^2\) | \(y^2\) | \(xy\) |
---|---|---|---|---|

\(1\) | \(60\) | \(1\) | \(3600\) | \(60\) |

\(2\) | \(38\) | \(4\) | \(1444\) | \(76\) |

\(3\) | \(30\) | \(9\) | \(900\) | \(90\) |

\(4\) | \(26\) | \(16\) | \(676\) | \(104\) |

\(5\) | \(24\) | \(25\) | \(576\) | \(120\) |

\(6\) | \(23\) | \(36\) | \(529\) | \(138\) |

\[\sum{x=21}\] | \[\sum{y=189}\] | \[\sum{x^2=91}\] | \[\sum{y^2=6429}\] | \[\sum{xy=576}\] |

\[s_{xx}=\sum{x^2}-n{\overline{x}}^2\]
\[\ \ \ \ =91-\left(6\times {3.5}^2\right)\]
\[\ \ \ \ =17.5\]

\[s_{yy}=\sum{y^2}-n{\overline{y}}^2\]
\[\ \ \ \ =7725-\left(6\times {33.5}^2\right)\]
\[\ \ \ \ =991.5\]

\[s_{xy}=\sum{xy}-n\overline{x}\overline{y}\]
\[\ \ \ \ =588-\left(6\times 3.5\times 33.5\right)\]
\[\ \ \ \ =-115.5\]

\[r=\frac{s_{xy}}{\sqrt{s_{xx}s_{yy}}}\]
\[\ \ =\frac{-115.5}{\sqrt{17.5\times 991.5}}\]
\[\ \ =0.877\]

Since \(r\) is negative, it means that the price of the car and the number of years it is used move in opposite directions, i.e. there is a negative or inverse relationship.