- Central Tendency: A single value that reflects the nature of a distribution .
- Dispersion: Associating some number to the ‘spread’ of a distribution.
If you noticed, it’s about measuring various properties of ‘a single’ distribution.
But what if we have to compare two distribution?
Suppose one of your friends said, “Tall people tend to weigh more in my family”. It looks intuitively looks logical – but is it true scientifically?
To prove it, you and your friend collected information from your friend’s family: Dad, mom, your friend and his two siblings. You can find the average of height and weight separately, also find how ‘spread’ the data is (measure of dispersion). Now, how will you compare the “Heights” table and “Weights” table and make conclusions?
For that, we have a tool called Correlation Coefficient.
Correlation coefficient (developed by Karl Pearson) is a way to indicate how closely related two sets of data are. Correlation coefficient is given by:
Calculate the correlation coefficient between marks in test 1 (X) and Test 2 (Y)
|∑X=33||∑Y=24||∑XY=148||∑X2 =223||∑Y2 =146|
Meaning of Correlation Coefficient
We got correlation coefficient value, and it is negative – what does it mean?
Correlation usually has one of two directions. These are positive or negative. If it is positive, then the two sets go up together. If it is negative, then one goes up while the other goes down.
In example 1, the scatter plot (plotting the points (x,y) on a graph) will give us a plot like this:
Clear that our values of Y are decreasing as value of X increases. Remember: we got our correlation coefficient as -0.8218!
But… it’s about checking if value of Y increases/decreases with value of X? Can’t we do it by observing the data or scatter plot? Not exactly. When your data set is bigger, or say ‘weakly’ correlated, it would be hard to catch such trends. Correlation coefficient helps us associate a degree of relation between the distributions as well. So, it’s a very useful tool!
Find the correlation coefficient of the following marks scored by 5 students in two exams:
|∑X=148||∑Y=154||∑XY=6096||∑X2 =5742||∑Y2 =6526|
If you continue to calculate the correlation coefficient, you will get the answer as
But – big numbers, looks scary! Remember the trick we used for standard deviation? Turns out it will work for Correlation, too!
|X||Y||U = X – 34||V = Y – 35||UV||U^2||V^2|
|∑U=12||∑V=14||∑UV = 440||∑U2 = 302||∑V2 =646|
Continue the calculation and find (Answer will be 0.9987)
Take a look at what happens to the scatter plot:
Look how the trend seen in scatter plot is the same, and how Y (and V) increases as X (and U) increases!
- Choose the numbers
- A (preferably the median of X),
- B (preferably the median of Y) •
- Let U = X – A and V = Y – B
- Find rUV
- rUV = rXY
Here’s the Height-Weight of 7 random entries in this Kaggle dataset: Find the correlation coefficient and make some observation about the correlation between height and weight.
(Try it yourself)
Some Properties of Correlation Coefficient
- rXY = rYX
- – 1 ≤ r ≤ 1
- r > 0 when X increases as Y increases
- r = 0 when X and Y have no connection
- r < 0 when X decreases as Y increases