Correlation in Statistics

Contents hide

Correlation

Correlation measures the relation between two variables that how they are related. And is denoted by r and ρ moreover, the correlation quantifies the level of relationship between -1 to +1. If the value of correlation r is -1 then there is perfect negative relationship. If value of correlation is +1 then there is positive correlation between variables.

Important Points

r=-1 Perfect Negative Correlation

r=0 No Correlation

r=1 Perfect Positive Correlation

r lies between -1 and +1

Pearson Correlation-

Named after Karl Pearson it is the most widely used formula for correlation coefficient. If there are two variables X and Y having N instances. Then the correlation coefficient r is given in formula.

Calculation of Pearson’s Correlation Coefficient in Python

Let there be two variables X and Y

The values of X and Y are

X = [40,46,55,60,70,75,78,80 , 85, 95]
Y = [40,46,55,60,70,75, 78,80 , 85, 95]

Then Python’s code for computation of correlation is

#Computation of Pearson’s Correlation Coefficient
from scipy.stats import pearsonr
from matplotlib import pyplot
X = [40,46,55,60,70,75,78,80 , 85, 95]
Y = [40,46,55,60,70,75, 78,80 , 85, 95]
# pearsonr(X,Y) Calculates Pearson’s Correlation Coefficient
r= pearsonr(X,Y)
print(“Pearson’s Correlation Coefficient”, r)
pyplot.scatter(X,Y)
pyplot.savefig(“pearsonr.png”)

The output of the program would be

Output: Pearson’s Correlation Coefficient (1.0, 0.0)

Scatter plot for the data is

Pearson Correlation Coefficient — **Perfect Positive Correlation**

From the above scatter diagram you can observed that there is perfect positive correlation between X and Y variables.

This is due to X and Y having the same values.

Again consider the data set

X = [40,46,55,60,70,75,78,80 , 85, 95]
Y= [95,85,80,78,75,70,60,55,46,40]

And Corresponding Python’s Code

#Computation of Pearson’s Correlation Coefficient
from scipy.stats import pearsonr
from matplotlib import pyplot
X = [40,46,55,60,70,75,78,80 , 85, 95]
Y= [95,85,80,78,75,70,60,55,46,40]
# pearsonr(X,Y) Calculates Pearson’s Correlation Coefficient
r= pearsonr(X,Y)
print(“Pearson’s Correlation Coefficient”, r)
pyplot.scatter(X,Y)
pyplot.savefig(“negativepearsonr.png”)

The output of the program would be

Pearson’s Correlation Coefficient (-0.9613416714042071, 9.325227687014438e-06)

The the scatter plot of the data is

Pearson Correlation Scatter Plot — **Negative Correlation Scatter Plot**

You can observed that I have just reversed the data and then relation has become negatively correlated.

Spearman Rank Correlation

Spearman’n rank correlation is used for qualitative data. The first step is to convert qualitative comparative data into rank. Then apply the following formula.

Let R1 and R2 be ranks given to statistics and mathematics students in a university.

Set of values of R1 and R2 are

R1 = [3,5,8,10,15,26,30,36,40,42]
R2 = [3,5,8,10,15,26,30,36,40,42]

Python’s Code for Calculation Spearman’s Rank Correlation Coefficient

#Spearman’s Correlation Coefficient
from scipy.stats import spearmanr
from matplotlib import pyplot
R1 = [3,5,8,10,15,26,30,36,40,42]
R2 = [3,5,8,10,15,26,30,36,40,42]
# spearmanr(R1,R2) Calculates Spearman’s Rank Correlation Coefficient
r= spearmanr(R1,R2)
print(“Spearman’s Correlation Coefficient”, r)
pyplot.scatter(R1,R2)
pyplot.savefig(“spearmanr.png”)

Output of the program would be

Output : Spearman’s Correlation Coefficient SpearmanrResult(correlation=0.9999999999999999, pvalue=6.646897422032013e-64)

And scatter plot is

I have taken R1 and R2 having the same that is why there is perfect positive correlation.

Furthermore, If I reverse R2 then the plot will be

The Python’s code is

#Spearman’s Correlation Coefficient
from scipy.stats import spearmanr
from matplotlib import pyplot
R1 = [3,5,8,10,15,26,30,36,40,42]
R2= [42,40,36,30,26,15,10,8,5,3]
# spearmanr(R1,R2) Calculates Spearman’s Rank Correlation Coefficient
r= spearmanr(R1,R2)
print(“Spearman’s Correlation Coefficient”, r)
pyplot.scatter(R1,R2)
pyplot.savefig(“negcorspearmanr.png”)

The output of the program would be

Output : Spearman’s Correlation Coefficient SpearmanrResult(correlation=-0.9999999999999999, pvalue=6.646897422032013e-64)

And corresponding scatter plot is

Conclusion-

Correlation is very important topic in machine learning, statistics and data science. It helps to find out relationship in a data set. In this post, I have explained two popular method for correlation computation. Hope you will understand and apply.

References-

Meng, X.L., Rosenthal, R. and Rubin, D.B., 1992. Comparing correlated correlation coefficients. Psychological bulletin, 111(1), p.172.
Bansal, N., Blum, A. and Chawla, S., 2004. Correlation clustering. Machine learning, 56(1-3), pp.89-113. https://link.springer.com/content/pdf/10.1023/B:MACH.0000033116.57574.95.pdf

Correlation

Pearson Correlation-

Spearman Rank Correlation

Leave a Comment

©Postnetwork-All rights reserved.