Pearson vs. Spearman Correlation: What’s the difference?

Anyi Guo
5 min readSep 6, 2021

A practical guide on their difference, with examples!

A correlation coefficient measures how much two variables tend to change in relation to each other. The coefficient describes both the direction as well as the strength of the relationship. In this article we’ll go through two of the most popular correlation calculation methods: Pearson’s Correlation and Spearman’s Correlation.

Pearson’s correlation (P)

Pearson’s correlation (named after Karl Pearson) is used to show linear relationship between two variables. It is calculated as:

Pearson Correlation = covariance (X, Y) / (stdv(X) * stdv(Y))
Pearson Correlation = covariance (X, Y) / (stdv(X) * stdv(Y))

Pearson’s Correlation returns a value between [-1, 1], with 1 meaning full positive correlation and -1 full negative correlation.

Pearson’s Correlation uses mean and standard deviation in the calculation, which implies that it is a parametric method and it assumes a Gaussian-like distribution for the data.

Pearson’s Correlation is the most popular method of calculating correlation and it tends to be the default implementation in many programming languages. For example, if you run corr() method on a Pandas dataframe, it will calculate Pearson’s correlation by default unless you explicitly override it.

The problem with Pearson’s Correlation is that it is sensitive to outliers and may lead you to a wrong conclusion depending on the data. There’s an example below. But first, let’s briefly talk about what Spearman Correlation is.

Spearman Correlation (S)

Spearman correlation (named after Charles Spearman) is the non-parametric version of the Pearson’s correlations. It is used when:

  • The relationship between the two variables are non-linear (for example, a relationship that’s sometimes stronger and sometimes weaker depending on the data).
  • The variables have a non-Gaussian distribution.
Spearman Correlation Formula

Similar to Pearson’s Correlation, Spearman also returns a value between [-1,1] for full negative correlation and full positive correlation, respectively.

A Practical Example: Pearson vs. Spearman

Enough of theory so far, so let’s see an example where Pearson correlation alone is not sufficient for drawing a conclusion.

Imagine if we have two arrays x and y. There is a positive correlation in most of the rows, which means that as x increases, y also increases.

x and y are positively correlated in row 1 -9, except row 10 which shows negative correlation.
Example data: x and y are positively correlated in row 1 -9, except row 10 which shows negative correlation.

We use Python to calculate the Pearson and Spearman correlation on x and y, and we notice something weird: the Pearson correlation is negative, but the Spearman correlation is positive. What’s happening?

Pearson and Spearman show correlation with opposite directions when there’s outlier in the data.
Pearson and Spearman show correlation with opposite directions when there’s outlier in the data.

We create a scatterplot of the data and notice that there’s a big outlier in the data. This corresponds to the last row in the dataset: its magnitude (1000) is quite big relative to the other samples in array x, which caused the Pearson correlation to be negative despite a general positive correlation among the most of the rows.

Now that we know there’s an outlier in the data, we can remove it from the sample and recalculate the Pearson (P) and Spearman (S) correlation. This time P and S are a lot closer to each other.

This example shows us that Pearson correlation is sensitive to extreme values in the data. When such anomalies are accounted for OR removed, Pearson and Spearman should yield similar values, which means that you don’t need to worry about them.

We can now plot the same scatterplots for the original data and the data with outlier removed. The slope of the blue line represents the correlation coefficient using Pearson (P), and it’s clear to see that the weakly negative correlation has become a strongly positive correlation once the outlier is removed from data.

As a general rule, when you are doing data exploration, it is recommended to calculate both Pearson and Spearman correlation. This is because any difference between the two could shed additional light on the relationship between variables, as well as any anomalies in the data itself.

When to use Pearson vs. Spearman?

Monotonic vs. Linear: in a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate. In a linear relationship, the variables move in the same direction at a constant rate, e.g. you can draw a straight blue line to depict the data.
Monotonic vs. Linear: in a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate, e.g. you cannot draw a straight line on the top left chart. In a linear relationship, the variables move in the same direction at a constant rate, e.g. you can draw a straight blue line on the top right chart.

Method 1:

  • Use Pearson’s correlation when you think the variables’ relationship is linear
  • User Spearman’s correlation when you think the relationship is monotonic

Method 2:

  • Use Pearson’s correlation when the data is measured on an interval scale, e.g. people’s height, distance between locations.
  • Use Spearman’s correlation when data is measured on an ordinal scale, OR when outliers are orders of magnitude larger than most data points, e.g. customer satisfaction score, movie ratings, stock price fluctuations, individual’s incomes.

TL’DR

  • As a rule of thumb, you should try to compute both Pearson (P) and Spearman (S) correlation as it would give you some additional information. If they are pretty close, then you don’t need to worry.
  • If S and P are very different, you should check if there’re any outliers in the data, and consider removing or replacing them from the calculation. Spearman’s correlation is more robust to outliers than Pearson’s correlation.
  • If S >> P or S << P, that means the correlation is monotonic but not linear. You may want to try some data transformation methods (e.g. log transform) to expose the correlation more clearly.
  • Use scatterplots to examine your data — both methods only show linear or monotonic relationships, but other relationships are possible as well.

--

--

Anyi Guo

Head of Data Science @ UW. This is my notepad for thoughts on stats, data science and ML.