Correlation Result:
Understanding Correlation Coefficients
Correlation coefficients measure the strength and direction of a relationship between two variables. The two commonly used coefficients are:
- Pearson Correlation Coefficient (\( r \)): Measures the linear correlation between two continuous variables.
- Spearman's Rank Correlation Coefficient (\( r_s \)): Measures the strength and direction of the monotonic relationship between two variables, using ranked data, suitable for non-linear relationships.
Pearson Correlation Formula
The Pearson correlation coefficient (\( r \)) is calculated as:
\( r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}} \)
where:
- \( x_i \) and \( y_i \) are individual data points for X and Y.
- \( \bar{x} \) and \( \bar{y} \) are the mean values of X and Y.
Spearman's Rank Correlation Formula
The Spearman's rank correlation coefficient (\( r_s \)) is calculated as:
\( r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \)
where:
- \( d_i \) is the difference in ranks for each pair of data points.
- \( n \) is the number of pairs.
What Does Monotonic Mean?
A relationship is monotonic if, as one variable increases, the other variable either only increases or only decreases. A monotonic relationship captures consistent directionality but not necessarily a constant rate.
For example, a monotonic relationship can be positive, where higher values of X correspond to higher values of Y, or negative, where higher values of X correspond to lower values of Y. Spearman’s correlation captures monotonic relationships, even if they are non-linear, making it useful for variables without a strict linear relationship.
Why Use These Methods?
Pearson correlation is best for linear relationships between two continuous variables, assuming normal distribution and few outliers. It provides an accurate measure of linear association.
Spearman’s rank correlation is useful for non-parametric data or when the relationship between variables is monotonic but non-linear. It reduces the influence of outliers by ranking data and is suitable for ordinal data.
Example Calculation
Suppose we have two sets of data:
- X values: 10, 34, 23, 54, 9
- Y values: 4, 5, 11, 15, 20
Pearson Correlation Calculation
Given Data:
X values: 10, 34, 23, 54, 9
Y values: 4, 5, 11, 15, 20
Step 1: Calculate Means
Mean of \( X \) (\( \bar{x} \)) = \( \frac{10 + 34 + 23 + 54 + 9}{5} = 26 \)
Mean of \( Y \) (\( \bar{y} \)) = \( \frac{4 + 5 + 11 + 15 + 20}{5} = 11 \)
Step 2: Calculate Deviations and Products
X | Y | (X - \( \bar{x} \)) | (Y - \( \bar{y} \)) | (X - \( \bar{x} \))(Y - \( \bar{y} \)) | (X - \( \bar{x} \))² | (Y - \( \bar{y} \))² |
---|---|---|---|---|---|---|
10 | 4 | -16 | -7 | 112 | 256 | 49 |
34 | 5 | 8 | -6 | -48 | 64 | 36 |
23 | 11 | -3 | 0 | 0 | 9 | 0 |
54 | 15 | 28 | 4 | 112 | 784 | 16 |
9 | 20 | -17 | 9 | -153 | 289 | 81 |
Sums: | 23 | 1402 | 182 |
Step 3: Apply Pearson Correlation Formula
\( r = \frac{\sum (X - \bar{x})(Y - \bar{y})}{\sqrt{\sum (X - \bar{x})^2 \cdot \sum (Y - \bar{y})^2}} \)
\( r = \frac{23}{\sqrt{1402 \times 182}} \)
\( r = \frac{23}{\sqrt{255,164}} \)
\( r = \frac{23}{505.14} \)
\( r \approx 0.0455 \)
Conclusion:
The Pearson correlation coefficient of \( r \approx 0.0455 \) indicates a very weak positive correlation between the X and Y variables.
Spearman's Rank Correlation Calculation
Given Ranks
i | \(x_i\) (rank) | \(y_i\) (rank) |
---|---|---|
1 | 2 | 1 |
2 | 4 | 2 |
3 | 3 | 3 |
4 | 5 | 4 |
5 | 1 | 5 |
Step 1: Calculate Means
\( \bar{x} = \frac{2 + 4 + 3 + 5 + 1}{5} = 3 \)
\( \bar{y} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3 \)
Step 2: Calculate Deviations and Products
i | \((x_i - \bar{x})\) | \((y_i - \bar{y})\) | \((x_i - \bar{x})^2\) | \((y_i - \bar{y})^2\) | \((x_i - \bar{x})(y_i - \bar{y})\) |
---|---|---|---|---|---|
1 | -1 | -2 | 1 | 4 | 2 |
2 | 1 | -1 | 1 | 1 | -1 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | 2 | 1 | 4 | 1 | 2 |
5 | -2 | 2 | 4 | 4 | -4 |
Sums: | 10 | 10 | -1 |
Step 3: Apply Formula
\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} \]
\[ r = \frac{-1}{\sqrt{10 \times 10}} = \frac{-1}{10} = -0.1 \]
Alternative Method Using Standard Deviation
We can also express this using standard deviations:
\[ S_{XY} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{-1}{5-1} = -0.25 \]
\[ S_X = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{10}{4}} = 1.5811 \]
\[ S_Y = \sqrt{\frac{\sum(y_i - \bar{y})^2}{n-1}} = \sqrt{\frac{10}{4}} = 1.5811 \]
\[ r = \frac{S_{XY}}{S_X S_Y} = \frac{-0.25}{1.5811 \times 1.5811} = -0.1 \]
Interpreting Correlation Coefficients
Both Pearson and Spearman correlation coefficients range from -1 to +1, where:
- • +1 indicates a perfect positive correlation
- • -1 indicates a perfect negative correlation
- • 0 indicates no correlation
General Guidelines for Interpretation of Correlation Coefficients
Coefficient Value | Strength | Direction |
---|---|---|
0.90 to 1.00 | Very strong | Positive |
0.70 to 0.89 | Strong | Positive |
0.50 to 0.69 | Moderate | Positive |
0.30 to 0.49 | Weak | Positive |
0.00 to 0.29 | Very weak | Positive |
-0.29 to 0.00 | Very weak | Negative |
-0.49 to -0.30 | Weak | Negative |
-0.69 to -0.50 | Moderate | Negative |
-0.89 to -0.70 | Strong | Negative |
-1.00 to -0.90 | Very strong | Negative |
Key Differences in Interpretation
Pearson's Correlation (r)
- Measures the strength and direction of the linear relationship
- Uses the actual values of the variables
- Best for continuous, normally distributed data
- Sensitive to outliers and non-linear relationships
Spearman's Correlation ($\rho$ or rs)
- Measures the strength and direction of the monotonic relationship
- Uses the ranks of the variables
- Suitable for ordinal data and non-normally distributed variables
- Less sensitive to outliers and non-linear relationships
Important Considerations
- Statistical significance should be considered alongside the correlation coefficient
- Context of the data and field of study may affect interpretation of strength
- Visual inspection of the data (e.g., scatterplots) should accompany correlation analysis
Limitations of Correlation Coefficients
While correlation coefficients are useful, they have limitations:
- Not Causal: Correlation does not imply causation.
- Sensitive to Outliers (for Pearson): Outliers can skew Pearson’s correlation.
- Linear Relationships (for Pearson): Pearson only captures linear relationships.
- Ordinal Data (for Spearman): Spearman’s is suitable for ordinal data but may not fully capture non-monotonic relationships.
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.