In statistical analysis, R (correlation coefficient) and R² (coefficient of determination) are two related but distinct measures that help us understand relationships between variables. While they’re mathematically connected, they serve different purposes and provide different insights into our data.
Table of Contents
Quick Definitions:
R (Correlation Coefficient): Measures the strength and direction of a linear relationship between two variables. Ranges from -1 to +1.
R² (Coefficient of Determination): Represents the proportion of variance in the dependent variable explained by the independent variable(s). Ranges from 0 to 1.
Understanding R (Correlation Coefficient)
The correlation coefficient (R) tells us about both the strength and direction of a linear relationship. Its key properties include:
- Range: -1 to +1
- Sign indicates direction (positive or negative relationship)
- Absolute value indicates strength
- Scale-independent (unitless measure)
Formula for R:
\[ R = \frac{\sum(x – \bar{x})(y – \bar{y})}{\sqrt{\sum(x – \bar{x})^2\sum(y – \bar{y})^2}} \]Where:
- \(x\) and \(y\) are the variables
- \(\bar{x}\) and \(\bar{y}\) are their respective means
Understanding R² (Coefficient of Determination)
R² represents the proportion of variance in the dependent variable that’s predictable from the independent variable(s). Key properties include:
- Range: 0 to 1 (or 0% to 100%)
- Always positive
- Increases with the addition of variables (adjusted R² addresses this)
- Represents explained variation
Formula for R²:
\[ R^2 = 1 – \frac{\sum(y_i – \hat{y}_i)^2}{\sum(y_i – \bar{y})^2} \]Where:
- \(y_i\) are the actual values
- \(\hat{y}_i\) are the predicted values
- \(\bar{y}\) is the mean of actual values
Practical Example: Calculating R and R²
Example: Study Hours vs. Test Scores
Let’s analyze the relationship between study hours (x) and test scores (y) for five students:
Study Hours (x) | Test Score (y) |
---|---|
2 | 65 |
3 | 70 |
5 | 80 |
7 | 85 |
8 | 90 |
Step 1: Calculate Means
\[ \bar{x} = 5 \text{ hours} \] \[ \bar{y} = 78 \text{ points} \]Step 2: Calculate Deviations and Products
x – x̄ | y – ȳ | (x-x̄)(y-ȳ) | (x-x̄)² | (y-ȳ)² |
---|---|---|---|---|
-3 | -13 | 39 | 9 | 169 |
-2 | -8 | 16 | 4 | 64 |
0 | 2 | 0 | 0 | 4 |
2 | 7 | 14 | 4 | 49 |
3 | 12 | 36 | 9 | 144 |
∑=0 | ∑=0 | ∑=105 | ∑=26 | ∑=430 |
Step 3: Calculate R
Using the Pearson correlation formula:
\[ R = \frac{105}{\sqrt{(26)(430)}} = \frac{105}{\sqrt{11,180}} = 0.993 \]Step 4: Calculate R²
\[ R^2 = (0.993)^2 = 0.986 \]Interpretation:
- R = 0.993 indicates an extremely strong positive correlation between study hours and test scores
- R² = 0.986 means that 98.6% of the variance in test scores can be explained by study hours
- The remaining 1.4% of variance might be due to other factors like sleep quality, prior knowledge, or test-taking skills
Key Differences Between R and R²
R (Correlation Coefficient) | R² (Coefficient of Determination) |
---|---|
Ranges from -1 to +1 | Ranges from 0 to 1 |
Shows direction of relationship | Direction-neutral |
Measures strength and direction of linear relationship | Measures proportion of explained variance |
Used primarily for correlation analysis | Used primarily in regression analysis |
Extended Example: Multiple Linear Regression
Example: Predicting Test Scores with Multiple Factors
Let’s expand our analysis to include both study hours and previous test scores as predictors of final exam performance. This example will demonstrate how R² works with multiple predictors and why we need adjusted R².
Consider data from eight students:
Study Hours (x₁) | Previous Test Score (x₂) | Final Exam Score (y) |
---|---|---|
2 | 72 | 65 |
3 | 75 | 70 |
5 | 80 | 80 |
7 | 85 | 85 |
8 | 88 | 90 |
4 | 78 | 75 |
6 | 82 | 82 |
5 | 79 | 78 |
Step 1: Calculate Individual Correlations
First, let’s examine how each predictor correlates with the final exam score:
- Correlation between study hours and final exam score (r₁): 0.989
- Correlation between previous test score and final exam score (r₂): 0.992
- Correlation between study hours and previous test score (r₁₂): 0.987
Step 2: Multiple Regression Equation
Using matrix algebra to solve for the coefficients, our regression equation is:
\[ \hat{y} = 8.76 + 3.42x_1 + 0.63x_2 \]Where:
- \(x_1\) is study hours
- \(x_2\) is previous test score
Step 3: Calculate Multiple R²
For multiple regression, R² is calculated as:
\[ R^2 = 1 – \frac{SS_{res}}{SS_{tot}} \]Where:
- SSres is the sum of squared residuals
- SStot is the total sum of squares
For our data:
Actual (y) | Predicted (ŷ) | Residual (y – ŷ) | Squared Residual |
---|---|---|---|
65 | 64.89 | 0.11 | 0.012 |
70 | 70.15 | -0.15 | 0.023 |
80 | 79.88 | 0.12 | 0.014 |
85 | 85.12 | -0.12 | 0.014 |
90 | 89.95 | 0.05 | 0.003 |
75 | 74.92 | 0.08 | 0.006 |
82 | 82.05 | -0.05 | 0.003 |
78 | 78.04 | -0.04 | 0.002 |
Step 4: Calculate Adjusted R²
\[ R^2_{adj} = 1 – (1-R^2)\frac{n-1}{n-p-1} \]Where:
- n = 8 (number of observations)
- p = 2 (number of predictors)
Key Insights from Multiple Regression
- The multiple R² of 0.998 is higher than our single-predictor R² (0.986), showing that adding previous test score as a predictor improved our model’s explanatory power.
- The adjusted R² (0.997) is only slightly lower than the multiple R² (0.998), suggesting that both predictors contribute meaningful information despite their high correlation with each other.
- The coefficients tell us that, holding other variables constant:
- Each additional study hour is associated with a 3.42 point increase in final exam score
- Each point increase in previous test score is associated with a 0.63 point increase in final exam score
- The high correlation between predictors (0.987) indicates multicollinearity, which could make individual coefficient interpretations less reliable.
Practical Implications:
This multiple regression example demonstrates several important concepts:
- Adding relevant predictors can improve model fit, as shown by the increase in R² from 0.986 to 0.998.
- The small difference between R² and adjusted R² suggests both predictors are valuable, despite their high correlation.
- Even with very high R² values, we should consider practical significance and potential overfitting, especially with small sample sizes.
- Multicollinearity between predictors can complicate interpretation of individual effects while still maintaining high overall predictive power.
When to Use Each Measure
Use R when you want to:
- Determine if there’s a positive or negative relationship
- Measure the strength of a linear relationship
- Compare relationships between different pairs of variables
Use R² when you want to:
- Explain how much variance is accounted for by your model
- Assess the goodness of fit of a regression model
- Compare the explanatory power of different models
Quick Calculation Tool
Working with real-world datasets often involves more complex calculations than our example above. To help you quickly and accurately compute R², you can use our Coefficient of Determination (R²) Calculator.
Further Reading
-
R² Calculator
Access our comprehensive calculator for quick and accurate computation of R and R², complete with visualizations and step-by-step explanations.
-
Linear Regression Calculator
Explore the broader context of regression analysis with our complete linear regression calculator.
Attribution and Citation
If you found this guide and tools helpful, feel free to link back to this page or cite it in your work!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.