How to Add Regression Line Equation and R-Squared on Graph using R

by | Programming, R, Tips

This tutorial will go through adding the regression line equation and R-squared to a plot in R with code examples.


What is the Regression Equation?

Linear regression is the statistical method of finding the relationship between two variables by fitting a linear equation to observed data.

One of the two variables is considered the explanatory variable, and the other is the response variable. A linear regression line has an equation called the regression equation, which takes the form Y = a +bX, where X is the explanatory variable and Y is the dependent variable. The gradient of the line is b, and a is the intercept (the value of y when x = 0)

What is the R-Squared Value?

When we fit a linear regression model to data, we need a value to tell us how well the model fits to the data, and the R-square value does this for us.

Formally we can define R-squared as the percentage of the response variable variation explained by the linear model.

R-squared is always between 0 and 1 or 0% and 100% where:

  • 0 indicates that the model explains none of the variability of the response data around its mean.
  • 1 indicates that the model explains all of the variability of the response data around its mean.

Generally, we can say that the higher the R-squared value, the better the linear regression model fits to the data. However, not all low R-squared values are intrinsically bad and not all R-squared values are intrinsically good.

Example: Using ggpubr

Let’s look at an example of fitting a linear regression model to some data and obtaining the regression equation and R-squared.

Create Data

We will use the built-in mtcars dataset. We can look at the available features in the dataset using the head() function:

head(mtcars)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can see there are 11 features. We will choose miles-per-gallon (mpg) and weight (wt) as we are interested in the relationship between fuel efficiency and weight. We can see the values for each feature using the dollar-sign operator:

mtcars$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780
[15] 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845 1.935 2.140 1.513
[29] 3.170 2.770 3.570 2.780
mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

We can see if there is a linear relationship between the two variables by plotting as follows:

install.packages("ggplot2")
library(ggplot2)

ggplot(data=mtcars, aes(x=wt, y=mpg)) + 
geom_point()

If you do not have ggplot2 installed, you must use the install.packages("ggplot2") command. Otherwise, you can omit it.

mpg vs wt from mtcars dataset
mpg vs wt from mtcars dataset

We can see that there is a linear relationship between mpg and wt.

Plot Data and Add Regression Equation

Next, we will install and load ggpubr to use the stat_regline_equation() function:

install.packages("ggpubr")
library(ggpubr)

Then we create the plot with the regression line and the regression equation as follows:

ggplot(data=mtcars, aes(x=wt, y=mpg)) + 
geom_smooth(method="lm") + 
geom_point() + 
stat_regline_equation(label.x=4, label.y=30)

geom_smooth adds a line of best fit using linear regression and confidence bands in grey. stat_regline_equation adds a regression line to the plot.

The parameters label.x and label.y specify the x and y coordinates for the regression equation on the plot.

Let’s run the code to see the result:

mpg against wt from the mtcars dataset with regression equation
mpg against wt from the mtcars dataset with the regression equation

The fitted regression equation is

y = 37 - 5.3 * (x)

Where y is mpg and x is wt.

Plot Data and Add Regression Equation and R-Squared

We can add the R-squared value using the stat_cor() function as follows:

library(ggplot2)
library(ggpubr)

ggplot(data=mtcars, aes(x=wt, y=mpg)) + 
geom_smooth(method="lm") + 
geom_point() + 
stat_regline_equation(label.x=4, label.y=30) + 
stat_cor(aes(label=..rr.label..), label.x=4, label.y=28)
mpg against wt from the mtcars dataset with regression equation and R-squared
mpg against wt from the mtcars dataset with the regression equation and R-squared

The R-squared value for this model is 0.75.

We can also find the parameters of the regression equation by using the lm() function as follows:

fit <- lm(mpg ~ wt, data = mtcars)

The tilde symbol ~ means “explained by”, which tells the lm() function that mpg is the response variable and wt is the explanatory variable. We can get the coefficients and R-Squared using summary() as follows:

summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

We can see the estimated intercept is 37.3, and the gradient is -5.3, matching what we saw on the plot. The Multiple R-squared value is 0.75, which matches what we saw on the plot.

Summary

Congratulations on reading to the end of this tutorial!

For further reading on plotting in R, go to the articles:

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Research Scientist at Moogsoft | + posts

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!

Follow the Research Scientist Pod on Social media!