This tutorial will go through adding the regression line equation and R-squared to a plot in R with code examples.
Table of contents
What is the Regression Equation?
Linear regression is the statistical method of finding the relationship between two variables by fitting a linear equation to observed data.
One of the two variables is considered the explanatory variable, and the other is the response variable. A linear regression line has an equation called the regression equation, which takes the form Y = a +bX, where X is the explanatory variable and Y is the dependent variable. The gradient of the line is b, and a is the intercept (the value of y when x = 0)
You can use our free calculator to fit a linear regression model to predictor and response values.
What is the R-Squared Value?
When we fit a linear regression model to data, we need a value to tell us how well the model fits the data, and the R-square value does this for us.
We can define R-squared as the percentage of the response variable variation explained by the linear model.
R-squared is always between 0 and 1 or 0% and 100% where:
- 0 indicates that the model explains none of the variability of the response data around its mean.
- 1 indicates that the model explains all of the variability of the response data around its mean.
Generally, we can say that the higher the R-squared value, the better the linear regression model fits to the data. However, not all low R-squared values are intrinsically bad and not all R-squared values are intrinsically good.
Example: Using ggpubr
Let’s look at an example of fitting a linear regression model to some data and obtaining the regression equation and R-squared.
Create Data
We will use the built-in mtcars
dataset. We can look at the available features in the dataset using the head()
function:
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can see there are 11 features. We will choose miles-per-gallon (mpg
) and weight (wt
) as we are interested in the relationship between fuel efficiency and weight. We can see the values for each feature using the dollar-sign operator:
mtcars$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780 [15] 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845 1.935 2.140 1.513 [29] 3.170 2.770 3.570 2.780
mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 [17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
We can see if there is a linear relationship between the two variables by plotting as follows:
install.packages("ggplot2") library(ggplot2) ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()
If you do not have ggplot2
installed, you must use the install.packages("ggplot2")
command. Otherwise, you can omit it.
We can see that there is a linear relationship between mpg
and wt
.
Plot Data and Add Regression Equation
Next, we will install and load ggpubr
to use the stat_regline_equation() function:
install.packages("ggpubr") library(ggpubr)
Then we create the plot with the regression line and the regression equation as follows:
ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_smooth(method="lm") + geom_point() + stat_regline_equation(label.x=4, label.y=30)
geom_smooth adds a line of best fit using linear regression and confidence bands in grey. stat_regline_equation
adds a regression line to the plot.
The parameters label.x
and label.y
specify the x
and y
coordinates for the regression equation on the plot.
Let’s run the code to see the result:
The fitted regression equation is
y = 37 - 5.3 * (x)
Where y
is mpg
and x
is wt
.
Plot Data and Add Regression Equation and R-Squared
We can add the R-squared value using the stat_cor() function as follows:
library(ggplot2) library(ggpubr) ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_smooth(method="lm") + geom_point() + stat_regline_equation(label.x=4, label.y=30) + stat_cor(aes(label=..rr.label..), label.x=4, label.y=28)
The R-squared value for this model is 0.75.
We can also find the parameters of the regression equation by using the lm()
function as follows:
fit <- lm(mpg ~ wt, data = mtcars)
The tilde symbol ~
means “explained by”, which tells the lm()
function that mpg is the response variable and wt is the explanatory variable. We can get the coefficients and R-Squared using summary()
as follows:
summary(fit)
Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We can see the estimated intercept is 37.3, and the gradient is -5.3, matching what we saw on the plot. The Multiple R-squared value is 0.75, which matches what we saw on the plot.
Summary
Congratulations on reading to the end of this tutorial!
For further reading on plotting in R, go to the articles:
- How to Place Two Plots Side by Side using ggplot2 and cowplot in R
- How to Download and Plot Stock Prices with quantmod in R
- How to Remove Outliers from Boxplot using ggplot2 in R
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!