This error occurs when you attempt to fit a model and two or more predictor variables are perfectly correlated.
You can solve this error by using the cor()
function to identify the variables with a perfect correlation and drop one of the variables from the model.
This tutorial will go through the error in detail and how to solve it with code examples
Example
Let’s look at an example of fitting a linear regression model using a data frame. First, we will define a data frame containing the weight in kilograms and height in metres and centimetres of 10 subjects.
df <- data.frame(weight=c(74, 58, 96, 75, 102, 86, 47, 93, 69, 52), height_m =c(1.7, 1.5, 2.0, 1.75, 1.84, 1.9, 1.3, 1.5, 1.7, 1.66), height_cm=c(170, 150, 200, 175, 184, 190, 130, 150, 170, 166)) summary(df)
Let’s look at an example of fitting a linear regression model using a data frame. First, we will define the data frame.
weight height_m height_cm Min. : 47.00 Min. :1.300 Min. :130.0 1st Qu.: 60.75 1st Qu.:1.540 1st Qu.:154.0 Median : 74.50 Median :1.700 Median :170.0 Mean : 75.20 Mean :1.685 Mean :168.5 3rd Qu.: 91.25 3rd Qu.:1.817 3rd Qu.:181.8 Max. :102.00 Max. :2.000 Max. :200.0
Next, we will fit a linear regression model on the data and print the model summary to the console:
model <- lm(weight~height_m+height_cm, data=df) summary(model)
Call: lm(formula = weight ~ height_m + height_cm, data = df) Residuals: Min 1Q Median 3Q Max -21.6525 -5.4040 -3.3657 0.4445 29.2511 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -29.10 40.15 -0.725 0.4893 height_m 61.90 23.67 2.616 0.0309 * height_cm NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.81 on 8 degrees of freedom Multiple R-squared: 0.461, Adjusted R-squared: 0.3936 F-statistic: 6.841 on 1 and 8 DF, p-value: 0.03086
Note that after the residuals and before the coefficients, there is the message:
Coefficients: (1 not defined because of singularities)
The error occurred because the two predictor variables height_m
and height_cm
are perfectly correlated.
Perfectly correlated variables do not provide unique information in the regression model. It is not possible to vary the predictor variable height_m
to see the effect on the response variable weight
without the predictor variable height_cm
also moving.
Therefore, it is impossible to estimate values for every coefficient in the regression model, which we can see with the NA
values for the coefficient estimate of height_cm
.
The values for height_cm
are the values for height_m
multiplied by 100
. A predictor variable that is a multiple of another is an example of perfect multicollinearity, which means there is an exact linear relationship between the two variables.
Solution
The first step of solving the error involves calling the cor()
function to get a correlation matrix and examining which variables have a correlation of exactly 1 with each other.
cor(df)
weight height_m height_cm weight 1.0000000 0.6789428 0.6789428 height_m 0.6789428 1.0000000 1.0000000 height_cm 0.6789428 1.0000000 1.0000000
We can see that the variables height_m
and height_cm
are perfectly correlated.
Next, we can drop either of the two variables from the model. Let’s drop height_cm
and fit the linear regression model.
model <- lm(weight~height_m, data=df) summary(model)
Let’s print the summary of the model.
Call: lm(formula = weight ~ height_m, data = df) Residuals: Min 1Q Median 3Q Max -21.6525 -5.4040 -3.3657 0.4445 29.2511 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -29.10 40.15 -0.725 0.4893 height_m 61.90 23.67 2.616 0.0309 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.81 on 8 degrees of freedom Multiple R-squared: 0.461, Adjusted R-squared: 0.3936 F-statistic: 6.841 on 1 and 8 DF, p-value: 0.03086
Note that the not “defined because of singularities
” error is gone, and we have a coefficient estimate for height_m
.
Summary
Congratulations on reading to the end of this tutorial!
For further reading on R-related errors, go to the article:
- How to Count the Number of NA in R
- How to Solve R Error: non-conformable arguments
- How to Solve R Error in n(): Must be used inside dplyr verbs
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.