*This error occurs when you attempt to fit a model and two or more predictor variables are perfectly correlated.*

*You can solve this error by using the cor() function to identify the variables with a perfect correlation and drop one of the variables from the model.*

*This tutorial will go through the error in detail and how to solve it with code examples*

## Example

Let’s look at an example of fitting a linear regression model using a data frame. First, we will define a data frame containing the weight in kilograms and height in metres and centimetres of 10 subjects.

df <- data.frame(weight=c(74, 58, 96, 75, 102, 86, 47, 93, 69, 52), height_m =c(1.7, 1.5, 2.0, 1.75, 1.84, 1.9, 1.3, 1.5, 1.7, 1.66), height_cm=c(170, 150, 200, 175, 184, 190, 130, 150, 170, 166)) summary(df)

Let’s look at an example of fitting a linear regression model using a data frame. First, we will define the data frame.

weight height_m height_cm Min. : 47.00 Min. :1.300 Min. :130.0 1st Qu.: 60.75 1st Qu.:1.540 1st Qu.:154.0 Median : 74.50 Median :1.700 Median :170.0 Mean : 75.20 Mean :1.685 Mean :168.5 3rd Qu.: 91.25 3rd Qu.:1.817 3rd Qu.:181.8 Max. :102.00 Max. :2.000 Max. :200.0

Next, we will fit a linear regression model on the data and print the model summary to the console:

model <- lm(weight~height_m+height_cm, data=df) summary(model)

Call: lm(formula = weight ~ height_m + height_cm, data = df) Residuals: Min 1Q Median 3Q Max -21.6525 -5.4040 -3.3657 0.4445 29.2511 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -29.10 40.15 -0.725 0.4893 height_m 61.90 23.67 2.616 0.0309 * height_cm NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.81 on 8 degrees of freedom Multiple R-squared: 0.461, Adjusted R-squared: 0.3936 F-statistic: 6.841 on 1 and 8 DF, p-value: 0.03086

Note that after the residuals and before the coefficients, there is the message:

Coefficients: (1 not defined because of singularities)

The error occurred because the two predictor variables `height_m`

and `height_cm`

are perfectly correlated.

Perfectly correlated variables do not provide unique information in the regression model. It is not possible to vary the predictor variable `height_m`

to see the effect on the response variable `weight`

without the predictor variable `height_cm`

also moving.

Therefore, it is impossible to estimate values for every coefficient in the regression model, which we can see with the `NA`

values for the coefficient estimate of `height_cm`

.

The values for `height_cm`

are the values for `height_m`

multiplied by `100`

. A predictor variable that is a multiple of another is an example of perfect multicollinearity, which means there is an exact linear relationship between the two variables.

### Solution

The first step of solving the error involves calling the `cor()`

function to get a correlation matrix and examining which variables have a correlation of exactly 1 with each other.

cor(df)

weight height_m height_cm weight 1.0000000 0.6789428 0.6789428 height_m 0.6789428 1.0000000 1.0000000 height_cm 0.6789428 1.0000000 1.0000000

We can see that the variables `height_m`

and `height_cm`

are perfectly correlated.

Next, we can drop either of the two variables from the model. Let’s drop `height_cm`

and fit the linear regression model.

model <- lm(weight~height_m, data=df) summary(model)

Let’s print the summary of the model.

Call: lm(formula = weight ~ height_m, data = df) Residuals: Min 1Q Median 3Q Max -21.6525 -5.4040 -3.3657 0.4445 29.2511 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -29.10 40.15 -0.725 0.4893 height_m 61.90 23.67 2.616 0.0309 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.81 on 8 degrees of freedom Multiple R-squared: 0.461, Adjusted R-squared: 0.3936 F-statistic: 6.841 on 1 and 8 DF, p-value: 0.03086

Note that the not “`defined because of singularities`

” error is gone, and we have a coefficient estimate for `height_m`

.

## Summary

Congratulations on reading to the end of this tutorial!

For further reading on R-related errors, go to the article:

- How to Count the Number of NA in R
- How to Solve R Error: non-conformable arguments
- How to Solve R Error in n(): Must be used inside dplyr verbs

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!