How to Solve R Error: not defined because of singularities

by | Programming, R, Tips

This error occurs when you attempt to fit a model and two or more predictor variables are perfectly correlated.

You can solve this error by using the cor() function to identify the variables with a perfect correlation and drop one of the variables from the model.

This tutorial will go through the error in detail and how to solve it with code examples


Example

Let’s look at an example of fitting a linear regression model using a data frame. First, we will define a data frame containing the weight in kilograms and height in metres and centimetres of 10 subjects.

df <- data.frame(weight=c(74, 58, 96, 75, 102, 86, 47, 93, 69, 52),
                 height_m =c(1.7, 1.5, 2.0, 1.75, 1.84, 1.9, 1.3, 1.5, 1.7, 1.66),
                 height_cm=c(170, 150, 200, 175, 184, 190, 130, 150, 170, 166))
summary(df)

Let’s look at an example of fitting a linear regression model using a data frame. First, we will define the data frame.

    weight          height_m       height_cm    
 Min.   : 47.00   Min.   :1.300   Min.   :130.0  
 1st Qu.: 60.75   1st Qu.:1.540   1st Qu.:154.0  
 Median : 74.50   Median :1.700   Median :170.0  
 Mean   : 75.20   Mean   :1.685   Mean   :168.5  
 3rd Qu.: 91.25   3rd Qu.:1.817   3rd Qu.:181.8  
 Max.   :102.00   Max.   :2.000   Max.   :200.0  

Next, we will fit a linear regression model on the data and print the model summary to the console:

model <- lm(weight~height_m+height_cm, data=df)
summary(model)
Call:
lm(formula = weight ~ height_m + height_cm, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.6525  -5.4040  -3.3657   0.4445  29.2511 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -29.10      40.15  -0.725   0.4893  
height_m       61.90      23.67   2.616   0.0309 *
height_cm         NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.81 on 8 degrees of freedom
Multiple R-squared:  0.461,	Adjusted R-squared:  0.3936 
F-statistic: 6.841 on 1 and 8 DF,  p-value: 0.03086

Note that after the residuals and before the coefficients, there is the message:

Coefficients: (1 not defined because of singularities)

The error occurred because the two predictor variables height_m and height_cm are perfectly correlated.

Perfectly correlated variables do not provide unique information in the regression model. It is not possible to vary the predictor variable height_m to see the effect on the response variable weight without the predictor variable height_cm also moving.

Therefore, it is impossible to estimate values for every coefficient in the regression model, which we can see with the NA values for the coefficient estimate of height_cm.

The values for height_cm are the values for height_m multiplied by 100. A predictor variable that is a multiple of another is an example of perfect multicollinearity, which means there is an exact linear relationship between the two variables.

Solution

The first step of solving the error involves calling the cor() function to get a correlation matrix and examining which variables have a correlation of exactly 1 with each other.

cor(df)
            weight  height_m height_cm
weight    1.0000000 0.6789428 0.6789428
height_m  0.6789428 1.0000000 1.0000000
height_cm 0.6789428 1.0000000 1.0000000

We can see that the variables height_m and height_cm are perfectly correlated.

Next, we can drop either of the two variables from the model. Let’s drop height_cm and fit the linear regression model.

model <- lm(weight~height_m, data=df)
summary(model)

Let’s print the summary of the model.

Call:
lm(formula = weight ~ height_m, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.6525  -5.4040  -3.3657   0.4445  29.2511 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -29.10      40.15  -0.725   0.4893  
height_m       61.90      23.67   2.616   0.0309 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.81 on 8 degrees of freedom
Multiple R-squared:  0.461,	Adjusted R-squared:  0.3936 
F-statistic: 6.841 on 1 and 8 DF,  p-value: 0.03086

Note that the not “defined because of singularities” error is gone, and we have a coefficient estimate for height_m.

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R-related errors, go to the article:

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Research Scientist at Moogsoft | + posts

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!