Select Page

How to Solve R Error in lm.fit: na/nan/inf

by | Programming, R, Tips

This error occurs when trying to fit a linear regression model in R using the lm() function but either the predictor or response variables contain Not a Number (NaN) or infinity (Inf) values.

You can solve this error by replacing the NaN and Inf values with NA values, for example:

df[is.na(df) | df=="Inf"] = NA

The error can also occur if you do not provide a continuous numeric response variable when performing linear regression, for example, Yes/No. In that case, you may have your predictor and response variables the wrong way round, or you may need to fit a logistic regression model instead.

This tutorial will go through the error in detail and how to solve it with code examples.


Example #1: Predictor and/or Response Variable contains NaN or Inf

Consider the following data frame that contains information about the amount of ice cream sold over ten days and the temperature for each of the days in Celsius.

df <- data.frame(day=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
temperature=c(10, NA, 22, 19, 28, 15, 20, 17, 13, 30),
ice_cream_sold=c(5, NaN, 40, 38, 100, 40, Inf, 10, 30, 150))

df
   day temperature ice_cream_sold
1    1          10              5
2    2          NA            NaN
3    3          22             40
4    4          19             38
5    5          28            100
6    6          15             40
7    7          20            Inf
8    8          17             10
9    9          13             30
10  10          30            150

The data frame contains some NaN and Inf values.

Let’s attempt to fit a linear regression model using temperature as the predictor variable and ice_cream_sold as the response variable.

model <- lm(ice_cream_sold ~ temperature, df)

Let’s run the code to see what happens:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'

The R interpreter raises an error because there are NaN and Inf values present in the data frame.

Solution

We can solve this error by replacing the NaN and Inf values with NA. The NA values are ignored when fitting the linear regression model. Let’s look at the revised code:

df[is.na(df) | df=="Inf"] = NA

df

Let’s look at the updated data frame:

day temperature ice_cream_sold
1    1          10              5
2    2          NA             NA
3    3          22             40
4    4          19             38
5    5          28            100
6    6          15             40
7    7          20             NA
8    8          17             10
9    9          13             30
10  10          30            150

Now we can fit the linear regression model and get the coefficients of the model using summary():

model <- lm(ice_cream_sold ~ temperature, df)
summary(model)
Call:
lm(formula = ice_cream_sold ~ temperature, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-28.73 -15.96   2.43  15.42  31.50 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -68.127     25.983  -2.622  0.03948 * 
temperature    6.221      1.277   4.872  0.00279 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.8 on 6 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7982,	Adjusted R-squared:  0.7646 
F-statistic: 23.73 on 1 and 6 DF,  p-value: 0.00279

Example #2

Consider an example where the ice cream data frame contains a new categorical column indicating whether a given day was cloudy or not.

df <- data.frame(day=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
temperature=c(10, 40, 22, 19, 28, 15, 20, 17, 13, 30),
ice_cream_sold=c(5, 200, 40, 38, 100, 40, 55, 10, 30, 150),
is_cloudy = c('Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'No')
)


df
   day temperature ice_cream_sold is_cloudy
1    1          10              5       Yes
2    2          40            200        No
3    3          22             40        No
4    4          19             38       Yes
5    5          28            100        No
6    6          15             40        No
7    7          20             55        No
8    8          17             10       Yes
9    9          13             30       Yes
10  10          30            150        No

We want to fit a model using is_cloudy as the response variable and ice_cream_sold as the predictor variable:

model <- lm(is_cloudy ~ ice_cream_sold, df)

Let’s run the code to see the result:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

The error occurs because we have a categorical response variable. We can only use continuous numerical values for our response variable in linear regression.

Solution #1: Use Logistic Regression

As the response variable can only have two outcomes (binary), we can perform logistic regression using the generalized linear model function glm(). We have to specify the parameter family=binomial().

model <- glm(as.factor(is_cloudy) ~ ice_cream_sold, data = df, family=binomial())
summary(model)

Note that we have to tell the R to treat is_cloudy as a factor otherwise it will treat it like a numeric variable. Let’s run the code to get the coefficients of the model:

Call:
glm(formula = as.factor(is_cloudy) ~ ice_cream_sold, family = binomial(), 
    data = df)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-7.672e-05  -2.100e-08  -2.100e-08   2.100e-08   1.046e-04  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)       753.70  222362.45   0.003    0.997
ice_cream_sold    -19.33    5694.43  -0.003    0.997

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.346e+01  on 9  degrees of freedom
Residual deviance: 2.272e-08  on 8  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25

We successfully fit a logistic regression model.

Solution #2: Swap the variables

Alternatively, the predictor and response variables may be the wrong way round. The variable ice_cream_sold is the outcome and the variable is_cloudy is the predictor. Let’s look at the revised code:

model <- lm(ice_cream_sold ~ is_cloudy, df)
summary(model)

Let’s run the code to fit the linear regression model and get the model coefficients.

Call:
lm(formula = ice_cream_sold ~ is_cloudy, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.500 -35.812  -4.125  15.250 102.500 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)     97.50      21.62   4.510  0.00198 **
is_cloudyYes   -76.75      34.18  -2.245  0.05497 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 52.96 on 8 degrees of freedom
Multiple R-squared:  0.3866,	Adjusted R-squared:  0.3099 
F-statistic: 5.041 on 1 and 8 DF,  p-value: 0.05497

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R related errors, go to the articles: 

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!