When working with the randomForest
package in R, you might encounter the error:
Error in randomForest.default : NA not permitted in predictors
This error occurs when the dataset being passed to the randomForest
function contains NA
, NaN
, or Inf
values, which are not supported by the random forest algorithm. This post will walk you through how to reproduce this error and solve it.
Example to Reproduce the Error
Let’s create a simple dataset that includes NA
values, which will trigger the error.
# Load the randomForest package library(randomForest) # Create a sample data frame with NA values data <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, NA, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data[, 1:2], y = data$y)
Running this code will produce the following error:
Error in randomForest.default(x = data[, 1:2], y = data$y) : NA not permitted in predictors
This happens because randomForest
cannot handle NA
values directly. We need to clean or impute missing values before fitting the model.
Solution: Handle Missing Values
To fix the error, you can either remove rows with missing values or impute them using various strategies, such as mean imputation or using more sophisticated methods.
Option 1: Remove Missing Values
You can use na.omit()
to remove rows containing NA
values before passing the data to the randomForest
function:
# Remove rows with NA values clean_data <- na.omit(data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)
Output:
Call: randomForest(x = clean_data[, 1:2], y = clean_data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 100% Confusion matrix: 0 1 class.error 0 0 0 NaN 1 2 0 1
By removing missing data, this ensures that only complete cases are passed into the randomForest
function, preventing the error.
Option 2: Impute Missing Values
Alternatively, you can use mean imputation to replace NA
values with the column mean.
# Impute missing values with column means data$x1[is.na(data$x1)] <- mean(data$x1, na.rm = TRUE) data$x2[is.na(data$x2)] <- mean(data$x2, na.rm = TRUE) # Run randomForest on the imputed dataset model <- randomForest(x = data[, 1:2], y = data$y) # Print model output print(model)
Output:
Call: randomForest(x = data[, 1:2], y = data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 80% Confusion matrix: 0 1 class.error 0 1 1 0.5 1 3 0 1.0
Example 2 to Reproduce the Error: in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) :
NA/NaN/Inf in foreign function call (arg 1)
If the dataset has Inf values but not NA, you may instead get the error
Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)
Let’s create a dataset that contains Inf
values, which will also trigger the error:
# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, Inf, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)
When you run this code, you’ll see the following error:
Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)
This error occurs because randomForest
cannot handle Inf
values, just like NA
/NaN.
Solution: Handle Inf
Values
Similar to missing (NA
) values, you need to clean or handle Inf
values in your dataset. Here are two possible approaches:
Option 1: Remove Inf
Values
You can remove rows that contain Inf
values using the is.finite()
function:
# Remove rows with Inf values clean_data_inf <- data_inf[is.finite(rowSums(data_inf[, 1:2])), ] # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) # Print model output print(model)
Output:
Call: randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 100% Confusion matrix: 0 1 class.error 0 0 0 NaN 1 2 0 1
This method ensures that only rows with finite values are used, preventing the error.
Option 2: Replace Inf
Values with Finite Numbers
You can replace Inf
values with a finite number, like the maximum value of the column:
# Replace Inf values with the column maximum (excluding Inf) data_inf$x1[is.infinite(data_inf$x1)] <- max(data_inf$x1[is.finite(data_inf$x1)]) data_inf$x2[is.infinite(data_inf$x2)] <- max(data_inf$x2[is.finite(data_inf$x2)]) # Run randomForest on the modified dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y) # Print model output print(model)
Output:
Call: randomForest(x = data_inf[, 1:2], y = data_inf$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 40% Confusion matrix: 0 1 class.error 0 1 1 0.5000000 1 1 2 0.3333333
In this case, the infinite values are replaced with the maximum finite value in each column, allowing the randomForest
function to run without errors.
Handling Both NA
and Inf
Values
To generalize the solution, you can handle both NA
and Inf
values at the same time by using a combination of is.finite()
and other functions to clean the dataset:
# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)
# Clean dataset by removing both NA and Inf clean_data <- data[is.finite(rowSums(data[, 1:2])), ] clean_data <- na.omit(clean_data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)
Output:
Call: randomForest(x = clean_data[, 1:2], y = clean_data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 60% Confusion matrix: 0 1 class.error 0 1 1 0.5000000 1 2 1 0.6666667
This approach removes all rows containing either NA
or Inf
values before fitting the model.
Conclusion
The randomForest.default(... ) NA not permitted in predictors
error is triggered by NA values in your dataset. The randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call
error can be triggered by both missing (NA
) and infinite (Inf
) values in your dataset. You can solve it by removing or replacing those problematic values. The approach you choose depends on your data and analysis needs. These simple techniques will help you avoid this error and enable a smooth modeling process with the randomForest
package in R.
Congratulations on reading to the end of this tutorial!
For further reading on data science related in R errors, go to the article:
- How to Solve Error in randomforest.default(m, y, …) : can’t have empty classes in y
- How to Solve R Warning: glm.fit algorithm did not converge
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.