When working with the randomForest package in R, you might encounter the error:
Error in randomForest.default : NA not permitted in predictors
This error occurs when the dataset being passed to the randomForest function contains NA, NaN, or Inf values, which are not supported by the random forest algorithm. This post will walk you through how to reproduce this error and solve it.
Example to Reproduce the Error
Let’s create a simple dataset that includes NA values, which will trigger the error.
# Load the randomForest package library(randomForest) # Create a sample data frame with NA values data <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, NA, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data[, 1:2], y = data$y)
Running this code will produce the following error:
Error in randomForest.default(x = data[, 1:2], y = data$y) : NA not permitted in predictors
This happens because randomForest cannot handle NA values directly. We need to clean or impute missing values before fitting the model.
Solution: Handle Missing Values
To fix the error, you can either remove rows with missing values or impute them using various strategies, such as mean imputation or using more sophisticated methods.
Option 1: Remove Missing Values
You can use na.omit() to remove rows containing NA values before passing the data to the randomForest function:
# Remove rows with NA values clean_data <- na.omit(data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)
Output:
Call:
randomForest(x = clean_data[, 1:2], y = clean_data$y)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 100%
Confusion matrix:
0 1 class.error
0 0 0 NaN
1 2 0 1
By removing missing data, this ensures that only complete cases are passed into the randomForest function, preventing the error.
Option 2: Impute Missing Values
Alternatively, you can use mean imputation to replace NA values with the column mean.
# Impute missing values with column means data$x1[is.na(data$x1)] <- mean(data$x1, na.rm = TRUE) data$x2[is.na(data$x2)] <- mean(data$x2, na.rm = TRUE) # Run randomForest on the imputed dataset model <- randomForest(x = data[, 1:2], y = data$y) # Print model output print(model)
Output:
Call:
randomForest(x = data[, 1:2], y = data$y)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 80%
Confusion matrix:
0 1 class.error
0 1 1 0.5
1 3 0 1.0
Example 2 to Reproduce the Error: in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) :
NA/NaN/Inf in foreign function call (arg 1)
If the dataset has Inf values but not NA, you may instead get the error
Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)
Let’s create a dataset that contains Inf values, which will also trigger the error:
# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, Inf, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)
When you run this code, you’ll see the following error:
Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)
This error occurs because randomForest cannot handle Inf values, just like NA/NaN.
Solution: Handle Inf Values
Similar to missing (NA) values, you need to clean or handle Inf values in your dataset. Here are two possible approaches:
Option 1: Remove Inf Values
You can remove rows that contain Inf values using the is.finite() function:
# Remove rows with Inf values clean_data_inf <- data_inf[is.finite(rowSums(data_inf[, 1:2])), ] # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) # Print model output print(model)
Output:
Call:
randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 100%
Confusion matrix:
0 1 class.error
0 0 0 NaN
1 2 0 1
This method ensures that only rows with finite values are used, preventing the error.
Option 2: Replace Inf Values with Finite Numbers
You can replace Inf values with a finite number, like the maximum value of the column:
# Replace Inf values with the column maximum (excluding Inf) data_inf$x1[is.infinite(data_inf$x1)] <- max(data_inf$x1[is.finite(data_inf$x1)]) data_inf$x2[is.infinite(data_inf$x2)] <- max(data_inf$x2[is.finite(data_inf$x2)]) # Run randomForest on the modified dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y) # Print model output print(model)
Output:
Call:
randomForest(x = data_inf[, 1:2], y = data_inf$y)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 40%
Confusion matrix:
0 1 class.error
0 1 1 0.5000000
1 1 2 0.3333333
In this case, the infinite values are replaced with the maximum finite value in each column, allowing the randomForest function to run without errors.
Handling Both NA and Inf Values
To generalize the solution, you can handle both NA and Inf values at the same time by using a combination of is.finite() and other functions to clean the dataset:
# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)
# Clean dataset by removing both NA and Inf clean_data <- data[is.finite(rowSums(data[, 1:2])), ] clean_data <- na.omit(clean_data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)
Output:
Call:
randomForest(x = clean_data[, 1:2], y = clean_data$y)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 60%
Confusion matrix:
0 1 class.error
0 1 1 0.5000000
1 2 1 0.6666667
This approach removes all rows containing either NA or Inf values before fitting the model.
Conclusion
The randomForest.default(... ) NA not permitted in predictors error is triggered by NA values in your dataset. The randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call error can be triggered by both missing (NA) and infinite (Inf) values in your dataset. You can solve it by removing or replacing those problematic values. The approach you choose depends on your data and analysis needs. These simple techniques will help you avoid this error and enable a smooth modeling process with the randomForest package in R.
Congratulations on reading to the end of this tutorial!
For further reading on data science related in R errors, go to the article:
- How to Solve Error in randomforest.default(m, y, …) : can’t have empty classes in y
- How to Solve R Warning: glm.fit algorithm did not converge
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.
