Select Page

How to Solve R Error: randomForest NA not permitted in predictors

by | Programming, R, Tips

When working with the randomForest package in R, you might encounter the error:

Error in randomForest.default : 
  NA not permitted in predictors

This error occurs when the dataset being passed to the randomForest function contains NA, NaN, or Inf values, which are not supported by the random forest algorithm. This post will walk you through how to reproduce this error and solve it.

Example to Reproduce the Error

Let’s create a simple dataset that includes NA values, which will trigger the error.

# Load the randomForest package
library(randomForest)

# Create a sample data frame with NA values
data <- data.frame(
  x1 = c(2, 4, NA, 6, 8),
  x2 = c(1, 3, 5, NA, 7),
  y = as.factor(c(1, 0, 1, 0, 1))
)

# Attempt to run randomForest on this dataset
model <- randomForest(x = data[, 1:2], y = data$y)

Running this code will produce the following error:

Error in randomForest.default(x = data[, 1:2], y = data$y) : 
  NA not permitted in predictors

This happens because randomForest cannot handle NA values directly. We need to clean or impute missing values before fitting the model.

Solution: Handle Missing Values

To fix the error, you can either remove rows with missing values or impute them using various strategies, such as mean imputation or using more sophisticated methods.

Option 1: Remove Missing Values

You can use na.omit() to remove rows containing NA values before passing the data to the randomForest function:

# Remove rows with NA values
clean_data <- na.omit(data)

# Run randomForest on the cleaned dataset
model <- randomForest(x = clean_data[, 1:2], y = clean_data$y)

# Print model output
print(model)

Output:

Call:
 randomForest(x = clean_data[, 1:2], y = clean_data$y) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 100%
Confusion matrix:
  0 1 class.error
0 0 0         NaN
1 2 0           1

By removing missing data, this ensures that only complete cases are passed into the randomForest function, preventing the error.

Option 2: Impute Missing Values

Alternatively, you can use mean imputation to replace NA values with the column mean.

# Impute missing values with column means
data$x1[is.na(data$x1)] <- mean(data$x1, na.rm = TRUE)
data$x2[is.na(data$x2)] <- mean(data$x2, na.rm = TRUE)

# Run randomForest on the imputed dataset
model <- randomForest(x = data[, 1:2], y = data$y)

# Print model output
print(model)

Output:

Call:
 randomForest(x = data[, 1:2], y = data$y) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 80%
Confusion matrix:
  0 1 class.error
0 1 1         0.5
1 3 0         1.0

Example 2 to Reproduce the Error: in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) :
NA/NaN/Inf in foreign function call (arg 1)

If the dataset has Inf values but not NA, you may instead get the error

Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : 
  NA/NaN/Inf in foreign function call (arg 1)

Let’s create a dataset that contains Inf values, which will also trigger the error:

# Load the randomForest package
library(randomForest)

# Create a sample data frame with Inf values
data_inf <- data.frame(
  x1 = c(2, 4, Inf, 6, 8),
  x2 = c(1, 3, 5, Inf, 7),
  y = as.factor(c(1, 0, 1, 0, 1))
)

# Attempt to run randomForest on this dataset
model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)

When you run this code, you’ll see the following error:

Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : 
  NA/NaN/Inf in foreign function call (arg 1)

This error occurs because randomForest cannot handle Inf values, just like NA/NaN.

Solution: Handle Inf Values

Similar to missing (NA) values, you need to clean or handle Inf values in your dataset. Here are two possible approaches:

Option 1: Remove Inf Values

You can remove rows that contain Inf values using the is.finite() function:

# Remove rows with Inf values
clean_data_inf <- data_inf[is.finite(rowSums(data_inf[, 1:2])), ]

# Run randomForest on the cleaned dataset
model <- randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y)

# Print model output
print(model)

Output:

Call:
 randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 100%
Confusion matrix:
  0 1 class.error
0 0 0         NaN
1 2 0           1

This method ensures that only rows with finite values are used, preventing the error.

Option 2: Replace Inf Values with Finite Numbers

You can replace Inf values with a finite number, like the maximum value of the column:

# Replace Inf values with the column maximum (excluding Inf)
data_inf$x1[is.infinite(data_inf$x1)] <- max(data_inf$x1[is.finite(data_inf$x1)])
data_inf$x2[is.infinite(data_inf$x2)] <- max(data_inf$x2[is.finite(data_inf$x2)])

# Run randomForest on the modified dataset
model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)

# Print model output
print(model)

Output:

Call:
 randomForest(x = data_inf[, 1:2], y = data_inf$y) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 40%
Confusion matrix:
  0 1 class.error
0 1 1   0.5000000
1 1 2   0.3333333

In this case, the infinite values are replaced with the maximum finite value in each column, allowing the randomForest function to run without errors.

Handling Both NA and Inf Values

To generalize the solution, you can handle both NA and Inf values at the same time by using a combination of is.finite() and other functions to clean the dataset:

# Load the randomForest package
library(randomForest)

# Create a sample data frame with Inf values
data_inf <- data.frame(
  x1 = c(2, 4, NA, 6, 8),
  x2 = c(1, 3, 5, Inf, 7),
  y = as.factor(c(1, 0, 1, 0, 1))
)

# Attempt to run randomForest on this dataset
model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)
# Clean dataset by removing both NA and Inf
clean_data <- data[is.finite(rowSums(data[, 1:2])), ]
clean_data <- na.omit(clean_data)

# Run randomForest on the cleaned dataset
model <- randomForest(x = clean_data[, 1:2], y = clean_data$y)

# Print model output
print(model)

Output:

Call:
 randomForest(x = clean_data[, 1:2], y = clean_data$y) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 60%
Confusion matrix:
  0 1 class.error
0 1 1   0.5000000
1 2 1   0.6666667

This approach removes all rows containing either NA or Inf values before fitting the model.

Conclusion

The randomForest.default(... ) NA not permitted in predictors error is triggered by NA values in your dataset. The randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call error can be triggered by both missing (NA) and infinite (Inf) values in your dataset. You can solve it by removing or replacing those problematic values. The approach you choose depends on your data and analysis needs. These simple techniques will help you avoid this error and enable a smooth modeling process with the randomForest package in R.

Congratulations on reading to the end of this tutorial!

For further reading on data science related in R errors, go to the article: 

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!