When working with the `randomForest`

package in R, you might encounter the error:

Error in randomForest.default : NA not permitted in predictors

This error occurs when the dataset being passed to the `randomForest`

function contains `NA`

, `NaN`

, or `Inf`

values, which are not supported by the random forest algorithm. This post will walk you through how to reproduce this error and solve it.

## Example to Reproduce the Error

Let’s create a simple dataset that includes `NA`

values, which will trigger the error.

# Load the randomForest package library(randomForest) # Create a sample data frame with NA values data <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, NA, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data[, 1:2], y = data$y)

Running this code will produce the following error:

Error in randomForest.default(x = data[, 1:2], y = data$y) : NA not permitted in predictors

This happens because `randomForest`

cannot handle `NA`

values directly. We need to clean or impute missing values before fitting the model.

## Solution: Handle Missing Values

To fix the error, you can either remove rows with missing values or impute them using various strategies, such as mean imputation or using more sophisticated methods.

### Option 1: Remove Missing Values

You can use `na.omit()`

to remove rows containing `NA`

values before passing the data to the `randomForest`

function:

# Remove rows with NA values clean_data <- na.omit(data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)

**Output:**

Call: randomForest(x = clean_data[, 1:2], y = clean_data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 100% Confusion matrix: 0 1 class.error 0 0 0 NaN 1 2 0 1

By removing missing data, this ensures that only complete cases are passed into the `randomForest`

function, preventing the error.

### Option 2: Impute Missing Values

Alternatively, you can use mean imputation to replace `NA`

values with the column mean.

# Impute missing values with column means data$x1[is.na(data$x1)] <- mean(data$x1, na.rm = TRUE) data$x2[is.na(data$x2)] <- mean(data$x2, na.rm = TRUE) # Run randomForest on the imputed dataset model <- randomForest(x = data[, 1:2], y = data$y) # Print model output print(model)

**Output:**

Call: randomForest(x = data[, 1:2], y = data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 80% Confusion matrix: 0 1 class.error 0 1 1 0.5 1 3 0 1.0

## Example 2 to Reproduce the Error: in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) :

NA/NaN/Inf in foreign function call (arg 1)

If the dataset has Inf values but not NA, you may instead get the error

Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)

Let’s create a dataset that contains `Inf`

values, which will also trigger the error:

# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, Inf, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)

When you run this code, you’ll see the following error:

Error in randomForest.default(x = data_inf[, 1:2], y = data_inf$y) : NA/NaN/Inf in foreign function call (arg 1)

This error occurs because `randomForest`

cannot handle `Inf`

values, just like `NA`

/NaN.

## Solution: Handle `Inf`

Values

Similar to missing (`NA`

) values, you need to clean or handle `Inf`

values in your dataset. Here are two possible approaches:

### Option 1: Remove `Inf`

Values

You can remove rows that contain `Inf`

values using the `is.finite()`

function:

# Remove rows with Inf values clean_data_inf <- data_inf[is.finite(rowSums(data_inf[, 1:2])), ] # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) # Print model output print(model)

**Output:**

Call: randomForest(x = clean_data_inf[, 1:2], y = clean_data_inf$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 100% Confusion matrix: 0 1 class.error 0 0 0 NaN 1 2 0 1

This method ensures that only rows with finite values are used, preventing the error.

### Option 2: Replace `Inf`

Values with Finite Numbers

You can replace `Inf`

values with a finite number, like the maximum value of the column:

# Replace Inf values with the column maximum (excluding Inf) data_inf$x1[is.infinite(data_inf$x1)] <- max(data_inf$x1[is.finite(data_inf$x1)]) data_inf$x2[is.infinite(data_inf$x2)] <- max(data_inf$x2[is.finite(data_inf$x2)]) # Run randomForest on the modified dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y) # Print model output print(model)

**Output:**

Call: randomForest(x = data_inf[, 1:2], y = data_inf$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 40% Confusion matrix: 0 1 class.error 0 1 1 0.5000000 1 1 2 0.3333333

In this case, the infinite values are replaced with the maximum finite value in each column, allowing the `randomForest`

function to run without errors.

### Handling Both `NA`

and `Inf`

Values

To generalize the solution, you can handle both `NA`

and `Inf`

values at the same time by using a combination of `is.finite()`

and other functions to clean the dataset:

# Load the randomForest package library(randomForest) # Create a sample data frame with Inf values data_inf <- data.frame( x1 = c(2, 4, NA, 6, 8), x2 = c(1, 3, 5, Inf, 7), y = as.factor(c(1, 0, 1, 0, 1)) ) # Attempt to run randomForest on this dataset model <- randomForest(x = data_inf[, 1:2], y = data_inf$y)

# Clean dataset by removing both NA and Inf clean_data <- data[is.finite(rowSums(data[, 1:2])), ] clean_data <- na.omit(clean_data) # Run randomForest on the cleaned dataset model <- randomForest(x = clean_data[, 1:2], y = clean_data$y) # Print model output print(model)

**Output:**

Call: randomForest(x = clean_data[, 1:2], y = clean_data$y) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 60% Confusion matrix: 0 1 class.error 0 1 1 0.5000000 1 2 1 0.6666667

This approach removes all rows containing either `NA`

or `Inf`

values before fitting the model.

## Conclusion

The `randomForest.default(... ) NA not permitted in predictors`

error is triggered by NA values in your dataset. The `randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call`

error can be triggered by both missing (`NA`

) and infinite (`Inf`

) values in your dataset. You can solve it by removing or replacing those problematic values. The approach you choose depends on your data and analysis needs. These simple techniques will help you avoid this error and enable a smooth modeling process with the `randomForest`

package in R.

Congratulations on reading to the end of this tutorial!

For further reading on data science related in R errors, go to the article:

- How to Solve Error in randomforest.default(m, y, …) : can’t have empty classes in y
- How to Solve R Warning: glm.fit algorithm did not converge

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.