Select Page

How to Solve R Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg1)

by | Programming, R, Tips

If you try to perform k-means clustering with data containing missing values, NaN or Inf, you will raise the error: in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg1).

In R, The K-means algorithm cannot handle data with NA, NaN, or Inf values. By introducing these values, the mean and variance are no longer well defined, and the algorithm cannot determine which is the closest cluster centre.

You can solve this error by replacing the Inf values with NA then remove the rows containing missing values using na.omit. Alternatively, you can impute the missing values.

This tutorial will go through the error in detail and how to solve it with code examples.


Example

Consider the following data frame in with several NaN, NA and Inf values.

df <- data.frame(var1=c(2, NaN, 4, 6, 7, Inf, 8, 6, 10, 12),
                 var2=c(NaN, 14, 14, 7, 7, 15, 10, 9, 9, Inf),
                 var3=c(22, NA, 19, 23, 25, 21, 19, 16, 12, 15))
df
 var1 var2 var3
1     2  NaN   22
2   NaN   14   NA
3     4   14   19
4     6    7   23
5     7    7   25
6   Inf   15   21
7     8   10   19
8     6    9   16
9    10    9   12
10   12  Inf   15

Let’s attempt to perform k-means clustering on the data frame using the kmeans() function:

km <- kmeans(df, centers=3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

The error occurs because the data frame contains NA, NaN and Inf values.

Solution #1: Remove Rows

We need to clean the data frame of the values that kmeans cannot handle. First we will replace the Inf values with NA using a do.call.

df_noinf <- do.call(data.frame,lapply(df, function(x) replace(x, is.infinite(x),NA)))
df_noinf

In the do.call we use lapply to replace the Inf values in the data frame. Let’s look at the updated data frame.

  var1 var2 var3
1     2  NaN   22
2   NaN   14   NA
3     4   14   19
4     6    7   23
5     7    7   25
6    NA   15   21
7     8   10   19
8     6    9   16
9    10    9   12
10   12   NA   15

Next we will call the na.omit() function to remove the rows containing NA and NaN values.

df_clean <- na.omit(df_noinf)
df_clean

Let’s run the code to see the clean data frame:

var1 var2 var3
3    4   14   19
4    6    7   23
5    7    7   25
7    8   10   19
8    6    9   16
9   10    9   12

Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.

km <- kmeans(df_clean, centers=3)
km

Let’s run the code to get the result.

K-means clustering with 3 clusters of sizes 3, 1, 2

Cluster means:
  var1 var2 var3
1  6.0   11   18
2 10.0    9   12
3  6.5    7   24

Clustering vector:
3 4 5 7 8 9 
1 3 3 1 1 2 

Within cluster sum of squares by cluster:
[1] 28.0  0.0  2.5
 (between_SS / total_SS =  81.4 %)

Solution #2: Impute Values

If we want to preserve the number of rows we can instead impute values in place of the NA and NaN values.

> df_noinf$var1[is.na(df_noinf$var1)] <- mean(df_noinf$var1, na.rm=T)
> df_noinf$var2[is.na(df_noinf$var2)] <- mean(df_noinf$var2, na.rm=T)
> df_noinf$var3[is.na(df_noinf$var3)] <- mean(df_noinf$var3, na.rm=T)
df_noinf

In the above code we use the subscript operator to manually impute missing values in each column using the mean for the column the missing value is in. Let’s look at the updated data frame.

 var1   var2     var3
1   2.000 10.625 22.00000
2   6.875 14.000 19.11111
3   4.000 14.000 19.00000
4   6.000  7.000 23.00000
5   7.000  7.000 25.00000
6   6.875 15.000 21.00000
7   8.000 10.000 19.00000
8   6.000  9.000 16.00000
9  10.000  9.000 12.00000
10 12.000 10.625 15.00000

Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.

km <- kmeans(df_noinf, centers=3)
km

Let’s run the code to get the result:

K-means clustering with 3 clusters of sizes 3, 4, 3

Cluster means:
      var1      var2     var3
1 9.333333  9.541667 14.33333
2 6.437500 13.250000 19.52778
3 5.000000  8.208333 23.33333

Clustering vector:
 [1] 3 2 2 3 3 2 2 1 1 1

Within cluster sum of squares by cluster:
[1] 29.09375 26.41377 27.42708
 (between_SS / total_SS =  70.8 %)

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R related errors, go to the articles: 

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Research Scientist at Moogsoft | + posts

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!