If you try to perform k-means clustering with data containing missing values, NaN or Inf, you will raise the error: in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg1).

In R, The K-means algorithm cannot handle data with NA, NaN, or Inf values. By introducing these values, the mean and variance are no longer well defined, and the algorithm cannot determine which is the closest cluster centre.

You can solve this error by replacing the Inf values with NA then remove the rows containing missing values using na.omit. Alternatively, you can impute the missing values.

This tutorial will go through the error in detail and how to solve it with code examples.


Example

Consider the following data frame in with several NaN, NA and Inf values.

df <- data.frame(var1=c(2, NaN, 4, 6, 7, Inf, 8, 6, 10, 12),
                 var2=c(NaN, 14, 14, 7, 7, 15, 10, 9, 9, Inf),
                 var3=c(22, NA, 19, 23, 25, 21, 19, 16, 12, 15))
df
 var1 var2 var3
1     2  NaN   22
2   NaN   14   NA
3     4   14   19
4     6    7   23
5     7    7   25
6   Inf   15   21
7     8   10   19
8     6    9   16
9    10    9   12
10   12  Inf   15

Let’s attempt to perform k-means clustering on the data frame using the kmeans() function:

km <- kmeans(df, centers=3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

The error occurs because the data frame contains NA, NaN and Inf values.

Solution #1: Remove Rows

We need to clean the data frame of the values that kmeans cannot handle. First we will replace the Inf values with NA using a do.call.

df_noinf <- do.call(data.frame,lapply(df, function(x) replace(x, is.infinite(x),NA)))
df_noinf

In the do.call we use lapply to replace the Inf values in the data frame. Let’s look at the updated data frame.

  var1 var2 var3
1     2  NaN   22
2   NaN   14   NA
3     4   14   19
4     6    7   23
5     7    7   25
6    NA   15   21
7     8   10   19
8     6    9   16
9    10    9   12
10   12   NA   15

Next we will call the na.omit() function to remove the rows containing NA and NaN values.

df_clean <- na.omit(df_noinf)
df_clean

Let’s run the code to see the clean data frame:

var1 var2 var3
3    4   14   19
4    6    7   23
5    7    7   25
7    8   10   19
8    6    9   16
9   10    9   12

Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.

km <- kmeans(df_clean, centers=3)
km

Let’s run the code to get the result.

K-means clustering with 3 clusters of sizes 3, 1, 2

Cluster means:
  var1 var2 var3
1  6.0   11   18
2 10.0    9   12
3  6.5    7   24

Clustering vector:
3 4 5 7 8 9 
1 3 3 1 1 2 

Within cluster sum of squares by cluster:
[1] 28.0  0.0  2.5
 (between_SS / total_SS =  81.4 %)

Solution #2: Impute Values

If we want to preserve the number of rows we can instead impute values in place of the NA and NaN values.

> df_noinf$var1[is.na(df_noinf$var1)] <- mean(df_noinf$var1, na.rm=T)
> df_noinf$var2[is.na(df_noinf$var2)] <- mean(df_noinf$var2, na.rm=T)
> df_noinf$var3[is.na(df_noinf$var3)] <- mean(df_noinf$var3, na.rm=T)
df_noinf

In the above code we use the subscript operator to manually impute missing values in each column using the mean for the column the missing value is in. Let’s look at the updated data frame.

 var1   var2     var3
1   2.000 10.625 22.00000
2   6.875 14.000 19.11111
3   4.000 14.000 19.00000
4   6.000  7.000 23.00000
5   7.000  7.000 25.00000
6   6.875 15.000 21.00000
7   8.000 10.000 19.00000
8   6.000  9.000 16.00000
9  10.000  9.000 12.00000
10 12.000 10.625 15.00000

Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.

km <- kmeans(df_noinf, centers=3)
km

Let’s run the code to get the result:

K-means clustering with 3 clusters of sizes 3, 4, 3

Cluster means:
      var1      var2     var3
1 9.333333  9.541667 14.33333
2 6.437500 13.250000 19.52778
3 5.000000  8.208333 23.33333

Clustering vector:
 [1] 3 2 2 3 3 2 2 1 1 1

Within cluster sum of squares by cluster:
[1] 29.09375 26.41377 27.42708
 (between_SS / total_SS =  70.8 %)

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R related errors, go to the articles: 

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!