If you try to perform k-means clustering with data containing missing values, NaN or Inf, you will raise the error: in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg1).
In R, The K-means algorithm cannot handle data with NA, NaN, or Inf values. By introducing these values, the mean and variance are no longer well defined, and the algorithm cannot determine which is the closest cluster centre.
You can solve this error by replacing the Inf values with NA then remove the rows containing missing values using na.omit
. Alternatively, you can impute the missing values.
This tutorial will go through the error in detail and how to solve it with code examples.
Table of contents
Example
Consider the following data frame in with several NaN, NA and Inf values.
df <- data.frame(var1=c(2, NaN, 4, 6, 7, Inf, 8, 6, 10, 12), var2=c(NaN, 14, 14, 7, 7, 15, 10, 9, 9, Inf), var3=c(22, NA, 19, 23, 25, 21, 19, 16, 12, 15)) df
var1 var2 var3 1 2 NaN 22 2 NaN 14 NA 3 4 14 19 4 6 7 23 5 7 7 25 6 Inf 15 21 7 8 10 19 8 6 9 16 9 10 9 12 10 12 Inf 15
Let’s attempt to perform k-means clustering on the data frame using the kmeans() function:
km <- kmeans(df, centers=3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
The error occurs because the data frame contains NA, NaN and Inf values.
Solution #1: Remove Rows
We need to clean the data frame of the values that kmeans cannot handle. First we will replace the Inf values with NA using a do.call
.
df_noinf <- do.call(data.frame,lapply(df, function(x) replace(x, is.infinite(x),NA))) df_noinf
In the do.call we use lapply to replace the Inf values in the data frame. Let’s look at the updated data frame.
var1 var2 var3 1 2 NaN 22 2 NaN 14 NA 3 4 14 19 4 6 7 23 5 7 7 25 6 NA 15 21 7 8 10 19 8 6 9 16 9 10 9 12 10 12 NA 15
Next we will call the na.omit() function to remove the rows containing NA and NaN values.
df_clean <- na.omit(df_noinf) df_clean
Let’s run the code to see the clean data frame:
var1 var2 var3 3 4 14 19 4 6 7 23 5 7 7 25 7 8 10 19 8 6 9 16 9 10 9 12
Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.
km <- kmeans(df_clean, centers=3) km
Let’s run the code to get the result.
K-means clustering with 3 clusters of sizes 3, 1, 2 Cluster means: var1 var2 var3 1 6.0 11 18 2 10.0 9 12 3 6.5 7 24 Clustering vector: 3 4 5 7 8 9 1 3 3 1 1 2 Within cluster sum of squares by cluster: [1] 28.0 0.0 2.5 (between_SS / total_SS = 81.4 %)
Solution #2: Impute Values
If we want to preserve the number of rows we can instead impute values in place of the NA and NaN values.
> df_noinf$var1[is.na(df_noinf$var1)] <- mean(df_noinf$var1, na.rm=T) > df_noinf$var2[is.na(df_noinf$var2)] <- mean(df_noinf$var2, na.rm=T) > df_noinf$var3[is.na(df_noinf$var3)] <- mean(df_noinf$var3, na.rm=T) df_noinf
In the above code we use the subscript operator to manually impute missing values in each column using the mean for the column the missing value is in. Let’s look at the updated data frame.
var1 var2 var3 1 2.000 10.625 22.00000 2 6.875 14.000 19.11111 3 4.000 14.000 19.00000 4 6.000 7.000 23.00000 5 7.000 7.000 25.00000 6 6.875 15.000 21.00000 7 8.000 10.000 19.00000 8 6.000 9.000 16.00000 9 10.000 9.000 12.00000 10 12.000 10.625 15.00000
Now that we have a clean data frame, we can run the k-means clustering algorithm and get the cluster information.
km <- kmeans(df_noinf, centers=3) km
Let’s run the code to get the result:
K-means clustering with 3 clusters of sizes 3, 4, 3 Cluster means: var1 var2 var3 1 9.333333 9.541667 14.33333 2 6.437500 13.250000 19.52778 3 5.000000 8.208333 23.33333 Clustering vector: [1] 3 2 2 3 3 2 2 1 1 1 Within cluster sum of squares by cluster: [1] 29.09375 26.41377 27.42708 (between_SS / total_SS = 70.8 %)
Summary
Congratulations on reading to the end of this tutorial!
For further reading on R related errors, go to the articles:
- How to Solve R Error in sort.int(x, na.last = na.last, decreasing = decreasing, …) : ‘x’ must be atomic
- How to Solve R Error: replacement has length zero
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.