This tutorial will go through counting the number of missing values or NAs in a data frame in R.
Table of contents
Example
Let’s look at an example using built-in data airquality.
Get Airquality Data
First, let’s look at the head of the airquality dataset.
head(airquality)
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
We can see that there are NA values in the data frame, but we need to determine how many there are.
Solution #1: Use summary
The simplest way to get the number of NAs in the data frame is to use the summary method. Let’s look at the implementation of summary:
summary(airquality)
Ozone Solar.R Wind Temp Month
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
NA's :37 NA's :7
Day
Min. : 1.0
1st Qu.: 8.0
Median :16.0
Mean :15.8
3rd Qu.:23.0
Max. :31.0
The summary method returns statistical summaries of each column in the data frame and the NAs in each column. We can see there are 37 NA values in Ozone and 7 NA values in Solar.R.
Solution #2: Use sum and is.na
The second way we can get the total number of NAs in the data frame is to call is.na which returns TRUE or FALSE for each value in a data set and sum() sums up the TRUE values. Let’s look at the code:
sum(is.na(airquality))
Let’s run the code to see the result:
[1] 44
There is a total of 44 NAs in the data frame.
Solution #3: Use sum and is.na in Function
If we want to get the number of NAs per column in a data frame we can define a function to iterate over each column and count the NAs using sum() and is.na. Let’s look at the code:
res <- NULL
f <- function(x) {
for (i in 1:ncol(x)){
temp<-sum(is.na(x[,i]))
temp<-as.data.frame(temp)
temp$var<colnames(x)[i]
res<-rbind(res,temp)
}
return(res)
}
Let’s call the function to see the result:
f(airquality)
temp 1 37 2 7 3 0 4 0 5 0 6 0
There are 37 NAs in the first column and 7 NAs in the second column.
Summary
Congratulations on reading to the end of this tutorial!
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
For further reading on data analysis with R, go to the article: How to Download and Plot Stock Prices with quantmod in R
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.
