This tutorial will go through counting the number of missing values or NAs in a data frame in R.
Table of contents
Example
Let’s look at an example using built-in data airquality
.
Get Airquality Data
First, let’s look at the head of the airquality
dataset.
head(airquality)
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
We can see that there are NA
values in the data frame, but we need to determine how many there are.
Solution #1: Use summary
The simplest way to get the number of NAs
in the data frame is to use the summary
method. Let’s look at the implementation of summary
:
summary(airquality)
Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 NA's :37 NA's :7 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0
The summary method returns statistical summaries of each column in the data frame and the NAs
in each column. We can see there are 37 NA
values in Ozone
and 7 NA
values in Solar.R
.
Solution #2: Use sum and is.na
The second way we can get the total number of NAs
in the data frame is to call is.na
which returns TRUE
or FALSE
for each value in a data set and sum()
sums up the TRUE
values. Let’s look at the code:
sum(is.na(airquality))
Let’s run the code to see the result:
[1] 44
There is a total of 44 NAs
in the data frame.
Solution #3: Use sum and is.na in Function
If we want to get the number of NAs
per column in a data frame we can define a function to iterate over each column and count the NAs
using sum()
and is.na
. Let’s look at the code:
res <- NULL f <- function(x) { for (i in 1:ncol(x)){ temp<-sum(is.na(x[,i])) temp<-as.data.frame(temp) temp$var<colnames(x)[i] res<-rbind(res,temp) } return(res) }
Let’s call the function to see the result:
f(airquality)
temp 1 37 2 7 3 0 4 0 5 0 6 0
There are 37 NAs
in the first column and 7 NAs
in the second column.
Summary
Congratulations on reading to the end of this tutorial!
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
For further reading on data analysis with R, go to the article: How to Download and Plot Stock Prices with quantmod in R
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.