Fix R Warning: aggregate function missing

by Suf | Programming, R, Tips

This warning occurs when you use the dcast function to convert a data frame from long to wide format, but more than one value can be placed in the individual output cells of the wide data frame. You can stop this warning from occurring by specifying the aggregate function argument when using dcast.

This tutorial will explain how to solve the R warning with code examples.

Example

Let’s look at an example to reproduce the warning. First, let’s create a data frame containing the test scores of two students for two Chemistry and Physics exams.

#create data frame
df <- data.frame(student=c('Alex', 'Alex', 'Alex', 'Alex', 'Bill', 'Bill', 'Bill', 'Bill'),
                 subject=c('Chemistry', 'Chemistry', 'Physics', 'Physics', 'Chemistry', 'Chemistry', 'Physics', 'Physics'),
                 test=c('Exam1', 'Exam2', 'Exam1', 'Exam2', 'Exam1', 'Exam2', 'Exam1', 'Exam2'),
                 score=c(82, 78, 69, 80, 50, 61, 75, 82))

df

  student   subject  test score
1    Alex Chemistry Exam1    82
2    Alex Chemistry Exam2    78
3    Alex   Physics Exam1    69
4    Alex   Physics Exam2    80
5    Bill Chemistry Exam1    50
6    Bill Chemistry Exam2    61
7    Bill   Physics Exam1    75
8    Bill   Physics Exam2    82

Next, we will install (if not installed previously) and load the reshape2 package, and then use the reshape dcast function to cast the data frame from long to wide format.

install.packages('reshape2')
library(reshape2)

df_wide <- dcast(df, student ~ test, value.var="score")

Aggregation function missing: defaulting to length

student Exam1 Exam2
1    Alex     2     2
2    Bill     2     2

We can see that the dcast function works, but we receive a warning message from the R interpreter: Aggregation function missing: defaulting to length.

Solution

The warning occurs because there are multiple values we can use score that can go into the output cells of df_wide.

For example, for the student Alex and the test Exam1, the score could be 82 for Chemistry or 69 for Physics.

Because there is more than one value to choose from and the fun.aggregate argument is not specified, the dcast function defaults to using length as the aggregate function. For example, we can see when using length, that for student Alex and Exam1, there is a total of 2 scores.

We can use a different aggregate function, like sum, mean, or sd. In this example, it is suitable to use mean, so that we can get the average score for the students across the different exams.

Let’s look at the revised code:

install.packages('reshape2')
library(reshape2)

df_wide <- dcast(df, student ~ test, value.var="score", fun.aggregate=mean)

df_wide

  student Exam1 Exam2
1    Alex  75.5  79.0
2    Bill  62.5  71.5

When we run the code with the fun.aggregate argument specified, we do not receive the warning message.

We can interpret this as follows:

Given that both Physics and Chemistry have an Exam1 and an Exam2

Student Alex has an average score of 75.5 for Exam1 and an average score of 79.0 for Exam2
Student Bill has an average score of 62.5 for Exam1 and an average score of 71.5 for Exam2

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R-related errors, go to the articles:

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Suf

Senior Advisor, Data Science | [email protected] | + posts

Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.

Buy Me a Coffee