This warning occurs when you use the dcast
function to convert a data frame from long to wide format, but more than one value can be placed in the individual output cells of the wide data frame. You can stop this warning from occurring by specifying the aggregate function argument when using dcast
.
This tutorial will explain how to solve the R warning with code examples.
Example
Let’s look at an example to reproduce the warning. First, let’s create a data frame containing the test scores of two students for two Chemistry and Physics exams.
#create data frame df <- data.frame(student=c('Alex', 'Alex', 'Alex', 'Alex', 'Bill', 'Bill', 'Bill', 'Bill'), subject=c('Chemistry', 'Chemistry', 'Physics', 'Physics', 'Chemistry', 'Chemistry', 'Physics', 'Physics'), test=c('Exam1', 'Exam2', 'Exam1', 'Exam2', 'Exam1', 'Exam2', 'Exam1', 'Exam2'), score=c(82, 78, 69, 80, 50, 61, 75, 82)) df
student subject test score 1 Alex Chemistry Exam1 82 2 Alex Chemistry Exam2 78 3 Alex Physics Exam1 69 4 Alex Physics Exam2 80 5 Bill Chemistry Exam1 50 6 Bill Chemistry Exam2 61 7 Bill Physics Exam1 75 8 Bill Physics Exam2 82
Next, we will install (if not installed previously) and load the reshape2 package, and then use the reshape dcast
function to cast the data frame from long to wide format.
install.packages('reshape2') library(reshape2) df_wide <- dcast(df, student ~ test, value.var="score")
Aggregation function missing: defaulting to length student Exam1 Exam2 1 Alex 2 2 2 Bill 2 2
We can see that the dcast
function works, but we receive a warning message from the R interpreter: Aggregation function missing: defaulting to length.
Solution
The warning occurs because there are multiple values we can use score
that can go into the output cells of df_wide
.
For example, for the student Alex and the test Exam1, the score could be 82 for Chemistry or 69 for Physics.
Because there is more than one value to choose from and the fun.aggregate
argument is not specified, the dcast
function defaults to using length
as the aggregate function. For example, we can see when using length
, that for student Alex and Exam1, there is a total of 2 scores.
We can use a different aggregate function, like sum
, mean
, or sd
. In this example, it is suitable to use mean
, so that we can get the average score for the students across the different exams.
Let’s look at the revised code:
install.packages('reshape2') library(reshape2) df_wide <- dcast(df, student ~ test, value.var="score", fun.aggregate=mean) df_wide
student Exam1 Exam2 1 Alex 75.5 79.0 2 Bill 62.5 71.5
When we run the code with the fun.aggregate
argument specified, we do not receive the warning message.
We can interpret this as follows:
Given that both Physics and Chemistry have an Exam1 and an Exam2
- Student Alex has an average score of 75.5 for Exam1 and an average score of 79.0 for Exam2
- Student Bill has an average score of 62.5 for Exam1 and an average score of 71.5 for Exam2
Summary
Congratulations on reading to the end of this tutorial!
For further reading on R-related errors, go to the articles:
- How to Solve R Error: $ operator is invalid for atomic vectors
- How to Solve R Error: object of type ‘closure’ is not subsettable
- How to Solve R Error missing value where TRUE/FALSE needed
Go to the online courses page on R to learn more about coding in R for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.