Select Page

How to Solve R Error in fix.by(by.y, y): ‘by’ must specify a uniquely valid column

by | Programming, R, Tips

If you try to merge two data frames by a column only present in one of the data frames you will raise the error fix.by(by.y, y): ‘by’ must specify a uniquely valid column.

You can solve this error by either using a column name present in both data frames for the by parameter or using the unique column names in by.x and by.y.

This tutorial will go through how to solve the error with code examples.


What is Merge in R?

The merge function merges two data frames by common columns or row names.

In R, if you want to merge data frames using merge(), you have to specify a column name in the by parameter.

If the column names are the same, you can use by. If the column name names are unique to each data frame, you need to specify by.x and by.y.

Example

Let’s look at an example where we have two data frames. Each data frame has a column for variable ID numbers and another column for variable values, which are sampled randomly from the normal distribution.

dat1 <- data.frame(var_ID = 1:10, x1 = rnorm(10))
dat1
 var_ID         x1
1       1 -0.4343253
2       2 -0.5291911
3       3  1.2316967
4       4 -0.4829048
5       5  0.2598425
6       6 -0.7514874
7       7 -0.6536955
8       8 -0.8750813
9       9 -0.2649102
10     10  0.3956067
dat2 <- data.frame(VAR_ids = 1:10, y1 = rnorm(10))
dat2
 VAR_ids         y1
1        1 -0.8518977
2        2  0.0305206
3        3  0.4972952
4        4  2.1803895
5        5 -2.6383560
6        6 -1.2931126
7        7  0.7551982
8        8 -0.1333365
9        9 -0.3959812
10      10  1.2125677

Next, we will try to merge the data frames using the merge() function.

dat_merge <- merge(dat1, dat2, by='var_ID')

Let’s run the code to see what happens:

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

We raise the error because the column name var_ID only exists in dat1. We can only use the by parameter with a column name that exists in both data frames.

Solution

We can use by.x and by.y to solve this error. The parameter by.x must have the column name unique to the first data frame and by.y must have the column name unique to the second. Let’s look at the revised code:

dat_merge <- merge(dat1, dat2, by.x='var_ID', by.y='VAR_ids')
dat_merge

Let’s run the code to see the result:

   var_ID         x1         y1
1       1 -0.4343253 -0.8518977
2       2 -0.5291911  0.0305206
3       3  1.2316967  0.4972952
4       4 -0.4829048  2.1803895
5       5  0.2598425 -2.6383560
6       6 -0.7514874 -1.2931126
7       7 -0.6536955  0.7551982
8       8 -0.8750813 -0.1333365
9       9 -0.2649102 -0.3959812
10     10  0.3956067  1.2125677

We successfully merged the two data frames.

Summary

Congratulations on reading to the end of this tutorial!

For further reading on R related errors, go to the articles: 

Go to the online courses page on R to learn more about coding in R for data science and machine learning.

Have fun and happy researching!

Research Scientist at Moogsoft | + posts

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!