I'm trying to merge 2 dataframes in R.
df1 = data.frame(CustomerId = c(1:5,5), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 4, 6,7), State = c(rep("Alabama", 2), rep("Ohio", 3)))
loj=merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Actual Result:
CustomerId Product State
1 1 Toaster <NA>
2 2 Toaster Alabama
3 3 Toaster <NA>
4 4 Radio Alabama
5 4 Radio Ohio
6 5 Radio <NA>
7 5 Radio <NA>
Expected Result:
CustomerId Product State
1 1 Toaster <NA>
2 2 Toaster Alabama
3 3 Toaster <NA>
4 4 Radio Alabama
5 5 Radio <NA>
6 5 Radio <NA>
However, if you look at row 4 and 5, the entry is repeated. How can I prevent doing that? I just want the first match to be viewed and do not care about the rest of the matches that may happen in ds2. Essentially, merged should have same row count as ds1.
Thanks
答案 0 :(得分:1)
One way to do it is to create an index vector with all duplicates that we want to remove and subset loj
based on that ind
ind <- which(duplicated(loj$CustomerId))[1:abs(nrow(df1) - nrow(loj))]
loj[-ind,]
# CustomerId Product State
#1 1 Toaster <NA>
#2 2 Toaster Alabama
#3 3 Toaster <NA>
#4 4 Radio Alabama
#6 5 Radio <NA>
#7 5 Radio <NA>
答案 1 :(得分:0)
Merge, then rank by group, then get 1st per group:
# dummy data
df1 = data.frame(CustomerId = c(1:5,5),
Product = c(rep("Toaster", 3),
rep("Radio", 2),
"Car")) #added "car" for customer 5
df2 = data.frame(CustomerId = c(2, 4, 4, 6,7),
State = c(rep("Alabama", 2), rep("Ohio", 3)))
library(dplyr)
merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE) %>%
group_by(CustomerId, Product) %>%
filter(rank(CustomerId, ties.method = "first") == 1)
# Source: local data frame [6 x 3]
# Groups: CustomerId, Product [6]
#
# CustomerId Product State
# (dbl) (fctr) (fctr)
# 1 1 Toaster NA
# 2 2 Toaster Alabama
# 3 3 Toaster NA
# 4 4 Radio Alabama
# 5 5 Radio NA
# 6 5 Car NA