merge dataset without duplicates R

时间:2016-04-25 08:52:20

标签: r merge

I'm trying to merge 2 dataframes in R.

df1 = data.frame(CustomerId = c(1:5,5), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 4, 6,7), State = c(rep("Alabama", 2), rep("Ohio", 3)))

loj=merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Actual Result:

 CustomerId Product   State
1          1 Toaster    <NA>
2          2 Toaster Alabama
3          3 Toaster    <NA>
4          4   Radio Alabama
5          4   Radio    Ohio
6          5   Radio    <NA>
7          5   Radio    <NA>

Expected Result:

 CustomerId Product   State
1          1 Toaster    <NA>
2          2 Toaster Alabama
3          3 Toaster    <NA>
4          4   Radio Alabama
5          5   Radio    <NA>
6          5   Radio    <NA>

However, if you look at row 4 and 5, the entry is repeated. How can I prevent doing that? I just want the first match to be viewed and do not care about the rest of the matches that may happen in ds2. Essentially, merged should have same row count as ds1.

Thanks

2 个答案:

答案 0 :(得分:1)

One way to do it is to create an index vector with all duplicates that we want to remove and subset loj based on that ind

ind <- which(duplicated(loj$CustomerId))[1:abs(nrow(df1) - nrow(loj))]
loj[-ind,]
#  CustomerId Product   State
#1          1 Toaster    <NA>
#2          2 Toaster Alabama
#3          3 Toaster    <NA>
#4          4   Radio Alabama
#6          5   Radio    <NA>
#7          5   Radio    <NA>

答案 1 :(得分:0)

Merge, then rank by group, then get 1st per group:

# dummy data
df1 = data.frame(CustomerId = c(1:5,5),
                 Product = c(rep("Toaster", 3),
                             rep("Radio", 2),
                             "Car")) #added "car" for customer 5
df2 = data.frame(CustomerId = c(2, 4, 4, 6,7),
                 State = c(rep("Alabama", 2), rep("Ohio", 3)))

library(dplyr)

merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE) %>% 
  group_by(CustomerId, Product) %>% 
  filter(rank(CustomerId, ties.method = "first") == 1)

# Source: local data frame [6 x 3]
# Groups: CustomerId, Product [6]
# 
#   CustomerId Product   State
#        (dbl)  (fctr)  (fctr)
# 1          1 Toaster      NA
# 2          2 Toaster Alabama
# 3          3 Toaster      NA
# 4          4   Radio Alabama
# 5          5   Radio      NA
# 6          5     Car      NA