按组查找按行重复的副本

时间:2017-03-29 05:42:54

标签: r data.table dplyr

我有一个可能有重复产品的数据集。如果有重复的产品,我们会将该订单标记为" Y",反之亦然。

以下是我的数据:

dput(Test_File)
structure(list(Order = c(1234, 1234, 2345, 2345, 2345, 3456, 
3456, 4567, 5678, 5678, 5678, 5678, 9999, 9999), Product = c("A12960", 
"12960", "B3560", "3560", "A3850", "3850", "A3850", "A2920", 
"2930", "2921", "A2921", "A2930", "A1234", "2345"), ASKU = c("Y", 
"N", "N", "N", "Y", "N", "Y", "Y", "N", "N", "Y", "Y", "Y", "N"
)), .Names = c("Order", "Product", "ASKU"), row.names = c(NA, 
14L), class = "data.frame")

关于数据的评论:我有一个列ASKU。这主要是为了确定特定SKU是否以A开头。另请注意,A1234和1234将被视为重复。 A1234和23456不会被视为重复。同样,B1234和1234不会被视为重复。因此,要识别重复项,可以忽略A列中的ASKU

预期输出:

dput(Output_File)
structure(list(Order = c(1234, 2345, 3456, 4567, 5678, 9999), 
    Duplicate = c("Y", "N", "Y", "N", "Y", "N")), .Names = c("Order", 
"Duplicate"), row.names = c(NA, 6L), class = "data.frame")

我的代码(非工作):

我试过这段代码,但是我收到了一个错误。我们的想法是在从SKU名称中提取字符串后对SKU进行逐行比较。

  

错误:行(3,4),(8,9),(10,11)的重复标识符

Test_File$New_SKU<-NA_character_

Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"New_SKU"]<-sub("^A","",Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"Product"])

Test_File[Test_File$ASKU=="N","New_SKU"]<-Test_File[Test_File$ASKU=="N","Product"]

Test_File %>%
  dplyr::group_by(Order) %>%
  dplyr::mutate(DCount = n_distinct(ASKU)) %>%
  dplyr::filter(DCount>=2) %>%
  dplyr::ungroup() %>%
  dplyr::select(Order,New_SKU,ASKU) %>%
  dplyr::distinct() %>%
  tidyr::spread(key = ASKU,value = New_SKU)

有人可以帮帮我吗?如果您能帮助我使用基于dplyr的解决方案和基于data.table的解决方案,我将不胜感激。

2 个答案:

答案 0 :(得分:1)

我们可以使用dplyr并从A中删除以Product开头的Order,然后按library(dplyr) Test_File %>% mutate(Product = sub("^A", "", Product)) %>% group_by(Order) %>% summarise(Duplicate = any(duplicated(Product))) # Order Duplicate # <dbl> <lgl> #1 1234 TRUE #2 2345 FALSE #3 3456 TRUE #4 4567 FALSE #5 5678 TRUE #6 9999 FALSE 进行分组,我们可以检查其中是否有任何重复值组。

Duplicate

如果我们需要输出为Y / N格式,可以使用ifelse

轻松替换Test_File %>% mutate(Product = sub("^A", "", Product)) %>% group_by(Order) %>% summarise(Duplicate = ifelse(any(duplicated(Product)), "Y", "N")) 列的值
{{1}}

答案 1 :(得分:1)

以下是使用data.table

的选项
library(data.table)
setDT(Test_File)[, .(Duplicate = c("N", "Y")[(anyDuplicated(sub("^A",
                  "", Product)) > 0)+1]), Order]
#    Order Duplicate
#1:  1234         Y
#2:  2345         N
#3:  3456         Y
#4:  4567         N
#5:  5678         Y
#6:  9999         N

base R

i1 <- with(Test_File, tapply(sub("^A", "", Product), Order, FUN = anyDuplicated)>0)
stack(split(i1, names(i1)))[2:1]