我有一个可能有重复产品的数据集。如果有重复的产品,我们会将该订单标记为" Y",反之亦然。
以下是我的数据:
dput(Test_File)
structure(list(Order = c(1234, 1234, 2345, 2345, 2345, 3456,
3456, 4567, 5678, 5678, 5678, 5678, 9999, 9999), Product = c("A12960",
"12960", "B3560", "3560", "A3850", "3850", "A3850", "A2920",
"2930", "2921", "A2921", "A2930", "A1234", "2345"), ASKU = c("Y",
"N", "N", "N", "Y", "N", "Y", "Y", "N", "N", "Y", "Y", "Y", "N"
)), .Names = c("Order", "Product", "ASKU"), row.names = c(NA,
14L), class = "data.frame")
关于数据的评论:我有一个列ASKU
。这主要是为了确定特定SKU是否以A
开头。另请注意,A1234和1234将被视为重复。 A1234和23456不会被视为重复。同样,B1234和1234不会被视为重复。因此,要识别重复项,可以忽略A
列中的ASKU
。
预期输出:
dput(Output_File)
structure(list(Order = c(1234, 2345, 3456, 4567, 5678, 9999),
Duplicate = c("Y", "N", "Y", "N", "Y", "N")), .Names = c("Order",
"Duplicate"), row.names = c(NA, 6L), class = "data.frame")
我的代码(非工作):
我试过这段代码,但是我收到了一个错误。我们的想法是在从SKU名称中提取字符串后对SKU进行逐行比较。
错误:行(3,4),(8,9),(10,11)的重复标识符
Test_File$New_SKU<-NA_character_
Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"New_SKU"]<-sub("^A","",Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"Product"])
Test_File[Test_File$ASKU=="N","New_SKU"]<-Test_File[Test_File$ASKU=="N","Product"]
Test_File %>%
dplyr::group_by(Order) %>%
dplyr::mutate(DCount = n_distinct(ASKU)) %>%
dplyr::filter(DCount>=2) %>%
dplyr::ungroup() %>%
dplyr::select(Order,New_SKU,ASKU) %>%
dplyr::distinct() %>%
tidyr::spread(key = ASKU,value = New_SKU)
有人可以帮帮我吗?如果您能帮助我使用基于dplyr
的解决方案和基于data.table
的解决方案,我将不胜感激。
答案 0 :(得分:1)
我们可以使用dplyr
并从A
中删除以Product
开头的Order
,然后按library(dplyr)
Test_File %>%
mutate(Product = sub("^A", "", Product)) %>%
group_by(Order) %>%
summarise(Duplicate = any(duplicated(Product)))
# Order Duplicate
# <dbl> <lgl>
#1 1234 TRUE
#2 2345 FALSE
#3 3456 TRUE
#4 4567 FALSE
#5 5678 TRUE
#6 9999 FALSE
进行分组,我们可以检查其中是否有任何重复值组。
Duplicate
如果我们需要输出为Y / N格式,可以使用ifelse
Test_File %>%
mutate(Product = sub("^A", "", Product)) %>%
group_by(Order) %>%
summarise(Duplicate = ifelse(any(duplicated(Product)), "Y", "N"))
列的值
{{1}}
答案 1 :(得分:1)
以下是使用data.table
library(data.table)
setDT(Test_File)[, .(Duplicate = c("N", "Y")[(anyDuplicated(sub("^A",
"", Product)) > 0)+1]), Order]
# Order Duplicate
#1: 1234 Y
#2: 2345 N
#3: 3456 Y
#4: 4567 N
#5: 5678 Y
#6: 9999 N
或base R
i1 <- with(Test_File, tapply(sub("^A", "", Product), Order, FUN = anyDuplicated)>0)
stack(split(i1, names(i1)))[2:1]