Question

数据：

DB1 <- data.frame(orderItemID  = c(1,2,3,4,5,6,7,8,9,10), 
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),  
itemID = c(2,3,2,5,12,4,2,3,1,5),  
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")), 
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),  
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))

预期产出：

selection_order = c("yes","no","no","no","no","no","yes","no","no","no")

在数据集中，我有相同大小或相同颜色的项目，相同的ItemID。每个注册用户都有自己唯一的customerID。

我想确定用户何时订购具有相同itemID的产品（多于一个）（使用不同的尺寸或颜色=例如，customerID = 33的用户订购相同的商品（ItemID = 2）以两种不同的颜色）并在一个名为“选择顺序”（例如）的新列中标记为“是”或“否”。当他或她订购带有其他ID的物品时，它不应该向我显示“是”。我只想得到一个“是”，当有一个订单（在同一天或过去）与同一个ID多一次 - 无论其他ID（其他产品）。

我已经尝试了很多，但没有任何作用。有几千个不同的userID和ItemId - 所以我不能为每个Id配置。我尝试使用重复的功能 - 但它并没有带来令人满意的解决方案：

问题是，如果同一个人订购了多个对象（然后是customerID重复），而另一个人（customerId）订购一个具有相同Id的项目（itemId是重复的那么），它会给我一个“是”：在这种情况下它必须是“否”。（在示例中，duplicate函数会在orderItemID 4处给我一个“yes”而不是“no”）

Answer 1

我想我现在明白你想要的输出是什么，试试

library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
#     orderItemID orderDate itemID size  color customerID selection_order
#  1:           1    1.1.12      2    l   blue         33            TRUE
#  2:           2    1.1.12      3    s  black         15           FALSE
#  3:           3    1.1.12      2   xl   blue          1           FALSE
#  4:           4    1.1.12      5   xs orange         33           FALSE
#  5:           5    1.1.12     12    m    red         14           FALSE
#  6:           6    1.1.12      4    s   navy         55           FALSE
#  7:           7    1.1.12      2    l    red         33            TRUE
#  8:           8    1.1.12      3    m purple         78           FALSE
#  9:           9    2.1.12      1  xxs  white         94           FALSE
# 10:          10    2.1.12      5  xxl  black         23           FALSE

要转换回data.frame，请使用DB1 <- as.data.frame(DB1)（对于旧版本）或setDF(DB1)作为lates data.table版本。

您也可以使用基数R（效率较低）

transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))

或使用dplyr包

library(dplyr)
DB1 %>%
  group_by(customerID, itemID) %>%
  mutate(selection_order = n() > 1)

Answer 2

如果行代表重复（customerID，itemID）元组，则以下代码会将新列selection.order附加到数据框。

# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))

# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]

# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)

Answer 3

如果你只是想要这个子集，正如你在标题中所说的那样，那就这样做：

DB1[duplicated(DB1[c("itemID", "customerID")]),]

如果您想要该列，那么：

f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L

请注意，通过简化上面的最后一行也很容易获得实际计数。

如何在数据框中获取一个新列，该数据框中只有在R中出现在集合中的元素不止一次

3 个答案: