Question

我有一个包含客户购买信息的数据集。我试图通过连接device_id（客户），store_id，product_id和日期（购买）来创建唯一ID。我已经使用了以下代码

customer$device_store_product_date <- paste(customer$device, customer$store_id, customer$product_id, customer$date, sep='_')

结果列是这样的：

        device_store_product_date
48c6eec37affa1db_203723_9313962_2016-02-19
eb2c2f00071b97f3_179926_6180944_2016-02-20
d82066a784c9552_180704_9308311_2016-02-20
9766bba65b1ef9ac_204187_9313852_2016-02-20
77d80c1066f5267_180488_9312672_2016-02-20

正如预期的那样，仍然存在重复。为了识别它们，我使用了duplicated（）：

x1 = customer[duplicated(customer$device_store_product_date),]

但是，对于少数x1 $ device_store_product_date，仅存在单个条目。这不应该是这种情况，因为x1应该由重复的值组成。让我知道我哪里错了。要选择与device_store_product_date的特定值相对应的条目，我使用了：

filter(x1, x1$device_store_product_date=="14163e6b6ed06890_203723_9313477_2016-02-20")

Answer 1

对于已经发生的任何值，

duplicated（）都返回TRUE，所以

x <-c("a","b","a")
duplicated(x)

将返回

FALSE FALSE TRUE

如果你想要第一次出现，那么这样的事情就可以了

duplicated(x)|rev(duplicated(rev(x)))

Answer 2

duplicated函数有一个参数fromLast=TRUE来检查结尾的重复项。这里，最后一个元素为FALSE，所有其他重复元素返回TRUE。通过使用|，我们确保包含所有重复元素。

 duplicated(x)|duplicated(x, fromLast=TRUE)

可用于获取所有重复元素

如何使用duplicated（）

2 个答案: