Question

我有以下数据：

> head(bigdata)
      type                               text
1  neutral              The week in 32 photos
2  neutral Look at me! 22 selfies of the week
3  neutral       Inside rebel tunnels in Homs
4  neutral                Voices from Ukraine
5  neutral  Water dries up ahead of World Cup
6 positive     Who's your hero? Nominate them

我的副本看起来像这样（空$type）：

7              Who's your hero? Nominate them
8           Water dries up ahead of World Cup

我删除了这样的重复项：

bigdata <- bigdata[!duplicated(bigdata$text),]

问题是，它删除了错误的副本。我想删除$type为空的那个，而不是具有$type值的那个。

如何删除R中的特定副本？

Answer 1

所以这是一个不使用duplicated(...)的解决方案。

# creates an example - you have this already...
set.seed(1)   # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
                      text=sample(letters[1:10],10),
                      stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))   

# you start here...
newdf  <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]

按文字和字词排序bigdata按降序排列，这样对于给定的文字，空type将出现在任何非空type之后。然后我们只为每个text提取每种类型的第一个匹配项。

如果你的数据确实是＃34;大＆＃34;，那么data.table解决方案可能会更快。

library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]

这基本上是相同的，但由于setkey仅按递增顺序排序，我们使用type[.N]来获取type的最后次每text。 .N是一个特殊变量，用于保存该组的元素数。

请注意，当前开发版本实现了一个函数setorder()，它通过引用对data.table 进行排序，并且可以按递增顺序和递减顺序排序。因此，使用devel version，它是：

require(data.table) # 1.9.3 setorder(DT, text, -type) DT[, list(type = type[1L]), by = text]

Answer 2

foo = function(x){
    x == ""
}

bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

Answer 3

您应该保留不重复或不缺少类型值的行。 duplicated函数仅返回每个值的第二个和后面的重复项（请查看duplicated(c(1, 1, 2))），因此我们需要同时使用该值以及duplicated调用的值fromLast=TRUE }。

bigdata <- bigdata[!(duplicated(bigdata$text) |
                     duplicated(bigdata$text, fromLast=TRUE)) |
                   !is.na(bigdata$type),]

如何删除R中的特定重复项

3 个答案: