R中有多个重复(2次,3次,......)

时间:2015-04-30 16:27:59

标签: r duplicates duplicate-data

搜索一段时间后,我知道这个问题尚未得到解答。假设我有以下向量

v <- c("a", "b", "b", "c","c","c", "d", "d", "d", "d")

如何找到重复次数超过1的值

(应为"c","c","c", "d", "d", "d", "d")

和超过2个重复

(应为"d", "d", "d", "d"

函数duplicated(v)仅返回具有重复项的值。

2 个答案:

答案 0 :(得分:7)

您可以生成table(),然后检查v的哪些元素是表格相关子集的一部分,例如

R> v <- c("a", "b", "b", "c","c","c", "d", "d", "d", "d")
R> tab <- table(v)
R> tab
v
a b c d 
1 2 3 4 
R> v[v %in% names(tab[tab > 2])]
[1] "c" "c" "c" "d" "d" "d" "d"
R> v[v %in% names(tab[tab > 3])]
[1] "d" "d" "d" "d"

答案 1 :(得分:5)

我会使用ave编写一个这样的简单函数:

myFun <- function(vector, thresh) {
  ind <- ave(rep(1, length(vector)), vector, FUN = length)
  vector[ind > thresh + 1] ## added "+1" to match your terminology
}

这里适用于“v”:

myFun(v, 1)
# [1] "c" "c" "c" "d" "d" "d" "d"
myFun(v, 2)
# [1] "d" "d" "d" "d"

当然,总有“data.table”:

as.data.table(v)[, N := .N, by = v][N > 1 + 1]$v
# [1] "c" "c" "c" "d" "d" "d" "d"
as.data.table(v)[, N := .N, by = v][N > 2 + 1]$v
# [1] "d" "d" "d" "d"