Question

我有许多ID，每个ID与多个匹配相关联。以下是数据的样子：

ids <- LETTERS[1:4]
all.hits <- paste("h",1:50,sep=".")

set.seed(1)
df <- do.call(rbind,lapply(1:length(ids), function(i){
  hits <- all.hits[sample(50,as.integer(runif(1,5,10)),replace = F)]
  return(data.frame(id=rep(ids[i],length(hits)),hit=hits,stringsAsFactors=F))
}))

鉴于number.hits我希望找到最小的匹配集，以便每个id被这些匹配中的至少number.hits覆盖，并且此集具有最大表示ids。

因此，例如number.hits <- 2，该集合将为：

c(h.10,h.19,h.42,h.50)

因为h.10由ID A,B,C覆盖，然后由h.19覆盖A,C,D，然后由{ID覆盖h.42：A,D {1}}然后是h.50，其中包含ID：B,C。因此，A被覆盖3次（＆gt; = number.hits），B被覆盖两次（＆gt; = number.hits），C被覆盖3次（＆gt; = number.hits）和D两次（＆gt; = number.hits）。

Answer 1

使用library(tidyverse)可能是这样的吗？

number.hits <- 2
df %>% group_by(hit) %>% tally() %>% filter(n >= number.hits)
# A tibble: 5 × 2
hit     n
<chr> <int>
1  h.10     3
2  h.19     3
3  h.42     2
4  h.44     2
5  h.50     2

number.hits <- 3
df %>% group_by(hit) %>% tally() %>% filter(n >= number.hits)
# A tibble: 2 × 2
hit     n
<chr> <int>
1  h.10     3
2  h.19     3

我的两次点击次数超出了您的预期，不确定我是否在您的情况下错过了一些。

Answer 2

我们可以使用data.table

library(data.table)
number.hits <- 2
setDT(df)[, if(.N >= number.hits) .N, by = hit]
#    hit V1
#1: h.19  3
#2: h.44  2
#3: h.10  3
#4: h.42  2
#5: h.50  2

number.hits <- 3
setDT(df)[, if(.N >= number.hits) .N, hit]
#    hit V1
#1: h.19  3
#2: h.10  3

或base R

subset(aggregate(.~hit, df, length), id >= number.hits)

查找具有最大覆盖范围的最小集合

2 个答案: