Question

我有一个dataframe和speciesID的大型Individual ID。

对于我的数据集，当SpeciesID和SpeciesID的唯一组合的发生次数少于4次时，我需要删除IndID。

例如，我有数据集：

SpeciesID   IndID
99          13-001
99          13-001
99          14-002
99          14-002
99          14-002
100         14-005
100         14-005
100         14-005
100         14-006
100         14-007
100         14-007
100         14-008
100         14-009
500         16-001
500         16-001
500         16-002
500         16-002
500         16-002
500         16-003
500         16-003
500         16-004
500         16-004
500         16-005
500         16-006
500         16-006
500         16-007

看到此数据集，我想删除SpeciesID和IndID的唯一组合出现少于5次的行：

在这种情况下，我要删除：

由于以下各项的独特组合：

99  13-001
99  14-002

仅出现2次。

Answer 1

尽管对于任何一种组合条件都不成立，但请以此为起点。

  library(data.table)
  df <- structure(list(SpeciesID = c(99L, 99L, 99L, 99L, 99L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L)
 , IndID = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 13L, 13L, 14L), .Label = c("13-001", "14-002", "14-005", "14-006", "14-007", "14-008", "14-009", "16-001", "16-002", "16-003", "16-004", "16-005", "16-006", "16-007"), class = "factor")), class = "data.frame", row.names = c(NA, -26L))

  dt <- data.table(df)
  # Selecting all combinations which appear at least 3 times in the dataset
  dt[, .( SpeciesID
        , IndID
        , .N
)
, by = list(Unique.ID = paste0(SpeciesID, IndID))][N < 3,]

Answer 2

使用基数R计算每个unique的{{1}}个值的数量，并仅选择出现大于等于5次的SpeciesID。

SpeciesID

df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ] # SpeciesID IndID #6 100 14-005 #7 100 14-005 #8 100 14-005 #9 100 14-006 #10 100 14-007 #11 100 14-007 #12 100 14-008 #13 100 14-009 #14 500 16-001 #15 500 16-001 #16 500 16-002 #17 500 16-002 #18 500 16-002 #19 500 16-003 #20 500 16-003 #21 500 16-004 #22 500 16-004 #23 500 16-005 #24 500 16-006 #25 500 16-006 #26 500 16-007也可以由length(unique(x))的{{1}}代替

n_distinct

或者是一个更详细的完整dplyr解决方案

library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]

Answer 3

您可以使用dplyr：

library(dplyr)

通过SpeciesID和IndID对数据进行分组，使用row_number()计算组合出现的频率，并过滤最大数量超过特定阈值的组：

"SpeciesID   IndID
99          13-001
99          13-001
99          14-002
99          14-002
99          14-002
100         14-005
100         14-005
100         14-005
100         14-006
100         14-007
100         14-007
100         14-008
100         14-009
500         16-001
500         16-001
500         16-002
500         16-002
500         16-002
500         16-003
500         16-003
500         16-004
500         16-004
500         16-005
500         16-006
500         16-006
500         16-007" %>% 
  read.table(text = ., header = TRUE) %>% 
  group_by(SpeciesID, IndID) %>% 
  mutate(rn = row_number()) %>% 
  mutate(max = max(rn)) %>% 
  filter(max >= 3) %>% 
  select(SpeciesID, IndID)

结果（对于阈值== 3）：

# A tibble: 9 x 2
# Groups:   SpeciesID, IndID [3]
  SpeciesID IndID 
      <int> <fct> 
1        99 14-002
2        99 14-002
3        99 14-002
4       100 14-005
5       100 14-005
6       100 14-005
7       500 16-002
8       500 16-002
9       500 16-002

省略结合向量少于x次的ID

3 个答案: