R中的子集和分组依据

时间:2019-04-28 20:43:19

标签: r

我想对我的数据框进行子集化,下面是一个例子:

groups  names   col3
group1  Sp1 OK  
group1  Sp3 OK
group1  Sp7 OK
group1  Sp3 OK
group2  Sp1 OK
group2  Sp2 OK
group2  Sp3 OK
group3  Sp1 OK
group4  Sp1 OK
group4  Sp2 OK
group4  Sp2 OK

并且想法是针对每个组,只保留同时包含Sp1Sp2的那些,并删除另一个

在这里,我应该保留组2 and 4

groups  names   col3
group2  Sp1 OK
group2  Sp2 OK
group2  Sp3 OK
group4  Sp1 OK
group4  Sp2 OK
group4  Sp2 OK

我尝试了类似的方法:

df2=df %>%
  group_by(groups) %>%
  df$names == "Sp1" & df$names == "Sp2"

但这似乎不是正确的方法。

感谢您的帮助。

1 个答案:

答案 0 :(得分:2)

我们可以在filter步骤之后使用group_by,并确保该组同时具有%in%all

library(dplyr)
df %>% 
   group_by(groups) %>%
   filter(all(c("Sp1", "Sp2") %in% names))
# A tibble: 6 x 3
# Groups:   groups [2]
#  groups names col3 
#  <chr>  <chr> <chr>
#1 group2 Sp1   OK   
#2 group2 Sp2   OK   
#3 group2 Sp3   OK   
#4 group4 Sp1   OK   
#5 group4 Sp2   OK   
#6 group4 Sp2   OK  

或者将base Rtablesubset一起使用

subset(df, groups %in% names(which(!rowSums(!table(subset(df, 
        names %in% c("Sp1", "Sp2"), select = 1:2))))))

请注意,使用&的问题在于我们正在检查'Sp1'和'Sp2'是否都在不太可能出现的'names'同一行中。相反,逻辑在于是否可以在特定组的“名称”中找到它们

数据

df <- structure(list(groups = c("group1", "group1", "group1", "group1", 
"group2", "group2", "group2", "group3", "group4", "group4", "group4"
), names = c("Sp1", "Sp3", "Sp7", "Sp3", "Sp1", "Sp2", "Sp3", 
"Sp1", "Sp1", "Sp2", "Sp2"), col3 = c("OK", "OK", "OK", "OK", 
"OK", "OK", "OK", "OK", "OK", "OK", "OK")),
class = "data.frame", row.names = c(NA, 
-11L))