我想对我的数据框进行子集化,下面是一个例子:
groups names col3
group1 Sp1 OK
group1 Sp3 OK
group1 Sp7 OK
group1 Sp3 OK
group2 Sp1 OK
group2 Sp2 OK
group2 Sp3 OK
group3 Sp1 OK
group4 Sp1 OK
group4 Sp2 OK
group4 Sp2 OK
并且想法是针对每个组,只保留同时包含Sp1
和Sp2
的那些,并删除另一个
在这里,我应该保留组2 and 4
:
groups names col3
group2 Sp1 OK
group2 Sp2 OK
group2 Sp3 OK
group4 Sp1 OK
group4 Sp2 OK
group4 Sp2 OK
我尝试了类似的方法:
df2=df %>%
group_by(groups) %>%
df$names == "Sp1" & df$names == "Sp2"
但这似乎不是正确的方法。
感谢您的帮助。
答案 0 :(得分:2)
我们可以在filter
步骤之后使用group_by
,并确保该组同时具有%in%
和all
library(dplyr)
df %>%
group_by(groups) %>%
filter(all(c("Sp1", "Sp2") %in% names))
# A tibble: 6 x 3
# Groups: groups [2]
# groups names col3
# <chr> <chr> <chr>
#1 group2 Sp1 OK
#2 group2 Sp2 OK
#3 group2 Sp3 OK
#4 group4 Sp1 OK
#5 group4 Sp2 OK
#6 group4 Sp2 OK
或者将base R
与table
和subset
一起使用
subset(df, groups %in% names(which(!rowSums(!table(subset(df,
names %in% c("Sp1", "Sp2"), select = 1:2))))))
请注意,使用&
的问题在于我们正在检查'Sp1'和'Sp2'是否都在不太可能出现的'names'同一行中。相反,逻辑在于是否可以在特定组的“名称”中找到它们
df <- structure(list(groups = c("group1", "group1", "group1", "group1",
"group2", "group2", "group2", "group3", "group4", "group4", "group4"
), names = c("Sp1", "Sp3", "Sp7", "Sp3", "Sp1", "Sp2", "Sp3",
"Sp1", "Sp1", "Sp2", "Sp2"), col3 = c("OK", "OK", "OK", "OK",
"OK", "OK", "OK", "OK", "OK", "OK", "OK")),
class = "data.frame", row.names = c(NA,
-11L))