具有两个数据帧
dat1 <- data.frame(group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18),name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V"))
dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))
第二个数据帧具有按ID列分组的两个值的组合。并且基于第二个数据帧(dat2),如果在dat2中存在特定的组组合,则需要删除第一个数据帧(dat1)中的行。
例如:如果dat1中同时存在“ A”和“ B” 两者,则应将其删除。
因此,所需的输出是
desiredat <- data.frame(group= c(12,12,13,13,15,15),name= c("C","D","G","H","I","J"))
在R中寻找达到相同目的的方法。
答案 0 :(得分:0)
这样的事情...?
dat1[dat1$name %in% setdiff(dat1$name, dat2$name), ]
3 12 C
4 12 D
7 14 G
8 14 H
9 15 I
10 15 J
15 17 W
18 18 V
答案 1 :(得分:0)
这可以通过 anti-join 解决。但是,我们需要确定哪些组ID group
必须从dat1
中删除。
library(data.table)
# count names per ID
setDT(dat2)[, n.id := .N, by = ID]
# identify groups to remove by joining and ...
groups_to_remove <- dat2[setDT(dat1), on = "name", nomatch = 0L][
# ... check which groups have a match with the complete set of names
, which(n.id == .N), by = .(ID, group)]
# anti join
dat1[!groups_to_remove, on = "group"]
group name 1: 12 C 2: 12 D 3: 14 G 4: 14 H 5: 15 I 6: 15 J 7: 19 A 8: 19 X
没有删除组19,因为名称“ A”和“ X”属于dat2
中的不同ID。
更简化的方法使用all()
而不是唯一的名称:
library(data.table)
setDT(dat1)
setDT(dat2)
groups_to_remove <- dat1[dat2, on = "name"][, which(all(ID == ID[1])), by = group]
dat1[!groups_to_remove, on = "group"]
group name 1: 12 C 2: 12 D 3: 14 G 4: 14 H 5: 15 I 6: 15 J 7: 19 A 8: 19 X
与以上dplyr
语法相同:
library(dplyr)
dat2 %>%
left_join(dat1, by = "name") %>%
group_by(group) %>%
summarise(all_have_same_id = all(ID == ID[1L])) %>%
filter(all_have_same_id) %>%
anti_join(dat1, ., by = "group")
group name 1 12 C 2 12 D 3 14 G 4 14 H 5 15 I 6 15 J 7 19 A 8 19 X Warning message: Column `name` joining factors with different levels, coercing to character vector
OP提供的样本数据集dat1
由以下组组成:dat2
中不包含任何名称,或者所有名称都在一个dat2
的ID中(可能加上一个附加名称)但是它缺少一个用例,其中dat2
的ID中仅包含一个名称。因此,我添加了这个用例(作为第19组):
dat1 <- data.frame(
group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18,19,19),
name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V","A","X"))
dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))