我正在尝试解决一个棘手的R问题,我无法通过Google搜索关键字解决这个问题。具体来说,我试图采用一个子集,一个数据帧的值不会出现在另一个数据帧中。这是一个例子:
> test
number fruit ID1 ID2
item1 "number1" "apples" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples" "12" "13"
> test2
number fruit ID1 ID2
item1 "number1" "papayas" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples" "123" "13"
item5 "number3" "peaches" "44" "25"
item6 "number4" "apples" "12" "13"
item7 "number1" "apples" "22" "33"
我有两个数据框,test和test2,目标是选择test2中未出现在测试中的所有整行,即使某些值可能相同。
我想要的输出如下:
item1 "number1" "papayas" "22" "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples" "123" "13"
可能有任意数量的行或列,但在我的特定情况下,一个数据框是另一个数据框的直接子集。
我已广泛使用R subset(),merge()和which()函数,但无法弄清楚如何组合使用它们,如果可能的话,可以得到我想要的东西。
编辑:这是我用来生成这两个表的R代码。
test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))
test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")
test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13) ))
test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")
提前致谢!
答案 0 :(得分:15)
这是另一种方式:
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
# number fruit ID1 ID2
# item1 number1 papayas 22 33
# item3 number3 peaches 441 25
# item4 number4 apples 123 13
编辑:已修改以保留行名称。
答案 1 :(得分:4)
使用data.table和sqldf
有两种方法可以解决这个问题library(data.table)
test<- fread('
item number fruit ID1 ID2
item1 "number1" "apples" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples" "12" "13"
')
test2<- fread('
item number fruit ID1 ID2
item1 "number1" "papayas" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples" "123" "13"
item5 "number3" "peaches" "44" "25"
item6 "number4" "apples" "12" "13"
item7 "number1" "apples" "22" "33"
')
data.table方法,这使您可以选择要比较的列
setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item number fruit ID1 ID2
1: item1 number1 apples 22 33
2: item3 number3 peaches 44 25
3: item4 number4 apples 12 13
Sql方法
sqldf('select * from test except select * from test2')
item number fruit ID1 ID2
1: item1 number1 apples 22 33
2: item3 number3 peaches 44 25
3: item4 number4 apples 12 13
答案 2 :(得分:2)
以下内容可以帮助您:
rows <- unique(unlist(mapply(function(x, y)
sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]
这里发生的事情是:
mapply
用于在两个数据集之间进行逐列比较。 setdiff
查找前者但不是后者的任何项目which
标识前者的哪一行不存在。 unique(unlist(....))
抓取所有唯一的行
然后我们将其用作前者的过滤器,即test2
number fruit ID1 ID2
item1 number1 papayas 22 33
item3 number3 peaches 441 25
item4 number4 apples 123 13
确保您test
&amp; test2
为data.frames
而不是matrices
,因为mapply
会迭代矩阵的每个元素,但会遍历每个列 data.frame
test <- as.data.frame(test, stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)
答案 3 :(得分:1)
在test2中创建一个新的row-ID列,合并数据框,并选择那些ID不在合并结果中的行。
test2 <- cbind(test2, id=seq_len(nrow(test2)))
matches <- merge(test1, test2)$id
test2 <- test2[-matches, ]
答案 4 :(得分:1)
这是另一种方法,但我不确定它的扩展程度。
test2[!apply(test2, 1, paste, collapse = "") %in%
apply(test, 1, paste, collapse = ""), ]
# number fruit ID1 ID2
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples" "123" "13"
这将不删除所有重复项。比较,例如,test2
是否有重复:
test2 <- rbind(test2, test2[1:3, ])
## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
# number fruit ID1 ID2
# item4 "number4" "apples" "123" "13"
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
apply(test, 1, paste, collapse = ""), ]
# number fruit ID1 ID2
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples" "123" "13"
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
答案 5 :(得分:1)
使用dplyr软件包,您还可以使用anti_join。
missing.species <- anti_join(test2, test, by = NULL)
它将返回test2中没有匹配项的test2行。通过显式加入变量。如果为NULL,则该函数将使用test和test2共有的所有变量。