R从数据框中选择未出现在另一个数据框中的所有行

时间:2013-07-02 14:13:37

标签: r dataframe subset

我正在尝试解决一个棘手的R问题,我无法通过Google搜索关键字解决这个问题。具体来说,我试图采用一个子集,一个数据帧的值不会出现在另一个数据帧中。这是一个例子:

> test
      number    fruit     ID1  ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
> test2
      number    fruit     ID1   ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"

我有两个数据框,test和test2,目标是选择test2中未出现在测试中的所有整行,即使某些值可能相同。

我想要的输出如下:

item1 "number1" "papayas" "22"  "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples"  "123" "13"

可能有任意数量的行或列,但在我的特定情况下,一个数据框是另一个数据框的直接子集。

我已广泛使用R subset(),merge()和which()函数,但无法弄清楚如何组合使用它们,如果可能的话,可以得到我想要的东西。

编辑:这是我用来生成这两个表的R代码。

test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))

test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")

test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13)  ))

test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")

提前致谢!

6 个答案:

答案 0 :(得分:15)

这是另一种方式:

x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#        number   fruit ID1 ID2
# item1 number1 papayas  22  33
# item3 number3 peaches 441  25
# item4 number4  apples 123  13

编辑:已修改以保留行名称。

答案 1 :(得分:4)

使用data.table和sqldf

有两种方法可以解决这个问题
library(data.table)
test<- fread('
item number fruit ID1 ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
')
test2<- fread('
item number fruit ID1 ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"
')

data.table方法,这使您可以选择要比较的列

setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

Sql方法

sqldf('select * from test except select * from test2')
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

答案 2 :(得分:2)

以下内容可以帮助您:

rows <- unique(unlist(mapply(function(x, y) 
          sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]

这里发生的事情是:

  • mapply用于在两个数据集之间进行逐列比较。
  • 使用setdiff查找前者但不是后者的任何项目
  • which标识前者的哪一行不存在。
  • unique(unlist(....))抓取所有唯一的行

  • 然后我们将其用作前者的过滤器,即test2

结果:

       number   fruit ID1 ID2
item1 number1 papayas  22  33
item3 number3 peaches 441  25
item4 number4  apples 123  13

编辑:

确保您test&amp; test2data.frames而不是matrices,因为mapply会迭代矩阵的每个元素,但会遍历每个 data.frame

test  <- as.data.frame(test,  stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)

答案 3 :(得分:1)

在test2中创建一个新的row-ID列,合并数据框,并选择那些ID不在合并结果中的行。

test2 <- cbind(test2, id=seq_len(nrow(test2)))

matches <- merge(test1, test2)$id

test2 <- test2[-matches, ]

答案 4 :(得分:1)

这是另一种方法,但我不确定它的扩展程度。

test2[!apply(test2, 1, paste, collapse = "") %in% 
        apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"

这将删除所有重复项。比较,例如,test2是否有重复:

test2 <- rbind(test2, test2[1:3, ])

## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#       number    fruit     ID1   ID2 
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
  apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

答案 5 :(得分:1)

使用dplyr软件包,您还可以使用anti_join。

missing.species <- anti_join(test2, test, by = NULL)

它将返回test2中没有匹配项的test2行。通过显式加入变量。如果为NULL,则该函数将使用test和test2共有的所有变量。