如果相同的第一两列在另一个数据框中,则r中删除数据框中的行最快的函数是什么。例如,如果数据帧A如下(带有更多信息列):
NAME SURENAME
John Beer
Rose Pitt
Bob Kin
Charile Kind
Smith Red
Brad Tea
Kale Joe
Ana Bread
Lauren Old
Mike Karl
和B如下:
NAME SURENAME
Rose Pitt
Smith Red
Mike Karl
我希望将B从A移除,就像这样:
NAME SURENAME
John Beer
Bob Kin
Charile Kind
Brad Tea
Kale Joe
Ana Bread
Lauren Old
因此,在我的情况下,A有200万行(和其他10列),而B有200,000行(所有唯一的名称和姓氏)。
答案 0 :(得分:1)
也许您可以使用setdiff()
软件包中的dplyr
尝试下面的代码,但是您需要检查大型数据帧的速度(那时我不确定它的性能)
C <- dplyr::setdiff(A,B)
这样
> C
NAME SURENAME
1 John Beer
2 Bob Kin
3 Charile Kind
4 Brad Tea
5 Kale Joe
6 Ana Bread
7 Lauren Old
数据
A <- structure(list(NAME = c("John", "Rose", "Bob", "Charile", "Smith",
"Brad", "Kale", "Ana", "Lauren", "Mike"), SURENAME = c("Beer",
"Pitt", "Kin", "Kind", "Red", "Tea", "Joe", "Bread", "Old", "Karl"
)), class = "data.frame", row.names = c(NA, -10L))
B <- structure(list(NAME = c("Rose", "Smith", "Mike"), SURENAME = c("Pitt",
"Red", "Karl")), class = "data.frame", row.names = c(NA, -3L))
答案 1 :(得分:1)
测试了一个基准,按照原始帖子中的指示,过滤了一个200万行的数据帧,其中以200万行为一个过滤条件,您可以清楚地看到data.table
相对于dplyr
的速度。在运行dplyr
函数的时间非常长的情况下,尤其是set_diff
,我只运行了一次。
rbenchmark::benchmark(
"dplyr_anti_join" = {
set.seed(1)
df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
dplyr::anti_join(df, indices, by = c("a", "b"))
},
"dplyr_set_diff" = {
set.seed(1)
df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
dplyr::setdiff(df, indices)
},
"dt" = {
set.seed(1)
library(data.table)
df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
fsetdiff(df, indices)
},
replications = 1
)
#> test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join 1 637.06 13.165 596.86 11.50 NA NA
#> 2 dplyr_set_diff 1 9981.93 206.281 320.67 4.66 NA NA
#> 3 dt 1 48.39 1.000 80.61 8.73 NA NA