我可能会尝试不可能的事情,但如果有任何解决以下问题的方法,我将不胜感激。
我有两个数据集,第一个的大小为40万行,第二个的大小为150万行。我正在尝试加入他们,同时检查几个条件。这是最重要的部分:我不想加入然后进行过滤,因为组合的数量使RAM爆炸。因此,只有在满足多个条件的情况下,才可以加入。重现问题的简单示例:
library(data.table)
# arbitrary data frames
dt1 <- data.frame(
a = 1:4,
cond_a1 = c(10,20,10,20),
cond_b1 = c("m","n","m","n"),
cond_c1 = c(12,13,14,5),
b = letters[1:4]
)
dt2 <- data.frame(
d = 1:4,
cond_a2 = c(30,20,50,10),
cond_b2 = c("n","t","m","t"),
cond_c2 = c(22,113,200,15),
r = letters[5:8]
)
# make data tables
setDT(dt1)
setDT(dt2)
# join doesn't work because of anti join and other operations not being available in data.table
dt1[dt2,
on = .(a = d,
cond_a1 > cond_a2,
cond_b1 != cond_b2, # these 2 should not be equal
cond_c2 - cond_c1 > 0 # difference should be greater than 0
), nomatch=0]
# desired result
a cond_a1 cond_b1 cond_c1 b cond_b2 cond_c2 r
1 4 10 n 5 d t 15 h
可以使用data.table
以某种方式完成此操作吗? dplyr
不能做到这一点(除非您想加入然后进行过滤),而sqldf
可以降低速度。
提前感谢您的时间