我试图确定如何在包含等式和不等式作为子条件的条件下使用data.table
方法联接两个数据集。这是一些示例数据:
> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
name age
1: Sally 20
2: Joe 25
3: Fred 30
> B
name age condition
1: Sally 20 deceased
2: Joe 30 good
3: Fred 35 good
4: Fred 40 ailing
执行A[B,on =.(name = name, age < age), condition := i.condition]
时,我只会返回以下3行:
> A
name age condition
1: Sally 20 <NA>
2: Joe 25 good
3: Fred 30 ailing
根据直觉,典型的SQL用户将返回所有符合联接条件的行(在这种情况下,将返回4)。我正在使用data.table_1.11.8。
有没有一种data.table
方法,可以让我
:=
将值分配给现有数据集,以避免不必要的内存使用?
如果没有data.table解决方案,最好的选择是什么(我的数据集非常大,我希望尽可能少地打包)?
编辑
要弄清楚我要查找的输出,我将给出要模仿的功能的SQL代码:
create table #A (
name varchar(50),
age integer
);
insert into #A
values ('Sally',20),
('Joe',25),
('Fred',30);
create table #B (
name varchar(50),
age integer,
condition varchar(50)
);
insert into #B
values ('Sally',20,'deceased'),
('Joe',30,'good'),
('Fred',35,'good'),
('Fred',40,'ailing');
select
#A.*,
condition
from #A left join #B
on #A.name = #B.name
and #A.age < #B.age;
上面的返回以下结果集:
name age condition
Sally 20 NULL
Joe 25 good
Fred 30 good
Fred 30 ailing
答案 0 :(得分:0)
如果需要使用SQL样式的左连接(如编辑中所述),可以使用与icecreamtoucan注释中的建议非常相似的代码来实现:
B[A,on=.(name = name, age > age)]
注意:如果结果集超出了联接元素的行数之和,则data.table
将假定您犯了一个错误(与SQL引擎不同)并抛出了错误。解决方法(假设您没有出错)是添加allow.cartesian = TRUE
。
此外,与SQL不同,此联接不会返回组成表中的所有列。取而代之的是(对于那些来自SQL背景的人来说,这有些令人沮丧),将在联接的不等式条件中使用的左表中的列值返回到 列中,并带有右表列的名称 与不平等加入条件相比!
这里的解决方案(我是在另一个SO答案中找到的,但是现在找不到)是创建要保留的连接列的副本,将这些列用于连接条件,然后指定要保留的列连接。
例如
A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED
B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))
> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR
>#Try to join all rows whose date ranges intersect:
>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]
vecseq(f__,len__,if(allow.cartesian || notjoin || !anyDuplicated(f__,:连接结果为12行;大于9 = nrow(x)+ nrow(i)。在每个i中检查重复的键值 一遍又一遍地加入x中的同一组。如果可以,请尝试 by = .EACHI为每个组运行j以避免大分配。如果 您确定要继续,请使用allow.cartesian = TRUE重新运行。 否则,请在FAQ,Wiki, 堆栈溢出和data.table问题跟踪器以获取建议。
>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names
> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED
>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep
> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR
我当然不是第一个指出这些偏离SQL约定的人,或者重现该功能(如上所述)相当麻烦,我相信improvements are actively being considered。
对于任何考虑替代策略(例如sqldf
软件包)的人,我都会说,尽管data.table
可以替代其他优点,但我一直在努力寻找与{{ 1}}当涉及到非常大的数据集时,无论是关于联接还是其他操作。不用说,还有许多其他好处使此程序包对我和许多其他人来说都是必不可少的。因此,对于那些使用大型数据集的人,如果上面看起来很麻烦,我建议不要放弃data.table
联接,而是养成习惯于执行这些动作的习惯,或者编写一个辅助函数来复制动作序列,直到有所改善语法出现了。
最后,我在这里没有提到析取连接,但是据我所知,这是data.table
方法的另一个缺点(以及data.table
有用的另一个领域)。我一直在通过临时的“ hacks”来解决这些问题,但是对于在sqldf
中处理这些问题的最佳方法,我将不胜感激。