使用data.table在相等和不相等条件下向左联接,并且每个左表行有多个匹配项

时间:2018-10-31 21:17:36

标签: r data.table left-join inequality

我试图确定如何在包含等式和不等式作为子条件的条件下使用data.table方法联接两个数据集。这是一些示例数据:

> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
    name age
1: Sally  20
2:   Joe  25
3:  Fred  30

> B
    name age condition
1: Sally  20  deceased
2:   Joe  30      good
3:  Fred  35      good
4:  Fred  40    ailing

执行A[B,on =.(name = name, age < age), condition := i.condition]时,我只会返回以下3行:

> A
    name age condition
1: Sally  20      <NA>
2:   Joe  25      good
3:  Fred  30    ailing

根据直觉,典型的SQL用户将返回所有符合联接条件的行(在这种情况下,将返回4)。我正在使用data.table_1.11.8。

有没有一种data.table方法,可以让我

  1. 处理其子条件可能是相等条件的条件 和不平等条件
  2. 使用:=将值分配给现有数据集,以避免不必要的内存使用
  3. 像SQL一样保留所有符合联接条件的行

如果没有data.table解决方案,最好的选择是什么(我的数据集非常大,我希望尽可能少地打包)?

编辑

要弄清楚我要查找的输出,我将给出要模仿的功能的SQL代码:

create table #A (
name varchar(50),
age integer
);

insert into #A
values ('Sally',20),
       ('Joe',25),
       ('Fred',30);

create table #B (
name varchar(50),
age integer,
condition varchar(50)
);

insert into #B
values ('Sally',20,'deceased'),
       ('Joe',30,'good'),
       ('Fred',35,'good'),
       ('Fred',40,'ailing');

select
#A.*,
condition
from #A left join #B
on  #A.name = #B.name
and #A.age < #B.age;

上面的返回以下结果集:

name    age   condition
Sally   20    NULL
Joe     25    good
Fred    30    good
Fred    30    ailing

1 个答案:

答案 0 :(得分:0)

如果需要使用SQL样式的左连接(如编辑中所述),可以使用与icecreamtoucan注释中的建议非常相似的代码来实现:

B[A,on=.(name = name, age > age)]

注意:如果结果集超出了联接元素的行数之和,则data.table将假定您犯了一个错误(与SQL引擎不同)并抛出了错误。解决方法(假设您没有出错)是添加allow.cartesian = TRUE

此外,与SQL不同,此联接不会返回组成表中的所有列。取而代之的是(对于那些来自SQL背景的人来说,这有些令人沮丧),将在联接的不等式条件中使用的左表中的列值返回到 列中,并带有右表列的名称 与不平等加入条件相比!

这里的解决方案(我是在另一个SO答案中找到的,但是现在找不到)是创建要保留的连接列的副本,将这些列用于连接条件,然后指定要保留的列连接。

例如

A <- data.table( group = rep("WIZARD LEAGUE",3)
                ,name = rep("Fred",time=3)
                ,status_start = as.Date("2017-01-01") + c(0,370,545)
                ,status_end = as.Date("2017-01-01") + c(369,544,365*3-1) 
                ,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
                         ,name = "Sally"
                         ,status_start = as.Date("2017-01-01")
                         ,status_end = as.Date("2019-12-31")
                         ,status = "CONTRACTED"))
> A
           group  name status_start status_end     status
1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED
3: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED
4: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED


B <- data.table( group = rep("WIZARD LEAGUE",time=5)
                ,loc_start = as.Date("2017-01-01") + 180*0:4
                ,loc_end = as.Date("2017-01-01") + 180*1:5-1
                , loc = c("US","GER","FRA","ITA","MOR"))

> B
           group  loc_start    loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29  US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR

>#Try to join all rows whose date ranges intersect:

>B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end)]
  

vecseq(f__,len__,if(allow.cartesian || notjoin ||   !anyDuplicated(f__,:连接结果为12行;大于9 =   nrow(x)+ nrow(i)。在每个i中检查重复的键值   一遍又一遍地加入x中的同一组。如果可以,请尝试   by = .EACHI为每个组运行j以避免大分配。如果   您确定要继续,请使用allow.cartesian = TRUE重新运行。   否则,请在FAQ,Wiki,   堆栈溢出和data.table问题跟踪器以获取建议。

>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names

> B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end), allow.cartesian = TRUE]
            group  loc_start    loc_end loc  name     status
 1: WIZARD LEAGUE 2018-01-05 2017-01-01  US  Fred UNEMPLOYED
 2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER  Fred UNEMPLOYED
 3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA  Fred UNEMPLOYED
 4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA  Fred   EMPLOYED
 5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA  Fred   EMPLOYED
 6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA  Fred    RETIRED
 7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR  Fred    RETIRED
 8: WIZARD LEAGUE 2019-12-31 2017-01-01  US Sally CONTRACTED
 9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED

>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep

> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
    ,..keep_cols
    ,on=.( group = group
          ,end >= start_dup
          ,start <= end_dup)
          ,allow.cartesian = TRUE]
            group  name status_start status_end     status  loc_start    loc_end loc
 1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29  US
 2: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
 3: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
 4: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2017-12-27 2018-06-24 FRA
 5: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2018-06-25 2018-12-21 ITA
 6: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-06-25 2018-12-21 ITA
 7: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-12-22 2019-06-19 MOR
 8: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29  US
 9: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR

我当然不是第一个指出这些偏离SQL约定的人,或者重现该功能(如上所述)相当麻烦,我相信improvements are actively being considered

对于任何考虑替代策略(例如sqldf软件包)的人,我都会说,尽管data.table可以替代其他优点,但我一直在努力寻找与{{ 1}}当涉及到非常大的数据集时,无论是关于联接还是其他操作。不用说,还有许多其他好处使此程序包对我和许多其他人来说都是必不可少的。因此,对于那些使用大型数据集的人,如果上面看起来很麻烦,我建议不要放弃data.table联接,而是养成习惯于执行这些动作的习惯,或者编写一个辅助函数来复制动作序列,直到有所改善语法出现了。

最后,我在这里没有提到析取连接,但是据我所知,这是data.table方法的另一个缺点(以及data.table有用的另一个领域)。我一直在通过临时的“ hacks”来解决这些问题,但是对于在sqldf中处理这些问题的最佳方法,我将不胜感激。