我一直在研究fuzzyjoin
,将2个数据帧连接在一起,但是由于内存问题,连接导致cannot allocate memory of…
。因此,我尝试使用data.table
合并数据。数据示例如下。
df1看起来像:
ID f_date ACCNUM flmNUM start_date end_date
1 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30
2 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19
3 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05
4 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13
5 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20
6 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24
7 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17
8 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03
9 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10
10 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20
df2看起来像:
ID date fyear at lt
1 50341 1998-12-31 1998 104382 94973
2 50341 1999-12-31 1999 190692 175385
3 50341 2000-12-31 2000 179519 163347
4 50341 2001-12-31 2001 203638 186030
5 50341 2002-12-31 2002 190453 173620
6 50341 2003-12-31 2003 200235 181955
我将专注于ID
= 50341
。如果df2$date
在df1$start_date
和df1$end_date
的时间段内,则将它们合并在一起。因此,这里df2$date
= 2002-12-31
介于df1
开始2002-09-07
和结束2003-08-30
之间,因此请加入此行。
我运行以下代码并获得相应的输出:
df1$f_date <- as.Date(df1$f_date)
df2$date <- as.Date(df2$date)
df1$start_date <- df1$f_date + 183
df1$end_date <- df1$f_date + 540
library(fuzzyjoin)
final_data <- fuzzy_left_join(
df1, df2,
by = c(
"ID" = "ID",
"start_date" = "date",
"end_date" = "date"
),
match_fun = list(`==`, `<`, `>=`)
)
final_data
输出:
ID.x f_date ACCNUM flmNUM start_date end_date ID.y date fyear at lt
1 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30 50341 2002-12-31 2002 190453.000 173620.000
2 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19 1067983 2010-12-31 2010 372229.000 209295.000
3 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05 804753 2004-12-31 2004 982.265 383.614
4 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13 1090727 2013-12-31 2013 36212.000 29724.000
5 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20 1467858 2010-12-31 2010 138898.000 101739.000
6 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24 NA <NA> NA NA NA
7 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2488 2016-12-31 2016 3321.000 2905.000
8 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03 NA <NA> NA NA NA
9 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10 1467858 2017-12-31 2017 212482.000 176282.000
10 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 14693 2016-04-30 2015 4183.000 2621.000
在这里我们可以看到ID
= 50341
已正确连接。
当我尝试以data.table
的方式运行时,得到以下输出:
代码:
dt_final_data <- setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]
输出:
ID date fyear at lt date.1 f_date ACCNUM flmNUM
1: 50341 2002-09-07 2002 190453.000 173620.000 2003-08-30 2002-03-08 0001104659-02-000656 2571187
2: 1067983 2010-05-27 2010 372229.000 209295.000 2011-05-19 2009-11-25 0001047469-09-010426 91207220
3: 804753 2004-11-13 2004 982.265 383.614 2005-11-05 2004-05-14 0001193125-04-088404 4805453
4: 1090727 2013-11-21 2013 36212.000 29724.000 2014-11-13 2013-05-22 0000712515-13-000022 13865105
5: 1467858 2010-08-28 2010 138898.000 101739.000 2011-08-20 2010-02-26 0001193125-10-043035 10640035
6: 858877 2019-08-02 NA NA NA 2020-07-24 2019-01-31 0001166691-19-000005 19556540
7: 2488 2016-08-25 2016 3321.000 2905.000 2017-08-17 2016-02-24 0001193125-16-476010 161452982
8: 1478242 2004-09-11 NA NA NA 2005-09-03 2004-03-12 0001193125-04-039482 4664082
9: 1467858 2017-08-18 2017 212482.000 176282.000 2018-08-10 2017-02-16 0001555280-17-000044 17618235
10: 14693 2016-04-28 2015 4183.000 2621.000 2017-04-20 2015-10-28 0001193125-15-356351 151180619
dt_final_data
此处start_date
中的df1
变成了date
,而end_date
中的df1
变成了date.1
。因此,我在date
中原来的df2
列中消失了,这是检查合并是否按预期工作的更重要的日期之一。
两个问题:
如何像fuzzyjoin
示例中那样保留所有日期列? data.table
更改名称的方式使我在检查联接时有些困惑。
代码/逻辑正确吗?我已经多次查看了此联接数据,并且“似乎”正确。
数据1:
df1 <-
structure(list(ID = c(50341L, 1067983L, 804753L, 1090727L, 1467858L,
858877L, 2488L, 1478242L, 1467858L, 14693L), f_date = structure(c(11754,
14573, 12552, 15847, 14666, 17927, 16855, 12489, 17213, 16736
), class = "Date"), ACCNUM = c("0001104659-02-000656", "0001047469-09-010426",
"0001193125-04-088404", "0000712515-13-000022", "0001193125-10-043035",
"0001166691-19-000005", "0001193125-16-476010", "0001193125-04-039482",
"0001555280-17-000044", "0001193125-15-356351"), flmNUM = c(2571187L,
91207220L, 4805453L, 13865105L, 10640035L, 19556540L, 161452982L,
4664082L, 17618235L, 151180619L),
start_date = structure(c(11937, 14756, 12735, 16030, 14849, 18110, 17038,
12672, 17396, 16919), class = "Date"),
end_date = structure(c(12294, 15113, 13092, 16387, 15206, 18467, 17395, 13029,
17753, 17276), class = "Date")
), row.names = c(NA, -10L), class = "data.frame")
数据2:
df2 <-
structure(list(ID = c(2488L, 2488L, 2488L, 2488L, 2488L, 2488L,
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L,
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L,
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 14693L, 14693L,
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L,
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L,
14693L, 14693L, 14693L, 50341L, 50341L, 50341L, 50341L, 50341L,
50341L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L,
1090727L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L,
804753L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L,
1478242L, 1478242L, 1478242L, 1478242L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L,
858877L, 858877L, 858877L, 858877L), date = structure(c(10591,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 10346, 10711, 11077, 11442,
11807, 12172, 12538, 12903, 13268, 13633, 13999, 14364, 14729,
15094, 15460, 15825, 16190, 16555, 16921, 17286, 17651, 10591,
10956, 11322, 11687, 12052, 12417, 10591, 10956, 11322, 11687,
12052, 12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974,
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896, 10591,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 14609, 14974, 15339, 15705,
16070, 16435, 16800, 17166, 17531, 17896, 10438, 10803, 11169,
11534, 11899, 12264, 12630, 12995, 13360, 13725, 14091, 14456,
14821, 15186, 15552, 15917, 16282, 16647, 17013, 17378, 17743
), class = "Date"), fyear = c(1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L,
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L), at = c(4252.968, 4377.698,
5767.735, 5647.242, 5619.181, 7094.345, 7844.21, 7287.779, 13147,
11550, 7675, 9078, 4964, 4954, 4000, 4337, 3767, 3109, 3321,
3540, 4556, 122237, 131416, 135792, 162752, 169544, 180559, 188874,
198325, 248437, 273160, 267399, 297119, 372229, 392647, 427452,
484931, 526186, 552257, 620854, 702095, 707794, 1494, 1735, 1802,
1939, 2016, 2264, 2376, 2624, 2728, 3551, 3405, 3475, 3383, 3712,
3477, 3626, 4103, 4193, 4183, 4625, 4976, 104382, 190692, 179519,
203638, 190453, 200235, 257389, 274730, 303100, 323969, 370782,
448507, 479921, 476078, 186192, 148883, 91047, 136295, 138898,
144603, 149422, 166344, 177677, 194520, 221690, 212482, 227339,
17067, 23043, 21662, 24636, 26357, 28909, 33026, 35222, 33210,
39042, 31879, 31883, 33597, 34701, 38863, 36212, 35471, 38311,
40377, 45403, 50016, 436.485, 660.891, 616.411, 712.302, 779.279,
859.34, 982.265, 1303.629, 1491.39, 1689.956, 1880.988, 2148.567,
2422.79, 3000.358, 3704.468, 4098.364, 4530.565, 5561.984, 5629.963,
6469.311, 6708.636, NA, NA, 2322.917, 2499.153, 3066.797, 3305.832,
3926.316, 21208, 22742, 22549, 8916.705, 14725, 32870, 35238,
37795, 37107, 35594, 33883, 43315, 53340, 58734, 68128, 81130,
87095, 91759, 101191, 105134, 113481, 121652, 129818, 108784),
lt = c(2247.919, 2398.425, 2596.068, 2092.187, 3151.916,
3938.395, 3993.516, 3700.954, 7072, 8295, 7588, 7354, 3951,
3364, 3462, 3793, 3580, 3521, 2905, 2929, 3290, 63190, 72232,
72799, 103453, 104116, 102218, 102216, 106025, 137756, 149759,
153820, 161334, 209295, 223686, 235864, 260446, 283159, 293630,
334495, 350141, 355294, 677, 818, 754, 752, 705, 1424, 1291,
1314, 1165, 1978, 1680, 1659, 1488, 1652, 1408, 1998, 2071,
2288, 2621, 3255, 3660, 94973, 175385, 163347, 186030, 173620,
181955, 241738, 253490, 272218, 303516, 363134, 422932, 452164,
460442, 190443, 184363, 176387, 107340, 101739, 105612, 112422,
123170, 141653, 154197, 177615, 176282, 184562, 9894, 10569,
11927, 14388, 13902, 14057, 16642, 18338, 17728, 26859, 25099,
24187, 25550, 27593, 34130, 29724, 33313, 35820, 39948, 44373,
46979, 165.342, 281.954, 272.694, 317.463, 338.035, 363.494,
383.614, 541.81, 571.972, 556.242, 568.693, 567.769, 517.373,
689.557, 870.818, 930.7, 964.597, 1691.6, 1702.016, 1683.963,
1780.247, NA, NA, 3292.513, 3858.197, 3734.282, 4009.844,
4261.997, 12348, 14384, 15595, 1766.98, 3003, 6328, 8096,
9124, 9068, 9678, 10699, 19397, 21850, 24332, 29451, 36845,
39836, 40458, 42063, 48473, 53774, 58067, 63681, 65580)), row.names = c(NA,
-163L), class = "data.frame")
答案 0 :(得分:1)
针对您的问题的data.table
方法不需要与data.table进行Fuzzyjoin [至少在不精确匹配的意义上没有]。相反,您只想使用非相等的二进制运算符>=
,>
,<=
和/或<
连接到data.table列。在data.table
术语中,这些称为“非等联接”。
在这里您将问题命名为“使用data.table将两个数据帧模糊连接在一起”,这是可以理解的,这是在您第一次尝试使用library(fuzzyjoin)之后。 (没问题,只是为读者澄清一下。)
data.table
非等值连接来比较日期列的解决方案:您非常接近可以使用的data.table
解决方案,
dt_final_data <- setDT(df2)[df1,
on = .(ID, date > start_date, date <= end_date)]
要对其进行修改以使其按需运行,只需添加data.table j
表达式以按照所需的顺序选择所需的列 EDIT:并为它们加上前缀x.
的问题列(告诉data.table从x
连接的dt_x[dt_i,]
端返回该列),例如,如下所示列x.date
:
dt_final_data <- setDT(df2)[df1,
.(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt),
on = .(ID, date > start_date, date <= end_date)]
这现在为您提供了所需要的输出:
dt_final_data
ID f_date ACCNUM flmNUM start_date end_date x.date fyear at lt
1: 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30 2002-12-31 2002 190453.000 173620.000
2: 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19 2010-12-31 2010 372229.000 209295.000
3: 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05 2004-12-31 2004 982.265 383.614
4: 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13 2013-12-31 2013 36212.000 29724.000
5: 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20 2010-12-31 2010 138898.000 101739.000
6: 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24 <NA> NA NA NA
7: 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31 2016 3321.000 2905.000
8: 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03 <NA> NA NA NA
9: 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10 2017-12-31 2017 212482.000 176282.000
10: 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30 2015 4183.000 2621.000
如上所述,您ID为50341的结果现在具有date = 2002-12-31。换句话说,结果列date
现在来自df2.date
。
您当然可以在j表达式中重命名x.date列:
setDT(df2)[ df1,
.(ID,
f_date,
ACCNUM,
flmNUM,
start_date,
end_date,
my_result_date_name = x.date,
fyear,
at,
lt),
on = .(ID, date > start_date, date <= end_date)]
This explanation很好地总结了它:
执行任何连接时,结果中仅返回每个键列的一个副本。当前,返回的是i的列,并用x的列名称标记,从而使等联接符合基本merge()的行为。
如果您在版本1.9.8之前没有任何疑问,请记住。
通过并包括data.table的当前1.12.2版本,此(和几个重叠的问题)引起了对data.table github问题列表的大量讨论。例如: possible inconsistency in non-equi join, returning join columns #3437和 SQL-like column return for non-equi and rolling joins #2706仅是其中的2个。
但是,请观看此github问题:继续上述讨论,data.table团队敏锐的分析头脑正在努力在某些(希望不太遥远的)未来版本中减少混淆: Both columns for rolling and non-equi joins #3093