合并两个精确且模糊的表

时间:2017-11-20 23:02:28

标签: r merge data.table fuzzy-logic exact-match

我想根据一个变量的精确匹配和另一个变量的模糊匹配来合并两个表。

考虑下面的两个表。对于dt1中的每个id1,我想在dt2中找到一个与大小完全匹配的id2,其中dt2中的日期值等于或晚于dt1中的日期字段。如果有多个匹配,我想随机选择一个。

dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))

dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))

结果表可能如下所示:

   id1 size       date  id2
1:   A    2 2013-03-27    1
2:   B    3 2014-05-08    3

或像这样(取决于随机选择)

   id1 size       date  id2
1:   A    2 2013-03-27    4
2:   B    3 2014-05-08    3    

2 个答案:

答案 0 :(得分:1)

我不确定这通常是大多数人在说'模糊匹配'时想到的 - 你想要组合两个表然后随意匹配的结果,如:

library(data.table)
library(tidyverse)

set.seed(1234)
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
         c("id1", "size", "date"))

dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
                                                                   "2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
         c("id2", "size", "date"))

dt <- full_join(dt1, dt2, by = "size") %>% 
  filter(date.y >= date.x) %>% 
  group_by(size) %>%
  sample_n(size = 1)

答案 1 :(得分:1)

要按大小加入并选择适当的日期条目,我们可以使用非等连接:

> # Rename the date columns to make the join step clear:
> setnames(dt1, "date", "date1")
> setnames(dt2, "date", "date2")

> # Non equi-join will give all entries in dt2 matching on size where
> # date2 >= date1:
> dt2[dt1, on=.(size, date2 >= date1)]
   id2 size      date2 id1
1:   4    2 2013-03-27   A
2:   1    2 2013-03-27   A
3:   3    3 2014-05-08   B

我找不到与连接一起进行随机选择步骤的可靠方法。作为一个hacky解决方案,我们可以在上面的表中添加一个包含洗牌行号的新列,然后选择每id1行具有最大洗牌行数的行:

> joined <- dt2[dt1, on=.(size, date2 >= date1)]
> joined[, selection_column := sample(.I, .N)] 
> filtered <- joined[,.SD[which.max(selection_column)], by=id1]
> filtered[, selection_column := NULL]
> filtered
   id1 id2 size      date2
1:   A   1    2 2013-03-27
2:   B   3    3 2014-05-08

或者,我们可以使用dplyr进行随机选择步骤:

> library(dplyr)
> dt2[dt1, on=.(size, date2 >= date1)] %>% 
+   group_by(id1) %>% 
+   sample_n(1) %>% 
+   as.data.table()  
   id2 size      date2 id1
1:   4    2 2013-03-27   A
2:   3    3 2014-05-08   B