限制与roll =“ nearest”的合并范围

时间:2019-11-24 11:01:51

标签: r merge data.table fuzzyjoin

我有两个要合并的数据库。通过此链接:Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table。我知道在没有直接匹配的情况下,可以将这些data.tables与最近的年份合并,如下所示:

  library(data.table)
  dfA <- fread("
  A   B   C   D   E   F   G   Z   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE)

  dfB <- fread("
  A   B   C   D   H   I   J   K   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2018
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE)

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

对此我有两个问题。

1)编辑:这个问题是拼写错误的结果

2)NLD 2014中的dfANLD 2018中的dfB相匹配。如果我认为4年的差异太大而想限制为两年,该怎么办?

当我想将允许的年限限制在dfAdfB之间时,该怎么办?

1 个答案:

答案 0 :(得分:3)

您有两个选择:

  1. 使用roll = 2roll = -2,这将要求最近的距离是 个方向的2年之内。
  2. dfA中再添加两列,以使其成为显式的非等额联接。
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ] 

# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ] 

非等额联接将需要您做更多的工作,因为它不需要roll = 'nearest'参数,因此您需要使用mult = 'first'或在后续操作中进行过滤。

dfA[, `:=`(min_year.join = yearA - 2,
           max_year.join = yearA + 2)]

result <- dfB[dfA,
              on = .(iso.join,
                          year.join <= max_year.join,
                          year.join >= min_year.join)
              #, mult = 'first'
              ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $

#set column order
setcolorder(result, colorder)
result