我有两个要合并的数据库。通过此链接:Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table。我知道在没有直接匹配的情况下,可以将这些data.tables与最近的年份合并,如下所示:
library(data.table)
dfA <- fread("
A B C D E F G Z iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE)
dfB <- fread("
A B C D H I J K iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2018
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE)
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
对此我有两个问题。
1)编辑:这个问题是拼写错误的结果
2)NLD 2014
中的dfA
与NLD 2018
中的dfB
相匹配。如果我认为4年的差异太大而想限制为两年,该怎么办?
当我想将允许的年限限制在dfA
和dfB
之间时,该怎么办?
答案 0 :(得分:3)
您有两个选择:
roll = 2
或roll = -2
,这将要求最近的距离是 个方向的2年之内。dfA
中再添加两列,以使其成为显式的非等额联接。#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ]
# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ]
非等额联接将需要您做更多的工作,因为它不需要roll = 'nearest'
参数,因此您需要使用mult = 'first'
或在后续操作中进行过滤。
dfA[, `:=`(min_year.join = yearA - 2,
max_year.join = yearA + 2)]
result <- dfB[dfA,
on = .(iso.join,
year.join <= max_year.join,
year.join >= min_year.join)
#, mult = 'first'
]
#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $
#set column order
setcolorder(result, colorder)
result