我需要在三个变量上匹配两个数据集。 三个变量中的两个不会出现拼写错误(根据设计)。 只有第三个变量才需要模糊匹配。
标准fuyyzmerge通过将所有三个变量模糊连接而产生一些问题。
有没有一种方法可以指定三个应该模糊匹配以及哪个完全匹配?
可复制的示例:
dataset_1 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB")), c("var_1", "var_2", "var_3"))
dataset_2 <- setNames(data.frame(c(1995,1996,1995,1996),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBC"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))
merged <- stringdist_join(dataset_1, dataset_2,
by=c("var_1","var_2","var_3"),
max_dist = 2,
method = c("soundex"),
mode = "full",
ignore_case = FALSE)
理想的结果:
merged <- setNames(data.frame(rep(1995,4),c("AA","AA","BB","BB"),c("AAAA","AAAA","BBBB","BBBB"),c("A","B","C","D")), c("var_1", "var_2", "var_3","var_4"))
答案 0 :(得分:0)
stringdist_join
是fuzzy_join
的包装,并且fuzzy_join
有一个match_fun
参数,只要您的{{ 1}}参数,因此我们可以使用by
(与fuzzy_full_join
一起使用fuzzy_join
):
mode= "full"
由于模糊匹配的性质,lhs和rhs的值通常不相同,因此如果只保留lhs,我们最终得到两组by列:
library(fuzzyjoin)
res <- fuzzy_full_join(dataset_1, dataset_2,
by=c("var_1","var_2","var_3"),
list(`==`, `==`, function(x,y) stringdist::stringdist(x,y, "soundex") <= 2))
res
# var_1.x var_2.x var_3.x var_1.y var_2.y var_3.y var_4
# 1 1995 AA AAAA 1995 AA AAAA A
# 2 1996 AA AAAA 1996 AA AAAA B
# 3 1995 BB BBBB 1995 BB BBBB C
# 4 1996 BB BBBB 1996 BB BBBC D