我有两个包含地址的表(街道,城市,邮政编码和两个包含这些连接值的字段),我想对Zipcode进行模糊匹配,但仅适用于具有完全相同StrCity值的情况。我开始首先只选择与字典中的StrCity匹配然后进行模糊匹配的地址,但有两个问题:
1)如果与Zipcode匹配,则不考虑街道和城市 2)如果匹配地址(包含所有Zipcode,Street和City),它也会返回可能的值,在同一个邮政编码中,还有另一条距离足够近的街道。
可能我需要同时做两个不同的匹配(一个模糊和一个精确),但我不知道如何实现它,而不是在性能方面杀死我的计算机。
以下是TableAd的数据样本:
StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459 Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa 6423687 02-730 Bukowińska Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa 6425093 03-536 Kanałowa Warszawa 03-536KanałowaWarszawa
字典样本:
Zipcode Street City Address StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536 Kołowa Warszawa 03-536KołowaWarszawa KołowaWarszawa
这是我目前的代码:
TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"
ResultTable <- TableMatch %>%
stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
select(ID, Zipcode.x, Address.x, Address.y, dist) %>%
group_by(Address.x) %>%
# select best fit record
top_n(-1, dist)
我在上面提供的示例中特别找到了问题 - 该脚本验证了strCityKanałowaWarszawa是否存在于字典中,但是当更改邮政编码时,组合地址字符串的Levenshtein距离与将街道更改为Kołowa时相同,后者具有与检查的邮政编码相同的邮政编码。 这里它返回两个更改,但如果邮政编码只有2或1位数差异,那么它可能会错误地建议更换街道,而邮政编码应该更改。
注意:我正在使用套餐purrr
,dplyr
和fuzzyjoin
。
答案 0 :(得分:1)
这是一种使用常规fuzzyjoin
函数的使其更灵活的方法:
数据
TableAd <- read.table(h=T,strin=F,text="StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459 Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa 6423687 02-730 Bukowińska Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa 6425093 03-536 Kanałowa Warszawa 03-536KanałowaWarszawa")
TableDict <- read.table(h=T,strin=F,text="Zipcode Street City StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536 Kołowa Warszawa 03-536KołowaWarszawa KołowaWarszawa")
解决方案
library(fuzzyjoin)
library(stringdist)
res <- fuzzy_left_join(
TableAd,
TableDict,
by=c("StrCity","Zipcode"),
list(`==`, function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 5)
)
res %>%
select(StrCity = StrCity.x, everything(), - StrCity.y)
# StrCity ID Zipcode.x Street.x City.x Address.x Zipcode.y Street.y City.y Address.y
# 1 BialowiejskaWarszawa 5148676 01-459 Bialowiejska Warszawa 01-459BialowiejskaWarszawa <NA> <NA> <NA> <NA>
# 2 BukowinskaWarszawa 6423687 02-730 Bukowinska Warszawa 02-730BukowinskaWarszawa <NA> <NA> <NA> <NA>
# 3 KanalowaWarszawa 6425093 03-536 Kanalowa Warszawa 03-536KanalowaWarszawa 03-663 Kanalowa Warszawa 03-663KanalowaWarszawa
上述解决方案的问题在于,它内部产生笛卡尔积,如果您有大量数据,则可能会出现问题。由于您正在连接串联的字符串,因此影响减小了,但感觉像是一种可避免的hack。
解决此问题的一种方法是将模糊联接应用于由完全匹配确定的子集对,我们在下面定义了一个函数来实现此目的,以及增强的样本数据。
数据
TableAd2 <- read.table(h=T,strin=F,text="ID Zipcode Street City
5148676 01-459 Białowiejska Warszawa
6423687 02-730 Bukowińska Warszawa
6423687 99-999 Agaty Warszawa
6423687 02-883 Agaty Warszawa
6425093 03-536 Kanałowa Warszawa")
TableDict2 <- read.table(h=T,strin=F,text="Zipcode Street City
02-882 Agaty Warszawa
03-663 Kanałowa Warszawa
03-536 Kołowa Warszawa
02-730 Bukowińska Warszawa")
功能
fuzzy_inner_join2 <- function(x,y,by, match_fun, ...){
match_fun_equal_lgl <- sapply(match_fun, identical, `==`)
# columns to use for exact join equivalent
by_exact = by[match_fun_equal_lgl]
# columns to use for fuzzy join on relevant subsets of data (for efficiency)
by_fuzzy = by[!match_fun_equal_lgl]
# update match_fun
match_fun <- match_fun[!match_fun_equal_lgl]
# trim inputs of irrelevant data
x <- dplyr::semi_join(x,y,by= by_exact)
y <- dplyr::semi_join(y,x,by= by_exact)
# make lists so we have pairs of data frames to fuzzy join together
x_list <- dplyr::group_split(dplyr::group_by_at(x, by_exact))
y_list <- dplyr::group_split(dplyr::group_by_at(y, by_exact), keep = FALSE)
# apply fuzzy join on pairs and bind the results
map2_dfr(x_list,y_list, fuzzyjoin::fuzzy_inner_join, match_fun = match_fun,
by = by_fuzzy, ...)
}
解决方案
fuzzy_inner_join2(
TableAd2,
TableDict2,
by=c("City","Street","Zipcode"),
match_fun = list(
`==`, `==`,
function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3)
)
# # A tibble: 3 x 5
# ID Zipcode.x Street City Zipcode.y
# <int> <chr> <chr> <chr> <chr>
# 1 6423687 02-883 Agaty Warszawa 02-882
# 2 6423687 02-730 Bukowinska Warszawa 02-730
# 3 6425093 03-536 Kanalowa Warszawa 03-663
答案 1 :(得分:0)
要使用fuzzyjoin
进行部分模糊和部分精确匹配,可以输入多个match_fun并自定义自己的匹配项。在这里,我为strcity设置了精确匹配==
,为邮政编码和地址设置了stringdist。为此,我需要获取stringdist match_fun代码并对其进行自定义。
要想使匹配的邮政编码更加准确,我想您可能想对数字进行分解,然后将match_fun用于数字的接近度而不是stringdist。
library(fuzzyjoin); library(dplyr)
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
ignore_case = TRUE
method = "lv"
max_dist = 99
distance_col = "dist"
if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}
# Now, call fuzzy_join with multiple match_fun
fuzzy_join(data1, data2,
by = list(x = c("Address", "Zipcode", "StrCity"), y = c("Address", "Zipcode", "StrCity")),
match_fun = list(match_fun_stringdist, match_fun_stringdist, `==`),
mode = "left"
) %>%
group_by(StrCity, Zipcode, Address) %>%
top_n(-1, Address.dist) %>%
select(Address.dist, everything())