Question

是否可以对字符串距离度量进行加权，例如Damerau-Levenshtein距离，其中权重根据字符类型而变化？

我希望创建一个模糊的地址匹配，并且需要以不同的方式对数字和字母进行加权，以便得到如下地址：

“5 James Street”和“5 Jmaes Street”被视为相同和

“5 James Street”和“6 James Street”被视为不同。

我考虑在应用字符串距离之前将地址拆分为数字和字母，但这会错过“5a”和“5b”的单位。订单在数据集中也不一致，因此一个条目可能“James Street 5”。

我目前正在使用带有stringdist包的R，但不限于此。

谢谢！

Answer 1

这是一个想法。它涉及一些手动处理，但它可能是一个很好的起点。首先，我们使用adist()（或stringdist()与您的数据最适合method）计算地址之间的近似字符串距离，而不关注街道号码。

m <- adist(v) 
rownames(m) <- v

> m
#                     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#5 James Street          0    2    3    1    4   17   17
#5 Jmaes Street          2    0    4    3    6   17   17
#5#Jam#es Str$eet        3    4    0    4    6   17   17
#6 James Street          1    3    4    0    4   17   17
#James Street 5          4    6    6    4    0   16   17
#10a Cold Winter Road   17   17   17   17   16    0    1
#10b Cold Winter Road   17   17   17   17   17    1    0

在这种情况下，我们可以清楚地识别两个聚类，但我们也可以使用hclust()来可视化树形图。

cl <- hclust(as.dist(m))
plot(cl)
rect.hclust(cl, 2)

然后，我们将每条街道标记为相应的相似群集，迭代它们并检查匹配的街道号码。

library(dplyr)
res <- data.frame(cluster = cutree(cl, 2)) %>%
  tibble::rownames_to_column("address") %>%
  mutate(
    # Extract all components of the address
    lst = stringi::stri_extract_all_words(address),
    # Identify the component containing the street number and return it
    num = purrr::map_chr(lst, .f = ~ grep("\\d+", .x, value = TRUE))) %>% 
  # For each cluster, tag matching street numbers
  mutate(group = group_indices_(., .dots = c("cluster", "num")))

给出了：

#               address cluster                     lst num group
#1       5 James Street       1        5, James, Street   5     1
#2       5 Jmaes Street       1        5, Jmaes, Street   5     1
#3     5#Jam#es Str$eet       1    5, Jam, es, Str, eet   5     1
#4       6 James Street       1        6, James, Street   6     2
#5       James Street 5       1        James, Street, 5   5     1
#6 10a Cold Winter Road       2 10a, Cold, Winter, Road 10a     3
#7 10b Cold Winter Road       2 10b, Cold, Winter, Road 10b     4

然后，您可以使用pull()基于group distinct() {（1}}个唯一地址：

> distinct(res, group, .keep_all = TRUE) %>% pull(address)
#[1] "5 James Street"       "6 James Street"       "10a Cold Winter Road"
#    "10b Cold Winter Road"

数据

v <- c("5 James Street", "5 Jmaes Street", "5#Jam#es Str$eet", "6 James Street", "James Street 5", "10a Cold Winter Road", "10b Cold Winter Road")

基于正则表达式

1 个答案: