字符向量列表与字符向量的模糊字符串匹配

时间:2018-07-12 14:29:41

标签: r string fuzzy-comparison

我有一个字符向量列表和一个字符向量。我想在列表的每个元素(字符向量)与字符向量(字符串)的每个元素之间的R中执行模糊匹配,并返回每种组合的最大相似度得分。下面是一个玩具示例:

a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)

charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")

现在,我想将mylist中的每个元素与charvec中的第一个字符串进行模糊匹配,并返回7个得分中的最大相似得分。同样,我想获得mylistcharvec的每种组合的得分。

到目前为止我的尝试:

将charvec中的字符串转换为空数据框的列名

df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))

使用RecordLinkage程序包中的jarowinkler距离(或者如果有更好的距离量度来匹配短语,则计算每个组合之间的最大相似度得分!)

for (j in seq_along(mylist)) {
  for (i in length(ncol(df))) {
    df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
  }
}

但是不幸的是,我在第一行中仅获得3分,其余值为NA。

对此将提供任何帮助。

3 个答案:

答案 0 :(得分:2)

使用purrr

mylist <- setNames(mylist, c('a', 'b', 'c'))

library(purrr)

map_dfr(charvec,
    function(wrd, vec_list){
      setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
               names(vec_list)
      )

    },
    mylist)

# A tibble: 10 x 3
       a     b     c
   <dbl> <dbl> <dbl>
 1 0.911 0.580 0.603
 2 0.85  0.713 0.603
 3 0.842 0.557 0.515
 4 0.657 0.490 0.409
 5 0.912 0.489 0.659
 6 0.538 0.546 0.801
 7 0.716 0.547 0.740
 8 0.591 0.524 0.856
 9 0.675 0.509 0.821
10 0.619 0.587 0.630

如果您想要宽范围:

map_dfc(charvec,
         function(wrd, vec_list) {
          set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
                    wrd)
         },
        mylist
)

# A tibble: 3 x 10
  `brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
        <dbl>      <dbl>         <dbl>         <dbl>            <dbl>         <dbl>          <dbl>          <dbl>
1       0.911      0.85          0.842         0.657            0.912         0.538          0.716          0.591
2       0.580      0.713         0.557         0.490            0.489         0.546          0.547          0.524
3       0.603      0.603         0.515         0.409            0.659         0.801          0.740          0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>

答案 1 :(得分:0)

首先,一个帮助函数返回给定字符向量要检查的单词的最佳匹配。我将purrr包用于映射功能,因为我更喜欢它而不是循环。

library(purrr)
library(magrittr)
library(RecordLinkage)
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")

getBestMatch <- function(word, vector){
  purrr::map_dbl(charvec, ~RecordLinkage::jarowinkler(word, .x)) %>%
    magrittr::set_names(charvec) %>%
    which.max %>%
    names
}

运行该函数将产生以下输出:

> getBestMatch("brown fox", charvec)
[1] "brown dog"

现在我们有了一个辅助函数,只需在vector的元素上调用它即可。

>map_chr(a, ~ getBestMatch(.x, charvec))
[1] "brown dog"        "lazy cat"         "white dress"      "I know that"     
[5] "I know that"      "new building"     "excuse me please"

答案 2 :(得分:0)

library(stringdist)

dist <- stringdistmatrix( df$text, charvec ,method = "lcs" )
row.names( dist ) <- as.character( df$text )
colnames( dist ) <- charvec

在此示例中,我使用了lcs L C S ubstring距离。

我建议您检查其他方法。 ?"stringdist-metrics"

距离越短,匹配度越好...

> dist
#               brown dog lazy cat white dress I know that excuse me please tall person new building good example green with envy zebra crossing
# brown fox             4       15          16          14               23          14           17           15              18             15
# lazy dog              9        6          15          15               20          13           14           18              21             14
# white cat            14        9           8          12               19          16           17           17              16             17
# I don't know         13       16          19          11               24          17           18           20              19             20
# sunset               13       12          13          13               16          13           14           16              17             16
# never mind           13       16          15          17               18          15           12           18              15             14
# excuse me            16       15          14          18                7          16           17           13              16             17
# very late            14        9          14          14               15          16           15           15              16             17
# do not cross         13       16          13          15               22          15           20           18              21             14
# sunrise              14       15          14          16               17          14           15           17              16             17
# long vacation        14       11          22          16               25          16           17           19              20             19
# toy example          16       13          16          16               15          14           19            5              20             21
# green apple          14       15          16          16               15          16           17           11              12             21
# tall building        16       17          18          20               25          12            7           21              22             17
# good rating          14       13          18          14               23          16           15           11              18             15
# accommodating        16       13          22          18               23          18           17           17              24             15