Question

我在R中有两个类型字符向量。

我希望能够使用jarowinkler将引用列表与原始字符列表进行比较，并指定％相似度得分。因此，例如，如果我有10个参考项目和20个原始数据项目，我希望能够获得比较的最佳分数以及算法与之匹配的内容（因此2个向量为10）。如果我有大小为8和10个参考项目的原始数据，我应该只得到8个项目的2个向量结果，每个项目的最佳匹配和得分

项目，匹配， matched_to 冰，78，冰淇淋

下面是我的代码，不用多看。

NumItems.Raw = length(words)
NumItems.Ref = length(Ref.Desc)

for (item in words) 
{
  for (refitem in Ref.Desc)
  {
    jarowinkler(refitem,item)

    # Find Best match Score
    # Find Best Item in reference table
    # Add both items to vectors
    # decrement NumItems.Raw
    # Loop
  }
}

Answer 1

使用玩具示例：

library(RecordLinkage)
library(dplyr)

ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

给出

 words     match matched_to
1   cat 1.0000000        cat
2   cow 1.0000000        cow
3   dog 1.0000000        dog
4   emu 0.5277778       bear
5 horse 1.0000000      horse
6  kiwi 0.5350000      koala
7   pig 1.0000000        pig
8 sheep 1.0000000      sheep

编辑：作为对OP评论的回复，最后一个命令使用来自dplyr的管道方法，并将原始字词和引用的每个组合分组。原始单词，使用jarowinkler分数添加列match_score，并仅返回最高匹配分数的摘要（由which.max（match_score）索引），以及也通过最大match_score索引的引用。

Answer 2

有一个包已经实现了Jaro-Winkler距离。

> install.packages("stringdist")
> library(stringdist)
> 1-stringdist('ice','ice-cream',method='jw')
[1] 0.7777778

R：使用jarowinkler进行字符串模糊匹配

2 个答案: