任何人都知道字符串相似性方法会给出下面的正确结果吗?我正在处理字母数字ID,其中:
Levenstein和Jaro-Winkler似乎没有做正确的事。
见下面的例子。
import jellyfish
t1="100"
t21=["100a","a100"] # case 1. expecting: similar, not similar
t22=["101","105","200"] # case 2. expecting: similar, less similar, least similar
fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same
fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same
为了增加乐趣,第一个字符串的前缀基本上与作为ID的字符串无关,但会混淆字符串相似性。
t1="pre-100"
t21=["100a","a100"] # expecting: similar, not similar
t22=["101","105","200"] # expecting: similar, less similar, least similar
fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # all the same
fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # picks the right one