字符串相似性,其中ascii代码的顺序和差异很重要

时间:2018-01-12 13:52:32

标签: python string levenshtein-distance jaro-winkler

任何人都知道字符串相似性方法会给出下面的正确结果吗?我正在处理字母数字ID,其中:

  1. 弦的早期部分的变化比后者更重要。我想我可以做ngrams?虽然在一个字符串有前缀的情况下可能会出现问题?
  2. 将“a”改为“b”时,改变“a”到“b”的角色的差异很重要,而不是将其改为“c”。
  3. Levenstein和Jaro-Winkler似乎没有做正确的事。

    见下面的例子。

    import jellyfish
    t1="100"
    t21=["100a","a100"] # case 1. expecting: similar, not similar
    t22=["101","105","200"] # case 2. expecting: similar, less similar, least similar
    
    fun = jellyfish.levenshtein_distance
    print([fun(t1, t) for t in t21]) # all the same
    print([fun(t1, t) for t in t22]) # all the same
    
    fun = jellyfish.jaro_winkler
    print([fun(t1, t) for t in t21]) # all the same
    print([fun(t1, t) for t in t22]) # all the same
    

    为了增加乐趣,第一个字符串的前缀基本上与作为ID的字符串无关,但会混淆字符串相似性。

    t1="pre-100"
    t21=["100a","a100"] # expecting: similar, not similar
    t22=["101","105","200"] # expecting: similar, less similar, least similar
    
    fun = jellyfish.levenshtein_distance
    print([fun(t1, t) for t in t21]) # picks the wrong one
    print([fun(t1, t) for t in t22]) # all the same
    
    fun = jellyfish.jaro_winkler
    print([fun(t1, t) for t in t21]) # picks the wrong one
    print([fun(t1, t) for t in t22]) # picks the right one
    

0 个答案:

没有答案