如何计算文本中句子之间的Levenshtein距离

时间:2018-12-27 15:25:46

标签: python nlp

我想计算一个文档中句子之间的Levenshtein距离。我找到了一个代码来计算字符级别的距离,但是我希望它处于单词级别。  例如,此字符级别的输出为6,但我希望它为1,这意味着如果我们想将b更改为a或a更改为b,则只需要删除一个单词:

a = "The patient tolerated this ."
b = "The patient tolerated ."

def levenshtein_distance(a, b):

    if a == b:
        return 0
    if len(a) < len(b):
        a, b = b, a
    if not a:
        return len(b)
    previous_row = range(len(b) + 1)
    for i, column1 in enumerate(a):
        current_row = [i + 1]
        for j, column2 in enumerate(b):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (column1 != column2)
            current_row.append(min(insertions, deletions,    substitutions))
            previous_row = current_row
    print (previous_row[-1]) 
    return previous_row[-1] 

result = levenshtein_distance(a, b)

1 个答案:

答案 0 :(得分:0)

我建议避免重新发明轮子,可以使用pylev https://pypi.org/project/pylev/ 您只需在控制台中执行pip install pylev命令即可​​安装它。 然后使用单词而不是字母来计算距离:

 import pylev
 a = "The patient tolerated this ."
 b = "The patient tolerated ."
 a = a.split(" ")
 b = b.split(" ")
 print(pylev.levenshtein(a,b))

请记住,此解决方案区分大小写,并假定所有单词都是空格。