Question

假设您有两个包含类似项目的字符串列表，包含更改（例如，List 1：Apples，fruits_b，orange; List2：Fruit，apples，banana，orange_juice）。

给定距离度量，例如Levenshtein距离，找到最佳配对的好算法是什么，即最小化所有配对的距离总和的配对？

与我的例子对应的结果是：

Apples    - apples
fruits_b  - Fruit
orange    - orange_juice
          - banana

附属问题：是否有一些工具已经实现了这个或类似的东西？

Answer 1

好的，这是我的使用Levenshtein距离和匈牙利算法的python解决方案（均由外部包提供）：

from munkres import Munkres
from Levenshtein import distance
from sys import argv

if __name__ == '__main__':
    if len(argv) < 3:
        print("Usage: fuzzy_match.py file file")
        print("Finds the best pairing of lines from the two input files")
        print("using the Levenshtein distance and the Hungarian algorithm")
    w1 = [l.strip() for l in open(argv[1]).readlines()]
    w2 = [l.strip() for l in open(argv[2]).readlines()]
    if len(w1) != len(w2):
        if len(w2) > len(w1):
            w1, w2 = w2, w1
        w2.extend([""]*(len(w1)-len(w2)))
    matrix = []
    for i in w1: 
        row = []
        for j in w2:
            row.append(distance(i.lower(), j.lower()))
        matrix.append(row)
    m = Munkres()
    max_length = max(len(w) for w in w1)
    for i, j in m.compute(matrix):
        print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))

它非常好用。不过，如果有人能想出更好的算法，我仍然很好奇！

两个列表中字符串模糊配对的算法

1 个答案: