我有两个非常长的列表,我想为第二个列表中第一个列表的每个元素找到最长的公共子字符串。
一个简化的例子是
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
所以我想在L2(“xy_b_c”)中找到“a_b_c”的最佳匹配,然后在L2(“z_d_e_y”)中找到“d_e_f”的最佳匹配。对我来说最匹配的是具有最长公共字符的字符串。在 我查看了Levenshtein Distance的示例,它适用于小型列表(http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/),但是我的列表L2有163531个元素,并且在过去15分钟内甚至找不到一个匹配..
我没有CS背景,有人能指出一些更好的算法(甚至更好,它的实现?:))非常感谢。
当前代码(从链接中复制而来自stackoverflow中的其他人):
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
for string in L1:
print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]
编辑 - 只需点击控制+ C,它在15分钟后给了我一个不正确(但很接近)的答案。那只是第一个字符串,而且还有很多字符串...
答案 0 :(得分:6)
使用difflib
模块:
>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
s = SequenceMatcher(None, x, y)
return s.find_longest_match(0, len(x), 0, len(y)).size
...
for item in L1:
f = partial(func, item)
print max(L2, key=f)
...
xy_b_c
z_d_e_y
答案 1 :(得分:1)
您还可以查看The Levenshtein Python C extension module。如果我在你的例子中用随机字符串测试它,它似乎比python实现快约150倍。并使用 Ashwini Chaudhary 所示的max
。