Question

我说两个字符串;

str1="wild animals are trying to escape the deserted jungle to the sandy island"
str2="people are trying to escape from the smoky mountain to the sandy road"

为了找到这两个字符串之间的匹配，产生一定长度的kgrams（这里是10），产生它们的哈希值并比较这两个字符串的哈希值。例如，如果来自这两个字符串的匹配的kgrams是;

['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']

请建议我找到这些克拉姆斯的连续替代（kgram）匹配的有效方法。在上面的例子中，实际答案是

"aretryingtoescape"

提前感谢你!!!

Answer 1

首先make yourself a coverage mask由0和1（或其他字符，如果您愿意）组成，然后找到1与itertools.groupby()的最长时间段。

Answer 2

Ignacio's idea之后的代码：

#!/usr/bin/env python

from itertools import groupby

str1 = 'wild animals are trying to escape the deserted jungle to the sandy island'
str2 = 'people are trying to escape from the smoky mountain to the sandy road'

words = ['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']

def solve(strings, words):
    s = min([ s.replace(' ', '') for s in strings ], key=len)
    coverage = [False]*len(s)
    for w in words:
        p = s.find(w)
        if p >= 0:
            for i in range(len(w)):
                coverage[p+i] = True
    return max([ ''.join([ y[1] for y in g ]) for k, g in groupby(enumerate(s), key=lambda x: coverage[x[0]]) if k ], key=len)

print solve([str1, str2], words)

查找连续的子字符串匹配

2 个答案: