查找连续的子字符串匹配

时间:2010-11-25 19:10:06

标签: python algorithm

我说两个字符串;

str1="wild animals are trying to escape the deserted jungle to the sandy island"
str2="people are trying to escape from the smoky mountain to the sandy road"

为了找到这两个字符串之间的匹配,产生一定长度的kgrams(这里是10),产生它们的哈希值并比较这两个字符串的哈希值。例如,如果来自这两个字符串的匹配的kgrams是;

['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy'] 

请建议我找到这些克拉姆斯的连续替代(kgram)匹配的有效方法。在上面的例子中,实际答案是

"aretryingtoescape"  

提前感谢你!!!

2 个答案:

答案 0 :(得分:2)

首先make yourself a coverage mask01(或其他字符,如果您愿意)组成,然后找到1itertools.groupby()的最长时间段。

答案 1 :(得分:0)

Ignacio's idea之后的代码:

#!/usr/bin/env python

from itertools import groupby

str1 = 'wild animals are trying to escape the deserted jungle to the sandy island'
str2 = 'people are trying to escape from the smoky mountain to the sandy road'

words = ['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']

def solve(strings, words):
    s = min([ s.replace(' ', '') for s in strings ], key=len)
    coverage = [False]*len(s)
    for w in words:
        p = s.find(w)
        if p >= 0:
            for i in range(len(w)):
                coverage[p+i] = True
    return max([ ''.join([ y[1] for y in g ]) for k, g in groupby(enumerate(s), key=lambda x: coverage[x[0]]) if k ], key=len)

print solve([str1, str2], words)