我说两个字符串;
str1="wild animals are trying to escape the deserted jungle to the sandy island"
str2="people are trying to escape from the smoky mountain to the sandy road"
为了找到这两个字符串之间的匹配,产生一定长度的kgrams(这里是10),产生它们的哈希值并比较这两个字符串的哈希值。例如,如果来自这两个字符串的匹配的kgrams是;
['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']
请建议我找到这些克拉姆斯的连续替代(kgram)匹配的有效方法。在上面的例子中,实际答案是
"aretryingtoescape"
提前感谢你!!!
答案 0 :(得分:2)
首先make yourself a coverage mask由0
和1
(或其他字符,如果您愿意)组成,然后找到1
与itertools.groupby()
的最长时间段。
答案 1 :(得分:0)
Ignacio's idea之后的代码:
#!/usr/bin/env python
from itertools import groupby
str1 = 'wild animals are trying to escape the deserted jungle to the sandy island'
str2 = 'people are trying to escape from the smoky mountain to the sandy road'
words = ['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']
def solve(strings, words):
s = min([ s.replace(' ', '') for s in strings ], key=len)
coverage = [False]*len(s)
for w in words:
p = s.find(w)
if p >= 0:
for i in range(len(w)):
coverage[p+i] = True
return max([ ''.join([ y[1] for y in g ]) for k, g in groupby(enumerate(s), key=lambda x: coverage[x[0]]) if k ], key=len)
print solve([str1, str2], words)