我有一长串子字符串(接近16000),我想找到重复循环开始/停止的地方。我已经提出这个代码作为起点:
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
pat = [ '1100100100000010',
'1001001000000110',
'0010010000001100',]
for i in range(0,len(strings)-1):
for j in range(0,len(pat)):
if strings[i] == pat[j]:
continue
if strings[i+1] == pat[j]:
print 'match', strings[i]
break
break
这种方法的问题在于你必须知道要搜索它的pat是什么。我希望能够从第一个n子列表开始(在这种情况下为3)并搜索它们,如果不匹配,则将一个子字符串向下移动到下一个3,直到它遍历整个列表或找到重复。我相信如果长度足够高(可能是10),它会发现重复,而不是太长时间的要求。
答案 0 :(得分:1)
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
n = 3
patt_dict = {}
for i in range(0, len(strings) - n, 1):
patt = (' '.join(strings[i:i + n]))
if patt not in patt_dict.keys(): patt_dict[patt] = 1
else: patt_dict[patt] += 1
for key in patt_dict.keys():
if patt_dict[key] > 1:
print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'
给它一个机会。以线性时间运行。基本上使用字典来计算子集中出现n尺寸模式的次数。如果它超过1,那么我们有一个重复模式:)
答案 1 :(得分:0)
这里可以找到在字符串数组中匹配的所有子数组。
strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']
pat = ['A', 'B', 'C', 'D']
i = 0
while i < len(strings):
if strings[i] not in pat:
i += 1
continue
matches = 0
for j in xrange(pat.index(strings[i]), len(pat)):
if i + j - pat.index(strings[i]) >= len(strings):
break
if strings[i + j - pat.index(strings[i])] == pat[j]:
matches += 1
else:
break
if matches:
print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
i += matches
else:
i += 1
输出:
matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A
答案 2 :(得分:0)
这是一种相当简单的方法,可以找到所有长度的所有匹配&gt; = 1:
def findall(xs):
from itertools import combinations
# x2i maps each member of xs to a list of all the
# indices at which that member appears.
x2i = {}
for i, x in enumerate(xs):
x2i.setdefault(x, []).append(i)
n = len(xs)
for ixs in x2i.values():
if len(ixs) > 1:
for i, j in combinations(ixs, 2):
length = 1 # xs[i] == xs[j]
while (i + length < n and
j + length < n and
xs[i + length] == xs[j + length]):
length += 1
yield i, j, length
然后:
for i, j, n in findall(strings):
print("match of length", n, "at indices", i, "and", j)
显示:
match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8
您所做和不想要的内容尚未精确指定,因此列出所有匹配项。你可能真的不想要它们中的一些。例如,索引1和7处长度3的匹配只是索引0和6处长度为4的匹配的尾端。
因此,您需要更改代码以计算您真正想要的内容。也许你只想要一个单一的,最大的匹配?所有最大匹配?只有特定长度的匹配?等