在列表

时间:2016-07-19 18:24:11

标签: python string list for-loop sequence

我有一长串子字符串(接近16000),我想找到重复循环开始/停止的地方。我已经提出这个代码作为起点:

strings= ['1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',
        '1001000000110110',
        '0010000001101101',
        '1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',]

pat = [ '1100100100000010',
        '1001001000000110',
        '0010010000001100',]

for i in range(0,len(strings)-1):
    for j in range(0,len(pat)):
        if strings[i] == pat[j]:
            continue
        if strings[i+1] == pat[j]:
            print 'match', strings[i]
            break
        break

这种方法的问题在于你必须知道要搜索它的pat是什么。我希望能够从第一个n子列表开始(在这种情况下为3)并搜索它们,如果不匹配,则将一个子字符串向下移动到下一个3,直到它遍历整个列表或找到重复。我相信如果长度足够高(可能是10),它会发现重复,而不是太长时间的要求。

3 个答案:

答案 0 :(得分:1)

strings= ['1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',
        '1001000000110110',
        '0010000001101101',
        '1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',]

n = 3

patt_dict = {}

for i in range(0, len(strings) - n, 1):
    patt = (' '.join(strings[i:i + n]))
    if patt not in patt_dict.keys(): patt_dict[patt] = 1
    else: patt_dict[patt] += 1


for key in patt_dict.keys():
    if patt_dict[key] > 1: 
        print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'

给它一个机会。以线性时间运行。基本上使用字典来计算子集中出现n尺寸模式的次数。如果它超过1,那么我们有一个重复模式:)

答案 1 :(得分:0)

这里可以找到在字符串数组中匹配的所有子数组。

strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']

pat = ['A', 'B', 'C', 'D']

i = 0
while i < len(strings):
    if strings[i] not in pat:
        i += 1
        continue
    matches = 0
    for j in xrange(pat.index(strings[i]), len(pat)):
        if i + j - pat.index(strings[i]) >= len(strings):
            break
        if strings[i + j - pat.index(strings[i])] == pat[j]:
            matches += 1
        else:
            break
    if matches:
        print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
        i += matches
    else:
        i += 1

输出:

matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A

答案 2 :(得分:0)

这是一种相当简单的方法,可以找到所有长度的所有匹配&gt; = 1:

def findall(xs):
    from itertools import combinations
    # x2i maps each member of xs to a list of all the
    # indices at which that member appears.
    x2i = {}
    for i, x in enumerate(xs):
        x2i.setdefault(x, []).append(i)
    n = len(xs)
    for ixs in x2i.values():
        if len(ixs) > 1:
            for i, j in combinations(ixs, 2):
                length = 1 # xs[i] == xs[j]
                while (i + length < n and
                       j + length < n and
                       xs[i + length] == xs[j + length]):
                    length += 1
                yield i, j, length

然后:

for i, j, n in findall(strings):
    print("match of length", n, "at indices", i, "and", j)

显示:

match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8

您所做和不想要的内容尚未精确指定,因此列出所有匹配项。你可能真的不想要它们中的一些。例如,索引1和7处长度3的匹配只是索引0和6处长度为4的匹配的尾端。

因此,您需要更改代码以计算您真正想要的内容。也许你只想要一个单一的,最大的匹配?所有最大匹配?只有特定长度的匹配?等