如何在大文件中找到重复出现的图案线条

时间:2017-12-08 21:44:01

标签: python algorithm

我有一个大文件,其中包含许多字符串,格式为:

A
B
D
B
C
D
C
D
C
D

正如你所看到的,C和D经常出现3次彼此相邻,所以我想解析文件以找到它们的重复模式输出:

A
B
D
B
repeat C D 3 times
C
D

在实际情况中,相邻的线的重复模式可能会达到3行但不超过3行,例如

A
B
D
B
C
D
E
C
D
E

所以输出结果为:

    A
    B
    D
    B
    Repeat C D E 2 times
    C
    D
    E

这是可运行的代码,但如果前面有其他重复模式,我怎么能找到最长的重复模式?

def is_list_equal(left, right):
    if len(left) != len(right):
        return False

    for i in xrange(len(left)):
        if left[i] != right[i]:
            return False

    return True

def find_dup_from_end(l, min_pattern_len, min_repeats):
    assert min_pattern_len * min_repeats <= len(l)
    max_pattern_len = len(l) / min_repeats
    l = l[::-1]
    for i in xrange(max_pattern_len, min_pattern_len-1, -1):
        s1 = l[:i]
        for j in range(1, min_repeats):
            s2 = l[len(s1)*j:len(s1)*(j+1)]
            if not is_list_equal(s1, s2):
                break
        else:
            return (len(l)-len(s1)*min_repeats, len(s1))

    return None, None


def find_dup(file, min_pattern_len=2, max_pattern_len=10, min_repeats=2):
    assert min_pattern_len > 0
    assert min_repeats > 1
    assert min_pattern_len <= max_pattern_len

    feed = []
    min_feed_length = min_pattern_len*min_repeats
    max_feed_length = max_pattern_len*min_repeats
    start = None
    length = None
    with open(file) as f:
        mylist = f.read().splitlines() 
        for line in mylist:
            line = line.strip()
            api = line
            fn = 'a'
            feed.append((fn, api))
            if len(feed) < min_feed_length:
                continue
            if len(feed) > max_feed_length:
                fn2, api2 = feed[0]
                print fn2, api2
                feed = feed[1:]
            feed1 = [api for _, api in feed]
            start, length = find_dup_from_end(feed1, min_pattern_len, min_repeats)
            if length:
                print(feed[start:start+length], min_repeats)
                break
#         print "length", length
        feed = feed[:(start - length*(min_repeats-1))]
        for fn2, api2 in feed:
            print fn2, api2


if __name__ == '__main__':
    find_dup('./apiLoopTest.text', 1, 2, 5)

a string1
a string2
([('a', 'string3'), ('a', 'string4')], 5)
a string3
a string4

1 个答案:

答案 0 :(得分:1)

使用re模块的以下代码应该为您提供一个很好的起点。将上面的示例文本和正则表达式粘贴到网站regex101.com中,以获取正则表达式的描述并测试其他样本输入的匹配。

import re

regex = r"(?P<_REP>(?P<GRP>(?:^[A-Z]$\n?){2,3})(?P=GRP)+)"

test_str = ("""\
A
B
D
B
C
D
E
C
D
E
""")

for match in re.finditer(regex, test_str, re.MULTILINE | re.DOTALL):
    groupdict  = match.groupdict()
    GRP, _REP = groupdict['GRP'], groupdict['_REP']
    print('The folloing group was repeated %i times from location %i\n%s'
          % ((len(_REP) + 1) / len(GRP), match.start(), GRP))

输出:

The following group was repeated 2 times from location 8
C
D
E