我有一个大文件,其中包含许多字符串,格式为:
A
B
D
B
C
D
C
D
C
D
正如你所看到的,C和D经常出现3次彼此相邻,所以我想解析文件以找到它们的重复模式输出:
A
B
D
B
repeat C D 3 times
C
D
在实际情况中,相邻的线的重复模式可能会达到3行但不超过3行,例如
A
B
D
B
C
D
E
C
D
E
所以输出结果为:
A
B
D
B
Repeat C D E 2 times
C
D
E
这是可运行的代码,但如果前面有其他重复模式,我怎么能找到最长的重复模式?
def is_list_equal(left, right):
if len(left) != len(right):
return False
for i in xrange(len(left)):
if left[i] != right[i]:
return False
return True
def find_dup_from_end(l, min_pattern_len, min_repeats):
assert min_pattern_len * min_repeats <= len(l)
max_pattern_len = len(l) / min_repeats
l = l[::-1]
for i in xrange(max_pattern_len, min_pattern_len-1, -1):
s1 = l[:i]
for j in range(1, min_repeats):
s2 = l[len(s1)*j:len(s1)*(j+1)]
if not is_list_equal(s1, s2):
break
else:
return (len(l)-len(s1)*min_repeats, len(s1))
return None, None
def find_dup(file, min_pattern_len=2, max_pattern_len=10, min_repeats=2):
assert min_pattern_len > 0
assert min_repeats > 1
assert min_pattern_len <= max_pattern_len
feed = []
min_feed_length = min_pattern_len*min_repeats
max_feed_length = max_pattern_len*min_repeats
start = None
length = None
with open(file) as f:
mylist = f.read().splitlines()
for line in mylist:
line = line.strip()
api = line
fn = 'a'
feed.append((fn, api))
if len(feed) < min_feed_length:
continue
if len(feed) > max_feed_length:
fn2, api2 = feed[0]
print fn2, api2
feed = feed[1:]
feed1 = [api for _, api in feed]
start, length = find_dup_from_end(feed1, min_pattern_len, min_repeats)
if length:
print(feed[start:start+length], min_repeats)
break
# print "length", length
feed = feed[:(start - length*(min_repeats-1))]
for fn2, api2 in feed:
print fn2, api2
if __name__ == '__main__':
find_dup('./apiLoopTest.text', 1, 2, 5)
a string1
a string2
([('a', 'string3'), ('a', 'string4')], 5)
a string3
a string4
答案 0 :(得分:1)
使用re模块的以下代码应该为您提供一个很好的起点。将上面的示例文本和正则表达式粘贴到网站regex101.com中,以获取正则表达式的描述并测试其他样本输入的匹配。
import re
regex = r"(?P<_REP>(?P<GRP>(?:^[A-Z]$\n?){2,3})(?P=GRP)+)"
test_str = ("""\
A
B
D
B
C
D
E
C
D
E
""")
for match in re.finditer(regex, test_str, re.MULTILINE | re.DOTALL):
groupdict = match.groupdict()
GRP, _REP = groupdict['GRP'], groupdict['_REP']
print('The folloing group was repeated %i times from location %i\n%s'
% ((len(_REP) + 1) / len(GRP), match.start(), GRP))
输出:
The following group was repeated 2 times from location 8
C
D
E