从python中的文本文件中提取重复的模式

时间:2016-06-06 03:41:19

标签: python html

我希望提取文本文件中重复模式之间的所有文本。我的文本文件XYZ.txt看起来像这样:

Start

This is a great day

End

Start
This is another great day

End

Start
This is 3rd great day
End

我正在寻找每个开始和结束之间的所有文本,我的输出应该是:

This is a great day
This is another great day
This is 3rd great day

我希望将所有输​​出保存为单独的HTML文件。我使用的代码如下:

import re
with open('XYZ.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()

print(text)

但上面的代码只打印第一行。不知道如何打印模式之间的所有值并将它们保存为单独的html文件。我真的很感激任何指示。

谢谢

4 个答案:

答案 0 :(得分:0)

您需要使用re.findall查找所有正则表达式。

>>> lines
'Start\n\nThis is a great day\n\nEnd\n\nStart\nThis is another great day\n\nEnd\n\nStart\nThis is 3rd great day\nEnd\n'
>>>
>>> re.findall('This is.*day', lines)
['This is a great day', 'This is another great day', 'This is 3rd great day']

答案 1 :(得分:0)

您可以使用字符串变异和生成器代替re。

def format_file(file, start, end):
    f = open(file, 'r').read()
    return tuple(x for x in ''.join(f.split(start)).replace('\n', '').split(end) if x != '')

print format_file('XYZ', 'Start', 'End')

或纯粹的发电机

def format_file(file, start, end):
    f = open(file, 'r').readlines()
    return tuple(x.rstrip() for x in f if x != '\n' and not x.startswith(start) and not x.startswith(end))
print format_file('XYZ', 'Start', 'End')

答案 2 :(得分:0)

我会使用readlines()函数并执行以下操作:

with open('jokes.txt') as myfile:
    for line in myfile.readlines():
        if line.strip() != 'Start' and line.strip() != 'End' and line.strip():
            print line[:-1]

这将给出输出:

This is a great day
This is another great day
This is 3rd great day

此外,它会推广到'Start''End'

之间的任何类型的字符串

答案 3 :(得分:0)

如果您的文字文字在帖子中显示,那么您可能不需要regex,则可以使用list comprehension

您只需将要提取的所有行存储在列表中即可。

lst = []
with open('XYZ.txt', 'r') as myfile:
    for line in myfile:
        line = line.strip()
        lst.append(line)
lst2 = [i for i in lst if i != 'Start' and i != 'End' ]        
print lst2 

输出:

['This is a great day', 'This is another great day', 'This is 3rd great day']