我希望提取文本文件中重复模式之间的所有文本。我的文本文件XYZ.txt看起来像这样:
Start
This is a great day
End
Start
This is another great day
End
Start
This is 3rd great day
End
我正在寻找每个开始和结束之间的所有文本,我的输出应该是:
This is a great day
This is another great day
This is 3rd great day
我希望将所有输出保存为单独的HTML文件。我使用的代码如下:
import re
with open('XYZ.txt') as myfile:
content = myfile.read()
text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
print(text)
但上面的代码只打印第一行。不知道如何打印模式之间的所有值并将它们保存为单独的html文件。我真的很感激任何指示。
谢谢
答案 0 :(得分:0)
您需要使用re.findall
查找所有正则表达式。
>>> lines
'Start\n\nThis is a great day\n\nEnd\n\nStart\nThis is another great day\n\nEnd\n\nStart\nThis is 3rd great day\nEnd\n'
>>>
>>> re.findall('This is.*day', lines)
['This is a great day', 'This is another great day', 'This is 3rd great day']
答案 1 :(得分:0)
您可以使用字符串变异和生成器代替re。
def format_file(file, start, end):
f = open(file, 'r').read()
return tuple(x for x in ''.join(f.split(start)).replace('\n', '').split(end) if x != '')
print format_file('XYZ', 'Start', 'End')
或纯粹的发电机
def format_file(file, start, end):
f = open(file, 'r').readlines()
return tuple(x.rstrip() for x in f if x != '\n' and not x.startswith(start) and not x.startswith(end))
print format_file('XYZ', 'Start', 'End')
答案 2 :(得分:0)
我会使用readlines()
函数并执行以下操作:
with open('jokes.txt') as myfile:
for line in myfile.readlines():
if line.strip() != 'Start' and line.strip() != 'End' and line.strip():
print line[:-1]
这将给出输出:
This is a great day
This is another great day
This is 3rd great day
此外,它会推广到'Start'
和'End'
答案 3 :(得分:0)
如果您的文字文字在帖子中显示,那么您可能不需要regex
,则可以使用list comprehension。
您只需将要提取的所有行存储在列表中即可。
lst = []
with open('XYZ.txt', 'r') as myfile:
for line in myfile:
line = line.strip()
lst.append(line)
lst2 = [i for i in lst if i != 'Start' and i != 'End' ]
print lst2
输出:
['This is a great day', 'This is another great day', 'This is 3rd great day']