我有一个文件,我需要从中提取一个数据,由(可能)多行固定模式分隔
some data ... [my opening pattern
is here
and can be multiline] the data
I want to extract [my ending
pattern which can be
multiline as well] ... more data
这些模式在内容总是相同的意义上是固定的,除了它可以包含单词之间的新行。
如果我确信我的模式可以预测格式化但是没有。
解决方案很简单。有没有办法匹配这些"模式"流?
有一个question几乎是重复的,答案指向缓冲输入。我的情况不同之处在于我知道模式中的确切字符串,除了单词可能也由换行符分隔(因此不需要\w*
种类匹配)
答案 0 :(得分:1)
你在找这个吗?
>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']
更新要将大文件读入块,我建议采用以下方法:
## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re
class ChunkIter:
def __init__(self, f, delim):
""" f: file object
delim: regex pattern"""
self.f = f
self.delim = re.compile(delim)
self.buffer = ''
self.part = '' # the string to return
def read_to_delim(self):
"""Return characters up to the last delim, or None if at EOF"""
while "delimiter not found":
b = self.f.read(256)
if not b: # if EOF
self.part = None
break
# Continue reading to buffer
self.buffer += b
# Try regex split the buffer string
parts = self.delim.split(self.buffer)
# If pattern is found
if parts[:-1]:
# Retrieve the string up to the last delim
self.part = ''.join(parts[:-1])
# Reset buffer string
self.buffer = parts[-1]
break
return self.part
if __name__ == '__main__':
with open('input.txt', 'r') as f:
chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
while chunk.read_to_delim():
print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)
print 'job done.'