我有一个大文本文件,其值由以"#"开头的标题分隔。如果条件与标题中的条件匹配,我想读取文件直到下一个标题"#"和SKIP其余的文件。
要测试我是否正在尝试阅读名为test234.txt的以下文本文件:
# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr
我写的代码是:
file_t = open('test234.txt')
cond = True
while cond:
for line_ in file_t:
print(line_)
if file_t.read(1) == "#":
cond = False
file_t.close()
但是,我得到的输出是:
# abcdefgh
fnrnf
rkfr
foiernfr
erfnr
something
jndjen kj
jkndjke
vcrvr
相反,我希望两个标题之间的输出用"#"这是:
1fnrnf
mrkfr
nfoiernfr
nerfnr
我该怎么做?谢谢!
编辑:Reading in file block by block using specified delimiter in python谈论以标题分隔的组中读取文件,但我不想阅读所有标题。我只想阅读满足给定条件的标题,并且只要该行到达标记为'#'的下一个标题。它停止阅读文件。
答案 0 :(得分:3)
itertools.groupby
可以提供帮助:
from io import StringIO
from itertools import groupby
text = '''# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr'''
with StringIO(text) as file:
lines = (line.strip() for line in file) # removing trailing '\n'
for key, group in groupby(lines, key=lambda x: x[0]=='#'):
if key is True:
# found a line that starts with '#'
print('found header: {}'.format(next(group)))
if key is False:
# group now contanins all lines that do not start with '#'
print('\n'.join(group))
请注意,所有这些都是 lazy 。你只会在内存中的两个标题之间拥有所有项目。
您必须将with StringIO(text) as file:
替换为; with open('test234.txt', 'r') as file:
...
测试的输出是:
found header: # abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
found header: # something
njndjen kj
ejkndjke
found header: #vcrvr
更新,因为我误解了。这是一个新的尝试:
from io import StringIO
from collections import deque
from itertools import takewhile
from_line = '# abcdefgh'
to_line = '# something'
with StringIO(text) as file:
lines = (line.strip() for line in file) # removing trailing '\n'
# fast-forward up to from_line
deque(takewhile(lambda x: x != from_line, lines), maxlen=0)
for line in takewhile(lambda x: x != to_line, lines):
print(line)
我使用itertools.takewhile
来获取迭代器直到满足一个转义(直到你的情况下找到第一个头)。
deque
部分只是itertools食谱中建议的consume
pattern。它只是快速前进到给定条件不再存在的点。
答案 1 :(得分:1)
学习和使用正则表达式。它将帮助您完成所有文档表示过程。
import re #regex library
with open('test234.txt') as f: #file stream
lines = f.readlines() #reads all lines
p = re.compile('^#.*') #regex pattern creation
for l in lines:
if p.match(l) == None: #looks for non-matching lines
print(l[:-2])