我想知道,如何从大数据文件中的特定范围提取一些数据?有没有一种方法可以读取以“流行语”开头和结尾的内容。
我想读取*NODE
和**
之间的每一行
*NODE
13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517
13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065
13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658
13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476
**
在*NODE
之前和**
之后,有一千行...
我知道它应该类似于:
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# NOW THERE SHOULD FOLLOW SOMETHING LIKE:
# Go to next line and "a.append" till there comes the "magical"
# "**"
有什么主意吗?我对python完全陌生。感谢帮助! 我希望你知道我的意思。
答案 0 :(得分:1)
您几乎做到了-唯一缺少的是,一旦找到开始,就搜索序列结束,直到发生这种情况,然后才将要迭代的每一行添加到列表中。即:
data = None # a placeholder to store your lines
with open("file.txt", "r") as f: # do not shadow the built-in `file`
for line in f: # iterate over the lines
if data is None: # we haven't found `NODE*` yet
if line[:5] == "NODE*": # search for `NODE*` at the line beginning
data = [] # make `data` an empty list to begin collecting
elif line[:2] == "**": # data initialized, we look for the sequence's end
break # no need to iterate over the file anymore
else: # data initialized but not at the end...
data.append(line) # append the line to our data
现在data
将包含NODE*
和**
之间的行列表,如果找不到该序列,则将包含None
。
答案 1 :(得分:1)
尝试一下:
with open('file.txt') as file:
a = []
running = False # avoid NameError when 'if' statement below isn't reached
for line in file:
if line.startswith('*NODE'):
running = True # show that we are starting to add values
continue # make sure we don't add '*NODE'
if line.startswith('**'):
running = False # show that we're done adding values
continue # make sure we don't add '**'
if running: # only add the values if 'running' is True
a.extend([i.strip() for i in line.split(',')])
输出是一个包含以下内容作为字符串的列表:
(我用过print('\n'.join(a))
)
13021145
2637.6073002472617
55.011929824413045
206.0394346892517
13021146
2637.6051226039867
55.21115693303926
206.05686503802065
13021147
2634.226986419154
54.98263035830583
205.9520084547658
13021148
2634.224808775879
55.181857466932044
205.96943880353476
答案 2 :(得分:1)
我们可以遍历行,直到没有剩余或到达块尾为止
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# collect block-related lines
while True:
try:
line = next(file)
except StopIteration:
# there is no lines left
break
if line.startswith('**'):
# we've reached the end of block
break
a.append(line)
# stop iterating over file
break
会给我们
print(a)
['13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517\n',
'13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065\n',
'13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658\n',
'13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476\n']
或者,我们可以编写像这样的辅助谓词
def not_a_block_start(line):
return not line.startswith('*NODE')
def not_a_block_end(line):
return not line.startswith('**')
然后使用itertools
module的光彩
from itertools import (dropwhile,
takewhile)
with open('file.txt') as file:
block_start = dropwhile(not_a_block_start, file)
# skip block start line
next(block_start)
a = list(takewhile(not_a_block_end, block_start))
这将为我们提供与a
相同的值。