Question

我想知道，如何从大数据文件中的特定范围提取一些数据？有没有一种方法可以读取以“流行语”开头和结尾的内容。

我想读取*NODE和**之间的每一行

*NODE
13021145,       2637.6073002472617,       55.011929824413045,        206.0394346892517
13021146,       2637.6051226039867,        55.21115693303926,       206.05686503802065
13021147,        2634.226986419154,        54.98263035830583,        205.9520084547658
13021148,        2634.224808775879,       55.181857466932044,       205.96943880353476
**

在*NODE之前和**之后，有一千行...

我知道它应该类似于：

a = []

with open('file.txt') as file:
   for line in file:
      if line.startswith('*NODE'):

      # NOW THERE SHOULD FOLLOW SOMETHING LIKE:
      #   Go to next line and "a.append" till there comes the "magical"
      #   "**"

有什么主意吗？我对python完全陌生。感谢帮助！我希望你知道我的意思。

Answer 1

您几乎做到了-唯一缺少的是，一旦找到开始，就搜索序列结束，直到发生这种情况，然后才将要迭代的每一行添加到列表中。即：

data = None  # a placeholder to store your lines
with open("file.txt", "r") as f:  # do not shadow the built-in `file`
    for line in f:  # iterate over the lines
        if data is None:  # we haven't found `NODE*` yet
            if line[:5] == "NODE*":  # search for `NODE*` at the line beginning
                data = []  # make `data` an empty list to begin collecting
        elif line[:2] == "**":  # data initialized, we look for the sequence's end
            break  # no need to iterate over the file anymore
        else:  # data initialized but not at the end...
            data.append(line)  # append the line to our data

现在data将包含NODE*和**之间的行列表，如果找不到该序列，则将包含None。

Answer 2

尝试一下：

 with open('file.txt') as file:
    a = []
    running = False  # avoid NameError when 'if' statement below isn't reached
    for line in file:
        if line.startswith('*NODE'):
            running = True  # show that we are starting to add values
            continue  # make sure we don't add '*NODE'
        if line.startswith('**'):
            running = False  # show that we're done adding values
            continue  # make sure we don't add '**'
        if running:  # only add the values if 'running' is True
            a.extend([i.strip() for i in line.split(',')])

输出是一个包含以下内容作为字符串的列表： （我用过print('\n'.join(a))）

13021145 2637.6073002472617 55.011929824413045 206.0394346892517 13021146 2637.6051226039867 55.21115693303926 206.05686503802065 13021147 2634.226986419154 54.98263035830583 205.9520084547658 13021148 2634.224808775879 55.181857466932044 205.96943880353476

Answer 3

我们可以遍历行，直到没有剩余或到达块尾为止

a = []

with open('file.txt') as file:
    for line in file:
        if line.startswith('*NODE'):
            # collect block-related lines
            while True:
                try:
                    line = next(file)
                except StopIteration:
                    # there is no lines left
                    break
                if line.startswith('**'):
                    # we've reached the end of block
                    break
                a.append(line)
            # stop iterating over file
            break

会给我们

print(a)
['13021145,       2637.6073002472617,       55.011929824413045,        206.0394346892517\n',
 '13021146,       2637.6051226039867,        55.21115693303926,       206.05686503802065\n',
 '13021147,        2634.226986419154,        54.98263035830583,        205.9520084547658\n',
 '13021148,        2634.224808775879,       55.181857466932044,       205.96943880353476\n']

或者，我们可以编写像这样的辅助谓词

def not_a_block_start(line):
    return not line.startswith('*NODE')


def not_a_block_end(line):
    return not line.startswith('**')

然后使用itertools module的光彩

from itertools import (dropwhile,
                       takewhile)    

with open('file.txt') as file:
    block_start = dropwhile(not_a_block_start, file)
    # skip block start line
    next(block_start)
    a = list(takewhile(not_a_block_end, block_start))

这将为我们提供与a相同的值。

从特定行之间的文件中提取零件

3 个答案: