Question

我有文件，其中随机顺序有不同数量的标题行，后跟我需要的数据，它跨越相应标题给出的行数。前Lines: 3

from: blah@blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff

Foo Bar Lines of Data, which take up
some arbitrary long amount  characters on a single line, but no  matter how long 
they still only take up the number of lines as specified in the header

如何在一次读取文件时获取该数据？附：数据来自20Newsgroups语料库。

编辑：我想这个快速的解决方案只有在我放松了只读一次约束时才有效：

[第一次阅读]找出total_num_of_lines并在第一个Lines:标题上匹配，
[第二次阅读]我丢弃第一个(total_num_of_lines- header_num_of_lines)，然后阅读文件的其余部分

我仍然没有意识到在一次传递中读取数据的方法。

Answer 1

我不太确定您是否需要文件的开头才能获取其内容。考虑使用split：

_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n

但是，如果lines参数确实计数 - 您可以使用上面建议的技术以及解析文件头：

headers, contents = file_contents.split(os.linesep + os.linesep)

# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])

# Get contents
real_contents = contents[:lines_count]

Answer 2

假设我们有一般情况，可能会有多条消息互相跟随，可能就像

from itertools import takewhile
def msgreader(file):
    while True:
        header = list(takewhile(lambda x: x.strip(), file))
        if not header: break
        header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
        line_count = int(header_dict['Lines'])
        message = [next(file) for i in xrange(line_count)] # or islice..
        yield message

会起作用，

with open("53903") as fp:
    for message in msgreader(fp):
        print message

会提供所有列出的消息。对于这个特定的用例，上面的内容会有点过分，但坦率地说，提取所有标题信息并不比只有一行更难。如果没有一个模块可以解析这些消息，我会感到惊讶。

Answer 3

您需要存储标题是否已完成的状态。就是这样。

从文件中提取行的子集

3 个答案: