Question

我有一个这样的input_file.fa文件（FASTA格式）：

> header1 description
data data
data
>header2 description
more data
data
data

我想一次在文件中读取一个块，以便每个块包含一个头和相应的数据，例如第1块：

> header1 description
data data
data

当然，我可以像这样读取文件并拆分：

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

但我想避免将整个文件读入内存，因为文件通常很大。

我当然可以逐行阅读文件：

with open("input_file.fa") as f:
    for line in f:
        pass

但理想情况下我想要的是这样的：

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

但是我收到了一个错误：

ValueError：非法换行值：＆gt;

我也尝试使用csv module，但没有成功。

我确实从3年前发现了this post，它为这个问题提供了基于生成器的解决方案，但它看起来并不紧凑，这真的是唯一/最好的解决方案吗？如果可以使用单行而不是单独的函数创建生成器，那将是很好的，类似于这个伪代码：

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

如果这是不可能的，那么我想你可以认为我的问题与另一篇文章重复，但如果是这样的话，我希望人们可以向我解释为什么其他解决方案是唯一的。非常感谢。

Answer 1

这里的一般解决方案是为此编写一个生成器函数，一次生成一个组。这就是你将在内存中一次只存储一个组。

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

<强>输出：

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data

对于FASTA格式，我建议使用Biopython包。

Answer 2

我喜欢的一种方法是使用itertools.groupby和简单的key功能：

from itertools import groupby


def make_grouper():
    counter = 0
    def key(line):
        nonlocal counter
        if line.startswith('>'):
            counter += 1
        return counter
    return key

将其用作：

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        fasta_section = ''.join(group)   # or list(group)

只有当您必须将整个部分的内容作为单个字符串处理时，才需要join。如果您只想逐一阅读这些内容，您可以这样做：

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        # parse >header description
        header, description = next(group)[1:].split(maxsplit=1)
        for line in group:
            # handle the contents of the section line by line

Answer 3

def read_blocks(file):
    block = ''
    for line in file:
        if line.startswith('>') and len(block)>0:
            yield block
            block = ''
        block += line
    yield block


with open('input_file.fa') as f:
    for block in read_blocks(f):
        print(block)

这将逐行读取文件，您将使用yield语句返回块。这是懒惰的，所以你不必担心大量的内存占用。

使用python中的指定分隔符逐块读取文件

3 个答案: