用于将大型fasta文件拆分为多个文件的biopython脚本

时间:2017-09-05 09:09:58

标签: python biopython

我正在研究一个大型的fasta文件,我希望根据基因ID将其分成多个。我正在尝试使用biopython教程中的上述脚本:

def batch_iterator(iterator, batch_size):
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = iterator.next()
            except StopIteration:
                entry = None
            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch 

record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
    filename='gene_%i.fasta' % (i + 1)
    with open('/path/files/' + filename, 'w') as ouput_handle:
        count=SeqIO.write(batch, ouput_handle, 'fasta')
    print ('Wrote %i records to %s' % (count, filename))

它确实拆分了93个序列的文件但它每组93个给出2个文件。我看不到错误,但我猜有一个。 还有另一种方法可以以不同的方式拆分大型fasta文件吗? 感谢

2 个答案:

答案 0 :(得分:1)

在阅读了示例中的代码之后,迭代器似乎没有按基因id 分隔文件,而只是在batch_size组中对序列进行分割,所以在你的情况下每个文件93个序列。

答案 1 :(得分:1)

如果将来有人对此脚本感兴趣。该脚本完全按原样运行。问题是我试图划分的文件有更多的序列。所以我删除了坏文件,并生成了一个与上面的脚本很好地分开的新文件。