我正在研究一个大型的fasta文件,我希望根据基因ID将其分成多个。我正在尝试使用biopython教程中的上述脚本:
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate (batch_iterator(record_iter, 93)):
filename='gene_%i.fasta' % (i + 1)
with open('/path/files/' + filename, 'w') as ouput_handle:
count=SeqIO.write(batch, ouput_handle, 'fasta')
print ('Wrote %i records to %s' % (count, filename))
它确实拆分了93个序列的文件但它每组93个给出2个文件。我看不到错误,但我猜有一个。 还有另一种方法可以以不同的方式拆分大型fasta文件吗? 感谢
答案 0 :(得分:1)
在阅读了示例中的代码之后,迭代器似乎没有按基因id 分隔文件,而只是在batch_size
组中对序列进行分割,所以在你的情况下每个文件93个序列。
答案 1 :(得分:1)
如果将来有人对此脚本感兴趣。该脚本完全按原样运行。问题是我试图划分的文件有更多的序列。所以我删除了坏文件,并生成了一个与上面的脚本很好地分开的新文件。