我是一名拥有0计算机技能的生物学家,但我需要一个简单的脚本来将大量的DNA序列分成较小的.fasta进行BLAST搜索。我一直在浏览这个网站几天找不到答案。我几乎从biopython cookbook中复制了我的代码。为什么这不起作用?
def batch_iterator(iterator, batch_size):
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.__next__
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
from Bio import SeqIO
record_iter = SeqIO.parse(open("/Users/nermin/mainfolder/MTB_NITR203.fasta"),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1000)):
filename = "group_%i.fasta" % (i + 1)
with open(filename, "w") as handle:
count = SeqIO.write(batch, handle, "fasta")
print("Wrote %i records to %s" % (count, filename))
答案 0 :(得分:0)
错误消息是由代码中较早的操作引起的,具体为:
entry = iterator.__next__
应该是:
entry = iterator.__next__()
或者在Python 3中:
entry = next(iterator)
我的代码修改:
from Bio import SeqIO
def batch_iterator(iterator, batch_size):
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = False
if not entry:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter = SeqIO.parse('/Users/nermin/mainfolder/MTB_NITR203.fasta', 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(batch, filename, 'fasta')
print('Wrote {} records to {}'.format(count, filename))
我需要一个简单的脚本来将大量的DNA序列分成更小的序列 .fasta for BLAST search
这是我对您的问题有疑问的地方。根据您的描述,我希望输入文件如下:
>ZB3243.4 Platypus mRNA for dDFD-w, complete cds.
TTTTAATTTTGCTTTCAATATGACGGCTGTCAATGTTGCCCTGATTCGTGATACCAAGTG
GCTGACTTTAGAAGTCTGTAGAGAATTTCAGAGAGGAACTTGCTCTCGAGCTGATGCAGA
TTGCAAGTTTGCCCATCCACCAAGAGTTTGCCATGTGGAAAATGGTCGTGTGGTGGCCTG
... extremely long sequence
GCTAGGACAGGGAACAGGGAAGCACTTACAATTATTCCTTGATTTATTCAAAAGAACTGG
GAAAGATGGTTGTAGTTGTCTTTAGCTTCGGTTCAACTGAGTTTCGTTTTGTTAAACAGT
TCAGACCCTCTCACATCATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
但是你的代码处理实际上是这样的:
>ZB3243.4 Platypus mRNA for dDFD-w, complete cds.
TTTTAATTTTGCTTTCAATATGACGGCTGTCAATGTTGCCCTGATTCGTGATACCAAGTG
GCTGACTTTAGAAGTCTGTAGAGAATTTCAGAGAGGAACTTGCTCTCGAGCTGATGCAGA
TTGCAAGTTTGCCCATCCACCAAGAGTTTGCCATGTGGAAAATGGTCGTGTGGTGGCCTG
>VF42354.1 Rhino Ig active H-chain V-region, subgroup VH-II
TTCATGCAAATATGCTCTCTTTCTTTAGAATATTTCTGTAGGTTTCTTGGGACTGACATT
TAAAACGCCTCACTTTTGAATGTGCACAAAACCTGCTCATTAACATGCATGTGTATAATT
TGTACCTGCAGATCTGATGTTGCATAATACAATCAAATTACTAGATTTTTTAAAGAGAGA
>GS45345.54 Aardvark binding protein.
AACAACACCTGCCACCAGCGTTCCGTTCGCTGCACCAACTACAGGCAATCAGCTGAAATT
CTGAACAGCAGAGTTATGGAGTATCAGAATCTTTCCATGGAAACCTCCATATGGCCTTTC
TATATATATTCTCGTATGTCTTATTCTACCAACACAACAATAAGCGTGTTGCAGTCAATG
... extremely large number of sequences
>GR343245.2 Eggplant subgroup VH-II, mRNA.
TTGCCGCTATGCTCACCCTACTGATGCTTCCATGATTGAAGCGAGTGATAATACTGTGAC
AATCTGCATGGATTACATCAAAGGTCGATGCTCGCGGGAGAAATGCAAGTACTTTCATCC
TCCTGCACACTTGCAAGCCAGACTCAAGGCAGCTCATCATCAGATGAACCATTCAGCTGC
>FG345252.3 Bedbug binding protein 4
TTTTGATTCTCTAAAGGGTCGGTGTACCCGAGAGAACTGCAAGTACCTTCACCCTCCTCC
ACACTTAAAAACGCAGCTGGAGATTAATGGGCGGAACAATCTGATTCAACAGAAGACTGC
CGCAGCCATGTTCGCCCAGCAGATGCAGCTTATGCTCCAAAACGCTCAAATGTCATCACT
>MD2435324.5 Mantis subgroup VH-II, mRNA.
TAAAAATATGCTAATTACAAGTTATAAATCAAACGGAGAGATGGGGGCATGGAGATAGTT
TTTACGTACTGGAGGAAAGTGTGTAAAACCATGGCAATGTCACCTTTTACACAAATGCCA
TTTTCCAAATGCAAATGGCTCATGCTCTTTAGACTACTCTTTGAATAACAAGTAAGATGC
您是在处理大量序列还是一个大规模序列?
输入的fasta只是一个大规模的序列,大约有400万 basepairs(就像你的第一个例子),但由于它的大小我们 不能进行爆炸搜索。我们唯一需要的就是把它分成 较小的fasta文件,最好小于1 mb
我们现在已经确定了您的代码并非旨在满足您的需求。因此,让我们通过运行单个序列的数据来解决您的问题,将其分解为单独的FASTA文件:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
FILENAME = '/Users/nermin/mainfolder/MTB_NITR203.fasta'
BPS_PER_FILE = 500000 # base pairs per FASTA file
# There should be one and only one record, the entire genome:
large_record = SeqIO.read(FILENAME, 'fasta')
large_sequence = large_record.seq
large_description = large_record.description
limit = len(large_sequence)
for i, start in enumerate(range(0, limit, BPS_PER_FILE), start=1):
stop = min(start + BPS_PER_FILE, limit)
small_sequence = large_sequence[start:stop]
small_description = large_description + "; base pairs {} to {}".format(start, stop)
small_record = SeqRecord(small_sequence, id=large_record.id, description=small_description)
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(small_record, filename, "fasta")
assert (count == 1), "Incorrect record count!"
print('Wrote {} base pairs to {}'.format(stop - start, filename))
每个输出文件都是单个序列,与原始序列一样,但更小,具有修改后的描述,现在包含每个特定文件中的碱基对范围。