我有一个程序,我在其中获取一对非常大的多序列文件(> 77,000个序列,每个序列平均长约1000 bp)并计算每个配对的单个元素之间的对齐分数,并将该数字写入输出文件(我稍后会加载到excel文件中。
我的代码适用于小型多序列文件,但我的大型主文件将在分析第16对后抛出以下回溯。
Traceback (most recent call last):
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
cycle(f,k,binLen)
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
return _align(**keywds)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
score_only)
MemoryError: Out of memory
我已经尝试过许多方法来解决这个问题(正如许多人可能从代码中看到的那样),但都无济于事。我已经尝试将大型主文件分成较小的批次,以便输入分数计算方法。我在使用完文件后尝试过del文件,我尝试在Oracle虚拟机上使用我的Ubuntu 11.11(我通常使用64位Windows 7)。我是否雄心勃勃,这在BioPython中是否具有计算可行性?下面是我的代码,我没有内存调试的经验,这是这个问题的明显罪魁祸首。非常感谢任何帮助我对此问题感到非常沮丧。
最佳, 哈利
##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)
from Bio import SeqIO
from Bio import pairwise2
import gc
##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
entry = True #Make sure we loop once
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
def split(subject,query):
##Query Iterator and Batch Subject Iterator
query_iterator = SeqIO.parse(query,"fasta")
record_iter = SeqIO.parse(subject,"fasta")
##Writes both large file into many small files
print "Splitting Subject File..."
binLen=2
for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
filename1="groupA_%i.fasta" % (j+1)
handle1=open(filename1, "w")
count1 = SeqIO.write(batch1, handle1, "fasta")
handle1.close()
print "Done splitting Subject file"
print "Splitting Query File..."
for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
filename2="groupB_%i.fasta" % (k+1)
handle2=open(filename2, "w")
count2 = SeqIO.write(batch2, handle2, "fasta")
handle2.close()
print "Done splitting both FASTA files"
print " "
return [k ,binLen]
##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')
def cycle(f,k,binLen):
i=1
m=1
while i<=k+1:
##Open the first small file
subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
i=i+1
j=0
##Make small file iterators
smallQuery=SeqIO.parse(queryFile,"fasta")
smallSubject=SeqIO.parse(subjectFile,"fasta")
##Cycles through both sets of FASTA files
while j<binLen:
j=j+1
currentQuery=smallQuery.next()
currentSubject=smallSubject.next()
##Verify every pair is correct
print " "
print "Pair: " + str(m)
print "Subject: "+ currentSubject.id
print "Query: " + currentQuery.id
gc.collect()
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
gc.collect()
currentQuery=None
currentSubject=None
score=str(a)
a=None
print "Score: " + score
f.write("1"+ "\n")
m=m+1
smallQuery.close()
smallSubject.close()
subjectFile.close()
queryFile.close()
gc.collect()
print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files
##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)
P.S。善待我知道代码中可能存在一些愚蠢的事情,我试图解决这个问题。
答案 0 :(得分:2)
另见BioStars上这个非常相似的问题,http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/
在那里,我建议尝试使用现有工具进行此类操作,例如: EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall(您可以使用Biopython解析EMBOSS对齐输出)
答案 1 :(得分:0)
pairwise2
模块在最新版本的Biopython(1.68)中进行了更新,以便更快,更少耗费内存。