现在我已经尝试定义并记录我自己的功能,但我遇到了测试代码的问题,我实际上不知道它是否正确。我找到了BioPython的一些解决方案,或其他,但我真的想用yield来完成这项工作。
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq
答案 0 :(得分:0)
我经常使用非常大的GenBank文件,并且(多年前)发现BioPython解析器太脆弱了,无法通过100个成千上万的记录(当时),而不会在不寻常的记录上崩溃。
我编写了一个纯python(2)函数,用于从打开的文件中返回下一个完整记录,以1k块读取,并使文件指针准备好以获取下一条记录。我用一个使用这个函数的简单迭代器和一个带有fasta(self)方法来获取fasta版本的GenBank Record类来绑定它。
YMMV,但获取下一条记录的函数在此处应该可插入您想要使用的任何迭代器方案中。至于转换为fasta,你可以使用类似于上面的ACCESSION和ORIGIN抓取的逻辑,或者你可以使用以下方法获取部分文本(如ORIGIN):
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
像ORGANISM这样的小节需要一个5个空格的左侧边垫。
以下是我对主要问题的解决方案:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText