使用Python(x,y)中的产量将GenBank解析为FASTA

时间:2014-11-04 23:54:34

标签: python parsing bioinformatics fasta genbank

现在我已经尝试定义并记录我自己的功能,但我遇到了测试代码的问题,我实际上不知道它是否正确。我找到了BioPython的一些解决方案,或其他,但我真的想用yield来完成这项工作。

#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
    #set Default label
    curr_label = None
    #set Default sequence
    curr_seq = ""
    for line in lines:
        #if the line starts with ACCESSION this should be saved as the beginning of the label
        if line.startswith('ACCESSION'):
            #if the label has already been changed
            if curr_label is not None:
                #output the label and sequence
                yield curr_label, curr_seq
                ''' if the label starts with ACCESSION, immediately replace the current label with
                the next ACCESSION number and continue with the next check'''
            #strip the first column and leave the number
            curr_label = '>' + line.strip()[12:]
        #check for the organism column
        elif line.startswith ('  ORGANISM'):
            #add the organism name to the label line
            curr_label = curr_label + " " + line.strip()[12:]
        #check if the region of the sequence starts
        elif line.startswith ('ORIGIN'):
            #until the end of the sequence is reached
            while line.startswith ('//') is False:
                #get a line without spaces and numbers
                curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
    #if no more lines, then give the last label and sequence            
    yield curr_label, curr_seq

1 个答案:

答案 0 :(得分:0)

我经常使用非常大的GenBank文件,并且(多年前)发现BioPython解析器太脆弱了,无法通过100个成千上万的记录(当时),而不会在不寻常的记录上崩溃。

我编写了一个纯python(2)函数,用于从打开的文件中返回下一个完整记录,以1k块读取,并使文件指针准备好以获取下一条记录。我用一个使用这个函数的简单迭代器和一个带有fasta(self)方法来获取fasta版本的GenBank Record类来绑定它。

YMMV,但获取下一条记录的函数在此处应该可插入您想要使用的任何迭代器方案中。至于转换为fasta,你可以使用类似于上面的ACCESSION和ORIGIN抓取的逻辑,或者你可以使用以下方法获取部分文本(如ORIGIN):

sectionTitle='ORIGIN'    
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
                     gbrText,re.MULTILINE | re.DOTALL) 
sectionText=searchRslt.groups()[0]

像ORGANISM这样的小节需要一个5个空格的左侧边垫。

以下是我对主要问题的解决方案:

def getNextRecordFromOpenFile(fHandle):
    """Look in file for the next GenBank record
    return text of the record
    """
    cSize =1024
    recFound = False
    recChunks = []
    try:
        fHandle.seek(-1,1)
    except IOError:
        pass
    sPos = fHandle.tell()

    gbr=None
    while True:
        cPos=fHandle.tell()
        c=fHandle.read(cSize)
        if c=='':
            return None
        if not recFound:

            locusPos=c.find('\nLOCUS')
            if sPos==0 and c.startswith('LOCUS'):
                locusPos=0
            elif locusPos == -1:
                continue
            if locusPos>0:
                locusPos+=1
            c=c[locusPos:]
            recFound=True
        else:
            locusPos=0

        if (len(recChunks)>0 and
            ((c.startswith('//\n') and recChunks[-1].endswith('\n'))
             or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
             or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
             )):
            eorPos=0
        else:
            eorPos=c.find('\n//\n',locusPos)

        if eorPos == -1:
            recChunks.append(c)
        else:
            recChunks.append(c[:(eorPos+4)])
            gbrText=''.join(recChunks)
            fHandle.seek(cPos-locusPos+eorPos)
            return gbrText