Fastq解析器不采用空序列(和其他边缘情况)。蟒蛇

时间:2014-12-17 08:18:59

标签: python parsing generator bioinformatics string

这是Generator not working to split string by particular identifier . Python 2的延续。但是,我完全修改了代码,它根本不是同一种格式。这是关于边缘情况

Edge Cases:
 . when sequence length is different than number of quality values
 . when there's an empty sequence or entry
 . when the number of lines with quality values is more than one

我无法弄清楚如何处理上面的边缘情况。如果它是一个空数据文件,那么我仍然想输出空字符串。我正在尝试使用这些序列来输入我的输入文件:(只是一个小背景,ID在行的开头由@设置,序列字符后面是行,直到到达带有+的行。下一行行将具有质量值(值〜= chr(char))这种格式非常糟糕且经过深思熟虑。

@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+

@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~@-LF9>~MMqbV~ABBV=99mhIwGRR~
@different_number_of_seq_qual
ATCG
+
**!
@this_should_work
GGGG
+
****

有错误的那些,我试图用空字符串替换seq和qual字符串

seq,qual = '',''

到目前为止,这是我的代码。这些边缘情况对我来说很难找到请帮助。 。 。

def read_fastq(input, offset):
    """
    Inputs a fastq file and reads each line at a time.  'offset' parameter can be set to 33 (phred+33 encoding 
    fastq), and 64.  Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
    Capable of reading empty sequences and empty files. 
    """

    ID, comment, seq, qual = None,'','',''

    step = 1 #step is a variable that organizes the order fastq parsing
    #step= 1 scans for ID and comment line
    #step= 2 adds relevant lines to sequence string
    #step= 3 adds quality values to string
    for line in input:
        line = line.strip()
        if step == 1 and line.startswith('@'): #Step system from Nedda Saremi
            if ID is not None: 
                qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
                sep = None
                if ' ' in ID: sep = ' '
                if sep is not None: 
                    ID, comment =  ID.split(sep,1) #Separates ID and comment by  ' '

                yield ID, comment, seq, qual
                ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
            ID = line[1:]
            step = 2
            continue
        if step==2 and not line.startswith('@') and not line.startswith('+'):
            seq = seq + line.strip()
            continue
        if step == 2 and line.startswith('+'):
            step = 3
            continue
        while step == 3: 
        #process the quality data
            if len(qual) == len(seq):
            #once the length of the quality seq and seq are the same, end gathering data
                step = 1
                continue
            if len(qual) < len(seq):
                qual = qual + line.strip()
                if len(qual) < len(seq): 
                    step = 3
                    continue
                if (len(qual) > len(seq)): 
                    sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
                    comment,seq,qual= '','',''
                    ID = line
                    step = 1
                    continue
            break

    if ID is not None:
        #Section reserved for last entry in file
        if len(qual) > 0: 
            qual = [ord(char)-offset for char in qual]
        sep = None
        if ' ' in ID: sep = ' '
        if sep is not None: 
            ID, comment =  ID.split(sep,1)
        if len(seq) == 0: ID,comment,seq,qual= '','','',''
        yield ID, comment, seq, qual       

我的输出正在跳过ID @ m120204_092117_richard_c100250832550000001523001204251233_s1_p0 / 904 / ccs并添加@ **!什么时候不应该在输出中

@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+


Error: different_number_of_seq_qual sequence length not equal to quality values
@**!

+

@this_should_work
GGGG
+
****

2 个答案:

答案 0 :(得分:1)

您可能应该使用BioPython。

您的错误似乎是跳过的读取序列中有129个碱基但只有128个qv。因此,您的解析器将下一个defline读取为质量线,然后使其过长,以便打印错误。

然后你的州没有说明你在step 1所处的情况,但没有看到违约。所以你继续阅读额外的行来覆盖ID变量。

但如果你真的想编写自己的解析器:

我会一次解决一个问题。

当序列长度与质量值数量不同时

这是无效的。 fastq文件中的每条记录必须具有相同数量的基数和质量。文件中的不同记录可以是彼此不同的长度,但每个记录必须具有相同的基础和质量。

当有空序列或条目时

对于序列和质量行,空读将有空行,如下所示:

@SOLEXA1_0007:1:9:610:1983#GATCAG/2

+SOLEXA1_0007:1:9:610:1983#GATCAG/2

@SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5@>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A@G<CAD?@############

当质量值的行数超过一行时

由于上面第一个答案的要求。我们知道基础和质量的数量必须匹配。此外,序列块中永远不会有+个字符。因此,我们可以继续解析序列块,直到看到以+开头的行。然后我们知道我们已经完成了解析序列。然后我们可以继续解析质量线,直到我们获得与序列中相同数量的质量。我们不能依赖于查找任何特殊字符,因为根据质量编码,@可能是有效的质量调用。

另外,您似乎正在拆分序列defline以解析可选注释。你必须小心CASAVA 1.8格式愚蠢地有空格。因此,您可能需要一个正则表达式来查看它是否是CASAVA 1.8格式,然后不要在空格等上拆分。

答案 1 :(得分:0)

您是否考虑使用可用于处理此类数据的强大python包而不是从头开始编写解析器?在部分内容中,我建议您查看HTSeq