这是Generator not working to split string by particular identifier . Python 2的延续。但是,我完全修改了代码,它根本不是同一种格式。这是关于边缘情况
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
我无法弄清楚如何处理上面的边缘情况。如果它是一个空数据文件,那么我仍然想输出空字符串。我正在尝试使用这些序列来输入我的输入文件:(只是一个小背景,ID在行的开头由@设置,序列字符后面是行,直到到达带有+的行。下一行行将具有质量值(值〜= chr(char))这种格式非常糟糕且经过深思熟虑。
@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=@@88,8,+>::J88ELF9.-5.45G+@@@NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J@'''B;.(...HP?E@@AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:@K..M?6?::7,/4444=JK>>HNN=//16@--F@K;9<:6449@BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC@EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00@,+-/.,#(LL@>@I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=@=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000@NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=@@HEE.::NM&&&>O''PO88H@A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA@BAJ7>%6700;;=@@?=;J55>>QP<<:>MF;;RPL==JMMPPPQR@@P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,@AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP@FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66@@@AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>@5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP@@PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E@=$$$#.::,,BPJK76B;;F5<<J::K
@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~@-LF9>~MMqbV~ABBV=99mhIwGRR~
@different_number_of_seq_qual
ATCG
+
**!
@this_should_work
GGGG
+
****
有错误的那些,我试图用空字符串替换seq和qual字符串
seq,qual = '',''
到目前为止,这是我的代码。这些边缘情况对我来说很难找到请帮助。 。 。
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('@'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('@') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
我的输出正在跳过ID @ m120204_092117_richard_c100250832550000001523001204251233_s1_p0 / 904 / ccs并添加@ **!什么时候不应该在输出中
@m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=@@88,8,+>::J88ELF9.-5.45G+@@@NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J@'''B;.(...HP?E@@AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:@K..M?6?::7,/4444=JK>>HNN=//16@--F@K;9<:6449@BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC@EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00@,+-/.,#(LL@>@I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=@=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000@NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=@@HEE.::NM&&&>O''PO88H@A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA@BAJ7>%6700;;=@@?=;J55>>QP<<:>MF;;RPL==JMMPPPQR@@P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,@AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP@FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66@@@AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>@5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP@@PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E@=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
@**!
+
@this_should_work
GGGG
+
****
答案 0 :(得分:1)
您可能应该使用BioPython。
您的错误似乎是跳过的读取序列中有129个碱基但只有128个qv。因此,您的解析器将下一个defline读取为质量线,然后使其过长,以便打印错误。
然后你的州没有说明你在step 1
所处的情况,但没有看到违约。所以你继续阅读额外的行来覆盖ID
变量。
但如果你真的想编写自己的解析器:
我会一次解决一个问题。
当序列长度与质量值数量不同时
这是无效的。 fastq文件中的每条记录必须具有相同数量的基数和质量。文件中的不同记录可以是彼此不同的长度,但每个记录必须具有相同的基础和质量。
当有空序列或条目时对于序列和质量行,空读将有空行,如下所示:
@SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
@SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5@>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A@G<CAD?@############
当质量值的行数超过一行时
由于上面第一个答案的要求。我们知道基础和质量的数量必须匹配。此外,序列块中永远不会有+
个字符。因此,我们可以继续解析序列块,直到看到以+
开头的行。然后我们知道我们已经完成了解析序列。然后我们可以继续解析质量线,直到我们获得与序列中相同数量的质量。我们不能依赖于查找任何特殊字符,因为根据质量编码,@
可能是有效的质量调用。
另外,您似乎正在拆分序列defline以解析可选注释。你必须小心CASAVA 1.8格式愚蠢地有空格。因此,您可能需要一个正则表达式来查看它是否是CASAVA 1.8格式,然后不要在空格等上拆分。
答案 1 :(得分:0)
您是否考虑使用可用于处理此类数据的强大python包而不是从头开始编写解析器?在部分内容中,我建议您查看HTSeq