我有以下名为seq.fasta的文件:
>AAM15934.1| NtrX [Gluconacetobacter diazotrophicus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG
>WP_002731145.1| NtrX [Phaeospirillum molischianum]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR
>WP_002967695.1| NtrX [Brucella abortus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV
我想把每个字母块放在一个列表中。 例如:
列出内容:
List[0] = MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG
List[1] = MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR
List[2] = MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV
但我正在努力拆分并将它们放入列表中,我的代码就像:
import re
myfile = open('seq.fasta', 'r').read()
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(myfile)]
for m in matches:
onlySequences = (m[1])
print(onlySequences)
变量onlySequences
只返回最后一个字母块,如何保留所有字母,每个字母都列在一个列表中?
答案 0 :(得分:0)
您在for循环中覆盖onlySequences
。也许你只需要这个:
matches = [m.groups()[1] for m in regex.finditer(myfile)]
print(matches)
或者修改代码:
matches = [m.groups() for m in regex.finditer(myfile)]
onlySequences = [m[1] for m in matches]
答案 1 :(得分:0)
你不需要正则表达式来做到这一点。更好的方法是逐行读取文件:
text/css