带有多行文本块的正则表达式(python)

时间:2017-06-03 21:05:50

标签: python regex list split expression

我有以下名为seq.fasta的文件:

>AAM15934.1| NtrX [Gluconacetobacter diazotrophicus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG

>WP_002731145.1| NtrX [Phaeospirillum molischianum]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR

>WP_002967695.1| NtrX  [Brucella abortus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8
MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV

我想把每个字母块放在一个列表中。 例如:

列出内容:

List[0] = MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG

List[1] = MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR

List[2] = MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV

但我正在努力拆分并将它们放入列表中,我的代码就像:

import re

myfile = open('seq.fasta', 'r').read()

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(myfile)]

for m in matches:
    onlySequences = (m[1])

print(onlySequences)

变量onlySequences只返回最后一个字母块,如何保留所有字母,每个字母都列在一个列表中?

2 个答案:

答案 0 :(得分:0)

您在for循环中覆盖onlySequences。也许你只需要这个:

matches = [m.groups()[1] for m in regex.finditer(myfile)]
print(matches)

或者修改代码:

matches = [m.groups() for m in regex.finditer(myfile)]
onlySequences = [m[1] for m in matches]

答案 1 :(得分:0)

你不需要正则表达式来做到这一点。更好的方法是逐行读取文件:

text/css