通过正则表达式过滤txt文件的信息

时间:2016-05-28 10:29:23

标签: python regex expression

我有一个包含信息的文件,这就是它的样子:

****ALIGNMENT****
Sequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]
Length:  201
E-value:  2.66576e-82
KYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...
+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...
EYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL...

现在我想过滤一些信息,我想将它用作变量。我想我应该使用正则表达式,但我不知道如何使用第二行的大量信息来做到这一点。例如。

我需要hitsidproteinorganismevalue

相应的数据:

hitsid = 86755972
protein = cold acclimation protein COR413-PM1
organism = Chimonanthus praecox
evalue = 2.66576e-82

所以我想要,当我要求hitsid时,Python打印' 86755972'。

有人可以帮我吗?谢谢!

1 个答案:

答案 0 :(得分:0)

使用像

这样的正则表达式
^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)

请参阅regex demo

sample Python code将多个值添加到词典列表中:

import re
p = re.compile(r'^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)', re.MULTILINE)
s = "****ALIGNMENT****\nSequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]\nLength:  201\nE-value:  2.66576e-82\nKYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...\n+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...\nEYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL..."
res = [m.groupdict() for m in p.finditer(s)]
for x in res:
    print(x['hitsid'])
    print(x['protein'])
    print(x['organism'])
    print(x['evalue'])