如何使用正则表达式在方括号之间获取内容?

时间:2017-08-01 04:05:44

标签: python regex bioinformatics protein-database

我有一个名为50267.gff的gff文件,如下所示

#start gene g1
dog1
dog2
dog3
#protein sequence = [DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD]
#end gene g1
###
#start gene g2
cat1
cat2
cat3
#protein sequence = [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
#end gene g2
###
#start gene g3
pig1
pig2
pig3
...

我想在括号之间获取内容,并创建名为50267.fa的新文件,如下所示

>g1_50267
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>g2_50267
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC
...

2 个答案:

答案 0 :(得分:0)

您需要在正则表达式中转义方括号。然后,您可以使用捕获组来获取内部的内容。

datePicker.minimumDate = Date() //Today's date
datePicker.maximumDate = Date().addingTimeInterval(60 * 60 * 24 * 180) //180 days forward time from today.

答案 1 :(得分:0)

您可以使用\[(.*?)\]\[([^\]]+)

import re

with open("50267.gff", "r") as ff:
    matches = re.findall(r'\[([^\]]+)', ff.read())
    matches = ['>g' + str(ind+1) + "_50267\n" + x.replace('\n#', ' ') for ind, x in enumerate(matches)]
    #print(matches)
    with open('50267.fa', 'w') as fa:
        fa.write("\n".join(matches))