Python,正则表达式,括号内的字符串[]

时间:2014-07-29 20:19:12

标签: python regex string bioinformatics

在以下行中: (bla - 意思不重要)

> blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
> blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
> blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
> blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]

我试图将所有信息都放在方括号[]中 as:

Geobacter sp. M21
Acetobacter pasteurianus IFO 3283-07
Gardnerella vaginalis ATCC 14019
Granulibacter bethesdensis CGDNIH1

我的代码在这里,当然它不起作用 - 在[]有时是3,有时是4" alfanumeric words",还有像" 。 "或" - ":

import re
#code...
pattern = r'[ \w+ \w+ \w+ ]'
for i in lines_:
    m = re.search ( pattern, str(i) )
    print m.group()

这样可以使用正则表达式获取这些信息吗?

3 个答案:

答案 0 :(得分:7)

这里不需要正则表达式:

>>> s = '''> blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
... > blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
... > blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
... > blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]'''
>>> for x in s.splitlines():
...     print x.rsplit('[')[-1].rstrip(']')
...     
Geobacter sp. M21
Acetobacter pasteurianus IFO 3283-07
Gardnerella vaginalis ATCC 14019
Granulibacter bethesdensis CGDNIH1

答案 1 :(得分:3)

您可以将lines_传递给re.findall并使用如下的正则表达式模式:

\[([^\]]+)\]

以下是匹配内容的细分:

\[      # [
(       # The start of a capture group
[^\]]+  # One or more characters that are not ]
)       # The close of the capture group
\]      # ]

这是一个示范:

>>> from re import findall
>>> lines_ = """
... > blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
... > blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
... > blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
... > blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]
... """
>>> findall("\[([^\]]+)\]", lines_)
['Geobacter sp. M21', 'Acetobacter pasteurianus IFO 3283-07', 'Gardnerella vaginalis ATCC 14019', 'Granulibacter bethesdensis CGDNIH1']
>>>

答案 2 :(得分:0)

最后我这样做:

for i in list_:
    dop = re.search("\[(.+)\]$", str(i))
    if dop:
        species=dop.group(0)

说明:

\[      # [
(       # start of a capture group
.+      # One or more characters because some of them had brackets inside []
        # like > bla|bla [Salmonella enterica subsp. 4,[5],12:i:- str. 08-1736]
)       # The close of the capture group
\]      # ]
$       # matching from the end of line

谢谢大家的帮助