在以下行中: (bla - 意思不重要)
> blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
> blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
> blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
> blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]
我试图将所有信息都放在方括号[]中 as:
Geobacter sp. M21
Acetobacter pasteurianus IFO 3283-07
Gardnerella vaginalis ATCC 14019
Granulibacter bethesdensis CGDNIH1
我的代码在这里,当然它不起作用 - 在[]有时是3,有时是4" alfanumeric words",还有像" 。 "或" - ":
import re
#code...
pattern = r'[ \w+ \w+ \w+ ]'
for i in lines_:
m = re.search ( pattern, str(i) )
print m.group()
这样可以使用正则表达式获取这些信息吗?
答案 0 :(得分:7)
这里不需要正则表达式:
>>> s = '''> blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
... > blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
... > blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
... > blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]'''
>>> for x in s.splitlines():
... print x.rsplit('[')[-1].rstrip(']')
...
Geobacter sp. M21
Acetobacter pasteurianus IFO 3283-07
Gardnerella vaginalis ATCC 14019
Granulibacter bethesdensis CGDNIH1
答案 1 :(得分:3)
您可以将lines_
传递给re.findall
并使用如下的正则表达式模式:
\[([^\]]+)\]
以下是匹配内容的细分:
\[ # [
( # The start of a capture group
[^\]]+ # One or more characters that are not ]
) # The close of the capture group
\] # ]
这是一个示范:
>>> from re import findall
>>> lines_ = """
... > blabla|blabla|bla|blabla| blabla [Geobacter sp. M21]
... > blabla|blabla|bla|blabla| blabla [Acetobacter pasteurianus IFO 3283-07]
... > blabla|blabla|bla|blabla| blabla [Gardnerella vaginalis ATCC 14019]
... > blabla|blabla|bla|blabla| blabla [Granulibacter bethesdensis CGDNIH1]
... """
>>> findall("\[([^\]]+)\]", lines_)
['Geobacter sp. M21', 'Acetobacter pasteurianus IFO 3283-07', 'Gardnerella vaginalis ATCC 14019', 'Granulibacter bethesdensis CGDNIH1']
>>>
答案 2 :(得分:0)
最后我这样做:
for i in list_:
dop = re.search("\[(.+)\]$", str(i))
if dop:
species=dop.group(0)
说明:
\[ # [
( # start of a capture group
.+ # One or more characters because some of them had brackets inside []
# like > bla|bla [Salmonella enterica subsp. 4,[5],12:i:- str. 08-1736]
) # The close of the capture group
\] # ]
$ # matching from the end of line
谢谢大家的帮助