我有一些看起来像这样的数据:
PMID- 19587274
OWN - NLM
DP - 2009 Jul 8
TI - Domain general mechanisms of perceptual decision making in human cortex.
PG - 8675-87
AB - To successfully interact with objects in the environment, sensory evidence must
be continuously acquired, interpreted, and used to guide appropriate motor
responses. For example, when driving, a red
AD - Perception and Cognition Laboratory, Department of Psychology, University of
California, San Diego, La Jolla, California 92093, USA.
PMID- 19583148
OWN - NLM
DP - 2009 Jun
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
amyloidosis.
PG - 482-6
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
extracellular accumulation of pathologic fibrillar proteins in various tissues
AD - Asklepios Hospital, Department of Medicine, Langen, Germany.
innere2.longen@asklepios.com
我想编写一个可以匹配PMID,TI和AB之后的句子的正则表达式。
是否有可能在一次性正则表达式中获得这些?
我花了将近一整天来试图找出一个正则表达式,而我能得到的最接近的是:
reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()
只会在第二组“数据集”中返回匹配项,而不是所有数据。
有什么想法吗?谢谢!
答案 0 :(得分:2)
怎么样:
import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
print i.groupdict()
输出:
{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}
修改
作为一个冗长的RE,使其更容易理解(我认为详细的RE应该用于除最简单的表达之外的任何东西,但这只是我的观点!):
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
(?: # Non capturing group with multiple options, first option:
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
| # Next option:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
| # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
)
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
请注意,您可以将^PG
和^AD
替换为^\S
以使其更加通用(您想要匹配所有内容,直到行开头的第一个非空格) )。
编辑2
如果你想在一个正则表达式中捕获整个事物,请删除起始(?:
,结束)
并将|
字符更改为.*?
:< / p>
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
.*? # Next part:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
.*? # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
这给出了:
{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
答案 1 :(得分:2)
如何不使用正则表达式执行此任务,而是使用按换行符拆分的编程代码,使用.startswith()等查看前缀代码? 代码会更长,但每个人都能够理解它,而不必来stackoverflow寻求帮助。
答案 2 :(得分:0)
问题是贪婪的资格赛。这是一个更具体,非贪婪的正则表达式:
#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()
reg4 = r'''
^PMID # Start matching at the string PMID
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters
.*?TI # next, match any character up to the first occurrence of 'TI'
\s*?- # as little whitespace as possible up to the next '-'
\s*? # as little whitespace as possible
(?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG'
.*?AB # match any character up to the following occurrence of 'AB'
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
print 78*"-"
pprint(i.groupdict())
输出:
------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
sensory evidence must\n be continuously acquired, interpreted, and
used to guide appropriate motor\n responses. For example, when
driving, a red \n',
'pmid': '19587274',
'title': ' Domain general mechanisms of perceptual decision making in
human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
diseases characterized by\n extracellular accumulation of pathologic
fibrillar proteins in various tissues\n',
'pmid': '19583148',
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
with hepatic\n amyloidosis.\n'}
扫描后,您可能希望strip
每个字段的空白区域。
答案 3 :(得分:0)
另一个正则表达式:
reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI - )(?P<title>.*?)(?=PG - ).*?(?<=AB - )(?P<abstract>.*?)(?=AD - )'
答案 4 :(得分:0)
如果线条的顺序可以改变,您可以使用此模式:
reg4 = re.compile(r"""
^
(?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
| TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
| AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
| .+\n
)+
""", re.MULTILINE | re.VERBOSE)
它将匹配连续的非空行,并捕获PMID
,TI
和AB
项。
项目值可以跨越多行,只要第一行后面的行以空白字符开头。
[^\S\n]
”匹配任何空白字符(\s
),换行符(\n
)除外。.* (?:\n[^\S\n].*)*
”匹配以空格字符开头的连续行。.+\n
”匹配任何其他非空行。