Question

我有一些看起来像这样的数据：

PMID- 19587274
OWN - NLM
DP  - 2009 Jul 8
TI  - Domain general mechanisms of perceptual decision making in human cortex.
PG  - 8675-87
AB  - To successfully interact with objects in the environment, sensory evidence must
      be continuously acquired, interpreted, and used to guide appropriate motor
      responses. For example, when driving, a red 
AD  - Perception and Cognition Laboratory, Department of Psychology, University of
      California, San Diego, La Jolla, California 92093, USA.

PMID- 19583148
OWN - NLM
DP  - 2009 Jun
TI  - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
      amyloidosis.
PG  - 482-6
AB  - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
      extracellular accumulation of pathologic fibrillar proteins in various tissues
AD  - Asklepios Hospital, Department of Medicine, Langen, Germany.
      innere2.longen@asklepios.com

我想编写一个可以匹配PMID，TI和AB之后的句子的正则表达式。

是否有可能在一次性正则表达式中获得这些？

我花了将近一整天来试图找出一个正则表达式，而我能得到的最接近的是：

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

只会在第二组“数据集”中返回匹配项，而不是所有数据。

有什么想法吗？谢谢！

Answer 1

怎么样：

import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI  - (?P<title>.*?)^PG|AB  - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
    print i.groupdict()

输出：

{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

修改

作为一个冗长的RE，使其更容易理解（我认为详细的RE应该用于除最简单的表达之外的任何东西，但这只是我的观点！）：

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                     # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        (?:                   # Non capturing group with multiple options, first option:
            PMID-\s           # Literal "PMID-" followed by a space
            (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        |                     # Next option:
            TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
            (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
            ^PG               # The characters PG at the start of a line
        |                     # Next option
            AB\s{2}-\s        # "AB  - "
            (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
            ^AD               # "AD" at the start of a line
        )
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

请注意，您可以将^PG和^AD替换为^\S以使其更加通用（您想要匹配所有内容，直到行开头的第一个非空格））。

编辑2

如果你想在一个正则表达式中捕获整个事物，请删除起始(?:，结束)并将|字符更改为.*?：< / p>

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                 # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        PMID-\s           # Literal "PMID-" followed by a space
        (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        .*?               # Next part:
        TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
        (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
        ^PG               # The characters PG at the start of a line
        .*?               # Next option
        AB\s{2}-\s        # "AB  - "
        (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
        ^AD               # "AD" at the start of a line
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

这给出了：

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}

Answer 2

如何不使用正则表达式执行此任务，而是使用按换行符拆分的编程代码，使用.startswith（）等查看前缀代码？代码会更长，但每个人都能够理解它，而不必来stackoverflow寻求帮助。

Answer 3

问题是贪婪的资格赛。这是一个更具体，非贪婪的正则表达式：

#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()

reg4 = r'''
   ^PMID               # Start matching at the string PMID
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<pmid>[0-9]+)    # Capture the field "pmid", accepting only numeric characters
   .*?TI               # next, match any character up to the first occurrence of 'TI'
   \s*?-               # as little whitespace as possible up to the next '-'
   \s*?                # as little whitespace as possible
   (?P<title>.*?)PG    # capture the field <title> accepting any character up the the next occurrence of 'PG'
   .*?AB               # match any character up to the following occurrence of 'AB'
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
   print 78*"-"
   pprint(i.groupdict())

输出：

------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
   sensory evidence must\n      be continuously acquired, interpreted, and
   used to guide appropriate motor\n      responses. For example, when
   driving, a red \n',
 'pmid': '19587274',
 'title': ' Domain general mechanisms of perceptual decision making in
    human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
   diseases characterized by\n      extracellular accumulation of pathologic
   fibrillar proteins in various tissues\n',
 'pmid': '19583148',
 'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
    with hepatic\n      amyloidosis.\n'}

扫描后，您可能希望strip每个字段的空白区域。

Answer 4

另一个正则表达式：

reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI  - )(?P<title>.*?)(?=PG  - ).*?(?<=AB  - )(?P<abstract>.*?)(?=AD  - )'

Answer 5

如果线条的顺序可以改变，您可以使用此模式：

reg4 = re.compile(r"""
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
     |  TI   \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
     |  AB   \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
     |  .+\n
     )+
""", re.MULTILINE | re.VERBOSE)

它将匹配连续的非空行，并捕获PMID，TI和AB项。

项目值可以跨越多行，只要第一行后面的行以空白字符开头。

“[^\S\n]”匹配任何空白字符（\s），换行符（\n）除外。
“.* (?:\n[^\S\n].*)*”匹配以空格字符开头的连续行。
“.+\n”匹配任何其他非空行。

匹配多行字符串中的多个模式

5 个答案: