匹配多行字符串中的多个模式

时间:2009-09-01 09:04:21

标签: python regex

我有一些看起来像这样的数据:

PMID- 19587274
OWN - NLM
DP  - 2009 Jul 8
TI  - Domain general mechanisms of perceptual decision making in human cortex.
PG  - 8675-87
AB  - To successfully interact with objects in the environment, sensory evidence must
      be continuously acquired, interpreted, and used to guide appropriate motor
      responses. For example, when driving, a red 
AD  - Perception and Cognition Laboratory, Department of Psychology, University of
      California, San Diego, La Jolla, California 92093, USA.

PMID- 19583148
OWN - NLM
DP  - 2009 Jun
TI  - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
      amyloidosis.
PG  - 482-6
AB  - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
      extracellular accumulation of pathologic fibrillar proteins in various tissues
AD  - Asklepios Hospital, Department of Medicine, Langen, Germany.
      innere2.longen@asklepios.com

我想编写一个可以匹配PMID,TI和AB之后的句子的正则表达式。

是否有可能在一次性正则表达式中获得这些?

我花了将近一整天来试图找出一个正则表达式,而我能得到的最接近的是:

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

只会在第二组“数据集”中返回匹配项,而不是所有数据。

有什么想法吗?谢谢!

5 个答案:

答案 0 :(得分:2)

怎么样:

import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI  - (?P<title>.*?)^PG|AB  - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
    print i.groupdict()

输出:

{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

修改

作为一个冗长的RE,使其更容易理解(我认为详细的RE应该用于除最简单的表达之外的任何东西,但这只是我的观点!):

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                     # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        (?:                   # Non capturing group with multiple options, first option:
            PMID-\s           # Literal "PMID-" followed by a space
            (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        |                     # Next option:
            TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
            (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
            ^PG               # The characters PG at the start of a line
        |                     # Next option
            AB\s{2}-\s        # "AB  - "
            (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
            ^AD               # "AD" at the start of a line
        )
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

请注意,您可以将^PG^AD替换为^\S以使其更加通用(您想要匹配所有内容,直到行开头的第一个非空格) )。

编辑2

如果你想在一个正则表达式中捕获整个事物,请删除起始(?:,结束)并将|字符更改为.*?:< / p>

#!/usr/bin/python
import re
reg4 = re.compile(r'''
        ^                 # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        PMID-\s           # Literal "PMID-" followed by a space
        (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        .*?               # Next part:
        TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
        (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
        ^PG               # The characters PG at the start of a line
        .*?               # Next option
        AB\s{2}-\s        # "AB  - "
        (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
        ^AD               # "AD" at the start of a line
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

这给出了:

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n      be continuously acquired, interpreted, and used to guide appropriate motor\n      responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n      extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n      amyloidosis.\n'}

答案 1 :(得分:2)

如何不使用正则表达式执行此任务,而是使用按换行符拆分的编程代码,使用.startswith()等查看前缀代码? 代码会更长,但每个人都能够理解它,而不必来stackoverflow寻求帮助。

答案 2 :(得分:0)

问题是贪婪的资格赛。这是一个更具体,非贪婪的正则表达式:

#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()

reg4 = r'''
   ^PMID               # Start matching at the string PMID
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<pmid>[0-9]+)    # Capture the field "pmid", accepting only numeric characters
   .*?TI               # next, match any character up to the first occurrence of 'TI'
   \s*?-               # as little whitespace as possible up to the next '-'
   \s*?                # as little whitespace as possible
   (?P<title>.*?)PG    # capture the field <title> accepting any character up the the next occurrence of 'PG'
   .*?AB               # match any character up to the following occurrence of 'AB'
   \s*?-               # As little whitespace as possible up to the next '-'
   \s*?                # As little whitespcase as possible
   (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
   print 78*"-"
   pprint(i.groupdict())

输出:

------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
   sensory evidence must\n      be continuously acquired, interpreted, and
   used to guide appropriate motor\n      responses. For example, when
   driving, a red \n',
 'pmid': '19587274',
 'title': ' Domain general mechanisms of perceptual decision making in
    human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
   diseases characterized by\n      extracellular accumulation of pathologic
   fibrillar proteins in various tissues\n',
 'pmid': '19583148',
 'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
    with hepatic\n      amyloidosis.\n'}

扫描后,您可能希望strip每个字段的空白区域。

答案 3 :(得分:0)

另一个正则表达式:

reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI  - )(?P<title>.*?)(?=PG  - ).*?(?<=AB  - )(?P<abstract>.*?)(?=AD  - )'

答案 4 :(得分:0)

如果线条的顺序可以改变,您可以使用此模式:

reg4 = re.compile(r"""
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
     |  TI   \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
     |  AB   \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
     |  .+\n
     )+
""", re.MULTILINE | re.VERBOSE)

它将匹配连续的非空行,并捕获PMIDTIAB项。

项目值可以跨越多行,只要第一行后面的行以空白字符开头。

  • [^\S\n]”匹配任何空白字符(\s),换行符(\n)除外。
  • .* (?:\n[^\S\n].*)*”匹配以空格字符开头的连续行。
  • .+\n”匹配任何其他非空行。