如何捕捉前瞻性的后瞻性正则表达式python

时间:2016-04-26 00:21:41

标签: python regex regex-lookarounds

这是一个字符串:

str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"

我想从“ADDTIONAL”捕获到“Languages”所以我写了这个正则表达式:

regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'

然而,它只捕获([\s\S]+)之间的所有内容。它没有抓住ADDTIONAL&amp; Languages。我在这里缺少什么?

6 个答案:

答案 0 :(得分:3)

你的正则表达式是

regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'

,你的字符串是

Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
                     ^^                          ^^
                     ||                          ||
Match Position:-(?<=\n(ADDITIONAL|Additional)\n)(?=\n(Languages|LANGUAGES)\n*)

所以[\s\S]+?将包含这两个排名之间的内容,不包括ADDITIONALLANGUAGES

您只需要找到ADDITIONAL的起始位置和LANGUAGES的结束位置。这可以使用以下正则表达式

完成
(?=\n(ADDITIONAL|Additional)\n)([\s\S]+?)(?<=\n(Languages|LANGUAGES)\b)

此外,如果您希望[\s\S]+?仅捕获所有内容,则可以对AdditionalLanguages使用非捕获组

(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)

Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
        ^^                                                  ^^
        ||                                                  ||
(?=\n(ADDITIONAL|Additional)\n)             (?<=\n(Languages|LANGUAGES))

Python代码

p = re.compile(r'(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)', re.MULTILINE)
test_str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
print(re.findall(p, test_str))

<强> Ideone Demo

答案 1 :(得分:1)

它被捕获但它不是捕获组0的一部分,因为组0为 仅包含消耗的匹配,即移动当前
的匹配 位置。

断言不会移动位置,所以如果你在一个断言内捕获 它不会成为比赛的一部分。

然而,如果断言之后是一些消耗了的子表达式 在断言中引用的那些,它将成为整体匹配的一部分。

您当前的正则表达式与您的字符串不匹配。要匹配你的字符串 删除换行符\n

 (?<=
      ( ADDITIONAL | Additional )   # (1)
 )
 [\s\S]+? 
 (?=
      ( Languages | LANGUAGES )     # (2)
 )

答案 2 :(得分:0)

如果要将它们包含在匹配中,请不要将它们放在外观中,因为它们的目的是测试周围的文本而不将其包含在匹配结果中。如果您只是需要更换,请使用普通的非捕获组。

regex = r'\n(?:ADDITIONAL|Additional)\n[\s\S]+?\n(?:Languages|LANGUAGES)\n*'

顺便说一句,你的正则表达式需要ADDITIONALLanguages周围的换行符,但你的字符串中没有任何换行符。

答案 3 :(得分:0)

试试这个

(?<=ADDITIONAL\s).*?(?=\sLanguages)

Regex demo

<强>解释
(?<=…):正面观察sample
\s:“空格字符”:空格,制表符,换行符,回车符,垂直制表符sample
.:除了换行符sample之外的任何字符 *:零次或多次sample
?:一次或无sample
(?=…):积极前瞻sample

的Python:

import re
p = re.compile(ur'(?<=ADDITIONAL\s).*?(?=\sLanguages)', re.IGNORECASE)
test_str = u"the companys direction ADDITIONAL Awards: 2010 Global Engagement Summit, Languages: Fluent Japanese"

g = re.findall(p, test_str)
print g //[u'Awards: 2010 Global Engagement Summit,']

答案 4 :(得分:0)

如果您需要捕获包含ADDITIONALLANGUAGES的内容,请使用这样的简单正则表达式。

\b(ADDITIONAL .* Languages)\b

确保在解决方案中使用时包含re.IGNORECASE标志。

请参阅REGEX101

上的演示

答案 5 :(得分:0)

我猜你会让事情变得复杂,即:

match = re.search("(ADDITIONAL.*?Languages)", subject, re.MULTILINE)

正则表达式解释:

(ADDITIONAL.*?Languages)


Match the regex below and capture its match into backreference number 1 «(ADDITIONAL.*?Languages)»
   Match the character string “ADDITIONAL” literally (case sensitive) «ADDITIONAL»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character string “Languages” literally (case sensitive) «Languages»

Regex101 Demo