Regex - more compact/elegant form

时间:2015-09-29 00:51:22

标签: regex pcre

I would like to write this regex (later simplified form) in a more compact/ elegant/ systematic. PCRE or Python (newer engine) preferred. Shortly, I would like to capture each artery name (iliac, femoral, popliteal and so on), regardless of the string between them . Ideally, the resulted regex won't depend on any kind of regex flavor.

LE2: Even more simplified regex, but not working correctly: https://www.regex101.com/r/cK5wB6/7. I've eliminated DEFINE section - this was added only for modularity purposes, and DEFINE is not compatible with Python anyway (newer, v1 engine added this feature). I want capture all the artery names, equivalent of getting a vector of all artery names, regardless of number of names, or strings between them.

(arteries:.{0,25}?)
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
(?<artfinal>(?&art))

The problem is that some arteries are still not recognized correctly (at least visually). I'm trying to capture those names, without explicitly write capturing groups like in this.

LE4: The last variant actually ignore all names, aside the 1st and the last two.enter image description here

2 个答案:

答案 0 :(得分:2)

首先,正则表达式风味独立模式是一个神话。正则表达式引擎是不同的,具有不同的功能,甚至相同的模式,只使用两个或更多正则表达式引擎之间的共同令牌可以返回不同的结果。

Python regex module的一个示例,它具有一些有趣的功能,例如能够在模式中使用集合(\L<arteries>)以及存储重复捕获组的能力:

import regex

s = '''arteries: jhjh iliac jdfd femoral 
arteries: sdsdsd iliac jdfd femoral fd d popliteal
arteries: hgv  popliteal,sddsdsds iliac  tibial nkjkknperoneal nkjkkn
arteries: iliac,  peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk   
arteries:m bkjnkjnperoneal  mm femoral jnnbn n right femoralbjkkbb   jk'''

arteries_set = ['femoral', 'iliac', 'peroneal', 'tibial']

p = regex.compile(r'^arteries: (?: [^\w\n]* (?>\w+[^\w\n]+)*? (\L<arteries>) \M)+', regex.M | regex.I | regex.X, arteries=arteries_set)

for m in p.finditer(s):
    print(m.captures(1))

我自愿删除了少于25个字符&#34;条件,以建立一个更有效的模式,但随意用[^\w\n]* (?>\w+[^\w\n]+)*?

替换.{0,25}? \m

\m\M分别是单词边界,用于单词的开头和结尾)

答案 1 :(得分:1)

  

我希望捕获所有动脉名称
  问题是某些动脉仍未被正确识别(至少在视觉上)

这个正则表达式的问题:

((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*

group art在最后一场比赛中不断覆盖其捕捉。这是设计中的预期行为。

  

我想识别清单中给出的所有动脉(见定义)   相同的动脉可以在任何位置出现1 ... n。   动脉名称之间的字符串可以是最大值。 25个字符。

     
    

至于口味,让我们坚持使用PCRE

  

如果您正在使用PCRE,而不是一次性匹配所有动脉,我建议一次匹配1个动脉。为实现这一目标,我们可以使用\G to match at the end of last match

正则表达式:

/\G                  # Match anchor (BoS or EoLastMatch)
(?:
    (?!^)            # With previous match
  |
    .*?              # Or first occurence
    arteries:        #  of arteries:
)

.{1,25}?             # Separated by max 25 chars

(?P<art>             # Group 1 (capture 1 artery)
      \b             # List of arteries
      (?:iliac|femoral|popliteal|peroneal|tibial)
      \b             #  in between word boundaries
                     # Modif: global, caseless, singleline, extra
)/gixs

这将捕获组art(组1)中的每个动脉。

关于其他口味的说明:

至于与其他正则表达式的兼容性,您可以在代码中循环每个匹配以模拟\G(几乎没有其他任何风格实现)。另一种选择是使用表达式分割文本:

(arteries:|\b(?:iliac|femoral|popliteal|peroneal|tibial)\b)

然后检查每个令牌的长度,以保证中间不超过25个字符。

  

代码必须迁移到Python(不是那么遥远)一天

更新:迁移到Python:

您可以在regex module实现它的情况下在Python中使用\G,但是如果您确实使用该模块,那么如果能够从.captures method的组中检索重复捕获,则可以利用它。检查@CasimiretHippolyte's answer,这是在这种情况下使用捕获的完美示例。

另一方面,如果您坚持标准re module,我建议循环每个匹配以模拟相同的行为。

<强>代码:

import re

text = '''arteries: jhjh iliac jdfd femoral 
arteries: sdsdsd iliac jdfd femoral fd d popliteal
some arteries: hgv  popliteal,sddsdsds iliac  tibial nkjkknperoneal nkjkkn
arteries: iliac,  peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk   
arteries:m bkjnkjnperoneal  mm femoral jnnbn n right femoralbjkkbb   jk'''
n = 0

pattern_from = re.compile( r'arteries:', re.I)
pattern_token = re.compile( r'.{1,25}?\b(iliac|femoral|popliteal|peroneal|tibial)\b', re.I)

for match_from in pattern_from.finditer(text):
    n = n + 1
    print( '\nMatch #%s:' % n, end="")
    match_token = pattern_token.match( text, match_from.end())
    while match_token:
        print( '[%s:%s]="%s" ' % (match_token.start(1), match_token.end(1), match_token.group(1)), end="")
        match_token = pattern_token.match( text, match_token.end())

<强>输出:

Match #1:[15:20]="iliac" [26:33]="femoral" 
Match #2:[52:57]="iliac" [63:70]="femoral" [76:85]="popliteal" 
Match #3:[106:115]="popliteal" [125:130]="iliac" [132:138]="tibial" 
Match #4:[171:176]="iliac" [179:187]="peroneal" 
Match #5:[243:251]="peroneal" [257:264]="femoral" [267:273]="tibial" [282:288]="tibial" 
Match #6:[344:351]="femoral"