I would like to write this regex (later simplified form) in a more compact/ elegant/ systematic. PCRE or Python (newer engine) preferred. Shortly, I would like to capture each artery name (iliac, femoral, popliteal and so on), regardless of the string between them . Ideally, the resulted regex won't depend on any kind of regex flavor.
LE2: Even more simplified regex, but not working correctly: https://www.regex101.com/r/cK5wB6/7. I've eliminated DEFINE
section - this was added only for modularity purposes, and DEFINE
is not compatible with Python anyway (newer, v1 engine added this feature). I want capture all the artery names, equivalent of getting a vector of all artery names, regardless of number of names, or strings between them.
(arteries:.{0,25}?)
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
(?<artfinal>(?&art))
The problem is that some arteries are still not recognized correctly (at least visually). I'm trying to capture those names, without explicitly write capturing groups like in this.
LE4: The last variant actually ignore all names, aside the 1st and the last two.
答案 0 :(得分:2)
Python regex module的一个示例,它具有一些有趣的功能,例如能够在模式中使用集合(\L<arteries>
)以及存储重复捕获组的能力:
import regex
s = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
arteries_set = ['femoral', 'iliac', 'peroneal', 'tibial']
p = regex.compile(r'^arteries: (?: [^\w\n]* (?>\w+[^\w\n]+)*? (\L<arteries>) \M)+', regex.M | regex.I | regex.X, arteries=arteries_set)
for m in p.finditer(s):
print(m.captures(1))
我自愿删除了少于25个字符&#34;条件,以建立一个更有效的模式,但随意用[^\w\n]* (?>\w+[^\w\n]+)*?
.{0,25}? \m
(\m
和\M
分别是单词边界,用于单词的开头和结尾)
答案 1 :(得分:1)
我希望捕获所有动脉名称,
问题是某些动脉仍未被正确识别(至少在视觉上)
这个正则表达式的问题:
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
是group art
在最后一场比赛中不断覆盖其捕捉。这是设计中的预期行为。
我想识别清单中给出的所有动脉(见定义) 相同的动脉可以在任何位置出现1 ... n。 动脉名称之间的字符串可以是最大值。 25个字符。
至于口味,让我们坚持使用PCRE
如果您正在使用PCRE,而不是一次性匹配所有动脉,我建议一次匹配1个动脉。为实现这一目标,我们可以使用\G
to match at the end of last match。
/\G # Match anchor (BoS or EoLastMatch)
(?:
(?!^) # With previous match
|
.*? # Or first occurence
arteries: # of arteries:
)
.{1,25}? # Separated by max 25 chars
(?P<art> # Group 1 (capture 1 artery)
\b # List of arteries
(?:iliac|femoral|popliteal|peroneal|tibial)
\b # in between word boundaries
# Modif: global, caseless, singleline, extra
)/gixs
这将捕获组art
(组1)中的每个动脉。
至于与其他正则表达式的兼容性,您可以在代码中循环每个匹配以模拟\G
(几乎没有其他任何风格实现)。另一种选择是使用表达式分割文本:
(arteries:|\b(?:iliac|femoral|popliteal|peroneal|tibial)\b)
然后检查每个令牌的长度,以保证中间不超过25个字符。
代码必须迁移到Python(不是那么遥远)一天
您可以在regex module实现它的情况下在Python中使用\G
,但是如果您确实使用该模块,那么如果能够从.captures method
的组中检索重复捕获,则可以利用它。检查@CasimiretHippolyte's answer,这是在这种情况下使用捕获的完美示例。
另一方面,如果您坚持标准re module,我建议循环每个匹配以模拟相同的行为。
<强>代码:强>
import re
text = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
some arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
n = 0
pattern_from = re.compile( r'arteries:', re.I)
pattern_token = re.compile( r'.{1,25}?\b(iliac|femoral|popliteal|peroneal|tibial)\b', re.I)
for match_from in pattern_from.finditer(text):
n = n + 1
print( '\nMatch #%s:' % n, end="")
match_token = pattern_token.match( text, match_from.end())
while match_token:
print( '[%s:%s]="%s" ' % (match_token.start(1), match_token.end(1), match_token.group(1)), end="")
match_token = pattern_token.match( text, match_token.end())
<强>输出:强>
Match #1:[15:20]="iliac" [26:33]="femoral"
Match #2:[52:57]="iliac" [63:70]="femoral" [76:85]="popliteal"
Match #3:[106:115]="popliteal" [125:130]="iliac" [132:138]="tibial"
Match #4:[171:176]="iliac" [179:187]="peroneal"
Match #5:[243:251]="peroneal" [257:264]="femoral" [267:273]="tibial" [282:288]="tibial"
Match #6:[344:351]="femoral"