Question

假设我想写一个能捕捉这类句子的模式：

<person> was one of the few <profession> in <city> whom everybody admired.

以下描述了所需的变化：

<person> is a member of {Michael, Jack, Joe, Maria, Susan}.
<profession> is any of {painters, actors}.
<city> is the regexp pattern `[$k|K]a\w+`.

因此，该模式应该捕获这种句子：

Jack was one of the few painters in Kansan whom everybody admired. 
Michael was one of the few actors in Karlsruhe whom everybody admired.

我如何在Python中对此进行建模？据我所知，单凭正则表达式无法捕获这样的模式。我也许可以编写无上下文的语法，但在走这条路之前，我想我可能会问这里，看看是否有更简单的方法。

Answer 1

你走了：

>>> import re
>>> persons = ['Michael', 'Jack', 'Joe', 'Maria', 'Susan']
>>> professions = ['painters', 'actors']
>>> regex = re.compile(r'{person} was one of the few {profession} in {city} whom everybody admired\.'
                         .format(person='|'.join(persons),
                         profession='|'.join(professions),
                         city='[$k|K]a\w+'))

>>> a = ['Jack was one of the few painters in Kansan whom everybody admired.', 
         'Michael was one of the few actors in Karlsruhe whom everybody admired.',   
         'Jone was one of the few painters in Kansan whom everybody admired.', 
         'Susan was one of the few foo in Kansan whom everybody admired.', 
         'Joe was one of the few actors in Kansan whom everybody admired.']


>>> for i in a:
...     regex.search(i)
...     
... 
<_sre.SRE_Match object; span=(0, 4), match='Jack'>
<_sre.SRE_Match object; span=(0, 7), match='Michael'>
<_sre.SRE_Match object; span=(0, 3), match='Joe'>

Answer 2

可能你想要这样的东西：

/^(Michael|Susan|Maria|Jack|Joe).*?(painters|actors).*?([P|K]a\w+).*$/gm

DEMO

PS：我打算将$k作为变量并将其替换为实际值（在我的情况下为P），如果你的意思是不同的评论我的答案，我将修复正则表达式同样。

<强> CAVEAT

除非按长度（从最长到最小）对管道组中的条目进行排序，否则使用正则表达式的每个解决方案都无法按预期工作。在python中使用类似这样的东西：

persons.sort(lambda x,y: cmp(len(y), len(x)))

为什么呢？与此(Maria|Joe|Jack|Mariano)类似的匹配组永远不会匹配字符串Mariano，因为它首先匹配Maria，然后停止搜索，就像任何常见编程语言中的任何OR组一样。

Answer 3

此regex会抓住您的示例。

(\w+) was one of the few (painters|actors) in ([$k|K]a\w+) whom everybody admired.

编辑添加了如何检查组的示例

假设您要检查名称是否在包含1000个名称的列表中，正则表达式是不够的。您可以捕获此正则表达式的结果，并添加一个额外的检查。

import re

input_strs = ['Jack was one of the few painters in Kansan whom everybody admired.',
              'Michael was one of the few actors in Karlsruhe whom everybody admired.']

allowed_names = ['Michael', 'John']

pattern = re.compile(r'(\w+) was one of the few (painters|actors) in ([$k|K]a\w+) whom everybody admired.')

for input in input_strs:
    m = pattern.match(input)
    if m:
        # check if name is in the list
        name = m.group(1)
        print('name: ' + name)
        if name in allowed_names:
            print('ok')
        else:
            print('fail')

查找仅限于某组单词的模式

3 个答案: