Question

首先，这是家庭作业。（我无法在标题中使用标签，并且底部的标签列表中没有任何内容显示在家庭作业中，所以如果我应该就此问题编写其他内容，请告诉我。）

所以我一直在阅读python文档并清理SO，找到几个接近我想要的解决方案，但不完全正确。

我有一本字典，我读到了一个字符串：

a
aa
aabbaa
...
z

我们正在针对这些数据练习各种正则表达式。这里的具体问题是返回与模式匹配的单词列表，而不是每个匹配中的组的元组。

例如：

给出这个词典的一个子集，如：

someword
sommmmmeword
someworddddd
sooooomeword

我想回来：

['sommmmmword', 'someworddddd']

NOT：

[('sommmmword', 'mmmmm', ...), ...] # or any other variant

编辑：

我在上面的例子背后的理由是，我想看看如何避免对结果进行第二次传递。那不是说：

res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]

我特别想要一种我可以使用的机制：

return re.match(re.compile(r'pattern'), dictionary)

我知道这可能听起来很傻，但我这样做是为了真正深入挖掘正则表达式。我在底部提到了这一点。

这就是我的尝试：

# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing

# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing

在测试的某个地方，我注意到一个模式返回'\nsomeword'之类的东西。我无法弄清楚它是什么，但如果我再次找到该模式，我会将其包含在此处以便完整。

# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing

# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope

# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope

# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)'  # I realize set difference may be the issue

# I saw someone (on SO) use set difference claiming it works
#  but I gave up on it...

# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]

# Trying the the previous ones without set difference 
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b'  -> # returned everything (all words)

# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' ->  # still everything

# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' ->  # still everything

# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'  
    # Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
    # Using dictionary -> []  # WAIT WHAT?!

这最后一个如何工作？也许那里没有3个重复的辅音词？我在我的学校服务器上使用/usr/share/dict/cracklib-small，我认为这是大约50,000字。

我仍在努力，但任何建议都会很棒。

我觉得奇怪的一件事是你无法回复引用非捕获组。如果我只想输出完整的单词，我使用（？：...）来避免捕获，但后来我无法返回引用。显然我可以留下捕获，循环结果并过滤掉额外的东西，但我绝对想用正常的正则表达式解决这个问题！

也许有一种方法可以进行非捕获，但仍然允许反向引用？或许也有一种完全不同的表达方式，我还没有测试过。

Answer 1

以下是需要考虑的一些要点：

使用re.findall获取所有结果，而非re.match（仅搜索1匹配且仅在字符串开头）。
[b-z&&[^eiou]]是Java / ICU正则表达式，Python re不支持此语法。在Python中，您可以重新定义范围以跳过元音，或使用(?![eiou])[b-z]。
要避免使用re.findall的元组中的“额外”值，不使用捕获组。如果您需要反向引用，请使用re.finditer代替re.findall并访问每个匹配的.group()。

回到这个问题，你如何使用反向引用并仍然得到整个匹配，这里是working demo：

import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']

Python正则表达式匹配单词与重复辅音

1 个答案: