Question

是否有正则表达式字符串<regex>，re.findall(r'<regex>', doc)会返回与以下代码相同的结果？

doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
    if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
         new_doc.append(word)
>>> new_doc
['is', 'if']

Answer 1

也许，你目前获得比赛的方式是最好的。

如果没有一些额外的操作，你不能这样做，例如list comprehension，因为带有包含捕获组的模式的re.findall会在结果列表中输出捕获的子字符串。

因此，您可以添加外部捕获组并使用re.findall或使用re.finditer并使用

获取第一个组

(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+

请参阅this regex demo。

<强>详情

(?<!\S) - 字符串的空格或开头必须位于当前位置的左侧
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - 在任何0 +非空白字符之后，不能有3个相同的非空格字符或除{wh}之外的_，数字或任何非字字符的字符。右边的当前位置
\S+ - 1 +非空白字符。

请参阅the Python demo：

import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']

re.findall（）等效于带有内部搜索的string.split（）循环

1 个答案: