Question

例如，我有s="I REALLY don't want to talk about it, not at all!"

我希望re.findall(reg, s)返回 "I" "don't" "want" "to" "talk" "about" "it" "," "not" "at" "all" "!"

到目前为止，我得到了reg=r'[^\w\s]+|\w+|\n'，无法过滤掉单词"REALLY"

谢谢

Answer 1

\w+模式匹配1个或多个任何字符char，包括ALLCAPS中的单词。

请注意，代词I也是ALLCAPS。因此，假设您要跳过所有2个或更多字母的ALLCAPS单词，则可以考虑将当前模式固定为

r'[^\w\s]+|\b(?![A-Z]{2,}\b)\w+|\n'

请参见regex demo

\b(?![A-Z]{2,}\b)\w+模式匹配

\b-单词边界
(?![A-Z]{2,}\b)-如果在当前位置的右边立即有2个或多个ASCII大写字母后接单词边界，则匹配失败的否定前行
\w+-1个或多个单词字符（如果您只想匹配字母，请替换为[^\W\d_]+）。

要支持所有Unicode大写字母，您可以使用带有r'[^\w\s]+|\b(?!\p{Lu}{2,}\b)\w+|\n'模式的PyPi正则表达式，或者使用pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))（Python 3）或pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]))（Python 2）构建类。参见Python regex for unicode capitalized words。注意，我建议您坚持使用最新的Python版本或最新的PyPi regex模块。

Answer 2

Brian Kernighan的这句话对正则表达式尤其如此。

每个人都知道调试的难度是编写程序的两倍第一名。因此，如果您在写作时尽可能聪明它，您将如何调试它？

因此，如果在单个正则表达式中难以执行某些操作，则可能需要将其分为两个步骤。首先找到所有单词，然后过滤掉所有大写单词。更容易理解和测试。

>>> import re
>>> s="I REALLY don't want to talk about it, not at all!"
>>> words = re.findall(r"[\w']+", s)
>>> words = [w for w in words if w.upper() != w]
>>> print(words)
["don't", 'want', 'to', 'talk', 'about', 'it', 'not', 'at', 'all']

如何使用re.findall查找不是全部大写字母的单词？

2 个答案: